CN116954873B

CN116954873B - Heterogeneous computing system, and method, device, equipment and medium for selecting power nodes of heterogeneous computing system

Info

Publication number: CN116954873B
Application number: CN202311219994.1A
Authority: CN
Inventors: 郭振华; 唐轶男; 王丽; 曹芳; 高开; 赵雅倩; 李仁刚
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2024-01-23
Anticipated expiration: 2043-09-21
Also published as: CN116954873A

Abstract

The invention discloses a heterogeneous computing system and a method, a device, equipment and a medium for selecting power nodes thereof, which are applied to the technical field of computers. When a task execution request is received, running a task to be executed according to a user-defined computing node parameter; utilizing each test computing force node of the heterogeneous computing system, configuring different test task parameters for each test computing force node to execute a task to be executed, and generating a plurality of test tasks deployed to the heterogeneous computing system; and determining a calculation power calling scheme when the calculation resources are least based on the test results of the test tasks and the self-defined calculation power node parameters, and correspondingly adjusting the self-defined calculation power node parameters based on the calculation power calling scheme. The invention can solve the defect of improper selection of the computing power nodes in the related technology, realize the optimal selection of the computing power nodes for executing tasks in the heterogeneous computing system on the basis of not increasing the cost, and optimize the computing resource utilization rate of the heterogeneous computing system.

Description

Heterogeneous computing system, and method, device, equipment and medium for selecting power nodes of heterogeneous computing system

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a heterogeneous computing system, and a method, an apparatus, a device, and a medium for selecting a power node thereof.

Background

Heterogeneous computing techniques are parallel and distributed computing techniques that best match the type of parallelism (i.e., code type) of a computing task to the type of computation (i.e., machine capability) that a machine can efficiently support, and best utilize various computing resources. With the rapid development of artificial intelligence technology, heterogeneous computing systems, particularly multi-heterogeneous computing systems, capable of effectively cooperatively processing a large amount of data are widely used in related technologies of artificial intelligence.

The heterogeneous computing system in the related art has the problem that the computing power node is not properly selected, so that the resource of the heterogeneous computing system is wasted. In view of this, on the basis of not increasing the cost, the optimal selection of the computing power node for executing the task in the heterogeneous computing system is realized, and the resource utilization rate of the heterogeneous computing system is optimized, which is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The invention provides a computing power node selection method and device of a heterogeneous computing system, electronic equipment, a readable storage medium and the heterogeneous computing system, which can realize optimal selection of computing power nodes for executing tasks in the heterogeneous computing system on the basis of not increasing cost, optimize the resource utilization rate of the heterogeneous computing system, effectively save computing resources and improve the task execution efficiency.

In order to solve the technical problems, the invention provides the following technical scheme:

the first aspect of the present invention provides a method for selecting a computing power node of a heterogeneous computing system, including:

when a task execution request carrying a task to be executed and a user-selected user-defined computing node parameter is received, the task to be executed is operated according to the user-defined computing node parameter;

utilizing each test computing node of the heterogeneous computing system, configuring different test task parameters for each test computing node to execute a task to be executed, and generating a plurality of test tasks deployed to the heterogeneous computing system;

determining a calculation power calling scheme when the calculation resources are least based on the test results of all the test tasks and the self-defined calculation power node parameters, and correspondingly adjusting the self-defined calculation power node parameters based on the calculation power calling scheme; wherein, each test calculation node corresponds to the calculation node type to which the calculation node is respectively defined to be the same; the computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node; each executing computing node is a part of computing nodes in the custom computing nodes.

In a first exemplary embodiment, the generating a plurality of test tasks deployed to the heterogeneous computing system by using each test computing node of the heterogeneous computing system and configuring a different test task parameter for each test computing node to execute a task to be executed includes:

obtaining distributed training task information by analyzing the task execution request;

generating a training test task according to the distributed training task information, and deploying the training test task to corresponding class computing nodes of the heterogeneous computing system;

the distributed training task information carries a network model to be trained and an algorithm node type, and the training test task comprises a plurality of training subtasks and training parameters corresponding to the training subtasks.

In a second exemplary embodiment, the training parameters corresponding to each training subtask include a batch size, and the deploying the training test task to the corresponding class of computing force nodes of the heterogeneous computing system includes:

respectively selecting one computing node from each class of computing nodes in the heterogeneous computing system as a test computing node;

and respectively issuing each training subtask and the corresponding batch size to selected computing nodes in the corresponding class of computing nodes of the heterogeneous computing system, so that each test computing node uses training data of different scales to train the same network model to be trained.

In a third exemplary embodiment, the issuing each training subtask and the batch size corresponding to each training subtask onto a selected computing node in the corresponding class of computing nodes of the heterogeneous computing system includes:

and issuing each training subtask and the batch size corresponding to each training subtask to idle computing force nodes in corresponding class computing force nodes of the heterogeneous computing system.

In a fourth exemplary embodiment, the issuing each training subtask and its corresponding batch size onto a free computing node in a corresponding class of computing nodes of the heterogeneous computing system includes:

judging whether an idle computing power node exists in the current class computing power node for the corresponding class computing power node of the heterogeneous computing system;

if the current class of computing nodes have idle computing nodes, issuing each training subtask and the corresponding batch size to the idle computing nodes in the current class of computing nodes; if the current class of computing nodes do not have the idle computing nodes, after waiting for the idle computing nodes to appear, issuing each training subtask and the corresponding batch size to the idle computing nodes in the current class of computing nodes.

In a fifth exemplary embodiment, after the current class of computing nodes have no idle computing nodes, the method further includes:

Judging whether the current waiting time exceeds a preset waiting time threshold;

and if the current waiting time exceeds the waiting time threshold, not adjusting the respectively defined computing power nodes and the corresponding task parameters thereof.

In a sixth exemplary embodiment, after each training subtask and the batch size corresponding to each training subtask are issued to the selected computing node in the corresponding computing nodes of the heterogeneous computing system, the method further includes:

monitoring the task running state of each test computing node;

when the current training subtask of the current test calculation node is detected to be in a stable running state, counting the time consumption of single iteration of the current training subtask;

when the current test calculation node is detected to generate display memory deficiency reporting errors, generating display memory deficiency information; the insufficient memory information comprises error reporting identifiers, training subtask identifiers, test calculation node identifiers and current batch size values;

and generating a test result according to the time consumption and the insufficient memory information of single iteration of each training subtask.

In a seventh exemplary embodiment, after the counting of the time elapsed for a single iteration of the current training subtask, the method further includes:

And sending an instruction for stopping running the current training subtask and executing the next training subtask to the current test computing node.

In an eighth exemplary embodiment, after detecting that the current test computing power node has a video-memory-shortage reporting error, the method further includes:

and if the current test calculation force node has the target training subtask with the batch size value larger than the current batch size value, generating an instruction for stopping running the target training subtask.

In a ninth exemplary embodiment, the adjusting the custom computing node parameter based on the computing power calling scheme includes:

selecting target computing force nodes different from each executing computing force node from the respective defining computing force nodes, adjusting task parameters of the other self-defining computing force nodes to corresponding target task parameters of the executing computing force nodes, releasing resources of each target computing force node and setting the running state of each target computing force node to be an idle state.

In a tenth exemplary embodiment, the task to be executed is a multiple iteration task type, and the adjusting the user-defined computing node parameter based on the computing power calling scheme includes:

And after the heterogeneous computing system detects that the current iteration updating of the task to be executed is completed, correspondingly adjusting the custom computing node and task parameters thereof based on the computing power calling scheme.

In an eleventh exemplary embodiment, after the generating the plurality of test tasks deployed to the heterogeneous computing system, the method further includes:

calling a pre-established test task table; the test task table comprises a test task identification bar item, a test calculation node type bar item, a test task parameter bar item and a test result bar item;

and filling the corresponding item contents of the test task list according to the test calculation nodes executing the test tasks, the corresponding test task parameters and the test results of the test tasks.

In a twelfth exemplary embodiment, after determining the computing power calling scheme when the computing resource is least used, the method further includes:

invoking a node optimization table which is created in advance; the node optimization table comprises an execution computing power node type bar item and a target task parameter bar item;

and filling corresponding item contents in the node optimization table according to each execution computing node and the corresponding target task parameters thereof, and outputting the node optimization table.

In a thirteenth exemplary embodiment, the determining a computing power calling scheme with least computing resources based on the test results of each test task and the custom computing power node parameters includes:

distributing the workload of the first type of test computing nodes, the computing performance of which does not meet the first preset target computing performance condition, in each test computing node to the second type of test computing nodes, the computing performance of which meets the second preset target computing performance condition, in sequence in a mode of adjusting task parameters until the computing resources used for running the task to be executed reach the minimum;

and taking the test computing power node of the task to be executed currently operated by the heterogeneous computing system as an execution computing power node, taking the current task parameter corresponding to each test computing power node as a corresponding target task parameter, and generating a computing power calling scheme.

In a fourteenth exemplary embodiment, the task to be executed is a distributed training task, and the distributing the workload of a first type of test computing node whose computing performance does not meet a first preset target computing performance condition among the test computing nodes to a second type of test computing node whose computing performance meets a second preset target computing performance condition by adjusting task parameters sequentially until the computing resource used for running the task to be executed reaches a minimum includes:

S101: obtaining the batch size of each custom calculation node according to the custom calculation node parameters, and obtaining the single iteration time consumption of each test calculation node according to the test result;

s102: obtaining a maximum time-consuming calculation force node of a single iteration time-consuming maximum value in a current test result;

s103: selecting a minimum time-consuming computing force node with a single iteration time consumption minimum value from a current test result, adjusting the single iteration time consumption of the minimum time-consuming computing force node to be the single iteration time consumption of a first target computing force node, and adjusting the batch size value of the minimum time-consuming computing force node corresponding to the user-defined computing force node parameter, and simultaneously adjusting the batch size value of the maximum time-consuming computing force node corresponding to the user-defined computing force node parameter; the first target computing force node is the same as the computing force type of the minimum time-consuming computing force node, and the batch size value is larger than the batch size of the minimum time-consuming computing force node;

s104: if the batch size value of the maximum time-consuming computing node corresponding to the user-defined computing node parameter is 0, executing S105; if the batch size value of the maximum time-consuming computing node corresponding to the custom computing node parameter is not 0, updating the custom computing node parameter and the test result, and jumping to execute S103;

S105: if the maximum value of the single iteration time consumption of the original test result is greater than or equal to the maximum value of the single iteration time consumption in the updated test result, deleting the maximum time consumption computing node, the batch size corresponding to the maximum time consumption computing node and the single iteration time consumption, updating the user-defined computing node parameters and the test result, and jumping to execute S102 until the maximum value of the single iteration time consumption of the original test result is smaller than the maximum value of the single iteration time consumption in the updated test result.

In a fifteenth exemplary embodiment, the adjusting the single iteration time of the minimum time-consuming computing force node to the single iteration time of the first target computing force node, and adjusting the batch size value of the minimum time-consuming computing force node corresponding to the custom computing force node parameter, and simultaneously adjusting the batch size value of the maximum time-consuming computing force node corresponding to the custom computing force node parameter, includes:

the single iteration time consumption of the minimum time-consuming computing node is adjusted to be the single iteration time consumption of the first target computing node, the batch size value of the minimum time-consuming computing node corresponding to the self-defined computing node parameter is adjusted according to the mode of increasing the second preset value each time, and the batch size value of the maximum time-consuming computing node corresponding to the self-defined computing node parameter is adjusted according to the mode of decreasing the third preset value each time;

The first target force calculation node and the minimum time-consuming force calculation node have the same calculation type, and the batch size value is larger than the batch size value of the minimum time-consuming force calculation node by a first preset value.

In a sixteenth exemplary embodiment, after the maximum time-consuming computing force node for obtaining the maximum time-consuming value of a single iteration in the current test result, the method further includes:

if the single iteration time consumption value of the first target computing node does not exist or is insufficient display memory information, selecting a time-consuming computing node with single iteration time consumption inferior to that of the minimum time-consuming computing node as the minimum time-consuming computing node, and jumping to execute S103;

if all the calculation force nodes in the test result are traversed, the single iteration consumption value of the current minimum time-consuming calculation force node cannot be successfully adjusted, and whether the batch size of each test calculation force node in the current updated test result is identical to the batch size value customized by a user in the task execution request is judged;

if the calculation nodes are identical, the calculation nodes are not defined respectively and the corresponding task parameters are not adjusted; if the test result is different, the batch size of each test computing node and the corresponding user-defined computing node parameter after updating in the updated test result is used as each execution computing node and the corresponding target task parameter.

A second aspect of the present invention provides a method for selecting a computing power node of a heterogeneous computing system, comprising:

when a distributed training task execution request carrying a network model to be trained and user-selected user-defined computing power node parameters is received, calling each defined computing power node to run the network model to be trained based on a corresponding batch size;

utilizing each test computing node of a heterogeneous computing system, configuring different batch sizes for each test computing node to train the network model to be trained, and generating a plurality of test tasks deployed to the heterogeneous computing system;

according to the test results of each test task and the self-defined computing power node parameters, determining a computing power calling scheme when the computing resources are least, and correspondingly adjusting the self-defined computing power node parameters based on the computing power calling scheme; wherein, each test calculation node corresponds to the calculation node type to which the calculation node is respectively defined to be the same; the calculation force calling scheme comprises execution calculation force nodes for executing the tasks to be executed after optimization and batch sizes corresponding to each execution calculation force node; each executing computing node is a part of computing nodes in the custom computing nodes.

A third aspect of the present invention provides a computing power node selection apparatus for a heterogeneous computing system, comprising:

the task operation module is used for receiving a task execution request carrying a task to be executed and user-selected user-defined computing node parameters, and operating the task to be executed according to the user-defined computing node parameters;

the test task deployment module is used for utilizing each test computing force node of the heterogeneous computing system, configuring different test task parameters for each test computing force node to execute a task to be executed, and generating a plurality of test tasks deployed to the heterogeneous computing system;

the node optimization module is used for determining a calculation power calling scheme when the calculation resources are least used based on the test results of all the test tasks and the self-defined calculation power node parameters, and correspondingly adjusting the self-defined calculation power node parameters based on the calculation power calling scheme; wherein, each test calculation node corresponds to the calculation node type to which the calculation node is respectively defined to be the same; the computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node; each executing computing node is a part of computing nodes in the custom computing nodes.

A fourth aspect of the present invention provides an apparatus for selecting a computing power node of a heterogeneous computing system, comprising:

the training task operation module is used for calling each defined computing node to operate the network model to be trained based on the corresponding batch size when a distributed training task execution request carrying the network model to be trained and the user-selected user-defined computing node parameters is received;

the training task deployment module is used for utilizing each testing computing force node of the heterogeneous computing system, configuring different batch sizes for each testing computing force node, and training the network model to be trained to generate a plurality of testing tasks deployed to the heterogeneous computing system;

the training node optimization module is used for determining a calculation power calling scheme when the calculation resources are least used according to the test results of all the test tasks and the self-defined calculation power node parameters, and correspondingly adjusting the self-defined calculation power node parameters based on the calculation power calling scheme; wherein, each test calculation node corresponds to the calculation node type to which the calculation node is respectively defined to be the same; the computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node; each executing computing node is a part of computing nodes in the custom computing nodes.

A fifth aspect of the invention provides an electronic device comprising a processor for implementing the steps of the method of computing power node selection of a heterogeneous computing system as claimed in any preceding claim when executing a computer program stored in a memory.

The sixth aspect of the present invention also provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for selecting a computing node of a heterogeneous computing system as described in any of the preceding claims.

A seventh aspect of the invention provides a heterogeneous computing system comprising a plurality of classes of computing nodes and a processor; the processor is configured to implement the steps of the method for selecting a computing power node of a heterogeneous computing system according to any one of the methods for selecting a computing power node of a heterogeneous computing system when executing a computer program stored in a memory.

In an exemplary embodiment, the processor is deployed to a target computing node;

the target computing power node is the computing power node with the highest computing power performance in all computing power nodes in the heterogeneous computing system.

In another exemplary embodiment, the processor is deployed to a server;

The server is communicatively coupled to each of the computing nodes of the heterogeneous computing system.

The technical scheme provided by the invention has the advantages that in the process of running the task to be executed according to the computing nodes selected by the user and the corresponding task parameters, the heterogeneous computing system can obtain the task running results of all the computing nodes in the heterogeneous computing system under different configuration parameters through a small number of tests, the aim of minimum calculation resources used for running the task to be executed is fulfilled, the problem of improper selection of the heterogeneous computing power can be automatically found according to the task running results of the different computing nodes and the user-selected custom computing node parameters, and the heterogeneous computing nodes used for optimally running the task to be executed are automatically regulated under the condition that the user does not perceive the heterogeneous computing nodes. Because the finally selected computing nodes are part of the computing nodes currently running the task to be executed, the optimal selection of the computing nodes for executing the task in the heterogeneous computing system can be realized on the basis of not increasing the cost, the resource utilization rate of the heterogeneous computing system is optimized, and the computing resources of the heterogeneous computing system are effectively saved. Because the computing power node for running the task to be executed is optimally adjusted, a good task execution effect can be obtained without introducing a new high-performance node, and the task execution efficiency is improved.

In addition, the invention also provides a corresponding implementation device, electronic equipment, a readable storage medium and a heterogeneous computing system for the computing power node selection method of the heterogeneous computing system, so that the method has more practicability, and the device, the electronic equipment, the readable storage medium and the heterogeneous computing system have corresponding advantages.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

For a clearer description of the present invention or of the technical solutions related thereto, the following brief description will be given of the drawings used in the description of the embodiments or of the related art, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained from these drawings without the inventive effort of a person skilled in the art.

FIG. 1 is a schematic diagram of an exemplary application scenario provided by the present invention;

FIG. 2 is a flow chart of a method for selecting a computing power node of a heterogeneous computing system according to the present invention;

FIG. 3 is a flow chart of the method for determining the optimal force node and the corresponding node parameters according to the present invention;

FIG. 4 is a flow chart of a method for selecting a computing power node of another heterogeneous computing system according to the present invention;

FIG. 5 is a schematic diagram of a hardware framework of an exemplary application scenario provided by the present invention;

FIG. 6 is a schematic diagram of a second electronic device in an exemplary application scenario provided by the present invention;

FIG. 7 is a schematic diagram of a first electronic device in an exemplary application scenario provided by the present invention;

FIG. 8 is a block diagram of an embodiment of a computing node selection device for a heterogeneous computing system according to the present invention;

FIG. 9 is a structural frame diagram of another embodiment of a computing node selection device of a heterogeneous computing system provided by the present invention;

FIG. 10 is a structural frame diagram of an embodiment of an electronic device according to the present invention;

FIG. 11 is a block diagram of one embodiment of a heterogeneous computing system provided by the present invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and the detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations of the two, are intended to cover a non-exclusive inclusion. The term "exemplary" means "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Unlike an isomorphic computing system (i.e., a computing system composed of computing nodes having the same computing performance), the computing performance of each heterogeneous computing node of a heterogeneous computing system is different, and there is also a certain mutual constraint between computing nodes during the execution of parallel tasks. This results in blindly increasing the computational effort nodes used to execute parallel tasks, such as distributed training tasks, without considering computational performance, which can have the undesirable consequence of adversely decreasing the efficiency of the parallel tasks.

Furthermore, the underlying principle of the task to be performed by the computing power caller of the heterogeneous computing system, such as the user renting the heterogeneous computing system, is not known, and the situation of improper selection of computing power nodes is very easy to occur, which not only results in waste of computing resources, but also seriously affects task execution efficiency, especially for the application scenario of the ultra-large scale distributed training task, the training time of days or even weeks may be increased, and the training efficiency is seriously affected. In other words, in the process of executing large-scale tasks by using heterogeneous computing forces with different computing performances, there is a problem that the computing forces are inappropriately selected, resulting in poor utilization rate of computing resources.

Taking a distributed training task of a heterogeneous computing system running a training mode based on data parallelism and synchronization parameter updating as an example, as shown in fig. 1, nodes of the heterogeneous computing system participating in the distributed training task are heterogeneous computing nodes 1, 2 and 3, and total time consumption of each heterogeneous computing node in each iteration, namely single iteration time consumption, comprises local training time duration, high-performance node waiting time duration and parameter server synchronization time consumption. For the nth iteration in the model training process, the local training time of the heterogeneous computing node 1 is longer, the local training time of the heterogeneous computing node 2 and the local training time of the heterogeneous computing node 3 are shorter, and because synchronous parameter updating needs to be carried out among the heterogeneous computing node 1, the heterogeneous computing node 2 and the heterogeneous computing node 3, the two high-performance nodes of the heterogeneous computing node 2 and the heterogeneous computing node 3 need to wait for the computing node with poor performance of the heterogeneous computing node 1, which not only leads to resource waste of a heterogeneous computing system, but also leads to lower running efficiency of the whole distributed training task.

When a task execution request is received, the task to be executed starts to run according to the user-defined computing power node selected by the user and the corresponding task parameter; in the running process of the task to be executed, a plurality of test tasks deployed to the heterogeneous computing system are generated by utilizing each test computing node of the heterogeneous computing system and configuring different test task parameters for each test computing node to execute the task to be executed. According to the test results of the test tasks and the user-defined computing power node parameters, the computing power node used for executing the task to be executed is selected with the least computing resource used for executing the task to be executed as the optimization purpose, and the target task parameter corresponding to each computing power node is determined, so that the optimal selection of the computing power node for executing the task in the heterogeneous computing system is realized, the computing resource of the heterogeneous computing system is optimized, and the task execution efficiency is improved. Also taking fig. 1 as an example, in the nth iteration process of model training, the technical scheme provided by the invention is utilized to determine that the executing computing power node for running the distributed training task is the heterogeneous computing node 2 and the heterogeneous computing node 3, when the nth iteration parameter is updated, the workload of the heterogeneous computing node 1 is distributed to the heterogeneous computing node 2 and the heterogeneous computing node 3 in the (n+1) th iteration, the heterogeneous computing node 1 terminates the running task and releases the computing resources to become an idle node, the heterogeneous node 3 only needs to wait for less time of the heterogeneous computing node 2, the parameter server synchronization is time-consuming, the task running time is shortened, the task executing efficiency is effectively improved, the computing resources of the heterogeneous computing system are effectively saved, and the computing resource utilization rate is optimized.

Having described aspects of the invention, various non-limiting embodiments of the invention are described in detail below. Numerous specific details are set forth in the following description in order to provide a better understanding of the invention. It will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

Referring to fig. 2 first, fig. 2 is a flow chart of a method for selecting a computing power node of a heterogeneous computing system according to the present invention, and the method may include the following steps:

s201: and when a task execution request carrying a task to be executed and a user-selected user-defined computing node parameter is received, the task to be executed is operated according to the user-defined computing node parameter.

In this step, the task execution request carries the task to be executed, the source task parameters thereof, and the user-defined computing node parameters selected by the user, where the task to be executed is a task that needs to be executed in the heterogeneous computing system, the source task parameters are operation parameters of the task to be executed, and the user-defined computing node parameters are computing nodes of the heterogeneous computing system selected by the user and parameters of each computing node when the task is executed. Taking a case that a user calls a heterogeneous computing system to run a distributed training task as an example, a task to be executed is the distributed training task, a source task parameter is the training model type and the maximum value of batch size batch_size used by all computing nodes, and the batch size is the number of data, namely samples, which are singly transmitted to a program for model training. The custom computing node parameters include custom computing nodes in the heterogeneous computing system selected by the user and batch sizes of each custom computing node. In order to facilitate description and not cause ambiguity, in this embodiment, a user's own selected computing node for running a task to be executed is defined as a custom computing node, a computing node running a test task in a heterogeneous computing system is defined as a test computing node, and the heterogeneous computing system finally selects an optimized optimal computing node running the task to be executed as an execution computing node. Each test computing node corresponds to the computing node type to which the computing node is respectively defined and is the same; that is, the system will select a computing node from the computing node types to which each custom computing node belongs as the test computing node. Because the executing computing nodes are computing nodes optimized for the defined computing nodes, each executing computing node is a part of computing nodes in the self-defined computing nodes. The operation parameters of the test computing nodes are defined as test task parameters, and the operation parameters of the execution computing nodes are defined as target task parameters.

S202: and generating a plurality of test tasks deployed to the heterogeneous computing system by utilizing each test computing node of the heterogeneous computing system and configuring different test task parameters for each test computing node to execute the task to be executed.

In the process of running the task to be executed by the heterogeneous computing system in the previous step, the heterogeneous computing system caller intercepted in the previous step generates a task execution request to be issued by the heterogeneous computing system, and a test task is generated based on the task to be executed in the task execution request, the source task parameters and the custom computing power node parameters thereof.

The heterogeneous computing system comprises a group of heterogeneous machines, a high-speed network for connecting the heterogeneous machines, and corresponding heterogeneous computing support software. Heterogeneous machines are a set of heterogeneous computing nodes with different computing power capabilities, including but not limited to heterogeneous computing chips, FPGAs (Field Programmable Gate Array, field programmable gate arrays), computing cards such as NVIDIA computing accelerator cards, and kadsura computing accelerator cards, and each class of computing power node resources may also include multiple computing power nodes, such as multiple FPGAs, multiple NVIDIA computing accelerator cards. In order to obtain the performance of each computing node in the heterogeneous computing system, the embodiment determines the performance of each computing node for executing the task to be executed by configuring different operation parameters for each computing node, namely, generates a test task by configuring different test task parameters for each test computing node and executing the task to be executed, wherein each test computing node at least comprises computing resources corresponding to the computing node specified by the user-defined node parameters, each test task is a plurality of tasks, each test task corresponds to different computing nodes and different operation parameters, the test task is deployed to the heterogeneous computing system, and the heterogeneous computing system sends the corresponding test computing node under each test task to each test computing node, which operates the corresponding test task.

S203: and determining a calculation power calling scheme when the calculation resources are least based on the test results of the test tasks and the custom calculation power node parameters, and correspondingly adjusting the custom calculation power node parameters based on the calculation power calling scheme.

After each test task is deployed to the heterogeneous computing system in the previous step, the operation result of each test computing node for operating the corresponding test task is obtained, and the operation result of each test computing node is the test result. The test result can reflect the calculation performance of each test computing node in the heterogeneous computing system, and the execution computing node for finally executing the task to be executed and the corresponding target task parameter are determined based on the task execution request and the test result, so that the calculation efficiency used for running the task to be executed is minimum. The computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node, wherein each executing computing power node can be all user-defined computing power nodes of a user, or can be part of the user-defined computing power nodes, so long as the task to be executed can be executed and the final used computing resource is minimum. After the calculation power calling scheme is determined, each test calculation power node running the test task can be released to become an idle calculation power node so as to be called by the heterogeneous computing system to execute other tasks. After the calculation power calling scheme is determined, the calculation power nodes with poor performance in the running definition calculation power nodes can be terminated based on the calculation power calling scheme, task parameters of the remained user-defined calculation power nodes are correspondingly adjusted, the running tasks on the calculation power nodes are migrated to the remained user-defined calculation power nodes, and as the heterogeneous calculation system always runs the tasks to be executed, the heterogeneous calculation nodes used by the tasks to be executed can be automatically adjusted and optimized under the condition that users do not perceive, optimization of calculation resources can be achieved, and the execution effect of the tasks is improved. Compared with the mode that the task executed on the node with slower performance is directly replaced to the new high-performance node, which is equivalent to introducing extra computing resources, the method does not need to use the new high-performance node, does not increase the original requirement and cost of a user on the heterogeneous computing system, and does not cause the increase of the operation cost of the heterogeneous computing system.

In the technical scheme provided by the invention, in the process of running the task to be executed according to the computing nodes selected by the user and the corresponding task parameters, the heterogeneous computing system can obtain the task running results of all computing nodes in the heterogeneous computing system under different configuration parameters through a small number of tests, the aim of minimum computing resources used for running the task to be executed is fulfilled, the problem of improper selection of the heterogeneous computing power can be automatically found according to the task running results of the different computing nodes and the user-selected custom computing node parameters, and the heterogeneous computing nodes used for optimally running the task to be executed are automatically regulated under the condition that the user does not perceive the heterogeneous computing nodes. Because the finally selected computing nodes are part of the computing nodes currently running the task to be executed, the optimal selection of the computing nodes for executing the task in the heterogeneous computing system can be realized on the basis of not increasing the cost, the resource utilization rate of the heterogeneous computing system is optimized, and the computing resources of the heterogeneous computing system are effectively saved. Because the computing power node for running the task to be executed is optimally adjusted, a good task execution effect can be obtained without introducing a new high-performance node, and the task execution efficiency is improved.

After determining the executing computing force node for executing the task to be executed and the corresponding target task parameters thereof, in order to further save the computing resources of the heterogeneous computing system, the invention selects the target computing force node different from each executing computing force node from the respective defining computing force nodes, adjusts the task parameters of the rest of the self-defining computing force nodes to the corresponding target task parameters of the executing computing force nodes, and releases other self-defining computing force nodes which are not executing computing force nodes. In other words, after the executing computing power node for executing the task to be executed is determined, the user-defined computing power nodes running the task in the heterogeneous computing system are adjusted, task parameters on each user-defined computing power node are correspondingly adjusted, resources of computing power nodes which are used before adjustment but are not used after adjustment are released, and the running state of the computing power nodes is set to be in an idle state, so that the target computing power nodes not running the task to be executed can be called by other tasks, and the computing resources of the heterogeneous computing system can be saved.

Further, for the task to be executed which is of a multi-iteration task type, such as a distributed training task, after the heterogeneous computing system detects that the current iteration update of the task to be executed is completed, the user-defined computing node and task parameters thereof are correspondingly adjusted based on a computing power calling scheme. The node identification is used to uniquely determine the computing node. In this embodiment, if the node of the heterogeneous computing system needs to be optimized, after the update of the current iteration of the task to be executed is completed, the task to be executed is deployed and adjusted, and the optimized computing power node is released, so that the computing power node is in an idle state. Taking the distributed training task as an example, if node optimization is needed, after model parameters of all computing nodes of the distributed training task are synchronized at the current time, deployment adjustment is performed on the distributed training task before iteration is started next time, and the optimized heterogeneous computing nodes are released. The method not only can keep the current task execution precision such as model training progress, but also can optimize the computing power nodes of the heterogeneous computing system under the condition that a user does not perceive, and finally achieves the effects of saving computing resources and accelerating task execution efficiency.

In the above embodiment, there is no limitation on how to generate the plurality of test tasks, and the present invention also provides an exemplary generation manner of the test tasks, which may include the following: obtaining distributed training task information by analyzing a task execution request; and generating a training test task according to the distributed training task information, and deploying the training test task to corresponding class computing nodes of the heterogeneous computing system.

When a user using a heterogeneous computing system, such as a computing power user renting heterogeneous computing power, needs the heterogeneous computing system to run a task, a task execution request is issued to the heterogeneous computing system, and once a distributed training task needs to be executed in the heterogeneous computing system, the distributed training task to be executed, relevant parameters and node information are sent to the heterogeneous computing system as a task execution request for task starting. In the process of sending the request, the executing subject of the embodiment intercepts the task execution request and collects the data information carried by the task execution request. The distributed training task information carries a network model to be trained, namely a network model type, and the training test task comprises a plurality of training subtasks and training parameters corresponding to the training subtasks, wherein the training parameters are as batch sizes. The task execution request of this embodiment may include model information trained by the distributed training task, and heterogeneous computing force type and quantity information used, an identification of each computing force node used, a batch_size trained by each computing force node, a maximum value of batch_size used by all computing force nodes, and correspondingly, the distributed training task information may include model information trained by the distributed training task, a computing force resource type involved, and a maximum value of batch_size used by all computing force nodes. The number of the training subtasks can be flexibly determined according to the actual application scene, and the number of the training subtasks can be a plurality of times of the maximum value of batch_size and at least should be 2 times. The greater the number of training subtasks, the greater the number of computational nodes that can be scaled down, and the greater the time taken to test the task, but the test time is still far less than the time taken for the entire distributed training task. For example, the distributed training task to be executed and the node information carried by the task execution request are: the training bert-base (Bidirectional Encoder Representations from Transformers base, based on bi-directional encoder characterization from the converter model) model, the power nodes of the heterogeneous computing system used included 5 inflight a100 (model), 5 inflight RTX 4060 Ti (model), 5 british MLU370 (model), and the respective identities of the power nodes, the band_size trained by each power node, and the band_size maximum used for all heterogeneous powers was 16. The distributed training task information may be: some training tasks relate to Injeida A100, injeida RTX 4060 Ti, blatta MLU370, batch_size maximum of 16, and the model of the distributed training task is bert-base. And deploying a plurality of training subtasks for each related computing power node resource according to the distributed training task information, and testing and recording the running time of the network model to be trained corresponding to the distributed training task of the network under the setting of test task parameters such as batch_size of different configurations of each heterogeneous computing power.

Further, the network model trained by the distributed training task information in the above embodiment, that is, the network model to be executed, adopts a training mode based on data parallelism and synchronization parameter updating, and accordingly, the test task parameter is a batch size, and based on this, an exemplary implementation manner of the above embodiment is as follows: and respectively selecting one computing node from each type of computing nodes in the heterogeneous computing system as a test computing node, issuing each training subtask and the corresponding batch size to one computing node in each type of computing node resource of the heterogeneous computing system, and training the same network model to be trained by using training data of different scales by each test computing node. Wherein, each type of computing node refers to each type of computing node type to which the self-defined computing node belongs, and not refers to each type of computing node type in the heterogeneous computing system.

In a training mode based on data parallelism and synchronization parameter updating, each computing node in the heterogeneous computing system trains the same model by using different training data. In the training process, whether the distributed training is realized based on a centralized parameter server mode or a decentralised allreduce (global reduction communication) mode, each computing node needs to synchronize own model parameters with each other regularly. Therefore, in the one-time iterative training process of the distributed training task, after the computing power node with better computing performance is trained, the computing power node with poorer computing performance still needs to wait for the computing power node with poorer computing performance to be trained, and then the synchronous operation of the model parameters can be performed. Therefore, the training speed of the distributed training task is limited by the computational power node with poor computing performance, and in some extreme cases, the computational power node with poor performance is used as the computational power node for running the distributed training task, which is likely to cause the running speed of the distributed training to be extremely deteriorated, so that the computational power node for running the distributed training task in the heterogeneous computing system needs to be optimized by using the technical scheme provided by the invention, so that the running efficiency of the distributed training task is improved.

As an exemplary implementation manner of the above embodiment, in order to further improve the computing performance of the heterogeneous computing system, not occupy too much computing resources of the heterogeneous computing system, which affects other running tasks, the present embodiment uses idle computing nodes to perform test tasks, that is, training subtasks in the above embodiment. For ease of description, an idle computing node is defined as an idle computing node. In this embodiment, each training subtask and its corresponding batch size are issued to an idle computing node in the corresponding class of computing nodes of the heterogeneous computing system.

It will be appreciated that many users of heterogeneous computing systems are simultaneously performing many tasks, so that the computing nodes of the heterogeneous computing systems are not necessarily idle, and thus, an exemplary implementation of the above embodiment may be as follows: judging whether an idle computing power node exists in the current class computing power node for the corresponding class computing power node of the heterogeneous computing system; if the current class of computing nodes have idle computing nodes, issuing each training subtask and the corresponding batch size to the idle computing nodes in the current class of computing nodes; if the current class of computing nodes do not have the idle computing nodes, after waiting for the idle computing nodes to appear, issuing each training subtask and the corresponding batch size to the idle computing nodes in the current class of computing nodes.

Based on the above embodiment, in order to further improve flexibility and practicality, and improve overall performance of the whole heterogeneous computing system, if no idle computing node exists in the current class of computing nodes and the computing node is waiting for idle all the time, the waiting duration can be monitored in real time, and whether the current waiting duration exceeds a preset waiting duration threshold value is judged; if the current waiting time exceeds the waiting time threshold, the computing power node is not optimized, namely the self-defined computing power node parameters are not adjusted, and the heterogeneous computing system continues to execute the task to be executed according to the source task parameters and the self-defined computing power node parameters in the task execution request.

In this embodiment, the waiting duration threshold may be preset and stored locally, or may be set again when it is required to wait for an idle computing node, where the value of the waiting duration threshold may be flexibly selected according to the actual operating situation of the heterogeneous computing system and the user requirement, and the present invention is not limited in any way. If the corresponding class of the computing node is not idle at the moment, waiting for the idle computing node to appear and then issuing. If the waiting time is overtime, for example, exceeds 1 hour, the node optimization is abandoned for the distributed training task, the computing power node selection method flow of the whole heterogeneous computing system is directly exited, and the task to be executed is executed according to the task execution request of the user. Of course, if the user does not select the computing node and/or does not specify task parameters such as source task parameters or custom computing node parameters when issuing the task execution request, the user may be fed back with data missing prompt information, and when the user inputs the required data, the task to be executed is executed according to the custom computing node, the source task parameters and the custom computing node parameters selected based on the user.

In the above embodiment, the present invention also provides an exemplary generation manner of the test result, which may include the following matters:

monitoring the task running state of each test computing node; when the current training subtask of the current test calculation node is detected to be in a stable running state, counting the time consumption of single iteration of the current training subtask; when the fact that the current test calculation node is in the fault reporting of the insufficient display memory is detected, generating the information of the insufficient display memory; and generating a test result according to the time consumption and the insufficient memory information of single iteration of each training subtask.

In this embodiment, the test result is the running condition of each computing node running the corresponding training subtask of the heterogeneous computing system, which may include two kinds of successful running and unsuccessful running, and for a successful running scenario, the test result is the time consumption of a single iteration of the computing node training subtask of the heterogeneous computing system. For the case of unsuccessful operation, with the increase of the value of the batch_size, the situation of insufficient display memory can occur in the computing nodes of the heterogeneous computing system, namely, the current computing node cannot successfully operate the training subtask, and then the test result of the training subtask is the insufficient display memory information, and the insufficient display memory information can comprise an error reporting identifier for identifying the occurrence of the error reporting of the insufficient display memory, a training subtask identifier for uniquely determining the training subtask, a test computing node identifier for uniquely determining the computing node and the current batch size value.

In order to improve the execution efficiency of the test task, in the process of executing the training subtasks one by one in the heterogeneous computing system, each training subtask does not need to wait for model convergence, and only after the training subtask stably runs, for example, after 10 iteration steps have been trained, the single step time consumption of each training subtask, namely, the single iteration time consumption, can be counted, and an instruction for stopping running the current training subtask and executing the next training subtask is sent to the current test computing node, and the computing node can stop and execute the next training subtask.

It can be understood that once a computing node has a fault reporting problem due to insufficient video memory, when the computing node runs training subtasks with larger batch sizes, the fault reporting problem due to insufficient video memory can be necessarily generated, so that if the current computing node has a target training subtask with a batch size value larger than the current batch size value, an instruction for stopping running the target training subtask is generated, the computing node is not run with the training subtask with the larger batch size batch_size any more, the execution efficiency of the testing task is improved, the computing node selection efficiency of the whole heterogeneous computing system is further improved, and the task running efficiency is effectively improved.

Based on the above embodiment, in order to facilitate implementation, operation and tracing of data, the present invention further provides another exemplary embodiment, which may include the following:

the test task list and the node optimization list are created in advance and can comprise a test task identification strip item, a test calculation node type strip item, a test task parameter strip item and a test result strip item. For example, in the actual application scenario where the task to be executed is the network model to be trained, the corresponding test task is a training subtask, the test task parameters are batch sizes of the network model to be trained in the training process, the test task results are time-consuming for a single iteration, the test task table can be shown in table 1, and the test task table can be used for testingThe test task identification bar corresponds to a column of training subtask sequence numbers in table 1, the test calculation force node type bar corresponds to a column of heterogeneous calculation force types in table 1, the test task parameter bar corresponds to a column of training batch sizes in table 1, and the test result bar corresponds to a column of single iteration time consuming in table 1.t _A,n Representing that a single iteration of testing a computing force node type a-type i.e. heterogeneous computing force a running a training subtask at a batch size n is time consuming, t _B,1 Representing that the test computing force node type is type B, i.e., heterogeneous computing force B is time consuming for a single iteration of the run training subtask at a batch size of 1. OOM indicates that the test computation node is under-memory when running the training subtask. Where n can be set to a multiple of the maximum value of batch_size, it is recommended that at least 2 times should be used. The larger n is, the more heterogeneous computing nodes can be adjusted and reduced, but the number of training subtasks is increased, so that the time consumption of the test task is increased, but the time consumption of the test is still far less than that of the whole distributed training task. The identification of the computing force node in the heterogeneous computing system can be represented by the computing force node IP or can be represented by different custom coding modes, so long as the corresponding heterogeneous computing force can be inquired and obtained, and the implementation of the invention is not affected. The node optimization table, as shown in table 4, may include an execution computing power node type entry and a target task parameter entry.

In the process of executing the step S202, a pre-created test task table is called, the input data set can directly use the data set of the original distributed training task, and the corresponding item contents of the test task table are filled according to the test computing nodes for executing each test task, the corresponding test task parameters and the test results of each test task. For example, for A100, a single iteration time at all batch_size in Table 1 may be counted as the bert-base training subtask is being performed. For Injettia RTX 4060 Ti, it is possible that a single iteration of the training subtask may be available when the batch-size is less than 8, but when the batch-size is greater than 8, due to insufficient video memory of Injettia RTX 4060 Ti, when the bert-base training subtask is performed An error report may occur due to insufficient memory. Summarizing distributed training tasks to be executed and node information, marking heterogeneous computing force types related to the distributed training tasks in a heterogeneous computing system, counting the batch size of each computing force node training, and filling the batch size into an information summary table before optimization, wherein the used heterogeneous computing force comprises 100 English-to-Data (A) pieces, 5 English-to-Data (RTX) 4060 Ti pieces and 370-to-Bluer (MLU) pieces, as shown in table 2. After determining the target task parameters corresponding to each execution computing node, invoking a pre-established node optimization table; and filling the corresponding item content in the node optimization table according to each executing computing node and the corresponding target task parameter, and outputting the optimized rest computing node identifiers and the updated batch size of each computing node in the form of the node optimization table. As can be seen from an comparison of tables 2 and 4, the tasks were run in Table 2 using 5 Injettia A100, 5 Injettia RTX 4060 Ti and 5 british MLU370 (i.e., shown in the first and second columns of Table 2), with each Injettia A100 batch size of 16, each Injettia RTX 4060 Ti batch size of 8, and each british MLU370 batch size of 16 (i.e., shown in the third column of Table 2). The optimized table 4 obtained by the method for selecting the optimal node provided by the invention uses 5 inflight a100, 2 inflight RTX 4060 Ti and 5 british MLU370 (i.e. shown in the first column of table 4) to run tasks, while the batch size of each inflight a100 is 21, 21 and 20, the batch size of two inflight RTX 4060 Ti is still 8, and the batch size of each british MLU370 is still 16 (i.e. shown in the second column of table 4). It can be seen that table 4 uses less than 3 RTX 4060 Ti nodes than table 2, and that the load of these three RTX 4060 Ti nodes is operated by 5 a100, i.e., 3 RTX 4060 Ti nodes @ in table 2 ID _{RTX Ti40601}、ID _{RTX Ti40602} 、ID _{RTX Ti40603} ) The total 24 (3*8) batch_size values are distributed to 5A 100 as evenly as possible, and the 5 th A100 node is theID _A1005 The batch size value of (2) is the remaining 20. At the current completion of the distributed computing taskAfter synchronizing the model parameters in all heterogeneous computing forces, the distributed training task is adjusted and deployed according to table 4 before the next iteration starts, and the optimized computing force nodes are released, so that the computing force nodes are idle.

Table 1 test task table

TABLE 2 information summary table before optimization

Table 3 exemplary test task Table

Table 4 exemplary node optimization table

In the above embodiment, how to select the execution computing node for executing the task to be executed and the corresponding target task parameter thereof are not limited in any way, the present invention also provides an exemplary determination manner of the execution computing node and the corresponding target task parameter thereof, which may include the following:

distributing the workload of the first type of test computing nodes, the computing performance of which does not meet the first preset target computing performance condition, in each test computing node to the second type of test computing nodes, the computing performance of which meets the second preset target computing performance condition, in sequence in a mode of adjusting task parameters until the computing resources used for running the task to be executed reach the minimum; and taking the test computing power node of the task to be executed currently operated by the heterogeneous computing system as an execution computing power node, taking the current task parameter corresponding to each test computing power node as a corresponding target task parameter, and generating a computing power calling scheme.

In this embodiment, the first preset target performance condition is a preset condition for measuring the poor performance of the power node, for example, the running test task time exceeds a first preset threshold, for example, 3 min, and the second preset target performance condition is a preset condition for measuring the good performance of the power node, for example, the running test task time is lower than a second preset threshold, for example, 1 min, which can be flexibly determined by a person skilled in the art according to the actual application scenario. The first type of test computing nodes are computing nodes with poor performance, the second type of test computing nodes are computing nodes with good performance, the workload of the first type of test computing nodes with poor performance is distributed to the second type of test computing nodes with good performance, and the migration of the workload can be repeatedly operated for a plurality of times according to a certain rhythm until the computing resources used for running the task to be executed are minimum. Through the continuous migration of the workload, when the computing resources used by the heterogeneous computing system to run the task to be executed reach the minimum, the computing nodes of the heterogeneous computing system to run the task and the corresponding test task parameters are the optimal results, so that the computing nodes still running the task at the moment, namely the computing nodes with good performance, are taken as the executing computing nodes, the task parameters corresponding to the computing nodes are the target task parameters of the corresponding executing computing nodes, the computing nodes without the workload, namely the computing nodes not running the task, are the optimized computing nodes with poor performance, all the resources of the optimized computing nodes can be released to become idle computing nodes, the computing resources of the heterogeneous computing nodes with poor performance can be saved, and the heterogeneous computing nodes with good performance can be fully utilized, so that the execution efficiency of the task to be executed is accelerated.

As an exemplary implementation manner of the foregoing embodiment, referring to fig. 3, for an application scenario in which a task to be executed is a distributed training task, the implementation process of distributing, by adjusting task parameters, the workload of a first type of test computing node whose computing performance does not meet a first preset target computing performance condition among all test computing nodes to a second type of test computing node whose computing performance meets a second preset target computing performance condition until computing resources used for running the task to be executed reach a minimum may be:

s101: obtaining the batch size of each custom computing node according to the custom computing node parameters, and obtaining the single iteration time consumption of each test computing node according to the test result;

s103: and selecting a minimum time-consuming computing force node with a single iteration time-consuming minimum value from the current test result, adjusting the single iteration time-consuming of the minimum time-consuming computing force node to be the single iteration time-consuming of the first target computing force node, and adjusting the batch size value of the minimum time-consuming computing force node corresponding to the user-defined computing force node parameter, and simultaneously, adjusting the batch size value of the maximum time-consuming computing force node corresponding to the user-defined computing force node parameter.

S104: judging whether the batch size value of the maximum time-consuming computing node corresponding to the user-defined computing node parameter is 0, if so, executing S105; if not, updating the user-defined calculation node parameters and the test results, and jumping to execute S103;

s105: and judging whether the single iteration time consumption maximum value of the original test result is larger than or equal to the single iteration time consumption maximum value in the updated test result, if so, executing S106.

S106: and deleting the maximum time-consuming computing force node, the batch size corresponding to the maximum time-consuming computing force node and the time consumption of single iteration, updating the user-defined computing force node parameters and the test result again, and executing S102 in a jumping manner.

In this embodiment, the batch size of each custom calculation node is obtained according to the custom calculation node parameters, the single iteration time consumption of each test calculation node is obtained according to the test result, and the maximum time consumption calculation node of the single iteration time consumption maximum value in the current test result is obtained. The single iteration time consumption of the minimum time-consuming computing power node in the current test result is adjusted to be the single iteration time consumption of the first target computing power node, the batch size value of the minimum time-consuming computing power node corresponding to the user-defined computing power node parameter is adjusted to be large, and the batch size value of the maximum time-consuming computing power node corresponding to the user-defined computing power node parameter is adjusted to be small; and selecting the time-consuming computing force node with the current single iteration time-consuming minimum value as the minimum time-consuming computing force node, and repeatedly executing the step of adjusting the single iteration time-consuming of the minimum time-consuming computing force node to the single iteration time-consuming of the first target computing force node until the batch size value of the maximum time-consuming computing force node corresponding to the user-defined computing force node parameter is 0, and simultaneously updating the test result and the user-defined computing force node parameter. If the maximum time consumption value of the single iteration of the original test result is greater than or equal to the maximum time consumption value of the single iteration in the updated test result, deleting the maximum time consumption computing force node, the batch size corresponding to the maximum time consumption computing force node and the single iteration time consumption, taking the time consumption computing force node of the current single iteration time consumption maximum value as the maximum time consumption computing force node, and repeatedly executing the step of selecting the maximum time consumption computing force node of the single iteration time consumption maximum value. And when the time consumption maximum value of the single iteration of the original test result is smaller than that of the updated test result, indicating that the current time consumption maximum value of the single iteration is optimal.

The task to be executed is a distributed training task, and the custom computing node parameters of the task execution request comprise custom computing nodes and batch sizes of each custom computing node. The maximum time-consuming computing power node refers to the computing power node with the maximum time consumption of single iteration of the heterogeneous computing system at the current time, and the minimum time-consuming computing power node refers to the computing power node with the minimum time consumption of single iteration of the heterogeneous computing system at the current time. The user-defined computing force node and the test computing force node can be corresponding according to computing force node identifiers such as computing force node IP, after the maximum time-consuming computing force node and the minimum time-consuming computing force node are determined from the test result, the corresponding computing force node identifiers are used for carrying out the corresponding, and then the corresponding batch size value is determined from the task execution request. The time consumption of single iteration of each calculation node in the test result and/or the batch size value in the custom calculation node parameter are changed once, the corresponding test result and the custom calculation node parameter are updated once in real time, and the corresponding maximum time consumption calculation node and the minimum time consumption calculation node are correspondingly changed. The whole tuning process is a cyclic process, and for both the maximum time-consuming computing node and the minimum time-consuming computing node determined in S102 and S103, the test results are based on the operation time of the step, in other words, for the first execution, the current test result is the original test result, for the second and subsequent operations, the current test result is the test result after the update operation has been performed, and the original test result is the test result obtained after the execution of S202 is completed. The first target computing node is one of the computing nodes that satisfies a condition that a computing type of the computing node is the same as a computing type of the smallest time consuming computing node, and a batch size value is greater than a batch size of the smallest time consuming computing node. The difference between the batch size values can be flexibly determined according to the actual scene. Because the test result contains the time-consuming information of single iteration of each computing node under different batch sizes, the batch size and the time-consuming of single iteration of the first target computing node can be obtained by providing the query test result and the user-defined computing node parameters.

As an exemplary embodiment, in order to improve the accuracy and flexibility of selecting optimal nodes, when the batch size value of each computing node is adjusted, the batch size value may be adjusted according to a predetermined scale, where the scale may be flexibly determined according to an actual application scenario, that is, the first preset value, the second preset value, and the third preset value may be flexibly selected according to the actual scenario, which does not affect implementation of the present invention. Correspondingly, the single iteration time consumption of the minimum time-consuming computing node is adjusted to be the single iteration time consumption of the first target computing node, the batch size value of the minimum time-consuming computing node corresponding to the user-defined computing node parameter can be increased in a mode of increasing the second preset value each time, and the batch size value of the maximum time-consuming computing node corresponding to the user-defined computing node parameter can be reduced in a mode of decreasing the third preset value each time; the first target calculation force node is the same as the calculation force type of the minimum time-consuming calculation force node, and the batch size value is larger than the batch size value of the minimum time-consuming calculation force node by a first preset value.

As an exemplary embodiment, to further improve the practicability and flexibility of the optimal node selection, after obtaining the maximum time-consuming computing node of the maximum time-consuming value of the single iteration in the current test result, based on the above embodiment, the method may further include: if the single iteration time consumption value of the first target calculation node does not exist or is insufficient display memory information, selecting a time-consuming calculation node with single iteration time consumption inferior to that of the minimum time-consuming calculation node as the minimum time-consuming calculation node, and repeatedly executing the step S103; if all the calculation force nodes in the test result are traversed, the single iteration consumption value of the current minimum time-consuming calculation force node cannot be successfully adjusted, and whether the batch size of each test calculation force node in the current updated test result is identical to the batch size value customized by a user in the task execution request or not is judged; if the calculation nodes are identical, the optimal selection of the calculation nodes is not needed, namely, the calculation nodes and the corresponding task parameters are not respectively defined and adjusted, and the task to be executed is continuously operated according to S201; if the test result is different, the batch size of each test computing node and the corresponding user-defined computing node parameter after updating in the updated test result is used as each execution computing node and the corresponding target task parameter.

As an exemplary embodiment, to facilitate operation, and simplify complexity, the batch size value of each custom inode in the custom inode parameters in a task execution request may be defined as a vector=(b ₁ ,b ₂ ,…,b _z ) And duplicate a copy of +.>=(b ₁ ,b ₂ ,…,b _z )，b _z The z is the custom computing force node. The single iteration time-consuming of each test computation node in the test result can be defined as vector +.>=(t ₁ ,t ₂ ,…,t _c ) And duplicate a copy of +.>=(t ₁ ,t ₂ ,…,t _c )，t _c The force node is tested for the c-th test. The corresponding relation between the test calculation nodes and the custom calculation nodes is established through the calculation node identifications, which can be represented by a table 5, and the column of the calculation node batch size in the table 5 is equivalent to +.>、/>This column, which is time-consuming for a single iteration, corresponds to +.>、/>：

Table 5 force node mapping table

A1: acquisition ofThe maximum value of the time consumption of a single iteration in (a) is obtained and corresponds to +.>Batch size of custom computing nodes in (1), defined asb。

A2: from the slaveFinding the minimum value of the time consumption of a single iteration, will +.>The value of (2) is adjusted to the corresponding +.>. If->If the mark is OOM (insufficient memory identification character string) or there is no record, the method continues to find +.>The next smallest value time-consuming for a single iteration in (a) and changing this value to +.>And so on. If go through->If no changeable is found, step A4 is performed. If- >Is successful, change->The batch size value corresponding to the corresponding custom computing node in (1), for example, the batch size value +1 can be adjusted, and +.>Corresponding tob=b-1. At this time, ifbAnd if the number is= 0, executing the step A3, otherwise, repeatedly executing the step A2.

A3: comparison ofAnd->. If->Indicating that the calculation force node optimization is effective, deletingbThe computing node corresponding to =0 and delete +.>Element 0 of->Elements corresponding to 0 in (a) and then additionally +.>，/>And returns to step A1. If->And (C) indicating that the optimization of the calculation force nodes of the round is invalid, that the calculation force nodes of the running distributed training task of the current heterogeneous computing system are optimal, and executing the step A4 in a jumping manner.

A4: if at presentAnd if the batch size of the calculation nodes is identical to that of the batch size of the calculation nodes respectively defined in the original custom calculation node parameters, the calculation nodes of the heterogeneous computing system running the distributed training task are not modified. If not the same, thenAnd outputting the corresponding computing power node identification as an optimized batch_size result.

As can be seen from the above, in this embodiment, by iterating and continuously trying to distribute the load in the computing node with poor computing performance to the computing node with good computing performance, the effect of optimizing the training node is finally achieved. The running efficiency of the distributed training task can be improved, the computing resources of the heterogeneous computing system can be saved, and the additional computing resources are released.

In addition, the present invention uses the task to be executed as the distributed training task as the application scenario, and further provides a method for selecting a computing power node in the process of running the distributed training task by the heterogeneous computing system, please refer to fig. 4, fig. 4 is a flow chart diagram of another method for selecting a computing power node of the heterogeneous computing system provided by the present invention, which may include the following contents:

s401: and when a distributed training task execution request carrying a network model to be trained and user-selected user-defined computing power node parameters is received, calling the respective defined computing power nodes to run the network model to be trained based on the corresponding batch size.

S402: and generating a plurality of test tasks deployed to the heterogeneous computing system by utilizing each test computing node of the heterogeneous computing system and configuring different batch sizes for each test computing node to train the network model to be trained.

S403: and determining a calculation power calling scheme when the calculation resources are least according to the test results of the test tasks and the custom calculation power node parameters, and correspondingly adjusting the custom calculation power node parameters based on the calculation power calling scheme.

The distributed training task execution request is a request when a task to be executed in the task execution request in the embodiment is a distributed training task, and the distributed training task execution request carries a network model to be trained, training parameters of the network model and user-selected user-defined computing power node parameters, and each test computing power node corresponds to the computing power node type to which the respective defined computing power node belongs; the computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node; each executing computing node is a part of computing nodes in the custom computing nodes. The source task parameters in the above embodiment include training parameters in this embodiment, where the training parameters are the maximum of batch sizes used by all the computing nodes, and the user-selected custom computing node parameters include the custom computing nodes in the heterogeneous computing system selected by the user and the batch size of each custom computing node. The target task parameter of the above embodiment is the batch size of the computing nodes in the present embodiment. The method steps involved in implementing the optimal force node in this embodiment may refer to the description of the foregoing embodiment, and only related terms in the foregoing embodiment need to be replaced by corresponding terms in this embodiment, which are not repeated herein.

As can be seen from the above, the present embodiment can solve the problem of improper selection of computing nodes in the related art, and on the basis of not increasing the cost, implement optimal selection of computing nodes of a distributed training task in a heterogeneous computing system, effectively improve the operation efficiency of the distributed training task, and optimize the computing resource utilization rate of the heterogeneous computing system.

It should be noted that, in the present invention, the steps are not strictly executed sequentially, so long as they conform to the logic sequence, and the steps may be executed simultaneously or may be executed according to a certain preset sequence, and fig. 2, fig. 3, and fig. 4 are only schematic, and are not meant to represent only such an execution sequence.

Finally, based on the above technical solution of the present invention, the following description will be given by way of example with reference to fig. 5, where fig. 5 is a schematic diagram of a hardware composition framework to which the computing node selection method of the heterogeneous computing system provided by the present invention is applicable, and the method may include the following contents:

the hardware component framework may include a first electronic device 51 and a second electronic device 52, with the first electronic device 51 and the second electronic device 52 being connected by a network 53. The first electronic device 51 is disposed with a processor for executing the computing power node selection method of the heterogeneous computing system described in any of the above embodiments, the second electronic device 52 is disposed with a multi-component heterogeneous computing system including a plurality of heterogeneous computing power nodes, as shown in fig. 6, where the multi-component heterogeneous computing system may include heterogeneous computing power node 1, heterogeneous computing power node 2, heterogeneous computing power node 3, and heterogeneous computing power node 4 …, in the multi-component heterogeneous computing system, heterogeneous computing power nodes with different computing performance, such as NVIDIA computing accelerator card and british computing accelerator card, and the FPGA is connected to the same distributed computing system, and the communication manner of each computing power node may be in a server or between servers, which does not affect the implementation of the present invention. The heterogeneous computing system trains the large-scale neural network model in a distributed training mode, the training data or the large-scale neural network model is split firstly, and then the split data and sub-training tasks are deployed on a plurality of heterogeneous computing force nodes. Among the various modes of distributed training, because the training modes based on data parallelism and synchronization parameter updating are often easy to deploy, the finally obtained model has the best performance, and the heterogeneous computing system of the embodiment adopts the training model based on data parallelism and synchronization parameter updating to run the distributed training task. As shown in fig. 7, a distributed training task information collection module, a test task deployment module, a test task information collection module, and a computing node optimization module may be integrated in the processor of the first electronic device 51, where the distributed training task information collection module may be configured to collect distributed training task information to be executed by intercepting a to-be-executed distributed training task and node information issued by a caller of the heterogeneous computing system; the test task deployment module is used for generating a training test task according to the distributed training task information, then issuing the training test task to computing force nodes related to the distributed training task, and executing the test task by only needing one idle heterogeneous computing force of the type for each computing force node; the test task information collection module is used for collecting test results of test tasks in all computing nodes; the computing node optimization module is used for optimizing the distributed training task, adjusting the computing power node running the distributed training task in the heterogeneous computing system and the configuration of training, and issuing the adjusted distributed training task to the heterogeneous computing system, and the heterogeneous computing system correspondingly adjusts the user-defined computing power node parameter running the task to be executed. The first electronic device 51 completes all or part of the steps in the computing node selection method of the heterogeneous computing system described in the above embodiment through the aforementioned 4 functional modules.

It should be noted that the above application scenario is only shown for the convenience of understanding the idea and principle of the present invention, and the embodiment of the present invention is not limited in any way. Rather, embodiments of the invention may be applied to any scenario where applicable.

The invention also provides a corresponding device for the computing power node selection method of the heterogeneous computing system, so that the method is more practical. Wherein the device may be described separately from the functional module and the hardware. In the following description, a description will be given of a computing node selection device of a heterogeneous computing system, which is configured to implement a computing node selection method of a heterogeneous computing system according to the present invention, where in this embodiment, the computing node selection device of the heterogeneous computing system may include or be divided into one or more program modules, where the one or more program modules are stored in a storage medium and executed by one or more processors, to implement the computing node selection method of the heterogeneous computing system disclosed in the first embodiment. Program modules in the present invention are directed to a series of computer program instruction segments capable of performing the specified functions, more preferably than the program itself, for describing the execution of a computing node selection means of a heterogeneous computing system in a storage medium. The following description will specifically describe functions of each program module of the present embodiment, and the computing node selecting device of the heterogeneous computing system described below and the computing node selecting method of the heterogeneous computing system described above may be referred to correspondingly with each other.

Referring to fig. 8, fig. 8 is a block diagram of a computing node selecting device of a heterogeneous computing system according to an embodiment of the present invention, where the computing node selecting device may include:

the task running module 801 is configured to, when a task execution request carrying a task to be executed and a user-selected user-defined computing node parameter is received, run the task to be executed according to the user-defined computing node parameter;

the test task deployment module 802 is configured to utilize each test computing node of the heterogeneous computing system, and configure different test task parameters for each test computing node to execute a task to be executed, so as to generate a plurality of test tasks deployed to the heterogeneous computing system; the task execution request carries a task to be executed, a source task parameter and a user-defined computing node parameter selected by a user;

the node optimization module 803 is configured to determine, based on the test results of each test task and the custom calculation node parameters, a calculation power calling scheme when the calculation resources are least used, and correspondingly adjust the custom calculation node parameters based on the calculation power calling scheme; wherein, each test calculation node corresponds to the calculation node type to which the calculation node is respectively defined to be the same; the computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node; each executing computing node is a part of computing nodes in the custom computing nodes.

Illustratively, in some implementations of this embodiment, the node optimization module 803 may be further configured to: selecting target computing force nodes different from each executing computing force node from the respective defining computing force nodes, adjusting task parameters of the other self-defining computing force nodes to corresponding target task parameters of the executing computing force nodes, releasing resources of each target computing force node and setting the running state of each target computing force node to be an idle state.

In an exemplary implementation of the foregoing embodiment, the node optimization module 803 may be further configured to: and after the heterogeneous computing system detects that the current iteration updating of the task to be executed is completed, correspondingly adjusting the custom computing power node and the task parameters thereof based on the computing power calling scheme.

Illustratively, in other implementations of the present embodiment, the test task deployment module 802 described above may be further configured to: obtaining distributed training task information by analyzing a task execution request; generating a training test task according to the distributed training task information, and deploying the training test task to corresponding class computing nodes of the heterogeneous computing system; the distributed training task information carries a network model to be trained and an algorithm node type, and the training test task comprises a plurality of training subtasks and training parameters corresponding to the training subtasks.

In an exemplary implementation of the foregoing embodiment, the test task deployment module 802 may be further configured to: respectively selecting one computing node from each class of computing nodes in the heterogeneous computing system as a test computing node; and respectively issuing each training subtask and the corresponding batch size to selected computing nodes in the corresponding class of computing nodes of the heterogeneous computing system, so that each test computing node uses training data of different scales to train the same network model to be trained.

In another exemplary implementation of the above embodiment, the test task deployment module 802 may be further configured to: and issuing each training subtask and the batch size corresponding to each training subtask to idle computing nodes in the corresponding class computing nodes of the heterogeneous computing system.

In yet another exemplary implementation of the foregoing embodiment, the test task deployment module 802 may be further configured to: judging whether an idle computing power node exists in the current class computing power node for the corresponding class computing power node of the heterogeneous computing system; if the current class of computing nodes have idle computing nodes, issuing each training subtask and the corresponding batch size to the idle computing nodes in the current class of computing nodes; if the current class of computing nodes do not have the idle computing nodes, after waiting for the idle computing nodes to appear, issuing each training subtask and the corresponding batch size to the idle computing nodes in the current class of computing nodes.

In yet another exemplary implementation of the above embodiment, the test task deployment module 802 may be further configured to: judging whether the current waiting time exceeds a preset waiting time threshold; if the current waiting time exceeds the waiting time threshold, the defined computing power nodes and the corresponding task parameters thereof are not adjusted.

For example, in some other implementations of the present embodiment, the apparatus may further include a test result generating module, configured to monitor a task running state of each test computing node; when the current training subtask of the current test calculation node is detected to be in a stable running state, counting the time consumption of single iteration of the current training subtask; when the fact that the current test calculation node is in the fault reporting of the insufficient display memory is detected, generating the information of the insufficient display memory; the insufficient display memory information comprises error reporting marks, training subtask marks, test calculation node marks and current batch size values; and generating a test result according to the time consumption and the insufficient memory information of single iteration of each training subtask.

In an exemplary implementation of the foregoing embodiment, the test task deployment module 802 may further be configured to: and sending an instruction for stopping running the current training subtask and executing the next training subtask to the current test computing node.

In another exemplary implementation of the foregoing embodiment, the test task deployment module 802 may be further configured to: if the current test calculation force node has the target training subtask with the batch size value larger than the current batch size value, generating an instruction for stopping running the target training subtask.

In yet another exemplary implementation of the foregoing embodiment, the node optimization module 803 may be further configured to: selecting target computing force nodes different from each executing computing force node from the respective defining computing force nodes, adjusting task parameters of the other self-defining computing force nodes to corresponding target task parameters of the executing computing force nodes, releasing resources of each target computing force node and setting the running state of each target computing force node to be an idle state.

Illustratively, in some other implementations of this embodiment, the test task deployment module 802 may be further configured to: calling a pre-established test task table; the test task table comprises a test task identification bar item, a test calculation node type bar item, a test task parameter bar item and a test result bar item; and filling the corresponding item contents of the test task list according to the test calculation nodes executing the test tasks, the corresponding test task parameters and the test results of the test tasks.

In an exemplary implementation of the foregoing embodiment, the node optimization module 803 may be further configured to: invoking a node optimization table which is created in advance; the node optimization table comprises an execution computing power node type bar item and a target task parameter bar item; and filling the corresponding item content in the node optimization table according to each execution computing node and the corresponding target task parameter, and outputting the node optimization table.

Illustratively, in other implementations of this embodiment, the node optimization module 803 may be further configured to: distributing the workload of the first type of test computing nodes, the computing performance of which does not meet the first preset target computing performance condition, in each test computing node to the second type of test computing nodes, the computing performance of which meets the second preset target computing performance condition, in sequence in a mode of adjusting task parameters until the computing resources used for running the task to be executed reach the minimum; and taking the test computing power node of the task to be executed currently operated by the heterogeneous computing system as an execution computing power node, taking the current task parameter corresponding to each test computing power node as a corresponding target task parameter, and generating a computing power calling scheme.

In an exemplary implementation of the foregoing embodiment, the node optimization module 803 may be further configured to: the task to be executed is a distributed training task, the batch size of each custom computing node is obtained according to the custom computing node parameters, and the single iteration time consumption of each test computing node is obtained according to the test result. Obtaining a maximum time-consuming calculation force node of a single iteration time-consuming maximum value in a current test result; the single iteration time consumption of the minimum time-consuming computing power node is adjusted to be the single iteration time consumption of the first target computing power node, the batch size value of the minimum time-consuming computing power node corresponding to the user-defined computing power node parameter is adjusted to be large, and the batch size value of the maximum time-consuming computing power node corresponding to the user-defined computing power node parameter is adjusted to be small; and selecting the time-consuming computing force node with the current single iteration time-consuming minimum value as the minimum time-consuming computing force node, and repeatedly executing the step of adjusting the single iteration time-consuming of the minimum time-consuming computing force node to the single iteration time-consuming of the first target computing force node until the batch size value of the maximum time-consuming computing force node corresponding to the user-defined computing force node parameter is 0, and simultaneously updating the test result and the user-defined computing force node parameter. If the maximum time consumption value of the single iteration of the original test result is greater than or equal to the maximum time consumption value of the single iteration in the updated test result, deleting the maximum time consumption calculation force node, the batch size corresponding to the maximum time consumption calculation force node and the single iteration time consumption, taking the time consumption calculation force node of the current single iteration time consumption maximum value as the maximum time consumption calculation force node, and repeatedly executing the step of obtaining the maximum time consumption calculation force node of the single iteration time consumption maximum value in the current test result until the single iteration time consumption maximum value of the original test result is smaller than the single iteration time consumption maximum value in the updated test result. The first target calculation node and the minimum time-consuming calculation node have the same calculation type, and the batch size value is larger than the batch size of the minimum time-consuming calculation node.

In another exemplary implementation of the foregoing embodiment, the node optimization module 803 may be further configured to: the single iteration time consumption of the minimum time-consuming computing node is adjusted to be the single iteration time consumption of the first target computing node, the batch size value of the minimum time-consuming computing node corresponding to the user-defined computing node parameter is adjusted to be larger in a mode of increasing the second preset value each time, and the batch size value of the maximum time-consuming computing node corresponding to the user-defined computing node parameter is adjusted to be smaller in a mode of decreasing the third preset value each time; the first target calculation force node is the same as the calculation force type of the minimum time-consuming calculation force node, and the batch size value is larger than the batch size value of the minimum time-consuming calculation force node by a first preset value.

In yet another exemplary implementation of the foregoing embodiment, the node optimization module 803 may be further configured to: if the single iteration time consumption value of the first target computing node does not exist or is insufficient display memory information, selecting a time-consuming computing node with single iteration time consumption being inferior to that of the minimum time-consuming computing node as the minimum time-consuming computing node, and repeatedly executing the step of adjusting the single iteration time consumption of the minimum time-consuming computing node to be that of the first target computing node; if all the calculation force nodes in the test result are traversed, the single iteration consumption value of the current minimum time-consuming calculation force node cannot be successfully adjusted, and whether the batch size of each test calculation force node in the current updated test result is identical to the batch size value customized by a user in the task execution request or not is judged; if the calculation nodes are identical, the calculation nodes are not defined respectively and the corresponding task parameters are not adjusted; if the test result is different, the batch size of each test computing node and the corresponding user-defined computing node parameter after updating in the updated test result is used as each execution computing node and the corresponding target task parameter.

Based on the angles of the functional modules, please refer to fig. 9, fig. 9 is a block diagram of a computing node selecting device of a heterogeneous computing system according to an embodiment of the present invention, where the device processes a distributed training task by using the heterogeneous computing system, and the method may include:

the training task running module 901 is configured to, when receiving a distributed training task execution request carrying a network model to be trained and user-selected user-defined computing node parameters, invoke respective defined computing nodes to run the network model to be trained based on corresponding batch sizes;

the training task deployment module 902 is configured to, when receiving a distributed training task execution request, generate a plurality of testing tasks deployed to the heterogeneous computing system by using each testing computing node of the heterogeneous computing system and configuring a mode of training a network model to be trained in different batch sizes for each testing computing node;

the training node optimization module 903 is configured to determine, according to a test result of each test task and the custom computing node parameter, a computing power calling scheme when a computing resource is least used, and correspondingly adjust the custom computing node parameter based on the computing power calling scheme; wherein, each test calculation node corresponds to the calculation node type to which the calculation node is respectively defined to be the same; the computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node; each executing computing node is a part of computing nodes in the custom computing nodes.

The functions of each functional module of the computing power node selection device of the heterogeneous computing system can be specifically realized according to the method in the method embodiment, and the specific implementation process can refer to the related description of the method embodiment, and the detailed description is omitted herein.

As can be seen from the above, the present embodiment can implement optimal selection of computing power nodes for executing tasks in heterogeneous computing systems, effectively improve task execution efficiency, and save computing resources of heterogeneous computing systems.

The computing power node selection device of the heterogeneous computing system is described from the perspective of functional modules, and further, the invention also provides an electronic device, which is described from the perspective of hardware. Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 10, the electronic device comprises a memory 100 for storing a computer program; a processor 101 for implementing the steps of the method for computing force node selection of a heterogeneous computing system as mentioned in any of the embodiments above when executing a computer program.

Processor 101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and processor 101 may also be a controller, microcontroller, microprocessor, or other data processing chip, among others. The processor 101 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 101 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and drawing of content that the display screen is required to display. In some embodiments, the processor 101 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 100 may include one or more computer-readable storage media, which may be non-transitory. Memory 100 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. Memory 100 may be an internal storage unit of an electronic device, such as a hard disk of a server, in some embodiments. The memory 100 may also be an external storage device of the electronic device, such as a plug-in hard disk provided on a server, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc. in other embodiments. Further, the memory 100 may also include both internal storage units and external storage devices of the electronic device. The memory 100 may be used to store not only application software installed in an electronic device, but also various types of data, such as: code or the like that performs a program during the method of computing power node selection of a heterogeneous computing system may also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 100 is at least used to store a computer program 1001, where the computer program, when loaded and executed by the processor 101, is capable of implementing the relevant steps of the method for selecting a computing node of a heterogeneous computing system disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 100 may further include an operating system 1002, data 1003, and the like, and the storage manner may be transient storage or permanent storage. The operating system 1002 may include Windows, unix, linux, among other things. The data 1003 may include, but is not limited to, data corresponding to the result of the selection of the computing node of the heterogeneous computing system, and the like.

In some embodiments, the electronic device may further include a display 102, an input/output interface 103, a communication interface 104, or referred to as a network interface, a power supply 105, and a communication bus 106. Among other things, the display 102, input output interface 103 such as a Keyboard (Keyboard) belong to a user interface, which may optionally also include standard wired interfaces, wireless interfaces, etc. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface. The communication interface 104 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between an electronic device and other electronic devices. The communication bus 106 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

Those skilled in the art will appreciate that the configuration shown in fig. 10 is not limiting of the electronic device and may include more or fewer components than shown, for example, may also include sensors 107 to perform various functions.

The functions of each functional module of the electronic device of the present invention may be specifically implemented according to the method in the above method embodiment, and the specific implementation process may refer to the relevant description of the above method embodiment, which is not repeated herein.

It will be appreciated that the method of computing node selection for a heterogeneous computing system in the above embodiments may be stored on a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution contributing to the related art, or may be embodied in the form of a software product stored in a storage medium, which performs all or part of the steps of the methods of the various embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrically erasable programmable ROM, registers, a hard disk, a multimedia card, a card-type Memory (e.g., SD or DX Memory, etc.), a magnetic Memory, a removable disk, a CD-ROM, a magnetic disk, or an optical disk, etc., that can store program code.

Based on this, the present invention also provides a readable storage medium storing a computer program which when executed by a processor performs the steps of the method for selecting a computing node of a heterogeneous computing system according to any of the above embodiments.

Finally, the present invention also provides a heterogeneous computing system, see fig. 11, which may include:

the heterogeneous computing system may include multiple classes of computing nodes 111 and processors 101; each class of computing nodes 111 may further include a plurality of computing nodes, where the types of computing nodes and the number of computing nodes included in the heterogeneous computing system may be flexibly selected according to practical situations, and the processor 101 is configured to implement the methods and steps described in the embodiments of the computing node selection method of the heterogeneous computing system as described in any one of the foregoing when executing the computer program stored in the memory. The processor 101 and each computing node 111 may be connected by any communication means, such as wired connection, remote connection.

As an exemplary implementation, the processor 101 may be deployed to one of the computing nodes in the heterogeneous computing system, which is defined as the target computing node for ease of description; the target computing force node is the computing force node with the highest computing force performance in all computing force nodes in the heterogeneous computing system.

As an exemplary embodiment, the processor 101 may also be deployed to any server in communication with each computing node of the heterogeneous computing system.

The functions of each functional module of the computing power node selection system of the heterogeneous computing system according to the embodiment of the present invention may be specifically implemented according to the method in the embodiment of the method, and the specific implementation process may refer to the related description of the embodiment of the method, which is not repeated herein.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the hardware including the apparatus, the electronic device and the system disclosed in the embodiments, since the hardware corresponds to the method disclosed in the embodiments, the description is simpler, and the relevant parts refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The method, the device, the electronic equipment, the readable storage medium and the heterogeneous computing system for selecting the computing power nodes of the heterogeneous computing system are provided by the invention. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. A method of computing power node selection for a heterogeneous computing system, comprising:

Determining a calculation power calling scheme when the calculation resources are least based on the test results of all the test tasks and the self-defined calculation power node parameters, and correspondingly adjusting the self-defined calculation power node parameters based on the calculation power calling scheme;

the user-defined power calculation node parameters are power calculation nodes of the heterogeneous computing system selected by a user and parameters of each power calculation node when a task is run; the user-defined computing power node is a computing power node selected by a user and used for running a task to be executed; the test computing power node is a computing power node for running test tasks in the heterogeneous computing system; the executing computing force node finally selects an optimal computing force node for running the task to be executed after optimization for the heterogeneous computing system; each test computing node corresponds to the computing node type to which the computing node is respectively defined and is the same; the computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node; each execution computing node is a part of computing nodes in the custom computing nodes;

the determining a computing power calling scheme when the computing resources are least used based on the test results of the test tasks and the self-defined computing power node parameters comprises the following steps:

taking a test computing power node of the task to be executed currently operated by the heterogeneous computing system as an execution computing power node, taking current task parameters corresponding to the test computing power nodes as corresponding target task parameters, and generating a computing power calling scheme;

the task to be executed is a distributed training task, the work load of a first class of test computing nodes with computing performance not meeting a first preset target computing performance condition in each test computing node is distributed to a second class of test computing nodes with computing performance meeting a second preset target computing performance condition in sequence by adjusting task parameters until computing resources used for running the task to be executed reach minimum, and the method comprises the following steps:

s105: if the maximum value of the single iteration time consumption of the original test result is greater than or equal to the maximum value of the single iteration time consumption in the updated test result, deleting the maximum time-consuming computing node, the batch size corresponding to the maximum time-consuming computing node and the single iteration time consumption, updating the user-defined computing node parameters and the test result, and jumping to execute S102 until the maximum value of the single iteration time consumption of the original test result is smaller than the maximum value of the single iteration time consumption in the updated test result;

The adjusting the user-defined computing power node parameter based on the computing power calling scheme comprises the following steps: selecting target computing force nodes different from each executing computing force node from the respective defining computing force nodes, terminating each target computing force node, and adjusting task parameters of the rest self-defining computing force nodes into corresponding target task parameters of the executing computing force nodes.

2. The method for selecting a computing power node of a heterogeneous computing system according to claim 1, wherein generating a plurality of test tasks deployed to the heterogeneous computing system by using each test computing power node of the heterogeneous computing system and configuring a different test task parameter for each test computing power node to perform a task to be performed, comprises:

3. The method of claim 2, wherein the training parameters corresponding to each training subtask include a batch size, and the deploying the training test task to the corresponding class of computing nodes of the heterogeneous computing system comprises:

4. A method of computing force node selection for a heterogeneous computing system according to claim 3, wherein said issuing each training subtask and its corresponding batch size onto a selected one of the respective class of computing force nodes for the heterogeneous computing system comprises:

5. The method of claim 4, wherein issuing each training subtask and its corresponding batch size onto a free computing node in a corresponding class of computing nodes of the heterogeneous computing system, comprises:

Judging whether an idle computing power node exists in the current class of computing power nodes for each class of computing power nodes of the heterogeneous computing system;

6. The method of computing force node selection for a heterogeneous computing system of claim 5, further comprising, after the current class of computing force nodes has no free computing force nodes:

7. A method for selecting a computing power node of a heterogeneous computing system according to claim 3, wherein after each training subtask and the corresponding batch size thereof are issued to the selected computing power node of the corresponding class of computing power nodes of the heterogeneous computing system, the method further comprises:

Monitoring the task running state of each test computing node;

8. The method of computing power node selection for a heterogeneous computing system of claim 7, wherein after said counting a single iteration of said current training subtask is time consuming, further comprising:

9. The method for selecting a computing power node of a heterogeneous computing system according to claim 7, wherein after detecting that the current test computing power node has a video-out-of-memory fault, further comprising:

10. The method for selecting a computing power node of a heterogeneous computing system according to claim 1, wherein the adjusting the custom computing power node parameter based on the computing power calling scheme comprises:

and releasing the resources of each target computing node and setting the running state of each target computing node to be an idle state.

11. The method for selecting a computing power node of a heterogeneous computing system according to claim 10, wherein the task to be performed is a multi-iteration task type, and the adjusting the custom computing power node parameter based on the computing power calling scheme includes:

12. The method of computing power node selection for a heterogeneous computing system of any of claims 1 to 11, wherein after generating a plurality of test tasks for deployment to the heterogeneous computing system, further comprising:

13. The method of claim 12, further comprising, after determining the power invocation scheme when the computing resources are least used:

14. The method for selecting a computing power node of a heterogeneous computing system according to claim 1, wherein the adjusting the single iteration time of the minimum time consuming computing power node to the single iteration time of the first target computing power node and adjusting the batch size value of the minimum time consuming computing power node corresponding to the custom computing power node parameter and simultaneously adjusting the batch size value of the maximum time consuming computing power node corresponding to the custom computing power node parameter includes:

15. The method for selecting a computing power node of a heterogeneous computing system according to claim 1, wherein after the maximum time-consuming computing power node of the maximum time-consuming value of a single iteration in the current test result is obtained, the method further comprises:

16. A method of computing power node selection for a heterogeneous computing system, comprising:

when a distributed training task execution request carrying a network model to be trained and user-selected user-defined computing power node parameters is received, calling each defined computing power node to run the network model to be trained based on a corresponding batch size; the user-defined computing power node parameters are the computing power nodes of the heterogeneous computing system selected by the user and the parameters of each computing power node when running tasks; the user-defined computing power node is a computing power node selected by a user and used for running a task to be executed;

according to the test results of each test task and the self-defined computing power node parameters, determining a computing power calling scheme when the computing resources are least, and correspondingly adjusting the self-defined computing power node parameters based on the computing power calling scheme; the test computing power node is a computing power node for running test tasks in the heterogeneous computing system; the executing computing force node finally selects an optimal computing force node for running the task to be executed after optimization for the heterogeneous computing system; each test computing node corresponds to the computing node type to which the computing node is respectively defined and is the same; the calculation power calling scheme comprises execution calculation power nodes for executing the network model to be trained after optimization and batch sizes corresponding to each execution calculation power node; each execution computing node is a part of computing nodes in the custom computing nodes;

the determining a calculation power calling scheme when the calculation resources are least used according to the test results of the test tasks and the self-defined calculation power node parameters comprises the following steps:

17. A computing power node selection apparatus for a heterogeneous computing system, comprising:

the task operation module is used for operating the task to be executed according to the user-defined computing node parameters when a task execution request carrying the task to be executed and the user-selected user-defined computing node parameters is received; the user-defined computing power node parameters are the computing power nodes of the heterogeneous computing system selected by the user and the parameters of each computing power node when running tasks; the user-defined computing power node is a computing power node selected by a user and used for running a task to be executed;

The node optimization module is used for determining a calculation power calling scheme when the calculation resources are least used based on the test results of all the test tasks and the self-defined calculation power node parameters, and correspondingly adjusting the self-defined calculation power node parameters based on the calculation power calling scheme; the test computing power node is a computing power node for running test tasks in the heterogeneous computing system; the executing computing force node finally selects an optimal computing force node for running the task to be executed after optimization for the heterogeneous computing system; each test computing node corresponds to the computing node type to which the computing node is respectively defined and is the same; the computing power calling scheme comprises executing computing power nodes for executing the task to be executed after optimization and target task parameters corresponding to each executing computing power node; each execution computing node is a part of computing nodes in the custom computing nodes;

wherein the node optimization module is further configured to:

18. A computing power node selection apparatus for a heterogeneous computing system, comprising:

the training task operation module is used for calling each defined computing node to operate the network model to be trained based on the corresponding batch size when a distributed training task execution request carrying the network model to be trained and the user-selected user-defined computing node parameters is received; the user-defined computing power node parameters are the computing power nodes of the heterogeneous computing system selected by the user and the parameters of each computing power node when running tasks; the user-defined computing power node is a computing power node selected by a user and used for running a task to be executed;

the training node optimization module is used for determining a calculation power calling scheme when the calculation resources are least used according to the test results of all the test tasks and the self-defined calculation power node parameters, and correspondingly adjusting the self-defined calculation power node parameters based on the calculation power calling scheme; the test computing power node is a computing power node for running test tasks in the heterogeneous computing system; the executing computing force node finally selects an optimal computing force node for running the task to be executed after optimization for the heterogeneous computing system; each test computing node corresponds to the computing node type to which the computing node is respectively defined and is the same; the computing power calling scheme comprises executing computing power nodes of the network model to be trained after optimization and target task parameters corresponding to each executing computing power node; each execution computing node is a part of computing nodes in the custom computing nodes;

Wherein the training node optimization module is further configured to:

19. An electronic device comprising a processor and a memory, the processor being configured to implement the steps of the method of selecting a computing node of a heterogeneous computing system according to any of claims 1 to 15 and/or the method of selecting a computing node of a heterogeneous computing system according to claim 16 when executing a computer program stored in the memory.

20. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the method of selecting a computing node of a heterogeneous computing system according to any of claims 1 to 15 and/or the method of selecting a computing node of a heterogeneous computing system according to claim 16.

21. A heterogeneous computing system comprising a plurality of classes of computing nodes and a processor; the processor is configured to implement the steps of the method for selecting a computing node of a heterogeneous computing system according to any of claims 1 to 15 and/or the method for selecting a computing node of a heterogeneous computing system according to claim 16 when executing a computer program stored in a memory.

22. The heterogeneous computing system of claim 21, wherein the processor is deployed to a target computing power node;

23. The heterogeneous computing system of claim 21, wherein the processor is deployed to a server;