CN116980423A

CN116980423A - Model scheduling method, device, computing system, equipment and readable storage medium

Info

Publication number: CN116980423A
Application number: CN202311220749.2A
Authority: CN
Inventors: 王丽; 郭振华; 赵雅倩; 唐轶男; 曹芳; 高开
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-10-31
Anticipated expiration: 2043-09-21
Also published as: CN116980423B

Abstract

The invention discloses a model scheduling method, a device, a computing system, equipment and a readable storage medium in the technical field of deep learning, wherein the method comprises the steps of mapping a network layer of a model onto an accelerator of the computing system based on a computational priority strategy to obtain a scheduling strategy; performing quasi-remapping on a specific network layer, optimizing communication delay on a scheduling strategy after quasi-remapping by using the memory of an accelerator, and acquiring the total delay of the optimized system; under the condition that the total delay of the optimized system is lower than the total delay of the system before the optimization, updating the scheduling strategy after the remapping based on the quasi-remapping; and carrying out scheduling processing on the model according to the updated scheduling strategy after remapping. The invention has the technical effects that: under the condition of sacrificing smaller calculation performance, the method obtains larger reduction of communication cost, finally improves the overall performance of the system, realizes balance of calculation and communication, and improves the utilization ratio of calculation and storage resources.

Description

Model scheduling method, device, computing system, equipment and readable storage medium

Technical Field

The present invention relates to the field of deep learning technologies, and in particular, to a model scheduling method, apparatus, computing system, device, and readable storage medium.

Background

Machine learning algorithms are evolving from processing single-mode single tasks to processing multi-mode multi-tasks. These changes lead to larger and larger depth model sizes, more and more complex inter-block connections, and huge heterogeneity in network layer types, complex inter-layer connections, and data exchanges.

In a computing system, the types and the number of accelerators are also increasing, different accelerators have different delays in running different network layers, and the delay is caused by the difference of storage positions of data. Delay is a main evaluation index for measuring overall performance.

In summary, how to balance the calculation force and the communication to perform the model scheduling becomes a critical problem to be solved urgently by the model scheduling.

Disclosure of Invention

The invention aims to provide a model scheduling method, a device, a computing system, equipment and a readable storage medium, which can map a deep learning network model computing task to an ideal accelerator suitable for a computing mode of the deep learning network model computing task to be a key problem of realizing efficient coordination of multiple computing forces so as to improve the overall computing efficiency of a computing system.

In order to solve the technical problems, the invention provides the following technical scheme:

a model scheduling method, comprising:

Mapping a network layer of the model to an accelerator of a computing system based on the force calculation priority strategy to obtain a scheduling strategy;

performing quasi-remapping on a specific network layer, optimizing communication delay of the scheduling strategy after quasi-remapping by using the memory of an accelerator, and acquiring the total delay of the system after optimization;

updating the scheduling strategy after remapping based on the pseudo remapping under the condition that the total delay of the system after optimization is lower than the total delay of the system before optimization;

and carrying out scheduling processing on the model according to the updated scheduling strategy after remapping.

Preferably, searching for the specific network layer includes:

traversing each network layer of the model in sequence, and determining an accelerator with a mapping relation with the network layer;

judging whether the current network layer and the next adjacent network layer are mapped to the same accelerator;

if yes, skipping the current network layer;

if not, the current network layer is determined as the specific network layer.

Preferably, the performing pseudo-remapping on the specific network layer includes:

and the specific network layer is to be remapped to an accelerator mapped by a next adjacent network layer, and the scheduling strategy after the remapping is obtained.

Preferably, the optimizing the communication delay of the scheduling policy after the remapping by using the memory of the accelerator includes:

and writing the intermediate result of the specific network layer into the accelerator memory to be remapped.

Preferably, the acquiring the optimized total system delay includes:

calculating a total delay for each of the accelerators using a performance analysis model;

and superposing the total delay of all the accelerators to obtain the total delay of the system.

Preferably, in the case that the total delay of the system after optimization is lower than the total delay of the system before optimization, updating the scheduling policy after remapping based on the pseudo-remapping includes:

judging whether the total delay of the optimized system is lower than the total delay of the system before optimization;

if yes, after remapping the specific network layer, updating the mapping relation of the specific network layer in the scheduling policy to obtain the remapped scheduling policy;

and carrying out communication delay optimization on the remapped scheduling strategy by utilizing the memory of the accelerator, and iteratively updating the scheduling time of the specific network layer and the subsequent network layers of the specific network layer.

Preferably, in the case that the total delay of the system after optimization is higher than or equal to the total delay of the system before optimization, the method includes:

Skipping the specific network layer and searching for the next specific network layer;

if the next specific network layer is found, performing pseudo-remapping on the specific network layer and acquiring the total system delay, and updating the optimized scheduling strategy after remapping on the specific network layer under the condition of reducing the total system delay;

and if the specific network layer does not exist next, executing the step of scheduling the model according to the updated scheduling strategy after remapping.

Preferably, the mapping the network layer of the model onto an accelerator of the computing system based on the power-calculation priority policy, to obtain the scheduling policy includes:

iteratively obtaining delays of the current network layer on different accelerators;

mapping the current network layer onto the accelerator with the smallest delay;

and after the mapping of all network layers of the model is completed based on the power calculation priority strategy, the scheduling strategy is obtained.

Preferably, the iteratively obtaining delays of a current network layer on different accelerators includes:

obtaining a calculation map of the model;

based on the calculation graph, taking a network layer without a preamble node as a group;

Traversing the network layers in the packet in turn, enumerating all possible mapping accelerators of the current network layer, and calculating a delay of running the current network layer on each of the mapping accelerators.

Preferably, said enumerating all possible mapping accelerators of said current network layer includes:

enumerating all mapping accelerators capable of running the current network layer from a heterogeneous computing system;

the accelerator in the heterogeneous computing system comprises an image processor, a field programmable gate array chip, a brain processor and a tensor processor.

Preferably, said calculating a delay of running said current network layer on each of said mapping accelerators comprises:

clearing the local data of the mapping accelerator, and setting the memory of the mapping accelerator as unavailable;

storing all weight parameters and intermediate results in a host memory;

using a performance analysis model, calculating a delay to run the current network layer on the mapped accelerator.

Preferably, after mapping the network layer of the model onto an accelerator of the computing system based on the power-first policy, and obtaining the scheduling policy, before the performing the quasi-remapping on the specific network layer, the method further includes:

And carrying out communication delay optimization on the scheduling strategy by using the memory of the accelerator.

Preferably, the optimizing the communication delay for the scheduling policy by using the memory of the accelerator includes:

writing the data content of the network layer into the memory of the accelerator with a mapping relation with the network layer;

and re-determining the communication delay of the network layer and updating the scheduling policy based on the communication delay.

Preferably, the writing the data content of the network layer into the memory of the accelerator having a mapping relationship with the network layer includes:

determining a target accelerator with a mapping relation with the network layer based on the scheduling policy;

acquiring the memory occupancy rate of the target accelerator;

and writing the data content of the network layer into the memory of the target accelerator under the condition that the memory occupancy rate is lower than a preset occupancy threshold value.

and writing the weight parameters of the network layer into the memory of the accelerator with the mapping relation with the network layer.

Preferably, writing the data content of the network layer into the memory of the accelerator having a mapping relationship with the network layer includes:

if adjacent successor layers of the network layer are mapped to the same accelerator at the same time, the intermediate result of the network layer is written into the memory in the accelerator.

Preferably, the re-determining the communication delay of the network layer and updating the scheduling policy based on the communication delay includes:

acquiring delay time of the accelerator for acquiring the data content from the host;

updating a communication delay of the network layer based on the delay time;

and recursively updating the scheduling time in the scheduling strategy by using the communication delay updated by the network layer.

Preferably, it comprises:

the model is a heterogeneous model with a heterogeneous computing network layer;

the computing system is a heterogeneous computing system having heterogeneous accelerators.

Preferably, the scheduling processing of the model according to the updated scheduling policy after remapping includes:

and in the model training stage and/or the model using stage, scheduling the model according to the updated scheduling strategy after remapping.

A heterogeneous computing system, comprising:

having a plurality of heterogeneous accelerators, steps of a model scheduling method as described above are implemented in the heterogeneous computing system.

A model scheduling apparatus comprising:

the computing priority mapping module is used for mapping the network layer of the model to an accelerator of the computing system based on the computing priority strategy to obtain a scheduling strategy;

the remapping module is used for carrying out quasi-remapping on a specific network layer, carrying out communication delay optimization on the scheduling strategy after quasi-remapping by utilizing the memory of the accelerator, and obtaining the optimized total delay of the system; updating the scheduling strategy after remapping based on the pseudo remapping under the condition that the total delay of the system after optimization is lower than the total delay of the system before optimization;

and the scheduling module is used for scheduling the model according to the updated scheduling strategy after remapping.

An electronic device, comprising:

a memory for storing a computer program;

and the processor is used for realizing the steps of the model scheduling method when executing the computer program.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above model scheduling method.

By applying the method provided by the embodiment of the invention, the network layer of the model is mapped to the accelerator of the computing system based on the power calculation priority strategy to obtain the scheduling strategy; performing quasi-remapping on a specific network layer, optimizing communication delay on a scheduling strategy after quasi-remapping by using the memory of an accelerator, and acquiring the total delay of the optimized system; under the condition that the total delay of the optimized system is lower than the total delay of the system before the optimization, updating the scheduling strategy after the remapping based on the quasi-remapping; and carrying out scheduling processing on the model according to the updated scheduling strategy after remapping.

In the invention, a scheduling strategy is obtained after the network layer in the model is mapped onto the accelerator based on the computational priority. And then, performing quasi-remapping on the specific network layer, and performing communication delay optimization on the scheduling strategy after quasi-remapping by using the memory of the accelerator on the basis of quasi-remapping, thereby obtaining the optimized total delay of the system. And under the condition of comparing and determining the total delay of the system before the region optimization of the total delay of the system after optimization, updating the scheduling strategy after remapping based on the pseudo remapping. And finally, carrying out scheduling processing on the model according to the updated scheduling strategy after remapping.

The invention has the technical effects that: under the condition of sacrificing smaller calculation performance, the method obtains larger reduction of communication cost, finally improves the overall performance of the system, realizes balance of calculation and communication, and improves the utilization ratio of calculation and storage resources of each accelerator in the overall calculation process of the system. Meanwhile, the limitation of the existing main stream to calculate the priority mapping can be broken, and the complexity and the isomerism of a model and a system can be considered, so that the mapping which can simultaneously consider calculation and communication can be found, and the overall efficiency of the system is optimal.

Correspondingly, the embodiment of the invention also provides a model scheduling device, equipment and a readable storage medium corresponding to the model scheduling method, which have the technical effects and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flowchart of a model scheduling method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a classical network model;

FIG. 3 is a schematic diagram of a multi-component heterogeneous computing system;

FIG. 4 is a schematic diagram illustrating an implementation of a model scheduling method according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an implementation of calculating a priority map according to an embodiment of the present invention;

FIG. 6 is a flow chart of a remapping implementation according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a heterogeneous network model and a heterogeneous accelerator according to an embodiment of the present invention;

FIG. 8 is a network layer packet diagram of a heterogeneous network model according to an embodiment of the present invention;

FIG. 9 is a diagram illustrating a mapping after force-prioritized mapping according to an embodiment of the present invention;

FIG. 10 is a schematic diagram illustrating optimization of communication delay based on weight parameters according to an embodiment of the present invention;

FIG. 11 is a schematic diagram illustrating optimization of communication delay based on intermediate results according to an embodiment of the present invention;

FIG. 12 is a diagram illustrating a remapping according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of a model scheduling apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 15 is a schematic diagram of a specific structure of an electronic device according to an embodiment of the present invention;

FIG. 16 is a schematic diagram of a heterogeneous computing system according to an embodiment of the present invention.

Detailed Description

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart of a model scheduling method according to an embodiment of the invention, the method includes the following steps:

s101, mapping a network layer of the model to an accelerator of a computing system based on a power-calculation priority strategy to obtain a scheduling strategy.

The computing force priority strategy is to give priority to the optimal computing force matching when mapping the network layer of the model to the accelerator of the computing system, so as to obtain the scheduling strategy.

It should be noted that in the embodiment of the present invention, the type and number of network layers of the model are not limited, and the number and type of accelerators in the computing system are not limited. That is, the network layers of the model may be all the same network, or may be multiple networks, that is, the model may be a homogeneous model or a heterogeneous model; the network layer of the model can be small or large, namely, a small model or a large model. The accelerators may be the same or different for the computing system, i.e., the computing system may be a homogeneous system or a heterogeneous system, and the accelerators may be small or large, i.e., the system may be a microsystem or a larger system.

Wherein, heterogeneous model means that the network layer therein has at least two different networks. For example, a model comprising a convolution layer and a pooling layer is a heterogeneous model. In the embodiment of the present invention, the specific type of the network layer in the heterogeneous model is not specifically limited, and only at least two different networks exist, namely the heterogeneous model.

Wherein heterogeneous systems refer to at least two different accelerators within the system. For example, a computing system with an FPGA accelerator and a GPU accelerator is a heterogeneous system. Of course, in heterogeneous systems, the invention is not particularly limited as to which specific types of accelerators are specific.

Specifically, the running performance of each network layer on different accelerators can be obtained through modes of analog running and the like, and specific running performance can be measured by specific performance references such as delay and the like. And then, selecting the accelerator with the best operation effect of each network layer for mapping through transverse comparison.

Wherein accelerator refers to a device or card that can be used for acceleration, specifically including but not limited to at least one of the following accelerators:

GPU: graphics processing unit, an image processor;

And (3) FPGA: field Programmable Gate Array, field programmable gate array chip;

APU: accelerated Processing Unit an acceleration processor for accelerating the image processing chip;

BPU: brain Processing Unit, brain processor:

TPU (thermoplastic polyurethane): tensor Processing Unit tensor processor;

DPU, data Processing Unit, data processor.

Based on the power-calculation priority strategy, the network layers can be mapped to the accelerator with the best operation effect one by one, so that the scheduling strategy is obtained. The scheduling policy includes the mapping relation between the network layer and the accelerator, and the scheduling time of each network layer.

In one embodiment of the present invention, mapping a network layer of a model to an accelerator of a computing system based on a power-calculation priority policy to obtain a scheduling policy includes:

iteratively acquiring delays of the current network layer on different accelerators;

mapping the current network layer to the accelerator with the minimum delay;

and after the mapping of all network layers of the model is completed based on the calculation priority, a scheduling strategy is obtained.

For convenience of description, the following description will be given by combining the above three steps.

Considering the level dependence and the effectiveness of scheduling, for the network in the model, each network layer can be traversed one by one in sequence, and the delay of each network layer on different accelerators can be obtained. The network layer is then mapped onto the one accelerator with the least delay.

For example, for the current network layer, there are 10 accelerators in total, where the delay of accelerator 2 is minimal, then the current network layer is mapped onto accelerator 2.

Based on the force calculation priority strategy, the scheduling strategy can be obtained after mapping of all network layers of the model is completed one by one.

The method for iteratively obtaining the delay of the current network layer on different accelerators comprises the following steps:

obtaining a calculation map of the model;

taking a network layer without a preamble node as a group based on a calculation graph;

traversing the network layers in the packet in turn, enumerating all possible mapping accelerators for the current network layer, and calculating the delay of running the current network layer on each mapping accelerator.

The network layers may be grouped in advance in consideration of the level dependency and the validity of scheduling. Specifically, a computational graph of the model may be obtained first, and then the network layer without the preamble node is taken as a packet. Then, in the network traversing each packet in turn, by enumerating all possible mapping accelerators for the current network layer, and calculating the delay of running the current network layer on each mapping accelerator.

When a model is specific A, B and C, and a and B have no preamble, C needs to wait for a and B to complete, then all network layers corresponding to a are set as packet 1, all network layers corresponding to B are set as packet 2, all network layers corresponding to C are set as packet 3, when calculating delay, related delay of network layers in a and B is calculated, and after related delay calculation of network layers in a and B is completed, related delay of all network layers in C is calculated.

Wherein enumerating all possible mapping accelerators of the current network layer includes: enumerating all mapping accelerators capable of running the current network layer from the heterogeneous computing system; the accelerator in the heterogeneous computing system may be an image processor, a field programmable gate array chip, a brain processor, or a tensor processor. A heterogeneous computing system refers to an acceleration system that includes at least two different accelerators therein. For example, for the current network layer, in a heterogeneous computing system, only a portion of the accelerators may run the network layer, and only that portion of the accelerators need to be enumerated to reduce unnecessary delay computations.

Wherein calculating the delay of running the current network layer on each mapping accelerator comprises:

clearing local data of the mapping accelerator, and setting the memory of the mapping accelerator to be unavailable;

storing all weight parameters and intermediate results in a host memory;

using the performance analysis model, the delay of running the current network layer on the mapped accelerator is calculated.

That is, in the calculation priority mapping link, the communication delay optimization project is not considered at all, the local data condition of the mapping accelerator is directly mapped, the local memory of the accelerator is not used, all weight parameters and intermediate results are stored in the host memory, and then the delay of the current network layer running on the mapping accelerator is calculated based on the performance analysis model.

If the simulation calculation is performed, the delay of running the current network layer on the mapping accelerator can be calculated by using the performance analysis model under the condition that the memory of the mapping accelerator is set to be 0 and no local data exists in the memory, and all weight parameters and intermediate results are stored in the memory of the host.

The mapping is based on the computational priority, i.e. without taking into account the communication delay. When the current network layer is operated on each mapping accelerator, the memory of the mapping accelerator is assumed to be 0, no local data exists in the memory, all weight parameters and intermediate results are stored in the memory of the host, and then, a performance analysis model is adopted for simulation analysis, so that delay can be obtained. Among other things, existing performance analysis models can be employed, and the delay performance of a single accelerator includes both purely computational (i.e., computational delay) and communication (i.e., communication delay) components. Thus, the delay of a single accelerator may also be referred to as the total delay of the accelerator.

In one embodiment of the invention, a network layer of a model is mapped to an accelerator of a computing system based on computational priority to obtain a scheduling strategy, which comprises;

acquiring a heterogeneous model with a heterogeneous computing network layer;

And mapping the network layer of the heterogeneous model to an accelerator of the heterogeneous computing system based on the force-calculation priority strategy to obtain a scheduling strategy.

That is, the model that needs to be scheduled may be a heterogeneous model with heterogeneous computing network layers, and the computing system may also be a heterogeneous computing system.

Wherein, obtain the heterogeneous model with heterogeneous computational network layer, include:

creating a heterogeneous model oriented to multi-mode and multi-task;

or, copying the trained heterogeneous model;

the heterogeneous computing network layer comprises a convolution layer, a full connection layer, a long-term and short-term memory network layer and a conversion layer.

That is, in embodiments of the present invention, for the heterogeneous model, heterogeneous computing network layers therein may specifically include, but are not limited to, a convolution layer, a full connection layer, a long and short term memory network layer, and a translation layer.

Further, the heterogeneous model may be a heterogeneous model just created or a heterogeneous model after training. That is, the model scheduling provided by the embodiment of the invention can be performed for the training stage of the model, and also can be performed for the deployment and use stage of the trained model.

S102, performing pseudo-remapping on a specific network layer, performing communication delay optimization on a scheduling strategy after the pseudo-remapping by using the memory of an accelerator, and acquiring the total delay of the optimized system.

In practical applications, there may be situations where communication delays of individual network layers increase in computing power best maps. That is, mapping is performed entirely in terms of computational power optimality, although the computational power delay is overall lower, additional communication delays may be incurred, and the total system delay is the sum of the computational power delay and the communication delay. That is, the comprehensive consideration should not take account of computational optimization or communication optimization alone, but need to take account of the minimum total delay of the system.

Based on this, after mapping the network layer of the model onto the accelerator of the computing system based on the power-first policy, the overall system latency can also be reduced by remapping the particular network layer.

In general, a network layer that may reduce the overall delay of the system after remapping is determined as a specific network layer. In order to determine the effect of the remapping operation, the total system delay must be recalculated each time a remapping attempt is made, so that the remapping is performed while the total system delay is guaranteed to be optimized.

Thus, after a particular network is selected, a quasi-remapping can be performed and the total delay of the quasi-remapped system can be obtained.

In one embodiment of the present invention, searching for a specific network layer includes:

if yes, skipping the current network layer;

if not, the current network layer is determined as the specific network layer.

For convenience of description, the above steps are described in combination.

First, each network layer of the model is traversed in sequence, and the current network layer is determined to be a specific network layer in the case that the next adjacent network layer is not mapped to the same accelerator as the current network layer. That is, since the intermediate results of the adjacent two network layers can be stored in the memory of the accelerator, the communication delay can be reduced, and thus, the current network layer can be determined as a specific network layer in the case that the next adjacent network layer of the current network layer is not mapped to the same accelerator. Then, a pseudo-remapping is performed for that particular network layer.

In one embodiment of the present invention, performing pseudo-remapping on a specific network layer includes:

and the specific network layer is to be remapped to the accelerator mapped by the next adjacent network layer, and the scheduling strategy after the remapping is obtained. The specific network layer is to be remapped to the accelerator mapped by the next adjacent network layer, and the communication delay of the scheduling strategy is optimized by utilizing the memory of the accelerator, and the total delay of the system corresponding to the optimized scheduling strategy is optimized.

In a specific embodiment of the present invention, using the memory of the accelerator, the communication delay optimization is performed on the scheduling policy after the remapping, including: and writing the intermediate result of the specific network layer into the accelerator memory to be remapped. That is, when adjacent network layers are mapped into the same accelerator, communication delay can be optimized, and thus a specific network layer can be preferentially mapped onto the accelerator to which the specific network layer is mapped. That is, after the quasi-remapping is performed on the specific network layer, the intermediate result of the specific network layer needs to be stored in the accelerator memory, so as to reduce the communication delay, thereby achieving the purpose of optimizing the communication delay of the scheduling policy. And then, acquiring the total system delay corresponding to the optimized scheduling strategy.

It should be noted that, in the communication delay optimization process, the model parameters of the network layer may also be stored in the corresponding accelerator memory, so as to reduce the communication delay between the host and the accelerator as much as possible.

In a specific embodiment of the present invention, obtaining a total system delay corresponding to an optimized scheduling policy includes:

calculating a total delay for each accelerator using the performance analysis model;

the total delay of all accelerators is superimposed to obtain the total delay of the system.

That is, each network layer in the model is mapped to an accelerator, so that the total delay of each accelerator can be calculated directly by using the performance analysis model, and then the total delay of all accelerators can be obtained soon. Of course, in practical application, the calculation delay and the communication delay of each network layer may also be calculated separately, and then all the calculation delays and the communication delays are superimposed, so as to obtain the total delay of the system.

And S103, updating the scheduling strategy after remapping based on the quasi-remapping under the condition that the total delay of the system after optimization is lower than the total delay of the system before optimization.

After S102 is performed, it is necessary to compare, remap the specific network layer only if the total delay of the system can be reduced, and update the scheduling policy based on the remap.

It should be noted that, here, remapping is performed, that is, the mapping relationship of the specific network layer is changed, and intermediate results of adaptively changing the specific network layer are stored in the accelerator memory of the new mapping. For example, the specific network layer is originally mapped onto the accelerator 1, and the specific network layer is now remapped onto the accelerator 2, and at this time, the intermediate result of the specific network layer needs to be stored on the local memory of the accelerator 2 instead. Therefore, the mapping relation of the specific network layer in the scheduling policy needs to be changed to be mapped to the accelerator 2, and the scheduling time of the specific network layer and the subsequent network layers of the specific network layer needs to be adaptively adjusted.

In a specific embodiment of the present invention, when the total delay of the system after optimization is lower than the total delay of the system before optimization, updating the scheduled policy after remapping based on the quasi-remapping includes:

if yes, after remapping the specific network layer, updating the mapping relation of the specific network layer in the scheduling policy to obtain a remapped scheduling policy;

and carrying out communication delay optimization on the remapped scheduling strategy by using the memory of the accelerator, and iteratively updating the scheduling time of the specific network layer and the subsequent network layers of the specific network layer.

The total system delay before optimization may be the total system delay corresponding to the scheduling policy before the quasi-remapping is not performed on the specific network layer. In practical application, the total system delay calculated each time can be cached, and whether the total system delay is reduced due to remapping can be determined only by comparing the total system delay calculated currently with the total system delay calculated last time. If the communication delay is reduced, the pseudo-remapping effect is better, the pseudo-remapping of the specific network layer can be accepted, and the scheduling strategy is updated, otherwise, the communication delay optimization caused by the pseudo-remapping cannot exceed the calculation delay difference of the calculation priority mapping.

In the case where the total system delay after optimization is higher than or equal to the total system delay before optimization, the following steps may be performed:

skipping a specific network layer and searching for the next specific network layer;

if the next specific network layer is found, performing pseudo-remapping on the specific network layer and acquiring the total delay of the system, and updating the optimized scheduling strategy after remapping on the specific network layer under the condition of reducing the total delay of the system;

and if the next specific network layer does not exist, executing a step of scheduling the model according to the updated scheduling strategy after remapping.

That is, when the specific network layer is remapped to perform communication delay optimization, the delay of the original mapping cannot be further reduced, at this time, the specific network layer may be skipped to find the next specific network layer, and the next specific network layer is continuously remapped to obtain a new total system delay, and based on whether the total system delay can be reduced, it is determined whether to accept the remapped, so as to further optimize the scheduling policy.

In this way, the operation is repeated until there is no specific network layer, and the operation of step S104 may be performed.

S104, carrying out scheduling processing on the model according to the updated scheduling strategy after remapping.

After mapping according to the calculation force priority, communication delay optimization is performed based on the memory of the accelerator, then the specific network layer is remapped, and under the condition of further optimizing the total delay of the system by combining the communication delay optimization strategy, the optimization direction comprehensively considering the calculation force and the communication angles can be obtained, so that the updated scheduling strategy after the final remapping has better overall performance, namely the total delay of the system is minimum. The model may then be scheduled according to the scheduling policy.

Specifically, according to the updated scheduling policy after remapping, the scheduling processing of the model includes:

receiving a task request;

and processing task requests according to the updated scheduling strategy after remapping by the scheduling model.

That is, when a task request is associated with a model, the model may be scheduled based on the scheduling policy, thereby processing the task request.

Because the total system delay of the scheduling strategy is lower, the task request can be rapidly processed, the overall response efficiency of the system can be improved, and the user experience can be improved.

It should be noted that, based on the above embodiments, the embodiments of the present invention further provide corresponding improvements. The preferred/improved embodiments relate to the same steps as those in the above embodiments or the steps corresponding to the steps may be referred to each other, and the corresponding advantages may also be referred to each other, so that detailed descriptions of the preferred/improved embodiments are omitted herein.

After the above step S101 is performed, each network layer may be deployed to the accelerator with the lowest computation delay. After the step S103 is performed, the communication delay of the network layer can be reduced as much as possible based on the memory of the accelerator.

In the embodiment of the present invention, after the step S101 is performed, the communication delay may be optimized for the scheduling policy by using the memory of the accelerator before the step S102 is performed.

The accelerator often needs to communicate with the host in order to run the network layer, thereby acquiring the weight parameters, and forwarding intermediate result data through the host, which can cause communication delays. And for accelerators, local memory is typically provided. Thus, in embodiments of the present invention, the communication delay may be optimized for the scheduling policy based on the memory of the accelerator. Specifically, some related data content which is originally needed to be read from the host is stored in the memory of the accelerator. Because the accelerator directly stores the related data, the related data is not needed to be obtained through communication with the host, communication delay can be saved, and the scheduling time in the scheduling strategy can be optimized and adjusted.

For example, the time spent for the accelerator to operate the network layer 1 is t, and the local memory of the accelerator directly stores the related data of the network layer can save the acquisition time of the data, so that the accelerator operation network layer 1 needs to spend time t minus the communication delay required by the data acquisition, and needs to adaptively adjust the scheduling time of the network layer in the scheduling policy and the scheduling time of the subsequent other network layers of the network layer, thereby achieving the purpose of optimizing the communication delay.

It will be appreciated that when the communication delay is optimized, the overall system delay is also optimized. The total system delay includes communication delay and calculation delay.

In one embodiment of the present invention, optimizing the communication delay of the scheduling policy using the memory of the accelerator includes:

the communication delay of the network layer is redetermined and the scheduling policy is updated based on the communication delay.

For convenience of description, the two steps are described in combination.

When relevant data content is written into the memory of the accelerator, the data content of the network layer is written into the memory of the accelerator mapped by the data content based on the mapping relation between the network layer and the accelerator. The communication delay of the network layer is then redetermined and the scheduling policy is updated based on the communication delay.

Since the writing of the data content into the memory of the accelerator does not affect the computational effort, the delay of the re-determination at this time is only caused by the optimization of the communication delay, and therefore the scheduling policy can be updated directly based on the communication delay. Of course, in practical applications, the delay of the network layer may also be directly determined, so as to adjust the scheduling policy based on the new delay.

In one embodiment of the present invention, in order to ensure normal use of the accelerator, when writing data into the memory of the accelerator, a portion of the memory needs to be reserved for normal operation of the accelerator. Thus, writing the data content of the network layer into the memory of the accelerator having a mapping relationship with the network layer includes:

determining a target accelerator with a mapping relation with a network layer based on a scheduling strategy;

acquiring the memory occupancy rate of a target accelerator;

That is, when writing data content of the network layer to the accelerator, a policy may first be scheduled to select a target accelerator having a mapping relationship with the network layer.

And then, acquiring the memory occupancy rate of the target accelerator, if the memory occupancy rate is lower than a preset occupancy threshold value, writing the data content of the network layer into the target accelerator, otherwise, not continuing to write the data content of the network layer. The preset occupation threshold value can be preset according to factors such as the memory size and the operation characteristics of different practical accelerators, or can be adjusted according to practical requirements, and specific numerical limitation is not performed.

In one embodiment of the present invention, writing data content of a network layer into a memory of an accelerator having a mapping relationship with the network layer includes: and writing the weight parameters of the network layer into the memory of the accelerator with the mapping relation with the network layer. Namely, the local memory of the accelerator is fully utilized to reduce the transmission time of the weight parameters from the host memory to the accelerator, and the overall performance of the network layer on the accelerator is optimized; in the embodiment of the invention, it is possible that 1 or more network layers are mapped to the same accelerator, so as to maximize the utilization of the local memory of the accelerator and reduce data transmission, and under the limitation of the local memory capacity of the accelerator, as many weight parameters as possible can be stored in the local memory of the accelerator. After the weight is fixed, the communication delay of each network layer is updated first, and as each network layer delay and scheduling are changed, all subsequent layers are affected, so that the layer scheduling needs to be updated recursively until the scheduling update of the whole calculation map is completed, the updated total system delay is obtained, and meanwhile, the communication delay optimization of the scheduling strategy is completed.

In one embodiment of the present invention, writing data content of a network layer into a memory of an accelerator having a mapping relationship with the network layer includes: if adjacent successor layers of the network layer are mapped to the same accelerator at the same time, the intermediate result of the network layer is written into the memory in the accelerator.

If two adjacent network layers are mapped to the same accelerator, then intermediate results between the two layers can be stored in the local memory of the accelerator, thus avoiding the interaction time of the accelerator with the host memory. Recursively writing the relevant intermediate results into the accelerator, similar to the local optimization of the weight parameters, for each network layer, checking the subsequent neighbor layers, if the adjacent layers are on the same accelerator, updating the performance delay of the corresponding network layer, and then recursively updating the overall schedule of the system, further optimizing the overall delay performance of the system.

In one embodiment of the present invention, redefining a communication delay of a network layer and updating a scheduling policy based on the communication delay includes:

acquiring delay time of the accelerator for acquiring data content from the host;

updating the communication delay of the network layer based on the delay time;

The scheduling time in the scheduling policy is recursively updated with the communication delay updated by the network layer.

In particular, the delay time of the data content of the network layer running on it can be obtained from the host by means of an analog accelerator. Since the data content is stored directly locally to the simulation accelerator, the communication delay of the network layer may be updated based on the delay time, thereby recursively updating the scheduling time for each network layer in the scheduling policy based on the updated communication delay.

In order to facilitate the understanding and implementation of the model scheduling method provided by the embodiment of the present invention by those skilled in the art, the following describes the model scheduling method in detail with reference to the related technical schemes and specific application scenarios as examples.

Some deep neural network mapping algorithms map models based primarily on computational priorities. For example, by increasing the data stream to increase the computational efficiency of a single accelerator, by mapping the convolutional layer onto a different type of accelerator, by increasing the convolutional efficiency by 65%, etc. However, these methods do not discuss the data communication overhead at the system level, and computing the priority map does not bring about global optimum performance; the mapping algorithm with a small part of communication priority divides the computation graph into a plurality of task clusters and distributes the task clusters to the accelerator in the unit of task clusters, but because the tasks in the same cluster are not necessarily suitable for running on the distributed accelerator, and meanwhile, the clustering effect is poor due to serious cross-layer dependence (crosstalk) in the heterogeneous model, so that the computation efficiency is greatly damaged.

From the above, the model scheduling method provided by the embodiment of the invention can consider the complexity and the isomerism of the neural network model and the isomerism system, and can consider the hardware perception mapping of calculation and communication at the same time, so as to realize the optimal overall efficiency of the isomerism computing system.

In general, the complexity of real-world application scenarios requires both machine learning models and hardware computing systems to be heterogeneous, with model heterogeneity resulting from the need for multi-task learning and perception in processing multi-modal multi-tasks, resulting in network models with different network layer modules and computing modes. For example, as shown in fig. 2, in the classical CNN-LSTM network, a normalization layer (softmax) is provided to connect multiple branches, and each branch has an LSTM network layer (Long Short-Term Memory network layer), an FC layer (Full Connection layer), a pool layer, and a conv layer (ConvNets).

As shown in fig. 3, the isomerism of the computing system comes from heterogeneous computing devices, and different accelerators have specific computing advantages, and facing to large-scale model computing tasks and complex application scenes, efficient collaboration of multiple isomerism computing forces is needed, so that the characteristics of different isomerism computing forces are fully exerted, and further, the optimal computing efficiency of the system is realized.

The model scheduling method provided by the embodiment of the invention can be used for aiming at the limitation of calculation priority mapping and communication priority mapping, facing to complex model calculation tasks in a multi-component heterogeneous calculation system, simultaneously considering calculation and communication, and providing a method for efficiently mapping the model calculation tasks to heterogeneous accelerators in the calculation system from the perspective of optimal overall performance of the system.

The invention provides a model scheduling method which has calculation and communication perception capability, and can simultaneously consider model calculation and data communication performance to realize system layer performance optimization. The implementation flow of the method can comprise three modules, namely a calculation priority mapping module, a communication optimization module based on the local memory of the accelerator and a network layer position sensing remapping module, wherein when model calculation is executed in a multi-element heterogeneous computing system, the implementation flow schematic diagram of each module of the method is shown in fig. 4.

In order to more accurately describe the mapping between the heterogeneous computational graph (i.e., a computational graph corresponding to a heterogeneous model) and the heterogeneous accelerator, a computational graph and a system performance model may be constructed first, and the following parameters are designed:

: representing a computational graph, wherein->: representing a model network layer, E: representing dependencies between network layers;

: expressed in accelerator +.>Sub-graph with calculation performed above +.>Is empty, in iterationGradually updating in the process;

: representing the overall computing task allocation of the system, consisting of a group +.>Composition;

representing a single accelerator +.>For different accelerators,existing performance analysis models can be employed, and the delay performance of a single accelerator comprises two parts, namely pure computation time consumption and communication time consumption;

: the overall performance of the system is represented and used as a comparison basis for evaluating whether the mapping scheme is optimal.

For the computation priority mapping module, the computation priority mapping is performed, first, the accelerator local memory is assumed to be 0, and no local data exists, and all weights and intermediate result data are assumed to be put on the host memory, so the system performance includes two parts of computation and communication, and the communication includes: the weight parameters are transferred from host memory to the accelerator and intermediate results are transferred in or out of host memory.

The calculation priority mapping implementation flow is shown in fig. 5, and the whole process comprises the following steps:

To calculate the graphAnd a performance analysis model for each accelerator +.>As input;

to take into account the inter-layer dependencies and the validity of the schedule, the algorithm iterates to determine the mapping and schedule: each iteration selects a network layer (nodes, also called nodes) without a preamble phase in the computational graph as a packet, traverses all possible mappings in turn for the network layer in the packet, and uses the performance modelTo estimate the accelerator +.>Performance of the layer is performed thereon; and according to->Evaluating the system performance of the mapping, and selecting a mapping and scheduling mode which enables the system performance to be optimal (delay to be minimum);

finally, the network layer is mapped to the accelerator with the best computing performance, and the overall scheduling strategy of the system is given. The overall scheduling policy may specifically include a list of scheduling times of a network layer and mapping relationships between the network layer and an accelerator.

The communication optimization module based on the accelerator local memory can perform data transmission optimization on the basis of the calculation priority mapping after the calculation priority mapping is completed, and the module mainly comprises two parts of contents: local optimization of network layer weight parameters and local optimization of intermediate results of adjacent network layers.

Firstly, carrying out local optimization on weight parameters, wherein the part mainly fully utilizes the local memory of the accelerator to reduce the transmission time of the weight parameters from the host memory to the accelerator and optimize the overall performance of a network layer on the accelerator; in the present invention, it is possible that 1 or more network layers are mapped onto one accelerator, and in order to maximize the utilization of the accelerator local memory, reduce data transmission, as many weight parameters as possible are stored in the local memory under the accelerator local memory capacity limit. After the weight fixing is completed, the delay performance of each network layer needs to be updated. In particular, since each change in network layer delay and schedule affects all subsequent layers, it is necessary to recursively update the layer schedule until the schedule update of the entire computational graph is completed, resulting in updated system delay performance.

The local optimization of the intermediate results between the adjacent network layers, i.e. considering that if two adjacent network layers are mapped to the same accelerator, the intermediate results between the two network layers can be stored into the local memory of the accelerator, so that the interaction time between the accelerator and the host memory can be avoided. The recursive intermediate result optimization may be similar to the weight parameter local optimization in that for each network layer, the mapping object of the subsequent adjacent network layer is checked, if the subsequent adjacent network layer is on the same accelerator, the performance delay of the corresponding network layer is updated, and then the overall schedule of the system is recursively updated, thereby further optimizing the overall delay performance of the system.

The network layer location aware remapping module defines an operation to remap and re-optimize communication delays for a particular network layer. Specifically, referring to fig. 6, if there is a next adjacent network layer to the network layer on the target accelerator, it is considered to remap the network layer to the target accelerator, where the remapping is mainly performed by using local optimization of intermediate results of adjacent network layers in the communication optimization module based on the local memory of the accelerator, so as to reduce the time for transmitting the intermediate results. However, doing so may increase the computational delay of the network layer, and the weight parameter propagation delay may increase or decrease depending on the available local memory capacity of the destination accelerator. Therefore, in order to determine the effect of the remapping operation, the data transfer optimization based on the accelerator's local memory must be re-performed at each remapping attempt. A greedy algorithm may be employed to make a remapping attempt for each particular network layer, with remapping accepted only if the remapping reduces the overall system delay, i.e., the benefits of reduced communication outweigh the increase in computational cost, or not accepted, and the algorithm is terminated when no more particular network layers can be remapped.

In an actual application scene, the device can be directly embedded into model training/reasoning equipment or a system or a platform, firstly, a model calculation task mapping and scheduling program is executed, and then a subsequent model calculation process is executed according to the mapping and scheduling result; of course, the device may also be used as a tool independent of the model application.

Taking the heterogeneous network model and heterogeneous accelerator shown in fig. 7 as an example, the model scheduling method will be described in detail,

in a heterogeneous computing system comprising 3 different heterogeneous accelerator devices 0, 1, 2, it is now necessary to map network layers of the network model onto the accelerators according to computational priorities.

First, the network layer without the preamble phase in the computation graph of the heterogeneous network model is divided into packet 1 and packet 2, as shown in fig. 8, all possible mappings { device 0 (thick solid line box), device 1 (broken line box), device 2 (thin solid line) } are traversed for each network layer in packet 1 and packet 2, and performance models are used to estimate the performance of executing the layer on the accelerator, and finally the network layer is mapped onto the lowest-latency device, and the mapped system overall schedule is as shown in fig. 9, where the network layer line samples in the model represent the accelerators on which they are mapped to the same line samples.

For example, network layer 1.1 maps to accelerator device 1, and the accelerator performs intermediate idle parts, such as idle parts in front of network layer 3.2, because the network layer needs to wait for the network layer 3.1, in front of which data dependencies exist, to complete before starting the computation.

For the local optimization of the weight parameters, as shown in fig. 10, the network layers (1.2, 3.2, etc.) with the same line patterns as the devices represent that the weights are all stored in the local memories of the corresponding devices, so that the delay of the network layers at the devices is reduced due to the reduction of the data communication delay.

For example, after the delay of the 3.1 network layer mapping device is reduced, the 3.2 network layer can obtain the intermediate result of 3.1 earlier, so that the calculation can be performed in advance, meanwhile, the idle waiting time between 1.1 and 3.2 on the device 1 is reduced, and the like, after the local optimization of the weight is performed, the performance delay and the scheduling time of the network layer on each accelerator in the heterogeneous computing system are correspondingly updated, the overall delay of some accelerators of the system is reduced, and the calculation efficiency is improved to a certain extent.

Local optimization of intermediate results in adjacent network layers: after supporting the weight local optimization, it can be seen that, for example, on accelerator device 0, network layers 3.4 and 3.5 are adjacent network layers; on the accelerator device 1, the network layers 3.1 and 3.2 are adjacent network layers; 1.3, 1.4 and 1.5 on the device 2 are adjacent network layers, so that when in calculation, the adjacent network layers are on the same accelerator, the incoming and outgoing time of intermediate results is reduced, the performance delay of each network layer is reduced, and after the optimization of the part is completed, the total delay of the system is further reduced, and the overall performance of the system is improved.

After local optimization of the intermediate result of the adjacent network layers in fig. 11, remapping may be tried for each network layer, as shown in fig. 12, taking the 1.1 network layer as an example, the adjacent network layer 1.2 of the 1.1 network layer is on the device0, thus, the remapping of 1.1 onto the device0 is tried, and then the data transmission optimization based on the accelerator local memory is performed after 2.1, the overall delay of the system after remapping is calculated, if the overall delay of the system is reduced, the effect of reducing the communication cost caused by reducing the transmission of the intermediate result is larger than the increase of the calculation delay caused by mapping 1.1 onto the accelerator which is not optimal in the calculation mode, so that the remapping is accepted, and the scheduling time of the subsequent network layers is updated in sequence. Otherwise, 1.1 cannot be mapped to device0, and computing scheduling is still performed on device 1.

The method mainly obtains larger reduction of communication cost under the condition of sacrificing smaller calculation performance, finally improves the overall performance of the system, realizes the balance of calculation and communication in the heterogeneous calculation system, fully utilizes the performance advantages of the multi-component heterogeneous accelerator and the local memory space, and improves the calculation and storage resource utilization rate of each accelerator in the overall calculation process of the system. Meanwhile, the model scheduling method can break the limitation of the existing main stream to calculate the priority mapping, and find a hardware perception mapping algorithm which can simultaneously consider calculation and communication in consideration of the complexity and the isomerism of the neural network model and the isomerism system so as to realize the optimal overall efficiency of the isomerism computing system.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a model scheduling device, where the model scheduling device described below and the model scheduling method described above can be referred to correspondingly.

Referring to fig. 13, the apparatus includes the following modules:

the computing priority mapping module 101 is configured to map a network layer of the model onto an accelerator of the computing system based on the computing priority policy, so as to obtain a scheduling policy;

the remapping module 102 is configured to perform remapping on a specific network layer, perform communication delay optimization on a scheduling policy after remapping by using a memory of an accelerator, and obtain an optimized total system delay; under the condition that the total delay of the optimized system is lower than the total delay of the system before the optimization, updating the scheduling strategy after the remapping based on the quasi-remapping;

and the scheduling module 103 is used for scheduling the model according to the updated scheduling strategy after remapping.

By applying the device provided by the embodiment of the invention, the network layer of the model is mapped to the accelerator of the computing system based on the power calculation priority strategy to obtain the scheduling strategy; performing quasi-remapping on a specific network layer, optimizing communication delay on a scheduling strategy after quasi-remapping by using the memory of an accelerator, and acquiring the total delay of the optimized system; under the condition that the total delay of the optimized system is lower than the total delay of the system before the optimization, updating the scheduling strategy after the remapping based on the quasi-remapping; and carrying out scheduling processing on the model according to the updated scheduling strategy after remapping.

In one embodiment of the present invention, the remapping module is specifically configured to find a specific network layer, including: traversing each network layer of the model in sequence, and determining an accelerator with a mapping relation with the network layer; judging whether the current network layer and the next adjacent network layer are mapped to the same accelerator;

if yes, skipping the current network layer; if not, the current network layer is determined as the specific network layer.

In a specific embodiment of the present invention, the remapping module is specifically configured to remap the specific network layer to an accelerator mapped by a subsequent adjacent network layer, so as to obtain a scheduling policy after the remapping.

In one embodiment of the present invention, the remapping module is specifically configured to write the intermediate result of the specific network layer into the accelerator memory to be remapped.

In one embodiment of the present invention, the remapping module is specifically configured to calculate the total delay for each accelerator using a performance analysis model;

In one embodiment of the present invention, the remapping module is specifically configured to determine whether the total system delay after optimization is lower than the total system delay before optimization;

In a specific embodiment of the present invention, the remapping module is specifically configured to skip a specific network layer and search for a next specific network layer when the total system delay after optimization is higher than or equal to the total system delay before optimization;

In one specific embodiment of the invention, the computation priority mapping module is specifically configured to iteratively obtain delays of a current network layer on different accelerators;

mapping the current network layer to the accelerator with the minimum delay;

And after the mapping of all network layers of the model is completed based on the power calculation priority strategy, a scheduling strategy is obtained.

In one embodiment of the present invention, the computation priority mapping module is specifically configured to obtain a computation graph of the model;

In one embodiment of the present invention, the computation priority mapping module is specifically configured to enumerate all mapping accelerators capable of running the current network layer from the heterogeneous computing system;

In one embodiment of the present invention, a computation priority mapping module is specifically configured to empty local data of the mapping accelerator, and set a memory of the mapping accelerator to be unavailable;

storing all weight parameters and intermediate results in a host memory;

In one embodiment of the present invention, after mapping the network layer of the model onto the accelerator of the computing system based on the power-first policy, and obtaining the scheduling policy, before performing the quasi-remapping on the specific network layer, the method further includes: and the communication delay optimization module is used for optimizing the communication delay of the scheduling strategy by utilizing the memory of the accelerator.

In one embodiment of the present invention, the communication delay optimization module is specifically configured to write data content of the network layer into a memory of an accelerator having a mapping relationship with the network layer;

In one specific embodiment of the present invention, the communication delay optimization module is specifically configured to determine, based on a scheduling policy, a target accelerator having a mapping relationship with a network layer;

acquiring the memory occupancy rate of a target accelerator;

In one embodiment of the present invention, the communication delay optimization module is specifically configured to write the weight parameter of the network layer into the memory of the accelerator having a mapping relationship with the network layer.

In one embodiment of the present invention, the communication delay optimization module is specifically configured to write the intermediate result of the network layer into the memory in the accelerator if the adjacent successor layers of the network layer are mapped to the same accelerator at the same time.

In one embodiment of the present invention, the communication delay optimization module is specifically configured to obtain a delay time for the accelerator to obtain data content from the host;

updating the communication delay of the network layer based on the delay time;

In one embodiment of the invention, the model is a heterogeneous model with heterogeneous computational network layers;

In a specific embodiment of the present invention, the scheduling module is specifically configured to perform scheduling processing on the model according to the updated scheduling policy after remapping during a model training stage and/or a model use stage.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a heterogeneous computing system, where a heterogeneous computing system described below and a model scheduling method described above may be referred to correspondingly.

Referring to fig. 16, fig. 16 is a schematic diagram of a heterogeneous computing system according to an embodiment of the invention.

Having multiple heterogeneous accelerators, the steps of the model scheduling method described above are implemented in a heterogeneous computing system.

It should be noted that, in the heterogeneous computing system, the accelerators include two or more kinds, that is, the heterogeneous computing system can be regarded as, and the embodiments of the present invention are not limited to any kind of accelerator.

In view of the fact that the steps of the above-mentioned model scheduling method can be implemented in the heterogeneous computing system, the heterogeneous computing system also has the technical effects of the above-mentioned model scheduling method, which are not described in detail herein.

Corresponding to the above method embodiment, the embodiment of the present invention further provides an electronic device, and an electronic device described below and a model scheduling method described above may be referred to correspondingly.

Referring to fig. 14, the electronic device includes:

a memory 332 for storing a computer program;

a processor 322 for implementing the steps of the model scheduling method of the above method embodiment when executing a computer program.

Specifically, referring to fig. 15, fig. 15 is a schematic diagram of a specific structure of an electronic device according to the present embodiment, where the electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 322 (e.g., one or more processors) and a memory 332, where the memory 332 stores one or more computer programs 342 or data 344. Wherein the memory 332 may be transient storage or persistent storage. The program stored in memory 332 may include one or more modules (not shown), each of which may include a series of instruction operations in the data processing apparatus. Still further, the processor 322 may be configured to communicate with the memory 332 and execute a series of instruction operations in the memory 332 on the electronic device 301.

The electronic device 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input/output interfaces 358, and/or one or more operating systems 341.

The steps in the model scheduling method described above may be implemented by the structure of the electronic device.

Corresponding to the above method embodiments, the embodiments of the present invention further provide a readable storage medium, where a readable storage medium described below and a model scheduling method described above may be referred to correspondingly.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the model scheduling method of the above method embodiment.

The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, and the like.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms include, comprise, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the idea of the present invention, the present disclosure should not be construed as limiting the present invention in summary.

Claims

1. A method of model scheduling, comprising:

2. The model scheduling method of claim 1, wherein searching for the particular network layer comprises:

if yes, skipping the current network layer;

if not, the current network layer is determined as the specific network layer.

3. The method of model scheduling according to claim 1, wherein said quasi-remapping of a specific network layer comprises:

4. The method for scheduling a model according to claim 3, wherein the performing communication delay optimization on the scheduling policy after the remapping by using the memory of the accelerator includes:

5. A method of model scheduling according to claim 3, wherein the obtaining the optimized total system delay comprises:

6. The method according to claim 1, wherein updating the scheduling policy after remapping based on the pseudo-remapping in a case where the total system delay after optimization is lower than the total system delay before optimization, comprises:

7. The model scheduling method according to claim 1, wherein in the case where the total system delay after optimization is higher than or equal to the total system delay before optimization, comprising:

8. The method for scheduling models according to claim 1, wherein the mapping the network layer of the models to the accelerator of the computing system based on the power-first policy to obtain the scheduling policy comprises:

Mapping the current network layer onto the accelerator with the smallest delay;

9. The model scheduling method of claim 8, wherein iteratively obtaining delays of a current network layer on different ones of the accelerators comprises:

obtaining a calculation map of the model;

10. The model scheduling method of claim 9, wherein enumerating all possible mapping accelerators of the current network layer comprises:

11. The model scheduling method of claim 9, wherein said calculating the delay of running the current network layer on each of the mapped accelerators comprises:

storing all weight parameters and intermediate results in a host memory;

12. The model scheduling method of claim 1, wherein after mapping the network layer of the model onto an accelerator of the computing system based on the power-first policy, resulting in a scheduling policy, before the quasi-remapping of the particular network layer, further comprising:

13. The method for model scheduling according to claim 12, wherein said optimizing the communication delay for the scheduling policy using the memory of the accelerator comprises:

14. The method of claim 13, wherein writing the data content of the network layer into the memory of the accelerator having a mapping relationship with the network layer comprises:

acquiring the memory occupancy rate of the target accelerator;

15. The method of claim 13, wherein writing the data content of the network layer into the memory of the accelerator having a mapping relationship with the network layer comprises:

16. The method of claim 13, wherein writing the data content of the network layer into the memory of the accelerator having a mapping relationship with the network layer comprises:

17. The model scheduling method of claim 13, wherein the re-determining the communication delay of the network layer and updating the scheduling policy based on the communication delay comprises:

Acquiring delay time of the accelerator for acquiring the data content from a host;

updating a communication delay of the network layer based on the delay time;

18. The model scheduling method according to any one of claims 1 to 17, comprising:

19. The method for scheduling the model according to claim 18, wherein the scheduling the model according to the updated scheduling policy after remapping comprises:

20. A heterogeneous computing system, comprising:

having a plurality of heterogeneous accelerators, implementing the steps of the model scheduling method of any of claims 1 to 19 in the heterogeneous computing system.

21. A model scheduling apparatus, comprising:

22. An electronic device, comprising:

a memory for storing a computer program;

processor for implementing the steps of the model scheduling method according to any one of claims 1 to 19 when executing said computer program.

23. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the model scheduling method according to any one of claims 1 to 19.