CN115114033B

CN115114033B - Heterogeneous computing method and device based on layer number segmentation

Info

Publication number: CN115114033B
Application number: CN202211044043.0A
Authority: CN
Inventors: 孙坚伟; 胡力; 乔安成
Original assignee: Shanghai Core Computing Technology Co ltd
Current assignee: Shanghai Core Computing Technology Co ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-12-06
Anticipated expiration: 2042-08-30
Also published as: CN115114033A

Abstract

The invention provides a heterogeneous computing method and device based on layer number division, which are based on a plurality of processors to carry out operation and comprise the following steps: predicting a first time required for the convolutional neural network model to operate on a first processor of the plurality of processors, and when the first time exceeds a first threshold, dividing the convolutional neural network model into at least two sub-models; a first sub-model of the at least two sub-models operating on a first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; and dynamically allocating the layer number of at least two sub models, wherein the sum of the layer numbers of the at least two sub models is the total layer number of the convolutional neural network model. According to the heterogeneous computing method and device based on layer number segmentation, the convolutional neural network is segmented into the submodels according to the layer number and then distributed to the hardware units of the processors, and therefore the accelerated computing efficiency of the convolutional neural network is improved in a system with residual computing power.

Description

Heterogeneous computing method and device based on layer number segmentation

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a heterogeneous computing method and device based on layer number segmentation.

Background

With the wide application of artificial intelligence in end-side computing systems, convolutional Neural Networks (CNNs) are widely deployed on various heterogeneous computing platforms. The CNN gradually becomes a bottleneck in the field of artificial intelligence due to the characteristics of large calculated amount, high real-time requirement and the like.

In the prior art, common solutions include:

1. the use of a low complexity CNN model makes the various algorithm models currently available unprofitable.

2. The cost is greatly increased by using a computationally-intensive embedded Neural Network Processor (NPU), or by interconnecting a plurality of NPUs.

3. The CNN model and the data are sent to the cloud platform for calculation, and the result is returned, so that the method cannot meet the real-time requirement due to network transmission delay caused by large CNN data volume and synchronization among different cloud platforms.

Common embedded systems usually include different types of hardware units such as an NPU, a GPU (graphical Processing Unit), a CPU (Central Processing Unit), and the like, and usually only the NPU executes CNN inference, while the hardware such as the GPU and the CPU is relatively idle, because various deep learning operators can be implemented on different hardware at present, it is possible to assist the CNN inference of the NPU by using performance margins of the GPU and the CPU in the system.

Disclosure of Invention

The invention provides a heterogeneous computing method and device based on layer number segmentation, which are characterized in that a convolutional neural network is segmented into submodels according to the layer number and then distributed to hardware units of a plurality of processors, so that the accelerated computing efficiency of the convolutional neural network is improved in a system with residual computing power.

The embodiment of the invention provides a heterogeneous computing method based on layer number division, which is based on a plurality of processors to carry out operation and comprises the following steps:

predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, and when the first time exceeds a first threshold, segmenting the convolutional neural network model into at least two sub-models;

a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors;

and dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model.

Preferably, a second time required for the convolutional neural network model to operate on the first processor and the second processor of the plurality of processors is predicted, and when the second time exceeds a second threshold, the convolutional neural network model is divided into three sub-models;

the first one of the three submodels operates on the first processor, the second one of the three submodels operates on the second processor, and a third one of the three submodels operates on a third one of the plurality of processors.

Preferably, said partitioning the convolutional neural network model into at least two submodels when the first time exceeds a first threshold, including the first of the at least two submodels operating on the first processor, the second of the at least two submodels operating on the second processor, a third of the at least two submodels operating on a third processor of the plurality of processors.

Preferably, the dynamically adapting the number of layers of the at least two submodels includes dynamically adapting according to historical performance margins of the plurality of processors.

Preferably, the performing dynamic scheduling according to the historical performance margins of the processors includes obtaining idle-busy time ratios of the processors in a preset time unit, where the idle-busy time ratios of the processors in the unit time unit are calculated by the following formula:

wherein the content of the first and second substances,

is a ratio of free busy times per unit time for the plurality of processors,

is the idle time in the unit time of the plurality of processors,

busy for the plurality of processors per unit timeClearing time.

Preferably, the dynamically allocating the number of layers of the at least two submodels comprises dynamically allocating according to real-time busy states of the processors, and when it is monitored that the current load of the first processor is too heavy and the second processor has a large performance margin, the first submodel is operated on the first processor and the second submodel is operated on the second processor in the next frame;

and when the second processor is monitored to be overloaded currently and a third processor in the plurality of processors has a larger performance margin, the second submodel is operated on the second processor in the next frame, and the third submodel in the at least two submodels is calculated on the third processor.

Preferably, according to the number of the at least two submodels, distributing a plurality of blocks of shared physical memories for the input and output layers shared among the submodels;

when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory;

when the at least two submodels comprise the first submodel, the second submodel and a third submodel, the multi-block shared physical memory comprises the first shared physical memory, the second shared physical memory and a third shared physical memory.

Preferably, the multi-block shared physical memory is used for supporting the output of the first submodel and the second submodel and the reading of the second submodel and the third submodel;

when the plurality of shared physical memories comprise the first shared physical memory and the second shared physical memory, the first shared physical memory is used for storing the output of the ith frame of the first submodel or the second submodel, and the first shared physical memory is also used for reading the ith frame of the second submodel or the third submodel;

the second shared physical memory is used for storing the output of the first submodel or the second submodel of the (i + 1) th frame, and the second shared physical memory is also used for reading the second submodel or the third submodel of the (i + 1) th frame;

when the plurality of shared physical memories comprise the first shared physical memory, the second shared physical memory and the third shared physical memory, the third shared physical memory is used for storing the output of the i +2 frame first submodel or the second submodel, and the third shared physical memory is also used for reading the i +2 frame second submodel or the third submodel;

wherein i is a natural number, and i is more than or equal to 1.

Preferably, the plurality of processors includes at least two of an embedded neural network processor, a graphics processor, and a central processor.

The embodiment of the present invention further provides a heterogeneous computing device based on layer number division, which performs operations based on multiple processors, and includes:

a prediction module to predict a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, when the first time exceeds a first threshold, to partition the convolutional neural network model into at least two sub-models;

an operation module to operate a first sub-model of the at least two sub-models on the first processor, a second sub-model of the at least two sub-models on a second processor of the plurality of processors;

and the dynamic allocation module is used for dynamically allocating the layer number of the at least two sub models, and the sum of the layer number of the at least two sub models is the total layer number of the convolutional neural network model.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

according to the heterogeneous computing method and device based on layer number segmentation, the first time required by operation of a convolutional neural network model on a first processor in a plurality of processors is predicted, and when the first time exceeds a first threshold value, the convolutional neural network model is segmented into at least two sub models; a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model, and dividing the convolutional neural network into submodels according to the layer number and then allocating the submodels to hardware units of a plurality of processors, so that the accelerated calculation efficiency of the convolutional neural network is improved in a system with residual calculation power;

further, distributing a plurality of blocks of shared physical memories for the input and output layers shared among the submodels according to the number of the at least two submodels; when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory; when the at least two submodels comprise the first submodel, the second submodel and a third submodel, the plurality of blocks of shared physical memories comprise the first shared physical memory, the second shared physical memory and a third shared physical memory, and correct data can be read by the N-th layer neural network submodel by distributing the plurality of blocks of shared physical memories.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for describing the embodiments or the prior art, and it is apparent that the drawings in the following description are some embodiments of the present invention, but not all embodiments. For a person skilled in the art, other figures can also be obtained from these figures without inventive exercise.

Fig. 1 is a schematic flowchart of a heterogeneous computing method based on layer number segmentation according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a heterogeneous computing method based on layer number segmentation according to another embodiment of the present invention;

FIG. 3 is a block diagram of a heterogeneous computing device based on layer number partitioning according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating real-time states of a plurality of processors in a heterogeneous computing method based on layer number segmentation according to an embodiment of the present invention;

fig. 5 is a schematic diagram of multiple shared physical memories based on a layer number division heterogeneous computing method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The technical solution of the present invention will be described in detail with reference to specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Based on the problems in the prior art, the embodiment of the invention provides a heterogeneous computing method and device based on layer number segmentation, and a convolutional neural network is segmented into submodels according to the layer number and then distributed to hardware units of an embedded neural network processor, a graphic processor and a central processing unit, so that the accelerated computing efficiency of the convolutional neural network is improved in a system with residual computing power.

Fig. 1 is a schematic flow chart of a heterogeneous computing method based on layer number segmentation according to an embodiment of the present invention. Referring now to fig. 1, an embodiment of the present invention provides a heterogeneous computing method based on layer number division, which performs operations based on multiple processors, and includes: step S101: predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, and when the first time exceeds a first threshold, segmenting the convolutional neural network model into at least two sub-models; step S102: a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; step S013: and dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model.

In some embodiments, the plurality of processes includes at least two of an embedded neural network processor, a graphics processor, and a central processor.

In step S101, predicting a first time required for the convolutional neural network model to operate on a first processor of the plurality of processors specifically includes performing dynamic scheduling according to historical performance margins of the plurality of processors, and performing dynamic scheduling according to real-time busy states of the plurality of processors.

When the first time exceeds a first threshold value, the convolutional neural network model is divided into at least two submodels, namely, if the first processor is predicted to be incapable of processing the convolutional neural network model in time, the convolutional neural network is divided into at least two submodels for accelerating the calculation efficiency. The first threshold may be set according to system history data, which is not described herein.

Segmenting the convolutional neural network model into at least two sub-models, specifically comprising: and traversing all network layers of the N-layer convolutional neural network model, calculating the calculated quantity C [ i ] of each layer of network layer, wherein the calculated quantity of the convolutional layers is 2-weight quantity output size, and obtaining the total calculated quantity SUM = C [1] + C [2] \ 8230, + C [ N ] through accumulation.

In step S102, a first submodel of the at least two submodels operates on the first processor and a second submodel of the at least two submodels operates on a second processor of the plurality of processors. The first processor may be any one of an embedded neural network processor and a graphics processor, and the second processor may be any one of a graphics processor and a central processor.

For example, the first sub-model operates on a first processor, i.e., an embedded neural network processor, and the second sub-model operates on a second processor, i.e., a graphics processor, or the first sub-model operates on the first processor, i.e., the graphics processor, and the second sub-model operates on the second processor, i.e., a central processing unit.

In step S103, dynamically allocating the number of layers of the at least two submodels, where the sum of the number of layers of the at least two submodels is the total number of layers of the convolutional neural network model. For example, the number of layers of a first submodel in the at least two submodels is a, the number of layers of a second submodel in the at least two submodels is b, and the sum of the number of layers of the first submodel a and the number of layers of the second submodel b is the total number of layers N of the convolutional neural network model.

In some embodiments, predicting a second time required for the convolutional neural network model to operate on the first processor and the second processor of the plurality of processors, when the second time exceeds a second threshold, partitioning the convolutional neural network model into three sub-models; that is, if it is predicted that neither the first processor nor the second processor can process the convolutional neural network model in time, the convolutional neural network is divided into three sub-models in order to accelerate the computational efficiency. The second threshold may be set according to system history data, which is not described herein.

For example, the number of layers of a first submodel in the three submodels is a, the number of layers of a second submodel in the three submodels is b, the number of layers of a third submodel in the three submodels is c, and the sum of the number of layers of the first submodel a, the number of layers of the second submodel b and the number of layers of the third submodel c is the total number of layers N of the convolutional neural network model.

The first one of the three submodels operates on the first processor, the second one of the three submodels operates on the second processor, and a third one of the three submodels operates on a third one of the plurality of processors. The first processor may be an embedded neural network processor, the second processor may be a graphics processor, and the third processor may be a central processor.

For example, a first submodel operates on a first processor, i.e., an embedded neural network processor, a second submodel operates on a second processor, i.e., a graphics processor, and a third submodel operates on a third processor, i.e., a central processor.

In some embodiments, the partitioning the convolutional neural network model into at least two submodels when the first time exceeds a first threshold, including the first one of the at least two submodels operating on the first processor, the second one of the at least two submodels operating on the second processor, a third one of the at least two submodels operating on a third one of the plurality of processors.

For example, when the first time exceeds a first threshold, the convolutional neural network model is divided into three submodels, the first submodel operates on a first processor, namely an embedded neural network processor, the second submodel operates on a second processor, namely a graphics processor, and the third submodel operates on a third processor, namely a central processing unit.

In some embodiments, dynamically adapting the number of levels of the at least two submodels includes dynamically adapting based on historical performance margins of the plurality of processors.

In some embodiments, the dynamically adapting according to the historical performance margins of the processors includes obtaining idle-busy time ratios of the processors in a preset time unit, where the idle-busy time ratios of the processors in the unit time unit are calculated by the following formula:

wherein the content of the first and second substances,

is a ratio of free busy-times per unit time of the plurality of processors,

is the idle time in the unit time of the plurality of processors,

a busy time per unit time for the plurality of processors.

Specifically, the idle-busy-time per unit time ratio of the embedded neural network processor is calculated by the following formula:

wherein the content of the first and second substances,

is the ratio of free busy time per unit time of the embedded neural network processor,

is the idle time of the embedded neural network processor in unit time,

the busy time of the embedded neural network processor in unit time.

The idle-busy-time per unit time ratio of the graphics processor is calculated by the following equation:

wherein, the first and the second end of the pipe are connected with each other,

is the ratio of free busy time per unit time of the graphics processor,

for graphics processor idle time in unit time，

Busy time per unit time of the graphics processor.

The free busy time ratio of the central processing unit in unit time is calculated by the following formula:

wherein the content of the first and second substances,

is the idle time of the embedded neural network processor in unit time,

is the busy time of the embedded neural network processor in unit time.

Fig. 4 is a schematic diagram of real-time states of a plurality of processors in a layer number segmentation-based heterogeneous computing method according to an embodiment of the present invention. Referring now to FIG. 4, in some embodiments, the dynamically adapting the number of layers of the at least two submodels includes dynamically adapting the number of layers according to a real-time busy status of the plurality of processors, and when it is detected that the first processor is currently overloaded and the second processor has a large performance margin, the next frame operates the first submodel on the first processor and the second submodel on the second processor;

Under the condition of some real-time states, the load of the system can be increased, for example, when turning is needed in automatic driving, the speed of the vehicle is increased, the frequency of a sensor is increased, and the detection of a reversing image is increased in reversing, the phenomenon that the current load is too heavy can occur in a processor.

If the frequency of input data is increased, whether the processor can complete calculation on time can be predicted according to the real-time busy state of the processor, for example, when the current load of the first processor, namely the embedded neural network processor is monitored to be too heavy, the calculation power is judged to be insufficient, the first submodel is operated on the embedded neural network processor in the next frame, and the second submodel is operated on the second processor, namely the graphic processor. For example, when it is detected that the current load of the second processor, i.e., the graphics processor, is too heavy, the second submodel is operated on the graphics processor in the next frame, and the third submodel is operated on the third processor, i.e., the central processing unit.

In some embodiments, the whole neural network model is run on the idle embedded neural network processor, the graphics processor and the central processing unit for several times respectively to obtain the preliminary operation time on the three processors

、

And

。

further, the available computing power of the three processors can be obtained, and the available computing power of the embedded neural network processor is

The available power of the graphics processor is

Available power of CPUIs composed of

。

In some embodiments, the first submodel goes from the first level to the a-th level, the a-level total calculated amount being

And the rest part is a second sub-model with the number of layers being N-a.

If the system predicts that the graphics processor cannot complete the operation of the convolutional neural network model in time, the second submodel is continuously segmented, and the number of layers of the segmented second submodel is reduced from the N-a layer to the b layer, namely from the a +1 layer to the a +1+ b layer. The operand of the second sub-model of the b-layer is

，

Data that is updated in real time.

In some embodiments, according to the number of the at least two submodels, allocating a plurality of blocks of shared physical memory for the input/output layer shared between the submodels;

The first shared physical memory, the second shared physical memory and the third shared physical memory are used for sharing memory multi-cache among the first submodel, the second submodel and the third submodel. Fig. 5 is a schematic diagram of multiple shared physical memories of a layer number division-based heterogeneous computing method according to an embodiment of the present invention. Referring now to FIG. 5, in some embodiments, the plurality of blocks of shared physical memory are used to support the outputting of the first submodel, the second submodel, and the reading of the second submodel, the third submodel;

when the plurality of shared physical memories include the first shared physical memory, the second shared physical memory, and the third shared physical memory, the third shared physical memory is used for storing the output of the i +2 th frame of the first submodel or the second submodel, and the third shared physical memory is also used for reading the i +2 th frame of the second submodel or the third submodel;

wherein i is a natural number, and i is more than or equal to 1.

As shown in fig. 5, the i +1 frame first sub-model and the i frame second sub-model are executed in parallel in time, and if there is no first shared physical memory, the i frame second sub-model may not read the data written into the memory by the i frame first sub-model, because it is possible that the writing operation of the i +1 frame first sub-model precedes the reading operation of the i frame second sub-model. The N-th layer neural network submodel can read correct data just by distributing a plurality of blocks of shared physical memory.

In some embodiments, the plurality of processors includes at least two of an embedded neural network processor, a graphics processor, and a central processor.

Fig. 2 is a flowchart illustrating a heterogeneous computing method based on layer number segmentation according to another embodiment of the present invention. Referring now to fig. 2, an embodiment of the present invention provides a heterogeneous computing method based on layer number partitioning, which performs operations based on multiple processors, including: step S201: predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, and partitioning the neural network model into at least two sub-models when the first time exceeds a first threshold; step S202: a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; step S203: predicting a second time required for the convolutional neural network model to operate on the first processor and the second processor of the plurality of processors, and when the second time exceeds a second threshold, partitioning the convolutional neural network model into three submodels; step S204: the first one of the three submodels operating on the first processor, the second one of the three submodels operating on the second processor, a third one of the three submodels operating on a third one of the plurality of processors; step S205: and dynamically allocating the layer number of the at least two sub models, wherein the sum of the layer number of the at least two sub models is the total layer number of the convolutional neural network model.

Fig. 3 is a block diagram of a heterogeneous computing device based on layer number partitioning according to an embodiment of the present invention. Referring now to fig. 3, an embodiment of the present invention further provides a heterogeneous computing device based on layer number division, which performs operations based on multiple processors, including: a prediction module 31 for predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, the convolutional neural network model being partitioned into at least two sub-models when the first time exceeds a first threshold; an operation module 32 for operating a first sub-model of the at least two sub-models on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; and a dynamic allocation module 33, configured to dynamically allocate the number of layers of the at least two sub-models, where a sum of the number of layers of the at least two sub-models is a total number of layers of the convolutional neural network model.

In summary, in the heterogeneous computing method and apparatus based on layer number segmentation according to the embodiments of the present invention, a first time required for a convolutional neural network model to operate on a first processor of a plurality of processors is predicted, and when the first time exceeds a first threshold, the convolutional neural network model is segmented into at least two sub-models; a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model, and dividing the convolutional neural network into submodels according to the layer number and then allocating the submodels to hardware units of a plurality of processors, so that the accelerated calculation efficiency of the convolutional neural network is improved in a system with residual calculation power;

furthermore, distributing a plurality of blocks of shared physical memories for the input and output layers shared between the submodels according to the quantity of the at least two submodels; when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory; when the at least two submodels comprise the first submodel, the second submodel and a third submodel, the plurality of blocks of shared physical memories comprise the first shared physical memory, the second shared physical memory and a third shared physical memory, and correct data can be read by the N-th layer neural network submodel by distributing the plurality of blocks of shared physical memories.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A heterogeneous computing method based on layer number division is operated based on a plurality of processors, and is characterized by comprising the following steps:

a first sub-model of the at least two sub-models operates on the first processor, a second sub-model of the at least two sub-models operates on a second processor of the plurality of processors, the first processor is an embedded neural network processor, the second processor is a graphics processor, or the first processor is a graphics processor, the second processor is a central processor;

dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model;

distributing a plurality of blocks of shared physical memories for the input and output layers shared among the submodels according to the quantity of the at least two submodels;

when the at least two submodels include the first submodel, the second submodel, and a third submodel, the plurality of blocks of shared physical memory include the first shared physical memory, the second shared physical memory, and a third shared physical memory.

2. The method of claim 1, wherein the computing device is further configured to compute the number of layers based on the number of layers,

predicting a second time required for the convolutional neural network model to operate on the first processor and the second processor of the plurality of processors, and when the second time exceeds a second threshold, partitioning the convolutional neural network model into three submodels;

the first one of the three submodels operates on the first processor, the second one of the three submodels operates on the second processor, and the third one of the three submodels operates on a third one of the plurality of processors.

3. The layer-number-division-based heterogeneous computation method according to claim 1,

the partitioning of the convolutional neural network model into at least two submodels when the first time exceeds a first threshold, including the first of the at least two submodels operating on the first processor, the second of the at least two submodels operating on the second processor, a third of the at least two submodels operating on a third of the plurality of processors.

4. The layer-number-division-based heterogeneous computation method according to claim 1,

and dynamically allocating the layer number of the at least two submodels comprises dynamically allocating according to the historical performance margins of the plurality of processors.

5. The method of claim 4, wherein the computing device is further configured to compute the number of layers based on the number of layers,

the dynamic allocation according to the historical performance margins of the processors comprises the steps of obtaining idle and busy time ratios of the processors in a preset time unit, wherein the idle and busy time ratios of the processors in the unit time unit are calculated by the following formula:

is a ratio of free busy times per unit time for the plurality of processors,

is the idle time in the unit time of the plurality of processors,

busy time in unit time for the plurality of processors.

6. The layer-number-division-based heterogeneous computation method according to claim 1,

dynamically allocating the layer number of the at least two submodels comprises dynamically allocating according to real-time busy states of the processors, and when the current load of the first processor is monitored to be overweight and the second processor has larger performance allowance, operating the first submodel on the first processor and operating the second submodel on the second processor in the next frame;

7. The layer-number-division-based heterogeneous computation method according to claim 1,

the multi-block shared physical memory is used for supporting the output of the first submodel and the second submodel and the reading of the second submodel and the third submodel;

when the plurality of shared physical memories comprise the first shared physical memory and the second shared physical memory, the first shared physical memory is used for storing the output of the ith frame first submodel or the ith frame second submodel, and the first shared physical memory is also used for reading the ith frame second submodel or the ith frame third submodel;

the second shared physical memory is used for storing the output of the (i + 1) th frame first submodel or the second submodel, and the second shared physical memory is also used for reading the (i + 1) th frame second submodel or the third submodel;

wherein i is a natural number, and i is more than or equal to 1.

8. The method of claim 1, wherein the plurality of processors includes at least two of an embedded neural network processor, a graphics processor, and a central processor.

9. A heterogeneous computing device based on layer number division, which performs operations based on a plurality of processors, comprising:

an operation module to operate a first sub-model of the at least two sub-models on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors, the first processor being an embedded neural network processor, the second processor being a graphics processor, or the first processor being a graphics processor, the second processor being a central processor;

the dynamic allocation module is used for dynamically allocating the layer number of the at least two sub models, and the sum of the layer number of the at least two sub models is the total layer number of the convolutional neural network model;

distributing a plurality of blocks of shared physical memories for the input and output layers shared among the submodels according to the number of the at least two submodels;