CN115114033B - Heterogeneous computing method and device based on layer number segmentation - Google Patents

Heterogeneous computing method and device based on layer number segmentation Download PDF

Info

Publication number
CN115114033B
CN115114033B CN202211044043.0A CN202211044043A CN115114033B CN 115114033 B CN115114033 B CN 115114033B CN 202211044043 A CN202211044043 A CN 202211044043A CN 115114033 B CN115114033 B CN 115114033B
Authority
CN
China
Prior art keywords
processor
submodel
shared physical
submodels
physical memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211044043.0A
Other languages
Chinese (zh)
Other versions
CN115114033A (en
Inventor
孙坚伟
胡力
乔安成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Core Computing Technology Co ltd
Original Assignee
Shanghai Core Computing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Core Computing Technology Co ltd filed Critical Shanghai Core Computing Technology Co ltd
Priority to CN202211044043.0A priority Critical patent/CN115114033B/en
Publication of CN115114033A publication Critical patent/CN115114033A/en
Application granted granted Critical
Publication of CN115114033B publication Critical patent/CN115114033B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a heterogeneous computing method and device based on layer number division, which are based on a plurality of processors to carry out operation and comprise the following steps: predicting a first time required for the convolutional neural network model to operate on a first processor of the plurality of processors, and when the first time exceeds a first threshold, dividing the convolutional neural network model into at least two sub-models; a first sub-model of the at least two sub-models operating on a first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; and dynamically allocating the layer number of at least two sub models, wherein the sum of the layer numbers of the at least two sub models is the total layer number of the convolutional neural network model. According to the heterogeneous computing method and device based on layer number segmentation, the convolutional neural network is segmented into the submodels according to the layer number and then distributed to the hardware units of the processors, and therefore the accelerated computing efficiency of the convolutional neural network is improved in a system with residual computing power.

Description

Heterogeneous computing method and device based on layer number segmentation
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a heterogeneous computing method and device based on layer number segmentation.
Background
With the wide application of artificial intelligence in end-side computing systems, convolutional Neural Networks (CNNs) are widely deployed on various heterogeneous computing platforms. The CNN gradually becomes a bottleneck in the field of artificial intelligence due to the characteristics of large calculated amount, high real-time requirement and the like.
In the prior art, common solutions include:
1. the use of a low complexity CNN model makes the various algorithm models currently available unprofitable.
2. The cost is greatly increased by using a computationally-intensive embedded Neural Network Processor (NPU), or by interconnecting a plurality of NPUs.
3. The CNN model and the data are sent to the cloud platform for calculation, and the result is returned, so that the method cannot meet the real-time requirement due to network transmission delay caused by large CNN data volume and synchronization among different cloud platforms.
Common embedded systems usually include different types of hardware units such as an NPU, a GPU (graphical Processing Unit), a CPU (Central Processing Unit), and the like, and usually only the NPU executes CNN inference, while the hardware such as the GPU and the CPU is relatively idle, because various deep learning operators can be implemented on different hardware at present, it is possible to assist the CNN inference of the NPU by using performance margins of the GPU and the CPU in the system.
Disclosure of Invention
The invention provides a heterogeneous computing method and device based on layer number segmentation, which are characterized in that a convolutional neural network is segmented into submodels according to the layer number and then distributed to hardware units of a plurality of processors, so that the accelerated computing efficiency of the convolutional neural network is improved in a system with residual computing power.
The embodiment of the invention provides a heterogeneous computing method based on layer number division, which is based on a plurality of processors to carry out operation and comprises the following steps:
predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, and when the first time exceeds a first threshold, segmenting the convolutional neural network model into at least two sub-models;
a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors;
and dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model.
Preferably, a second time required for the convolutional neural network model to operate on the first processor and the second processor of the plurality of processors is predicted, and when the second time exceeds a second threshold, the convolutional neural network model is divided into three sub-models;
the first one of the three submodels operates on the first processor, the second one of the three submodels operates on the second processor, and a third one of the three submodels operates on a third one of the plurality of processors.
Preferably, said partitioning the convolutional neural network model into at least two submodels when the first time exceeds a first threshold, including the first of the at least two submodels operating on the first processor, the second of the at least two submodels operating on the second processor, a third of the at least two submodels operating on a third processor of the plurality of processors.
Preferably, the dynamically adapting the number of layers of the at least two submodels includes dynamically adapting according to historical performance margins of the plurality of processors.
Preferably, the performing dynamic scheduling according to the historical performance margins of the processors includes obtaining idle-busy time ratios of the processors in a preset time unit, where the idle-busy time ratios of the processors in the unit time unit are calculated by the following formula:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE003
is a ratio of free busy times per unit time for the plurality of processors,
Figure DEST_PATH_IMAGE005
is the idle time in the unit time of the plurality of processors,
Figure 312995DEST_PATH_IMAGE007
busy for the plurality of processors per unit timeClearing time.
Preferably, the dynamically allocating the number of layers of the at least two submodels comprises dynamically allocating according to real-time busy states of the processors, and when it is monitored that the current load of the first processor is too heavy and the second processor has a large performance margin, the first submodel is operated on the first processor and the second submodel is operated on the second processor in the next frame;
and when the second processor is monitored to be overloaded currently and a third processor in the plurality of processors has a larger performance margin, the second submodel is operated on the second processor in the next frame, and the third submodel in the at least two submodels is calculated on the third processor.
Preferably, according to the number of the at least two submodels, distributing a plurality of blocks of shared physical memories for the input and output layers shared among the submodels;
when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory;
when the at least two submodels comprise the first submodel, the second submodel and a third submodel, the multi-block shared physical memory comprises the first shared physical memory, the second shared physical memory and a third shared physical memory.
Preferably, the multi-block shared physical memory is used for supporting the output of the first submodel and the second submodel and the reading of the second submodel and the third submodel;
when the plurality of shared physical memories comprise the first shared physical memory and the second shared physical memory, the first shared physical memory is used for storing the output of the ith frame of the first submodel or the second submodel, and the first shared physical memory is also used for reading the ith frame of the second submodel or the third submodel;
the second shared physical memory is used for storing the output of the first submodel or the second submodel of the (i + 1) th frame, and the second shared physical memory is also used for reading the second submodel or the third submodel of the (i + 1) th frame;
when the plurality of shared physical memories comprise the first shared physical memory, the second shared physical memory and the third shared physical memory, the third shared physical memory is used for storing the output of the i +2 frame first submodel or the second submodel, and the third shared physical memory is also used for reading the i +2 frame second submodel or the third submodel;
wherein i is a natural number, and i is more than or equal to 1.
Preferably, the plurality of processors includes at least two of an embedded neural network processor, a graphics processor, and a central processor.
The embodiment of the present invention further provides a heterogeneous computing device based on layer number division, which performs operations based on multiple processors, and includes:
a prediction module to predict a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, when the first time exceeds a first threshold, to partition the convolutional neural network model into at least two sub-models;
an operation module to operate a first sub-model of the at least two sub-models on the first processor, a second sub-model of the at least two sub-models on a second processor of the plurality of processors;
and the dynamic allocation module is used for dynamically allocating the layer number of the at least two sub models, and the sum of the layer number of the at least two sub models is the total layer number of the convolutional neural network model.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
according to the heterogeneous computing method and device based on layer number segmentation, the first time required by operation of a convolutional neural network model on a first processor in a plurality of processors is predicted, and when the first time exceeds a first threshold value, the convolutional neural network model is segmented into at least two sub models; a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model, and dividing the convolutional neural network into submodels according to the layer number and then allocating the submodels to hardware units of a plurality of processors, so that the accelerated calculation efficiency of the convolutional neural network is improved in a system with residual calculation power;
further, distributing a plurality of blocks of shared physical memories for the input and output layers shared among the submodels according to the number of the at least two submodels; when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory; when the at least two submodels comprise the first submodel, the second submodel and a third submodel, the plurality of blocks of shared physical memories comprise the first shared physical memory, the second shared physical memory and a third shared physical memory, and correct data can be read by the N-th layer neural network submodel by distributing the plurality of blocks of shared physical memories.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for describing the embodiments or the prior art, and it is apparent that the drawings in the following description are some embodiments of the present invention, but not all embodiments. For a person skilled in the art, other figures can also be obtained from these figures without inventive exercise.
Fig. 1 is a schematic flowchart of a heterogeneous computing method based on layer number segmentation according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a heterogeneous computing method based on layer number segmentation according to another embodiment of the present invention;
FIG. 3 is a block diagram of a heterogeneous computing device based on layer number partitioning according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating real-time states of a plurality of processors in a heterogeneous computing method based on layer number segmentation according to an embodiment of the present invention;
fig. 5 is a schematic diagram of multiple shared physical memories based on a layer number division heterogeneous computing method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The technical solution of the present invention will be described in detail with reference to specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Based on the problems in the prior art, the embodiment of the invention provides a heterogeneous computing method and device based on layer number segmentation, and a convolutional neural network is segmented into submodels according to the layer number and then distributed to hardware units of an embedded neural network processor, a graphic processor and a central processing unit, so that the accelerated computing efficiency of the convolutional neural network is improved in a system with residual computing power.
Fig. 1 is a schematic flow chart of a heterogeneous computing method based on layer number segmentation according to an embodiment of the present invention. Referring now to fig. 1, an embodiment of the present invention provides a heterogeneous computing method based on layer number division, which performs operations based on multiple processors, and includes: step S101: predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, and when the first time exceeds a first threshold, segmenting the convolutional neural network model into at least two sub-models; step S102: a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; step S013: and dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model.
In some embodiments, the plurality of processes includes at least two of an embedded neural network processor, a graphics processor, and a central processor.
In step S101, predicting a first time required for the convolutional neural network model to operate on a first processor of the plurality of processors specifically includes performing dynamic scheduling according to historical performance margins of the plurality of processors, and performing dynamic scheduling according to real-time busy states of the plurality of processors.
When the first time exceeds a first threshold value, the convolutional neural network model is divided into at least two submodels, namely, if the first processor is predicted to be incapable of processing the convolutional neural network model in time, the convolutional neural network is divided into at least two submodels for accelerating the calculation efficiency. The first threshold may be set according to system history data, which is not described herein.
Segmenting the convolutional neural network model into at least two sub-models, specifically comprising: and traversing all network layers of the N-layer convolutional neural network model, calculating the calculated quantity C [ i ] of each layer of network layer, wherein the calculated quantity of the convolutional layers is 2-weight quantity output size, and obtaining the total calculated quantity SUM = C [1] + C [2] \ 8230, + C [ N ] through accumulation.
In step S102, a first submodel of the at least two submodels operates on the first processor and a second submodel of the at least two submodels operates on a second processor of the plurality of processors. The first processor may be any one of an embedded neural network processor and a graphics processor, and the second processor may be any one of a graphics processor and a central processor.
For example, the first sub-model operates on a first processor, i.e., an embedded neural network processor, and the second sub-model operates on a second processor, i.e., a graphics processor, or the first sub-model operates on the first processor, i.e., the graphics processor, and the second sub-model operates on the second processor, i.e., a central processing unit.
In step S103, dynamically allocating the number of layers of the at least two submodels, where the sum of the number of layers of the at least two submodels is the total number of layers of the convolutional neural network model. For example, the number of layers of a first submodel in the at least two submodels is a, the number of layers of a second submodel in the at least two submodels is b, and the sum of the number of layers of the first submodel a and the number of layers of the second submodel b is the total number of layers N of the convolutional neural network model.
In some embodiments, predicting a second time required for the convolutional neural network model to operate on the first processor and the second processor of the plurality of processors, when the second time exceeds a second threshold, partitioning the convolutional neural network model into three sub-models; that is, if it is predicted that neither the first processor nor the second processor can process the convolutional neural network model in time, the convolutional neural network is divided into three sub-models in order to accelerate the computational efficiency. The second threshold may be set according to system history data, which is not described herein.
For example, the number of layers of a first submodel in the three submodels is a, the number of layers of a second submodel in the three submodels is b, the number of layers of a third submodel in the three submodels is c, and the sum of the number of layers of the first submodel a, the number of layers of the second submodel b and the number of layers of the third submodel c is the total number of layers N of the convolutional neural network model.
The first one of the three submodels operates on the first processor, the second one of the three submodels operates on the second processor, and a third one of the three submodels operates on a third one of the plurality of processors. The first processor may be an embedded neural network processor, the second processor may be a graphics processor, and the third processor may be a central processor.
For example, a first submodel operates on a first processor, i.e., an embedded neural network processor, a second submodel operates on a second processor, i.e., a graphics processor, and a third submodel operates on a third processor, i.e., a central processor.
In some embodiments, the partitioning the convolutional neural network model into at least two submodels when the first time exceeds a first threshold, including the first one of the at least two submodels operating on the first processor, the second one of the at least two submodels operating on the second processor, a third one of the at least two submodels operating on a third one of the plurality of processors.
For example, when the first time exceeds a first threshold, the convolutional neural network model is divided into three submodels, the first submodel operates on a first processor, namely an embedded neural network processor, the second submodel operates on a second processor, namely a graphics processor, and the third submodel operates on a third processor, namely a central processing unit.
In some embodiments, dynamically adapting the number of levels of the at least two submodels includes dynamically adapting based on historical performance margins of the plurality of processors.
In some embodiments, the dynamically adapting according to the historical performance margins of the processors includes obtaining idle-busy time ratios of the processors in a preset time unit, where the idle-busy time ratios of the processors in the unit time unit are calculated by the following formula:
Figure DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 22543DEST_PATH_IMAGE010
is a ratio of free busy-times per unit time of the plurality of processors,
Figure 758418DEST_PATH_IMAGE012
is the idle time in the unit time of the plurality of processors,
Figure 160580DEST_PATH_IMAGE014
a busy time per unit time for the plurality of processors.
Specifically, the idle-busy-time per unit time ratio of the embedded neural network processor is calculated by the following formula:
Figure DEST_PATH_IMAGE015
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE017
is the ratio of free busy time per unit time of the embedded neural network processor,
Figure DEST_PATH_IMAGE019
is the idle time of the embedded neural network processor in unit time,
Figure DEST_PATH_IMAGE021
the busy time of the embedded neural network processor in unit time.
The idle-busy-time per unit time ratio of the graphics processor is calculated by the following equation:
Figure 721137DEST_PATH_IMAGE022
wherein, the first and the second end of the pipe are connected with each other,
Figure 917763DEST_PATH_IMAGE024
is the ratio of free busy time per unit time of the graphics processor,
Figure 344196DEST_PATH_IMAGE026
for graphics processor idle time in unit time,
Figure 550050DEST_PATH_IMAGE028
Busy time per unit time of the graphics processor.
The free busy time ratio of the central processing unit in unit time is calculated by the following formula:
Figure DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 948801DEST_PATH_IMAGE030
is the ratio of free busy time per unit time of the embedded neural network processor,
Figure DEST_PATH_IMAGE031
is the idle time of the embedded neural network processor in unit time,
Figure 722853DEST_PATH_IMAGE032
is the busy time of the embedded neural network processor in unit time.
Fig. 4 is a schematic diagram of real-time states of a plurality of processors in a layer number segmentation-based heterogeneous computing method according to an embodiment of the present invention. Referring now to FIG. 4, in some embodiments, the dynamically adapting the number of layers of the at least two submodels includes dynamically adapting the number of layers according to a real-time busy status of the plurality of processors, and when it is detected that the first processor is currently overloaded and the second processor has a large performance margin, the next frame operates the first submodel on the first processor and the second submodel on the second processor;
and when the second processor is monitored to be overloaded currently and a third processor in the plurality of processors has a larger performance margin, the second submodel is operated on the second processor in the next frame, and the third submodel in the at least two submodels is calculated on the third processor.
Under the condition of some real-time states, the load of the system can be increased, for example, when turning is needed in automatic driving, the speed of the vehicle is increased, the frequency of a sensor is increased, and the detection of a reversing image is increased in reversing, the phenomenon that the current load is too heavy can occur in a processor.
If the frequency of input data is increased, whether the processor can complete calculation on time can be predicted according to the real-time busy state of the processor, for example, when the current load of the first processor, namely the embedded neural network processor is monitored to be too heavy, the calculation power is judged to be insufficient, the first submodel is operated on the embedded neural network processor in the next frame, and the second submodel is operated on the second processor, namely the graphic processor. For example, when it is detected that the current load of the second processor, i.e., the graphics processor, is too heavy, the second submodel is operated on the graphics processor in the next frame, and the third submodel is operated on the third processor, i.e., the central processing unit.
In some embodiments, the whole neural network model is run on the idle embedded neural network processor, the graphics processor and the central processing unit for several times respectively to obtain the preliminary operation time on the three processors
Figure DEST_PATH_IMAGE033
Figure 839845DEST_PATH_IMAGE034
And
Figure DEST_PATH_IMAGE035
further, the available computing power of the three processors can be obtained, and the available computing power of the embedded neural network processor is
Figure 255914DEST_PATH_IMAGE036
The available power of the graphics processor is
Figure DEST_PATH_IMAGE037
Available power of CPUIs composed of
Figure 774751DEST_PATH_IMAGE038
In some embodiments, the first submodel goes from the first level to the a-th level, the a-level total calculated amount being
Figure DEST_PATH_IMAGE039
And the rest part is a second sub-model with the number of layers being N-a.
If the system predicts that the graphics processor cannot complete the operation of the convolutional neural network model in time, the second submodel is continuously segmented, and the number of layers of the segmented second submodel is reduced from the N-a layer to the b layer, namely from the a +1 layer to the a +1+ b layer. The operand of the second sub-model of the b-layer is
Figure 204857DEST_PATH_IMAGE040
Figure DEST_PATH_IMAGE041
Data that is updated in real time.
In some embodiments, according to the number of the at least two submodels, allocating a plurality of blocks of shared physical memory for the input/output layer shared between the submodels;
when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory;
when the at least two submodels comprise the first submodel, the second submodel and a third submodel, the multi-block shared physical memory comprises the first shared physical memory, the second shared physical memory and a third shared physical memory.
The first shared physical memory, the second shared physical memory and the third shared physical memory are used for sharing memory multi-cache among the first submodel, the second submodel and the third submodel. Fig. 5 is a schematic diagram of multiple shared physical memories of a layer number division-based heterogeneous computing method according to an embodiment of the present invention. Referring now to FIG. 5, in some embodiments, the plurality of blocks of shared physical memory are used to support the outputting of the first submodel, the second submodel, and the reading of the second submodel, the third submodel;
when the plurality of shared physical memories comprise the first shared physical memory and the second shared physical memory, the first shared physical memory is used for storing the output of the ith frame of the first submodel or the second submodel, and the first shared physical memory is also used for reading the ith frame of the second submodel or the third submodel;
the second shared physical memory is used for storing the output of the first submodel or the second submodel of the (i + 1) th frame, and the second shared physical memory is also used for reading the second submodel or the third submodel of the (i + 1) th frame;
when the plurality of shared physical memories include the first shared physical memory, the second shared physical memory, and the third shared physical memory, the third shared physical memory is used for storing the output of the i +2 th frame of the first submodel or the second submodel, and the third shared physical memory is also used for reading the i +2 th frame of the second submodel or the third submodel;
wherein i is a natural number, and i is more than or equal to 1.
As shown in fig. 5, the i +1 frame first sub-model and the i frame second sub-model are executed in parallel in time, and if there is no first shared physical memory, the i frame second sub-model may not read the data written into the memory by the i frame first sub-model, because it is possible that the writing operation of the i +1 frame first sub-model precedes the reading operation of the i frame second sub-model. The N-th layer neural network submodel can read correct data just by distributing a plurality of blocks of shared physical memory.
In some embodiments, the plurality of processors includes at least two of an embedded neural network processor, a graphics processor, and a central processor.
Fig. 2 is a flowchart illustrating a heterogeneous computing method based on layer number segmentation according to another embodiment of the present invention. Referring now to fig. 2, an embodiment of the present invention provides a heterogeneous computing method based on layer number partitioning, which performs operations based on multiple processors, including: step S201: predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, and partitioning the neural network model into at least two sub-models when the first time exceeds a first threshold; step S202: a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; step S203: predicting a second time required for the convolutional neural network model to operate on the first processor and the second processor of the plurality of processors, and when the second time exceeds a second threshold, partitioning the convolutional neural network model into three submodels; step S204: the first one of the three submodels operating on the first processor, the second one of the three submodels operating on the second processor, a third one of the three submodels operating on a third one of the plurality of processors; step S205: and dynamically allocating the layer number of the at least two sub models, wherein the sum of the layer number of the at least two sub models is the total layer number of the convolutional neural network model.
Fig. 3 is a block diagram of a heterogeneous computing device based on layer number partitioning according to an embodiment of the present invention. Referring now to fig. 3, an embodiment of the present invention further provides a heterogeneous computing device based on layer number division, which performs operations based on multiple processors, including: a prediction module 31 for predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, the convolutional neural network model being partitioned into at least two sub-models when the first time exceeds a first threshold; an operation module 32 for operating a first sub-model of the at least two sub-models on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; and a dynamic allocation module 33, configured to dynamically allocate the number of layers of the at least two sub-models, where a sum of the number of layers of the at least two sub-models is a total number of layers of the convolutional neural network model.
In summary, in the heterogeneous computing method and apparatus based on layer number segmentation according to the embodiments of the present invention, a first time required for a convolutional neural network model to operate on a first processor of a plurality of processors is predicted, and when the first time exceeds a first threshold, the convolutional neural network model is segmented into at least two sub-models; a first sub-model of the at least two sub-models operating on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors; dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model, and dividing the convolutional neural network into submodels according to the layer number and then allocating the submodels to hardware units of a plurality of processors, so that the accelerated calculation efficiency of the convolutional neural network is improved in a system with residual calculation power;
furthermore, distributing a plurality of blocks of shared physical memories for the input and output layers shared between the submodels according to the quantity of the at least two submodels; when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory; when the at least two submodels comprise the first submodel, the second submodel and a third submodel, the plurality of blocks of shared physical memories comprise the first shared physical memory, the second shared physical memory and a third shared physical memory, and correct data can be read by the N-th layer neural network submodel by distributing the plurality of blocks of shared physical memories.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A heterogeneous computing method based on layer number division is operated based on a plurality of processors, and is characterized by comprising the following steps:
predicting a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, and when the first time exceeds a first threshold, segmenting the convolutional neural network model into at least two sub-models;
a first sub-model of the at least two sub-models operates on the first processor, a second sub-model of the at least two sub-models operates on a second processor of the plurality of processors, the first processor is an embedded neural network processor, the second processor is a graphics processor, or the first processor is a graphics processor, the second processor is a central processor;
dynamically allocating the layer number of the at least two submodels, wherein the sum of the layer number of the at least two submodels is the total layer number of the convolutional neural network model;
distributing a plurality of blocks of shared physical memories for the input and output layers shared among the submodels according to the quantity of the at least two submodels;
when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory;
when the at least two submodels include the first submodel, the second submodel, and a third submodel, the plurality of blocks of shared physical memory include the first shared physical memory, the second shared physical memory, and a third shared physical memory.
2. The method of claim 1, wherein the computing device is further configured to compute the number of layers based on the number of layers,
predicting a second time required for the convolutional neural network model to operate on the first processor and the second processor of the plurality of processors, and when the second time exceeds a second threshold, partitioning the convolutional neural network model into three submodels;
the first one of the three submodels operates on the first processor, the second one of the three submodels operates on the second processor, and the third one of the three submodels operates on a third one of the plurality of processors.
3. The layer-number-division-based heterogeneous computation method according to claim 1,
the partitioning of the convolutional neural network model into at least two submodels when the first time exceeds a first threshold, including the first of the at least two submodels operating on the first processor, the second of the at least two submodels operating on the second processor, a third of the at least two submodels operating on a third of the plurality of processors.
4. The layer-number-division-based heterogeneous computation method according to claim 1,
and dynamically allocating the layer number of the at least two submodels comprises dynamically allocating according to the historical performance margins of the plurality of processors.
5. The method of claim 4, wherein the computing device is further configured to compute the number of layers based on the number of layers,
the dynamic allocation according to the historical performance margins of the processors comprises the steps of obtaining idle and busy time ratios of the processors in a preset time unit, wherein the idle and busy time ratios of the processors in the unit time unit are calculated by the following formula:
Figure 819160DEST_PATH_IMAGE002
wherein, the first and the second end of the pipe are connected with each other,
Figure 641622DEST_PATH_IMAGE003
is a ratio of free busy times per unit time for the plurality of processors,
Figure 908656DEST_PATH_IMAGE004
is the idle time in the unit time of the plurality of processors,
Figure 107556DEST_PATH_IMAGE005
busy time in unit time for the plurality of processors.
6. The layer-number-division-based heterogeneous computation method according to claim 1,
dynamically allocating the layer number of the at least two submodels comprises dynamically allocating according to real-time busy states of the processors, and when the current load of the first processor is monitored to be overweight and the second processor has larger performance allowance, operating the first submodel on the first processor and operating the second submodel on the second processor in the next frame;
and when the second processor is monitored to be overloaded currently and a third processor in the plurality of processors has a larger performance margin, the second submodel is operated on the second processor in the next frame, and the third submodel in the at least two submodels is calculated on the third processor.
7. The layer-number-division-based heterogeneous computation method according to claim 1,
the multi-block shared physical memory is used for supporting the output of the first submodel and the second submodel and the reading of the second submodel and the third submodel;
when the plurality of shared physical memories comprise the first shared physical memory and the second shared physical memory, the first shared physical memory is used for storing the output of the ith frame first submodel or the ith frame second submodel, and the first shared physical memory is also used for reading the ith frame second submodel or the ith frame third submodel;
the second shared physical memory is used for storing the output of the (i + 1) th frame first submodel or the second submodel, and the second shared physical memory is also used for reading the (i + 1) th frame second submodel or the third submodel;
when the plurality of shared physical memories comprise the first shared physical memory, the second shared physical memory and the third shared physical memory, the third shared physical memory is used for storing the output of the i +2 frame first submodel or the second submodel, and the third shared physical memory is also used for reading the i +2 frame second submodel or the third submodel;
wherein i is a natural number, and i is more than or equal to 1.
8. The method of claim 1, wherein the plurality of processors includes at least two of an embedded neural network processor, a graphics processor, and a central processor.
9. A heterogeneous computing device based on layer number division, which performs operations based on a plurality of processors, comprising:
a prediction module to predict a first time required for a convolutional neural network model to operate on a first processor of the plurality of processors, when the first time exceeds a first threshold, to partition the convolutional neural network model into at least two sub-models;
an operation module to operate a first sub-model of the at least two sub-models on the first processor, a second sub-model of the at least two sub-models operating on a second processor of the plurality of processors, the first processor being an embedded neural network processor, the second processor being a graphics processor, or the first processor being a graphics processor, the second processor being a central processor;
the dynamic allocation module is used for dynamically allocating the layer number of the at least two sub models, and the sum of the layer number of the at least two sub models is the total layer number of the convolutional neural network model;
distributing a plurality of blocks of shared physical memories for the input and output layers shared among the submodels according to the number of the at least two submodels;
when the at least two submodels comprise the first submodel and the second submodel, the multi-block shared physical memory comprises a first shared physical memory and a second shared physical memory;
when the at least two submodels comprise the first submodel, the second submodel and a third submodel, the multi-block shared physical memory comprises the first shared physical memory, the second shared physical memory and a third shared physical memory.
CN202211044043.0A 2022-08-30 2022-08-30 Heterogeneous computing method and device based on layer number segmentation Active CN115114033B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211044043.0A CN115114033B (en) 2022-08-30 2022-08-30 Heterogeneous computing method and device based on layer number segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211044043.0A CN115114033B (en) 2022-08-30 2022-08-30 Heterogeneous computing method and device based on layer number segmentation

Publications (2)

Publication Number Publication Date
CN115114033A CN115114033A (en) 2022-09-27
CN115114033B true CN115114033B (en) 2022-12-06

Family

ID=83335686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211044043.0A Active CN115114033B (en) 2022-08-30 2022-08-30 Heterogeneous computing method and device based on layer number segmentation

Country Status (1)

Country Link
CN (1) CN115114033B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667046A (en) * 2019-03-08 2020-09-15 富泰华工业(深圳)有限公司 Deep learning acceleration method and user terminal

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7257756B2 (en) * 2018-08-20 2023-04-14 キヤノン株式会社 Image identification device, image identification method, learning device, and neural network
US20210056357A1 (en) * 2019-08-19 2021-02-25 Board Of Trustees Of Michigan State University Systems and methods for implementing flexible, input-adaptive deep learning neural networks
CN112783807B (en) * 2020-12-31 2023-12-29 深圳大普微电子科技有限公司 Model calculation method and system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667046A (en) * 2019-03-08 2020-09-15 富泰华工业(深圳)有限公司 Deep learning acceleration method and user terminal

Also Published As

Publication number Publication date
CN115114033A (en) 2022-09-27

Similar Documents

Publication Publication Date Title
CN109919315B (en) Forward reasoning method, device, equipment and storage medium of neural network
CN112219209A (en) Parallel computing architecture with reconfigurable core-level and vector-level parallelism
CN108647155B (en) Deep learning-based multi-level cache sharing method and device
CN110795226B (en) Method for processing task using computer system, electronic device and storage medium
CN111885137A (en) Edge container resource allocation method based on deep reinforcement learning
CN111752879B (en) Acceleration system, method and storage medium based on convolutional neural network
CN113723443A (en) Distributed training method and system for large visual model
CN112783807A (en) Model calculation method and system
CN113868808B (en) Road network approach detection time delay optimization method, device and system
CN115114033B (en) Heterogeneous computing method and device based on layer number segmentation
CN113163004B (en) Industrial Internet edge task unloading decision method, device and storage medium
CN114626516A (en) Neural network acceleration system based on floating point quantization of logarithmic block
CN114005458A (en) Voice noise reduction method and system based on pipeline architecture and storage medium
WO2022028232A1 (en) Device and method for executing lstm neural network operation
CN110750363B (en) Computer storage management method and device, electronic equipment and storage medium
CN115860080B (en) Computing core, accelerator, computing method, apparatus, device, medium, and system
CN115668222A (en) Data processing method and device of neural network
CN111783984A (en) Neural network operation method, device, equipment and storage medium
CN110728372A (en) Cluster design method and cluster architecture for dynamic loading of artificial intelligence model
CN115204364A (en) Convolution neural network hardware accelerating device for dynamic allocation of cache space
CN112989270A (en) Convolution calculating device based on hybrid parallel
CN114461538A (en) Cloud computing application memory management method based on real-time content prediction and historical resource occupation
CN113760541A (en) Method and device for distributing edge resources
CN117785492B (en) Operator segmentation method determining method, device, equipment and medium
CN117291240B (en) Convolutional neural network accelerator and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant