CN111723900A - Mapping method of neural network based on many-core processor and computing device - Google Patents

Mapping method of neural network based on many-core processor and computing device Download PDF

Info

Publication number
CN111723900A
CN111723900A CN201910203167.0A CN201910203167A CN111723900A CN 111723900 A CN111723900 A CN 111723900A CN 201910203167 A CN201910203167 A CN 201910203167A CN 111723900 A CN111723900 A CN 111723900A
Authority
CN
China
Prior art keywords
network
sub
many
core
network layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910203167.0A
Other languages
Chinese (zh)
Other versions
CN111723900B (en
Inventor
张伟豪
李涵
裴京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lynxi Technology Co Ltd
Original Assignee
Beijing Lynxi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lynxi Technology Co Ltd filed Critical Beijing Lynxi Technology Co Ltd
Priority to CN201910203167.0A priority Critical patent/CN111723900B/en
Priority to PCT/CN2020/077973 priority patent/WO2020187041A1/en
Publication of CN111723900A publication Critical patent/CN111723900A/en
Application granted granted Critical
Publication of CN111723900B publication Critical patent/CN111723900B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a mapping method and computing equipment of a neural network based on a many-core processor, wherein the method comprises the following steps: acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups; splitting each network layer of all the network layers, wherein each network layer comprises a plurality of sub-network layers; fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and respectively mapping the sub-network layer groups to a plurality of cores of a preset many-core processor. The method provided by the invention can distribute resources such as calculation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, the load of each core of the many-core processor is more balanced compared with the traditional scheme, and the resource efficiency of calculation and storage is effectively improved.

Description

Mapping method of neural network based on many-core processor and computing device
Technical Field
The invention relates to the technical field of processors, in particular to a mapping method of a neural network based on a many-core processor and computing equipment.
Background
With the continuous application of artificial intelligence technology in various fields, various hardware platforms for running artificial intelligence algorithms have come into existence, and a many-core processor is one of the hardware platforms. The neural network algorithm is a mainstream artificial intelligence algorithm and has the characteristics of high calculation amount and high parallelism. These features make neural networks suitable for running on many-core architectures, which is also why many-core processor architectures are an important way to currently build neural network accelerators. On the basis of a neural network algorithm and a proper many-core processor, how to map the neural network algorithm onto the processor and how to distribute resources such as calculation, storage, routing and the like of each core of the many-core processor are problems to be solved urgently.
Disclosure of Invention
In view of the above, the present invention provides a mapping method and computing device for a many-core processor based neural network that overcomes or at least partially solves the above problems.
According to one aspect of the invention, a mapping method for a neural network based on a many-core processor is provided, and comprises the following steps:
acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups;
splitting each network layer of all the network layers, wherein each network layer comprises a plurality of sub-network layers;
and fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and respectively mapping the sub-network layer groups to a plurality of cores of a preset many-core processor. By splitting the neural network in groups and then fusing in groups and mapping to the kernels, the routing between the cores in the many-core processor can be greatly reduced, and the processing efficiency of the processor is improved.
Optionally, after the sub-network layers belonging to the same network layer group are fused according to the preset rule to obtain a plurality of sub-network layer groups, and before the plurality of sub-network layer groups are respectively mapped to a plurality of cores of a preset many-core processor, the method further includes:
judging whether the number of network layers included by the sub-network layer group is greater than a first preset threshold value or not;
if the number of network layers included in at least one sub-network layer group is greater than the first preset threshold value, the network layer group to which the sub-network layer group belongs, the number of network layers of which is greater than the first preset threshold value, is subjected to intra-group splitting again. When the groups are merged, the number of layers contained in a group is too large and exceeds the upper limit of resources borne by one core due to too large distribution of one group during network grouping, and at the moment, the number of network layers in the group is reduced by grouping again so as to realize load balance of each core.
Optionally, after mapping the sub-network layer groups to a plurality of cores of a preset many-core processor, the method further includes:
screening out at least one first kernel with the resource utilization rate lower than a preset index from the plurality of kernels;
and fusing the sub-network layer groups corresponding to the first core again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor. After the mapped cores are obtained through the steps, the cores with lower resource utilization rate may exist, and the cores with lower resource utilization rate are subjected to re-fusion so as to effectively reduce the overhead required by scheduling in the operation of the processor.
Optionally, after mapping the sub-network layer groups to a plurality of cores of a preset many-core processor, the method further includes:
judging whether the many-core processor has residual cores;
if the remaining cores exist, acquiring at least one third core of which the resource consumption rate is greater than a second preset threshold value in the many-core processor;
translating at least a portion of the sub-network layers mapped to the third core to the remaining cores. On the basis, if the many-core processor still has the remaining cores without tasks needing to be distributed, the strategy of integral re-splitting can be added, the computing tasks of the cores with larger loads are redistributed to the remaining cores, the utilization rate of each core in the many-core processor can be improved, and the operating efficiency of the many-core processor can also be improved.
Optionally, the converting at least part of the sub-network layer mapped to the third core to the remaining core includes:
translating the half subnet layer mapped to the third core to the remaining cores.
Optionally, all network layers in the same network layer group are connected in sequence in the neural network.
Optionally, when the plurality of network layer groups are subjected to intra-group splitting, the number of sub-network layers in each network layer group is equal.
Optionally, for any network layer group, the sub-network layers with the same index in the network layer group are fused to obtain the sub-network layer group corresponding to the index.
There is also provided, in accordance with yet another aspect of the invention, a computing device, including a many-core processor, wherein,
the many-core processor is used for executing the related algorithm of the neural network mapped by the mapping method of the neural network based on the many-core processor.
Optionally, the computing device further comprises:
a storage device for storing a computer program which is loaded and executed by a processor when running in the computing device.
The invention provides a more balanced mapping method of a neural network based on a many-core processor, which reasonably maps each network layer of the neural network to each core of the many-core processor through the steps of network grouping, intra-group splitting, intra-group fusing, integral re-fusing and the like for the neural network to be mapped, and distributes resources of calculation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, the load of each core of the many-core processor is more balanced compared with the traditional scheme, and the resource efficiency of calculation and storage is effectively improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a diagram of a many-core processor based neural network mapping, according to one embodiment of the invention;
FIG. 2 shows a diagram of a many-core processor based neural network map according to another embodiment of the invention;
FIG. 3 shows a diagram of a many-core processor based neural network map according to another embodiment of the invention;
FIG. 4 shows a flow diagram of a many-core processor based neural network mapping method, according to a preferred embodiment of the invention;
FIG. 5 shows a network packet diagram of a many-core processor based neural network, according to a preferred embodiment of the invention;
FIG. 6 shows a schematic diagram of the intra-group split of a many-core processor based neural network, according to a preferred embodiment of the present invention;
FIG. 7 shows an intra-group fusion schematic of a many-core processor based neural network, according to a preferred embodiment of the present invention;
FIG. 8 shows a schematic diagram of the routing between the pre-and post-fusion kernels within a group of a many-core processor-based neural network, according to a preferred embodiment of the present invention;
FIG. 9 shows an overall re-fusion schematic of a many-core processor based neural network, according to a preferred embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The algorithm responsible for distributing the parallel algorithm to the many-core processor for operation is generally called a scheduling algorithm, and scheduling is divided into static scheduling and dynamic scheduling. The static scheduling refers to that a scheduling strategy is made before the parallel algorithm is executed, and the operation is completely carried out according to the made strategy. The dynamic scheduling is different, and the dynamic scheduling can determine how to schedule the next step according to the self state and the environment state when the algorithm runs.
By "mapping" algorithm, we refer in the narrow sense to a static scheduling algorithm that emphasizes mapping some portion of the parallel algorithm onto cores of a many-core processor, each core running and only the portion of the algorithm that is mapped.
The method aims at the mapping algorithm of the neural network on the many-core processor, but the mapping algorithm of the general parallel algorithm on the many-core processor is widely researched, and some universal methods are formed. For example, the simplest and general mapping algorithm may be to map each layer of the neural network to each core in turn until all layers have been allocated. Referring to FIG. 1, layers 0-Layer5 represent network layers 1-6, respectively, and cores 0-Core5 represent cores 1-6, respectively, and when mapping layers of a neural network to cores in a many-Core processor, layers 0-Layer5 may be mapped to cores 0-Core5, respectively, correspondingly.
Generally speaking, each layer of the neural network may have the characteristic of extremely unbalanced calculation and storage, and the traditional parallel algorithm mapping technology is rarely optimized specifically for the characteristic of the neural network. Mapping techniques with simple generalized policies may cause a significant imbalance in load among cores, resulting in a significant waste of computational and storage resources, or route congestion.
Firstly, it is introduced that a network layer with a higher load is split, that is, one layer is mapped to a plurality of cores, and a layer is calculated by the plurality of cores, which is beneficial to load balancing of the whole architecture. As shown in fig. 2, Layer5 (Layer 5) is split, and the two split sub-network layers are mapped to Core5 and Core6, respectively.
In addition, the layers with smaller loads are fused, and a core is used for calculating multiple layers, so that the resource utilization rate of the cores can be improved, and the technology adopted in the process can be called as fusion technology. As shown in fig. 3, Layer0 and Layer1 were fused and mapped together to Core 0.
Based on the split and fusion technology, the embodiment of the invention provides a more efficient and balanced neural network mapping method based on a many-core processor.
As shown in fig. 4, the method provided by this embodiment may include:
step S401, acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped into a plurality of network layer groups; and all network layers in the same network layer group are sequentially connected in the neural network.
That is, after the neural network to be mapped is acquired, network grouping is performed in the first step. All network layers of the neural network to be mapped are sequentially divided into different network layer groups, and the network layers in the same network layer group are usually represented as a continuous section on the neural network in the connection relation. Fig. 5 shows a network grouping diagram of this embodiment, as can be seen from fig. 5, a neural network may include Layer0-Layer5, and when performing intra-network grouping, Layer0 and Layer1 may be divided into Group0 (Group 0), Layer2-4 is Group1 (Group 1), Layer5 is alone as Group2 (Group 2), and the network grouping shown in fig. 5 is only one of multiple groups.
Step S402, splitting each network layer of all network layers, wherein each network layer includes a plurality of sub-network layers.
The second step is intra-group splitting. Because the calculation amount of some network layers in the neural network is large, one or more layers of neural networks are selected for splitting by adopting an intra-group splitting mode. Specifically, a splitting technique may be used to split each network layer group, and when performing intra-group splitting, for example, the number of sub-network layers included in each network layer in all network layer groups may be equal, that is, the network layer groups in the same group preferentially have the same splitting number. For example, as shown in fig. 5, assuming that Group0 is split into 2 parts, Layer0 and Layer1 are split into 2 parts. Similarly, Group1 is split into 3 parts, and then Layer2, Layer3 and Layer4 are all split into 3 parts; group2 was split into 3 parts and Layer5 was split into 3 parts. The splitting is as equal as possible, that is, the partial algorithm obtained by splitting has as close as possible computation, storage and/or routing. Alternatively, the index can be used to represent the partial algorithm after the network Layer is split, that is, Layer0 is split into 2 shares to obtain Layer0[0] and Layer0[1], where [0], [1] can be represented as the index of the sub-network Layer, and this process can be seen in fig. 6.
And S403, fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and respectively mapping the sub-network layer groups to a plurality of cores of a preset many-core processor.
By utilizing the fusion technology to perform intra-group fusion, a plurality of sub-network layers in each network layer group can be fused to obtain a plurality of sub-network layer groups. As introduced above, the index may be used to indicate a part of the algorithm after the network layer is split, and optionally, the sub-network layers belonging to the same network layer group are merged according to a preset rule to obtain a plurality of sub-network layer groups, further comprising: and for any network layer group, fusing the sub-network layers with the same index in the network layer group to obtain the sub-network layer group corresponding to the index. That is, in each group, sub-network layers having the same index are fused into one core using a fusion technique. For example, referring to FIG. 7, Layer0[0] and Layer1[0] would fuse into Core0, and Layer0[1] and Layer1[1] would fuse into Core 1; layer2[0], Layer3[0], Layer4[0] fused to Core2, and so on.
When the groups are fused, the resource upper limit of one nuclear energy burden can be exceeded after the tasks of multiple layers are fused no matter how split the tasks are split because the number of layers contained in one group is too large and too many layers are contained in the group when the network is grouped. At this point, the step of returning the network packet is required to re-perform the packet.
Optionally, after the sub-network layers in the same network layer group are merged to obtain a plurality of sub-network layer groups in step S403, it may be further determined whether the number of network layers included in the sub-network layer group is greater than a first preset threshold; if the number of network layers included in at least one sub-network layer group is greater than a first preset threshold value, the network layer group to which the sub-network layer group belongs, the number of network layers of which the number of network layers is greater than the first preset threshold value, is split again in the group, and then the number of the network layers in the group is reduced. The first preset threshold may be set according to different actual needs, and the present invention is not limited thereto.
Intra-group fusion can greatly reduce routing between cores. Take the intra-Group fusion of Group0 as an example. As shown in fig. 8, the total routing of Layer0 and Layer1 before merging can be represented by two thicker arrows and two thinner arrows. Wherein the amount of routing represented by the thick arrows is generally much larger than the amount of routing represented by the thin arrows (except for the fully connected neural network) due to the data locality of the neural network operation. After the intra-group fusion, the route represented by the thick arrow becomes data transmission in the core, the real inter-core route is only the route amount represented by the two thin arrows, and the total route amount is greatly reduced.
The embodiment of the invention also comprises a step S404 of screening out a first kernel with the resource utilization rate lower than a preset index from the plurality of kernels; and fusing the sub-network layer groups corresponding to the first cores again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor, wherein the number of the first cores is at least one.
After the steps, a plurality of mapped cores are obtained, and some 'small cores', namely cores occupying less resources or having a resource utilization rate lower than a preset index, may exist in the cores, wherein the preset index may be set according to different many-core processors. If some cores in the mapping do not reach the bottleneck of the original mapping after being re-fused, namely the neural network under the original mapping scheme does not run slower or exceed the memory and routing limits, the overall re-fusion can be performed, and the cores with lower resource utilization rate are re-fused. An example of this process is shown in figure 9.
Referring to FIG. 9, Core0 is responsible for Layer0[0] and Layer1[0], Core5 is responsible for Layer5[0], Core0 and Core5 are less available, Core0 and Core5 can be re-fused, and fused Core0 is responsible for Layer0[0], Layer1[0] and Layer5[0 ]. Similarly, Core6 is primarily responsible for Layer5[1], Core7 is primarily responsible for Layer5[2], and after fusing the two, Core5 can be simultaneously responsible for Layer5[1] and Layer5[2 ].
On the basis of the algorithm, if the number of cores required by the final mapping scheme is smaller than the total number of cores of the many-core processor, and the rest cores do not need to be allocated by other tasks, an integral re-splitting strategy can be added.
The embodiment of the invention also comprises a step S405 of judging whether the many-core processor has the residual kernel or not; if the remaining cores exist, acquiring at least one third core of which the resource consumption rate is greater than a second preset threshold value in the many-core processor; at least part of the sub-network layer mapped to the third kernel is converted to the rest of the kernels, and the other part of the sub-network layer is continuously mapped to the third kernel. Alternatively, one-half of the sub-network layers that have been mapped to the third core may be kept mapped to the third core, and the other one-half of the sub-network layers are translated to be mapped to the remaining cores.
That is, the whole is split again after the whole is fused, the core with the largest load is continuously selected and split, and when the step is split, a strategy of two can be preferentially adopted until all cores are utilized.
The following describes the scheme of the above embodiment, taking VGG19 as an example, and the input feature map size is 224 × 224 × 3. The network structure of the convolutional neural network VGG19 is shown in table 1.
TABLE 1
Figure BDA0001998129560000071
Figure BDA0001998129560000081
The Convolutional Neural Network (CNN) is composed of Input layers, convolutional layers, activation functions, pooling layers, fully-connected layers, i.e., Input-Conv-ReLU-Pool-Fc, and prob (classifier). For convenience of calculation, only the convolutional layer and the ReLU layer are focused on, and the adjacent convolutional layer and ReLU layer are considered as one layer, i.e. layeri
Table 2 evaluates the amount of computation and the amount of storage for each layer. Wherein the calculated amount is represented by MAC, i.e. the number of accumulations. For the ReLU layer, it is assumed that the activation function operation is performed one MAC per pair of numbers. The amount of memory is represented by the total number of weighted digits of the profile. The unit of MAC is M, the unit of storage is K, 1M denotes 1000000, and 1K denotes 1000.
TABLE 2 calculation (MAC) and storage for layers in VGG19
layer0 layer1 layer2 layer3 layer4 layer5 layer6 layer7
MAC(M) 89.9 1852.9 926.4 1851.3 925.6 1850.5 1850.5 1850.5
Storage (K) 151.1 3212.0 804.0 1606.8 403.7 805.1 805.1 805.1
layer8 layer9 layer10 layer11 layer12 layer13 layer14 layer15
MAC(M) 925.2 1850.1 1850.1 1850.1 462.5 462.5 462.5 462.5
Storage (K) 205.3 406.0 406.0 406.0 105.0 105.0 105.0 105.0
Assume that the memory size of each core is 4M (4000K). The above mapping uses 16 cores, and the computational utilization of each core is calculated using the following formula.
Figure BDA0001998129560000082
Wherein, the computer _ rateiIndicating the calculated utilization, i-table Corei (core i), MACiRepresents the calculated amount of Corei, MACjRepresents CorejThe amount of calculation of (a).
The storage utilization of each core is calculated using the following formula.
Figure BDA0001998129560000091
Wherein, memory _ rateiRepresents the storage amount, MemiRepresenting the amount of memory of Corei.
The calculation and storage utilization of each core is obtained as shown in table 3.
TABLE 3 computation and storage utilization per core after VGG19 simple mapping
core0 core1 core2 core3 core4 core5 core6 core7
Calculation utilization 4.85% 100% 50% 99.91% 49.96% 99.87% 99.87% 99.87%
Storage utilization 3.78% 80.3% 20.1% 40.17% 10.09% 20.13% 20.13% 20.13%
core8 core9 core10 core11 core12 core13 core14 core15
Calculation utilization 49.94% 99.85% 99.85% 99.85% 24.96 24.96 24.96 24.96
Storage utilization 5.13% 10.15% 10.15% 10.15% 2.64% 2.64% 2.64% 2.64%
The average calculated utilization was 65.85%, and the average storage utilization was 16.31%.
Based on the method provided by the embodiment of the invention, the mapping of the VGG19 network is optimized, and the network grouping can be as follows:
Group0={layer0,layer1}
Group1={layer2,layer3}
Group2={layer4,layer5,layer6,layer7}
Group3={layer8,layer9,layer10,layer11}
Group4={layer12,layer13,layer14,layer15}
Group0split number 1, Group1Split number of 2, Group2Split number of 3, Group3Split number of 3, Group4The split number is 1. 10 nuclei were obtained by intragroup fusion. According to the 10 core splitting results, no small core can be further fused, and a final optimization scheme is obtained. The calculation and storage utilization of each core of the scheme is shown in table 4.
TABLE 4 computational utilization and storage utilization per core after VGG19 has been mapped by the patented scheme
core0 core1 core2 core3 core4 core5 core6 core7
Calculation utilization 89.99% 64.33 64.33 100% 100% 100% 99.98% 99.98%
Storage utilization 84.07% 30.13% 30.13% 23.49% 23.49% 23.49% 11.86% 11.86%
core8 core9
Calculation utilization 99.98% 85.69%
Storage utilization 11.86% 10.50%
The average calculated utilization was 90.43%, and the average storage utilization was 26.09%. Compared with the traditional scheme, the scheme provided by the embodiment of the invention reduces the using quantity of the cores and greatly improves the resource utilization rate. The optimization scheme employed by the examples herein is primarily directed to improving computational utilization, so storage utilization may be at a lower level. Due to the difference of the splitting schemes, the split algorithm may have slight storage redundancy, and the calculation of the storage redundancy is omitted.
Based on the same inventive concept, the embodiment of the invention also provides computing equipment which comprises a many-core processor and is characterized in that the many-core processor is used for executing the related algorithm of the neural network mapped by the mapping method of the neural network based on the many-core processor.
Optionally, the computing device further comprises: a storage device for storing a computer program which is loaded and executed by a processor when running in the computing device.
The embodiment of the invention provides a neural network mapping method based on a many-core processor, which has wider applicability and higher efficiency, reasonably maps each network layer of the neural network to each core of the many-core processor through the steps of network grouping, in-group splitting, in-group fusing, integral re-fusing and the like for the neural network to be mapped, and distributes resources such as calculation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, and simultaneously, the load of each core of the many-core processor is more balanced compared with the traditional scheme. The method can be theoretically applied to the currently mainstream neural network algorithm, including a fully-connected neural network and a convolutional neural network, and particularly has better performance on the convolutional neural network. In terms of a many-core processor, the scheme provided by the embodiment of the invention is particularly suitable for a many-core accelerator architecture specially designed for a neural network. Because the scheme provided by the embodiment of the invention belongs to a static scheduling scheme, the overhead required by scheduling in operation can be greatly reduced, and the neural network accelerator can use the main computing power for the calculation of the algorithm of the neural network.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
Thus, it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been illustrated and described in detail herein, many other variations or modifications consistent with the principles of the invention may be directly determined or derived from the disclosure of the present invention without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be understood and interpreted to cover all such other variations or modifications.

Claims (10)

1. A mapping method for a neural network based on a many-core processor, comprising:
acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups;
splitting each network layer of all the network layers, wherein each network layer comprises a plurality of sub-network layers;
fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and respectively mapping the sub-network layer groups to a plurality of cores of a preset many-core processor.
2. The method of claim 1, wherein the fusing the sub-network layers belonging to the same network layer group according to the preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively, further comprises:
judging whether the number of network layers included by the sub-network layer group is greater than a first preset threshold value or not;
if the number of network layers included in at least one sub-network layer group is greater than the first preset threshold value, the network layer group to which the sub-network layer group belongs, the number of network layers of which is greater than the first preset threshold value, is subjected to intra-group splitting again.
3. The method of claim 1 or 2, wherein after mapping the set of sub-network layers to the cores of the pre-provisioned many-core processor, respectively, further comprising:
screening out at least one first kernel with the resource utilization rate lower than a preset index from the plurality of kernels;
and fusing the sub-network layer groups corresponding to the first core again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor.
4. The method of any of claims 1-3, wherein after mapping the set of sub-network layers to the cores of the pre-provisioned many-core processor, respectively, further comprising:
judging whether the many-core processor has residual cores;
if the remaining cores exist, acquiring at least one third core of which the resource consumption rate is greater than a second preset threshold value in the many-core processor;
translating at least a portion of the sub-network layers mapped to the third core to the remaining cores.
5. The method of claim 4, wherein the translating at least a portion of the sub-network layers mapped to the third core to the remaining cores comprises:
translating the half subnet layer mapped to the third core to the remaining cores.
6. The method of any one of claims 1-5, wherein all network layers in the same network layer group are connected sequentially in the neural network.
7. The method of any one of claims 1-5, wherein the number of sub-network layers in each network layer group is equal.
8. The method of claim 7, wherein for any network layer group, merging sub-network layers with the same index in the network layer group to obtain the sub-network layer group corresponding to the index.
9. A computing device comprising a many-core processor, wherein,
the many-core processor configured to execute an algorithm related to a neural network mapped by the method for mapping a neural network based on a many-core processor of any of claims 1-9.
10. The computing device of claim 9, wherein the computing device further comprises:
a storage device for storing a computer program which is loaded and executed by a processor when running in the computing device.
CN201910203167.0A 2019-03-18 2019-03-18 Neural network mapping method and computing device based on many-core processor Active CN111723900B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910203167.0A CN111723900B (en) 2019-03-18 2019-03-18 Neural network mapping method and computing device based on many-core processor
PCT/CN2020/077973 WO2020187041A1 (en) 2019-03-18 2020-03-05 Neural network mapping method employing many-core processor and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910203167.0A CN111723900B (en) 2019-03-18 2019-03-18 Neural network mapping method and computing device based on many-core processor

Publications (2)

Publication Number Publication Date
CN111723900A true CN111723900A (en) 2020-09-29
CN111723900B CN111723900B (en) 2023-10-20

Family

ID=72518948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910203167.0A Active CN111723900B (en) 2019-03-18 2019-03-18 Neural network mapping method and computing device based on many-core processor

Country Status (2)

Country Link
CN (1) CN111723900B (en)
WO (1) WO2020187041A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835718A (en) * 2021-02-10 2021-05-25 北京灵汐科技有限公司 Method and device for processing task, many-core system and computer readable medium
CN114418063A (en) * 2021-12-27 2022-04-29 北京百度网讯科技有限公司 Method and device for distributing network layer in neural network model
WO2022171002A1 (en) * 2021-02-10 2022-08-18 北京灵汐科技有限公司 Task processing method and apparatus, many-core system, and computer-readable medium
CN115098262A (en) * 2022-06-27 2022-09-23 清华大学 Multi-neural-network task processing method and device

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112884123B (en) * 2021-02-23 2024-03-01 杭州海康威视数字技术股份有限公司 Neural network optimization method and device, electronic equipment and readable storage medium
CN113485836B (en) * 2021-07-21 2024-03-19 瀚博半导体(上海)有限公司 Tensor processing method and tensor processing system based on tensor segmentation
CN116167463B (en) * 2023-04-26 2023-07-07 之江实验室 Distributed model training container scheduling method and device for intelligent computing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095110A (en) * 2014-02-18 2015-11-25 新加坡国立大学 Fusible and reconfigurable cache architecture
CN106909971A (en) * 2017-02-10 2017-06-30 华南理工大学 A kind of BP neural network parallel method towards multinuclear computing environment
WO2018154494A1 (en) * 2017-02-23 2018-08-30 Cerebras Systems Inc. Accelerated deep learning
WO2019001418A1 (en) * 2017-06-26 2019-01-03 上海寒武纪信息科技有限公司 Data sharing system and data sharing method therefor
CN109409513A (en) * 2018-10-10 2019-03-01 广州市百果园信息技术有限公司 A kind of task processing method neural network based and relevant device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10868893B2 (en) * 2017-03-31 2020-12-15 Xilinx, Inc. Network interface device
US10108850B1 (en) * 2017-04-24 2018-10-23 Intel Corporation Recognition, reidentification and security enhancements using autonomous machines
CN110738316B (en) * 2018-07-20 2024-05-14 北京三星通信技术研究有限公司 Operation method and device based on neural network and electronic equipment
CN110515732B (en) * 2019-08-23 2021-06-18 中国人民解放军国防科技大学 Task allocation method based on deep learning inference of resource-constrained robot

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095110A (en) * 2014-02-18 2015-11-25 新加坡国立大学 Fusible and reconfigurable cache architecture
CN106909971A (en) * 2017-02-10 2017-06-30 华南理工大学 A kind of BP neural network parallel method towards multinuclear computing environment
WO2018154494A1 (en) * 2017-02-23 2018-08-30 Cerebras Systems Inc. Accelerated deep learning
WO2019001418A1 (en) * 2017-06-26 2019-01-03 上海寒武纪信息科技有限公司 Data sharing system and data sharing method therefor
CN109409513A (en) * 2018-10-10 2019-03-01 广州市百果园信息技术有限公司 A kind of task processing method neural network based and relevant device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835718A (en) * 2021-02-10 2021-05-25 北京灵汐科技有限公司 Method and device for processing task, many-core system and computer readable medium
WO2022171002A1 (en) * 2021-02-10 2022-08-18 北京灵汐科技有限公司 Task processing method and apparatus, many-core system, and computer-readable medium
CN114418063A (en) * 2021-12-27 2022-04-29 北京百度网讯科技有限公司 Method and device for distributing network layer in neural network model
CN114418063B (en) * 2021-12-27 2023-01-06 北京百度网讯科技有限公司 Method and device for distributing network layer in neural network model
CN115098262A (en) * 2022-06-27 2022-09-23 清华大学 Multi-neural-network task processing method and device
CN115098262B (en) * 2022-06-27 2024-04-23 清华大学 Multi-neural network task processing method and device

Also Published As

Publication number Publication date
CN111723900B (en) 2023-10-20
WO2020187041A1 (en) 2020-09-24

Similar Documents

Publication Publication Date Title
CN111723900A (en) Mapping method of neural network based on many-core processor and computing device
US20220147403A1 (en) Reducing overlay network overhead across container hosts
Xiao et al. NFVdeep: Adaptive online service function chain deployment with deep reinforcement learning
WO2017156968A1 (en) Neural network computing method, system and device therefor
CN111199275B (en) System on chip for neural network
CN110308984B (en) Cross-cluster computing system for processing geographically distributed data
WO2017173754A1 (en) Method and device for on-chip repetitive addressing
CN117632361A (en) Resource scheduling method and device and related equipment
CN108923979B (en) Software defined network virtual network mapping method
CN108170861B (en) Distributed database system collaborative optimization method based on dynamic programming
CN110380906B (en) Large-scale multidimensional fusion virtual network mapping method
CN107360031A (en) It is a kind of based on optimization overhead gains than mapping method of virtual network
Huang et al. Fuzzy clustering with feature weight preferences for load balancing in cloud
CN111159859A (en) Deployment method and system of cloud container cluster
CN103955397B (en) A kind of scheduling virtual machine many policy selection method based on micro-architecture perception
CN107493574B (en) Wireless controller equipment, parallel authentication processing method, system and networking device
JP2018148455A (en) Information processor and method
Yang et al. Yun: a high-performance container management service based on openstack
Trejo-Sánchez et al. A multi-agent architecture for scheduling of high performance services in a GPU cluster
Xing et al. Allocating DNN layers computation between front-end devices and the cloud server for video big data processing
CN115834466B (en) Method, device, equipment, system and storage medium for analyzing path of computing power network
Cheng et al. Towards a deep-pipelined architecture for accelerating deep GCN on a multi-FPGA platform
Chen et al. Task scheduling based on fruit fly optimization algorithm in mobile cloud computing
WO2024022046A1 (en) Deep learning system and method
CN107317767A (en) Network fast flow optimization method based on anti-ant group algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zhang Weihao

Inventor after: Li Han

Inventor before: Zhang Weihao

Inventor before: Li Han

Inventor before: Pei Jing