CN111723900B - Neural network mapping method and computing device based on many-core processor - Google Patents

Neural network mapping method and computing device based on many-core processor Download PDF

Info

Publication number
CN111723900B
CN111723900B CN201910203167.0A CN201910203167A CN111723900B CN 111723900 B CN111723900 B CN 111723900B CN 201910203167 A CN201910203167 A CN 201910203167A CN 111723900 B CN111723900 B CN 111723900B
Authority
CN
China
Prior art keywords
network
sub
network layer
many
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910203167.0A
Other languages
Chinese (zh)
Other versions
CN111723900A (en
Inventor
张伟豪
李涵
裴京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lynxi Technology Co Ltd
Original Assignee
Beijing Lynxi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lynxi Technology Co Ltd filed Critical Beijing Lynxi Technology Co Ltd
Priority to CN201910203167.0A priority Critical patent/CN111723900B/en
Priority to PCT/CN2020/077973 priority patent/WO2020187041A1/en
Publication of CN111723900A publication Critical patent/CN111723900A/en
Application granted granted Critical
Publication of CN111723900B publication Critical patent/CN111723900B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a neural network mapping method and computing equipment based on a many-core processor, wherein the method comprises the following steps: acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups; splitting each network layer in all the network layers, wherein each network layer comprises a plurality of sub-network layers; and fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively. The method provided by the invention can allocate the resources of computation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, and the loads of each core of the many-core processor are more balanced than those of the traditional scheme, and the computation and storage resource efficiency is effectively improved.

Description

Neural network mapping method and computing device based on many-core processor
Technical Field
The present invention relates to the field of processor technologies, and in particular, to a mapping method and a computing device for a neural network based on a many-core processor.
Background
With the continuous application of artificial intelligence technology in various fields, various hardware platforms for running artificial intelligence algorithms have been developed, and many-core processors are one of them. The neural network algorithm is a mainstream artificial intelligence algorithm and has the characteristics of high calculation amount and high parallelism. These features make neural networks suitable for operation on many-core architectures, which is also why many-core processor architectures are currently an important way to build neural network accelerators. On the basis of the neural network algorithm and a suitable many-core processor, how to map the neural network algorithm to the processor and how to allocate the resources of computation, storage, routing and the like of each core of the many-core processor is a problem to be solved.
Disclosure of Invention
In view of the foregoing, the present invention provides a mapping method and computing device for a many-core processor-based neural network that overcomes or at least partially solves the foregoing problems.
According to one aspect of the present invention, there is provided a mapping method of a neural network based on a many-core processor, including:
acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups;
splitting each network layer in all the network layers, wherein each network layer comprises a plurality of sub-network layers;
and fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively. By splitting the neural network in the group and then fusing the neural network in the group and mapping the neural network to the kernel, the routing among the kernels in the many-kernel processor can be greatly reduced, and the processing efficiency of the processor is further improved.
Optionally, after fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, before mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor, the method further includes:
judging whether the network layer number included in the sub-network layer group is larger than a first preset threshold value or not;
and if the network layer number included in at least one sub-network layer group is greater than the first preset threshold, re-splitting the network layer group to which the sub-network layer group with the network layer number greater than the first preset threshold belongs. When the network is integrated in the group, the number of layers in the group is too large due to the fact that one group is excessively distributed when the network is grouped, and the number of layers in the group exceeds the upper limit of resources of one nuclear energy load, and at the moment, the number of network layers in the group is reduced by grouping again, so that load balancing of each core is achieved.
Optionally, after mapping the plurality of sub-network layer groups to the plurality of cores of the preset many-core processor, the method further includes:
screening out a first kernel with the resource utilization rate lower than a preset index from the plurality of kernels, wherein the first kernel is at least one;
and fusing the sub-network layer groups corresponding to the first cores again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor. After the mapped cores are obtained through the steps, cores with low resource utilization rate can exist, and the cost required by scheduling in the operation of the processor is effectively reduced by re-fusing the cores with low resource utilization rate.
Optionally, after mapping the plurality of sub-network layer groups to the plurality of cores of the preset many-core processor, the method further includes:
judging whether residual kernels exist in the many-core processor;
if the residual kernels exist, at least one third kernel with the resource consumption rate larger than a second preset threshold value in the many-core processor is obtained;
at least part of the sub-network layer mapped to the third core is converted to the remaining cores. On the basis, if the many-core processor still has the residual cores and no task needs to be distributed, the overall re-splitting strategy can be increased, and the computing tasks of the cores with larger loads are redistributed to the residual cores, so that the utilization rate of each core in the many-core processor can be improved, and the running efficiency of the many-core processor can be improved.
Optionally, the converting at least part of the sub-network layer mapped to the third core to the remaining cores includes:
and converting one half of the sub-network layer mapped to the third core to the remaining cores.
Optionally, all network layers in the same network layer group are sequentially connected in the neural network.
Optionally, when the plurality of network layer groups are split in a group, the number of sub-network layers in each network layer group is equal.
Optionally, for any network layer group, fusing the sub-network layers with the same index in the network layer group to obtain the sub-network layer group corresponding to the index.
According to yet another aspect of the present invention, there is also provided a computing device comprising a many-core processor, characterized in that,
the many-core processor is configured to execute the related algorithm of the neural network mapped by the mapping method of the neural network based on the many-core processor.
Optionally, the computing device further comprises:
a storage device for storing a computer program that is loaded and executed by a processor when run in the computing device.
The invention provides a more balanced mapping method of a neural network based on a many-core processor, which reasonably maps each network layer of the neural network to each core of the many-core processor through the steps of network grouping, intra-group splitting, intra-group fusing, overall re-fusion and the like for the neural network to be mapped, distributes the resources of calculation, storage, routing and the like of each core of the many-core processor, ensures that the operation of the neural network is more efficient, simultaneously ensures that the loads of each core of the many-core processor are more balanced than the traditional scheme, and effectively improves the calculation and storage resource efficiency.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
The above, as well as additional objectives, advantages, and features of the present invention will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present invention when read in conjunction with the accompanying drawings.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 illustrates a neural network map schematic based on a many-core processor, according to one embodiment of the invention;
FIG. 2 illustrates a neural network map schematic based on a many-core processor, according to another embodiment of the invention;
FIG. 3 illustrates a neural network map schematic based on a many-core processor, according to another embodiment of the invention;
FIG. 4 is a flowchart of a neural network mapping method based on a many-core processor according to a preferred embodiment of the present invention;
FIG. 5 shows a network grouping schematic of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention;
FIG. 6 illustrates an intra-group split schematic of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention;
FIG. 7 illustrates a schematic diagram of intra-group fusion of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention;
FIG. 8 illustrates a schematic routing diagram between cores before and after intra-group fusion of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention;
fig. 9 shows a schematic diagram of an overall re-fusion of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Algorithms responsible for distributing parallel algorithms to run on many-core processors are generally referred to as scheduling algorithms, and scheduling is classified into static scheduling and dynamic scheduling. The static scheduling refers to the process that a scheduling strategy is formulated before the parallel algorithm is executed, and the running process completely runs according to the formulated strategy. The dynamic scheduling is different, and when the algorithm runs, the dynamic scheduling can determine how to schedule in the next step according to the states of the dynamic scheduling and the environment.
In the narrow sense of the "mapping" algorithm, a static scheduling algorithm emphasizes the mapping of some portion of the parallel algorithm onto cores of a many-core processor, each core running and only the mapped portion of the algorithm.
The mapping algorithm of the neural network on the many-core processor is less researched, but the mapping algorithm of the general parallel algorithm on the many-core processor is widely researched, and a plurality of general methods are formed. For example, the simplest general mapping algorithm may be to map each layer of the neural network to each core in turn until all layers have been assigned. Referring to fig. 1, layer0-Layer5 represents network layers 1-6, core0-Core5 represents cores 1-6, respectively, and layer0-Layer5 may be mapped to Core0-Core5, respectively, when mapping layers of a neural network to cores in a many-Core processor.
In general, the layers of the neural network may have the characteristic of extremely unbalanced calculation and storage, and the conventional parallel algorithm mapping technology is seldom specially optimized for the characteristics of the neural network. The mapping technique through a simple pervasive strategy may cause a great imbalance in load among the cores, cause a great deal of waste of computing and storage resources, or cause route blocking.
First, the network layer with higher load is split, that is, one layer is mapped to a plurality of cores, and one layer is calculated by the plurality of cores, so that the load balancing of the whole architecture is facilitated, and the technology adopted in the process can be called a splitting technology. As shown in fig. 2, layer5 (Layer 5) is split and the split two sub-network layers are mapped to Core5 and Core6, respectively.
In addition, the layers with smaller loads are fused, and one core is used for calculating multiple layers, so that the resource utilization rate of the cores can be improved, and the technology adopted in the process can be called a fusion technology. As shown in fig. 3, layer0 and Layer1 are fused and commonly mapped to Core0.
Based on the splitting and fusing technology, the embodiment of the invention provides a neural network mapping method based on a many-core processor, which is more efficient and balanced.
As shown in fig. 4, the method provided in this embodiment may include:
step S401, obtaining a neural network to be mapped, and combining all network layers of the neural network to be mapped in sequence, and dividing the network layers into a plurality of network layer groups; all network layers in the same network layer group are sequentially connected in the neural network.
That is, after the neural network to be mapped is acquired, in a first step, network grouping is performed. All network layers of the neural network to be mapped are sequentially divided into different network layer groups, and the network layers in the same network layer group are often reflected as a continuous section of the neural network in connection relation. Fig. 5 shows a network grouping schematic diagram of the present embodiment, referring to fig. 5, it can be known that the neural network may include Layer0-Layer5, when performing network grouping, layer0 and Layer1 may be divided into Group0 (Group 0), layer2-4 is Group1 (Group 1), layer5 is solely used as Group2 (Group 2), the network grouping shown in fig. 5 is only one of multiple groupings, in practical application, all network layers of the neural network may be divided according to different requirements when grouping, and the present invention is not limited.
In step S402, each of all the network layers is split, and each network layer includes a plurality of sub-network layers.
The second step is in-group splitting. Because the calculated amount of network layers in the neural network is large, one or more layers of the neural network are selected for splitting in a splitting mode in the group. The splitting technology can be specifically utilized to split each network layer group, and when in intra-group splitting, for example, the number of sub-network layers included in each network layer in all network layer groups can be equal, that is, the network layer groups in the same group preferably have the same splitting number. Taking the example shown in fig. 5, assuming that Group0 is split into 2 parts, layer0 and Layer1 will both be split into 2 parts. Similarly, if Group1 is split into 3 parts, then Layer2, layer3, and Layer4 will all be split into 3 parts; group2 was split into 3 parts and Layer5 was split into 3 parts. Splitting is as uniform as possible, i.e. the partial algorithms obtained by splitting have as close computation, memory and/or routing as possible. Alternatively, the index may be used to represent a part of the split algorithm of the network Layer, that is, layer0[0] and Layer0[1] are obtained after Layer0 is split into 2 parts, where [0] and [1] may be represented as indexes of the sub-network Layer, and this process may be referred to fig. 6.
Step S403, fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively.
And the fusion technology is utilized to carry out intra-group fusion, and a plurality of sub-network layers in each network layer group can be fused to obtain a plurality of sub-network layer groups. In the foregoing, the index may be used to represent a part of the algorithm after the network layer is split, optionally, the sub-network layers belonging to the same network layer group are fused according to a preset rule to obtain a plurality of sub-network layer groups, and further includes: and for any network layer group, fusing the sub-network layers with the same index in the network layer group to obtain the sub-network layer group corresponding to the index. That is, in each group, sub-network layers having the same index are fused into one core using a fusion technique. For example, referring to FIG. 7, layer0[0] and Layer1[0] would be fused into Core0 and Layer0[1] and Layer1[1] would be fused into Core 1; layer2[0], layer3[0], layer4[0] fused into Core2, and so forth.
When the tasks are fused in the group, the number of layers contained in the group is too large due to the fact that one group is excessively distributed when the network is grouped, so that the tasks in the plurality of layers are split anyway, and the upper limit of resources of one nuclear energy burden is exceeded after the tasks in the plurality of layers are fused. At this time, a step of returning the network packet is required to re-perform the packet.
Optionally, after merging the sub-network layers in the same network layer group to obtain a plurality of sub-network layer groups in step S403, it may also be determined whether the number of network layers included in the sub-network layer group is greater than a first preset threshold; if the network layer number included in at least one sub-network layer group is greater than a first preset threshold, splitting the network layer group to which the sub-network layer group belongs again, so that the number of the network layer in the group is reduced. The first preset threshold value can be set according to different actual needs, and the invention is not limited.
Intra-group fusion can greatly reduce routing between cores. Taking Group0 intra-Group fusion as an example. As shown in fig. 8, the total route of Layer0 and Layer1 before fusion can be represented by two thicker arrows and two thinner arrows. Wherein the amount of routing represented by the thick arrows is generally much greater than the amount of routing represented by the thin arrows (except for fully connected neural networks) due to the data locality of the neural network operations. After intra-group fusion, the routes represented by the thick arrows become intra-core data transfer, and the real inter-core routes are only the routes represented by the two thin arrows, so that the total route is greatly reduced.
The embodiment of the invention further comprises a step S404 of screening out a first kernel with the resource utilization rate lower than a preset index from the kernels; and fusing the sub-network layer groups corresponding to the first cores again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor, wherein the first core is at least one.
After the steps, a plurality of mapped cores are obtained, and some 'small cores' possibly exist in the cores, namely the cores with less occupied resources or the resource utilization rate lower than a preset index are obtained, wherein the preset index can be set according to different multi-core processors. If some cores are recombined and do not reach the bottleneck of the original mapping, namely the neural network under the original mapping scheme can not run more slowly or exceed the memory and routing limit, the whole re-fusion can be carried out, and the cores with lower resource utilization rate are recombined. Fig. 9 shows an example of this process.
Referring to FIG. 9, core0 is responsible for Layer0[0] and Layer1[0], core5 is responsible for Layer5[0], core0 and Core5 can be re-fused at this time due to lower Core0 and Core5 utilization, and fused Core0 is responsible for Layer0[0], layer1[0] and Layer5[0]. Similarly, core6 is responsible for Layer5[1], core7 is responsible for Layer5[2], and Core5 may be responsible for both Layer5[1] and Layer5[2] after fusing the two.
Based on the above algorithm, if the number of cores required by the final mapping scheme is smaller than the total number of cores of the many-core processor, and the remaining cores have no other tasks to be allocated, an overall repartition strategy can be added.
The embodiment of the invention further comprises step S405, judging whether residual kernels exist in the many-core processor; if the residual kernels exist, at least one third kernel with the resource consumption rate larger than a second preset threshold value in the many-core processor is obtained; at least a portion of the sub-network layer mapped to the third core is converted to the remaining cores while the other portions remain mapped to the third core. Alternatively, one-half sub-network layers that have been mapped to the third core may continue to remain mapped to the third core, with another one-half sub-network layer conversion mapped to the remaining cores.
That is to say, after the whole fusion, the whole is split again, the core with the largest load is continuously selected and split, and when the step is split, a strategy with one part being two can be preferentially adopted until all cores are utilized.
Next, the scheme of the above embodiment will be described by taking VGG19 as an example, and the input feature map size is 224×224×3. The network structure of convolutional neural network VGG19 is shown in table 1.
TABLE 1
Convolutional Neural Networks (CNNs) consist of an Input layer, a convolutional layer, an activation function, a pooling layer, a fully connected layer, i.e. Input-Conv-ReLU-Pool-Fc, and prob (classifier). For the convenience of calculation, only the convolution layer and the ReLU layer are concerned, and the adjacent convolution layer and the ReLU layer are regarded as one layer, namely layer i
Table 2 evaluates the calculated amount and storage amount for each layer. Wherein the calculated amount is represented by MAC, namely the accumulated times. For the ReLU layer, assume that each pair of digits performs an activation function operation to record a MAC. The storage is represented by the total number of digits weighted by the feature map. The unit of MAC is M, the unit of storage is K, and 1M represents 1000000,1K and 1000.
TABLE 2 calculation (MAC) and memory for each layer in VGG19
layer 0 layer 1 layer 2 layer 3 layer 4 layer 5 layer 6 layer 7
MAC(M) 89.9 1852.9 926.4 1851.3 925.6 1850.5 1850.5 1850.5
Storage (K) 151.1 3212.0 804.0 1606.8 403.7 805.1 805.1 805.1
layer 8 layer 9 layer 10 layer 11 layer 12 layer 13 layer 14 layer 15
MAC(M) 925.2 1850.1 1850.1 1850.1 462.5 462.5 462.5 462.5
Storage (K) 205.3 406.0 406.0 406.0 105.0 105.0 105.0 105.0
Assume that the memory size of each core is 4M (4000K). The above mapping uses 16 cores, and the computational utilization of each core is calculated using the following formula.
Wherein, the computer_rate i Indicating the calculated utilization, i table Corei (core i), MAC i Representing the calculated amount of Corei, MAC j Indicating Core j Is a calculation amount of (a).
The storage utilization of each core is calculated using the following formula.
Wherein memory_rate i Represents the storage amount, mem i Representing the amount of memory of Corei.
The calculation and storage utilization of each core is shown in Table 3.
TABLE 3 calculated utilization and storage utilization for each core after VGG19 simple mapping
core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7
Calculating the utilization rate 4.85% 100% 50% 99.91% 49.96% 99.87% 99.87% 99.87%
Storage utilization 3.78% 80.3% 20.1% 40.17% 10.09% 20.13% 20.13% 20.13%
core 8 core 9 core 10 core 11 core 12 core 13 core 14 core 15
Calculating the utilization rate 49.94% 99.85% 99.85% 99.85% 24.96 24.96 24.96 24.96
Storage utilization 5.13% 10.15% 10.15% 10.15% 2.64% 2.64% 2.64% 2.64%
The average calculated utilization was 65.85% and the average storage utilization was 16.31%.
The method provided by the embodiment of the invention optimizes the mapping of the VGG19 network, and the network grouping can be as follows:
Group 0 ={layer 0 ,layer 1 }
Group 1 ={layer 2 ,layer 3 }
Group 2 ={layer 4 ,layer 5 ,layer 6 ,layer 7 }
Group 3 ={layer 8 ,layer 9 ,layer 10 ,layer 11 }
Group 4 ={layer 12 ,layer 13 ,layer 14 ,layer 15 }
Group 0 the split number is 1, group 1 Split number is 2, group 2 The split number is 3, group 3 The split number is 3, group 4 The split was 1. 10 nuclei were obtained by intra-group fusion. According to the 10 kernel splitting results, no small kernels are found to be further fused, and a final optimization scheme is obtained. The calculation and storage utilization rate of each core in the scheme is shown in table 4.
TABLE 4 calculated utilization and storage utilization for each core after VGG19 is mapped by the proprietary scheme
core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7
Calculating the utilization rate 89.99% 64.33 64.33 100% 100% 100% 99.98% 99.98%
Storage utilization 84.07% 30.13% 30.13% 23.49% 23.49% 23.49% 11.86% 11.86%
core 8 core 9
Calculating the utilization rate 99.98% 85.69%
Storage utilization 11.86% 10.50%
The average calculated utilization was 90.43% and the average storage utilization was 26.09%. Compared with the traditional scheme, the scheme provided by the embodiment of the invention reduces the number of the cores used and greatly improves the resource utilization rate. The optimization scheme adopted by the example here is mainly aimed at improving the computing utilization, so the storage utilization can be at a lower level. Due to the difference of the splitting schemes, the split algorithm may have small storage redundancy, and calculation of the storage redundancy is omitted.
Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including a many-core processor, where the many-core processor is configured to execute a related algorithm of a neural network mapped by the mapping method of the many-core processor-based neural network described in any one of the above.
Optionally, the computing device further comprises: a storage device for storing a computer program that is loaded and executed by a processor when run in the computing device.
The embodiment of the invention provides a neural network mapping method based on a many-core processor, which has wider applicability and higher efficiency, and reasonably maps each network layer of the neural network to each core of the many-core processor through the steps of network grouping, intra-group splitting, intra-group fusion, integral re-fusion and the like for the neural network to be mapped, and distributes resources such as calculation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, and the loads of each core of the many-core processor are more balanced than the traditional scheme. The method can be theoretically applied to the currently mainstream neural network algorithms, including fully connected neural networks and convolutional neural networks, and particularly has better performance on the convolutional neural networks. In the aspect of a many-core processor, the scheme provided by the embodiment of the invention is particularly suitable for a many-core accelerator architecture specially designed for a neural network. Because the scheme provided by the embodiment of the invention belongs to a static scheduling scheme, the overhead required by scheduling in operation can be greatly reduced, and the neural network accelerator can use main computing power for computing the algorithm of the neural network.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
By now it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been shown and described herein in detail, many other variations or modifications of the invention consistent with the principles of the invention may be directly ascertained or inferred from the present disclosure without departing from the spirit and scope of the invention. Accordingly, the scope of the present invention should be understood and deemed to cover all such other variations or modifications.

Claims (7)

1. A method of mapping a neural network based on a many-core processor, comprising:
acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups;
splitting each network layer in the network layers in a group, wherein each network layer comprises a plurality of sub-network layers; when the intra-group splitting is carried out, the number of sub-network layers in each network layer group is equal;
performing intra-group fusion on the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively;
for any network layer group, fusing the sub-network layers with the same index in the network layer group to obtain a sub-network layer group corresponding to the index;
after fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, before mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor, the method further comprises:
judging whether the network layer number included in the sub-network layer group is larger than a first preset threshold value or not;
and if the network layer number included in at least one sub-network layer group is greater than the first preset threshold, re-splitting the network layer group to which the sub-network layer group with the network layer number greater than the first preset threshold belongs.
2. The method of claim 1, wherein after mapping the plurality of sub-network layer groups to the plurality of cores of the preset many-core processor, respectively, further comprises:
screening out a first kernel with the resource utilization rate lower than a preset index from the plurality of kernels, wherein the first kernel is at least one;
and fusing the sub-network layer groups corresponding to the first cores again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor.
3. The method of any of claims 1-2, wherein after mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor, respectively, further comprises:
judging whether residual kernels exist in the many-core processor;
if the residual kernels exist, at least one third kernel with the resource consumption rate larger than a second preset threshold value in the many-core processor is obtained;
at least part of the sub-network layer mapped to the third core is converted to the remaining cores.
4. The method of claim 3, wherein the transitioning at least a portion of the sub-network layers mapped to the third core to the remaining cores comprises:
and converting one half of the sub-network layer mapped to the third core to the remaining cores.
5. The method according to any of claims 1-2, wherein all network layers in the same network layer group are connected in sequence in the neural network.
6. A computing device comprising a many-core processor, characterized in that,
the many-core processor for executing the related algorithm of the neural network mapped by the mapping method of the many-core processor-based neural network of any one of claims 1-5.
7. The computing device of claim 6, wherein the computing device further comprises:
a storage device for storing a computer program that is loaded and executed by a processor when run in the computing device.
CN201910203167.0A 2019-03-18 2019-03-18 Neural network mapping method and computing device based on many-core processor Active CN111723900B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910203167.0A CN111723900B (en) 2019-03-18 2019-03-18 Neural network mapping method and computing device based on many-core processor
PCT/CN2020/077973 WO2020187041A1 (en) 2019-03-18 2020-03-05 Neural network mapping method employing many-core processor and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910203167.0A CN111723900B (en) 2019-03-18 2019-03-18 Neural network mapping method and computing device based on many-core processor

Publications (2)

Publication Number Publication Date
CN111723900A CN111723900A (en) 2020-09-29
CN111723900B true CN111723900B (en) 2023-10-20

Family

ID=72518948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910203167.0A Active CN111723900B (en) 2019-03-18 2019-03-18 Neural network mapping method and computing device based on many-core processor

Country Status (2)

Country Link
CN (1) CN111723900B (en)
WO (1) WO2020187041A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835718A (en) * 2021-02-10 2021-05-25 北京灵汐科技有限公司 Method and device for processing task, many-core system and computer readable medium
WO2022171002A1 (en) * 2021-02-10 2022-08-18 北京灵汐科技有限公司 Task processing method and apparatus, many-core system, and computer-readable medium
CN112884123B (en) * 2021-02-23 2024-03-01 杭州海康威视数字技术股份有限公司 Neural network optimization method and device, electronic equipment and readable storage medium
CN113485836B (en) * 2021-07-21 2024-03-19 瀚博半导体(上海)有限公司 Tensor processing method and tensor processing system based on tensor segmentation
CN114418063B (en) * 2021-12-27 2023-01-06 北京百度网讯科技有限公司 Method and device for distributing network layer in neural network model
CN115098262B (en) * 2022-06-27 2024-04-23 清华大学 Multi-neural network task processing method and device
CN116167463B (en) * 2023-04-26 2023-07-07 之江实验室 Distributed model training container scheduling method and device for intelligent computing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095110A (en) * 2014-02-18 2015-11-25 新加坡国立大学 Fusible and reconfigurable cache architecture
CN106909971A (en) * 2017-02-10 2017-06-30 华南理工大学 A kind of BP neural network parallel method towards multinuclear computing environment
WO2018154494A1 (en) * 2017-02-23 2018-08-30 Cerebras Systems Inc. Accelerated deep learning
WO2019001418A1 (en) * 2017-06-26 2019-01-03 上海寒武纪信息科技有限公司 Data sharing system and data sharing method therefor
CN109409513A (en) * 2018-10-10 2019-03-01 广州市百果园信息技术有限公司 A kind of task processing method neural network based and relevant device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10868893B2 (en) * 2017-03-31 2020-12-15 Xilinx, Inc. Network interface device
US10108850B1 (en) * 2017-04-24 2018-10-23 Intel Corporation Recognition, reidentification and security enhancements using autonomous machines
CN110738316B (en) * 2018-07-20 2024-05-14 北京三星通信技术研究有限公司 Operation method and device based on neural network and electronic equipment
CN110515732B (en) * 2019-08-23 2021-06-18 中国人民解放军国防科技大学 Task allocation method based on deep learning inference of resource-constrained robot

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095110A (en) * 2014-02-18 2015-11-25 新加坡国立大学 Fusible and reconfigurable cache architecture
CN106909971A (en) * 2017-02-10 2017-06-30 华南理工大学 A kind of BP neural network parallel method towards multinuclear computing environment
WO2018154494A1 (en) * 2017-02-23 2018-08-30 Cerebras Systems Inc. Accelerated deep learning
WO2019001418A1 (en) * 2017-06-26 2019-01-03 上海寒武纪信息科技有限公司 Data sharing system and data sharing method therefor
CN109409513A (en) * 2018-10-10 2019-03-01 广州市百果园信息技术有限公司 A kind of task processing method neural network based and relevant device

Also Published As

Publication number Publication date
CN111723900A (en) 2020-09-29
WO2020187041A1 (en) 2020-09-24

Similar Documents

Publication Publication Date Title
CN111723900B (en) Neural network mapping method and computing device based on many-core processor
CN110490309B (en) Operator fusion method for neural network and related product thereof
WO2017156968A1 (en) Neural network computing method, system and device therefor
CN111178519B (en) Convolutional neural network acceleration engine, convolutional neural network acceleration system and method
CN107169560B (en) Self-adaptive reconfigurable deep convolutional neural network computing method and device
EP2472398B1 (en) Memory-aware scheduling for NUMA architectures
CN111199275B (en) System on chip for neural network
WO2017173754A1 (en) Method and device for on-chip repetitive addressing
CN105468439B (en) The self-adaptive parallel method of neighbours in radii fixus is traversed under CPU-GPU isomery frame
CN105471985A (en) Load balance method, cloud platform computing method and cloud platform
WO2020134703A1 (en) Neural network system-based image processing method and neural network system
CN110990154B (en) Big data application optimization method, device and storage medium
KR20210148586A (en) Scheduler, method for operating the same and accelerator system including the same
CN114356587A (en) Calculation power task cross-region scheduling method, system and equipment
KR20210108749A (en) Accelerator, method for operating the same and accelerator system including the same
Chen et al. Towards efficient allocation of graph convolutional networks on hybrid computation-in-memory architecture
CN111401543A (en) Neural network accelerator with full on-chip storage and implementation method thereof
CN104104621A (en) Dynamic adaptive adjustment method of virtual network resources based on nonlinear dimensionality reduction
CN108170861B (en) Distributed database system collaborative optimization method based on dynamic programming
CN110167031B (en) Resource allocation method, equipment and storage medium for centralized base station
CN104331336B (en) Be matched with the multilayer nest balancing method of loads of high-performance computer structure
CN115668222A (en) Data processing method and device of neural network
CN116304212A (en) Data processing system, method, equipment and storage medium
CN114064294B (en) Dynamic resource allocation method and system in mobile edge computing environment
CN106844037B (en) KNL-based test method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant