CN111723900A

CN111723900A - Mapping method of neural network based on many-core processor and computing device

Info

Publication number: CN111723900A
Application number: CN201910203167.0A
Authority: CN
Inventors: 张伟豪; 李涵; 裴京
Original assignee: Beijing Lynxi Technology Co Ltd
Current assignee: Beijing Lynxi Technology Co Ltd
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2020-09-29
Anticipated expiration: 2039-03-18
Also published as: CN111723900B; WO2020187041A1

Abstract

The invention provides a mapping method and computing equipment of a neural network based on a many-core processor, wherein the method comprises the following steps: acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups; splitting each network layer of all the network layers, wherein each network layer comprises a plurality of sub-network layers; fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and respectively mapping the sub-network layer groups to a plurality of cores of a preset many-core processor. The method provided by the invention can distribute resources such as calculation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, the load of each core of the many-core processor is more balanced compared with the traditional scheme, and the resource efficiency of calculation and storage is effectively improved.

Description

Mapping method of neural network based on many-core processor and computing device

Technical Field

The invention relates to the technical field of processors, in particular to a mapping method of a neural network based on a many-core processor and computing equipment.

Background

With the continuous application of artificial intelligence technology in various fields, various hardware platforms for running artificial intelligence algorithms have come into existence, and a many-core processor is one of the hardware platforms. The neural network algorithm is a mainstream artificial intelligence algorithm and has the characteristics of high calculation amount and high parallelism. These features make neural networks suitable for running on many-core architectures, which is also why many-core processor architectures are an important way to currently build neural network accelerators. On the basis of a neural network algorithm and a proper many-core processor, how to map the neural network algorithm onto the processor and how to distribute resources such as calculation, storage, routing and the like of each core of the many-core processor are problems to be solved urgently.

Disclosure of Invention

In view of the above, the present invention provides a mapping method and computing device for a many-core processor based neural network that overcomes or at least partially solves the above problems.

According to one aspect of the invention, a mapping method for a neural network based on a many-core processor is provided, and comprises the following steps:

acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups;

splitting each network layer of all the network layers, wherein each network layer comprises a plurality of sub-network layers;

and fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and respectively mapping the sub-network layer groups to a plurality of cores of a preset many-core processor. By splitting the neural network in groups and then fusing in groups and mapping to the kernels, the routing between the cores in the many-core processor can be greatly reduced, and the processing efficiency of the processor is improved.

Optionally, after the sub-network layers belonging to the same network layer group are fused according to the preset rule to obtain a plurality of sub-network layer groups, and before the plurality of sub-network layer groups are respectively mapped to a plurality of cores of a preset many-core processor, the method further includes:

judging whether the number of network layers included by the sub-network layer group is greater than a first preset threshold value or not;

if the number of network layers included in at least one sub-network layer group is greater than the first preset threshold value, the network layer group to which the sub-network layer group belongs, the number of network layers of which is greater than the first preset threshold value, is subjected to intra-group splitting again. When the groups are merged, the number of layers contained in a group is too large and exceeds the upper limit of resources borne by one core due to too large distribution of one group during network grouping, and at the moment, the number of network layers in the group is reduced by grouping again so as to realize load balance of each core.

Optionally, after mapping the sub-network layer groups to a plurality of cores of a preset many-core processor, the method further includes:

screening out at least one first kernel with the resource utilization rate lower than a preset index from the plurality of kernels;

and fusing the sub-network layer groups corresponding to the first core again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor. After the mapped cores are obtained through the steps, the cores with lower resource utilization rate may exist, and the cores with lower resource utilization rate are subjected to re-fusion so as to effectively reduce the overhead required by scheduling in the operation of the processor.

judging whether the many-core processor has residual cores;

if the remaining cores exist, acquiring at least one third core of which the resource consumption rate is greater than a second preset threshold value in the many-core processor;

translating at least a portion of the sub-network layers mapped to the third core to the remaining cores. On the basis, if the many-core processor still has the remaining cores without tasks needing to be distributed, the strategy of integral re-splitting can be added, the computing tasks of the cores with larger loads are redistributed to the remaining cores, the utilization rate of each core in the many-core processor can be improved, and the operating efficiency of the many-core processor can also be improved.

Optionally, the converting at least part of the sub-network layer mapped to the third core to the remaining core includes:

translating the half subnet layer mapped to the third core to the remaining cores.

Optionally, all network layers in the same network layer group are connected in sequence in the neural network.

Optionally, when the plurality of network layer groups are subjected to intra-group splitting, the number of sub-network layers in each network layer group is equal.

Optionally, for any network layer group, the sub-network layers with the same index in the network layer group are fused to obtain the sub-network layer group corresponding to the index.

There is also provided, in accordance with yet another aspect of the invention, a computing device, including a many-core processor, wherein,

the many-core processor is used for executing the related algorithm of the neural network mapped by the mapping method of the neural network based on the many-core processor.

Optionally, the computing device further comprises:

a storage device for storing a computer program which is loaded and executed by a processor when running in the computing device.

The invention provides a more balanced mapping method of a neural network based on a many-core processor, which reasonably maps each network layer of the neural network to each core of the many-core processor through the steps of network grouping, intra-group splitting, intra-group fusing, integral re-fusing and the like for the neural network to be mapped, and distributes resources of calculation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, the load of each core of the many-core processor is more balanced compared with the traditional scheme, and the resource efficiency of calculation and storage is effectively improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 shows a diagram of a many-core processor based neural network mapping, according to one embodiment of the invention;

FIG. 2 shows a diagram of a many-core processor based neural network map according to another embodiment of the invention;

FIG. 3 shows a diagram of a many-core processor based neural network map according to another embodiment of the invention;

FIG. 4 shows a flow diagram of a many-core processor based neural network mapping method, according to a preferred embodiment of the invention;

FIG. 5 shows a network packet diagram of a many-core processor based neural network, according to a preferred embodiment of the invention;

FIG. 6 shows a schematic diagram of the intra-group split of a many-core processor based neural network, according to a preferred embodiment of the present invention;

FIG. 7 shows an intra-group fusion schematic of a many-core processor based neural network, according to a preferred embodiment of the present invention;

FIG. 8 shows a schematic diagram of the routing between the pre-and post-fusion kernels within a group of a many-core processor-based neural network, according to a preferred embodiment of the present invention;

FIG. 9 shows an overall re-fusion schematic of a many-core processor based neural network, according to a preferred embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The algorithm responsible for distributing the parallel algorithm to the many-core processor for operation is generally called a scheduling algorithm, and scheduling is divided into static scheduling and dynamic scheduling. The static scheduling refers to that a scheduling strategy is made before the parallel algorithm is executed, and the operation is completely carried out according to the made strategy. The dynamic scheduling is different, and the dynamic scheduling can determine how to schedule the next step according to the self state and the environment state when the algorithm runs.

By "mapping" algorithm, we refer in the narrow sense to a static scheduling algorithm that emphasizes mapping some portion of the parallel algorithm onto cores of a many-core processor, each core running and only the portion of the algorithm that is mapped.

The method aims at the mapping algorithm of the neural network on the many-core processor, but the mapping algorithm of the general parallel algorithm on the many-core processor is widely researched, and some universal methods are formed. For example, the simplest and general mapping algorithm may be to map each layer of the neural network to each core in turn until all layers have been allocated. Referring to FIG. 1, layers 0-Layer5 represent network layers 1-6, respectively, and cores 0-Core5 represent cores 1-6, respectively, and when mapping layers of a neural network to cores in a many-Core processor, layers 0-Layer5 may be mapped to cores 0-Core5, respectively, correspondingly.

Generally speaking, each layer of the neural network may have the characteristic of extremely unbalanced calculation and storage, and the traditional parallel algorithm mapping technology is rarely optimized specifically for the characteristic of the neural network. Mapping techniques with simple generalized policies may cause a significant imbalance in load among cores, resulting in a significant waste of computational and storage resources, or route congestion.

Firstly, it is introduced that a network layer with a higher load is split, that is, one layer is mapped to a plurality of cores, and a layer is calculated by the plurality of cores, which is beneficial to load balancing of the whole architecture. As shown in fig. 2, Layer5 (Layer 5) is split, and the two split sub-network layers are mapped to Core5 and Core6, respectively.

In addition, the layers with smaller loads are fused, and a core is used for calculating multiple layers, so that the resource utilization rate of the cores can be improved, and the technology adopted in the process can be called as fusion technology. As shown in fig. 3, Layer0 and Layer1 were fused and mapped together to Core 0.

Based on the split and fusion technology, the embodiment of the invention provides a more efficient and balanced neural network mapping method based on a many-core processor.

As shown in fig. 4, the method provided by this embodiment may include:

step S401, acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped into a plurality of network layer groups; and all network layers in the same network layer group are sequentially connected in the neural network.

That is, after the neural network to be mapped is acquired, network grouping is performed in the first step. All network layers of the neural network to be mapped are sequentially divided into different network layer groups, and the network layers in the same network layer group are usually represented as a continuous section on the neural network in the connection relation. Fig. 5 shows a network grouping diagram of this embodiment, as can be seen from fig. 5, a neural network may include Layer0-Layer5, and when performing intra-network grouping, Layer0 and Layer1 may be divided into Group0 (Group 0), Layer2-4 is Group1 (Group 1), Layer5 is alone as Group2 (Group 2), and the network grouping shown in fig. 5 is only one of multiple groups.

Step S402, splitting each network layer of all network layers, wherein each network layer includes a plurality of sub-network layers.

The second step is intra-group splitting. Because the calculation amount of some network layers in the neural network is large, one or more layers of neural networks are selected for splitting by adopting an intra-group splitting mode. Specifically, a splitting technique may be used to split each network layer group, and when performing intra-group splitting, for example, the number of sub-network layers included in each network layer in all network layer groups may be equal, that is, the network layer groups in the same group preferentially have the same splitting number. For example, as shown in fig. 5, assuming that Group0 is split into 2 parts, Layer0 and Layer1 are split into 2 parts. Similarly, Group1 is split into 3 parts, and then Layer2, Layer3 and Layer4 are all split into 3 parts; group2 was split into 3 parts and Layer5 was split into 3 parts. The splitting is as equal as possible, that is, the partial algorithm obtained by splitting has as close as possible computation, storage and/or routing. Alternatively, the index can be used to represent the partial algorithm after the network Layer is split, that is, Layer0 is split into 2 shares to obtain Layer0[0] and Layer0[1], where [0], [1] can be represented as the index of the sub-network Layer, and this process can be seen in fig. 6.

And S403, fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and respectively mapping the sub-network layer groups to a plurality of cores of a preset many-core processor.

By utilizing the fusion technology to perform intra-group fusion, a plurality of sub-network layers in each network layer group can be fused to obtain a plurality of sub-network layer groups. As introduced above, the index may be used to indicate a part of the algorithm after the network layer is split, and optionally, the sub-network layers belonging to the same network layer group are merged according to a preset rule to obtain a plurality of sub-network layer groups, further comprising: and for any network layer group, fusing the sub-network layers with the same index in the network layer group to obtain the sub-network layer group corresponding to the index. That is, in each group, sub-network layers having the same index are fused into one core using a fusion technique. For example, referring to FIG. 7, Layer0[0] and Layer1[0] would fuse into Core0, and Layer0[1] and Layer1[1] would fuse into Core 1; layer2[0], Layer3[0], Layer4[0] fused to Core2, and so on.

When the groups are fused, the resource upper limit of one nuclear energy burden can be exceeded after the tasks of multiple layers are fused no matter how split the tasks are split because the number of layers contained in one group is too large and too many layers are contained in the group when the network is grouped. At this point, the step of returning the network packet is required to re-perform the packet.

Optionally, after the sub-network layers in the same network layer group are merged to obtain a plurality of sub-network layer groups in step S403, it may be further determined whether the number of network layers included in the sub-network layer group is greater than a first preset threshold; if the number of network layers included in at least one sub-network layer group is greater than a first preset threshold value, the network layer group to which the sub-network layer group belongs, the number of network layers of which the number of network layers is greater than the first preset threshold value, is split again in the group, and then the number of the network layers in the group is reduced. The first preset threshold may be set according to different actual needs, and the present invention is not limited thereto.

Intra-group fusion can greatly reduce routing between cores. Take the intra-Group fusion of Group0 as an example. As shown in fig. 8, the total routing of Layer0 and Layer1 before merging can be represented by two thicker arrows and two thinner arrows. Wherein the amount of routing represented by the thick arrows is generally much larger than the amount of routing represented by the thin arrows (except for the fully connected neural network) due to the data locality of the neural network operation. After the intra-group fusion, the route represented by the thick arrow becomes data transmission in the core, the real inter-core route is only the route amount represented by the two thin arrows, and the total route amount is greatly reduced.

The embodiment of the invention also comprises a step S404 of screening out a first kernel with the resource utilization rate lower than a preset index from the plurality of kernels; and fusing the sub-network layer groups corresponding to the first cores again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor, wherein the number of the first cores is at least one.

After the steps, a plurality of mapped cores are obtained, and some 'small cores', namely cores occupying less resources or having a resource utilization rate lower than a preset index, may exist in the cores, wherein the preset index may be set according to different many-core processors. If some cores in the mapping do not reach the bottleneck of the original mapping after being re-fused, namely the neural network under the original mapping scheme does not run slower or exceed the memory and routing limits, the overall re-fusion can be performed, and the cores with lower resource utilization rate are re-fused. An example of this process is shown in figure 9.

Referring to FIG. 9, Core0 is responsible for Layer0[0] and Layer1[0], Core5 is responsible for Layer5[0], Core0 and Core5 are less available, Core0 and Core5 can be re-fused, and fused Core0 is responsible for Layer0[0], Layer1[0] and Layer5[0 ]. Similarly, Core6 is primarily responsible for Layer5[1], Core7 is primarily responsible for Layer5[2], and after fusing the two, Core5 can be simultaneously responsible for Layer5[1] and Layer5[2 ].

On the basis of the algorithm, if the number of cores required by the final mapping scheme is smaller than the total number of cores of the many-core processor, and the rest cores do not need to be allocated by other tasks, an integral re-splitting strategy can be added.

The embodiment of the invention also comprises a step S405 of judging whether the many-core processor has the residual kernel or not; if the remaining cores exist, acquiring at least one third core of which the resource consumption rate is greater than a second preset threshold value in the many-core processor; at least part of the sub-network layer mapped to the third kernel is converted to the rest of the kernels, and the other part of the sub-network layer is continuously mapped to the third kernel. Alternatively, one-half of the sub-network layers that have been mapped to the third core may be kept mapped to the third core, and the other one-half of the sub-network layers are translated to be mapped to the remaining cores.

That is, the whole is split again after the whole is fused, the core with the largest load is continuously selected and split, and when the step is split, a strategy of two can be preferentially adopted until all cores are utilized.

The following describes the scheme of the above embodiment, taking VGG19 as an example, and the input feature map size is 224 × 224 × 3. The network structure of the convolutional neural network VGG19 is shown in table 1.

TABLE 1

The Convolutional Neural Network (CNN) is composed of Input layers, convolutional layers, activation functions, pooling layers, fully-connected layers, i.e., Input-Conv-ReLU-Pool-Fc, and prob (classifier). For convenience of calculation, only the convolutional layer and the ReLU layer are focused on, and the adjacent convolutional layer and ReLU layer are considered as one layer, i.e. layer_i。

Table 2 evaluates the amount of computation and the amount of storage for each layer. Wherein the calculated amount is represented by MAC, i.e. the number of accumulations. For the ReLU layer, it is assumed that the activation function operation is performed one MAC per pair of numbers. The amount of memory is represented by the total number of weighted digits of the profile. The unit of MAC is M, the unit of storage is K, 1M denotes 1000000, and 1K denotes 1000.

TABLE 2 calculation (MAC) and storage for layers in VGG19

	layer₀	layer₁	layer₂	layer₃	layer₄	layer₅	layer₆	layer₇
									MAC(M)	89.9	1852.9	926.4	1851.3	925.6	1850.5	1850.5	1850.5
Storage (K)	151.1	3212.0	804.0	1606.8	403.7	805.1	805.1	805.1
										layer₈	layer₉	layer₁₀	layer₁₁	layer₁₂	layer₁₃	layer₁₄	layer₁₅
MAC(M)	925.2	1850.1	1850.1	1850.1	462.5	462.5	462.5	462.5
									Storage (K)	205.3	406.0	406.0	406.0	105.0	105.0	105.0	105.0

Assume that the memory size of each core is 4M (4000K). The above mapping uses 16 cores, and the computational utilization of each core is calculated using the following formula.

Wherein, the computer _ rate_iIndicating the calculated utilization, i-table Corei (core i), MAC_iRepresents the calculated amount of Corei, MAC_jRepresents Core_jThe amount of calculation of (a).

The storage utilization of each core is calculated using the following formula.

Wherein, memory _ rate_iRepresents the storage amount, Mem_iRepresenting the amount of memory of Corei.

The calculation and storage utilization of each core is obtained as shown in table 3.

TABLE 3 computation and storage utilization per core after VGG19 simple mapping

core₀

core₁

core₂

core₃

core₄

core₅

core₆

core₇

Calculation utilization

4.85％

100％

50％

99.91％

49.96％

99.87％

Storage utilization

3.78％

80.3％

20.1％

40.17％

10.09％

20.13％

core₈

core₉

core₁₀

core₁₁

core₁₂

core₁₃

core₁₄

core₁₅

Calculation utilization

49.94％

99.85％

24.96

Storage utilization

5.13％

10.15％

2.64％

The average calculated utilization was 65.85%, and the average storage utilization was 16.31%.

Based on the method provided by the embodiment of the invention, the mapping of the VGG19 network is optimized, and the network grouping can be as follows:

Group₀＝{layer₀，layer₁}

Group₁＝{layer₂，layer₃}

Group₂＝{layer₄，layer₅，layer₆，layer₇}

Group₃＝{layer₈，layer₉，layer₁₀，layer₁₁}

Group₄＝{layer₁₂，layer₁₃，layer₁₄，layer₁₅}

Group₀split number 1, Group₁Split number of 2, Group₂Split number of 3, Group₃Split number of 3, Group₄The split number is 1. 10 nuclei were obtained by intragroup fusion. According to the 10 core splitting results, no small core can be further fused, and a final optimization scheme is obtained. The calculation and storage utilization of each core of the scheme is shown in table 4.

TABLE 4 computational utilization and storage utilization per core after VGG19 has been mapped by the patented scheme

core₀

core₁

core₂

core₃

core₄

core₅

core₆

core₇

Calculation utilization

89.99％

64.33

100％

99.98％

Storage utilization

84.07％

30.13％

23.49％

11.86％

core₈

core₉

Calculation utilization

99.98％

85.69％

Storage utilization

11.86％

10.50％

The average calculated utilization was 90.43%, and the average storage utilization was 26.09%. Compared with the traditional scheme, the scheme provided by the embodiment of the invention reduces the using quantity of the cores and greatly improves the resource utilization rate. The optimization scheme employed by the examples herein is primarily directed to improving computational utilization, so storage utilization may be at a lower level. Due to the difference of the splitting schemes, the split algorithm may have slight storage redundancy, and the calculation of the storage redundancy is omitted.

Based on the same inventive concept, the embodiment of the invention also provides computing equipment which comprises a many-core processor and is characterized in that the many-core processor is used for executing the related algorithm of the neural network mapped by the mapping method of the neural network based on the many-core processor.

Optionally, the computing device further comprises: a storage device for storing a computer program which is loaded and executed by a processor when running in the computing device.

The embodiment of the invention provides a neural network mapping method based on a many-core processor, which has wider applicability and higher efficiency, reasonably maps each network layer of the neural network to each core of the many-core processor through the steps of network grouping, in-group splitting, in-group fusing, integral re-fusing and the like for the neural network to be mapped, and distributes resources such as calculation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, and simultaneously, the load of each core of the many-core processor is more balanced compared with the traditional scheme. The method can be theoretically applied to the currently mainstream neural network algorithm, including a fully-connected neural network and a convolutional neural network, and particularly has better performance on the convolutional neural network. In terms of a many-core processor, the scheme provided by the embodiment of the invention is particularly suitable for a many-core accelerator architecture specially designed for a neural network. Because the scheme provided by the embodiment of the invention belongs to a static scheduling scheme, the overhead required by scheduling in operation can be greatly reduced, and the neural network accelerator can use the main computing power for the calculation of the algorithm of the neural network.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Thus, it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been illustrated and described in detail herein, many other variations or modifications consistent with the principles of the invention may be directly determined or derived from the disclosure of the present invention without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be understood and interpreted to cover all such other variations or modifications.

Claims

1. A mapping method for a neural network based on a many-core processor, comprising:

fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and respectively mapping the sub-network layer groups to a plurality of cores of a preset many-core processor.

2. The method of claim 1, wherein the fusing the sub-network layers belonging to the same network layer group according to the preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively, further comprises:

if the number of network layers included in at least one sub-network layer group is greater than the first preset threshold value, the network layer group to which the sub-network layer group belongs, the number of network layers of which is greater than the first preset threshold value, is subjected to intra-group splitting again.

3. The method of claim 1 or 2, wherein after mapping the set of sub-network layers to the cores of the pre-provisioned many-core processor, respectively, further comprising:

and fusing the sub-network layer groups corresponding to the first core again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor.

4. The method of any of claims 1-3, wherein after mapping the set of sub-network layers to the cores of the pre-provisioned many-core processor, respectively, further comprising:

judging whether the many-core processor has residual cores;

translating at least a portion of the sub-network layers mapped to the third core to the remaining cores.

5. The method of claim 4, wherein the translating at least a portion of the sub-network layers mapped to the third core to the remaining cores comprises:

6. The method of any one of claims 1-5, wherein all network layers in the same network layer group are connected sequentially in the neural network.

7. The method of any one of claims 1-5, wherein the number of sub-network layers in each network layer group is equal.

8. The method of claim 7, wherein for any network layer group, merging sub-network layers with the same index in the network layer group to obtain the sub-network layer group corresponding to the index.

9. A computing device comprising a many-core processor, wherein,

the many-core processor configured to execute an algorithm related to a neural network mapped by the method for mapping a neural network based on a many-core processor of any of claims 1-9.

10. The computing device of claim 9, wherein the computing device further comprises: