CN111723900B

CN111723900B - Neural network mapping method and computing device based on many-core processor

Info

Publication number: CN111723900B
Application number: CN201910203167.0A
Authority: CN
Inventors: 张伟豪; 李涵; 裴京
Original assignee: Beijing Lynxi Technology Co Ltd
Current assignee: Beijing Lynxi Technology Co Ltd
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2023-10-20
Anticipated expiration: 2039-03-18
Also published as: CN111723900A; WO2020187041A1

Abstract

The invention provides a neural network mapping method and computing equipment based on a many-core processor, wherein the method comprises the following steps: acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups; splitting each network layer in all the network layers, wherein each network layer comprises a plurality of sub-network layers; and fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively. The method provided by the invention can allocate the resources of computation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, and the loads of each core of the many-core processor are more balanced than those of the traditional scheme, and the computation and storage resource efficiency is effectively improved.

Description

Neural network mapping method and computing device based on many-core processor

Technical Field

The present invention relates to the field of processor technologies, and in particular, to a mapping method and a computing device for a neural network based on a many-core processor.

Background

With the continuous application of artificial intelligence technology in various fields, various hardware platforms for running artificial intelligence algorithms have been developed, and many-core processors are one of them. The neural network algorithm is a mainstream artificial intelligence algorithm and has the characteristics of high calculation amount and high parallelism. These features make neural networks suitable for operation on many-core architectures, which is also why many-core processor architectures are currently an important way to build neural network accelerators. On the basis of the neural network algorithm and a suitable many-core processor, how to map the neural network algorithm to the processor and how to allocate the resources of computation, storage, routing and the like of each core of the many-core processor is a problem to be solved.

Disclosure of Invention

In view of the foregoing, the present invention provides a mapping method and computing device for a many-core processor-based neural network that overcomes or at least partially solves the foregoing problems.

According to one aspect of the present invention, there is provided a mapping method of a neural network based on a many-core processor, including:

acquiring a neural network to be mapped, and sequentially combining all network layers of the neural network to be mapped to divide the network layers into a plurality of network layer groups;

splitting each network layer in all the network layers, wherein each network layer comprises a plurality of sub-network layers;

and fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively. By splitting the neural network in the group and then fusing the neural network in the group and mapping the neural network to the kernel, the routing among the kernels in the many-kernel processor can be greatly reduced, and the processing efficiency of the processor is further improved.

Optionally, after fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, before mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor, the method further includes:

judging whether the network layer number included in the sub-network layer group is larger than a first preset threshold value or not;

and if the network layer number included in at least one sub-network layer group is greater than the first preset threshold, re-splitting the network layer group to which the sub-network layer group with the network layer number greater than the first preset threshold belongs. When the network is integrated in the group, the number of layers in the group is too large due to the fact that one group is excessively distributed when the network is grouped, and the number of layers in the group exceeds the upper limit of resources of one nuclear energy load, and at the moment, the number of network layers in the group is reduced by grouping again, so that load balancing of each core is achieved.

Optionally, after mapping the plurality of sub-network layer groups to the plurality of cores of the preset many-core processor, the method further includes:

screening out a first kernel with the resource utilization rate lower than a preset index from the plurality of kernels, wherein the first kernel is at least one;

and fusing the sub-network layer groups corresponding to the first cores again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor. After the mapped cores are obtained through the steps, cores with low resource utilization rate can exist, and the cost required by scheduling in the operation of the processor is effectively reduced by re-fusing the cores with low resource utilization rate.

judging whether residual kernels exist in the many-core processor;

if the residual kernels exist, at least one third kernel with the resource consumption rate larger than a second preset threshold value in the many-core processor is obtained;

at least part of the sub-network layer mapped to the third core is converted to the remaining cores. On the basis, if the many-core processor still has the residual cores and no task needs to be distributed, the overall re-splitting strategy can be increased, and the computing tasks of the cores with larger loads are redistributed to the residual cores, so that the utilization rate of each core in the many-core processor can be improved, and the running efficiency of the many-core processor can be improved.

Optionally, the converting at least part of the sub-network layer mapped to the third core to the remaining cores includes:

and converting one half of the sub-network layer mapped to the third core to the remaining cores.

Optionally, all network layers in the same network layer group are sequentially connected in the neural network.

Optionally, when the plurality of network layer groups are split in a group, the number of sub-network layers in each network layer group is equal.

Optionally, for any network layer group, fusing the sub-network layers with the same index in the network layer group to obtain the sub-network layer group corresponding to the index.

According to yet another aspect of the present invention, there is also provided a computing device comprising a many-core processor, characterized in that,

the many-core processor is configured to execute the related algorithm of the neural network mapped by the mapping method of the neural network based on the many-core processor.

Optionally, the computing device further comprises:

a storage device for storing a computer program that is loaded and executed by a processor when run in the computing device.

The invention provides a more balanced mapping method of a neural network based on a many-core processor, which reasonably maps each network layer of the neural network to each core of the many-core processor through the steps of network grouping, intra-group splitting, intra-group fusing, overall re-fusion and the like for the neural network to be mapped, distributes the resources of calculation, storage, routing and the like of each core of the many-core processor, ensures that the operation of the neural network is more efficient, simultaneously ensures that the loads of each core of the many-core processor are more balanced than the traditional scheme, and effectively improves the calculation and storage resource efficiency.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

The above, as well as additional objectives, advantages, and features of the present invention will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present invention when read in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 illustrates a neural network map schematic based on a many-core processor, according to one embodiment of the invention;

FIG. 2 illustrates a neural network map schematic based on a many-core processor, according to another embodiment of the invention;

FIG. 3 illustrates a neural network map schematic based on a many-core processor, according to another embodiment of the invention;

FIG. 4 is a flowchart of a neural network mapping method based on a many-core processor according to a preferred embodiment of the present invention;

FIG. 5 shows a network grouping schematic of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention;

FIG. 6 illustrates an intra-group split schematic of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention;

FIG. 7 illustrates a schematic diagram of intra-group fusion of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention;

FIG. 8 illustrates a schematic routing diagram between cores before and after intra-group fusion of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention;

fig. 9 shows a schematic diagram of an overall re-fusion of a many-core processor-based neural network in accordance with a preferred embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Algorithms responsible for distributing parallel algorithms to run on many-core processors are generally referred to as scheduling algorithms, and scheduling is classified into static scheduling and dynamic scheduling. The static scheduling refers to the process that a scheduling strategy is formulated before the parallel algorithm is executed, and the running process completely runs according to the formulated strategy. The dynamic scheduling is different, and when the algorithm runs, the dynamic scheduling can determine how to schedule in the next step according to the states of the dynamic scheduling and the environment.

In the narrow sense of the "mapping" algorithm, a static scheduling algorithm emphasizes the mapping of some portion of the parallel algorithm onto cores of a many-core processor, each core running and only the mapped portion of the algorithm.

The mapping algorithm of the neural network on the many-core processor is less researched, but the mapping algorithm of the general parallel algorithm on the many-core processor is widely researched, and a plurality of general methods are formed. For example, the simplest general mapping algorithm may be to map each layer of the neural network to each core in turn until all layers have been assigned. Referring to fig. 1, layer0-Layer5 represents network layers 1-6, core0-Core5 represents cores 1-6, respectively, and layer0-Layer5 may be mapped to Core0-Core5, respectively, when mapping layers of a neural network to cores in a many-Core processor.

In general, the layers of the neural network may have the characteristic of extremely unbalanced calculation and storage, and the conventional parallel algorithm mapping technology is seldom specially optimized for the characteristics of the neural network. The mapping technique through a simple pervasive strategy may cause a great imbalance in load among the cores, cause a great deal of waste of computing and storage resources, or cause route blocking.

First, the network layer with higher load is split, that is, one layer is mapped to a plurality of cores, and one layer is calculated by the plurality of cores, so that the load balancing of the whole architecture is facilitated, and the technology adopted in the process can be called a splitting technology. As shown in fig. 2, layer5 (Layer 5) is split and the split two sub-network layers are mapped to Core5 and Core6, respectively.

In addition, the layers with smaller loads are fused, and one core is used for calculating multiple layers, so that the resource utilization rate of the cores can be improved, and the technology adopted in the process can be called a fusion technology. As shown in fig. 3, layer0 and Layer1 are fused and commonly mapped to Core0.

Based on the splitting and fusing technology, the embodiment of the invention provides a neural network mapping method based on a many-core processor, which is more efficient and balanced.

As shown in fig. 4, the method provided in this embodiment may include:

step S401, obtaining a neural network to be mapped, and combining all network layers of the neural network to be mapped in sequence, and dividing the network layers into a plurality of network layer groups; all network layers in the same network layer group are sequentially connected in the neural network.

That is, after the neural network to be mapped is acquired, in a first step, network grouping is performed. All network layers of the neural network to be mapped are sequentially divided into different network layer groups, and the network layers in the same network layer group are often reflected as a continuous section of the neural network in connection relation. Fig. 5 shows a network grouping schematic diagram of the present embodiment, referring to fig. 5, it can be known that the neural network may include Layer0-Layer5, when performing network grouping, layer0 and Layer1 may be divided into Group0 (Group 0), layer2-4 is Group1 (Group 1), layer5 is solely used as Group2 (Group 2), the network grouping shown in fig. 5 is only one of multiple groupings, in practical application, all network layers of the neural network may be divided according to different requirements when grouping, and the present invention is not limited.

In step S402, each of all the network layers is split, and each network layer includes a plurality of sub-network layers.

The second step is in-group splitting. Because the calculated amount of network layers in the neural network is large, one or more layers of the neural network are selected for splitting in a splitting mode in the group. The splitting technology can be specifically utilized to split each network layer group, and when in intra-group splitting, for example, the number of sub-network layers included in each network layer in all network layer groups can be equal, that is, the network layer groups in the same group preferably have the same splitting number. Taking the example shown in fig. 5, assuming that Group0 is split into 2 parts, layer0 and Layer1 will both be split into 2 parts. Similarly, if Group1 is split into 3 parts, then Layer2, layer3, and Layer4 will all be split into 3 parts; group2 was split into 3 parts and Layer5 was split into 3 parts. Splitting is as uniform as possible, i.e. the partial algorithms obtained by splitting have as close computation, memory and/or routing as possible. Alternatively, the index may be used to represent a part of the split algorithm of the network Layer, that is, layer0[0] and Layer0[1] are obtained after Layer0 is split into 2 parts, where [0] and [1] may be represented as indexes of the sub-network Layer, and this process may be referred to fig. 6.

Step S403, fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively.

And the fusion technology is utilized to carry out intra-group fusion, and a plurality of sub-network layers in each network layer group can be fused to obtain a plurality of sub-network layer groups. In the foregoing, the index may be used to represent a part of the algorithm after the network layer is split, optionally, the sub-network layers belonging to the same network layer group are fused according to a preset rule to obtain a plurality of sub-network layer groups, and further includes: and for any network layer group, fusing the sub-network layers with the same index in the network layer group to obtain the sub-network layer group corresponding to the index. That is, in each group, sub-network layers having the same index are fused into one core using a fusion technique. For example, referring to FIG. 7, layer0[0] and Layer1[0] would be fused into Core0 and Layer0[1] and Layer1[1] would be fused into Core 1; layer2[0], layer3[0], layer4[0] fused into Core2, and so forth.

When the tasks are fused in the group, the number of layers contained in the group is too large due to the fact that one group is excessively distributed when the network is grouped, so that the tasks in the plurality of layers are split anyway, and the upper limit of resources of one nuclear energy burden is exceeded after the tasks in the plurality of layers are fused. At this time, a step of returning the network packet is required to re-perform the packet.

Optionally, after merging the sub-network layers in the same network layer group to obtain a plurality of sub-network layer groups in step S403, it may also be determined whether the number of network layers included in the sub-network layer group is greater than a first preset threshold; if the network layer number included in at least one sub-network layer group is greater than a first preset threshold, splitting the network layer group to which the sub-network layer group belongs again, so that the number of the network layer in the group is reduced. The first preset threshold value can be set according to different actual needs, and the invention is not limited.

Intra-group fusion can greatly reduce routing between cores. Taking Group0 intra-Group fusion as an example. As shown in fig. 8, the total route of Layer0 and Layer1 before fusion can be represented by two thicker arrows and two thinner arrows. Wherein the amount of routing represented by the thick arrows is generally much greater than the amount of routing represented by the thin arrows (except for fully connected neural networks) due to the data locality of the neural network operations. After intra-group fusion, the routes represented by the thick arrows become intra-core data transfer, and the real inter-core routes are only the routes represented by the two thin arrows, so that the total route is greatly reduced.

The embodiment of the invention further comprises a step S404 of screening out a first kernel with the resource utilization rate lower than a preset index from the kernels; and fusing the sub-network layer groups corresponding to the first cores again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor, wherein the first core is at least one.

After the steps, a plurality of mapped cores are obtained, and some 'small cores' possibly exist in the cores, namely the cores with less occupied resources or the resource utilization rate lower than a preset index are obtained, wherein the preset index can be set according to different multi-core processors. If some cores are recombined and do not reach the bottleneck of the original mapping, namely the neural network under the original mapping scheme can not run more slowly or exceed the memory and routing limit, the whole re-fusion can be carried out, and the cores with lower resource utilization rate are recombined. Fig. 9 shows an example of this process.

Referring to FIG. 9, core0 is responsible for Layer0[0] and Layer1[0], core5 is responsible for Layer5[0], core0 and Core5 can be re-fused at this time due to lower Core0 and Core5 utilization, and fused Core0 is responsible for Layer0[0], layer1[0] and Layer5[0]. Similarly, core6 is responsible for Layer5[1], core7 is responsible for Layer5[2], and Core5 may be responsible for both Layer5[1] and Layer5[2] after fusing the two.

Based on the above algorithm, if the number of cores required by the final mapping scheme is smaller than the total number of cores of the many-core processor, and the remaining cores have no other tasks to be allocated, an overall repartition strategy can be added.

The embodiment of the invention further comprises step S405, judging whether residual kernels exist in the many-core processor; if the residual kernels exist, at least one third kernel with the resource consumption rate larger than a second preset threshold value in the many-core processor is obtained; at least a portion of the sub-network layer mapped to the third core is converted to the remaining cores while the other portions remain mapped to the third core. Alternatively, one-half sub-network layers that have been mapped to the third core may continue to remain mapped to the third core, with another one-half sub-network layer conversion mapped to the remaining cores.

That is to say, after the whole fusion, the whole is split again, the core with the largest load is continuously selected and split, and when the step is split, a strategy with one part being two can be preferentially adopted until all cores are utilized.

Next, the scheme of the above embodiment will be described by taking VGG19 as an example, and the input feature map size is 224×224×3. The network structure of convolutional neural network VGG19 is shown in table 1.

TABLE 1

Convolutional Neural Networks (CNNs) consist of an Input layer, a convolutional layer, an activation function, a pooling layer, a fully connected layer, i.e. Input-Conv-ReLU-Pool-Fc, and prob (classifier). For the convenience of calculation, only the convolution layer and the ReLU layer are concerned, and the adjacent convolution layer and the ReLU layer are regarded as one layer, namely layer _i 。

Table 2 evaluates the calculated amount and storage amount for each layer. Wherein the calculated amount is represented by MAC, namely the accumulated times. For the ReLU layer, assume that each pair of digits performs an activation function operation to record a MAC. The storage is represented by the total number of digits weighted by the feature map. The unit of MAC is M, the unit of storage is K, and 1M represents 1000000,1K and 1000.

TABLE 2 calculation (MAC) and memory for each layer in VGG19

	layer ₀	layer ₁	layer ₂	layer ₃	layer ₄	layer ₅	layer ₆	layer ₇
									MAC(M)	89.9	1852.9	926.4	1851.3	925.6	1850.5	1850.5	1850.5
Storage (K)	151.1	3212.0	804.0	1606.8	403.7	805.1	805.1	805.1
										layer ₈	layer ₉	layer ₁₀	layer ₁₁	layer ₁₂	layer ₁₃	layer ₁₄	layer ₁₅
MAC(M)	925.2	1850.1	1850.1	1850.1	462.5	462.5	462.5	462.5
									Storage (K)	205.3	406.0	406.0	406.0	105.0	105.0	105.0	105.0

Assume that the memory size of each core is 4M (4000K). The above mapping uses 16 cores, and the computational utilization of each core is calculated using the following formula.

Wherein, the computer_rate _i Indicating the calculated utilization, i table Corei (core i), MAC _i Representing the calculated amount of Corei, MAC _j Indicating Core _j Is a calculation amount of (a).

The storage utilization of each core is calculated using the following formula.

Wherein memory_rate _i Represents the storage amount, mem _i Representing the amount of memory of Corei.

The calculation and storage utilization of each core is shown in Table 3.

TABLE 3 calculated utilization and storage utilization for each core after VGG19 simple mapping

core ₀

core ₁

core ₂

core ₃

core ₄

core ₅

core ₆

core ₇

Calculating the utilization rate

4.85％

100％

50％

99.91％

49.96％

99.87％

Storage utilization

3.78％

80.3％

20.1％

40.17％

10.09％

20.13％

core ₈

core ₉

core ₁₀

core ₁₁

core ₁₂

core ₁₃

core ₁₄

core ₁₅

Calculating the utilization rate

49.94％

99.85％

24.96

Storage utilization

5.13％

10.15％

2.64％

The average calculated utilization was 65.85% and the average storage utilization was 16.31%.

The method provided by the embodiment of the invention optimizes the mapping of the VGG19 network, and the network grouping can be as follows:

Group ₀ ＝{layer ₀ ，layer ₁ }

Group ₁ ＝{layer ₂ ，layer ₃ }

Group ₂ ＝{layer ₄ ，layer ₅ ，layer ₆ ，layer ₇ }

Group ₃ ＝{layer ₈ ，layer ₉ ，layer ₁₀ ，layer ₁₁ }

Group ₄ ＝{layer ₁₂ ，layer ₁₃ ，layer ₁₄ ，layer ₁₅ }

Group ₀ the split number is 1, group ₁ Split number is 2, group ₂ The split number is 3, group ₃ The split number is 3, group ₄ The split was 1. 10 nuclei were obtained by intra-group fusion. According to the 10 kernel splitting results, no small kernels are found to be further fused, and a final optimization scheme is obtained. The calculation and storage utilization rate of each core in the scheme is shown in table 4.

TABLE 4 calculated utilization and storage utilization for each core after VGG19 is mapped by the proprietary scheme

core ₀

core ₁

core ₂

core ₃

core ₄

core ₅

core ₆

core ₇

Calculating the utilization rate

89.99％

64.33

100％

99.98％

Storage utilization

84.07％

30.13％

23.49％

11.86％

core ₈

core ₉

Calculating the utilization rate

99.98％

85.69％

Storage utilization

11.86％

10.50％

The average calculated utilization was 90.43% and the average storage utilization was 26.09%. Compared with the traditional scheme, the scheme provided by the embodiment of the invention reduces the number of the cores used and greatly improves the resource utilization rate. The optimization scheme adopted by the example here is mainly aimed at improving the computing utilization, so the storage utilization can be at a lower level. Due to the difference of the splitting schemes, the split algorithm may have small storage redundancy, and calculation of the storage redundancy is omitted.

Based on the same inventive concept, an embodiment of the present invention further provides a computing device, including a many-core processor, where the many-core processor is configured to execute a related algorithm of a neural network mapped by the mapping method of the many-core processor-based neural network described in any one of the above.

Optionally, the computing device further comprises: a storage device for storing a computer program that is loaded and executed by a processor when run in the computing device.

The embodiment of the invention provides a neural network mapping method based on a many-core processor, which has wider applicability and higher efficiency, and reasonably maps each network layer of the neural network to each core of the many-core processor through the steps of network grouping, intra-group splitting, intra-group fusion, integral re-fusion and the like for the neural network to be mapped, and distributes resources such as calculation, storage, routing and the like of each core of the many-core processor, so that the operation of the neural network is more efficient, and the loads of each core of the many-core processor are more balanced than the traditional scheme. The method can be theoretically applied to the currently mainstream neural network algorithms, including fully connected neural networks and convolutional neural networks, and particularly has better performance on the convolutional neural networks. In the aspect of a many-core processor, the scheme provided by the embodiment of the invention is particularly suitable for a many-core accelerator architecture specially designed for a neural network. Because the scheme provided by the embodiment of the invention belongs to a static scheduling scheme, the overhead required by scheduling in operation can be greatly reduced, and the neural network accelerator can use main computing power for computing the algorithm of the neural network.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

By now it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been shown and described herein in detail, many other variations or modifications of the invention consistent with the principles of the invention may be directly ascertained or inferred from the present disclosure without departing from the spirit and scope of the invention. Accordingly, the scope of the present invention should be understood and deemed to cover all such other variations or modifications.

Claims

1. A method of mapping a neural network based on a many-core processor, comprising:

splitting each network layer in the network layers in a group, wherein each network layer comprises a plurality of sub-network layers; when the intra-group splitting is carried out, the number of sub-network layers in each network layer group is equal;

performing intra-group fusion on the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, and mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor respectively;

for any network layer group, fusing the sub-network layers with the same index in the network layer group to obtain a sub-network layer group corresponding to the index;

after fusing the sub-network layers belonging to the same network layer group according to a preset rule to obtain a plurality of sub-network layer groups, before mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor, the method further comprises:

and if the network layer number included in at least one sub-network layer group is greater than the first preset threshold, re-splitting the network layer group to which the sub-network layer group with the network layer number greater than the first preset threshold belongs.

2. The method of claim 1, wherein after mapping the plurality of sub-network layer groups to the plurality of cores of the preset many-core processor, respectively, further comprises:

and fusing the sub-network layer groups corresponding to the first cores again to obtain at least one first sub-network layer group, and remapping the first sub-network layer group to a second core of the many-core processor.

3. The method of any of claims 1-2, wherein after mapping the plurality of sub-network layer groups to a plurality of cores of a preset many-core processor, respectively, further comprises:

judging whether residual kernels exist in the many-core processor;

at least part of the sub-network layer mapped to the third core is converted to the remaining cores.

4. The method of claim 3, wherein the transitioning at least a portion of the sub-network layers mapped to the third core to the remaining cores comprises:

5. The method according to any of claims 1-2, wherein all network layers in the same network layer group are connected in sequence in the neural network.

6. A computing device comprising a many-core processor, characterized in that,

the many-core processor for executing the related algorithm of the neural network mapped by the mapping method of the many-core processor-based neural network of any one of claims 1-5.

7. The computing device of claim 6, wherein the computing device further comprises: