CN114691142A

CN114691142A - Compiling method of execution program, chip, electronic device, and computer-readable storage medium

Info

Publication number: CN114691142A
Application number: CN202011623138.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Simm Computing Technology Co ltd
Current assignee: Beijing Simm Computing Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-07-01

Abstract

The embodiment of the disclosure discloses a compiling method of an executive program, a chip, an electronic device and a computer readable storage medium. The compiling method of the executive program comprises the following steps: acquiring the total data amount of each subprogram in the original program; determining the number of processing cores required for executing each subprogram according to the size of a data storage area of a plurality of processing cores and the total data amount of each subprogram; grouping the subprograms according to the number of processing cores required for executing each subprogram; determining the position of a synchronization point according to the size of the parameter of each subprogram; compiling the original program into an executive program according to the grouping and the positions of the synchronization points. The method compiles and generates the executive program according to the total amount of the input and output data of the original program to group the subprogram and generate the executive program, and solves the technical problem that the executive program in the prior art needs to frequently access an external memory during execution.

Description

Compiling method of execution program, chip, electronic device, and computer-readable storage medium

Technical Field

The present disclosure relates to the field of program compilation and processors, and more particularly, to a method, a chip, an electronic device and a computer-readable storage medium for compiling an execution program

Background

With the development of science and technology, the human society is rapidly entering the intelligent era. The important characteristics of the intelligent era are that people obtain more and more data, the quantity of the obtained data is larger and larger, and the requirement on the speed of processing the data is higher and higher. The chip is the foundation of task scheduling, which fundamentally determines the ability of people to process data. From the application field, the chip mainly has two routes: one is a general chip route, such as CPU (Central Processing Unit), which offers great flexibility but is less computationally efficient in Processing domain-specific algorithms; the other is a special chip route, such as tpu (tpu) (sensor Processing unit), which can exert higher effective computing power in some specific fields, but has poorer or even no Processing capability in the flexible and versatile more general fields. Because the data of the intelligent era is various and huge in quantity, the chip is required to have extremely high flexibility, can process algorithms in different fields and in different days, has extremely high processing capacity, and can rapidly process extremely large and sharply increased data volume.

When a multi (many) core CPU or GPU performs neural network task processing, there are generally two processing methods:

the first is that each processing core processes its own task independently, and the cores are not affected with each other, as shown in fig. 1 a; the second is that some or all of the processing cores process a task in parallel, each completing a portion of the task, as shown in FIG. 1 b.

In the two methods, the compiler compiles a proper program according to the structure of the neural network and the characteristics of the traditional multi-core (many-core) CPU or GPU to calculate the neural network. The Cache in each processing core is transparent to the program and is not directly and independently accessible, and all Data read-write in the calculation is carried out based on the access address of a DDR (double Data rate) memory for the processing core. In the neural network calculation, a large amount of intermediate data is generated, and most of the intermediate data of each layer has no relevance. Because the Cache has a spatial local characteristic and a temporal local characteristic, in the neural network calculation, the Cache can easily reduce the hit rate and can frequently access the DDR memory, so that the calculation speed of the neural network is reduced, and the reduction of power consumption is brought.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In order to solve the technical problems of inflexible task scheduling and complex control of processing cores in the prior art, the embodiment of the disclosure provides the following technical scheme:

in a first aspect, an embodiment of the present disclosure provides a compiling method for an execution program, including:

acquiring the total data amount of each subprogram in the original program;

determining the number of processing cores required for executing each subprogram according to the size of a data storage area of a plurality of processing cores and the total data amount of each subprogram;

grouping the subprograms according to the number of processing cores required for executing each subprogram, and distributing the processing cores with corresponding number to the groups;

determining the position of a synchronization point according to the size of the parameter of each subprogram and the size of a parameter storage area of the processing core;

compiling the original program into an executive program according to the grouping and the position of the synchronization point.

Further, the grouping the subroutines according to the number of processing cores required for executing each subroutine, and allocating a corresponding number of processing cores to the grouping includes:

determining the grouping of the subprograms according to the number of processing cores required by executing each subprogram and the total number of the processing cores;

distributing a corresponding number of processing cores to each group according to the number of the processing cores needed for executing the subprogram in each group and the total number of the processing cores; wherein, the number of the processing cores allocated to the same group is the same.

Further, the original program includes a plurality of subroutines, the subroutines are sequentially executed, and determining the grouping of the subroutines according to the number of processing cores required for executing each subroutine and the total number of the processing cores includes:

sequentially acquiring the number of processing cores required for executing each subprogram;

calculating a value that is capable of dividing the total number of the processing cores by the whole number and is not less than the number of the processing cores required to execute the subroutine, as a first value of the subroutine; wherein each subroutine corresponds to a first value;

and sequentially determining the subprograms corresponding to the continuously same first values as a subprogram group.

sequentially acquiring the number N of processing cores required for executing each subprogram_i(ii) a Wherein i represents the number of the subroutine;

obtaining a current N_iMaximum value N in_max1；

Obtaining all N_max1The number j1 of the subroutine with the largest number among the corresponding subroutines;

will j₁And j₁The previous subroutine is determined to be j₁And (4) corresponding grouping.

Further, the allocating a corresponding number of processing cores to each group according to the number of processing cores required for executing the subprogram in each group and the total number of the processing cores includes:

calculating a total number M of processing cores that can be divided evenly and is not less than N_max1As the value of j₁Number of processing cores Ng of corresponding packet₁。

Further, the method further comprises:

will j₁Subsequent subroutines being all subroutines, Ng₁Continuing to perform the step of determining the number of packets and processing cores to which the packets correspond as the total number of processing cores to obtain j₂Corresponding grouping and j₂Number of processing cores Ng of corresponding packet₂；

If Ng is Ng₂Is equal to Ng₁Then j will be₁Corresponding packet sum j₂The corresponding groups are combined into one group;

if Ng is Ng₂Not equal to Ng₁Then j will be₂The corresponding packet is treated as a new packet.

Further, the method further comprises:

the steps of determining the number of groups and processing cores to which the groups correspond are continued with the subroutines after j2 as all subroutines and Ng2 as the total number of processing cores until all subroutines have corresponding groups.

Further, the method further comprises:

determining the number of loop executions of the packet.

Further, the compiling the original program into an execution program according to the grouping and the location of the synchronization point includes:

compiling the original program into an execution program according to the grouping, the circulation execution times of the grouping and the position of the synchronization point.

In a second aspect, an embodiment of the present disclosure provides a chip, including:

a plurality of processing cores and a synchronization signal generator; wherein each processing core comprises a data storage area and a parameter storage area;

the processing cores in each group are used for executing a plurality of program segments corresponding to the grouping of the subprograms in the execution program; wherein the data storage area is used for storing input data and output data of the plurality of program segments, and the parameter storage area is used for storing parameters of the plurality of program segments;

and the synchronous signal generator is used for sending a synchronous signal to all the processing cores when all the processing cores executing the program segment finish executing.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a memory for storing computer readable instructions; and one or more processors for executing the computer-readable instructions, so that the processors implement the compiling method of the executive program in the second aspect when running.

In a fourth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer instructions for causing a computer to execute the compiling method for the executive program according to any one of the second aspects.

In a fifth aspect, an embodiment of the present disclosure provides a computer program product, wherein: comprising computer instructions which, when executed by a computing device, may perform the method of compiling an executive as in any one of the second aspects described above.

In a sixth aspect, the present disclosure provides a computing device, including the chip in any one of the second aspects.

The embodiment of the disclosure discloses a compiling method of an executive program, a chip, an electronic device and a computer readable storage medium. The compiling method of the executive program comprises the following steps: acquiring the total data amount of each subprogram in the original program; determining the number of processing cores required for executing each subprogram according to the size of a data storage area of a plurality of processing cores and the total data amount of each subprogram; grouping the subprograms according to the number of processing cores required for executing each subprogram, and distributing the processing cores with corresponding number to the groups; determining the position of a synchronization point according to the size of the parameter of each subprogram and the size of a parameter storage area of the processing core; compiling the original program into an executive program according to the grouping and the position of the synchronization point. The method compiles and generates the executive program according to the total amount of the input and output data of the original program to group the subprogram and generate the executive program, and solves the technical problem that the executive program in the prior art needs to frequently access an external memory during execution.

The foregoing is a summary of the present disclosure, and for the purposes of promoting a clear understanding of the technical means of the present disclosure, the present disclosure may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIGS. 1a and 1b are schematic diagrams of the prior art;

fig. 2 is a flowchart illustrating a compiling method for executing a program according to an embodiment of the disclosure;

FIG. 3 is a flowchart illustrating a compiling method for executing a program according to an embodiment of the disclosure;

FIG. 4 is a flowchart illustrating a compiling method for executing a program according to an embodiment of the disclosure;

FIG. 5 is a flowchart illustrating a compiling method for executing a program according to an embodiment of the disclosure;

FIG. 6 is a flowchart illustrating a compiling method for executing a program according to an embodiment of the disclosure;

fig. 7a is an exemplary diagram of a chip provided by an embodiment of the disclosure;

FIG. 7b is a schematic diagram of a neural network to be compiled in an embodiment of the present disclosure;

8a-8c are process diagrams illustrating the execution of a packet subroutine by a processing core in accordance with embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a" or "an" in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will appreciate that references to "one or more" are intended to be exemplary and not limiting unless the context clearly indicates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 2 is a flowchart illustrating a compiling method of an execution program according to an embodiment of the disclosure. The compiling method of the executive program is used for a system comprising a plurality of processing cores, and the processing cores comprise storage areas used for storing relevant data of the executive program. The data related to the execution program includes input data, output data, program instruction data, parameter data, and the like of the execution program, and correspondingly, the storage area includes a data storage area for storing the input data and the output data of the execution program, a program storage area for storing the program instruction data, and a parameter storage area for storing the parameter data.

Wherein the method comprises the following steps:

in step S201, the total amount of data of each subroutine in the original program is acquired.

Wherein the total amount of data of the subroutine comprises a sum of a size of input data and a size of output data of the subroutine.

Illustratively, the original program is a neural network, and the sub-program is a layer of sub-network in the neural network. In this alternative embodiment, the original program includes a plurality of subroutines, the original program is sequentially executed in the order of the subroutines when executed, and the output data of the subroutines is the input data of the next subroutine or the output data of the original program.

When the original program is a neural network, the step S201 includes:

analyzing the size of input data and the size of output data of each layer in the neural network; and generating the total data amount of each layer in the neural network, wherein the total data amount is the sum of the size of the input data and the size of the data. A neural network may be generally represented in a graph, and each layer includes a size of input data and a size of output data, such as a dimension of input and output data. The total amount of data for each layer of sub-networks can thus be derived by analyzing the graph representing the neural network.

It is to be understood that the original program may be other various types of programs, and the subprogram of the original program may be a program module divided by functional modules in the original program or a program module divided by a section in the program execution order to obtain output data from the output data.

Since the original program may have intermediate data, such as the output data of the above-mentioned subprogram, during execution, it is not necessary to access a storage area outside the processing core for each intermediate data, and therefore the total amount of data of each subprogram of the original program is calculated first in this step.

Returning to fig. 2, the compiling method of the executive program further includes:

step S202, determining the number of processing cores required to execute each sub-program according to the size of the data storage area of the plurality of processing cores and the total data amount of each sub-program.

After the total data amount of each subroutine of the original program is obtained, in step S202, the sizes of the data storage areas of the plurality of processing cores are obtained, and the number of processing cores required for executing the original program is calculated according to the total data amount of the subroutines and the sizes of the data storage areas. Wherein, the total size of the data storage area of the processing core for executing the original program is not less than the total data amount of the subprogram, so that the storage area outside the processing core is not required to be accessed during calculation.

Optionally, the step S202 includes:

step S301, calculating the quotient of the total data amount of each subprogram and the size of the data storage area of the processing core;

step S302, using the rounded up value of each quotient as the number of processing cores required for executing each subroutine.

In step S301, assuming that the size of the data storage area of each processing core is the same, a plurality of quotients are obtained by dividing the total amount of data of each subroutine by the size of the data storage area, where the quotients may or may not be integers. Therefore, in step S302, the rounded-up values of the plurality of quotients are used as the number of processing cores required for executing each subroutine. It is understood that, in the result obtained in step S302, each subroutine corresponds to the number of processing cores required to execute the subroutine. When the size of the data storage area of each processing core is different, the minimum value of the data storage area in the processing core can be used as the size of the data storage area used when calculating the quotient, so that each processing core can store the input data and the output data of each subroutine without using an external memory.

step S203, grouping the subroutines according to the number of processing cores required for executing each subroutine, and allocating a corresponding number of processing cores to the grouping.

In the disclosed embodiments, the subroutines are grouped, with as many processing cores being used by subroutines in the same group when executed.

Optionally, the step S203 includes:

step S401, determining the grouping of the subprograms according to the number of the processing cores needed for executing each subprogram and the total number of the processing cores;

step S402, distributing a corresponding number of processing cores to each group according to the number of the processing cores needed for executing the subprogram in each group and the total number of the processing cores; wherein, the number of the processing cores allocated to the same group is the same.

In the above-described step S401 and step S402, it is determined how many processing cores are required to execute each subroutine in the case of the total number of processing cores, based on the number of processing cores required to execute each subroutine and the total number of processing cores, and then, the subroutines requiring the same number of processing cores are divided into one subroutine group, and the required number of processing cores is assigned to each subroutine group.

The plurality of sub-programs are sequentially executed, for example, the plurality of sub-programs are multilayer sub-networks in a neural network. Optionally, in an embodiment, the step S401 includes:

step S501, the number of processing cores needed for executing each subprogram is sequentially acquired;

step S502 of calculating, as a first value of the subroutine, a value that can divide the total number of the processing cores by the whole number and is not less than the number of processing cores required to execute the subroutine; wherein each subroutine corresponds to a first value;

in step S503, the sub-programs corresponding to the continuously same first values are sequentially determined as a sub-program group.

In step S501, the number of processing cores required for executing each subroutine is sequentially obtained in the order of the subroutines, where the number of processing cores required for executing each subroutine is N, for example₁,N₂，……N_lIn which N is_iIndicating the number of processing cores required to execute the ith subroutine. Wherein i is an integer greater than 1 and less than l, wherein l is the number of subroutines.

In the above step S502, a value that can divide the total number of the processing cores entirely and is not less than the number of processing cores required to execute the subroutine is calculated as the first value of the subroutine. Wherein the first value is capable of dividing the total number of processing cores by an integer and is not less than N_iIs measured. Illustratively, when the total number of processing cores M₀9, said N₁When 4, 4 cannot divide 9, so 4 is added to 1 to obtain 5, 5 cannot divide 9, so 5 is added to 1, … … until 9 is added, 9 can be divided, and 9 is taken as the first value corresponding to the first subroutine. In this step, each subroutine can obtain the first value corresponding to it, and then the first value can be obtained.

In step S503, the sub-program corresponding to the continuously identical first value is determined as one sub-program group. Specifically, for example, there are 5 subroutines executed in sequence in the original program, and the corresponding first values are: 4,3,3,3,1. The 1 st subprogram alone constitutes the first subprogram group, the 2 nd, 3 rd and 4 th subprograms constitute the second subprogram group, and the 5 th subprogram alone constitutes the third subprogram group.

And when or after the subprogram groups are obtained, the number of the processing cores corresponding to each subprogram group can be determined according to the first value. In this embodiment, the first value corresponds to the number of processing cores corresponding to each sub-program. As stated above, M is 9, N₁As an example of 4, although only 4 processing cores are needed to execute the first subroutine, the total number of processing cores needs to be halved so that only 9 processing cores can be allocated to the first subroutine to execute.

Optionally, in another embodiment, the step S401 includes:

obtaining a current N_iMaximum value N in_max1；

Obtaining all N_max1Number j of the largest subprogram among the corresponding subprograms₁；

Since a subroutine with less processing core requirements may be executed by a processing core with a higher number of processing cores than is required, in this embodiment, the current N is obtained first_iMaximum value N in_max1And all N are acquired_max1Number j of the largest subprogram among the corresponding subprograms₁Then j at this time₁And j₁The previous subroutines may actually all be executed using the same number of processing cores and thus may be divided into the same subroutine group.

In a further embodiment, the step S402 includes:

calculating a total number M of processing cores that can be divided evenly₀And is not less than N_max1As the value of j₁Number of processing cores Ng of corresponding packet₁。

The calculation process is the same as the process of calculating the first value of one sub-program in step S502, and is not described herein again. In this embodiment, Ng to be calculated₁As the number of processing cores corresponding to the entire subroutine group.

Through the above steps, all the subroutines may not be grouped, so in this embodiment, the method further includes:

will j₁Subsequent subroutines being all subroutines, Ng₁Continuing to perform the step of determining the number of packets and processing cores to which the packets correspond as the total number of processing cores to obtain j₂Corresponding grouping and j₂Processing of corresponding packetsNumber of nuclei Ng₂；

In the above step, the number of all the subroutines and the total number of the processing cores are selected from the original subset of the number of the subroutines and the total number of the processing cores, and the step of determining the subroutine group and the number of the processing cores corresponding to the group is continuously executed, so as to obtain j₂Corresponding grouping and j₂Number of processing cores Ng of corresponding packet₂。

Then further determining Ng₂And Ng₁Whether equal, if equal, j₁Corresponding packet sum j₂The corresponding packets may be merged into one subroutine packet using the same data processing cores. If Ng is Ng₂Not equal to Ng₁Then j will be₂The corresponding packet is treated as a new packet.

Then, to complete the grouping of all subroutines, j is₂Subsequent subroutines being all subroutines, Ng₂The above-described steps of determining the number of packets and processing cores to which the packets correspond are continued as the total number of processing cores until all subroutines have corresponding packets.

step S204, determining the position of the synchronization point according to the size of the parameter of each subroutine and the size of the parameter storage area of the processing core.

The instructions of the subprogram between the two synchronization points are program instructions which need to be executed by the processing core in one synchronization cycle.

The parameter storage area of the processing core is used for storing parameters of the subprogram in the original program, such as the size, weight value, step size and other parameters of the used convolution core of each layer of sub-network in the convolution neural network. If the parameter storage area in the processing core can store all the parameters of the next layer of sub-network, the processing core can complete the calculation of one layer of sub-network without reading the parameters from the storage area outside the processing core in one synchronization period, and the synchronization point can be set at the end of the program instruction of each sub-network. In some cases, the parameter storage area of the processing core is relatively small, and all the parameters of one layer of sub-network cannot be stored, and at this time, a synchronization point needs to be inserted into the sub-network to determine a position where the execution can be performed using the parameters stored in the parameter storage area.

Further, the step S204 includes:

step S601, determining the number of parameters of the subprogram that can be stored in the parameter storage area of the processing core according to the size of the parameter storage area of the processing core and the size of the parameter of the subprogram;

step S602, determining the position of the synchronization point of the subroutine according to the number of parameters of the subroutine that can be stored in the parameter storage area of the processing core.

In this embodiment, the number of parameters that can be stored in the parameter storage area can be determined by the size of the parameter of the sub-program and the size of the parameter storage area, and the positions of the synchronization points of the plurality of sub-programs are determined based on the number of parameters, for example, if the size of the parameter of the sub-program is 50KB and the size of the parameter storage area is 25KB, then it is necessary to insert one synchronization point at the midpoint position of the sub-program. And executing the above determined positions of the synchronization points for each subprogram to obtain the positions of all the synchronization points of the original program.

Further, after obtaining the position of the synchronization point, the step S204 further includes: adding a synchronization instruction at the position of the synchronization point; wherein the synchronization instructions are to cause the system comprising the plurality of processing cores to generate a synchronization signal. Namely, after the processing core executes the program instruction meeting between the synchronization points, the processing core continues to execute the synchronization instruction. Optionally, after the processing cores execute the synchronization instruction, a synchronization request signal may be generated to request the system including the multiple processing cores to generate a synchronization signal, where the system includes a synchronization signal generator, and after receiving the synchronization request signal sent by each processing core participating in program execution in the system, the synchronization signal generator generates a synchronization signal to enable the multiple processing cores to enter a next different cycle to execute a subsequent program instruction.

step S205, compiling the original program into an execution program according to the grouping and the location of the synchronization point.

The execution program comprises a plurality of program segments, the program segments are generated by taking the position of the synchronization point as a boundary point, each program segment comprises an instruction in the original program and a control instruction required by the processing core to execute the instruction in the original program, the control instruction comprises a synchronization instruction between the program segments and a task grouping instruction generated according to the grouping, the task grouping instruction is used for dividing the processing core into at least one group so as to execute a group of program segments specified in the task grouping instruction, and the group of program segments corresponds to a program segment corresponding to a subprogram in one subprogram group. As described above, a synchronization point is inserted into a subroutine, and a plurality of program sections are divided by the synchronization point, so that instructions included in one program section may be some or all of the instructions in one subroutine, or instructions in a plurality of subroutines.

The control instruction is used for the processing core to read parameters and/or a next program segment and the like required by the next synchronization cycle in each synchronization cycle. The end of the program segment also includes the synchronization instruction described above for generating a synchronization request signal. The number of the processing cores is used for generating allocation information and/or grouping information in the execution program, and is used for grouping the plurality of processing cores, allocating program segments, parameters and input data when the system comprising the plurality of processing cores executes the execution program.

By the compiling method of the executive program, the original program is compiled into the executive program suitable for being executed by the multi-processing core system, wherein the basis of compiling the executive program is the total input and output data amount and the parameter size of the original program, so that the fit between the executive program and the multi-processing core system is enhanced, and the effective computing power of the multi-processing core system is improved; in addition, intermediate data generated by the original program all move in the multi-processing core system without exchanging with an external memory, so that the time delay is reduced, the pressure on the bandwidth of the external memory is reduced, and the power consumption of the whole multi-processing core system is also reduced.

Further, no calculation power is wasted in the calculation process. For example, the first sub-program group corresponds to 8 processing cores, the second sub-program group corresponds to 4 processing cores, the data output by the first sub-program group can only be used for 4 processing cores corresponding to one second sub-program group, and the other 4 processing cores are in an idle state. In this optional embodiment, after step S203, the method for compiling an executive program further includes:

step S206, determining the number of times of loop execution of the packet.

Wherein the step S206 includes: and calculating the product of the packet numbers of the processing cores in all the subprogram packets after the subprogram packet as the loop execution times of the subprogram packet.

The grouping number of the processing cores in the subprogram grouping is defined as A_k,Where k denotes the number of the sub-program packet, then A_k＝M_k-1/N_gkNumber of loop executions C of kth sub-program packet_k＝A_(k+1)*A_(k+2)*…*A_lastWhere last represents the maximum number of subroutine packets; k is not less than 0 and not more than last, and C_last＝1；M₀Indicates the number of all processing cores, M_k-1＝N_g(k-1)。

In actual execution, if the sub-program groups are not consideredThe number of loop executions can be less than C_kThe values of (2) are not described in detail herein.

In this embodiment, the step S205 further includes:

In this embodiment, information of the number of times of loop execution of the packet is added on the basis of the above-described step S205. Thus, when the original program is actually executed, the packet with the front number can be circularly executed for a plurality of times by using a plurality of input data, so that the input data of a plurality of tasks executed by the packet with the back number in parallel can be obtained, and the calculation power of the processing core can be more fully utilized. Illustratively, the number of times of loop execution of the packet is used to generate a control instruction in the execution program, so that after each packet completes execution of the current task, it is determined whether to continue to acquire input data of a next task according to the number of times of loop execution of the packet.

Further, in or after step S202, after the number of processing cores required to execute each subroutine is obtained by calculation in this step, the amount of input data of each processing core may be further calculated, that is, the input data of each subroutine is distributed to each processing core in a group of processing cores in a balanced manner. Specifically, the input data may be divided into N on average_iThe input data is divided in a sharing mode; however, there may be cases where there is overlap in input data between processing cores, and it is necessary to calculate the number N of processing cores required to execute each subroutine_iIs adjusted so that N is_iSatisfies the following conditions: m₀≥D_m+D_in/N_i(ii) a Wherein M is₀Is the size of the data storage area of the processing core, D_mThe incremental data generated by the data overlapping part of each processing core is added, so that the input data required to be distributed by each processing core is D_m+D_in/N_i. After the sub-program grouping, the number of processing cores corresponding to the grouping is not less than that of each sub-program in the groupingThe number of processing cores required for the program, and therefore the number of processing cores corresponding to the subroutine group is used as the N_iThe above conditions are also satisfied. In this way, the size of the input data that each processing core needs to receive and the way in which the input data for each subroutine is divided can be predetermined for one task.

The above embodiment discloses a compiling method of an executive program, which includes: acquiring the total data amount of each subprogram in the original program; determining the number of processing cores required for executing each subprogram according to the size of a data storage area of a plurality of processing cores and the total data amount of each subprogram; grouping the subprograms according to the number of processing cores required for executing each subprogram, and distributing the processing cores with corresponding number to the groups; determining the position of a synchronization point according to the size of the parameter of each subprogram and the size of a parameter storage area of the processing core; compiling the original program into an executive program according to the grouping and the position of the synchronization point. The method compiles and generates the executive program according to the total amount of the input and output data of the original program to group the subprogram and generate the executive program, and solves the technical problem that the executive program in the prior art needs to frequently access an external memory during execution.

Fig. 7a is an example of a schematic structural diagram of a system including multiple processing cores according to an embodiment of the present disclosure. As shown in fig. 7a, in this example, the system including multiple processing cores is a chip, and the chip 700 includes:

a plurality of processing cores 701 and a synchronization signal generator 702; wherein each processing core includes a data storage area 703 and a parameter storage area;

the processing cores 701 are used for grouping according to sub-program grouping information in an execution program, wherein the processing cores in each group are used for executing a plurality of program segments corresponding to the sub-program grouping in the execution program; wherein the data storage area is used for storing input data and output data of the plurality of program segments, and the parameter storage area is used for storing parameters of the plurality of program segments;

the synchronization signal generator 702 is configured to send a synchronization signal to all processing cores when all processing cores that execute the program segment are completely executed.

Further, the chip 700 may further include a shared memory area 704 for storing output data output by the plurality of processing cores executing each subroutine group.

The compiling process of the execution program will be described below by way of example with the structure of the chip shown in fig. 7 a. As shown in FIG. 7a, the chip 700 includes 4 processing cores, each of which is C₁、C₂、C₃、C₄Each processing core comprises a 1MB data storage area, each processing core further comprises a parameter storage area and a program storage area (both not shown), a 10MB shared storage area is further included in the chip, and an external memory DDR is connected outside the chip and used for storing input data, parameters and final output data of an original program.

The original program is exemplified by a 5-layer neural network, and the structure and the total amount of input and output data of each layer are shown in fig. 7 b. The input data of the first layer L1 of the neural network is 2200KB, and the output data is 1000 KB; the input data of the second layer L2 of the neural network has a size of 1000KB, which is the same as the output data of the first layer, and the output data has a size of 1200 KB; the size of the input data of the third layer L3 of the neural network is 1200KB, which is the same as the size of the output data of the second layer, and the output data is 400 KB; the size of the input data of the fourth layer L4 of the neural network is 400KB, which is the same as the size of the output data of the third layer, and the output data is 800 KB; the input data of the fifth layer L5 of the neural network has a size of 800KB, which is the same as the output data of the fourth layer, and the output data has a size of 10KB

The compiling method of the execution program is executed by a neural network compiler. According to the embodiment of the compiling method of the execution program, the neural network compiler firstly executes step S201, analyzes the neural network, obtains the sum of the input data and the output data of each layer, and generates the data total table 1 of each layer:

Layer	InData(KB)	OutData(KB)	Total Data(KB)
				L1	2200	1000	3200
L2	1000	1200	2200
				L3	1200	400	1600
L4	400	800	1200
				L5	800	10	810

TABLE 1

Then, the step in step S202 is executed, the number of processing cores required for executing the original program is calculated according to the total amount of data of each layer and the size of the data storage area of the processing core, and if a decimal occurs in the calculation process, an integer is fetched upwards. The calculation results are shown in table 2 below:

Layer	Calculation of Core Number	Core Number
			L1	3200/1000＝3.2	4
L2	2200/1000＝2.2	3
			L3	1600/1000＝1.6	2
L4	1200/1000＝1.2	2
			L5	810/1000＝0.81	1

TABLE 2

The calculation result indicates that 4 processing cores are required to execute the sub network program of the first layer; executing the sub-network program of the second layer requires 3 processing cores; 2 processing cores are needed to execute the third and fourth layer sub-network programs; executing the fifth layer of sub-network programs requires 1 processing core.

The subnetworks are then grouped by the steps in step S203. The steps of grouping the subnetworks by the method in one embodiment in step S203 are as follows:

first, the number of processing cores required by each subroutine is sequentially obtained, and as a result, as shown in table 2 above, each Layer corresponds to the number of one processing core, where N is₁＝4,N₂＝3,N₃＝2,N₄＝2,N₅＝1。

Then, a value that is capable of dividing the total number of the processing cores by the whole number and is not less than the number of processing cores required to execute the subroutine is calculated as the first value of the subroutine. The first value corresponding to each layer of the sub-program is obtained by calculation, as shown in table 3 below:

Layer	Core Number1
		L1
	4
		L2	4
L3	2
		L4	2
L5	1

TABLE 3

Then, the sub-networks corresponding to the consecutive first values are determined as a sub-program group, as shown in table 3. L1 and L2 are the first group, L3 and L4 are the second group, and L5 is the third group.

The above procedure for determining sub-program grouping may further group the sub-networks by the method in another embodiment of the above step S203 as follows:

finding the maximum number of cores N required to execute each subnetwork_max1In this example, the layer requiring the largest number of cores is the layer 1 subnet, and the layer 1 is taken as L1. And N is_max14, namely: when the layer 1 sub-network program is executed, input and output data are not written outside the processing cores in the operation process, at least 4 processing cores are needed, and Ng is needed at the moment₁＝4，A₁1, packet LG1 contains a level 1 sub-network, and the allocation of processing cores is shown in the following table:

Layer	Group of Cores	Cores of Layer Group	Layer Group
				L1	A1＝1	Ng₁＝4	LG1
L2
				L3
L4
				L5

so far, the grouping for the 1 st time is preliminarily completed;

then, in the remaining layer subroutines, the maximum number of cores N required to execute each subnetwork is found_max2In this example, the level requiring the maximum number of kernels is the level 2 subroutine, taking level 2 as L2, and N_max23, namely: when the layer 2 sub-network program is executed, the input is ensured in the operation processInput and output data are not written out of the processing cores, and at least 3 processing cores are needed; but due to Ng₁Not dividing N by 4_max2Therefore, only Ng can be taken₂＝4，A₂1 is ═ 1; due to A₂＝1＝A₁Then this time of grouping, LG2 needs to be merged into the grouping LG1, so that the grouping is as shown in the following table:

Layer	Group of Cores	Cores of Layer Group	Layer Group
				L1	A1＝1	Ng1＝4	LG1
L2	A1＝1	Ng1＝4	LG1
				L3
L4
				L5

in the calculation process, N_max2、Ng₂、A₂And the layer group LG2 will be reset for the next packet.

So far, the grouping for the 2 nd time is preliminarily completed;

continuing in the remaining layers, the maximum number of cores N required to execute each subnetwork is found_max2. In this example, the layers requiring the largest number of kernels are the layer 3 and layer 4 subroutines, in which case the layer 4 with the largest number of layers is taken as L2 and is both N _max22, i.e.: when the sub-network programs of the 3 rd layer and the 4 th layer are executed, the input and output data are ensured not to be written out of the processing cores in the operation process, and at least 2 processing cores are needed; due to Ng ₁4, can divide N completely_max2Therefore Ng is₂＝2，A₂The packet LG2 contains layer 3 and layer 4 sub-networks, so that the packet case is as shown in the following table:

so far, the 3 rd time of core allocation is completed preliminarily;

continue to find the execution in the remaining layersMaximum number of cores N required to line each subnetwork_max3. In this example, the layer requiring the largest number of cores is the subroutine of 5 layers, and in this case, the 5 th layer having the largest number of layers is taken as L3, and N is_max31, namely: when the sub-network program of the 5 th layer is executed, only 1 core is needed, and the input and output data are not written out of the processing core in the operation process; due to Ng ₂2, can divide N completely_max3Therefore Ng is Ng₃＝1，A₃The grouping LG3 contains a layer 5 sub-network, so that the grouping is shown in the following table:

Layer	Group of Cores	Cores of Layer Group	Layer Group
				L1	A1＝1	Ng1＝4	LG1
L2	A1＝1	Ng1＝4	LG1
				L3	A2＝2	Ng2＝2	LG2
L4	A2＝2	Ng2＝2	LG2
				L5	A3＝2	Ng3＝1	LG3

so far, the process of core allocation of the neural network from the layer surface by the compiler is completed.

All the 5-layer sub-networks form 3 groups, the sub-programs of each group are executed by the same number of processing cores, intermediate data calculated by sub-networks belonging to the same group do not need to be written into an off-processing-core storage space, for example, LG1 has a layer 1 sub-network and a layer 2 sub-network, the result of the calculation of the layer 1 neural network, namely output data, is directly used as input data for the layer 2 neural network, and the off-processing-core storage area SM or the off-chip storage area DDR is not needed.

After the grouping of the subroutines is determined, the step of S206 may be performed to obtain the number of loop executions of each grouping, such as grouping described above, according to C_k＝A_(k+1)*A_(k+2)*…*A_lastThe loop execution times of each packet are calculated, and the loop execution times of each packet are shown in table 4:

Layer Group	Group of Cores	Number of Loops
			LG1	A1＝1	C1＝4
LG 2	A2＝2	C2＝2
			LG 3	A3＝2	C3＝1

TABLE 4

After the division scheme of the sub-program grouping and the grouping cycle number is obtained, the embodiment in the step S204 is continuously executed, and the position of the synchronization point is determined according to the parameter of each layer and the size of the parameter storage area in the processing core; in each synchronization period, each processing core can read parameters from the parameter storage area in the processing core to perform neural network calculation, and can read the parameters to be used in the next synchronization period from the off-chip storage area, namely DDR, to the parameter storage area in the processing core in the synchronization period. And inserting a synchronization instruction at the position of the synchronization point according to the position of the synchronization point.

Then, the embodiment in step S205 is continued to generate the program segments to be executed in the processing core, and this process may call a conventional compiler to generate executable program segments, where each program segment is generated by a program of a neural network between two synchronization points.

Fig. 8a-8c are schematic diagrams illustrating the process of each packet when executed by a processing core.

Fig. 8a is a schematic diagram of a process when a first packet is executed by a processing core. As shown in fig. 8a, the L1 input data in the 1 st group LG1 is divided, the whole input data of L1 is 2200KB, and the input data is divided into 4 inputs of 550K (sometimes not completely equally divided, and part of the data may need to be used by multiple cores, and at this time, the part of the data is simultaneously used by multiple cores, and then the input data of each core is larger than 550KB), and the data is respectively sent to 4 processing cores in the chip, namely C1, C2, C3 and C4; when they each complete a partial computation of the first tier subnetwork L1, one quarter of the total output data 1000KB of L1, i.e., 250 KB; the 4 processing cores use output data of their own L1 as input of their respective L2 (the output of their respective L1 is not necessarily used as input of their L2, and it may be necessary for part of the output data of one core to be sent to another core, and it may be necessary to use the output data of their L1 as input of their L2, so that data exchange may be performed between the processing cores, but such data exchange is performed in the cores inside the chip, and it is not necessary to read and write the off-core memory space or the off-chip memory space); when they each complete a partial computation of L2, which results in a quarter of the total L2 output data 1200KB, i.e., 300KB, the partial outputs of all processing cores executing the group subroutine are combined to be the total L2 output data 1200 KB; storing 1200KB output data into a shared memory space SM outside the processing core; cycling through the calculation of LG1 4 times, a total of 4x 1200KB ═ 4.8MB, all in SM; the input data input in each calculation is input data of different tasks.

Fig. 8b is a schematic diagram of the process when the second packet is executed by the processing core. As shown in FIG. 8b, the calculation process is similar to that of the first packet shown in FIG. 8a, but now due to Ng₂＝2，A₂2, so 4 processing cores are grouped into 2 processing cores, and 2 groups; the input data and the output data of each group of processing cores are divided into 2 parts of operation; the result is also continuously stored in the SM; in the operation process of fig. 8b, LG2 is executed 2 times in a loop to obtain the input data of LG3, and since the processing cores in LG2 are divided into 2 groups, the loop execution 2 times obtains the intermediate data of 4 tasks, which is used as the input data of LG 3.

Fig. 8c is a schematic diagram of the process when the third packet is executed by the processing core. As shown in fig. 8c, the calculation process only involves one sub-network, and one processing core can execute one sub-program alone, so that when the third packet is executed by the processing core, the input data is directly allocated to the processing core for execution without being divided. The 4 processing cores execute 4 tasks in parallel, resulting in 4 independent outputs. Each output corresponds to the output data of an independent task. The subnetworks of the third group are performed only once.

In the above example, the first grouping subroutine may input 4 input data respectively, and after executing the loop for 4 times, obtain first intermediate data of 4 tasks; dividing the 4 pieces of first intermediate data into two groups, and respectively inputting second grouping subprograms, wherein the second grouping subprograms are operated on 2 groups of processing cores, so that each group of processing cores circularly executes for 2 times to obtain second intermediate data of 4 tasks; the 4 second intermediate data are then input to a third packed subroutine, which runs on 4 processing cores, so that the 4 tasks can be executed in parallel. Therefore, through the cyclic execution of the first sub-program group and the second sub-program group, the computing power of the 4 processing cores is fully utilized when the tasks are executed, and the problem that the computing power is idle is avoided.

When the chip executes the executive program, the input data is divided in the manner shown in fig. 8a to 8c, then the processing cores execute the program segment in the executive program according to the input data and the parameters, and each time a synchronization instruction is executed, a synchronization request is generated and sent to the synchronization signal generator.

As can be seen from the above examples, the original program is compiled by using the executive program compiling method in the embodiment of the present disclosure, and the executive program is executed by using the chip in the embodiment of the present disclosure. When the executive program is executed, for each processing core, because the used parameters are the same, the calculation amount of the processing core in each synchronization period is the same, so that the calculation time of all the processing cores in each synchronization period is consistent, the calculation time inequality of different processing cores is avoided, the calculation power loss caused by the processing core which finishes calculation after waiting is needed by the processing core which finishes calculation first, and the effective calculation power of the chip is greatly improved; in addition, all the processing cores use the same parameters, so that the parameters can be shared by all the processing cores only by reading the parameters from the DDR once, the reuse rate of the parameters is greatly improved, the requirement on the DDR bandwidth is reduced, and the power consumption is also reduced.

An embodiment of the present disclosure further provides an electronic device, including: a memory for storing computer readable instructions; and one or more processors for executing the computer-readable instructions, so that the processors implement the compiling method of any of the execution programs in the embodiments when running.

The present disclosure also provides a non-transitory computer-readable storage medium, which stores computer instructions for causing a computer to execute the compiling method of the executive program in any one of the foregoing embodiments.

The embodiment of the present disclosure further provides a computer program product, wherein: includes computer instructions, when the computer instructions are executed by the computing device, the computing device can execute the compiling method of the executive program in any of the previous embodiments.

The embodiment of the present disclosure further provides a computing device, which includes the chip in any one of the embodiments.

The flowchart and block diagrams in the figures of the present disclosure illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Claims

1. A compiling method of an execution program used in a system including a plurality of processing cores, comprising:

acquiring the total data amount of each subprogram in the original program;

2. The compiling method of an executive program according to any of claim 1, wherein said grouping said subprograms according to the number of processing cores required to execute each subprogram, and allocating a corresponding number of said processing cores to said grouping, comprises:

3. The compiling method of an executive program according to claim 2, wherein the original program includes a plurality of subprograms, the subprograms being executed in sequence, and the determining the grouping of the subprograms based on the number of processing cores required to execute each subprogram and the total number of the processing cores comprises:

4. The compiling method of an executive program according to claim 2, wherein the original program includes a plurality of subprograms, the subprograms being executed in sequence, and the determining the grouping of the subprograms based on the number of processing cores required to execute each subprogram and the total number of the processing cores comprises:

obtaining a current N_iMaximum value N in_max1；

5. The compiling method for an executive program according to claim 4, wherein said allocating a corresponding number of said processing cores to each of said groups based on a number of said processing cores required to execute the subprogram in said group and a total number of said processing cores comprises:

6. The compiling method of the executive program according to claim 5, wherein the method further comprises:

will j₁Subsequent subroutines being all subroutines, Ng₁Continuing to perform the step of determining the number of packets and processing cores to which the packets correspond as the total number of processing cores to obtain j₂Corresponding packet andj₂number of processing cores Ng of corresponding packet₂；

7. The compiling method of the executive program according to claim 6, wherein the method further comprises:

the above steps of determining the number of groups and processing cores to which the groups correspond are continued with the subroutines after j2 as all subroutines and Ng2 as the total number of processing cores until all subroutines have corresponding groups.

8. The compiling method of executing a program according to any of claims 3 to 7, wherein the method further comprises:

determining the number of loop executions of the packet.

9. The compiling method for an executable program according to claim 8, wherein the compiling the original program into the executable program according to the grouping and the position of the synchronization point comprises:

10. A chip, comprising: