WO2022141344A1 - 执行程序的编译方法、芯片、电子设备及计算机可读存储介质 - Google Patents

执行程序的编译方法、芯片、电子设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2022141344A1
WO2022141344A1 PCT/CN2020/141941 CN2020141941W WO2022141344A1 WO 2022141344 A1 WO2022141344 A1 WO 2022141344A1 CN 2020141941 W CN2020141941 W CN 2020141941W WO 2022141344 A1 WO2022141344 A1 WO 2022141344A1
Authority
WO
WIPO (PCT)
Prior art keywords
program
data
processing cores
storage area
original program
Prior art date
Application number
PCT/CN2020/141941
Other languages
English (en)
French (fr)
Inventor
冯杰
Original Assignee
北京希姆计算科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京希姆计算科技有限公司 filed Critical 北京希姆计算科技有限公司
Priority to CN202080108193.6A priority Critical patent/CN116710930A/zh
Priority to PCT/CN2020/141941 priority patent/WO2022141344A1/zh
Publication of WO2022141344A1 publication Critical patent/WO2022141344A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • the present disclosure relates to the field of program compilation and processors, and in particular, to a compilation method, chip, electronic device and computer-readable storage medium for executing a program.
  • chips are required to have extremely high flexibility, capable of processing algorithms in different fields and changing with each passing day, and extremely strong processing capabilities to rapidly process huge and rapidly growing data. quantity.
  • each processing core processes its own tasks independently, as shown in Figure 1a; the second is that some or all processing cores jointly process a task in parallel, and each completes a part of the task, as shown in the appendix shown in Figure 1b.
  • the compiler compiles a suitable program according to the structure of the neural network and according to the characteristics of the traditional multi (many) core CPU or GPU, and performs the calculation of the neural network.
  • the Cache in each processing core is transparent to the program and cannot be accessed directly and independently. All data read and write in the calculation is based on the access address of the DDR (Double Data Rate) memory for the processing core. .
  • DDR Double Data Rate
  • a large amount of intermediate data will be generated, and most of the intermediate data of each layer have no correlation. Because Cache has spatial local characteristics and temporal local characteristics, in neural network computing, Cache can easily reduce the hit rate and frequently access DDR memory, thereby reducing the computing speed of neural network and reducing power consumption.
  • an embodiment of the present disclosure provides a compiling method for executing a program, including:
  • the original program is compiled into an executable program according to the number of the processing cores and the position of the synchronization point.
  • determining the number of processing cores required to execute the original program according to the attributes of the data of the original program includes:
  • the number of processing cores required to execute the original program is determined according to the total amount of data of each subroutine and the size of the data storage area of the processing core
  • calculating the number of processing cores required to execute the original program according to the total amount of data of each subroutine and the size of the data storage area of the processing core including:
  • the largest quotient among the multiple quotients is rounded up as the number of processing cores required to execute the original program.
  • determining the position of the synchronization point in the original program according to the parameters of the original program includes:
  • the positions of the synchronization points of the plurality of subroutines are determined according to the size of the parameter storage area of the processing core and the size of the parameters of the plurality of subroutines.
  • determining the position of the synchronization point of the subroutine according to the size of the parameter storage area of the processing core and the size of the parameter of the subroutine includes:
  • the position of the synchronization point of the subroutine is determined according to the number of parameters of the subroutine that can be stored in the parameter storage area of the processing core.
  • a synchronization instruction is added at the position of the synchronization point; wherein, the synchronization instruction is used to make the system including a plurality of processing cores generate a synchronization signal.
  • the execution program includes a plurality of program segments, and each program segment includes an instruction in the original program and a control instruction required by the processing core to execute the instruction in the original program.
  • the original program is a neural network
  • the subprogram is a layer of sub-networks in the neural network.
  • the described acquisition of the total amount of data of each subprogram in the original program includes:
  • an embodiment of the present disclosure provides a chip, including:
  • each processing core includes a data storage area and a parameter storage area;
  • the multiple processing cores are used for grouping according to execution programs, wherein the processing cores in each group are used for executing multiple program segments in the execution program; wherein the data storage area is used for storing the multiple processing cores. input data and output data of the plurality of program segments, the parameter storage area is used to store the parameters of the plurality of program segments;
  • the synchronization signal generator is configured to send a synchronization signal to all the processing cores when all the processing cores executing the program segment are executed.
  • embodiments of the present disclosure provide an electronic device, including: a memory for storing computer-readable instructions; and one or more processors for executing the computer-readable instructions, so that the processors run The method for compiling any one of the execution programs in the foregoing first aspect is implemented.
  • an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to cause a computer to execute the foregoing first aspect. Any of the compiling methods for executing programs.
  • an embodiment of the present disclosure provides a computer program product, which is characterized by comprising computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any one of the foregoing first aspects. Compile method to execute the program.
  • an embodiment of the present disclosure provides a computing device, characterized in that it includes the chip according to any one of the second aspects.
  • the embodiments of the present disclosure disclose a compilation method, a chip, an electronic device and a computer-readable storage medium for executing a program.
  • the compiling method for executing the program includes: determining the number of processing cores required to execute the original program according to attributes of the data of the original program; determining the position of the synchronization point in the original program according to the parameters of the original program; The original program is compiled into an executable program according to the number of the processing cores and the position of the synchronization point.
  • the above-mentioned method compiles and generates the execution program based on the attributes and parameters of the input and output data of the original program, which solves the technical problem that the execution program in the prior art needs to frequently access external memory during execution.
  • FIG. 1a and 1b are schematic diagrams of the prior art
  • FIG. 2 is a schematic flowchart of a method for compiling an execution program according to an embodiment of the present disclosure
  • FIG. 3 is a further schematic flowchart of a method for compiling an execution program provided by an embodiment of the present disclosure
  • FIG. 4 is a further schematic flowchart of a method for compiling an execution program provided by an embodiment of the present disclosure
  • FIG. 5 is a further schematic flowchart of a method for compiling an execution program provided by an embodiment of the present disclosure
  • FIG. 6a is an example diagram of a chip provided by an embodiment of the present disclosure.
  • 6b is a schematic diagram of grouping of processing cores when the chip according to the embodiment of the present disclosure executes the execution program
  • FIG. 7 is a schematic diagram of a neural network to be compiled in an embodiment of the disclosure.
  • FIG. 8 is a schematic diagram of two processing cores in the first group of splitting data according to an embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 2 is a schematic flowchart of a method for compiling an execution program according to an embodiment of the present disclosure.
  • the method for compiling an execution program is used in a system including a plurality of processing cores, wherein the processing cores include a storage area for storing relevant data of the execution program.
  • the relevant data of the execution program includes input data, output data, program instruction data, parameter data, etc. of the execution program, and correspondingly, the storage area includes input data and output data for storing the execution program.
  • the method includes:
  • Step S201 determining the number of processing cores required to execute the original program according to the attributes of the data of the original program
  • the attributes of the data of the original program include the total amount of input and output data of the original program, and the like.
  • the data of the original program is analyzed, the allocation scheme of the processing cores is planned according to the data attributes of the original program, and the number of processing cores required to be able to calculate the original program in parallel is determined.
  • step S201 includes:
  • Step S301 obtains the total amount of data of each subprogram in the original program; wherein the total amount of data of the subprogram includes the sum of the size of the input data and the size of the output data of the subprogram;
  • Step S302 Determine the number of processing cores required to execute the original program according to the total amount of data of each subroutine and the size of the data storage area of the processing core.
  • the original program is a neural network
  • the sub-program is a layer of sub-networks in the neural network.
  • the original program includes a plurality of subprograms, the original program is executed in sequence according to the sequence of the subprograms, and the output data of the subprograms are the input data of the next subprogram or the output data of the original program.
  • the step S301 includes:
  • a neural network can usually be represented in the form of a graph, and each layer includes the size of the input data and the size of the output data, such as the dimension of the input and output data.
  • the total amount of data of each layer of sub-networks can be obtained by analyzing the graph representing the neural network.
  • the original program can be various other types of programs, and the subprograms of the original program can be program modules in the original program that are divided according to function modules or a section divided according to the program execution sequence obtained from the output data.
  • Program module that outputs data.
  • each subroutine of the original program is first calculated in this step. the total amount of data.
  • step S302 After the total amount of data of each subprogram of the original program is obtained, in step S302, the size of the data storage area of the processing core is obtained, and according to the total amount of data of the subprogram and the size of the data storage area Size calculates the number of processing cores required to execute the original program.
  • the step S302 includes:
  • Step S401 calculating the quotient of the size of the data total amount of each subroutine and the data storage area of the processing core;
  • Step S402 the maximum quotient among the multiple quotients is rounded up as the number of processing cores required to execute the original program.
  • step S401 assuming that the size of the data storage area of each processing core is the same, multiple quotients are obtained by dividing the total amount of data of each subroutine by the size of the data storage area, wherein the quotient may be an integer or may not an integer. Therefore, in step S402, the rounded-up value of the largest quotient among the multiple quotients is taken as the number of processing cores required to execute the original program.
  • the minimum value of the data storage areas in the processing cores may be used as the size of the data storage areas used in calculating the quotient to ensure that each processing core has a
  • the input data and output data of each subroutine can be stored without using external memory. It can be understood that in the above steps S401 and S402, the minimum number of processing cores required to execute each subroutine is calculated, and then the maximum number of these processing cores is used as the minimum number of processing cores required to execute the original program. number.
  • step S201 the number of processing cores required to execute the original program is obtained, the plurality of processing cores are grouped according to the number of processing cores required by the original program, and the number of processing cores in each group is is the required number of processing cores; each group can execute multiple original programs in parallel.
  • N min the number of processing cores required to execute the original program
  • N g the number of processing cores in each group
  • N g the number of processing cores in the system
  • the amount of input data of each processing core can be further calculated, that is, the input data of each subroutine is evenly distributed to a group of processing cores each processing core in .
  • the input data can be divided into N g parts equally; however, in some cases, the input data between the processing cores may overlap, and in this case, the calculation of the processing cores in each group
  • N g adjust so that N g satisfies: M 0 ⁇ D m +D in /N g ; where M 0 is the size of the data storage area of the processing core, and D m is the overlapped portion of the data for each additional processing core.
  • the generated incremental data, and thus the input data that each processing core needs to allocate is D m +D in /N g .
  • step S201 the grouping planning of the multiple processing cores in the system executing the original program can be completed in the compilation stage of the original program.
  • the compiling method of described execution program also includes:
  • Step S202 determining the position of the synchronization point in the original program according to the parameters of the original program.
  • the instructions of the original program between the two synchronization points are program instructions that the processing core needs to execute in one synchronization cycle.
  • the step S202 includes: determining the positions of the synchronization points of the multiple subprograms according to the size of the parameter storage area of the processing core and the size of the parameters of the multiple subprograms.
  • the parameter storage area of the processing kernel is used to store the parameters of the original program, such as parameters such as the size, weight value, and step size of the convolution kernel used in each layer of sub-networks in the convolutional neural network.
  • the size of the parameter storage area in the processing core determines the amount of program instructions that can be executed in one synchronization cycle.
  • the processing core can complete the calculation of a layer of sub-networks without reading parameters from the storage area outside the processing core, and the synchronization point can be set at the end of the program instructions of each sub-network.
  • the parameter storage area of the processing core is relatively small and cannot store all the parameters of one layer of sub-networks. In this case, a synchronization point needs to be inserted in the sub-network to ensure that the parameters stored in the parameter storage area can be executed to s position.
  • step S202 includes:
  • Step S501 according to the size of the parameter storage area of the processing core and the size of the parameter of the subroutine, determine the number of parameters of the subroutine that can be stored in the parameter storage area of the processing core;
  • Step S502 Determine the position of the synchronization point of the subroutine according to the number of parameters of the subroutine that can be stored in the parameter storage area of the processing core.
  • the number of parameters that can be stored in the parameter storage area can be determined by the size of the parameters of the subroutine and the size of the parameter storage area, and the number of parameters of the multiple subroutines is determined according to the number of parameters.
  • the size of the parameter of the subroutine is 50KB
  • the size of the parameter storage area is 25KB
  • a synchronization point needs to be inserted at the midpoint of the subroutine. The above-mentioned determination of the positions of the synchronization points is performed for each subprogram, and the positions of all the synchronization points of the original program are obtained.
  • the step S202 further includes: adding a synchronization instruction at the location of the synchronization point; wherein the synchronization instruction is used to cause the system including multiple processing cores to generate sync signal. That is, after the processing core executes the program instruction session between synchronization points, it continues to execute the synchronization instruction.
  • a synchronization request signal will be generated to request the system including a plurality of processing cores to generate a synchronization signal.
  • the system includes a synchronization signal generator, and the synchronization signal generates a synchronization signal.
  • the processor After receiving the synchronization request signal sent by each processing core participating in program execution in the system, the processor generates a synchronization signal so that the multiple processing cores enter the next different cycle to execute subsequent program instructions.
  • the compiling method of described execution program also includes:
  • Step S203 compiling the original program into an executable program according to the number of the processing cores and the position of the synchronization point.
  • the execution program includes a plurality of program segments, the program segments are generated by taking the position of the synchronization point as a demarcation point, and each of the program segments includes the instructions in the original program and the execution process of the processing core. control instructions required by the instructions in the original program.
  • a synchronization point is inserted into a subprogram, and the synchronization point is used as a demarcation point between multiple program segments, so the instructions contained in a program segment may be part or all of the instructions in a subprogram, or Instructions in multiple subroutines.
  • control instruction is used for the processing core to read the parameters and/or the next program segment and the like required for the next synchronization cycle in each synchronization cycle.
  • the end of the program segment further includes the above synchronization instruction for generating a synchronization request signal.
  • the number of the processing cores is used to generate allocation information and/or grouping information in the execution program, which is used for the execution of the execution program by the system including multiple processing cores. Perform grouping and assignment of blocks, parameters, input data, etc.
  • the original program is compiled into an executive program suitable for execution by a multi-processing core system, wherein the basis for compiling and generating the executive program is the attributes and parameters of the input and output data of the original program, thus enhancing the relationship between the execution program and the execution program.
  • the compatibility of the multi-processing core system improves the effective computing power of the multi-processing core system; in addition, the intermediate data generated by the original program is moved within the multi-processing core system, and does not need to be exchanged with external memory, thus reducing the time required. This reduces the pressure on the bandwidth of the external memory and reduces the power consumption of the entire multiprocessing core system.
  • FIG. 6a is an example of a schematic structural diagram of a system including multiple processing cores provided by an embodiment of the present disclosure. As shown in FIG. 6a, in this example, the system including multiple processing cores is a chip, and the chip 600 includes:
  • each processing core includes a data storage area and a parameter storage area;
  • the plurality of processing cores 601 are used for grouping according to the execution program, wherein the processing cores in each group are used for executing a plurality of program segments in the execution program; wherein the data storage area is used for storing the input data and output data of multiple program segments, and the parameter storage area is used to store the parameters of the multiple program segments;
  • the synchronization signal generator 602 is configured to send a synchronization signal to all the processing cores when all the processing cores executing the program segment are executed.
  • the chip 600 includes four processing cores, namely C 1 , C 2 , C 3 , and C 4 , each processing core includes a 1MB data storage area, and each processing core also includes parameter storage The area and the program storage area (both are not shown), an external memory DDR is connected outside the chip for storing the input data, parameters and final output data of the original program.
  • the original program takes a 2-layer neural network as an example, and its structure and the total amount of input and output data of each layer are shown in Figure 7.
  • the input data of the first layer L1 of the neural network is 400KB, and the output data is 800KB;
  • the size of the input data of the second layer L2 of the neural network is the same as the size of the output data of the first layer, which is 800KB, and the output data is 10KB.
  • the above-mentioned compiling method for executing the program is executed by the neural network compiler.
  • the neural network compiler first executes step S301, analyzes the neural network, obtains the sum of the size of the input data and the output data of each layer, and generates the data of each layer Total Table Table 1:
  • step S302 After that, the steps in the above step S302 are performed, and the number of processing cores required to execute the original program is calculated according to the total amount of data of each layer and the size of the data storage area of the processing core. If a decimal is generated, round up the whole number.
  • Table 2 The calculation results are shown in Table 2 below:
  • the number of processing cores Ng in each group is determined, and Ng needs to meet the following two conditions, in order to maximize the utilization rate of the computing power of the chip:
  • FIG. 6b is the grouping of processing cores in the chip shown in FIG. 6a, wherein C1 and C2 are the first grouping Group1 , and C3 and C4 are the second grouping Group2.
  • the input and output data of each layer of the neural network is divided, so that they can be calculated in parallel on the two processing cores, and make the calculation of the two processing cores quantity balance.
  • the two processing cores use the same parameters in the same synchronization cycle.
  • FIG. 8 is a schematic diagram of two processing cores in the first group slicing data.
  • the input data of L1 is firstly divided.
  • the input data of L1 is 400KB, which is divided into two processing cores for calculation, and divided into 2 sub-input data of 200K (not necessarily completely In some cases, the two processing cores will use part of the same input data.
  • this part of the input data needs to be given to the two processing cores at the same time, then the two parts of the input data will be larger than 200KB), respectively allocated to C1 and C2 in Group1 are used as input data; when they complete the calculation of the first layer of neural network L1, each will generate half of L1's 800KB output data and 400KB sub-output data; C1 and C2 will own L1's sub-output data , as the sub-input data of the respective L2 (not necessarily the output of the respective L1 can be used as the input of its L2, it is possible that part of the sub-output data of the L1 of the C1 needs to be allocated to the C2, together with the sub-output data of the L1 of the C2 as its L2 Similarly, some sub-output data of L1 of C2 may also need to be allocated to C1 as the input of its L2, but the data exchange between the two is carried out in the storage area of the chip, and there is no need to read and write off-chip storage.
  • step S202 is continued to be executed, and the position of the synchronization point is determined according to the parameters of each layer and the size of the parameter storage area in the processing core; so that in each synchronization cycle, each processing core can The parameters are read from the parameter storage area inside the processing core for neural network calculation, and in the synchronization cycle, the parameters to be used in the next synchronization cycle can be read from the off-chip storage area, that is, the DDR, into the processing core parameter storage area. According to the position of the synchronization point, the synchronization instruction is inserted at the position of the synchronization point.
  • step S203 can call the traditional compiler to generate executable program segment, each program segment is composed of two synchronization points. Procedural generation of neural networks.
  • the processing core executes the program segments in the execution program according to the input data and parameters. Generate a synchronization request and send it to the synchronization signal generator, after the synchronization signal generator receives the synchronization request sent by each processing core in the chip, generate a synchronization signal and send it to each processing core, so that each processing core The core enters the next synchronization cycle and continues to execute the program segment of the execution program with the new parameters until the execution of the execution program is completed and the output result is obtained.
  • the above-mentioned compiling method for the execution program of the present disclosure can be executed under the condition that the number of processing cores is known in advance, and the execution program obtained after compiling at this time may contain allocation information, including grouping information of the processing cores, input data (including The division method of the intermediate data as the input data of other layers), etc.; the compiling method of the executive program may also not know the number of processing cores in advance, and the executive program obtained after compiling at this time may contain the execution program.
  • the required number of processing cores and the grouping strategy of the processing cores When executing the execution program, the number of processing cores of the current chip is firstly obtained, and the number of processing cores required to execute the execution program is determined according to the grouping strategy and the execution program.
  • the optimal grouping method for executing the execution program on the current chip is calculated by the number of pieces, and then tasks are allocated to the processing cores of each group according to the grouping method.
  • the original program is compiled by using the execution program compiling method in the embodiment of the present disclosure, and the execution program is executed by using the chip in the embodiment of the present disclosure.
  • the processing core has the same amount of calculation in each synchronization cycle, so that the calculation time of all processing cores is consistent in each synchronization cycle, avoiding The computing time of different processing cores varies, and the processing core that completes the calculation first needs to wait for the processing core to complete the calculation and then complete the calculation.
  • Embodiments of the present disclosure also provide an electronic device, including: a memory for storing computer-readable instructions; and one or more processors for executing the computer-readable instructions, so that the processor implements execution when running The compiling method for executing the program described in any of the examples.
  • Embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to cause a computer to execute any one of the foregoing embodiments. Describes how to compile a program.
  • Embodiments of the present disclosure also provide a computer program product, which is characterized by comprising computer instructions, and when the computer instructions are executed by a computing device, the computing device can execute any one of the execution programs in the foregoing embodiments. compilation method.
  • An embodiment of the present disclosure further provides a computing device, which is characterized in that it includes the chip described in any one of the embodiments.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself under certain circumstances.
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

一种执行程序的编译方法、芯片、电子设备和计算机可读存储介质。其中该执行程序的编译方法包括:根据原始程序的数据的属性确定执行所述原始程序所需要的处理核的个数(S201);根据所述原始程序的参数确定所述原始程序中的同步点的位置(S202);根据所述处理核的个数和所述同步点的位置将所述原始程序编译成执行程序(S203)。上述方法编译生成所述执行程序的依据为原始程序的输入输出数据的属性以及参数,解决了现有技术中的执行程序在执行时需要频繁访问外部存储器的技术问题。

Description

执行程序的编译方法、芯片、电子设备及计算机可读存储介质 技术领域
本公开涉及程序编译及处理器领域,尤其涉及一种执行程序的编译方法、芯片、电子设备和计算机可读存储介质。
背景技术
随着科学技术的发展,人类社会正在快速进入智能时代。智能时代的重要特点,就是人们获得数据的种类越来越多,获得数据的量越来越大,而对处理数据的速度要求越来越高。芯片是任务调度的基石,它从根本上决定了人们处理数据的能力。从应用领域来看,芯片主要有两条路线:一条是通用芯片路线,例如CPU(Central Processing Unit)等,它们能提供极大的灵活性,但是在处理特定领域算法时有效算力比较低;另一条是专用芯片路线,例如TPU(Tensor Processing Unit)等,它们在某些特定领域,能发挥较高的有效算力,但是面对灵活多变的比较通用的领域,它们处理能力比较差甚至无法处理。由于智能时代的数据种类繁多且数量巨大,所以要求芯片既具有极高的灵活性,能处理不同领域且日新月异的算法,又具有极强的处理能力,能快速处理极大的且急剧增长的数据量。
多(众)核CPU或者GPU在进行神经网络任务处理时,一般有两种处理方法:
第一种是各处理核单独处理各自的任务,核间互不影响,如附图1a所示;第二种是部分或者所有处理核共同并行处理某个任务,各完成任务的一部分,如附图1b所示。
上述两种方法中,编译器根据神经网络的结构,按照传统多(众)核CPU或者GPU的特点,编译出合适的程序,进行神经网络的计算。其各处理核中Cache对于程序而言,是透明的,是不可直接独立访问的,计算中所有的数据读写,对于处理核来说,都是基于DDR(Double Data Rate)存储器的访问地址进行。在神经网络计算中,会产生大量的中间数据,且各层的中间数据大部分都没有关联性。由于Cache具有空间局部特性和时间局部特性,所以在神经网络计算中,Cache很容易降低命中率,会频繁的访问DDR存储器,从而降低神经网络的计算速度,且带来功耗的降低。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
为了解决现有技术中处理核的任务调度不灵活、控制复杂的技术问题,本公开实施例提出如下技术方案:
第一方面,本公开实施例提供一种执行程序的编译方法,包括:
根据原始程序的数据的属性确定执行所述原始程序所需要的处理核的个数;
根据所述原始程序的参数确定所述原始程序中的同步点的位置;
根据所述处理核的个数和所述同步点的位置将所述原始程序编译成执行程序。
进一步的,所述根据原始程序的数据的属性确定执行所述原始程序所需要的处理核的个数,包括:
获取所述原始程序中的每一个子程序的数据总量;其中所述子程序的数据总量包括所述子程序的输入数据的大小和输出数据的大小之和;
根据所述每一个子程序的数据总量以及所述处理核的数据存储区域的大小确定执行所述原始程序所需要的处理核的个数
进一步的,所述根据所述每一个子程序的数据总量以及所述处理核的数据存储区域的大小计算执行所述原始程序所需要的处理核的个数,包括:
计算所述每一个子程序的数据总量和所述处理核的数据存储区域的大小的商;
将多个所述商中的最大商向上取整后的值作为执行所述原始程序所需要的处理核的个数。
进一步的,所述根据所述原始程序的参数确定所述原始程序中的同步点的位置,包括:
根据所述处理核的参数存储区域的大小以及所述多个子程序的参数的大小确定所述多个子程序的同步点的位置。
进一步的,所述根据所述处理核的参数存储区域的大小以及所述子程序的参数的大小确定所述子程序的同步点的位置,包括:
根据所述处理核的参数存储区域的大小以及所述子程序的参数的大小确定所述处理核的参数存储区域能够存储的所述子程序的参数的数量;
根据所述处理核的参数存储区域能够存储的所述子程序的参数的数量确定所述子程序的同步点的位置。
进一步的,在确定所述同步点的位置之后,还包括:
在所述同步点的位置加入同步指令;其中,所述同步指令用于使所述包括多个处理核的系统生成同步信号。
进一步的,所述执行程序中包括多个程序段,每个程序段中包括所述原始程序中的指令以及所述处理核执行所述原始程序中的指令所需要的控制指令。
进一步的,所述原始程序为神经网络,所述子程序为所述神经网络中的一层子网络。
进一步的,所述获取所述原始程序中的每一个子程序的数据总量,包括:
获取所述神经网络;
分析所述神经网络中每一层子网络的输入数据的大小和输出数据的大小;
生成所述神经网络中每一层子网络的数据总量。
第二方面,本公开实施例提供一种芯片,包括:
多个处理核和同步信号发生器;其中每个处理核包括数据存储区域和参数存储区域;
其中,所述多个处理核用于根据执行程序进行分组,其中每一组中的处理核用于执行所 述执行程序中的多个程序段;其中所述数据存储区域用于存储所述多个程序段的输入数据和输出数据,所述参数存储区域用于存储所述多个程序段的参数;
所述同步信号发生器,用于在所有执行所述程序段的处理核均执行完毕时,发送同步信号至所有处理核。
第三方面,本公开实施例提供一种电子设备,包括:存储器,用于存储计算机可读指令;以及一个或多个处理器,用于运行所述计算机可读指令,使得所述处理器运行时实现前述第一方面中的任一所述执行程序的编译方法。
第四方面,本公开实施例提供一种非暂态计算机可读存储介质,其特征在于,该非暂态计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行前述第一方面中的任一所述执行程序的编译方法。
第五方面,本公开实施例提供一种计算机程序产品,其特征在于:包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行前述第一方面中的任一所述执行程序的编译方法。
第六方面,本公开实施例提供一种计算装置,其特征在于,包括所述第二方面中的任一所述的芯片。
本公开实施例公开了一种执行程序的编译方法、芯片、电子设备和计算机可读存储介质。其中该执行程序的编译方法包括:根据原始程序的数据的属性确定执行所述原始程序所需要的处理核的个数;根据所述原始程序的参数确定所述原始程序中的同步点的位置;根据所述处理核的个数和所述同步点的位置将所述原始程序编译成执行程序。上述方法编译生成所述执行程序的依据为原始程序的输入输出数据的属性以及参数,解决了现有技术中的执行程序在执行时需要频繁访问外部存储器的技术问题。
上述说明仅是本公开技术方案的概述,为了能更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为让本公开的上述和其他目的、特征和优点能够更明显易懂,以下特举较佳实施例,并配合附图,详细说明如下。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。
图1a和1b为现有技术的示意图;
图2为本公开实施例提供的执行程序的编译方法的流程示意图;
图3为本公开实施例提供的执行程序的编译方法的进一步流程示意图;
图4为本公开实施例提供的执行程序的编译方法的进一步流程示意图;
图5为本公开实施例提供的执行程序的编译方法的进一步流程示意图;
图6a为本公开实施例提供的芯片的示例图;
图6b为本公开实施例提供的芯片执行所述执行程序时处理核的分组示意图;
图7为本公开实施例中的待编译的神经网络的示意图;
图8为本公开实施例中第一分组中的两个处理核切分数据的示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图2为本公开实施例提供的执行程序的编译方法的流程示意图。所述执行程序的编译方法,用于包括多个处理核的系统中,所述处理核包括用于存储所述执行程序的相关数据的存储区域。其中,所述执行程序的相关数据包括所述执行程序的输入数据、输出数据、程序指令数据和参数数据等,相应的,所述存储区域包括用于存储所述执行程序的输入数据和输出数据的数据存储区域、存储所述程序指令数据的程序储存区域以及存储所述参数数据的参数存储区域。
其中,所述方法包括:
步骤S201,根据原始程序的数据的属性确定执行所述原始程序所需要的处理核的个数;
在该步骤中,所述原始程序的数据的属性包括原始程序的输入输出数据的总量等等。在该步骤中,对原始程序的数据进行分析,根据原始程序的数据属性规划处理核的分配方案,确定需要多少个处理核能够并行计算所述原始程序。
可选的,所述步骤S201包括:
步骤S301,获取所述原始程序中的每一个子程序的数据总量;其中所述子程序的数据总量包括所述子程序的输入数据的大小和输出数据的大小之和;
步骤S302,根据所述每一个子程序的数据总量以及所述处理核的数据存储区域的大小确定执行所述原始程序所需要的处理核的个数。
示例性的,所述原始程序为神经网络,所述子程序为所述神经网络中的一层子网络。在该可选实施例中,所述原始程序包括多个子程序,原始程序在执行时按照子程序的顺序依次执行,子程序的输出数据为下一子程序的输入数据或者原始程序的输出数据。
当所述原始程序为神经网络,所述步骤S301包括:
获取所述神经网络;分析所述神经网络中每一层的输入数据的大小和输出数据的大小;生成所述神经网络中每一层的数据总量。神经网络通常可以以图的形式表示,每一层包括输入数据的大小和输出数据的大小,如输入输出数据的维度等。由此可以通过对表示所述神经网络的图进行分析得到每一层子网络的数据总量。
可以理解的,所述原始程序可以是其他各种类型的程序,所述原始程序的子程序可以是原始程序中的按照功能模块进行划分的程序模块或者按照程序执行顺序划分的一段从输出数据得到输出数据的程序模块。
由于原始程序在执行的时候,可能存在中间数据,如上述子程序的输出数据,为了每个中间数据可以不必访问处理核外的存储区域,因此在该步骤中首先计算出原始程序的每个子程序的数据总量。
在得到所述原始程序的每个子程序的数据总量之后,在步骤S302中,获取所述处理核的数据存储区域的大小,并根据所述子程序的数据总量和所述数据存储区域的大小计算执行所述原始程序所需要的处理核的个数。
由于每个子程序的数据总量可能不同,为了能够执行完整的原始程序,可以选择执行每个子程序所需要的处理核的个数中的最大个数作为执行所述原始程序所需要的处理核的个数。由此,可选的,所述步骤S302,包括:
步骤S401,计算所述每一个子程序的数据总量和所述处理核的数据存储区域的大小的商;
步骤S402,将多个所述商中的最大商向上取整后的值作为执行所述原始程序所需要的处理核的个数。
在步骤S401中,假设每个处理核的数据存储区域的大小相同,则通过每个子程序的数据总量除以所述数据存储区域的大小得到多个商,其中所述商可能是整数也可能不是整数。因此在步骤S402中,取所述多个商中的最大商的向上取整的值作为执行所述原始程序所需要的处理核的个数。当每个处理核的数据存储区域的大小不相同时,可以将所述处理核中数据存储区域的最小值作为计算所述商时所使用的数据存储区域的大小,以保证每个处理核都可以在不使用外部存储器的情况下存储每个子程序的输入数据和输出数据。可以理解的,上述步骤S401和步骤S402中,计算执行每个子程序所需要的最少处理核个数,之后将这些处理 核个数中的最大值作为执行所述原始程序所需要的最少处理核个数。
在该步骤S201中,得到的执行所述原始程序所需要的处理核的个数,根据原始程序所需要的处理核的个数对所述多个处理核进行分组,每组中处理核的数量为所述需要的处理核的个数;各个组之间可以并行的执行多个原始程序。
设执行所述原始程序所需要的处理核的个数为N min,对所述多个处理核进行分组之后,每组中的处理核的个数为N g,则需要满足N g≥N min,所述系统中的处理核的个数为N,则N=A*N g,其中A为正整数,即分组的数量。示例性的,如N min=2,N=9,则N g=3,A=3。
在该步骤计算得到执行所述原始程序所需要的处理核的个数之后,还可以进一步计算出每个处理核的输入数据的量,即每个子程序的输入数据均衡的分配给一组处理核中的每个处理核。具体的,可以按照将输入数据平均分成N g份的方式对输入数据进行分割;但是,有些情况下可能处理核之间的输入数据会有重叠,此时需要在计算每组中的处理核的个数N g时进行调整,使得N g满足:M 0≥D m+D in/N g;其中M 0是处理核的数据存储区域的大小,D m是每增加一个处理核由于数据重叠部分所产生的增量数据,由此每一个处理核所需要分配的输入数据即为D m+D in/N g
这样,通过上述步骤S201,能够在原始程序的编译阶段完成执行所述原始程序的系统中的多个处理核的分组规划。
返回附图2,所述执行程序的编译方法,还包括:
步骤S202,根据所述原始程序的参数确定所述原始程序中的同步点的位置。
其中,两个同步点之间的原始程序的指令为在一个同步周期中,所述处理核所需要执行的程序指令。
可选的,所述步骤S202包括:根据所述处理核的参数存储区域的大小以及所述多个子程序的参数的大小确定所述多个子程序的同步点的位置。
所述处理核的参数存储区域用于存储所述原始程序的参数,如卷积神经网络中各层子网络的所使用的卷积核的大小、权重值以及步长等等参数。处理核中的参数存储区域的大小决定了一个同步周期中所能执行的程序指令的量,如所述处理核中的参数存储区域能够存储下一层子网络的所有参数,则在一个同步周期中,所述处理核无需从处理核外的存储区域读取参数即可完成一层子网络的计算,则同步点可以设置在每个子网络的程序指令的结尾。而有些情况下,处理核的参数存储区域比较小,无法存储一层子网络的所有参数,此时就需要在子网络中插入同步点,以确定使用参数存储区域中所存储的参数能够执行到的位置。
进一步的,上述步骤S202包括:
步骤S501,根据所述处理核的参数存储区域的大小以及所述子程序的参数的大小确定所述处理核的参数存储区域能够存储的所述子程序的参数的数量;
步骤S502,根据所述处理核的参数存储区域能够存储的所述子程序的参数的数量确定所述子程序的同步点的位置。
在该实施例中,通过所述子程序的参数的大小和所述参数存储区域的大小能够确定所述参数存储区域能够存储的参数的数量,根据所述参数的数量确定所述多个子程序的同步点的 位置,例如所述子程序的参数的大小为50KB,所述参数存储区域的大小为25KB,则需要在子程序的中点位置插入一个同步点。对每个子程序均执行上述确定同步点的位置,得到所述原始程序的所有同步点的位置。
进一步的,在得到所述同步点的位置之后,所述步骤S202还包括:在所述同步点的位置加入同步指令;其中,所述同步指令用于使所述包括多个处理核的系统生成同步信号。即所述处理核执行完同步点之间的程序指令会后,继续执行所述同步指令。可选的,所述处理核执行所述同步指令之后,会产生同步请求信号,请求所述包括多个处理核的系统生成同步信号,所述系统中包括同步信号发生器,所述同步信号发生器在接收到所述系统中参与程序执行的每个处理核所发出的同步请求信号之后,产生同步信号使得所述多个处理核进入下一个不同周期,以执行后续的程序指令。
返回附图2,所述执行程序的编译方法,还包括:
步骤S203,根据所述处理核的个数和所述同步点的位置将所述原始程序编译成执行程序。
其中,所述执行程序包括多个程序段,所述程序段以所述同步点的位置为分界点生成,每个所述程序段中包括所述原始程序中的指令以及所述处理核执行所述原始程序中的指令所需要的控制指令。如上所述,子程序中被插入同步点,而多个程序段之间以所述同步点为分界点,因此一个程序段中所包含的指令可能是一个子程序中的部分或全部指令,或者多个子程序中的指令。
其中,所述控制指令用于处理核在每个同步周期中读取下一个同步周期所需要的参数和/或下一个程序段等。所述程序段的结尾处还包括上述同步指令,用于生成同步请求信号。其中,所述处理核的个数用于生成所述执行程序中的分配信息和/或分组信息,用于所述包括多个处理核的系统执行所述执行程序时,对其多个处理核进行分组以及程序段、参数、输入数据的分配等。
通过上述执行程序的编译方法,将原始程序编译成适合多处理核系统执行的执行程序,其中编译生成所述执行程序的依据为原始程序的输入输出数据的属性以及参数,因此增强了执行程序与多处理核系统的契合度,提高了多处理核系统的有效算力;另外,原始程序所产生的中间数据,均在多处理核系统内移动,不需要与外部存储器进行交换,从而降低了时延,降低了对外部存储器的带宽的压力,也降低了整个多处理核系统的功耗。
图6a为本公开实施例所提供的包括多个处理核的系统的结构示意图的示例。如图6a所示,在该示例中,所述包括多个处理核的系统为芯片,所述芯片600包括:
多个处理核601和同步信号发生器602;其中每个处理核包括数据存储区域和参数存储区域;
其中,所述多个处理核601用于根据执行程序进行分组,其中每一组中的处理核用于执行所述执行程序中的多个程序段;其中所述数据存储区域用于存储所述多个程序段的输入数据和输出数据,所述参数存储区域用于存储所述多个程序段的参数;
所述同步信号发生器602,用于在所有执行所述程序段的处理核均执行完毕时,发送同步信号至所有处理核。
下面通过图6a所示的芯片的结构对上述执行程序的编译过程进行举例说明。如图6a所 示,所述芯片600中包括4个处理核,分别为C 1、C 2、C 3、C 4,每个处理核包括1MB的数据存储区域,每个处理核还包括参数存储区域和程序存储区域(均未示出),芯片外连接有外部存储器DDR,用于存储原始程序的输入数据、参数以及最终的输出数据。
所述原始程序以一个2层的神经网络为例,其结构和各层的输入输出数据总量如图7所示。神经网络的第一个层L1的输入数据为400KB,输出数据为800KB;神经网络的第二层L2的输入数据的大小与第一层的输出数据的大小相同为800KB,输出数据为10KB。
上述执行程序的编译方法由神经网络编译器执行。根据上述执行程序的编译方法的实施例,所述神经网络编译器首先执行步骤S301,对所述神经网络进行分析,得到每一层的输入数据和输出数据的大小之和,生成各层的数据总量表格表1:
Layer InData(KB) OutData(KB) Total Data(KB)
L1 400 800 1200
L2 800 10 810
表1
之后,执行上述步骤S302中的步骤,根据所述每一层的数据总量以及所述处理核的数据存储区域的大小计算执行所述原始程序所需要的处理核的个数,如果计算过程中产生小数,则向上取整数。计算结果如下表2所示:
Layer Calculate Core Number Core Number
L1 1200/1000=1.2 2
L2 810/1000=0.8 1
表2
该计算结果表示执行第一层的子网络程序需要2个处理核,执行第二层的子网络程序需要1个处理核。由于神经网络中每一层子网络需要依次执行,因为子网络之间不能并行执行,则为了执行所述神经网络,至少需要2个处理核。即执行所述神经网络所需要的处理核的个数N min=2。
之后确定芯片的处理核分组中,每个分组的处理核个数Ng,Ng需要满足以下两个条件,以期达到芯片算力利用率的最大化:
a.N g>=N min
b.N=A*N g,A为正整数。
在上例中,芯片中包括4个处理核,则N=4,Nmin=2,可以求得:A=2时,N g=N min=2,即芯片4个处理核分成两组,每组中包含2个核,每一组内的核并行计算同一个任务,2组可以并行处理2个任务。如图6b所示即为上述图6a所示的芯片中的处理核的分组,其中C 1和C 2为第一分组Group1,C 3和C 4为第二分组Group2。
对于每个分组,根据分组中的处理核的个数N g,将神经网络各层的输入输出数据进行切分,使得他们能够在两个处理核上并行计算,并且使得两个处理核的计算量平衡。两个处理核在同一个同步周期中所使用的参数相同。
图8为第一分组中的两个处理核切分数据的示意图。如图8中所示,首先对L1的输入数据进行切分,L1的输入数据为400KB,将其分给两个处理核进行计算,切分成2个200K的子输入数据(并不一定是完全的等分,某些情况下两个处理核会使用部分相同的输入数据, 这个时候这部分输入数据就需要同时给两个处理核核,那么两部分子输入数据都会大于200KB),分别分配给Group1中的C1和C2做输入数据;当它们完成第一层神经网络L1的计算,各自会产生L1的800KB输出数据的一半,400KB的子输出数据;C1和C2将自身的L1的子输出数据,作为各自L2的子输入数据(并不一定各自L1的输出就能作为其L2的输入,可能C1的L1的部分子输出数据需要分配给C2,与C2的L1的子输出数据一起做其L2的子输入数据,同样的,可能C2的L1的部分子输出数据也需要分配给C1做其L2的输入,但是两者的数据交换均在芯片内的存储区域进行,不需要读写芯片外存储DDR);当两者完成L2的计算,各自会产生L2的10KB输出数据的一半5KB的子输出数据,两个子输出数据合成总的10KB的输出数据。由此芯片完成一个完整执行程序的运行。Group2的情况与Group1类似,只是输入数据不同,也由此,图6a中的芯片可以并行执行两个神经网络计算任务。
在得到输入数据的划分方案之后,继续执行步骤S202中的实施例,根据各层的参数和处理核内的参数存储区大小,确定同步点的位置;使得在每一个同步周期中,各处理核能从处理核内部的参数存储区读取参数进行神经网络计算,并能在所述同步周期中,将下一个同步周期中所要使用的参数从芯片外存储区,即DDR中读取到处理核内的参数存储区。根据同步点的位置,在同步点的位置处插入同步指令。
之后继续执行步骤S203中的实施例,生成所述N g处理核中所要执行的程序段,此过程可以调用传统的编译器生成可执行程序段,每个程序段由两个同步点之间的神经网络的程序生成。
在所述芯片执行所述执行程序时,通过上述图8中的方式将输入数据进行分割,之后处理核根据输入数据和参数执行所述执行程序中的程序段,每次执行到同步指令,则生成同步请求发送至同步信号发生器,当所述同步信号发生器接收到所述芯片中的每个处理核所发出的同步请求之后,生成同步信号并发送至每个处理核,使得每个处理核进入下一个同步周期并使用新的参数继续执行所述执行程序的程序段,直至执行程序执行完毕得到输出结果。
本公开的上述执行程序的编译方法,可以在预先知道处理核的数量的情况下执行,此时编译之后得到的执行程序中可以带有分配信息,其中包括处理核的分组信息,输入数据(包括作为其他层的输入数据的中间数据)的划分方式等;所述执行程序的编译方法,也可以预先不知道处理核的数量,此时编译之后得到的执行程序中可以带有执行所述执行程序所需要的处理核的个数以及处理核的分组策略,在执行所述执行程序时,首选获取当前芯片的处理核数量,并根据所述分组策略以及执行所述执行程序所需要的处理核的个数计算出当前芯片执行所述执行程序的最优分组方式,之后再根据所述分组方式给各组的处理核分配任务。
通过上述示例可以看出,使用本公开实施例中的执行程序编译方法对原始程序进行编译,并使用本公开实施例中的芯片执行所述执行程序。在执行所述执行程序时,对于每一个处理核,由于其使用的参数相同,因此处理核在每个同步周期中计算量相同,从而使得各同步周期中,所有处理核的计算时间一致,避免了不同处理核计算时间不等,先完成计算的处理核需要等待后完成计算的处理核而带来的算力损失,从而大大提升了芯片的有效算力;此外,所有的处理核使用相同的参数,从而参数只需要从DDR中读取一次,便可以给所有处理核共用,大大提高了参数的复用率,降低了对DDR带宽的需求,也降低了功耗。
本公开实施例还提供一种电子设备,包括:存储器,用于存储计算机可读指令;以及一 个或多个处理器,用于运行所述计算机可读指令,使得所述处理器运行时实现实施例中的任一所述执行程序的编译方法。
本公开实施例还提供一种非暂态计算机可读存储介质,其特征在于,该非暂态计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行前述实施例中的任一所述执行程序的编译方法。
本公开实施例还提供一种计算机程序产品,其中,其特征在于:包括计算机指令,当所述计算机指令被计算设备执行时,所述计算设备可以执行前述实施例中的任一所述执行程序的编译方法。
本公开实施例还提供一种计算装置,其特征在于,包括所述实施例中的任一所述的芯片。
本公开附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。

Claims (10)

  1. 一种执行程序的编译方法,用于包括多个处理核的系统中,其特征在于,包括:
    根据原始程序的数据的属性确定执行所述原始程序所需要的处理核的个数;
    根据所述原始程序的参数确定所述原始程序中的同步点的位置;
    根据所述处理核的个数和所述同步点的位置将所述原始程序编译成执行程序。
  2. 如权利要求1所述的执行程序的编译方法,其特征在于,所述根据原始程序的数据的属性确定执行所述原始程序所需要的处理核的个数,包括:
    获取所述原始程序中的每一个子程序的数据总量;其中所述子程序的数据总量包括所述子程序的输入数据的大小和输出数据的大小之和;
    根据所述每一个子程序的数据总量以及所述处理核的数据存储区域的大小确定执行所述原始程序所需要的处理核的个数。
  3. 如权利要求2所述的执行程序的编译方法,其特征在于,所述根据所述每一个子程序的数据总量以及所述处理核的数据存储区域的大小计算执行所述原始程序所需要的处理核的个数,包括:
    计算所述每一个子程序的数据总量和所述处理核的数据存储区域的大小的商;
    将多个所述商中的最大商向上取整后的值作为执行所述原始程序所需要的处理核的个数。
  4. 如权利要求1-3中任一项所述的执行程序的编译方法,其特征在于,所述根据所述原始程序的参数确定所述原始程序中的同步点的位置,包括:
    根据所述处理核的参数存储区域的大小以及所述多个子程序的参数的大小确定所述多个子程序的同步点的位置。
  5. 如权利要求4中任一项所述的执行程序的编译方法,其特征在于,所述根据所述处理核的参数存储区域的大小以及所述子程序的参数的大小确定所述子程序的同步点的位置,包括:
    根据所述处理核的参数存储区域的大小以及所述子程序的参数的大小确定所述处理核的参数存储区域能够存储的所述子程序的参数的数量;
    根据所述处理核的参数存储区域能够存储的所述子程序的参数的数量确定所述子程序的同步点的位置。
  6. 如权利要求1-5中任一项所述的执行程序的编译方法,其特征在于,在确定所述同步点的位置之后,还包括:
    在所述同步点的位置加入同步指令;其中,所述同步指令用于使所述包括多个处理核的系统生成同步信号。
  7. 如权利要求1-6中任一项所述的执行程序的编译方法,其特征在于:
    所述执行程序中包括多个程序段,每个程序段中包括所述原始程序中的指令以及所述处理核执行所述原始程序中的指令所需要的控制指令。
  8. 如权利要求2-7中任一项所述的执行程序的编译方法,其特征在于,所述原始程序为神经网络,所述子程序为所述神经网络中的一层子网络。
  9. 如权利要求8所述的执行程序的生成方法,其特征在于,所述获取所述原始程序中的每一个子程序的数据总量,包括:
    获取所述神经网络;
    分析所述神经网络中每一层子网络的输入数据的大小和输出数据的大小;
    生成所述神经网络中每一层子网络的数据总量。
  10. 一种芯片,其特征在于,包括:
    多个处理核和同步信号发生器;其中每个处理核包括数据存储区域和参数存储区域;
    其中,所述多个处理核用于根据执行程序进行分组,其中每一组中的处理核用于执行所述执行程序中的多个程序段;其中所述数据存储区域用于存储所述多个程序段的输入数据和输出数据,所述参数存储区域用于存储所述多个程序段的参数;
    所述同步信号发生器,用于在所有执行所述程序段的处理核均执行完毕时,发送同步信号至所有处理核。
PCT/CN2020/141941 2020-12-31 2020-12-31 执行程序的编译方法、芯片、电子设备及计算机可读存储介质 WO2022141344A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080108193.6A CN116710930A (zh) 2020-12-31 2020-12-31 执行程序的编译方法、芯片、电子设备及计算机可读存储介质
PCT/CN2020/141941 WO2022141344A1 (zh) 2020-12-31 2020-12-31 执行程序的编译方法、芯片、电子设备及计算机可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141941 WO2022141344A1 (zh) 2020-12-31 2020-12-31 执行程序的编译方法、芯片、电子设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2022141344A1 true WO2022141344A1 (zh) 2022-07-07

Family

ID=82258806

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141941 WO2022141344A1 (zh) 2020-12-31 2020-12-31 执行程序的编译方法、芯片、电子设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN116710930A (zh)
WO (1) WO2022141344A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639901A (zh) * 2009-09-03 2010-02-03 王连明 基于多核技术的前馈神经网络硬件实现方法
CN109726806A (zh) * 2017-10-30 2019-05-07 上海寒武纪信息科技有限公司 信息处理方法及终端设备
CN109754073A (zh) * 2018-12-29 2019-05-14 北京中科寒武纪科技有限公司 数据处理方法、装置、电子设备和可读存储介质
US20200242189A1 (en) * 2019-01-29 2020-07-30 Hewlett Packard Enterprise Development Lp Generation of executable files corresponding to neural network models

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639901A (zh) * 2009-09-03 2010-02-03 王连明 基于多核技术的前馈神经网络硬件实现方法
CN109726806A (zh) * 2017-10-30 2019-05-07 上海寒武纪信息科技有限公司 信息处理方法及终端设备
CN109754073A (zh) * 2018-12-29 2019-05-14 北京中科寒武纪科技有限公司 数据处理方法、装置、电子设备和可读存储介质
US20200242189A1 (en) * 2019-01-29 2020-07-30 Hewlett Packard Enterprise Development Lp Generation of executable files corresponding to neural network models

Also Published As

Publication number Publication date
CN116710930A (zh) 2023-09-05

Similar Documents

Publication Publication Date Title
US11093526B2 (en) Processing query to graph database
Cai et al. DGCL: An efficient communication library for distributed GNN training
WO2022022670A1 (zh) 神经网络计算图的处理方法、装置及处理设备
Mojumder et al. Profiling dnn workloads on a volta-based dgx-1 system
CN103049241B (zh) 一种提高cpu+gpu异构装置计算性能的方法
CN111274016B (zh) 基于模块融合的动态部分可重构系统应用划分与调度方法
JP2014525640A (ja) 並列処理開発環境の拡張
CN104536937A (zh) 基于cpu-gpu异构集群的大数据一体机实现方法
Hu et al. Trix: Triangle counting at extreme scale
Shi et al. MG-WFBP: Merging gradients wisely for efficient communication in distributed deep learning
CN110659278A (zh) 基于cpu-gpu异构架构的图数据分布式处理系统
Zhang et al. Fine-grained multi-query stream processing on integrated architectures
Mahafzah et al. The hybrid dynamic parallel scheduling algorithm for load balancing on chained-cubic tree interconnection networks
US20170371713A1 (en) Intelligent resource management system
Yang et al. Fast All-Pairs Shortest Paths Algorithm in Large Sparse Graph
Kobus et al. Gossip: Efficient communication primitives for multi-gpu systems
WO2022141344A1 (zh) 执行程序的编译方法、芯片、电子设备及计算机可读存储介质
Davidović et al. Parallel local search to schedule communicating tasks on identical processors
Messina et al. Exploiting gpus to simulate complex systems
CN114691142A (zh) 执行程序的编译方法、芯片、电子设备及计算机可读存储介质
Chen et al. Mixed volume computation in parallel
Zou et al. Direction-optimizing breadth-first search on CPU-GPU heterogeneous platforms
Zheng et al. Galliot: Path merging based betweenness centrality algorithm on GPU
Karnagel et al. Heterogeneous placement optimization for database query processing
Czarnul et al. Auto-tuning methodology for configuration and application parameters of hybrid CPU+ GPU parallel systems based on expert knowledge

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20967683

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080108193.6

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967683

Country of ref document: EP

Kind code of ref document: A1