CN113553054A

CN113553054A - Heterogeneous system based compiling method, device, equipment and storage medium

Info

Publication number: CN113553054A
Application number: CN202110747965.7A
Authority: CN
Inventors: 蒋国跃; 张力; 杨柳西; 高鹏; 张广飞; 詹克团
Original assignee: Beijing Suneng Technology Co ltd
Current assignee: Beijing Suneng Technology Co ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-10-26

Abstract

The invention discloses a compiling method, a compiling device, equipment and a storage medium based on a heterogeneous system. Wherein, the heterogeneous system comprises a plurality of computing cores with different architectures. The method comprises the following steps: obtaining a calculation graph corresponding to a source language code, wherein the source language code is compiled based on a programming model, and the programming model is described by an algorithm through a scalar programming language and a tensor programming language; dividing the computation graph into at least one target graph region, wherein the at least one target graph region comprises a scalar graph region containing scalar computation nodes and/or a tensor graph region containing tensor computation nodes; generating a corresponding binary instruction code segment for each target graph area; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections. In the method, when compiling and developing are carried out based on a heterogeneous system, the source language codes can describe the algorithm by adopting a scalar programming language and a tensor programming language, and the development efficiency is improved.

Description

Heterogeneous system based compiling method, device, equipment and storage medium

Technical Field

The present disclosure relates to, but not limited to, the field of compilation development, and in particular, to a compiling method, apparatus, device and storage medium based on a heterogeneous system.

Background

With the rise of computing-intensive fields such as artificial intelligence, high-performance data analysis, financial analysis and the like, the traditional general computing mode cannot meet the requirement of people on computing capacity, and therefore a computing mode with stronger computing power, namely heterogeneous computing, is provided.

Heterogeneous computing (heterogeneous computing) is a special parallel distributed computing system, and mainly refers to a computing mode of a system formed by computing units of different types of instruction sets and architectures. In a heterogeneous computing system, a common heterogeneous system (also referred to as a heterogeneous device) may include a combination of different computing cores (cores) such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), a programmable gate array (FPGA), a Tensor Processing Unit (TPU), an Application Specific Integrated Circuit (ASIC), and so on.

In a heterogeneous system, each computing core usually has a separate instruction set and programming model, so when a program is developed based on the heterogeneous system, the programming model of each computing core needs to be programmed and compiled separately, resulting in low development efficiency.

Disclosure of Invention

The present disclosure provides a compiling method, apparatus, device and storage medium based on a heterogeneous system, so as to implement unified compiling of the heterogeneous system and improve development efficiency.

According to a first aspect of embodiments of the present disclosure, the present disclosure provides a heterogeneous system-based compiling method, wherein the heterogeneous system includes a plurality of computing cores with different architectures. The method comprises the following steps: obtaining a calculation graph corresponding to a source language code, wherein the source language code is compiled based on a programming model, and the programming model is described by an algorithm through a scalar variable programming language and a tensor programming language; dividing the computation graph into at least one target graph region, wherein the at least one target graph region comprises a scalar graph region containing scalar computation nodes and/or a tensor graph region containing tensor computation nodes; generating a corresponding binary code section for each target map region, the binary code sections corresponding to the scalar map region being executable in a first compute core for scalar computation, and the binary code sections corresponding to the tensor map region being executable in a second compute core for tensor computation; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections.

In the above scheme, the programming model includes a built-in function interface written in a tensor programming language and capable of running in the second computation core, and the built-in function interface is used for being called to perform tensor computation on tensor data.

In the above scheme, obtaining a computation graph corresponding to the source language code includes: translating the source language code into a first intermediate representation, the first intermediate representation being for representing scalar data and/or tensor data; and generating a computational graph based on the first intermediate representation, wherein the computational nodes in the computational graph represent functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationship between the functions.

In the above scheme, after generating the computation graph based on the intermediate representation, the method further includes: optimizing the calculation graph; wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.

In the above scheme, dividing the computation graph into at least one target graph region includes: dividing the computing nodes with the dependency relationship and the same computing type in the computing graph into a computing area; counting the mappable computing cores of each computing node in each computing area; dividing the computing nodes which have dependency relationship in each computing area and can be mapped to the same computing core into a graph area; at least one target map region is determined based on the map regions in each of the calculation regions.

In the above scheme, the calculation region is a scalar calculation region; determining at least one target map region from the map regions in each calculation region, including: the map area is determined as at least one target map area.

In the above scheme, the calculation region is a tensor calculation region; determining at least one target map region from the map regions in each calculation region, including: creating a map region set for the ith map region in the map region; traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped in the same computing core into a candidate graph region, wherein i is a positive integer; determining a first profit value of the ith map area according to the execution duration of the candidate map areas of the ith map area on the corresponding computation core; determining a second profit value which is larger than the preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to a graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to the preset profit value; and determining the candidate map area corresponding to the first profit value as a target map area.

In the above scheme, dividing the computation graph into at least one target graph region includes: dividing the computing nodes with the same type and the dependency relationship in the computing graph into a computing area; calculating the execution time of each computing node in each computing area in the mappable computing core, and determining the mappable computing core with the minimum execution time as a target computing core of the computing node; merging the same computing nodes of the target computing cores in each computing area into a graph area; and determining at least one target map area according to the merged map areas.

In the above solution, generating a corresponding binary instruction code segment for each target map region includes: determining the mapping relation between each target map area in at least one target map area and a calculation core; converting each target graph area into a second intermediate representation according to the mapping relation; based on the second intermediate representation, a binary instruction code segment is generated.

In the above scheme, the first computing core includes at least one of: CPU, DSP; the second computing core includes at least one of: CPU, TPU, GPU, DSP, ASIC, FPGA.

According to a second aspect of the embodiments of the present disclosure, there is provided a compiling apparatus for performing program compilation on a heterogeneous system, where the heterogeneous system includes a plurality of computing cores with different architectures. The compiling apparatus includes: the system comprises a programming front-end module, a calculation graph and a calculation model, wherein the programming front-end module is used for obtaining a calculation graph corresponding to a source language code, the source language code is compiled based on a programming model, and the programming model is described by an algorithm through a scalar programming language and a tensor programming language; the calculation graph processing module is used for dividing the calculation graph into at least one target graph region, and the at least one target graph region comprises a scalar graph region containing scalar calculation nodes and/or a tensor graph region containing tensor calculation nodes; the calculation core decoding module is used for generating a corresponding binary instruction code segment for each target image area; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections, wherein the binary instruction sequence corresponding to the scalar image areas can be executed in a first calculation core for scalar calculation, and the binary instruction sequence corresponding to the tensor image areas can be executed in a second calculation core for tensor calculation.

In the above arrangement, the front end module is programmed to convert the source language code into a first intermediate representation, the intermediate representation being for representing scalar data and/or tensor data; and generating a computational graph based on the intermediate representation, wherein the computational nodes in the computational graph represent functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationship between the functions.

In the above scheme, the programming front-end module is further configured to perform optimization processing on the calculation graph; wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.

In the above scheme, the computation graph processing module is configured to divide the computation nodes, which have dependency relationships and are of the same computation type, into a computation region; counting the mappable computing cores of each computing node in each computing area; dividing the computing nodes which have dependency relationship in each computing area and can be mapped to the same computing core into a graph area; at least one target map region is determined based on the map regions in each of the calculation regions.

In the above scheme, the calculation region is a scalar calculation region; and the calculation map processing module is used for determining at least one map area corresponding to the calculation map as at least one target map area.

In the above scheme, the calculation region is a tensor calculation region; the calculation map processing module is used for creating a map region set for the ith map region in the map regions; traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped in the same computing core into a candidate graph region, wherein i is a positive integer; determining a first profit value of the ith map area according to the execution duration of the candidate map areas of the ith map area on the corresponding computation core; determining a second profit value which is larger than the preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to a graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to the preset profit value; and determining the candidate map area corresponding to the first profit value as a target map area.

In the above scheme, the computation graph processing module is further configured to divide the computation nodes having dependency relationships and the same type in the computation graph into a computation region; calculating the execution time of each computing node in each computing area in the mappable computing core, and determining the mappable computing core with the minimum execution time as a target computing core of the computing node; merging the same computing nodes of the target computing cores in each computing area into a graph area; and determining at least one target map area according to the merged map areas.

In the above scheme, the computation core decoding module is configured to determine a mapping relationship between each target map region in at least one target map region and a computation core; converting each target graph area into a second intermediate representation according to the mapping relation; based on the second intermediate representation, a binary instruction code segment is generated.

According to a third aspect of embodiments of the present disclosure, there is provided a compiling apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: for executing executable instructions to implement a method as described in the first aspect and any aspect thereof.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium storing an executable program, wherein the executable program when executed by a processor implements the method according to the first aspect and any one of the aspects thereof.

In the method, when program development is carried out based on a heterogeneous system, the source language codes can describe the algorithm by adopting a scalar programming language and a tensor programming language, so that scalar calculation and tensor calculation can be written simultaneously in the programming process without respectively programming different heterogeneous calculation cores, and the development efficiency is greatly improved. Furthermore, the same calculation function can be mapped to different heterogeneous calculation cores for execution through calculation graph division, the execution time of the function is shortened, and the effective utilization rate of calculation resources is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic structural diagram of a heterogeneous system in an embodiment of the present disclosure;

FIG. 2 is a block diagram of a compiler according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating an implementation of a compiling method according to an embodiment of the disclosure;

FIG. 4 is a diagram illustrating a description of source language code in an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a computational graph in an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating an implementation flow of a partition computation graph according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart illustrating another implementation of a partition computation graph in an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a tensor computation region in an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a compiling apparatus in an embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosed embodiments, as detailed in the appended claims.

The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present disclosure. As used in the disclosed embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first", "second", "third", etc. may be used in the embodiments of the present disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, "first information" may also be referred to as "second information," and similarly, "second information" may also be referred to as "first information," without departing from the scope of embodiments of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Heterogeneous computing mainly refers to a computing mode of a system composed of computing units using different types of instruction sets and architectures. In a heterogeneous computing system, common classes of heterogeneous computing cores may include CPUs, GPUs, DSPs, FPGAs, TPUs, ASICs, and the like. For example, a heterogeneous system may be a general purpose computer integrating a CPU and a GPU, a system on chip (SoC) integrating a CPU, a GPU and a DSP chip, a machine learning system integrating a CPU and a TPU/ASIC/FPGA/ASIC, and the like. In a heterogeneous system, each computing core usually has a separate instruction set and programming model, so when a program is developed based on the heterogeneous system, the programming model of each computing core needs to be programmed and compiled separately, resulting in low development efficiency.

In addition, for the same computing task (like a computing function), since the core computing program is executed on only one or one type of computing core in a heterogeneous system, and the same core computing program cannot be run across the computing cores at the same time (it can be understood that the same core computing program can only be run on one or one type of computing core, and cannot be run in parallel on different types of computing cores). Therefore, although parallel execution can be performed among a plurality of computing tasks, heterogeneous parallel operation cannot be realized for a single computing task, and the effective utilization rate of computing resources is not high.

Then, in order to solve the above problem, the embodiments of the present disclosure provide a compiling method that may be applied to a heterogeneous system. Fig. 1 is a schematic structural diagram of a heterogeneous system in an embodiment of the present disclosure, and referring to solid lines in fig. 1, a heterogeneous system 10 may include a programming program 11 (also referred to as a development program or development software) of the heterogeneous system 10 and a heterogeneous computing core 12. Alternatively, the heterogeneous computing core 12 may include a plurality of computing cores of different architectures. The computational cores include one or more computational cores a 121 (e.g., a first computational core) for scalar computation and one or more computational cores B122 (e.g., a second computational core) for tensor computation. Here, when the number of the computing cores a is plural, the plural computing cores a are computing cores of the same architecture; similarly, when there are a plurality of computing cores B, the plurality of computing cores B are computing cores of the same architecture. Illustratively, the computing core a may be a CPU, a DSP, etc., and the computing core B may be a CPU, a GPU, a TPU, an FPGA, a DSP, an AISC, etc.

Still referring to fig. 1, the programming program 11 may include: a programming model module 111 and a compiler 112. The programming model module 111 is configured to assume a program description of the source language code 20 written by the user, which is independent of the underlying heterogeneous computing core 12. The compiler 112 is configured to compile the source language code 20 based on the heterogeneous computation core to obtain a corresponding binary instruction sequence.

In some possible embodiments, in order to enable uniform programming of heterogeneous systems, the programming program 11 may support general-purpose applications (e.g., including scalar computing functions) and may also support Artificial Intelligence (AI) deep learning applications (e.g., including tensor computing functions). Of course, the above programming program 11 may also support the application of other frameworks, and this is not particularly limited in the embodiment of the present disclosure.

In some possible implementations, fig. 2 is a schematic structural diagram of a compiler in an embodiment of the present disclosure, and referring to fig. 2, the compiler 112 may include: a programming front end module 1121, a computation graph processing module 1122, and a computation core compilation module 1123.

In some possible embodiments, the compiler 112 may adopt a three-layer structure, where the first layer is a programming front end module 1121, and the first layer may include a programming language front end analyzer 1121a for analyzing a source language code of a general application and a deep learning inference graph front end analyzer 1121b for analyzing a neural network graph. The second layer is a computational graph processing module 1122, which may include a graph optimizer 1122a and a task dispatcher 1122b for a variety of heterogeneous computational cores. The third layer is a compute core compiler module 1123, which may include compilers for respective compute cores, such as compute core compiler a 1123a for compute core a and compute core compiler B1123B for compute core B. And the compiler of each computing core receives the task issued by the second layer, generates a binary instruction code segment corresponding to each computing core, and further generates a binary instruction sequence corresponding to the source language code.

The following describes a compiling method provided by the embodiment of the present disclosure with reference to the above heterogeneous system.

Fig. 3 is a schematic implementation flow diagram of a compiling method in the embodiment of the present disclosure, and referring to fig. 3, the compiling method may include:

s301, the programming model module obtains source language codes written by a user according to the programming model by using the programming language.

The programming model is described by an algorithm through a scalar programming language and a tensor programming language.

In practical applications, the programming model may include scalar calculation functions written in a scalar programming language that can be performed in computational core a, and also include tensor calculation functions written in a tensor programming language that can be performed in computational core B (the tensor calculation functions may be understood as built-in functions in the programming model). Optionally, in order to perform tensor calculation using the tensor calculation function, the programming model may further include a built-in function interface (i.e., a tensor calculation function interface) for calling the tensor calculation function to perform tensor calculation on the input tensor data. Illustratively, the built-in function interface is extended on the basis of a scalar programming language (such as C/C + + language).

In some possible implementations, the CPU, GPU, DSP, TPU may use their underlying instruction sets to achieve their highest performance for building built-in function interfaces. The ASIC may implement its own functionality. The FPGA may be implemented using verilog hardware description language.

In the embodiment of the disclosure, because the programming model includes both scalar calculation functions and tensor calculation functions, the difference of heterogeneous calculation cores can be shielded, and even if the heterogeneous calculation cores are changed (such as a new scalar calculation core and/or a new tensor calculation core is expanded), the source language code written according to the programming model can be unchanged.

For example, fig. 4 is a schematic diagram illustrating a source language code in an embodiment of the present disclosure, and as shown in fig. 4, in the source language code, L2 losssekernel is a function name of the source language code, and an upper box in a function is a built-in function interface. The scalar data can be regarded as 0-dimensional vector data, which corresponds to oklang:: scalar in fig. 4; tensor data can be considered as vector data greater than or equal to 1-dimension, corresponding to oklang:: tensor in the figure. If the tensor data is 1-dimensional, the tensor data can be understood as a vector. If the tensor data is 2-dimensional, the tensor data can be understood as a matrix. If the tensor data is more than 3-dimensional, it can be understood as a multidimensional vector. Oklang:: mul and oklang:: muls in fig. 4 are built-in functions (which can also be understood as tensor computed functions). These built-in functions can run in any compute core as long as the underlying heterogeneous compute core supports them.

S302, the programming front-end module generates a computation graph (computational graph) corresponding to the source language code.

As can be appreciated, the programming front end module converts the source language code into a first intermediate representation capable of representing scalar data and/or tensor data. The programming front-end module then generates a computation graph corresponding to the source language code based on the first intermediate representation. The calculation nodes in the calculation graph represent functions corresponding to the source language codes, and the directed edges in the calculation graph represent the dependency relationship among the functions.

Illustratively, the programming front-end module converts the first intermediate representation output by itself into a computational graph consisting of nodes and directed edges connecting the nodes. Each node represents a function (also understood as a computation or an operation) corresponding to the source language code, which may be a tensor computed function (e.g., a built-in function) or a scalar computed function. The node for representing the tensor calculation function may be referred to as a tensor calculation node, and the node for representing the scalar calculation function may be referred to as a scalar calculation node. FIG. 5 is a schematic diagram of a computation graph according to an embodiment of the present disclosure, and referring to FIG. 5, there are directed edges

Pointing to node v by node u, the directed edge

The delegate node v depends on the node u.

Optionally, the programming model module transfers the source language code written by the user to the programming front-end module. When the source language code written by the user is the neural network graph, the programming model module transmits the source language code to the deep learning inference graph front-end analyzer.

In some possible embodiments, after S302 and before S303, the method may further include: the graph optimizer performs optimization processing on the computation graph generated by S302. Illustratively, the optimization process may include: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, merging cyclic blocks and the like. Of course, in practical applications, the optimization module may perform other graph optimization on the computation graph, and this is not specifically limited in this disclosure.

And S303, dividing the calculation map into at least one target map area by the calculation map processing module.

The at least one target map region may include a scalar map region including scalar calculation nodes and/or a tensor map region including tensor calculation nodes.

Specifically, the above S303 can be implemented in the following three ways without being limited thereto.

In a first manner, fig. 6 is a schematic implementation flow diagram of a partition calculation graph in the embodiment of the present disclosure, and referring to fig. 6, S303 may include:

s601, the task distributor divides the computing nodes with dependency relationships and the same type into a computing area.

For example, the task allocator may divide at least one tensor calculation node having a dependency relationship in the calculation graph into one tensor calculation region, and divide at least one scalar calculation node having a dependency relationship into one scalar calculation node may be divided into scalar calculation regions. In this way, the task distributor can obtain a plurality of calculation regions corresponding to the calculation graph.

S602, counting the mappable computing cores of each computing node by the task distributor aiming at each computing area.

S603, the task distributor divides the computing nodes which have dependency relationship in each computing area and can be mapped to the same computing core into a graph area.

S604, the task distributor determines at least one target map area of the calculation map according to at least one map area in each calculation area.

In some possible embodiments, for a scalar graph region, the mappable compute cores of the scalar compute nodes are one or more compute cores a. If only one computing core A, such as a CPU, is included in the heterogeneous computing cores, the scalar map region can be divided into a map region; if two computing cores A, such as a CPU and a DSP, are included in the heterogeneous computing cores, the scalar map region may be divided into two map regions, and so on. Illustratively, the principle of map region division is as follows: the part with more parallel scalar computing nodes is divided into a graph area, and the block can be mapped in the DSP; other scalar compute nodes are divided into one or more graph regions, which are mapped in the CPU. Further, the step S604 may include: and the task distributor determines the map area of each calculation area as the target map area.

In some possible embodiments, for a tensor map region, the mappable computing cores of each tensor computing node are one or more computing cores B. The S604 may include: for each tensor map region, creating a map region set for the ith map region (i is a positive integer, namely any map region of the calculation region); traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped in the same computing core into a candidate graph region; determining a first profit value of the ith map area according to the execution duration of the candidate map areas of the ith map area on the corresponding computation core; determining a second profit value which is larger than the preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to a graph region set, returning to the step of traversing the graph region set, and iteratively calculating the profit values when the ith graph region is mapped to different calculation cores until the first profit values are all smaller than or equal to the preset profit value; and determining the candidate map area corresponding to the first profit value as a target map area. Optionally, the preset profit value may be 0.

In one embodiment, for a tensor map region, step 1, the task allocator divides a tensor computation region into a map region a that can be mapped to computation core B0₀And is a region of the graph a₀Creating a map region set A, where A is { a }_iI is more than or equal to 0 and less than m, and 1. And 2, traversing the graph areas in the graph area set A by the task distributor. When traversing to the region a_iIn the drawing area a_iTensor computation nodes in which dependency exists and which can be mapped to other computation cores Bx (such as computation core B1, computation cores B2, … and computation core Bn, n is a positive integer) are divided into a candidate graph area

Candidate map region

May be the map area a_iA part of (2) may be the entire map region (a)_i. At this time, the region a_iEquivalent to being split into a plurality of candidate regions, e.g.

Drawing area a_iMay map different computational cores. Candidate map region

Mapping to a compute kernel Bx, candidate region

Mappable computing core and map area a_iThe mapped computational cores are the same. Step 4, the task distributor determines the map area a_iDivided candidate map regions

Total execution duration t in a corresponding mappable compute core_i'. Step 5, the task distributor maps the image area a_iExecution duration t in corresponding mappable compute core_iThe total execution time t of multiple candidate map regions_i' subtract to obtain the image area a_iFirst profit value p of_i＝t_i–t_i', and the first profit value p_iAdded to the revenue set P. Step 6, the task distributor selects a second profit value P which is larger than a preset profit value (such as 0) and has the maximum value from the profit set P_kAnd determining the profit p_kCorresponding map region a_kAnd a drawing region a_kCandidate region of (2)

Step 7, the task distributor deletes the graph area a from the graph area set A_kAnd dividing the candidate region

And adding the updated map region set A into the map region set A. And 8, repeating the step 2 by the task distributor, and traversing the graph area set A again until the first profit values in the profit set P are all less than or equal to the preset profit values. Step 9, task allocationThe machine determines the candidate map regions in the map region set a as target map regions.

Exemplarily, fig. 8 is a schematic diagram of a tensor calculation region in an embodiment of the present disclosure, and as shown in fig. 8, it is assumed that a tensor calculation region is N0 → N1 → N2 → N3 → N4, where Ni (i ═ 0,1,2,3,4) is a tensor calculation node, and N0 → N1 indicates that N1 depends on N0. The heterogeneous compute kernels include 3 compute kernels B (i.e., compute kernel B1, compute kernel B2, and compute kernel B3). N0, N1, N2, N3, and N4 may be mapped to compute core B1, N1 and N2 may be mapped to compute core B2, and N2, N3, and N4 may be mapped to compute core B3.

First, the task dispatcher maps N0-N4 into compute core B1, which is the graph region a₀Is N0 → N1 → N2 → N3 → N4, the set of regions A ═ a₀}. Then, the task allocator determines the map region a₀The execution time period T in the computation core B1 is 150 × T.

Second, the task dispatcher traverses the set of map regions A to divide N0 into a candidate map region

Mapping in the computing core B1, dividing N1 → N2 into a candidate map region

Mapping in the compute kernel B2 and dividing N3 → N4 into a candidate map region

Mapped in compute core B1. At this time, the region a₀Is split into

And

three candidate map regions are determined

The total execution time t' of the corresponding mappable computing core100 × T (T is the clock period). Calculating the map area a₀First profit value p of₀＝t-t'＝50×T。

Thirdly, the task distributor maps the map area a₀Middle N2 → N3 → N4 is divided into a candidate map region

Is mapped in the computing core B3, at this time, the region a₀Is divided into

And

two candidate map regions are determined

Is 120 × T. Calculating the map area a₀Another first profit value p₀＝t-t'＝30×T。

At this time, the profit set P is {50,30 }.

Fourth, the task distributor selects a second profit value p₀Candidate region corresponding to 50 × T

Map region set A is updated, at which time N1 → N2 maps to B2 and N0 and N3 → N4 map to B1.

Fifthly, the task distributor traverses the graph area set A again to draw the graph area

N1 in (1) is divided into a candidate map region

Can be mapped in a computing core B2, and divides N2 into a candidate map region

Mapped in the computing core B1, when the map region

Is divided into

Two candidate map regions, and determining a map region a₀A first profit value p₁T-30 × T. And, the task dispatcher maps the map regions

Wherein N1 is divided into a candidate region

Mapping in a computing core B2, and dividing N2 into a candidate map region

Mapped in the computing core B3, when the map region

Is divided into

Two candidate map regions, and determining a map region a₀Corresponding another profit value p₁＝t-t'＝-20×T。

Sixthly, task distributor

N0 has no other mappable compute cores. Region of the drawing

Partitioning into a candidate map region

Mapped in the computing core B3, when the map region

Is divided into

A candidate map region, and determining a map region a₀Corresponding further profit value p₁＝t-t'＝20×T。

After the map area set a is traversed, the profit set P is { -30, -20, 20 }.

Seventhly, the task distributor selects a second profit value p₁Candidate region corresponding to 20 × T

The map region set A is updated, and at this time, N1 → N2 is mapped to B2, N0 is mapped to B1, and N3 → N4 is mapped to B3.

And eighthly, the task distributor traverses the graph area set A again, repeats the steps and obtains a profit set P, and at the moment, all the first profit values in the profit set P are less than or equal to 0. The task dispatcher terminates the routine.

Ninth step, the task distributor sets the updated map region set a ═ a₀＝N1～N2，a₁＝N0，a₂N3-N4 are determined as target map regions, where N1-N2 are mapped to compute kernel B2, N0 is mapped to compute kernel B1, and N3-N4 are mapped to compute kernel B3.

In a second manner, fig. 7 is a schematic diagram of an implementation flow of another partition calculation graph in the embodiment of the present disclosure, and referring to fig. 7, S303 may include:

s701, the task distributor divides the computing nodes which have dependency relationships in the computing graph and are the same in computing type into a computing area.

It can be understood that the specific execution process of S701 may refer to the related description of S601, which is not described herein again.

S702, the task distributor calculates the execution time of each computing node in each computing area in the mappable computing core.

It can be understood that, the task distributor firstly uses each node in each computation region as a graph region, and adopts, for example, an integer linear programming algorithm, a branch boundary algorithm, a genetic algorithm, etc. to map these blocks in each computation core with the goal of minimizing the execution duration, so as to obtain the corresponding execution duration.

S703, the task distributor determines the mappable computing core with the minimum execution time as the target computing core of the computing node.

It can be understood that the task distributor determines the computing core with the smallest execution time length in different mappable computing cores of the same graph area as a target computing core of the graph area (i.e. the computing node).

S704, the task distributor merges the same computing nodes of the target computing cores in each computing area into a graph area.

S705, the task distributor determines at least one target map area according to the combined map areas.

In some possible embodiments, S704 to S705 may include: after determining the target computing cores corresponding to each graph region, the task distributor may merge the graph regions identical to the target computing cores into one large graph region. And when all the map areas cannot be further merged, the task distributor determines the map areas corresponding to the calculation map as target map areas.

Of course, S303 may also be implemented in other manners, and the embodiment of the present disclosure is not particularly limited.

It should be noted that, in some possible implementations, the task allocator may compute an estimate of how long the graph region was executing in the mappable compute core by invoking a performance mathematical model. Optionally, the performance mathematical model is obtained by training according to the historical execution duration of scalar computing nodes and/or tensor computing nodes in different kinds of computing cores. The estimated value of the execution duration obtained through the performance mathematical model is simple in calculation, high in efficiency and low in accuracy.

In other possible embodiments, the task distributor may further map the graph region to a compiler of the mappable computing core, the compiler generates a corresponding binary instruction sequence, and the binary instruction sequence is issued to the mappable computing core for running, so as to obtain an execution duration of the graph region in the mappable computing core. Compared with the method for determining the estimated value of the execution time length through the performance mathematical model, the obtained execution time length is accurate, but the calculation is complex and the efficiency is low.

S304, the calculation core compiling module generates a corresponding binary instruction code segment for each target graph area.

The binary code sections corresponding to the scalar calculation map region can be executed in a first calculation core for scalar calculation, and the binary code sections corresponding to the tensor calculation map region can be executed in a second calculation core for tensor calculation.

In some possible implementations, after determining the at least one target graph region, the task allocator communicates the at least one target graph region and the mapping relationship between the target graph region and the compute cores to the compute core compilation module. And the calculation core compiling module converts at least one target graph area into second intermediate representations according to the mapping relation, and respectively issues the second intermediate representations to the corresponding compilers so as to generate binary instruction code segments corresponding to the target graph area.

S305, the calculation core compiling module generates a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code segments.

It can be understood that the calculation core compiling module mixes the binary instruction code segments generated by each compiler together, and obtains a compiled binary instruction sequence by adding the dependency relationship between each graph region and each partition.

At this point, the compilation process for the heterogeneous system is completed.

In some possible embodiments, after the development process described in S301 to S305 is completed, the compiled binary instruction sequence in S305 may be mapped to a heterogeneous computing core for execution. Then, referring to the dashed line in fig. 1, the heterogeneous system 10 may further include: a runtime system 13. In the actual operation process, the programming program 11 issues the compiled binary instruction sequence to the heterogeneous computing core. During the issuing process, the runtime system 13 may query whether there is a binary instruction code segment to be executed. When the runtime system 13 determines that data required for executing the binary code segment to be executed is ready and the heterogeneous computing core corresponding to the binary instruction code segment is idle, the runtime system 13 may issue the binary instruction code segment to the corresponding heterogeneous computing core for execution.

In the embodiment of the disclosure, when program development is performed based on a heterogeneous system, since the source language code can describe the algorithm by using a scalar programming language and a tensor programming language, the source language code does not need to be programmed respectively for different heterogeneous computation cores, but can write scalar computation and tensor computation simultaneously in the programming process, thereby greatly improving the development efficiency. Furthermore, the same calculation function can be mapped to different heterogeneous calculation cores for execution through calculation graph division, the execution time of the function is shortened, and the effective utilization rate of calculation resources is improved.

Based on the same inventive concept, the disclosed embodiments provide a compiling apparatus for compiling programs for heterogeneous systems.

Fig. 9 is a schematic structural diagram of a compiling apparatus in an embodiment of the disclosure, and referring to fig. 9, the compiling apparatus 900 may include: a programming front-end module 901, configured to obtain a computation graph corresponding to a source language code, where the source language code is written based on a programming model, and the programming model is described by an algorithm in a scalar programming language and a tensor programming language; a computation graph processing module 902, configured to divide the computation graph into at least one target graph region, where the at least one target graph region includes a scalar graph region including scalar computation nodes and/or a tensor graph region including tensor computation nodes; a compute core decode module 903 to generate a corresponding binary instruction code segment for each target graph region; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections, wherein the binary instruction sequence corresponding to the scalar image areas can be executed in a first calculation core for scalar calculation, and the binary instruction sequence corresponding to the tensor image areas can be executed in a second calculation core for tensor calculation.

In the above scheme, a front end module 901 is programmed for converting source language code into a first intermediate representation for representing scalar data and/or tensor data; and generating a computational graph based on the intermediate representation, wherein the computational nodes in the computational graph represent functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationship between the functions.

In the above scheme, the front-end module 901 is programmed and is further configured to perform optimization processing on the calculation graph; wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.

In the above scheme, the computation graph processing module 902 is configured to divide the computation nodes that have dependency relationships and are of the same computation type in the computation graph into a computation region; counting the mappable computing cores of each computing node in each computing area; dividing the computing nodes which have dependency relationship in each computing area and can be mapped to the same computing core into a graph area; at least one target map region is determined based on the map regions in each of the calculation regions.

In the above scheme, the calculation region is a scalar calculation region; and the computation graph processing module 902 is configured to determine at least one graph region corresponding to the computation graph as at least one target graph region.

In the above scheme, the calculation region is a tensor calculation region; a computation graph processing module 902, configured to create a graph region set for an ith graph region in the graph regions; traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped in the same computing core into a candidate graph region, wherein i is a positive integer; determining a first profit value of the ith map area according to the execution duration of the candidate map areas of the ith map area on the corresponding computation core; determining a second profit value which is larger than the preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to a graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to the preset profit value; and determining the candidate map area corresponding to the first profit value as a target map area.

In the above scheme, the computation graph processing module 902 is further configured to divide the computation nodes having dependency relationships and the same type in the computation graph into a computation region; calculating the execution time of each computing node in each computing area in the mappable computing core, and determining the mappable computing core with the minimum execution time as a target computing core of the computing node; merging the same computing nodes of the target computing cores in each computing area into a graph area; and determining at least one target map area according to the merged map areas.

In the above scheme, the computation core decoding module 903 is configured to determine a mapping relationship between each target map area in at least one target map area and a computation core; converting each target graph area into a second intermediate representation according to the mapping relation; based on the second intermediate representation, a binary instruction code segment is generated.

Based on the same inventive concept, an embodiment of the present disclosure provides a compiling apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: when used to execute executable instructions, implement the compilation method described in one or more of the embodiments above.

Based on the same inventive concept, the disclosed embodiments provide a computer-readable storage medium, which stores an executable program, wherein the executable program, when executed by a processor, implements the compiling method according to one or more of the embodiments.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of devices consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A compilation method based on a heterogeneous system including a plurality of computing cores having different architectures, the method comprising:

obtaining a calculation graph corresponding to source language codes, wherein the source language codes are written based on a programming model, and the programming model is described by algorithms of a scalar variable programming language and a tensor programming language;

dividing the computation graph into at least one target graph region, the at least one target graph region including a scalar graph region containing scalar computation nodes and/or a tensor graph region containing tensor computation nodes;

generating a corresponding binary instruction code section for each target graph region, the binary code section corresponding to the scalar graph region being executable in a first computation core for scalar computation, and the binary code section corresponding to the tensor graph region being executable in a second computation core for tensor computation;

and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections.

2. The method of claim 1, wherein the programming model comprises a built-in function interface written in the tensor programming language operable in the second computational core, the built-in function interface for being invoked for tensor computation of tensor data.

3. The method of claim 1, wherein obtaining the computation graph corresponding to the source language code comprises:

converting the source language code into a first intermediate representation, the first intermediate representation being for representing scalar data and/or tensor data;

and generating the computational graph based on the first intermediate representation, wherein the computational nodes in the computational graph represent the functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationships among the functions.

4. The method of claim 3, wherein after the generating the computational graph based on the intermediate representation, the method further comprises:

optimizing the calculation graph;

wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.

5. The method of claim 1, wherein the dividing the computation graph into at least one target graph region comprises:

dividing the computing nodes with the dependency relationship and the same computing type in the computing graph into a computing area;

counting the mappable computing cores of each computing node in each computing area;

dividing the computing nodes which have dependency relationship in each computing area and can be mapped and have the same computing core into a graph area;

and determining the at least one target map area according to the map area in each calculation area.

6. The method of claim 5, wherein the computation region is a scalar computation region; said determining said at least one target map region from said map regions in each calculation region comprises:

determining the map region as the at least one target map region.

7. The method of claim 5, wherein the computed region is a tensor computed region; said determining said at least one target map region from said map regions in each calculation region comprises:

creating a map region set for the ith map region in the map regions;

traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped to the same computing core into a candidate graph region, wherein i is a positive integer;

determining a first profit value of the ith map area according to the execution duration of the candidate map area of the ith map area on the corresponding computation core;

determining a second profit value which is larger than a preset profit value and has the largest value from the first profit value;

updating the candidate graph region corresponding to the second profit value to the graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to a preset profit value;

and determining the candidate map area corresponding to the first profit value as the target map area.

8. The method of claim 1, wherein the dividing the computation graph into at least one target graph region comprises:

dividing the computing nodes with the same type and the dependency relationship in the computing graph into a computing area;

calculating the execution duration of each computing node in each computing area in a mappable computing core, and determining the mappable computing core with the minimum execution duration as a target computing core of the computing node;

merging the same computing nodes of the target computing cores in each computing area into a graph area;

and determining the at least one target map area according to the combined map area.

9. The method of claim 1, wherein generating a corresponding binary instruction code segment for each target map region comprises:

determining the mapping relation between each target map area in the at least one target map area and the computing core;

converting each target graph area into a second intermediate representation according to the mapping relation;

generating the binary instruction code segment based on the second intermediate representation.

10. The method of any of claims 1 to 9, wherein the first computing core comprises at least one of: CPU, DSP; the second computing core comprises at least one of: CPU, TPU, GPU, DSP, ASIC, FPGA.

11. A compiling apparatus for program compiling a heterogeneous system including a plurality of computing cores having different architectures, the compiling apparatus comprising:

the system comprises a programming front-end module, a calculation graph calculation module and a calculation model calculation module, wherein the calculation graph is corresponding to a source language code, the source language code is compiled based on a programming model, and the programming model is described by a scalar programming language and a tensor programming language in an algorithm mode;

a calculation graph processing module, configured to divide the calculation graph into at least one target graph region, where the at least one target graph region includes a scalar graph region including scalar calculation nodes and/or a tensor graph region including tensor calculation nodes;

the calculation core decoding module is used for generating a corresponding binary instruction code segment for each target image area; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target graph areas and the binary instruction code sections, wherein the binary instruction code sections corresponding to the scalar graph areas can be executed in a first calculation core for scalar calculation, and the binary code sections corresponding to the tensor graph areas can be executed in a second calculation core for tensor calculation.

12. The apparatus of claim 11, wherein the programming model comprises a built-in function interface written in the tensor programming language operable in the second computational core, the built-in function interface to be invoked for tensor computation of tensor data.

13. The apparatus of claim 11, wherein the programming front end module is to convert the source language code into a first intermediate representation, the intermediate representation to represent scalar data and/or tensor data; and generating the computational graph based on the intermediate representation, wherein the computational nodes in the computational graph represent the functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationship between the functions.

14. The apparatus of claim 13, wherein the programming front-end module is further configured to perform an optimization process on the computation graph; wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.

15. The apparatus according to claim 11, wherein the computation graph processing module is configured to divide the computation nodes having dependency relationships and the same computation types in the computation graph into a computation region; counting the mappable computing cores of each computing node in each computing area; dividing the computing nodes which have dependency relationship in each computing area and can be mapped and have the same computing core into a graph area; and determining the at least one target map area according to the map area in each calculation area.

16. The apparatus of claim 15, wherein the computation region is a scalar computation region; and the calculation map processing module is used for determining at least one map area corresponding to the calculation map as the at least one target map area.

17. The apparatus of claim 15, wherein the computed region is a tensor computed region; the computational graph processing module is configured to create a graph region set for an ith graph region in the graph regions; traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped to the same computing core into a candidate graph region, wherein i is a positive integer; determining a first profit value of the ith map area according to the execution duration of the candidate map area of the ith map area on the corresponding computation core; determining a second profit value which is larger than a preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to the graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to a preset profit value; and determining the candidate map area corresponding to the first profit value as the target map area.

18. The apparatus according to claim 11, wherein the computation graph processing module is further configured to divide the computation nodes having dependency relationships and the same type in the computation graph into a computation region; calculating the execution duration of each computing node in each computing area in a mappable computing core, and determining the mappable computing core with the minimum execution duration as a target computing core of the computing node; merging the same computing nodes of the target computing cores in each computing area into a graph area; and determining the at least one target map area according to the combined map area.

19. The apparatus of claim 11, wherein the computation core decoding module is configured to determine a mapping relationship between each of the at least one target graph region and a computation core; converting each target graph area into a second intermediate representation according to the mapping relation; generating the binary instruction code segment based on the second intermediate representation.

20. A compiling device characterized by comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: for implementing the method of any one of claims 1 to 10 when executing the executable instructions.

21. A computer-readable storage medium, characterized in that the readable storage medium stores an executable program, wherein the executable program, when executed by a processor, implements the method of any one of claims 1 to 10.