CN113553054A - Heterogeneous system based compiling method, device, equipment and storage medium - Google Patents
Heterogeneous system based compiling method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113553054A CN113553054A CN202110747965.7A CN202110747965A CN113553054A CN 113553054 A CN113553054 A CN 113553054A CN 202110747965 A CN202110747965 A CN 202110747965A CN 113553054 A CN113553054 A CN 113553054A
- Authority
- CN
- China
- Prior art keywords
- graph
- computing
- region
- computation
- calculation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/42—Syntactic analysis
- G06F8/427—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The invention discloses a compiling method, a compiling device, equipment and a storage medium based on a heterogeneous system. Wherein, the heterogeneous system comprises a plurality of computing cores with different architectures. The method comprises the following steps: obtaining a calculation graph corresponding to a source language code, wherein the source language code is compiled based on a programming model, and the programming model is described by an algorithm through a scalar programming language and a tensor programming language; dividing the computation graph into at least one target graph region, wherein the at least one target graph region comprises a scalar graph region containing scalar computation nodes and/or a tensor graph region containing tensor computation nodes; generating a corresponding binary instruction code segment for each target graph area; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections. In the method, when compiling and developing are carried out based on a heterogeneous system, the source language codes can describe the algorithm by adopting a scalar programming language and a tensor programming language, and the development efficiency is improved.
Description
Technical Field
The present disclosure relates to, but not limited to, the field of compilation development, and in particular, to a compiling method, apparatus, device and storage medium based on a heterogeneous system.
Background
With the rise of computing-intensive fields such as artificial intelligence, high-performance data analysis, financial analysis and the like, the traditional general computing mode cannot meet the requirement of people on computing capacity, and therefore a computing mode with stronger computing power, namely heterogeneous computing, is provided.
Heterogeneous computing (heterogeneous computing) is a special parallel distributed computing system, and mainly refers to a computing mode of a system formed by computing units of different types of instruction sets and architectures. In a heterogeneous computing system, a common heterogeneous system (also referred to as a heterogeneous device) may include a combination of different computing cores (cores) such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), a programmable gate array (FPGA), a Tensor Processing Unit (TPU), an Application Specific Integrated Circuit (ASIC), and so on.
In a heterogeneous system, each computing core usually has a separate instruction set and programming model, so when a program is developed based on the heterogeneous system, the programming model of each computing core needs to be programmed and compiled separately, resulting in low development efficiency.
Disclosure of Invention
The present disclosure provides a compiling method, apparatus, device and storage medium based on a heterogeneous system, so as to implement unified compiling of the heterogeneous system and improve development efficiency.
According to a first aspect of embodiments of the present disclosure, the present disclosure provides a heterogeneous system-based compiling method, wherein the heterogeneous system includes a plurality of computing cores with different architectures. The method comprises the following steps: obtaining a calculation graph corresponding to a source language code, wherein the source language code is compiled based on a programming model, and the programming model is described by an algorithm through a scalar variable programming language and a tensor programming language; dividing the computation graph into at least one target graph region, wherein the at least one target graph region comprises a scalar graph region containing scalar computation nodes and/or a tensor graph region containing tensor computation nodes; generating a corresponding binary code section for each target map region, the binary code sections corresponding to the scalar map region being executable in a first compute core for scalar computation, and the binary code sections corresponding to the tensor map region being executable in a second compute core for tensor computation; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections.
In the above scheme, the programming model includes a built-in function interface written in a tensor programming language and capable of running in the second computation core, and the built-in function interface is used for being called to perform tensor computation on tensor data.
In the above scheme, obtaining a computation graph corresponding to the source language code includes: translating the source language code into a first intermediate representation, the first intermediate representation being for representing scalar data and/or tensor data; and generating a computational graph based on the first intermediate representation, wherein the computational nodes in the computational graph represent functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationship between the functions.
In the above scheme, after generating the computation graph based on the intermediate representation, the method further includes: optimizing the calculation graph; wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.
In the above scheme, dividing the computation graph into at least one target graph region includes: dividing the computing nodes with the dependency relationship and the same computing type in the computing graph into a computing area; counting the mappable computing cores of each computing node in each computing area; dividing the computing nodes which have dependency relationship in each computing area and can be mapped to the same computing core into a graph area; at least one target map region is determined based on the map regions in each of the calculation regions.
In the above scheme, the calculation region is a scalar calculation region; determining at least one target map region from the map regions in each calculation region, including: the map area is determined as at least one target map area.
In the above scheme, the calculation region is a tensor calculation region; determining at least one target map region from the map regions in each calculation region, including: creating a map region set for the ith map region in the map region; traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped in the same computing core into a candidate graph region, wherein i is a positive integer; determining a first profit value of the ith map area according to the execution duration of the candidate map areas of the ith map area on the corresponding computation core; determining a second profit value which is larger than the preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to a graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to the preset profit value; and determining the candidate map area corresponding to the first profit value as a target map area.
In the above scheme, dividing the computation graph into at least one target graph region includes: dividing the computing nodes with the same type and the dependency relationship in the computing graph into a computing area; calculating the execution time of each computing node in each computing area in the mappable computing core, and determining the mappable computing core with the minimum execution time as a target computing core of the computing node; merging the same computing nodes of the target computing cores in each computing area into a graph area; and determining at least one target map area according to the merged map areas.
In the above solution, generating a corresponding binary instruction code segment for each target map region includes: determining the mapping relation between each target map area in at least one target map area and a calculation core; converting each target graph area into a second intermediate representation according to the mapping relation; based on the second intermediate representation, a binary instruction code segment is generated.
In the above scheme, the first computing core includes at least one of: CPU, DSP; the second computing core includes at least one of: CPU, TPU, GPU, DSP, ASIC, FPGA.
According to a second aspect of the embodiments of the present disclosure, there is provided a compiling apparatus for performing program compilation on a heterogeneous system, where the heterogeneous system includes a plurality of computing cores with different architectures. The compiling apparatus includes: the system comprises a programming front-end module, a calculation graph and a calculation model, wherein the programming front-end module is used for obtaining a calculation graph corresponding to a source language code, the source language code is compiled based on a programming model, and the programming model is described by an algorithm through a scalar programming language and a tensor programming language; the calculation graph processing module is used for dividing the calculation graph into at least one target graph region, and the at least one target graph region comprises a scalar graph region containing scalar calculation nodes and/or a tensor graph region containing tensor calculation nodes; the calculation core decoding module is used for generating a corresponding binary instruction code segment for each target image area; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections, wherein the binary instruction sequence corresponding to the scalar image areas can be executed in a first calculation core for scalar calculation, and the binary instruction sequence corresponding to the tensor image areas can be executed in a second calculation core for tensor calculation.
In the above scheme, the programming model includes a built-in function interface written in a tensor programming language and capable of running in the second computation core, and the built-in function interface is used for being called to perform tensor computation on tensor data.
In the above arrangement, the front end module is programmed to convert the source language code into a first intermediate representation, the intermediate representation being for representing scalar data and/or tensor data; and generating a computational graph based on the intermediate representation, wherein the computational nodes in the computational graph represent functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationship between the functions.
In the above scheme, the programming front-end module is further configured to perform optimization processing on the calculation graph; wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.
In the above scheme, the computation graph processing module is configured to divide the computation nodes, which have dependency relationships and are of the same computation type, into a computation region; counting the mappable computing cores of each computing node in each computing area; dividing the computing nodes which have dependency relationship in each computing area and can be mapped to the same computing core into a graph area; at least one target map region is determined based on the map regions in each of the calculation regions.
In the above scheme, the calculation region is a scalar calculation region; and the calculation map processing module is used for determining at least one map area corresponding to the calculation map as at least one target map area.
In the above scheme, the calculation region is a tensor calculation region; the calculation map processing module is used for creating a map region set for the ith map region in the map regions; traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped in the same computing core into a candidate graph region, wherein i is a positive integer; determining a first profit value of the ith map area according to the execution duration of the candidate map areas of the ith map area on the corresponding computation core; determining a second profit value which is larger than the preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to a graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to the preset profit value; and determining the candidate map area corresponding to the first profit value as a target map area.
In the above scheme, the computation graph processing module is further configured to divide the computation nodes having dependency relationships and the same type in the computation graph into a computation region; calculating the execution time of each computing node in each computing area in the mappable computing core, and determining the mappable computing core with the minimum execution time as a target computing core of the computing node; merging the same computing nodes of the target computing cores in each computing area into a graph area; and determining at least one target map area according to the merged map areas.
In the above scheme, the computation core decoding module is configured to determine a mapping relationship between each target map region in at least one target map region and a computation core; converting each target graph area into a second intermediate representation according to the mapping relation; based on the second intermediate representation, a binary instruction code segment is generated.
According to a third aspect of embodiments of the present disclosure, there is provided a compiling apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: for executing executable instructions to implement a method as described in the first aspect and any aspect thereof.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium storing an executable program, wherein the executable program when executed by a processor implements the method according to the first aspect and any one of the aspects thereof.
In the method, when program development is carried out based on a heterogeneous system, the source language codes can describe the algorithm by adopting a scalar programming language and a tensor programming language, so that scalar calculation and tensor calculation can be written simultaneously in the programming process without respectively programming different heterogeneous calculation cores, and the development efficiency is greatly improved. Furthermore, the same calculation function can be mapped to different heterogeneous calculation cores for execution through calculation graph division, the execution time of the function is shortened, and the effective utilization rate of calculation resources is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a schematic structural diagram of a heterogeneous system in an embodiment of the present disclosure;
FIG. 2 is a block diagram of a compiler according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart illustrating an implementation of a compiling method according to an embodiment of the disclosure;
FIG. 4 is a diagram illustrating a description of source language code in an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a computational graph in an embodiment of the present disclosure;
FIG. 6 is a schematic diagram illustrating an implementation flow of a partition computation graph according to an embodiment of the present disclosure;
FIG. 7 is a schematic flow chart illustrating another implementation of a partition computation graph in an embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a tensor computation region in an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a compiling apparatus in an embodiment of the disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosed embodiments, as detailed in the appended claims.
The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present disclosure. As used in the disclosed embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms "first", "second", "third", etc. may be used in the embodiments of the present disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, "first information" may also be referred to as "second information," and similarly, "second information" may also be referred to as "first information," without departing from the scope of embodiments of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
With the rise of computing-intensive fields such as artificial intelligence, high-performance data analysis, financial analysis and the like, the traditional general computing mode cannot meet the requirement of people on computing capacity, and therefore a computing mode with stronger computing power, namely heterogeneous computing, is provided.
Heterogeneous computing mainly refers to a computing mode of a system composed of computing units using different types of instruction sets and architectures. In a heterogeneous computing system, common classes of heterogeneous computing cores may include CPUs, GPUs, DSPs, FPGAs, TPUs, ASICs, and the like. For example, a heterogeneous system may be a general purpose computer integrating a CPU and a GPU, a system on chip (SoC) integrating a CPU, a GPU and a DSP chip, a machine learning system integrating a CPU and a TPU/ASIC/FPGA/ASIC, and the like. In a heterogeneous system, each computing core usually has a separate instruction set and programming model, so when a program is developed based on the heterogeneous system, the programming model of each computing core needs to be programmed and compiled separately, resulting in low development efficiency.
In addition, for the same computing task (like a computing function), since the core computing program is executed on only one or one type of computing core in a heterogeneous system, and the same core computing program cannot be run across the computing cores at the same time (it can be understood that the same core computing program can only be run on one or one type of computing core, and cannot be run in parallel on different types of computing cores). Therefore, although parallel execution can be performed among a plurality of computing tasks, heterogeneous parallel operation cannot be realized for a single computing task, and the effective utilization rate of computing resources is not high.
Then, in order to solve the above problem, the embodiments of the present disclosure provide a compiling method that may be applied to a heterogeneous system. Fig. 1 is a schematic structural diagram of a heterogeneous system in an embodiment of the present disclosure, and referring to solid lines in fig. 1, a heterogeneous system 10 may include a programming program 11 (also referred to as a development program or development software) of the heterogeneous system 10 and a heterogeneous computing core 12. Alternatively, the heterogeneous computing core 12 may include a plurality of computing cores of different architectures. The computational cores include one or more computational cores a 121 (e.g., a first computational core) for scalar computation and one or more computational cores B122 (e.g., a second computational core) for tensor computation. Here, when the number of the computing cores a is plural, the plural computing cores a are computing cores of the same architecture; similarly, when there are a plurality of computing cores B, the plurality of computing cores B are computing cores of the same architecture. Illustratively, the computing core a may be a CPU, a DSP, etc., and the computing core B may be a CPU, a GPU, a TPU, an FPGA, a DSP, an AISC, etc.
Still referring to fig. 1, the programming program 11 may include: a programming model module 111 and a compiler 112. The programming model module 111 is configured to assume a program description of the source language code 20 written by the user, which is independent of the underlying heterogeneous computing core 12. The compiler 112 is configured to compile the source language code 20 based on the heterogeneous computation core to obtain a corresponding binary instruction sequence.
In some possible embodiments, in order to enable uniform programming of heterogeneous systems, the programming program 11 may support general-purpose applications (e.g., including scalar computing functions) and may also support Artificial Intelligence (AI) deep learning applications (e.g., including tensor computing functions). Of course, the above programming program 11 may also support the application of other frameworks, and this is not particularly limited in the embodiment of the present disclosure.
In some possible implementations, fig. 2 is a schematic structural diagram of a compiler in an embodiment of the present disclosure, and referring to fig. 2, the compiler 112 may include: a programming front end module 1121, a computation graph processing module 1122, and a computation core compilation module 1123.
In some possible embodiments, the compiler 112 may adopt a three-layer structure, where the first layer is a programming front end module 1121, and the first layer may include a programming language front end analyzer 1121a for analyzing a source language code of a general application and a deep learning inference graph front end analyzer 1121b for analyzing a neural network graph. The second layer is a computational graph processing module 1122, which may include a graph optimizer 1122a and a task dispatcher 1122b for a variety of heterogeneous computational cores. The third layer is a compute core compiler module 1123, which may include compilers for respective compute cores, such as compute core compiler a 1123a for compute core a and compute core compiler B1123B for compute core B. And the compiler of each computing core receives the task issued by the second layer, generates a binary instruction code segment corresponding to each computing core, and further generates a binary instruction sequence corresponding to the source language code.
The following describes a compiling method provided by the embodiment of the present disclosure with reference to the above heterogeneous system.
Fig. 3 is a schematic implementation flow diagram of a compiling method in the embodiment of the present disclosure, and referring to fig. 3, the compiling method may include:
s301, the programming model module obtains source language codes written by a user according to the programming model by using the programming language.
The programming model is described by an algorithm through a scalar programming language and a tensor programming language.
In practical applications, the programming model may include scalar calculation functions written in a scalar programming language that can be performed in computational core a, and also include tensor calculation functions written in a tensor programming language that can be performed in computational core B (the tensor calculation functions may be understood as built-in functions in the programming model). Optionally, in order to perform tensor calculation using the tensor calculation function, the programming model may further include a built-in function interface (i.e., a tensor calculation function interface) for calling the tensor calculation function to perform tensor calculation on the input tensor data. Illustratively, the built-in function interface is extended on the basis of a scalar programming language (such as C/C + + language).
In some possible implementations, the CPU, GPU, DSP, TPU may use their underlying instruction sets to achieve their highest performance for building built-in function interfaces. The ASIC may implement its own functionality. The FPGA may be implemented using verilog hardware description language.
In the embodiment of the disclosure, because the programming model includes both scalar calculation functions and tensor calculation functions, the difference of heterogeneous calculation cores can be shielded, and even if the heterogeneous calculation cores are changed (such as a new scalar calculation core and/or a new tensor calculation core is expanded), the source language code written according to the programming model can be unchanged.
For example, fig. 4 is a schematic diagram illustrating a source language code in an embodiment of the present disclosure, and as shown in fig. 4, in the source language code, L2 losssekernel is a function name of the source language code, and an upper box in a function is a built-in function interface. The scalar data can be regarded as 0-dimensional vector data, which corresponds to oklang:: scalar in fig. 4; tensor data can be considered as vector data greater than or equal to 1-dimension, corresponding to oklang:: tensor in the figure. If the tensor data is 1-dimensional, the tensor data can be understood as a vector. If the tensor data is 2-dimensional, the tensor data can be understood as a matrix. If the tensor data is more than 3-dimensional, it can be understood as a multidimensional vector. Oklang:: mul and oklang:: muls in fig. 4 are built-in functions (which can also be understood as tensor computed functions). These built-in functions can run in any compute core as long as the underlying heterogeneous compute core supports them.
S302, the programming front-end module generates a computation graph (computational graph) corresponding to the source language code.
As can be appreciated, the programming front end module converts the source language code into a first intermediate representation capable of representing scalar data and/or tensor data. The programming front-end module then generates a computation graph corresponding to the source language code based on the first intermediate representation. The calculation nodes in the calculation graph represent functions corresponding to the source language codes, and the directed edges in the calculation graph represent the dependency relationship among the functions.
Illustratively, the programming front-end module converts the first intermediate representation output by itself into a computational graph consisting of nodes and directed edges connecting the nodes. Each node represents a function (also understood as a computation or an operation) corresponding to the source language code, which may be a tensor computed function (e.g., a built-in function) or a scalar computed function. The node for representing the tensor calculation function may be referred to as a tensor calculation node, and the node for representing the scalar calculation function may be referred to as a scalar calculation node. FIG. 5 is a schematic diagram of a computation graph according to an embodiment of the present disclosure, and referring to FIG. 5, there are directed edgesPointing to node v by node u, the directed edgeThe delegate node v depends on the node u.
Optionally, the programming model module transfers the source language code written by the user to the programming front-end module. When the source language code written by the user is the neural network graph, the programming model module transmits the source language code to the deep learning inference graph front-end analyzer.
In some possible embodiments, after S302 and before S303, the method may further include: the graph optimizer performs optimization processing on the computation graph generated by S302. Illustratively, the optimization process may include: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, merging cyclic blocks and the like. Of course, in practical applications, the optimization module may perform other graph optimization on the computation graph, and this is not specifically limited in this disclosure.
And S303, dividing the calculation map into at least one target map area by the calculation map processing module.
The at least one target map region may include a scalar map region including scalar calculation nodes and/or a tensor map region including tensor calculation nodes.
Specifically, the above S303 can be implemented in the following three ways without being limited thereto.
In a first manner, fig. 6 is a schematic implementation flow diagram of a partition calculation graph in the embodiment of the present disclosure, and referring to fig. 6, S303 may include:
s601, the task distributor divides the computing nodes with dependency relationships and the same type into a computing area.
For example, the task allocator may divide at least one tensor calculation node having a dependency relationship in the calculation graph into one tensor calculation region, and divide at least one scalar calculation node having a dependency relationship into one scalar calculation node may be divided into scalar calculation regions. In this way, the task distributor can obtain a plurality of calculation regions corresponding to the calculation graph.
S602, counting the mappable computing cores of each computing node by the task distributor aiming at each computing area.
S603, the task distributor divides the computing nodes which have dependency relationship in each computing area and can be mapped to the same computing core into a graph area.
S604, the task distributor determines at least one target map area of the calculation map according to at least one map area in each calculation area.
In some possible embodiments, for a scalar graph region, the mappable compute cores of the scalar compute nodes are one or more compute cores a. If only one computing core A, such as a CPU, is included in the heterogeneous computing cores, the scalar map region can be divided into a map region; if two computing cores A, such as a CPU and a DSP, are included in the heterogeneous computing cores, the scalar map region may be divided into two map regions, and so on. Illustratively, the principle of map region division is as follows: the part with more parallel scalar computing nodes is divided into a graph area, and the block can be mapped in the DSP; other scalar compute nodes are divided into one or more graph regions, which are mapped in the CPU. Further, the step S604 may include: and the task distributor determines the map area of each calculation area as the target map area.
In some possible embodiments, for a tensor map region, the mappable computing cores of each tensor computing node are one or more computing cores B. The S604 may include: for each tensor map region, creating a map region set for the ith map region (i is a positive integer, namely any map region of the calculation region); traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped in the same computing core into a candidate graph region; determining a first profit value of the ith map area according to the execution duration of the candidate map areas of the ith map area on the corresponding computation core; determining a second profit value which is larger than the preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to a graph region set, returning to the step of traversing the graph region set, and iteratively calculating the profit values when the ith graph region is mapped to different calculation cores until the first profit values are all smaller than or equal to the preset profit value; and determining the candidate map area corresponding to the first profit value as a target map area. Optionally, the preset profit value may be 0.
In one embodiment, for a tensor map region, step 1, the task allocator divides a tensor computation region into a map region a that can be mapped to computation core B00And is a region of the graph a0Creating a map region set A, where A is { a }iI is more than or equal to 0 and less than m, and 1. And 2, traversing the graph areas in the graph area set A by the task distributor. When traversing to the region aiIn the drawing area aiTensor computation nodes in which dependency exists and which can be mapped to other computation cores Bx (such as computation core B1, computation cores B2, … and computation core Bn, n is a positive integer) are divided into a candidate graph areaCandidate map regionMay be the map area aiA part of (2) may be the entire map region (a)i. At this time, the region aiEquivalent to being split into a plurality of candidate regions, e.g.Drawing area aiMay map different computational cores. Candidate map regionMapping to a compute kernel Bx, candidate regionMappable computing core and map area aiThe mapped computational cores are the same. Step 4, the task distributor determines the map area aiDivided candidate map regionsTotal execution duration t in a corresponding mappable compute corei'. Step 5, the task distributor maps the image area aiExecution duration t in corresponding mappable compute coreiThe total execution time t of multiple candidate map regionsi' subtract to obtain the image area aiFirst profit value p ofi=ti–ti', and the first profit value piAdded to the revenue set P. Step 6, the task distributor selects a second profit value P which is larger than a preset profit value (such as 0) and has the maximum value from the profit set PkAnd determining the profit pkCorresponding map region akAnd a drawing region akCandidate region of (2)Step 7, the task distributor deletes the graph area a from the graph area set AkAnd dividing the candidate regionAnd adding the updated map region set A into the map region set A. And 8, repeating the step 2 by the task distributor, and traversing the graph area set A again until the first profit values in the profit set P are all less than or equal to the preset profit values. Step 9, task allocationThe machine determines the candidate map regions in the map region set a as target map regions.
Exemplarily, fig. 8 is a schematic diagram of a tensor calculation region in an embodiment of the present disclosure, and as shown in fig. 8, it is assumed that a tensor calculation region is N0 → N1 → N2 → N3 → N4, where Ni (i ═ 0,1,2,3,4) is a tensor calculation node, and N0 → N1 indicates that N1 depends on N0. The heterogeneous compute kernels include 3 compute kernels B (i.e., compute kernel B1, compute kernel B2, and compute kernel B3). N0, N1, N2, N3, and N4 may be mapped to compute core B1, N1 and N2 may be mapped to compute core B2, and N2, N3, and N4 may be mapped to compute core B3.
First, the task dispatcher maps N0-N4 into compute core B1, which is the graph region a0Is N0 → N1 → N2 → N3 → N4, the set of regions A ═ a0}. Then, the task allocator determines the map region a0The execution time period T in the computation core B1 is 150 × T.
Second, the task dispatcher traverses the set of map regions A to divide N0 into a candidate map regionMapping in the computing core B1, dividing N1 → N2 into a candidate map regionMapping in the compute kernel B2 and dividing N3 → N4 into a candidate map regionMapped in compute core B1. At this time, the region a0Is split intoAndthree candidate map regions are determinedThe total execution time t' of the corresponding mappable computing core100 × T (T is the clock period). Calculating the map area a0First profit value p of0=t-t'=50×T。
Thirdly, the task distributor maps the map area a0Middle N2 → N3 → N4 is divided into a candidate map regionIs mapped in the computing core B3, at this time, the region a0Is divided intoAndtwo candidate map regions are determinedIs 120 × T. Calculating the map area a0Another first profit value p0=t-t'=30×T。
At this time, the profit set P is {50,30 }.
Fourth, the task distributor selects a second profit value p0Candidate region corresponding to 50 × TMap region set A is updated, at which time N1 → N2 maps to B2 and N0 and N3 → N4 map to B1.
Fifthly, the task distributor traverses the graph area set A again to draw the graph areaN1 in (1) is divided into a candidate map regionCan be mapped in a computing core B2, and divides N2 into a candidate map regionMapped in the computing core B1, when the map regionIs divided intoTwo candidate map regions, and determining a map region a0A first profit value p1T-30 × T. And, the task dispatcher maps the map regionsWherein N1 is divided into a candidate regionMapping in a computing core B2, and dividing N2 into a candidate map regionMapped in the computing core B3, when the map regionIs divided intoTwo candidate map regions, and determining a map region a0Corresponding another profit value p1=t-t'=-20×T。
Sixthly, task distributorN0 has no other mappable compute cores. Region of the drawingPartitioning into a candidate map regionMapped in the computing core B3, when the map regionIs divided intoA candidate map region, and determining a map region a0Corresponding further profit value p1=t-t'=20×T。
After the map area set a is traversed, the profit set P is { -30, -20, 20 }.
Seventhly, the task distributor selects a second profit value p1Candidate region corresponding to 20 × TThe map region set A is updated, and at this time, N1 → N2 is mapped to B2, N0 is mapped to B1, and N3 → N4 is mapped to B3.
And eighthly, the task distributor traverses the graph area set A again, repeats the steps and obtains a profit set P, and at the moment, all the first profit values in the profit set P are less than or equal to 0. The task dispatcher terminates the routine.
Ninth step, the task distributor sets the updated map region set a ═ a0=N1~N2,a1=N0,a2N3-N4 are determined as target map regions, where N1-N2 are mapped to compute kernel B2, N0 is mapped to compute kernel B1, and N3-N4 are mapped to compute kernel B3.
In a second manner, fig. 7 is a schematic diagram of an implementation flow of another partition calculation graph in the embodiment of the present disclosure, and referring to fig. 7, S303 may include:
s701, the task distributor divides the computing nodes which have dependency relationships in the computing graph and are the same in computing type into a computing area.
It can be understood that the specific execution process of S701 may refer to the related description of S601, which is not described herein again.
S702, the task distributor calculates the execution time of each computing node in each computing area in the mappable computing core.
It can be understood that, the task distributor firstly uses each node in each computation region as a graph region, and adopts, for example, an integer linear programming algorithm, a branch boundary algorithm, a genetic algorithm, etc. to map these blocks in each computation core with the goal of minimizing the execution duration, so as to obtain the corresponding execution duration.
S703, the task distributor determines the mappable computing core with the minimum execution time as the target computing core of the computing node.
It can be understood that the task distributor determines the computing core with the smallest execution time length in different mappable computing cores of the same graph area as a target computing core of the graph area (i.e. the computing node).
S704, the task distributor merges the same computing nodes of the target computing cores in each computing area into a graph area.
S705, the task distributor determines at least one target map area according to the combined map areas.
In some possible embodiments, S704 to S705 may include: after determining the target computing cores corresponding to each graph region, the task distributor may merge the graph regions identical to the target computing cores into one large graph region. And when all the map areas cannot be further merged, the task distributor determines the map areas corresponding to the calculation map as target map areas.
Of course, S303 may also be implemented in other manners, and the embodiment of the present disclosure is not particularly limited.
It should be noted that, in some possible implementations, the task allocator may compute an estimate of how long the graph region was executing in the mappable compute core by invoking a performance mathematical model. Optionally, the performance mathematical model is obtained by training according to the historical execution duration of scalar computing nodes and/or tensor computing nodes in different kinds of computing cores. The estimated value of the execution duration obtained through the performance mathematical model is simple in calculation, high in efficiency and low in accuracy.
In other possible embodiments, the task distributor may further map the graph region to a compiler of the mappable computing core, the compiler generates a corresponding binary instruction sequence, and the binary instruction sequence is issued to the mappable computing core for running, so as to obtain an execution duration of the graph region in the mappable computing core. Compared with the method for determining the estimated value of the execution time length through the performance mathematical model, the obtained execution time length is accurate, but the calculation is complex and the efficiency is low.
S304, the calculation core compiling module generates a corresponding binary instruction code segment for each target graph area.
The binary code sections corresponding to the scalar calculation map region can be executed in a first calculation core for scalar calculation, and the binary code sections corresponding to the tensor calculation map region can be executed in a second calculation core for tensor calculation.
In some possible implementations, after determining the at least one target graph region, the task allocator communicates the at least one target graph region and the mapping relationship between the target graph region and the compute cores to the compute core compilation module. And the calculation core compiling module converts at least one target graph area into second intermediate representations according to the mapping relation, and respectively issues the second intermediate representations to the corresponding compilers so as to generate binary instruction code segments corresponding to the target graph area.
S305, the calculation core compiling module generates a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code segments.
It can be understood that the calculation core compiling module mixes the binary instruction code segments generated by each compiler together, and obtains a compiled binary instruction sequence by adding the dependency relationship between each graph region and each partition.
At this point, the compilation process for the heterogeneous system is completed.
In some possible embodiments, after the development process described in S301 to S305 is completed, the compiled binary instruction sequence in S305 may be mapped to a heterogeneous computing core for execution. Then, referring to the dashed line in fig. 1, the heterogeneous system 10 may further include: a runtime system 13. In the actual operation process, the programming program 11 issues the compiled binary instruction sequence to the heterogeneous computing core. During the issuing process, the runtime system 13 may query whether there is a binary instruction code segment to be executed. When the runtime system 13 determines that data required for executing the binary code segment to be executed is ready and the heterogeneous computing core corresponding to the binary instruction code segment is idle, the runtime system 13 may issue the binary instruction code segment to the corresponding heterogeneous computing core for execution.
In the embodiment of the disclosure, when program development is performed based on a heterogeneous system, since the source language code can describe the algorithm by using a scalar programming language and a tensor programming language, the source language code does not need to be programmed respectively for different heterogeneous computation cores, but can write scalar computation and tensor computation simultaneously in the programming process, thereby greatly improving the development efficiency. Furthermore, the same calculation function can be mapped to different heterogeneous calculation cores for execution through calculation graph division, the execution time of the function is shortened, and the effective utilization rate of calculation resources is improved.
Based on the same inventive concept, the disclosed embodiments provide a compiling apparatus for compiling programs for heterogeneous systems.
Fig. 9 is a schematic structural diagram of a compiling apparatus in an embodiment of the disclosure, and referring to fig. 9, the compiling apparatus 900 may include: a programming front-end module 901, configured to obtain a computation graph corresponding to a source language code, where the source language code is written based on a programming model, and the programming model is described by an algorithm in a scalar programming language and a tensor programming language; a computation graph processing module 902, configured to divide the computation graph into at least one target graph region, where the at least one target graph region includes a scalar graph region including scalar computation nodes and/or a tensor graph region including tensor computation nodes; a compute core decode module 903 to generate a corresponding binary instruction code segment for each target graph region; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections, wherein the binary instruction sequence corresponding to the scalar image areas can be executed in a first calculation core for scalar calculation, and the binary instruction sequence corresponding to the tensor image areas can be executed in a second calculation core for tensor calculation.
In the above scheme, the programming model includes a built-in function interface written in a tensor programming language and capable of running in the second computation core, and the built-in function interface is used for being called to perform tensor computation on tensor data.
In the above scheme, a front end module 901 is programmed for converting source language code into a first intermediate representation for representing scalar data and/or tensor data; and generating a computational graph based on the intermediate representation, wherein the computational nodes in the computational graph represent functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationship between the functions.
In the above scheme, the front-end module 901 is programmed and is further configured to perform optimization processing on the calculation graph; wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.
In the above scheme, the computation graph processing module 902 is configured to divide the computation nodes that have dependency relationships and are of the same computation type in the computation graph into a computation region; counting the mappable computing cores of each computing node in each computing area; dividing the computing nodes which have dependency relationship in each computing area and can be mapped to the same computing core into a graph area; at least one target map region is determined based on the map regions in each of the calculation regions.
In the above scheme, the calculation region is a scalar calculation region; and the computation graph processing module 902 is configured to determine at least one graph region corresponding to the computation graph as at least one target graph region.
In the above scheme, the calculation region is a tensor calculation region; a computation graph processing module 902, configured to create a graph region set for an ith graph region in the graph regions; traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped in the same computing core into a candidate graph region, wherein i is a positive integer; determining a first profit value of the ith map area according to the execution duration of the candidate map areas of the ith map area on the corresponding computation core; determining a second profit value which is larger than the preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to a graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to the preset profit value; and determining the candidate map area corresponding to the first profit value as a target map area.
In the above scheme, the computation graph processing module 902 is further configured to divide the computation nodes having dependency relationships and the same type in the computation graph into a computation region; calculating the execution time of each computing node in each computing area in the mappable computing core, and determining the mappable computing core with the minimum execution time as a target computing core of the computing node; merging the same computing nodes of the target computing cores in each computing area into a graph area; and determining at least one target map area according to the merged map areas.
In the above scheme, the computation core decoding module 903 is configured to determine a mapping relationship between each target map area in at least one target map area and a computation core; converting each target graph area into a second intermediate representation according to the mapping relation; based on the second intermediate representation, a binary instruction code segment is generated.
Based on the same inventive concept, an embodiment of the present disclosure provides a compiling apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: when used to execute executable instructions, implement the compilation method described in one or more of the embodiments above.
Based on the same inventive concept, the disclosed embodiments provide a computer-readable storage medium, which stores an executable program, wherein the executable program, when executed by a processor, implements the compiling method according to one or more of the embodiments.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of devices consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (21)
1. A compilation method based on a heterogeneous system including a plurality of computing cores having different architectures, the method comprising:
obtaining a calculation graph corresponding to source language codes, wherein the source language codes are written based on a programming model, and the programming model is described by algorithms of a scalar variable programming language and a tensor programming language;
dividing the computation graph into at least one target graph region, the at least one target graph region including a scalar graph region containing scalar computation nodes and/or a tensor graph region containing tensor computation nodes;
generating a corresponding binary instruction code section for each target graph region, the binary code section corresponding to the scalar graph region being executable in a first computation core for scalar computation, and the binary code section corresponding to the tensor graph region being executable in a second computation core for tensor computation;
and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target image areas and the binary instruction code sections.
2. The method of claim 1, wherein the programming model comprises a built-in function interface written in the tensor programming language operable in the second computational core, the built-in function interface for being invoked for tensor computation of tensor data.
3. The method of claim 1, wherein obtaining the computation graph corresponding to the source language code comprises:
converting the source language code into a first intermediate representation, the first intermediate representation being for representing scalar data and/or tensor data;
and generating the computational graph based on the first intermediate representation, wherein the computational nodes in the computational graph represent the functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationships among the functions.
4. The method of claim 3, wherein after the generating the computational graph based on the intermediate representation, the method further comprises:
optimizing the calculation graph;
wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.
5. The method of claim 1, wherein the dividing the computation graph into at least one target graph region comprises:
dividing the computing nodes with the dependency relationship and the same computing type in the computing graph into a computing area;
counting the mappable computing cores of each computing node in each computing area;
dividing the computing nodes which have dependency relationship in each computing area and can be mapped and have the same computing core into a graph area;
and determining the at least one target map area according to the map area in each calculation area.
6. The method of claim 5, wherein the computation region is a scalar computation region; said determining said at least one target map region from said map regions in each calculation region comprises:
determining the map region as the at least one target map region.
7. The method of claim 5, wherein the computed region is a tensor computed region; said determining said at least one target map region from said map regions in each calculation region comprises:
creating a map region set for the ith map region in the map regions;
traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped to the same computing core into a candidate graph region, wherein i is a positive integer;
determining a first profit value of the ith map area according to the execution duration of the candidate map area of the ith map area on the corresponding computation core;
determining a second profit value which is larger than a preset profit value and has the largest value from the first profit value;
updating the candidate graph region corresponding to the second profit value to the graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to a preset profit value;
and determining the candidate map area corresponding to the first profit value as the target map area.
8. The method of claim 1, wherein the dividing the computation graph into at least one target graph region comprises:
dividing the computing nodes with the same type and the dependency relationship in the computing graph into a computing area;
calculating the execution duration of each computing node in each computing area in a mappable computing core, and determining the mappable computing core with the minimum execution duration as a target computing core of the computing node;
merging the same computing nodes of the target computing cores in each computing area into a graph area;
and determining the at least one target map area according to the combined map area.
9. The method of claim 1, wherein generating a corresponding binary instruction code segment for each target map region comprises:
determining the mapping relation between each target map area in the at least one target map area and the computing core;
converting each target graph area into a second intermediate representation according to the mapping relation;
generating the binary instruction code segment based on the second intermediate representation.
10. The method of any of claims 1 to 9, wherein the first computing core comprises at least one of: CPU, DSP; the second computing core comprises at least one of: CPU, TPU, GPU, DSP, ASIC, FPGA.
11. A compiling apparatus for program compiling a heterogeneous system including a plurality of computing cores having different architectures, the compiling apparatus comprising:
the system comprises a programming front-end module, a calculation graph calculation module and a calculation model calculation module, wherein the calculation graph is corresponding to a source language code, the source language code is compiled based on a programming model, and the programming model is described by a scalar programming language and a tensor programming language in an algorithm mode;
a calculation graph processing module, configured to divide the calculation graph into at least one target graph region, where the at least one target graph region includes a scalar graph region including scalar calculation nodes and/or a tensor graph region including tensor calculation nodes;
the calculation core decoding module is used for generating a corresponding binary instruction code segment for each target image area; and generating a binary instruction sequence corresponding to the source language code based on the dependency relationship between the target graph areas and the binary instruction code sections, wherein the binary instruction code sections corresponding to the scalar graph areas can be executed in a first calculation core for scalar calculation, and the binary code sections corresponding to the tensor graph areas can be executed in a second calculation core for tensor calculation.
12. The apparatus of claim 11, wherein the programming model comprises a built-in function interface written in the tensor programming language operable in the second computational core, the built-in function interface to be invoked for tensor computation of tensor data.
13. The apparatus of claim 11, wherein the programming front end module is to convert the source language code into a first intermediate representation, the intermediate representation to represent scalar data and/or tensor data; and generating the computational graph based on the intermediate representation, wherein the computational nodes in the computational graph represent the functions corresponding to the source language code, and the directed edges in the computational graph represent the dependency relationship between the functions.
14. The apparatus of claim 13, wherein the programming front-end module is further configured to perform an optimization process on the computation graph; wherein the optimization process comprises at least one of: deleting redundant computing nodes, fusing computing nodes, equivalently converting complex computing into simple computing, and merging cyclic blocks.
15. The apparatus according to claim 11, wherein the computation graph processing module is configured to divide the computation nodes having dependency relationships and the same computation types in the computation graph into a computation region; counting the mappable computing cores of each computing node in each computing area; dividing the computing nodes which have dependency relationship in each computing area and can be mapped and have the same computing core into a graph area; and determining the at least one target map area according to the map area in each calculation area.
16. The apparatus of claim 15, wherein the computation region is a scalar computation region; and the calculation map processing module is used for determining at least one map area corresponding to the calculation map as the at least one target map area.
17. The apparatus of claim 15, wherein the computed region is a tensor computed region; the computational graph processing module is configured to create a graph region set for an ith graph region in the graph regions; traversing the graph region set, and dividing the computing nodes which have dependency relationship in the ith graph region and can be mapped to the same computing core into a candidate graph region, wherein i is a positive integer; determining a first profit value of the ith map area according to the execution duration of the candidate map area of the ith map area on the corresponding computation core; determining a second profit value which is larger than a preset profit value and has the largest value from the first profit value; updating the candidate graph region corresponding to the second profit value to the graph region set, and returning to the step of traversing the graph region set until the first profit values are all smaller than or equal to a preset profit value; and determining the candidate map area corresponding to the first profit value as the target map area.
18. The apparatus according to claim 11, wherein the computation graph processing module is further configured to divide the computation nodes having dependency relationships and the same type in the computation graph into a computation region; calculating the execution duration of each computing node in each computing area in a mappable computing core, and determining the mappable computing core with the minimum execution duration as a target computing core of the computing node; merging the same computing nodes of the target computing cores in each computing area into a graph area; and determining the at least one target map area according to the combined map area.
19. The apparatus of claim 11, wherein the computation core decoding module is configured to determine a mapping relationship between each of the at least one target graph region and a computation core; converting each target graph area into a second intermediate representation according to the mapping relation; generating the binary instruction code segment based on the second intermediate representation.
20. A compiling device characterized by comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to: for implementing the method of any one of claims 1 to 10 when executing the executable instructions.
21. A computer-readable storage medium, characterized in that the readable storage medium stores an executable program, wherein the executable program, when executed by a processor, implements the method of any one of claims 1 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110747965.7A CN113553054A (en) | 2021-07-02 | 2021-07-02 | Heterogeneous system based compiling method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110747965.7A CN113553054A (en) | 2021-07-02 | 2021-07-02 | Heterogeneous system based compiling method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113553054A true CN113553054A (en) | 2021-10-26 |
Family
ID=78102559
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110747965.7A Pending CN113553054A (en) | 2021-07-02 | 2021-07-02 | Heterogeneous system based compiling method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113553054A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116974580A (en) * | 2023-09-25 | 2023-10-31 | 之江实验室 | Multi-modal network compiling method, system and storage medium |
CN117331541A (en) * | 2023-10-27 | 2024-01-02 | 北京智源人工智能研究院 | Compiling and operating method and device for dynamic graph frame and heterogeneous chip |
CN118246377A (en) * | 2024-05-27 | 2024-06-25 | 北京燧原智能科技有限公司 | Simulator architecture, simulation method, simulation equipment and medium of tensor processor |
-
2021
- 2021-07-02 CN CN202110747965.7A patent/CN113553054A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116974580A (en) * | 2023-09-25 | 2023-10-31 | 之江实验室 | Multi-modal network compiling method, system and storage medium |
CN116974580B (en) * | 2023-09-25 | 2024-01-09 | 之江实验室 | Multi-modal network compiling method, system and storage medium |
CN117331541A (en) * | 2023-10-27 | 2024-01-02 | 北京智源人工智能研究院 | Compiling and operating method and device for dynamic graph frame and heterogeneous chip |
CN118246377A (en) * | 2024-05-27 | 2024-06-25 | 北京燧原智能科技有限公司 | Simulator architecture, simulation method, simulation equipment and medium of tensor processor |
CN118246377B (en) * | 2024-05-27 | 2024-08-13 | 北京燧原智能科技有限公司 | Simulator architecture, simulation method, simulation equipment and medium of tensor processor |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Besta et al. | To push or to pull: On reducing communication and synchronization in graph computations | |
WO2021000970A1 (en) | Deep learning algorithm compiling method, device, and related product. | |
CN113553054A (en) | Heterogeneous system based compiling method, device, equipment and storage medium | |
JP2634144B2 (en) | Program parallel execution method and parallel execution compiler | |
US6212617B1 (en) | Parallel processing method and system using a lazy parallel data type to reduce inter-processor communication | |
Vishkin | Using simple abstraction to reinvent computing for parallelism | |
CN110383247A (en) | Method, computer-readable medium and heterogeneous computing system performed by computer | |
US9891958B2 (en) | System and method for parallelizing grid search method facilitating determination of PK-PD parameters | |
WO2021000971A1 (en) | Method and device for generating operation data and related product | |
JP2020013608A (en) | Data processing graph compilation | |
US12039305B2 (en) | Method for compilation, electronic device and storage medium | |
Shi et al. | Welder: Scheduling deep learning memory access via tile-graph | |
Chowdhury et al. | Autogen: Automatic discovery of efficient recursive divide-8-conquer algorithms for solving dynamic programming problems | |
Coullon et al. | The SIPSim implicit parallelism model and the SkelGIS library | |
Gosmann et al. | Automatic optimization of the computation graph in the Nengo neural network simulator | |
Eliahu et al. | Frpa: A framework for recursive parallel algorithms | |
Bilotta et al. | Design and implementation of particle systems for meshfree methods with high performance | |
Fehling | Algorithms for massively parallel generic hp-adaptive finite element methods | |
Verma et al. | Fast, quasi-optimal, and pipelined instruction-set extensions | |
Lo et al. | LaRCS: A language for describing parallel computations for the purpose of mapping | |
CN110717587B (en) | Performance semantic acceleration method based on parallel acceleration loop body and application thereof | |
US20230259338A1 (en) | Computer-implemented method and a computer-readable medium | |
Bai et al. | Parallelization of matrix partitioning in hierarchical matrix construction on distributed memory systems | |
González-Domínguez et al. | Performance evaluation of sparse matrix products in UPC | |
Dieterle et al. | Skeleton composition versus stable process systems in Eden |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |