CN116484947B

CN116484947B - Operator automatic generation method, device, equipment and medium

Info

Publication number: CN116484947B
Application number: CN202310744779.7A
Authority: CN
Inventors: 翟鑫奕; 陈禹东; 谭磊; 田野
Original assignee: Shanghai Enflame Technology Co ltd
Current assignee: Shanghai Suiyuan Technology Co ltd
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-09-08
Anticipated expiration: 2043-06-25
Also published as: CN116484947A

Abstract

The invention discloses an operator automatic generation method, device, equipment and medium, comprising the following steps: compiling through a front-end compiler according to the algorithm primitive language input by the user to obtain an initial forward data flow graph; optimizing each node in the initial forward data flow graph according to a specified rule to obtain an optimized forward data flow graph; acquiring a chip instruction set matched with the optimized forward data flow graph; and compiling the target algorithm through a back-end compiler according to the chip instruction set. After compiling the obtained initial forward data flow graph according to the algorithm primitive language input by the user through the front-end compiler, optimizing the obtained initial forward data flow graph, obtaining a chip instruction set corresponding to the optimized forward data flow graph, and automatically generating a target algorithm according to the chip instruction set through the back-end compiler, so that the algorithm can be automatically generated based on the multi-level cache processor equipment of the DMA without the need of user parameters.

Description

Operator automatic generation method, device, equipment and medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to an operator automatic generation method, device, equipment and medium.

Background

In the development of the deep learning model, a corresponding algorithm model needs to be designed, and due to the fact that the deep learning is advanced day by day, an operator layer is endless, how to balance the algorithm and the development efficiency becomes important.

Most deep learning frameworks still stay at the stage of manually optimizing the corresponding operators at present, and in this manner of realizing the operators one by one, the definition time is long, the maintenance is difficult, and a professional deep learning senior algorithm engineer is generally needed to be relied on, and the optimization strategy of a cache chip architecture, such as a CPU or a GPU, is mainly aimed at, so that the algorithm generation under the multi-level cache chip architecture based on direct memory access (Direct Memory Access, DMA) cannot be realized.

Disclosure of Invention

The embodiment of the invention provides an operator automatic generation method, device, equipment and medium, which are used for realizing automatic generation of an algorithm under a multi-level cache processor architecture based on DMA.

In a first aspect, an embodiment of the present invention provides a method for automatically generating an operator, including: compiling according to an algorithm primitive language input by a user through a front-end compiler to obtain an initial forward data flow graph, wherein the initial forward data flow graph comprises calculation flow nodes and data flow nodes;

Optimizing each node in the initial forward data flow graph according to a specified rule to obtain an optimized forward data flow graph, wherein the specified rule comprises a node strategy mark, node merging and node rearrangement;

acquiring a chip instruction set matched with the optimized forward data flow graph, wherein the chip instruction set comprises chip instructions corresponding to all nodes in the optimized forward data flow graph;

and compiling the target algorithm through a back-end compiler according to the chip instruction set.

In a second aspect, an embodiment of the present invention provides an apparatus for automatically generating an operator, including:

the system comprises an initial forward data flow graph acquisition module, a forward data flow graph acquisition module and a data flow processing module, wherein the initial forward data flow graph acquisition module is used for acquiring an initial forward data flow graph through compiling by a front-end compiler according to an algorithm primitive input by a user, and the initial forward data flow graph comprises calculation flow nodes and data flow nodes;

the optimized forward data flow graph acquisition module is used for optimizing each node in the initial forward data flow graph according to a specified rule to acquire the optimized forward data flow graph, wherein the specified rule comprises a node strategy mark, node merging and node rearrangement;

the chip instruction set acquisition module is used for acquiring a chip instruction set matched with the optimized forward data flow graph, wherein the chip instruction set comprises chip instructions corresponding to all nodes in the optimized forward data flow graph;

And the target algorithm generating module is used for compiling and generating a target algorithm through a back-end compiler according to the chip instruction set.

In a third aspect, embodiments of the present application provide a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method as described above when executing the program.

In a fourth aspect, embodiments of the present application provide a storage medium having computer-executable instructions stored thereon a computer program which, when executed by a processor, implements a method as described above.

According to the application, after the front-end compiler compiles the obtained initial forward data flow graph according to the algorithm primitive language input by the user, the obtained initial forward data flow graph is optimized, a chip instruction set corresponding to the optimized forward data flow graph is obtained, and the target algorithm is automatically generated according to the chip instruction set through the back-end compiler, so that the algorithm can be automatically generated based on the multi-level cache processor equipment of the DMA without the need of user parameters.

Drawings

FIG. 1 is a flow chart of an operator auto-generation method according to a first embodiment of the present application;

FIG. 2 is an overall schematic diagram of an operator auto-generation method according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of DMA auto-segmentation based on a power of 2 approximation according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of multi-dimensional fusion using DMA insertion according to an embodiment of the present invention;

FIG. 5 is a flowchart of an operator auto-generation method according to a second embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an operator automatic generation device according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of an operator automatic generation method provided in an embodiment of the present invention, where the embodiment is applicable to a case where a DMA-based multi-level cache processor device performs automatic algorithm generation, the method may be performed by an operator automatic generation device, and the device may be implemented by software and/or hardware, and the operator automatic generation method includes:

Step S101, compiling is carried out through a front-end compiler according to the algorithm primitive language input by the user to obtain an initial forward data flow diagram.

Specifically, in this embodiment, a user may define a set of algorithm primitive suitable for a compiler under the DMA-based multi-level cache processor architecture in advance, and the algorithm primitive defined by the user may include a calculation stream primitive and a data stream primitive. Wherein, the computation fluid primitive may include basic operations or operator loop variables, such as addition, subtraction, multiplication, division, and the like; the data stream primitive includes the location of the mark data stored on the chip, data handling, tensor slicing or vectorization, for example, a multi-level storage access handling strategy between low-speed storage-medium-low-speed storage-high-speed storage, etc., and the data stream primitive is particularly adapted to the requirements of the multi-level cache processor architecture based on DMA.

The front-end compiler in the DMA-based multi-level cache processor device receives the algorithm primitive defined by the user, and compiles the data stream primitive and the computation stream primitive into an initial forward data flow graph by adopting a loop-optimization-based text analysis strategy, i.e., a loop stmt strategy, and the initial forward data flow graph includes computation stream nodes and data stream nodes, and the specific principle of the loop stmt strategy is not an important point of the present application, so that a detailed description is omitted in this embodiment.

It should be noted that, compared with other operator generation frameworks, the embodiment of the invention expands the operator generation strategy, operator description primitive and the like under a specific architecture scheme, and complements the shortboard without the operator generation strategy under the multi-level cache chip architecture based on DMA. Chip characteristics describing various DMA-based multi-level cache architectures may be defined, with different memory modules in the chip being named L1, L2, L3, etc. multiple levels depending on memory efficiency, e.g., defining a high-speed memory level as L1 or any other name, so that data of a computational flow may be marked at specific locations on the memory architecture. Meanwhile, aiming at different chip handling modes, a defined DMA strategy can be adopted, handling among different levels is simply understood as assignment, the problem of crossing storage levels is not needed to be considered, and the handling flow of the DMA is distributed and managed by a compiler.

Step S102, optimizing each node in the initial forward data flow graph according to a specified rule, and obtaining an optimized forward data flow graph.

Optionally, optimizing each node in the initial forward data flow graph according to a specified rule to obtain an optimized forward data flow graph, including: performing tensor processing on the calculation flow nodes in the initial forward data flow graph, and performing DMA strategy marking on the data flow nodes in the initial forward data flow graph to obtain a first optimized forward data flow graph; multidimensional fusion is carried out on data stream nodes with association relations in the first optimized forward data flow graph in a DMA (direct memory access) insertion mode, and calculation stream nodes in the first optimized forward data flow graph are reserved to obtain a second optimized forward data flow graph; determining the running time of each node in the second optimized forward data flow graph, and carrying out position rearrangement on the data flow nodes adjacent to the calculation flow nodes according to the running time to obtain a third optimized forward data flow graph, wherein the data flow nodes subjected to position rearrangement are marked by prefetch instructions; merging the calculation flow nodes of the designated type in the third optimization forward data flow graph, and reserving the data flow nodes in the third optimization forward data flow graph to obtain a fourth optimization forward data flow graph.

Optionally, the DMA policy marking is performed on the data flow node in the initial forward data flow graph, including: acquiring a pre-configured compiler chip search space, wherein the compiler chip search space comprises the corresponding relation between each chip type and a storage structure; determining the type of a current application chip of the adaptive front-end compiler, and determining a target storage structure corresponding to the current application chip by traversing a compiler chip search space, wherein the target storage structure comprises the number of storage levels and the capacity of the storage levels; acquiring a position level of each data stream node in an initial forward data flow graph, which is positioned on a current application chip, and determining a carrying strategy for the data stream node according to the position level and a target storage structure; the handling policy is marked on the corresponding data flow node in the initial forward data flow graph for DMA policy marking.

Optionally, determining a handling policy for the data flow node according to the location hierarchy and the target storage structure includes: determining an associated calculation stream node with a logical relationship with the data stream node, and acquiring attribute information of the associated calculation stream node, wherein the attribute information comprises an operand, tensor information and a type; splitting the lowest storage level in the target storage structure according to the attribute information of the associated calculation flow node to obtain calculation space capacity; determining a carrying direction for the data flow node according to the position hierarchy and the number of storage hierarchies in the target storage structure; determining the carrying times of the data stream node in each carrying direction according to the calculated space capacity and the storage hierarchy capacity in the target storage structure; the carrying direction and the carrying times are used as carrying strategies for the data flow nodes.

Specifically, as shown in fig. 2, which is an overall schematic diagram of an automatic operator generating method in this embodiment, after the front-end compiler compiles according to the algorithm primitive input by the user to obtain an initial forward flow chart, since the initial forward data flow chart includes a plurality of data flow nodes and computation flow nodes, the front-end compiler processes the data flow nodes and computation flow nodes respectively to obtain a first optimized forward data flow chart. The compiling optimization step a shown in fig. 2 is performed for the computation flow node, and tensor processing is performed on the computation flow node to obtain the computation flow node in the parallelized tensor format. For example, when the calculation flow node is an addition calculation of y=a+b, each parameter is subjected to Zhang Lianghua processing to obtain a tensor result [ Y ] = [ a ] + [ B ], where [ a ] = [ a..a ], [ B ] = [ b..a..b ], the parallelism of each parameter is the same, and the parallelism of each parameter can be determined by the user, and is not limited in this embodiment.

In addition, for the data flow node in the initial forward data flow graph, DMA policy marking is performed, and when the DMA policy marking is performed, a preconfigured compiler chip search space needs to be obtained, where the search space includes a correspondence between each chip type and a storage structure, and table 1 below is an example of the search space:

TABLE 1

However, for reasons of space limitation, the memory structures corresponding to the two chip types are only illustrated in table 1, and the specific form of the memory structure corresponding to each chip type is not limited. Therefore, after determining that the type of the current application chip of the current adaptive front-end compiler is b type, the target storage structure corresponding to the current application chip can be determined by traversing the search space of the compiler chip as follows: the number of storage layers is 3, the L1 level capacity 512k, the L2 level capacity 30m, the L3 level capacity 10G, and the handling capacity supported by each level is also different. Because the position level of each data flow node in the initial forward data flow graph at the current application chip is acquired before, the associated calculation flow node with the logic relation with the data flow node can be determined, the attribute information of the associated calculation flow node is acquired, and the lowest storage level in the target storage structure is segmented according to the attribute information of the associated calculation flow node, so that the calculation space capacity is acquired. As shown in fig. 3, a DMA auto-segmentation schematic based on a power approximation of 2 is shown, when determining that the associated computation flow node corresponding to the data flow node F is a W (x, y) =a (x, y) +b (x, y), and the constraint condition for the associated computation flow node is: maximizef (x, y) st.0.ltoreq.x.ltoreq.m, 0.ltoreq.y.ltoreq.m, 3 x y sizeof (float) ltoreq.512 kb, y mod vectorsize=0, vectorsize=32b, the attribute information of the associated computation flow node is shown in table 2 below:

TABLE 2

The associated computation flow node may be determined to include three operands, tensor parallelization information is 32, the type is float, and the final size is the computation space capacity required by a single parameter in the associated computation flow node, and then the computation may be performed based on power approximation of 2 in the following formula (1), so as to obtain the computation space capacity required by the single parameter:

(1)

wherein n=7 can be obtained by solving the n value of equation 1, then 2 ⁷ 128 as the calculation space capacity required by a single parameter, so that 128kb is filled for the final sizes corresponding to function name a, function name B and function name C in table 2, respectively, the total calculation space capacity for the associated calculation flow node is 128×3=382 kb. As shown in fig. 3, which is a DMA auto-segmentation schematic based on power approximation of 2, when determining that the lowest storage level L1 in the target storage structure is 512K, L1 is segmented into a 384kb current operator computation space and a 128kb sub-graph parallel reserved space. When determiningWhen l1=2k, l2=512M, l3=8g, since the data flow node is known to be located at the L1 level, the number of times of transportation required from L2 to L1 is 512M/384 k=1365, and the number of times of transportation required from L3 to L2 is 8G/1365×384 k=16, so that the obtained transportation direction and transportation number are used as the transportation policy of the data flow node F, and the obtained transportation policy is marked on the data flow node F in the initial forward data flow diagram to perform DMA policy marking. In this embodiment, the first optimized forward data flow graph is obtained by the computation flow node processed according to Zhang Lianghua and the data flow node performing DMA policy flags.

It should be noted that after the first optimized forward data flow graph is obtained, the optimization b in fig. 2 is executed to perform multidimensional fusion by adopting a DM insertion mode on the data flow nodes with association relation in the flow graph, as shown in fig. 4, a schematic diagram of multidimensional fusion by adopting a DMA insertion mode is shown, so that multiple-dimensional DMAs, such as DMAL2- > l1+dmal2- > L1, are combined through the technology shown in fig. 4, and if the same DMA level and position are met for the upper node and the lower node, the positions and the sizes of the two nodes can be analyzed, so that the DMA operations are combined, the required execution times are changed from 2 times to 1 time, and the execution speed is saved. In addition, the multidimensional fusion is only aimed at the data stream nodes, and the computation stream nodes in the first optimization forward data flow graph are correspondingly reserved, so that a second optimization forward data flow node is obtained.

Specifically, after the second optimized forward data stream node is obtained, since the computation stream and the data stream are separately present and can be operated simultaneously, the delay of access needs to be hidden after the computation, specifically by modifying the position of the data stream node in the second optimized forward data stream to obtain the third optimized forward data stream. Specifically by performing the compilation optimization c in fig. 2.

For example, when it is determined that the second optimized forward dataflow graph contains four nodes, and the distribution order is: the data stream node m, the data stream node n and the data stream node p are calculated, and the data stream node q is calculated, wherein the running time of the nodes m, n, p and q is respectively 1s, 5s, 2s and 2s, and if the nodes m, n, p and q are executed according to the original sequence, the total occupied running time is 1+5+2+2=10s. Since the computation flows and the data flows exist separately, the computation of the data flow node p is independent of the computation flow node n, so that a new node distribution sequence can be obtained before the data flow node p is placed in the computation flow node n: the data flow node m, the data flow node p, the computation flow node n, and the computation flow node q, that is, the data flow node p performs computation in synchronization, so that there is a 2s time overlap, so that the total occupied running time obtained after the position rearrangement can be obtained to be 1+5+2=8s.

After the third optimized forward data flow graph is obtained, the graph optimization d in fig. 2 is performed to combine the specified types of computation flow nodes, for example, an addition and multiplication are combined to form a multiply-add algorithm, and meanwhile, the data flow nodes in the third optimized forward data flow graph are reserved to obtain a fourth optimized forward data flow graph. Therefore, after the front-end compiler acquires the initial forward data flow graph, the front-end compiler can simplify the initial forward data flow graph by adopting modes of node strategy marking, node merging, node rearrangement and the like, so that the subsequent back-end compiler can automatically generate a more simplified algorithm.

Step S103, a chip instruction set matched with the optimized forward data flow graph is acquired.

Optionally, the compiler chip search space further includes a space map corresponding to each chip type; the space mapping comprises a corresponding relation between node types and chip instructions.

Optionally, acquiring a chip instruction set matched with the optimized forward dataflow graph includes: extracting each node in the fourth optimized forward data flow graph, and constructing a node set according to the extracted nodes, wherein the node set is marked with the types of each node; inquiring space mapping according to the node set to obtain chip instructions corresponding to all nodes in the node set; and constructing a chip instruction set according to the acquired chip instructions, wherein a back-end compiler supports the chip instruction set.

Specifically, after the fourth optimized forward data flow graph is obtained, a node set is constructed according to each node in the fourth optimized forward data flow graph, and the type of each node is marked in the node set. Since the search space shown in fig. 1 further includes a spatial mapping corresponding to each type of chip, that is, a correspondence between a node type AddOp and a chip instruction, for example, the chip instruction corresponding to the node type AddOp is 120.Vadd, and the chip instruction can be identified by the back-end compiler, a chip instruction set corresponding to the node set can be obtained by querying the spatial mapping, and the chip instruction set includes chip instructions corresponding to each node in the fourth optimized forward data flow graph.

Step S104, compiling the target algorithm through a back-end compiler according to the chip instruction set.

Optionally, compiling by a back-end compiler according to the chip instruction set to generate a target algorithm, including: compiling according to the chip instruction through back-end compiling to obtain an executable binary file; a target algorithm is generated from the binary file.

Specifically, since the obtained chip instruction set includes the chip instructions that can be recognized by the back-end compiler, the back-end compiler performs binary compiling according to the chip instruction set to generate an executable binary file, and the target algorithm can be automatically generated according to the binary file. Therefore, in the embodiment, by defining a specific primitive language, supporting the I/O operation of DMA, using a compiler to replace the work done by hardware design under the traditional cache architecture, in order to break through the bandwidth bottleneck of I/O transmission, the delay of I/O in the process of calculating the stream is reduced and the bandwidth of transmission is improved through a prefetching strategy; and the calculation and the transmission are parallel, so that the operation efficiency of the finally generated code is improved. Therefore, for operator developers, after the primitive words and strategies provided by the invention are used, the schemes become transparent, so that the threshold required by operator development is greatly simplified, and the operator development efficiency is improved on the premise of ensuring the operator execution speed.

Example two

Fig. 5 is a flowchart of an operator automatic generation method according to a second embodiment of the present application, where the embodiment is based on the foregoing embodiment, and after compiling by a back-end compiler according to a chip instruction set to generate a target algorithm, the method further includes verifying the target algorithm, and specifically includes:

step S201, compiling is carried out through a front-end compiler according to the algorithm primitive language input by the user to obtain an initial forward data flow diagram.

Step S202, optimizing each node in the initial forward data flow graph according to a specified rule, and obtaining an optimized forward data flow graph.

Step S203, a chip instruction set matched with the optimized forward data flow graph is acquired.

Step S204, compiling the target algorithm through a back-end compiler according to the chip instruction set.

Step S205, checking the target algorithm.

Specifically, in this embodiment, after the target algorithm is generated, the target algorithm is checked, specifically, whether a binary file corresponding to the target algorithm has a place with obvious errors, for example, a place with a messy code or a grammatical error is detected, and when the place with the obvious errors is determined, the target algorithm corresponding to the binary file is determined to be an invalid algorithm, so that the check fails.

It should be noted that, when the verification failure is determined, a verification failure prompt is generated, for example, the "current target algorithm is invalid, and note that the adjustment" may occur due to an invalid optimization procedure of the front-end compiler, or may be caused by an error in the obtained optimization forward dataflow graph or an error in a chip instruction obtained by the back-end compiler according to the spatial mapping, which is not limited in this embodiment. And when the verification failure is determined, the verification failure prompt information is displayed, so that a user can overhaul the front-end compiler, the back-end compiler or the software in time, and the automatic generation efficiency and the accuracy of the operator are further improved.

According to the application, after the front-end compiler compiles the obtained initial forward data flow graph according to the algorithm primitive language input by the user, the obtained initial forward data flow graph is optimized, a chip instruction set corresponding to the optimized forward data flow graph is obtained, and the target algorithm is automatically generated according to the chip instruction set through the back-end compiler, so that the algorithm can be automatically generated based on the multi-level cache processor equipment of the DMA without the need of user parameters. And when the verification failure is determined, the verification failure prompt information is displayed, so that a user can overhaul the front-end compiler, the back-end compiler or the software in time, and the automatic generation efficiency and the accuracy of the operator are further improved.

Example III

Fig. 6 is a schematic structural diagram of an operator automatic generating apparatus according to a third embodiment of the present invention, where the apparatus may execute the operator automatic generating method according to the foregoing embodiments. The device can be realized in a software and/or hardware mode, as shown in fig. 6, the automatic operator generating device specifically comprises: an initial forward dataflow graph acquisition module 310, an optimized forward dataflow graph acquisition module 320, a chip instruction set acquisition module 330, and a target algorithm generation module 340.

An initial forward data flow graph obtaining module 310, configured to obtain an initial forward data flow graph by compiling through a front-end compiler according to an algorithm primitive input by a user, where the initial forward data flow graph includes a computation flow node and a data flow node;

the optimized forward data flow graph obtaining module 320 is configured to optimize each node in the initial forward data flow graph according to a specified rule, and obtain the optimized forward data flow graph, where the specified rule includes a node policy flag, node merging and node rearrangement;

a chip instruction set obtaining module 330, configured to obtain a chip instruction set matched with the optimized forward data flow graph, where the chip instruction set includes chip instructions corresponding to nodes in the optimized forward data flow graph;

The target algorithm generating module 340 is configured to generate a target algorithm by compiling through a back-end compiler according to the chip instruction set.

Optionally, the optimizing forward data flow graph obtaining module includes: the first optimized forward data flow graph acquisition unit is used for tensor processing of calculation flow nodes in the initial forward data flow graph and DMA strategy marking of the data flow nodes in the initial forward data flow graph so as to acquire the first optimized forward data flow graph;

the second optimized forward data flow graph acquisition unit is used for carrying out multidimensional fusion on the data flow nodes with association relations in the first optimized forward data flow graph in a DMA (direct memory access) insertion mode, and reserving the calculation flow nodes in the first optimized forward data flow graph so as to acquire the second optimized forward data flow graph;

the third optimized forward data flow graph acquisition unit is used for determining the running time of each node in the second optimized forward data flow graph, and carrying out position rearrangement on the data flow nodes adjacent to the calculation flow nodes according to the running time so as to acquire the third optimized forward data flow graph, wherein the data flow nodes subjected to the position rearrangement are marked by adopting a pre-fetching instruction;

and the fourth optimization forward data flow diagram acquisition unit is used for merging the calculation flow nodes of the designated type in the third optimization forward data flow diagram and reserving the data flow nodes in the third optimization forward data flow diagram so as to acquire the fourth optimization forward data flow diagram.

Optionally, the first optimized forward data flow graph obtaining unit is configured to obtain a pre-configured compiler chip search space, where the compiler chip search space includes a correspondence between each chip type and a storage structure;

determining the type of a current application chip of the adaptive front-end compiler, and determining a target storage structure corresponding to the current application chip by traversing a compiler chip search space, wherein the target storage structure comprises the number of storage levels and the capacity of the storage levels;

acquiring a position level of each data stream node in an initial forward data flow graph, which is positioned on a current application chip, and determining a carrying strategy for the data stream node according to the position level and a target storage structure;

the handling policy is marked on the corresponding data flow node in the initial forward data flow graph for DMA policy marking.

Optionally, the first optimized forward data flow graph obtaining unit is further configured to determine an associated computation flow node having a logical relationship with the data flow node, and obtain attribute information of the associated computation flow node, where the attribute information includes an operand, tensor information and a type;

splitting the lowest storage level in the target storage structure according to the attribute information of the associated calculation flow node to obtain calculation space capacity;

Determining a carrying direction for the data flow node according to the position hierarchy and the number of storage hierarchies in the target storage structure;

determining the carrying times of the data stream node in each carrying direction according to the calculated space capacity and the storage hierarchy capacity in the target storage structure;

the carrying direction and the carrying times are used as carrying strategies for the data flow nodes.

Optionally, the compiler chip search space further includes a space map corresponding to each chip type;

the space mapping comprises a corresponding relation between node types and chip instructions.

Optionally, the chip instruction set obtaining module is configured to extract each node in the fourth optimized forward data flow graph, and construct a node set according to the extracted node, where the node set is labeled with a type of each node;

inquiring space mapping according to the node set to obtain chip instructions corresponding to all nodes in the node set;

and constructing a chip instruction set according to the acquired chip instructions, wherein a back-end compiler supports the chip instruction set.

Optionally, the target algorithm generating module is used for compiling according to the chip instruction through back-end compiling to obtain an executable binary file;

A target algorithm is generated from the binary file.

Example IV

Fig. 7 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention, and as shown in fig. 7, the computer device includes a processor 610, a memory 620, an input device 630 and an output device 640; the number of processors 610 in the computer device may be one or more, one processor 610 being taken as an example in fig. 6; the processor 610, memory 620, input devices 630, and output devices 640 in the computer device may be connected by a bus or other means, for example in fig. 7 by a bus connection.

The memory 620 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the automatic operator generating method in the embodiment of the present invention. The processor 610 executes various functional applications of the computer device and data processing, i.e., implements the above-described operator auto-generation method, by running software programs, instructions, and modules stored in the memory 620.

The automatic generation method of the operator is applied to the multi-level cache processor equipment based on DMA and comprises the following steps:

compiling according to an algorithm primitive language input by a user through a front-end compiler to obtain an initial forward data flow graph, wherein the initial forward data flow graph comprises calculation flow nodes and data flow nodes;

Optimizing each node in the initial forward data flow graph according to a specified rule, and obtaining an optimized forward data flow graph, wherein the specified rule comprises a node strategy mark, node merging and node rearrangement;

Memory 620 may include primarily a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 620 may further include memory remotely located relative to processor 610, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 630 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the computer device. The output device 640 may include a display device such as a display screen.

Example five

The fifth embodiment of the present invention also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for automatically generating an operator;

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above method operations, but may also perform related operations in the operator automatic generation method provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute the method of the embodiments of the present invention.

It should be noted that, in the embodiment of the operator automatic generation device, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. An automatic generation method of an operator, which is applied to a multi-level cache processor device based on DMA, comprising:

compiling the target algorithm through a back-end compiler according to the chip instruction set;

optimizing each node in the initial forward data flow graph according to a specified rule to obtain an optimized forward data flow graph, wherein the method comprises the following steps: performing tensor processing on the calculation flow nodes in the initial forward data flow graph, and performing DMA strategy marking on the data flow nodes in the initial forward data flow graph to obtain a first optimized forward data flow graph;

multidimensional fusion is carried out on the data stream nodes with association relations in the first optimized forward data flow graph in a DMA insertion mode, and the calculation stream nodes in the first optimized forward data flow graph are reserved to obtain a second optimized forward data flow graph;

Determining the running time of each node in the second optimized forward data flow graph, and carrying out position rearrangement on the data flow nodes adjacent to the calculation flow nodes according to the running time to obtain a third optimized forward data flow graph, wherein the data flow nodes subjected to position rearrangement are marked by prefetch instructions;

and merging the calculation flow nodes of the designated type in the third optimization forward data flow graph, and reserving the data flow nodes in the third optimization forward data flow graph to obtain a fourth optimization forward data flow graph.

2. The method of claim 1, wherein said DMA policy marking said data flow node in said initial forward data flow graph comprises:

acquiring a pre-configured compiler chip search space, wherein the compiler chip search space comprises corresponding relations between chip types and storage structures;

determining the type of a current application chip adapting to the front-end compiler, and determining a target storage structure corresponding to the current application chip by traversing the compiler chip search space, wherein the target storage structure comprises the number of storage levels and the capacity of the storage levels;

Acquiring a position level of each data flow node in the initial forward data flow graph, which is positioned on the current application chip, and determining a carrying strategy for the data flow node according to the position level and the target storage structure;

and marking the handling strategy on a corresponding data flow node in the initial forward data flow diagram so as to carry out DMA strategy marking.

3. The method of claim 2, wherein said determining a handling policy for the data flow node based on the location hierarchy and the target storage structure comprises:

determining an associated calculation stream node with a logical relationship with the data stream node, and acquiring attribute information of the associated calculation stream node, wherein the attribute information comprises an operand, tensor information and a type;

splitting the lowest storage hierarchy in the target storage structure according to the attribute information of the associated computation flow node to obtain computation space capacity;

determining a direction of conveyance for the data flow node according to the location hierarchy and the number of storage hierarchies in the target storage structure;

determining the number of handling times in each handling direction for the data flow node according to the calculated space capacity and the storage hierarchy capacity in the target storage structure;

And taking the conveying direction and the conveying times as conveying strategies aiming at the data flow nodes.

4. The method of claim 2, wherein the compiler chip search space further includes a spatial map corresponding to each chip type;

5. The method of claim 4, wherein the fetching of the chip instruction set that matches the optimized forward dataflow graph includes:

extracting each node in the fourth optimized forward data flow graph, and constructing a node set according to the extracted nodes, wherein the node set is marked with the type of each node;

inquiring the space mapping according to the node set to obtain chip instructions corresponding to all nodes in the node set;

and constructing the chip instruction set according to the acquired chip instructions, wherein the back-end compiler supports the chip instruction set.

6. The method of claim 1, wherein compiling by a back-end compiler according to the chip instruction set to generate a target algorithm comprises:

Compiling according to the chip instruction through the back-end compiling to obtain an executable binary file;

and generating the target algorithm according to the binary file.

7. An apparatus for automatically generating an operator, comprising:

the target algorithm generating module is used for compiling and generating a target algorithm through a back-end compiler according to the chip instruction set;

the optimized forward data flow graph acquisition module is used for performing tensor processing on the calculation flow nodes in the initial forward data flow graph, and performing DMA strategy marking on the data flow nodes in the initial forward data flow graph to acquire a first optimized forward data flow graph;

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-6 when the program is executed by the processor.

9. A storage medium having stored thereon computer program, characterized in that the program when executed by a processor implements the method according to any of claims 1-6.