CN118051261A

CN118051261A - Instruction fusion method, device and storage medium

Info

Publication number: CN118051261A
Application number: CN202211426902.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2024-05-17

Abstract

The embodiment of the application provides an instruction fusion method, equipment and a storage medium, which are characterized in that a program code to be processed is obtained, wherein at least two instructions are marked with compiling indication information in advance and are used for indicating whether the marked instructions can be fused; determining an instruction to be fused according to the read-write relation of the instruction in the program code to be processed and the compiling instruction information; and replacing at least two instructions to be fused in the program codes to be processed with target fusion instructions to generate fused target program codes. The instruction capable of being fused is marked through compiling instruction information, the instruction to be fused can be accurately determined based on the read-write relation of the instruction in the program code to be processed and the compiling instruction information, the instruction to be fused is fused in the compiler, the performance of the stream-oriented computing hardware processor is improved, an API (application program interface) is not required to be packaged, the program code is not required to be rewritten, the cost is reduced, the hardware supporting the instruction fusion can be used for the instruction fusion, and the hardware not supporting the instruction fusion can be used for the instruction fusion, so that the compatibility is ensured.

Description

Instruction fusion method, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an instruction fusion method, an instruction fusion device and a storage medium.

Background

In a streaming-oriented computing hardware processor, each streaming computing instruction typically requires that the input data be read from a memory unit for some processing (e.g., addition, convolution, etc.) and then the output data be written back to the memory unit. With the development of the streaming computing hardware, the streaming computing hardware can support the fusion operation of a plurality of computing instructions, that is, the intermediate data of the fused computing instructions does not need to be read and written by a storage unit.

In one related art, fused computation instructions are directly packaged into an API (Application Programming Interface ), such as an API that fuses multiply operation instructions and add operation instructions into multiply-accumulate (fma) operation instructions; in another related art, computing instructions are fused according to program code by a compiler.

In the first related art, the cost of the API of various combinations of multiple computing instructions is high, and if only some APIs of combinations are encapsulated, it is not applicable to other combinations; in the second related art, the program code needs to be rewritten according to the hardware characteristics, the development workload is large, compatibility problems exist, and furthermore, the compiler may not be able to perform the calculation instruction fusion effectively.

Disclosure of Invention

The embodiment of the application provides an instruction fusion method, equipment and a storage medium, which are used for reducing the cost of instruction fusion and have higher flexibility and compatibility.

In a first aspect, an embodiment of the present application provides an instruction fusion method, including:

Acquiring a program code to be processed, wherein at least two instructions in the program code to be processed are marked with compiling instruction information in advance, and the compiling instruction information is used for indicating whether the marked instructions can be fused;

Determining an instruction to be fused according to the read-write relation of the instruction in the program code to be processed and the compiling instruction information;

and replacing at least two to-be-fused instructions in the to-be-processed program codes with target fusion instructions to generate fused target program codes.

In a second aspect, an embodiment of the present application provides an instruction fusion apparatus, including:

The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring to-be-processed program codes, at least two instructions in the to-be-processed program codes are marked with compiling indication information in advance, and the compiling indication information is used for indicating whether the marked instructions can be fused;

the determining module is used for determining the instruction to be fused according to the read-write relation of the instruction in the program code to be processed and the compiling instruction information;

and the generating module is used for replacing at least two instructions to be fused in the program codes to be processed with target fusion instructions to generate fused target program codes.

In a third aspect, an embodiment of the present application provides an instruction fusion apparatus, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform the method as described in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored which, when executed by at least one processor, implements a method as described in the first aspect.

According to the instruction fusion method, the device and the storage medium provided by the embodiment of the application, the program codes to be processed are obtained, wherein at least two instructions in the program codes to be processed are marked with the compiling instruction information in advance, and the compiling instruction information is used for indicating whether the marked instructions can be fused; determining an instruction to be fused according to the read-write relation of the instruction in the program code to be processed and the compiling instruction information; and replacing at least two instructions to be fused in the program codes to be processed with target fusion instructions to generate fused target program codes. The instruction which can be fused is marked through compiling the instruction information, the instruction to be fused can be accurately determined based on the read-write relation of the instruction in the program code to be processed and the compiling instruction information, the instruction to be fused is fused in the compiler, the performance of the stream-oriented computing hardware processor is improved, an API (application program interface) is not required to be packaged, the program code is not required to be rewritten, the cost is reduced, the instruction fusion is easily realized in hardware supporting the instruction fusion, the instruction fusion is not performed in hardware not supporting the instruction fusion, and the compatibility is ensured.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic view of a scenario of an instruction fusion method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for instruction fusion according to an embodiment of the present application;

FIG. 3 is a flowchart of an instruction fusion method according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a directed bipartite graph according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a split of a bidirectional edge in a directed bipartite graph according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for instruction fusion according to another embodiment of the present application;

FIG. 7 is a schematic diagram of a directed bipartite graph according to another embodiment of the present application;

FIG. 8 is a schematic diagram of an instruction fusion device according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an instruction fusion device according to another embodiment of the present application;

Fig. 10 is a structural view showing a board according to an embodiment of the present application;

Fig. 11 is a block diagram showing a combination processing apparatus according to an embodiment of the present application;

FIG. 12 is a schematic diagram showing the internal structure of a single core computing device according to an embodiment of the present application;

FIG. 13 is a schematic diagram illustrating the internal structure of a multi-core computing device according to an embodiment of the application;

Fig. 14 is a schematic diagram showing an internal structure of a processor core according to an embodiment of the present application.

Specific embodiments of the present disclosure have been shown by way of the above drawings and will be described in more detail below. These drawings and the written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the disclosed concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

For a clear understanding of the technical solutions of the present application, the prior art solutions will be described in detail first.

In the first related art, an API that encapsulates various combinations of multiple computing instructions is required, for example, assuming that the hardware supports N kinds of operation instructions and at most supports fusion of M kinds of operation instructions, in order to support various combinations, the number of API interfaces needs to be encapsulated externally: n ¹+N²+…+N^M, when m=4 and n=10, the total number of API interfaces is 11110, which is costly, and if only some APIs of the combination are encapsulated, it is not applicable to other combinations.

In the second related technology, the program code needs to be rewritten according to the hardware characteristics, the development workload is large, and compatibility problems exist, such as that the former generation of hardware does not support instruction fusion, the latter generation of hardware supports instruction fusion, if the program code developed for the generation of hardware needs to perform instruction fusion in the generation of hardware, the program code needs to be rewritten, and the operation instruction needing to be fused is written into the rewritten program code to form a fused instruction code; similarly, if the program code with instruction fusion developed for the hardware of the Y generation needs to be transplanted to the hardware of the X generation, the program code also needs to be rewritten because the hardware of the X generation does not support the instruction fusion. The instruction fusion optimization of the visible hardware is related to the specific hardware, the realization cost is high, the compiling language and the compiler need to consider compatibility, the instruction fusion can be carried out on the hardware supporting the instruction fusion by the same program code, and the instruction fusion is not carried out on the hardware not supporting the instruction fusion.

In addition, the compiler may not be able to effectively perform calculation instruction fusion, because the calculation result of each operation instruction is written back to a certain storage space of the storage unit (such as NRAM), and the start address of the storage space is represented by a pointer, in the C-like language, the pointer may have an alias, the instruction fusion of the hardware actually skips the operation of writing the intermediate calculation result into the storage space, directly uses the result of the previous operation as the input of the next operation, and does not read and write the storage space C of the intermediate calculation result. Thus, if another pointer p is located later to the memory space C, but a different alias exists, then reading the memory space C with the pointer p will result in an unexpected result. If the static analysis of the compiler can not accurately identify the pointer alias, the instruction fusion is not performed in a conservative mode, so that the instruction fusion can not achieve the extreme fusion.

In order to solve the above technical problems, an embodiment of the present application provides an instruction fusion method, which obtains a program code to be processed, wherein at least two instructions in the program code to be processed are marked with compiling instruction information in advance, and the compiling instruction information is used for indicating whether the marked instructions can be fused; determining an instruction to be fused according to the read-write relation of the instruction in the program code to be processed and the compiling instruction information; and replacing at least two instructions to be fused in the program codes to be processed with target fusion instructions to generate fused target program codes. The instruction which can be fused is marked through compiling the instruction information, the instruction to be fused can be accurately determined based on the read-write relation of the instruction in the program code to be processed and the compiling instruction information, the instruction to be fused is fused in the compiler, the performance of the stream-oriented computing hardware processor is improved, an API (application program interface) is not required to be packaged, the program code is not required to be rewritten, the cost is reduced, the instruction fusion is easily realized in hardware supporting the instruction fusion, the instruction fusion is not performed in hardware not supporting the instruction fusion, and the compatibility is ensured.

The application provides an instruction fusion method which is applied to an application scene shown in fig. 1, and comprises a compiler for stream-oriented computing and a hardware processor, wherein a program code to be processed is input into the compiler, the program code to be processed comprises a plurality of instructions, at least two instructions are marked with compiling instruction information in advance, and the compiler can determine the instructions to be fused according to the read-write relation of the instructions in the program code to be processed and the compiling instruction information; and replacing at least two to-be-fused instructions in the to-be-processed program codes with target fusion instructions, generating fused target program codes, transmitting the target program codes to the stream-oriented computing hardware processor, and executing the fused target program codes in the stream-oriented computing hardware processor. The compiler may run on the CPU, and the streaming-oriented computing hardware processor may be an XPU, which may be an IPU (INTELLIGENCE PROCESSING UNIT, intelligent processing unit) or the like streaming hardware processor.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart of an instruction fusion method according to an embodiment of the present application, where, as shown in fig. 2, an execution body of the embodiment is a compiler or other electronic devices with a compiling function. As shown in fig. 2, the instruction fusion method provided in this embodiment includes the following steps:

S201, acquiring to-be-processed program codes, wherein at least two instructions in the to-be-processed program codes are marked with compiling instruction information in advance, and the compiling instruction information is used for indicating whether the marked instructions can be fused or not.

In this embodiment, the programming language (such as class C language) may be extended, so that any instruction of the program code may be marked with compiling instruction information, where the compiling instruction information is used to indicate whether the marked instruction allows fusion, for example, adding the compiling instruction information before or after any instruction. Alternatively, the compiling instruction information may be a flag for a selected code (e.g., an instruction block including a plurality of instructions) in the program code to be processed, for indicating whether or not the instructions in the selected code are capable of fusion.

Optionally, if the preceding instruction and the subsequent instruction executed after the preceding instruction can be fused into one instruction, the output result of the preceding instruction may be directly used by the subsequent instruction without executing the read-write operation of the storage unit corresponding to the output result of the preceding instruction, and based on this, the compiling instruction information may be marked by the storage unit corresponding to the output result of each instruction, so as to implement marking of the instruction. The compiling indication information may include a compiling indication item, and the compiling indication item may include a base address (base) indicating that the base address may not be read from or written to in the storage unit; or the compiling instruction item may include a base address (base) and a length (length), that is, the address range of [ base, base+length ] in the storage unit may not be read and written, and the pointer of the following instruction may not be aliased with the address range. The base address here is usually the output address of a certain instruction, and the length corresponds to the data size of the operation result of the instruction. And the compiling indication information can also be used for specifying whether the pointer has an alias or not by a programmer, so that the problem about the alias is effectively avoided.

Optionally, the programmer may add the compiling instruction information to the instructions that allow the fusion when writing the program code, for example, add the compiling instruction information to the selected code that includes the plurality of instructions in the program code to be processed when writing the program code, and may add the compiling instruction information to the instructions that allow the fusion in the program code in other manners. The compiler can acquire the program code to be processed, at least two instructions in the program code to be processed are marked with compiling indication information in advance, and a subsequent instruction fusion method can be executed on the program code to be processed.

Optionally, the instruction in this embodiment may be a vector operation instruction, a matrix operation instruction, or the like, and parallel processing of data may be implemented by a plurality of operators, where each operator may perform an operation on data in one dimension in a vector, so that the operation performance may be improved.

S202, determining the instruction to be fused according to the read-write relation of the instruction in the program code to be processed and the compiling instruction information.

The read-write relationship of the instructions, that is, which memory unit each instruction reads data from to operate, and which memory unit the output result is written into. If the output result of the preceding instruction is the input data of the following instruction, the preceding instruction and the following instruction can access the same storage unit, and if the compiling indication information marks that the storage unit is allowed not to be read and written, the preceding instruction does not need to write the output result of the preceding instruction into the storage unit, the following instruction does not need to read the output result of the preceding instruction from the storage unit, the condition of instruction fusion is met, and the preceding instruction and the following instruction can be fused into one instruction, so that the reading and writing process of an intermediate result is omitted. Therefore, the instruction to be fused can be determined according to the read-write relation of the instruction and the compiling instruction information.

For example, for the following instruction c=a+b; d=c+e; f=d+g; if the compiling indication information is [ C,128], it indicates that the storage unit C corresponding to the instruction is allowed not to be read and written, and according to the read-write relationship of the instruction, it may be determined that the instruction c=a+b writes the output result into the storage node corresponding to [ C,128], the instruction d=c+e reads data from the storage node corresponding to [ C,128], and the instruction f=d+g does not read data from the storage node corresponding to [ C,128], so it may be determined that the instruction to be fused includes c=a+b; d=c+e; d=a+b+e can be obtained after fusion.

More specifically, according to the above compiling instruction information, the present disclosure may determine that the storage unit allows the instruction that is not to be read and written, where the storage unit may be a storage unit corresponding to a source operand of the instruction or a storage unit corresponding to a destination operand of the instruction. The memory cell that is allowed to be not read and written may be a memory cell included in the compiling instruction information, a base address in the compiling instruction information indicates a base address of the memory cell, and a length in the compiling instruction information indicates a size of the memory cell.

Further, the present disclosure may determine the instruction to be fused according to the read-write relationship of the instruction in the program code to be processed and the instruction allowed not to be read-written by the storage unit. If it is determined that the storage unit corresponding to the output result of the preceding instruction is allowed to be unwritten according to the read-write relationship of the instruction, and it is determined that the storage unit is only read by one following instruction according to the read-write relationship of the instruction, it may be determined that the preceding instruction may be fused with the following instruction, and the preceding instruction and the following instruction may be determined as instructions to be fused.

S203, replacing at least two to-be-fused instructions in the to-be-processed program codes with target fusion instructions, and generating fused target program codes.

In this embodiment, the compiler may generate the target fusion instruction according to at least two to-be-fused instructions, that is, write the at least two to-be-fused instructions into an entry target fusion instruction, for example, the to-be-fused instruction is d=a×b, e=d+c, and the target fusion instruction is e=a×b+c.

According to the instruction fusion method provided by the embodiment, the program codes to be processed are obtained, wherein at least two instructions in the program codes to be processed are marked with compiling instruction information in advance, and the compiling instruction information is used for indicating whether the marked instructions can be fused; determining an instruction to be fused according to the read-write relation of the instruction in the program code to be processed and the compiling instruction information; and replacing at least two instructions to be fused in the program codes to be processed with target fusion instructions to generate fused target program codes. The instruction which can be fused is marked through compiling the instruction information, the instruction to be fused can be accurately determined based on the read-write relation of the instruction in the program code to be processed and the compiling instruction information, the instruction to be fused is fused in the compiler, the performance of the stream-oriented computing hardware processor is improved, an API (application program interface) is not required to be packaged, the program code is not required to be rewritten, the cost is reduced, the instruction fusion is easily realized in hardware supporting the instruction fusion, the instruction fusion is not performed in hardware not supporting the instruction fusion, and the compatibility is ensured.

Furthermore, optionally, not all instructions are allowed to be fused due to the characteristics of hardware, so that hardware information can be considered on the basis of the above embodiment, and instructions to be fused can be further screened and determined based on the hardware information. Specifically, an instruction set of hardware support fusion can be obtained according to the hardware information, wherein the instruction set can comprise some basic instructions of the hardware support fusion; further, each instruction in the program code to be processed is respectively matched with an instruction set, and if any instruction in the program code to be processed is contained in the instruction set, the instruction is determined to be allowed to be fused; if the instruction is not included in the instruction set, it is determined that the instruction does not allow fusion.

On the basis of any embodiment, the compiling instruction information may include one or more pieces of storage unit information that are allowed not to be read and written, and the fusion result of different compiling instruction information may be different, for example, for the following instruction c=a+b; d=c+e; f=d+g; if there is a compile indication [ C,128], it may be determined that the instruction to be fused includes c=a+b; d=c+e; d=a+b+e can be obtained after fusion; f=d+g; if there are two coding instruction information [ C,128], [ D,128], it may be determined that the instruction to be fused includes c=a+b; d=c+e; f=d+g; after fusion, f=a+b+e+g can be obtained. The more the compiling indication information is, the more opportunities can be fused, and the better the performance is; how much of the coding indication information may depend on the actual program code.

On the basis of any of the above embodiments, as shown in fig. 3, the determining the instruction to be fused according to the read-write relationship between the instructions in the program code to be processed and the compiling instruction information may specifically include:

s301, constructing a directed bipartite graph according to the read-write relation of each instruction in the program code to be processed; the directed bipartite graph comprises operation nodes of all instructions, storage unit nodes and directed edges representing read-write relations;

S302, configuring fusion attributes for related storage unit nodes in the directed bipartite graph according to the compiling indication information, wherein the fusion attributes are used for indicating that the storage unit nodes are allowed to be not read and written;

s303, determining at least two instructions to be fused according to the directed bipartite graph and fusion attributes of related nodes therein.

In this embodiment, an instruction generally includes an operation code for indicating a function of the instruction, and an operation field for indicating data information of the instruction, which may be an immediate or a data address, etc., for example, the operation field may include an input data address, an output data address of the instruction, which may be determined by identifying the operation code for performing what type of operation the instruction uses. The compiler analyzes the program code to be processed, and builds a directed bipartite graph based on the read-write relationship of each instruction, wherein the directed bipartite graph comprises two groups of nodes, one group is an operation node of each instruction, the other group is a storage unit node, the operation node is a node corresponding to each instruction operation code, the storage unit node is a node corresponding to each instruction operation domain (operand), the read-write relationship between the operation node and the storage unit node is identified by directed edges, all the directed edges can cross the boundary of the groups, the directed edges (pointed by the storage unit node to the operation node) corresponding to the output degree of the storage unit node represent that the data of the storage unit node are read by the operation node, and the directed edges (pointed by the operation node to the storage unit node) corresponding to the input degree of the storage unit node represent that the output data of the operation node are written into the storage unit node.

After the directed bipartite graph is constructed, the fusion attribute can be configured for the relevant storage unit nodes in the directed bipartite graph according to the compiling indication information, wherein the relevant storage unit nodes are storage unit nodes contained in the compiling indication information, and the fusion attribute is used for indicating that the storage unit nodes are allowed to be not read and written, namely, for any storage unit marked by the compiling indication information, the fusion attribute is configured for the storage unit nodes. Alternatively, the process of configuring the fusion attribute may be to distinguish between storage unit nodes that are allowed to be unread and storage unit nodes that are not allowed to be read and written in a different color or other manner.

Taking the following program code as an example:

#pragma try-fusion(Buf2,Buf3,Buf5,Buf6,Buf7)

Buf3＝A(Buf1)；

Buf4＝B(Buf1)；

Buf6＝C(Buf3,Buf4)；

Buf7＝D(Buf6)；

As can be seen from the program code, compiling indication information # PRAGMA TRY-fusion (Buf 2, buf3, buf5, buf6, buf 7) is used for indicating that Buf2, buf3, buf5, buf6, buf7 is allowed not to be read and written, so that when a directed bipartite graph is constructed, as shown in FIG. 4, fusion attributes can be configured for storage unit nodes Buf2, buf3, buf5, buf6 and Buf7, and in FIG. 4, fusion attributes are configured for Buf2, buf3, buf5, buf6 and Buf7 in gray, and the read and write are allowed not to be read and written. And then determining the instruction to be fused according to the directed bipartite graph and the fusion attribute of the related nodes.

On the basis of any of the above embodiments, when the fusion attribute is configured for the relevant storage unit node in the directed bipartite graph according to the compiling instruction information, whether the compiling instruction information is valid or not may be further determined, and for valid compiling instruction information, the fusion attribute may be configured for the relevant storage unit node, and for invalid compiling instruction information, the fusion attribute may not be configured for the relevant storage unit node, where the invalid compiling instruction information may be determined according to whether the execution performance after the instruction fusion is better than the independent execution performance, or whether the relevant storage unit node is read by a plurality of operation nodes, which may be described in the following embodiments.

In an optional embodiment, when configuring the fusion attribute for the relevant storage unit node in the directed bipartite graph according to the compiling indication information, the method specifically further includes:

judging whether the execution performance of at least two fused instructions is better than the independent execution performance;

If the execution performance after fusion is better than the independent execution performance, configuring fusion attributes for the storage unit nodes; or if the fused execution performance is not superior to the independent execution performance, the configuration of the fusion attribute to the storage unit node is ignored.

In this embodiment, although the compiling instruction information indicates that a part of storage unit nodes are allowed to be not read and written, that is, the instructions related to the storage unit nodes allow the instruction fusion, there may be an independent execution performance of at least two instructions before fusion, for example, the to-be-fused instruction is divided into a target instruction and a next instruction to be executed thereafter, an output result of the target instruction is input data of the next instruction, if the execution time of the target instruction after fusion with the next instruction is longer than the sum of the execution time of the independent execution target instruction and the execution time of the independent execution of the next instruction, in this case, the fusion of the target instruction with the next instruction is not recommended, so that the compiler determines whether the execution performance of the target instruction after fusion with the next instruction is better than the independent execution performance, if so, the fusion attribute may be continuously configured for the storage unit nodes storing the output result of the target instruction, if not better, the compiling instruction information is determined to be invalid, and the fusion attribute may not be configured for the storage unit nodes storing the output result of the target instruction. Through the process, the performance of the stream-oriented computing hardware processor can be effectively improved after the instruction fusion. The compiler can judge whether the execution performance of the target instruction fused with the next instruction is better than the independent execution performance based on the hardware architecture information, for example, if the fusion performance of the floating point operation instruction in the nth generation hardware architecture is not better than that of the independent instruction, the floating point operation instruction is not fused; or whether the execution performance of the target instruction after being fused with the next instruction is better than the independent execution performance is judged based on the information such as the pre-configuration information and/or the historical data (the execution performance related data).

In another optional embodiment, when configuring the fusion attribute for the relevant storage unit node in the directed bipartite graph according to the compiling instruction information, the method further includes:

judging whether the storage unit node is read by a plurality of operation nodes or not;

If the storage unit node is determined to be read by only one operation node, configuring fusion attributes for the storage unit node; or if the storage unit node is determined to be read by a plurality of operation nodes, the configuration fusion attribute of the storage unit node is ignored.

In this embodiment, since the intermediate data of the fused instruction does not need to be read and written by the storage unit when the instructions are fused, if there is intermediate data of the fused instruction to be read and written by other non-fused instructions, the fused instruction cannot be fused, and at this time, even if the storage unit node is a storage unit identified in the compiling instruction information, the fusion attribute is not configured for the storage unit node. Therefore, in this embodiment, it is determined whether the storage unit node storing the output result of the instruction is read by a plurality of operation nodes, and if it is determined that the storage unit node is read by only one operation node, it is indicated that the intermediate data may not need to be read or written, so that the fusion attribute is configured for the storage unit node. If the storage unit node is determined to be read by a plurality of operation nodes, the configuration fusion attribute of the storage unit node is ignored. E.g., d=c+e; f=d+g; h=d+i, where the storage unit node D is read by a plurality of operation nodes, if the compiling instruction information is [ D,128], if instruction fusion is performed, the intermediate data is not written into the storage unit node D, and some operation nodes (e.g. the operation nodes h=d+i) cannot read the intermediate data, which may cause an operation error. Therefore, the storage unit node D must perform reading and writing, so that the fusion attribute is not configured for the storage unit node, and at this time, the compiling instruction information can be considered as invalid for the identification of the storage unit node D.

If the storage unit node is determined to be read by only one operation node, the fusion attribute is configured for the storage unit node according to the identification of the compiling indication information. E.g., d=c+e; f=d+g; the storage unit node D is only read by the only operation node behind it, and at this time, the two operation instructions satisfy the condition of instruction fusion, so that the information identified in the compiling instruction information can be considered valid. Thus, the compiler can configure fusion attributes for the storage unit node D according to the compilation instruction information.

It should be noted that, since the program code to be processed includes at least one basic block (basic block), each basic block includes at least one instruction, and no branch (for example if, else) exists in the basic block, the scope of the coding instruction information in this embodiment may be one basic block, or multiple adjacent basic blocks, where the multiple adjacent basic blocks are single-in and single-out basic blocks (forming a single-in and single-out domain region, no branch exists), that is, the instructions to be fused that can be fused with each other can only source one basic block, or originate from a single-in and single-out region. Therefore, when the directed graph is constructed according to the read-write relationship of each instruction in the program code to be processed, the directed graph can be constructed according to the read-write relationship of each instruction in any basic block in the program code to be processed; or the read-write relation of each instruction in a plurality of adjacent basic blocks in the program code to be processed is constructed into a directed bipartite graph, wherein the plurality of adjacent basic blocks are single-in and single-out basic blocks.

On the basis of any embodiment, after the directed bipartite graph is constructed, considering that the directed bipartite may have a bidirectional edge or a directed ring, the influence on determining the instruction to be fused exists, so that the directed bipartite graph can be optimized, and the optimized directed bipartite graph is obtained, so that instruction fusion judgment is performed according to the optimized directed bipartite graph.

Optionally, optimizing the directed bipartite graph for the bidirectional edge specifically may include:

Judging whether a bidirectional edge exists in the directed bipartite graph or not;

If the bidirectional edge exists, the storage unit node connected with the bidirectional edge is replaced by two virtual storage unit nodes, and the optimized directed bipartite graph is obtained, so that instruction fusion is carried out according to the optimized directed bipartite graph. One virtual storage unit node is connected with the directed edge corresponding to the outgoing degree, and the other virtual storage unit node is connected with the directed edge corresponding to the incoming degree.

In this embodiment, the normal operation of a single instruction constructs a bidirectional edge in the directed bipartite graph, for example, instruction c=c+1, reads data from the storage unit node C and writes the data to the storage unit node C after adding 1, and the bidirectional edge exists between the storage unit node C and the operation node in the directed bipartite graph, so that whether the normal operation exists in the program code to be processed can be identified, and whether the bidirectional edge exists in the directed bipartite graph can be determined, or the bidirectional edge can be identified directly based on the directed bipartite graph.

For bi-directional edges, semantically, it is necessary to read and write first. Therefore, the two-way edge connected storage unit node can be split into two virtual storage unit nodes, as shown in fig. 5, the buffer1 is split into two virtual buffers 11 and 12, wherein the buffer11 is used for reading data, the buffer12 is used for writing data, the output degree of the buffer11 and the input degree of the buffer12 are connected with the original operation node a, that is, the operation node a changes from reading data from the buffer1 to reading data from the buffer11, and the operation node a changes from writing data to the buffer1 to writing data to the buffer 12. Therefore, one bidirectional edge can be split into two unidirectional edges, and the cost is that a virtual Buffer is added in the directed bipartite graph. For example, for c=c+1; d=c+2, where the storage unit node C reads and writes first, i.e. there is a bidirectional edge, and splitting C into C ₁、C₂ may be denoted as C ₂＝C₁+1;D＝C₂ +2, so that the existence of the bidirectional edge in the directed bipartite graph can be avoided, and the influence of the bidirectional edge on determining the instruction to be fused is avoided.

Further, the compiler can judge whether the storage unit node storing the output result of the instruction is read by a plurality of operation nodes or not when the compiler configures the fusion attribute for the storage unit node in the directed bipartite graph according to the optimized directed bipartite graph; if the storage unit node is determined to be read by only one operation node, configuring fusion attributes for the storage unit node; if the storage unit node is determined to be read by a plurality of operation nodes, the configuration fusion attribute of the storage unit node is ignored.

The above example is accepted for c=c+1; d=c+2; e=c×f, where the memory cell node C reads and writes, i.e. there is a bidirectional edge, and C is split into C ₁、C₂, which can be denoted as C ₂＝C₁+1;D＝C₂+2,E＝C₂ ×f. At this time, since the storage node C ₂ is read by two operation nodes, although the storage unit C is identified in the compiling instruction information and is allowed not to be read and written, the fusion attribute is not configured for the storage node C ₂ at this time.

Optionally, the optimizing the directed bipartite graph with respect to the directed ring may specifically include:

judging whether a directed ring exists in the directed bipartite graph or not;

If the directed ring exists, identifying the last operation node in the directed ring according to the instruction execution sequence; and if the storage unit node written by the last operation node is allowed not to be read and written, determining that the last operation node can be fused.

In this embodiment, spatial multiplexing between instructions results in the possibility of an existing directed loop in the directed bipartite graph, which must be loop-free without supporting in-situ operation. First, it can be determined whether there is a directed ring in the directed bipartite graph, for example, b=op1 (a); c=op2 (B); a=op3 (C); the operation node op1 reads data from the storage unit node a, the operation node op3 writes output data of the operation node op into the storage unit node a to form a directed ring, when judging whether the directed ring exists in the directed graph, the operation node op3 is optionally numbered sequentially according to the instruction execution sequence (namely, the sequence of executing each instruction in the stream processing process or the sequence of generating each instruction in the program code), the directed graph is traversed sequentially according to the number of the operation node, and if the number of the write operation node of the currently traversed storage unit node is larger than the number of the read operation node, the directed ring exists in the directed graph, so that whether the directed ring exists in the directed graph can be accurately identified. In the above example, when traversing to the storage unit node a, the number of the operation node op3 of the written storage unit node a is larger than the number of the operation node op1 of the data read from the storage unit node a, it is indicated that the directed ring exists. After determining that the directed ring exists in the directed bipartite graph, the last operation node in the directed ring may be identified, and in the above example, the last operation node in the directed ring may be identified as op3 according to the instruction execution sequence.

Optionally, the last operation node in the directed ring may be fused with any other instruction, for example, c=a+b; d=c-E; a=d×f; g=a+h; wherein a→c→d→a forms a directed ring, where the last operation in the directed ring is a=d×f. If the compiling indication information identifies that the storage node A is allowed not to be read and written, the last operation node of the directed ring can be subjected to instruction fusion with other instructions. In this embodiment, the storage unit node at the closed position in the directed ring (that is, the storage unit node of the last operation node that outputs the result, and the storage unit node of the write operation node that has a number greater than that of the read operation node) may be split into two virtual storage unit nodes, where one virtual storage unit node is used for reading data, and the other virtual storage unit node is used for writing data, for example, the storage unit node a is split into a ₁ and a ₂,C＝A₁+B;D＝C-E;A₂＝D*F;G＝A₂ +h in the above example, and if the compiling instruction information identifies that the storage node a is allowed to not be read and written, the instruction is fused: g= (a ₁ +b-E) f+h. Therefore, the existence of the directed ring in the directed graph can be avoided, and the influence of the directed ring on determining the instruction to be fused is avoided.

Optionally, the last operation node in the directed ring may not be fused with any other instruction, and the last operation node in the directed ring may be marked to indicate that the instruction corresponding to the last operation node is not fused with any instruction, so as to avoid the influence of the directed ring on determining the instruction to be fused.

Further, when the compiler configures the fusion attribute for the storage unit node in the directed graph according to the optimized directed graph, if the compiling instruction information identifies that the storage unit node written by the last operation node in the directed loop is allowed not to be read and written, the fusion attribute is configured for the storage node. Similarly, the compiler can judge whether the storage unit node storing the output result of the instruction is read by a plurality of operation nodes or not when configuring the fusion attribute for the storage unit node in the directed bipartite graph according to the optimized directed bipartite graph; if the storage unit node is determined to be read by only one operation node, configuring fusion attributes for the storage unit node; if the storage unit node is determined to be read by a plurality of operation nodes, the configuration fusion attribute of the storage unit node is ignored.

Based on any of the foregoing embodiments, after the compiler configures the fusion attribute for the relevant storage unit node in the directed bipartite graph based on the optimized directed bipartite graph, based on the directed bipartite graph, the compiler may determine at least two instructions to be fused according to the directed bipartite graph and the fusion attribute of the relevant node therein, as shown in fig. 6, and may specifically include:

s401, searching at least one directed path from the directed bipartite graph, wherein the directed path is composed of at least one target node configured with fusion attribute and an operation node connected with the target node through a directed edge;

S402, determining at least two instructions to be fused according to the at least one directed path, and generating a target fusion instruction according to the instructions to be fused.

In this embodiment, the directed bipartite graph may be traversed, and at least one directed path is searched, where the at least one directed path is formed by at least one target node (storage unit node) configured with a fusion attribute and an operation node connected to the target node through a directed edge, for example, the directed path is formed by one storage unit node configured with a fusion attribute and two operation nodes connected to the storage unit node through a directed edge, where the storage unit node configured with a fusion attribute may not be read and written, and output data of a previous operation node is directly output to a subsequent operation node without writing into the storage unit node, that is, an instruction corresponding to the two operation nodes is a to-be-fused instruction; for example, the directed path is formed by two storage unit nodes configured with a fusion attribute and three operation nodes connected with the two storage unit nodes through directed edges, so that the two storage unit nodes can be directly output to a first operation node without writing the output data of the first operation node into the storage unit node, and the output data of the second operation node is directly output to a third operation node without writing the output data of the second operation node into the storage unit node, namely, the instructions corresponding to the three operation nodes are to-be-fused instructions.

After determining at least two to-be-fused instructions, generating a target fusion instruction according to the to-be-fused instructions, namely writing the determined at least two to-be-fused instructions into one instruction, wherein the intermediate result does not read and write the storage unit relative to the at least two to-be-fused instructions.

The compiler may search for a directed path from the directed bipartite graph, for example, search for a directed path buf1→a→buf3→c→buf6→d in the directed bipartite graph shown in fig. 4, so Buf3 and Buf6 are not read and written, operation A, C, D may be fused together, and buf3=a (Buf 1); buf6=c (Buf 3, buf 4); buf7=d (Buf 6); determining the instruction to be fused, and further generating a target fusion instruction: buf7=d (C (a (Buf 1), buf 4)), and intermediate data is not written to or read from the storage unit.

The program code thus becomes eventually:

//Buf4＝B(Buf1)

//Buf7＝D(C(A(Buf1),Buf4))

The directed bipartite graph at this time can be shown in fig. 7.

Alternatively, when searching the directed path from the directed bipartite graph, a depth-first manner may be used for searching, where the depth-first traversal is performed for each possible branch path until it is no longer possible to reach in, and each node can only access once, and the directed path can be quickly searched and constructed through the depth-first traversal. In the process of adopting depth-first search, the operation nodes which are added into the directed path are marked with the specified identifiers so as to traverse only the operation nodes which are not marked with the specified identifiers when searching and constructing the next directed path.

On the basis of the above embodiment, considering that one or more directed paths may be searched when searching the directed paths from the directed bipartite graph, when determining at least two instructions to be fused according to the at least one directed path, the method may include:

if only one directed path exists, determining the instruction corresponding to each operation node in the directed path as an instruction to be fused; or alternatively

If a plurality of directed paths exist, determining the longest directed path from the plurality of directed paths, and determining the instruction corresponding to each operation node in the longest directed path as the instruction to be fused.

In this embodiment, if only one directed path is searched, the instruction corresponding to each operation node in the directed path is directly determined as the instruction to be fused; if multiple directed paths are searched, the longest directed path may be determined from the multiple directed paths, for example, in the directed bipartite graph of fig. 4, the searched directed paths include:

Buf1→A→Buf3→C；

Buf3→C→Buf6→D；

Buf1→A→Buf3→C→Buf6→D；

the directional path Buf1→A→Buf3→C→Buf6→D is the longest directional path.

Of course, when selecting the final directed path from the plurality of directed paths, the longest directed path may not be adopted, for example, the compiler may also determine that each directed path performs execution performance after instruction fusion, and select the directed path with the best execution performance; or may be selected using other strategies, not exemplified herein.

Of course, instruction fusion cannot be carried out in the stream-oriented computing hardware processor without limitation, an upper limit of the number of the instruction fusion is generally set, performance degradation and other results may be caused if the upper limit of the number of the instruction fusion is exceeded, a corresponding preset length threshold exists for the length of the corresponding directed path, when the directed path is obtained, if the length of the directed path exceeds the preset length threshold, the directed path is intercepted according to the preset length threshold, so that the length of the intercepted directed path does not exceed the preset length threshold, and then the instruction corresponding to each operation node in the intercepted directed path is determined to be the instruction to be fused according to the instruction corresponding to each operation node in the intercepted directed path. The preset length threshold may be a maximum number of instructions that the hardware processor supports for fusion.

The compiler may select the intercepting mode when intercepting the directed path, for example, take the forefront part, the middle part, or the rearmost part of the directed path, or the compiler may determine which intercepting mode obtains the directed path with the best execution performance after the instruction fusion, and select the best intercepting mode, or may use other strategies to intercept, which is not exemplified here.

Furthermore, due to hardware limitations, the hardware may have constraints on the first or last instruction to be fused of the instructions to be fused in the directed path, e.g., square operations may extend the bit width, the hardware requires square operations to be the only last operation of the directed path, which would then need to be broken after square operations. Of course, there are many other possible instruction fusion constraints for hardware, and this is not an example. The method comprises the steps of acquiring hardware information, judging whether the directed path needs to be cut according to the hardware information, determining a cutting position if the directed path needs to be cut, and cutting the directed path from the cutting position. In the process, the fusion optimization can perform corresponding fusion processing on the instruction according to the constraint corresponding to the hardware information, an upper user is not required to adapt to the hardware, and the compatibility and the expansibility are good.

Fig. 8 is a schematic structural diagram of an instruction fusion device according to an embodiment of the present application, as shown in fig. 7, where the instruction fusion device according to the embodiment may be a compiler or other electronic devices with a compiling function, and the instruction fusion device 50 according to the embodiment includes: an acquisition module 51, a determination module 52, and a generation module 53.

The acquiring module 51 is configured to acquire a program code to be processed, where at least two instructions in the program code to be processed are pre-marked with compiling instruction information, where the compiling instruction information is used to indicate whether the marked instructions can be fused;

The determining module 52 is configured to determine an instruction to be fused according to a read-write relationship of instructions in the program code to be processed and the compiling instruction information;

the generating module 53 is configured to replace at least two of the to-be-fused instructions in the to-be-processed program codes with target fusion instructions, and generate fused target program codes.

In one or more embodiments of the present application, the compilation instruction information includes information that allows at least one memory location not to be read from or written to by an instruction; the determining module 52 is configured to, when determining the instruction to be fused according to the read-write relationship between the instructions in the program code to be processed and the compiling instruction information:

And determining the instructions related to the storage unit which are allowed not to be read and written by the instructions according to the read-write relation of the instructions in the program codes to be processed and the compiling instruction information, and taking the instructions related to the storage unit as the instructions to be fused.

In one or more embodiments of the present application, the determining module 52 is configured to, when determining the instruction to be fused according to the read-write relationship between the instructions in the program code to be processed and the compiling instruction information:

Constructing a directed bipartite graph according to the read-write relation of each instruction in the program code to be processed; the directed bipartite graph comprises operation nodes of all instructions, storage unit nodes and directed edges representing read-write relations;

Configuring fusion attributes for related storage unit nodes in the directed bipartite graph according to the compiling indication information, wherein the fusion attributes are used for indicating that the storage unit nodes are allowed to be not read and written;

and determining at least two instructions to be fused according to the directed bipartite graph and the fusion attribute of the related nodes.

In one or more embodiments of the present application, the determining module 52 is further configured to, after constructing a directed bipartite graph according to the read-write relationship of each instruction in the pending program code:

And optimizing the directed bipartite graph to obtain an optimized directed bipartite graph.

In one or more embodiments of the application, the determining module 52, when optimizing the directed bipartite graph, is configured to:

if the bidirectional edge exists, the storage unit node connected with the bidirectional edge is replaced by two virtual storage unit nodes, and an optimized directed bipartite graph is obtained; one virtual storage unit node is connected with the directed edge corresponding to the outgoing degree, and the other virtual storage unit node is connected with the directed edge corresponding to the incoming degree.

In one or more embodiments of the present application, the determining module 52 is configured, when configuring the fusion attribute for the relevant storage unit node in the directed bipartite graph according to the compiling instruction information, to:

judging whether each storage unit node is read by a plurality of operation nodes or not respectively;

if the storage unit node is determined to be read by only one operation node, configuring fusion attributes for the storage unit node; or alternatively

If the storage unit node is determined to be read by a plurality of operation nodes, the configuration fusion attribute of the storage unit node is ignored.

if the execution performance after fusion is better than the independent execution performance, configuring fusion attributes for the storage unit nodes; or alternatively

If the performance of the fused execution is not superior to the independent execution performance, the configuration of the fused attribute to the storage unit node is ignored.

In one or more embodiments of the present application, the determining module 52 is configured, when determining at least two instructions to be fused according to the directed bipartite graph and the fusion attribute of the relevant node therein, to:

searching at least one directed path from the directed bipartite graph, wherein the directed path is composed of at least one target node configured with fusion attribute and an operation node connected with the target node through a directed edge;

And determining at least two instructions to be fused according to the at least one directed path, and generating a target fusion instruction according to the instructions to be fused.

In one or more embodiments of the present application, the determining module 52, when determining at least two instructions to be fused according to the at least one directed path, is configured to:

In one or more embodiments of the application, the determination module 52 is further configured to:

if the length of the directed path exceeds a preset length threshold, intercepting the directed path according to the preset length threshold, and determining the instruction corresponding to each operation node in the intercepted directed path as an instruction to be fused.

judging whether a directed ring exists in the directed bipartite graph or not;

if the directed ring exists, identifying the last operation node in the directed ring according to the instruction execution sequence;

And if the storage unit node written by the last operation node is allowed not to be read and written, determining that the last operation node can be fused.

In one or more embodiments of the present application, the determining module 52, when determining whether a directed ring exists in the directed bipartite graph, is configured to:

Sequentially numbering each operation node according to the instruction execution sequence;

traversing in turn according to the number of the operation node;

and if the number of the write operation node of the currently traversed storage unit node is larger than the number of the read operation node, determining that a directed ring exists in the directed bipartite graph.

In one or more embodiments of the application, the program code to be processed includes at least one basic block, each of the basic blocks including at least one instruction therein;

the determining module 52 is configured to, when constructing a directed bipartite graph according to the read-write relationship of each instruction in the program code to be processed:

constructing a directed bipartite graph for the read-write relation of each instruction in any basic block in the program code to be processed; or alternatively

And constructing a directed bipartite graph for the read-write relation of each instruction in a plurality of adjacent basic blocks in the program code to be processed, wherein the plurality of adjacent basic blocks are single-in and single-out basic blocks.

The instruction fusion device provided in this embodiment may execute the technical solution of the foregoing method embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

Fig. 9 is a schematic structural diagram of an instruction fusion device according to another embodiment of the present application, and as shown in fig. 9, an instruction fusion device 60 according to an embodiment of the present application includes: at least one processor 61 and a memory 62;

Memory 62 stores computer-executable instructions;

At least one processor 61 executes computer-executable instructions stored in memory 62 such that the at least one processor performs the instruction fusion method provided in any one of the embodiments described above.

In a possible implementation manner, a computer readable storage medium is also disclosed, where a computer program is stored, and when the computer program is executed by at least one processor, the instruction fusion method provided in any one of the foregoing embodiments is implemented.

In one possible implementation, the above-mentioned stream-oriented computing hardware processor may be a processor structure shown in fig. 12 or 13, and further the processor may be integrated in a board, where the stream-oriented computing hardware processor may be an IPU or a GPU, and the application is not limited thereto.

In one possible implementation, a board, which may be a device-side board, is also disclosed. Fig. 10 shows a schematic structural diagram of a board 70 according to an embodiment of the application. As shown in fig. 10, the board card 70 includes a Chip 701, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent operation unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided, so that the board card 70 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 701 is connected to an external device 703 through an external interface device 702. The external device 703 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 703 to the chip 701 through the external interface means 702. The calculation result of the chip 701 may be transmitted back to the external device 703 via the external interface means 702. The external interface device 702 may have different interface forms, such as a PCIe interface, etc., according to different application scenarios.

The board card 70 also includes a memory device 704 for storing data, which includes one or more memory cells 705. The memory device 704 is connected to the control device 706 and the chip 701 through a bus and transmits data. The control device 706 in the board card 70 is configured to regulate the state of the chip 701. To this end, in one application scenario, the control device 706 may include a single chip microcomputer (Micro Controller Unit, MCU).

In one possible implementation, a combination processing apparatus is also provided, and fig. 11 is a block diagram showing the combination processing apparatus in the chip 701 of this embodiment. As shown in fig. 11, the combination processing device 80 includes a computing device 801, an interface device 802, a processing device 803, and a storage device 804.

The computing device 801 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 803 through the interface device 802 to collectively accomplish the user-specified operations.

The interface means 802 is used for transmitting data and control instructions between the computing means 801 and the processing means 803. For example, the computing device 801 may obtain input data from the processing device 803 via the interface device 802, write to a storage device on the computing device 801 chip. Further, the computing device 801 may obtain control instructions from the processing device 803 via the interface device 802, and write the control instructions into a control cache on the computing device 801 chip. Alternatively or in addition, the interface device 802 may also read data in a memory device of the computing device 801 and transmit it to the processing device 803.

The processing device 803, as a general purpose processing device, performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 801, and the like. Depending on the implementation, the processing device 803 may be one or more types of processors, including but not limited to a digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processors, and the number thereof may be determined according to actual needs. As before, the computing device 801 of the present application may be considered to have a single core structure or a homogeneous multi-core structure. However, when computing device 801 and processing device 803 are considered in combination, they are considered to form a heterogeneous multi-core structure.

The storage device 804 is configured to store data to be processed, which may be a DRAM804, typically 16G or greater in size, for DDR memory, for storing data for the computing device 801 and/or the processing device 803.

Fig. 12 shows a schematic internal architecture of a computing device 801 as a single core. The single-core computing device 901 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the single-core computing device 901 includes three modules: a control module 91, an operation module 92 and a storage module 93.

The control module 91 is used for coordinating and controlling the operation of the operation module 92 and the storage module 93 to complete the task of deep learning, and includes a fetch unit (instruction fetch unit, IFU) 911 and an instruction decode unit (instruction decode unit, IDU) 912. The instruction fetching unit 911 is configured to fetch an instruction from the processing device 1203, and the instruction decoding unit 912 decodes the fetched instruction and sends the decoded result to the operation module 92 and the storage module 93 as control information.

The operation module 92 includes a vector operation unit 921 and a matrix operation unit 922. The vector operation unit 921 is used for executing vector operation and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 922 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.

The storage module 93 is used to store or handle related data, including a neuron storage unit (NRAM) 931, a parameter storage unit (WEIGHT RAM, WRAM) 932, and a direct memory access module (direct memory access, DMA) 933.NRAM 931 is used to store input neurons, output neurons, and computed intermediate results; WRAM 932 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 933 is coupled to DRAM 804 via bus 94 and is responsible for data transfer between single core computing device 901 and DRAM 804.

Fig. 13 illustrates an internal architecture diagram of a computing device 801 that is multi-core. The multi-core computing device 1001 adopts a hierarchical design, and the multi-core computing device 1001 is a system-on-chip (soc) including at least one cluster (cluster), each cluster including a plurality of processor cores, in other words, the multi-core computing device 1001 is formed by a hierarchy of system-on-chip (soc) -cluster-processor cores.

At the level of the system-on-chip, as shown in fig. 13, the multi-core computing device 1001 includes an external storage controller 1001, a peripheral communication module 1002, an on-chip interconnect module 1003, a synchronization module 1004, and a plurality of clusters 1005.

There may be a plurality of external memory controllers 1001, 2 being shown by way of example, for accessing external memory devices, such as DRAM 804 in fig. 11, to read data from or write data to the off-chip in response to access requests issued by the processor cores. The peripheral communication module 1002 is configured to receive a control signal from the processing device 803 via the interface device 802, and activate the computing device 801 to perform a task. The on-chip interconnect module 1003 connects the external memory controller 1001, the peripheral communication module 1002, and the plurality of clusters 1005 for transmitting data and control signals between the respective modules. The synchronization module 1004 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 1005 are computing cores of the multi-core computing device 1001, 4 being illustratively shown in the figure, the 4 clusters 1005 forming 4 quadrants as in fig. 1. As hardware progresses, the multi-core computing device 1001 of the present application may also include 8, 16, 64, or even more clusters 1005. Cluster 1005 is used to efficiently perform the deep learning algorithm.

At the cluster level, as shown in FIG. 13, each cluster 1005 includes a plurality of processor cores (IPU cores) 1006 and a memory core (MEM core) 1007. Illustratively, each cluster 1005 includes 4 processor cores and 1 memory, which may be DRAM804. Each processor core corresponds to one of the arithmetic units in fig. 1, and each memory corresponds to one of the memory units in fig. 1.

The processor cores 1006 are illustratively shown as 4, the present application is not limited to the number of processor cores 1006. The internal architecture is shown in fig. 14. Each processor core 1006 is similar to the single core computing device 901 of fig. 12, again comprising three major modules: a control module 1101, an operation module 1102 and a storage module 1103. The functions and structures of the control module 1101, the operation module 1102 and the storage module 1103 are substantially the same as those of the control module 91, the operation module 92 and the storage module 93, and the control module 1101 includes a fetch unit 11011 and an instruction decoding unit 11012. The operation module 1102 includes a vector operation unit 11021 and a matrix operation unit 11022. And will not be described in detail. It should be noted that the storage module 1103 includes an input/output direct memory access module (input/output direct memory access, IODMA) 11033 and a handling direct memory access module (move direct memory access, MVDMA) 11034.IODMA11033, control access to NRAM 11031/WRAM 11032 and DRAM 804 over broadcast bus 1009; MVDMA 11034 to 11034 are used to control access to the NRAM 11031/WRAM 11032 and memory cell (SRAM) 1008.

Returning to FIG. 11, the memory cores 1007 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 1006, as well as to perform communications between the clusters 1005 and the DRAM 804, between the clusters 1005, between the processor cores 1006, etc. In other embodiments, the memory core 1007 has the capability of scalar operations to perform scalar operations.

The memory core 1007 includes SRAM 1008, broadcast bus 1009, cluster direct memory access module (cluster direct memory access, CDMA) 1010, and global direct memory access module (global direct memory access, GDMA) 1011. The SRAM 1008 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 1006 in the same cluster 1005 is not required to be obtained from the DRAM 804 through the processor cores 1006, but transferred between the processor cores 1006 through the SRAM 1008, and the memory core 1007 only needs to rapidly distribute the multiplexed data from the SRAM 1008 to the plurality of processor cores 1006, so as to improve the inter-core communication efficiency and greatly reduce the on-chip off-chip input/output access.

Broadcast bus 1009, CDMA 1010 and GDMA are used to perform communication between processor cores 1006, communication between clusters 1005, and data transfer between clusters 1005 and DRAM 804, respectively. As will be described below, respectively.

The broadcast bus 1009 is used to perform high-speed communication between the processor cores 1006 in the cluster 1005. The broadcast bus 1009 of this embodiment supports inter-core communication modes including unicast, multicast, and broadcast. Unicast is a communication mode that refers to the transfer of data from point to point (e.g., single processor core to single processor core), multicast is the transfer of a piece of data from SRAM 1008 to a specific number of processor cores 1006, and broadcast is the transfer of a piece of data from SRAM 1008 to all processor cores 1006, a special case of multicast.

CDMA 1010 is used to control access to SRAM 1008 between different clusters 1005 within the same computing device 801.

GDMA 1011 cooperate with the external memory controller 1001 to control access of the SRAM1008 to the DRAM 804 of the cluster 1005 or to read data from the DRAM 804 into the SRAM 1008. From the foregoing, it can be appreciated that communication between DRAM 804 and NRAM 11031 or WRAM11032 may be achieved via 2 channels. The first channel is to directly contact DRAM 804 with NRAM 11031 or WRAM11032 through IODAM 11033,11033; the second channel is to transfer data between DRAM 804 and SRAM1008 via GDMA a, and then transfer data between SRAM1008 and NRAM 11031 or WRAM11032 via MVDMA a 11034. While seemingly the second channel requires more elements to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel, so communication between DRAM 804 and NRAM 11031 or WRAM11032 may be more efficient through the second channel. Embodiments of the present application may select a data transmission channel based on its hardware conditions.

In other embodiments, the functions of GDMA 1011 and IODMA 11033 may be integrated in the same component. For convenience of description, GDMA, 1011 and IODMA, 11033 are considered as different components, so long as the functions and technical effects achieved by the present application are similar to those of the present application, and thus, the present application is within the scope of protection of the present application. Further, the functions of GDMA, IODMA, 11033, CDMA 1010, MVDMA, 11034 may also be implemented by the same component.

The foregoing (The foregoing may be better understood in view of the following clauses) may be better understood in light of the following clauses:

Clause 1, a method of instruction fusion, comprising:

Clause 2, the method of clause 1, wherein the compilation instruction information comprises information of at least one memory unit that is not allowed to be read from or written to by an instruction; the determining the to-be-fused instruction according to the read-write relation between the instructions in the to-be-processed program code and the compiling instruction information comprises the following steps:

Clause 3, the method according to clause 1 or 2, wherein determining the instruction to be fused according to the read-write relationship between the instructions in the program code to be processed and the compiling instruction information includes:

Clause 4, the method according to clause 3, wherein after constructing the directed bipartite graph according to the read-write relationship of each instruction in the to-be-processed program code, further includes:

Clause 5, the method of clause 4, the optimizing the directed bipartite graph, comprising:

Clause 6, the method according to clause 4 or 5, wherein the configuring the fusion attribute for the relevant storage unit node in the directed bipartite graph according to the compiling instruction information includes:

Clause 7, the method according to clause 4 or 5, wherein the configuring the fusion attribute for the relevant storage unit node in the directed bipartite graph according to the compiling instruction information includes:

Clause 8, the method according to clause 4 or 5, wherein determining at least two instructions to be fused according to the fusion attribute of the directed bipartite graph and the related nodes comprises:

Clause 9, the method of clause 8, wherein determining at least two instructions to be fused according to the at least one directed path comprises:

Clause 10, the method of clause 8 or 9, further comprising:

Clause 11, the method of any of clauses 4-10, the optimizing the directed bipartite graph, comprising:

judging whether a directed ring exists in the directed bipartite graph or not;

Clause 12, the method of clause 11, wherein the determining whether a directed ring exists in the directed bipartite graph, comprises:

traversing in turn according to the number of the operation node;

Clause 13, the method of any of clauses 3-12, wherein the program code to be processed comprises at least one basic block, each basic block comprising at least one instruction therein;

the constructing a directed bipartite graph according to the read-write relation of each instruction in the to-be-processed program code comprises the following steps:

Clause 14, an instruction fusion apparatus, comprising:

Clause 15, an instruction fusion apparatus, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

The at least one processor executing computer-executable instructions stored in the memory causes the at least one processor to perform the method of any one of clauses 1-13.

Clause 16, a computer readable storage medium having stored therein a computer program which, when executed by at least one processor, implements the method of any of clauses 1-13.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are alternative embodiments, and that the acts and modules referred to are not necessarily required for the present application.

It should be further noted that, although the steps in the flowchart are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

It will be appreciated that the device embodiments described above are merely illustrative and that the device of the application may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.

In addition, each functional unit/module in each embodiment of the present application may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together, unless otherwise specified. The integrated units/modules described above may be implemented either in hardware or in software program modules.

The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The artificial intelligence processor may be any suitable hardware processor, such as CPU, GPU, FPGA, DSP and an ASIC, etc., unless otherwise specified. Unless otherwise indicated, the storage elements may be any suitable magnetic or magneto-optical storage medium, such as resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (ENHANCED DYNAMIC Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc.

The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned memory includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, all of the combinations of the technical features should be considered as being within the scope of the disclosure.

Claims

1. A method of instruction fusion, comprising:

2. The method of claim 1, wherein the compilation instruction information includes information that allows at least one memory location not to be read from or written to by an instruction; the determining the to-be-fused instruction according to the read-write relation between the instructions in the to-be-processed program code and the compiling instruction information comprises the following steps:

3. The method according to claim 1 or 2, wherein determining the instruction to be fused according to the read-write relationship between the instructions in the program code to be processed and the compiling instruction information comprises:

4. A method according to claim 3, wherein after the directed bipartite graph is constructed according to the read-write relationship of each instruction in the program code to be processed, the method further comprises:

5. The method of claim 4, wherein optimizing the directed bipartite graph comprises:

6. The method according to claim 4 or 5, wherein the configuring the fusion attribute for the relevant storage unit node in the directed bipartite graph according to the compiling instruction information includes:

7. The method according to claim 4 or 5, wherein the configuring the fusion attribute for the relevant storage unit node in the directed bipartite graph according to the compiling instruction information includes:

8. The method according to claim 4 or 5, wherein determining at least two instructions to be fused according to the fusion attribute of the directed bipartite graph and the related nodes thereof comprises:

9. The method of claim 8, wherein said determining at least two instructions to be fused from said at least one directed path comprises:

10. The method according to claim 8 or 9, characterized in that the method further comprises:

11. The method of any of claims 4-10, wherein the optimizing the directed bipartite graph comprises:

judging whether a directed ring exists in the directed bipartite graph or not;

12. The method of claim 11, wherein said determining whether a directed ring exists in the directed bipartite graph comprises:

traversing in turn according to the number of the operation node;

13. The method according to any of claims 3-12, wherein the program code to be processed comprises at least one basic block, each basic block comprising at least one instruction therein;

14. An instruction fusion apparatus, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

The at least one processor executing computer-executable instructions stored in the memory cause the at least one processor to perform the method of any one of claims 1-13.

15. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by at least one processor, implements the method according to any of claims 1-13.