CN118051260A

CN118051260A - Instruction fusion method, device and storage medium

Info

Publication number: CN118051260A
Application number: CN202211426895.6A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2024-05-17

Abstract

The embodiment of the application provides an instruction fusion method, an instruction fusion device and a storage medium, wherein the instruction fusion method, the instruction fusion device and the storage medium are used for acquiring an instruction chain to be fused according to a program code to be processed, and the instruction chain to be fused comprises at least two instructions to be fused; generating a target fusion instruction code according to an instruction to be fused included in the instruction chain to be fused and a preset general fusion instruction code template; and replacing the code of the to-be-fused instruction with the target fusion instruction code. The method has the advantages that the target fusion instruction codes are generated in the compiler by adopting the unified template format, the to-be-fused instructions are fused, the method is suitable for different instruction combination modes, the packaging of an API (application program interface) is not needed, the rewriting of the program codes is not needed, the expandability is good, the development workload is reduced, the instruction fusion in hardware supporting the instruction fusion is easy to realize, the instruction fusion in hardware not supporting the instruction fusion is not realized, the compatibility is ensured, and the performance of a stream-oriented computing hardware processor is improved.

Description

Instruction fusion method, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to an instruction fusion method, an instruction fusion device and a storage medium.

Background

In a streaming-oriented computing hardware processor, each streaming computing instruction typically requires that the input data be read from a memory unit for some processing (e.g., addition, convolution, etc.) and then the output data be written back to the memory unit. With the development of the streaming computing hardware, the streaming computing hardware can support the fusion operation of a plurality of computing instructions, that is, the intermediate data of the fused computing instructions does not need to be read and written by a storage unit.

In one related art, fused computation instructions are directly packaged into an API (Application Programming Interface ), such as an API that fuses multiply operation instructions and add operation instructions into multiply-accumulate (fma) operation instructions; in another related art, computing instructions are fused according to program code by a compiler.

In the first related art, the cost of the API of various combination modes of multiple computing instructions is high, and if only some APIs of combination modes are packaged, the API cannot be applied to other combination modes, and the compatibility and the expandability are poor; in the second related art, a compiler may not be able to efficiently perform calculation instruction fusion.

Disclosure of Invention

The embodiment of the application provides an instruction fusion method, equipment and a storage medium, which are used for reducing the cost of instruction fusion and have higher flexibility and compatibility.

In a first aspect, an embodiment of the present application provides an instruction fusion method, including:

Acquiring an instruction chain to be fused according to a program code to be processed, wherein the instruction chain to be fused comprises at least two instructions to be fused;

generating a target fusion instruction code according to an instruction to be fused included in the instruction chain to be fused and a preset general fusion instruction code template;

And replacing the code of the to-be-fused instruction with the target fusion instruction code.

In a second aspect, an embodiment of the present application provides an instruction fusion apparatus, including:

The acquisition unit is used for acquiring an instruction chain to be fused according to the program code to be processed, wherein the instruction chain to be fused comprises at least two instructions to be fused;

The generating unit is used for generating a target fusion instruction code according to the to-be-fused instruction included in the to-be-fused instruction chain and a preset general fusion instruction code template;

and the replacing unit is used for replacing the code of the to-be-fused instruction with the target fusion instruction code.

In a third aspect, an embodiment of the present application provides an instruction fusion apparatus, including: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform the method as described in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored which, when executed by at least one processor, implements a method as described in the first aspect.

According to the instruction fusion method, the device and the storage medium provided by the embodiment of the application, the instruction chain to be fused is obtained according to the program code to be processed, and the instruction chain to be fused comprises at least two instructions to be fused; generating a target fusion instruction code according to an instruction to be fused included in the instruction chain to be fused and a preset general fusion instruction code template; and replacing the code of the to-be-fused instruction with the target fusion instruction code. The method has the advantages that the target fusion instruction codes are generated in the compiler by adopting the unified template format, the to-be-fused instructions are fused, the method is suitable for different instruction combination modes, the packaging of an API (application program interface) is not needed, the rewriting of the program codes is not needed, the expandability is good, the development workload is reduced, the instruction fusion in hardware supporting the instruction fusion is easy to realize, the instruction fusion in hardware not supporting the instruction fusion is not realized, the compatibility is ensured, and the performance of a stream-oriented computing hardware processor is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic view of a scenario of an instruction fusion method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for instruction fusion according to an embodiment of the present application;

FIG. 3 is a flowchart of an instruction fusion method according to another embodiment of the present application;

FIG. 4 is a flowchart of an instruction fusion method according to another embodiment of the present application;

FIG. 5 is a schematic diagram of an instruction fusion apparatus according to an embodiment of the present application;

FIG. 6 is a schematic diagram of an instruction fusion apparatus according to another embodiment of the present application;

Fig. 7 is a structural view showing a board according to an embodiment of the present application;

fig. 8 is a block diagram showing a combination processing apparatus according to an embodiment of the present application;

FIG. 9 is a schematic diagram showing an internal structure of a single core computing device according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating the internal architecture of a multi-core computing device according to an embodiment of the application;

fig. 11 is a schematic diagram showing an internal structure of a processor core according to an embodiment of the present application.

Specific embodiments of the present disclosure have been shown by way of the above drawings and will be described in more detail below. These drawings and the written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the disclosed concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

For a clear understanding of the technical solutions of the present application, the prior art solutions will be described in detail first.

In a stream-oriented computing hardware processor, each stream-oriented computing instruction generally needs to read input data from a storage unit and perform a certain process (such as adding, convolution, etc.), then write the output data back to the storage unit, if a subsequent computing instruction needs to use the output data, then read the output data from the storage unit, execute a new process, and then write the processing result back to the storage unit as output data, for example, if N instructions op1, op2, and.. opN exist, where the output result of any opi is only used by the subsequent op (i+1), then execute the N instructions, and need to read and write the storage unit N times respectively, resulting in a longer total execution time. With the development of the streaming computing hardware, the streaming computing hardware can support the fusion operation of a plurality of computing instructions, namely, the intermediate data of the fused computing instructions do not need to be read and written by a storage unit, and the output result of the former operation is directly input to the latter operation without passing through the storage unit, so that the time cost of frequent reading and writing of the storage unit is reduced, and the performance is improved.

In one related art, fused computation instructions are directly packaged into an API, such as an API that fuses multiply operation instructions and add operation instructions into multiply-accumulate (fma) operation instructions; the disadvantages of this related art are:

1) Poor scalability: if the new hardware supports the new instruction combination mode, the new instruction combination mode needs to be repackaged, the number of interfaces is increased along with the iteration of the hardware, the development workload is larger, and the code scale and the compiling time are also increased.

2) When the existing code uses the hardware characteristic, the existing code needs to be rewritten, and the fusion instruction interface is used for replacing the original basic instruction combination, so that the development workload is large.

3) Poor compatibility: the hardware fusion optimization is relatively costly to realize, and many hardware does not support the feature, which requires that the programming language and the compiler need to consider compatibility when designing, so that the same program can be directly accelerated by the hardware supporting the feature, and still execute with the original method on the hardware not supporting the feature, while the program using the fusion interface cannot execute on the hardware not supporting the interface.

In another related art, a compiler analyzes data dependency of each instruction of a program code, and then fuses instructions conforming to a fusion rule; the disadvantage of this approach is that: by adopting pure static analysis, the aliases of the storage space cannot be accurately identified, and then conservative fusion is adopted, namely, the instructions related to the storage space in which the aliases cannot be accurately identified are not fused, so that the performance of fusion characteristics cannot be fully mined.

In order to solve the technical problems, the application provides an instruction fusion method, which comprises the steps of acquiring an instruction chain to be fused according to a program code to be processed, wherein the instruction chain to be fused comprises at least two instructions to be fused; generating a target fusion instruction code according to an instruction to be fused included in an instruction chain to be fused and a preset general fusion instruction code template; and replacing the code of the to-be-fused instruction with the target fusion instruction code. The method has the advantages that the target fusion instruction codes are generated in the compiler by adopting the unified template format, the to-be-fused instructions are fused, the method is suitable for different instruction combination modes, the packaging of an API (application program interface) is not needed, the rewriting of the program codes is not needed, the expandability is good, the development workload is reduced, the instruction fusion in hardware supporting the instruction fusion is easy to realize, the instruction fusion in hardware not supporting the instruction fusion is not realized, the compatibility is ensured, and the performance of a stream-oriented computing hardware processor is improved.

The application provides an instruction fusion method which is applied to an application scene shown in fig. 1, and comprises a compiler for stream-oriented computing and a hardware processor, wherein a program code to be processed is input into the compiler, the program code to be processed comprises a plurality of instructions, the compiler can acquire an instruction chain to be fused according to the program code to be processed, and the instruction chain to be fused comprises at least two instructions to be fused; the compiler generates a target fusion instruction code according to the to-be-fused instruction included in the to-be-fused instruction chain and a preset general fusion instruction code template, and replaces the code of the to-be-fused instruction with the target fusion instruction code. And finally, the compiler generates the fused target program code, transmits the target program code to the stream-oriented computing hardware processor, and executes the fused target program code in the stream-oriented computing hardware processor. The compiler may run on the CPU, and the streaming computing hardware processor may be an XPU, which may be an IPU (INTELLIGENCE PROCESSING UNIT, intelligent processing unit) or a GPU (Graphics Processing Unit ) or the like.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart of an instruction fusion method according to an embodiment of the present application, where, as shown in fig. 2, an execution body of the embodiment is a compiler (or other electronic devices with a compiling function). As shown in fig. 2, the instruction fusion method provided in this embodiment includes the following steps:

S201, acquiring an instruction chain to be fused according to a program code to be processed, wherein the instruction chain to be fused comprises at least two instructions to be fused.

In this embodiment, the program code to be processed includes a plurality of instructions, the instructions may be analyzed, and a chain of instructions to be fused is determined according to a relationship between the instructions, where the chain of instructions to be fused includes at least two instructions to be fused, where a subsequent instruction to be fused reads data in an output address of a previous instruction to be fused, and performs a certain processing operation, so as to form a chain relationship.

S202, generating a target fusion instruction code according to the to-be-fused instruction included in the to-be-fused instruction chain and a preset general fusion instruction code template.

In this embodiment, considering that there are many combinations of different instructions, the same instruction may support multiple data types, in order to reduce development workload, a preset general fusion instruction code template may be preset, where how to perform instruction fusion is preset in the general fusion instruction code template, and especially the general fusion instruction code template includes some variable items such as data types, instruction information corresponding to the variable items in the to-be-fused instruction included in the to-be-fused instruction chain may be added to the variable items, and finally a target to-be-fused instruction code is generated according to the general fusion instruction template to which the instruction information is added. The target fusion instruction code is generated by adopting a unified format through the universal fusion instruction code template, so that the method is applicable to different instruction combination modes, is convenient for code replacement, and is convenient for a compiler to recognize and compile.

S203, replacing the code of the to-be-fused instruction with the target fusion instruction code.

In this embodiment, the code of the to-be-fused instruction in the to-be-processed program code may be replaced by the target fusion instruction code, and writing and reading of intermediate data into the storage unit are not performed in the target fusion instruction code, so as to realize fusion of the to-be-fused instructions.

Alternatively, the target fusion instruction code may be an intermediate code (INTERMEDIATE REPRESENTATION, IR, also referred to as intermediate representation) of the target fusion instruction, and the intermediate code of each of the to-be-fused instructions may be replaced with the target fusion instruction code during the process of converting the to-be-processed program code into the intermediate code.

According to the instruction fusion method, an instruction chain to be fused is obtained according to a program code to be processed, wherein the instruction chain to be fused comprises at least two instructions to be fused; generating a target fusion instruction code according to an instruction to be fused included in the instruction chain to be fused and a preset general fusion instruction code template; and replacing the code of the to-be-fused instruction with the target fusion instruction code. The method has the advantages that the target fusion instruction codes are generated in the compiler by adopting the unified template format, the to-be-fused instructions are fused, the method is suitable for different instruction combination modes, the packaging of an API (application program interface) is not needed, the rewriting of the program codes is not needed, the expandability is good, the development workload is reduced, the instruction fusion in hardware supporting the instruction fusion is easy to realize, the instruction fusion in hardware not supporting the instruction fusion is not realized, the compatibility is ensured, and the performance of a stream-oriented computing hardware processor is improved.

On the basis of the above embodiment, as shown in fig. 3, the obtaining the instruction chain to be fused according to the program code to be processed may specifically include:

s2011, determining a plurality of candidate instructions which are supported and fused by target hardware in the to-be-processed program code;

S2012, constructing an instruction chain to be fused according to the plurality of candidate instructions.

In this embodiment, since not all instructions are allowed to be fused due to the characteristics of the target hardware, a plurality of instructions in the program code to be processed, which are supported by the target hardware for fusion, may be determined first, and used as candidate instructions for determining the instruction chain to be fused later.

Optionally, an instruction set of target hardware support fusion can be obtained according to the target hardware information, wherein the instruction set can comprise some basic instructions of the target hardware support fusion; further, each instruction in the program code to be processed is respectively matched with the instruction set, if any instruction in the program code to be processed is included in the instruction set, the instruction is determined to be allowed to be fused, and the candidate instruction is determined, that is, the instruction included in the instruction set in the program code to be processed can be determined to be the candidate instruction in the embodiment. Of course, if the instruction is not included in the instruction set, determining that the instruction is not allowed to be fused, reserving the instruction in the subsequent processing process, and not fusing the instruction. Alternatively, a plurality of candidate instructions may form a candidate instruction set. After determining a plurality of candidate instructions in the to-be-processed program code which are supported and fused by target hardware, traversing searching can be performed on the basis of a candidate instruction set formed by the candidate instructions to construct an to-be-fused instruction chain.

Alternatively, the embodiment may search the candidate instruction set in a traversal manner to construct an instruction chain to be fused. In the traversal process, the accessed candidate instruction can be marked with the second identifier, so that the traversal can be performed only on the candidate instruction without the marked second identifier when searching and constructing the to-be-fused instruction chain. The second indication is used to indicate whether the candidate instruction has been added to any chain of instructions to be fused, e.g., the second indication may be "visited".

More specifically, for any candidate instruction of the plurality of candidate instructions, which is not marked with the second identifier, a depth-first traversal mode is adopted to find a dependency chain taking the candidate instruction as a head node, wherein the dependency chain comprises at least two candidate instructions with data dependency relations, and as a branch condition may exist, multiple dependency chains taking the candidate instruction as the head node may be found, the longest dependency chain is taken as an instruction chain to be fused, and the second identifier is marked for the candidate instruction on the instruction chain to be fused (i.e. the longest dependency chain).

The depth-first traversal is performed on each possible branch path until the possible branch path cannot be further performed, each node can only access once, the traversal process is performed according to the data dependency relationship among the candidate instructions, the output result of the previous candidate instruction is read and used by the next candidate instruction, the fact that the data dependency relationship exists between the previous candidate instruction and the next candidate instruction is indicated, the longest dependency chain can be quickly queried and constructed through the depth-first traversal, and the candidate instructions in the dependency chain are connected through the data use relationship, namely, the output result of the previous candidate instruction is read and used by the next candidate instruction.

Optionally, since the purpose of the dependency chain is to fuse multiple instruction instructions into one instruction, so that intermediate results of multiple instructions no longer need to be stored in the storage space, and intermediate results do not need to be read from the storage space, so that according to a general rule, output results of any candidate instruction (marked as an intermediate candidate instruction) except the last candidate instruction in the dependency chain are at most read by one candidate instruction (the instruction chain to be fused is also the dependency chain, and therefore the rule is also satisfied); otherwise, the output result of a certain candidate instruction is read by more than two candidate instructions except the last candidate instruction in the dependency chain, if the output result of the candidate instruction is fused, if the output result is not stored in the storage space, at least one candidate instruction in the more than two candidate instructions cannot read the output result and cannot be normally executed. The output result of any candidate instruction may refer to the output address of the candidate instruction or a value stored in a storage space pointed to by the output address. For example:

In the process of searching the dependency chain for the program code before fusion, it is found that the output result dst1 of the operation B is used by the operations C and D, and the operation B does not satisfy that the output result is at most read by one candidate instruction next thereto, so that the instruction of the operation B cannot be fused with the operation C or D, and thus the dependency chain only includes the instructions corresponding to the operation a and the operation B. If the dependency chain is finally determined to be the longest dependency chain, the longest dependency chain can be used as an instruction chain to be fused for subsequent fusion, and finally the fused program code is obtained. In this example, the output result of any candidate instruction may refer to the output address of that candidate instruction. If the output address of the intermediate candidate instruction is the source operand address of the two or more subsequent candidate instructions, then it is determined whether the output result of the intermediate candidate instruction is read by the two or more subsequent candidate instructions.

Optionally, on the basis of the general rule of constructing the instruction chain to be fused, the process of constructing the instruction chain to be fused can be optimized, so that the instruction chain to be fused can fuse more instructions as much as possible under the condition of ensuring correct calculation. Specifically, when the output address of the intermediate candidate instruction is read by more than two subsequent candidate instructions, the life cycle of the value stored in the output address of the intermediate candidate instruction is further determined, and whether the output result of the intermediate candidate instruction is read by more than two subsequent candidate instructions is determined according to the life cycle of the value stored in the output address of the intermediate candidate instruction. If the value stored in the output address of the intermediate candidate instruction is read by only one candidate instruction following the intermediate candidate instruction in its lifecycle, the intermediate candidate instruction and the candidate instruction following it that reads the data stored in the output address of the intermediate candidate instruction may be fused. Wherein the life cycle of a value stored in the output address of an instruction refers to the time that the value was generated until the value was destroyed. The numerical value can be generated through external initial input or can be generated through executing a previous instruction, and the numerical value is the execution result of the previous instruction at the moment; the value is destroyed by clearing or releasing the memory space of the corresponding output address, or by writing operation of other subsequent instructions, that is, when the subsequent instructions execute writing operation to the corresponding output address, the value is destroyed by being covered by the writing result of the subsequent instructions.

Furthermore, the compiler may rename operands in the intermediate candidate instruction and the following two or more candidate instructions, that is, rename the address of the first storage space corresponding to the output address of the intermediate candidate instruction to a different address, and then construct the instruction chain to be fused according to the traversal process under the normal condition.

For example, there are 4 candidate instructions before fusion, and the four candidate instructions may be marked as a first candidate instruction to a fourth candidate instruction, where the first candidate instruction to the third candidate instruction are the intermediate candidate instructions described above, and the data dependency relationship of the candidate instructions may be in a write-after-read condition. During the traversal, for the first candidate instruction dst1=a (src 0), it is detected that dst1 is used by both operations B and D, i.e. the output address of the first candidate instruction dst1=a (src 0) is read by the two subsequent candidate instructions, so that dst1=a (src 0) cannot be fused with the subsequent operations B and D according to the general rule. However, in practice, the inputs dst1 of operations B and D point to the same address, but are read by different values, operation B reads the output result of operation a, and operation D reads the output result of operation C, and it can be seen that the first candidate instruction dst1=a (src 0) and the subsequent second candidate instruction dst2=b (dst 1), third candidate instruction dst1=c (dst 2), and fourth candidate instruction dst3=d (dst 1) can be fused, and do not affect the output result of operation D reading operation C. At this time, the compiler may determine the life cycle of the stored value in dst1, and if the stored value in dst1 is read by only one candidate instruction after operation a in its life cycle, it indicates that the operation a and the candidate instruction after operation a can be fused. Specifically, the lifecycle of the values in dst1 starts at the end of the execution of operation a and ends at the end of the execution of operation C. If the compiler can confirm that the value stored in dst1 is used by only one third candidate instruction during its life cycle, i.e. the value generated by operation a is used by only dst2=b (dst 1) and the value generated by operation C is used by only dst3=d (dst 1), then it is determined that the first candidate instruction dst1=a (src 0), the second candidate instruction dst2=b (dst 1), the third candidate instruction dst1=c (dst 2) and the fourth candidate instruction dst3=d (dst 1) can be instruction fused. To facilitate instruction fusion, the compiler may rename address dst1 in all candidate instructions described above to two addresses dst11 and dst12, namely dst11=a (src 0), dst2=a (dst 11); dst12=c (dst 2), dst3=c (dst 12). Since the value in dst11 is read only by operation B and the value in dst12 is read only by operation D, eventually the four candidate instructions can be fused normally to dst3=d (B (a (src 0))). Therefore, during the traversal, the life cycle of the value in the memory space should be analyzed for the value stored in the output address of the candidate instruction, instead of the output address, so as to ensure that only one operation at most reads the value in the life cycle.

Examples:

For another example, there are four candidate instructions before fusion, where the output address of the first candidate instruction a=op1 (B, C) is read by the subsequent second candidate instruction d=op2 (a, E) and fourth candidate instruction f=op3 (a, I), at which point the compiler may determine the lifecycle of the stored data in the output address a of the first candidate instruction a=op1 (B, C). Specifically, the lifecycle of the values in a starts at the end of the execution of operation op1 and ends at the end of the execution of operation op 4. The compiler may determine that there is only one second candidate instruction d=op2 (a, E) to read the value in a during the life cycle of the value in a, and thus may determine that the first candidate instruction and the second candidate instruction may be fused, and the fused instruction is d=op2 (op 1 (B, C), E). Since the third candidate instruction a=op4 (H, K) has no data dependency relationship with the first candidate instruction and the second candidate instruction, the third candidate instruction cannot be fused with the first candidate instruction and the second candidate instruction. The third candidate instruction may be instruction fused with the subsequent fourth candidate instruction f=op3 (a, I), with fused instruction f=op3 (op 4 (H, K), I).

If more than two subsequent candidate instructions exist in the life cycle of the data stored in the output address of the intermediate candidate instruction to read the data stored in the output address of the intermediate candidate instruction, determining that the intermediate candidate instruction cannot be subjected to instruction fusion with the subsequent candidate instructions. For example, there are 3 candidate instructions before fusion, labeled as first candidate instruction a=op1 (B, C), respectively; second candidate instruction d=op2 (a, E) and third candidate instruction f=op3 (a, I). Since the value stored in a is read by two candidate instructions following the first candidate instruction during the life cycle of the value, the first candidate instruction cannot be fused with the subsequent second candidate instruction and third candidate instruction.

On the basis of the embodiment, special conditions may exist in the instruction chain to be fused, and the instruction chain to be fused needs to be cut off so as to further optimize the instruction chain to be fused determined by the method.

In a possible case, the compiler of the present embodiment may determine whether an alias exists in a storage space corresponding to an output address of each instruction to be fused in the instruction chain to be fused; if the storage space corresponding to the output address of any instruction to be fused is determined to have an alias or if the alias cannot be determined to exist, disconnecting the instruction chain to be fused after the instruction to be fused, and taking the instruction to be fused and the previous part of the instruction chain to be fused as a instruction chain to be fused.

In this case, considering that different pointer aliases may exist in the same storage space, so that different instructions to be fused write output results into the same storage space, static analysis of a compiler generally cannot accurately identify aliases, and then a conservative manner is adopted, if it is determined that an aliases exist in the storage space corresponding to the output address of any instruction to be fused, or if it cannot be determined whether an aliases exist, the instruction chain to be fused is disconnected after the instruction to be fused, that is, the fused instruction is not fused with a subsequent instruction, so that performance of fusion characteristics cannot be fully mined. In this embodiment, the instruction chain to be fused is disconnected after the instruction to be fused, the instruction to be fused and the part before the instruction chain to be fused can still be fused continuously, for example, the instruction chain to be fused a- > B- > C- > D, if the output address of the operation C may have an alias, the operation C cannot be fused with the operation D, the instruction chain to be fused is disconnected after the operation C, that is, the instruction chain to be fused is split into a- > B- > C and D, and the operation C and the part a- > B- > C before the operation C can be fused continuously. By actively judging whether the storage space has the alias or not, the problem that static analysis of a compiler cannot accurately identify the alias is avoided as much as possible. For example, the number of the cells to be processed,

The// original procedure: variable s2 may have an alias, and if the compiler cannot accurately judge whether the alias exists, ADD and MUL are not fused

_nram_half s0[128]；

_nram_half s1[128]；

_nram_half s2[128]；

_nram_half s3[128]；

s2＝ADD(s0+s1)；

s3＝MUL(s2*s0)；

Based on the problem that the static analysis cannot fully utilize the fusion characteristic, the method can add a grammar structure in programming language, allow fusion attribute modifier words to be added when a vector variable is declared, and be used for explaining that a storage space corresponding to an instruction does not have an alias in the whole program, so that fusion optimization can be executed. Wherein the fusion attribute modifier can be a specific identifier (first identifier), such as "_fused_". On this basis, when judging whether the storage space corresponding to the output address of each instruction to be fused in the instruction chain to be fused has an alias, the method specifically comprises the following steps:

Judging whether each instruction to be fused in the instruction chain to be fused is marked with a first identifier in advance, wherein the first identifier is used for indicating that a storage space corresponding to an output address of the marked instruction does not have an alias; if any instruction to be fused is marked with a first identifier in advance, determining that the storage space corresponding to the output address of the instruction to be fused has no alias; if any instruction to be fused is not marked with the first identifier in advance, performing alias analysis (i.e. static analysis) by a compiler, judging whether an alias exists in a storage space corresponding to an output address of the instruction to be fused, accurately determining whether the alias exists or not in the storage space corresponding to the output address of the instruction to be fused by some compilers, possibly not determining whether the alias exists in the storage space corresponding to the output address of the instruction to be fused by some compilers, and disconnecting an instruction chain to be fused from the instruction to be fused after the instruction to be fused exists or not determining whether the alias exists in the storage space corresponding to the output address of the instruction to be fused. For example, the number of the cells to be processed,

Program using new grammar structure

_nram_half s0[128]；

_nram_half s1[128]；

Fused __ nram _half s2[128]; the// variable s2 adds the attribute modifier first identifier_fused/u

_nram_half s3[128]；

s2＝ADD(s0+s1)；

s3＝MUL(s2*s0)；

Program fusion using new grammar structure

s3＝MUL(ADD(s0+s1)*s0)。

By marking one or more instructions in the program code to be processed with a first identifier in advance, the storage space corresponding to the instruction is indicated to have no alias in the whole program, and further whether the storage space has the alias or not can be actively judged, so that the problem that static analysis of a compiler cannot accurately identify the alias is avoided as much as possible.

On the basis of any embodiment, because the target hardware has a certain limit, the instruction chain to be fused may need to be cut off, and whether the instruction chain to be fused needs to be cut off or not may be judged according to the target hardware information; if the instruction chain to be fused needs to be cut off, determining a cutting position, and cutting off the instruction chain to be fused from the cutting position.

In one possible case, due to the limitation of the target hardware, the to-be-fused instruction chain needs to be cut, for example, the target hardware can specify a fusion instruction number threshold, if the to-be-fused instruction chain length exceeds the fusion instruction number threshold, the cutting needs to be performed, so that the to-be-fused instruction chain length does not exceed the fusion instruction number threshold, for example, the target hardware supports K different instruction fusion at most, and if the to-be-fused instruction number in the to-be-fused instruction chain is greater than K, the to-be-fused instruction chain needs to be cut, so that the cut to-be-fused instruction chain meets the requirement. That is, the compiler may determine whether the number of instructions to be fused included in the instruction chain to be fused exceeds a fused instruction number threshold corresponding to the target hardware; if yes, determining that the instruction chain to be fused needs to be cut off according to the fusion instruction quantity threshold value.

In one possible scenario, the target hardware may have constraints on the first or last instruction to be fused of the chain of instructions to be fused, e.g., square operations may expand the bit width, the target hardware requires square operations to be the only last operation of the chain of instructions to be fused, which would then need to be broken after square operations. Of course, there are many other possible instruction fusion constraints for the target hardware, which are not exemplified here. That is, the compiler may determine whether the instruction to be fused included in the instruction chain to be fused includes a specific instruction related to the target hardware; if so, determining that the instruction chain to be fused needs to be cut off according to the specific instruction; the specific instruction can only be used as the head or tail of the to-be-fused instruction chain.

In the process, the fusion optimization can perform corresponding fusion processing on the instruction according to the constraint corresponding to the target hardware information, an upper user is not required to adapt to the target hardware, and the compatibility and the expansibility are good.

On the basis of any of the above embodiments, considering that the instruction includes a plurality of variable items, there are many combinations of different instructions, especially in the process of converting the program code to be processed into the intermediate code IR, if each fused instruction has one corresponding intermediate code IR, the number of intermediate codes IR will be extremely huge, the development workload will be large and the code scale will be significantly increased, and at the same time, in the instruction replacement stage of the fusion optimization process, full-scale matching will also need to be performed, so there will be larger switch or multi-stage if-else statements, the compiling time will be longer, and in order to reduce the development workload, the preset general fused instruction code template may be preconfigured.

Alternatively, the variable items may comprise a combination of one or more of the following: the opcode and number of instructions to be fused, the data type of the instructions to be fused, the number of source operands of the instructions to be fused, the data size of the operands of the instructions to be fused, and so forth. These variable terms result in a number of possible combinations of fused instruction parameters and types:

1) The number of the fused instructions may be 2,3 and 4 if the number of the fused instructions corresponding to the target hardware is k=4, i.e. the target hardware supports fusion of at most 4 fused instructions, the target hardware supports fusion of basic instructions such as add, sub, mul, div, eq, ne, lt, ge, and, or, xor, not in an instruction set, and the instruction sequence may be arbitrary, and the fused chain may be add-sub-mul-div, div-sub-mul-add, and the like. Since the instruction sequence can be arbitrary, if an IR fusing manner is designed for each possible instruction chain to be fused, the number of IR to be realized is too large.

2) Data type of the instruction to be fused: the operands in each instruction to be fused typically have a plurality of data types such as float, half, int, short, char; if one way to fuse the IR is designed for each data type, the number of IR implementations needed is too large.

3) The number of source operands of the instruction to be fused may be a unitary (e.g., not) operation comprising only one operand, or a binary (e.g., xor) operation comprising two operands, and the fused operands may be considered to have a default because the operands of the instruction to be fused are variable.

4) Data size of operands of instructions to be fused: the operands of the instructions to be fused may be matrices, vectors or scalars.

Based on this, the generic fusion instruction code template of the present disclosure may include: fusing instruction identification and at least one variable item; the fused instruction identifier is used for indicating that the code segment is a fused instruction code, for example, the fused instruction identifier can adopt a fuse, and the variable item comprises a variable item corresponding to each item of instruction information of each instruction to be fused. Thus, the following general fusion instruction code templates may be provided as examples in this embodiment:

fuse (dst, type, op0, op1, op2, op3, src0, src1, src2, src3, src4, size); wherein,

Fuse: fusing instruction identification;

[ dst ]: outputting a fusion instruction, wherein the output address corresponds to the output address of the last instruction to be fused in the instruction chain to be fused;

Type: fusing the data types of the instructions;

[ opi ] the operation codes of the instructions to be fused, such as op0, op1, op2, op3, etc., are set as default values InvalidOp by default;

[ srci ]: the source operands of the instructions to be fused, such as src0, src1, src2, src3, src4, etc., may be vectors or scalar, default set to default value INVALIDSRC;

[ size ]: the size of the operands of the instruction to be fused; wherein if all srci are identical in size, only one size is required; in the case of srci inconsistent sizes, the number of sizes may be multiple.

It should be clear that, in other embodiments, the number of operation codes of the fusion operation may not be limited to 4, and the operands srci of the corresponding fusion operation may be extended accordingly, which is not specifically limited in this disclosure.

As shown in fig. 4, the method for generating the target fusion instruction code according to the general instruction fusion code may include:

s301, acquiring instruction information of each instruction to be fused, which is included in the instruction chain to be fused;

S302, adding instruction information of each instruction to be fused into the universal fused instruction code template to generate the target fused instruction code according to the universal fused instruction code template added with the instruction information.

The instruction information of the to-be-fused instruction comprises; the opcode of the instruction to be fused, the data type of each operand, at least one operand, the length of each operand (i.e., the size of the operand), and the output address of each instruction to be fused.

In this embodiment, the operation code, the data type, the operand length, and the output address of the last instruction to be fused are the most basic instruction information required for instruction fusion, and these instruction information can be obtained from the program code to be processed and added to the general fusion instruction code template, so as to obtain the general fusion instruction code template after instruction information is added, so as to generate the target fusion instruction code according to the general fusion instruction code template after instruction information is added.

Further, the compiler may add the data type, operand length, and output address of the last to-be-fused instruction to the corresponding variable item position of the universal fused instruction code template, that is, type, size, dst; according to the sequence of each instruction to be fused in the instruction chain to be fused, the operation codes of the instructions to be fused are sequentially arranged and added to the positions of variable items of the operation codes of the universal fused instruction code template, namely the positions of op0, op1, op2 and op 3; according to the sequence of each instruction to be fused in the instruction chain to be fused, sequentially arranging operands of each instruction to be fused, and adding the operands to the positions of the variable items of the operands of the general fusion instruction code template, namely, src0, src1, src2, src3 and src4; each instruction to be fused is considered to be binary operation, two operands are provided, src0 and src1 are operands corresponding to op0, the output result of src0 and src1 passing through op0 is an operand corresponding to src2, the output result of op1 and src3 are operands corresponding to op2, and the like, wherein if the operation code and/or the operand of any instruction to be fused have default, the operation code and/or the operand of the instruction to be fused are set to default values.

The following two examples are given as examples.

Example one:

Before/fusion

dst0＝add(s0,s1)；

dst1＝sub(dst0,s2)；

dst2＝mul(dst1,10)；

The instruction chain to be fused is add-sub-mul, and the operation number is smaller than the fused instruction number threshold K=4 supported by target hardware

Universal fused instruction code template after/(and fusion)

Fuse (dst 2, type, add, sub, mul, invalidOp, s0, s1, s2,10, invalidsrc, size); the number of the to-be-fused instructions is smaller than the fusion instruction number threshold supported by the target hardware, and corresponding operation codes and operands in the universal fusion instruction code template are set to default values. I.e., op4 is set to InvalidOp and src4 is set to INVALIDSRC.

Example two:

Before/fusion

dst0＝add(s0,s1)；

dst1＝sub(dst0,s2)；

dst2＝not(dst1)；

dst3＝eq(dst2,s3)；

The instruction chain to be fused is add-sub-not-eq and contains unitary operation not, and the operation quantity is equal to the fused instruction quantity threshold K=4 supported by target hardware

Universal fused instruction code template after/(and fusion)

Fuse (dst 3, type, add, sub, not, eq, s0, s1, s2, INVALIDSRC, s3, size), wherein the fusion operation not is a unitary operation, and the operand corresponding to the fusion operation is set to a default value INVALIDSRC.

Further, the generating the target fusion instruction code according to the general fusion instruction code template further includes:

And in the back-end instruction printing stage (namely in the instruction generation process of the target fusion code), printing is not carried out on the default value (InvalidOp, invalidSrc) in the universal fusion instruction code template. And printing out the corresponding character strings of the operation codes of various data types in the general fusion instruction code template, for example:

The instruction chain to be fused is Add-sub, the type of the fused instruction can be float, char, half, the data types are encoded, different data types correspond to different immediate numbers, for example, the immediate number corresponding to the float type is 1, the immediate number corresponding to the char type is 2, the immediate number corresponding to the half type is 3, the set of encoding rules and the corresponding relation are fixed, and the method is applicable to all instructions. And extracting type information from the instruction to be fused, converting the type information into corresponding immediate data, storing the corresponding immediate data in a general fusion instruction code template type, and printing the immediate data into a corresponding data type character string in a subsequent instruction printing stage. For example:

1)float s0,s1,s2,d0,d1；

d0＝add(s0,s1)；

d1＝sub(d0,s2)

the instruction chain to be fused is add-sub, the data type is float, the type code corresponding to float is 1, and based on the fact, the general fused instruction code template corresponding to the instruction chain to be fused can be expressed as:

fuse(d0,1,add,sub,InvalidOp,InvalidOp,s0,s1,s2,InvalidSrc,InvalidSrc,size)；

The compiler may generate a target fusion instruction code add.

And the following steps:

2)half s0,s1,s2,d0,d1；

d0＝add(s0,s1)；

d1＝sub(d0,s2)

The instruction chain to be fused is add-sub, the data type is half, the type code corresponding to half is 3, and based on the fact, the general fusion instruction code template corresponding to the instruction chain to be fused can be expressed as:

fuse(d0,3,add,sub,InvalidOp,InvalidOp,s0,s1,s2,InvalidSrc,InvalidSrc,size)；

The compiler may generate a target fusion instruction code add.

The target fusion instruction codes are generated by adopting the universal fusion instruction code template in a unified format, so that the method can be suitable for different instruction combination modes, can be suitable for any data type, any unitary operation and binary operation, can be used as operands in matrix, vector or scalar, has better compatibility, avoids excessively large quantity of IR to be realized, is convenient for expansion, is convenient for code replacement, and is convenient for a compiler to recognize and compile.

On the basis of any of the above embodiments, the program code to be processed includes at least one basic block (basic block), where each basic block includes multiple instructions, and branches (for example if, else) do not exist in the basic block, so the scope of instruction fusion may be one basic block, or multiple adjacent basic blocks, where the multiple adjacent basic blocks are single-in and single-out basic blocks (forming a single-in and single-out domain region, no branches exist), that is, the instructions to be fused that can be fused with each other can only source one basic block, or originate from a single-in and single-out region. Therefore, when acquiring the instruction chain to be fused according to the program code to be processed, the instruction chain to be fused can be acquired according to a plurality of instructions included in the same basic block; or acquiring the instruction chain to be fused according to a plurality of instructions included in a plurality of adjacent basic blocks, wherein the plurality of adjacent basic blocks are single-in and single-out basic blocks.

Fig. 5 is a schematic structural diagram of an instruction fusion device according to an embodiment of the present application, as shown in fig. 5, where the instruction fusion device provided in this embodiment may be a compiler or other electronic devices with a compiling function, and the instruction fusion device 50 provided in this embodiment includes: an acquisition unit 51, a generation unit 52, and a replacement unit 53.

The obtaining unit 51 is configured to obtain a to-be-fused instruction chain according to a to-be-processed program code, where the to-be-fused instruction chain includes at least two to-be-fused instructions;

the generating unit 52 is configured to generate a target fusion instruction code according to the to-be-fused instruction included in the to-be-fused instruction chain and a preset general fusion instruction code template;

And a replacing unit 53, configured to replace the code of the instruction to be fused with the target fusion instruction code.

In one or more embodiments of the present application, the acquiring unit 51 is further configured to, after acquiring the instruction chain to be fused:

judging whether aliases exist in a storage space corresponding to an output address of each instruction to be fused in the instruction chain to be fused;

If the storage space corresponding to the output address of any instruction to be fused is determined to have an alias or if the alias cannot be determined to exist, disconnecting the instruction chain to be fused after the instruction to be fused, and taking the instruction to be fused and the previous part of the instruction chain to be fused as a instruction chain to be fused.

In one or more embodiments of the present application, when the obtaining unit 51 determines whether the storage space corresponding to the output address of each to-be-fused instruction in the to-be-fused instruction chain has an alias, the obtaining unit is configured to:

Judging whether each instruction to be fused in the instruction chain to be fused is marked with a first identifier in advance, wherein the first identifier is used for indicating that a storage space corresponding to an output address of the marked instruction does not have an alias;

if any instruction to be fused is marked with a first identifier in advance, determining that the storage space corresponding to the output address of the instruction to be fused has no alias; or alternatively

If any instruction to be fused is not marked with the first identifier in advance, the compiler performs alias analysis to judge whether the storage space corresponding to the output address of the instruction to be fused has an alias or not.

judging whether the instruction chain to be fused needs to be cut off or not according to the target hardware information;

If the instruction chain to be fused needs to be cut off, determining a cutting position, and cutting off the instruction chain to be fused from the cutting position.

In one or more embodiments of the present application, the obtaining unit 51 is configured to, when determining whether the instruction chain to be fused needs to be cut according to the target hardware information:

Judging whether the number of the to-be-fused instructions included in the to-be-fused instruction chain exceeds a fused instruction number threshold corresponding to target hardware; if yes, determining that the instruction chain to be fused needs to be cut off according to the fusion instruction quantity threshold; and/or

Judging whether the to-be-fused instruction included in the to-be-fused instruction chain comprises a specific instruction related to target hardware or not; if so, determining that the instruction chain to be fused needs to be cut off according to the specific instruction; the specific instruction can only be used as the head or tail of the to-be-fused instruction chain.

In one or more embodiments of the present application, the generating unit 52 is configured to, when generating the target fusion instruction code according to the to-be-fused instruction included in the to-be-fused instruction chain and a preset general fusion instruction code template:

acquiring instruction information of each instruction to be fused included in the instruction chain to be fused;

adding instruction information of each instruction to be fused into the universal fused instruction code template to generate the target fused instruction code;

the instruction information of the to-be-fused instruction comprises; the operation code, the data type, the operand length and the output address of the last instruction to be fused are used for the instruction to be fused.

In one or more embodiments of the application, the generic fused instruction code template includes: fusing instruction identification and variable items;

The fusion instruction identifier is used for indicating that the code is a fusion instruction code, and the variable item comprises a variable item corresponding to each item of instruction information of each instruction to be fused.

In one or more embodiments of the present application, the generating unit 52 is configured, when adding instruction information of each instruction to be fused to the universal fused instruction code template, to generate the target fused instruction code:

Respectively adding the data type, operand length and output address of the last to-be-fused instruction to the corresponding variable item position of the universal fused instruction code template;

according to the sequence of each instruction to be fused in the instruction chain to be fused, the operation codes of the instructions to be fused are sequentially arranged and added to the position of the variable item of the operation code of the universal fusion instruction code template;

According to the sequence of each instruction to be fused in the instruction chain to be fused, sequentially arranging operands of each instruction to be fused, and adding the operands to the position of an operand variable item of the universal fused instruction code template;

If the operation code and/or the operand of any one of the to-be-fused instructions have default values, the operation code and/or the operand of the to-be-fused instruction are set to default values.

In one or more embodiments of the present application, the generating unit 52 is configured, when acquiring the instruction chain to be fused according to the program code to be processed, to:

Determining a plurality of candidate instructions in the to-be-processed program code, which are supported by target hardware to be fused;

and constructing an instruction chain to be fused according to the plurality of candidate instructions.

In one or more embodiments of the present application, the fetch unit 51 is configured, when constructing an instruction chain to be fused according to the plurality of candidate instructions, to:

aiming at any candidate instruction which is not marked with a second identifier in the plurality of candidate instructions, searching the longest dependent chain taking the candidate instruction as a head node in a depth-first traversal mode, determining the dependent chain as an instruction chain to be fused, and marking the candidate instruction on the dependent chain with the second identifier; wherein at least two candidate instructions having a data dependency are included in the dependency chain.

In one or more embodiments of the present application, the output result of any one of the remaining intermediate candidate instructions in the dependency chain, except the last candidate instruction, is at most read by the subsequent candidate instruction, and the output result of the intermediate candidate instruction is the output address of the intermediate candidate instruction.

In one or more embodiments of the present application, when the output address of any intermediate candidate instruction in the dependency chain is read by two or more candidate instructions subsequent thereto, the fetch unit 51 is further configured to, when constructing the chain of instructions to be fused:

if the output address of the intermediate candidate instruction in the dependency chain is read by more than two candidate instructions in the dependency chain, determining the life cycle of the data stored in the output address of the intermediate candidate instruction;

if only one candidate instruction of the two or more subsequent candidate instructions of the intermediate candidate instruction reads the data stored in the output address of the intermediate candidate instruction in the life cycle of the data stored in the output address of the intermediate candidate instruction, determining that the intermediate candidate instruction can be fused with the candidate instruction of which the subsequent candidate instruction reads the data stored in the output address of the intermediate candidate instruction.

In one or more embodiments of the present application, the fetch unit 51 is further configured to, after determining that the intermediate candidate instruction can be fused with a candidate instruction whose subsequent fetch data stored in the output address of the intermediate candidate instruction:

renaming operands in the intermediate candidate instruction and more than two candidate instructions subsequent to the intermediate candidate instruction.

In one or more embodiments of the application, the program code to be processed includes at least one basic block, each of the basic blocks including a plurality of instructions therein; the acquiring unit 51 is configured to, when acquiring the instruction chain to be fused according to the program code to be processed:

Acquiring the instruction chain to be fused according to a plurality of instructions included in the same basic block; or alternatively

And acquiring the instruction chain to be fused according to a plurality of instructions included in a plurality of adjacent basic blocks, wherein the plurality of adjacent basic blocks are single-in and single-out basic blocks.

In one or more embodiments of the present application, the obtaining unit 51, when determining that the target hardware supports the fused multiple candidate instructions in the pending program code, is configured to:

Acquiring an instruction set of the target hardware support fusion according to the target hardware information;

And matching each instruction in the program code to be processed with the instruction set respectively, and determining the instruction contained in the instruction set in the program code to be processed as the candidate instruction.

The instruction fusion device provided in this embodiment may execute the technical solution of the foregoing method embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

Fig. 6 is a schematic structural diagram of an instruction fusion device according to another embodiment of the present application, and as shown in fig. 6, an instruction fusion device 60 according to an embodiment of the present application includes: at least one processor 61 and a memory 62;

Memory 62 stores computer-executable instructions;

At least one processor 61 executes computer-executable instructions stored in memory 62 such that the at least one processor performs the instruction fusion method provided in any one of the embodiments described above.

In a possible implementation manner, a computer readable storage medium is also disclosed, where a computer program is stored, and when the computer program is executed by at least one processor, the instruction fusion method provided in any one of the foregoing embodiments is implemented.

In one possible implementation manner, the above-mentioned stream-oriented computing hardware processor may be a processor structure shown in fig. 9 or 13, and further the processor may be integrated in a board, where the stream-oriented computing hardware processor may be an IPU or a GPU, and the application is not limited thereto.

In one possible implementation, a board, which may be a device-side board, is also disclosed. Fig. 7 shows a schematic structural diagram of a board 70 according to an embodiment of the application. As shown in fig. 7, the board card 70 includes a Chip 701, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided, so that the board card 70 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 701 is connected to an external device 703 through an external interface device 702. The external device 703 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 703 to the chip 701 through the external interface means 702. The calculation result of the chip 701 may be transmitted back to the external device 703 via the external interface means 702. The external interface device 702 may have different interface forms, such as a PCIe interface, etc., according to different application scenarios.

The board card 70 also includes a memory device 704 for storing data, which includes one or more memory cells 705. The memory device 704 is connected to the control device 706 and the chip 701 through a bus and transmits data. The control device 706 in the board card 70 is configured to regulate the state of the chip 701. To this end, in one application scenario, the control device 706 may include a single chip microcomputer (Micro Controller Unit, MCU).

In one possible implementation, a combination processing apparatus is also provided, and fig. 8 is a block diagram showing the combination processing apparatus in the chip 701 of this embodiment. As shown in fig. 8, the combination processing device 80 includes a computing device 801, an interface device 802, a processing device 803, and a storage device 804.

The computing device 801 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 803 through the interface device 802 to collectively accomplish the user-specified operations.

The interface means 802 is used for transmitting data and control instructions between the computing means 801 and the processing means 803. For example, the computing device 801 may obtain input data from the processing device 803 via the interface device 802, write to a storage device on the computing device 801 chip. Further, the computing device 801 may obtain control instructions from the processing device 803 via the interface device 802, and write the control instructions into a control cache on the computing device 801 chip. Alternatively or in addition, the interface device 802 may also read data in a memory device of the computing device 801 and transmit it to the processing device 803.

The processing device 803, as a general purpose processing device, performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 801, and the like. Depending on the implementation, the processing device 803 may be one or more types of processors, including but not limited to a digital signal processor (DIGITAL SIGNAL processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processors, and the number thereof may be determined according to actual needs. As before, the computing device 801 of the present application may be considered to have a single core structure or a homogeneous multi-core structure. However, when computing device 801 and processing device 803 are considered in combination, they are considered to form a heterogeneous multi-core structure.

The storage device 804 is configured to store data to be processed, which may be a DRAM804, typically 16G or greater in size, for DDR memory, for storing data for the computing device 801 and/or the processing device 803.

Fig. 9 shows a schematic diagram of the internal architecture of a computing device 801 as a single core. The single-core computing device 901 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the single-core computing device 901 includes three modules: a control module 91, an operation module 92 and a storage module 93.

The control module 91 is used for coordinating and controlling the operation of the operation module 92 and the storage module 93 to complete the task of deep learning, and includes a fetch unit (instruction fetch unit, IFU) 911 and an instruction decode unit (instruction decode unit, IDU) 912. The instruction fetching unit 911 is configured to fetch an instruction from the processing device 1203, and the instruction decoding unit 912 decodes the fetched instruction and sends the decoded result to the operation module 92 and the storage module 93 as control information.

The operation module 92 includes a vector operation unit 921 and a matrix operation unit 922. The vector operation unit 921 is used for executing vector operation and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 922 is responsible for the core computation of the deep learning algorithm, i.e. matrix multiplication and convolution.

The storage module 93 is used to store or handle related data, including a neuron storage unit (NRAM) 931, a parameter storage unit (WEIGHT RAM, WRAM) 932, and a direct memory access module (direct memory access, DMA) 933.NRAM 931 is used to store input neurons, output neurons, and computed intermediate results; WRAM 932 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 933 is coupled to DRAM 804 via bus 94 and is responsible for data transfer between single core computing device 901 and DRAM 804.

Fig. 10 illustrates a schematic internal architecture of a computing device 801 that is multi-core. The multi-core computing device 1001 adopts a hierarchical design, and the multi-core computing device 1001 is a system-on-chip (soc) including at least one cluster (cluster), each cluster including a plurality of processor cores, in other words, the multi-core computing device 1001 is formed by a hierarchy of system-on-chip (soc) -cluster-processor cores.

At the level of the system-on-chip, as shown in fig. 10, the multi-core computing device 1001 includes an external storage controller 1001, a peripheral communication module 1002, an on-chip interconnect module 1003, a synchronization module 1004, and a plurality of clusters 1005.

There may be a plurality of external memory controllers 1001, 2 being shown by way of example, for accessing external memory devices, such as DRAM 804 in FIG. 8, to read data from or write data to the off-chip in response to an access request issued by the processor core. The peripheral communication module 1002 is configured to receive a control signal from the processing device 803 via the interface device 802, and activate the computing device 801 to perform a task. The on-chip interconnect module 1003 connects the external memory controller 1001, the peripheral communication module 1002, and the plurality of clusters 1005 for transmitting data and control signals between the respective modules. The synchronization module 1004 is a global synchronization barrier controller (global barrier controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 1005 are computing cores of the multi-core computing device 1001, 4 being illustratively shown in the figure, the 4 clusters 1005 forming 4 quadrants as in fig. 1. As hardware progresses, the multi-core computing device 1001 of the present application may also include 8, 16, 64, or even more clusters 1005. Cluster 1005 is used to efficiently perform the deep learning algorithm.

At the cluster level, as shown in FIG. 10, each cluster 1005 includes a plurality of processor cores (IPU cores) 1006 and a memory core (MEM core) 1007. Illustratively, each cluster 1005 includes 4 processor cores and 1 memory, which may be DRAM804. Each processor core corresponds to one of the arithmetic units in fig. 1, and each memory corresponds to one of the memory units in fig. 1.

The processor cores 1006 are illustratively shown as 4, the present application is not limited to the number of processor cores 1006. The internal architecture is shown in fig. 11. Each processor core 1006 is similar to the single core computing device 901 of fig. 9, again comprising three major modules: a control module 1101, an operation module 1102 and a storage module 1103. The functions and structures of the control module 1101, the operation module 1102 and the storage module 1103 are substantially the same as those of the control module 91, the operation module 92 and the storage module 93, and the control module 1101 includes a fetch unit 11011 and an instruction decoding unit 11012. The operation module 1102 includes a vector operation unit 11021 and a matrix operation unit 11022. And will not be described in detail. It should be noted that the storage module 1103 includes an input/output direct memory access module (input/output direct memory access, IODMA) 11033 and a handling direct memory access module (move direct memory access, MVDMA) 11034.IODMA11033, control access to NRAM 11031/WRAM 11032 and DRAM 804 over broadcast bus 1009; MVDMA 11034 to 11034 are used to control access to the NRAM 11031/WRAM 11032 and memory cell (SRAM) 1008.

Returning to FIG. 8, the memory cores 1007 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 1006, as well as to perform communications between the clusters 1005 and the DRAM 804, between the clusters 1005, between the processor cores 1006, etc. In other embodiments, the memory core 1007 has the capability of scalar operations to perform scalar operations.

The memory core 1007 includes SRAM 1008, broadcast bus 1009, cluster direct memory access module (cluster direct memory access, CDMA) 1010, and global direct memory access module (global direct memory access, GDMA) 1011. The SRAM 1008 plays a role of a high-performance data transfer station, and data multiplexed between different processor cores 1006 in the same cluster 1005 is not required to be obtained from the DRAM 804 through the processor cores 1006, but transferred between the processor cores 1006 through the SRAM 1008, and the memory core 1007 only needs to rapidly distribute the multiplexed data from the SRAM 1008 to the plurality of processor cores 1006, so as to improve the inter-core communication efficiency and greatly reduce the on-chip off-chip input/output access.

Broadcast bus 1009, CDMA 1010 and GDMA are used to perform communication between processor cores 1006, communication between clusters 1005, and data transfer between clusters 1005 and DRAM 804, respectively. As will be described below, respectively.

The broadcast bus 1009 is used to perform high-speed communication between the processor cores 1006 in the cluster 1005. The broadcast bus 1009 of this embodiment supports inter-core communication modes including unicast, multicast, and broadcast. Unicast is a communication mode that refers to the transfer of data from point to point (e.g., single processor core to single processor core), multicast is the transfer of a piece of data from SRAM 1008 to a specific number of processor cores 1006, and broadcast is the transfer of a piece of data from SRAM 1008 to all processor cores 1006, a special case of multicast.

CDMA 1010 is used to control access to SRAM 1008 between different clusters 1005 within the same computing device 801.

GDMA 1011 cooperate with the external memory controller 1001 to control access of the SRAM1008 to the DRAM 804 of the cluster 1005 or to read data from the DRAM 804 into the SRAM 1008. From the foregoing, it can be appreciated that communication between DRAM 804 and NRAM 11031 or WRAM11032 may be achieved via 2 channels. The first channel is to directly contact DRAM 804 with NRAM 11031 or WRAM11032 through IODAM 11033,11033; the second channel is to transfer data between DRAM 804 and SRAM1008 via GDMA a, and then transfer data between SRAM1008 and NRAM 11031 or WRAM11032 via MVDMA a 11034. While seemingly the second channel requires more elements to participate and the data stream is longer, in practice in some embodiments the bandwidth of the second channel is much greater than the first channel, so communication between DRAM 804 and NRAM 11031 or WRAM11032 may be more efficient through the second channel. Embodiments of the present application may select a data transmission channel based on its hardware conditions.

In other embodiments, the functions of GDMA 1011 and IODMA 11033 may be integrated in the same component. For convenience of description, GDMA, 1011 and IODMA, 11033 are considered as different components, so long as the functions and technical effects achieved by the present application are similar to those of the present application, and thus, the present application is within the scope of protection of the present application. Further, the functions of GDMA, IODMA, 11033, CDMA 1010, MVDMA, 11034 may also be implemented by the same component.

The foregoing (The foregoing may be better understood in view of the following clauses) may be better understood in light of the following clauses:

Clause 1, a method of instruction fusion, comprising:

Clause 2, the method according to clause 1, after the obtaining the instruction chain to be fused, further includes:

Clause 3, the method according to clause 2, wherein the determining whether the storage space corresponding to the output address of each to-be-fused instruction in the to-be-fused instruction chain has an alias, includes:

The method according to any one of clauses 1-3, wherein after the obtaining the instruction chain to be fused, the method further comprises:

Clause 5, the method according to clause 4, wherein the determining whether the instruction chain to be fused needs to be cut off according to the target hardware information includes:

Clause 6, the method according to clause 1, wherein the generating the target fusion instruction code according to the to-be-fused instruction included in the to-be-fused instruction chain and the preset general fusion instruction code template includes:

Clause 7, the method of clause 6, the generic fusion instruction code template comprising: fusing instruction identification and variable items;

Clause 8, the method according to clause 7, wherein adding the instruction information of each to-be-fused instruction to the universal fused instruction code template, generating the target fused instruction code, includes:

Clause 9, the method according to clause 1, wherein the obtaining the instruction chain to be fused according to the program code to be processed includes:

Clause 10, the method of clause 9, the constructing an instruction chain to be fused according to the plurality of candidate instructions, comprising:

Clause 11, the method of clause 10, wherein the output result of any intermediate candidate instruction in the dependency chain except the last candidate instruction is at most read by one candidate instruction next to the last candidate instruction, and the output result of the intermediate candidate instruction is the output address of the intermediate candidate instruction.

Clause 12, the method of clause 10, wherein the output address of any intermediate candidate instruction in the dependency chain is read by two or more candidate instructions subsequent thereto, the building the chain of instructions to be fused further comprising:

Clause 13, the method of clause 12, after determining that the intermediate candidate instruction can be fused with a candidate instruction whose subsequent fetch the data stored in the output address of the intermediate candidate instruction, further comprising:

Clause 14, the method of clause 1, wherein the program code to be processed comprises at least one basic block, each basic block comprising a plurality of instructions therein; the obtaining the to-be-fused instruction chain according to the to-be-processed program code comprises the following steps:

Clause 15, the method of clause 9, the determining the plurality of candidate instructions in the pending program code that are fused supported by target hardware, comprising:

Clause 16, an instruction fusion apparatus, comprising:

The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring to-be-processed program codes, at least one instruction in the to-be-processed program codes is marked with compiling instruction information in advance, and the compiling instruction information is used for indicating whether the marked instructions can be fused;

the determining module is used for determining the instruction to be fused according to the read-write relation of the instruction in the program code to be processed and the compiling instruction information;

and the generating module is used for replacing at least two instructions to be fused in the program codes to be processed with target fusion instructions to generate fused target program codes.

Clause 17, an instruction fusion apparatus, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

The at least one processor executing computer-executable instructions stored in the memory causes the at least one processor to perform the method of any one of clauses 1-15.

Clause 18, a computer readable storage medium having stored therein a computer program which, when executed by at least one processor, implements the method of any of clauses 1-15.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are alternative embodiments, and that the acts and modules referred to are not necessarily required for the present application.

It should be further noted that, although the steps in the flowchart are sequentially shown as indicated by arrows, the steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

It will be appreciated that the device embodiments described above are merely illustrative and that the device of the application may be implemented in other ways. For example, the division of the units/modules in the above embodiments is merely a logic function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted or not performed.

In addition, each functional unit/module in each embodiment of the present application may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may be integrated together, unless otherwise specified. The integrated units/modules described above may be implemented either in hardware or in software program modules.

The integrated units/modules, if implemented in hardware, may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. The artificial intelligence processor may be any suitable hardware processor, such as CPU, GPU, FPGA, DSP and an ASIC, etc., unless otherwise specified. Unless otherwise indicated, the storage elements may be any suitable magnetic or magneto-optical storage medium, such as resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (ENHANCED DYNAMIC Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc.

The integrated units/modules may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned memory includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. The technical features of the foregoing embodiments may be arbitrarily combined, and for brevity, all of the possible combinations of the technical features of the foregoing embodiments are not described, however, all of the combinations of the technical features should be considered as being within the scope of the disclosure.

Claims

1. A method of instruction fusion, comprising:

2. The method of claim 1, wherein after the obtaining the instruction chain to be fused, further comprising:

3. The method of claim 2, wherein the determining whether the alias exists in the storage space corresponding to the output address of each to-be-fused instruction in the to-be-fused instruction chain includes:

4. A method according to any one of claims 1-3, wherein after the obtaining the instruction chain to be fused, further comprises:

5. The method of claim 4, wherein the determining whether the chain of instructions to be fused needs to be cut according to the target hardware information comprises:

6. The method according to claim 1, wherein the generating a target fusion instruction code according to the to-be-fused instruction included in the to-be-fused instruction chain and a preset general fusion instruction code template includes:

The instruction information of the to-be-fused instruction comprises; the operation code, the data type, the operand length and the output address of each instruction to be fused are used for the instruction to be fused.

7. The method of claim 6, wherein the generic fusion instruction code template comprises: fusing instruction identification and variable items;

8. The method of claim 7, wherein adding instruction information of each instruction to be fused to the universal fused instruction code template to generate the target fused instruction code, comprises:

9. The method according to any one of claims 1-8, wherein the obtaining a chain of instructions to be fused according to the program code to be processed comprises:

10. The method of claim 9, wherein constructing an instruction chain to be fused from the plurality of candidate instructions comprises:

11. The method of claim 10, wherein the output result of any one of the remaining intermediate candidate instructions in the dependency chain, except the last candidate instruction, is at most read by a subsequent one of the candidate instructions, the output result of the intermediate candidate instruction being the output address of the intermediate candidate instruction.

12. The method of claim 10, wherein the output address of any intermediate candidate instruction in the dependency chain is read by more than two candidate instructions subsequent thereto, the constructing the chain of instructions to be fused further comprising:

13. The method of claim 12, wherein after the determining that the intermediate candidate instruction can be fused with a candidate instruction whose subsequent fetch of data stored in an output address of the intermediate candidate instruction, the method further comprises:

14. The method of claim 1, wherein the program code to be processed comprises at least one basic block, each basic block comprising a plurality of instructions therein; the obtaining the to-be-fused instruction chain according to the to-be-processed program code comprises the following steps:

15. The method of claim 9, wherein the determining a plurality of candidate instructions in the pending program code that are fused by target hardware support comprises:

16. An instruction fusion apparatus, comprising:

17. An instruction fusion apparatus, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

The at least one processor executing computer-executable instructions stored in the memory cause the at least one processor to perform the method of any one of claims 1-15.

18. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by at least one processor, implements the method according to any of claims 1-15.