CN112257870B

CN112257870B - Machine learning instruction conversion method and device, board card, main board and electronic equipment

Info

Publication number: CN112257870B
Application number: CN202011115613.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2024-04-09
Anticipated expiration: 2039-11-08
Also published as: CN112667241B; CN110874643B; CN110874643A; CN112667241A; CN112257870A

Abstract

The application relates to a machine learning instruction conversion method and device, a board card, a main board and electronic equipment, wherein a machine learning instruction sequence is obtained, the machine learning instruction sequence is divided to obtain at least one basic block, and then the machine learning instruction in the basic block is subjected to instruction conversion according to a peeping hole optimization algorithm to obtain a converted machine learning instruction, so that peeping hole optimization of the machine learning instruction is realized, time cost of the machine learning instruction is reduced, and the overall performance of a machine learning computing device is greatly improved.

Description

Machine learning instruction conversion method and device, board card, main board and electronic equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for converting machine learning instructions, a board card, a motherboard, and an electronic device.

Background

In recent years, machine learning is rapidly developed, and mainly, the machine learning can meet the ultra-fast processing requirement of people on huge data volume. Machine learning operation is a powerful algorithm and has been applied in recent years to the fields of images, languages, and the like.

Peeping optimization is a very local optimization mode, namely, aiming at the generated codes, the compiler combines the characteristics of CPU instructions, and improves the code performance through some conversion rules possibly bringing about performance improvement or through integral analysis and instruction conversion.

However, the conventional technology lacks a solution for peeping optimization of machine learning instructions in a machine learning computing device, so how to implement peeping optimization of machine learning instructions is a problem to be solved by those skilled in the art.

Disclosure of Invention

Accordingly, it is necessary to provide a method and apparatus for converting machine learning instructions, a board card, a motherboard, and an electronic device for solving the above-mentioned technical problem of how to optimize the peeping hole of the machine learning instructions.

A method of converting machine learning instructions, the method comprising:

acquiring a machine learning instruction sequence;

dividing the machine learning instruction sequence to obtain at least one basic block, wherein the basic block comprises at least one machine learning instruction;

and performing instruction conversion on the machine learning instruction in the basic block according to the peeping optimization algorithm to obtain a converted machine learning instruction.

In one embodiment, the dividing the machine learning instruction sequence to obtain at least one basic block includes:

searching a jump instruction in the machine learning instruction sequence;

and dividing the machine learning instruction sequence according to the jump instruction to obtain at least one basic block.

In one embodiment, the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:

acquiring a first offset register instruction in the basic block;

searching a second offset register instruction in the basic block according to the first offset register instruction, wherein the first offset register instruction and the second offset register instruction are used for offsetting the value in the same register;

and if no machine learning instruction using the value in the register exists between the first offset register instruction and the second offset register instruction, merging the first offset register instruction and the second offset register instruction to obtain a merged offset register instruction.

Acquiring an advanced machine learning instruction in the basic block;

performing position forward movement on the machine learning instruction capable of being advanced, and judging whether logic errors exist in the machine learning instruction sequence according to the position of the machine learning instruction capable of being advanced;

if the machine learning instruction sequence has logic errors, stopping performing position forward movement on the machine learning instruction capable of being advanced, and placing the machine learning instruction capable of being advanced at any position after forward movement corresponding to the machine learning instruction sequence without logic errors.

In one embodiment, the acquiring the machine-learnable instructions in the basic block includes:

and matching the machine learning instruction in the basic block with a preset advanced machine learning instruction to obtain the advanced machine learning instruction in the basic block.

In one embodiment, the advancing the machine-learnable instruction includes:

judging whether the forward movement of the machine learning instruction can affect the execution of other machine learning instructions or not;

and if the advance machine learning instruction is judged to be advanced and does not influence the execution of other machine learning instructions, carrying out position advance on the advance machine learning instruction.

acquiring a storage instruction in the basic block, wherein the storage instruction is used for storing first data;

continuing to search for a load instruction for loading the first data after the store instruction;

and merging the storage instruction and the loading instruction to obtain a merged moving instruction.

acquiring first machine learning instructions with the number exceeding n in a continuous way, wherein n is the queue length of an instruction transmitting queue corresponding to the first machine learning instructions;

and advancing a second machine learning instruction which is executed in parallel with the first machine learning instruction in the same time slice, inserting the second machine learning instruction to the back of the nth first machine learning instruction, wherein the insertion quantity of the second machine learning instruction does not exceed the queue length of an instruction transmitting queue corresponding to the second machine learning instruction.

acquiring initial relative positions among instruction blocks of synchronous instructions in a time slice, wherein the instruction blocks comprise at least one machine learning instruction;

adjusting the positions between the instruction blocks and other machine learning instructions to obtain final relative positions between the instruction blocks of the synchronous instructions;

and if the final relative position is smaller than the initial relative position, determining the final position of the adjusted instruction block of the synchronous instruction according to the final relative position.

if a plurality of multiplication operations or activation operations for the same section of data exist in a time slice, the multiplication operations or activation operations are fused, and a fused machine learning instruction is obtained.

For example, if the scale is larger than 0, the scale may be multiplied to the slope of the scale and the instruction multiplied by the scale may be removed, and the result is the same.

A conversion apparatus of machine learning instructions, the apparatus comprising:

the instruction acquisition module is used for acquiring a machine learning instruction sequence;

the instruction dividing module is used for dividing the machine learning instruction sequence to obtain at least one basic block, wherein the basic block comprises at least one machine learning instruction;

and the instruction conversion module is used for carrying out instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction.

A board, the board comprising: a machine learning processor for performing the method of any of the above embodiments.

A motherboard, the motherboard comprising: a general purpose processor and a board as described in the above embodiments.

An electronic device comprising a motherboard as in the previous embodiments.

According to the machine learning instruction conversion method and device, the board card, the main board and the electronic equipment, the machine learning instruction sequence is obtained, the machine learning instruction sequence is divided to obtain at least one basic block, and then the machine learning instructions in the basic block are subjected to instruction conversion according to the peeping hole optimization algorithm to obtain the converted machine learning instructions, so that peeping hole optimization of the machine learning instructions is realized, the time cost of the machine learning instructions is reduced, and the overall performance of the machine learning computing device is greatly improved.

Drawings

FIG. 1 is an application environment diagram of a method of converting machine learning instructions in one embodiment;

FIG. 2 is a flow diagram of a method of converting machine learning instructions in one embodiment;

FIG. 3 is a flow diagram of partitioning a sequence of machine-learned instructions to obtain at least one basic block replenishment scheme, in one embodiment;

FIG. 4 is a flow diagram of a peep optimization for machine learning instructions by incorporating offset register instructions in one embodiment;

FIG. 5 is a flow diagram of a pre-executable machine learning instruction being shifted forward in position to achieve peep optimization of the machine learning instruction in one embodiment;

FIG. 6 is a schematic diagram of a data load operation, a data calculate operation, and a data store operation constituting a three-stage pipelined execution in one embodiment;

FIG. 7 is a flow chart of a peep hole optimization for combining a store instruction and a load instruction corresponding to the same batch of data into a move instruction to implement a machine learning instruction in one embodiment;

FIG. 8 is a flow diagram of reordering machine learning instructions of a sequence of machine learning instructions to achieve peep optimization of the machine learning instructions, in one embodiment;

FIG. 9 is a flow diagram of reordering machine learning instructions of a sequence of machine learning instructions to achieve peep optimization of the machine learning instructions in another embodiment;

FIG. 10 is a block diagram of a machine learning instruction conversion device in one embodiment;

fig. 11 is an internal structural diagram of an electronic device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The conversion method of the machine learning instruction can be applied to an application environment shown in fig. 1. Wherein the processor 102 and the compiler 104 are respectively connected to the memory 106. The compiler 104 is configured to obtain a machine learning instruction sequence, and perform instruction conversion on instructions in the machine learning instruction sequence by using a peeping optimization technology, so as to obtain converted machine learning instructions. The processor 102 is configured to perform a corresponding machine learning operation according to the converted machine learning instruction obtained by the compiler 104. The sequence of machine learning instructions and/or the converted machine learning instructions described above may be stored in memory 106. By optimizing the machine learning instruction, the time cost can be effectively reduced.

Alternatively, the processor 102 may be any suitable hardware processor, such as: CPU, GPU, FPGA, DSP and ASIC, etc. The Memory 106 may be a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. which may store program codes.

In one embodiment, the method for converting machine learning instructions can be applied to computing platforms such as central processing units and neural network accelerators. The machine learning instruction conversion method may be executed by a compiler, which is usually a software program running on a central processing unit, or may be a software program or a hardware circuit running in the central processing unit or a neural network accelerator. In this embodiment, the compiler may convert the machine learning instruction on the central processor, so that the central processor sends the converted machine learning instruction to the neural network accelerator, and after the neural network accelerator operates the converted machine learning instruction, the operation result is returned to the central processor. Alternatively, the neural network accelerator may be a graphics processor, an embedded neural network processor, or a machine learning processing unit, and the specific type of the neural network accelerator is not limited in this embodiment. A specific implementation procedure of the conversion method of the machine learning instruction will be described in the following embodiments.

In one embodiment, as shown in fig. 2, a method for converting machine learning instructions is provided, and the method is applied to the compiler 104 in fig. 1 for illustration, and includes the following steps:

s202, acquiring a machine learning instruction sequence.

The machine learning instruction sequence is a linear queue formed by sequentially splicing machine codes of a plurality of machine learning instructions by taking bytes as basic units. Optionally, the sequence of machine learning instructions comprises a plurality of machine learning instructions. It should be appreciated that machine learning instructions include a number of vector instructions or stream instructions, one instruction being capable of handling a plurality of numbers. The jump instructions are relatively few.

In particular, the compiler may read a sequence of machine-learned instructions from memory.

S204, dividing the machine learning instruction sequence to obtain at least one basic block.

Wherein, the basic block refers to a program-sequence of executed sentences, which has only one entry and one exit, the entry being the first sentence therein and the exit being the last sentence therein. For a basic block, execution is performed only by entering from its entrance and exiting from its exit.

Wherein the basic block includes at least one machine learning instruction therein.

Specifically, the compiler divides the machine learning instruction sequence into at least one basic block according to a preset instruction division mode. Alternatively, the preset instruction dividing manner may divide the machine learning instruction sequence according to an ifelse instruction, a for loop instruction, or other jump instructions. Wherein, as noted above, since the jump instructions are relatively few, the basic blocks are generally divided more, the jump instructions are less restrictive. The order of instructions may be optimized at a coarser granularity in machine-learned instructions than in more scalar instructions for the CPU, without having to consider dependencies between data processed by a single vector instruction.

S206, performing instruction conversion on the machine learning instruction in the basic block according to the peeping optimization algorithm to obtain a converted machine learning instruction.

It should be noted that the peeping optimization algorithm may optimize machine learning instructions in a basic block, but not machine learning instructions in a different basic block.

Specifically, after the basic block is obtained, the compiler performs instruction conversion on the machine learning instruction in the basic block according to the peeping optimization algorithm to obtain a converted machine learning instruction, so that local optimization of the machine learning instruction by utilizing the peeping optimization technology is realized. Further, the compiler sends the obtained converted machine learning instruction to the memory for storage, so that the processor can read the converted machine learning instruction from the memory at any time and execute the operation corresponding to the instruction.

Optionally, the peephole optimization includes instruction merging, instruction forwarding, instruction reordering, and the like. For example, the machine learning instructions to be combined are combined in an instruction combining mode, so that the combined machine learning instructions are obtained. It will be appreciated that the merging process is one implementation of instruction conversion for machine learning instructions.

According to the machine learning instruction conversion method, the compiler obtains the machine learning instruction sequence, divides the machine learning instruction sequence to obtain at least one basic block, and then carries out instruction conversion on the machine learning instructions in the basic block according to the peeping optimization algorithm to obtain the converted machine learning instructions, so that peeping optimization on the machine learning instructions is realized, the time cost of the machine learning instructions is reduced, and the overall performance of the machine learning computing device is greatly improved.

In one embodiment, referring to FIG. 3, one possible implementation involves the compiler dividing a sequence of machine-learned instructions into at least one basic block. On the basis of the above embodiment, S204 includes the steps of:

s212, searching a jump instruction in the machine learning instruction sequence;

S214, dividing the machine learning instruction sequence according to the jump instruction to obtain at least one basic block.

The jump instruction refers to a machine learning instruction capable of realizing a jump function. Alternatively, the jump instruction may include an if else instruction, a for loop instruction, or the like.

Specifically, after obtaining the machine learning instruction sequence, the compiler performs instruction search on all machine learning instructions in the machine learning instruction sequence to obtain all jump instructions in the machine learning instruction sequence. And then the compiler divides the machine learning instruction sequence according to the jump instruction and the machine learning instruction which the jump instruction jumps to obtain at least one basic block. Optionally, the instruction searching mode may be preset jump instructions, and then according to the instruction matching mode, all jump instructions in the matched machine learning instruction sequence may be obtained.

Taking if else instruction as an example, assuming that the if else instruction represents if (expression) statement 1else statement 2, then, taking if corresponding machine learning instruction and else corresponding machine learning instruction as boundaries, dividing if corresponding machine learning instruction, else corresponding machine learning instruction and machine learning instruction between them into one basic block, and dividing machine learning instructions on two sides of the basic block into one basic block, so as to obtain four basic blocks. The four basic blocks include a basic block composed before if, a basic block composed between if and else, a basic block composed between else and end, and a basic block composed after end. The four basic blocks described above may be represented as follows:

Basic block 1

If(…){

Basic block 2

}else{

Basic block 3

}

Basic block 4.

In the embodiment of the application, the compiler takes the jump instruction as the dividing limit of the basic block, so that the machine learning instruction in the divided basic block cannot have a logic problem, and further the accuracy of executing the machine learning instruction after peeping hole optimization is ensured.

In the embodiment of the present application, there are various implementations of the step S206, and only a few implementations are listed below.

In one embodiment, please refer to fig. 4, which relates to a specific process of peeping optimization by merging offset register instructions to implement machine learning instructions. On the basis of the above embodiment, S206 includes the steps of:

s222, acquiring a first offset register instruction in a basic block;

s224, searching a second offset register instruction in the basic block according to the first offset register instruction, wherein the first offset register instruction and the second offset register instruction are used for offsetting the values in the same register;

s226, if no machine learning instruction using the value in the register exists between the first offset register instruction and the second offset register instruction, merging the first offset register instruction and the second offset register instruction to obtain a merged offset register instruction.

The machine learning instruction includes the offset register instruction. An offset register instruction refers to a fixreg instruction for offsetting a value in a register.

Optionally, the offset register instruction is to add an offset to the value in the register, or to subtract an offset from the value in the register, or to multiply or divide the value in the register by an offset.

Specifically, the compiler, after obtaining the basic block, looks up a first offset register instruction in the basic block, and looks up a second offset register instruction that offsets a value in the same register backward based on the position of the first offset register instruction. Then the compiler judges whether a machine learning instruction using the value in the register exists between the first offset register instruction and the second offset register instruction, if not, the compiler merges the first offset register instruction and the second offset register instruction to obtain a merged offset register instruction; if so, return to step S222. Further, the compiler sends the obtained combined offset register instruction to the memory for storage, so that the processor can read the combined offset register instruction from the memory at any time and execute the operation corresponding to the instruction.

In the embodiment of the application, the compiler merges the offset register instructions with the same function, so that the number of machine learning instructions is reduced, and the running efficiency of the machine learning computing device is improved.

As one implementation, the offset register instruction may be implemented by a scalar add instruction, so the conversion method may also optimize the scalar add instruction. As another embodiment, the conversion method may also be optimized for other scalar calculation methods for the same register.

In one embodiment, please refer to fig. 5, which is directed to a specific process of performing a position advance of machine learning instructions that can be executed in advance to achieve peep optimization of the machine learning instructions. On the basis of the above embodiment, S206 includes the steps of:

s232, acquiring an advanced machine learning instruction in the basic block;

s234, performing position forward movement on the machine learning instruction capable of being advanced, and judging whether logic errors exist in the machine learning instruction sequence according to the position of the machine learning instruction capable of being advanced;

s236, if the machine learning instruction sequence has logic errors, stopping performing position forward movement on the machine learning instruction capable of being advanced, and placing the machine learning instruction capable of being advanced at any position after forward movement corresponding to the machine learning instruction sequence without logic errors.

The machine learning instruction comprises the machine learning instruction capable of being advanced. The machine-learnable instruction in advance means a machine-learnable instruction that can be executed in advance, and that should not affect the execution of other machine-learnable instructions when executed in advance.

Optionally, the compiler matches the machine learning instruction in the basic block with a preset machine learning instruction capable of being advanced, so as to obtain the machine learning instruction capable of being advanced in the basic block after matching. For example, taking a preset machine learning instruction capable of being advanced as a loadwt instruction, the loadwt instruction is preset in a compiler, and then the loadwt instruction is matched from the machine learning instructions in the basic block in an instruction matching manner.

Alternatively, step S234 may be performed after the compiler obtains all the machine learning possible instructions in the machine learning possible instruction sequence, or step S234 may be performed after the compiler obtains any one of the machine learning possible instructions.

Specifically, after the compiler obtains the machine learning instruction capable of being advanced in the basic block in the instruction matching mode, the compiler advances the position of the machine learning instruction capable of being advanced from the initial position of the machine learning instruction capable of being advanced, and judges whether the sequence of the machine learning instruction has logic errors according to the position after the machine learning instruction capable of being advanced in the process of advancing the machine learning instruction capable of being advanced. If the machine learning instruction sequence has logic errors, the compiler stops moving forward the position of the machine learning instruction in advance, and places the machine learning instruction in advance at any position after moving forward corresponding to the machine learning instruction sequence without logic errors. Alternatively, the compiler may place the machine-learnable instruction in advance at a position after the last advance corresponding to the absence of a logical error in the sequence of machine-learnable instructions.

Further, the compiler sends the shifted machine learning instruction to the memory for storage, so that the processor can read the shifted machine learning instruction from the memory at any time and execute the operation corresponding to the instruction.

In the embodiment of the application, the compiler can execute the machine learning instruction in advance, so that the execution time of the machine learning instruction is covered by the execution time of the previous instruction, thereby optimizing the overall instruction execution time.

As one embodiment, a specific implementation manner of the compiler for performing position forward on the machine learning instruction capable of being advanced includes: the compiler determines whether advancing the machine learning instruction can affect execution of other machine learning instructions. If the compiler determines that the advance of the machine-learnable instruction does not affect the execution of other machine-learnable instructions, the compiler advances the advance of the machine-learnable instruction. Alternatively, if the early machine learning instruction is associated with no other machine learning instructions, the compiler may determine that the early machine learning instruction is advanced without affecting other machine learning instruction execution.

In the embodiment, the position forward movement is performed on the premise that the forward movement of the machine learning instruction can be advanced and the execution of other machine learning instructions is not influenced, so that errors in the execution of the machine learning instruction are avoided, and the execution accuracy of the machine learning instruction is ensured.

Illustratively, taking a neural network as an example, typically, the neural network includes multiple layers. Assuming that the neural network includes a pooling layer of a previous layer and a convolution layer of a subsequent layer, since the pooling layer does not need to use weights, the weights of the convolution layers can be read in advance. Specifically, the process of reading the weight value of the convolution layer of the later layer can be advanced, so that the process is executed in parallel with the calculation process of the pooling layer of the previous layer, and the overall execution time can be saved. Through the instruction optimization mode, the weight reading instruction corresponding to the weight reading weight of the convolution layer of the later layer is moved forward in position, so that the weight reading instruction can be executed in advance, and the overall execution time of the neural network instruction can be greatly reduced.

Under the application scene that the machine learning instructions are distributed in a pipeline, peeping optimization can be performed on part of the machine learning instructions. For example, referring to FIG. 6, "L" in FIG. 6 represents a data load operation, i.e., load; "C" represents a data calculation operation, i.e., computer; "S" represents a data storage operation, namely store. Wherein the time occupied by each operation is represented by a time slice. Specifically, in the first time slice, a last loading instruction is acquired, and a data loading operation is executed; in the second time slice, the last calculation instruction and the current loading instruction are acquired, and the data calculation operation and the data loading operation are executed in parallel; in the third time slice, the last storage instruction, the current calculation instruction and the next loading instruction are acquired, the data calculation operation, the data loading operation and the data storage operation are executed in parallel, and the cyclic execution is performed, so that the machine learning instruction can be executed in a three-stage pipeline mode until the execution of the last storage instruction is completed. And, during execution of L, C, S, associated L, C, S is also stored using ping-pong storage alternates.

In one embodiment, based on the above application scenario, please refer to fig. 7, which relates to a specific process of merging a store instruction and a load instruction corresponding to the same batch of data into a move instruction by a compiler to realize peep hole optimization of a machine learning instruction. On the basis of the above embodiment, S206 includes the steps of:

s242, acquiring a storage instruction in the basic block, wherein the storage instruction is used for storing first data;

s244, continuing to search for a loading instruction for loading the first data after the storage instruction;

s246, combining the store instruction and the load instruction to obtain a combined moving instruction.

The machine learning instruction comprises the storage instruction and the loading instruction. Store instructions, i.e., store instructions. Load instructions, i.e., load instructions. The move instruction is used to select the memory space in the ping-pong storage process.

Specifically, after the basic block is obtained, the compiler searches the storage instruction in the basic block, searches the loading instruction for the same on-chip address backwards based on the storage instruction, and can determine whether the storage instruction and the loading instruction act on the same batch of data according to the identified on-chip address. If the storage instruction and the loading instruction which act on the same batch of data exist, the compiler merges the storage instruction and the loading instruction to obtain the merged moving instruction.

For example, in the neural network, the output result of the previous layer is required to be used in the next layer of the neural network, and at this time, the storage instruction needs to be read first, the output result of the previous layer of the neural network is stored, then the loading instruction is read, and the stored output result is loaded in the next layer of the neural network. If the storage instruction and the loading instruction are combined, the combination is a mobile instruction, and the output result of the previous layer of the neural network can be directly input to the next layer of the neural network.

Further, the compiler sends the obtained combined moving instruction to the memory for storage, so that the processor can read the combined moving instruction from the memory at any time and execute the operation corresponding to the instruction.

In the embodiment of the application, the compiler combines the storage instruction and the loading instruction which are applied to the same batch of data, so that unnecessary IO operation can be reduced, and the time cost of executing the machine learning instruction is reduced.

In one embodiment, referring to FIG. 8, a specific process of a compiler reordering machine learning instructions of a sequence of machine learning instructions to achieve peephole optimization of the machine learning instructions is described. On the basis of the above embodiment, S206 includes the steps of:

S252, acquiring first machine learning instructions with the number exceeding n in a continuous way, wherein n is the queue length of an instruction transmission queue corresponding to the first machine learning instructions;

s254, advancing a second machine learning instruction which is executed in parallel with the first machine learning instruction in the same time slice, inserting the second machine learning instruction behind the nth first machine learning instruction, wherein the number of the second machine learning instructions inserted is not more than the queue length of an instruction transmitting queue corresponding to the second machine learning instruction.

Optionally, pipelining the computing instruction and the IO instruction is performed as an example. As one implementation, in each time slice in the pipeline, the compiler looks up more than n consecutive IO instructions, where n is the queue length of the hardware IO instruction issue queue. If the compiler finds out continuous IO instructions exceeding n, the compiler advances a plurality of calculation instructions in the same time slice and inserts the calculation instructions into the back of the nth IO instruction, wherein the number of the inserted calculation instructions does not exceed the queue length of an instruction transmission queue corresponding to the calculation instructions.

As another implementation, in each time slice in the pipeline, the compiler looks up more than n consecutive compute instructions, where n is the queue length of the hardware compute instruction issue queue. If the compiler finds out continuous calculation instructions exceeding n, the compiler advances a plurality of IO instructions in the same time slice and inserts the IO instructions into the back of the nth calculation instruction, wherein the number of inserted IO instructions does not exceed the queue length of an instruction transmission queue corresponding to the IO instructions.

The applicant found through research that if the number of continuous machine learning instructions exceeds the queue length of the transmission queue, the transmission queue is blocked, and the performance is affected. Thus, in embodiments of the present application, the compiler prevents performance loss caused by full instruction issue queues by reordering machine learning instructions. For example, it is possible to detect whether an IO queue or a compute queue is full in a single time slice and advance instructions of the other queue to alleviate this situation, thereby optimizing the performance of machine learning instructions.

In one embodiment, please refer to fig. 9, which is directed to another specific process for a compiler to reorder machine learning instructions of a sequence of machine learning instructions to achieve peephole optimization of the machine learning instructions. On the basis of the above embodiment, S206 includes the steps of:

s262, acquiring initial relative positions among instruction blocks of synchronous instructions in a time slice, wherein the instruction blocks comprise at least one machine learning instruction;

s264, adjusting the positions between the instruction blocks and other machine learning instructions to obtain the final relative positions between the instruction blocks of the synchronous instructions;

and S266, if the final relative position is smaller than the initial relative position, determining the adjusted final position of the instruction block of the synchronous instruction according to the final relative position.

The synchronous instructions are arranged among the machine learning instructions related to the data and are used for dividing all the machine learning instructions to obtain machine learning instructions corresponding to a plurality of time steps, the machine learning instructions in adjacent time steps are executed in series, and the machine learning instructions in each time step are executed in parallel.

Specifically, in each time slice in the pipeline, the compiler looks up the instruction block containing the synchronous instruction, determines the boundary of the instruction block, and thus determines the initial relative position between the instruction blocks of different synchronous instructions. The compiler then adjusts the instruction blocks within the same time slice in a manner that the instruction sequence is learned with other machines so that the final relative position between the instruction blocks of the synchronous instruction is smaller than the initial relative position between the instruction blocks of the synchronous instruction. And the compiler determines the final position of the adjusted instruction block of the synchronous instruction according to the final relative position.

Further, the compiler sends the machine learning instruction after the position adjustment to the memory for storage, so that the processor can read the machine learning instruction after the position adjustment from the memory at any time and execute the operation corresponding to the instruction.

In the embodiment of the application, on the premise of using the instruction pipeline arrangement, the synchronous instructions interrupt the running water type execution of the instructions, and the synchronous instructions are close to each other so that the overhead of the interrupt flow is smaller, so that the compiler puts instruction blocks containing the synchronous instructions together as much as possible, and the influence of the synchronous instructions on the performance of the machine learning computing device is reduced.

In one embodiment, the compiler is involved in performing instruction conversion on the machine learning instruction in the basic block according to the peeping optimization algorithm, so as to obtain a possible implementation process of the converted machine learning instruction. On the basis of the above embodiment, S206 includes the steps of:

s272, if a plurality of multiplication operations or activation operations for the same section of data exist in sequence in one time slice, the plurality of multiplication operations or activation operations are fused, and a fused machine learning instruction is obtained.

The optimization of this embodiment is understood to be a "constant folding" optimization. Specifically, the compiler scans the machine learning instruction according to the peeping optimization algorithm, and if a plurality of multiplication operations or activation operations for the same segment of data exist in a time slice, the compiler fuses the plurality of multiplication operations or activation operations to obtain the fused machine learning instruction. Further, the compiler sends the obtained fused machine learning instruction to the memory for storage, so that the processor can read the fused machine learning instruction from the memory at any time and execute the operation corresponding to the instruction.

Illustratively, when a convolution-relu-convolution layer join occurs in the network, the compiler fuses the layers. Typically, the convolution layer is multiplied by a scale, which is a quantization parameter, and divided by the scale after convolution. The compiler will fuse some operations of multiply-divide scale into relu, thereby reducing overhead. For example, if the result of conv1 is to be multiplied by 1/scale1 and the input of conv2 is to be multiplied by scale2, the compiler may multiply scale2/scale1 into the slope of relu, thereby reducing the overhead of two multiplying scales. Note that in this embodiment, scale1 is a positive number, and scale2 may be a negative number.

It should be understood that, although the steps in the flowcharts of fig. 2-9 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-9 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in fig. 10, there is provided a conversion apparatus of machine learning instructions, the apparatus comprising:

an instruction acquisition module 302, configured to acquire a machine learning instruction sequence;

the instruction dividing module 304 is configured to divide the machine learning instruction sequence to obtain at least one basic block, where the basic block includes at least one machine learning instruction;

the instruction conversion module 306 is configured to perform instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm, so as to obtain a converted machine learning instruction.

According to the machine learning instruction conversion device, the machine learning instruction sequence is obtained and divided to obtain at least one basic block, and then the machine learning instructions in the basic block are subjected to instruction conversion according to the peeping optimization algorithm to obtain the converted machine learning instructions, so that peeping optimization of the machine learning instructions is realized, the time cost of the machine learning instructions is reduced, and the overall performance of the machine learning computing device is greatly improved.

For specific limitations on the conversion means of the machine learning instruction, reference may be made to the above limitation on the conversion method of the machine learning instruction, and no further description is given here. The respective modules in the conversion device of the machine learning instruction described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, there is also provided a board card comprising: a machine learning processor configured to execute the method for converting machine learning instructions according to any of the above embodiments.

In one embodiment, there is also provided a motherboard, comprising: a general purpose processor and the board card.

In one embodiment, an electronic device is also provided, which includes the motherboard described above.

In one embodiment, the electronic device may be a terminal, and the internal structure thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of converting machine learning instructions. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the electronic device to which the present application is applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

According to the board card, the main board and the electronic equipment, the machine learning instruction sequence is obtained and divided to obtain at least one basic block, and then the machine learning instructions in the basic block are subjected to instruction conversion according to the peeping optimization algorithm to obtain the converted machine learning instructions, so that peeping optimization of the machine learning instructions is realized, the time cost of the machine learning instructions is reduced, and the overall performance of the machine learning computing device is greatly improved.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of converting machine learning instructions, the method comprising:

acquiring a machine learning instruction sequence;

performing instruction conversion on the machine learning instruction in the basic block according to a peeping hole optimization algorithm to obtain a converted machine learning instruction;

the method for obtaining the machine learning instruction after conversion comprises the following steps of:

acquiring an advanced machine learning instruction in the basic block;

2. The method of claim 1, wherein dividing the sequence of machine learning instructions to obtain at least one basic block comprises:

searching a jump instruction in the machine learning instruction sequence;

3. The method of claim 1, wherein the fetching of the early machine-learnable instructions in the basic block comprises:

4. The method of claim 1, wherein performing a position advance on the machine-learnable instruction comprises:

5. The method according to claim 1, wherein the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:

6. The method according to claim 1, wherein the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:

7. The method according to claim 1, wherein the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:

8. A conversion apparatus of machine learning instructions, the apparatus comprising:

the instruction conversion module is used for carrying out instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction;

acquiring an advanced machine learning instruction in the basic block;

9. A board, characterized in that, the board includes: a machine learning processor for performing the method of any of claims 1-7.

10. A motherboard, said motherboard comprising: a general purpose processor and a board as claimed in claim 9.

11. An electronic device comprising the motherboard of claim 10.