CN112257870B - Machine learning instruction conversion method and device, board card, main board and electronic equipment - Google Patents

Machine learning instruction conversion method and device, board card, main board and electronic equipment Download PDF

Info

Publication number
CN112257870B
CN112257870B CN202011115613.1A CN202011115613A CN112257870B CN 112257870 B CN112257870 B CN 112257870B CN 202011115613 A CN202011115613 A CN 202011115613A CN 112257870 B CN112257870 B CN 112257870B
Authority
CN
China
Prior art keywords
instruction
machine learning
learning instruction
instructions
basic block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011115613.1A
Other languages
Chinese (zh)
Other versions
CN112257870A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cambricon Information Technology Co Ltd
Original Assignee
Anhui Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cambricon Information Technology Co Ltd filed Critical Anhui Cambricon Information Technology Co Ltd
Priority to CN202011115613.1A priority Critical patent/CN112257870B/en
Publication of CN112257870A publication Critical patent/CN112257870A/en
Application granted granted Critical
Publication of CN112257870B publication Critical patent/CN112257870B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The application relates to a machine learning instruction conversion method and device, a board card, a main board and electronic equipment, wherein a machine learning instruction sequence is obtained, the machine learning instruction sequence is divided to obtain at least one basic block, and then the machine learning instruction in the basic block is subjected to instruction conversion according to a peeping hole optimization algorithm to obtain a converted machine learning instruction, so that peeping hole optimization of the machine learning instruction is realized, time cost of the machine learning instruction is reduced, and the overall performance of a machine learning computing device is greatly improved.

Description

Machine learning instruction conversion method and device, board card, main board and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for converting machine learning instructions, a board card, a motherboard, and an electronic device.
Background
In recent years, machine learning is rapidly developed, and mainly, the machine learning can meet the ultra-fast processing requirement of people on huge data volume. Machine learning operation is a powerful algorithm and has been applied in recent years to the fields of images, languages, and the like.
Peeping optimization is a very local optimization mode, namely, aiming at the generated codes, the compiler combines the characteristics of CPU instructions, and improves the code performance through some conversion rules possibly bringing about performance improvement or through integral analysis and instruction conversion.
However, the conventional technology lacks a solution for peeping optimization of machine learning instructions in a machine learning computing device, so how to implement peeping optimization of machine learning instructions is a problem to be solved by those skilled in the art.
Disclosure of Invention
Accordingly, it is necessary to provide a method and apparatus for converting machine learning instructions, a board card, a motherboard, and an electronic device for solving the above-mentioned technical problem of how to optimize the peeping hole of the machine learning instructions.
A method of converting machine learning instructions, the method comprising:
acquiring a machine learning instruction sequence;
dividing the machine learning instruction sequence to obtain at least one basic block, wherein the basic block comprises at least one machine learning instruction;
and performing instruction conversion on the machine learning instruction in the basic block according to the peeping optimization algorithm to obtain a converted machine learning instruction.
In one embodiment, the dividing the machine learning instruction sequence to obtain at least one basic block includes:
searching a jump instruction in the machine learning instruction sequence;
and dividing the machine learning instruction sequence according to the jump instruction to obtain at least one basic block.
In one embodiment, the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
acquiring a first offset register instruction in the basic block;
searching a second offset register instruction in the basic block according to the first offset register instruction, wherein the first offset register instruction and the second offset register instruction are used for offsetting the value in the same register;
and if no machine learning instruction using the value in the register exists between the first offset register instruction and the second offset register instruction, merging the first offset register instruction and the second offset register instruction to obtain a merged offset register instruction.
In one embodiment, the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
Acquiring an advanced machine learning instruction in the basic block;
performing position forward movement on the machine learning instruction capable of being advanced, and judging whether logic errors exist in the machine learning instruction sequence according to the position of the machine learning instruction capable of being advanced;
if the machine learning instruction sequence has logic errors, stopping performing position forward movement on the machine learning instruction capable of being advanced, and placing the machine learning instruction capable of being advanced at any position after forward movement corresponding to the machine learning instruction sequence without logic errors.
In one embodiment, the acquiring the machine-learnable instructions in the basic block includes:
and matching the machine learning instruction in the basic block with a preset advanced machine learning instruction to obtain the advanced machine learning instruction in the basic block.
In one embodiment, the advancing the machine-learnable instruction includes:
judging whether the forward movement of the machine learning instruction can affect the execution of other machine learning instructions or not;
and if the advance machine learning instruction is judged to be advanced and does not influence the execution of other machine learning instructions, carrying out position advance on the advance machine learning instruction.
In one embodiment, the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
acquiring a storage instruction in the basic block, wherein the storage instruction is used for storing first data;
continuing to search for a load instruction for loading the first data after the store instruction;
and merging the storage instruction and the loading instruction to obtain a merged moving instruction.
In one embodiment, the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
acquiring first machine learning instructions with the number exceeding n in a continuous way, wherein n is the queue length of an instruction transmitting queue corresponding to the first machine learning instructions;
and advancing a second machine learning instruction which is executed in parallel with the first machine learning instruction in the same time slice, inserting the second machine learning instruction to the back of the nth first machine learning instruction, wherein the insertion quantity of the second machine learning instruction does not exceed the queue length of an instruction transmitting queue corresponding to the second machine learning instruction.
In one embodiment, the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
acquiring initial relative positions among instruction blocks of synchronous instructions in a time slice, wherein the instruction blocks comprise at least one machine learning instruction;
adjusting the positions between the instruction blocks and other machine learning instructions to obtain final relative positions between the instruction blocks of the synchronous instructions;
and if the final relative position is smaller than the initial relative position, determining the final position of the adjusted instruction block of the synchronous instruction according to the final relative position.
In one embodiment, the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
if a plurality of multiplication operations or activation operations for the same section of data exist in a time slice, the multiplication operations or activation operations are fused, and a fused machine learning instruction is obtained.
For example, if the scale is larger than 0, the scale may be multiplied to the slope of the scale and the instruction multiplied by the scale may be removed, and the result is the same.
A conversion apparatus of machine learning instructions, the apparatus comprising:
the instruction acquisition module is used for acquiring a machine learning instruction sequence;
the instruction dividing module is used for dividing the machine learning instruction sequence to obtain at least one basic block, wherein the basic block comprises at least one machine learning instruction;
and the instruction conversion module is used for carrying out instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction.
A board, the board comprising: a machine learning processor for performing the method of any of the above embodiments.
A motherboard, the motherboard comprising: a general purpose processor and a board as described in the above embodiments.
An electronic device comprising a motherboard as in the previous embodiments.
According to the machine learning instruction conversion method and device, the board card, the main board and the electronic equipment, the machine learning instruction sequence is obtained, the machine learning instruction sequence is divided to obtain at least one basic block, and then the machine learning instructions in the basic block are subjected to instruction conversion according to the peeping hole optimization algorithm to obtain the converted machine learning instructions, so that peeping hole optimization of the machine learning instructions is realized, the time cost of the machine learning instructions is reduced, and the overall performance of the machine learning computing device is greatly improved.
Drawings
FIG. 1 is an application environment diagram of a method of converting machine learning instructions in one embodiment;
FIG. 2 is a flow diagram of a method of converting machine learning instructions in one embodiment;
FIG. 3 is a flow diagram of partitioning a sequence of machine-learned instructions to obtain at least one basic block replenishment scheme, in one embodiment;
FIG. 4 is a flow diagram of a peep optimization for machine learning instructions by incorporating offset register instructions in one embodiment;
FIG. 5 is a flow diagram of a pre-executable machine learning instruction being shifted forward in position to achieve peep optimization of the machine learning instruction in one embodiment;
FIG. 6 is a schematic diagram of a data load operation, a data calculate operation, and a data store operation constituting a three-stage pipelined execution in one embodiment;
FIG. 7 is a flow chart of a peep hole optimization for combining a store instruction and a load instruction corresponding to the same batch of data into a move instruction to implement a machine learning instruction in one embodiment;
FIG. 8 is a flow diagram of reordering machine learning instructions of a sequence of machine learning instructions to achieve peep optimization of the machine learning instructions, in one embodiment;
FIG. 9 is a flow diagram of reordering machine learning instructions of a sequence of machine learning instructions to achieve peep optimization of the machine learning instructions in another embodiment;
FIG. 10 is a block diagram of a machine learning instruction conversion device in one embodiment;
fig. 11 is an internal structural diagram of an electronic device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The conversion method of the machine learning instruction can be applied to an application environment shown in fig. 1. Wherein the processor 102 and the compiler 104 are respectively connected to the memory 106. The compiler 104 is configured to obtain a machine learning instruction sequence, and perform instruction conversion on instructions in the machine learning instruction sequence by using a peeping optimization technology, so as to obtain converted machine learning instructions. The processor 102 is configured to perform a corresponding machine learning operation according to the converted machine learning instruction obtained by the compiler 104. The sequence of machine learning instructions and/or the converted machine learning instructions described above may be stored in memory 106. By optimizing the machine learning instruction, the time cost can be effectively reduced.
Alternatively, the processor 102 may be any suitable hardware processor, such as: CPU, GPU, FPGA, DSP and ASIC, etc. The Memory 106 may be a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. which may store program codes.
In one embodiment, the method for converting machine learning instructions can be applied to computing platforms such as central processing units and neural network accelerators. The machine learning instruction conversion method may be executed by a compiler, which is usually a software program running on a central processing unit, or may be a software program or a hardware circuit running in the central processing unit or a neural network accelerator. In this embodiment, the compiler may convert the machine learning instruction on the central processor, so that the central processor sends the converted machine learning instruction to the neural network accelerator, and after the neural network accelerator operates the converted machine learning instruction, the operation result is returned to the central processor. Alternatively, the neural network accelerator may be a graphics processor, an embedded neural network processor, or a machine learning processing unit, and the specific type of the neural network accelerator is not limited in this embodiment. A specific implementation procedure of the conversion method of the machine learning instruction will be described in the following embodiments.
In one embodiment, as shown in fig. 2, a method for converting machine learning instructions is provided, and the method is applied to the compiler 104 in fig. 1 for illustration, and includes the following steps:
s202, acquiring a machine learning instruction sequence.
The machine learning instruction sequence is a linear queue formed by sequentially splicing machine codes of a plurality of machine learning instructions by taking bytes as basic units. Optionally, the sequence of machine learning instructions comprises a plurality of machine learning instructions. It should be appreciated that machine learning instructions include a number of vector instructions or stream instructions, one instruction being capable of handling a plurality of numbers. The jump instructions are relatively few.
In particular, the compiler may read a sequence of machine-learned instructions from memory.
S204, dividing the machine learning instruction sequence to obtain at least one basic block.
Wherein, the basic block refers to a program-sequence of executed sentences, which has only one entry and one exit, the entry being the first sentence therein and the exit being the last sentence therein. For a basic block, execution is performed only by entering from its entrance and exiting from its exit.
Wherein the basic block includes at least one machine learning instruction therein.
Specifically, the compiler divides the machine learning instruction sequence into at least one basic block according to a preset instruction division mode. Alternatively, the preset instruction dividing manner may divide the machine learning instruction sequence according to an ifelse instruction, a for loop instruction, or other jump instructions. Wherein, as noted above, since the jump instructions are relatively few, the basic blocks are generally divided more, the jump instructions are less restrictive. The order of instructions may be optimized at a coarser granularity in machine-learned instructions than in more scalar instructions for the CPU, without having to consider dependencies between data processed by a single vector instruction.
S206, performing instruction conversion on the machine learning instruction in the basic block according to the peeping optimization algorithm to obtain a converted machine learning instruction.
It should be noted that the peeping optimization algorithm may optimize machine learning instructions in a basic block, but not machine learning instructions in a different basic block.
Specifically, after the basic block is obtained, the compiler performs instruction conversion on the machine learning instruction in the basic block according to the peeping optimization algorithm to obtain a converted machine learning instruction, so that local optimization of the machine learning instruction by utilizing the peeping optimization technology is realized. Further, the compiler sends the obtained converted machine learning instruction to the memory for storage, so that the processor can read the converted machine learning instruction from the memory at any time and execute the operation corresponding to the instruction.
Optionally, the peephole optimization includes instruction merging, instruction forwarding, instruction reordering, and the like. For example, the machine learning instructions to be combined are combined in an instruction combining mode, so that the combined machine learning instructions are obtained. It will be appreciated that the merging process is one implementation of instruction conversion for machine learning instructions.
According to the machine learning instruction conversion method, the compiler obtains the machine learning instruction sequence, divides the machine learning instruction sequence to obtain at least one basic block, and then carries out instruction conversion on the machine learning instructions in the basic block according to the peeping optimization algorithm to obtain the converted machine learning instructions, so that peeping optimization on the machine learning instructions is realized, the time cost of the machine learning instructions is reduced, and the overall performance of the machine learning computing device is greatly improved.
In one embodiment, referring to FIG. 3, one possible implementation involves the compiler dividing a sequence of machine-learned instructions into at least one basic block. On the basis of the above embodiment, S204 includes the steps of:
s212, searching a jump instruction in the machine learning instruction sequence;
S214, dividing the machine learning instruction sequence according to the jump instruction to obtain at least one basic block.
The jump instruction refers to a machine learning instruction capable of realizing a jump function. Alternatively, the jump instruction may include an if else instruction, a for loop instruction, or the like.
Specifically, after obtaining the machine learning instruction sequence, the compiler performs instruction search on all machine learning instructions in the machine learning instruction sequence to obtain all jump instructions in the machine learning instruction sequence. And then the compiler divides the machine learning instruction sequence according to the jump instruction and the machine learning instruction which the jump instruction jumps to obtain at least one basic block. Optionally, the instruction searching mode may be preset jump instructions, and then according to the instruction matching mode, all jump instructions in the matched machine learning instruction sequence may be obtained.
Taking if else instruction as an example, assuming that the if else instruction represents if (expression) statement 1else statement 2, then, taking if corresponding machine learning instruction and else corresponding machine learning instruction as boundaries, dividing if corresponding machine learning instruction, else corresponding machine learning instruction and machine learning instruction between them into one basic block, and dividing machine learning instructions on two sides of the basic block into one basic block, so as to obtain four basic blocks. The four basic blocks include a basic block composed before if, a basic block composed between if and else, a basic block composed between else and end, and a basic block composed after end. The four basic blocks described above may be represented as follows:
Basic block 1
If(…){
Basic block 2
}else{
Basic block 3
}
Basic block 4.
In the embodiment of the application, the compiler takes the jump instruction as the dividing limit of the basic block, so that the machine learning instruction in the divided basic block cannot have a logic problem, and further the accuracy of executing the machine learning instruction after peeping hole optimization is ensured.
In the embodiment of the present application, there are various implementations of the step S206, and only a few implementations are listed below.
In one embodiment, please refer to fig. 4, which relates to a specific process of peeping optimization by merging offset register instructions to implement machine learning instructions. On the basis of the above embodiment, S206 includes the steps of:
s222, acquiring a first offset register instruction in a basic block;
s224, searching a second offset register instruction in the basic block according to the first offset register instruction, wherein the first offset register instruction and the second offset register instruction are used for offsetting the values in the same register;
s226, if no machine learning instruction using the value in the register exists between the first offset register instruction and the second offset register instruction, merging the first offset register instruction and the second offset register instruction to obtain a merged offset register instruction.
The machine learning instruction includes the offset register instruction. An offset register instruction refers to a fixreg instruction for offsetting a value in a register.
Optionally, the offset register instruction is to add an offset to the value in the register, or to subtract an offset from the value in the register, or to multiply or divide the value in the register by an offset.
Specifically, the compiler, after obtaining the basic block, looks up a first offset register instruction in the basic block, and looks up a second offset register instruction that offsets a value in the same register backward based on the position of the first offset register instruction. Then the compiler judges whether a machine learning instruction using the value in the register exists between the first offset register instruction and the second offset register instruction, if not, the compiler merges the first offset register instruction and the second offset register instruction to obtain a merged offset register instruction; if so, return to step S222. Further, the compiler sends the obtained combined offset register instruction to the memory for storage, so that the processor can read the combined offset register instruction from the memory at any time and execute the operation corresponding to the instruction.
In the embodiment of the application, the compiler merges the offset register instructions with the same function, so that the number of machine learning instructions is reduced, and the running efficiency of the machine learning computing device is improved.
As one implementation, the offset register instruction may be implemented by a scalar add instruction, so the conversion method may also optimize the scalar add instruction. As another embodiment, the conversion method may also be optimized for other scalar calculation methods for the same register.
In one embodiment, please refer to fig. 5, which is directed to a specific process of performing a position advance of machine learning instructions that can be executed in advance to achieve peep optimization of the machine learning instructions. On the basis of the above embodiment, S206 includes the steps of:
s232, acquiring an advanced machine learning instruction in the basic block;
s234, performing position forward movement on the machine learning instruction capable of being advanced, and judging whether logic errors exist in the machine learning instruction sequence according to the position of the machine learning instruction capable of being advanced;
s236, if the machine learning instruction sequence has logic errors, stopping performing position forward movement on the machine learning instruction capable of being advanced, and placing the machine learning instruction capable of being advanced at any position after forward movement corresponding to the machine learning instruction sequence without logic errors.
The machine learning instruction comprises the machine learning instruction capable of being advanced. The machine-learnable instruction in advance means a machine-learnable instruction that can be executed in advance, and that should not affect the execution of other machine-learnable instructions when executed in advance.
Optionally, the compiler matches the machine learning instruction in the basic block with a preset machine learning instruction capable of being advanced, so as to obtain the machine learning instruction capable of being advanced in the basic block after matching. For example, taking a preset machine learning instruction capable of being advanced as a loadwt instruction, the loadwt instruction is preset in a compiler, and then the loadwt instruction is matched from the machine learning instructions in the basic block in an instruction matching manner.
Alternatively, step S234 may be performed after the compiler obtains all the machine learning possible instructions in the machine learning possible instruction sequence, or step S234 may be performed after the compiler obtains any one of the machine learning possible instructions.
Specifically, after the compiler obtains the machine learning instruction capable of being advanced in the basic block in the instruction matching mode, the compiler advances the position of the machine learning instruction capable of being advanced from the initial position of the machine learning instruction capable of being advanced, and judges whether the sequence of the machine learning instruction has logic errors according to the position after the machine learning instruction capable of being advanced in the process of advancing the machine learning instruction capable of being advanced. If the machine learning instruction sequence has logic errors, the compiler stops moving forward the position of the machine learning instruction in advance, and places the machine learning instruction in advance at any position after moving forward corresponding to the machine learning instruction sequence without logic errors. Alternatively, the compiler may place the machine-learnable instruction in advance at a position after the last advance corresponding to the absence of a logical error in the sequence of machine-learnable instructions.
Further, the compiler sends the shifted machine learning instruction to the memory for storage, so that the processor can read the shifted machine learning instruction from the memory at any time and execute the operation corresponding to the instruction.
In the embodiment of the application, the compiler can execute the machine learning instruction in advance, so that the execution time of the machine learning instruction is covered by the execution time of the previous instruction, thereby optimizing the overall instruction execution time.
As one embodiment, a specific implementation manner of the compiler for performing position forward on the machine learning instruction capable of being advanced includes: the compiler determines whether advancing the machine learning instruction can affect execution of other machine learning instructions. If the compiler determines that the advance of the machine-learnable instruction does not affect the execution of other machine-learnable instructions, the compiler advances the advance of the machine-learnable instruction. Alternatively, if the early machine learning instruction is associated with no other machine learning instructions, the compiler may determine that the early machine learning instruction is advanced without affecting other machine learning instruction execution.
In the embodiment, the position forward movement is performed on the premise that the forward movement of the machine learning instruction can be advanced and the execution of other machine learning instructions is not influenced, so that errors in the execution of the machine learning instruction are avoided, and the execution accuracy of the machine learning instruction is ensured.
Illustratively, taking a neural network as an example, typically, the neural network includes multiple layers. Assuming that the neural network includes a pooling layer of a previous layer and a convolution layer of a subsequent layer, since the pooling layer does not need to use weights, the weights of the convolution layers can be read in advance. Specifically, the process of reading the weight value of the convolution layer of the later layer can be advanced, so that the process is executed in parallel with the calculation process of the pooling layer of the previous layer, and the overall execution time can be saved. Through the instruction optimization mode, the weight reading instruction corresponding to the weight reading weight of the convolution layer of the later layer is moved forward in position, so that the weight reading instruction can be executed in advance, and the overall execution time of the neural network instruction can be greatly reduced.
Under the application scene that the machine learning instructions are distributed in a pipeline, peeping optimization can be performed on part of the machine learning instructions. For example, referring to FIG. 6, "L" in FIG. 6 represents a data load operation, i.e., load; "C" represents a data calculation operation, i.e., computer; "S" represents a data storage operation, namely store. Wherein the time occupied by each operation is represented by a time slice. Specifically, in the first time slice, a last loading instruction is acquired, and a data loading operation is executed; in the second time slice, the last calculation instruction and the current loading instruction are acquired, and the data calculation operation and the data loading operation are executed in parallel; in the third time slice, the last storage instruction, the current calculation instruction and the next loading instruction are acquired, the data calculation operation, the data loading operation and the data storage operation are executed in parallel, and the cyclic execution is performed, so that the machine learning instruction can be executed in a three-stage pipeline mode until the execution of the last storage instruction is completed. And, during execution of L, C, S, associated L, C, S is also stored using ping-pong storage alternates.
In one embodiment, based on the above application scenario, please refer to fig. 7, which relates to a specific process of merging a store instruction and a load instruction corresponding to the same batch of data into a move instruction by a compiler to realize peep hole optimization of a machine learning instruction. On the basis of the above embodiment, S206 includes the steps of:
s242, acquiring a storage instruction in the basic block, wherein the storage instruction is used for storing first data;
s244, continuing to search for a loading instruction for loading the first data after the storage instruction;
s246, combining the store instruction and the load instruction to obtain a combined moving instruction.
The machine learning instruction comprises the storage instruction and the loading instruction. Store instructions, i.e., store instructions. Load instructions, i.e., load instructions. The move instruction is used to select the memory space in the ping-pong storage process.
Specifically, after the basic block is obtained, the compiler searches the storage instruction in the basic block, searches the loading instruction for the same on-chip address backwards based on the storage instruction, and can determine whether the storage instruction and the loading instruction act on the same batch of data according to the identified on-chip address. If the storage instruction and the loading instruction which act on the same batch of data exist, the compiler merges the storage instruction and the loading instruction to obtain the merged moving instruction.
For example, in the neural network, the output result of the previous layer is required to be used in the next layer of the neural network, and at this time, the storage instruction needs to be read first, the output result of the previous layer of the neural network is stored, then the loading instruction is read, and the stored output result is loaded in the next layer of the neural network. If the storage instruction and the loading instruction are combined, the combination is a mobile instruction, and the output result of the previous layer of the neural network can be directly input to the next layer of the neural network.
Further, the compiler sends the obtained combined moving instruction to the memory for storage, so that the processor can read the combined moving instruction from the memory at any time and execute the operation corresponding to the instruction.
In the embodiment of the application, the compiler combines the storage instruction and the loading instruction which are applied to the same batch of data, so that unnecessary IO operation can be reduced, and the time cost of executing the machine learning instruction is reduced.
In one embodiment, referring to FIG. 8, a specific process of a compiler reordering machine learning instructions of a sequence of machine learning instructions to achieve peephole optimization of the machine learning instructions is described. On the basis of the above embodiment, S206 includes the steps of:
S252, acquiring first machine learning instructions with the number exceeding n in a continuous way, wherein n is the queue length of an instruction transmission queue corresponding to the first machine learning instructions;
s254, advancing a second machine learning instruction which is executed in parallel with the first machine learning instruction in the same time slice, inserting the second machine learning instruction behind the nth first machine learning instruction, wherein the number of the second machine learning instructions inserted is not more than the queue length of an instruction transmitting queue corresponding to the second machine learning instruction.
Optionally, pipelining the computing instruction and the IO instruction is performed as an example. As one implementation, in each time slice in the pipeline, the compiler looks up more than n consecutive IO instructions, where n is the queue length of the hardware IO instruction issue queue. If the compiler finds out continuous IO instructions exceeding n, the compiler advances a plurality of calculation instructions in the same time slice and inserts the calculation instructions into the back of the nth IO instruction, wherein the number of the inserted calculation instructions does not exceed the queue length of an instruction transmission queue corresponding to the calculation instructions.
As another implementation, in each time slice in the pipeline, the compiler looks up more than n consecutive compute instructions, where n is the queue length of the hardware compute instruction issue queue. If the compiler finds out continuous calculation instructions exceeding n, the compiler advances a plurality of IO instructions in the same time slice and inserts the IO instructions into the back of the nth calculation instruction, wherein the number of inserted IO instructions does not exceed the queue length of an instruction transmission queue corresponding to the IO instructions.
The applicant found through research that if the number of continuous machine learning instructions exceeds the queue length of the transmission queue, the transmission queue is blocked, and the performance is affected. Thus, in embodiments of the present application, the compiler prevents performance loss caused by full instruction issue queues by reordering machine learning instructions. For example, it is possible to detect whether an IO queue or a compute queue is full in a single time slice and advance instructions of the other queue to alleviate this situation, thereby optimizing the performance of machine learning instructions.
In one embodiment, please refer to fig. 9, which is directed to another specific process for a compiler to reorder machine learning instructions of a sequence of machine learning instructions to achieve peephole optimization of the machine learning instructions. On the basis of the above embodiment, S206 includes the steps of:
s262, acquiring initial relative positions among instruction blocks of synchronous instructions in a time slice, wherein the instruction blocks comprise at least one machine learning instruction;
s264, adjusting the positions between the instruction blocks and other machine learning instructions to obtain the final relative positions between the instruction blocks of the synchronous instructions;
and S266, if the final relative position is smaller than the initial relative position, determining the adjusted final position of the instruction block of the synchronous instruction according to the final relative position.
The synchronous instructions are arranged among the machine learning instructions related to the data and are used for dividing all the machine learning instructions to obtain machine learning instructions corresponding to a plurality of time steps, the machine learning instructions in adjacent time steps are executed in series, and the machine learning instructions in each time step are executed in parallel.
Specifically, in each time slice in the pipeline, the compiler looks up the instruction block containing the synchronous instruction, determines the boundary of the instruction block, and thus determines the initial relative position between the instruction blocks of different synchronous instructions. The compiler then adjusts the instruction blocks within the same time slice in a manner that the instruction sequence is learned with other machines so that the final relative position between the instruction blocks of the synchronous instruction is smaller than the initial relative position between the instruction blocks of the synchronous instruction. And the compiler determines the final position of the adjusted instruction block of the synchronous instruction according to the final relative position.
Further, the compiler sends the machine learning instruction after the position adjustment to the memory for storage, so that the processor can read the machine learning instruction after the position adjustment from the memory at any time and execute the operation corresponding to the instruction.
In the embodiment of the application, on the premise of using the instruction pipeline arrangement, the synchronous instructions interrupt the running water type execution of the instructions, and the synchronous instructions are close to each other so that the overhead of the interrupt flow is smaller, so that the compiler puts instruction blocks containing the synchronous instructions together as much as possible, and the influence of the synchronous instructions on the performance of the machine learning computing device is reduced.
In one embodiment, the compiler is involved in performing instruction conversion on the machine learning instruction in the basic block according to the peeping optimization algorithm, so as to obtain a possible implementation process of the converted machine learning instruction. On the basis of the above embodiment, S206 includes the steps of:
s272, if a plurality of multiplication operations or activation operations for the same section of data exist in sequence in one time slice, the plurality of multiplication operations or activation operations are fused, and a fused machine learning instruction is obtained.
The optimization of this embodiment is understood to be a "constant folding" optimization. Specifically, the compiler scans the machine learning instruction according to the peeping optimization algorithm, and if a plurality of multiplication operations or activation operations for the same segment of data exist in a time slice, the compiler fuses the plurality of multiplication operations or activation operations to obtain the fused machine learning instruction. Further, the compiler sends the obtained fused machine learning instruction to the memory for storage, so that the processor can read the fused machine learning instruction from the memory at any time and execute the operation corresponding to the instruction.
Illustratively, when a convolution-relu-convolution layer join occurs in the network, the compiler fuses the layers. Typically, the convolution layer is multiplied by a scale, which is a quantization parameter, and divided by the scale after convolution. The compiler will fuse some operations of multiply-divide scale into relu, thereby reducing overhead. For example, if the result of conv1 is to be multiplied by 1/scale1 and the input of conv2 is to be multiplied by scale2, the compiler may multiply scale2/scale1 into the slope of relu, thereby reducing the overhead of two multiplying scales. Note that in this embodiment, scale1 is a positive number, and scale2 may be a negative number.
It should be understood that, although the steps in the flowcharts of fig. 2-9 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2-9 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor does the order in which the sub-steps or stages are performed necessarily occur in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.
In one embodiment, as shown in fig. 10, there is provided a conversion apparatus of machine learning instructions, the apparatus comprising:
an instruction acquisition module 302, configured to acquire a machine learning instruction sequence;
the instruction dividing module 304 is configured to divide the machine learning instruction sequence to obtain at least one basic block, where the basic block includes at least one machine learning instruction;
the instruction conversion module 306 is configured to perform instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm, so as to obtain a converted machine learning instruction.
According to the machine learning instruction conversion device, the machine learning instruction sequence is obtained and divided to obtain at least one basic block, and then the machine learning instructions in the basic block are subjected to instruction conversion according to the peeping optimization algorithm to obtain the converted machine learning instructions, so that peeping optimization of the machine learning instructions is realized, the time cost of the machine learning instructions is reduced, and the overall performance of the machine learning computing device is greatly improved.
For specific limitations on the conversion means of the machine learning instruction, reference may be made to the above limitation on the conversion method of the machine learning instruction, and no further description is given here. The respective modules in the conversion device of the machine learning instruction described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, there is also provided a board card comprising: a machine learning processor configured to execute the method for converting machine learning instructions according to any of the above embodiments.
In one embodiment, there is also provided a motherboard, comprising: a general purpose processor and the board card.
In one embodiment, an electronic device is also provided, which includes the motherboard described above.
In one embodiment, the electronic device may be a terminal, and the internal structure thereof may be as shown in fig. 11. The electronic device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the electronic device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of converting machine learning instructions. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 11 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the electronic device to which the present application is applied, and that a particular electronic device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
According to the board card, the main board and the electronic equipment, the machine learning instruction sequence is obtained and divided to obtain at least one basic block, and then the machine learning instructions in the basic block are subjected to instruction conversion according to the peeping optimization algorithm to obtain the converted machine learning instructions, so that peeping optimization of the machine learning instructions is realized, the time cost of the machine learning instructions is reduced, and the overall performance of the machine learning computing device is greatly improved.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (11)

1. A method of converting machine learning instructions, the method comprising:
acquiring a machine learning instruction sequence;
dividing the machine learning instruction sequence to obtain at least one basic block, wherein the basic block comprises at least one machine learning instruction;
performing instruction conversion on the machine learning instruction in the basic block according to a peeping hole optimization algorithm to obtain a converted machine learning instruction;
the method for obtaining the machine learning instruction after conversion comprises the following steps of:
acquiring an advanced machine learning instruction in the basic block;
performing position forward movement on the machine learning instruction capable of being advanced, and judging whether logic errors exist in the machine learning instruction sequence according to the position of the machine learning instruction capable of being advanced;
If the machine learning instruction sequence has logic errors, stopping performing position forward movement on the machine learning instruction capable of being advanced, and placing the machine learning instruction capable of being advanced at any position after forward movement corresponding to the machine learning instruction sequence without logic errors.
2. The method of claim 1, wherein dividing the sequence of machine learning instructions to obtain at least one basic block comprises:
searching a jump instruction in the machine learning instruction sequence;
and dividing the machine learning instruction sequence according to the jump instruction to obtain at least one basic block.
3. The method of claim 1, wherein the fetching of the early machine-learnable instructions in the basic block comprises:
and matching the machine learning instruction in the basic block with a preset advanced machine learning instruction to obtain the advanced machine learning instruction in the basic block.
4. The method of claim 1, wherein performing a position advance on the machine-learnable instruction comprises:
judging whether the forward movement of the machine learning instruction can affect the execution of other machine learning instructions or not;
And if the advance machine learning instruction is judged to be advanced and does not influence the execution of other machine learning instructions, carrying out position advance on the advance machine learning instruction.
5. The method according to claim 1, wherein the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
acquiring a storage instruction in the basic block, wherein the storage instruction is used for storing first data;
continuing to search for a load instruction for loading the first data after the store instruction;
and merging the storage instruction and the loading instruction to obtain a merged moving instruction.
6. The method according to claim 1, wherein the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
acquiring initial relative positions among instruction blocks of synchronous instructions in a time slice, wherein the instruction blocks comprise at least one machine learning instruction;
adjusting the positions between the instruction blocks and other machine learning instructions to obtain final relative positions between the instruction blocks of the synchronous instructions;
And if the final relative position is smaller than the initial relative position, determining the final position of the adjusted instruction block of the synchronous instruction according to the final relative position.
7. The method according to claim 1, wherein the performing instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction includes:
if a plurality of multiplication operations or activation operations for the same section of data exist in a time slice, the multiplication operations or activation operations are fused, and a fused machine learning instruction is obtained.
8. A conversion apparatus of machine learning instructions, the apparatus comprising:
the instruction acquisition module is used for acquiring a machine learning instruction sequence;
the instruction dividing module is used for dividing the machine learning instruction sequence to obtain at least one basic block, wherein the basic block comprises at least one machine learning instruction;
the instruction conversion module is used for carrying out instruction conversion on the machine learning instruction in the basic block according to the peep hole optimization algorithm to obtain a converted machine learning instruction;
The method for obtaining the machine learning instruction after conversion comprises the following steps of:
acquiring an advanced machine learning instruction in the basic block;
performing position forward movement on the machine learning instruction capable of being advanced, and judging whether logic errors exist in the machine learning instruction sequence according to the position of the machine learning instruction capable of being advanced;
if the machine learning instruction sequence has logic errors, stopping performing position forward movement on the machine learning instruction capable of being advanced, and placing the machine learning instruction capable of being advanced at any position after forward movement corresponding to the machine learning instruction sequence without logic errors.
9. A board, characterized in that, the board includes: a machine learning processor for performing the method of any of claims 1-7.
10. A motherboard, said motherboard comprising: a general purpose processor and a board as claimed in claim 9.
11. An electronic device comprising the motherboard of claim 10.
CN202011115613.1A 2019-11-08 2019-11-08 Machine learning instruction conversion method and device, board card, main board and electronic equipment Active CN112257870B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011115613.1A CN112257870B (en) 2019-11-08 2019-11-08 Machine learning instruction conversion method and device, board card, main board and electronic equipment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911087323.8A CN110874643B (en) 2019-11-08 2019-11-08 Conversion method and device of machine learning instruction, board card, mainboard and electronic equipment
CN202011115613.1A CN112257870B (en) 2019-11-08 2019-11-08 Machine learning instruction conversion method and device, board card, main board and electronic equipment

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201911087323.8A Division CN110874643B (en) 2019-11-08 2019-11-08 Conversion method and device of machine learning instruction, board card, mainboard and electronic equipment

Publications (2)

Publication Number Publication Date
CN112257870A CN112257870A (en) 2021-01-22
CN112257870B true CN112257870B (en) 2024-04-09

Family

ID=69718233

Family Applications (3)

Application Number Title Priority Date Filing Date
CN202011570154.6A Active CN112667241B (en) 2019-11-08 2019-11-08 Machine learning instruction conversion method and device, board card, main board and electronic equipment
CN202011115613.1A Active CN112257870B (en) 2019-11-08 2019-11-08 Machine learning instruction conversion method and device, board card, main board and electronic equipment
CN201911087323.8A Active CN110874643B (en) 2019-11-08 2019-11-08 Conversion method and device of machine learning instruction, board card, mainboard and electronic equipment

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202011570154.6A Active CN112667241B (en) 2019-11-08 2019-11-08 Machine learning instruction conversion method and device, board card, main board and electronic equipment

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201911087323.8A Active CN110874643B (en) 2019-11-08 2019-11-08 Conversion method and device of machine learning instruction, board card, mainboard and electronic equipment

Country Status (1)

Country Link
CN (3) CN112667241B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667241B (en) * 2019-11-08 2023-09-29 安徽寒武纪信息科技有限公司 Machine learning instruction conversion method and device, board card, main board and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021040A (en) * 2016-05-04 2016-10-12 中国人民解放军国防科学技术大学 Linear assembly instruction diversity conversion based DSP soft error detection method
CN110163362A (en) * 2018-02-13 2019-08-23 上海寒武纪信息科技有限公司 A kind of computing device and method

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4656582A (en) * 1985-02-04 1987-04-07 International Business Machines Corporation Generating storage reference instructions in an optimizing compiler
US20060200811A1 (en) * 2005-03-07 2006-09-07 Cheng Stephen M Method of generating optimised stack code
CN100444118C (en) * 2007-03-19 2008-12-17 中国人民解放军国防科学技术大学 Software and hardware combined command relative controlling method based on logic transmitting rank
US20120151187A1 (en) * 2010-12-13 2012-06-14 Microsoft Corporation Instruction optimization
CN102662720B (en) * 2012-03-12 2015-01-28 天津国芯科技有限公司 Optimization method of compiler of multi-issue embedded processor
CN102945148A (en) * 2012-09-26 2013-02-27 中国航天科技集团公司第九研究院第七七一研究所 Method for realizing parallel instruction set
CN102981802B (en) * 2012-11-06 2015-10-07 无锡江南计算技术研究所 A kind of instruction morphing method and system
US9595205B2 (en) * 2012-12-18 2017-03-14 Neuron Fuel, Inc. Systems and methods for goal-based programming instruction
CN104516726B (en) * 2013-09-27 2018-08-07 联想(北京)有限公司 A kind of method and device of instruction processing
CN104049949B (en) * 2014-05-30 2016-10-05 南阳理工学院 A kind of peephole optimization method towards BSWAP instruction
US11281481B2 (en) * 2014-07-25 2022-03-22 Intel Corporation Using a plurality of conversion tables to implement an instruction set agnostic runtime architecture
CN105487839A (en) * 2015-11-24 2016-04-13 无锡江南计算技术研究所 Continuous non-alignment vector data access oriented compiling optimization method
CN105719184A (en) * 2016-01-15 2016-06-29 优品财富管理有限公司 Transaction command conversion method and system
CN105843660B (en) * 2016-03-21 2019-04-02 同济大学 A kind of code optimization dispatching method of compiler
US10789544B2 (en) * 2016-04-05 2020-09-29 Google Llc Batching inputs to a machine learning model
US10817802B2 (en) * 2016-05-07 2020-10-27 Intel Corporation Apparatus for hardware accelerated machine learning
US10169010B2 (en) * 2016-06-01 2019-01-01 International Business Machines Corporation Performing register promotion optimizations in a computer program in regions where memory aliasing may occur and executing the computer program on processor hardware that detects memory aliasing
CN109491659B (en) * 2017-09-11 2022-06-21 龙芯中科技术股份有限公司 Instruction conversion method and device
CN108154238B (en) * 2017-12-25 2020-11-27 东软集团股份有限公司 Migration method and device of machine learning process, storage medium and electronic equipment
CN110045960B (en) * 2018-01-16 2022-02-18 腾讯科技(深圳)有限公司 Chip-based instruction set processing method and device and storage medium
CN108427558A (en) * 2018-02-09 2018-08-21 芯海科技(深圳)股份有限公司 A kind of peephole optimization method of C compilers
CN108924187B (en) * 2018-06-07 2020-05-08 北京百度网讯科技有限公司 Task processing method and device based on machine learning and terminal equipment
CN108845830B (en) * 2018-07-03 2021-12-03 中国人民解放军国防科技大学 Execution method of one-to-one loading instruction
CN112667241B (en) * 2019-11-08 2023-09-29 安徽寒武纪信息科技有限公司 Machine learning instruction conversion method and device, board card, main board and electronic equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021040A (en) * 2016-05-04 2016-10-12 中国人民解放军国防科学技术大学 Linear assembly instruction diversity conversion based DSP soft error detection method
CN110163362A (en) * 2018-02-13 2019-08-23 上海寒武纪信息科技有限公司 A kind of computing device and method

Also Published As

Publication number Publication date
CN112667241B (en) 2023-09-29
CN110874643B (en) 2021-01-12
CN110874643A (en) 2020-03-10
CN112667241A (en) 2021-04-16
CN112257870A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
EP3624020A1 (en) Computing method and related product
CN109543816B (en) Convolutional neural network calculation method and system based on weight kneading
US8321492B1 (en) System, method, and computer program product for converting a reduction algorithm to a segmented reduction algorithm
US10877733B2 (en) Segment divider, segment division operation method, and electronic device
CN111915001A (en) Convolution calculation engine, artificial intelligence chip and data processing method
US10558500B2 (en) Scheduling heterogenous processors
EP4071619A1 (en) Address generation method, related device and storage medium
US12061910B2 (en) Dispatching multiply and accumulate operations based on accumulator register index number
US20220291901A1 (en) Data processing method for processing unit, electronic device and computer readable storage medium
US8707013B2 (en) On-demand predicate registers
CN112257870B (en) Machine learning instruction conversion method and device, board card, main board and electronic equipment
CN106682258B (en) Multi-operand addition optimization method and system in high-level comprehensive tool
US7734456B2 (en) Method and apparatus for priority based data processing
US11188328B2 (en) Compute array of a processor with mixed-precision numerical linear algebra support
CN115469931B (en) Instruction optimization method, device, system, equipment and medium of loop program
CN116578425A (en) Load balancing method and system based on rasterization
CN115130672A (en) Method and device for calculating convolution neural network by software and hardware collaborative optimization
CN111798363B (en) Graphics processor
CN114490002A (en) Data processing system, task scheduling method, device, chip and electronic equipment
CN113496270A (en) Hybrid precision neural processing unit using spatial fusion with load balancing
CN114117896A (en) Method and system for realizing binary protocol optimization for ultra-long SIMD pipeline
CN117112033B (en) Random instruction generation method, device, equipment and storage medium
US9298421B2 (en) Performing quotient selection for a carry-save division operation
Gao et al. GPU acceleration of pyrosequencing noise removal
US20240004830A1 (en) Floorplan-optimized matrix extension architecture for processors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant