CN109754062A

CN109754062A - The execution method and Related product of convolution extended instruction

Info

Publication number: CN109754062A
Application number: CN201711086019.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2019-05-14

Abstract

Present disclosure provides the implementation method and Related product of a kind of convolution extended instruction, comprising: computing device reads input data, convolution kernel and the auxiliary operation that the convolution extended instruction obtains the convolution extended instruction from memory；The convolution extended instruction includes: operation code and operation domain, and the operation domain includes: register and auxiliary domain, and the register is used to determine the address of input data and the address of convolution kernel, and the auxiliary domain is for identifying auxiliary operation；Computing device executes convolution operation and auxiliary operation to the address of the input data and the address of convolution kernel.The technical solution that present disclosure provides has the advantages of reducing calculation amount, reducing power consumption.

Description

The execution method and Related product of convolution extended instruction

Technical field

Present disclosure is related to nerual network technique field, and in particular to a kind of implementation method and correlation of convolution extended instruction Product.

Background technique

Convolutional neural networks are a kind of efficient identification calculations for being widely used in the fields such as pattern-recognition, image procossing in recent years Method, it has the characteristics that structure is simple, training parameter is few and adaptable, translation, rotation, scaling.Due to the feature of CNN/DNN Detection layers are learnt by training data, so the feature extraction of display is avoided when using CNN/DNN, and implicitly Learnt from training data；Furthermore since the neuron weight on same Feature Mapping face is identical, so network can be simultaneously Row study, this is also convolutional network is connected with each other a big advantage of network relative to neuron.

In the application of existing computer field, application relevant to convolution algorithm is very universal.The present invention is absorbed in volume Product neural network, the mainstream device that can execute such operation at present are as follows:

In the prior art, a kind of known arrangement carrying out convolutional neural networks operation is using general processor, the party Method executes universal command by general-purpose register and general utility functions component, thereby executing convolutional neural networks operation.However, The disadvantages of this method first is that single general processor is chiefly used in Scalar operation, the operation when carrying out convolutional neural networks operation It can be lower.And when being executed parallel using multiple general processors, the mutual communication between general processor and being likely to become property It can bottleneck.

It is another in the prior art, use graphics processor (GPU) Lai Jinhang vector to calculate, wherein by using logical General SIMD instruction is executed with register file and general stream processing unit to carry out convolutional neural networks operation.However, above-mentioned side In case, GPU on piece caching is too small, needs constantly to carry out data outside piece when carrying out extensive convolutional neural networks operation and carries, The outer bandwidth of piece becomes main performance bottleneck.

Disclosure content

Present disclosure embodiment provides a kind of implementation method of convolution extended instruction and convolution extended instruction and related produces Product are, it can be achieved that improving performance bottleneck, the advantages of reducing power consumption.

In a first aspect, present disclosure embodiment provides a kind of execution method of convolution extended instruction, the method includes as follows Step:

Computing device reads the convolution extended instruction from memory and obtains the input data of the convolution extended instruction, volume Product core and activation operation；

The convolution extended instruction includes: operation code and operation domain, and the operation code includes: the convolution extended instruction Mark；The operation domain include: convolution subdomain and activation subdomain, the convolution subdomain include: store input data address and The address of convolution kernel, the activation subdomain include: the identification code of the activation operation or the interpolation table address of the activation operation；

Computing device executes convolution algorithm to the input data and convolution kernel and obtains intermediate result, passes through activation Domain executes activation operation to the intermediate result and obtains the final result of described instruction.

Optionally, the activation operation includes: convolutional neural networks Maxout operation, convolutional neural networks PReLU behaviour Make, convolutional neural networks RReLU operation, convolutional neural networks Leaky ReLU operation, nonlinear activation operates or linear activation Operation operation.

Optionally, described to pass through the activation subdomain pair such as the interpolation table address that the activation subdomain includes: activation operation The intermediate result executes activation operation and obtains the final result of described instruction, comprising:

Computing device extract it is described activation operation the corresponding interpolation table of interpolation table address, by the intermediate result with it is described Interpolation table executes activation operation and obtains the final result of described instruction.

Optionally, as it is described activation subdomain include: activation operation identification code, it is described by the activation subdomain to described Intermediate result executes activation operation and obtains the final result of described instruction, comprising:

Computing device identifies that the identification code of the activation operation determines the activation operation, reads inserting for the activation operation It is worth table, the interpolation table and the intermediate result is executed into activation operation and obtain the final result of described instruction.

Optionally, the computing device executes convolution algorithm to the input data and convolution kernel and obtains intermediate result, wraps It includes:

The input data is split into multiple portions and obtains multiple input subdatas by the main computing module of computing device, will Multiple input subdatas be distributed to it is multiple from computing module, convolution kernel is sent to it is multiple from computing module, it is the multiple from fortune Calculate modular concurrent execute input subdata and convolution kernel multiplying obtain it is multiple son as a result, computing device main computing module Splice the multiple sub- result to obtain the intermediate result.

Second aspect, provides a kind of computing device, the computing device include: memory, arithmetic element, interconnection module, Arithmetic element, controller unit and data access unit；

Wherein, the arithmetic element, comprising: adder calculator, multiplicative operator；

Controller unit, for reading the input that the convolution extended instruction obtains the convolution extended instruction from memory Data, convolution kernel and activation operation；

Data access unit, for obtain the input data address and convolution kernel the corresponding input data in address with And convolution kernel；

The arithmetic element obtains intermediate result for executing convolution algorithm to the input data and convolution kernel, passes through The activation subdomain executes activation operation to the intermediate result and obtains the final result of described instruction.

Optionally, if the activation subdomain includes: the interpolation table address that activation operates；

The data access unit, for extracting the corresponding interpolation table of interpolation table address of the activation operation；

The arithmetic element obtains described instruction for the intermediate result and the interpolation table to be executed activation operation Final result.

Optionally, if the activation subdomain includes: the identification code that activation operates；The arithmetic element further include: activation fortune Calculate device；

The controller unit, the identification code of the activation operation determines the activation operation for identification；

The activation arithmetic unit, for taking the interpolation table of the activation operation, by the interpolation table and the intermediate result It executes activation operation and obtains the final result of described instruction.

Optionally, the arithmetic element further include: main computing module and multiple from computing module, the main computing module packet Include: adder calculator and multiplicative operator, it is described from computing module include: adder calculator and multiplicative operator；

The main computing module obtains multiple input subdatas for the input data to be split into multiple portions, will Multiple input subdatas be distributed to it is multiple from computing module, convolution kernel is sent to it is multiple from computing module, it is the multiple from fortune Module is calculated, the multiplying for executing input subdata and convolution kernel parallel obtains multiple sons as a result, the main computing module, For splicing the multiple sub- result to obtain the intermediate result.

The third aspect provides a kind of computer readable storage medium, which is characterized in that it, which is stored, is used for electronic data interchange Computer program, wherein the computer program make computer execute first aspect provide method.

Fourth aspect, provides a kind of computer program product, and the computer program product includes storing computer journey The non-transient computer readable storage medium of sequence, the computer program are operable to execute computer described in first aspect Method.

5th aspect, provides a kind of chip, and the chip includes the computing device that second aspect provides.

6th aspect, provides a kind of chip-packaging structure, and the chip-packaging structure includes the chip that the 5th aspect provides.

7th aspect, provides a kind of board, and the board includes the chip-packaging structure that the 6th aspect provides.

Eighth aspect, provides a kind of electronic device, and the electronic device includes a kind of board that the 7th aspect provides.

As can be seen that realizing convolution algorithm by present disclosure embodiment with single instruction and activating the excellent of operation Point, so it, which has, reduces the advantages of calculating the time, saving power consumption.

Detailed description of the invention

In order to illustrate more clearly of the technical solution in present disclosure embodiment, will make below to required in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of present disclosure, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of structural schematic diagram for calculating equipment that present disclosure provides.

Fig. 2 is the schematic block diagram for the interconnection module that present disclosure embodiment provides.

Fig. 2 a is main operation mould in the device for executing convolutional neural networks forward operation of present disclosure embodiment offer The schematic block diagram of block.

Fig. 2 b is in the device for executing convolutional neural networks forward operation of present disclosure embodiment offer from operation mould The schematic block diagram of block.

Fig. 3 is the flow chart that the convolutional neural networks arithmetic unit that present disclosure embodiment provides executes convolution transform instruction.

Fig. 3 a is a kind of schematic diagram for convolution kernel that present disclosure embodiment provides.

Fig. 3 b is a kind of schematic diagram for input data that present disclosure embodiment provides.

Fig. 3 c is a kind of schematic diagram of the movement for convolution kernel that present disclosure embodiment provides.

Fig. 3 d is the schematic diagram of the movement for another convolution kernel that present disclosure embodiment provides.

Fig. 3 e is the schematic diagram of the movement for another convolution kernel that present disclosure embodiment provides.

Specific embodiment

Below in conjunction with the attached drawing in present disclosure embodiment, the technical solution in present disclosure embodiment is carried out clear, complete Site preparation description, it is clear that described embodiment is present disclosure a part of the embodiment, instead of all the embodiments.Based on originally draping over one's shoulders Embodiment in dew, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example belongs to the range of present disclosure protection.

The specification and claims of present disclosure and term " first ", " second ", " third " and " in the attached drawing Four " etc. are not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and it Any deformation, it is intended that cover and non-exclusive include.Such as it contains the process, method of a series of steps or units, be System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or list Member, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.

Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments It is contained at least one embodiment of present disclosure.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.

Illustrate the method for convolution ordering calculation by taking convolution algorithm instructs as an example below, convolution instruction can be applied in mind Through certainly in practical applications, also can be applied in other calculating scenes, present disclosure is not intended to limit above-mentioned convolution in network Concrete implementation scene is instructed, convolution algorithm instruction is referred to as convolutional neural networks.For convolution instruction, in fact The formula that border needs to be implemented can be s=s (∑ wx_i+ b) wherein, i.e., by convolution kernel w (may include multiple data) multiplied by input Data x_i,

It sums, then primary Calculation result h can be obtained plus biasing b according to actual calculate, then preliminary meter Activation operation s (h) can also be done by calculating result, to obtain final output result S.Calculating topology can be obtained according to the formula Structure is multiplicative operator-adder calculator-activation arithmetic unit.

Existing convolution is instructed, if necessary to execute activation operation, needs to execute by multiple instruction, it is above For stating formula, firstly, it needs to instruct by convolution algorithm to obtain primary Calculation result h, then refer to by convolution activation It enables and activation operation is executed to the h, that is, need the instruction of at least two convolution to obtain the result S of above-mentioned formula, such mode is first Multiple quantity are needed for the quantity of convolution instruction, in addition, it is called due to needing to repeat for chip or computing device Data, so it needs more computing costs, and power consumption is also higher.

The present disclosure provides a kind of computing device, the computing device is as shown in Figure 1, comprising: storage medium 111, register Unit 112, interconnection module 113, arithmetic element 114, controller unit 115 and data access unit 116；

Wherein, arithmetic element 114 may include: multiplier and adder calculator, and certain arithmetic element can be with It include: at least one of comparator, activation arithmetic unit, OP converter.

Interconnection module 113, the connection relationship for controlling calculator in arithmetic element 114 make at least two kinds of calculator groups At different calculating topological structures.

Register cell 112 refers to for storing operational order, input data, convolution kernel in the address of storage medium, convolution Enable corresponding calculating topological structure.

Storage medium 111 can be chip external memory, certainly in practical applications, or on-chip memory is used for Input data, convolution kernel are stored, the input data, convolution kernel are specifically as follows vector, matrix or multidimensional data.

Controller unit 115, for extracted out of register cell 112 operational order (be specifically as follows convolution instruction), The corresponding operation domain of the operational order and the operational order corresponding first calculate topological structure, which is decoded into It executes instruction, this, which is executed instruction, executes arithmetic operation for controlling arithmetic element, which is transmitted to data access unit 116, by the calculating topology transmission to interconnection module 113.

Data access unit 116, for extracting the corresponding input data of the operation domain, convolution kernel from storage medium 111, And the input data, convolution kernel are transmitted to arithmetic element 114.

Interconnection module 113 calculates topology for forming first according to the connection relationship of calculator in control arithmetic element 114 Structure.

Arithmetic element 114, for by first calculating topological structure and this execute instruction call calculator to the data block It executes arithmetic operation and obtains operation result, which is transmitted to data access unit and is stored in storage medium.

The operational order can be as shown in Figure 1, comprising: operation domain and operation code, by taking convolution algorithm instructs as an example, and operation Domain may include: convolution subdomain and activate subdomain, as shown in table 1, wherein (optional, register is also possible to post register number Storage heap) 0, register number (optional, register file) 1, register number (optional, register file) 2, register number it is (optional , register file) 3, register number 4, register number 0, register number 1, register number 2, register number 3 can be convolution Domain, specifically, register number 4 can be activation subdomain.

Table 1:

When for activation primitive interpolation table address, for computing device, the setting of activation calculator can be saved, and Decoder parsing expense can also be saved for the setting of activation primitive interpolation table address, calculation amount is reduced, save chip power-consumption And area.It include activation primitive interpolation table address the following detailed description of its concrete implementation mode, such as CONV_ACTIVATE, then CONV_ACTIVATE instructs the result (i.e. intermediate result) that convolution operation is obtained after having executed convolution operation, then extracts The corresponding interpolation table of activation primitive interpolation table address executes activation operation to the result of the convolution operation and directly obtains result.This Mode only needs to read a CONV_ACTIVATE instruction, and executes and execute without individual activation calculator, so The advantages of it is small with expense is analyzed the instruction, and reduces calculation amount, saves hardware configuration, if CONV_ACTIVATE includes activation Function operation code, then CONV_ACTIVATE instruction obtains swashing as a result, parsing this for convolution operation after having executed convolution operation Function operation code living obtains corresponding activation primitive, and the activation primitive is then sent to activation calculator, activation calculator according to Interpolation table is extracted according to the activation primitive, activation operation is executed to the result of the convolution operation, need to parse repeatedly instruction, in addition It needs that calculator is individually activated to execute activation operation.

The operational order can also be as shown in table 2, comprising: operation code CONV_AC_OP, register number (optional, deposit Device heap) 0, register number (optional, register file) 1, register number (optional, register file) 2, register number (it is optional, Register file) 3, register number (optional, register file) 4, auxiliary operation code, register number 0, register number 1, register number 2, register number 3 can be convolution subdomain, and register number 4 can be activation subdomain, and OP operation code can be OP subdomain, specifically , as shown in table 2.

Table 2:

Above-mentioned operational order may include convolution instruction set, which includes the convolutional neural networks of different function CONV instruction, CONV_ACTIVATE instruction, CONV_OP and CONFIG instruction, I/O instruction, NOP instruction, JUMP instruct, MOVE instruction.

It includes calculating operation and calculator connection relationship that auxiliary operation code as shown in Table 1 and Table 2, which is specifically as follows,. By taking OP is operated as an example, for there are many OP operations, it is assumed that 1 indicates transposition, and 0 indicates conjugation, it is assumed that the auxiliary operation code can be 4bit, in practical applications or other amount of bits, such as 6bit, 8bit etc., for the auxiliary of CONV_OP Operation code, if its be 1111, then can be expressed as, transposition operation, need to be implemented transposition operation may include: input data, Convolution kernel, primary Calculation result, it is assumed here that 1111 the 2nd bit indicates whether input data executes OP operation, the 3rd bit Indicate whether convolution kernel executes OP operation, the 4th bit indicates whether primary Calculation result executes OP operation, it is assumed that 1 can be to hold Row OP operation, it is assumed that 0 can operate not execute OP.Certainly it is also possible to other operations in practical applications.

In one embodiment, CONV_ACTIVATE, which is instructed, includes:

Convolution activation instruction, according to the instruction, device is respectively from the finger of memory (preferred, to be scratchpad) Determine address and take out the input data being sized and convolution kernel, convolution operation is done in convolution algorithm component, then ties output Fruit is cooked activation primitive operation；Above-mentioned be sized can be by producer or user's self-defining.

Convolution activation instruction can specifically include:

Convolutional neural networks Maxout instruction, can specifically include: device is (preferably, temporary for high speed from memory respectively Deposit memory) specified address take out the input data that is sized and convolution kernel, do convolution operation in convolution algorithm component, Then output result is done into Maxout activation；Above-mentioned be sized can be by producer or user's self-defining.For convolutional Neural The specific manifestation form of network Maxout instruction can be, in the register number 4 of the operation domain of CONV_ACTIVATE instruction Add the interpolation table or Maxout operation code of Maxout.

For Maxout, mathematic(al) representation can be with are as follows:

whereZ_ij=x^TW_ij+b_ij

Wherein, h_iIndicate the output of Maxout as a result, W_ijIndicate convolution kernel, b_ijIndicate biasing, X^TIndicate input data Transposition.

Convolutional neural networks PReLU instruction, for doing PReLU activation according to output result of the instruction to computing device, Device takes out the input data being sized and convolution kernel from the specified address of scratchpad respectively, in convolution algorithm portion Convolution operation is done in part, and output result is then done into PReLU activation；For the specific manifestation of convolutional neural networks PReLU instruction Form can be that interpolation table or the PReLU behaviour of PReLU are added in the register number 4 of the operation domain of CONV_ACTIVATE instruction Make code.

Convolutional neural networks RReLU instruction, for doing RReLU activation according to output result of the instruction to computing device, Device takes out the input data being sized and convolution kernel from the specified address of scratchpad respectively, in convolution algorithm portion Convolution operation is done in part, and output result is then done into RReLU activation；For the specific manifestation of convolutional neural networks RReLU instruction Form can be that interpolation table or the RReLU behaviour of RReLU are added in the register number 4 of the operation domain of CONV_ACTIVATE instruction Make code.

Convolutional neural networks Leaky ReLU instruction, for being Leaky according to output result of the instruction to computing device ReLU activation, device take out the input data being sized and convolution kernel from the specified address of scratchpad respectively, Convolution operation is done in convolution algorithm component, and output result is then done into Leaky ReLU activation；For convolutional neural networks The specific manifestation form of Leaky ReLU instruction can be, in the register number 4 of the operation domain of CONV_ACTIVATE instruction Add the interpolation table or Leaky ReLU operation code of RReLU.

For ReLU, mathematic(al) representation are as follows: f (X)=max (0, x)；

Leaky ReLU, RReLU, PReLU, mathematic(al) representation can be with are as follows:

F (X)=α x (x < 0), f (X)=x (x >=0)；

For above-mentioned mathematic(al) representation, Leaky ReLU, RReLU or PReLU are corresponded to for different α values, as α > 0 When, it is PReLU；It is Leaky ReLU as α < 0, is RReLU when α is the random number of Gaussian Profile.

CONV_ACTIVATE instruction also may include other operational orders, carry out nonlinear activation or linear activation behaviour Make.

In one embodiment, CONV_OP, which is instructed, includes:

Convolution transform instruction, according to the instruction, device is respectively from the finger of memory (preferred, to be scratchpad) Determine address and take out the input data that is sized and convolution kernel, in OP (conjugation or transposition) arithmetic unit to input data and/ Or convolution kernel does map function, and convolution operation is then done in convolution algorithm component, then converts output result；It is above-mentioned to set Determining size, OP type can be by producer or user's self-defining.

Convolution transform instruction specifically includes:

Convolutional neural networks Reshape instruction, for being Reshape according to output result of the instruction to computing device Operation, device take out the input number being sized from the specified address of memory (preferred, to be scratchpad) respectively According to and convolution kernel, reshape (dimension reform, such as nchw- > chwn) operation is done in OP arithmetic unit, is then transported in convolution It calculates in component and does convolution operation, output result is then done into reshape operation；Above-mentioned be sized can be by producer or user Self-defining.

So-called dimension is reformed, and the four dimensions of the input data and convolution kernel that refer to convolution algorithm are reformed.

M convolution kernel shown in Fig. 3 a, each convolution kernel is the three-dimensional data block of 5*3*3, then its operation window is also The three-dimensional data block of 5*3*3, in M convolution kernel as shown in Figure 3a KH and KW indicate that the corresponding dimension of its KH is The H dimension of input data, the corresponding dimension which indicates are the W dimension of input data.Grey parts side in Fig. 3 c, 3d, 3e Block is that sliding operation window carries out the data that operation uses each time, after the direction of sliding can be using H as glide direction Taking W as glide direction or is being after glide direction is completed using W using H as glide direction.Specifically, it is for convolution, Operation at each sliding window is that the data block that grey parts square indicates in figure and " Fig. 3 a convolution 1- convolution kernel " indicate M convolution kernel data block carry out inner product operation respectively, convolution will correspond to each convolution kernel to each sliding window position A numerical value is exported, i.e., there is M output numerical value for each sliding window；A side is used in " Fig. 3 a- Fig. 3 e " Block indicates a numerical value, is referred to as a weight；Number used in schematic diagram only limits for example, actual conditions Middle dimension data may be that any number (includes the case where that some dimension is 1, in this case, the 4 D data block is automatic As three-dimensional data block, for example, input data is exactly a three-dimensional data in the case that the sample size that ought be calculated simultaneously is 1 Block；For example, convolution sum data are a three-dimensional data block in the case that convolution nuclear volume is 1).It is filled using the chip Set the convolution algorithm carried out between input data B and convolution kernel A；

For a convolutional layer, weight (all convolution kernels) such as shown in " Fig. 3 a convolution 1- convolution kernel ", remembers its convolution The quantity of core is M, and each convolution kernel is made of the matrix that C KH row KW is arranged, so the weight of convolutional layer can be expressed as one Four dimensions are M, C, KH, the 4 D data block of KW respectively；The input data of convolutional layer is 4 D data block, by N number of three dimension It is formed according to block, each three-dimensional data block is made of that (i.e. four dimensions are N, C, H, W respectively the eigenmatrix that C H row W is arranged Data block)；

Convolutional neural networks Pad instruction, for doing Pad operation, device according to output result of the instruction to computing device The input data and convolution being sized are taken out from the specified address of memory (preferred, to be scratchpad) respectively Core does pad (periphery expands) operation to convolution kernel in OP arithmetic unit, then does convolution operation in convolution algorithm component； Above-mentioned be sized can be by producer or user's self-defining.It can for the specific manifestation form of convolutional neural networks Pad instruction Think, Pad operation code is added in the auxiliary operation code for the operation domain that CONV_OP or CONV_AC_OP is instructed.

Periphery amplification refers to that having added N for the periphery of convolution kernel encloses data, and N is positive integer.At this moment N can be 1. It is constant to wait this instruction format.Circle means the two-dimensional blocks of data of original kh*kw size, expands as (kh+ by peripheral complement 2N)*(kw+2N).

N increases an operation domain (register 5) if it is greater than 1 or instruction format and exists to store the numerical value of this n The operation domain of CONV_OP increases a register 5, which is used to store the numerical value of n.If instruction format is constant, execute The method of instruction changes, and before executing CONV instruction, the numerical value of n is called using config instruction, is executing CONV instruction Pad operation is executed before.

In addition data can be all 0, this is most basic pad operation.

Optionally, data can be 0 and 1 random distribution.In this case, operation code must be changed to conv-pad- random.Method much step are as follows: generate the number that pad needs to fill using random number generator, altogether (kh+2N) * (kw+2N)- Kh*hw data

Convolutional neural networks Crop instruction is filled for doing Crop operation according to output result of the instruction to computing device It sets and takes out the input data and convolution being sized from the specified address of memory (preferred, to be scratchpad) respectively Core is crop (size cutting) to input in OP arithmetic unit, convolution operation is then done in convolution algorithm component, above-mentioned to set Determining size can be by producer or user's self-defining.

The definition that size is cut is to intercept out the two-dimemsional number of wherein H1*W1 size from the two-dimensional blocks of data of script H*W size According to block, wherein H1 and W1 is less than or equal to H and W.

Convolutional neural networks Dilate instruction, for being Dilate behaviour according to output result of the instruction to computing device Make, device takes out the input data being sized from the specified address of memory (preferred, to be scratchpad) respectively And convolution kernel, dilate (inserting 0 in inside) operation is done to convolution kernel in OP arithmetic unit, is then done in convolution algorithm component Convolution operation；Above-mentioned be sized can be by producer or user's self-defining.

The definition of dilate (inside insert 0) is: for the convolution kernel of kh*kw size, (above-mentioned pad inside it It is in periphery) 0 or random number are evenly or randomly inserted, the effect to convolution kernel " dilution " is played, does so and volume can be enhanced The feature extraction effect of product core.

CONV_OP instruction also may include other transformation directives, such as do BLAS transformation etc. to input, weight.

Above-metioned instruction collection include different function convolutional neural networks CONV_AC_OP instruction and CONFIG instruction, I/O instruction, NOP instruction, JUMP instruction and MOVE instruction.

In one embodiment, CONV_AC_OP can realize CONV, ACTICATE by the setting of auxiliary operation code With any combination of OP operation.

Fig. 2 diagrammatically illustrates a kind of embodiment of interconnecting modules 113: H tree module.Interconnecting modules 113 constitute main fortune Module 5 and multiple data paths between computing module 6 are calculated, is the binary tree access being made of multiple nodes, each node The data that two nodes in downstream return are merged, and returned by two nodes that the data of upstream are similarly issued to downstream Back to the node of upstream.For example, starting calculation stages in convolutional neural networks, the neuron number evidence in main computing module 5 passes through Interconnecting modules 4 are sent to each from computing module 6；After the completion of the calculating process from computing module 6, when the meter from computing module After the completion of calculation process, each from computing module export neuron value can be combined into step by step in interconnecting modules one completely by The vector of neuron composition.It illustrates, it is assumed that share N number of from computing module in device, then input data Xi is sent to N number of From computing module, each from computing module by input data Xi with should do convolution algorithm from the corresponding convolution kernel of computing module, obtain To a scalar data, respectively from the scalar data of computing module be interconnected module 4 be merged into the centre containing N number of element to Amount.Assuming that convolution window traverses in total and obtains A*B (X-direction is A, and Y-direction is B, and X, Y are three-dimensional orthogonal coordinate system Reference axis) input data Xi, then above-mentioned convolution operation is executed to A*B Xi, obtained institute's directed quantity closes in main computing module And obtain the three-dimensional intermediate result of A*B*N.

Fig. 2 a shows main fortune in the device for executing convolutional neural networks forward operation according to present disclosure embodiment Calculate the example block diagram of the structure of module 5.As shown in Figure 2 b, main computing module 5 includes the first arithmetic element 51, the first data dependence Relationship judging unit 52 and the first storage unit 53.

Wherein, the first arithmetic element 51 includes vectorial addition unit 511 and activation unit 512.First arithmetic element 51 The control signal from controller unit is received, the various calculation functions of main computing module 5 are completed, vectorial addition unit 511 is used In realizing in the calculating of convolutional neural networks forward direction plus bias operation, which aligns phase for biased data and the intermediate result Add and is biased as a result, 512 pairs of biasing results of arithmetic element is activated to execute activation primitive operation.The biased data can be from What external address space was read in, it is also possible to be stored in local.

First data dependence relation judging unit 52 is the port that the first arithmetic element 51 reads and writes the first storage unit 53, is protected Demonstrate,prove the read-write consistency of data in the first storage unit 53.Meanwhile first data dependence relation judging unit 52 be also responsible for will be from The data that first storage unit 53 is read are sent to from computing module by interconnecting modules 4, and from the output data of computing module 6 The first arithmetic element 51 is transmitted directly to by interconnecting modules 4.The instruction that controller unit 2 exports is sent to 51 He of computing unit First data dependence relation judging unit 52, to control its behavior.

Storage unit 53 is for caching input data and output data that main computing module 5 is used in calculating process.

Fig. 2 b is shown in the device for executing convolutional neural networks forward operation according to present disclosure embodiment from fortune Calculate the example block diagram of the structure of module 6.As shown in Figure 2 A, each include the second arithmetic element 61, data dependence from computing module 6 Relationship judging unit 62, the second storage unit 63 and third storage unit 64.

Second arithmetic element 61 receives the control signal that controller unit 2 issues and carries out convolution algorithm.Second operation list Member includes OP converter unit 808, and vector multiplies unit 611 and summing elements 612, and the vector being each responsible in convolution algorithm multiplies fortune Calculation, accumulating operation and OP map function.

Second data dependence relation judging unit 62 is responsible in calculating process to the read-write operation of the second storage unit 63.The Two data dependence relation judging units 62 can guarantee data used between instruction first before executing read-write operation there is no read Write consistency conflict.For example, all control signals for being sent to data dependence relation unit 62 can all be stored into data dependence relation In instruction queue inside unit 62, in the queue, if the range of the reading data of reading instruction and queue position are forward The range that write command writes data clashes, then the instruction must can execute after relied on write command is performed.

Second storage unit 63 caches the input data and output scalar data from computing module 6.

Third storage unit 64 caches the convolution Nuclear Data needed in calculating process from computing module 6.

Fig. 3 is the flow chart that the convolutional neural networks arithmetic unit that present disclosure embodiment provides executes convolution transform instruction, As shown in figure 3, the process for executing convolutional neural networks instruction includes, the instruction of convolutional neural networks here is with CONV_AC_OP For, in practical applications, or other extended instructions, such as CONV_ACTIVATE or CONV_OP instruction: as being somebody's turn to do When extended instruction is that CONV_OP is instructed, it is only necessary to execute OP operation, execute activation to biased data without executing in S9 Operation, i.e., when being that CONV_OP is instructed such as the extended instruction, which is final output result.Such as extended instruction When instructing for CONV_ACTIVATE, which is not necessarily to OP module, converts in step S7 without execution OP.

In step S1, an I/O instruction is pre-deposited at the first address of register cell 112.

In step S2, operation starts, and controller unit 115 reads this I/O instruction from the first address of register cell 112, According to the control signal translated, data access unit 116 reads corresponding all convolutional neural networks operations from storage medium 111 Instruction, and be buffered in register cell 112.

In step S3, controller unit 115 reads in next I/O instruction from register cell 11, is believed according to the control translated Number, data access unit 116 read from storage medium 111 all data that main computing module 5 needs (e.g., including input number According to, the interpolation table for making quick activation primitive operation, the constant table for configuring arithmetic unit parameter, biased data etc.) To the first storage unit 53 of main computing module 5.

In step S4, controller unit 115 reads in next I/O instruction from register cell 11, is believed according to the control translated Number, data access unit 116 reads the convolution Nuclear Data needed from computing module 6 from storage medium 111.

In step S5, controller unit 115 reads in next CONFIG instruction from register cell 11, according to the control translated Signal processed, device configure the various constants of this layer of neural computing needs.For example, the first arithmetic element 51, the second operation list For member 61 according to the value of the parameter configuration unit internal register in control signal, the parameter includes that such as activation primitive needs Data；And every constant needed for OP operation, such as pad N, crop H1 and W1, reshape dimension order.

In step S6, controller unit 115 then reads in next CONV_AC_OP instruction, root from register cell 11 According to the control signal translated, main computing module 5 first by interconnection module 113 by the input data in convolution window issue respectively from Computing module 6 is saved to the second storage unit 63 from computing module 6, and then according to the mobile convolution window of instruction.

In step S7, the control signal translated is instructed according to CONV_AC_OP, from the arithmetic element 61 of computing module 6 from the Three storage units 64 read convolution kernel, read input data from the second storage unit 63, OP module is to the input data and volume Product core does OP variation, then executes input data (OP transformation) and convolution kernel (OP transformation) from the arithmetic element 61 of computing module 6 Convolution algorithm, intermediate result is returned by interconnection module 113.

In step S8, in interconnection module 113, respectively it is combined into step by step completely from the intermediate result that computing module 6 returns Intermediate vector.

In step S9, main computing module 5 obtains the intermediate vector of the return of interconnecting modules 4, and convolution window traverses all inputs All return vectors are spliced into intermediate result by data, main computing module；(optional) translates according to CONV_AC_OP instruction Signal is controlled, biased data is read from the first storage unit 53, adds unit 511 to be added by vector with intermediate result and biased As a result, main computing module 5 reads the corresponding interpolation table of activation primitive interpolation table address in CONV_AC_OP register number 4, it will be inclined It sets result and interpolation table does the output data for activating operation to obtain to the end, and last output data is written back to the first storage list In member 53.

In step S10, controller unit 115 then reads in next I/O instruction from the location of instruction, according to what is translated Signal is controlled, the output data in the first storage unit 53 is deposited to external address space and specifies address by data access unit 116, Operation terminates.

Present disclosure embodiment also provides a kind of computer storage medium, wherein computer storage medium storage is for electricity The computer program of subdata exchange, it is as any in recorded in above method embodiment which execute computer A kind of some or all of the implementation method of convolution extended instruction step.

Present disclosure embodiment also provides a kind of computer program product, and the computer program product includes storing calculating The non-transient computer readable storage medium of machine program, the computer program are operable to that computer is made to execute such as above-mentioned side Some or all of the implementation method for any convolution extended instruction recorded in method embodiment step.

Another embodiment of the disclosure, also disclose a kind of chip comprising the neural computing dress of above-described embodiment Set (as shown in Figure 1).

Another embodiment of the present disclosure also discloses a kind of chip-packaging structure comprising said chip.

Another embodiment of the present disclosure also discloses a kind of board comprising said chip encapsulating structure.

Another embodiment of the present disclosure also discloses a kind of electronic device comprising above-mentioned board.

Electronic device include data processing equipment, robot, computer, printer, scanner, tablet computer, intelligent terminal, Mobile phone, automobile data recorder, navigator, sensor, camera, cloud server, camera, video camera, projector, wrist-watch, earphone, Mobile storage, the wearable device vehicles, household electrical appliance, and/or Medical Devices.

The vehicles include aircraft, steamer and/or vehicle；The household electrical appliance include TV, air-conditioning, micro-wave oven, Refrigerator, electric cooker, humidifier, washing machine, electric light, gas-cooker, kitchen ventilator；The Medical Devices include Nuclear Magnetic Resonance, B ultrasound instrument And/or electrocardiograph.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, present disclosure is not limited by the described action sequence because According to present disclosure, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to alternative embodiment, related actions and modules not necessarily present disclosure It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, It can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of present disclosure can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.

Present disclosure embodiment is described in detail above, specific case used herein to the principle of present disclosure and Embodiment is expounded, the method and its core concept for present disclosure that the above embodiments are only used to help understand； At the same time, for those skilled in the art can in specific embodiments and applications according to the thought of present disclosure There is change place, in conclusion the content of the present specification should not be construed as the limitation to present disclosure.

Claims

1. a kind of execution method of convolution extended instruction, which is characterized in that described method includes following steps:

Computing device reads the convolution extended instruction from memory and obtains the input data of the convolution extended instruction, convolution kernel And activation operation；

The convolution extended instruction includes: operation code and operation domain, and the operation code includes: the mark of the convolution extended instruction Know；The operation domain includes: convolution subdomain and activation subdomain, and the convolution subdomain includes: address and the volume for storing input data The address of product core, the activation subdomain include: the identification code of the activation operation or the interpolation table address of the activation operation；

Computing device executes convolution algorithm to the input data and convolution kernel and obtains intermediate result, passes through the activation subdomain pair The intermediate result executes activation operation and obtains the final result of described instruction.

2. the method according to claim 1, wherein

The activation operation includes: convolutional neural networks Maxout operation, convolutional neural networks PReLU operation, convolutional Neural net Network RReLU operation, convolutional neural networks Leaky ReLU operation, nonlinear activation operation or linear activation operation operation.

3. the method according to claim 1, wherein if the activation subdomain includes: the interpolation table of activation operation Address, described executed by the activation subdomain to the intermediate result are activated operation to obtain the final result of described instruction, are wrapped It includes:

Computing device extracts the corresponding interpolation table of interpolation table address of the activation operation, by the intermediate result and the interpolation Table executes activation operation and obtains the final result of described instruction.

4. the method according to claim 1, wherein as it is described activation subdomain include: activation operation identification code, Described executed by the activation subdomain to the intermediate result activates operation to obtain the final result of described instruction, comprising:

Computing device identifies that the identification code of the activation operation determines the activation operation, reads the interpolation of the activation operation The interpolation table and the intermediate result are executed activation operation and obtain the final result of described instruction by table.

5. the method according to claim 1, wherein the computing device holds the input data and convolution kernel Row convolution algorithm obtains intermediate result, comprising:

The input data is split into multiple portions and obtains multiple input subdatas by the main computing module of computing device, will be multiple Input subdata be distributed to it is multiple from computing module, convolution kernel is sent to it is multiple from computing module, it is the multiple from operation mould Block executes input subdata parallel and the multiplying of convolution kernel obtains multiple sons as a result, the main computing module of computing device is by institute Multiple sub- results are stated to splice to obtain the intermediate result.

6. a kind of computing device, which is characterized in that the computing device includes: memory, arithmetic element, interconnection module, operation Unit, controller unit and data access unit；

Controller unit, for reading the input number that the convolution extended instruction obtains the convolution extended instruction from memory According to, convolution kernel and activation operation；

Data access unit, for obtaining the corresponding input data in address and volume of the address and convolution kernel of the input data Product core；

The arithmetic element obtains intermediate result for executing convolution algorithm to the input data and convolution kernel, by described Activation subdomain executes activation operation to the intermediate result and obtains the final result of described instruction.

7. computing device according to claim 6, which is characterized in that

8. computing device according to claim 6, which is characterized in that if the activation subdomain includes: inserting for activation operation It is worth table address；

The arithmetic element obtains the final of described instruction for the intermediate result to be executed activation operation with the interpolation table As a result.

9. computing device according to claim 6, which is characterized in that if the activation subdomain includes: the mark of activation operation Know code；The arithmetic element further include: activation arithmetic unit；

The activation arithmetic unit executes the interpolation table and the intermediate result for taking the interpolation table of the activation operation Activation operation obtains the final result of described instruction.

10. computing device according to claim 8, which is characterized in that the arithmetic element further include: main computing module and It is multiple from computing module, the main computing module includes: adder calculator and multiplicative operator, described to include: from computing module Adder calculator and multiplicative operator；

The main computing module obtains multiple input subdatas for the input data to be split into multiple portions, will be multiple Input subdata be distributed to it is multiple from computing module, convolution kernel is sent to it is multiple from computing module, it is the multiple from operation mould Block, the multiplying for executing input subdata and convolution kernel parallel obtain multiple sons as a result, the main computing module, is used for Splice the multiple sub- result to obtain the intermediate result.

11. a kind of computer readable storage medium, which is characterized in that it stores the computer program for being used for electronic data interchange, Wherein, the computer program makes computer execute the method according to claim 1 to 5.

12. a kind of computer program product, which is characterized in that the computer program product includes storing computer program Non-transient computer readable storage medium, the computer program are operable to that computer is made to execute such as claim 1-5 Method described in one.

13. a kind of electronic device, which is characterized in that the electronic device includes processor, and the processor includes that right such as is wanted Seek computing device described in 6-10 any one.