CN109754062B

CN109754062B - Execution method of convolution expansion instruction and related product

Info

Publication number: CN109754062B
Application number: CN201711086019.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2017-11-07
Filing date: 2017-11-07
Publication date: 2024-05-14
Anticipated expiration: 2037-11-07
Also published as: CN109754062A

Abstract

The disclosure provides a method for realizing convolution expansion instruction and a related product, comprising the following steps: the computing device reads the convolution expansion instruction from the memory to acquire input data, convolution kernel and auxiliary operation of the convolution expansion instruction; the convolution expansion instruction includes: an opcode and an operation field, the operation field comprising: a register for determining an address of input data and an address of a convolution kernel, and an auxiliary field for identifying an auxiliary operation; the computing device performs a convolution operation and an auxiliary operation on the address of the input data and the address of the convolution kernel. The technical scheme provided by the disclosure has the advantages of reducing the calculated amount and reducing the power consumption.

Description

Execution method of convolution expansion instruction and related product

Technical Field

The disclosure relates to the technical field of neural networks, in particular to a method for realizing convolution expansion instructions and related products.

Background

The convolutional neural network is a high-efficiency recognition algorithm which is widely applied to the fields of pattern recognition, image processing and the like in recent years, and has the characteristics of simple structure, few training parameters, strong adaptability, translation, rotation, scaling and the like. Since the feature detection layer of CNN/DNN learns through training data, the feature extraction displayed is avoided when CNN/DNN is used, and learning is implicitly performed from the training data; furthermore, because the weights of the neurons on the same feature mapping plane are the same, the network can learn in parallel, which is also a great advantage of convolutional networks with respect to networks in which the neurons are connected to each other.

In existing computer field applications, applications related to convolution operations are very common. The present invention is directed to convolutional neural networks, and the current mainstream devices that can perform this operation are as follows:

in the prior art, a known solution for performing convolutional neural network operations is to use a general purpose processor, which executes general purpose instructions via a general purpose register file and general purpose functional blocks, thereby performing convolutional neural network operations. However, one of the drawbacks of this approach is that a single general purpose processor is used for scalar computation, which is less computationally efficient when performing convolutional neural network operations. When a plurality of general-purpose processors are used for parallel execution, the intercommunication among the general-purpose processors may become a performance bottleneck.

In another prior art technique, a Graphics Processor (GPU) is used for vector computation, where convolutional neural network operations are performed by executing a general purpose SIMD instruction using a general purpose register file and a general purpose stream processing unit. However, in the above scheme, the on-chip buffer of the GPU is too small, and off-chip data handling is required to be continuously performed when performing large-scale convolutional neural network operation, and off-chip bandwidth becomes a main performance bottleneck.

Disclosure of the invention

The embodiment of the disclosure provides a method for realizing a convolution extended instruction, the convolution extended instruction and related products, which can realize the advantages of improving performance bottleneck and reducing power consumption.

In a first aspect, an embodiment of the present disclosure provides a method for executing a convolution expansion instruction, the method including the steps of:

the computing device reads the convolution expansion instruction from the memory to acquire input data of the convolution expansion instruction, a convolution kernel and an activation operation;

The convolution expansion instruction includes: an opcode and an opcode field, the opcode comprising: the identity of the convolution expansion instruction; the operation domain includes: a convolution sub-domain and an activation sub-domain, the convolution sub-domain comprising: storing an address of input data and an address of a convolution kernel, the activation subfield comprising: an identification code of the activation operation or an interpolation table address of the activation operation;

the computing device executes convolution operation on the input data and the convolution kernel to obtain an intermediate result, and executes activation operation on the intermediate result through the activation subdomain to obtain a final result of the instruction.

Optionally, the activating operation includes: convolutional neural network Maxout operation, convolutional neural network PReLU operation, convolutional neural network RReLU operation, convolutional neural network leak ReLU operation, nonlinear activation operation, or linear activation operation.

Optionally, activating the subdomain includes: an interpolation table address of an activate operation, said performing an activate operation on said intermediate result by said activate subfield obtaining a final result of said instruction, comprising:

and the computing device extracts an interpolation table corresponding to the interpolation table address of the activating operation, and executes activating operation on the intermediate result and the interpolation table to obtain a final result of the instruction.

Optionally, activating the subdomain includes: and an identification code of an activation operation, wherein the execution of the activation operation on the intermediate result through the activation subdomain obtains a final result of the instruction, and the identification code comprises the following components:

The computing device identifies the identification code of the activation operation, determines the activation operation, reads an interpolation table of the activation operation, and executes activation operation on the interpolation table and the intermediate result to obtain a final result of the instruction.

Optionally, the computing device performs a convolution operation on the input data and the convolution kernel to obtain an intermediate result, including:

The main operation module of the computing device splits the input data into a plurality of parts to obtain a plurality of input sub-data, distributes the plurality of input sub-data to a plurality of auxiliary operation modules, sends convolution kernels to the plurality of auxiliary operation modules, and the plurality of auxiliary operation modules execute multiplication operation of the input sub-data and the convolution kernels in parallel to obtain a plurality of sub-results, and the main operation module of the computing device splices the plurality of sub-results to obtain the intermediate result.

In a second aspect, there is provided a computing device comprising: the device comprises a memory, an operation unit, an interconnection module, an operation unit, a controller unit and a data access unit;

wherein the arithmetic unit includes: an adder and a multiplier;

A controller unit for reading the convolution expansion instruction from a memory to obtain input data of the convolution expansion instruction, a convolution kernel, and an activation operation;

the data access unit is used for acquiring the input data and the convolution kernel corresponding to the address of the input data and the address of the convolution kernel;

And the operation unit is used for performing convolution operation on the input data and the convolution kernel to obtain an intermediate result, and performing activation operation on the intermediate result through the activation subdomain to obtain a final result of the instruction.

Optionally, activating the subdomain includes: activating an interpolation table address of the operation;

The data access unit is used for extracting an interpolation table corresponding to the interpolation table address of the activation operation;

and the operation unit is used for executing the activation operation on the intermediate result and the interpolation table to obtain the final result of the instruction.

Optionally, activating the subdomain includes: an identification code for the activation operation; the arithmetic unit further includes: activating an arithmetic unit;

The controller unit is used for identifying the identification code of the activation operation to determine the activation operation;

And the activation arithmetic unit is used for taking the interpolation table of the activation operation, and executing the activation operation on the interpolation table and the intermediate result to obtain the final result of the instruction.

Optionally, the operation unit further includes: the system comprises a master operation module and a plurality of slave operation modules, wherein the master operation module comprises: an adder and a multiplier, the slave operation module including: an adder and a multiplier;

The main operation module is used for splitting the input data into a plurality of parts to obtain a plurality of input sub-data, distributing the plurality of input sub-data to the plurality of slave operation modules, sending the convolution kernel to the plurality of slave operation modules, and executing multiplication operation of the input sub-data and the convolution kernel in parallel to obtain a plurality of sub-results, wherein the main operation module is used for splicing the plurality of sub-results to obtain the intermediate result.

In a third aspect, a computer-readable storage medium is provided, characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method provided in the first aspect.

In a fourth aspect, there is provided a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform the method of the first aspect.

In a fifth aspect, there is provided a chip comprising the computing device provided in the second aspect.

In a sixth aspect, there is provided a chip packaging structure including the chip provided in the fifth aspect.

In a seventh aspect, a board is provided, where the board includes the chip package structure provided in the sixth aspect.

In an eighth aspect, there is provided an electronic device including a board card provided in the seventh aspect.

It can be seen that by the presently disclosed embodiments, they have the advantage of a single instruction to implement convolution operations as well as activate operations, so they have the advantage of reducing computation time and saving power consumption.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a schematic structural diagram of a computing device provided by the present disclosure.

Fig. 2 is a schematic block diagram of an interconnect module provided by an embodiment of the present disclosure.

Fig. 2a is a schematic block diagram of a main operation module in an apparatus for performing forward operations of a convolutional neural network provided in an embodiment of the present disclosure.

Fig. 2b is a schematic block diagram of a slave operation module in an apparatus for performing a convolutional neural network forward operation provided by an embodiment of the present disclosure.

Fig. 3 is a flowchart of a convolutional neural network operation device according to an embodiment of the present disclosure executing a convolutional transform instruction.

Fig. 3a is a schematic diagram of a convolution kernel provided by an embodiment of the present disclosure.

Fig. 3b is a schematic diagram of an input data provided by an embodiment of the present disclosure.

Fig. 3c is a schematic diagram of movement of a convolution kernel provided by an embodiment of the present disclosure.

Fig. 3d is a schematic diagram of another convolution kernel movement provided by an embodiment of the present disclosure.

Fig. 3e is a schematic diagram of movement of yet another convolution kernel provided by an embodiment of the present disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims and drawings are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the disclosure. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The following describes a method for calculating a convolution instruction by taking the convolution instruction as an example, where the convolution instruction may be applied to a neural network, and of course, in practical application, the convolution instruction may also be applied to other computing scenarios, and the disclosure is not limited to the specific implementation scenario of the convolution instruction. For a convolution instruction, the actual formula to be executed may be s=s (Σwx _i +b), where the convolution kernel w (which may include a plurality of data) is multiplied by the input data x _i, and summed, then the initial calculation result h may be obtained by adding the offset b to the actual calculation, and then the initial calculation result may also be subjected to the activation operation S (h) to obtain the final output result S. The calculation topology is obtained according to the formula, namely, a multiplication operator-addition operator-activation operator.

For the existing convolution instruction, if the activation operation needs to be performed by a plurality of instructions, taking the above formula as an example, firstly, the initial calculation result h needs to be obtained by the convolution operation instruction, and then the activation operation is performed on the h by the convolution activation instruction, that is, at least two convolution instructions are needed to obtain the result S of the above formula, this way firstly, a plurality of numbers are needed for the number of convolution instructions, and in addition, for a chip or a computing device, more calculation overhead is needed due to the need of repeatedly calling data, and power consumption is also higher.

The present disclosure provides a computing device, as shown in fig. 1, comprising: a storage medium 111, a register unit 112, an interconnect module 113, an operation unit 114, a controller unit 115, and a data access unit 116;

the operation unit 114 may include: a multiplication calculator and an addition calculator, of course the arithmetic unit may further comprise: at least one of a comparator, an activation arithmetic unit and an OP converter.

The interconnection module 113 is configured to control a connection relationship of the calculators in the operation unit 114 so that at least two calculators form different calculation topologies.

The register unit 112 is configured to store an operation instruction, input data, an address of a convolution kernel storage medium, and a computation topology corresponding to the convolution instruction.

The storage medium 111 may be an off-chip memory, or of course in practical applications, an on-chip memory, for storing input data, a convolution kernel, which may specifically be vector, matrix or multidimensional data.

The controller unit 115 is configured to extract an operation instruction (specifically, a convolution instruction) from the register unit 112, an operation domain corresponding to the operation instruction, and a first computation topology corresponding to the operation instruction, decode the operation instruction into an execution instruction, and the execution instruction is configured to control the operation unit to execute an operation, transmit the operation domain to the data access unit 116, and transmit the computation topology to the interconnection module 113.

The data access unit 116 is configured to extract the input data and the convolution kernel corresponding to the operation domain from the storage medium 111, and transmit the input data and the convolution kernel to the operation unit 114.

The interconnection module 113 is configured to form a first calculation topology according to a connection relationship of the calculator in the control operation unit 114.

The operation unit 114 is configured to call the calculator to perform an operation on the data block according to the first calculation topology and the execution instruction to obtain an operation result, and transmit the operation result to the data access unit for storage in the storage medium.

The operation instruction may be as shown in fig. 1, including: the operation domain and the operation code take convolution operation instruction as an example, and the operation domain may include: the convolution sub-field and the activation sub-field are shown in table 1, where register number (optional, register may also be a register file) 0, register number (optional, register file) 1, register number (optional, register file) 2, register number (optional, register file) 3, register number 4, register number 0, register number 1, register number 2, register number 3 may be the convolution sub-field, and in particular, register number 4 may be the activation sub-field.

Table 1:

When the address of the function interpolation table is activated, for the computing device, the setting of the activation calculator can be saved, and for the setting of the address of the function interpolation table, the resolving cost of a decoder can be saved, the calculated amount is reduced, and the power consumption and the area of a chip are saved. In the following, a specific implementation manner is described in detail, if the conv_activate contains an address of an interpolation table of an activation function, the conv_activate instruction obtains a result (i.e. an intermediate result) of the convolution operation after performing the convolution operation, and then extracts an interpolation table corresponding to the address of the interpolation table of the activation function to perform an activation operation on the result of the convolution operation to directly obtain the result. The method only needs to read the CONV_ACTIVATE instruction once, and the execution does not need to be carried out by a separate activation calculator, so that the method has the advantages of small instruction analysis cost, reduced calculation amount and hardware configuration saving.

The operation instruction may further include, as shown in table 2: operation code conv_ac_op, register number (optional, register file) 0, register number (optional, register file) 1, register number (optional, register file) 2, register number (optional, register file) 3, register number (optional, register file) 4, auxiliary operation code, register number 0, register number 1, register number 2, register number 3 may be convolution subfield, register number 4 may be activation subfield, OP operation code may be OP subfield, as specified in table 2.

Table 2:

the operation instruction may include a convolutional instruction set, where the instruction set includes a convolutional neural network CONV instruction, a conv_activate instruction, a conv_op, a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction, and a MOVE instruction with different functions.

The auxiliary operation codes shown in table 1 and table 2 may specifically include calculation operations and calculator connection relationships. Taking OP operations as an example, there are various OP operations, assuming that 1 represents transpose, 0 represents conjugate, assuming that the auxiliary operation code may be 4 bits, and in practical application, other bit numbers may also be other bits, for example, 6 bits, 8 bits, etc., and for the auxiliary operation code of conv_op, if it is 1111, this may be represented as transpose operation, where the need to perform transpose operation may include: input data, a convolution kernel, and a preliminary calculation result, it is assumed here that the 2 nd bit of 1111 represents whether the input data performs an OP operation, the 3 rd bit represents whether the convolution kernel performs an OP operation, the 4 th bit represents whether the preliminary calculation result performs an OP operation, it is assumed that 1 may perform an OP operation, and it is assumed that 0 may not perform an OP operation. Of course, other operations are also possible in practice.

In one embodiment, the conv_activate instruction includes:

A convolution activating instruction according to which the apparatus takes out the input data of a set size and a convolution kernel from a specified address of a memory (preferably, a scratch pad memory), respectively, performs a convolution operation in a convolution operation section, and then performs an activation function operation on an output result; the set size may be defined by the manufacturer or the user.

The convolution activation instructions may specifically include:

The convolutional neural network Maxout instructions may specifically include: the device respectively fetches the input data with set size and the convolution kernel from the appointed address of the memory (preferably, the scratch pad memory), and then makes convolution operation in the convolution operation part, and then makes Maxout activation on the output result; the set size may be defined by the manufacturer or the user. The convolutional neural network Maxout instruction may be embodied by adding Maxout interpolation table or Maxout opcode in register number 4 of the operation field of the conv_activate instruction.

For Maxout, the mathematical expression may be:

whereZ_ij＝x^TW_ij+b_ij

Where h _i denotes the output of Maxout, W _ij denotes the convolution kernel, b _ij denotes the offset, and X ^T denotes the transpose of the input data.

A convolutional neural network PReLU instruction, which is used for activating PReLU the output result of the computing device according to the instruction, wherein the device respectively takes out the input data with set size and the convolutional kernel from the designated address of the scratch pad memory, carries out convolutional operation in a convolutional operation part, and then activates PReLU the output result; the convolutional neural network PReLU instruction may be embodied by adding PReLU interpolation table or PReLU opcode in register number 4 of the operation field of the conv_activate instruction.

A convolutional neural network RReLU instruction, which is used for activating RReLU the output result of the computing device according to the instruction, wherein the device respectively takes out the input data with set size and the convolutional kernel from the designated address of the scratch pad memory, carries out convolutional operation in a convolutional operation part, and then activates RReLU the output result; the convolutional neural network RReLU instruction may be embodied by adding RReLU interpolation table or RReLU opcode in register number 4 of the operation field of the conv_activate instruction.

The convolution neural network leak ReLU instruction is used for activating the output result of the computing device according to the instruction, the device respectively takes out the input data with the set size and the convolution kernel from the designated address of the scratch pad memory, carries out convolution operation in the convolution operation component, and then carries out leak ReLU activation on the output result; the specific expression form of the convolutional neural network leak ReLU instruction may be that an interpolation table or a leak ReLU operation code of RReLU is added in a register number 4 of an operation field of the conv_active instruction.

For ReLU, its mathematical expression is: f (X) =max (0, X);

The mathematical expressions of the leak ReLU, RReLU, and prilu may be:

f(X)＝αx(x＜0)，f(X)＝x(x≥0)；

For the above mathematical expression, the values for α correspond to the leak ReLU, RReLU, or PReLU, and when α > 0, it is PReLU; when α < 0, it is a leakage ReLU, and when α is a gaussian-distributed random number, it is RReLU.

The CONV ACTIVATE instruction may also include other operation instructions to perform nonlinear or linear activation operations.

In one embodiment, the conv_op instruction includes:

A convolution conversion instruction according to which the apparatus fetches the input data and the convolution kernel of a set size from the specified address of the memory (preferably, the scratch pad memory), performs a conversion operation on the input data and/or the convolution kernel in an OP (conjugate or transpose) operation section, then performs a convolution operation in a convolution operation section, and then converts the output result; the set size and the OP type can be defined by a manufacturer or a user.

The convolution transformation instruction specifically comprises:

a convolutional neural network Reshape instruction for operating Reshape on the output result of the computing device according to the instruction, the device respectively fetching the input data with the set size and the convolution kernel from the designated address of the memory (preferably, the scratch pad memory), operating reshape (dimensional reforming, such as nchw- > chwn, etc.) in the OP operation part, then operating the convolution operation in the convolution operation part, and then operating reshape on the output result; the set size may be defined by the manufacturer or the user.

By dimensional reformation, it is meant that the four dimensions of the convolution operation and the input data are reformatted.

The M convolution kernels shown in fig. 3a, each convolution kernel is a three-dimensional data block of 5×3×3, and then the operation window of each convolution kernel is also a three-dimensional data block of 5×3×3, where KH and KW in the M convolution kernels shown in fig. 3a represent a dimension corresponding to KH as an H dimension of input data, and the corresponding dimension represented by KW is a W dimension of the input data. The gray squares in fig. 3c, 3d, and 3e are data used for each sliding operation window, and the sliding direction may be H as the sliding direction, W as the sliding direction, or W as the sliding direction, and H as the sliding direction. Specifically, for convolution, the operation at each sliding window is that the inner product operation is respectively performed on a data block represented by a gray part square block in the figure and M convolution kernel data blocks represented by 'convolution 1-convolution kernel of fig. 3 a', and the convolution outputs a value corresponding to each convolution kernel for each sliding window position, namely M output values for each sliding window; the "fig. 3 a-3 e" each use a square to represent a value, which may also be referred to as a weight; the numbers used in the schematic are only illustrative, and in practice the dimensional data may be any number (including a case where a certain dimension is 1, in which case the four-dimensional data block automatically becomes a three-dimensional data block, for example, in a case where the number of samples calculated simultaneously is 1, the input data is a three-dimensional data block, and in a case where the number of convolution kernels is 1, for example, the convolution sum data is a three-dimensional data block). Performing convolution operation between input data B and convolution kernel A by using the chip device;

for one convolution layer, the weight (all convolution kernels) of the convolution layer are shown as 'convolution 1-convolution kernels of fig. 3 a', the number of the convolution kernels is recorded as M, and each convolution kernel consists of a matrix of C KH rows KW columns, so that the weight of the convolution layer can be expressed as a four-dimensional data block with four dimensions of M, C, KH and KW respectively; the input data of the convolution layer is four-dimensional data blocks, and each three-dimensional data block consists of N three-dimensional data blocks, wherein each three-dimensional data block consists of C characteristic matrixes of H rows and W columns (namely, data blocks with four dimensions of N, C, H and W respectively);

A convolutional neural network Pad instruction for performing Pad operation on an output result of the computing device according to the instruction, wherein the device respectively fetches input data of a set size and a convolutional kernel from a specified address of a memory (preferably, a scratch Pad memory), performs Pad (peripheral amplification) operation on the convolutional kernel in an OP operation part, and then performs convolutional operation in the convolution operation part; the set size may be defined by the manufacturer or the user. The specific expression form of the convolutional neural network Pad instruction can be that a Pad operation code is added in an auxiliary operation code of an operation domain of the CONV_OP or CONV_AC_OP instruction.

Peripheral augmentation refers to the addition of N more cycles to the periphery of the convolution kernel, N being a positive integer. N may be 1 this time this instruction format is unchanged. The circle means a two-dimensional data block of original kh×kw, and is expanded to (kh+2n) by peripheral complement.

If N is greater than 1, either the instruction format adds an operation field (register 5) to store the value of N, i.e. adds a register 5 in the operation field of conv_op, which register 5 is used to store the value of N. If the instruction format is unchanged, the method of executing the instruction is changed, the value of n is called by the config instruction before the CONV instruction is executed, and the pad operation is executed before the CONV instruction is executed.

In addition, the data may be all 0s, which is the most basic pad operation.

Alternatively, the data may be randomly distributed with 0 and 1. In this case, the opcode is changed to conv-pad-random. The method comprises the following steps: the random number generator is used to generate the number that the pad needs to fill, totaling (kh+2N) × (kw+2N) -kh×hw data.

The device takes out the input data and convolution kernel with set size from the appointed address of the memory (preferably, the scratch pad memory), makes Crop (size cutting) on the input in the OP operation part, and then makes convolution operation in the convolution operation part, the set size can be defined by manufacturer or user.

The definition of size clipping is to intercept a two-dimensional data block with the size of H1 from a two-dimensional data block with the size of H1W, wherein H1 and W1 are smaller than or equal to H and W.

A convolutional neural network Dilate instruction for Dilate operating on the output result of the computing device according to the instruction, the device respectively fetching the input data of the set size and the convolutional kernel from the designated address of the memory (preferably, the scratch pad memory), performing dilate (internal interpolation 0) operation on the convolutional kernel in the OP operation part, and then performing convolution operation in the convolution operation part; the set size may be defined by the manufacturer or the user.

Dilate (internally inserted 0) is defined as that for a convolution kernel of kh kw, 0 or random number is inserted uniformly or randomly inside (the pad mentioned above is on the periphery) to play the role of "diluting" the convolution kernel, which can enhance the feature extraction effect of the convolution kernel.

The CONV OP instruction may also include other transformation instructions, such as blast transformation of inputs, weights, and the like.

The instruction set comprises a convolutional neural network CONV_AC_OP instruction, a CONFIG instruction, an IO instruction, a NOP instruction, a JUMP instruction and a MOVE instruction with different functions.

In one embodiment, the CONV AC OP may implement any combination of CONV, ACTICATE and OP operations through the setting of auxiliary operation codes.

Fig. 2 schematically illustrates one embodiment of the interconnect module 113: and an H tree module. The interconnection module 113 constitutes a data path between the master computing module 5 and the plurality of slave computing modules 6, and is a binary tree path formed by a plurality of nodes, each node transmits upstream data to two downstream nodes in the same way, and the data returned from the two downstream nodes are combined and returned to the upstream node. For example, in the initial calculation phase of the convolutional neural network, the neuron data in the master operation module 5 is sent to each slave operation module 6 through the interconnection module 4; when the calculation process of the slave operation module 6 is completed, the value of each neuron output by the slave operation module is gradually spliced into a complete vector consisting of neurons in the interconnection module. For example, assuming that there are N slave operation modules in the apparatus, the input data Xi is sent to the N slave operation modules, and each slave operation module convolves the input data Xi with the convolution kernel corresponding to the slave operation module to obtain scalar data, and the scalar data of each slave operation module is combined into an intermediate vector containing N elements by the interconnection module 4. Assuming that the convolution window is traversed in total to obtain a×b (the X direction is a, the Y direction is B, and X, Y is the coordinate axis of the three-dimensional orthogonal coordinate system) input data Xi, performing the convolution operation on the a×b Xi, and merging all the obtained vectors in the main operation module to obtain a three-dimensional intermediate result of a×b×n.

Fig. 2a shows an example block diagram of the structure of the main operation module 5 in an apparatus for performing a convolutional neural network forward operation according to an embodiment of the present disclosure. As shown in fig. 2a, the main operation module 5 includes a first operation unit 51, a first data dependency relationship determination unit 52, and a first storage unit 53.

The first operation unit 51 includes a vector addition unit 511 and an activation unit 512. The first operation unit 51 receives the control signal from the controller unit, completes various operation functions of the main operation module 5, and the vector addition unit 511 is used for implementing an offset adding operation in forward calculation of the convolutional neural network, where the offset adding unit adds offset data to the intermediate result in a bit to obtain an offset result, and the activation operation unit 512 performs an activation function operation on the offset result. The offset data may be read in from an external address space or may be stored locally.

The first data dependency relationship determination unit 52 is a port through which the first arithmetic unit 51 reads and writes the first storage unit 53, and ensures read-write consistency of data in the first storage unit 53. Meanwhile, the first data dependency relationship determination unit 52 is also responsible for transmitting the data read from the first storage unit 53 to the slave operation module through the interconnect module 4, and the output data of the slave operation module 6 is directly transmitted to the first operation unit 51 through the interconnect module 4. The instruction output from the controller unit 2 is sent to the calculation unit 51 and the first data dependency determination unit 52 to control the behavior thereof.

The storage unit 53 is used for buffering input data and output data used in the calculation process of the main operation module 5.

Fig. 2b shows an example block diagram of the structure of the slave operation module 6 in the apparatus for performing the convolutional neural network forward operation according to the embodiment of the present disclosure. As shown in fig. 2b, each slave computing module 6 includes a second computing unit 61, a data dependency relationship determination unit 62, a second storage unit 63, and a third storage unit 64.

The second arithmetic unit 61 receives the control signal from the controller unit 2 and performs convolution operation. The second operation unit includes an OP transform unit 808, a vector multiplication unit 611, and an accumulation unit 612, which are respectively responsible for vector multiplication operation, accumulation operation, and OP transform operation in convolution operation.

The second data dependency relationship determination unit 62 is responsible for the read and write operations to the second storage unit 63 during the calculation. The second data dependency determination unit 62 may first ensure that there is no read-write consistency conflict with the data used between instructions before performing read-write operations. For example, all control signals to the data dependency unit 62 are stored in an instruction queue within the data dependency unit 62, where if the read data range of a read instruction conflicts with the write data range of a write instruction preceding the queue position, the instruction must wait until the dependent write instruction is executed.

The second storage unit 63 caches the input data and the output scalar data of the slave operation module 6.

The third storage unit 64 caches the convolution kernel data required in the calculation process by the slave operation module 6.

Fig. 3 is a flowchart of a convolutional neural network operation device according to an embodiment of the present disclosure executing a convolutional transform instruction, where, as shown in fig. 3, a process of executing the convolutional neural network instruction includes, here, the convolutional neural network instruction to: for example, the conv_ac_op may be another extended instruction, such as a conv_active or conv_op instruction: if the extended instruction is a conv_op instruction, only the OP operation needs to be executed, and the execution of the activation operation on the offset data in S9 is not required, i.e., if the extended instruction is a conv_op instruction, the offset data is the final output result. If the extended instruction is a conv_activate instruction, the operation device does not need an OP module, and in step S7, the OP conversion is not needed.

In step S1, an IO instruction is stored in advance at the head address of the register unit 112.

In step S2, the operation starts, the controller unit 115 reads the IO instruction from the first address of the register unit 112, and according to the decoded control signal, the data access unit 116 reads all the corresponding convolutional neural network operation instructions from the storage medium 111 and buffers them in the register unit 112.

In step S3, the controller unit 115 reads in the next IO instruction from the register unit 11, and based on the decoded control signal, the data access unit 116 reads all data (including, for example, input data, an interpolation table for performing fast activation function operation, a constant table for configuring the operation device parameters, offset data, and the like) required for the main operation module 5 from the storage medium 111 to the first storage unit 53 of the main operation module 5.

In step S4, the controller unit 115 reads in the next IO instruction from the register unit 11, and the data access unit 116 reads the convolution kernel data required from the operation module 6 from the storage medium 111 in accordance with the decoded control signal.

In step S5, the controller unit 115 reads in the next CONFIG instruction from the register unit 11, and based on the decoded control signal, the device configures various constants required for the layer neural network calculation. For example, the first arithmetic unit 51, the second arithmetic unit 61 configures the values of the unit internal registers according to parameters in the control signals, the parameters including, for example, data required for activating the function; and constants required for OP operations, such as H1 and W1 of N, crop of pad, the dimensional order of reshape, and the like.

In step S6, the controller unit 115 then reads the next conv_ac_op instruction from the register unit 11, and according to the decoded control signal, the master operation module 5 first sends the input data in the convolution window to each slave operation module 6 through the interconnection module 113, stores the input data in the second storage unit 63 of the slave operation module 6, and then moves the convolution window according to the instruction.

In step S7, the convolution kernel is read from the third storage unit 64 by the operation unit 61 of the operation module 6, the input data is read from the second storage unit 63, the OP module makes an OP change to the input data and the convolution kernel, then the operation unit 61 of the operation module 6 performs a convolution operation of the input data (OP conversion) and the convolution kernel (OP conversion), and the intermediate result is returned through the interconnection module 113, based on the control signal decoded by the conv_ac_op instruction.

In step S8, in the interconnection module 113, the intermediate results returned from the operation module 6 are pieced together into complete intermediate vectors.

In step S9, the main operation module 5 obtains the intermediate vectors returned by the interconnection module 4, the convolution window traverses all input data, and the main operation module splices all the returned vectors into an intermediate result; the (optional) reads the offset data from the first storage unit 53 according to the control signal decoded by the conv_ac_op instruction, adds the offset data and the intermediate result through the vector adding unit 511 to obtain an offset result, the main operation module 5 reads the interpolation table corresponding to the address of the activated function interpolation table in the conv_ac_op register number 4, performs the activation operation on the offset result and the interpolation table to obtain the final output data, and writes the final output data back to the first storage unit 53.

In step S10, the controller unit 115 then reads the next IO instruction from the instruction storage unit, and the data access unit 116 stores the output data in the first storage unit 53 to the external address space designated address according to the decoded control signal, and the operation ends.

The disclosed embodiments also provide a computer storage medium storing a computer program for electronic data exchange, the computer program causing a computer to execute some or all of the steps of a method for implementing any one of the convolution expansion instructions described in the above method embodiments.

The disclosed embodiments also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of a method of implementing a convolution expansion instruction as described in any one of the method embodiments above.

In another embodiment of the present disclosure, a chip is also disclosed, which includes the neural network computing device (as shown in fig. 1) of the above embodiment.

In another embodiment of the present disclosure, a chip package structure is also disclosed, which includes the chip.

The other embodiment of the disclosure also discloses a board card, which comprises the chip packaging structure.

In another embodiment of the present disclosure, an electronic device is also disclosed, which includes the above board card.

The electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

It should be noted that, for simplicity of description, the foregoing method embodiments are all depicted as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may occur in other orders or concurrently in accordance with the disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The foregoing has described in detail embodiments of the present disclosure, with specific examples being employed herein to illustrate the principles and implementations of the present disclosure, the above examples being provided solely to assist in the understanding of the methods of the present disclosure and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the present disclosure, there are variations in the detailed description and the scope of the application, which in light of the foregoing description should not be construed to limit the present disclosure.

Claims

1. A method of executing a convolution extended instruction, the method comprising the steps of:

The computing device reads the instruction from the memory to obtain input data of the instruction, a convolution kernel and an activation operation;

the instructions include: an opcode and an opcode field, the opcode comprising: the identity of the convolution expansion instruction; the operation domain includes: a convolution sub-domain and an activation sub-domain, the convolution sub-domain comprising: storing an address of input data and an address of a convolution kernel, the activation subfield comprising: an interpolation table address of the activation operation;

The computing device reads a convolution expansion instruction once, performs convolution operation on the input data and the convolution kernel to obtain an intermediate result, and performs activation operation on the intermediate result through the activation subdomain to obtain a final result of the instruction, and specifically comprises the following steps:

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The activating operation includes: convolutional neural network Maxout operation, convolutional neural network PReLU operation, convolutional neural network RReLU operation, convolutional neural network leak ReLU operation, nonlinear activation operation, or linear activation operation.

3. The method of claim 1, wherein the computing device performing a convolution operation on the input data and a convolution kernel to obtain an intermediate result, comprising:

4. A computing device, the computing device comprising: the device comprises a memory, an operation unit, an interconnection module, an operation unit, a controller unit and a data access unit;

wherein the arithmetic unit includes: an adder and a multiplier;

a controller unit for reading the instruction from the memory to obtain input data of the instruction, a convolution kernel, and an activation operation;

The instructions include: an opcode and an opcode field, the opcode comprising: identification of the instruction; the operation domain includes: a convolution sub-domain and an activation sub-domain, the convolution sub-domain comprising: storing an address of input data and an address of a convolution kernel, the activation subfield comprising: an interpolation table address of the activation operation;

the operation unit is used for reading an instruction once, performing convolution operation on the input data and the convolution kernel to obtain an intermediate result, and performing activation operation on the intermediate result through the activation subdomain to obtain a final result of the instruction;

5. The computing device of claim 4, wherein the computing device is configured to,

6. The computing device of claim 4, wherein the arithmetic unit further comprises: the system comprises a master operation module and a plurality of slave operation modules, wherein the master operation module comprises: an adder and a multiplier, the slave operation module including: an adder and a multiplier;

7. A computer-readable storage medium, characterized in that it stores a computer program for electronic data exchange, wherein the computer program causes a computer to perform the method according to any one of claims 1-3.

8. A computer program product, characterized in that the computer program product comprises a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to perform the method of any of claims 1-3.

9. An electronic device comprising a processor comprising the computing device of any of claims 4-6.