CN109409514A

CN109409514A - Fixed-point calculation method, apparatus, equipment and the storage medium of convolutional neural networks

Info

Publication number: CN109409514A
Application number: CN201811302449.8A
Authority: CN
Inventors: 熊祎; 易松松
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2019-03-01

Abstract

The embodiment of the invention discloses fixed-point calculation method, apparatus, equipment and the storage mediums of convolutional neural networks, the convolutional neural networks include convolutional layer, the described method includes: receiving the input activation value of this layer of convolutional layer by input channel, the input channel has corresponding weight；Fixed point operation is carried out to the input activation value, obtains the First Eigenvalue；The First Eigenvalue and the weight are respectively written into the register of multiple register groupings；It is grouped for the multiple register, respectively according to the First Eigenvalue and weight progress multiply-add operation in the register, obtains multiple Second Eigenvalues.Due to usually providing multiple registers in processor, it can be carried out by the way that accumulation operations are dispersed in multiple registers, i.e. grouping is cumulative, the quantity for the multiply-add operation shared equally is reduced with this, reduces and overflows risk, improves the treatment effeciency of application operating instruction, increase entire throughput, meanwhile accuracy is maintained, it ensure that application range.

Description

Fixed-point calculation method, apparatus, equipment and the storage medium of convolutional neural networks

Technical field

The present embodiments relate to the technology of deep learning more particularly to the fixed-point calculation methods of convolutional neural networks, dress It sets, equipment and storage medium.

Background technique

In recent years, deep learning is widely used in the fields such as vision, wherein with CNN (Convolutional Neural Network, convolutional neural networks) be core series of algorithms, image classification, target detection, Pixel-level segmentation etc. application With preferable effect.

However, the operand of convolutional neural networks CNN is big, 90% or more computing load is concentrated in convolution algorithm, Convolutional layer can be converted to the multiplication operations of two matrixes by common realization, and the core in matrix multiple is as follows:

Wherein, w is weight, and a is input activation value, and O is output activation value, and N is the public edge lengths of w and a.

Since the operand of convolution algorithm is big, convolution algorithm usually requires GPU (Graphics in actual deployment Processing Unit, graphics processor) or FPGA (Field-Programmable Gate Array, field programmable gate Array) etc hardware accelerated, can be only achieved the requirement of real time execution.

In order to move to convolutional neural networks CNN in the limited equipment of calculation resources, such as mobile terminal is embedded sets It is standby, it is currently the methods of to train model that is smaller, more simplifying, and be aided with fixed point, cut, attempts between speed and accuracy rate Obtain a balance.

Wherein, fixed point (or quantization) method because it will not change network structure, is not necessarily to the advantages such as re -training, by Extensive concern.

Fixed point is that original numerical value indicated using 32bits floating number is switched to the number of fixed bit by mapping method Value indicates.By taking Q notation as an example, U2Q6 indicates the fixed-point number of a 8bits, integer part 2bit, fractional part 6bits.? In general processor, fixed-point calculation usually has less time delay and bigger handling capacity than floating-point operation.The fixed-point calculation of 8bits Compared with the floating-point operation of 32bits, 4 times of performance boost can be theoretically brought for CNN.

However, due to during fixed-point calculation bit wide can increase, in order to guarantee precision, need to go to deposit using bigger bit wide Storage output result.For example, the fixed-point number of two U2Q6 is multiplied, output result is S4Q12, i.e. 16bits.The increasing of this bit wide Add and affect instruction throughput, eventually leads to actual fixes operational performance far away from theoretical performance.

In order to promote the performance of fixed-point calculation, there are two types of current ways:

First method is to abandon low level by shifting function after fixed-point calculation to reduce bit wide.

For example, some not too important low levels are removed by increasing a shifting function in active coating, it is subsequent fixed to reduce Point processing complexity.

However, this method needs specially designed hardware supported, because intermediate result usually requires to save non-secondary power Bit wide as a result, and General Porcess Unit usually only possesses the operational order of secondary power bit wide, treatment effeciency is lower.

Second method is that multiplying order is replaced with to other to overflow the lower instruction of risk, such as shift instruction.

For example, such as { 0.125,0.25,0.5 } etc., these are specific by the way that weight is quantified as several sparse particular values The multiplying of value can be converted to the shifting function of corresponding bit number.

However, the accuracy of this method is lower, application range is smaller, is only applicable to simple picture classification task.

Summary of the invention

The embodiment of the present invention provides fixed-point calculation method, apparatus, equipment and the storage medium of a kind of convolutional neural networks, with It realizes while improving fixed-point calculation efficiency, guarantees application range.

In a first aspect, the embodiment of the invention provides a kind of fixed-point calculation method of convolutional neural networks, the convolution mind It include convolutional layer through network, which comprises

The input activation value of this layer of convolutional layer is received by input channel, the input channel has corresponding weight；

Fixed point operation is carried out to the input activation value, obtains the First Eigenvalue；

The First Eigenvalue and the weight are respectively written into the register of multiple register groupings；

For the multiple register be grouped, respectively according in the register the First Eigenvalue and the weight Multiply-add operation is carried out, multiple Second Eigenvalues are obtained.

Preferably, each register grouping register includes for storing the register of the First Eigenvalue, for depositing Store up the register, multiplication register and addend register of the weight；

It is described for the multiple register be grouped, respectively according in the register the First Eigenvalue with it is described Weight carries out multiply-add operation, obtains multiple Second Eigenvalues, comprising:

It is grouped for each register, corresponding to the same input channel described first is special in the multiplication register Value indicative and the weight carry out multiplying, obtain feature product data；

By the feature product data accumulation into the addend register, Second Eigenvalue is obtained.

Preferably, further includes:

Merge the multiple Second Eigenvalue, obtains third feature value；

Floating-point operation is carried out to the third feature value, obtains fourth feature value；

The output activation value of this layer of convolutional layer is generated according to the fourth feature value.

Preferably, before the input activation value for receiving this layer of convolutional layer by input channel, the method is also wrapped It includes:

The bit number of the weight is compressed, to compress the bit number of the Second Eigenvalue.

Preferably, the input activation value of this layer of convolutional layer is the output activation value of upper layer convolutional layer；

Before the input activation value for receiving this layer of convolutional layer by input channel, the method also includes:

The bit number of the output activation value of upper layer convolutional layer is compressed, to compress the bit number of the Second Eigenvalue.

Preferably, the convolutional layer is grouping convolutional layer, and the grouping convolutional layer includes multiple convolution groups；

The method also includes:

For each convolution group, the candidate arrangement mode of assigned input channel is enumerated, the input channel has Corresponding first trained values；

First trained values and the weight are respectively written into the register of multiple register groupings；

It is grouped for the multiple register, respectively according to first trained values and the weight in the memory Multiply-add operation is carried out under every kind of candidate arrangement mode, obtains multiple second trained values；

The smallest second trained values of absolute value are determined in second trained values, as target trained values；

Set the corresponding candidate arrangement mode of the target trained values to the target array mode of input channel.

Preferably, the register of each register grouping includes for storing the register of the First Eigenvalue, being used for Store the register, multiplication register and addend register of the weight；

It is described to be grouped for the multiple register, it is weighed according to first trained values in the memory respectively It focuses under every kind of candidate arrangement mode and carries out multiply-add operation, obtaining multiple second trained values includes:

First trained values corresponding to each input channel and the weight carry out in the multiplication register Multiplying obtains training product data；

The selective value the smallest m trained product data from the trained product data, as target training product data；

Target training product data are written into the addend register；

Other training product data in addition to target training product data are added to according to every kind of arrangement mode In the addend register, the second trained values are obtained.

The input activation value that this layer of convolutional layer is received by input channel, comprising:

Determine input channel assigned by each convolution group；

In each convolution group, input activation value is received by assigned input channel；

In the register that the First Eigenvalue and the weight are respectively written into multiple register groupings, comprising:

Target convolution group is successively determined from the multiple convolution group；

In the target convolution group, the assigned corresponding the First Eigenvalue of input channel is respectively written into weight In the register of multiple register groupings；

In the target convolution group, respectively according to the register grouping in the First Eigenvalue and the weight Preset target array mode carries out multiply-add operation, obtains multiple Second Eigenvalues.

Second aspect, the embodiment of the invention also provides a kind of fixed-point calculation device of convolutional neural networks, the convolution Neural network includes convolutional layer, and described device includes:

Activation value receiving module is inputted, it is described defeated for receiving the input activation value of this layer of convolutional layer by input channel Enter channel with corresponding weight；

Fixed point conversion module obtains the First Eigenvalue for carrying out fixed point operation to the input activation value；

It is grouped memory module, for the First Eigenvalue and the weight to be respectively written into posting for multiple register groupings In storage；

Multiply-add operation module, for being grouped for the multiple register, respectively according to described the in the register One characteristic value and the weight carry out multiply-add operation, obtain multiple Second Eigenvalues.

The multiply-add operation module includes:

Multiplying submodule, for being grouped for each register, to the same input in the multiplication register The corresponding the First Eigenvalue in channel and the weight carry out multiplying, obtain feature product data；

Add operation submodule, for the feature product data accumulation into the addend register, to be obtained second Characteristic value.

Preferably, further includes:

Characteristic value merging module obtains third feature value for merging the multiple Second Eigenvalue；

Floating-point conversion module obtains fourth feature value for carrying out floating-point operation to the third feature value；

Activation value generation module is exported, for generating the output activation value of this layer of convolutional layer according to the fourth feature value.

Preferably, further includes:

Weight compression module, for compressing the bit number of the weight, to compress the bit number of the Second Eigenvalue.

Described device further include:

Activation value compression module is exported, the bit number of the output activation value for compressing upper layer convolutional layer, described in compression The bit number of Second Eigenvalue.

Described device further include:

Candidate arrangement mode enumerates module, for being directed to each convolution group, enumerates the candidate of assigned input channel Arrangement mode, the input channel have corresponding first trained values；

Training set writing module, for first trained values and the weight to be respectively written into multiple register groupings In register；

Training set training module, for being grouped for the multiple register, respectively according in the memory First trained values and the weight carry out multiply-add operation under every kind of candidate arrangement mode, obtain multiple second trained values；

Target trained values selecting module, for determining the smallest second trained values of absolute value in second trained values, As target trained values；

Target array mode setup module, for the corresponding candidate arrangement mode of the target trained values to be set as inputting The target array mode in channel.

The training set training module includes:

Training product data computational submodule, for the institute corresponding to each input channel in the multiplication register It states the first trained values and the weight carries out multiplying, obtain training product data；

Target training product data select submodule, for the smallest m instruction of selective value from the trained product data Practice product data, as target training product data；

Submodule is written in target training product data, posts for being written target training product data to the addition In storage；

Training product data accumulation submodule, for by except the target training product data in addition to other training products Data are added in the addend register according to every kind of arrangement mode, obtain the second trained values.

The input activation value receiving module includes:

Channel distribution sub module, for determining input channel assigned by each convolution group；

Channel reception submodule, for receiving input activation by assigned input channel in each convolution group Value；

The grouping memory module includes:

Target convolution group determines submodule, for successively determining target convolution group from the multiple convolution group；

Channel sub-module stored is used in the target convolution group, by assigned input channel corresponding first Characteristic value and weight are respectively written into the register of multiple register groupings；

The multiply-add operation module includes:

Multiply-add submodule is arranged, is used in the target convolution group, respectively according in register grouping The First Eigenvalue and the preset target array mode of the weight carry out multiply-add operation, obtain multiple Second Eigenvalues.

The third aspect the embodiment of the invention also provides a kind of equipment, including memory, processor and is stored in memory Computer program that is upper and can running on a processor, the processor realize that first aspect present invention is real when executing described program The fixed-point calculation method of the convolutional neural networks of example offer is provided.

Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer Program realizes the fixed-point calculation for the convolutional neural networks that first aspect present invention embodiment provides when the program is executed by processor Method.

In embodiments of the present invention, the input activation value that this layer of convolutional layer is received by input channel, to input activation value Fixed point operation is carried out, the First Eigenvalue is obtained, the First Eigenvalue and weight are respectively written into the deposit of multiple register groupings It in device, is grouped for multiple registers, respectively according to the First Eigenvalue and weight progress multiply-add operation in register, obtains more A Second Eigenvalue can be by being dispersed in multiple deposits for accumulation operations due to usually providing multiple registers in processor It is carried out in device, i.e. grouping is cumulative, and the quantity for the multiply-add operation shared equally is reduced with this, reduces and overflows risk, improves application operating and refers to The treatment effeciency of order increases entire throughput, meanwhile, accuracy is maintained, ensure that application range.

Detailed description of the invention

Fig. 1 is the flow chart of the fixed-point calculation for the convolutional neural networks that the embodiment of the present invention one provides；

Fig. 2 is the schematic diagram of the convolutional layer in the embodiment of the present invention one；

Fig. 3 is the schematic diagram of the register grouping in the embodiment of the present invention one；

Fig. 4 is the flow chart of the fixed-point calculation of convolutional neural networks provided by Embodiment 2 of the present invention；

Fig. 5 is the flow chart of the fixed-point calculation for the convolutional neural networks that the embodiment of the present invention three provides；

Fig. 6 is the schematic diagram of the grouping convolutional layer in the embodiment of the present invention three；

Fig. 7 is the structural schematic diagram of the fixed-point calculation device for the convolutional neural networks that the embodiment of the present invention four provides；

Fig. 8 is the structural schematic diagram for the equipment that the embodiment of the present invention five provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.

Fig. 1 is the flow chart of the fixed-point calculation for the convolutional neural networks that the embodiment of the present invention one provides, and is specifically included as follows Step:

S110, the input activation value that this layer of convolutional layer is received by input channel.

In the concrete realization, the embodiment of the present invention can be applied in an equipment, which has universal computing unit, such as CPU (Central Processing Unit, central processing unit), DSP (Digital Signal Processing, number letter Number processing) etc., the operational order of commonly used secondary power bit wide.

The embedded device that the equipment can be limited for calculation resources, such as mobile terminal, but may be calculation resources More abundant server cluster, such as distributed system, the embodiments of the present invention are not limited thereto.

Currently, basic group of stratification of convolutional neural networks may include convolutional layer (Convolution layer), pond layer (Pooling layer), active coating (Activation layer) and full articulamentum (Full connection layer) etc..

Wherein, convolutional layer, pond layer, active coating can be combined into one layer of operation, for completing the volume of convolutional neural networks Product operation.

To Mr. Yu's layer convolutional layer, the input of this layer of convolutional layer can be received input to by input channel in_channels Activation value, the input activation value can be floating data.

In addition, input channel has corresponding weight, which is also known as convolution kernel.

Convolutional layer is generally made of multiple neuronal layers map, and each map is made of multiple neural units, the same map's All neural units share a weight, and weight often represents a feature, for example some weight represents one section of arc, then convolution Value be possible to be also one section of arc.

It should be noted that the weight can carry out fixed point operation in advance for fixed-point calculation, i.e. the weight can be Fixed-point data.

S120, fixed point operation is carried out to the input activation value, obtains the First Eigenvalue.

In embodiments of the present invention, for each input activation value, fixed point behaviour can be carried out by modes such as Q meter methods Make, is converted to fixed-point data from floating data, the result of fixed point operation conversion is the First Eigenvalue.

S130, the First Eigenvalue and the weight are respectively written into the register of multiple register groupings.

It in embodiments of the present invention, can be in advance by least portion as shown in Fig. 2, being provided with multiple memories in a device Register is divided to be divided in multiple registers groupings 202.

In one example, processor usually possesses 32 or more SIMD (Single Instruction Multiple Data, single-instruction multiple-data stream (SIMD)) register.

On the other side, single multiply-add operation MAC occupies 4 registers, wherein weight and the First Eigenvalue respectively occupy 1 A register, multiply-add operation MAC output the result is that 16bits, occupies 2 registers.

In addition, collecting register grouping accumulated result occupies 8 registers, wherein collection register is grouped cumulative The result is that 32bits, occupies 4 registers, to prevent data jamming assembly line, additional one times of register occupies 8 in total A register.

In this example, it removes and collects 8 registers that register grouping accumulated result occupies, remaining 24 deposits Device, every group of register grouping individually carry out multiply-add operation MAC, i.e., each register grouping at least occupies 4 registers, therefore, Register can be divided in 6 register groupings, carry out multiply-add fortune using independent register in each register grouping Calculate MAC.

It in embodiments of the present invention, as shown in Fig. 2, can be by the modes such as sequence, random, by each input channel The corresponding the First Eigenvalue of in_channels 201 is respectively written into the register of corresponding multiple register groupings with weight.

S140, for the multiple register be grouped, respectively according in the register the First Eigenvalue and institute It states weight and carries out multiply-add operation, obtain multiple Second Eigenvalues.

In the concrete realization, as shown in figure 3, the register of each register grouping includes for storing the First Eigenvalue Register 301, the register 302 for storing weight, multiplication register 303 and addend register 304.

Under normal circumstances, for the initial data of input convolutional neural networks, grading mode can be torn open by overlapping, by it Cutting is quantity, size and the identical input feature vector value of weight, so that input feature vector value and weight correspond.

For example, the image data that input convolutional neural networks are a 320*240 can be with if weight is 6*6*10 The image data is split into the input feature vector value of multiple 6*6*10.

It is grouped for each register, to the corresponding the First Eigenvalue of the same input channel in multiplication register 203 Multiplying is carried out with weight, obtains feature product data.

By feature product data accumulation into addend register 204, Second Eigenvalue is obtained.

It is so-called cumulative, refer to that since first feature product data, addend register accumulates always each feature product The value of data does not remove first feature product data, until being added to the last one feature product data, add up completion Afterwards, the data of acquisition are Second Eigenvalue.

Furthermore, the multiply-add operation of register grouping can be expressed as follows:

Wherein, O is Second Eigenvalue, and G is the quantity of register grouping, and w is weight, and a is the First Eigenvalue, and N is w and a Between public edge lengths,Quantity for the multiplying being assigned in each register group.

Fig. 4 is the flow chart of the fixed-point calculation of convolutional neural networks provided by Embodiment 2 of the present invention, before the present embodiment Based on stating embodiment, the processing to compression weight and/or the bit number, Second Eigenvalue that export bit value is further increased Operation.This method specifically comprises the following steps:

S410 compresses the bit number of the weight, to compress the bit number of the Second Eigenvalue.

S420, the bit number of the output activation value of compression upper layer convolutional layer, to compress the bit number of the Second Eigenvalue.

Since the fixed-point data of two 8bits is carried out multiplying, the result of output is 16bits, therefore, can be with The result bit number of output is limited to less than 16bits.

In practical applications, the input activation value of the weight of 8bits and/or 8bits not necessarily, can keep one Determine to compress weight under the premise of precision and/or input the bit number of activation value, to reach the ratio of the Second Eigenvalue of compression output Special number.

And for this layer of convolutional layer, the input activation value of this layer of convolutional layer is the output activation value of upper layer convolutional layer, Therefore, the bit number of output activation value can be compressed in the convolutional layer of upper layer.

In the concrete realization, weight can be compressed in the following way and/or exports the bit number of activation value:

1, homogeneous compaction

Each value is multiplied by 255/ (maximum value-minimum value), and then arest neighbors is rounded.

2, the truncation compression based on relative entropy

A threshold value is obtained by statistical method, the value that absolute value is greater than threshold value is suppressed to 255 by force, and other values all multiply Upper 255/ threshold value.

3, non-uniform quantizing

Multiplied by the value of a dynamic change, it is desirable that value density in part with high accuracy is high, and otherwise value density is low.

Certainly, above-mentioned compress mode is intended only as example, in implementing the embodiments of the present invention, can set according to the actual situation Other compress modes are set, the embodiments of the present invention are not limited thereto.In addition, other than above-mentioned compress mode, art technology Personnel can also use other compress modes according to actual needs, and the embodiment of the present invention is also without restriction to this.

It is [- 128,127] that current standard determined, which has symbol 8bits fixed-point value, and following table lists multiplying when fixed point The one of operand range of method operation is corresponded to the situation of change of output bit wide by limited time:

Wherein, operand A can be weight, and operand B can be input activation value, and output area and output bit wide are behaviour The A and operand B that counts carries out the value range and bit wide of multiplying output result (such as Second Eigenvalue).

Optionally, since the bit wide of Real Time Compression output activation value needs overhead, it is thereby possible to select pressing offline The part bit wide of contracting weight, the weight that convolutional neural networks CNN is loaded in actual motion have been the weights after compression.

Currently, including the overwhelming majority applications convolutional neural networks CNN such as image classification, target detection and segmentation, The bit wide of weight can be compressed to the range of upper table and keep certain accuracy.

In embodiments of the present invention, compression weight and/or the bit number of output activation value, i.e. limitation weight can be passed through And/or the value range of output activation value allows to avoid tearing open multiply-add operation to compress the bit number of Second Eigenvalue The problem of being divided into two instruction executions of multiplying and add operation.

It should be noted that may be incorporated into addition to the value range of limitation weight, the range of limitation input activation value New hardware instruction allows multiply-add operation MAC to be completed by an instruction to reach, and the embodiments of the present invention are not limited thereto.

S430 receives the input activation value of this layer of convolutional layer by input channel.

Wherein, input channel has corresponding weight.

For this layer of convolutional layer, the input activation value of this layer of convolutional layer is the output activation value of upper layer convolutional layer, i.e., The output activation value of upper layer convolutional layer is input to this layer of volume in compression bit number and then by output channel out_channels Lamination, input activation value received by this layer of convolutional layer have already passed through compression.

S440 carries out fixed point operation to the input activation value, obtains the First Eigenvalue.

The First Eigenvalue and the weight are respectively written into the register of multiple register groupings by S450.

S460, for the multiple register be grouped, respectively according in the register the First Eigenvalue and institute It states weight and carries out multiply-add operation, obtain multiple Second Eigenvalues.

S470 merges the multiple Second Eigenvalue, obtains third feature value.

S480 carries out floating-point operation to the third feature value, obtains fourth feature value.

S490 generates the output activation value of this layer of convolutional layer according to the fourth feature value.

In embodiments of the present invention, due to having the grouping of multiple registers in a device, multiply-add operation resulting the The quantity of two characteristic values is multiple.

Multiple Second Eigenvalues are merged two-by-two, the data obtained after merging are third feature value.

Third feature value is fixed-point data, is remapped to floating data, can be obtained fourth feature value.

Standardized operation is carried out by modes such as BN (Batch Normalization) operators to fourth feature value, is passed through The modes such as ReLU (Rectified Linear Units) function, Sigmod function carry out activation operation, etc., can be used as this The output activation value of layer convolutional layer, is exported by output channel out_channels to lower layer's convolutional layer.

Optionally, it for this layer of convolutional layer, after generating output activation value, can be activated with the Real Time Compression output The bit number of value, and the output activation value after compression is output to by lower layer's convolutional layer by output channel out_channels

During the multiply-add operation MAC of each register grouping:

To each register group, characteristic multiplier data are 8bits, and the Second Eigenvalue after adding up is 16bits, this Step time-consuming is theoretically speaking be (N/G) * t1, wherein t1 is average consumption needed for processor unit executes a multiply-add operation MAC When, since multiply-add operation MAC can intert progress, without pipeline blocking, assembly line settling time need to be only considered herein.

Assuming that there are 6 registers to be grouped, the Second Eigenvalue of 6 16bits altogether is obtained by cumulative, by the second spy Value indicative, which matches to add up two-by-two, obtains the intermediate value of 3 32bits, time-consuming 3*t2, including wherein t2 is comprising pipeline blocking The time-consuming of single accumulating operation.

It is added up twice again to above-mentioned intermediate value, obtains third feature value, time-consuming 2*t2.

When (N/G) is much larger than 5, the time-consuming of multiply-add operation is much larger than the time-consuming of accumulating operation, and the multiply-add operation of 8bits The data volume (handling capacity) that MAC single can be handled is theoretically 4 times of 32bits full precision data, therefore can be within the unit time More data are handled, therefore whole time-consuming just less.

It should be noted that the input operand of union operation is still 16bits, output is then 32bits, is compared The multiplying of 8bits, handling capacity halve, therefore, in practical applications, the adjustable register of those skilled in the art The quantity of grouping and the bit wide of weight, to reach the compromise of speed and precision.

Fig. 5 is the flow chart for the fixed-point calculation of convolutional neural networks that the embodiment of the present invention three provides, before the present embodiment Based on stating embodiment, the processing operation of grouping convolutional layer is further increased.This method specifically comprises the following steps:

S501 enumerates the candidate arrangement mode of assigned input channel for each convolution group.

In embodiments of the present invention, some convolutional layer in convolutional neural networks CNN is grouping convolutional layer (Grouped Convolution), also known as group's convolutional layer, grouping convolutional layer includes multiple convolution group group.

Compared to common convolutional layer, the parameter for being grouped convolutional layer is less, arithmetic speed faster, also, due to its outstanding property Can and to the characteristic of Cache close friend, grouping convolutional layer has become embedded device and realizes that one of convolutional neural networks CNN passes through Allusion quotation structure.

Assuming that upper one layer of output feature map has N number of, i.e. input channel number in_channel=N, it is further assumed that grouping The quantity of the convolution group of convolutional layer is M, then, in grouping convolutional layer, input channel number in_channe is divided into M parts, each Convolution group group corresponds to N/M input channel number in_channel, is independently connected therewith, then each convolution group group convolution Output is stacked into (concatenate) after the completion, the output channel out_channel as this layer.

In practical applications, it is grouped input channel in_ assigned by each convolution group group of convolutional layer Channel has generally been fixed when designing convolutional neural networks CNN, still, according to the commutative law of add operation, one Input channel in_channel in a convolution group group can readjust sequence, without having an impact to add operation.

For convolutional layer, the sequence of input channel in_channel has no related to the result of output, and inputs The result that activation value and weight carry out multiplying has just and has negative (probability is close), therefore, can be by adjusting input channel in_ The sequence of channel is overflowed so that the result for carrying out add operation in each convolution group group is as small as possible with this to reduce Risk.

Therefore, it can be directed to each convolution group when offline, enumerate the candidate arrangement mode of assigned input channel.

First trained values and the weight are respectively written into the register of multiple register groupings by S502.

On the one hand, input channel in_channel has corresponding first trained values, and first trained values are as this layer point The input feature vector value of group convolutional layer, for training arrangement mode.

Since the first trained values are the data sets for being intended to simulate actual use scene, it can be from training convolutional It is directly extracted in the test set of neural network CNN.

On the other hand, input channel in_channel has corresponding weight.

Similarly, it is provided with multiple memories in a device, at least partly register can be divided to multiple post in advance In storage grouping.

By the modes such as sequence, random, by corresponding first trained values of each input channel in_channels and weight It is respectively written into the register of multiple register groupings.

S503, for the multiple register be grouped, respectively according in the memory first trained values and institute It states weight and carries out multiply-add operation under every kind of candidate arrangement mode, obtain multiple second trained values.

S504 determines the smallest second trained values of absolute value, as target trained values in second trained values.

In the concrete realization, convolution group group is serially run, and each convolution group group can call multiple registers point Group is independent to carry out multiply-add operation MAC.

For each convolution group group, the first trained values and weight can be multiplied under every kind of candidate arrangement mode Add operation MAC, obtains multiple second trained values.

As the target of optimization, so that the absolute value of each register grouping accumulation result (i.e. the second trained values) is as far as possible It is small, to reduce spilling risk.

Above-mentioned optimization aim can be expressed as follows:

Wherein, G is the quantity of register grouping, and R is the maximum value of the second trained values under current arrangement mode s.

In one embodiment of the invention, the register of each register grouping includes for storing the First Eigenvalue Register, the register for storing weight, multiplication register and addend register.

Then in embodiments of the present invention, S504 may include:

S5041, first trained values corresponding to each input channel and the power in the multiplication register Multiplying is carried out again, obtains training product data.

S5042, the selective value the smallest m trained product data from the trained product data multiply as target training Volume data.

Target training product data are written into the addend register S5043.

S5044, by other training product data in addition to target training product data according to every kind of arrangement mode It is added in the addend register, obtains the second trained values.

In embodiments of the present invention, estimation is calculated is multiplied between corresponding first trained values of each input channel and weight The training product data that method operation obtains.

Take absolute value the smallest m (m is positive integer, and general value is 2-32) a training from all training product data Product data, as target training product data.

Target training product data are written into addend register, as cumulative initial value.

Remaining trained product data are taken from all training product data, are added to and are added according to different arrangement modes In method register, to obtain the second trained values.

At this point, distributing to the addition with the actual registers value the smallest input channel in_channels of cumulative rear absolute value Then register recalculates the accumulated value of addend register, add up and complete, and can be obtained the smallest i.e. second training of absolute value Value.

Above-mentioned second trained values can be expressed as follows:

Wherein, R is the second trained values, and G is the quantity of register grouping, and w is weight, and a is the First Eigenvalue, and N is w and a Between public edge lengths,Quantity for the multiplying being assigned in each register group, s are arrangement mode, and W and H are the The width and height of one trained values.

It should be noted that in addition to using violence traversal method channel weight can also be carried out using the method for Dynamic Programming Optimal arrangement mode is chosen in row's operation, chooses optimal row alternatively, can also judge by indirect indexes such as accuracys rate Column mode, the embodiments of the present invention are not limited thereto.

S505 sets the corresponding candidate arrangement mode of the target trained values to the target array mode of input channel.

According to the channel ID of the input channel in_channels of each register distribution, interlocks, obtain final mesh Mark arrangement mode.

Be grouped multiply-add stage, the register of a 16bits need cumulative (N/G) secondary multiplying altogether as a result, one Denier (N/G) is very big, it is possible to can be in cumulative process, absolute value exceeds the data that 15bits (1bits sign bit) can be stored The upper limit causes to overflow and unrolls (positive number change negative).

It in embodiments of the present invention, can be with since sequence of the add operation in convolution group to input channel is unrelated The sequence of arbitrary arrangement input channel adjusts cumulative sequence so that in the first trained values each time after multiply-add operation absolutely The increment of value is minimum, to reduce the absolute value peak occurred in cumulative process.

Since the sampling to truth may be implemented in the first trained values, when quantity is enough, according to the law of large numbers, it is exhausted Distribution value approaching to reality is distributed, it is therefore contemplated that this puts in order can also make multiply-add fortune each time in practical applications The increment of absolute value is minimum after calculation.

S506 determines input channel assigned by each convolution group.

S507 receives input activation value by assigned input channel in each convolution group.

As shown in fig. 6, certain this layer of convolutional layer is grouping convolutional layer when application on site convolutional neural networks CNN, then can divide The input channel in_channels 601 of each convolution group group 602 Cha Xun be pre-assigned to.

For each convolution group group 602, received respectively by assigned input channel in_channels601 It is input to the input activation value of convolution group group 602.

S508 carries out fixed point operation to the input activation value, obtains the First Eigenvalue.

S509 successively determines target convolution group from the multiple convolution group.

Multiple convolution group group are serially run in sequence, are target convolution group in currently running convolution group.

S510, in the target convolution group, by the assigned corresponding the First Eigenvalue of input channel and weight point It is not written in the register of multiple register groupings.

As shown in fig. 6, convolution group group 602 is at runtime, multiple registers groupings 603 can be called, by sequence, with The assigned corresponding the First Eigenvalue of input channel in_channels 601 and weight are respectively written into more by the modes such as machine In the register of a register grouping 603.

S511, in the target convolution group, respectively according to the register grouping in the First Eigenvalue and institute It states the preset target array mode of weight and carries out multiply-add operation, obtain multiple Second Eigenvalues.

In practical applications, the corresponding target of input channel in_channels assigned by the target convolution group is inquired Arrangement mode thereby determines that the First Eigenvalue and weight carry out the sequence of add operation in the target convolution group, is being multiplied After method operation, add operation is carried out with this, to obtain Second Eigenvalue.

It should be noted that the arrangement mode that the target array mode is trained when can be offline, or design volume The arrangement mode defaulted when product neural network CNN, the embodiments of the present invention are not limited thereto.

In embodiments of the present invention, divide common edge by introducing convolution group, can limit the First Eigenvalue and weight it Between common edge quantity with input channel quantity increase and increase, be extended to the convolutional layer comprising any input channel number.

In addition, input channel less in each convolution group allows to obtain input channel by the plain mode of force search The optimal solution of rearrangement.

It should be noted that the element for participating in single convolution can be limited there are also other modes in addition to being grouped convolutional layer Number, including but not limited to: depth separates convolution (Depth-wise Convolution), network pruning and rarefaction, empty Hole convolution etc., the embodiments of the present invention are not limited thereto.

Wherein, network pruning and rarefaction and empty convolution are participated in by the way that fraction in the middle part of convolution kernel is reset to 0 to reduce The element number of operation.

Embodiment in order to enable those skilled in the art to better understand the present invention passes through specific example in the present specification Illustrate the fixed-point calculation method of convolutional neural networks in the embodiment of the present invention.

The common convolutional layer of example one

This Ceng Juan base have 8 input channel channel_1, channel_2, channel_3, channel_4, channel_5、channel_6、channel_7、channel_8

The weight of 8 fixed-point datas, w1, w2, w3, w4, w5, w6, w7, w8 are set for 8 input channel branches

Memory is marked off two register groupings A1, A2 by equipment

When offline:

Compress the bit number of w1, w2, w3, w4, w5, w6, w7, w8

When online:

Respectively from channel_1, channel_2, channel_3, channel_4, channel_5, channel_6, Channel_7, channel_8 receive input activation value a1, a2, a3, a4, a5, a6, a7, a8

A1, a2, a3, a4, a5, a6, a7, a8 are converted into fixed-point data, as the First Eigenvalue

Channel_1, channel_3, channel_5, channel_7 are assigned to A1

Channel_2, channel_4, channel_6, channel_8 are assigned to A2

In the Second Eigenvalue of A1 operation multiply-add operation are as follows: a1*w1+a3*w3+a5*w5+a7*w7=D1

In the Second Eigenvalue of A2 operation multiply-add operation are as follows: a2*w2+a4*w4+a6*w6+a8*w8=D2

D1 and D2 merge into third feature value: D1+D2=D3

D3 is mapped as to the numerical value of floating-point, it is defeated as activating after the operations such as standardized operation, activation primitive activation It is worth out

Compress the bit number of D3

D3 is output to lower layer's convolutional layer by output channel

Example two is grouped convolutional layer

Being grouped convolutional layer has 2 convolution groups, B1, B2

With 8 input channel channel_1, channel_2, channel_3, channel_4, channel_5, channel_6、channel_7、channel_8

Wherein, channel_1, channel_2, channel_3, channel_4 distribute to B1

Channel_5, channel_6, channel_7, w_8 distribute to B2

Memory is marked off two register groupings A1, A2 by equipment

When offline:

Compress the bit number of w1, w2, w3, w4, w5, w6, w7, w8

Respectively from channel_1, channel_2, channel_3, channel_4, channel_5, channel_6, Channel_7, channel__8 receive the first trained values a1 ', a2 ', a3 ', a4 ', a5 ', a6 ', a7 ', a8 '

In B1, estimate the output valve of channel_1, channel_2, channel_3, channel_4, i.e. a1 ' * w1, a2’*w2、a3’*w3、a4’*w4

The smallest two channels of absolute value be channel_1, channel_2, i.e. a1 ' * w1, a2 ' * w2 absolute value be less than The absolute value of a3 ' * w3, a4 ' * w4

Add a3 ' * w3, a4 ' * w4 respectively on the basis of a1 ' * w1

Add a3 ' * w3, a4 ' * w4 respectively on the basis of a2 ' * w2

Assuming that the absolute value of (a1 ' * w1+a3 ' * w3)+(a2 ' * w2+a4 ' * w4) is less than (a1 ' * w1+a4 ' * w4)+(a2 ' * w2+a3’*w3)

So, in B1 input channel optimal arrangement mode (i.e. target array mode) be (channel_1, channel_3)、(channel_2、channel_4)

In B2, estimate the output valve of channel_5, channel_6, channel_7, channel_8, i.e. a5 ' * w5, a6’*w6、a7’*w7、a8’*w8

The smallest two channels of absolute value be channel_6, channel_8, i.e. a6 ' * w6, a8 ' * w8 absolute value be less than The absolute value of a5 ' * w5, a7 ' * w7

Add a5 ' * w5, a7 ' * w7 respectively on the basis of a6 ' * w6

Add a5 ' * w5, a7 ' * w7 respectively on the basis of a8 ' * w8

Assuming that the absolute value of (a6 ' * w6+a5 ' * w5)+(a7 ' * w7+a8 ' * w8) is less than (a6 ' * w6+a8 ' * w8)+(a7 ' * w7+a5’*w5)

So, in B2 input channel optimal arrangement mode (i.e. target array mode) be (channel_6, channel_5)、(channel_7、channel_8)

When online:

Channel_1, channel_3, channel_5, channel_7 are assigned to B1

Channel_2, channel_4, channel_6, channel_8 are assigned to B2

B1 is first handled:

In B1, the target array mode of B1 is read, channel_1, channel_31 are assigned to A1, by channel_ 2, channel_4 is assigned to A2

In the Second Eigenvalue of A1 operation multiply-add operation are as follows: a1*w1+a3*w3=E1

In the Second Eigenvalue of A2 operation multiply-add operation are as follows: a2*w2+a4*w4=E2

E1 merges with E2 are as follows: E1+E2=E3

After B1 processing is completed, B2 is handled:

In B2, the target array mode of B2 is read, channel_6, channel_5 are assigned to A1, by channel_ 7, channel_8 is assigned to A2

In the Second Eigenvalue of A1 operation multiply-add operation are as follows: a_6*w_6+a_5*w_5=E4

In the Second Eigenvalue of A2 operation multiply-add operation are as follows: a_7*w_7+a_8*w_8=E5

E4 merges with E5: E4+E5=E6

E3 and E6 merge into third feature value: E3+E6=E7

E7 is mapped as to the numerical value of floating-point, it is defeated as activating after the operations such as standardized operation, activation primitive activation It is worth out

Compress the bit number of E7

E7 is output to lower layer's convolutional layer by output channel

Fig. 7 is the structural schematic diagram of the fixed-point calculation device for the convolutional neural networks that the embodiment of the present invention four provides, specifically May include following module:

Activation value receiving module 710 is inputted, it is described for receiving the input activation value of this layer of convolutional layer by input channel Input channel has corresponding weight；

Fixed point conversion module 720 obtains the First Eigenvalue for carrying out fixed point operation to the input activation value；

It is grouped memory module 730, is grouped for the First Eigenvalue and the weight to be respectively written into multiple registers Register in；

Multiply-add operation module 740, for being grouped for the multiple register, respectively according in the register The First Eigenvalue and the weight carry out multiply-add operation, obtain multiple Second Eigenvalues.

In an alternate embodiment of the present invention where, each register grouping register includes special for storing described first The register of value indicative, the register for storing the weight, multiplication register and addend register；

The multiply-add operation module 740 includes:

In an alternate embodiment of the present invention where, further includes:

In an alternate embodiment of the present invention where, the input activation value of this layer of convolutional layer is that the output of upper layer convolutional layer swashs Value living；

Described device further include:

In an alternate embodiment of the present invention where, the convolutional layer is grouping convolutional layer, and the grouping convolutional layer includes Multiple convolution groups；

Described device further include:

In an alternate embodiment of the present invention where, the register of each register grouping includes for storing described first The register of characteristic value, the register for storing the weight, multiplication register and addend register；

The training set training module includes:

The input activation value receiving module includes:

The grouping memory module includes:

The multiply-add operation module includes:

Any embodiment of that present invention can be performed in the fixed-point calculation device of convolutional neural networks provided by the embodiment of the present invention The fixed-point calculation method of provided convolutional neural networks has the corresponding functional module of execution method and beneficial effect.

Fig. 8 is the structural schematic diagram for the equipment that the embodiment of the present invention five provides, as shown in figure 8, the equipment includes processor 80, memory 81, input unit 82 and output device 83；The quantity of processor 80 can be one or more in equipment, in Fig. 8 By taking a processor 80 as an example；Processor 80, memory 81, input unit 82 and output device 83 in equipment can be by total Line or other modes connect, in Fig. 8 for being connected by bus.

Processor 80 includes central processing unit (Central Processing Unit/Processor, CPU), and is deposited Device 801 is the component part in central processing unit.Register 801 is the high speed depositing element of limited storage capacity, they can be used To keep in instruction, data and address.In the control unit of central processing unit, the register for including have command register (IR) and Program counter (PC).In the arithmetic and logic unit of central processing unit, register 801 has accumulator (ACC).

Memory 81 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer Sequence and module, such as the corresponding program instruction of the fixed-point calculation method of the convolutional neural networks in the embodiment of the present invention/module (example Such as, the input activation value receiving module 710 in the fixed-point calculation device of convolutional neural networks, fixed point conversion module 720, grouping Memory module 730 and multiply-add operation module 740).Processor 80 is by running the software program being stored in memory 81, instruction And module realizes above-mentioned convolution thereby executing equipment/terminal/server various function application and data processing The fixed-point calculation method of neural network.

Memory 81 can mainly include storing program area and storage data area, wherein storing program area can store operation system Application program needed for system, at least one function；Storage data area, which can be stored, uses created data etc. according to terminal.This Outside, memory 81 may include high-speed random access memory, can also include nonvolatile memory, for example, at least a magnetic Disk storage device, flush memory device or other non-volatile solid state memory parts.In some instances, memory 81 can be further Including the memory remotely located relative to processor 80, these remote memories can by network connection to equipment/terminal/ Server.The example of above-mentioned network includes but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

Input unit 82 can be used for receiving the number or character information of input, and generate with the user setting of equipment and The related key signals input of function control.Output device 83 may include that display screen etc. shows equipment.

The embodiment of the present invention also provides a kind of storage medium comprising computer executable instructions, and the computer is executable It instructs when being executed by computer processor for executing a kind of fixed-point calculation method of convolutional neural networks, this method comprises:

Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention The method operation that executable instruction is not limited to the described above, can also be performed convolutional Neural provided by any embodiment of the invention Relevant operation in the fixed-point calculation of network

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

It is worth noting that, in the embodiment of the fixed-point calculation device of above-mentioned convolutional neural networks, included each list Member and module are only divided according to the functional logic, but are not limited to the above division, as long as can be realized corresponding Function；In addition, the specific name of each functional unit is also only for convenience of distinguishing each other, it is not intended to restrict the invention Protection scope.

It should be noted that above are only presently preferred embodiments of the present invention and institute's application technology principle of the present invention.This field Technical staff is appreciated that the invention is not limited to the specific embodiments described herein, for a person skilled in the art can be into The various apparent variations of row are readjusted and are substituted without departing from protection scope of the present invention.Therefore, although above embodiments The present invention is described in further detail, but the present invention is not limited to this, without departing from the inventive concept, It can also include more other equivalent embodiments, and the scope of the present invention is determined by scope of the claims.

Claims

1. a kind of fixed-point calculation method of convolutional neural networks, which is characterized in that the convolutional neural networks include convolutional layer, institute The method of stating includes:

It is grouped for the multiple register, is carried out respectively according to the First Eigenvalue in the register with the weight Multiply-add operation obtains multiple Second Eigenvalues.

2. fixed-point calculation method according to claim 1, which is characterized in that each register grouping register includes being used for Store the register of the First Eigenvalue, the register for storing the weight, multiplication register and addend register；

It is described for the multiple register be grouped, respectively according in the register the First Eigenvalue and the weight Multiply-add operation is carried out, multiple Second Eigenvalues are obtained, comprising:

It is grouped for each register, the First Eigenvalue corresponding to the same input channel in the multiplication register Multiplying is carried out with the weight, obtains feature product data；

3. fixed-point calculation method according to claim 1, which is characterized in that further include:

Merge the multiple Second Eigenvalue, obtains third feature value；

4. fixed-point calculation method according to claim 1-3, which is characterized in that connect described by input channel Before the input activation value for receiving this layer of convolutional layer, the method also includes:

5. fixed-point calculation method according to claim 1-3, which is characterized in that the input activation of this layer of convolutional layer Value is the output activation value of upper layer convolutional layer；

6. fixed-point calculation method according to claim 1, which is characterized in that the convolutional layer is grouping convolutional layer, described Being grouped convolutional layer includes multiple convolution groups；

The method also includes:

For each convolution group, the candidate arrangement mode of assigned input channel is enumerated, the input channel, which has, to be corresponded to The first trained values；

For the multiple register be grouped, respectively according to first trained values in the memory with the weight every Multiply-add operation is carried out under the candidate arrangement mode of kind, obtains multiple second trained values；

7. fixed-point calculation method according to claim 6, which is characterized in that the register of each register grouping includes using In the register, the register for storing the weight, multiplication register and the addend register that store the First Eigenvalue；

Described to be grouped for the multiple register, the weight according to first trained values in the memory exists respectively Multiply-add operation is carried out under every kind of candidate arrangement mode, obtaining multiple second trained values includes:

First trained values corresponding to each input channel and the weight carry out multiplication in the multiplication register Operation obtains training product data；

Target training product data are written into the addend register；

Other training product data in addition to target training product data are added to according to every kind of arrangement mode described In addend register, the second trained values are obtained.

8. fixed-point calculation method described according to claim 1 or 2 or 3 or 6 or 7, which is characterized in that the convolutional layer is grouping Convolutional layer, the grouping convolutional layer include multiple convolution groups；

Determine input channel assigned by each convolution group；

In the target convolution group, the assigned corresponding the First Eigenvalue of input channel is respectively written into weight multiple In the register of register grouping；

In the target convolution group, preset respectively according to the First Eigenvalue in register grouping with the weight Target array mode carry out multiply-add operation, obtain multiple Second Eigenvalues.

9. a kind of fixed-point calculation device of convolutional neural networks, which is characterized in that the convolutional neural networks include convolutional layer, institute Stating device includes:

Activation value receiving module is inputted, for receiving the input activation value of this layer of convolutional layer by input channel, the input is logical Road has corresponding weight；

It is grouped memory module, for the First Eigenvalue and the weight to be respectively written into the register of multiple register groupings In；

Multiply-add operation module, it is special according to described first in the register respectively for being grouped for the multiple register Value indicative and the weight carry out multiply-add operation, obtain multiple Second Eigenvalues.

10. fixed-point calculation device according to claim 9, which is characterized in that each register grouping register includes using In the register, the register for storing the weight, multiplication register and the addend register that store the First Eigenvalue；

The multiply-add operation module includes:

Multiplying submodule, for being grouped for each register, to the same input channel in the multiplication register The corresponding the First Eigenvalue and the weight carry out multiplying, obtain feature product data；

Add operation submodule, for the feature product data accumulation into the addend register, to be obtained second feature Value.

11. fixed-point calculation device according to claim 9, which is characterized in that further include:

12. according to the described in any item fixed-point calculation devices of claim 9-11, which is characterized in that further include:

13. a kind of equipment including memory, processor and stores the computer journey that can be run on a memory and on a processor Sequence, which is characterized in that the processor realizes such as convolutional Neural net described in any one of claims 1-8 when executing described program The fixed-point calculation method of network.

14. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The fixed-point calculation method such as convolutional neural networks described in any one of claims 1-8 is realized when execution.