CN106779068A

CN106779068A - The method and apparatus for adjusting artificial neural network

Info

Publication number: CN106779068A
Application number: CN201611104205.XA
Authority: CN
Inventors: 姚颂
Original assignee: Beijing Deephi Intelligent Technology Co Ltd
Current assignee: Beijing Deephi Intelligent Technology Co Ltd
Priority date: 2016-12-05
Filing date: 2016-12-05
Publication date: 2017-05-31

Abstract

The invention discloses the method and apparatus of one kind adjustment artificial neural network (ANN).ANN includes multiple neurons, and the annexation between neuron represents that methods described includes by connection weight matrix：N inessential weights in all N number of weights of housebroken first connection weight matrix are set to zero by beta pruning step；Step is instructed again without mask, is not obligating the second connection weight matrix of re -training in the case that any weights are zero through beta pruning；Mask generation step, according to the 3rd connection weight matrix generator matrix shape mask through being instructed without mask again；And band mask instructs step, and re -training is carried out to the 3rd connection weight matrix using the mask code matrix again.Thus, realize the dynamic to mask by being added according to the matrix generation masking step through being instructed without mask again in the instruction stage again to adjust, so that some miscuts during beta pruning are modified and are recovered, so as to lift the performance of compressed neutral net.

Description

The method and apparatus for adjusting artificial neural network

Technical field

Present invention design artificial neural network (ANN), for example, Recognition with Recurrent Neural Network (RNN), more particularly to based on mask pair The dynamic adjustment of neutral net.

Background technology

Artificial neural network (ANN), abbreviation neutral net (NNs) is a kind of behavioural characteristic for imitating animal nerve network, Carry out the mathematics computing model of distributed parallel information processing.In recent years, neutral net have developed rapidly, and be widely used in many Field, such as image recognition, speech recognition, natural language processing, weather forecast, gene expression, content push.

A large amount of be connected to each other, nodes being referred to as " neuron " are there are in neutral net.Each neuron passes through Specific output function calculates the weighting input value from other adjacent neurons.Information transmission intensity between each neuron is used " weights " define, algorithm can continuous self-teaching, adjust this weighted value.In neutral net, the annexation of neuron exists A series of matrixes can be mathematically expressed as.Although accurate by the network prediction after training, its matrix be all it is dense, i.e., " nonzero element is filled with matrix ".As neutral net becomes even more complex, the calculating of dense matrix can consume substantial amounts of depositing Storage and computing resource.Thus caused low velocity and high cost cause that the popularization and application of mobile terminal are faced with huge difficulty, from And greatly constrain the development of neutral net.

Recent studies indicate that, by the way that in the neural network model matrix that training is obtained, only part weights are larger Element represents important connection, and the less element of other weights can be removed (be set to zero), at the same time corresponding nerve Unit is also by beta pruning.Neural network accuracy after beta pruning can decline, but can be by instructing again, to remaining in weights in model matrix Size be adjusted, so as to reduce loss of significance.Model compression can be by the dense matrix rarefaction in neutral net, effectively Reduce amount of storage, reduce amount of calculation, realize accelerating while precision is kept.Model compression is for special sparse neural network Seem for accelerator particularly important.

For complex neural network model, especially multilayer neural network model, the network model of each layer Matrix is interrelated.Thus can show smaller in the presence of some important being connected on weights, and not weighed when most of cut off The connection wanted is laid equal stress on after instruction, and show again must be than larger.These weights are in beta pruning equivalent to by miscut.But existing In compression process, the network for precise decreasing after beta pruning is instructed again when, can only adjust in remaining in model matrix The size of weights, therefore miscut cannot be recovered.So may result in network model converge to one it is not good Local best points, so as to bring influence to compression ratio and model accuracy.

Accordingly, it would be desirable to a kind of neutral net Adjusted Option, can attempt to feelings such as miscuts in neutral net compression process Condition is modified.

The content of the invention

In order to some the miscut behaviors during beta pruning are modified and are recovered, neutral net adjustment proposed by the present invention Scheme is even improved by adding the dynamic adjustment of the matrix shape mask to storing beta pruning result to reach guarantee in the instruction stage again The target of neutral net performance after beta pruning.

According to an aspect of the present invention, it is proposed that the method for one kind adjustment artificial neural network (ANN), wherein described ANN includes multiple neurons, and the annexation between neuron represents that methods described includes by connection weight matrix：Beta pruning N inessential weights in all N number of weights of housebroken first connection weight matrix are set to zero by step；Without mask Step is instructed again, is not obligating the second connection weight matrix of re -training in the case that any weights are zero through beta pruning；Cover Code generation step, according to the 3rd connection weight matrix generator matrix shape mask through being instructed without mask again；And band mask weight Instruction step, re -training is carried out using the mask code matrix to the 3rd connection weight matrix.By basis without mask weight The matrix that obtains of instruction generates mask, it becomes possible to the miscut during beta pruning is modified and recovered, so as to sparse in matrix While changing reduction amount of calculation, it is ensured that model accuracy.

According to another aspect of the present invention, according to instructed again without mask the adjustment of matrix mask that obtains can be will be described Value zero setting in mask on the n weights correspondence position minimum with absolute value in matrix, thus in the case where compression ratio is constant More rational beta pruning is carried out, so as to lift the degree of accuracy as much as possible.Above-mentioned adjustment can also be the variable adjustment of compression ratio, example Zero setting position is such as determined with predetermined threshold, so as to adjust compression ratio.

According to a further aspect of the invention, can repeat to be instructed again without mask, step is instructed in mask generation and band mask again Untill until obtaining the optimization solution of the connection weight matrix.

It is of the invention in one aspect, it is also proposed that one kind adjustment artificial neural network (ANN) device, it includes Pruning device, implement Adjusted Option of the invention without mask weight training apparatus, mask generating means and band mask weight training apparatus.

Brief description of the drawings

Disclosure illustrative embodiments are described in more detail by with reference to accompanying drawing, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent, wherein, in disclosure illustrative embodiments, identical reference number Typically represent same parts.

Fig. 1 shows the exemplary block diagram of DNN.

Fig. 2 shows the schematic diagram of LSTM neural network models.

Fig. 3 shows the schematic flow sheet that Web compression is carried out by the instruction of beta pruning-again.

Fig. 4 shows the node and connection change before and after beta pruning as shown in Figure 3.

Fig. 5 shows a kind of beta pruning-weight method for training for neutral net compression.

Fig. 6 shows the beta pruning-weight method for training that be increased compared to Fig. 5 and instruct step again without mask.

Fig. 7 shows to be instructed again without mask improves the example of neural network accuracy.

Fig. 8 shows the indicative flowchart that neutral net is adjusted using dynamic mask.

Fig. 9 shows the preferred implementation of the Adjusted Option based on Fig. 8.

Figure 10 shows and is keeping dynamically being adjusted in the case that compression ratio is constant an example of mask.

Figure 11 shows need not keep dynamically being adjusted in the case that compression ratio is constant an example of mask.

Figure 12 shows the schematic diagram of the adjusting apparatus for being able to carry out ANN Adjusted Options of the invention.

Specific embodiment

The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here Formula is limited.Conversely, these embodiments are provided so that the disclosure is more thorough and complete, and can be by the disclosure Scope intactly conveys to those skilled in the art.

In the application, will be mainly right for illustrating the present invention as a example by short-term memory long (LSTM) model of speech recognition The improvement of ANN.The scheme of the application is applied to various artificial neural networks, including deep neural network (DNN), circulation nerve net Network (RNN) and convolutional neural networks (CNN), are particularly suited for belonging to a kind of above-mentioned LSTM models of RNN.

The basic conception of neutral net compression

Artificial neural network (ANN) is a kind of behavioural characteristic for imitating animal nerve network, carries out distributed parallel information The mathematics computing model for the treatment of.A large amount of nodes connected with each other are there are in neutral net, also referred to as " neuron ".Each nerve Unit is processed the weighting from other adjacent neurons by certain specific output function (also referred to as " activation primitive ") and is input into Value.And the information transmission intensity between each neuron is then defined with so-called " weights ".Algorithm can continuous self-teaching, adjust this A little weights.

The neutral net of early stage only has input and output layer two-layer.Due to that cannot process the logic of complexity, its practicality is received To considerable restraint.Deep neural network (DNN) is greatly improved by adding hiding intermediate layer between input and output layer The ability of Processing with Neural Network complex logic.Fig. 1 shows DNN model schematics.It should be understood that in actual applications DNN can have the large scale structure more much more complex than shown in Fig. 1, but its basic structure is still as shown in Figure 1.

Speech recognition is by the analog signal Sequential Mapping of language to a specific set of letters.In recent years, manually The effect that the method for neutral net is obtained in field of speech recognition far beyond all conventional methods, as full row The main flow of industry.Wherein, deep neural network has extremely wide application.

Recognition with Recurrent Neural Network (RNN) is a kind of conventional deep neural network model, different from traditional BP Neural Network Network, Recognition with Recurrent Neural Network introduces directed circulation, can process the problem of forward-backward correlation between those inputs.In speech recognition In, the forward-backward correlation of signal is very strong, for example, recognize the word in sentence, and word sequence relation before the word is very tight It is close.Therefore, Recognition with Recurrent Neural Network has a very wide range of applications in field of speech recognition.

In order to solve the memory problems to long-term information, Hochreiter＆Schmidhuber proposed length in 1997 When remember (LSTM) model.LSTM neutral nets are one kind of RNN, and neural network module is repeated by simple in the middle of common RNN Change into the connection interactive relation of complexity.Fig. 2 shows the schematic diagram of LSTM neural network models.LSTM neutral nets are in voice Extraordinary application effect is equally achieved in identification.

When designing and training deep neural network, big network size has stronger ability to express, can represent net Stronger non-linear relation between network input feature vector and output.But, this larger network is actually wanted in study Useful pattern when, it is easier to influenceed by noise in training set so that the pattern that learns expects deviation with actual. Because the noise on these training sets be universal and factor data collection and it is different, the network instructed on data set is possible to Over-fitting under influence of noise.

By developing rapidly in recent years, the scale of neutral net constantly increases, advanced neutral net up to hundreds of layers, Several hundred million connections.Because the neutral net that scale increasingly increases can consume substantial amounts of calculating and memory access resource, therefore model compression Just become particularly important.

In neutral net, especially deep neural network, the annexation of neuron is mathematically represented as one Series matrix.Although accurate by the network prediction after training, its matrix is all dense, i.e., " non-zero is filled with matrix Element ".As neutral net becomes even more complex, the calculating of dense matrix can consume substantial amounts of storage and computing resource.Thus Caused low velocity and high cost cause that the popularization and application of mobile terminal are faced with huge difficulty, so as to greatly constrain nerve The development of network.

Recent studies indicate that, by the way that in the neural network model matrix that training is obtained, only part weights are larger Element represents important connection, and the less element of other weights can be removed (be set to zero), at the same time corresponding nerve Unit is also by beta pruning (pruning).Neural network accuracy after beta pruning can decline, but can be by instructing (finetune) again, to still The size for being retained in weights in model matrix is adjusted, so as to reduce loss of significance.Model compression can be by neutral net Dense matrix rarefaction, can effectively reduce amount of storage, reduce amount of calculation, keep precision while realize accelerate.Mould Type is compressed for special sparse neural network accelerator, it appears particularly important.Fig. 3 show by beta pruning-instruct again into The schematic flow sheet of row Web compression.Fig. 4 then correspondingly show the node (neuron) before and after beta pruning and connect dividing for (cynapse) Cloth situation.

The rarefaction degree of model matrix can be represented using compression ratio after beta pruning.Typically use sensitivity analysis at present Method select compression ratio.Different matrixes in same neutral net, under identical compression ratio, change to neural network accuracy and produce Raw influence is totally different.For example, individual layer LSTM neutral nets include Wgx, Wix, Wfx, Wox, Wgr, Wir, Wfr, Wor, Wrm9 matrix, matrix W rm therein is compressed into the 10% drastically decline that can cause neural network accuracy, and (that is, Word Error Rate is drastically Rise), and the precision that matrix W or is compressed to 10% network is basically unchanged.Therefore, prior art is usually using susceptibility point Analysis, neural network accuracy of each matrix under different compression ratios in test network, so that suitable compression ratio is chosen as initial value, and Finely tune on this basis, as final compression ratio.For example, to each matrix in individual layer LSTM neutral nets, respectively according to 0.1,0.2 ..., 0.9 totally 9 kinds of consistencies be compressed, the Word Error Rate (WER) of test network, choose compression after compared to pressure Before contracting | △ WER | minimum consistencies corresponding less than specified threshold as the matrix consistency initial value.This parameter scanning Interval can be referred to as coarseness and choose mode than larger.

The mask compression of neutral net is instructed again

Compression depth neutral net is substantially the rarefaction to weight matrix in deep neural network.Weights after rarefaction Matrix has the element that many values are 0.During computing is carried out, these zero valued elements can be not involved in computing, reduce The operation times of needs, such that it is able to improving operational speed.Meanwhile, if network rarefaction degree (such as consistency higher 15%), then the weights of non-zero can be only stored, such that it is able to reduce memory space.But because compression process eliminates phase When a part of weights, the degree of accuracy of entire depth neutral net can decline many, it is necessary to pass through to instruct the stage again, and adjustment still retains The size of the weights in network weight matrix, improves the model accuracy of deep neural network again.But typically, by Some weights are reset in beta pruning has added new constraint equivalent in solution space, so readjust converging to new local optimum Although precision lifting after point, compared to the deep neural network before beta pruning, precision has still declined.

According to existing heavy instruction mode, the generator matrix shape mask (being designated as M) in beta pruning, mask is some 0-1 matrixes, The weight matrix in LSTM is corresponded to respectively.These matrix shape mask code matrixes record the distribution letter of matrix non-zero unit after compression Breath, is that 1 element representation corresponding weight value matrix correspondence position element retains, and is 0 element representation corresponding weight value matrix correspondence Position element is set to 0.It should be understood that it is each element in the mask according in be introduced into the purpose of above-mentioned 0-1 matrixes mask Value corresponding weights in weight matrix are used restraint (that is, zero setting or keep constant), and the matrix shape mask is then Realize that one kind of above-mentioned constraint is easy to computer implemented means.In other words, if real otherwise in concrete practice The zero setting of correspondence weights in weight matrix is showed, it is also possible to regard the realization using the equivalent effect of matrix mask as.

The mask pressure of neutral net is now illustrated with specific reference to the compression process of LSTM deep neural networks in speech recognition Contracting principle.Beta pruning result is stored as matrix shape mask, by instructing stage reasonable employment matrix shape mask again, is being kept The accuracy rate of deep neural network is ensured on the basis of compression ratio is constant as far as possible.

1. band mask is instructed again

For the ANN that the annexation between multiple neurons and neuron is represented by connection weight matrix, Fig. 5 shows A kind of beta pruning-weight method for training for neutral net compression.The method includes that beta pruning step S510 and band mask instruct step again S520。

In step S510, one or more the inessential weights in the preceding weight matrix trained and obtain are set to zero, with Obtain through the weight matrix of beta pruning.

Herein, " inessential weights " refer to little thus relatively inessential on the accuracy rate influence of neural network model Weights.Although people employ various rules to specify that what is inessential weights, for example, according to the Hessian of cost function Matrix selects inessential weights, but generally to be received by academia at present with the viewpoint of experimental verification be " in weights average The connection that less those weights of absolute value are represented is relatively unessential ".In " inessential weights " therefore preferred reference matrix Less those weights of absolute value.

In step S520, using the mask corresponding to the weight matrix through beta pruning, (that is, null position corresponds to weight matrix The 0-1 matrixes of beta pruning position) weight matrix is instructed again.

As follows will to such as upper band mask, the scheme of instructing be described again with reference to formula.

Note carries out re -training to network and is used to optimize the process of every frame cross entropy：

nnet_o=R_mask(nnet_i,M) (1)

Wherein nnet_iIt is the input network after beta pruning, nnet_oIt is output network.R_maskRepresent that a kind of band mask was trained Journey.Only the weights not cut off are updated (whether the information record for cutting off is in M) in this process.By this mistake Journey, remain in the weights in network weight matrix has been worth to adjustment, and deep neural network gradually converges to new office Portion's optimal solution.

Note M ⊙ nnet⁰It is to nnet⁰In each matrix, with M correspondence mask code matrix carry out dot product.Note nnet⁰To wait to press The network of contracting, then compression process is as follows：

(a)nnet⁰→ M (to input network beta pruning, obtains beta pruning mask M)

(b)nnet_i=M ⊙ nnet⁰(input network and mask dot product, complete beta pruning)

(c)nnet_o=R_mask(nnet_i, M) (the Netowrk tape mask after beta pruning is instructed again, nnet is obtained_o)

2. instruct again+instructed again with mask without mask

Compared to the flow according to Fig. 5, can be with as shown in fig. 6, instructing step again in beta pruning step S610 and band mask Increase by one between S620 and instruct step S611 again without mask, i.e. can increase by one between the computing of (b) and (c) not Band mask instructs link again, and the link is as follows：

nnet_o=R_{no_mask}(nnet_i) (2)

So-called instruction again without mask refers to the constraint that beta pruning shape is removed during instruct again, it is allowed to by the weights of beta pruning Regrow.A kind of intuitively implementation can be input into the initial of network during using the network weight after beta pruning as instruction again Value, the weights that those have been cut off are 0 equivalent to input initial value.Therefore, it is the network weight after beta pruning is defeated as what is instructed again Enter network initial value equivalent to network iteration since preferably starting point is allowed, allow relatively important weight to have relatively bigger Initial weight so that network more likely reduces the interference of grass, study to valuable pattern.Theory and practice The new network of the heavy instruction generation is illustrated compared to the network before beta pruning, the degree of accuracy can increase.

But because without mask, so the really dense network generated after training, it is impossible to reach the mesh of compression , so needing the weights zero setting that this cuts off by those scripts again.And this can cause that the network degree of accuracy declines again.In order to extensive The multiple degree of accuracy, it is necessary to carry out band mask training again so that network converges to an office in the solution space after adding beta pruning constraint Portion's optimum point, it is ensured that the deep neural network degree of accuracy after beta pruning.

Therefore, compression process new after optimization is as follows, wherein nnet⁰It is network to be compressed, R_maskIt is that band mask is instructed again, R_nomaskIt is to be instructed again without mask.

(a)(input network beta pruning, obtain beta pruning mask M)

(b)(input network and mask dot product, complete beta pruning)

(c)(network after beta pruning is instructed again without mask, is obtained)

(d)It is (rightDot product M, regrows at removal beta pruning again The weights for going out)

(e)(carry out band mask to the network after dot product to instruct again, after being compressed Network)

Instructed again by without mask, the degree of accuracy to network first carries out lifting so as to largely solve compressed web The phenomenon that the degree of accuracy is generally reduced after network.From the point of view of engineering practice result, in some cases, made using the method after the optimization The precision of network can even increase after must compressing.

In engineering practice, the computing of (d)-(e) generally can be preferably repeated (from step 620 return to step in such as Fig. 6 Shown in 611 arrow) so that network convergence is to more preferably local best points.Wherein iteration is according to key link：

Iteration stopping condition isWherein for the LSTM nets for speech recognition For network model, e (nnet⁰) may refer to nnet⁰Error rate.

Fig. 7 shows to be instructed again without mask improves the example of neural network accuracy.The figure is directed to a certain thousands of hours Chineses The LSTM deep neural networks trained on sound data set instruct the whole flow process that step is compressed without mask again using addition And correlated results.Abscissa is operating procedure in figure, and ordinate is an index for weighing the deep neural network degree of accuracy, i.e. Wrong word rate (WER).WER is lower, and the explanation network degree of accuracy is higher.Black line represents the initial wrong word rate of network to be compressed, grey Line represents the process of three wheel compressions.The result of beta pruning is carried out when dotted line meaning is every wheel compression for the first time, it is seen that each beta pruning Network mistake word rate rises afterwards.First carry out being trained without mask after beta pruning, network mistake word rate declines, then beta pruning again, and wrong word rate is again Rise, then band mask is trained, wrong word rate declines again.4,8,12 results for corresponding to three wheel compressions respectively, it is seen that the degree of accuracy is compared Initially lifted.

The dynamic mask compression of neutral net

For for complex neural network model, especially Multi-Layered Network Model, the network model matrix of each layer It is interrelated.Thus can show smaller in the presence of some important being connected on weights, and it is most of unessential when cutting off Connection is laid equal stress on after instruction, and show again must be than larger.These weights are in beta pruning equivalent to by miscut.In the pressure of above-mentioned combination Fig. 6 In contracting flow, the network for precise decreasing after beta pruning is instructed again when, beta pruning mask M is still used, i.e. can only adjust still The size of the weights being retained in model matrix, therefore miscut phenomenon cannot be recovered.This can cause network model possible Not good local best points are converged to, so as to all bring influence to compression ratio and model accuracy.Heavy instructed for above-mentioned Journey is to find local optimum again on the basis of beta pruning, and does not attempt situation about being modified to miscut etc., can by Instruction stage addition again is adjusted to the dynamic for storing the matrix shape mask of beta pruning result, to some the miscut behaviors during beta pruning It is modified and recovers, reaches the target for ensureing even to improve artificial neural network (for example, LSTM) performance after beta pruning.

Fig. 8 shows the indicative flowchart that neutral net is adjusted using dynamic mask.As shown in figure 8, a kind of adjustment people The method of artificial neural networks includes beta pruning step S810, instructs step S811, mask generation step S812 and band again without mask and cover Code instructs step S820 again.

In step S810, n inessential weights in all N number of weights of housebroken first connection weight matrix are set It is zero.Inessential weights preferably can be n weights of absolute value minimum in all N number of weights.In step S811, do not forcing Constrain any weights be in the case of zero re -training through beta pruning the second connection weight matrix.In step S812, according to through not With the 3rd connection weight matrix generator matrix shape mask that mask is instructed again.In step S820, using the mask code matrix to Three connection weight matrixes carry out re -training.It should be understood that " first " in this, " second " and " the 3rd " is intended to signified every It is distinguished by, rather than is limited with any other meaning, does not also really want corresponding with the network numbering in such as formula.Separately Outward, although step S812 is referred to as mask generation step, but due to that can understand beta pruning step S810 in one embodiment It is the beta pruning step carried out using beta pruning mask, therefore step S812 can also regard as and dynamically be adjusted according to the 3rd weight matrix The mask set-up procedure of existing mask (for example, beta pruning mask).

Method of adjustment for Fig. 6, beta pruning mask M is to maintain constant.As long as this means that certain part weights once Be taken as inessential weights and cut off, then cannot recover again forever, even if a certain weights become after a re -training it is very big (connection performance is critically important after explanation beta pruning) is also such.In order to correct this problem, that is, recover to those instruct again after table The miscut of existing important element, it is necessary to mask is entered Mobile state adjustment (in order to the mask to dynamic adjustment is distinguished by, Use beta pruning mask M instead beta pruning mask M herein₀Represent)：

(dynamic adjustment M₀, obtain M₁)

In other words, instructed again different from the use of the direct band mask that carried out using beta pruning mask M, the method for Fig. 8 needs root According to the mask dynamically used adjustment is instructed with mask again in the preceding result instructed without mask again.

Increase this compression process after dynamically adjusting as follows：

(a)(input network beta pruning, obtains beta pruning mask M₀)

(b)(input network and mask dot product, complete beta pruning)

(c)(network after beta pruning is instructed again without mask, is obtained)

(d)(dynamic adjustment M₀, obtain M₁)

(e)It is (rightDot product M₁, removal is adjusted needs what is grown at beta pruning Weights)

(f)(carry out band mask to the network after dot product to instruct again, after being compressed Network)

In engineering practice, above-mentioned step, mask generation step and the band mask of being instructed again without mask can be repeated and instruct step again Untill the rapid optimization solution until obtaining the connection weight matrix.I.e., it is possible to repeat the computing of (c)-(f) (from step in such as Fig. 9 Shown in the arrow of 920 return to step 911) so that network convergence is solved to a more preferably optimization, such as local best points. Each above-mentioned repetition is required for readjusting a matrix mask, i.e.In addition, in reality For the consideration of time cost etc. in engineering, instruct again and not converge to local optimum sometimes, but reach the predetermined degree of accuracy and want Optimization solution under asking.

Above-mentioned iteration can also be the iteration of more a small range.For example, the instruction again that can be carried out repeatedly without mask is calculated, Solution is instructed again without mask until being optimized.The mask dynamic adjustment step for then continuing after entering again.Likewise it is possible to make Many sub-band masks being carried out with same dynamically adjusted mask and instructing calculating again, instructing solution again until the band mask for being optimized is Only.The above-mentioned multiple instruction again without mask is calculated and the calculating of repeatedly instructing again with mask can be regarded as being respectively included in single Instruct step again without mask and band mask is instructed in step again.

In one embodiment, mask generation step can be minimum with absolute value in the 3rd connection weight matrix in mask N weights correspondence position on value zero setting.

An example of above-mentioned dynamic adjustment mode is now intuitively given using formula：Remember that mat isIn any power Value matrix, m_kIt is correspondence M_kIn mask code matrix.Note m_kIn 0 value element number be n.Generation one and M_kComplete 1 square of same size Battle array M '_k+1, all elements in mat are sorted from big to small, by minimum n element correspondence M '_k+1The element of relevant position Set to 0 and can obtain M_k+1.0 element in mask can be transferred to this dynamic adjustment process the ground of weights minimum in those weights networks Side, and ensure that compression ratio is constant.The method is actually that the outer less weights of mask are done into mask with weights larger in mask Replace.

Figure 10 shows and is keeping dynamically being adjusted in the case that compression ratio is constant an example of mask.Grayscale is represented in figure Mask correspondence position, grid and oblique line background pp mask are adjusted.As illustrated, because weights are small when neutral element 0.3 starts Cut off, but become big (0.7) instruct again after, illustrated that the connection representated by the element is important.Element quilt when starting in lower Retain but diminish (0.3) after instruct again, illustrate that the connection is actually inessential.Now by above-mentioned optimisation strategy, in can recovering Between element, cut off lower middle element.

Above-mentioned mask can be dynamically adjusted according to other modes.Academic circles at present not yet has the compression ratio being widely recognized as to choose Rule, so being typically all to choose mode with taking the coarsenesses such as parameter scanning.This means that the compression ratio of selection in itself not Must be suitable, thus compression ratio can be preferably finely adjusted in instruction link again.Specifically, if mask code matrix 1 is worth after instruct again Unit in correspondence neural network weight matrix have quite a few, with the corresponding neutral net power of the value of mask code matrix 0 after instruct again Quite a few of element in value matrix, very little is differed in weights size, and that means that these weights for cutting and not Those importance cut are basically identical, now just should not again subtract these weights.

Then, in one embodiment, mask generation step by the mask with the 3rd connection weight matrix in Absolute value is less than the value zero setting on the weights correspondence position of weight threshold.The weight threshold can be predetermined based on experience value , it is also possible to tried to achieve according to the element value in the 3rd weight matrix.Preferably, by the N number of power of whole of the 3rd connection weight matrix Value is sorted from big to small by absolute value, weight threshold can be in collating sequence before (N-n) individual weights average, i.e. be The average of other the whole elements outside n element of zero setting is wanted in removing.Weight threshold can also be (N-2n in collating sequence + 1) average of the n weights that individual weights rise, i.e. remove the average of the minimum n element outside n element for wanting zero setting.It is above-mentioned Threshold value can also be the value based on above-mentioned average, such as value suitable with above-mentioned average or proportional.

An example of above-mentioned dynamic adjustment mode is now intuitively given using formula：Remember that mat isIn any power Value matrix, m_kIt is correspondence M_kIn mask code matrix.Note m_kIn 0 value element number be n, total element number be N.Generation one and M_k Identical matrix M '_k+1.The step-length that note chooses compression ratio is ∈.All correspondence m in mat_kIn 1 value element element Sorted from big to small, and taken the N number of element x of minimum ∈_i(i=1,2 ..., ∈ N), calculates its average：

To all satisfactions | | x '_i|-x_avg|<Element x ' in the mat of σ_iCorrespondence M '_k+1Put 1 in relevant position, you can obtain M_k+1。 The method is actually to recover those sizes close to x_avgWeights.

Figure 11 shows need not keep dynamically being adjusted in the case that compression ratio is constant an example of mask.Grayscale in figure Mask correspondence position is represented, grid and oblique line background pp mask are adjusted.As illustrated, neutral element (0.5) most start because Weights are small to be cut off, but is become big (0.7) instruct again after, identical with outer some element sizes of mask, is illustrated representated by the element Connection is important.Because now the outer element of mask does not also become very little.Therefore reply compression ratio is adjusted.In can be Compare scheme using above-mentioned threshold value, realize the recovery to neutral element.

Although Figure 10 and Figure 11 respectively illustrate the situation that zero setting element is replaced and recovered, it is understood that, according to One of above-mentioned each dynamic adjustment mode or any combination, zero setting element can also simultaneously occur and replace and recover, or enter one Step deletes the situation of non-zero setting element.In addition, though many orders of magnitude with weights (are simply expressed as weights big sometimes above It is small) characterize the whether important of specific weights element, but weight matrix can also be carried out according to other importance rules Beta pruning and mask are dynamically adjusted, and these changes are all located within the scope of included by claim of the invention.

Figure 12 shows the schematic diagram of the adjusting apparatus for being able to carry out ANN Adjusted Options of the invention.The ANN adjusting apparatus 1200 can include pruning device 1210, instruct dress again without mask weight training apparatus 1211, mask generating means 1212 and band mask Put 1220.

Pruning device 1210 can be used for n in all N number of weights of housebroken first connection weight matrix not Important weights are set to zero.Can again be instructed not obligating in the case that any weights are zero without mask weight training apparatus 1211 Practice through the second connection weight matrix of beta pruning.Mask generating means 1212 are according to the 3rd connection weight square through being instructed without mask again Battle array generator matrix shape mask.Band mask weight training apparatus 1220 then can be used the mask code matrix to the 3rd connection weight square Battle array carries out re -training.

Similarly, can be with without mask weight training apparatus 1211, mask generating means 1212 and band mask weight training apparatus 1220 Untill repeating the optimization solution until obtaining the connection weight matrix.In addition, being covered without mask weight training apparatus 1211 and band Code weight training apparatus 1220 itself internally can also carry out repeatedly instruction meter again for multiple weight matrixs (if any) respectively Calculate, position is instructed in the action or end that next device can be entered until being optimized to again.

Mask generating means 1212 can be by n minimum with absolute value in the 3rd connection weight matrix in the mask Value zero setting on individual weights correspondence position, can enter row element zero setting according to weight threshold.Weight threshold can connect based on the 3rd Weight matrix is connect to try to achieve.If the weights of the 3rd connection weight matrix sorted from big to small by absolute value, weight threshold Can be according to one of following mean set or the average in itself：The average of (N-n) individual weights before in collating sequence；Sequence sequence The average of the n weights that (N-2n+1) individual weights rise in row

The ANN adjusting apparatus 1200 can be used for performing the Adjusted Option according to Fig. 5-6 of the present invention.For example, making The scheme of Fig. 5 is performed with pruning device 1210 and with mask weight training apparatus 1220, or using pruning device 1210, without mask Weight training apparatus 1211 and the scheme shown in Fig. 6 is performed with mask weight training apparatus 1220.

Band mask in text and (finetune) is instructed again without mask to refer to continue on the basis of existing training and train, can To be interpreted as " fine setting ", rather than the re -training (retrain) started anew to neutral net.The present invention is for network after beta pruning Local optimum solution adjustment obviously should be construed as on the basis of having trained continuation training.

Flow chart and block diagram in accompanying drawing show the possibility reality of the system and method for multiple embodiments of the invention Existing architectural framework, function and operation.At this point, each square frame in flow chart or block diagram can represent module, a journey A part for sequence section or code a, part for the module, program segment or code is used to realize regulation comprising one or more The executable instruction of logic function.It should also be noted that in some are as the realization replaced, the function of being marked in square frame also may be used Occur with different from the order marked in accompanying drawing.For example, two continuous square frames can essentially be performed substantially in parallel, They can also be performed in the opposite order sometimes, and this is depending on involved function.It is also noted that block diagram and/or stream The combination of the square frame in each square frame and block diagram and/or flow chart in journey figure, can use the function or operation for performing regulation Special hardware based system realize, or can be realized with the combination of computer instruction with specialized hardware.

It is described above various embodiments of the present invention, described above is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.In the case of without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes will be apparent from for the those of ordinary skill in art field.The selection of term used herein, purport Best explaining principle, practical application or the improvement to the technology in market of each embodiment, or make the art Other those of ordinary skill are understood that each embodiment disclosed herein.

Claims

1. the method that one kind adjusts artificial neural network (ANN), wherein the ANN includes multiple neurons, between neuron Annexation represents that methods described includes by connection weight matrix：

N inessential weights in all N number of weights of housebroken first connection weight matrix are set to zero by beta pruning step；

Step is instructed again without mask, is connected through the second of beta pruning re -training in the case that any weights are zero is not obligated Weight matrix；

Mask generation step, according to the 3rd connection weight matrix generator matrix shape mask through being instructed without mask again；And

Band mask instructs step again, and re -training is carried out to the 3rd connection weight matrix using the mask code matrix.

2. the method for claim 1, wherein repeat it is described instructed again without mask step, the mask generation step and Untill step is instructed again with mask until obtaining the optimization solution of the connection weight matrix.

3. the method for claim 1, wherein the inessential weights are absolute value minimum n in all N number of weights Weights.

4. method as claimed in claim 3, wherein, the mask generation step includes：

Value on n weights correspondence position minimum with absolute value in the 3rd connection weight matrix in the mask is put Zero.

5. method as claimed in claim 3, wherein, the mask generation step includes：

By on the weights correspondence position in the mask with absolute value in the 3rd connection weight matrix less than weight threshold Value zero setting.

6. method as claimed in claim 4, wherein, the weight threshold is obtained based on the 3rd connection weight Matrix Calculating, will The weights of the 3rd connection weight matrix are sorted from big to small by absolute value, and the weight threshold is according to one of following Mean set or the average are in itself：

The average of (N-n) individual weights before in collating sequence；

The average of the n weights that (N-2n+1) individual weights rise in collating sequence.

7. one kind adjusts the device of artificial neural network (ANN), wherein the ANN includes multiple neurons, between neuron Annexation represents that described device includes by connection weight matrix：

Pruning device, for n inessential weights in all N number of weights of housebroken first connection weight matrix to be set to Zero；

Without mask weight training apparatus, for not obligating in the case that any weights are zero re -training through the second of beta pruning Connection weight matrix；

Mask generating means, for according to the 3rd connection weight matrix generator matrix shape mask through being instructed without mask again；With And

Band mask weight training apparatus, for carrying out re -training to the 3rd connection weight matrix using the mask code matrix.

8. device as claimed in claim 7, wherein, it is described to be covered without mask weight training apparatus, the mask generating means and band Untill code weight training apparatus repeat the optimization solution until obtaining the connection weight matrix.

9. device as claimed in claim 7, wherein, the mask generating means by the mask with the 3rd connection weight Value zero setting in value matrix on n minimum weights correspondence position of absolute value.

10. device as claimed in claim 7, wherein, the mask generating means will be connected in the mask with the described 3rd Absolute value is less than the value zero setting on the weights correspondence position of weight threshold in weight matrix, wherein, the weight threshold is based on institute State the 3rd connection weight Matrix Calculating to obtain, the weights of the 3rd connection weight matrix sorted from big to small by absolute value, The weight threshold according to one of following mean set or the average in itself：

The average of (N-n) individual weights before in collating sequence；