CN109165734A

CN109165734A - Matrix local response normalization vectorization implementation method

Info

Publication number: CN109165734A
Application number: CN201810758431.2A
Authority: CN
Inventors: 陈书明; 李斌; 陈海燕; 扈啸; 杨超; 张军阳; 陈伟文
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2019-01-08
Anticipated expiration: 2038-07-11
Also published as: CN109165734B

Abstract

The invention discloses a vectorization realization method of matrix local response normalization, which comprises the steps of accessing an input characteristic diagram into a vector processor and carrying out block transposition to concentrate data points in each channel direction in a plurality of calculation results after convolution calculation in a row, carrying out local response normalization operation on the matrix after the transposition by using each vector processing unit VPE, simultaneously finishing the calculation of each output characteristic diagram by each vector processing unit VPE, and storing the calculation result of each vector processing unit VPE in a row. The invention can realize the vectorization of the local response normalization of the matrix and has the advantages of simple realization method, good parallelism, high processing efficiency and the like.

Description

A kind of normalized vectorization implementation method of matrix local acknowledgement

Technical field

The present invention relates to based on convolutional neural networks machine learning techniques field more particularly to a kind of matrix local acknowledgement Normalized vectorization implementation method.

Background technique

With the rise of depth learning technology, based on the target identification technology of convolutional neural networks in image recognition, voice The fields such as identification, natural language processing achieve extensive use.Convolutional neural networks are current depth learning algorithm models A kind of most neural network models of middle application, while being also a kind of best model of discrimination, convolutional neural networks model In generally comprise matrix convolution, activation primitive, maximum value pond or average value pond, local linear normalization operation etc..

Normalization is one of common module in convolutional neural networks model, and wherein local acknowledgement normalizes layer (Local Response Normalization, LRN) it is that a kind of " side suppression is performed by local acknowledgement's normalization operation to input number The mechanism of system ", specifically to the activity creation competition mechanism of local neuron, so that wherein the biggish value of response ratio becomes opposite It is bigger, and other is inhibited to feed back lesser neuron, enhance the generalization ability of model.Lateral inhibition refers in Neurobiology The neuron being activated inhibits adjacent neurons, and the normalized principle of local acknowledgement is exactly to copy biologically active neuron To the suppression (lateral inhibition) of adjacent neurons, that is, the thought of lateral inhibition is used for reference to realize local inhibition, normalized purpose It is " inhibition ", especially when using ReLU nonlinear activation function, this " lateral inhibition " is very useful.LRN layers pass through mould The lateral inhibition mechanism of imitative biological nervous system, to the activity creation competition mechanism of local neuron, so that the biggish value of response ratio It is accordingly bigger, the generalization ability of model is improved, local acknowledgement's normalization, which is equivalent to, has done smoothing processing, and discrimination can be improved 1~2%.

Local acknowledgement's normalization layer needs parameter mainly to have: norm_region: selection normalizes still between adjacency channel Area of space normalizes in channel, is defaulted as interchannel normalization；Local_size: there are two types of expression, the normalization of (1) interchannel When indicate summation port number；(2) side length in summation section is indicated when normalizing in channel, default value 5；Alpha: scaling The factor, default value 1；Beta: exponential term, default value 5.

Local acknowledgement normalizes in interchannel normalization mode, and local area is not free between adjacency channel Between extend；In channel in normalization mode, regional area spatially extends, but carries out just for autonomous channel.Part is rung Should normalize can effectively prevent over-fitting, because interchannel data are no longer mutually indepedent, it is possible to improve the general of model Change ability reduces over-fitting layer, such as dropout.

As the solution of high density large linear systems, HD video encoding and decoding, 4G communication, Digital Image Processing are contour Intensively, what real-time operation was applied continues to bring out, and significant change, some new architectures occurs in the architecture of computer It continues to bring out, such as many-core architecture, heterogeneous multi-core architecture and the vector processor architecture of GPU, these are novel Architecture be integrated with multiple processor cores on a single chip, include processing component abundant on each core, and then significantly Improve the calculated performance of chip.Vector processor is exactly one such novel architecture, as shown in Figure 1, it is general Including vector processor units (VPU) and scalar processing unit (SPU), generally comprised in Vector Processing component it is multiple it is parallel to It measures processing unit (VPE), by specification and carry out data interaction can be shuffled between VPE, all VPE are based on SIMD and execute together The operation of sample.

But local acknowledgement's normalization operation is not only computation-intensive and memory access is intensive, if reasonable calculating side cannot be taken Method is difficult to play due calculating advantage using high performance calculating equipment.And local acknowledgement in existing technology When normalization operation, non-vectorized method is usually used, i.e., an one of pixel of figure can only be carried out every time Operation, computational efficiency is not high, can not be suitable for above-mentioned vector processor structure and always realize that multi-core parallel concurrent operates, in particular for more When input feature vector figure carries out local acknowledgement's normalization operation, computational efficiency is lower, is not able to satisfy the higher application demand of real-time.

Summary of the invention

The technical problem to be solved in the present invention is that, for technical problem of the existing technology, the present invention provides one The normalized vectorization implementation method of matrix local acknowledgement that kind implementation method is simple, concurrency is good and treatment effeciency is high, can Realize the efficient normalized vectorization of matrix local acknowledgement of multi input characteristic pattern, can be improved multicore vector processor concurrency, Improve processor operation efficiency.

In order to solve the above technical problems, technical solution proposed by the present invention are as follows:

A kind of normalized vectorization implementation method of matrix local acknowledgement, this method include accessing input feature vector figure to vector In processor and piecemeal transposition is carried out, the data of same channel direction will be corresponded in the muti-piece calculated result after convolutional calculation Point is stored in a column, carries out local acknowledgement's normalization operation simultaneously using each vector processing unit VPE to the matrix after transposition, respectively A vector processing unit VPE be completed at the same time it is each output characteristic pattern calculating and each vector processing unit VPE calculate result by Column storage.

As a further improvement of the present invention, this method step includes:

S1. according to the quantity p of vector processing unit VPE, the quantity of convolution kernel, input feature vector figure size in vector processor And the weight group number n of the convolution carried out by the input data that calculated result after convolution obtains, determine vector processing unit VPE energy Size M × N of the output characteristic pattern enough calculated；

S2. access input feature vector figure is into vector processor, and is respectively allocated to each vector processing unit VPE；

S3. each vector processing unit VPE carries out parallel processing, and each vector processing unit VPE is to convolution Muti-piece calculated result after calculating carries out piecemeal transposition, obtains matrix after the transposition of N × M；

S4. it carries out local acknowledgement's normalization by column to matrix after the transposition to calculate, result is deposited by row after normalization Storage, result after the normalization that output size is M × N；

S5. step S3-S4 is repeated, until the parallel calculating for completing p output characteristic pattern.

As a further improvement of the present invention, the specific steps of the step S1 are as follows: when input feature vector figure size is W*H* N, when monokaryon vector processing unit VPE primary calculation amount is M*N*p, n is the integral multiple of p, is configured so that each Vector Processing The output characteristic pattern that unit VPE can be calculated is M*N, and meets M*N*p no more than core internal storage capacity and M and N difference energy Divided exactly by W and H to be able to carry out integer time operation.

As a further improvement of the present invention, specifically by the core of input feature vector figure input vector processor in the step S2 In outer DDR, input feature vector diagram data is evenly distributed to each vector processing unit VPE, and cannot be evenly distributed if it exists more Remainder according to when, extra data are reused into multiple vector processing unit VPE and are handled.

As a further improvement of the present invention: in the step S3, with specific reference to input feature vector figure storage position to volume N/p block calculated result after product calculates carries out piecemeal transposition, and the data point of last N number of channel direction is made to be stored in a column.

As a further improvement of the present invention: a column are specifically taken every time to matrix after the transposition in the step S4, and Calculating is successively normalized in accordance with the order from top to bottom.

As a further improvement of the present invention: operation being specifically normalized using following formula in the step S4；

Wherein, k, n, α, β are hyper parameter, and i indicates that i-th of core is defeated after position (x, y) is with activation primitive ReLU Out, N is port number, and a is the output result of convolutional layer and is a four-dimension [batch, height, width, channel], Batch is batch number, and height is picture height, and width is picture width, picture depth that channel is that treated.

As a further improvement of the present invention: the step S4 is carried out in local acknowledgement's normalization calculating, adjacent when calculating Several elements square when, first element in last calculate specifically is subtracted using last calculated result when previous calculated result Square, along with square obtaining for the last one element.

As a further improvement of the present invention: the input feature vector figure is specially the result after pond or convolutional calculation.

Compared with the prior art, the advantages of the present invention are as follows:

1) present invention is by carrying out transposition operation handlebar institute using the data after the operation of convolution pondization as input feature vector diagram data There is the result in channel to concentrate in together, makes each PE while the result calculated is stored by column, so as to each in long vector A value is operated, and the normalized vectorization of local acknowledgement is realized, by the implementation method of vectorization, if processor Vector Processing The quantity p of unit VPE, the then pixel that can open figure to p simultaneously operate simultaneously, return the local acknowledgement in neural network One changes the realization that operation can be parallel, improves the efficiency of concurrency and data processing that local acknowledgement's normalization calculates.

2) the above-mentioned vectorization implementation method of the present embodiment by the architecture feature based on vector processor and inputs special The quantity and scale of sign figure determine optimal implementation, effectively improve the parallelization operation of vector processor, will be different Input feature vector figure transfer to different PE to be handled, absolutely not operation associated between PE, being equivalent to how many PE can To calculate how many a input feature vector figures simultaneously, so that the realization that part normalization calculates is simple and convenient to operate, can sufficiently excavate The instruction of vector processor, data, task dispatching concurrency at all levels, to give full play to more PE arithmetic unit Vector Processings Possessed by device the advantages of high-performance calculation ability.

Detailed description of the invention

Fig. 1 is the principle schematic diagram of vector processor.

Fig. 2 is the implementation process schematic diagram of the normalized vectorization implementation method of the present embodiment matrix local acknowledgement.

Fig. 3 is the realization principle schematic diagram for carrying out transposition in the present embodiment to local acknowledgement's normalization input data.

Fig. 4 is the realization principle schematic diagram for matrix after transposition being normalized in the present embodiment calculating.

Fig. 5 is the realization principle schematic diagram that the present embodiment realizes that local acknowledgement's normalization calculates.

Specific embodiment

Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and It limits the scope of the invention.

The normalized vectorization implementation method of matrix local acknowledgement of the present invention include input feature vector figure is respectively allocated to Each vector processing unit VPE in processor is measured, each vector processing unit VPE parallel computation respectively exports characteristic pattern, Mei Gexiang It measures and input data is subjected to piecemeal transposition when processing unit VPE is calculated, it will be right in the muti-piece calculated result after convolutional calculation It answers and is stored in a column in the data point set of each channel direction, obtain carrying out part by column to matrix after transposition again after matrix after transposition ringing Answer normalization operation.

The operation of convolution pondization can be executed before carrying out local acknowledgement's normalization operation to data, matrix of the present invention is locally rung Normalized vectorization implementation method is answered to be suitable for executing local acknowledgement's normalizing after the operation of convolution pondization using vector processor Change processing.Convolution operation is carried out based on vector processor, because multiple cores are handled simultaneously, multiple channels of an element are handled As a result the position in DDR is piecemeal storage, and the present invention passes through using the data after the operation of convolution pondization as input feature vector figure Data, the result for carrying out all channels of transposition operation handlebar concentrate in together, and make each PE while the result calculated is stored by column, from And each value in long vector can be operated, it realizes the normalized vectorization of local acknowledgement, passes through the realization of vectorization Method, if the quantity p of processor vector processing unit VPE, the pixel that can open figure to p simultaneously operates simultaneously, makes Local acknowledgement's normalization operation in neural network can be parallel realization, improve local acknowledgement normalization calculate concurrency And the efficiency of data processing.

As shown in Fig. 2, including: the step of matrix local acknowledgement normalized vectorization implementation method in the present embodiment

S3. each vector processing unit VPE carries out parallel processing, after each vector processing unit VPE is to convolutional calculation Muti-piece calculated result carry out piecemeal transposition, obtain matrix after the transposition of N × M；

S4. it carries out local acknowledgement's normalization by column to matrix after transposition to calculate, result is stored by row after normalization, defeated Result after the normalization that size is M × N out；

The specific steps of the present embodiment step S1 are as follows: when input feature vector figure size is W*H*n, monokaryon vector processing unit When VPE primary calculation amount is M*N*p, n is the integral multiple of p, and it is defeated that configuration enables each vector processing unit VPE to calculate Characteristic pattern is M*N out, and meets M*N*p and can be divided exactly by W and H respectively no more than core internal storage capacity and M and N with can be into Row integer time operation.Data to be processed may have several pictures, then need to give several figures into each core, for Particular problem, the quantity of core determine the size of input feature vector figure, then the output characteristic pattern that vector processing unit VPE can be calculated It can be determined according to actual needs by the quantity p of PE, the quantity of convolution kernel, input feature vector figure size and weight group number n.

For a vector processor, the quantity p of the VPE of monokaryon be it is determining, on it calculate a determining mind Through network, then each layer of network size is also determining after neural network determines, local acknowledgement's normalization operation is in convolution Operation after layer convolutional calculation, weight group number n is port number.When input feature vector figure, i.e. local acknowledgement's normalization input number When determining according to size (W*H*n), the primary calculation amount of monokaryon (M*N*p), each PE can be determined according to the memory capacity in core When the output characteristic pattern M*N that can be calculated is determined, distinguish so that meeting M*N*p no more than core internal storage capacity and M and N It can be divided exactly by W and H, to be able to carry out integer time operation.

In the present embodiment, specifically in DDR, spy will will be inputted outside the core of input feature vector figure input vector processor in step S2 Sign diagram data is evenly distributed to each vector processing unit VPE, to ensure multiple core complete parallels, and cannot be evenly distributed if it exists Redundant data when, extra data are reused into multiple vector processing unit VPE and are handled.

In the present embodiment, in step S3, with specific reference to input feature vector figure storage position to the n/p block after convolutional calculation Calculated result carries out piecemeal transposition, and the data point of last N number of channel direction is made to be stored in a column.The characteristics of vector processor is p PE is operated simultaneously, and depositing value and value all is p data while carrying out, and p data are stored by row, forms a long vector. Normalization operation connects after convolution, and the result that p PE is calculated after convolution is sequentially stored in together, i.e., by row storage.By Calculating is all normalized in the p value for needing to while calculating, and each value in long vector can not be carried out by row storage It operates, result of the present embodiment by the way that n/p block number according to piecemeal transposition is carried out, is made p PE while being calculated is based on turning by column storage The vectorization processing that normalization operation may be implemented of the matrix postponed., as shown in figure 3, will be defeated in concrete application embodiment Entering characteristic pattern transposition is several N × M data block, and the result that p PE is calculated simultaneously is stored in a column, as each PE is calculated simultaneously As a resultFirst row.Characteristic pattern size is after convolution M × N, transposition operate with specific instruction and can be completed.

When carrying out local acknowledgement's normalization by column to matrix after transposition in the present embodiment, in step S4 and calculating, specifically pair Matrix takes a column every time after transposition, and calculating is successively normalized in accordance with the order from top to bottom.As shown in figure 4, specific First row [a is taken to matrix after the transposition of left side in Application Example₁₁a₁₂...a_1n] each element is successively taken to carry out normalizing from top to bottom Change and calculate, obtain the first row of matrix after the normalization on right side, remaining row normalization Computing Principle is identical.

As shown in figure 5, operation specifically is normalized using following formula in the present embodiment step S4；

Wherein, k, n, α, β are hyper parameter, and generally setting k=2, n=5, α=1*e-4, β=0.75, i are indicated i-th For core at position (x, y) with the output after activation primitive ReLU, N is port number, and a is convolutional layer (including convolution operation and pond Operation) output result and be a four-dimension [batch, height, width, channel], batch be batch number, wherein often A batch is a picture, and height is picture height, and width is picture width, and channel is port number, and as treated Picture depth；Ai (x, y) indicates a position [a, b, c, d] in above-mentioned export structure, it is possible to understand that schemes at a certain In some channel under some highly with the point of some width position, i.e. height under d-th of channel of a figure is b Width is the point of c；The direction of ∑ superposition is along channel direction, i.e., the quadratic sum of each point value is along the 3rd dimension in a The direction channel, that is, the equidirectional n/2 channel in front of a point (minimum 0th channel) and rear n/2 channel The quadratic sum (total n+1 point) of the point of (being up to the d-1 channel), the number that the port number of input is tieed up matrixes as 3, The direction of superposition is also in channel direction.

When above-mentioned carry out local acknowledgement normalization calculates, in interchannel normalization mode, local area is adjacent Interchannel, but there is no spatial spread, in channel in normalization mode, regional area spatially extends, but just for independence Channel carries out.

The realization of the present embodiment matrix local acknowledgement's normalized vectorization is while utilizing above-mentioned formula to p pixel (1) it is handled, local acknowledgement's normalization operation is carried out using above formula (1) by column to N × M after transposition every time, after normalization It stores, repeats M times by row, output size is the normalization result of M × N.

In the present embodiment, step S4 is carried out during local acknowledgement's normalization calculates, when calculate adjacent several elements square when, Square for specifically subtracting first element in last calculate using last calculated result when previous calculated result, along with last One element square obtains.When operation is normalized, need to calculate the quadratic sum of adjacent several elements, in order to avoid superfluous Remaining operation, the present embodiment do not need the quadratic sum for calculating all elements every time when calculating, only need to be in the upper base once calculated Square of the element of foremost is subtracted on plinth, then adds square of last surface element, and square needed for this is obtained with this With, for example, assuming that having one section in input feature vector figure for [a1, a2, a3, a4, a5, a6, a7], it is normalized meter It calculates, b1=a1*a1+a2*a2+a3*a3+a4*a4+a5*a5, b2=a2*a2+a3*a3+a4*a4+a5*a5+a6*a6, calculates b2 When, square of first element a1 need to be only subtracted by the calculated result of b1, in addition the last one element a6's square obtains.

The above-mentioned vectorization implementation method of the present embodiment passes through architecture feature and input feature vector based on vector processor The quantity and scale of figure determine optimal implementation, effectively improve the parallelization operation of vector processor, will be different Input feature vector figure transfers to different PE to be handled, absolutely not operation associated between PE, is equivalent to how many PE Simultaneously calculate how many a input feature vector figures so that part normalization calculate realization be simple and convenient to operate, can sufficiently excavate to The instruction of amount processor, data, task dispatching concurrency at all levels, to give full play to more PE arithmetic unit vector processors The advantages of possessed high-performance calculation ability.

In concrete application embodiment, the present invention is based on the specific of the normalized vectorization implementation method of matrix local acknowledgement Process are as follows:

(1) according to the quantity p of vector processing unit VPE, the size of input feature vector figure in vector processor and in normalizing The weight group number n of the convolution carried out before changing, determines the output characteristic pattern that each PE can be calculated, wherein p=16, n=64, and Input feature vector figure is 1152 × 16, can calculate 16 figures simultaneously, it is determined that the output characteristic pattern size that each PE can be calculated It is 18 × 64.

It (2) will be in the DDR of input feature vector figure merging vector processor.

(3) piecemeal transposition is carried out according to the storage position of input feature vector figure, size is 64 × 18 after transposition, is placed in AM.

(4) local acknowledgement's normalization operation is carried out by column to 64 × 18 after transposition.

(5) it repeats step (4) 18 times, the calculated result that output size is 18 × 64 after end of operation.

(6) step (3)-(5) are repeated, until completing the calculating of entire 16 width output characteristic pattern.

Above-mentioned only presently preferred embodiments of the present invention, is not intended to limit the present invention in any form.Although of the invention It has been disclosed in a preferred embodiment above, however, it is not intended to limit the invention.Therefore, all without departing from technical solution of the present invention Content, technical spirit any simple modifications, equivalents, and modifications made to the above embodiment, should all fall according to the present invention In the range of technical solution of the present invention protection.

Claims

1. a kind of normalized vectorization implementation method of matrix local acknowledgement, which is characterized in that this method includes that access input is special Sign figure is into vector processor and carries out piecemeal transposition, by each channel direction in the muti-piece calculated result after convolutional calculation It is stored in a column in data point set, local acknowledgement's normalizing is carried out simultaneously using each vector processing unit VPE to the matrix after transposition Change operation, each vector processing unit VPE is completed at the same time the calculating of each output characteristic pattern and each vector processing unit VPE is calculated Result by column storage.

2. the normalized vectorization implementation method of matrix local acknowledgement described in claim 1, which is characterized in that this method step Include:

S1. according to the quantity p of vector processing unit VPE in vector processor, the quantity of convolution kernel, input feature vector figure size and The weight group number n for the convolution that the input data that calculated result obtains after by convolution carries out, determines that vector processing unit VPE can Size M × N of the output characteristic pattern of calculating；

S3. each vector processing unit VPE carries out parallel processing, and each vector processing unit VPE is to convolutional calculation Muti-piece calculated result later carries out piecemeal transposition, obtains matrix after the transposition of N × M；

S4. it carries out local acknowledgement's normalization by column to matrix after the transposition to calculate, result is stored by row after normalization, defeated Result after the normalization that size is M × N out；

3. the normalized vectorization implementation method of matrix local acknowledgement according to claim 2, which is characterized in that the step The specific steps of rapid S1 are as follows: when input feature vector figure size is W*H*n, monokaryon vector processing unit VPE primary calculation amount is M* When N*p, n is the integral multiple of p, and configuring the output characteristic pattern that each vector processing unit VPE is calculated is M*N, and full Sufficient M*N*p is no more than core internal storage capacity and M and N and can be divided exactly by W and H respectively to be able to carry out integer time operation.

4. the normalized vectorization implementation method of matrix local acknowledgement according to claim 2, it is characterised in that: the step In DDR, input feature vector diagram data will specifically be evenly distributed to each outside the core of input feature vector figure input vector processor in rapid S2 Vector processing unit VPE, and cannot evenly distribute if it exists redundant data when, extra data are reused at multiple vectors Reason unit VPE is handled.

5. the normalized vectorization implementation method of matrix local acknowledgement according to claim 2, it is characterised in that: the step In rapid S3, piecemeal transposition is carried out to the n/p block calculated result after convolutional calculation with specific reference to the storage position of input feature vector figure, The data point of last N number of channel direction is set to be stored in a column.

6. the normalized vectorization implementation method of the matrix local acknowledgement according to any one of claim 2~5, special Sign is: a column are specifically taken every time to matrix after the transposition in the step S4, and in accordance with the order from top to bottom successively into Row normalization calculates.

7. the normalized vectorization implementation method of the matrix local acknowledgement according to any one of claim 2~5, special Sign is, operation specifically is normalized using following formula in the step S4；

Wherein, k, n, α, β are hyper parameter, and i indicates output of i-th of the core after position (x, y) is with activation primitive ReLU, N For port number, a is the output result of convolutional layer and is a four-dimension [batch, height, width, channel], and batch is Batch number, height are picture height, and width is picture width, picture depth that channel is that treated.

8. the normalized vectorization implementation method of the matrix local acknowledgement according to any one of claim 2~5, special Sign is, the step S4 is carried out during local acknowledgement's normalization calculates, when calculate adjacent several elements square when, it is specific current Secondary calculated result subtracts square of first element in last calculate using last calculated result, adds the last one element Square obtain.

9. the normalized vectorization implementation method of matrix local acknowledgement described according to claim 1~any one of 5, special Sign is that the input feature vector figure is specially the result after pond or convolutional calculation.