CN109165734B

CN109165734B - Matrix local response normalization vectorization implementation method

Info

Publication number: CN109165734B
Application number: CN201810758431.2A
Authority: CN
Inventors: 陈书明; 李斌; 陈海燕; 扈啸; 杨超; 张军阳; 陈伟文
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2021-04-02
Anticipated expiration: 2038-07-11
Also published as: CN109165734A

Abstract

The invention discloses a vectorization realization method of matrix local response normalization, which comprises the steps of accessing an input characteristic diagram into a vector processor and carrying out block transposition to concentrate data points in each channel direction in a plurality of calculation results after convolution calculation in a row, carrying out local response normalization operation on the matrix after the transposition by using each vector processing unit VPE, simultaneously finishing the calculation of each output characteristic diagram by each vector processing unit VPE, and storing the calculation result of each vector processing unit VPE in a row. The invention can realize the vectorization of the local response normalization of the matrix and has the advantages of simple realization method, good parallelism, high processing efficiency and the like.

Description

Matrix local response normalization vectorization implementation method

Technical Field

The invention relates to the technical field of machine learning based on a convolutional neural network, in particular to a vectorization implementation method for matrix local response normalization.

Background

With the rise of deep learning technology, the target recognition technology based on the convolutional neural network is widely used in the fields of image recognition, voice recognition, natural language processing and the like. The convolutional neural network is a neural network model which is most applied in a current deep learning algorithm model and is also a model with the best recognition rate, and the convolutional neural network model generally comprises matrix convolution, an activation function, maximum value pooling or average value pooling, local linear normalization operation and the like.

Normalization is one of the commonly used modules in the convolutional neural network model, in which a Local Response Normalization (LRN) is a mechanism that performs "side suppression" by performing a Local Response Normalization operation on an input number, specifically, a mechanism that creates a competition for the activity of Local neurons, so that a value with a larger Response becomes relatively larger, and suppresses other neurons with smaller feedback, thereby enhancing the generalization capability of the model. Lateral inhibition in neurobiology refers to the inhibition of adjacent neurons by activated neurons, and the principle of normalization of local responses is to mimic the phenomenon of inhibition of adjacent neurons by biologically active neurons (lateral inhibition), i.e., to achieve local inhibition by taking into account the concept of lateral inhibition, which is intended to be "inhibitory", and is useful especially when the ReLU nonlinear activation function is used. The LRN layer creates a competition mechanism for the activity of local neurons by simulating a lateral inhibition mechanism of a biological nervous system, so that a larger response value is correspondingly larger, the generalization capability of the model is improved, the normalization of local response is equivalent to smoothing, and the recognition rate can be improved by 1-2%.

The local response normalization layer needs parameters mainly include: norm _ region: selecting normalization between adjacent channels or normalization of space regions in the channels, and defaulting to normalization between the channels; local _ size: there are two representations, (1) inter-channel normalization represents the number of channels summed; (2) the side length of a summation interval is represented during channel normalization, and the default value is 5; alpha: a scaling factor, with a default value of 1; beta: index entry, default 5.

Local response normalization in an inter-channel normalization mode, a local area range is between adjacent channels, but no spatial expansion exists; in the intra-channel normalization mode, the local region is spatially extended, but only for the individual channels. The local response normalization can effectively prevent overfitting, and because the data among channels are not independent, the generalization capability of the model can be improved, and overfitting layers such as dropout are reduced.

Along with the continuous emergence of high-density and real-time operation applications such as solution of a high-density large linear equation set, high-definition video coding and decoding, 4G communication, digital image processing and the like, the system structure of a computer is obviously changed, and novel system structures such as a many-core system structure, a heterogeneous multi-core system structure, a vector processor system structure and the like of a GPU are continuously emerged, a plurality of processor cores are integrated on a single chip by the novel system structures, each core comprises abundant processing parts, and the computing performance of the chip is greatly improved. A vector processor is a new architecture, and as shown in fig. 1, it generally includes a Vector Processor Unit (VPU) and a Scalar Processing Unit (SPU), and the vector processing unit usually includes a plurality of parallel vector processing units (VPEs), and the VPEs can perform data interaction through specification and shuffle, and all VPEs perform the same operation based on SIMD.

However, the local response normalization operation is not only computationally intensive but also memory intensive, and if a reasonable computing method cannot be adopted, the computing advantage is difficult to be exerted even if high-performance computing equipment is used. However, in the prior art, during the local response normalization operation, a non-vectorization method is usually adopted, that is, only one pixel point in one graph can be operated each time, the calculation efficiency is not high, and the method cannot be applied to the vector processor structure to always realize multi-core parallel operation.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a vectorization implementation method of the matrix local response normalization, which has the advantages of simple implementation method, good parallelism and high processing efficiency, can implement the vectorization of the matrix local response normalization with high efficiency of the multi-input characteristic diagram, and can improve the parallelism of a multi-core vector processor and the operation efficiency of the processor.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a vectorization implementation method for matrix local response normalization comprises the steps of accessing an input characteristic diagram into a vector processor and conducting block transposition, enabling data points corresponding to the same channel direction in a plurality of calculation results after convolution calculation to be stored in a row, conducting local response normalization operation on a reversed matrix by using each vector processing unit VPE, enabling each vector processing unit VPE to complete calculation of each output characteristic diagram at the same time, and enabling calculation results of each vector processing unit VPE to be stored in a row.

As a further development of the invention, the method steps comprise:

s1, determining the size M multiplied by N of an output characteristic diagram which can be calculated by a vector processing unit VPE according to the number p of the vector processing unit VPE in a vector processor, the number of convolution kernels, the size of an input characteristic diagram and the weight group number N of convolution which is performed on input data obtained by calculating results after convolution;

s2, inputting an input characteristic diagram into a vector processor, and respectively distributing the input characteristic diagram to each vector processing unit VPE;

s3, each vector processing unit VPE carries out parallel processing, and each vector processing unit VPE carries out block transposition on a plurality of blocks of calculation results after convolution calculation to obtain an N multiplied by M transposed matrix;

s4, performing local response normalization calculation on the transposed matrix according to columns, storing the normalized result according to the rows, and outputting the normalized result with the size of M multiplied by N;

s5, repeatedly executing the steps S3-S4 until the calculation of the p output feature maps is completed in parallel.

As a further improvement of the present invention, the specific steps of step S1 are: when the input feature map size is W x H x N, the calculation amount of the single-core vector processing unit VPE is M x N x p once, N is an integral multiple of p, the configuration is such that the output feature map which can be calculated by each vector processing unit VPE is M x N, M x N p does not exceed the storage capacity in the core, and M and N can be divided by W and H in integer respectively to be capable of performing an integral number of operations.

As a further improvement of the present invention, in step S2, the input feature map is specifically input to the out-of-core DDR of the vector processor, the input feature map data is uniformly distributed to each vector processing unit VPE, and if there is excess data that cannot be uniformly distributed, the excess data is processed again by using a plurality of vector processing units VPE.

As a further improvement of the invention: in step S3, the N/p block calculation result after the convolution calculation is specifically block-wise transposed according to the storage position of the input feature map, so that the data points in the last N channel directions are stored in one column.

As a further improvement of the invention: in step S4, a column is taken for each time of the transposed matrix, and normalization calculation is performed sequentially according to the sequence from top to bottom.

As a further improvement of the invention: in step S4, normalization operation is specifically performed by using the following formula;

wherein k, N, α, and β are hyper-parameters, i represents the output of the ith kernel after applying the activation function ReLU at the position (x, y), N is the number of channels, a is the output result of the convolution layer and is a four-dimensional [ batch, height, width, channel ], batch is the number of batches, height is the picture height, width is the picture width, and channel is the processed picture depth.

As a further improvement of the invention: in the step S4, in the local response normalization calculation, when the squares of several adjacent elements are calculated, the specific current calculation result is obtained by subtracting the square of the first element in the previous calculation from the previous calculation result and adding the square of the last element.

As a further improvement of the invention: the input feature map is specifically a result of pooling or convolution calculation.

Compared with the prior art, the invention has the advantages that:

1) the invention takes the data after the convolution pooling operation as the input characteristic diagram data, carries out transposition operation to centralize the results of all channels, and stores the results of simultaneous calculation of each PE in a row, thereby being capable of operating each value in a long vector and realizing the vectorization of local response normalization.

2) In the vectorization implementation method, an optimal implementation mode is determined based on the architectural features of the vector processor and the number and scale of the input feature maps, so that the parallelization operation of the vector processor is effectively improved, different input feature maps are handed to different PEs for processing, no correlation operation exists between the PEs, and equivalently, how many input feature maps can be simultaneously calculated by how many PEs, the implementation of local normalization calculation is simple, the operation is convenient, the parallelism of each level of instructions, data, tasks and the like of the vector processor can be fully mined, and the advantage of high-performance calculation capability of the vector processor with multiple PE operation parts is fully exerted.

Drawings

FIG. 1 is a schematic diagram of the architecture of a vector processor.

Fig. 2 is a schematic diagram of an implementation flow of the matrix local response normalization vectorization implementation method in this embodiment.

Fig. 3 is a schematic diagram of an implementation principle of transposing partial response normalized input data in this embodiment.

Fig. 4 is a schematic diagram of an implementation principle of normalization calculation of the post-rotation matrix in this embodiment.

Fig. 5 is a schematic diagram of an implementation principle of the local response normalization calculation according to the present embodiment.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

The invention discloses a vectorization implementation method of matrix local response normalization, which comprises the steps of respectively distributing input characteristic graphs to vector processing units VPE in a vector processor, parallelly calculating output characteristic graphs by each vector processing unit VPE, carrying out block transposition on input data during calculation by each vector processing unit VPE, storing data points corresponding to each channel direction in a plurality of calculation results after convolution calculation in a row in a centralized manner, and carrying out local response normalization operation on the transposed matrix according to the row after the transposed matrix is obtained.

The method for realizing the vectorization of the matrix local response normalization is suitable for the local response normalization processing after the convolution pooling operation is executed by using a vector processor. The invention is based on the convolution operation of a vector processor, because a plurality of cores are processed simultaneously, and the positions of a plurality of channel processing results of one element in a DDR are stored in blocks.

As shown in fig. 2, the steps of the matrix local response normalization vectorization implementation method in this embodiment include:

s2, inputting the input characteristic graph into a vector processor, and respectively distributing the input characteristic graph to each vector processing unit VPE;

s3, each vector processing unit VPE performs parallel processing, and each vector processing unit VPE performs block transposition on a plurality of blocks of calculation results after convolution calculation to obtain an NxM transposed matrix;

s4, carrying out local response normalization calculation on the post-conversion matrix according to columns, storing the normalized result according to the rows, and outputting the normalized result with the size of M multiplied by N;

The specific steps of step S1 in this embodiment are: when the input feature map size is W x H x N, the calculation amount of the single-core vector processing unit VPE is M x N x p once, N is an integral multiple of p, the configuration is such that the output feature map which can be calculated by each vector processing unit VPE is M x N, M x N p does not exceed the storage capacity in the core, and M and N can be divided by W and H in integer respectively to be capable of performing an integral number of operations. If there may be several pictures in the data to be processed, several graphs need to be distributed to each kernel, and for a specific problem, the number of kernels determines the size of the input feature graph, and the output feature graph that can be calculated by the vector processing unit VPE can be determined according to the actual requirements by the number p of PEs, the number of convolution kernels, the size of the input feature graph, and the number n of weight groups.

For a vector processor, the number p of VPEs of a single core is determined, a determined neural network is calculated on the determined VPEs, the network scale of each layer after the neural network is determined is also determined, the local response normalization operation is the operation after convolution layer convolution calculation, and the weight group number n is the number of channels. When the input feature map, i.e., the local response normalized input data size is determined (W × H × N), the calculated amount (M × N × p) of a single core can be determined from the storage capacity in the core, and the output feature map M × N that each PE can calculate is determined such that M × N × p cannot exceed the storage capacity in the core, and M and N can be divided by W and H, respectively, to enable an integer number of operations.

In this embodiment, in step S2, the input feature map is specifically input to the out-of-core DDR of the vector processor, the input feature map data is uniformly distributed to each vector processing unit VPE, so as to ensure that the cores are completely parallel, and if there is excess data that cannot be uniformly distributed, the excess data is processed by using the vector processing units VPE again.

In this embodiment, in step S3, the N/p block calculation result after the convolution calculation is specifically subjected to block transposition according to the storage position of the input feature map, so that the data points in the last N channel directions are stored in one column. The vector processor is characterized in that p PEs operate simultaneously, the storage and the value taking are carried out simultaneously by p data, and the p data are stored according to rows to form a long vector. The normalization operation is connected after convolution, and the results calculated by p PEs after convolution are stored together in sequence, namely stored in rows. Since the normalization calculation needs to be performed on all the p values calculated simultaneously, and the storage by rows cannot perform the operation on each value in the long vector, the embodiment performs block transposition on n/p block data, so that the results of the simultaneous calculation of p PEs are stored by columns, and the vectorization processing of the normalization operation can be realized based on the matrix after the transposition. In the embodiment shown in fig. 3, the input feature map is transposed into N × M data blocks, and the results of simultaneous computation of p PEs are stored in a column, for example, the results of simultaneous computation of PE are stored in a column

The first column. After convolution, the size of the feature map is M × N, and the transpose operation can be completed using a specific instruction.

In this embodiment, when the local response normalization calculation is performed on the post-rotation matrix in columns in step S4, specifically, one column is taken for each time of the post-rotation matrix, and the normalization calculation is performed sequentially from top to bottom. As shown in FIG. 4, in the specific embodiment, the first column [ a ] is taken for the left transposed matrix₁₁a₁₂...a_1n]And sequentially taking all elements from top to bottom for normalization calculation to obtain the first column of the normalized matrix on the right side, wherein the normalization calculation principles of the other columns are the same.

As shown in fig. 5, in step S4, the normalization operation is performed specifically according to the following formula;

wherein k, N, α, and β are hyper parameters, and generally, k is 2, N is 5, α is 1 × e-4, β is 0.75, i represents an output of the i-th kernel after applying an activation function ReLU at a position (x, y), N is a channel number, a is an output result of a convolution layer (including convolution operation and pooling operation) and is a four-dimensional [ batch, height, width, channel ], batch is a batch number, where each batch is a picture, height is a picture height, width is a picture width, and channel is a channel number, that is, a processed picture depth; ai (x, y) represents a position [ a, b, c, d ] in the output structure, which can be understood as a point at a certain height and a certain width position under a certain channel in a certain figure, i.e. a point at which the height of the d channel in the a-th figure is b and the width is c; the direction of Σ superposition is along the channel direction, i.e. the sum of squares of each point value is along the 3 rd dimension channel direction in a, i.e. the sum of squares of the points of the front n/2 channels (minimum 0 th channel) and the back n/2 channels (maximum d-1 th channel) of a point in the same direction (n +1 points in total), the number of channels of input is taken as the number of 3-dimension matrices, and the direction of superposition is also in the channel direction.

In the above-described local response normalization calculation, in the inter-channel normalization mode, the local region range is between adjacent channels but there is no spatial expansion, and in the intra-channel normalization mode, the local region is spatially expanded but only for an independent channel.

The matrix local response normalization vectorization is realized by processing p pixel points simultaneously by using the formula (1), performing local response normalization operation on the N × M after each time of rotation by using the formula (1) according to columns, storing the N × M after normalization by rows, repeatedly executing the M times, and outputting a normalization result with the size of M × N.

In the present embodiment, in the step S4, in the local response normalization calculation, when the squares of several adjacent elements are calculated, the specific current calculation result is obtained by subtracting the square of the first element in the previous calculation from the previous calculation result and adding the square of the last element to the previous calculation result. In the normalization operation, the square sum of several adjacent elements needs to be calculated, in order to avoid redundant operations, the square sum of all the elements does not need to be calculated each time in the calculation of the present embodiment, and the square sum of the elements need to be obtained by subtracting the square of the frontmost element from the square of the frontmost element and then adding the square of the backmost element, for example, if there is a segment in the input feature map [ a1, a1, a1, a1, a1, a1, a1 ], and a1 ] to perform the normalization calculation, b1 ═ a1+ a1 a, and a1 a +1 a, and a1 b only need to calculate the square of the first element a1 a.

In the vectorization implementation method, an optimal implementation mode is determined based on the architectural features of the vector processor and the number and scale of the input feature maps, so that the parallelization operation of the vector processor is effectively improved, different input feature maps are handed to different PEs for processing, no correlation operation exists between the PEs, and equivalently, how many input feature maps can be simultaneously calculated by how many PEs, the implementation of local normalization calculation is simple, the operation is convenient, the parallelism of each level of instructions, data, tasks and the like of the vector processor can be fully mined, and the advantage of high-performance calculation capability of the vector processor with multiple PE operation parts is fully exerted.

In a specific application embodiment, the matrix local response normalization-based vectorization implementation method of the invention has the following specific procedures:

(1) according to the number p of vector processing units VPE in the vector processor, the size of the input feature map, and the number n of weight groups of convolution performed before normalization, an output feature map that can be calculated by each PE is determined, where p is 16, n is 64, and the input feature map is 1152 × 16, and 16 maps can be calculated at the same time, and then the size of the output feature map that can be calculated by each PE is determined to be 18 × 64.

(2) The input feature map is placed into the DDR of the vector processor.

(3) And (4) carrying out block transposition according to the storage position of the input feature map, wherein the size of the transposed block is 64 multiplied by 18, and the transposed block is placed in the AM.

(4) The post-rotation 64 × 18 is subjected to local response normalization operation by column.

(5) And (5) repeating the step (4)18 times, and outputting a calculation result with the size of 18 multiplied by 64 after the operation is finished.

(6) And (5) repeating the steps (3) to (5) until the calculation of the whole 16 output feature maps is completed.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A matrix local response normalization vectorization implementation method is characterized by comprising the steps of accessing an input feature map into a vector processor, carrying out block transposition, concentrating data points in each channel direction in a plurality of calculation results after convolution calculation in a row, carrying out local response normalization operation on the matrix after the transposition by using each vector processing unit VPE, completing calculation of each output feature map by each vector processing unit VPE at the same time, and storing the calculation results of each vector processing unit VPE in a row; the method comprises the following steps:

s1, determining the size M multiplied by N of an output characteristic diagram which can be calculated by a vector processing unit VPE according to the number p of the vector processing unit VPE in a vector processor, the number of convolution kernels, the size of an input characteristic diagram and the weight group number N of convolution of input data obtained by calculating a result after the convolution;

2. The matrix partial response normalization vectorization implementation method according to claim 1, wherein the step S1 specifically comprises the steps of: when the input feature map size is W x H x N, the calculation amount of the single-core vector processing unit VPE is M x N x p once, N is an integral multiple of p, the configuration is such that the output feature map which can be calculated by each vector processing unit VPE is M x N, M x N p does not exceed the storage capacity in the core, and M and N can be divided by W and H in integer respectively to be capable of performing an integral number of operations.

3. The matrix partial response normalized vectorization implementation method of claim 1, wherein: specifically, in step S2, the input feature map is input to the out-of-core DDR of the vector processor, the input feature map data is uniformly distributed to the vector processing units VPE, and if there is excess data that cannot be uniformly distributed, the excess data is processed by the vector processing units VPE.

4. The matrix partial response normalized vectorization implementation method of claim 1, wherein: in step S3, the N/p block calculation result after the convolution calculation is specifically block-wise transposed according to the storage position of the input feature map, so that the data points in the last N channel directions are stored in one column.

5. The matrix local response normalization vectorization implementation method according to any one of claims 1 to 4, characterized in that: in step S4, a column is taken for each time of the transposed matrix, and normalization calculation is performed sequentially according to the sequence from top to bottom.

6. The method for realizing the vectorization of the matrix partial response normalization according to any one of claims 1 to 4, wherein the normalization operation is performed in step S4 by using the following formula;

7. The matrix partial response normalization vectorization implementation method according to any one of claims 1 to 4, wherein in the step S4, when calculating the squares of several adjacent elements in the partial response normalization calculation, the specific current calculation result is obtained by subtracting the square of the first element in the previous calculation from the last calculation result and adding the square of the last element to the last calculation result.

8. The matrix partial response normalization vectorization implementation method according to any one of claims 1 to 4, wherein the input feature map is a result of pooling or convolution calculation.