CN106991472A

CN106991472A - A kind of fusion ReLU activation primitives and the vectorization implementation method in maximum pond

Info

Publication number: CN106991472A
Application number: CN201710201376.2A
Authority: CN
Inventors: 郭阳; 张军阳; 扈啸; 王慧丽; 胡敏慧; 王子聪
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2017-03-30
Filing date: 2017-03-30
Publication date: 2017-07-28

Abstract

The invention discloses a kind of fusion ReLU activation primitives and the vectorization implementation method in maximum pond, its step is：S1：Calculating matrix A ReLU activation primitive values；S2：The maximum pond of matrix in calculation procedure S1 after the processing of ReLU activation primitives；S3：Repeat step S1 and step S2 is until traveled through all sub-blocks of matrix A, and the processing of ReLU activation primitives and maximum pondization for being finally completed whole matrix A are operated.The present invention has the advantages that principle is simple, it is convenient to realize, can fully excavate the computation capability of vector processor and the concurrency of algorithm.

Description

A kind of fusion ReLU activation primitives and the vectorization implementation method in maximum pond

Technical field

Present invention relates generally to convolutional neural networks technical field, a kind of fusion ReLU activation primitives and maximum are refered in particular to The vectorization implementation method in pond.

Background technology

In the 1960s, Hubel and Wiesel is used for the god of local sensitivity and set direction in research cat cortex Find that its unique network structure can be effectively reduced the complexity of Feedback Neural Network during through member, then propose convolution god Through network (Convolutional Neural Network, CNN).Currently, convolutional neural networks have become numerous subject necks One of the study hotspot in domain, particularly in pattern classification field, because the network avoids the complicated early stage pretreatment to image, Original image can be directly inputted, thus has obtained more being widely applied.

Usually, typical convolutional neural networks computation model include convolutional layer, pond layer, full articulamentum and after Continuous grader, such as SVMs (Support Vector Machine, SVM).Wherein related in convolutional neural networks model And to calculating type mainly have：The convolutional calculation of matrix, the processing of activation primitive；Such as, linear activation primitive f (x)=x or Nonlinear activation functionDeng, and matrix pondization operation, including maximum pond (max pooling) and Average value pond (average pooling), it is final to convolution god finally by matrix operation and some processing surmounted function The output of model through network is predicted, and completes the process of object correlation identification.Because convolutional neural networks model is by not Same convolutional layer and pond layer alternating iteration, therefore the amount of calculation of convolutional neural networks model is very huge.Therefore, how to accelerate The computational efficiency of the model is all an important research contents in current academia and industrial quarters.

The activation primitive model used in current convolutional neural networks model mainly includes linear activation primitive and non-linear The major class of activation primitive two, about ten is as many as several, and correct linear unit, i.e. ReLU (Rectified Linear Units, ReLU) one kind of activation primitive exactly most common of which, its mathematic(al) representation be f (x)=max (0, x), it can be seen that work as input When signal x is less than 0, output is all 0, during more than 0, and output is equal to input.The outstanding advantages of ReLU functions are that one side suppresses；Phase The excited border broad to other activation primitives, with characteristics such as sparse activities.In terms of Neuscience, neuroscientist It also found the sparse activity of neuron, 2001, on the observational learning that Attwell et al. is consumed based on cerebral energy, pushed away Surveying neuron coding work mode has an openness and distributivity, the god that Lennie in 2003 et al. estimation brains are activated simultaneously Through member only 1~4%, the openness of neuron work is further demonstrated that.In terms of signal, i.e., neuron is simultaneously only to defeated Enter the small part selective response of signal, a large amount of signals can so improve the precision of study by shielding deliberately, more preferably, Quickly extract sparse features.Therefore, from the point of view of this openness angle, ReLU functions are into approximately meeting human neuronal mould The best model of type.

In convolutional neural networks model, view data after being handled by activation primitive, it is necessary to carry out the calculating of next stage, That is, pondization is operated, and pondization operation mainly includes maximum pondization and average value pond, and maximum pond refers to take out pond window In maximum as the pond window output, and average value pond refer to take out pond window in all elements average value It is used as the output of the pond window.Either average value pondization or maximum pond, its purpose are provided to not significantly Influence farthest to reduce the dimension of image array on the premise of Model Identification precision, amount of calculation is reduced, and also to keep away Exempt from model and over-fitting occur.

Convolutional neural networks are one of the computing modules commonly used during current high performance is calculated, be typical memory access it is intensive and Compute-intensive applications, calculating unit and memory bandwidth to processor require very high, and computation complexity is very big, current main-stream Acceleration platform have the convolutional neural networks calculating platform based on GPU, the convolutional neural networks calculating platform based on FPGA, be based on The calculating platform of special neutral net accelerator and convolutional Neural net is accelerated based on universal cpu or some vector processors The calculating of network model.Vector processor is a kind of processor of multipurpose multifunctional operating system, generally comprises Vector Processing part (Vector Processing Unit, VPU) and scalar processor unit (Scalar Processing Unit), Vector Processing part it is main by Several vector processing units (Vector Pocessing Element, VPE) constitute computing array, are mainly responsible for gauge Calculate, each VPE includes other work(such as multiple isomorphism calculation function parts such as some MAC0, MAC1, and ALU, position processing (BP) Can part；Scalar processing unit is mainly responsible for calculating task and stream is controlled, and VPU and SPU can carry out data channel transmission and exchange. There is provided the special vectorial memory bank of Large Copacity by the Load and Store of vector data access unit supporting vector data.

The content of the invention

The technical problem to be solved in the present invention is that：The technical problem existed for prior art, the present invention provides one Kind principle is simple, it is convenient to realize, can fully excavate the fusion of the computation capability of vector processor and the concurrency of algorithm ReLU activation primitives and the vectorization implementation method in maximum pond, i.e., grasped by merging ReLU activation primitives and maximum pondization Make to reduce the memory access amount of data, and then shorten the calculating time of convolutional neural networks, improve the meter of convolutional neural networks model Calculate efficiency.

In order to solve the above technical problems, the present invention uses following technical scheme：

The vectorization implementation method in a kind of fusion ReLU activation primitives and maximum pond, its step is：

S1：Calculating matrix A ReLU activation primitive values；

S2：The maximum pond of matrix in calculation procedure S1 after the processing of ReLU activation primitives；

S3：All sub-blocks of the repeat step S1 and step S2 up to having traveled through matrix A, are finally completed whole matrix A The processing of ReLU activation primitives and the operation of maximum pondization.

As a further improvement on the present invention：The step S1's concretely comprises the following steps：

S1.1 sets the matrix for needing to carry out activation primitive processing after convolution operation as A (M, N), and ReLU activation primitives are f (x) (0, x), vector processing unit VPE number is p to=max, and it is p, k to take N_x、k_yIntegral multiple, maximum pond window be k_x× k_y；

S1.2 instructs the first row element for taking matrix A using vectorial VLOAD；

S1.3 compares size instruction VFCMPGD using vector, compares the size of vector registor, the logical value of comparative result It is put into PSW；

S1.4 use conditions vector assignment instructs VMOV, takes out the value in step 1.3 more than 0 and is put into vector registor；

S1.5 draws the result after the processing of ReLU activation primitives；

S1.6 is according to maximum pond window k, and repeat step draws the Relu activation of A matrix k row elements for 1.2 to 1.5k times Function operation, is as a result stored in vector registor, directly as the input value in maximum pond in step S2.

As a further improvement on the present invention：The step S2's concretely comprises the following steps：

S2.1 takes the k row elements calculated in step S 1.6, directly as the input of this calculating；

S2.2 makes comparisons the 1st row element with the 2nd row element, and the logical value of comparative result is put into PSW；

S2.3 use conditions vector assignment instructs VMOV；

S2.4 draws the corresponding row maximum of k row elements by comparing k-1 times；

S2.5 configures shuffle mode, compares the maximum for drawing corresponding k column elements in step S 2.4；

S2.6 finally show that p/k pond window size is k simultaneously_x×k_yMaximum pond result.

As a further improvement on the present invention：A maximum pond result c in the step S 2.5_0,0Calculating it is public Formula is：

Wherein c_0,0For first element in the matrix of consequence of maximum pond, k_x、k_yFor the size of pond window, in convolution In neutral net, pond window is square formation, i.e. k_x=k_y=k, a_i,jTo need to carry out the element in the matrix A in maximum pond.

As a further improvement on the present invention：Defined in the above-mentioned steps size of pond window be sizeX, sizeY, The horizontal displacement of two adjacent pool windows or vertical displacement are stride, and pond window is not overlapping during maximum pondization is operated, That is sizeX=sizeY=stride.

Compared with prior art, the advantage of the invention is that：A kind of fusion ReLU activation primitives and maximum of the present invention The vectorization implementation method in pond, stream is calculated by the way that the operation of ReLU activation primitives and the calculating of maximum pond are fused into one Journey, it is to avoid the most STORE and LOAD of time-consuming intermediate calculation data, while also make full use of the vectorial portion in vector processor The characteristics of multiple parallel processing elements of part can carry out identical operation operation simultaneously carries out substantial amounts of same type operation, so that significantly Degree improves the computational efficiency of convolutional neural networks model, and step is simple, it is easy to accomplish.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the inventive method.

Fig. 2 is the general structure schematic diagram of vector processor.

Fig. 3 is the present invention 2 × 2 maximum pond schematic diagram in concrete application example.

Fig. 4 is the RuLU activation primitive image schematic diagrames that the present invention is used in concrete application example.

Fig. 5 is present invention ReLU activation primitives vectorization implementation process schematic diagram in concrete application example.

Fig. 6 is the present invention 2 × 2 maximum pond vectorization implementation process schematic diagram in concrete application example.

Fig. 7 be the present invention in concrete application example maximum pondization operate in pond window not overlap operation schematic diagram.

Embodiment

The present invention is described in further details below with reference to Figure of description and specific embodiment.

As shown in Figure 1 and Figure 4, a kind of fusion ReLU activation primitives of the invention and the vectorization realization side in maximum pond Method, its step is：

S1：Calculating matrix A ReLU activation primitive values；

S1.1 sets the matrix for needing to carry out activation primitive processing after convolution operation as A (M, N), and ReLU activation primitives are f (x) (0, x), vector processing unit VPE number is p to=max, and it is p, k typically to take N_x、k_yIntegral multiple, maximum pond window is k_x×k_y；

S1.2 instructs the first row element for taking matrix A using vectorial VLOAD；Such as take into vector registor VR10, make It is 0, i.e. VMOVI 0, VR20 with one vector registor VR20 of VMOVI instruction initialization；

S1.3 compares size instruction VFCMPGD using vector, compares vector registor VR10 and VR20 size, compares knot The logical value of fruit is put into PSW, such as VR0；VFCMPGD VR10, VR20, VR0, if VR10 [i]>VR20[i],1≤i ≤ p, then VR0 [i]=1, otherwise VR0 [i]=0；

S1.4 use conditions vector assignment instructs VMOV, takes out the value in step 1.3 more than 0 and is put into vector registor, Computations is：[VR0] VMOV VR10, VR20, p numerical value can be calculated simultaneously by being instructed by this condition adele ReLU activation primitive values, the numerical value that 0 is more than in VR10 are put into VR20, the numerical value less than 0 is set to 0；

S1.5 draws the result VR20 after the processing of ReLU activation primitives；

S1.6 is according to maximum pond window k, and repeat step draws the Relu activation of A matrix k row elements for 1.2 to 1.5k times Function operation, is as a result stored in vector registor, it is not necessary to store, directly as the input in maximum pond in step S2 Value.

S2.1 takes the k row elements calculated in step S 1.6, is posted because the result in step S 1.6 is stored directly in In storage, therefore it directly as the input of this calculating, this process avoids the time data memory in step S 1.6 and step The data LOAD times in rapid S 2.2, therefore, the calculating time is reduced accordingly.

S2.2 makes comparisons the 1st row element with the 2nd row element, and the logical value of comparative result is put into PSW, such as In VR1, VFCMPGD VR20, VR21, VR1, if VR20 [i]>VR21 [i], 1≤i≤p, then VR0 [i]=1, otherwise VR0 [i] =0；

S2.3 use conditions vector assignment instructs VMOV, takes out corresponding to the conditional register VR0 [i]=1 of step S 2.2 VPE in value VR20 [i] be assigned to corresponding VR21 [i], then value bigger than VR20 [i] in VR21 [i] is kept constant.

S2.4 draws the corresponding row maximum of k row elements by comparing k-1 times.

S2.6 finally show that p/k pond window size is k simultaneously_x×k_yMaximum pond result；

The present invention is mainly suitable for vector processor, as shown in Fig. 2 being the general structure schematic diagram of vector processor. In concrete application example, a maximum pond result c in the step S 2.5_0,0Calculation formula be：

Wherein c_0,0For first element in the matrix of consequence of maximum pond, k_x、k_yFor the size of pond window, in convolution In neutral net, pond window is generally square formation, i.e. k_x=k_y=k, a_i,jTo need to carry out the member in the matrix A in maximum pond Element, its maximum pond schematic flow sheet is as shown in Figure 3.

In concrete application example, the size of pond window defined in above-mentioned steps is sizeX, sizeY, and two adjacent The horizontal displacement of pond window or vertical displacement are stride, and pond window is not overlapping in the operation of maximum pondization, i.e. sizeX= SizeY=stride, as shown in Figure 7.

As shown in Figure 5, Figure 6, the present invention is in a concrete application example, and detailed step is：

S100：Calculating matrix A ReLU activation primitive values；

S1.1 sets the matrix for needing to carry out activation primitive processing after convolution operation as A (16,16), and ReLU activation primitives are f (x) (0, x), vector processing unit VPE number p is 16, maximum pond window k to=max_x=k_y=2；

S1.2 takes 16 elements of the 1st row of matrix A using vectorial VLOAD instructions into vector registor VR10, the 2nd row 16 elements, into VR11, are 0 using vector assignment instruction VMOVI instruction initialization 2 vector registors VR20, VR21, i.e., VMOVI 0, VR20, VMOVI 0, VR21；

S1.3 compares size instruction VFCMPGD using vector, compares vector registor VR10 and VR20, VR11 and VR21 Size, the logical value of comparative result is respectively put into PSW VR0, VR1；VFCMPGD VR10,VR20,VR0、 VFCMPGD VR11, VR21, VR1, if VR10 [i]>VR20 [i], (1≤i≤16), then VR0 [i]=1, otherwise VR0 [i]=0, Similarly VR1 [i]=1, otherwise VR1 [i]=0；

S1.4 use conditions vector assignment instructs VMOV, takes out the value in step 1.3 more than or equal to 0 and is put into vector registor In, computations is：[VR0] VMOV VR10, VR20, [VR1] VMOV VR11, VR21, are counted simultaneously by condition assignment directive Calculate the Relu activation primitive values that matrix A front two row element amounts to 32 elements；

The ReLU activation primitive values of S1.5 matrix A front two row elements are put into vector registor VR20, VR21；

S200：The maximum pond of matrix in calculation procedure S100 after the processing of ReLU activation primitives；

S2.1 is according to maximum pond window size k_x=k_y=2, the front two row element that step S1.5 is calculated is taken out, That is VR20 and VR21, is used as the input of maximum pond layer；

S2.2 compares the 1st row element VR20 and the 2nd row element VR21, and the logical value of comparative result is put into condition deposit In device VR2, computations is：VFCMPGD VR20, VR21, VR2, if VR20 [i]>VR21 [i], 1≤i≤p, then VR2 [i]= 1, otherwise VR2 [i]=0；

S2.3 use conditions vector assignment instructs VMOV, takes out corresponding to step S2.3 conditional register VR0 [i]=1 VPE in value VR20 [i] be assigned to corresponding VR21 [i], then value bigger than VR20 [i] in VR21 [i] is kept constant；

S2.4 compares 1 time, draws the corresponding row maximum of 2 row elements；

S2.5 configures corresponding shuffle mode, compares the maximum for drawing corresponding adjacent 2 column element in step S2.4；

S2.6 finally show that (16/2) 8 pond window size is 2 × 2 maximum pond results simultaneously；

S300：Repeat step S100 and step S200, until having traveled through all sub-blocks of matrix A, is finally completed whole square The battle array A processing of ReLU activation primitives and the operation of maximum pondization.

It the above is only the preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-described embodiment, All technical schemes belonged under thinking of the present invention belong to protection scope of the present invention.It should be pointed out that for the art For those of ordinary skill, some improvements and modifications without departing from the principles of the present invention should be regarded as the protection of the present invention Scope.

Claims

1. the vectorization implementation method in a kind of fusion ReLU activation primitives and maximum pond, it is characterised in that step is：

S1：Calculating matrix A ReLU activation primitive values；

S3：Repeat step S1 and step S2 is until traveled through all sub-blocks of matrix A, and the ReLU for being finally completed whole matrix A swashs Function processing and the operation of maximum pondization living.

2. fusion ReLU activation primitives according to claim 1 and the vectorization implementation method in maximum pond, its feature It is, the step S1's concretely comprises the following steps：

S1.1 set needed after convolution operation carry out activation primitive processing matrix as A (M, N), ReLU activation primitives for f (x)= (0, x), vector processing unit VPE number is p to max, and it is p, k to take N_x、k_yIntegral multiple, maximum pond window be k_x×k_y；

S1.3 compares size instruction VFCMPGD using vector, compares the size of vector registor, the logical value of comparative result is put into In PSW；

S1.5 draws the result after the processing of ReLU activation primitives；

S1.6 is according to maximum pond window k, and repeat step draws the Relu activation primitives of A matrix k row elements for 1.2 to 1.5k times Operation, is as a result stored in vector registor, directly as the input value in maximum pond in step S2.

3. fusion ReLU activation primitives according to claim 2 and the vectorization implementation method in maximum pond, its feature It is, the step S2's concretely comprises the following steps：

S2.1 takes the k row elements calculated in step S1.6, directly as the input of this calculating；

S2.3 use conditions vector assignment instructs VMOV；

S2.5 configures shuffle mode, compares the maximum for drawing corresponding k column elements in step S2.4；

4. fusion ReLU activation primitives and the vectorization implementation method in maximum pond according to claim 1 or 2 or 3, Characterized in that, a maximum pond result c in the step S2.5_0,0Calculation formula be：

c_{0, 0} = \underset{0 \leq i \leq k_{x} - 1}{m a x} (\underset{0 \leq j \leq k_{y} - 1}{m a x} (a_{i, j}))

Wherein c_0,0For first element in the matrix of consequence of maximum pond, k_x、k_yFor the size of pond window, in convolutional Neural In network, pond window is square formation, i.e. k_x=k_y=k, a_i,jTo need to carry out the element in the matrix A in maximum pond.

5. fusion ReLU activation primitives and the vectorization implementation method in maximum pond according to claim 1 or 2 or 3, Characterized in that, the size of pond window is sizeX, sizeY, the water of two adjacent pool windows defined in the above-mentioned steps Prosposition is moved or vertical displacement is stride, and pond window is not overlapping in the operation of maximum pondization, i.e. sizeX=sizeY= stride。