CN107704921A

CN107704921A - The algorithm optimization method and device of convolutional neural networks based on Neon instructions

Info

Publication number: CN107704921A
Application number: CN201710974484.3A
Authority: CN
Inventors: 朱明�; 曾建平; 张智鹏; 耿磊
Original assignee: Beijing Zhi Xinyuandong Science And Technology Ltd
Current assignee: Beijing Zhi Xinyuandong Science And Technology Ltd
Priority date: 2017-10-19
Filing date: 2017-10-19
Publication date: 2018-02-16

Abstract

The invention provides the algorithm optimization method of the convolutional neural networks instructed based on Neon, this method includes：The convolution kernel image of convolutional layer is subjected to matrixing processing, A matrixes corresponding to acquisition, and A matrixes columns is alignd according to 4 multiples；Convolved image is treated in input, will treat that convolution input picture carries out matrixing processing, B matrixes corresponding to acquisition, and B matrixes line number is alignd according to 4 multiples；Transposed transform is carried out to B matrixes, obtains transposed matrix Bt；Calculate the row and row dot product of A matrixes and Bt matrixes；Instructed using Neon and carry out parallel optimization processing.Compared with prior art, the present invention can effectively lift the calculating performance of convolutional neural networks.

Description

The algorithm optimization method and device of convolutional neural networks based on Neon instructions

Technical field

The present invention relates to image procossing, video monitoring and convolutional neural networks, the more particularly to volume based on Neon instructions The algorithm optimization method and device of product neutral net.

Background technology

With the fast development of artificial intelligence, deep learning is increasingly introducing to image procossing, pattern-recognition neck In domain, and do well in terms of solving relevant issues.Wherein, convolutional neural networks (convolutional neural Networks, abbreviation CNN) a kind of model structure as deep learning, the processing image that is particularly good at is particularly the phase of big image Shut down problem concerning study, be widely used, most furtherd investigate.

However, in the practical application of Image Processing and Pattern Recognition, convolutional neural networks are usually to use more net Network layers realize that its computational complexity is higher, contain a large amount of intensive image convolution computings, time-consuming long, directly affects The performance of related algorithm based on convolutional neural networks, its application is limited, particularly in video monitoring front-end embedded device In, such as ARM platforms.

At present from the point of view of the technical standpoint of convolutional neural networks algorithm optimization, the optimization for convolution algorithm mainly uses square Battle array accelerates, i.e., by being two big matrixes by convolution nuclear matrix and input picture matrixing, is obtained by big matrix product Convolution results.Thus convolution algorithm is changed for matrix operation, it becomes possible to accelerate in some support third party matrix operations Realize that matrix accelerates in the platform in storehouse, convolutional neural networks algorithm performance is highly improved.But do not supported for some Third party's matrix operation accelerates the embedded-type ARM platform in storehouse, and convolutional neural networks algorithm is time-consuming still very long, and real-time is bad.

Neon instructs the 128 SIMD (Single of one kind for applying to ARM Cortex-A series processors Instruction, Multiple Data, single instrction, more data) expansion structure.From smart mobile phone and mobile computing device to HDTV, it has been acknowledged as one of processor the most superior in multimedia application field.Neon instructions are designed using special, Transplanting of the software between different platform is simplified, the intensive multimedia application for similar Dolby Mobile provides low energy Consumption and flexible acceleration function.

In summary, need to propose at present it is a kind of can be effectively reduced it is time-consuming be applied to ARM platforms based on Neon The convolutional neural networks algorithm optimization method of instruction.

The content of the invention

In view of this, it is a primary object of the present invention to reduce computing resource consumption, the algorithm of convolutional neural networks is realized Optimization.

To reach above-mentioned purpose, according to the first aspect of the invention, there is provided the convolutional Neural net based on Neon instructions The algorithm optimization method of network, this method include：

First step, the convolution kernel image of convolutional layer is subjected to matrixing processing, A matrixes corresponding to acquisition, and by A matrixes Columns aligns according to 4 multiples；

Convolved image is treated in second step, input, will treat that convolution input picture carries out matrixing processing, B squares corresponding to acquisition Battle array, and B matrixes line number is alignd according to 4 multiples；

Third step, transposed transform is carried out to B matrixes, obtains transposed matrix Bt；

Four steps, calculate the row and row dot product of A matrixes and Bt matrixes；And

5th step, instructed using Neon and carry out parallel optimization processing.

Further, the first step includes：For the convolution kernel that CNum convolution kernel size is N × N in convolutional layer Image, successively using each convolution kernel image as a line matrix data, acquisition line number is CNum, the A matrixes that columns is N × N；Will A matrix column numbers expand to 4 multiple, and numerical value is 0 in every column matrix of extension.

Further, the second step includes：Input need convolutional layer handle treat convolved image；According to N × N volume Product core carries out convolution slide window processing successively, to obtain the MNum convolution feature sub-images after convolution slide window processing；Successively Using each convolution feature sub-image as a column matrix data, the B matrixes that acquisition line number is N × N, columns is MNum；By B matrixes Line number expand to 4 multiple, numerical value is 0 in every row matrix of extension.

Further, the row and column of B matrixes is carried out transposed transform by the third step, to obtain line number as MNum, row Number is extended to the Bt matrixes of 4 multiples alignment for N × N.

Further, the 5th step includes：In Neon instructions, carry out 4 using loading instruction vld1q_f32 and float The loading operation of points；The multiplying that 4 floating numbers are carried out using multiplying order vmulq_f32 is operated；Using addition instruction Vaddq_f32 carries out the add operation of 4 floating numbers；Distinguished using instruction vget_low_f32 and vget_high_f32 is split Obtain 2 floating numbers；Using by first carrying out in vget_low_f32, vget_high_f32 2 to addition instruction vpadd_f32 The addition of floating number, then adjacent add up is carried out to the result of addition.

According to another aspect of the present invention, there is provided the algorithm optimization dress of the convolutional neural networks based on Neon instructions Put, the device includes：

Convolution kernel image array processing module, for the convolution kernel image of convolutional layer to be carried out into matrixing processing, acquisition pair The A matrixes answered, and A matrixes columns is alignd according to 4 multiples；

Convolution input picture matrix disposal module is treated, convolved image is treated for inputting, will treat that convolution input picture carries out square Array processing, B matrixes corresponding to acquisition, and B matrixes line number is alignd according to 4 multiples；

Matrix transposed transform module, for carrying out transposed transform to B matrixes, obtain transposed matrix Bt；

Row matrix and row dot product module, for calculating the row and row dot product of A matrixes and Bt matrixes；And

Neon optimization processing modules, for carrying out parallel optimization processing using Neon instructions.

Further, the convolution kernel image array processing module includes：For for CNum convolution kernel in convolutional layer Size is N × N convolution kernel image, and successively using each convolution kernel image as a line matrix data, acquisition line number is CNum, row Number is N × N A matrixes；A matrix column numbers are expanded to 4 multiple, numerical value is 0 in every column matrix of extension.

Further, it is described to treat that convolution input picture matrix disposal module includes：Need what convolutional layer was handled for inputting Treat convolved image；Convolution slide window processing is carried out according to N × N convolution kernel successively, passes through convolution slide window processing to obtain MNum Convolution feature sub-image afterwards；Successively using each convolution feature sub-image as a column matrix data, acquisition line number is N × N, row Number is MNum B matrixes；The line number of B matrixes is expanded to 4 multiple, numerical value is 0 in every row matrix of extension.

The matrix transposed transform module is used to the row and columns of B matrixes carrying out transposed transform, using obtain line number as MNum, Columns is the Bt matrixes that N × N is extended to the alignment of 4 multiples.

Further, the Neon optimizations processing module includes：Used in being instructed in Neon, instructed using loading Vld1q_f32 carries out the loading operation of 4 floating numbers；The multiplying of 4 floating numbers is carried out using multiplying order vmulq_f32 Operation；The add operation of 4 floating numbers is carried out using addition instruction vaddq_f32；Using split instruction vget_low_f32 and Vget_high_f32 obtains 2 floating numbers respectively；Using by addition instruction vpadd_f32 is first carried out vget_low_f32, The addition of 2 floating numbers in vget_high_f32, then adjacent add up is carried out to the result of addition.

Compared with existing convolutional neural networks algorithm optimization method, the convolutional Neural net of the invention based on Neon instructions The algorithm optimization method of network is by convolution kernel image and treats that the matrixing of convolved image is handled, and the Neon instructions of ARM platforms Parallel optimization, can effectively lift the calculating performances of convolutional neural networks.

Brief description of the drawings

Fig. 1 shows the embodiment of the algorithm optimization method of the convolutional neural networks based on Neon instructions according to the present invention Flow chart.

Fig. 2 shows the embodiment of the algorithm optimization device of the convolutional neural networks based on Neon instructions according to the present invention Structural representation.

Embodiment

To enable your auditor to further appreciate that structure, feature and the other purposes of the present invention, in conjunction with appended preferably real Apply example describe in detail it is as follows, illustrated preferred embodiment is merely to illustrate technical scheme, and the non-limiting present invention.

Fig. 1 gives the first reality of the algorithm optimization method of the convolutional neural networks based on Neon instructions according to the present invention Apply the flow chart of example.As shown in figure 1, the algorithm optimization method bag of the convolutional neural networks based on Neon instructions according to the present invention Include：

First step S1, the convolution kernel image of convolutional layer is subjected to matrixing processing, A matrixes corresponding to acquisition, and by A squares Number of arrays is alignd according to 4 multiples；

Convolved image is treated in second step S2, input, will treat that convolution input picture carries out matrixing processing, B corresponding to acquisition Matrix, and B matrixes line number is alignd according to 4 multiples；

Third step S3, transposed transform is carried out to B matrixes, obtains transposed matrix Bt；

Four steps S4, calculate the row and row dot product of A matrixes and Bt matrixes；And

5th step S5, instructed using Neon and carry out parallel optimization processing.

Further, the first step S1 includes：For the convolution that CNum convolution kernel size is N × N in convolutional layer Core image, successively using each convolution kernel image as a line matrix data, acquisition line number is CNum, the A matrixes that columns is N × N； A matrix column numbers are expanded to 4 multiple, numerical value is 0 in every column matrix of extension.

Embodiment, for the convolution kernel image of 16 3 × 3 in convolutional layer, respectively using i-th of convolution kernel image as i-th Capable matrix data, i={ 0,1,2 ..., 15 }, then it is 16 that can obtain line number, and columns is 9 A matrixes；By A matrix column numbers 4 multiple i.e. 12 are expanded to, numerical value is 0 in every column matrix of extension.

Further, the second step S2 includes：Input need convolutional layer handle treat convolved image；According to N × N's Convolution kernel carries out convolution slide window processing successively, to obtain the MNum convolution feature sub-images after convolution slide window processing；According to It is secondary using each convolution feature sub-image as a column matrix data, the B matrixes that acquisition line number is N × N, columns is MNum；By B squares The line number of battle array expands to 4 multiple, and numerical value is 0 in every row matrix of extension.

Embodiment, 3 × 3 convolution kernel slide window processing is carried out to input picture, to obtain after convolution slide window processing Convolution feature sub-image；Matrix data using i-th of convolution feature sub-image as the i-th row respectively, i=0,1,2 ..., MNum }, then it is MNum that can obtain columns, and line number is 9 B matrixes；The line number of B matrixes is expanded to 4 multiple i.e. 12, extended Every row matrix in numerical value be 0.

The row and columns of B matrixes is carried out transposed transform by the third step S3, is MNum, columns for N × N to obtain line number It is extended to the Bt matrixes of 4 multiples alignment.

Further, the 5th step S5 includes：In Neon instructions, 4 are carried out using loading instruction vld1q_f32 The loading operation of floating number；The multiplying that 4 floating numbers are carried out using multiplying order vmulq_f32 is operated；Referred to using addition Vaddq_f32 is made to carry out the add operation of 4 floating numbers；Divided using instruction vget_low_f32 and vget_high_f32 is split Huo Qu not 2 floating numbers；First carried out 2 in vget_low_f32, vget_high_f32 using by addition instruction vpadd_f32 The addition of individual floating number, then adjacent add up is carried out to the result of addition.

Embodiment, for 8 × 8 A matrixes and Bt matrixes, the first row vector of A matrixes is for [a₁ a₂ a₃ … a₈], First row vector of Bt matrixes is [b₁ b₂ b₃ … b₈], can be simultaneously using loading instruction vld1q_f32 in Neon instructions Row access, once command realize the loading of 4 floating numbers simultaneously, and such as 128 bit register Va are respectively used to store a₁、a₂、a₃、a₄ Above-mentioned 4 floating numbers or a₅、a₆、a₇、a₈Above-mentioned 4 floating numbers, 128 bit register Vb are respectively used to store b₁、b₂、b₃、b₄On State 4 floating numbers or b₅、b₆、b₇、b₈Above-mentioned 4 floating numbers；Multiplying for 4 floating numbers is realized using multiplying order vmulq_f32 Method arithmetic operation V_a×b=[a₁×b₁ a₂×b₂ a₃×b₃ a₄×b₄] or V_a×b=[a₅×b₅ a₆×b₆ a₇×b₇ a₈× b₈]；The add operation V of 4 floating numbers is carried out using addition instruction vaddq_f32_a+b=[a₁×b₁+a₅×b₅ a₂×b₂+a₆× b₆ a₃×b₃+a₇×b₇ a₄×b₄+a₈×b₈]；A is obtained using instruction vget_low_f32 is split₁×b₁+a₅×b₅、a₂×b₂+ a₆×b₆Two floating numbers, a is obtained using instruction vget_high_f32 is split₃×b₃+a₇×b₇、a₄×b₄+a₈×b₈Two floating Points；By the addition a that two adjacent floating numbers in vget_low_f32 are first realized to addition instruction vpadd_f32₁×b₁+a₅× b₅+a₂×b₂+a₆×b₆, in vget_high_f32 two adjacent floating numbers addition a₃×b₃+a₇×b₇+a₄×b₄+a₈×b₈, Added up again, that is, obtain result Result=a₁×b₁+a₅×b₅+a₂×b₂+a₆×b₆+a₃×b₃+a₇×b₇+a₄×b₄+a₈× b₈。

Fig. 2 gives the first reality of the algorithm optimization device of the convolutional neural networks based on Neon instructions according to the present invention Apply the structural representation of example.As shown in Fig. 2 filled according to the algorithm optimization of the convolutional neural networks based on Neon instructions of the present invention Put including：

Convolution kernel image array processing module 1, for the convolution kernel image of convolutional layer to be carried out into matrixing processing, acquisition pair The A matrixes answered, and A matrixes columns is alignd according to 4 multiples；

Convolution input picture matrix disposal module 2 is treated, convolved image is treated for inputting, will treat that convolution input picture carries out square Array processing, B matrixes corresponding to acquisition, and B matrixes line number is alignd according to 4 multiples；

Matrix transposed transform module 3, for carrying out transposed transform to B matrixes, obtain transposed matrix Bt；

Row matrix and row dot product module 4, for calculating the row and row dot product of A matrixes and Bt matrixes；And

Neon optimizations processing module 5, for carrying out parallel optimization processing using Neon instructions.

Further, the convolution kernel image array processing module 1 includes：For for CNum convolution kernel in convolutional layer Size is N × N convolution kernel image, and successively using each convolution kernel image as a line matrix data, acquisition line number is CNum, row Number is N × N A matrixes；A matrix column numbers are expanded to 4 multiple, numerical value is 0 in every column matrix of extension.

Further, it is described to treat that convolution input picture matrix disposal module 2 includes：Need what convolutional layer was handled for inputting Treat convolved image；Convolution slide window processing is carried out according to N × N convolution kernel successively, passes through convolution slide window processing to obtain MNum Convolution feature sub-image afterwards；Successively using each convolution feature sub-image as a column matrix data, acquisition line number is N × N, row Number is MNum B matrixes；The line number of B matrixes is expanded to 4 multiple, numerical value is 0 in every row matrix of extension.

The matrix transposed transform module 3 is used to the row and columns of B matrixes carrying out transposed transform, using obtain line number as MNum, columns are the Bt matrixes that N × N is extended to the alignment of 4 multiples.

Further, the Neon optimizations processing module 5 includes：Used in being instructed in Neon, instructed using loading Vld1q_f32 carries out the loading operation of 4 floating numbers；The multiplying of 4 floating numbers is carried out using multiplying order vmulq_f32 Operation；The add operation of 4 floating numbers is carried out using addition instruction vaddq_f32；Using split instruction vget_low_f32 and Vget_high_f32 obtains 2 floating numbers respectively；Using by addition instruction vpadd_f32 is first carried out vget_low_f32, The addition of 2 floating numbers in vget_high_f32, then adjacent add up is carried out to the result of addition.

The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention, should Understand, the present invention is not limited to implementation as described herein, and the purpose of these implementations description is to help this area In technical staff put into practice the present invention.Any those of skill in the art are easy to do not departing from spirit and scope of the invention In the case of be further improved and perfect, therefore the present invention is only by the content of the claims in the present invention and limiting for scope System, its intention cover all alternatives being included in the spirit and scope of the invention being defined by the appended claims and waited Same scheme.

Claims

1. the algorithm optimization method of the convolutional neural networks based on Neon instructions, it is characterised in that this method includes：

First step, the convolution kernel image of convolutional layer is subjected to matrixing processing, A matrixes corresponding to acquisition, and by A matrix columns Alignd according to 4 multiples；

Second step, input treat convolved image, will treat convolution input picture carry out matrixing processing, B matrixes corresponding to acquisition, and B matrixes line number is alignd according to 4 multiples；

5th step, instructed using Neon and carry out parallel optimization processing.

2. the method as described in claim 1, it is characterised in that the first step includes：For CNum volume in convolutional layer The convolution kernel image that product core size is N × N, successively using each convolution kernel image as a line matrix data, obtaining line number is CNum, the A matrixes that columns is N × N；A matrix column numbers are expanded to 4 multiple, numerical value is 0 in every column matrix of extension.

3. the method as described in claim 1, it is characterised in that the second step includes：Input needs what convolutional layer was handled Treat convolved image；Convolution slide window processing is carried out according to N × N convolution kernel successively, passes through convolution slide window processing to obtain MNum Convolution feature sub-image afterwards；Successively using each convolution feature sub-image as a column matrix data, acquisition line number is N × N, row Number is MNum B matrixes；The line number of B matrixes is expanded to 4 multiple, numerical value is 0 in every row matrix of extension.

4. the method as described in claim 1, the row and column of B matrixes is carried out transposed transform by the third step, to obtain row Number is MNum, columns is that N × N is extended to the Bt matrixes that 4 multiples align.

5. the method as described in claim 1, it is characterised in that the 5th step includes：In Neon instructions, using loading Vld1q_f32 is instructed to carry out the loading operation of 4 floating numbers；The multiplication of 4 floating numbers is carried out using multiplying order vmulq_f32 Arithmetic operation；The add operation of 4 floating numbers is carried out using addition instruction vaddq_f32；Vget_low_ is instructed using splitting F32 and vget_high_f32 obtains 2 floating numbers respectively；Vget_low_ is first carried out using by addition instruction vpadd_f32 The addition of 2 floating numbers in f32, vget_high_f32, then adjacent add up is carried out to the result of addition.

6. the algorithm optimization device of the convolutional neural networks based on Neon instructions, it is characterised in that the device includes：Convolution kernel figure As matrix disposal module, for the convolution kernel image of convolutional layer to be carried out into matrixing processing, A matrixes corresponding to acquisition, and by A squares Number of arrays is alignd according to 4 multiples；

Convolution input picture matrix disposal module is treated, convolved image is treated for inputting, will treat that convolution input picture carries out matrixing Processing, B matrixes corresponding to acquisition, and B matrixes line number is alignd according to 4 multiples；

7. device as claimed in claim 6, it is characterised in that the convolution kernel image array processing module includes：For right The convolution kernel image that CNum convolution kernel size is N × N in convolutional layer, successively using each convolution kernel image as a row matrix Data, acquisition line number is CNum, the A matrixes that columns is N × N；A matrix column numbers are expanded to 4 multiple, each column square of extension Numerical value is 0 in battle array.

8. device as claimed in claim 6, it is characterised in that described to treat that convolution input picture matrix disposal module includes：With In input need convolutional layer handle treat convolved image；Convolution slide window processing is carried out according to N × N convolution kernel successively, to obtain The MNum convolution feature sub-images after convolution slide window processing；Successively using each convolution feature sub-image as a column matrix Data, the B matrixes that acquisition line number is N × N, columns is MNum；The line number of B matrixes is expanded to 4 multiple, the often row square of extension Numerical value is 0 in battle array.

9. device as claimed in claim 6, the matrix transposed transform module is used to the row and column of B matrixes carrying out transposition change Change, using obtain line number as MNum, columns for N × N be extended to 4 multiples alignment Bt matrixes.

10. device as claimed in claim 6, it is characterised in that the Neon optimizations processing module includes：For In Neon instructions, the loading that 4 floating numbers are carried out using loading instruction vld1q_f32 is operated；Using multiplying order vmulq_f32 Carry out the multiplying operation of 4 floating numbers；The add operation of 4 floating numbers is carried out using addition instruction vaddq_f32；Using Split instruction vget_low_f32 and vget_high_f32 and obtain 2 floating numbers respectively；Using by addition instruction vpadd_ F32 first carries out the addition of 2 floating numbers in vget_low_f32, vget_high_f32, then adjacent tired to the result progress of addition Add.