CN112446266A

CN112446266A - Face recognition network structure suitable for front end

Info

Publication number: CN112446266A
Application number: CN201910830546.2A
Authority: CN
Inventors: 田凤彬
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2021-03-05
Anticipated expiration: 2039-09-04
Also published as: CN112446266B

Abstract

The invention provides a face recognition network structure suitable for a front end, wherein the network uses the idea of down-sampling to reach a full connection layer, and a feature map of each small module result is directly connected with the full connection layer after down-sampling, so that the features extracted by each small module are effectively utilized. In each convolution, the result of convolution above cross-splicing is used, and the reinforcement of local features and the disappearance of the features are effectively avoided. The results of the above modules are spliced into the input of each module, thereby avoiding the disappearance of the gradient.

Description

Face recognition network structure suitable for front end

Technical Field

The invention relates to the technical field of face image recognition, in particular to a face recognition network structure suitable for a front end.

Background

With the continuous development of science and technology, particularly the development of computer vision technology, the face recognition technology is widely applied to various fields of information security, electronic authentication and the like, and the image feature extraction method has good recognition performance. Face recognition refers to a technique for recognizing one or more faces from a static or dynamic scene using image processing and/or pattern recognition techniques based on a known sample library of faces. However, the current face recognition technology includes 1, various networks applied to face recognition, including, ResNet, VGG, MobileNet, AlexNet, GoogleNet, and Incepression-ResNet; 2. the frame structure for face recognition training comprises facenet and instightface. The current face recognition technology includes the following disadvantages: 1. at present, the network at the front end uses independent convolution, feature map addition, convolution kernels of various sizes and the like, and is not suitable for a plurality of AI chips, so that the AI chips cannot realize the network. 2. The amount of calculation is relatively large. 3. The recognition rate is relatively low.

Disclosure of Invention

In order to solve the problems in the prior art, the present invention aims to:

1. the operators used are few, only 3 × 3 convolution kernels, the use of feature map stitching (concat) operations, downsampling operations, and full join operations.

2. The calculation amount is small, the calculation amount is less than 1G of multiplication, and the size of a model (the parameter is a floating point) is only 29M.

3. The face recognition rate is improved to 99.8%.

The invention provides a face recognition network structure suitable for a front end, wherein the network adopts a mode of down-sampling to reach a full connection layer, network modules are arranged in the network, a feature map of a result of each module is directly connected with the full connection layer after being subjected to down-sampling processing, in each convolution processing, the result of the convolution processing on the upper side is spliced in the input of each module by using the result of the convolution processing on the upper side in cross splicing.

The convolution processing is the processing that the convolution used by each module is a convolution kernel of 3 multiplied by 3, and normalization processing is carried out after convolution and then processing of an excitation function is carried out.

In the design of each layer of convolution, the number of output characteristic graphs is designed independently.

The down-sampling process is a down-sampling process with the processing kernel size of 2 multiplied by 2 and the step length of 2.

Further comprises firstly carrying out equalization processing on the input image data; and performing convolution processing with a convolution kernel of 3 multiplied by 3 and a step size of 1 on the data to generate a feature map.

The network module further comprises: subjecting the feature pattern to a first feature pattern length and width reduction process; the first process of reducing the feature pattern length and width includes:

[1.1] performing convolution processing with a convolution kernel of 3 × 3 and a step size of 2 on a feature map generated by performing convolution processing on image data to generate a feature map with a depth of 32 and a feature map of 48 × 48 × 32;

[1.2] the feature map of the image data described in [1.1] after convolution processing is downsampled, the depth of the feature map is 32 by convolution processing with a kernel size of 2 × 2 and a step size of 2, and the feature map is 48 × 48 × 64.

The network module further comprises: performing a second characteristic diagram length and width reducing treatment; the second process for reducing the feature width and length further comprises:

input data in _ data1, in _ data2, in _ data3, N, respectively;

[2.1] performing convolution processing with a convolution kernel of 3 × 3 and a step size of 2 on in _ data1 to generate a feature map with a depth of N and a feature map of W × H × N;

[2.2] downsampling in _ data2, wherein the downsampling process is performed with a kernel size of 2 × 2 and a step size of 2, a generated feature map is identical to the depth of the input feature map, and is M, and the generated feature map is W × H × M;

[2.3] the in _ data3 is downsampled by a downsampling process with a kernel size of 2 × 2 and a step size of 2, and the generated feature map is W × H × M, where M is the same as the depth of the input feature map.

The network module comprises: the device comprises a first module, a second module, a third module and a fourth module.

The last module of the network is eventually fully connected to 192 data.

The application has the advantages that:

1. the idea of down-sampling to reach the full-connection layer is used in the network, and the feature graph of each small module result is directly connected with the full-connection layer after down-sampling, so that the features extracted by each small module are effectively utilized. In each convolution, the result of convolution above cross-splicing is used, and the reinforcement of local features and the disappearance of the features are effectively avoided. The results of the above modules are spliced into the input of each module, thereby avoiding the disappearance of the gradient.

2. By adopting the scheme of independently designing the number of the output feature maps, the calculation amount can be effectively reduced, and the efficiency of extracting the features by convolution can be improved to the maximum extent.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

Fig. 1 is a schematic block diagram of the network architecture of the present invention.

Fig. 2-1 through 2-22 are block flow diagrams of embodiments of the present invention.

Detailed Description

In the field of face recognition, some related technical terms currently include:

1. deep learning: the concept of deep learning stems from the study of artificial neural networks. A multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data.

2. And (3) convolution kernel: the convolution kernel is a parameter used for performing an operation on a matrix and an original image during image processing. The convolution kernel is typically a matrix of column numbers (e.g., a3 x 3 matrix) with a weight value for each square on the region. The matrix shape is typically 1 × 1,3 × 3,5 × 5,7 × 7,1 × 3,3 × 1,2 × 2,1 × 5,5 × 1, ….

3. Convolution: the centre of the convolution kernel is placed on the pixel to be calculated, the products of each element in the kernel and its covered image pixel value are calculated once and summed, and the resulting structure is the new pixel value at that location, a process called convolution.

4. Characteristic diagram: the result of the convolution calculation of the input data is called a feature map, and the result of the full connection of the data is also called a feature map. The feature size is typically expressed as length x width x depth, or 1 x depth.

5. Step length: the length of the shift in the center position of the convolution kernel in the coordinates.

6: and (3) carrying out non-alignment treatment on two ends: when the image or data is processed by the convolution kernel with the size of 3 and the step size of 2, the data on two sides is insufficient, and the phenomenon that the data on two sides or one side is discarded is called that the data on two sides is not processed.

7. Front-end face detection: the face detection function used on the AI chip or the common chip is called front-end face detection, and the speed and accuracy of the front-end face detection are lower than those of the face detection of the cloud server.

8. Excitation function: each node in the neural network accepts an input value and passes the input value to the next layer, and the input node passes the input attribute value directly to the next layer (hidden layer or output layer). In a neural network, there is a functional relationship between the inputs and outputs of hidden and output layer nodes, this function being called the excitation function. Common stimulus functions are: linear excitation functions, threshold or step excitation functions, sigmoid excitation functions, hyperbolic tangent excitation functions, gaussian excitation functions, and the like.

In this application, as shown in fig. 1, the network structure specifically includes:

1. network architecture

1) The characteristics of the network. The idea of down-sampling to reach the full-connection layer is used in the network, and the feature graph of each small module result is directly connected with the full-connection layer after down-sampling, so that the features extracted by each small module are effectively utilized. In each convolution, the result of convolution above cross-splicing is used, and the reinforcement of local features and the disappearance of the features are effectively avoided. The results of the above modules are spliced into the input of each module, thereby avoiding the disappearance of the gradient.

2) In the specific design of each layer of convolution, the scheme of independently designing the number of output feature maps is used, so that the calculated amount can be effectively reduced, and the efficiency of extracting features by convolution is improved to the maximum extent. A profile of each small module is performed once.

As shown in fig. 1, firstly, the input image data is equalized, firstly, the data is subtracted by 128 to generate data1, and then the data1 is divided by 128 to generate data 2; then, data1 is subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and a feature map is generated with a depth of 64. The generated characteristic map is 96 multiplied by 64 and is named conv 0; and (3) performing a first feature pattern length and width reduction process, wherein the first feature pattern length and width reduction process comprises the following steps:

[1.1] conv0 was convolved with a convolution kernel of 3X 3 and a step size of 2 to generate a feature map with a depth of 32. The generated signature was 48 × 48 × 32 and named conv 1.

[1.2] the conv0 is downsampled by convolution processing with a kernel size of 2 × 2 and a step size of 2, and the depth of the generated feature map is 32. The resulting profile was 48X 64 and was designated pool 1.

And performing layer-by-layer convolution processing, excitation function processing and downsampling processing to generate a first module, a second module, a third module and a fourth module, and performing second length and width reduction processing on the feature map. The second process for reducing the feature width and length further comprises:

input data in _ data1, in _ data2, in _ data3, N, respectively;

The last module of the network, the fourth module, is finally fully connected to 192 data.

Further, the specific setting method referred to in the present application includes the following as shown in fig. 2-1 to 2-22:

2. network module

1) The convolution used by each module is the convolution processing of a convolution kernel of 3 multiplied by 3, and after the convolution, the normalization processing is carried out, and then the processing of an excitation function is carried out. This process is hereinafter referred to simply as convolution process.

2) Firstly carrying out equalization processing on input image data, firstly subtracting 128 to generate data1, and then dividing data1 by 128 to generate data 2;

3) the data1 is subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map is 64. The generated characteristic map is 96 multiplied by 64 and is named conv 0;

4) the first process of reducing feature pattern length and width.

[1] Convolution processing with a convolution kernel of 3 × 3 and a step size of 2 was performed on conv0, and the depth of the generated feature map was 32. The generated signature was 48 × 48 × 32 and named conv 1.

[2] The conv0 is downsampled by convolution processing with a kernel size of 2 × 2 and a step size of 2, and the depth of the generated feature map is 32. The resulting profile was 48X 64 and was designated pool 1.

5) A first module.

[1] The depth of the feature map generated by performing convolution processing with a convolution kernel of 3 × 3 and a step size of 1 on pool1 was 32. The generated signature graph is 48 × 48 × 32 and is named block1_ conv 1.

[2] Block1_ conv1 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 64. The generated signature graph is 48 × 48 × 32 and is named block1_ conv 2.

[3] Block1_ conv2 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 64. The generated signature graph is 48 × 48 × 32 and is named block1_ conv 3.

[4] And splicing block1_ conv3 with conv1 to generate a result named block1_ concat 1.

6) And the second processing for reducing the length and the width of the feature map.

The input data are in _ data1, in _ data2, in _ data3, N, respectively.

[1] In _ data1 is subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 2, and the depth of the generated feature map is N. The generated signature is W × H × N, named out _ conv 1.

[2] In _ data2 is downsampled by a downsampling process of a kernel size of 2 × 2 and a step size of 2, and the generated feature map is equal to the depth of the input feature map, which is M here. The resulting signature was W × H × M and named out _ pool 1.

[3] In _ data3 is downsampled by a downsampling process of a kernel size of 2 × 2 and a step size of 2, and the generated feature map is equal to the depth of the input feature map, which is M here. The resulting signature was W × H × M and named out _ pool 2.

7) A second module.

[1] Pass 6) a second reduced feature map aspect process, where in _ data1 is block1_ concat1, in _ data2 is block1_ concat1, and N is 48. The output result is named as block2_ conv0 ═ out _ conv1, and block2_ pool1 ═ out _ pool 1.

[2] Block2_ conv0 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 64. The generated signature graph is 24 × 24 × 64 and is named block2_ conv 1.

[3] Block2_ conv1 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 48. The generated characteristic graph is 24 multiplied by 48 and is named block2_ conv 2.

[4] Block2_ conv2 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 48. The generated characteristic graph is 24 multiplied by 48 and is named block2_ conv 3.

[5] And splicing block2_ conv3 with block2_ conv0 to generate a result named block2_ concat 1.

[6] Block2_ conv2 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 64. The generated signature graph is 24 × 24 × 64 and is named block2_ conv 4.

8) And a third module.

A first part:

[1] by 6) the second reduced feature map length-width process, where in _ data1 is block2_ concat1, in _ data2 is block2_ concat 4, in _ data3 is block2_ concat1, and N is 48. The output result is named block31_ conv0 ═ out _ conv1, block31_ pool1 ═ out _ pool1, and block31_ pool2 ═ out _ pool 2.

[2] Block31_ conv0 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 128. The generated signature graph is 12 × 12 × 128 and is named block31_ conv 1.

[3] Block31_ conv1 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 80. The generated signature graph is 12 × 12 × 80 and is named block31_ conv 2.

[4] And splicing block31_ conv2 with block31_ pool1 to generate a result named block31_ concat 1.

[5] Block31_ concat1 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 64. The generated signature graph is 12 × 12 × 64 and is named block31_ conv 3.

[6] And splicing block31_ conv3 with block31_ pool2 to generate a result named block31_ concat 2.

[7] Block31_ concat1 was convolved with a convolution kernel of 3 × 3 and a step size of 1 to generate a feature map with a depth of 96, and the generated feature map was 12 × 12 × 96 and named block32_ conv 1.

A second part:

[1] block31_ concat2 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 48. The generated signature graph is 12 × 12 × 48 and is named block32_ conv 4.

[2] And splicing block32_ conv1 with block31_ conv4 to generate a result named block32_ concat 1.

[3] Block32_ concat1 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 96. The generated signature graph is 12 × 12 × 96 and is named block32_ conv 2.

[4] And splicing block32_ conv2 with block32_ conv4 to generate a result named block32_ concat 2.

[5] Block32_ concat2 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 96. The generated signature graph is 12 × 12 × 96 and is named block32_ conv 3.

[6] And splicing block32_ conv3 with block31_ concat2 to generate a result named block32_ concat 3.

And a third part:

[1] block32_ concat3 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 48. The generated signature graph is 12 × 12 × 48 and is named block33_ conv 1.

[2] And splicing block33_ conv1 with block32_ conv3 to generate a result named block33_ concat 1.

[3] Block33_ concat1 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 96. The generated signature graph is 12 × 12 × 96 and is named block33_ conv 2.

[4] And splicing block33_ conv2 with block33_ conv1 to generate a result named block33_ concat 2.

[5] Block33_ concat2 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 96. The generated signature graph is 12 × 12 × 96 and is named block33_ conv 3.

[6] And splicing block33_ conv3 with block32_ concat3 to generate a result named block33_ concat 3.

9) And a fourth module.

[1] Pass 6) a second reduced feature map length-width process, where in _ data1 is block33_ concat3, in _ data2 is block33_ concat3, in _ data3 is block33_ concat3, and N is 64. The output result is named block4_ conv0 ═ out _ conv1, block4_ pool1 ═ out _ pool1, and block4_ pool2 ═ out _ pool 2.

[2] Block4_ conv0 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 160. The generated signature graph is 6 × 6 × 160 and is named block4_ conv 1.

[3] And splicing block4_ conv1 with block4_ pool1 to generate a result named block4_ concat 1.

[4] Block4_ concat1 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 128. The generated signature graph is 6 × 6 × 128 and is named block4_ conv 2.

[5] And splicing block4_ conv2 with block4_ conv0 to generate a result named block4_ concat 2.

[6] Block4_ concat2 was subjected to convolution processing with a convolution kernel of 3 × 3 and a step size of 1, and the depth of the generated feature map was 192. The generated characteristic graph is 6 multiplied by 192 and is named block4_ conv 3.

[7] And splicing block4_ conv3 with block31_ pool2 to generate a result named block4_ concat 3.

10) Full connection

Block4_ conv3 is fully connected to 192 data.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A face recognition network structure suitable for a front end is characterized in that the network adopts a mode of down-sampling to reach a full connection layer, network modules are arranged in the network, a feature map of a result of each module is directly connected with the full connection layer after being subjected to down-sampling processing, in each convolution processing, the result of the convolution processing on the upper side is used for cross splicing, and the result of each module on the upper side is spliced in the input of each module.

2. The front-end-adapted face recognition network architecture of claim 1, wherein the convolution processing is processing in which the convolution used by each module is a convolution kernel of 3 x 3, and normalization processing is performed after convolution, and then processing of an excitation function is performed.

3. The front-end-adapted face recognition network architecture of claim 1, wherein the number of output feature maps is designed separately in the design of each layer of convolution.

4. The front-end-compatible face recognition network architecture of claim 1, wherein the downsampling process is a downsampling process with a processing kernel size of 2 x 2 and a step size of 2.

5. The front-end-suitable face recognition network structure according to claim 1, further comprising firstly performing equalization processing on input image data; and performing convolution processing with a convolution kernel of 3 multiplied by 3 and a step size of 1 on the data to generate a feature map.

6. The front-end adapted face recognition network architecture of claim 1, wherein said network module further comprises: subjecting the feature pattern to a first feature pattern length and width reduction process; the first process of reducing the feature pattern length and width includes:

7. The front-end adapted face recognition network architecture of claim 1, wherein said network module further comprises: performing a second characteristic diagram length and width reducing treatment; the second process for reducing the feature width and length further comprises:

input data in _ data1, in _ data2, in _ data3, N, respectively;

8. A front-end adapted face recognition network architecture according to claim 1, wherein said network module comprises: the device comprises a first module, a second module, a third module and a fourth module.

9. The front-end adapted face recognition network architecture of claim 8, wherein the last module of the network is ultimately fully connected to 192 data.