CN112446266B

CN112446266B - Face recognition network structure suitable for front end

Info

Publication number: CN112446266B
Application number: CN201910830546.2A
Authority: CN
Inventors: 田凤彬
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2024-03-29
Anticipated expiration: 2039-09-04
Also published as: CN112446266A

Abstract

The invention provides a face recognition network structure suitable for a front end, wherein the network uses the idea of downsampling to reach a full connection layer, and a feature map of each small module result is directly connected with the full connection layer after downsampling, so that the extracted features of each small module are effectively utilized. In each convolution, the result of the convolution above is used to effectively avoid the reinforcement of the local feature and the disappearance of the feature. The results of the above modules are spliced in the input of each module, thereby avoiding the disappearance of gradients.

Description

Face recognition network structure suitable for front end

Technical Field

The invention relates to the technical field of face image recognition, in particular to a face recognition network structure suitable for a front end.

Background

Along with the continuous development of technology, particularly the development of computer vision technology, the face recognition technology is widely applied to various fields such as information security, electronic authentication and the like, and the image feature extraction method has good recognition performance. Face recognition refers to a technique of recognizing one or more faces from a static or dynamic scene using image processing and/or pattern recognition techniques based on a known face sample library. However, the current face recognition technology comprises 1, various networks applied to face recognition, inception, resNet, VGG, mobileNet, alexNet, googLeNet, inception-ResNet; 2. the framework structure of face recognition training comprises a facenet and an instmctation. The existing face recognition technology has the following defects: 1. the current front-end network uses independent convolution, feature map addition, convolution kernels with various sizes and the like, and is not suitable for many AI chips, so that the AI chips cannot realize the network. 2. The calculation amount is relatively large. 3. The recognition rate is relatively low.

Disclosure of Invention

In order to solve the above problems in the prior art, an object of the present invention is to:

1. few operators are used, only 3 x 3 convolution kernels, employing a feature map stitching (concat) operation, a downsampling operation, and a full join operation.

2. The calculated amount is small, the calculated amount is less than multiplication of 1G, and the model (the parameter is floating point) size is only 29M.

3. The face recognition rate is improved to 99.8 percent.

The invention provides a face recognition network structure suitable for a front end, wherein the network uses a form of downsampling to reach a full connection layer, a network module is arranged in the network, a feature map of a result of each module is directly connected with the full connection layer after downsampling, and in each convolution process, the result of the convolution process above is spliced by using a cross splice, and the result of each module above is spliced in the input of each module.

The convolution processing is that the convolution used by each module is the processing of a convolution kernel of 3 multiplied by 3, and the normalization processing is carried out after the convolution, and then the processing of the excitation function is carried out.

In the design of each layer of convolution, the number of output characteristic diagrams is independently designed.

The downsampling process is that the processing core size is 2×2, and the step size is 2.

Further comprising firstly performing equalization processing on input image data; and carrying out convolution processing with a convolution kernel of 3 multiplied by 3 and a step length of 1 on the data to generate a feature map.

The network module further comprises: subjecting to a first feature pattern length and width reduction process; the first feature map length and width reduction processing includes:

[1.1] a feature map generated by convolution processing of image data is subjected to convolution processing with a convolution kernel of 3×3 and a step length of 2, the depth of the generated feature map is 32, and the generated feature map is 48×48×32;

[1.2] the feature map obtained by convolving the image data described in [1.1] is downsampled, convolved with a processing kernel size of 2×2 and a step size of 2, the depth of the generated feature map is 32, and the generated feature map is 48×48×64.

The network module further comprises: subjecting to a second feature pattern length and width reduction process; the second feature pattern length and width reduction processing further comprises:

input data are respectively in_data1, in_data2, in_data3, N;

[2.1] performing convolution processing with the convolution kernel of in_data1 being 3×3 and the step size being 2, generating a feature map with a depth of N, and generating a feature map with a depth of W×H×N;

[2.2] downsampling the in_data2 with a processing kernel size of 2×2 and a step size of 2 to generate a feature map, which is set to be M as the depth of the input feature map, and is set to be W×H×M;

[2.3] the in_data3 is subjected to downsampling processing, the processing core size is 2×2, the step size is 2, the generated feature map is M as the depth of the input feature map, and the generated feature map is W×H×M.

The network module includes: the first module, the second module, the third module and the fourth module.

The last module of the network is ultimately fully connected to 192 data.

The application has the advantages that:

1. the idea of down sampling to the full-connection layer is used in the network, and the feature map of each small module result is directly connected with the full-connection layer after down sampling, so that the extracted features of each small module are effectively utilized. In each convolution, the result of the convolution above is used to effectively avoid the reinforcement of the local feature and the disappearance of the feature. The results of the above modules are spliced in the input of each module, thereby avoiding the disappearance of gradients.

2. By using a scheme of independently designing the number of the output characteristic graphs, the calculated amount can be effectively reduced, and the efficiency of convolution extracting the characteristics can be improved to the greatest extent.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate and together with the description serve to explain the invention.

Fig. 1 is a schematic block diagram of the network architecture of the present invention.

Fig. 2-1 through 2-22 are flow diagrams of embodiments of the present invention.

Detailed Description

In the field of face recognition, some related technical terms currently include:

1. deep learning: the concept of deep learning is derived from the study of artificial neural networks. The multi-layer sensor with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data.

2. Convolution kernel: the convolution kernel is a matrix used in image processing and is a parameter for operation with the original image. The convolution kernel is typically a matrix of columns (e.g., a matrix of 3*3) with a weight value for each square in the region. The matrix shapes are typically 1×1,3×3,5×5,7×7,1×3,3×1,2×2,1×5,5×1, ….

3. Convolution: the center of the convolution kernel is placed over the pixel to be calculated, and the products of each element in the kernel and its covered image pixel values are calculated and summed once to obtain a structure that is the new pixel value for that location, a process called convolution.

4. Feature map: the result obtained by convolution calculation of input data is called a feature map, and the result generated by full connection of the data is also called a feature map. The feature map size is generally expressed as length x width x depth, or 1 x depth.

5. Step size: the center position of the convolution kernel is moved by the length of the movement in the coordinates.

6: and (3) performing two-end misalignment treatment: processing an image or data with a convolution kernel size of 3 and a step size of 2 may result in insufficient data on both sides, where discarding data on both sides or on one side is used, a phenomenon called both sides not processing it.

7. Front-end face detection: the face detection function used on the AI chip or the common chip is called front-end face detection, and the speed and the accuracy of the front-end face detection are lower than those of the cloud server.

8. Excitation function: each node in the neural network accepts the input value and passes the input value to the next layer, and the input node passes the input attribute value directly to the next layer (hidden layer or output layer). In a neural network, there is a functional relationship between the inputs and outputs of hidden and output layer nodes, this function being called the excitation function. Common excitation functions are: linear excitation functions, threshold or step excitation functions, sigmoid excitation functions, hyperbolic tangent excitation functions, gaussian excitation functions, and the like.

In this application, as shown in fig. 1, the network structure that is set specifically includes:

1. network structure

1) Network characteristics. The idea of down sampling to the full-connection layer is used in the network, and the feature map of each small module result is directly connected with the full-connection layer after down sampling, so that the extracted features of each small module are effectively utilized. In each convolution, the result of the convolution above is used to effectively avoid the reinforcement of the local feature and the disappearance of the feature. The results of the above modules are spliced in the input of each module, thereby avoiding the disappearance of gradients.

2) In the specific design of each layer of convolution, the calculation amount can be effectively reduced by using a scheme of independently designing the number of the output feature graphs, and the efficiency of extracting the features by convolution can be improved to the greatest extent. Feature mapping of each small module is performed once.

As shown in fig. 1, the input image data is subjected to equalization processing, subtraction is performed 128 to generate data1, and division is performed on the data1 by 128 to generate data2; then, the convolution processing with the convolution kernel of 3×3 and the step size of 1 is performed on data1, and the depth of the generated feature map is 64. The generated feature map is 96×96×64 and named conv0; subjecting to a first reduced feature pattern length and width process, said first reduced feature pattern length and width process being as follows:

[1.1] Conv0 was subjected to convolution processing with a convolution kernel of 3×3 and a step size of 2, and the depth of the generated feature map was 32. The generated feature map was 48×48×32, and named conv1.

[1.2] Conv0 was subjected to downsampling, and convolution processing with a processing kernel size of 2×2 and a step size of 2 was performed to generate a feature map with a depth of 32. The generated feature map was 48×48×64 and named pool1.

And then performing layer-by-layer convolution processing, excitation function processing and downsampling processing to generate a first module, a second module, a third module and a fourth module, wherein the first module, the second module, the third module and the fourth module undergo second characteristic diagram length and width reduction processing. The second feature pattern length and width reduction processing further comprises:

input data are respectively in_data1, in_data2, in_data3, N;

The last module of the network, the fourth module, is ultimately all connected to 192 data.

Further, the specific setting method referred to in the present application includes the following as shown in fig. 2-1 to 2-22:

2. network module

1) The convolution used by each module is the convolution processing of a convolution kernel of 3×3, and the normalization processing is performed after the convolution, and then the excitation function processing is performed. This process is hereinafter simply referred to as convolution process.

2) Carrying out equalization treatment on input image data, firstly carrying out 128 subtraction to generate data1, and then dividing the data1 by 128 to generate data2;

3) The convolution processing with the convolution kernel of 3×3 and the step size of 1 is performed on data1, and the depth of the generated feature map is 64. The generated feature map is 96×96×64 and named conv0;

4) The first type of feature pattern length and width reduction processing.

[1] Conv0 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 2, and a depth of the generated feature map is 32. The generated feature map was 48×48×32, and named conv1.

[2] The conv0 is subjected to downsampling processing, convolution processing with a processing kernel size of 2×2 and a step size of 2 is performed, and a depth of the generated feature map is 32. The generated feature map was 48×48×64 and named pool1.

5) A first module.

[1] Pool1 was subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and the depth of the generated feature map was 32. The generated feature map is 48×48×32, and is named as block1_conv1.

[2] The convolution processing with the convolution kernel 3×3 and the step size 1 is performed on the block1_conv1, and the depth of the generated feature map is 64. The generated feature map is 48×48×32, and is named as block1_conv2.

[3] The convolution processing with the convolution kernel 3×3 and the step size 1 is performed on the block1_conv2, and the depth of the generated feature map is 64. The generated feature map is 48×48×32, and is named as block1_conv3.

[4] And splicing the block1_conv3 and the conv1 to generate a result named as block1_concat1.

6) And the second type of feature pattern length and width reduction treatment.

The input data are in_data1, in_data2, in_data3, N, respectively.

[1] And performing convolution processing with the convolution kernel of 3 multiplied by 3 and the step length of 2 on the in_data1 to generate a feature map with the depth of N. The generated feature map is w×h×n, and is named out_conv1.

[2] The in_data2 is subjected to downsampling, the processing core size is 2×2, and the step size is 2, and the generated feature map is the same as the depth of the input feature map, here denoted by M. The generated feature map is w×h×m, and is named out_pool1.

[3] The in_data3 is subjected to downsampling, the processing core size is 2×2, and the step size is 2, and the generated feature map is the same as the depth of the input feature map, here denoted by M. The generated feature map is w×h×m, and is named out_pool2.

7) And a second module.

[1] Through 6) a second reduced profile length and width process, where in_data1=block1_concat1, in_data2=block1_concat1, n=48. The output result is named as block2_conv0=out_conv1, block2_pool 1=out_pool 1.

[2] The convolution processing with the convolution kernel 3×3 and the step size 1 is performed on the block2_conv0, and the depth of the generated feature map is 64. The generated feature map is 24×24×64, and is named as block2_conv1.

[3] The convolution processing with the convolution kernel 3×3 and the step size 1 is performed on the block2_conv1, and the depth of the generated feature map is 48. The generated feature map is 24×24×48 and named block2_conv2.

[4] The convolution processing with the convolution kernel 3×3 and the step size 1 is performed on the block2_conv2, and the depth of the generated feature map is 48. The generated feature map is 24×24×48 and named block2_conv3.

[5] And splicing the block2_conv3 and the block2_conv0 to generate a result named block2_concat1.

[6] The convolution processing with the convolution kernel 3×3 and the step size 1 is performed on the block2_conv2, and the depth of the generated feature map is 64. The generated feature map is 24×24×64 and is named block2_conv4.

8) And a third module.

A first part:

[1] through the above 6) second feature map length-width reduction processing, in_data1=block2_concat1, in_data2=block2_conv4, in_data3=block2_concat1, n=48. The output result is named as block31_conv0=out_conv1, block31_poo1=out_poo1, and block31_poo2=out_poo2.

[2] The convolution processing with the convolution kernel 3×3 and the step size 1 is performed on the block31—conv0, and the depth of the generated feature map is 128. The generated feature map is 12×12×128, and is named block31_conv1.

[3] The block31_conv1 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and a depth of the feature map is 80. The generated feature map is 12×12×80, and is named block31_conv2.

[4] The result of splicing the block31_conv2 and the block31_pool1 is named as block31_concat1.

[5] The block31_concat1 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and a depth of the generated feature map is 64. The generated feature map is 12×12×64 and is named block31_conv3.

[6] The result of splicing the block31_conv3 and the block31_pool2 is named as block31_concat2.

[7] The convolution processing with the convolution kernel of 3×3 and the step size of 1 is performed on the block31_conca1, the depth of the generated feature map is 96, and the generated feature map is 12×12×96 and is named as block32_conv1.

A second part:

[1] the block31_concat2 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and a depth of the generated feature map is 48. The generated feature map is 12×12×48 and named block32_conv4.

[2] The block32_conv1 and the block31_conv4 are spliced, and the generated result is named as block 32_conc1.

[3] The block32_concat1 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and a depth of a feature map is generated to be 96. The generated feature map is 12×12×96 and named block32_conv2.

[4] And splicing the block32_conv2 and the block32_conv4 to generate a result named block32_conc2.

[5] The block32_concat2 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and a depth of a feature map is generated to be 96. The generated feature map is 12×12×96 and named block32_conv3.

[6] And splicing the block32_conv3 with the block31_conca2 to generate a result named as block32_conca3.

Third section:

[1] the block32_concat3 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and a depth of the feature map is generated to be 48. The generated feature map is 12×12×48 and named block33_conv1.

[2] The result of splicing the block33_conv1 and the block32_conv3 is named as block33_conc1.

[3] The block33_concat1 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and a depth of a feature map is generated to be 96. The generated feature map is 12×12×96 and named block33_conv2.

[4] And splicing the block33_conv2 and the block33_conv1 to generate a result named block33_conc2.

[5] The block33_concat2 is subjected to convolution processing with a convolution kernel of 3×3 and a step size of 1, and a depth of a feature map is generated to be 96. The generated feature map is 12×12×96 and named block33_conv3.

[6] And splicing the block33_conv3 with the block32_conc3 to generate a result named block33_conc3.

9) And a fourth module.

[1] Through 6) a second reduced profile length-width process, where in_data1=block33_concat3, in_data2=block33_conv3, in_data3=block33_concat3, n=64. The output result is named as block4_conv0=out_conv1, block4_pog1=out_pog1, block4_pog2=out_pog2.

[2] The convolution processing with the convolution kernel 3×3 and the step size 1 is performed on the block4_conv0, and the depth of the generated feature map is 160. The generated feature map is 6×6×160 and is named block4_conv1.

[3] And splicing the block4_conv1 and the block4_pool1, and naming the generated result as block4_conca1.

[4] The convolution processing with 3×3 convolution kernel and 1 step size is performed on the block4_concat1, and the depth of the generated feature map is 128. The generated feature map is 6×6×128 and is named block4_conv2.

[5] And splicing the block4_conv2 and the block4_conv0 to generate a result named block4_conc2.

[6] The convolution processing with 3×3 convolution kernel and 1 step size is performed on the block4_concat2, and the depth of the generated feature map is 192. The generated feature map is 6×6×192 and is named block4_conv3.

[7] And splicing the block4_conv3 with the block31_pool2 to generate a result named block4_concat3.

10 Full connection)

The block4_conv3 is fully connected to 192 data.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The face recognition network structure suitable for the front end is characterized in that the network uses a form of downsampling to reach a full connection layer, a network module is arranged in the network, a feature map of a result of each module is directly connected with the full connection layer after downsampling, and in each convolution process, the result of the convolution process above is spliced by using the cross splice, and the result of each module above is spliced in the input of each module;

the convolution processing is the processing that the convolution used by each module is a convolution kernel of 3 multiplied by 3, and the normalization processing is carried out after the convolution, and then the excitation function processing is carried out;

in the design of each layer of convolution, the number of the output characteristic diagrams is independently designed;

the downsampling process is that the processing kernel size is 2 multiplied by 2, and the step length is 2;

firstly, carrying out equalization treatment on input image data; then, carrying out convolution processing with a convolution kernel of 3 multiplied by 3 and a step length of 1 on the data to generate a feature map; subjecting to a first feature pattern length and width reduction process; the first feature map length and width reduction processing includes:

[1.2] downsampling the feature map obtained by convolving the image data described in [1.1], convolving the feature map with a processing kernel size of 2×2 and a step size of 2, generating a feature map with a depth of 32, and generating a feature map with a depth of 48×48×64;

the network module comprises a first module, a second module, a third module and a fourth module; the modules are subjected to a second feature pattern length and width reduction process, wherein the second feature pattern length and width reduction process comprises the following steps:

the input data are in_data1, in_data2, in_data3 and N respectively;

[2.3] downsampling the in_data3 with a processing kernel size of 2×2 and a step size of 2 to generate a feature map, which is set to M as the depth of the input feature map, and is set to W×H×M;

the fourth module of the network is ultimately fully connected to 192 data.