CN114155573A

CN114155573A - Human species identification method and device based on SE-ResNet network and computer storage medium

Info

Publication number: CN114155573A
Application number: CN202111305054.5A
Authority: CN
Inventors: 虞志媛; 杨立成
Original assignee: Shanghai Hongmu Intelligent Technology Co ltd
Current assignee: Shanghai Hongmu Intelligent Technology Co ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-03-08

Abstract

The invention discloses a race identification method based on an SE-ResNet network, which comprises the following steps: acquiring real race image data as original data, detecting the face, rotating, filling, correcting and zooming to a uniform size; removing the characteristic-free images which are shielded by the side face, the lower head and the five sense organ areas of the human face in a large area; performing diversity enhancement on the image data; performing subtraction processing on RGB channels of image data respectively, and then performing classification and labeling; adding an SE residual error module on the basis of ResNet50 to establish an SE-ResNet network, and training; and selecting a picture to be identified, inputting the picture to the SE-ResNet network obtained through training, and performing classification identification to obtain a result. The invention provides the SE-ResNet network training after no obvious characteristic data by detecting and processing the acquired image data, and has better recognition speed and accuracy.

Description

Human species identification method and device based on SE-ResNet network and computer storage medium

Technical Field

The invention relates to a human race identification method, in particular to a human race identification method and device based on an SE-ResNet network and a computer storage medium.

Background

The existing world ethnic identification is mostly recognized four major ethnicities, and in personnel management work, because the characteristics of different ethnic countries are different, the people need to respect the respective living and working habits, and therefore the people need to be identified. The prior art focuses on distinguishing four major ethnic groups, such as those in patent application nos. 202010996916.2 and 201811372085.0, and solves the problem that the existing face data sets contain most of the european and american ethnic data, so that the recognition results for other ethnic groups are poor. While there are many types within a race, the existing recognition methods are not accurate enough to recognize this particular subtype.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a race identification method based on an SE-ResNet network, which solves the problem that the accuracy of classification identification of a single subtype in four races is not high.

The technical scheme of the invention is as follows: a ethnic identification method based on SE-ResNet network comprises the following steps:

s1, acquiring real race image data as original data, detecting the face, performing rotary filling correction on the face according to the facial features, and zooming to a uniform size;

s2, removing the featureless images which are blocked by the human face side face, the head and the five sense organ regions in a large area;

s3, randomly adjusting the brightness, the contrast, the definition and the sharpness of the image data to increase the diversity of the image data and then performing Gaussian blur processing on the image;

s4, performing subtraction processing on RGB channels of the image data respectively, and then labeling the data set according to classification to distinguish a general race face and a specific race face, wherein the general race is other than the specific race;

s5, establishing an SE-ResNet network, wherein the SE-ResNet network sequentially comprises a first convolution module, a second convolution module, a first pooling layer, a first SE residual error module, a third convolution module, a second pooling layer, a second SE residual error module, a third SE residual error module, a fourth convolution module, a third pooling layer, fifth to ninth SE residual error modules, a fifth convolution module, a fourth pooling layer, tenth to twelfth SE residual error modules, a first full-link layer, a second full-link layer and a softmax layer, the first to fifth convolution modules respectively include a convolution layer + an active layer, the SE residual module includes a main path and a side path respectively connected to the eltwise layer, the main path is sequentially a first convolution layer + an active layer, a second convolution layer + an active layer, an average pooling layer, a full-connection layer, an active function, a full-connection layer and a Sigmoid and then is connected to the eltwise layer, and the side path is directly connected with the eltwise layer;

s6, training the SE-ResNet network by adopting the data set obtained in the step S4;

s7, selecting pictures to be identified, inputting the pictures to the SE-ResNet network trained in the step S6, and carrying out classification and identification to obtain a result.

Further, the labeled data set is divided into a test set and a training set in the step S4, the step S6 includes S6-1, the SE-ResNet network is trained by using the training set, then the SE-ResNet network trained in the step S6-1 is tested by using the original data corresponding to the test set, and the test set data corresponding to the original data with correct results and incorrect results in the test results form an optimized training set; and S6-2, carrying out optimization training on the SE-ResNet network trained in the step S6-1 by adopting an optimization training set.

Further, when the step S6-1 is performed, the optimizer is SGD, the loss function is a cross entropy loss function, the initial learning rate is 0.001, 100000 epochs are set for training, the learning strategy is multistep,20000 epoch learning rates are attenuated to 0.0001, 40000 epoch learning rates are attenuated to 0.00001 before, and the momentum is 0.99; when the optimization training in the step S6-2 is performed, the optimizer is Adam, the loss function is a cross entropy loss function, the initial learning rate is 0.0001, 50000 epochs are set for training, the learning strategy is step, the learning rate of each 5000 epochs is attenuated to 50% of the previous learning rate, the momentum is 0.99, and the learning rate of a second full connection layer and the bias learning rate in the SE-ResNet network are respectively expanded by 10 times to perform optimization to generate a network model; the result confidence is set to 0.8 when the classification recognition is performed in step S7.

Further, when the step S6-2 is performed with the optimization training, the percentage of the test set data corresponding to the original data with the wrong test result in the step S6-1 in the optimization training set is 40% to 50%.

Further, the ratio of the data of the general race category to the specific race category in the test set, the training set and the optimized training set is 1: 1.

further, the specific step of step S2 includes: s2-1, classifying and positioning the original data by adopting a first neural network, wherein the classification comprises a front face, a side face and a lower head, the positioning is coordinate points of eyebrows, eyebrows and eyebrows at the left and right sides, coordinate points of inner canthus, eyes and outer canthus of eyes at the left and right sides, coordinate points of nose heads, two sides of nose wings and nose tails of a nose, coordinate points of two ends, an upper lip, a middle lip and a lower lip of a mouth and coordinate points of two upper and lower positions of a left and right ear connecting face outline; s2-2, comparing the distance between the left ear and the left eye with the distance between the right ear and the right eye to calculate the face front side angle, and comparing the y value coordinates of the two ears with the y value coordinates of the two eyes to calculate the face pitch angle; s2-3, identifying whether the area of the five sense organs is large-area shielded or not by adopting a second neural network; s2-4, removing the featureless images which are judged to be 90-degree side faces, 70-degree or above head lowering and large-area occlusion of the five sense organ areas; the first neural network structure is a modified VGG network, the base layer is a convolutional layer with 5 step lengths of 1 and an active layer, the base layer is a pooling layer, the three convolutional layers are a convolutional layer with 3 step lengths of 1 and an active layer and a pooling layer, the base layer is a fully-connected calculation classification result, and the calculation result is divided into the classification and coordinate points by adopting a partition layer slice; the second neural network structure is a modified VGG network, the base layer is a convolution layer with convolution kernels of 7 and 4 in step length and an activation layer, the pooling layer is a convolution layer with convolution kernels of 3 and 1 in step length and an activation layer, the pooling layer is a layer, and the full-connection layer is used for calculating classification results.

Further, the random adjustment in step S3 is to randomly select whether the brightness, the contrast, the sharpness, or the sharpness is to be adjusted, and randomly select the forward adjustment or the backward adjustment for the parameters of the selected adjustment, and the adjustment magnitudes of the forward adjustment and the backward adjustment are the same.

The invention also provides a human race recognition device based on the SE-ResNet network, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program is executed by the processor to realize the human race recognition method based on the SE-ResNet network.

The invention also provides a computer storage medium, wherein a computer program is stored on the computer storage medium, and when the computer program is executed by a processor, the human species identification method based on the SE-ResNet network is realized.

The technical scheme provided by the invention has the advantages that: the invention constructs the SE-ResNet network aiming at the actual requirement of identifying the race based on the image, thereby improving the identification speed and reducing the labor. The face in the image data is detected in the early stage, the inferior data shielded by special angles and five sense organs is eliminated to form training data, the influence on the training result can be reduced, the VGG network is used for classifying and positioning the image, and the efficiency and accuracy of data elimination are improved by matching with the calculation of the angle. An ideal recognition network model is obtained through preliminary training and optimization training, and a reasonable confidence threshold value is set, so that the recognition accuracy is improved.

Drawings

FIG. 1 is a schematic diagram of a training process of an SE-ResNet network adopted by the race identification method based on the SE-ResNet network.

Fig. 2 is a schematic diagram of a SE-ResNet network structure.

FIG. 3 is a schematic diagram of the structure of SE residual modules in the SE-ResNet network structure.

Detailed Description

The present invention is further described in the following examples, which are intended to be illustrative only and not to be limiting as to the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications within the scope of the following claims.

Please refer to fig. 1, the race identification method based on SE-ResNet network of the present embodiment includes the following steps:

s1, acquiring real race image data, and generating a data set: specifically, the human species data set used in the invention is obtained by snapping in the actual project of the company, and 6W of general human faces and 6W of specific human species faces are obtained in total. Almost covers the process of imaging different cameras in different time periods and different illumination different scenes and different sexes and different age periods, adopts mtcnn to detect the human face, positions coordinates of five coordinate points on two sides of two eyes, a nose tip and a mouth, and matches the coordinates with preset coordinates of five sense organ points (left eyes 30.2946,51.6963, right eyes 65.5318,51.5014, a nose tip 48.0252,71.7366, left mouth corners 33.5493,92.3655 and right mouth corners 62.7299 and 92.2041) one by one, so that the five sense organs are corrected by rotary expansion, the five sense organs of each picture are at the same position, blank places after the image is rotated are supplemented by black pixel points, the human face is corrected by rotary filling, and the size of the human face is uniformly zoomed to 96 x 112;

s2, removing the unsatisfactory face data: the human face is divided into a general human face and other human faces of specific races according to requirements, the ratio is 1:1, and if the blood mixing phenomenon exists, the human faces are classified according to the deviation of the human faces. Through a calibrated and trained modified VGG network, a base layer of the modified VGG network is a convolutional layer with convolution kernels of 5 steps of 1 + an active layer, a pooling layer, three convolution kernels are a convolutional layer with 3 steps of 1 + an active layer + a pooling layer, a fully-connected calculation classification result is obtained, a segmentation layer slice is adopted to divide the calculation result into 3 types and 50 numerical values, the 3 types are respectively 0-front face, 1-90 degrees side face, 2-75 degrees higher head, and 50 numerical values form 25 coordinate points which are respectively coordinate points of eyebrow heads, eyebrow middle parts and eyebrow tails of the left and right eyebrows, coordinate points of inner canthus angles, eye middle parts and outer canthus angles of the left and right eyes, coordinate points of nose heads, two sides of nose wings and nose tails of a mouth, coordinate points of two ends of lips, an upper lip, a middle lip and a lower lip of the mouth, and coordinate points of two upper and lower parts of a left and right ear connection outline of the left and right eyes, and a gray scale graph of 40 x 40 is input image, in the early stage, 5W faces in each posture are selected, coordinate points of the eyebrows, the eyebrow middle parts and the eyebrow tail of the eyebrows on the left side and the right side, coordinate points of the inner canthus, the eye middle parts and the outer canthus of the eyes on the left side and the right side, coordinate points of the nose head, the nose wing, the nose tail of the nose, coordinate points of the two ends of the mouth, the upper lip, the middle lip and the lower lip and coordinate points of the upper and the lower parts of the connecting face outline of the ears on the left side and the right side are marked respectively, the network is trained, then the input face image obtains the coordinate points of the five sense organs by the network after the training, the ratio of the width of the right eye to the width of the two eyes is calculated by utilizing the x-axis distance from the right eye to the right ear/(the x-axis distance from the left eye to the left ear + the x-axis distance from the right eye to the right ear), a result range [0,1], the value is subtracted by 0.5 and then multiplied by 2 to the result range to the range to, 0) the right face (0, 1) is the left face, and the yaw angle is calculated by multiplying the square of the left face by 90 to count the left and right directions and the direction angles of the face, and the correspondence k1, b1 is calculated by using y ═ kx + b and two points on the top surface of both ears, and the correspondence k2, b2 is calculated by using two points on the bottom surface of both ears in the same manner. And (3) respectively calculating y1 and y2 by setting x as an x-axis coordinate point in the left eye, taking an average value y0 of y1 and y2, setting y as a y-axis coordinate point in the left eye, using n (y-y0/| y2-y1|), setting n as 30, and normalizing a value range to [ -90, 90] by comparing a maximum value with a minimum value to calculate a face pitch angle, wherein if the value is greater than 0, the head is raised, and if the value is less than 0, the head is lowered. Through another training modified VGG network, the network base layer is a convolution layer with 7-step size 4 and a convolution layer + an active layer, a pooling layer, a convolution layer with 3-step size 1 and a convolution layer + an active layer, a pooling layer, two convolution layers with 3-step size 1 and a convolution layer + an active layer, a pooling layer and a fully-connected layer are used as calculation classification results, a three-channel image with the image size of 96 x 60 is input, the image is a human face which is detected and corrected by an mtcnn open source network and is uniformly scaled to 96 × 112, the human face is cut from a position with the height of 62 to the end of the image to form 96 × 60 partial human faces only with the positions below the nose, and 5W human faces obtained in the above mode are trained to be marked as 0-mask 1-normal people to identify whether the area of the five sense organs is large-area mask or not. Removing the face data which is judged to be 90 degrees side face and 70 degrees or above low head through the two networks in the original data set, and removing the face data which is shielded in a large area by the five sense organs and has no obvious characteristics;

s3, performing data enhancement on the image, performing brightness, contrast, definition and sharpness adjustment on the image by using OpenCV, adjusting the four dimensions by using four [0,1] random numbers, if the contrast random number is 0, adjusting the contrast, if the random number is 1, adjusting the contrast, setting the random numbers as adjustment amplitude, setting the brightness adjustment amplitude to be [0.5,1.5] and the contrast adjustment amplitude to be [0.5,1.5], setting the definition adjustment amplitude to be [0.5,1.5] and setting the sharpness adjustment amplitude to be [0.5,1.5], and increasing the diversity of the sample.

Then using the Gauss equation

Setting a value of σ, wherein the specific value of the embodiment is 3, calculating an image weight matrix, multiplying each pixel point by a weight value to obtain a gaussian blur value of a central point, and thus performing gaussian blur processing on each image after brightness, contrast, definition and sharpness adjustment to enhance the generalization ability of the sample.

And S4, preprocessing the images, unifying the images into a three-channel color image, and performing brightness reduction processing on the images, specifically, subtracting the mean value 104,117,123 of each channel from the BGR channel respectively, wherein the mean value is obtained by calculating the mean value of each channel of BGR through the face image of S2, and the subtraction of the mean value from the images is convenient to eliminate the commonalities of the images and highlight individual differences. Eliminating the image commonality also comprises removing the average brightness value of the image, thereby reducing the influence of illumination on the data to a certain extent, labeling the data set according to classification, wherein the label is in the form of (n, l), n is a picture path, l ═ 0 represents a general face, l ═ n-1 represents a face of a specific different race, randomly extracting 10% as a test set, 90% as a training set, and randomly scrambling the data set.

S5, please refer to fig. 2 and 3, in which a SE-ResNet network constructed based on a ResNet50 network and SE residual modules sequentially includes a first convolution module, a second convolution module, a first pooling layer, a first SE residual module, a third convolution module, a second pooling layer, a second SE residual module, a third SE residual module, a fourth convolution module, a third pooling layer, fifth to ninth SE residual modules, a fifth convolution module, a fourth pooling layer, tenth to twelfth SE residual modules, a first full connection layer, a second full connection layer, and a softmax layer, and different numbers of SE residual modules are respectively inserted into the SE-ResNet network after the last four of the five convolution modules of the ResNet50 network, wherein all convolution modules of the ResNet50 network include a convolution layer and an active layer. The SE residual error module comprises a main path which is sequentially a first convolution layer + activation layer, a second convolution layer + activation layer, an average pooling layer, a full connection layer, an activation function (ReLU), a full connection layer and a Sigmoid and then is connected to the eltwise layer, the SE residual error module comprises a side path which is directly connected with the eltwise layer, and a newly-built SE-ResNet network trains a training set based on the SE-ResNet network.

S6-1, the training parameters are that the data input size is 96 × 112, the batchsize is 64, the optimizer SGD uses a cross entropy loss function, the initial learning rate is 0.001, 100000 epochs are set for training, the learning strategy is multistep,20000 epoch learning rates are attenuated to the previous 10%, 40000 epoch learning rates are attenuated to the previous 10%, and the momentum is 0.99.

S6-2, if the accuracy of the new setting scene is lower than 80% in actual use, selecting a new sample obtained by testing the scene camera by the training generation model in the step S6-1, selecting 2W samples according to the test result, wherein 40% of error samples, 60% of correct samples are selected, the ratio of the number of the categories is 1:1, the images are processed in the same manner as in steps S1 to S4, and then optimized based on the model generated by training in step S6-1, where the input size is 96 × 112, the batch size is 64, the optimizer is Adam, the loss function uses a cross entropy loss function, the initial learning rate is 0.0001, 50000 epochs are set, the learning strategy is step, the learning rate of every 5000 epochs is attenuated to 50% of the previous learning rate, the momentum is 0.99, and the learning rate and the bias learning rate of the last fully-connected layer (second fully-connected layer) in the network are respectively expanded by 10 times to generate the optimized model.

S8, practical use, selecting pictures collected by the camera, adopting mtcnn to carry out face detection and face angle correction, respectively calculating and screening faces without obvious characteristics of 90-degree side faces and large-area shielding through two modified VGG networks, carrying out [104,117,123] mean value reduction processing on the faces through RGB channels, sending the faces to the SE-ResNet network optimized in the step S6-2, and setting the result confidence coefficient to be 0.8 to obtain a race classification result. The race accuracy (i.e., race pattern vs. number of samples/race pattern out of samples) was 99.9%, and the recall (i.e., race pattern vs. number of samples/total number of race samples) was 85%.

It should be noted that the particular methods of the embodiments described above may form a computer program product and, thus, the computer program product embodied herein may be stored on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.). The present invention may be implemented in hardware, software, or a combination of hardware and software, or may be a computer device including at least one processor and a memory, where the memory stores a computer program for implementing the steps of the above-described flow chart, and the processor is configured to execute the computer program stored in the memory to perform the steps of the method forming the above-described embodiment.

Claims

1. A ethnic identification method based on SE-ResNet network is characterized by comprising the following steps:

s1, acquiring real race image data as original data, detecting the face by adopting an MTCNN model, carrying out rotary filling correction on the face according to the five sense organs of the face, and zooming to a uniform size;

s3, adjusting brightness, contrast, definition and sharpness of the image data to increase the diversity of the image data, and then performing Gaussian blur processing on the image;

2. The ethnic identification method based on the SE-ResNet network as claimed in claim 1, wherein the labeled data set is divided into a test set and a training set in the step S4, the step S6 comprises S6-1, the SE-ResNet network is trained by using the training set, then the SE-ResNet network trained by the step S6-1 is tested by using the original data corresponding to the test set, and the test set data corresponding to the original data with correct results and incorrect results in the test results are combined into an optimized training set; and S6-2, carrying out optimization training on the SE-ResNet network trained in the step S6-1 by adopting an optimization training set.

3. A race recognition method based on SE-ResNet network as claimed in claim 2, characterized in that when training in step S6-1, the optimizer is SGD, the loss function is a cross entropy loss function, the initial learning rate is 0.001, 100000 epochs are trained, the learning strategy is multistep,20000 epoch learning rates decay to 0.0001, 40000 epoch learning rates decay to the previous 0.00001, and the momentum is 0.99; when the optimization training in the step S6-2 is performed, the optimizer is Adam, the loss function is a cross entropy loss function, the initial learning rate is 0.0001, 50000 epochs are set for training, the learning strategy is step, the learning rate of each 5000 epochs is attenuated to 50% of the previous learning rate, the momentum is 0.99, and the second full connection layer learning rate and the bias learning rate in the SE-ResNet network are respectively expanded by 10 times to perform optimization to generate a network model.

4. The race recognition method based on SE-ResNet network as claimed in claim 2, wherein when the step S6-2 is optimized training, the ratio of the test set data corresponding to the original data with wrong test result in the step S6-1 in the optimized training set is 40% -50%.

5. The race recognition method based on SE-ResNet network as claimed in claim 2, wherein the ratio of data of general race categories to specific race categories in the test set, training set and the optimized training set is 1: 1.

6. the ethnic identification method based on the SE-ResNet network as claimed in claim 1, wherein the specific steps of the step S2 include: s2-1, classifying and positioning the original data by adopting a first neural network, wherein the classification comprises a front face, a side face and a lower head, the positioning is coordinate points of eyebrows, eyebrows and eyebrows at the left and right sides, coordinate points of inner canthus, eyes and outer canthus of eyes at the left and right sides, coordinate points of nose heads, two sides of nose wings and nose tails of a nose, coordinate points of two ends, an upper lip, a middle lip and a lower lip of a mouth and coordinate points of two upper and lower positions of a left and right ear connecting face outline; s2-2, comparing the distance between the left ear and the left eye with the distance between the right ear and the right eye to calculate the face front side angle, and comparing the y value coordinates of the two ears with the y value coordinates of the two eyes to calculate the face pitch angle; s2-3, identifying whether the area of the five sense organs is large-area shielded or not by adopting a second neural network; s2-4, removing the featureless images which are judged to be 90-degree side faces, 70-degree or above head lowering and large-area occlusion of the five sense organ areas; the first neural network structure is a modified VGG network, the base layer is a convolutional layer with 5 step lengths of 1 and an active layer, the base layer is a pooling layer, the three convolutional layers are a convolutional layer with 3 step lengths of 1 and an active layer and a pooling layer, the base layer is a fully-connected calculation classification result, and the calculation result is divided into the classification and coordinate points by adopting a partition layer slice; the second neural network structure is a modified VGG network, the base layer is a convolution layer with convolution kernels of 7 and 4 in step length and an activation layer, the pooling layer is a convolution layer with convolution kernels of 3 and 1 in step length and an activation layer, the pooling layer is a layer, and the full-connection layer is used for calculating classification results.

7. A race recognition method based on SE-ResNet network as claimed in claim 1, characterized in that said random adjustment in step S3 is to randomly select whether or not brightness, contrast, sharpness and sharpness are adjusted, and to randomly select forward adjustment or backward adjustment for the parameters of the selected adjustment, and the adjustment magnitude of the forward adjustment and the backward adjustment is the same.

8. A race recognition apparatus based on SE-ResNet network, comprising a processor and a memory, wherein the memory stores a computer program, and wherein the computer program, when executed by the processor, implements the race recognition method based on SE-ResNet network according to any one of claims 1 to 7.

9. A computer storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method for personality identification based on a SE-ResNet network of any of claims 1-7.