WO2020244032A1

WO2020244032A1 - Face image detection method and apparatus

Info

Publication number: WO2020244032A1
Application number: PCT/CN2019/096575
Authority: WO
Inventors: 连桄雷; 张龙
Original assignee: 罗普特科技集团股份有限公司
Priority date: 2019-06-03
Filing date: 2019-07-18
Publication date: 2020-12-10
Also published as: CN110276277A

Abstract

Embodiments of the present invention disclose a face image detection method and apparatus. A specific embodiment of the method comprises: acquiring a target image frame sequence; for each image frame comprised in the target image frame sequence, inputting the image frame into a pre-trained face detection model to acquire face position information; determining, on the basis of the acquired face position information, at least one face image sequence from the image frames comprised in the target image frame sequence, wherein a face image comprised in each face image sequence is used to indicate the same face; for each face image sequence in the at least one face image sequence, determining a quality score of each face image comprised in the face image sequence; and extracting a face image from the face image sequence on the basis of the acquired quality score, and outputting the face image. This embodiment extracts a high-quality face image from the target image sequence, thereby improving the accuracy for operations such as face recognition performed using the extracted face image.

Description

Method and device for detecting face image

Related application

This application claims the priority of the Chinese patent application with application number 201910475881.5 filed on June 3, 2019, and the entire content of the application is incorporated herein by reference.

Technical field

The embodiments of the present disclosure relate to the field of computer technology, and in particular to methods and devices for detecting face images.

Background technique

At present, the video surveillance network has covered all large, medium and small cities in China, and face recognition technology can be applied in the field of security surveillance. Generally, in order to establish a new intelligent security system integrating software and hardware from the cloud to the front end, it is necessary to deploy sufficient face capture equipment at the front end. In the process of monitoring point transformation and social resource access, purely relying on the back-end capture and analysis mode not only brings challenges to the data transmission capacity of the network, but also brings great pressure to the data processing capacity of the back-end platform , There are problems of reduced operating efficiency and high operating costs. Generally, the face capture function can be shared to the front end, but the replacement of capture cameras in large quantities will cause a sudden increase in project construction costs. At present, with the advent of the 5G era, edge computing, as a supplement to cloud computing, can serve as an alternative solution. In this way, a gateway device is needed to capture the face of front-end video for back-end analysis.

Public content

The purpose of the embodiments of the present disclosure is to provide an improved method and device for detecting a face image to solve the technical problems mentioned in the background art section above.

In the first aspect, the embodiments of the present disclosure provide a method for detecting a face image, the method includes: acquiring a target image frame sequence; for each image frame included in the target image frame sequence, inputting the image frame to pre-training Based on the obtained face position information, at least one face image sequence is determined from the image frames included in the target image frame sequence, where each face image sequence includes the person Face images are used to indicate the same face; for each face image sequence in at least one face image sequence, determine the quality score of each face image included in the face image sequence; based on the obtained quality score, from the person Extract the face image and output from the face image sequence.

In some embodiments, based on the obtained face position information, determining at least one face image sequence from the image frames included in the target image frame sequence includes: for every two adjacent images in the target image frame sequence Frame, determine the feature point in each face image in the first image frame of the two adjacent image frames, and determine the prediction in the second image frame corresponding to each face image in the first image frame Feature points; from the face images in the second image frame, determine that the number of predicted feature points included is greater than or equal to the preset value as the face image indicated by the corresponding face image in the first image frame. Face image.

In some embodiments, based on the obtained face position information, determining at least one face image sequence from the image frames included in the target image frame sequence includes: for every two adjacent images in the target image frame sequence Frame, the face image of the face image in the first image frame and the face image in the second image frame in the two adjacent image frames, the area overlap is greater than or equal to the preset overlap threshold Determined as a face image indicating the same face.

In some embodiments, the face detection model is also used to generate the key point information set of the image frame, where the key point information is used to characterize the position of the face key point in the face image; and determining the face image sequence includes The quality score of each face image includes: determining the face pose angle information of each face image based on the key point information set of each face image included in the face image sequence; determining each face image based on the face pose angle information The quality rating of the personal face image.

In some embodiments, based on the key point information set of each face image included in the face image sequence, determining the face pose angle information of each face image includes: based on each face image included in the face image sequence Generate the key point feature vector corresponding to each face image; multiply the generated key point feature vector by the pre-fitted feature matrix to obtain the face pose angle feature vector as the face pose angle information.

In some embodiments, determining the quality score of each face image based on the face pose angle information includes: determining the sharpness of each face image based on the key point information set of each face image; using the face pose angle information And clarity to determine the quality score of each face image.

In some embodiments, determining the sharpness of each face image based on the key point information set of each face image includes: extracting target key point information from the key point information set of each face image; based on the target key point information , Determine the target area from each face image, and determine the average pixel gradient of the pixels included in the target area; determine the sharpness of each face image based on the average pixel gradient.

In some embodiments, the face detection model includes a convolutional layer whose structure is a deeply separable convolution.

In some embodiments, the face detection model is pre-trained using batch standardization.

In the second aspect, an embodiment of the present disclosure provides an apparatus for detecting a face image. The apparatus includes: an acquisition module for acquiring a target image frame sequence; a generating module for determining each of the target image frame sequences Image frame, input the image frame into a pre-trained face detection model to obtain face position information, where the face position information is used to characterize the position of the face image in the image frame; the determining module is used to Determine at least one face image sequence from the image frames included in the target image frame sequence, where the face image included in each face image sequence is used to indicate the same face; the output module is used for For each face image sequence in at least one face image sequence, determine the quality score of each face image included in the face image sequence; based on the obtained quality score, extract the face image from the face image sequence and output .

In some embodiments, the determining module is further configured to: for every two adjacent image frames in the target image frame sequence, determine each face image in the first image frame of the two adjacent image frames And determine the predicted feature points in the second image frame corresponding to each face image in the first image frame; from the face images in the second image frame, determine the number of predicted feature points included The face image greater than or equal to the preset value is taken as the same face image as the face indicated by the corresponding face image in the first image frame.

In some embodiments, the determining module is further configured to: for every two adjacent image frames in the target image frame sequence, compare the face image in the first image frame of the two adjacent image frames with Among the face images in the second image frame, a face image whose area coincidence degree is greater than or equal to a preset coincidence degree threshold is determined to be a face image indicating the same face.

In some embodiments, the face detection model is also used to generate the key point information set of the image frame, where the key point information is used to characterize the position of the face key point in the face image; and the output module includes: a first determination The unit is used to determine the face pose angle information of each face image based on the key point information set of each face image included in the face image sequence; the second determining unit is used to determine each face pose angle information based on the face pose angle information The quality rating of the personal face image.

In some embodiments, the first determining unit includes: a first generating subunit, configured to generate a key point feature vector corresponding to each face image based on the key point information set of each face image included in the face image sequence; The second generating subunit is used to multiply the generated key point feature vector by the pre-fitted feature matrix to obtain the face pose angle feature vector as the face pose angle information.

In some embodiments, the second determining unit includes: a first determining subunit for determining the sharpness of each face image based on the key point information set of each face image; the second determining subunit for using human The facial posture angle information and sharpness determine the quality score of each face image.

In some embodiments, the first determining sub-unit includes: an extracting sub-module for extracting target key point information from the key point information set of each face image; and the first determining sub-module for extracting target key point information based on the target key point information, Determine the target area from each face image, and determine the average pixel gradient of the pixels included in the target area; the second determining sub-module is used to determine the sharpness of each face image based on the average pixel gradient.

In a third aspect, the embodiments of the present disclosure provide an electronic device, including one or more processors; a storage device, for storing one or more programs, when one or more programs are executed by one or more processors, One or more processors are caused to implement the method described in any implementation manner of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium having a computer program stored thereon, and the computer program, when executed by a processor, implements the method described in any implementation manner in the first aspect.

According to the method and device for detecting face images provided by the embodiments of the present disclosure, at least one face image sequence is determined from a target image frame sequence, where each face image sequence is used to indicate the same face, and then from each person Determine the quality score of each face image in the face image sequence, extract the face image and output according to the quality score, so as to achieve the extraction of high-quality face images from the target image sequence, which is beneficial to improve the use of the extracted face images. The accuracy of operations such as face recognition.

Description of the drawings

By reading the detailed description of the non-limiting embodiments with reference to the following drawings, other features, purposes and advantages of the present disclosure will become more apparent:

Fig. 1 is an exemplary system architecture diagram to which the present disclosure can be applied;

Fig. 2 is a flowchart of an embodiment of a method for detecting a face image according to the present disclosure;

Fig. 3 is a flowchart of another embodiment of a method for detecting a face image according to the present disclosure;

Fig. 4 is an exemplary schematic diagram of a face pose angle of the method for detecting a face image according to the present disclosure;

Fig. 5 is a schematic structural diagram of an embodiment of an apparatus for detecting a face image according to the present disclosure;

Fig. 6 is a schematic structural diagram of a computer system suitable for implementing an electronic device of an embodiment of the present disclosure.

Detailed ways

The present disclosure will be further described in detail below in conjunction with the drawings and embodiments. It can be understood that the specific embodiments described here are only used to explain the relevant disclosure, but not to limit the disclosure. In addition, it should be noted that, for ease of description, only the parts related to the relevant disclosure are shown in the drawings.

It should be noted that the embodiments in the present disclosure and the features in the embodiments can be combined with each other if there is no conflict. Hereinafter, the present disclosure will be described in detail with reference to the drawings and in conjunction with embodiments.

FIG. 1 shows an exemplary system architecture 100 to which a method for detecting a face image can be applied to an embodiment of the present disclosure.

As shown in FIG. 1, the system architecture 100 may include a terminal device 101, a network 102, an intermediate device 103, and a server 104. The network 102 is used to provide a medium for communication links between the terminal device 101, the intermediate device 103, and the server 104. The network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables.

The server 104 may be a server that provides various services, for example, an image processing server that processes an image frame sequence uploaded by the terminal device 101. The image processing server can process the received image frame sequence and obtain the processing result (for example, a high-quality face image).

The intermediate device 103 may be various devices used for data transceiving and processing, including but not limited to at least one of the following: a switch, a gateway device, and the like.

It should be noted that the method for detecting the face image provided by the embodiment of the present disclosure is generally executed by the intermediate device 103, and accordingly, the device for detecting the face image is generally set in the intermediate device 103. It should also be noted that the method for detecting face images provided by the embodiments of the present disclosure can also be executed by the terminal device 101 or the server 104. Accordingly, the device for detecting face images can be set in the terminal device 101 or Server 104.

It should be understood that the numbers of data servers, networks, and main servers in Figure 1 are merely illustrative. According to implementation needs, there can be any number of terminal devices, networks, intermediate devices and servers.

Continuing to refer to FIG. 2, it shows a flow 200 of an embodiment of the method for detecting a face image according to the present disclosure. The method includes the following steps:

Step 201: Obtain a target image frame sequence.

In this embodiment, the execution subject of the method for detecting a human face image (for example, the intermediate device or the terminal device or the server shown in FIG. 1) can obtain the target image frame sequence. The target image frame sequence may be a camera (for example, a camera included in the above-mentioned execution subject or a camera included in an electronic device communicatively connected with the above-mentioned execution subject) to a target face (for example, the face of a person within the shooting range of the above-mentioned camera) A sequence of image frames included in the captured video. Generally, the target image frame may be an image frame sequence composed of an image frame currently shot by the camera and an image frame shot by a preset time period before the current time.

Step 202: For each image frame included in the target image frame sequence, input the image frame into a pre-trained face detection model to obtain face position information.

In this embodiment, for each image frame included in the target image frame sequence, the above-mentioned execution subject may input the image frame into a pre-trained face detection model to obtain face position information. Among them, the face detection model is used to characterize the correspondence between the image sequence and the face position information.

As an example, the face detection model may be the above-mentioned executive body or other electronic equipment, using machine learning methods, taking the sample image frame sequence included in the training samples in the preset training sample set as input, and combining it with the input sample image frame sequence The corresponding sample position information is used as the expected output, and the initial model (for example, convolutional neural network, cyclic neural network, etc.) is trained, and the actual output can be obtained for the sample image frame sequence input for each training. Among them, the actual output is the data actually output by the initial model, which is used to characterize the position of the face image. Then, the above-mentioned executive body can adopt the gradient descent method and the back propagation method to adjust the parameters of the initial model based on the actual output and the expected output, and use the model obtained after each adjustment of the parameters as the initial model for the next training, and meet the expectations If the training end condition is set, the training is ended, and the speech recognition model is obtained by training. The preset training end conditions here may include but are not limited to at least one of the following: training time exceeds the preset duration; training times exceeds the preset number of times; the loss value calculated by using the preset loss function (for example, the cross-entropy loss function) is less than Preset loss value threshold.

The above-mentioned initial model may be various models for target detection, such as MTCNN (Multi-task Convolutional Neural Network), RetinaFace, etc.

In some optional implementation manners of this embodiment, the face detection model may include a convolutional layer structured as a deeply separable convolution. Among them, a convolutional neural network with a depthwise separable convolutions structure can reduce the storage space occupied by the convolutional neural network, and can reduce the calculation amount of the convolutional neural network, thereby helping to improve the extraction of faces The efficiency of the image. The convolutional neural network using a depth-separable convolution structure is a well-known technology that is currently widely studied and applied, and will not be repeated here.

In some optional implementation manners of this embodiment, the face detection model may be a model obtained by training in a batch standardized manner in advance. Among them, Batch Normalization (BN), also called batch normalization, is a technology used to improve the performance and stability of artificial neural networks. Using batch standardization to train the model can increase the training speed and greatly accelerate the convergence process. In addition, it can simplify the parameter adjustment process, improve the training efficiency and the accuracy of the model processing data.

Step 203: Determine at least one face image sequence from the image frames included in the target image frame sequence based on the obtained face position information.

In this embodiment, the above-mentioned execution subject may determine at least one face image sequence from the image frames included in the target image frame sequence based on the obtained face position information. Wherein, the face images included in each face image sequence are used to indicate the same face.

In some optional implementation manners of this embodiment, the above-mentioned execution subject may determine at least one face image sequence according to the following steps:

For every two adjacent image frames in the target image frame sequence, perform the following steps:

First, determine the feature point in each face image in the first image frame of the two adjacent image frames, and determine the prediction in the second image frame corresponding to each face image in the first image frame Feature points. Wherein, the first image frame is an image frame before the second image frame. Specifically, the above-mentioned execution subject may determine the face image from each image frame according to the obtained face position information, and then use various methods to determine the feature points of the face image. For example, the SIFT (Scale-invariant feature transform) algorithm is used to extract the feature points of each face image. Then, the above-mentioned execution subject can use various feature point prediction algorithms (for example, training a neural network, conditional random field, etc.) to determine the predicted feature point corresponding to each face image in the second image frame.

In practice, the optical flow method can be used to determine the feature points and predict feature points of the face image. Among them, the optical flow method uses the changes in the time domain of pixels in the image sequence and the correlation between adjacent frames to find the correspondence between two adjacent frames, thereby calculating the object between adjacent frames. A method of sports information. The advantage of optical flow method is that it can accurately detect and identify the position of the moving day mark without knowing the information of the scene. Moreover, optical flow not only carries the movement information of the moving object, but also carries rich information about the three-dimensional structure of the scene. It can detect the moving object without knowing any information of the scene.

Then, from the face images in the second image frame, it is determined that the number of predicted feature points included is greater than or equal to the preset value as the face image that is the same as the face indicated by the corresponding face image in the first image frame. Face image. Specifically, the above-mentioned execution subject may determine the face image according to the face position information corresponding to the second image frame, and determine the number of predicted feature points included in each face image. For a face image in the second image frame, if the number of predicted feature points in the face image is greater than or equal to the preset value, and the predicted feature points in the face image are based on a person in the first image frame If the face image is generated, the two face images are determined as face images indicating the same face. Generally, the predicted feature point has a corresponding face image identifier (used to indicate the face image in the first image). When the number of predicted feature points included in a certain face image in the second image is greater than or equal to the preset value, The face image identifier of the face image is set as the face image identifier corresponding to the predicted feature point. When the number of predicted feature points included in a certain face image in the second image is less than the preset value, a new face image identifier is set for the face image.

For every two adjacent image frames in the target image frame sequence, the area of the face image in the first image frame and the face image in the second image frame in the two adjacent image frames is A face image with a coincidence degree (or called a rectangular intersection ratio IOU) greater than or equal to a preset coincidence degree threshold is determined as a face image indicating the same face.

Step 204: For each face image sequence in at least one face image sequence, determine the quality score of each face image included in the face image sequence; based on the obtained quality score, extract a person from the face image sequence Face image and output.

In this embodiment, for each face image sequence in the at least one face image sequence, the execution subject may first determine the quality score of each face image included in the face image sequence. Then, based on the obtained quality score, the face image is extracted from the face image sequence and output. Among them, the quality score of the face image can be used to characterize the quality of the face image, that is, the higher the quality score, the higher the quality of the face image. As an example, the face image with the highest quality score can be output as the optimal face image.

The above-mentioned execution subject can determine the quality score of the face image according to various methods. As an example, the above-mentioned execution subject may determine the sharpness of the face image, and determine the sharpness as the quality score. Among them, the definition can be obtained by using existing algorithms for determining the definition of an image. For example, the algorithm for determining the definition of an image may include but is not limited to at least one of the following: pixel gradient function, gray-scale variance function, gray-scale variance product function, and the like.

The above-mentioned execution subject can extract and output the sharpest face image from the face image sequence. Or, extract and output a preset number of face images in descending order of sharpness.

In this embodiment, the execution subject may output the extracted face image in various ways, for example, the extracted face image and the identification of the face image may be displayed on a display included in the execution subject. Alternatively, the extracted face image is sent to other electronic devices communicatively connected with the execution subject.

The method provided by the foregoing embodiment of the present disclosure determines at least one face image sequence from the target image frame sequence, where each face image sequence is used to indicate the same face, and then each face image sequence is determined from each face image sequence. The quality score of the face image, the face image is extracted and output according to the quality score, thereby achieving the extraction of high-quality face images from the target image sequence, which is conducive to improving the accuracy of operations such as face recognition using the extracted face images Sex.

With further reference to FIG. 3, it shows a process 300 of another embodiment of the method for detecting a face image according to the present disclosure. The method includes the following steps:

Step 301: Obtain a target image frame sequence.

In this embodiment, step 301 is basically the same as step 201 in the embodiment corresponding to FIG. 2, and will not be repeated here.

Step 302: For each image frame included in the target image frame sequence, input the image frame into a pre-trained face detection model to obtain face position information.

In this embodiment, the face detection model can determine the face position information of the face image in the input image frame, and can also be used to generate the key point information set of the input image frame. In practice, the face detection model may be a model trained based on MTCNN (Multi-task Convolutional Neural Network). The model includes multiple cascaded sub-models, and the sub-models can be used to detect the face position and determine the key point information set. Among them, the key point information in the key point information set is used to represent the position of the face key point in the face image. Generally, the key point information may include the coordinates of the key points of the face in the image frame. The key points of the face are points in the face image that are used to characterize specific positions (such as eyes, nose, mouth, etc.).

Step 303, based on the obtained face position information, determine at least one face image sequence from the image frames included in the target image frame sequence.

In this embodiment, step 303 is basically the same as step 203 in the embodiment corresponding to FIG. 2, and will not be repeated here.

Step 304: For each face image sequence in at least one face image sequence, determine the face pose angle information of each face image based on the key point information set of each face image included in the face image sequence; The face angle information determines the quality score of each face image.

In this embodiment, for each face image sequence in the above at least one face image sequence, the execution subject of the method for detecting face images (for example, the intermediate device or terminal device or server shown in FIG. 1) can execute The following steps:

Step 1: Determine the face posture angle information of each face image based on the key point information set of each face image included in the face image sequence.

Among them, the face attitude angle information can be used to characterize the degree of deflection of the frontal orientation of the face relative to the camera that photographs the face. The face attitude angle information may include three angles: pitch, yaw, and roll, which respectively represent the angles of up and down, left and right, and in-plane rotation. As shown in Figure 4, the x-axis, y-axis, and z-axis are the three axes of the rectangular coordinate system. Wherein, the z-axis may be the optical axis of the target camera 401, and the y-axis may be a straight line that passes through the center point of the crown of the person's head and is perpendicular to the horizontal plane when the person's head does not turn sideways. The pitch angle can be the angle that the face rotates around the x axis, the yaw angle can be the angle that the face rotates around the y axis, and the roll angle can be the angle that the face rotates around the z axis. In the rectangular coordinate system in Fig. 4, when the human head rotates, determine the ray that takes the origin of the rectangular coordinate system as the end point and passes through the midpoint of the line of the two eyeball center points of the human. The angle with the x-axis, y-axis, and z-axis can be determined as the frontal attitude angle.

The above-mentioned execution subject can determine the face pose angle information according to various methods. For example, the existing face pose angle estimation method can be used to determine the face pose angle information based on the key point information set. Among them, the face pose angle estimation method may include but is not limited to at least one of the following: a model-based method, an appearance-based method, a classification-based method, and the like.

In some optional implementation manners of this embodiment, the above-mentioned execution subject may determine the face pose angle information of each face image according to the following steps:

First, based on the key point information set of each face image included in the face image sequence, a key point feature vector corresponding to each face image is generated. Among them, the elements in the key point feature vector include the coordinates of the key points of the M face.

As an example, assuming that M is 5, for a face image, the key point feature vector A corresponding to the face image can be generated as [x1,x2,x3,x4,x5,y1,y2,y3,y4,y5,b ], where x1-x5 are the x-axis coordinates of the five key points of the face, y1-y5 are the y-axis coordinates of the five key points of the face, and b is a preset offset item, such as 0. The key point feature vector A is a 1×11 vector.

The generated key point feature vector is multiplied by the pre-fitted feature matrix to obtain the face pose angle feature vector as the face pose angle information.

Continuing the above example, assuming that the feature matrix X is an 11×3 matrix, multiply the feature vector A by the feature matrix X, and the 1×3 vector is the face attitude angle vector. The face attitude angle vector includes pitch angle and yaw Angle, roll angle.

The above feature matrix can be fitted in advance as follows:

Assuming there are N sample key point feature vectors, each sample key point feature vector is expressed as V=[x1,x2,x3,x4,x5,y1,y2,y3,y4,y5,1], where in the vector The value 1 is the preset offset item. Combine N sample key point feature vectors into a feature matrix B, where B is an N×11 matrix. Each sample key point feature vector corresponds to a sample face attitude angle vector (including pitch angle, yaw angle, roll angle), and N sample key point feature vectors are combined into an N×3 matrix C. Establish the relationship B×X=C. Among them, B and C are known conditions, so using the least square method to solve the above relational expression, X can be obtained.

Step 2: Determine the quality score of each face image based on the face pose angle information.

Specifically, the above-mentioned execution subject may use preset weights corresponding to the three angles included in the face pose angle information to determine the quality score of each face image. As an example, the face pose information can be determined according to the following formula:

score1=0.2×(15-abs(roll))+0.5×(15-abs(yaw))+0.3×(15-abs(pitch)) Formula (1)

Among them, Score1 is the quality score of the face image, pitch, yaw, and roll are the pitch angle, yaw angle, and roll angle respectively, 0.2, 0.5, and 0.3 are the weights corresponding to the three angles, and abs() is the brackets The absolute value of the angle, 15 is the set angle threshold, that is, when the attitude angle exceeds 15 degrees, a negative value is taken.

In some optional implementation manners of this embodiment, the above-mentioned execution subject may determine the quality score of the face image according to the following steps:

First, based on the key point information collection of each face image, the definition of each face image is determined. Among them, the definition can be obtained by using existing algorithms for determining the definition of an image. For example, the algorithm for determining the definition of an image may include, but is not limited to, at least one of the following: pixel gradient function, gray-scale variance function, gray-scale variance product function, and the like. Generally, the definition can be set to be in the interval [0,1].

Then, the quality score of each face image is determined by using the information and clarity of the face pose angle. Specifically, the above-mentioned execution subject may use the face posture angle information to determine the first score score1 according to the above-mentioned formula (1). The above definition is the second score score2, and the quality score of the face image is determined based on the preset weight. As an example, the quality score of a face image can be determined according to the following formula (2):

score=0.6×score1+0.4×score2 Formula (2)

Among them, score is the quality score, 0.6 and 0.4 are the preset weights.

It should be noted that the numerical ranges of the above-mentioned first score and the second score are the same, for example, both are in [0,1] or [0,100].

In some optional implementation manners of this embodiment, the above-mentioned execution subject may determine the definition of each face image according to the following steps:

First, extract the target key point information from the key point information set of each face image. Among them, the target key point information may be preset key point information used to characterize a specific position of a human face. Generally, the key point information of the target face may be key point information used to indicate the person's eyes and mouth.

Then, based on the target key point information, the target area is determined from each face image, and the average pixel gradient of the pixels included in the target area is determined. The target area may be an area including key points of the face indicated by the target key point information. For example, the target area may be the smallest rectangle including the key points of the face indicated by the target key point information.

The above-mentioned execution subject may use the existing method of determining the pixel gradient of the pixel to determine the pixel gradient of each pixel in the target area, and average the determined pixel gradients to obtain the average pixel gradient.

Finally, based on the average pixel gradient, the sharpness of each face image is determined. Specifically, the sum S of the average value of the horizontal gradient and the vertical gradient of each pixel in the target area can be calculated, and then the average gradient avg_g=S/(w*h*255.0), where w and h are the values of the target area Width and height.

In the prior art, in order to improve the accuracy of face image detection, it is usually necessary to use a deep neural network to detect the image. If the network is too deep, the time to extract image features will become longer and the overall inference speed will slow down. The above-mentioned optional implementation method adopts the method of combining the face attitude angle and the image clarity to determine the quality score of the face image. Compared with the deep neural network, the image processing speed is faster and the hardware resources are occupied more. Therefore, the steps and optional implementations of the embodiments of the present disclosure can be combined and applied to the front end of the face image detection system (such as the terminal device or intermediate device shown in FIG. 1), which reduces the burden on the back-end server. pressure.

Step three, based on the obtained quality score, extract a face image from the face image sequence and output it.

Step 3 is basically the same as the method of extracting and outputting the face image in step 204 in the embodiment corresponding to FIG. 2, and will not be repeated here.

As can be seen from FIG. 3, compared with the embodiment corresponding to FIG. 2, the process 300 of the method for sending information in this embodiment highlights the step of determining the quality score of each face image based on the face posture angle information. . In this way, the facial posture angle information can be used to further improve the accuracy of determining the quality score of the facial image, which helps to further improve the quality of the extracted facial image.

With further reference to FIG. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a device for detecting a face image. The device embodiment corresponds to the method embodiment shown in FIG. 2 , The device can be specifically applied to various electronic equipment.

As shown in FIG. 5, the apparatus 500 for detecting a face image of this embodiment includes: an acquisition module 501, which is used to acquire a target image frame sequence; and a generation module 502, which is used for each image frame included in the target image frame sequence. , Input the image frame into a pre-trained face detection model to obtain face position information, where the face position information is used to characterize the position of the face image in the image frame; the determination module 503 is used to The face position information is used to determine at least one face image sequence from the image frames included in the target image frame sequence, where the face image included in each face image sequence is used to indicate the same face; the output module 504 is used to For each face image sequence in at least one face image sequence, determine the quality score of each face image included in the face image sequence; based on the obtained quality score, extract the face image from the face image sequence and output .

In this embodiment, the acquiring module 501 of the method for detecting a face image can acquire a target image frame sequence. The target image frame sequence may be a sequence of image frames included in a video captured by the camera on the target face (for example, the face of a person within the shooting range of the camera). Generally, the target image frame may be an image frame sequence composed of an image frame currently shot by the camera and an image frame shot by a preset time period before the current time.

In this embodiment, for each image frame included in the target image frame sequence, the aforementioned generating module 502 can input the image frame into a pre-trained face detection model to obtain face position information. Among them, the face detection model is used to characterize the correspondence between the image sequence and the face position information.

As an example, the face detection model may be the aforementioned device 500 or other electronic equipment, using a machine learning method to take as input the sample image frame sequence included in the training samples in the preset training sample set, and compare it with the input sample image frame sequence The corresponding sample position information is used as the expected output, and the initial model (for example, convolutional neural network, cyclic neural network, etc.) is trained, and the actual output can be obtained for the sample image frame sequence input for each training. Among them, the actual output is the data actually output by the initial model, which is used to characterize the position of the face image. Then, the executive body used to train the face detection model can use the gradient descent method and the back propagation method to adjust the parameters of the initial model based on the actual output and the expected output, and use the model obtained after each adjustment of the parameters as the next training Initial model, and when the preset training termination conditions are met, the training ends, so that the training obtains the speech recognition model. The preset training end conditions here may include but are not limited to at least one of the following: training time exceeds the preset duration; training times exceeds the preset number of times; the loss value calculated by using the preset loss function (for example, the cross-entropy loss function) is less than Preset loss value threshold.

In this embodiment, the determining module 503 may determine at least one face image sequence from the image frames included in the target image frame sequence in various ways based on the obtained face position information. Wherein, the face images included in each face image sequence are used to indicate the same face.

In this embodiment, for each face image sequence in the at least one face image sequence, the output module 504 may first determine the quality score of each face image included in the face image sequence. Then, based on the obtained quality score, the face image is extracted from the face image sequence and output. Among them, the quality score of the face image can be used to characterize the quality of the face image, that is, the higher the quality score, the higher the quality of the face image. Generally, the face image with the highest quality score can be output as the optimal face image.

The aforementioned output module 504 can determine the quality score of the face image according to various methods. As an example, the aforementioned output module 504 may determine the sharpness of the face image, and determine the sharpness as the quality score. Among them, the definition can be obtained by using existing algorithms for determining the definition of an image. For example, the algorithm for determining the definition of an image may include, but is not limited to, at least one of the following: pixel gradient function, gray-scale variance function, gray-scale variance product function, and the like.

The above-mentioned output module 504 can extract and output the face image with the greatest clarity from the face image sequence. Or, extract and output a preset number of face images in descending order of sharpness.

In some optional implementation manners of this embodiment, the determining module 503 is further configured to: for every two adjacent image frames in the target image frame sequence, determine the first image of the two adjacent image frames The feature points in each face image in the frame, and the predicted feature points in the second image frame corresponding to each face image in the first image frame are determined; from the face images in the second image frame, It is determined that a face image whose number of prediction feature points included is greater than or equal to a preset value is the same face image as the face indicated by the corresponding face image in the first image frame.

In some optional implementation manners of this embodiment, the determining module 503 is further configured to: for every two adjacent image frames in the target image frame sequence, the first of the two adjacent image frames In the face image in the image frame and the face image in the second image frame, a face image whose area coincidence degree is greater than or equal to a preset coincidence degree threshold is determined as a face image indicating the same face.

In some optional implementations of this embodiment, the face detection model is also used to generate the key point information set of the image frame, where the key point information is used to characterize the position of the face key point in the face image; and The output module 504 includes: a first determining unit (not shown in the figure), configured to determine the face pose angle information of each face image based on the key point information set of each face image included in the face image sequence; The second determining unit (not shown in the figure) is used to determine the quality score of each face image based on the face posture angle information.

In some optional implementation manners of this embodiment, the first determining unit (not shown in the figure) includes: a first generating subunit (not shown in the figure), which is configured to be based on each face image sequence included The key point information collection of the personal face image generates the key point feature vector corresponding to each face image; the second generating subunit (not shown in the figure) is used to multiply the generated key point feature vector by the pre-fitted feature vector The feature matrix is used to obtain the face pose angle feature vector as the face pose angle information.

In some optional implementations of this embodiment, the second determining unit includes: a first determining subunit (not shown in the figure), configured to determine each face image based on the key point information set of each face image The second determining sub-unit (not shown in the figure) is used to determine the quality score of each face image by using the facial posture angle information and the sharpness.

In some optional implementation manners of this embodiment, the first determining subunit includes: an extraction submodule (not shown in the figure) for extracting target key point information from the key point information set of each face image; The first determination submodule (not shown in the figure) is used to determine the target area from each face image based on the target key point information, and determine the average pixel gradient of the pixels included in the target area; the second determination submodule ( (Not shown in the figure), used to determine the sharpness of each face image based on the average pixel gradient.

In some optional implementations of this embodiment, the face detection model includes a convolutional layer whose structure is a deeply separable convolution.

In some optional implementations of this embodiment, the face detection model is pre-trained by batch standardization.

The apparatus provided by the above-mentioned embodiment of the present disclosure determines at least one face image sequence from the target image frame sequence, where each face image sequence is used to indicate the same face, and then each face image sequence is determined from each face image sequence. The quality score of the face image, the face image is extracted and output according to the quality score, thereby achieving the extraction of high-quality face images from the target image sequence, which is conducive to improving the accuracy of operations such as face recognition using the extracted face images Sex.

Reference is now made to FIG. 6, which shows a schematic structural diagram of a computer system 600 suitable for implementing an electronic device of the embodiments of the present disclosure. The electronic device shown in FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 6, the computer system 600 includes a central processing unit (CPU) 601, which can be based on a program stored in a read-only memory (ROM) 602 or a program loaded from a storage part 608 into a random access memory (RAM) 603 And perform various appropriate actions and processing. The RAM 603 also stores various programs and data required for the operation of the system 600. The CPU 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604.

The following components are connected to the I/O interface 605: an input part 606 including a keyboard, a mouse, etc.; an output part 607 including a liquid crystal display (LCD), etc., and a speaker, etc.; a storage part 608 including a hard disk, etc.; and The communication part 609 of a network interface card such as a modem. The communication section 609 performs communication processing via a network such as the Internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 610 as needed, so that the computer program read from it is installed into the storage part 608 as needed.

In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network through the communication part 609, and/or installed from the removable medium 611. When the computer program is executed by the central processing unit (CPU) 601, the above-mentioned functions defined in the method of the present disclosure are executed.

It should be noted that the computer-readable storage medium described in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device. In the present disclosure, a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The computer-readable signal medium may also be any computer-readable storage medium other than the computer-readable storage medium. The computer-readable storage medium may be sent, propagated or transmitted for use by or in combination with the instruction execution system, apparatus, or device program of. The program code contained on the computer-readable storage medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.

The computer program code used to perform the operations of the present disclosure can be written in one or more programming languages or a combination thereof. The programming languages include object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional Procedural programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server. In the case of a remote computer, the remote computer can be connected to the user’s computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to pass Internet connection).

The flowcharts and block diagrams in the accompanying drawings illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram can represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logical function Executable instructions. It should also be noted that, in some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.

The modules involved in the embodiments described in the present disclosure can be implemented in software or hardware. The described module may also be provided in the processor, for example, it may be described as: a processor includes an acquisition module, a generation module, a determination module, and an output module. Among them, the names of these modules do not constitute a limitation on the unit itself under certain circumstances. For example, the acquisition module can also be described as "a module for acquiring a target image frame sequence".

As another aspect, the present disclosure also provides a computer-readable storage medium. The computer-readable storage medium may be included in the electronic device described in the above embodiment; or it may exist alone without being assembled into the In electronic equipment. The aforementioned computer-readable storage medium carries one or more programs, and when the aforementioned one or more programs are executed by the electronic device, the electronic device: acquires a sequence of target image frames; for each image frame included in the sequence of target image frames , Input the image frame into a pre-trained face detection model to obtain face position information; based on the obtained face position information, determine at least one face image sequence from the image frames included in the target image frame sequence, where, The face images included in each face image sequence are used to indicate the same face; for each face image sequence in at least one face image sequence, determine the quality score of each face image included in the face image sequence; The obtained quality score is extracted from the face image sequence and output.

The above description is only a preferred embodiment of the present disclosure and an explanation of the applied technical principles. Those skilled in the art should understand that the scope of disclosure involved in this disclosure is not limited to the technical solutions formed by the specific combination of the above technical features, and should also cover the above technical features or technical solutions without departing from the above disclosed concept. Other technical solutions formed by any combination of its equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions applied for in this disclosure.

Claims

A method for detecting a human face image, characterized in that the method comprises: acquiring a target image frame sequence; for each image frame included in the target image frame sequence, inputting the image frame into a pre-trained human face Detect the model to obtain face position information; based on the obtained face position information, determine at least one face image sequence from the image frames included in the target image frame sequence, where each face image sequence includes the face The image is used to indicate the same face; for each face image sequence in the at least one face image sequence, determine the quality score of each face image included in the face image sequence; based on the obtained quality score, from the Extract the face image and output from the face image sequence.
The method according to claim 1, wherein the determining at least one face image sequence from the image frames included in the target image frame sequence based on the obtained face position information comprises: For every two adjacent image frames in the target image frame sequence, determine the feature points in each face image in the first image frame of the two adjacent image frames, and determine each person in the first image frame The predicted feature point in the second image frame corresponding to the face image; from the face image in the second image frame, determine the face image whose number of predicted feature points included is greater than or equal to the preset value as the first The corresponding face image in the image frame indicates the same face image.
The method according to claim 1, wherein the determining at least one face image sequence from the image frames included in the target image frame sequence based on the obtained face position information comprises: For every two adjacent image frames in the target image frame sequence, the area of the face image in the first image frame and the face image in the second image frame in the two adjacent image frames overlap Facial images whose degrees are greater than or equal to a preset coincidence degree threshold are determined as facial images indicating the same human face.
The method according to claim 1, wherein the face detection model is also used to generate a key point information set of the image frame, wherein the key point information is used to characterize the position of the face key point in the face image And said determining the quality score of each face image included in the face image sequence includes: determining the face pose angle of each face image based on the key point information set of each face image included in the face image sequence Information: Determine the quality score of each face image based on the face pose angle information.
The method according to claim 4, wherein the determining the face pose angle information of each face image based on the key point information set of each face image included in the face image sequence comprises: based on the person The key point information collection of each face image included in the face image sequence is used to generate the key point feature vector corresponding to each face image; the generated key point feature vector is multiplied by the pre-fitted feature matrix to obtain the face pose angle feature The vector is used as the angle information of the face.
The method according to claim 4, wherein the determining the quality score of each face image based on the face pose angle information comprises: determining the quality score of each face image based on the key point information set of each face image Sharpness: Using the information and sharpness of the face attitude angle, determine the quality score of each face image.
The method according to claim 6, wherein the determining the sharpness of each face image based on the key point information set of each face image comprises: extracting the target from the key point information set of each face image Key point information; based on the target key point information, determine the target area from each face image, and determine the average pixel gradient of the pixels included in the target area; determine the sharpness of each face image based on the average pixel gradient.
The method according to any one of claims 1-7, wherein the face detection model comprises a convolutional layer whose structure is a deeply separable convolution.
The method according to any one of claims 1-7, wherein the face detection model is obtained by training in a batch normalization method in advance.
A device for detecting a human face image, characterized in that the device includes: an acquisition module for acquiring a target image frame sequence; a generating module for each image frame included in the target image frame sequence, Input the image frame into a pre-trained face detection model to obtain face position information, where the face position information is used to characterize the position of the face image in the image frame; the determination module is used to obtain face position information based on the obtained face The location information is used to determine at least one face image sequence from the image frames included in the target image frame sequence, where the face image included in each face image sequence is used to indicate the same face; the output module is used for all Describe each face image sequence in at least one face image sequence, determine the quality score of each face image included in the face image sequence; based on the obtained quality score, extract the face image and the face image sequence from the face image sequence Output.
An electronic device, including: one or more processors; a storage device, used to store one or more programs. When the one or more programs are executed by the one or more processors, the one or more Multiple processors implement the method according to any one of claims 1-9.
A computer-readable storage medium having a computer program stored thereon, wherein the program is executed by a processor to implement the method according to any one of claims 1-9.