WO2023137905A1

WO2023137905A1 - Image processing method and apparatus, and electronic device and storage medium

Info

Publication number: WO2023137905A1
Application number: PCT/CN2022/090297
Authority: WO
Inventors: 胡显; 易军; 邓巍
Original assignee: 小米科技(武汉)有限公司; 北京小米移动软件有限公司; 北京小米松果电子有限公司
Priority date: 2022-01-21
Filing date: 2022-04-29
Publication date: 2023-07-27
Also published as: CN116543426A

Abstract

An image processing method. The image processing method comprises: acquiring a facial image to be subjected to detection, wherein said facial image comprises a target face; performing image detection on said facial image, and determining key point information of at least one key point of the target face and a first uncertainty of the target face; and determining a detection result of the target face according to the key point information and the first uncertainty.

Description

Image processing method, device, electronic device and storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with a filing date of January 21, 2022 and application number 202210074181.7, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference.

technical field

The present disclosure relates to the technical field of computer vision, and in particular to an image processing method, device, electronic equipment and storage medium.

Background technique

At present, face recognition based on deep neural network (DNN, Deep Neural Network) is one of the most important applications in the field of computer vision (CV, Computer Vision). Face key point detection refers to locating the feature key points of the face from the face image, such as the key points of the facial contour and the key points of the facial features. Due to the influence of factors such as pose, occlusion or light, face key point detection is a challenging task.

Contents of the invention

In order to improve the detection effect of facial key points, the embodiments of the present disclosure provide an image processing method, device, electronic equipment and storage medium.

In a first aspect, an embodiment of the present disclosure provides an image processing method, including:

Acquiring a human face image to be tested, the target human face is included in the human face image to be tested;

Perform image detection on the face image to be tested, and determine key point information of at least one key point of the target face and a first uncertainty of the target face; wherein, the first uncertainty is obtained according to second uncertainties of all key points of the target face;

A detection result of the target face is determined according to the key point information and the first uncertainty.

In some embodiments, the acquisition of the face image to be tested includes:

Acquiring an image to be processed, the image to be processed includes at least one human face;

performing image detection on the image to be processed, and determining face area information of each face on the image to be processed;

For any human face, the human face image to be tested corresponding to the human face is obtained by cropping according to the human face area information.

In some implementations, the performing image detection on the face image to be tested, and determining the key point information of at least one key point of the target face and the first uncertainty of the target face include:

Carry out key point detection on the face image to be tested, and determine the key point information and the second uncertainty of each key point of the target face based on the preset face key point type;

The first uncertainty of the target face is determined according to the second uncertainty of each key point of the target face.

In some implementations, the preset key points include at least one of the following types:

Key points of face contour, key points of eyes, key points of eyebrows, key points of nose, key points of mouth, key points of ears.

In some implementation manners, the determining the detection result of the target face according to the key point information and the first uncertainty includes:

determining the reliability score of the target face according to the first uncertainty of the target face;

In response to the reliability score satisfying a first preset condition, the key points are output on the human face image to be tested according to the key point information of each key point.

According to the reliability score and the corresponding relationship between the reliability score and the face tracking model established in advance, the target face tracking model is determined from a plurality of preset face tracking models;

The target face is detected and tracked by using the target face tracking model to obtain the detection result of the target face.

In response to the reliability score satisfying a second preset condition, it is determined that the target face detection of the face image to be tested passes.

Input the pre-trained feature extraction network of the human face image to be tested to obtain the feature map output by the feature extraction network;

Inputting the feature map into a pre-trained key point detection network to obtain the key point information and the first uncertainty of each key point of the target face output by the key point detection network.

In some embodiments, the method of the embodiment of the present disclosure also includes a training process for training the feature extraction network and the key point detection network, the training process includes:

Obtain a sample data set, each sample data in the sample data set includes a face sample image, and a key point label of each key point of the target face in the face sample image;

For any sample data, the face sample image is input into the feature extraction network to be trained, and the feature map of the human face sample image output by the feature extraction network is obtained;

Input the feature map of the human face sample image into the key point detection network to be trained, obtain the key point information of each key point of the target human face, and the first uncertainty of the target human face;

Based on the key point information, the key point label and the first uncertainty, determine the difference between the key point information and the key point label;

Adjust the network parameters of the feature extraction network and/or the key point detection network according to the difference until a convergence condition is satisfied, and obtain the trained feature extraction network and/or the key point detection network.

In a second aspect, an embodiment of the present disclosure provides an image processing device, including:

An acquisition module configured to acquire a face image to be tested, the face image to be tested includes a target face;

The image detection module is configured to perform image detection on the face image to be tested, and determine key point information of at least one key point of the target face and a first uncertainty of the target face; wherein, the first uncertainty is obtained according to the second uncertainty of all key points of the target face;

The result determination module is configured to determine the detection result of the target face according to the key point information and the first uncertainty.

In some implementations, the acquisition module is configured to:

In some embodiments, the image detection module is configured to:

In some embodiments, the outcome determination module is configured to:

In response to the reliability score meeting a second preset condition, it is determined that the target face detection of the human face image to be tested has passed.

In some embodiments, the image detection module includes:

A feature extraction module configured to input the face image to be tested into a pre-trained feature extraction network to obtain a feature map output by the feature extraction network;

The key point detection module is configured to input the feature map into a pre-trained key point detection network, and obtain the key point information and the first uncertainty of each key point of the target face output by the key point detection network.

In some embodiments, the device described in the embodiments of the present disclosure further includes a training module configured to:

determining a difference between the key point information and the key point label based on the key point information, the key point label, and the first uncertainty;

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

processor; and

The memory stores computer instructions that can be read by the processor, and when the computer instructions are read, the processor executes the method according to any implementation manner of the first aspect.

In a fourth aspect, the embodiments of the present disclosure provide a storage medium for storing computer-readable instructions, and the computer-readable instructions are used to cause a computer to execute the method according to any embodiment of the first aspect.

The image processing method according to the embodiment of the present disclosure includes acquiring a face image to be tested, performing image detection on the face image to be tested, determining key point information of at least one key point of a target face and a first uncertainty of the target face, and determining a detection result of the target face according to the key point information and the first uncertainty. In the embodiments of the present disclosure, the first uncertainty of the target face is used to assist the detection of key points of the face to improve the effect and accuracy of face detection, and at the same time, it is applicable to various task scenarios, and the first uncertainty representing the comprehensive error of the target face is determined based on the second uncertainty of all key points of the target face, so as to improve the network effect and training efficiency.

Description of drawings

In order to more clearly illustrate the specific embodiments of the present disclosure or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the description of the specific embodiments or prior art. Obviously, the accompanying drawings in the following description are some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without creative work.

FIG. 1 is a flowchart of an image processing method according to some embodiments of the present disclosure.

FIG. 2 is a flowchart of an image processing method according to some embodiments of the present disclosure.

FIG. 3 is a flowchart of an image processing method according to some embodiments of the present disclosure.

Fig. 4 is a schematic diagram of facial key points according to some implementations of the present disclosure.

Fig. 5 is a schematic structural diagram of an image detection network according to some embodiments of the present disclosure.

FIG. 6 is a flowchart of an image processing method according to some embodiments of the present disclosure.

FIG. 7 is a flowchart of an image processing method according to some embodiments of the present disclosure.

FIG. 8 is a flowchart of an image processing method according to some embodiments of the present disclosure.

FIG. 9 is a flowchart of an image processing method according to some embodiments of the present disclosure.

FIG. 10 is a structural block diagram of an image processing device according to some embodiments of the present disclosure.

FIG. 11 is a structural block diagram of an image processing device according to some embodiments of the present disclosure.

Fig. 12 is a structural block diagram of an electronic device according to some embodiments of the present disclosure.

Detailed ways

The technical solutions of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described implementations are part of the implementations of the present disclosure, but not all of them. Based on the implementation manners in the present disclosure, all other implementation manners obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present disclosure. In addition, the technical features involved in different embodiments of the present disclosure described below may be combined with each other as long as they do not constitute a conflict with each other.

Face key point detection is a necessary means for face recognition tasks. Face key point detection refers to locating key points of facial features from the face image, such as key points of facial contour and key points of facial features. Key points of facial contour can include key points of chin, jaw, and cheeks.

At present, face key point location based on deep neural network (DNN, Deep Neural Network) is the most efficient and commonly used detection method. In related technologies, in order to improve the accuracy of DNN in predicting and positioning key points, uncertainty parameters are designed for each key point, and DNN is used to predict the uncertainty of each key point of a face, and the coordinates of key points are regressed and predicted based on the uncertainty, so that DNN outputs face key points with relatively high accuracy.

However, in this scheme, since each face key point detected by DNN needs to return an uncertainty, for a network that requires high detection accuracy, the number of face key points may reach hundreds or thousands, resulting in a very large complexity of DNN network structure and calculation, and high cost. Moreover, in the DNN training process, it is impossible to design a clear optimization goal for the uncertainty of key points of the face, which makes it difficult for the network to converge, and the actual use effect is poor.

Based on the above defects, embodiments of the present disclosure provide an image processing method, device, electronic equipment, and storage medium, aiming at improving the accuracy of facial key point positioning and optimizing the structure and effect of an image detection network.

In a first aspect, an embodiment of the present disclosure provides an image processing method, which can be applied to an electronic device. In the embodiments of the present disclosure, the electronic device may be any type of device suitable for implementation, such as a mobile terminal, a vehicle terminal, a wearable device, an access control system, a video surveillance system, a cloud platform, and a server, etc., and the present disclosure does not limit this.

As shown in Figure 1, in some implementations, the image processing method of the present disclosure example includes:

S110. Acquire a face image to be tested, where the face image to be tested includes a target face.

Specifically, the face image to be tested refers to an image in which a face object is expected to be detected, so that the face image to be tested may include one or more face objects, and the face object is the target face.

In the embodiments of the present disclosure, the face image to be tested may be a single frame image collected by the image collection device of the electronic device, or may be a frame image in a video stream collected by the image collection device of the electronic device.

For example, in one example, the electronic device is a smart phone. The smart phone includes a camera, and an image including a human face can be captured through the camera, and the image can be used as the human face image to be tested in the present disclosure.

For example, in another example, the electronic device takes a video surveillance system as an example. The video surveillance system includes a surveillance camera, which can capture a video stream including a human face in the target scene area through the surveillance camera. The frame images in the video stream can be used as the face image to be tested in the present disclosure.

In a word, the face image to be tested can be any image that is expected to detect a face object from the image, it can be an image acquired in real time, or it can be a face image uploaded or downloaded through the network, which will not be repeated in this disclosure.

In some implementations, it is considered that the face image acquired by the electronic device often has many interference factors, for example, the face image includes multiple face objects, and for example, the face image includes a large non-face area. In order to improve the accuracy of subsequent key point detection, the face image can be cropped in advance, and the cropped image including only one face object can be used as the face image to be tested. The present disclosure will be described in the following embodiments, and will not be described in detail here.

S120. Perform image detection on the face image to be tested, and determine key point information of at least one key point of the target face and a first uncertainty of the target face.

Specifically, face key point detection needs to detect multiple face key points from the face image to be tested, and these face key points may respectively belong to different face key point types. Key point types can include, for example, facial contour key points, eye key points, eyebrow key points, nose key points, mouth key points, ear key points, etc. Each key point type can include multiple key points, for example, eyebrow key points can include 5*2 total of 10 key points.

In the embodiments of the present disclosure, the key point information may include key point coordinates corresponding to each key point. For example, image detection may be performed on the face image to be tested based on image detection technology, so that all key points of the target face and the position coordinates of each key point in the image may be obtained from the face image to be tested.

At the same time, in the embodiments of the present disclosure, the first uncertainty of the target face needs to be determined during key point detection, and the first uncertainty represents the comprehensive error of all key points of the target face, that is, the first uncertainty of the target face is obtained according to the second uncertainty of all key points. The higher the first uncertainty, the greater the error of the key point detection of the target face, and on the contrary, the lower the first uncertainty, the smaller the error of the key point detection of the target face.

In some embodiments, the pre-trained key point detection network can be used to predict the position coordinates of each key point of the target face, so as to obtain the key point information of each key point. At the same time, the key point detection network can also predict the uncertainty of each key point to obtain the second uncertainty corresponding to each key point.

Taking a certain key point A in the eyebrow key points of the target face as an example, the key point detection network can predict the position coordinates (x, y) and the second uncertainty p of the key point A based on the eyebrow feature of the target face, and the second uncertainty p represents the error of the position coordinates (x, y) of the key point. The larger the second uncertainty p, the larger the error of the position coordinates (x, y), on the contrary, the smaller the second uncertainty p, the smaller the error of the position coordinates (x, y).

It can be understood that only one of the key points is used as an example for illustration, and for all key points of the target face, each key point corresponds to key point information and a second uncertainty. In the embodiment of the present disclosure, instead of directly determining the detection result of the target face based on the second uncertainty of each key point, the first uncertainty of the target face is calculated according to the second uncertainties of all key points, and the first uncertainty is used as the comprehensive uncertainty corresponding to the target face.

In some implementation manners, the root mean square of the second uncertainties of all key points may be used as the first uncertainties corresponding to the target face. In some other implementation manners, the mean value of the second uncertainty of all key points may be used as the first uncertainty corresponding to the target face. Of course, it can be understood that the first uncertainty can also be obtained by fusing the second uncertainties of all key points in other ways, as long as the first uncertainty can represent the comprehensive error of the key points, which is not limited in the present disclosure.

S130. Determine the detection result of the target face according to the key point information and the first uncertainty.

Specifically, after determining the key point information of each key point of the target face and the first uncertainty corresponding to the target face, the corresponding post-processing logic can be set according to different downstream task scenarios, so as to obtain the detection result for the target face based on the key point information and the first uncertainty.

In one example, take face photo storage as an example, the face image uploaded by the user must meet certain requirements, such as not covering the eyebrows, not tilting the head too far, and so on. Therefore, through the above process of the present disclosure, image processing can be performed on the face photo uploaded by the user to obtain the key point information of each key point of the face image and the first uncertainty for the target face. When the first uncertainty is greater than the preset threshold, it means that the key point detection deviation of the face image uploaded by the user is large, and there may be problems such as facial features occlusion, so that the corresponding detection result can be output to the user as failed, and a certain facial feature is occluded.

In another example, taking the face tracking scene as an example, the electronic device corresponds to different working conditions under different lighting conditions. For example, in an extremely dark light scene, the exposure of the face image collected by the electronic device is very low, so the first uncertainty obtained by key point detection is relatively large. On the contrary, for example, in a bright scene, the exposure of the face image collected by the electronic device is normal, so the first uncertainty obtained by the key point detection is relatively low. Based on this, by setting an appropriate threshold, the first uncertainty can be used to determine the current lighting environment of the device, so that the corresponding tracking algorithm model can be used to realize face tracking.

Of course, it can be understood that the scenarios of the examples in the present disclosure are not limited to the above examples, which will be described in detail below in the present disclosure, and will not be described in detail here.

It is worth noting that in the embodiment of the present disclosure, the first uncertainty representing the comprehensive error of the target face is determined based on the second uncertainty of all key points of the target face, so that for the key point detection network, it is not necessary to perform regression optimization on the uncertainty of each key point during the training process, but to optimize the comprehensive uncertainty of the face, the network is easy to converge, the effect is better, and the training efficiency is greatly improved.

It can be known from the above that in the embodiments of the present disclosure, the first uncertainty of the target face is used to assist the detection of key points of the face, so as to improve the effect and accuracy of face detection. And based on the second uncertainty of all the key points of the target face, the first uncertainty representing the comprehensive error of the target face is determined to improve the network effect and training efficiency. At the same time, the disclosed method does not limit the application scenarios, and can be applied to downstream tasks in various scenarios, such as face image quality detection, face tracking, key point positioning, etc., and has higher robustness.

As shown in Figure 2, in some implementations, in the image processing method of the present disclosure example, the process of obtaining the face image to be tested includes:

S210. Acquire an image to be processed, where the image to be processed includes at least one human face.

S220. Perform image detection on the image to be processed, and determine face area information of each face on the image to be processed.

S230. For any one face, crop according to the face area information to obtain a face image to be tested corresponding to the face.

Specifically, the image to be processed may be an original image collected by an image collection device of the electronic device, or an uploaded image uploaded to the electronic device by a user. It can be understood that the image to be processed may include one human face, or may include multiple human faces.

In the embodiments of the present disclosure, image detection may be performed on the image to be processed based on the image detection technology to obtain face area information of each face on the image to be processed. For example, in one example, the image to be processed can be detected through the CenterFace network, so as to obtain the face detection frame of each face area on the image to be processed, and the face detection frame is also the face area information.

After the face detection frame of each face is obtained, the image to be processed can be cropped according to the face detection frame, so as to obtain a face image including each face area, which is the face image to be tested.

In an example, the center point of each face detection frame can be used as the origin, and the coordinates of the origin are kept unchanged to uniformly expand the entire face detection frame at a preset ratio, and the face image is cut out along the expanded face detection frame.

It can be understood that when the image to be processed includes multiple faces, the face image of each face can be cut out through the process of the embodiment shown in FIG. 2 , and these face images can be used as the face image to be tested in the present disclosure.

In some embodiments, after obtaining the face image to be tested, key point detection can be performed on the target face in the face image to be tested based on the image detection technology, which will be described below with reference to FIG. 3 .

As shown in FIG. 3 , in some implementations, in the image processing method of the disclosed example, the process of performing image detection on the face image to be tested includes:

S310. Perform key point detection on the face image to be tested, and determine key point information and a second uncertainty of each key point of the target face based on the preset key point type of the face.

S320. Determine the first uncertainty of the target face according to the second uncertainty of each key point of the target face.

Specifically, when performing key point detection on a target face, key points included in one or more key point types belonging to the target face need to be detected from the face image to be tested. Keypoint types include, for example, eye keypoints, eyebrow keypoints, nose keypoints, mouth keypoints, facial contour keypoints, etc., wherein each keypoint type may include multiple keypoints.

For example, as shown in Figure 4, the preset face key point types can include the following table 1:

Table I

人脸关键点类型face key point type	关键点编号key point number
脸部轮廓关键点Key points of facial contour	0～320～32
眉毛关键点Eyebrow Key Points	33～4233～42
鼻子关键点nose key points	43～5143～51
眼睛关键点eye key points	52～6352～63
嘴部关键点Mouth Key Points	64～8364～83

Of course, it can be understood that the face key point types are not limited to the examples in Table 1 above, and may also include any other key point types suitable for implementation, such as ear key points, apple muscle key points, etc., which are not limited in the present disclosure.

In the embodiments of the present disclosure, the above-mentioned key point type detection may be performed on the to-be-tested face image based on image detection, so that the key point information of each key point of the target face and the second uncertainty of each key point may be determined.

Based on the foregoing, it can be seen that for any key point, its key point information includes the position coordinates of the key point in the image coordinate system, and the second uncertainty represents the degree of uncertainty of the key point positioning result. Therefore, for all the key points of the target face, the first uncertainty corresponding to the target face can be obtained based on the second uncertainty calculation of each key point. For example, in an example, the root mean square of the second uncertainties of all key points may be used as the first uncertainties corresponding to the target face.

In some implementations, the key point detection of the target face in the face image to be tested can be realized based on a pre-trained image detection network. Fig. 5 shows the image detection network structure in some embodiments of the present disclosure, which will be described below in conjunction with Fig. 5 .

As shown in FIG. 5 , in some implementations, the image detection network of the example of the present disclosure includes a feature extraction network 510 and a key point detection network 520 .

The feature extraction network 510 is the backbone network (Backbone Network) of the image detection network, which is mainly used for feature extraction of the face image to be tested, thereby obtaining a feature map (feature map) including semantic features and texture features of the face to be tested. That is, the input of the feature extraction network 510 is the human face image to be tested, and the output is the feature map of the human face image to be tested.

In some exemplary implementations, the feature extraction network 510 can adopt a learnable network based on a convolutional neural network (CNN, Convolutional Neural Network) architecture. For example, in one example, in order to facilitate deployment in mobile terminals, the feature extraction network 510 can adopt a relatively lightweight MobileNet neural network.

The key point detection network 520 is used to predict and output key point information and the first uncertainty according to the feature map output by the feature extraction network 510 . For example, in the example in FIG. 5 , the network structure of the key point detection network 520 includes two branches, that is, the output layer is divided into two fully connected layers. One of the branches is key point information prediction, which is used to perform regression prediction on the position coordinates of each key point of the target face, and obtain the key point information of each key point. The other branch is uncertainty prediction, which is used to predict the first uncertainty of the output target face according to the uncertainty of each key point.

In an example, the pooling layer of the key point detection network 520 adopts a 7*7 pooling layer, and each fully connected layer adopts a 256*1-dimensional fully connected layer.

In some embodiments, before using the image detection network shown in FIG. 5 to process the face image to be tested, it also includes a process of normalizing the face image to be tested. The purpose of the normalization process is to normalize the pixel values of the face image to be tested, so as to obtain an input image that meets the network design requirements and reduce the amount of calculation.

In one example, before the face image to be tested is input into the image detection network, the face image to be tested may first be scaled to a preset size, such as 112 pixels*112 pixels, by bilinear interpolation, for example, and the image is pixel-normalized, expressed as:

I _Norm ＝(I-127.5)/127.5 Formula (1)

In formula (1), I _Norm represents the normalized image pixel value, I represents the pixel value of the original image, and the normalized image is used as the input image of the image detection network.

In the embodiments of the present disclosure, after obtaining the key point information of the target face predicted and output by the image detection network and the first uncertainty of the target face, different detection results for the target face can be obtained according to the specific requirements of downstream tasks, which will be described separately below.

For example, in some scenes, it is expected to detect the human face from the scene image, and display the visualization effect of the key points of the human face in the scene image. As shown in FIG. 6, in this scene, the image processing method of the present disclosure determines the detection result of the target face includes:

S131-1. Determine the reliability score of the target face according to the first uncertainty of the target face.

S132-2. In response to the reliability score satisfying the first preset condition, output the key points on the face image to be tested according to the key point information of each key point.

Specifically, after obtaining the key point information of the target face output by the image detection network and the first uncertainty of the target face, the reliability score of the target face can be calculated based on the first uncertainty.

It can be understood that the first uncertainty represents the comprehensive error of key point detection and positioning of the target face, which reflects the reliability of the detected key point information, based on which the reliability score of the target face can be determined.

In an example, the first uncertainty output by the image detection network is a value between 0 and 1, so the reliability score of the determined target face can be expressed as:

θ＝1-α Equation (2)

In formula (2), θ represents the reliability score of the target face, and α represents the first uncertainty of the target face.

In the embodiments of the present disclosure, a first preset threshold may be set in advance based on prior knowledge or scene requirements, and the first preset threshold represents a critical value for passing or failing the key point detection result of the target face. When the reliability score is greater than the first preset threshold, it means that the detection result of the target face is a reliable result, that is, the detection is passed, and the first preset condition is met. Conversely, when the reliability score is not greater than the first preset threshold, it indicates that the detection result of the target face is unreliable, that is, the detection fails, and the first preset condition is not met.

When it is determined that the reliability score satisfies the first preset condition, each key point can be marked on the original face image to be tested according to the key point information of each key point, so that the user can watch the position of each key point on the image, and realize the visual output of the key point of the face.

For example, in some scenarios, it is often necessary to use different tracking models for different working conditions when tracking human faces in real time. For example, for extreme scenes such as extremely dark light, backlight, blur, etc., it is necessary to use a face tracking model suitable for extreme scenes; and for ordinary scenes such as good lighting, it is sufficient to use a face tracking model suitable for ordinary scenes. Therefore, in some implementations of the present disclosure, the complexity of the current scene may be determined based on the first uncertainty to implement switching of the face tracking model, which will be described below with reference to the implementation in FIG. 7 .

As shown in FIG. 7, in some implementations, in the image processing method of the present disclosure, determining the detection result of the target face includes:

S132-1. Determine the reliability score of the target face according to the first uncertainty of the target face.

S132-2. Determine a target face tracking model from a plurality of preset face tracking models according to the reliability score and the pre-established correspondence between the reliability score and the face tracking model.

S132-3. Use the target face tracking model to detect and track the target face, and obtain a detection result of the target face.

Specifically, in this example, the reliability score of the target face may be determined based on the aforementioned process of the implementation manner in FIG. 6 , which will not be repeated in this disclosure.

It can be understood that for face images to be detected in different lighting scenes, the first uncertainties obtained by key point detection should also be different. For example, in an extremely dark scene, the exposure of the face image collected by the electronic device is very low, so the first uncertainty obtained by the key point detection is relatively large, and correspondingly, the reliability score of the target face is also lower. Conversely, for example, in a bright scene, the exposure of the face image collected by the electronic device is normal, so the first uncertainty obtained by the key point detection is lower, and correspondingly, the reliability score of the target face is higher.

Accordingly, the correspondence between the reliability score and the face tracking model can be established in advance based on prior knowledge or a limited number of experiments. In an example, the pre-established correspondence can be shown in Table 2 below:

Table II

可靠性分值reliability score	人脸跟踪模型Face Tracking Model	光线场景light scene
[0，0.6)[0,0.6)	模型1model 1	普通场景normal scene
[0.6，1][0.6, 1]	模型2model 2	暗光场景dark scene

Therefore, after the reliability score of the target face is determined, the face tracking model corresponding to the reliability score can be determined as the target face basis model according to the correspondence in Table 2 above, and then the target face tracking model can be used to detect and track the target face. For example, in an example, if the reliability score of the target face in the image to be detected is 0.8, then based on the correspondence in Table 2 above, it can be determined that the current scene is a normal scene, and the corresponding target face tracking model is "model 1", so that the target face is tracked and detected using model 1, and the face detection result is obtained.

From the above, it can be seen that in this exemplary embodiment, the current lighting scene can be judged based on the reliability score, so as to select the corresponding face tracking model for face tracking detection, and improve the effect of the detection system.

For example, in some scenarios, the quality inspection of the stored photos can be implemented according to the disclosed method. For example, for face recognition scenarios such as identity verification, users are often required to upload a face photo that meets the requirements in advance, so as to be used as a template photo for subsequent identity verification. In this case, the disclosed method can be used to detect the photos uploaded by users to determine whether the uploaded photos are qualified. The following will describe the embodiment in conjunction with FIG. 8 .

As shown in FIG. 8, in some implementations, in the image processing method of the present disclosure, determining the detection result of the target face includes:

S133-1. Determine the reliability score of the target face according to the first uncertainty of the target face.

S133-2. In response to the reliability score satisfying the second preset condition, determine that the target face detection of the face image to be tested has passed.

Specifically, after the user uploads the face image or collects the user's face image through an electronic device, the face image can be used as the face image to be tested as described in the foregoing embodiments of the present disclosure. Based on the methods of the foregoing embodiments, the key point detection of the face image to be tested can obtain the key point information and the first uncertainty of the target face.

In this example, the reliability score of the target face can be determined based on the aforementioned process of the implementation manner in FIG. 6 , which will not be repeated in this disclosure.

It can be understood that for the photos required for face recognition, it is often necessary to meet certain requirements, such as the face is unobstructed, the tilt angle of the face cannot be too large, etc. These interference factors will cause the key points of the face to be missing or shifted, so the first uncertainty of the key point detection is relatively large.

Accordingly, a second preset threshold may be preset based on prior knowledge or a limited number of trials, and the second preset threshold represents a critical value for whether the target face is detected or not. When the reliability score is greater than the second preset threshold, it means that the detection of the target face is passed, meets the second preset condition, and can be stored. Conversely, when the reliability score is not greater than the second preset threshold, it means that the detection result of the target face does not pass, does not meet the second preset condition, and cannot be stored.

In some implementations, when it is determined that the target face detection fails, the key points that do not meet the requirements can also be determined according to the key point information, so as to output prompt information to the user, such as "eyebrows are blocked" and so on.

From the above, it can be seen that the method of the embodiments of the present disclosure can be applied to various face recognition scenarios, and can distinguish image quality or current environmental conditions based on the first uncertainty, which has strong practicability and robustness, and improves the effect of face recognition tasks.

It is worth noting that, in the embodiment of the present disclosure, for the image detection network shown in Figure 5, for example, in the training process, it is not necessary to perform regression optimization on the uncertainty of each key point, but to optimize the comprehensive uncertainty of the face, the network is easy to converge, the effect is better, and the training efficiency is greatly improved. The training process will be described in detail below with reference to the embodiment shown in FIG. 9 .

As shown in FIG. 9, in some implementations, in the image processing method of the disclosed example, the process of performing network training on the image detection network includes:

S910. Acquire a sample data set.

Specifically, the sample data set includes a large amount of sample data. For example, in one example, the sample data set includes 5000 pieces of sample data. For each sample data, it includes a face sample image and a key point label of each key point of the target face in the pre-labeled face sample image.

It can be understood that the key point label represents the ground truth of each key point of the target face in the face sample image, and the key point label can be obtained by manual labeling. For example, in an example, the N key point coordinates of the target face in the face sample image can be marked by manual labeling to obtain the key point label corresponding to each face sample image.

In some implementations, the massive data in the sample data set can also be preprocessed in advance. The preprocessing process can refer to the aforementioned embodiment in FIG.

S920. For any sample data, input the face sample image into the feature extraction network to be trained, and obtain the feature map of the face sample image output by the feature extraction network.

In the implementation manner of the present disclosure, the network structure of the image detection network may refer to the implementation manner shown in FIG. 5 above. When using the sample data set to perform network training on the image detection network, each n sample data can be used as a batch (Batch) of training samples, usually n can be 256. The following takes a sample data as an example to illustrate the training process.

In some implementations, before inputting the face sample image into the image detection network, the face sample image can be normalized in advance, and the normalization process can refer to the aforementioned formula (1), which will not be repeated here.

The human face sample image included in the sample data is input into the feature extraction network 510 to be trained, so that the feature extraction network 510 outputs a feature map corresponding to the human face sample image.

S930. Input the feature map of the face sample image into the key point detection network to be trained, and obtain the key point information of each key point of the target face and the first uncertainty of the target face.

Specifically, the feature map output by the feature extraction network 510 is used as the input of the key point detection network 520, and through the pooling layer and the fully connected layer of the key point detection network 520, the key point information P of the target face and the first uncertainty α of the target face are respectively output.

In an example, the key point information of the target face output by the key point detection network 520 is expressed as:

P={(x ₁ ,y ₁ ),(x ₂ ,y ₂ ),...(x _i ,y _i )} _{i=1,2,3,...,N} formula (3)

In formula (3), P represents key point information, N represents the number of key points, and ( _xi , y _i ) represents the position coordinates of the i-th key point.

S940. Determine a difference between the key point information and the key point label based on the key point information, the key point label, and the first uncertainty.

Specifically, the key point information can include the position coordinates of the key points predicted by the image detection network, and the key point labels represent the real coordinates of the key points, so that the difference between the two can be calculated based on the pre-built loss function, that is, the loss.

It is worth noting that, in the method of the embodiment of the present disclosure, the image detection network is not optimized for training based on the difference between key point information and key point labels, but the first uncertainty is integrated for optimal training at the same time, so that there is no need to set additional labels for the first uncertainty, and the network is easier to converge.

In some implementations, the image processing network uses a multi-objective constraint loss function, expressed as follows:

L＝L _p +λ*L _α formula (4)

In formula (4), L represents the loss between key point information and key point labels, L _p represents the key point error loss function, L _a represents the uncertainty error loss function, and the two are expressed as follows:

L _p ＝f(σ _p ) formula (6)

L _α ＝f(σ _p -α) formula (7)

f(x)＝|x| Formula (8)

In formulas (5)-(8), σ _p represents the root mean square error of all key points of the target face, α represents the first uncertainty of the predicted output, f represents the L1 loss function, xi _and

Represent the i-th key point x-coordinate label and predicted value, y _i and

Represent the i-th keypoint y-coordinate label and predicted value, respectively.

S950. Adjust the network parameters of the feature extraction network and/or the key point detection network according to the difference until the convergence condition is satisfied, and obtain the trained feature extraction network and/or the key point detection network.

Specifically, after determining the difference between the predicted value and the label value, the network parameters of the feature extraction network and/or the key point detection network can be optimized and adjusted according to the difference backpropagation. The above process is repeated repeatedly using the sample data in the sample data set, and the image detection network is iteratively optimized until the convergence condition is met, and the network training is completed.

It is worth noting that, in the embodiments of the present disclosure, by constructing the loss function shown in formula (4), for example, the first uncertainty of the target face is integrated to optimize the training of the network to improve the effect of the image processing network. Moreover, the constructed loss function has a simple structure, and the optimization of the first uncertainty can be realized without setting additional labels for the first uncertainty, and the network is easier to converge. Moreover, the first uncertainty is the comprehensive uncertainty of the target face. During the training process, there is no need to perform regression optimization on each key point separately, which simplifies the amount of calculation and improves the efficiency of network training.

In a second aspect, the embodiments of the present disclosure provide an image processing device, which can be applied to electronic equipment. In the embodiments of the present disclosure, the electronic device may be any type of device suitable for implementation, such as a mobile terminal, a vehicle terminal, a wearable device, an access control system, a video surveillance system, a cloud platform, and a server, etc., and the present disclosure does not limit this.

As shown in FIG. 10 , in some implementations, the image processing device of the present disclosure includes:

The obtaining module 10 is configured to obtain a human face image to be tested, the human face image to be tested includes a target human face;

The image detection module 20 is configured to perform image detection on the face image to be tested, and determine key point information of at least one key point of the target face and a first uncertainty of the target face; wherein, the first uncertainty is obtained according to the second uncertainty of all key points of the target face;

The result determination module 30 is configured to determine the detection result of the target face according to the key point information and the first uncertainty.

In some implementations, the acquisition module 10 is configured to:

In some embodiments, the image detection module 20 is configured to:

In some embodiments, the result determination module 30 is configured to:

As shown in Figure 11, in some implementations, the image detection module 20 includes:

The feature extraction module 40 is configured to input the pre-trained feature extraction network of the human face image to be tested, and obtain the feature map output by the feature extraction network;

The key point detection module 50 is configured to input the feature map into a pre-trained key point detection network, and obtain the key point information and the first uncertainty of each key point of the target face output by the key point detection network.

In some embodiments, the device described in the embodiments of the present disclosure further includes a training module 60, the training module is configured to:

For any sample data, the feature extraction network to be trained is input to the human face sample image, obtain the feature map of the described human face sample image output by the feature extraction network;

It can be known from the above that in the embodiments of the present disclosure, the network is optimized and trained by fusing the first uncertainty of the target face to improve the effect of the image processing network. Moreover, the constructed loss function has a simple structure, and the optimization of the first uncertainty can be realized without setting additional labels for the first uncertainty, and the network is easier to converge. Moreover, the first uncertainty is the comprehensive uncertainty of the target face. During the training process, there is no need to perform regression optimization on each key point separately, which simplifies the amount of calculation and improves the efficiency of network training.

processor; and

Specifically, FIG. 12 shows a schematic structural diagram of an electronic device 600 suitable for implementing the method of the present disclosure. The electronic device shown in FIG. 12 can realize the corresponding functions of the above-mentioned processor and storage medium.

As shown in FIG. 12 , the electronic device 600 includes a processor 601 that can perform various appropriate actions and processes according to programs stored in the memory 602 or loaded from the storage part 608 into the memory 602 . In the memory 602, various programs and data necessary for the operation of the electronic device 600 are also stored. The processor 601 and the memory 602 are connected to each other through a bus 604 . An input/output (I/O) interface 605 is also connected to the bus 604 .

The following components are connected to the I/O interface 605: an input section 606 including a keyboard, a mouse, etc.; an output section 607 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 608 including a hard disk, etc.; and a communication section 609 including a network interface card such as a LAN card, a modem, etc. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc. is mounted on the drive 610 as necessary so that a computer program read therefrom is installed into the storage section 608 as necessary.

In particular, according to the embodiments of the present disclosure, the above method process can be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method described above. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 609 and/or installed from a removable medium 611 .

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or a portion of code that includes one or more executable instructions for implementing specified logical functions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block in the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by special purpose hardware-based systems that perform the specified functions or operations, or by combinations of special purpose hardware and computer instructions.

Apparently, the above-mentioned implementation manners are only examples for clear description, rather than limiting the implementation manners. For those of ordinary skill in the art, other changes or changes in different forms can be made on the basis of the above description. It is not necessary and impossible to exhaustively list all the implementation manners here. And the obvious changes or changes derived therefrom are still within the scope of protection of the present disclosure.

Claims

An image processing method, comprising:

Acquiring a human face image to be tested, the target human face is included in the human face image to be tested;

Perform image detection on the face image to be tested, and determine key point information of at least one key point of the target face and a first uncertainty of the target face; wherein, the first uncertainty is obtained according to second uncertainties of all key points of the target face;

A detection result of the target face is determined according to the key point information and the first uncertainty.
The method according to claim 1, wherein said obtaining the face image to be tested comprises:

Acquiring an image to be processed, the image to be processed includes at least one human face;

performing image detection on the image to be processed, and determining face area information of each face on the image to be processed;

For any human face, the human face image to be tested corresponding to the human face is obtained by cropping according to the human face region information.
The method according to claim 1, wherein, performing image detection on the face image to be tested, determining the key point information of at least one key point of the target face and the first uncertainty of the target face, comprising:

Carry out key point detection on the face image to be tested, and determine the key point information and the second uncertainty of each key point of the target face based on the preset face key point type;

The first uncertainty of the target face is determined according to the second uncertainty of each key point of the target face.
The method according to claim 3, wherein the human face key points include at least one of the following types:

Key points of face contour, key points of eyes, key points of eyebrows, key points of nose, key points of mouth, key points of ears.
The method according to any one of claims 1 to 4, wherein said determining the detection result of the target face according to the key point information and the first uncertainty includes:

determining the reliability score of the target face according to the first uncertainty of the target face;

In response to the reliability score satisfying a first preset condition, the key points are output on the human face image to be tested according to the key point information of each key point.
The method according to any one of claims 1 to 4, wherein said determining the detection result of the target face according to the key point information and the first uncertainty includes:

determining the reliability score of the target face according to the first uncertainty of the target face;

According to the reliability score and the corresponding relationship between the reliability score and the face tracking model established in advance, the target face tracking model is determined from a plurality of preset face tracking models;

The target face is detected and tracked by using the target face tracking model to obtain the detection result of the target face.
The method according to any one of claims 1 to 4, wherein said determining the detection result of the target face according to the key point information and the first uncertainty includes:

determining the reliability score of the target face according to the first uncertainty of the target face;

In response to the reliability score satisfying a second preset condition, it is determined that the target face detection of the face image to be tested passes.
The method according to any one of claims 1 to 7, wherein the performing image detection on the face image to be tested, determining key point information of at least one key point of the target face and the first uncertainty of the target face comprises:

Input the pre-trained feature extraction network of the human face image to be tested to obtain the feature map output by the feature extraction network;

Inputting the feature map into a pre-trained key point detection network to obtain the key point information and the first uncertainty of each key point of the target face output by the key point detection network.
The method according to claim 8, further comprising a training process for training the feature extraction network and the key point detection network, the training process comprising:

Obtain a sample data set, each sample data in the sample data set includes a face sample image, and a key point label of each key point of the target face in the face sample image;

For any sample data, the face sample image is input into the feature extraction network to be trained, and the feature map of the human face sample image output by the feature extraction network is obtained;

Input the feature map of the human face sample image into the key point detection network to be trained, obtain the key point information of each key point of the target human face, and the first uncertainty of the target human face;

determining a difference between the key point information and the key point label based on the key point information, the key point label, and the first uncertainty;

Adjust the network parameters of the feature extraction network and/or the key point detection network according to the difference until a convergence condition is satisfied, and obtain the trained feature extraction network and/or the key point detection network.
An image processing device, comprising:

An acquisition module configured to acquire a face image to be tested, the face image to be tested includes a target face;

The image detection module is configured to perform image detection on the face image to be tested, and determine key point information of at least one key point of the target face and a first uncertainty of the target face; wherein, the first uncertainty is obtained according to the second uncertainty of all key points of the target face;

The result determination module is configured to determine the detection result of the target face according to the key point information and the first uncertainty.
An electronic device comprising:

processor; and

A memory storing computer instructions readable by the processor, the processor executing the method according to any one of claims 1 to 9 when the computer instructions are read.
A storage medium for storing computer-readable instructions for causing a computer to execute the method according to any one of claims 1-9.