CN113283319A

CN113283319A - Method and device for evaluating face ambiguity, medium and electronic equipment

Info

Publication number: CN113283319A
Application number: CN202110524303.3A
Authority: CN
Inventors: 邹子杰
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2021-08-20

Abstract

The disclosure provides a method and a device for evaluating face ambiguity, a computer readable medium and electronic equipment, and relates to the technical field of image processing. The method comprises the following steps: extracting a target video frame containing a human face from video data, and acquiring a reference video frame corresponding to the target video frame from the video data; calculating face posture characteristics, face similarity characteristics and face gradient characteristics corresponding to each face contained in the target video frame based on the target video frame and the reference video frame; and respectively fusing the face posture characteristic, the face similarity characteristic and the face gradient characteristic corresponding to each face to obtain a fusion characteristic, and determining the face fuzziness corresponding to each face based on the fusion characteristic corresponding to each face. According to the face blurring detection method and device, through the multi-frame images and the semantic information of the multi-frame images, face blurring of a face region caused by motion or image blurring can be considered at the same time, and then more accurate face blurring degree is obtained.

Description

Method and device for evaluating face ambiguity, medium and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for evaluating face ambiguity, a computer-readable medium, and an electronic device.

Background

Image scene understanding has attracted great research interest in recent years as a basic and important task of image classification understanding, and is widely applied to various fields. In addition to high-level semantics, there is a need for scene understanding in terms of image quality. For example, a task related to image quality is to deblur an image.

The task related to image quality, such as image deblurring, is complicated in cause. For example, due to exposure time and the like, imaging blur and the like often occur in the acquired image, and refer to fig. 1, which shows face images with different blur degrees. In this case, it is necessary to first determine the degree of blurring of the image in the image, so as to select different deblurring algorithms for deblurring according to the degree of blurring of the image.

Disclosure of Invention

The present disclosure aims to provide a method for evaluating a face ambiguity, a device for evaluating a face ambiguity, a computer-readable medium, and an electronic device, so as to improve the accuracy of ambiguity estimation at least to a certain extent.

According to a first aspect of the present disclosure, there is provided a method for evaluating face ambiguity, including: extracting a target video frame containing a human face from video data, and acquiring a reference video frame corresponding to the target video frame from the video data; calculating face posture characteristics, face similarity characteristics and face gradient characteristics corresponding to each face contained in the target video frame based on the target video frame and the reference video frame; and respectively fusing the face posture characteristic, the face similarity characteristic and the face gradient characteristic corresponding to each face to obtain a fusion characteristic, and determining the face fuzziness corresponding to each face based on the fusion characteristic corresponding to each face.

According to a second aspect of the present disclosure, there is provided an apparatus for evaluating face blurriness, comprising: the video frame acquisition module is used for extracting a target video frame containing a human face from the video data and acquiring a reference video frame corresponding to the target video frame from the video data; the characteristic calculation module is used for calculating face posture characteristics, face similarity characteristics and face gradient characteristics corresponding to each face contained in the target video frame based on the target video frame and the reference video frame; and the ambiguity evaluation module is used for respectively fusing the face posture characteristics, the face similarity characteristics and the face gradient characteristics corresponding to each face to obtain fusion characteristics, and determining the face ambiguity corresponding to each face based on the fusion characteristics corresponding to each face.

According to a third aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned method.

According to a fourth aspect of the present disclosure, there is provided an electronic apparatus, comprising: a processor; and memory storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described method.

According to the method for evaluating the face ambiguity, which is provided by the embodiment of the disclosure, the target video frame in the video data is obtained, and the face detection is performed on the target video frame. And then when the target video frame comprises at least one face, acquiring a reference video frame corresponding to the target video frame containing the acquired face, calculating face posture characteristics, face similarity characteristics and face gradient characteristics corresponding to each face based on the multi-frame video frame, determining fusion characteristics of each face according to the face posture characteristics, the face similarity characteristics and the face gradient characteristics corresponding to each face, and determining face fuzziness corresponding to the face according to any characteristics. Through the multi-frame images and the semantic information of the multi-frame images, the face area blurring caused by the motion or the image quality blurring of the face area can be considered at the same time, and further more accurate face blurring is obtained.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 illustrates face images of different degrees of blur;

FIG. 2 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 3 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied;

FIG. 4 schematically illustrates a flow chart of a method of blur level determination by a quadratic blur algorithm;

fig. 5 schematically illustrates a flowchart of a method for evaluating face blurriness in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a decimation diagram of a reference video frame in an exemplary embodiment of the present disclosure;

FIG. 7 schematically illustrates a flow chart of a regression model training method in an exemplary embodiment of the present disclosure;

fig. 8 schematically illustrates a flowchart of another face ambiguity evaluation method in an exemplary embodiment of the present disclosure;

fig. 9 schematically illustrates a composition diagram of an evaluation apparatus for face blur degree in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 2 is a schematic diagram illustrating a system architecture of an exemplary application environment to which the face ambiguity assessment method and apparatus according to the embodiment of the present disclosure may be applied.

As shown in fig. 2, the system architecture 200 may include one or more of

terminal devices

201, 202, 203, a network 204, and a server 205. The network 204 serves as a medium for providing communication links between the

terminal devices

201, 202, 203 and the server 205. Network 204 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The

terminal devices

201, 202, 203 may be various electronic devices having an image processing function, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 2 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 205 may be a server cluster composed of a plurality of servers.

The evaluation method of the face ambiguity provided by the embodiment of the present disclosure is generally executed by the

terminal devices

201, 202, and 203, and accordingly, the evaluation device of the face ambiguity is generally disposed in the

terminal devices

201, 202, and 203. However, it is easily understood by those skilled in the art that the method for evaluating the face ambiguity provided in the embodiment of the present disclosure may also be executed by the server 205, and accordingly, the apparatus for evaluating the face ambiguity may also be disposed in the server 205, which is not particularly limited in the exemplary embodiment. For example, in an exemplary embodiment, a user may acquire video data for preview in real time through a camera module for acquiring an image included in the

terminal devices

201, 202, and 203, determine a video frame corresponding to a moment when the user presses a photographing key as a target video frame, acquire a reference video frame corresponding to the target video frame from the video data, send the target video frame and the reference video frame corresponding to the target video frame to the server 205 through the network 204, calculate a face pose feature, a face similarity feature, and a face gradient feature corresponding to each face through the server 205, perform feature fusion to obtain a fusion feature corresponding to each face, calculate a face ambiguity corresponding to each face according to the fusion feature, and return the face ambiguity to the

terminal devices

201, 202, and 203.

The exemplary embodiment of the present disclosure provides an electronic device for implementing an evaluation method of face ambiguity, which may be the

terminal device

201, 202, 203 or the server 205 in fig. 2. The electronic device comprises at least a processor and a memory for storing executable instructions of the processor, the processor being configured to execute the method of evaluating face ambiguity via executing the executable instructions.

The following takes the mobile terminal 300 in fig. 3 as an example, and exemplifies the configuration of the electronic device. It will be appreciated by those skilled in the art that the configuration of figure 3 can also be applied to fixed type devices, in addition to components specifically intended for mobile purposes. In other embodiments, the mobile terminal 300 may include more or fewer components than illustrated, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interfacing relationship between the components is only schematically illustrated and does not constitute a structural limitation of the mobile terminal 300. In other embodiments, the mobile terminal 300 may also interface differently than shown in fig. 3, or a combination of multiple interfaces.

As shown in fig. 3, the mobile terminal 300 may specifically include: the mobile terminal includes a processor 310, an internal memory 321, an external memory interface 322, a Universal Serial Bus (USB) interface 330, a charging management module 340, a power management module 341, a battery 342, an antenna 1, an antenna 2, a mobile communication module 350, a wireless communication module 360, an audio module 370, a speaker 371, a receiver 372, a microphone 373, an earphone interface 374, a sensor module 380, a display screen 390, a camera module 391, an indicator 392, a motor 393, a button 394, a Subscriber Identification Module (SIM) card interface 395, and the like. Wherein the sensor module 380 may include a depth sensor 3801, a pressure sensor 3802, a gyroscope sensor 3803, and the like.

Processor 310 may include one or more processing units, such as: the Processor 310 may include an Application Processor (AP), a modem Processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband Processor, and/or a Neural-Network Processing Unit (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors. In an exemplary embodiment, the steps of acquiring a target video frame included in the video data and acquiring a reference video frame corresponding to the target video frame in the video data may be implemented by a processor.

The NPU is a Neural-Network (NN) computing processor, which processes input information quickly by using a biological Neural Network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the mobile terminal 300, for example: image recognition, face recognition, speech recognition, text understanding, and the like. In an exemplary embodiment, the process of face detection, face feature extraction, and face ambiguity determination according to the features may be performed by the NPU.

The mobile terminal 300 implements a display function through the GPU, the display screen 390, the application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display screen 390 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 310 may include one or more GPUs that execute program instructions to generate or alter display information.

The mobile terminal 300 may implement a photographing function through an ISP, a camera module 391, a video codec, a GPU, a display screen 390, an application processor, and the like. The ISP is used for processing data fed back by the camera module 391; the camera module 391 is used for capturing still images or videos; the digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals; the video codec is used to compress or decompress digital video, and the mobile terminal 300 may also support one or more video codecs. In an exemplary embodiment, the process of acquiring video data may be implemented by an ISP, a camera module 391, a video codec, a GPU, a display screen 390, an application processor, and the like.

In the related art, a single conventional feature is usually adopted for fuzzy judgment, and the fuzzy judgment is mainly classified into the following categories:

one is to judge by the feature of image gradient. Such as Tenengrad gradient function, Laplacian gradient function, SMD (grayscale variance) function, SMD2 (grayscale variance product) function, Brenner gradient function, etc.;

and secondly, judging through a secondary fuzzy algorithm. Specifically, if an image is blurred, then the image is blurred once again, and the high-frequency component is not changed greatly; however, if the original image is clear and the blurring process is performed once, the high frequency component varies greatly. For example, as shown in fig. 4, in step S410, a degraded image of the image to be evaluated may be obtained by performing a gaussian blur process once on the image; in step S420, the change of the adjacent pixel values of the original image and the degraded image is compared, and the sharpness value is determined according to the magnitude of the change, so that a smaller calculation result indicates that the image is sharper, and conversely, the image is blurry. This processing method may be referred to as a secondary fuzzy algorithm-based determination method.

And thirdly, judging through the similarity of the gradient structure. For example, the image I to be evaluated may be subjected to low-pass filtering to obtain a reference image Ir, then gradient information of the images I and Ir is extracted to obtain gradient images G and Gr, then N image blocks with the richest gradient information in the gradient image G and the first N blocks of the corresponding gradient image Gr are found, and finally the non-reference structure definition of the image I is calculated according to the found N blocks with the richest gradient information.

And fourthly, judging by a machine learning method. For example, using a laplacian filter to find the maximum of the edge computation variance and the filtered image pixel values in the input image-high variance (and high maximum) indicates that the edge is sharp and visible, i.e., the image is sharp, and low variance indicates image blur.

However, the detection prevention of the degree of image blur described above mainly has the following disadvantages: because the application scenes of the blurring degree detection are rich, the limit of whether the image is blurred is difficult to determine. Specifically, different thresholds may need to be manually set as the blur boundary to distinguish between blur and non-blur according to different application scenarios. In this case, both blurring and non-blurring require artificial thresholding, and if it is desired to define the respective blurring degree in the blurred region, it is also subject to a large human subjective influence. Correspondingly, the selection of the deblurring algorithm becomes very difficult.

Further, there are some methods of performing blur determination based on a plurality of frame images. For example, in the patent application with publication number CN1992813A, a contrast image with the maximum definition is first found out in the focusing stage of the electronic photographing device, and the contrast image is compared with the photographed image to determine the difference between the definitions of the contrast image and the comparative image, if the difference is too large, the photographed image is blurred, otherwise, if the difference is very small, the photographed image is clear. For another example, in the patent application with publication number CN106296688A, a homonymous point set between two images in an image set is detected through feature point detection, so as to obtain a homonymous region, Laplace convolution is performed on the homonymous region, and then the variance of the homonymous region is calculated, and the mean of the variances of all homonymous regions in the two images is taken as the characterization quantity of the image set. And reflecting the fuzzy degree relation between every two images in a ratio form to obtain the sequence of the fuzzy degree of the image set.

However, one of the two ways of determining the multi-frame images requires the manual determination of the comparison image; and the other method can only sequence the relative fuzzy degrees of the images in one image set, and cannot obtain an accurate evaluation result.

Based on one or more of the problems described above, the present exemplary embodiment provides a face ambiguity evaluation method. The method for evaluating the face ambiguity may be applied to the server 105, or may be applied to one or more of the

terminal devices

201, 202, and 203, which is not particularly limited in this exemplary embodiment. Referring to fig. 5, the method for evaluating the face ambiguity may include the following steps S510 to S530:

in step S510, a target video frame including a human face is extracted from the video data, and a reference video frame corresponding to the target video frame is obtained from the video data.

Wherein, the target video frame can comprise at least one human face.

In an exemplary embodiment, when a target video frame including a human face is extracted from video data, a video frame to be detected may be extracted from the video data, and the human face detection may be performed on the video frame to be detected. When the video frame to be detected is detected, if the video frame to be detected comprises at least one face, the video frame to be detected can be determined as the target video frame.

In an exemplary embodiment, after a user opens a camera of the terminal device, video data can be collected by the camera in real time for displaying a preview picture on a screen of the terminal device, and when the user performs shooting (presses a shooting key such as a shutter) or performs a specific operation, a video frame corresponding to the moment in the video data is determined to be a video frame to be detected. For example, if the user presses the shutter at the 6 th frame, it may be determined that the 6 th frame is the video frame to be detected, assuming that the video data corresponding to the preview screen is as shown in fig. 6 during the shooting process of the camera.

It should be noted that, in an exemplary embodiment, in addition to determining the video frame to be detected by the above-mentioned shooting or performing a specific operation, the video frame to be detected may also be determined by other manners, which is not limited in this disclosure. For example, any one frame of the collected video data can be designated as a video frame to be detected, or a video frame meeting the requirement can be screened from the video data as the video frame to be detected according to the requirement of a user.

In an exemplary embodiment, when a reference video frame corresponding to a target video frame is obtained from video data, the reference video frame may be extracted from the video data based on a preset time condition and a time point of the target video frame in the video data. Specifically, the preset time condition may include a time range corresponding to the reference video frame. For example, referring to fig. 6, when the time point of the target video frame in the video data is the 6 th frame, the preset time condition may include that 4 frames before the target video frame in the video data, that is, the 2 nd to 5 th frames are reference video frames corresponding to the target video frame; for another example, the preset time condition may include that 3 frames before and after the target video frame in the video data, that is, the 3 rd to 5 th frames and the 7 th to 9 th frames are reference video frames corresponding to the target video frame; for another example, the preset time condition may further include that 5 frames after the target video frame, that is, 7 th to 11 th frames, in the video data are reference video frames corresponding to the target video frame.

It should be noted that the preset time condition may be set according to specific requirements. Specifically, with the target video frame as a reference, the reference video frames can be taken forward, backward or forward and backward simultaneously in the video data, and the number of the extracted reference video frames can also be set as required. In addition, since the conventional video frame can be generally used for predicting a subsequent video frame, when the reference video frame is acquired, the video frame before the target video frame is generally used as the reference video frame, so that the face ambiguity corresponding to the face included in the target video frame can be more accurately evaluated.

In step S520, face pose features, face similarity features, and face gradient features corresponding to each face included in the target video frame are calculated based on the target video frame and the reference video frame.

The face pose features may include a combination of one or more features for characterizing the face pose in the multi-frame images, for example, the face pose features may include the amount of change in the face pose of each frame image in the multi-frame images relative to the previous frame image; the face similarity feature may include a feature for characterizing the face similarity in the multi-frame image, for example, the face similarity feature may include an average of the similarity of the face pose of each frame image in the multi-frame image with respect to the previous frame image; the face gradient feature may include a feature for characterizing a face gradient in the multi-frame image, for example, the face gradient feature may include an average of gradients of face poses of each frame image of the multi-frame image relative to a previous frame image.

In an exemplary embodiment, after the target video frame and the reference video frame are determined, face pose features corresponding to respective faces in the target video frame may be calculated based on the target video frame and the reference video frame. Specifically, the target video frame and the reference video frame are sequenced in the video data according to the sequence of the target video frame and the reference video frame to obtain a video frame sequence, meanwhile, for each face, the head gestures corresponding to the face in the target video frame and the reference video frame can be respectively calculated, then, the gesture variable quantity of each video frame relative to the previous video frame is respectively calculated according to the sequence in the video frame sequence, and the face gesture feature corresponding to each face is determined based on the gesture variable quantity.

Wherein the head pose may comprise a combination of one or more of the following feature data for characterizing the head pose in the respective video frame: pitch angle (pitch), yaw angle (yaw), roll angle (roll), two-dimensional coordinates of face keypoints in the target video frame and the reference video frame, and the like.

It should be noted that, when the number of the reference video frames is greater than 1, that is, when the reference video frames are extracted and the multi-frame video frames are used as the reference video frames, and the face pose characteristics are determined based on the pose variation, the corresponding pose variation of the multi-frame video may be directly subjected to the dimension reduction processing to obtain the face pose characteristics. For example, a dimensionality reduction algorithm such as pca (principal Component analysis) dimensionality reduction can be used to reduce the obtained pose variation to obtain the face pose feature.

In an exemplary embodiment, after the target video frame and the reference video frame are determined, the face similarity feature corresponding to each face may be calculated based on the target video frame and the reference video frame. Specifically, for each face, the target video frame and the reference video frame may be sequenced according to the sequence of the target video frame and the reference video frame in the video data to obtain a video frame sequence, and then the similarity value of the image region corresponding to the face in each video frame with respect to the image region corresponding to the face in the previous video frame is calculated according to the sequence in the video frame sequence, and the face similarity feature is determined based on the similarity value.

In an exemplary embodiment, after obtaining the sequence of video frames, a histogram comparison method may be directly adopted, in which histogram data of an image region corresponding to a certain face in a video frame and an image region corresponding to the face in a previous video frame are obtained, and then the histogram data are compared to obtain a similarity value corresponding to the face. In addition, before the histogram data comparison, the video frame and the previous video frame may be normalized for the convenience of comparison. When calculating the similarity value, other image similarity calculation methods may be used to calculate the similarity value.

In an exemplary embodiment, when determining the face similarity feature based on the similarity values, for each face, an average value may be calculated for all the similarity values calculated from the target video frame and the reference video frame, and the calculated average value may be determined as the face similarity feature corresponding to the face. When determining the face similarity characteristic, the face similarity characteristic may be determined in other ways besides taking an average value.

In an exemplary embodiment, after the target video frame and the reference video frame are determined, a face gradient feature corresponding to each face may be calculated based on the target video frame and the reference video frame. Specifically, for each face, the target video frame and the reference video frame may be ordered according to the sequence of the target video frame and the reference video frame in the video data to obtain a video frame sequence, then the gradient difference between an image region corresponding to a certain face in each video frame and an image region corresponding to the face in the previous video frame is calculated according to the sequence in the video frame sequence, and the face similarity characteristic is determined based on the gradient difference.

When calculating the gradient, the method may select a Tenengrad gradient function, a Laplacian gradient function, an SMD (grayscale variance) function, an SMD2 (grayscale variance product) function, a Brenner gradient function, and other gradient functions, which is not particularly limited in this disclosure.

In an exemplary embodiment, when determining the face gradient feature based on the gradient difference, for each face, an average value may be calculated for all gradient differences calculated from the target video frame and the reference video frame, and the calculated average value may be determined as the face gradient feature corresponding to the face. It should be noted that, when determining the face gradient feature, the face gradient feature may be determined in other ways besides taking the average value.

In step S530, the face pose features, the face similarity features, and the face gradient features corresponding to the respective faces are respectively fused to obtain fusion features, and the face ambiguity corresponding to the respective faces is determined based on the fusion features corresponding to the respective faces.

In an exemplary embodiment, after the face pose feature, the face similarity feature and the face gradient feature are obtained, feature fusion may be performed on the three features, and the three features may be spliced together to obtain a fusion feature. It should be noted that, when performing splicing, the three features may be spliced according to any one order, and this disclosure is not limited to this.

In an exemplary embodiment, when the face ambiguity corresponding to each face is determined based on the fusion features corresponding to the faces, regression prediction may be performed on the fusion features corresponding to each face according to a preset regression algorithm to obtain the face ambiguity corresponding to each face. The predetermined regression algorithm may include various types of regression algorithms, such as support vector machine regression, linear regression, logistic regression, etc., which is not limited by the disclosure.

The predetermined regression algorithm may include a predetermined regression model. Correspondingly, before the fusion features are subjected to regression prediction through a preset regression algorithm, a sample video frame and a reference video frame corresponding to the sample video frame can be obtained, the face ambiguity corresponding to the sample video frame is marked, and then the regression model is trained based on the sample video frame, the reference video frame corresponding to the sample video frame and the face ambiguity corresponding to the sample video frame. The regression model may include a support vector machine, a fully connected neural network, and other regression models, which are not particularly limited in this disclosure. The sample video frame and the reference video frame corresponding to the sample video frame are training samples which are collected in advance and used for model training, the face ambiguity corresponding to the sample video frame is a mark of the face ambiguity which is carried out on the sample video frame in advance, and supervised learning of the regression model can be achieved through the training samples and the mark.

Specifically, referring to fig. 7, in step S710, the calculation process of the face pose feature, the face similarity feature, and the face gradient feature is performed based on the sample video frame and the reference video frame corresponding to the sample video frame, so as to obtain the face pose feature, the face similarity feature, and the face gradient feature; s720, fusing the obtained human face posture characteristic, the human face similarity characteristic and the human face gradient characteristic to obtain a fusion characteristic; step S730, inputting the obtained fusion features into a regression model, updating weight parameters in the regression model through back propagation by using face blurriness and loss functions corresponding to the labeled sample video frames, and further obtaining a trained regression model, i.e., a preset regression model.

In the following, taking the target video frame including only 1 face and taking the forward 4 frames as the reference video frame as an example, the technical solution of the embodiment of the present disclosure is described in detail with reference to fig. 8:

in step S801, after the 6 th frame is determined to be a video frame to be detected in the video data, face detection is performed on the 6 th frame.

In step S803, it is determined whether or not a human face exists in the 6 th frame.

In step S805, when at least one human face is included in the 6 th frame, the 6 th frame may be determined as a target video frame.

In step S807, after the target video frame is determined, 4 frames may be taken forward as reference video frames according to a preset time condition. I.e. taking frames 2-5 as reference video frames.

Assume that the user's head pose includes simultaneously pitch angle (pitch), yaw angle (yaw), roll angle (roll), and the two-dimensional coordinates of the face keypoints in the target video frame and the reference video frame.

In step S809, the target video frame and the reference video frame are sequenced to obtain a sequence of video frames.

Specifically, it is assumed that the target video frame is the 6 th frame in the video data, and the reference video frame is obtained by taking 4 frames ahead in the video data, i.e., the 2 nd to 5 th frames in the video data. At this time, the resulting video frame sequence is: frame 2, frame 3, frame 4, frame 5, frame 6.

In step S811, the head pose is calculated.

For a target video frame or a reference video frame, extracting 68 two-dimensional face key points included in the video frame, calculating a rotation vector based on the two-dimensional face key points, calculating a rotation matrix corresponding to the rotation vector, and further obtaining euler angles such as a pitch angle, a yaw angle and a roll angle, and obtaining a head posture of each face as shown in formula 1 based on the above processes:

face ═ Pitch, Yaw, Roll, Keyp, equation 1

Wherein, Pitch, Yaw, Roll represent the Pitch angle, Yaw angle and the Roll angle of the corresponding head of people's face respectively, and Keyp represents the two-dimensional coordinate of 68 people's face key points.

After the head poses corresponding to all the target video frames and the reference video frames are obtained, the head poses of all the video frames can be expressed as the following formula 2 according to the sequence of the video frame sequences:

Face＝{Face₂,Face₃,Face₄,Face₅,Face₆equation 2

Wherein, Face₂,Face₃,Face₄,Face₅,Face₆Respectively, the head pose in the 2 nd to 6 th frame video frames.

In step S813, the amount of change in the pose between every two frames is calculated in the order of the sequence of video frames.

Specifically, the Face set in formula 2 may be calculated based on the following formula 3:

wherein, F-motion_iRepresenting the posture variation quantity between the head posture corresponding to the human face in the ith video frame in the video frame sequence and the head posture in the previous video frame; t represents time; mean () represents the calculated average.

In step S815, PCA dimension reduction is performed on the orientation variation.

It should be noted that, because there are a plurality of face key point coordinates, an average value of the face key point coordinates may be calculated during calculation, so as to simplify the calculation process. When 5 video frames participate in the calculation, the above formula can obtain 4 sets of one-dimensional vectors, and correspondingly, a 4 × 4 matrix vector F _ motion can be obtained. To simplify the calculation process, the 4 × 4 matrix vector may be reduced based on PCA dimension reduction to obtain a one-dimensional vector, which may be represented by the following equation 4:

fm _ semi ═ PCA (F _ motion) equation 4

Wherein Fm _ semi represents the face pose characteristics after dimensionality reduction, F _ motion represents the face pose characteristics before dimensionality reduction, and PCA represents the PCA dimensionality reduction algorithm.

In step S817, based on the sequence of video frames determined in step S809, similarity calculation is performed, and a face similarity feature is determined.

Specifically, the calculation can be performed by the following equations 5 and 6:

hist_simi_i＝Histogram(face_i,face_i-1) Equation 5

H _ semi ═ Mean (hist _ semi) equation 6

Wherein, hist _ semi_iRepresenting the similarity value of the image area corresponding to the face in the ith video frame in the video frame sequence relative to the image area corresponding to the face in the previous video frame; histopram (face)_i,face_i-1) Representing histogram comparison of the image region corresponding to the face in the ith video frame with the image region corresponding to the face in the previous video frame (i-1); mean () represents the calculated average; h _ sim represents a face similarity feature.

In step S819, gradient calculation is performed based on the video frame sequence determined in step S809, and a face gradient feature is determined.

Specifically, the calculation can be performed by the following equations 7 and 8:

grad_simi_i＝Grad(face_i,face_i-1) Equation 7

G _ semi ═ Mean (grad _ semi) formula 8

Wherein, grad _ semi_iRepresenting the gradient difference of an image area corresponding to a face in the ith video frame in the video frame sequence relative to an image area corresponding to the face in the previous video frame; grad (face)_i,face_i-1) Calculating the difference between the gradient of the image area corresponding to the face in the ith video frame and the gradient of the image area corresponding to the face in the previous video frame (i-1); mean () represents the calculated average; g _ sami represents the face gradient feature.

In step S821, feature fusion is performed on the face pose feature, the face similarity feature, and the face gradient feature.

After the face pose features, the face similarity features and the face gradient features are obtained, feature splicing can be performed based on the following formula 9 to obtain fusion features corresponding to the face:

feature { Fm _ sami, H _ sami, G _ sami } equation 9

Wherein Fm _ sima represents the face pose feature after dimensionality reduction, H _ sima represents the face similarity feature, and G _ sima represents the face gradient feature.

In step S823, regression prediction is performed based on the fusion features of the face to obtain the face ambiguity corresponding to the face.

After the fusion features are obtained, the fusion features are input into a preset support vector machine model which is trained in advance to carry out regression prediction, and a regression prediction value, namely the face ambiguity, is obtained.

It should be noted that the preset support vector machine may train one support vector machine through the sample video frame, the reference video frame corresponding to the sample video frame, and the face ambiguity corresponding to the sample video frame, to obtain the preset vector machine model. During training, face posture characteristics, face similarity characteristics and face gradient characteristics can be calculated on the basis of a sample video frame and a reference video frame, then the three characteristics are subjected to characteristic splicing, and parameters in the support vector machine are updated according to the fusion characteristics and the face fuzziness corresponding to the sample video frame to obtain the preset support vector machine.

In summary, in the exemplary embodiment, the face is subjected to motion and image quality blur calculation through the multiple frames of images and the semantic information of the multiple frames of images, so that the accuracy of the face blur degree evaluation result is improved. Meanwhile, the traditional scheme can only singly distinguish motion blur or blur caused by poor image imaging quality, but the face simultaneously has blur caused by rigid motion and image quality related blur, so that the evaluation of the face blur degree is carried out by a mode of fusing three characteristics.

In addition, for the traditional characteristics, the single deep learning and the scheme of fusing the traditional characteristics and the deep learning, the accuracy is difficult to achieve, and only clear and serious fuzziness can be recognized. And the deep learning scheme also has the problem of difficult mobile terminal deployment. And regression prediction is carried out on the features based on machine learning, so that the accuracy of face ambiguity evaluation is improved, the face ambiguity evaluation is easy to deploy in a mobile terminal, more accurate fuzzy semantic information is provided for a portrait photographing scene, and a subsequent image quality algorithm is assisted to achieve the optimal effect.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Further, referring to fig. 9, in the embodiment of the present example, an apparatus 900 for evaluating a face ambiguity further includes a video frame acquiring module 910, a feature calculating module 920, and an ambiguity evaluating module 930. Wherein:

the video frame acquiring module 910 may be configured to extract a target video frame containing a human face from video data, and acquire a reference video frame corresponding to the target video frame from the video data.

The feature calculation module 920 may be configured to calculate, based on the target video frame and the reference video frame, a face pose feature, a face similarity feature, and a face gradient feature corresponding to each face included in the target video frame.

The ambiguity evaluation module 930 may be configured to fuse the face pose features, the face similarity features, and the face gradient features corresponding to each face to obtain fusion features, and determine the face ambiguity corresponding to each face based on the fusion features corresponding to each face.

In an exemplary embodiment, the feature calculation module 920 may be configured to calculate a head pose in the target video frame and the reference video frame, respectively, for each face; sequencing the target video frames and the reference video frames based on the sequence of the target video frames and the reference video frames in the video data to obtain a video frame sequence; in a video frame sequence, respectively calculating the posture variation of each video frame relative to the previous video frame based on the head posture corresponding to each video frame aiming at each human face; and determining the face pose characteristics corresponding to each face based on the pose variation.

In an exemplary embodiment, the feature calculation module 920 may be configured to perform a dimension reduction process on the pose variation to obtain the face pose feature.

In an exemplary embodiment, the feature calculation module 920 may be configured to sort the target video frames and the reference video frames based on their order in the video data to obtain a sequence of video frames; in a video frame sequence, aiming at each face, respectively calculating the similarity value of an image area corresponding to the face in each video frame relative to an image area corresponding to the face in the previous video frame; and determining face similarity characteristics corresponding to the faces based on the similarity values.

In an exemplary embodiment, the feature calculation module 920 may be configured to sort the target video frames and the reference video frames based on their order in the video data to obtain a sequence of video frames; in a video frame sequence, aiming at each face, respectively calculating the gradient difference of an image area corresponding to the face in each video frame relative to an image area corresponding to the face in the previous video frame; and determining the gradient characteristic of the human face based on the gradient difference.

In an exemplary embodiment, the ambiguity evaluation module 930 may be configured to perform regression prediction on the fusion features corresponding to the respective faces through a preset regression algorithm, so as to obtain the face ambiguities corresponding to the respective faces.

In an exemplary embodiment, the ambiguity evaluation module 930 may be configured to train the regression model based on the sample video frame, the reference video frame corresponding to the sample video frame, and the face ambiguity corresponding to the sample frame, so as to obtain a preset regression model.

In an exemplary embodiment, the video frame acquiring module 910 may be configured to extract a reference video frame corresponding to a target video frame from video data based on a preset time condition and a time point of the target video frame in the video data.

In an exemplary embodiment, the video frame acquiring module 910 may be configured to extract a video frame to be detected from video data, and perform face detection on the video frame to be detected; and when the video frame to be detected comprises at least one human face, determining the video frame to be detected as a target video frame.

The specific details of each module in the above apparatus have been described in detail in the method section, and details that are not disclosed may refer to the method section, and thus are not described again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product including program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device, for example, any one or more of the steps in fig. 5, 7 and 8 may be performed.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A method for evaluating the blurriness of a human face is characterized by comprising the following steps:

extracting a target video frame containing a human face from video data, and acquiring a reference video frame corresponding to the target video frame from the video data;

calculating face pose features, face similarity features and face gradient features corresponding to each face contained in the target video frame based on the target video frame and the reference video frame;

and respectively fusing the face posture characteristic, the face similarity characteristic and the face gradient characteristic corresponding to each face to obtain fusion characteristics, and determining the face fuzziness corresponding to each face based on the fusion characteristics corresponding to each face.

2. The method of claim 1, wherein computing the face pose features corresponding to the face based on the target video frame and the reference video frame comprises:

sequencing the target video frame and the reference video frame based on the sequence of the target video frame and the reference video frame in the video data to obtain a video frame sequence;

respectively calculating the head gestures in the target video frame and the reference video frame aiming at each human face;

in the video frame sequence, aiming at each human face, respectively calculating the posture variation of each video frame relative to the previous video frame based on the head posture corresponding to each video frame;

and determining the face posture characteristics corresponding to the faces based on the posture variation.

3. The method of claim 2, wherein when the number of reference video frames is greater than 1, the determining the face pose feature based on the pose change amount comprises:

and performing dimension reduction processing on the posture variation to obtain the human face posture characteristic.

4. The method of claim 1, wherein computing the face similarity feature based on the target video frame and the reference video frame comprises:

sequencing the target video frame and the reference frame based on the sequence of the target video frame and the reference video frame in the video data to obtain a video frame sequence;

in the video frame sequence, aiming at each face, respectively calculating the similarity value of an image area corresponding to the face in each video frame relative to an image area corresponding to the face in the previous video frame;

and determining face similarity characteristics corresponding to the faces based on the similarity values.

5. The method of claim 1, wherein computing the face gradient feature based on the target video frame and the reference video frame comprises:

in the video frame sequence, aiming at each face, respectively calculating the gradient difference of an image region corresponding to the face in each video frame relative to an image region corresponding to the face in the previous video frame;

determining the face gradient feature based on the gradient difference.

6. The method of claim 1, wherein determining the face blurriness for each face based on the fused features for each face comprises:

and performing regression prediction on the fusion features corresponding to the human faces through a preset regression algorithm to obtain the human face fuzziness corresponding to the human faces.

7. The method of claim 6, wherein the pre-set regression algorithm comprises a pre-set regression model, and wherein prior to said performing the regression prediction on the fused feature by the pre-set regression algorithm, the method further comprises:

training a regression model based on a sample video frame, a reference video frame corresponding to the sample video frame and a face ambiguity corresponding to the sample frame to obtain a preset regression model.

8. The method according to claim 1, wherein said obtaining a reference video frame corresponding to the target video frame from the video data comprises:

and extracting a reference video frame corresponding to the target video frame from the video data based on a preset time condition and the time point of the target video frame in the video data.

9. The method of claim 1, wherein extracting the target video frame containing the human face from the video data comprises:

extracting a video frame to be detected from video data, and carrying out face detection on the video frame to be detected;

and when the video frame to be detected comprises at least one human face, determining the video frame to be detected as a target video frame.

10. An apparatus for evaluating a face blur degree, comprising:

the video frame acquisition module is used for extracting a target video frame containing a human face from video data and acquiring a reference video frame corresponding to the target video frame from the video data;

the feature calculation module is used for calculating face posture features, face similarity features and face gradient features corresponding to each face contained in the target video frame based on the target video frame and the reference video frame;

and the ambiguity evaluation module is used for respectively fusing the face posture characteristic, the face similarity characteristic and the face gradient characteristic corresponding to each face to obtain a fusion characteristic, and determining the face ambiguity corresponding to each face based on the fusion characteristic corresponding to each face.

11. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 9.

12. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1 to 9 via execution of the executable instructions.