CN111488774A

CN111488774A - Image processing method and device for image processing

Info

Publication number: CN111488774A
Application number: CN201910090781.0A
Authority: CN
Inventors: 谷枫; 李斌; 徐祯
Original assignee: Beijing Sogou Technology Development Co Ltd; Sogou Hangzhou Intelligent Technology Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2020-08-04

Abstract

The embodiment of the invention provides an image processing method and device and a device for image processing. The method specifically comprises the following steps: determining a face-changing area in the template image according to the face area in the template image obtained by detection; the human face area is obtained by detection according to a human face detection model, and the human face detection model is a neural network model obtained by training according to a sample image containing a human face and a human face labeling result corresponding to the sample image; determining a target face key point in a target image; transforming the face-changing area according to the key points of the target face to obtain a transformed template image; wherein, the transformed template image comprises the characteristics of the key points of the target human face. The embodiment of the invention can improve the accuracy of face detection, and further improve the face changing accuracy.

Description

Image processing method and device for image processing

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an image processing method and apparatus, and an apparatus for image processing.

Background

With the development of computer technology, virtual face (also called face change) technology has also been greatly developed. For example, by the "face-changing" technique, the face in the template image can be replaced by the face in the photo; or, the actor's face in the movie may be replaced by a face in a photograph, etc.

At present, a Haar (wavelet feature) -based AdaBoost (Adaptive Boosting, Adaptive enhancement) cascade face detection classifier (Haar classifier for short) is generally adopted to detect a face in a photo or a template, a front face image is cut according to the detected face position, and then the face in the photo can be replaced to a corresponding face position in the template.

However, the Haar classifier has a high detection accuracy for face images in constrained environments such as a simple background, a front face and no occlusion, but has a greatly reduced detection accuracy for face images in unconstrained environments such as a complex background, various postures, occlusion with a cap and a mask, and bad illumination conditions, thereby affecting the effect of face change.

Disclosure of Invention

The embodiment of the invention provides an image processing method and device and an image processing device, which can improve the face changing precision and the face changing efficiency.

In order to solve the above problem, an embodiment of the present invention discloses an image processing method, including:

determining a face-changing area in the template image according to the face area in the template image obtained by detection; the human face area is obtained by detection according to a human face detection model, and the human face detection model is a neural network model obtained by training according to a sample image containing a human face and a human face labeling result corresponding to the sample image;

determining a target face key point in a target image;

transforming the face-changing area according to the key points of the target face to obtain a transformed template image; wherein, the transformed template image comprises the characteristics of the key points of the target human face.

In another aspect, an embodiment of the present invention discloses an image processing apparatus, including:

the region determining module is used for determining a face changing region in the template image according to the face region in the template image obtained by detection; the human face area is obtained by detection according to a human face detection model, and the human face detection model is a neural network model obtained by training according to a sample image containing a human face and a human face labeling result corresponding to the sample image;

the first key point detection module is used for determining key points of a target face in a target image;

the transformation module is used for transforming the face-changing area according to the key points of the target face to obtain a transformed template image; wherein, the transformed template image comprises the characteristics of the key points of the target human face.

In yet another aspect, an embodiment of the present invention discloses an apparatus for image processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:

determining a target face key point in a target image;

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform an image processing method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

the embodiment of the invention can firstly determine the face-changing area in the template image according to the face area in the template image obtained by detection so as to ensure the accuracy of the face-changing area, then determine the target face key point in the target image, and transform the face-changing area according to the target face key point so as to obtain the transformed template image, so that the transformed template image contains the characteristics of the target face key point so as to achieve the purpose of 'face-changing'. The face region is determined according to a face detection model, and the face detection model is a neural network model trained according to a sample image containing a face and a face labeling result corresponding to the sample image, for example, the face detection model can be trained according to a sample image containing a face image with a complex background, various postures and a cap, a mask and the like, so that compared with a Haar classifier, the face detection model provided by the embodiment of the invention can improve the accuracy of face detection, and further improve the accuracy of face change.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of an embodiment of an image processing method of the present invention;

FIG. 2 is a flow chart of steps in another image processing method embodiment of the present invention;

FIG. 3 is a block diagram of an embodiment of an image processing apparatus according to the present invention;

FIG. 4 is a block diagram of an apparatus 800 for image processing of the present invention; and

fig. 5 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of an image processing method according to the present invention is shown, which may specifically include the following steps:

step 101, determining a face-changing area in a template image according to a face area in the template image obtained by detection; the human face area is obtained by detection according to a human face detection model, and the human face detection model is a neural network model obtained by training according to a sample image containing a human face and a human face labeling result corresponding to the sample image;

step 102, determining key points of a target face in a target image;

103, transforming the face-changing area according to the key points of the target face to obtain a transformed template image; wherein, the transformed template image comprises the characteristics of the key points of the target human face.

The image processing method of the embodiment of the invention can be applied to electronic equipment, wherein the electronic equipment comprises but is not limited to a server, a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio L eye III) player, an MP4 (Moving Picture Experts Group Audio L eye IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, wearable equipment and the like.

In the embodiment of the present invention, the face image displayed in the template image and/or the target image may be a partial face (i.e., a face with incomplete face information, such as a side face or a face with a part of the face being blocked by hair, accessories, etc.); or a complete face image (i.e. the face information is complete, such as a face with no occlusion on the front). The template image and/or the target image may be a color image or a grayscale image. In addition, the Image Format is not limited in the embodiment of the present invention, for example, the Image Format of the template Image and/or the target Image may be any Format that can be recognized by an electronic device, such as jpg (Joint Photo graphics experts Group, a picture Format), BMP (Bitmap, Image file Format), or RAW (RAW Image Format).

Certainly, in the embodiment of the present invention, the face in the template image and/or the target image is not limited to a real person face, and may also be a cartoon character, or the like.

It can be understood that the number of faces in the template image and/or the target image is not limited by the embodiments of the present invention, for example, a plurality of template faces in the template image may be replaced by a plurality of target faces in a plurality of target images, where the plurality of target faces may be the same or different. For convenience of description, the embodiment of the present invention is described with the number of faces in the template image and the target image being 1, and the implementation processes of the scene transformation of a plurality of faces are similar and may be referred to each other.

In order to improve the accuracy of face detection and improve the face changing effect, the embodiment of the invention trains a face detection model in advance according to a large number of collected sample images containing faces and face labeling results corresponding to the sample images.

The face detection model can be obtained by performing supervised training on the existing neural network by using a training sample and a one-stage learning method. The training sample may include a large number of sample images and a face labeling result corresponding to each sample image, where the face labeling result may specifically include a label used to indicate whether the sample image is a face region or not and a label (e.g., coordinate values and the like) used to indicate a position of the face region.

The output result of the face detection model may include: the method comprises the steps of judging whether a human face area exists in an input image or not and judging the position of the human face area in the input image. According to the embodiment of the invention, the template image can be input into the face detection model, so that the face area in the template image can be obtained, and the face area can be further used as the face changing area.

In order to avoid the occurrence of an error in the face region determined by the face detection model, for example, the template face in the template image is not completely contained in the face region, thereby greatly affecting the face-changing effect, the embodiment of the present invention may expand the face region in the template image determined by the face detection model by a preset multiple (for example, 1.5 times), so as to ensure that the expanded face region may contain the complete template face in the template image, and the expanded face region is used as the face-changing region.

In an optional embodiment of the present invention, the face detection model may include a multilayer convolutional neural network, and a higher layer network in the multilayer convolutional neural network may be configured to detect macro information in an image, where the macro information includes at least any one of: semantic information and motion information, wherein a lower layer network of the multilayer convolutional neural network can be used for detecting detail related information in an image, and the detail related information at least comprises any one of the following items: edge information, color information.

In the multilayer convolutional neural network, the reception field of the underlying network is small, so that the features with lower abstraction degree, such as edge, color and other detail features, can be extracted, the reception field of the higher network layer is large, the features with higher abstraction degree can be extracted, further, the shape or the target can be recognized, and the higher network layer can extract more abstract features, such as semantic features, motion and behavior features. That is, the high-level features are the combination of the low-level features, and the feature representation from the low level to the high level is more abstract and more capable of expressing semantics or intentions. The process of identifying the target by the multilayer convolutional neural network is based on the characteristic that human vision is layered on the identification of the target, so that the accuracy of face detection can be improved by detecting the face region by the multilayer convolutional neural network.

In a specific application, different images may include facial images of different scales, for example, some facial images may include only 20 pixels, some facial images may include more than 200 pixels, and different network layers of the multi-layer convolutional neural network may respond to facial features of different scales, so that the embodiments of the present invention may also accurately detect facial images of different scales by using the multi-layer convolutional neural network.

In an embodiment of the present invention, each unit of the multilayer convolutional neural network may include at least one convolutional layer and at least one pooling layer, where the convolutional layer may be used to extract image features, and the pooling layer may be used to down-sample input information. Because the convolutional neural network is a feedforward neural network, the artificial neurons of the convolutional neural network can respond to peripheral units in a part of coverage range and have excellent performance on image processing, the convolutional neural network is used for extracting image features, and the accuracy rate of face detection can be improved.

In addition, the embodiment of the invention can also set a proper step length for each network layer of the multilayer convolutional neural network so as to quickly reduce a feature map (feature map), thereby improving the detection speed of the face detection model. For example, the total step size of the convolutional and pooling layers may be set to 32, so that the feature map is quickly reduced to 1/32.

Furthermore, according to the embodiment of the present invention, inclusion structures may be introduced after the convolutional layer and the pooling layer, and each inclusion structure may use convolution kernels with different sizes at the same time, for example, 1 inclusion structure may include convolution kernels with three different sizes, 1 × 1, 3 × 3, and 5 × 5, so as to increase diversity of receptive fields, and thus, detection accuracy may be improved.

After determining the face-changed region in the template image, the key points of the target face in the target image can be determined. The face key points can depict the face outline and key parts of the face, such as eyebrows, eyes, a nose, a mouth and the like, and the number of the face key points can be 68 or more. In the embodiment of the present invention, the target face key points may be a key point set including 68 face key points, and it is to be understood that the specific types and numbers of the face key points are not limited in the embodiment of the present invention.

In addition, the embodiment of the invention does not limit the face key point detection algorithm. For example, a detection algorithm of ASM (active shape model) or the like can be employed. However, the ASM algorithm performs an accurate search for each feature point, and the exhaustive search process results in a low detection efficiency.

In order to improve the detection efficiency of the face key points, the embodiment of the invention can pre-train the key point detection model and detect the face key points in the image according to the key point detection model.

In an optional embodiment of the present invention, the determining the key point of the target face in the target image may specifically include: determining a target face key point in a target image according to the key point detection model; the key point detection model is a neural network model obtained by training according to a sample image containing a human face and a key point marking result corresponding to the sample image.

The key point annotation result may specifically include an annotation used for indicating whether the key point is a face key point, and an annotation (for example, a coordinate value or the like) used for indicating a position of the face key point.

The embodiment of the invention detects the face key points in the image according to the key point detection model, does not need to search all the feature points exhaustively, and further can improve the speed of detecting the face key points. In addition, the key point detection model is obtained by training in a machine learning mode according to a large amount of sample data, so that the key points of the face in the image are detected through the key point detection model, and the accuracy of detecting the key points of the face can be improved.

In an optional embodiment of the present invention, after the keypoint detection model is obtained by training according to a sample image containing a human face and a keypoint labeling result corresponding to the sample image, the method may further include: and optimizing the trained key point detection model according to the sample image which does not contain the key point labeling result to obtain the optimized key point detection model.

Specifically, the embodiment of the present invention may divide the training data of the keypoint detection model into two parts: one part is a sample image containing the result of the keypoint annotation, and the other part is a sample image not containing the result of the keypoint annotation. The embodiment of the invention can adopt a semi-supervised machine learning mode to train the key point detection model. Specifically, an initial keypoint detection model may be obtained by training according to a sample image containing a keypoint labeling result, and then the initial keypoint detection model may be optimized according to a sample image not containing a keypoint labeling result, so as to obtain an optimized keypoint detection model.

In an optional embodiment of the present invention, the sample image not containing the key point annotation result may specifically include: continuous frame images in the face video; the optimizing the trained keypoint detection model according to the sample image not containing the keypoint labeling result to obtain an optimized keypoint detection model specifically may include:

step S11, sequentially inputting continuous frame images in the face video into the trained key point detection model to output key point detection results corresponding to each frame image in the continuous frame images;

step S12, determining the tracking result of the key point detection result of the previous frame image in the current frame according to the key point detection result of the previous frame image;

and step S13, optimizing parameters of the key point detection model according to the difference between the key point detection result of the current frame image and the tracking result to obtain the optimized key point detection model.

Specifically, after an initial key point detection model is obtained by training according to a sample image containing a key point annotation result, a sample image not containing a key point annotation result may be input into the initial key point detection model, for example, consecutive frame images in a face video may be sequentially input into a trained key point detection model to output a key point detection result corresponding to each frame image in the consecutive frame images, assuming that two consecutive frame images (a t-1 th frame image and a t-th frame image) are input, a key point detection result of the t-1 th frame image and a key point detection result of the t-th frame image may be output, assuming that a current frame is the t-th frame, the t-1 th frame image is a previous frame image, a tracking result of the key point detection result of the t-1 th frame image in the t-th frame may be determined according to the key point detection result of the t-1 th frame image, specifically, L ucas-Kanade algorithm may be adopted to track the key point detection result, then a tracking result of the t-1 th frame image, a time sequence detection result of the tracking result may be determined according to a time sequence loss of the t-1 th frame image, and a time sequence detection function may be optimized, and a time sequence loss of the key point detection function may be less than a time loss of the key point detection function.

It can be understood that the face video may be any segment of video containing a face, and the number of consecutive frames of the face video is not limited by the embodiment of the present invention.

Therefore, the embodiment of the invention adopts an unsupervised mode to extract the supervision information from the video to optimize the key point detection model, and the motion of the object in the video has a smoother characteristic, namely, the embodiment of the invention utilizes the principle of consistency of optical flow to optimize the key point detection model so as to improve the detection precision and stability of the key points of the human face.

It can be understood that the embodiment of the present invention uses two loss functions in the process of training and optimizing the keypoint detection model, one is a loss function used in the process of training the initial keypoint detection model according to the labeled sample data, and the loss function may specifically be a softmax (cross entropy) loss function, and the other is a loss function used in the process of optimizing the trained initial keypoint detection model, and the loss function may specifically be a time-series registration loss function. Of course, the softmax loss function and the time-series registration loss function are only one application example of the present invention, and the embodiment of the present invention does not limit the kinds of the loss functions.

After the target face key points in the target image are determined, the face-changed region can be transformed according to the target face key points to obtain a transformed template image, so that the transformed template image contains the features of the target face key points.

Specifically, the face image in the target image can be directly subjected to matting to obtain a matte image of the target face, and the matte image is subjected to operations such as scaling and rotation to cover the template face position in the template image to obtain a transformed template image.

Therefore, before replacing the target face, the embodiment of the invention carries out face alignment and face fusion processing on the target image and the template image according to the key points of the face in the target image and the template image, so that the image after face change is more natural.

In an optional embodiment of the present invention, the transforming the face-changed region according to the target face key point to obtain a transformed template image specifically may include:

step S21, determining template face key points in the template image;

step S22, aligning the target face key points to the face-changing area according to the corresponding relation between the positions of the target face key points and the positions of the template face key points in the template image to obtain an aligned image;

step S23, determining key points of the aligned face in the aligned image;

step S24, according to the key points of the aligned face in the aligned image and the key points of the template face, the aligned image and the template image are fused to obtain a fused image;

and step S25, covering the face image in the fusion image at the face-changing area position in the template image to obtain a transformed template image.

Specifically, firstly, template face key points in a template image may be determined according to a key point detection model, and the target face key points are aligned to the face-changing region according to a correspondence between positions of the target face key points and positions of the template face key points, so as to obtain an aligned image.

The alignment operation refers to taking the template face key points as reference points, and aligning the target face key points to the reference points, that is, respectively aligning the face key features such as eyebrows, eyes, a nose, a mouth, a face contour and the like in the target image to the positions corresponding to the face key features such as eyebrows, eyes, a nose, a mouth, a face contour and the like in the template image.

In specific application, coordinates of centers of two eyes in a target image and a template image can be respectively calculated according to the positions of the key points of the target face and the positions of the key points of the template face, and normalization processing such as scaling, rotation, translation and the like is carried out on the target image according to the coordinates of the centers of the two eyes in the target image, so that the key points of the target face can be aligned with the key points of the template face. However, this alignment method only considers the key points of both eyes, which results in an unsatisfactory alignment effect and further causes severe distortion of five sense organs during the subsequent image fusion process, and finally affects the "face-changing" effect.

In an optional embodiment of the present invention, the aligning, according to a correspondence between the positions of the target face key points and the positions of the template face key points, the target face key points to the face-changed region to obtain an aligned image may specifically include:

step S31, determining a first shape formed by the key points of the target face according to the positions of the key points of the target face;

step S32, determining a second shape formed by the template face key points according to the positions of the template face key points;

and step S33, aligning the key points of the target human face to the face-changing area according to the affine transformation of the first shape corresponding to the second shape to obtain an aligned image.

In order to improve the alignment accuracy, all face key points extracted in the alignment process are considered. For example, first, a first shape formed by the target face key points may be determined according to the positions of the target face key points, and a second shape formed by the template face key points may be determined according to the positions of the template face key points. The two shapes may then be aligned together using a pilfer analysis method based on the centers of gravity of the first and second shapes, and the angle, such that the pilfer distance is minimized.

Specifically, for the target face key points, the mean value of each key point in the target image can be calculated, then the mean value is subtracted from each key point in the target face key points to realize normalization, and then the gravity center of the first shape is calculated according to the decentralized data; likewise, the center of gravity of the second shape may be calculated. The first and second shapes may then be aligned together according to their centers of gravity and angles such that the pilfer distance is minimized.

Wherein the process of aligning the first shape and the second shape together based on the centers of gravity and the angles of the first shape and the second shape such that the distance in Peyer's disease is the smallest is an iterative process. Specifically, the rotation angle of the first shape aligned to the second shape may be calculated by using a least square method, and a partial derivative is calculated to obtain a rotation parameter of affine transformation of the first shape aligned to the second shape, the first shape is rotated according to the rotation parameter to obtain a new shape of one iteration, and then the iteration process is continuously repeated until the first shape is infinitely close to the second shape (if a specified cycle number is reached or an absolute norm of the shape between two iterations meets a preset threshold), the iteration may be stopped, and an image obtained by the last iteration is used as an aligned image.

Through the alignment operation, the target face key points can be aligned at the corresponding positions of the template face key points according to the display parameters such as the sizes and the rotation angles of the template face key points, so that the final face changing result is more natural.

After the target face key points are aligned to the face-changing area positions in the template image to obtain an aligned image, the face image in the aligned image is a new face image obtained by rotating and deforming the target face key points, so that the aligned face key points in the aligned image can be determined according to a key point detection model. And then, the alignment image and the template image can be fused according to the alignment face key points and the template face key points to obtain a fused image.

In an optional embodiment of the present invention, the fusing the aligned image and the template image according to the aligned face key points and the template face key points to obtain a fused image specifically may include:

step S41, determining a third face key point according to the first weight corresponding to the aligned face key point and the second weight corresponding to the template face key point;

step S42, determining a first affine transformation of the aligned face key points corresponding to the third face key points and a second affine transformation of the template face key points corresponding to the third face key points;

step S43, determining a first transformation result corresponding to the aligned image according to the first affine transformation, and determining a second transformation result corresponding to the template image according to the second affine transformation;

and step S44, fusing the first transformation result and the second transformation result according to the first weight and the second weight to obtain a fused image.

In the process of fusing the aligned image and the template image, the embodiment of the invention can set different weights for the face key points of the two images, for example, the weight corresponding to the aligned face key point is set as a first weight, and the weight corresponding to the template key point is set as a second weight, so that the fused image can simultaneously contain the face key point characteristics in the target image and the template image.

For example, if the first weight is set to be greater than the second weight, the obtained fused image will include more features of the key points of the target face, and include less features of the key points of the template face, that is, the face in the fused image is closer to the face in the target image. Thus, the display effect of the obtained fusion image is more natural.

Specifically, a third face key point is determined according to a first weight corresponding to the aligned face key point and a second weight corresponding to the template face key point. For example, according to the first weight and the second weight, the alignment key point and the template face key point are weighted and averaged to obtain a third face key point.

And then, triangulating the three key point sets of the aligned face key points, the template face key points and the third face key points to obtain three triangle sets, and recording the three triangle sets as a first triangle set (including the aligned face key points), a second triangle set (including the template face key points) and a third triangle set (including the third face key points).

Respectively calculating a first affine transformation of the first triangle set corresponding to the third triangle set and a second affine transformation of the second triangle set corresponding to the third triangle set; and according to the second affine transformation, carrying out deformation processing on the template image to obtain a second transformation result.

And then, according to the first weight and the second weight, fusing the first transformation result and the second transformation result to obtain a fused image.

And finally, carrying out matting on the face image in the fused image, and covering the face-changed region position in the template image to obtain a transformed template image. In the covering process, the face image in the fusion image can be subjected to fine adjustment processing such as scaling and rotation according to actual needs, so that the edge of the face image in the fusion image is more naturally jointed with the face-changing area, and in addition, the face skin color in the fusion image and the face skin color in the face-changing area can be averaged through filtering operation, so that the face skin color after face changing is more natural.

Optionally, in the process of fusing the alignment image and the template image, in the embodiment of the present invention, in addition to the key points of the face obtained based on the detection, other key points may be added, so that the fused image is more accurate and natural. Specifically, for the aligned image, a rectangle including a face region in the aligned image may be established, a point is taken at each of the middle positions of four sides of the rectangle, and the points are added to a key point set corresponding to the aligned face key points to increase the number of key points. Similarly, for the template image, the key points of the template are also added according to the method. And then, the alignment image and the template image can be fused according to the alignment key point set and the template key point set after the key points are added, so that the fused image is more natural.

Further, after the fused image is obtained, the key points corresponding to the face contour in the fused image may be adducted according to the proportion of the face, so that the key points are more fit, where the adduction amplitude may be 1/3 to 1/2 of the distance between adjacent key points in the horizontal direction or the vertical direction, and it can be understood that the adduction proportion is not limited by the embodiment of the present invention.

To sum up, the embodiment of the present invention may first determine a face-changing region in a template image according to a face region in the template image obtained by detection to ensure accuracy of the face-changing region, then determine a target face key point in a target image, and transform the face-changing region according to the target face key point to obtain a transformed template image, so that the transformed template image includes features of the target face key point to achieve the purpose of "face-changing". The face region is determined according to a face detection model, and the face detection model is a neural network model trained according to a sample image containing a face and a face labeling result corresponding to the sample image, for example, the face detection model can be trained according to a sample image containing a face image with a complex background, various postures and a cap, a mask and the like, so that compared with a Haar classifier, the face detection model provided by the embodiment of the invention can improve the accuracy of face detection, and further improve the accuracy of face change.

Method embodiment two

Referring to fig. 2, a flowchart illustrating steps of another embodiment of an image processing method according to the present invention is shown, which may specifically include the following steps:

step 201, determining a face area in a template image according to a face detection model;

specifically, the template image is input into a face detection model, so that a face region in the template image is output through the face detection model. The face detection model is a neural network model obtained by training according to a sample image containing a face and a face labeling result corresponding to the sample image.

Optionally, the face detection model may include a multilayer convolutional neural network, where a higher layer network in the multilayer convolutional neural network is configured to detect macro information in an image, where the macro information includes at least any one of: semantic information and motion information, wherein a lower layer network of the multilayer convolutional neural network is used for detecting detail related information in an image, and the detail related information at least comprises any one of the following items: edge information, color information.

Step 202, determining a face-changing area in a template image, wherein the face-changing area comprises a face area of the template image;

specifically, the face region detected by the face detection model may be enlarged by 1.5 times to obtain the face-changed region.

Step 203, determining target face key points in the target image and template face key points in the template image according to the key point detection model;

specifically, the target image and the template image are respectively input into the key point detection model to obtain a face key point set of the target image, wherein the face key point set comprises 68 target face key points in the target image, and a face key point set of the template image, wherein the face key point set comprises 68 template face key points in the template image. The key point detection model is a neural network model obtained by training according to a sample image containing a human face and a key point marking result corresponding to the sample image.

Step 204, generating an alignment image according to the target face key points and the template face key points;

specifically, the target face key points are aligned to the face-changing area according to the corresponding relationship between the positions of the target face key points and the positions of the template face key points in the template image, so as to obtain an aligned image.

Step 205, determining key points of the aligned face in the aligned image according to the key point detection model;

specifically, the aligned images are input into the keypoint detection model to obtain a face keypoint set of the aligned images, wherein the face keypoint set comprises 68 aligned face keypoints in the aligned images.

Step 206, generating a fusion image according to the template face key points and the aligned face key points;

specifically, the alignment image and the template image are fused according to the alignment face key points in the alignment image and the template face key points to obtain a fused image.

Step 207, determining a fused face key point in the fused image according to the key point detection model;

specifically, the fused image is input into the key point detection model to obtain a face key point set of the fused image, wherein the face key point set comprises 68 fused face key points in the fused image.

And step 208, covering the face image in the fusion image at the position of the face-changing area in the template image to obtain a transformed template image.

Specifically, the face image in the fused image may be subjected to matting, and a certain fine adjustment process such as scaling and rotation is performed to cover the face-changed region position in the template image, so as to obtain a transformed template image.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 3, a block diagram of an embodiment of an image processing apparatus according to the present invention is shown, where the apparatus may specifically include:

the region determining module 301 is configured to determine a face-changing region in the template image according to a face region in the template image obtained through detection; the human face area is obtained by detection according to a human face detection model, and the human face detection model is a neural network model obtained by training according to a sample image containing a human face and a human face labeling result corresponding to the sample image;

a first key point detection module 302, configured to determine key points of a target face in a target image;

a transformation module 303, configured to transform the face-changed region according to the target face key point to obtain a transformed template image; wherein, the transformed template image comprises the characteristics of the key points of the target human face.

Optionally, the first keypoint detection module 302 may specifically include:

the key point detection model is used for determining key points of a target face in a target image; the key point detection model is a neural network model obtained by training according to a sample image containing a human face and a key point marking result corresponding to the sample image.

Optionally, the apparatus may further include:

and the optimization module is used for optimizing the trained key point detection model according to the sample image which does not contain the key point labeling result so as to obtain the optimized key point detection model.

Optionally, the sample image not containing the key point annotation result may specifically include: continuous frame images in the face video; the optimization module may specifically include:

the input submodule is used for sequentially inputting the continuous frame images in the face video into the trained key point detection model so as to output the key point detection result corresponding to each frame image in the continuous frame images;

the tracking submodule is used for determining a tracking result of the key point detection result of the previous frame image in the current frame according to the key point detection result of the previous frame image;

and the optimization submodule is used for optimizing the parameters of the key point detection model according to the difference between the key point detection result of the current frame image and the tracking result so as to obtain the optimized key point detection model.

Optionally, the transformation module may specifically include:

the second key point detection module is used for determining template face key points in the template image;

the alignment submodule is used for aligning the target face key points to the face-changing area according to the corresponding relation between the positions of the target face key points and the positions of the template face key points in the template image so as to obtain an aligned image;

the third key point detection module is used for determining key points of the aligned face in the aligned image;

the fusion submodule is used for fusing the alignment image and the template image according to the alignment face key points in the alignment image and the template face key points to obtain a fusion image;

and the transformation submodule is used for covering the face image in the fusion image at the position of the face-changing area in the template image so as to obtain a transformed template image.

Optionally, the alignment sub-module may specifically include:

the first determining unit is used for determining a first shape formed by the target face key points according to the positions of the target face key points;

the second determining unit is used for determining a second shape formed by the template face key points according to the positions of the template face key points;

and the alignment unit is used for aligning the key points of the target human face to the face-changing area according to the affine transformation of the first shape corresponding to the second shape so as to obtain an aligned image.

Optionally, the face detection model includes a multilayer convolutional neural network, a higher layer network in the multilayer convolutional neural network is used to detect macro information in the image, and the macro information at least includes any one of the following items: semantic information and motion information, wherein a lower layer network of the multilayer convolutional neural network is used for detecting detail related information in an image, and the detail related information at least comprises any one of the following items: edge information, color information.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for image processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: determining a face-changing area in the template image according to the face area in the template image obtained by detection; the human face area is obtained by detection according to a human face detection model, and the human face detection model is a neural network model obtained by training according to a sample image containing a human face and a human face labeling result corresponding to the sample image; determining a target face key point in a target image; transforming the face-changing area according to the key points of the target face to obtain a transformed template image; wherein, the transformed template image comprises the characteristics of the key points of the target human face.

Fig. 4 is a block diagram illustrating an apparatus 800 for image processing according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user, in some embodiments, the screen may include a liquid crystal display (L CD) and a Touch Panel (TP). if the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), programmable logic devices (P L D), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 5 is a schematic diagram of a server in some embodiments of the invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows server, Mac OS XTM, UnixTM, &lttttranslation = L "&tttl &/t &gttinuxtm, FreeBSDTM, and so forth.

A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the image processing method shown in fig. 1. A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform an image processing method, the method comprising: determining a face-changing area in the template image according to the face area in the template image obtained by detection; the human face area is obtained by detection according to a human face detection model, and the human face detection model is a neural network model obtained by training according to a sample image containing a human face and a human face labeling result corresponding to the sample image; determining a target face key point in a target image; transforming the face-changing area according to the key points of the target face to obtain a transformed template image; wherein, the transformed template image comprises the characteristics of the key points of the target human face.

The embodiment of the invention discloses A1 and an image processing method, which comprises the following steps:

determining a target face key point in a target image;

A2, the method according to A1, wherein the determining key points of the target human face in the target image comprises: determining a target face key point in a target image according to the key point detection model; the key point detection model is a neural network model obtained by training according to a sample image containing a human face and a key point marking result corresponding to the sample image.

A3, according to the method in A2, after training a key point detection model according to a sample image containing a human face and a key point marking result corresponding to the sample image, the method further includes: and optimizing the trained key point detection model according to the sample image which does not contain the key point labeling result to obtain the optimized key point detection model.

A4, the sample image containing no keypoint labeling result according to the method of A3, comprising: continuous frame images in the face video; the method for optimizing the trained key point detection model according to the sample image not containing the key point labeling result to obtain the optimized key point detection model comprises the following steps: sequentially inputting continuous frame images in the face video into the trained key point detection model so as to output a key point detection result corresponding to each frame image in the continuous frame images;

determining a tracking result of a key point detection result of a previous frame image in a current frame according to a key point detection result of the previous frame image;

and optimizing parameters of the key point detection model according to the difference between the key point detection result and the tracking result of the current frame image to obtain the optimized key point detection model.

A5, according to the method in a1, the transforming the face-changed region according to the target face key points to obtain a transformed template image, including:

determining template face key points in the template image;

aligning the target face key points to the face-changing area according to the corresponding relation between the positions of the target face key points and the positions of the template face key points in the template image to obtain an aligned image;

determining key points of an aligned face in the aligned image;

according to the key points of the aligned face in the aligned image and the key points of the template face, fusing the aligned image and the template image to obtain a fused image;

and covering the face image in the fusion image at the position of the face-changing area in the template image to obtain a transformed template image.

A6, according to the method in a5, aligning the target face key points to the face-changed regions according to the correspondence between the positions of the target face key points and the positions of the template face key points to obtain aligned images, including:

determining a first shape formed by the target face key points according to the positions of the target face key points; determining a second shape formed by the template face key points according to the positions of the template face key points; and aligning the key points of the target human face to the face-changing area according to the affine transformation of the first shape corresponding to the second shape to obtain an aligned image.

A7, according to the method of any one of A1 to A6, wherein the face detection model comprises a multilayer convolutional neural network, a higher layer network in the multilayer convolutional neural network is used for detecting macroscopic information in an image, and the macroscopic information at least comprises any one of the following items: semantic information and motion information, wherein a lower layer network of the multilayer convolutional neural network is used for detecting detail related information in an image, and the detail related information at least comprises any one of the following items: edge information, color information.

B8, an image processing apparatus comprising:

B9, the apparatus of B8, the first keypoint detection module comprising:

the first key point detection model is used for determining key points of a target face in a target image; the key point detection model is a neural network model obtained by training according to a sample image containing a human face and a key point marking result corresponding to the sample image.

B10, the apparatus of B9, the apparatus further comprising:

B11, the sample image containing no keypoint labeling result according to the device of B10, comprising: continuous frame images in the face video; the optimization module comprises:

B12, the apparatus of B8, the transformation module comprising:

B13, the apparatus according to B12, the alignment submodule comprising:

B14, according to the device of any one of B8 to B13, the face detection model comprises a multilayer convolutional neural network, a high-level network in the multilayer convolutional neural network is used for detecting macroscopic information in an image, and the macroscopic information at least comprises any one of the following items: semantic information and motion information, wherein a lower layer network of the multilayer convolutional neural network is used for detecting detail related information in an image, and the detail related information at least comprises any one of the following items: edge information, color information.

C15, an apparatus for image processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors the one or more programs including instructions for:

determining a target face key point in a target image;

C16, the apparatus of C15, the determining the key points of the target face in the target image, comprising:

determining a target face key point in a target image according to the key point detection model; the key point detection model is a neural network model obtained by training according to a sample image containing a human face and a key point marking result corresponding to the sample image.

C17, the device of C16, the device also configured to execute the one or more programs by one or more processors including instructions for:

and optimizing the trained key point detection model according to the sample image which does not contain the key point labeling result to obtain the optimized key point detection model.

C18, the device of C17, the sample image not containing a keypoint labeling result comprising: continuous frame images in the face video; the method for optimizing the trained key point detection model according to the sample image not containing the key point labeling result to obtain the optimized key point detection model comprises the following steps: sequentially inputting continuous frame images in the face video into the trained key point detection model so as to output a key point detection result corresponding to each frame image in the continuous frame images; determining a tracking result of a key point detection result of a previous frame image in a current frame according to a key point detection result of the previous frame image; and optimizing parameters of the key point detection model according to the difference between the key point detection result and the tracking result of the current frame image to obtain the optimized key point detection model.

C19, the transforming the face-changed region according to the target face key points to obtain a transformed template image according to the apparatus of C15, including:

determining template face key points in the template image;

determining key points of an aligned face in the aligned image;

C20, aligning the target face key points to the face-changed regions according to the correspondence between the positions of the target face key points and the positions of the template face key points by the apparatus according to C19, so as to obtain an aligned image, including:

C21, the device according to any one of C15 to C20, wherein the face detection model comprises a multilayer convolutional neural network, a higher layer network in the multilayer convolutional neural network is used for detecting macro information in an image, and the macro information at least comprises any one of the following items: semantic information and motion information, wherein a lower layer network of the multilayer convolutional neural network is used for detecting detail related information in an image, and the detail related information at least comprises any one of the following items: edge information, color information.

D22, a machine readable medium having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform an image processing method as described in one or more of a 1-a 7.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The image processing method, the image processing apparatus and the apparatus for image processing provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the above description of the embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An image processing method, characterized in that the method comprises:

determining a target face key point in a target image;

2. The method of claim 1, wherein determining the target face keypoints in the target image comprises:

3. The method of claim 2, wherein after training a keypoint detection model according to a sample image containing a human face and a keypoint labeling result corresponding to the sample image, the method further comprises:

4. The method of claim 3, wherein the sample image not containing the result of the keypoint annotation comprises: continuous frame images in the face video; the method for optimizing the trained key point detection model according to the sample image not containing the key point labeling result to obtain the optimized key point detection model comprises the following steps:

sequentially inputting continuous frame images in the face video into the trained key point detection model so as to output a key point detection result corresponding to each frame image in the continuous frame images;

5. The method according to claim 1, wherein transforming the face-changed region according to the target face key points to obtain a transformed template image comprises:

determining template face key points in the template image;

determining key points of an aligned face in the aligned image;

6. The method according to claim 5, wherein aligning the target face key points to the face-changed region according to the correspondence between the positions of the target face key points and the positions of the template face key points to obtain an aligned image comprises:

determining a first shape formed by the target face key points according to the positions of the target face key points;

determining a second shape formed by the template face key points according to the positions of the template face key points;

and aligning the key points of the target human face to the face-changing area according to the affine transformation of the first shape corresponding to the second shape to obtain an aligned image.

7. The method according to any one of claims 1 to 6, wherein the face detection model comprises a multilayer convolutional neural network, a higher layer network in the multilayer convolutional neural network is used for detecting macroscopic information in the image, and the macroscopic information at least comprises any one of the following items: semantic information and motion information, wherein a lower layer network of the multilayer convolutional neural network is used for detecting detail related information in an image, and the detail related information at least comprises any one of the following items: edge information, color information.

8. An image processing apparatus, characterized in that the apparatus comprises:

9. An apparatus for image processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:

determining a target face key point in a target image;

10. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform an image processing method as claimed in one or more of claims 1 to 7.