CN113628322B

CN113628322B - Image processing, AR display and live broadcast method, device and storage medium

Info

Publication number: CN113628322B
Application number: CN202110844521.5A
Authority: CN
Inventors: 考月英; 吕江靖; 盘博文; 李晓波
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2023-12-05
Anticipated expiration: 2041-07-26
Also published as: CN113628322A

Abstract

The embodiment of the application provides an image processing method, an AR display method, an AR live broadcast method, AR display equipment and a storage medium. In the embodiment of the application, the three-dimensional reconstruction is carried out on the 2D image based on the prior 3D face model, in the three-dimensional reconstruction process, the face posture estimation is carried out by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, the 2D face posture data is obtained, when the three-dimensional reconstructed 3D face model is projected to the 2D face image, the perspective projection processing to the 2D face image is carried out on the target 3D face model by utilizing the 2D face posture data, the posture data of the 2D face is fully considered, so that the effect of the projection of the 3D face model to the 2D image is more true, and more possibilities are provided for the application of high-precision requirements.

Description

Image processing, AR display and live broadcast method, device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method, an AR display method, an AR live broadcast method, an AR display device, an AR live broadcast apparatus, and a storage medium.

Background

The three-dimensional face reconstruction and gesture estimation based on vision have important application values in application scenes such as 3D head portrait creation, face animation generation, AR makeup or AR try-on of augmented reality (Augmented Reality, AR) live broadcast in real time. For example, in an AR cosmetic scene, some cosmetic effects (e.g., blush, lipstick, etc.) need to be added to the reconstructed three-dimensional face model, and then the cosmetic effects added three-dimensional face model is projected onto a two-dimensional face image for presentation. For another example, in an AR try-on scene, a try-on product (e.g., sunglasses, earrings, etc.) needs to be added to the reconstructed three-dimensional face model, and then the three-dimensional face model to which the try-on product is added is projected onto a two-dimensional face image for presentation. When the three-dimensional face model is projected onto the two-dimensional face image, a problem that the reconstructed three-dimensional face model and the two-dimensional face image are not attached to each other, for example, a problem that makeup is distorted or suspended in an AR makeup scene, is often caused, so that the application of reconstructing the three-dimensional face model is limited.

Disclosure of Invention

Aspects of the application provide an image processing, AR display and live broadcast method, device and storage medium, so that the effect of 3D face model projection to a 2D image is more fit and more real, and more possibility is provided for high-precision application.

The embodiment of the application provides an image processing method, which comprises the following steps: performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image; in the three-dimensional reconstruction process, combining the position mapping relation from the 2D face image to the target 3D face model to perform face pose estimation so as to obtain pose data of the 2D face; and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the gesture data of the 2D face.

The embodiment of the application also provides an AR display method which is suitable for AR display equipment, wherein the AR display equipment is provided with a camera, and the method comprises the following steps: the AR display equipment acquires a 2D face image of a user by using a camera, and displays the 2D face image on a display screen of the AR display equipment; performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performing face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face; adding a target object on the target 3D face model; and according to the gesture data of the 2D face, performing perspective projection processing on the 2D face image aiming at the target 3D face model added with the target object, and displaying the 2D face image with the target object on a display screen.

The embodiment of the application also provides a live broadcast method, which comprises the following steps: acquiring an initial live video of a main broadcasting end by using a camera, wherein the initial live video comprises 2D face images; performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performing face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face; adding live broadcast dynamic effects or products on the 3D target face model; and performing perspective projection processing on the 2D face image according to the gesture data of the 2D face aiming at the target 3D face model added with the live broadcast moving effect or the product so as to obtain a target live broadcast video, and sending the target live broadcast video to a playing terminal, wherein the target live broadcast video contains the 2D face image with the live broadcast moving effect or the product.

The embodiment of the application also provides an image processing device, which comprises: the three-dimensional reconstruction module is used for carrying out three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image; the pose estimation module is used for carrying out face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process so as to obtain pose data of the 2D face; and the perspective projection module is used for executing perspective projection processing on the 2D face image aiming at the target 3D face model according to the gesture data of the 2D face.

The embodiment of the application also provides an image processing device, which comprises: a memory and a processor; a memory for storing a computer program; a processor coupled with the memory for executing the computer program for: performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image; in the three-dimensional reconstruction process, combining the position mapping relation from the 2D face image to the target 3D face model to perform face pose estimation so as to obtain pose data of the 2D face; and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the gesture data of the 2D face.

The embodiment of the application also provides a computer readable storage medium storing a computer program, which when executed by a processor, causes the processor to implement the steps in the image processing method, the AR display method and the live broadcast method provided by the embodiment of the application.

The embodiments of the present application also provide a computer program product, including a computer program/instruction, which when executed by a processor, causes the processor to implement the steps in the image processing method, the AR display method, and the live broadcast method provided by the embodiments of the present application.

In the embodiment of the application, the three-dimensional reconstruction is carried out on the 2D image based on the prior 3D face model, in the three-dimensional reconstruction process, the face posture estimation is carried out by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, the 2D face posture data is obtained, when the three-dimensional reconstructed 3D face model is projected to the 2D face image, the perspective projection processing to the 2D face image is carried out on the target 3D face model by utilizing the 2D face posture data, the posture data of the 2D face is fully considered, so that the effect of the projection of the 3D face model to the 2D image is more true, and more possibilities are provided for the application of high-precision requirements.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a flowchart of an image processing method according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network model structure for generating a priori 3D face model according to an exemplary embodiment of the present application;

FIG. 3a is a schematic diagram of face image processing based on a neural network model according to an exemplary embodiment of the present application;

FIG. 3b is a schematic diagram of a three-dimensional reconstruction based on a neural network model according to an exemplary embodiment of the present application;

FIG. 3c is a schematic diagram of another three-dimensional reconstruction and face pose estimation based on a neural network model according to an exemplary embodiment of the present application;

FIG. 3d is a schematic diagram of a first neural network model and a feature extraction process according to an exemplary embodiment of the present application;

FIG. 3e is a schematic diagram of a second neural network model and a feature extraction process according to an exemplary embodiment of the present application;

FIG. 3f is a schematic diagram of a neural network model structure for a 3D face shape prior based on a single RGB/RGB-D image, according to an exemplary embodiment of the present application;

FIG. 3g is a schematic diagram of an RGB image according to an exemplary embodiment of the present application;

fig. 3h is a schematic diagram of an RGB image after clipping a face area according to an exemplary embodiment of the present application;

fig. 3i is an effect diagram of 3D face projection into a 2D image according to an exemplary embodiment of the present application;

FIG. 3j is a schematic diagram of a neural network model structure for fusion of RGB images and depth images and 3D face shape priors based on an exemplary embodiment of the present application;

Fig. 4a is a schematic structural diagram of a live broadcast system according to an exemplary embodiment of the present application;

fig. 4b is a flowchart of a live broadcast method according to an exemplary embodiment of the present application;

FIG. 4c is a schematic diagram of an AR display system according to an exemplary embodiment of the present application;

fig. 4d is a flowchart of an AR display method according to an exemplary embodiment of the present application;

fig. 5a is a schematic structural view of an image processing apparatus according to an exemplary embodiment of the present application;

fig. 5b is a schematic structural view of an image processing apparatus according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In some face image based display scenarios, there are some application scenarios where special effects or products need to be added to the face image. For example, in a live scene, some hanging or aesthetic special effects need to be added to the face image. For another example, in an AR cosmetic scene, a cosmetic effect needs to be added to a face image. For another example, in AR fitting scenes, some fitting products such as sunglasses, earrings, or caps need to be added to the face image. The process of adding the special effects is actually a process of reconstructing a 3D face model based on the 2D face image and performing a projection operation on the 3D face model added with the special effects to the 2D face image. In the whole process, three-dimensional reconstruction of a face model and projection operation of the reconstructed 3D face model to a 2D face image are involved. The three-dimensional reconstruction refers to a process of restoring the 3D face model based on the 2D face image; the projection operation can be regarded as a process of mapping three-dimensional points on the 3D face model to two-dimensional points on the 2D face image, which can be regarded as essentially a conversion process of three-dimensional coordinates to two-dimensional coordinates.

In the existing scheme, the projection of the reconstructed 3D face model to the 2D face image is usually realized by using an orthogonal projection or weak perspective projection method. Wherein, the formula of orthogonal projection is x=s pi (X) +t, s is a constant, pi is a 3D to 2D projective transformation matrix, t is a 2X 1 translation vector; the formula of weak perspective projection is = s X + t, s is a constant, P is an orthogonal projection matrix [1, 0;0,1,0], R is a 3×3 rotation matrix, and t is a 2×1 translation vector. However, the orthographic projection or the weak perspective projection is adopted, so that the pose data of part or all of the faces is lost, and therefore, after the projection operation is performed, the phenomenon that the 3D face model added with special effects or products is not attached to the 2D face image may occur, for example, the problem that the makeup appearance is distorted or suspended in the AR cosmetic scene, and the application of reconstructing the three-dimensional face model is limited.

Aiming at the technical problems, in some embodiments of the application, three-dimensional reconstruction is performed on a 2D image based on a priori 3D face model, in the three-dimensional reconstruction process, face pose estimation is performed by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, 2D face pose data is obtained, when the three-dimensional reconstructed 3D face model is projected to the 2D face image, perspective projection processing to the 2D face image is performed on the target 3D face model by using the 2D face pose data, and the pose data of the 2D face is fully considered, so that the effect of projecting the 3D face model to the 2D image is more true, and more possibilities are provided for high-precision application.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flowchart of an image processing method according to an exemplary embodiment of the present application; as shown in fig. 1, the method includes:

101. performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image;

102. in the three-dimensional reconstruction process, combining the position mapping relation from the 2D face image to the target 3D face model to perform face pose estimation so as to obtain pose data of the 2D face;

103. and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the gesture data of the 2D face.

In the embodiment of the application, the prior 3D face model is utilized to carry out three-dimensional reconstruction on the 2D face image, and after the target 3D face model corresponding to the 2D face image is obtained, a plurality of special effects or try-on products can be added for the target 3D face model; thereafter, a projection process to the 2D face image may be performed on the target 3D face model to which the special effect or the try-on product is added. In order to improve the fitting degree of a target 3D face model, a special effect or a try-on product and a 2D face image, in the embodiment of the application, face gesture estimation is carried out in the process of carrying out three-dimensional reconstruction on the 2D image, and gesture data of the 2D face is obtained; based on the gesture data of the 2D face, when the projection processing to the 2D face image is executed for the target 3D face model after three-dimensional reconstruction, the perspective projection processing to the 2D face image is executed for the target 3D face model by utilizing the gesture data of the 2D face, and the gesture data of the 2D face is fully considered in the perspective projection process, so that the effect of the 3D face model projected to the 2D image is more true, and more possibility is provided for the application with high precision.

The perspective projection processing of the 2D face pose data on the target 3D face model is a projection process of projecting the target 3D face model or a special effect or a try-on product on the target 3D face model onto the 2D face image from a certain projection center, an expression of perspective projection transformation is x=K (RX+T), X is a 2D coordinate of 3D points on the target 3D face model projected into the 2D face image, K is an internal camera parameter, R is a 3×3 rotation matrix, T is a 3×1 translation vector, and X is a 3D point or a point cloud formed by 3D points in the reconstructed target 3D face model. The face has six degrees of freedom in space, namely a moving degree of freedom along the directions of three rectangular coordinate axes of x, y and z and a rotating degree of freedom around the three coordinate axes. For example, the translation matrix T in the above perspective transformation formula is a degree of freedom of movement along the directions of three rectangular coordinate axes of x, y and z, and the rotation matrix R is a degree of freedom of rotation about the three coordinate axes. In this embodiment, the pose data of the face may be represented by six degrees of freedom information of the face.

Specifically, three-dimensional reconstruction is carried out on the 2D face image based on the prior 3D face model, and a target 3D face model corresponding to the 2D face image is obtained. For example, respectively extracting features of a 2D face image and a priori 3D face model to obtain 2D face feature information and 3D face feature information; combining the 2D face feature information and the 3D face feature information to generate a 3D face deformation parameter; and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the priori 3D face model. The prior 3D face model is a general face model, which can be understood as a 3D face shape prior, or can be understood as an average face grid point (mesh), and the prior 3D face model is a 3D face model that does not include face pose information due to the universality of the prior 3D face model.

In the embodiment of the present application, the implementation manner of acquiring the prior 3D face model is not limited. Optionally, an embodiment for acquiring the prior 3D face model includes: learning a priori 3D face models using an encoding-decoding (encod-decoding) network structure, as shown in fig. 2, comprising: an encoder 21 and a decoder 22. Specifically, some 2D face sample images are collected, three-dimensional reconstruction is carried out on the 2D face sample images to obtain 3D face grids, or some 3D face grids can be directly collected by 3D face collection equipment; then, these 3D face meshes are used as samples, each sample contains three-dimensional coordinates of 3D points of a plurality of faces, the plurality of samples are input to the encoder 21, the encoder 21 maps the input samples into hidden layer vectors, averages the hidden layer vectors of the plurality of samples, then inputs the averaged hidden layer vectors to the decoder 22, and the decoder 22 generates a priori 3D face models from the averaged hidden layer vectors.

And then, in the three-dimensional reconstruction process, combining the position mapping relation from the 2D face image to the target 3D face model to perform face pose estimation so as to obtain pose data of the 2D face. The position mapping relationship refers to a position conversion relationship between a pixel point in the 2D face image and a 3D point in the target 3D face model. Face pose estimation mainly obtains face orientation information, and pose data of a 2D face can be represented through a rotation matrix, a rotation vector, quaternion or Euler angle, a multidimensional degree of freedom and the like, and the four representation forms can be mutually converted. Alternatively, the pose data of the face may be represented by six degrees of freedom information of the face, such as a movement degree of freedom and a rotation degree of freedom, and details may be found in the foregoing embodiments, which are not described herein.

After the pose data of the 2D face is obtained, perspective projection processing to the 2D face image may be performed for the target 3D face model according to the pose data of the 2D face. Wherein, the perspective projection processing can be performed by using the formula x=k (rx+t) of the perspective projection transformation. Performing perspective projection processing to a 2D face image for a target 3D face model, including but not limited to the following: firstly, a special effect object can be added on the reference 3D face model, the target 3D face model is aligned with the reference 3D face model, and based on the special effect object, the special effect object on the reference 3D face model can be directly projected into a 2D face image, wherein the special effect object can be a special make-up effect, such as a special make-up effect of blush, eyebrow or lipstick, a special make-up effect of paper, graffiti, characters or mosaics, a try-on product, such as sunglasses, hats or earrings, and the like; secondly, adding a special effect object on the target 3D face model, and projecting the special effect object and the target 3D face model into a 2D face image; thirdly, adding a special effect object on the target 3D face model, transparentizing the target 3D face model, and projecting the special effect object and the transparent target 3D face model into a 2D face image.

In an optional embodiment of the application, a neural network model can be utilized to realize three-dimensional reconstruction of the 2D face image based on the prior 3D face model, and face pose estimation is completed in the three-dimensional reconstruction process, so that a target 3D face model corresponding to the 2D face image and pose data of the 2D face are finally obtained. The scheme architecture for face image processing based on the neural network model is shown in fig. 3 a. As shown in fig. 3a, a 2D face image and a priori 3D face model are input to a neural network model; inside the neural network model, carrying out three-dimensional reconstruction on the 2D face image to obtain a target 3D face model, and carrying out face pose estimation in the three-dimensional reconstruction process to obtain pose data of the 2D face; after the neural network model outputs the pose data of the 2D face and the target 3D face model, perspective projection processing to the 2D face image is executed for the target 3D face model according to the pose data of the 2D face.

Further alternatively, the neural network model of the present embodiment may be split into two branches, namely a first neural network model and a second neural network model, as shown in fig. 3 b. The first neural network model is responsible for extracting features of the 2D face image; the second neural network model is responsible for feature extraction of the prior 3D face model. Specifically, in conjunction with fig. 3b, in the process of three-dimensional reconstruction, the 2D face image may be input to a first neural network model, and feature extraction is performed on the 2D face image by using the first neural network model, so as to obtain a 2D face global feature map. The first neural network model is any neural network model capable of performing feature extraction, for example, the first neural network model may be a multi-layer neural network model, that is, includes a plurality of network layers, and different network layers may perform corresponding feature extraction in a targeted manner. The network layer in the neural network model may include, but is not limited to: convolutional neural network (Convolutional Neural Networks, CNN), recurrent neural network (Recurrent Neural Network, RNN), deep neural network (Deep Neural Networks, DNN), or residual network, among others.

In combination with the illustration of fig. 3b, in the process of three-dimensional reconstruction, the processing of the prior 3D face model may be to perform feature extraction on the prior 3D face model by using the second neural network model, so as to obtain feature vectors of a plurality of 3D points on the prior 3D face model. Likewise, the second neural network model is any neural network model capable of feature extraction, for example, the second neural network model may be a multi-layer neural network model, i.e. comprising a plurality of network layers, different network layers may be targeted for respective feature extraction. The network layer in the neural network model may include, but is not limited to: convolutional neural network (Convolutional Neural Networks, CNN), recurrent neural network (Recurrent Neural Network, RNN), deep neural network (Deep Neural Networks, DNN), or residual network, among others.

Further, as shown in fig. 3b, after the feature vectors of the 2D face global feature map and the 3D points on the prior 3D face model are obtained, a 3D face deformation parameter may be generated according to the feature vectors of the 2D face global feature map and the 3D points; and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the priori 3D face model. The deformation parameters of the 3D face are deformation parameters of the target 3D face model relative to the prior 3D face model, and the deformation parameters can be represented by offset of the target 3D face model relative to the prior 3D face model. The method for reconstructing the target 3D face model corresponding to the 2D face image is not limited according to the 3D face deformation parameters and the prior 3D face model. For example, deformation processing can be performed on the prior 3D face model by using the 3D face deformation parameters, so as to obtain a target 3D face model corresponding to the 2D face image. For another example, the 3D face deformation parameters may be superimposed on the prior 3D face model to reconstruct a target 3D face model corresponding to the 2D face image.

Optionally, as shown in fig. 3c, an embodiment for generating a 3D face deformation parameter according to a 2D face global feature map and feature vectors of a plurality of 3D points includes: splicing the 2D face global feature map and feature vectors of a plurality of 3D points to obtain a first fusion feature; and performing feature learning on the first fusion features by using a third neural network model to obtain the 3D facial deformation parameters.

In an alternative embodiment, the neural network model may generate 2D-3D mapping parameters in addition to the 3D face deformation parameters, where the 2D-3D mapping parameters reflect a position mapping relationship between the 2D face image and the target 3D face model, specifically, may be mapping parameters between pixel points on the 2D face image and 3D points in the target 3D face model, where the mapping parameters may be represented by a mapping matrix or a functional relation, and so on. After the 2D-3D mapping parameters are obtained, face pose estimation can be performed based on the 2D-3D mapping parameters, and pose data of the 2D face can be obtained.

Specifically, as shown in fig. 3c, after the 2D face global feature map is obtained, feature vectors of a plurality of pixels may be generated based on the 2D face global feature map, where the feature vectors are features of each pixel obtained by further feature extraction based on the 2D face global feature map in a face region of the 2D face image. Further, as shown in fig. 3c, in the process of extracting features of the prior 3D face model by using the second neural network model, feature extraction is performed on the prior 3D face model by using the second neural network model to obtain a 3D face global feature map; feature vectors of the plurality of 3D points are further generated based on the 3D face global feature map. According to the 3D face global feature map and feature vectors of a plurality of pixel points, 2D-3D mapping parameters for reflecting the position mapping relation between the 2D face image and the target 3D face model can be generated. Based on this, in the three-dimensional reconstruction process, the implementation method for estimating the pose of the face by combining the position mapping relation from the 2D face image to the target 3D face model to obtain the pose data of the 2D face includes: splicing the 3D face global feature and feature vectors of a plurality of pixel points to obtain a second fusion feature; performing feature learning on the second fusion feature by using a fourth neural network model to obtain a 2D-3D mapping parameter; and carrying out gesture estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain gesture data of the 2D face, as shown in fig. 3 c.

In the embodiment of the application, the model structure of the first neural network model is not limited, and the model structure of feature vectors of a plurality of pixel points in a 2D face global feature map and a face region which can be obtained by feature extraction of a 2D face image is applicable to the embodiment of the application. In an alternative embodiment, as shown in fig. 3d, the first neural network model includes: the face segmentation network layer, the first feature extraction network layer and the second feature extraction network layer. Firstly, a 2D face image is sent into a face segmentation network layer to carry out face segmentation, and an initial feature map of the 2D face image and a face area in the 2D face image are obtained, wherein the face area comprises: the initial feature map comprises features of the plurality of pixel points. And then, inputting the features of the plurality of pixel points into a first feature extraction network layer for feature extraction of the 2D face to obtain a 2D face global feature map. And finally, inputting the 2D face global feature map into a second feature extraction network layer to perform feature extraction on each pixel point, so as to obtain feature vectors of a plurality of pixel points. For the face segmentation network layer, the face image with the face region already marked can be used for training the face segmentation network layer in advance, and the face segmentation network layer includes but is not limited to: at least one of a convolution layer, a pooling layer, an upsampling layer, and a deconvolution layer. In addition, the loss function employed by the face segmentation network layer may include, but is not limited to: cross entropy loss function (cross-entopy loss) or focal loss (focal loss), etc.

The first feature extraction network layer is a neural network model trained in advance and used for extracting 2D human face global features from the features of a plurality of pixels in a human face area, the model supports the quantity of input pixels and feature dimensions of the pixels to be determined by the model and application, the second feature extraction network layer is a neural network model trained in advance and used for extracting feature vectors of all the pixels from the 2D human face global features, the dimensions of the feature vectors are determined by the model and application requirements, and the method is not limited. The first feature extraction network layer or the second feature extraction network layer employs, but is not limited to: CNN, RNN, DNN, a residual network or a Multi-Layer Perceptron (MLP), etc., and the model architecture is not limited. In an alternative embodiment, the first feature extraction network layer or the second feature extraction network layer is implemented by using MLP, but it should be noted that the functions of the two network layers are different, so the structures of the two MLP network layers are different.

In this embodiment, the 2D face image may be a Red Green Blue (RGB) image, or an RGB image with Depth information (Depth), that is, an RGB-D image, or both an RGB image and a Depth image. If the 2D face image is an RGB image, the features of the initial feature map include pixel information of the image, and do not include depth information, a plurality of pixels may be selected from the face region, and features of the selected plurality of pixels (the features are features without depth information in the initial feature map obtained based on the RGB image) may be input into the first feature extraction network layer to perform global feature extraction for the 2D face. If the 2D face image is an RGB-D image, the features of the initial feature map include pixel information and depth information of the image, a plurality of pixels may be selected from the face region, and features of the selected plurality of pixels (the features are features with depth information in the initial feature map obtained based on the RGB-D image) are input into the first feature extraction network layer to perform global feature extraction for the 2D face. Optionally, if the 2D face image adopts the RGB image and the depth image at the same time, the RGB image may be input into the face segmentation network layer to perform face segmentation to obtain an initial feature image of the 2D face image and a face region in the 2D face image, where features of the initial feature image include pixel information of the image, a plurality of pixel points may be selected from the face region, and before features of the plurality of pixel points (features are features without depth information in the initial feature image obtained based on the RGB image) are input into the first feature extraction network layer to perform global feature extraction on the 2D face, a depth image corresponding to the RGB image may also be obtained; extracting depth features of a plurality of pixel points from the depth image, fusing the depth features of each pixel point into features of the pixel point (the features are features in an initial feature map obtained based on an RGB image), and inputting the features of the plurality of pixel points (the features are features fused with depth information and pixel information) into a first feature extraction network layer for global feature extraction of a 2D face.

The embodiment of extracting depth features of a plurality of pixels from the depth image is not limited. In an alternative embodiment, the depth image may be sent to a face segmentation network layer for face segmentation, so as to obtain an initial feature map of the depth image, where the initial feature map of the depth image includes depth features of a plurality of pixels. In another optional embodiment, the depth image has depth information and pixel point information, if coordinate points of a face in the depth image include three-dimensional coordinate information, and the three-dimensional coordinates are an x-axis, a y-axis and a z-axis respectively, the coordinates of the x-axis and the y-axis may reflect the pixel point information, the coordinates of the z-axis may reflect the depth information, the depth information and the pixel point information of a plurality of pixel points in the depth image may be converted into 3D point cloud data, and depth features of the plurality of pixel points in the depth image are learned based on the 3D point cloud data.

Similarly, the embodiment of the application is not limited to the model structure of the second neural network model, and the model structure of the feature vectors of the 3D global feature map of the 3D face and the plurality of 3D points on the 3D face model obtained by extracting the features of the prior 3D face model is applicable to the embodiment of the application. In an alternative embodiment, as shown in FIG. 3e, the second neural network model includes a third feature extraction network layer and a fourth feature extraction network layer. With reference to fig. 3e, feature extraction is performed on the prior 3D face model by using a second neural network model to obtain a 3D face global feature, including: selecting a plurality of 3D points from the prior 3D face model, inputting the position coordinates of the plurality of 3D points into a third feature extraction network layer for feature extraction to obtain a 3D face global feature, wherein the 3D face global feature value reflects the position coordinates of the plurality of 3D points and does not contain the gesture data of the 3D face, and the 3D face global feature belongs to the feature without gesture data; correspondingly, generating feature vectors of a plurality of 3D points based on the 3D face global feature map includes: and inputting the global features of the 3D face into a fourth feature extraction network layer to perform feature extraction, so as to obtain feature vectors of a plurality of 3D points. The third feature extraction network layer is a neural network model trained in advance and used for extracting 3D face global features according to position coordinates of a plurality of 3D points, and the fourth feature extraction network layer is a neural network model trained in advance and used for extracting feature vectors of all 3D points from the 3D face global features, and the third feature extraction network layer is not limited to the neural network model. The third feature extraction network layer or the fourth feature extraction network layer may include, but is not limited to: CNN, RNN, DNN, residual network or MLP, etc. In an alternative embodiment, the third feature extraction network layer or the fourth feature extraction network layer is implemented by using MLP, but it should be noted that the functions of the two network layers are different, so the structures of the two MLP network layers are different.

In some application scenes, such as an AR try-on scene, a model object, which may be a sunglass or a hat, needs to be added on the target 3D model, so that perspective projection processing to the 2D face image is performed for the target 3D face model to which the model object is added. In some application scenarios, for example, a merchant wearing a product provides a reference 3D face model, in the process of performing perspective projection processing on a 2D face image with respect to a target 3D face model to which a model object is added, the perspective projection processing may be assisted by means of the reference 3D face model provided with the model object (object to be projected). Based on this, before obtaining the pose data of the 2D face, the method further includes: acquiring a reference 3D face model, wherein an object to be projected is arranged on the reference 3D face model; and aligning the target 3D face model with the reference 3D face model. The alignment is a process of automatically positioning key feature points of the face, such as eyes, nose tips, mouth corner points, eyebrows or face contour points, for a target 3D face model or a reference 3D face model, and performing operations such as rotation, scaling, expansion or translation on the target 3D face model with the reference 3D face model as a reference, so that the target 3D face model is as close to the shape of the reference 3D face model as possible. On the basis of aligning the target 3D face model and the reference 3D face model, the position effect of the object to be projected on the reference 3D face model can replace the effect of the object to be projected on the target 3D face model. Correspondingly, carrying out gesture estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain gesture data of the 2D face, wherein the method comprises the following steps: according to the aligned target 3D face model and the 2D-3D mapping parameters, carrying out posture estimation on the face in the 2D face image to obtain posture data of the 2D face; because the target 3D face model and the reference 3D face model are aligned, the gesture data of the 2D face are also applicable to the reference 3D face model and the object to be projected on the reference 3D face model; based on this, perspective projection processing to the 2D face image is performed for the target 3D face model according to the pose data of the 2D face, including: and projecting the object to be projected into the 2D face image according to the gesture data of the 2D face to obtain the 2D face image with the object to be projected.

In the above embodiments of the present application, the structure and input/output dimensions of the neural network model are not limited, and in the following embodiments, the technical solution of the embodiments of the present application will be described in detail by taking the neural network model structure based on a single RGB/RGB-D image and 3D face shape prior, and the neural network model structure based on fusion of an RGB image and a depth image and 3D face shape prior, respectively, as examples. Wherein the single RGB/RGB-D image corresponds to the 2D face image in the method and the 3D face shape prior corresponds to the prior 3D face model in the method.

As shown in fig. 3f, a neural network model junction for a 3D face shape prior based on a single RGB/RGB-D image The structure is as follows:

firstly, an RGB image (shown in figure 3 g) is obtained, the RGB image contains a human face, and a human face detection method (a traditional machine learning method or a deep neural network method) is used for carrying out human face detection to obtain a human face region framed by a human face detection frame; clipping the original RGB image based on the face detection frame to obtain a face RGB image (shown in figure 3 h), which is marked as I _RGB . If the face detection frame cannot completely frame the face region, for example, the face detection frame frames the region below the eyebrow and above the chin in the face region, the face detection frame is properly enlarged, and the enlarged detection frame is utilized to cut the original RGB image to obtain a complete face RGB image.

In addition, a priori information of the 3D face shape may be obtained, which may be understood herein as an average 3D face grid point (mesh), where the a priori information of the 3D face shape may be learned by an encode-decode network structure (as shown in fig. 2), to obtain a priori 3D face shape, denoted as X e R ³ 。

As further shown in fig. 3f, the face RGB image (which is a 2D image) cut after face detection is input into a convolutional neural network (e.g., CNN), where the RGB image size is h×w×3, H represents the height (High) of the RGB image, W represents the width (Wide) of the RGB image, and 3 represents the three dimensions of RGB. The face RGB image is input into a convolutional neural network to be convolved, and then the convolution result is up-sampled or deconvolved to obtain a characteristic diagram with the size of H multiplied by W multiplied by C, wherein C represents the number of channels in the convolution process. In the process, face segmentation is carried out on the face RGB image at the same time, namely, two types of segmentation are carried out, and a face area and a non-face area in the face RGB image are obtained. According to the face segmentation result, m pixel points are randomly selected in a face region, the feature of each pixel point is a value of C dimension in an H×W×C feature map, an m×C feature vector is finally obtained, the m×C feature vector is then input into a multi-layer perceptron (MLP) network layer, global features of 1×f1 are obtained through global pooling (pooling), and f1 is a positive integer, for example, the feature can be 512 or 1024. The global feature of 1 xf 1 is input to the next MLP network layer to obtain a feature vector 1 xc 1 of each pixel point in m pixel points, wherein the feature vector of the m pixel points is expressed as m xc 1.

Further as shown in fig. 3f, assuming that 3D grid points of the a priori 3D face shape are n×3, n is a positive integer, the number of 3D grid points (simply referred to as 3D points) is represented, and the coordinate dimension of the 3D points is three-dimensional; then, inputting the 3D grid points of the prior 3D face shape into an MLP network layer for feature extraction, and obtaining global features of 1 xf 2 through global pooling, wherein f2 is a positive integer; and then, inputting the global feature of 1 xf 2 into a next MLP network layer for learning to obtain a feature vector 1 xc 2 of each 3D point in n 3D points, wherein the feature vector of the n 3D points is expressed as n xc 2.

As further shown in fig. 3f, on one hand, global features of 1×f1 and feature vectors of n×c2 are spliced, and the spliced features are input into a neural network model for prediction, so as to obtain a 2D-3D mapping matrix m×n; on the other hand, the feature vector of m multiplied by c1 is spliced with the global feature of 1 multiplied by f2, the offset 3 delta of each 3D point relative to the prior 3D face shape (the face shape is a geometric shape) is predicted according to the spliced feature vector, the offset 3 delta is represented by the offset in three coordinate dimensions, the offset of n 3D points relative to the prior 3D face shape is represented as n multiplied by 3 delta, and the offset n multiplied by 3 delta is added with the three-dimensional coordinate value of the input prior 3D face shape to obtain the current input 2D face I _RGB And a corresponding reconstructed 3D face model.

Further, through the combination of the coordinates of m pixel points in the RGB image of the face by using the 2D-3D mapping matrix m×n, the 3D coordinate information of m pixel points corresponding to the reconstructed 3D face model can be obtained, and then the six-degree-of-freedom gesture information of the current 2D face, namely the gesture data of the 2D face, is calculated through an n-point perspective (PNP) method or a variant method thereof. Furthermore, based on the six-degree-of-freedom pose information of the 2D face, perspective projection processing to the face RGB image is performed on the reconstructed 3D face model, as shown in fig. 3i, which is an effect diagram of projection into the face RGB image by using the reconstructed 3D face model in the estimated six-degree-of-freedom pose. Wherein, for ease of viewing, the outline of the 3D face projection is represented by a dashed line. It should be noted that, when the network model shown in fig. 3f is trained, the face segmentation model may not be trained, but the face information of the real label may be directly used in the training process instead of the face segmentation process.

The neural network shown in fig. 3f is a multi-task network structure, and can implement three tasks, the first is to perform face segmentation (face region and non-face region), and the loss functions usable by the image segmentation network layer can include, but are not limited to: cross entropy loss functions or range loss and the like, wherein the second is to generate a 2D-3D mapping matrix, and a classification model can be established for each 2D to 3D point in a modeling mode, wherein the loss functions which can be adopted are softmax loss and the like; the third task is to solve the coordinate offset of each face 3D point relative to the prior 3D face shape, and a loss function such as a smooth L1 or a euclidean distance loss can be used.

As shown in fig. 3j, a neural network for fusion of RGB image and depth image based and 3D face shape prior Model structure:

the face RGB image (as shown in fig. 3H) cut out after face detection is obtained, the face RGB image is input into a convolutional neural network to be convolved, the convolution result is up-sampled or deconvolved, a feature map with the size of h×w×c is obtained, and face segmentation is performed, and details of the feature map are referred to the foregoing embodiments and are not described herein. Advancing oneStep, as shown in fig. 3j, an original Depth image corresponding to the original RGB image is obtained, and the original Depth image is optionally cut to obtain a face Depth image based on the same method as the face RGB image, and is denoted as I _D . The size of the face Depth image is h×w×1. Then, according to the face segmentation result, a face Depth image I is obtained _D Randomly selecting m pixel points in a face area, and converting the m pixel points into a 3D point cloud; and then, extracting features of the 3D point cloud through an MLP network layer (not shown in fig. 3 j), and fusing the extracted features with features of corresponding points in an H×W×C feature map obtained by convolving the extracted features with the face RGB image to obtain m×C' features. There are various ways of feature fusion, for example, fusion on feature channels or point-by-point feature fusion.

As shown in fig. 3j, after obtaining the feature of m×c ', the feature of m×c' is input to the MLP network layer, and global pooling is performed to obtain a global feature of 1×f1, and the global feature of 1×f1 is input to the next MLP network layer to obtain a feature vector 1×c1 of each pixel point, where the feature vector of m pixel points is denoted as m×c1. In fig. 3j, the processing of the prior 3D face shape and the subsequent three-dimensional reconstruction based on the prior 3D face shape and the content of acquiring the 2D face pose data can be referred to the foregoing embodiments, and are not repeated herein.

It should be noted that, in addition to calculating the six-degree-of-freedom pose information of the face by adopting the PNP and its variant method, the six-degree-of-freedom pose information of the face may also be calculated by a least square fitting method. Specifically, since the depth image can be converted into a 3D point cloud, the 2D-3D mapping matrix mxn can also be the mapping from the 3D point cloud converted by the depth image to the 3D point in the face reconstruction model, and then the six-degree-of-freedom pose information of the face can be calculated by a 3D point least square fitting method.

It should be noted that the operations of face detection, image segmentation, or feature extraction of the input RGB image or Depth image may be varied. For example, the Depth image can also adopt a convolution network mode to perform feature extraction; as another example, there are many ways to perform feature fusion between RGB images and Depth images, such as fusion on feature channels or point-by-point feature fusion. In addition, the parameters and settings of the network layer in the network structure adopted by the application can be changed.

Scenerising example 1:

the following description will take a live broadcast scene as an example. The embodiment of the present application further provides a live broadcast system, as shown in fig. 4a, the live broadcast system 400a includes: a main cast end 401a, a play end 402a, a service end 403a and an image acquisition device 404a.

The anchor end 401a and the playing end 402a may be implemented as terminal devices such as a desktop computer, a notebook computer, or a smart phone, and the implementation forms of the anchor end 401a and the playing end 402a may be the same or different. The number of the playing ends 402a is one or more, and in fig. 4, the playing ends 402a are illustrated as a plurality of examples.

The image capture device 404a may be a monocular camera or a red green blue depth map (Red Green Blue Depth map, RGBD) camera, among others. The image capturing device 404a is configured to capture an initial live video, where the initial live video includes 2D face images, where the face may be a main cast or other staff in a picture. The image capturing device 404a provides the captured initial live video to the server 403a via the anchor terminal 401 a. If the image capturing device 404a is a monocular camera, the 2D face image included in the initial live video is an RGB image; if the image capturing device 404a is an RGBD camera, the 2D face image included in the initial live video is an RGB-D image.

The server 403a may be a server device such as a conventional server, a cloud server, or a server array, and is illustrated in fig. 4 by taking the cloud server as an example. The server 403a may perform three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and perform face pose estimation by combining a position mapping relationship between the 2D face image and the target 3D face model in the three-dimensional reconstruction process, so as to obtain pose data of the 2D face. When a live action or product is added by a host or other staff member through the host end 401a, the host end 401a provides the live action or product to the server end 403a. The server 403a adds the live streaming effect or the product to the target 3D face model, and executes perspective projection processing to the 2D face image for the target 3D face model to which the live streaming effect or the product is added according to the gesture data of the 2D face. Wherein, the live broadcast effect comprises headwear, hanging ornament or special cosmetic effect, etc. The product may be, but is not limited to: wigs, sunglasses, hats, etc. For example, a live action or product may be added to the reference 3D model, and the target 3D face model may be aligned with the reference 3D face model, based on which the live action or product on the reference 3D face model may be directly projected into the 2D face image. For another example, adding a live moving effect on the target 3D face model, and projecting the live moving effect and the target 3D face model into a 2D face image; thirdly, adding a live broadcast effect or a product on the target 3D face model, performing transparency processing on the target 3D face model, and projecting the live broadcast effect or the product and the transparent target 3D face model into a 2D face image. The server 403a may provide the 2D face image, which performs perspective projection processing, to the player 402a, and the player 402a provides the 2D face image to the viewing user. The details of the foregoing embodiments may be found in the foregoing embodiments, and are not described herein.

In this embodiment, the image processing in the system shown in fig. 4a may be performed by the server side or the anchor side. As shown in fig. 4b, the embodiment of the present application further provides a live broadcast method, which includes the following steps:

401b, acquiring an initial live video of a main broadcasting end by using a camera, wherein the initial live video comprises 2D face images;

402b, carrying out three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and carrying out face pose estimation by combining the position mapping relation between the 2D face image and the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face; adding live broadcast dynamic effects or products on the 3D target face model;

403b, performing perspective projection processing on the 2D face image according to the gesture data of the 2D face aiming at the target 3D face model added with the live broadcast dynamic effect or the product so as to obtain a target live broadcast video, and sending the target live broadcast video to a playing terminal, wherein the target live broadcast video contains the 2D face image with the live broadcast dynamic effect or the product.

The details of the live broadcast method can be found in the foregoing embodiments, and will not be described herein.

According to the live broadcasting method provided by the embodiment of the application, the three-dimensional reconstruction is carried out on the 2D image based on the priori 3D face model, in the three-dimensional reconstruction process, the face posture estimation is carried out by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, the 2D face posture data is obtained, when the three-dimensional reconstructed 3D face model is projected to the 2D face image, the perspective projection processing to the 2D face image is carried out on the target 3D face model by utilizing the 2D face posture data, the posture data of the 2D face is fully considered, so that the effect of the projection of the 3D face model to the 2D image is more true, and more possibility is provided for the application of high-precision requirements.

Scenerising example 2:

the following description will take an AR display scene as an example. The present application further provides an AR display system, as shown in fig. 4c, the AR display system 400c includes: an AR display device 401c and a server device 402c.

Wherein AR display device 401c may be, but is not limited to: a mobile phone, a tablet, a PC, a vehicle Head-up Display (HUD) or an AR intelligent interactive device, etc. The AR display device 401c has an image capturing function, for example, a monocular camera or an RGBD camera or the like is mounted on the AR display device 401c to realize the image capturing function. Server 402c may be a conventional server, cloud server, or server array, among other server devices.

In this embodiment, the AR display device 401c may collect a 2D face image of a user and display the 2D face image on a display screen thereof; the AR display device 401c may provide the 2D face image to the server device 402c. The server device 402c performs three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performs face pose estimation by combining the position mapping relationship from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face. When the user needs to add the target object on the 2D face image, the target object to be added may be sent to the server device 402c through the AR display device 401c. Wherein the target object may be a special effect or a product. The server device 402c receives the target object to be added, and adds the target object on the target 3D face model; according to the pose data of the 2D face, perspective projection processing to the 2D face image is performed on the target 3D face model added with the target object, a 2D face image with the target object is obtained, and the 2D face image with the target object is sent to the AR display device 401c. The AR display device 401c displays a 2D face image with the target object on the display screen.

In this embodiment, the image processing procedure in the system shown in fig. 4c may be performed by the server device or may be performed by the AR display device 401 c. As shown in fig. 4d, the embodiment of the present application further provides an AR display method, including the following steps:

401D, AR display equipment collects 2D face images of a user by using a camera, and displays the 2D face images on a display screen of the AR display equipment;

402D, carrying out three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and carrying out face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face; adding a target object on the target 3D face model;

403D, according to the pose data of the 2D face, performing perspective projection processing on the 2D face image for the target 3D face model added with the target object, and displaying the 2D face image with the target object on the display screen.

According to the AR display method provided by the embodiment of the application, the three-dimensional reconstruction is carried out on the 2D image based on the prior 3D face model, in the three-dimensional reconstruction process, the face posture estimation is carried out by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, the 2D face posture data is obtained, when the three-dimensional reconstructed 3D face model is projected to the 2D face image, the perspective projection processing to the 2D face image is carried out on the target 3D face model by utilizing the 2D face posture data, the posture data of the 2D face is fully considered, so that the effect of the projection of the 3D face model to the 2D image is more true, and more possibility is provided for the application of high-precision requirements.

It should be noted that, the execution subjects of each step of the method provided in the above embodiment may be the same device, or the method may also be executed by different devices. For example, the execution subject of steps 101 to 103 may be device a; for another example, the execution subject of steps 101 and 102 may be device a, and the execution subject of step 103 may be device B; etc.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that the operations may be performed out of the order in which they appear herein or performed in parallel, the sequence numbers of the operations such as 101, 102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

Fig. 5a is a schematic structural diagram of an image processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 5a, the image processing apparatus 50a includes: a three-dimensional reconstruction module 51a, a pose estimation module 52a and a perspective projection module 53a.

The three-dimensional reconstruction module 51a is configured to perform three-dimensional reconstruction on the 2D face image based on the prior 3D face model, so as to obtain a target 3D face model corresponding to the 2D face image;

the pose estimation module 52a is configured to perform pose estimation of a face in combination with a position mapping relationship from a 2D face image to a target 3D face model in a three-dimensional reconstruction process, so as to obtain pose data of the 2D face;

the perspective projection module 53a is configured to perform perspective projection processing on the 2D face image for the target 3D face model according to the pose data of the 2D face.

In an alternative embodiment, the three-dimensional reconstruction module 51a is configured to: performing feature extraction on the 2D face image by using a first neural network model to obtain a 2D face global feature map; extracting features of the prior 3D face model by using a second neural network model to obtain feature vectors of a plurality of 3D points on the prior 3D face model; generating a 3D face deformation parameter according to the 2D face global feature map and feature vectors of a plurality of 3D points; and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the priori 3D face model.

In an alternative embodiment, the three-dimensional reconstruction module 51a is configured to: splicing the 2D face global feature map and feature vectors of a plurality of 3D points to obtain a first fusion feature; and performing feature learning on the first fusion features by using a third neural network model to obtain the 3D facial deformation parameters.

In an alternative embodiment, the apparatus further comprises: a generating module; the generation module is used for: generating feature vectors of a plurality of pixel points based on the 2D face global feature map, wherein the plurality of pixel points are from a face region of a 2D face image; accordingly, the three-dimensional reconstruction module 51a is configured to: extracting features of the prior 3D face model by using a second neural network model to obtain a 3D face global feature map, and generating feature vectors of a plurality of 3D points based on the 3D face global feature map; correspondingly, the pose estimation module 52a is configured to splice the 3D face global feature and feature vectors of the plurality of pixel points to obtain a second fusion feature; performing feature learning on the second fusion feature by using a fourth neural network model to obtain a 2D-3D mapping parameter; and carrying out gesture estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model so as to obtain gesture data of the 2D face.

In an alternative embodiment, the three-dimensional reconstruction module 51a is configured to: sending the 2D face image into a face segmentation network layer in a first neural network model to carry out face segmentation to obtain an initial feature map of the 2D face image and a face region in the 2D face image, wherein the face region comprises a plurality of pixel points, and the initial feature map comprises features of the plurality of pixel points; inputting the features of the plurality of pixel points into a first feature extraction network layer in a first neural network model to perform feature extraction on the 2D face to obtain a 2D face global feature map; accordingly, the generating module is used for: and inputting the 2D face global feature map into a second feature extraction network layer in the first neural network model, and carrying out feature extraction on each pixel point to obtain feature vectors of a plurality of pixel points.

In an alternative embodiment, the 2D face image is an RGB image, and before inputting the features of the plurality of pixels into the first feature extraction network layer in the first neural network model to perform feature extraction on the 2D face, the apparatus further includes: the first acquisition module and the fusion module; the first acquisition module is used for: acquiring a depth image corresponding to the RGB image; the fusion module is used for: extracting depth features of a plurality of pixel points from the depth image, and fusing the depth features of each pixel point into the features of the pixel point.

In an alternative embodiment, the three-dimensional reconstruction module 51a is configured to: selecting a plurality of 3D points from the prior 3D face model, inputting the position coordinates of the 3D points into a third feature extraction network layer in the second neural network model for feature extraction, and obtaining the 3D face global feature; and inputting the global features of the 3D face into a fourth feature extraction network layer in the second neural network model to perform feature extraction, so as to obtain feature vectors of a plurality of 3D points.

In an alternative embodiment, before obtaining the pose data of the 2D face, the apparatus further includes: the second acquisition module and the alignment module; the second acquisition module is used for: acquiring a reference 3D face model, wherein an object to be projected is arranged on the reference 3D face model; the alignment module is used for: aligning the target 3D face model with the reference 3D face model; correspondingly, the pose estimation module 52a is configured to perform pose estimation on a face in the 2D face image according to the aligned target 3D face model and the 2D-3D mapping parameters, so as to obtain pose data of the 2D face; accordingly, the perspective projection module 53a is configured to project the object to be projected into the 2D face image according to the pose data of the 2D face.

According to the image processing device provided by the embodiment of the application, the three-dimensional reconstruction is carried out on the 2D image based on the prior 3D face model, in the three-dimensional reconstruction process, the face posture estimation is carried out by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, the 2D face posture data is obtained, when the three-dimensional reconstructed 3D face model is projected to the 2D face image, the perspective projection processing to the 2D face image is carried out on the target 3D face model by utilizing the 2D face posture data, the posture data of the 2D face is fully considered, so that the effect of the projection of the 3D face model to the 2D image is more true, and more possibility is provided for the application of high-precision requirements.

Fig. 5b is a schematic structural view of an image processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 5, the image processing apparatus includes: a memory 54b and a processor 55b.

The memory 54b is for storing a computer program, and may be configured to store other various data to support operations on the image processing apparatus. Examples of such data include instructions for any application or method operating on an image processing device.

The memory 54b may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

A processor 55b coupled to the memory 54b for executing the computer program in the memory 54b for: performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image; in the three-dimensional reconstruction process, combining the position mapping relation from the 2D face image to the target 3D face model to perform face pose estimation so as to obtain pose data of the 2D face; and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the gesture data of the 2D face.

In an alternative embodiment, the processor 55b is specifically configured to, when performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain the target 3D face model corresponding to the 2D face image: performing feature extraction on the 2D face image by using a first neural network model to obtain a 2D face global feature map; extracting features of the prior 3D face model by using a second neural network model to obtain feature vectors of a plurality of 3D points on the prior 3D face model; generating a 3D face deformation parameter according to the 2D face global feature map and feature vectors of a plurality of 3D points; and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the priori 3D face model.

In an alternative embodiment, the processor 55b is specifically configured to, when generating the 3D face deformation parameter according to the 2D face global feature map and the feature vectors of the plurality of 3D points: splicing the 2D face global feature map and feature vectors of a plurality of 3D points to obtain a first fusion feature; and performing feature learning on the first fusion features by using a third neural network model to obtain the 3D facial deformation parameters.

In an alternative embodiment, in the process of extracting features from the 2D face image by using the first neural network model, the processor 55b is further configured to: generating feature vectors of a plurality of pixel points based on the 2D face global feature map, wherein the plurality of pixel points are from a face region of a 2D face image; accordingly, when the processor 55b performs feature extraction on the prior 3D face model by using the second neural network model to obtain feature vectors of a plurality of 3D points on the prior 3D face model, the processor is specifically configured to: extracting features of the prior 3D face model by using a second neural network model to obtain a 3D face global feature map, and generating feature vectors of a plurality of 3D points based on the 3D face global feature map; accordingly, the processor 55b performs face pose estimation by combining the position mapping relationship from the 2D face image to the target 3D face model in the three-dimensional reconstruction process, so as to obtain pose data of the 2D face, which is specifically configured to: splicing the 3D face global feature and feature vectors of a plurality of pixel points to obtain a second fusion feature; performing feature learning on the second fusion feature by using a fourth neural network model to obtain a 2D-3D mapping parameter; and carrying out gesture estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model so as to obtain gesture data of the 2D face.

In an alternative embodiment, the processor 55b is specifically configured to, when performing feature extraction on the 2D face image by using the first neural network model to obtain a 2D face global feature map: sending the 2D face image into a face segmentation network layer in a first neural network model to carry out face segmentation to obtain an initial feature map of the 2D face image and a face region in the 2D face image, wherein the face region comprises a plurality of pixel points, and the initial feature map comprises features of the plurality of pixel points; inputting the features of the plurality of pixel points into a first feature extraction network layer in a first neural network model to perform feature extraction on the 2D face to obtain a 2D face global feature map; accordingly, the processor 55b is specifically configured to, when generating feature vectors of a plurality of pixels based on the 2D face global feature map: and inputting the 2D face global feature map into a second feature extraction network layer in the first neural network model, and carrying out feature extraction on each pixel point to obtain feature vectors of a plurality of pixel points.

In an alternative embodiment, the 2D face image is an RGB image, and the processor 55b is further configured to, before inputting the features of the plurality of pixels into the first feature extraction network layer in the first neural network model for feature extraction of the 2D face: acquiring a depth image corresponding to the RGB image; extracting depth features of a plurality of pixel points from the depth image, and fusing the depth features of each pixel point into the features of the pixel point.

In an alternative embodiment, the processor 55b is specifically configured to, when performing feature extraction on the prior 3D face model by using the second neural network model to obtain the global feature of the 3D face: selecting a plurality of 3D points from the prior 3D face model, inputting the position coordinates of the 3D points into a third feature extraction network layer in the second neural network model for feature extraction, and obtaining the 3D face global feature; accordingly, the processor 55b is specifically configured to, when generating feature vectors of a plurality of 3D points based on the 3D face global feature map: and inputting the global features of the 3D face into a fourth feature extraction network layer in the second neural network model to perform feature extraction, so as to obtain feature vectors of a plurality of 3D points.

In an alternative embodiment, the processor 55b is further configured to, prior to obtaining pose data for the 2D face: acquiring a reference 3D face model, wherein an object to be projected is arranged on the reference 3D face model; aligning the target 3D face model with the reference 3D face model; accordingly, when estimating the pose of the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model, the processor 55b is specifically configured to: according to the aligned target 3D face model and the 2D-3D mapping parameters, carrying out posture estimation on the face in the 2D face image to obtain posture data of the 2D face; accordingly, the processor 55b is specifically configured to, when performing perspective projection processing to the 2D face image for the target 3D face model according to the pose data of the 2D face: and projecting the object to be projected into the 2D face image according to the gesture data of the 2D face.

According to the image processing equipment provided by the embodiment of the application, the three-dimensional reconstruction is carried out on the 2D image based on the prior 3D face model, in the three-dimensional reconstruction process, the face posture estimation is carried out by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, the 2D face posture data is obtained, when the three-dimensional reconstructed 3D face model is projected to the 2D face image, the 2D face posture data is utilized to carry out perspective projection processing on the target 3D face model to the 2D face image, so that the effect of projecting the 3D face model to the 2D image is more true, and more possibility is provided for the application of high-precision requirements.

Further, as shown in fig. 5b, the image processing apparatus further includes: communication component 56b, display 57b, power component 58b, audio component 59b, and other components. Only part of the components are schematically shown in fig. 5b, which does not mean that the image processing device only comprises the components shown in fig. 5 b. It should be noted that, the components within the dashed box in fig. 5b are optional components, and not necessarily optional components, and the specific visual image processing apparatus is dependent on the product form.

The image processing device of the embodiment may be implemented as a terminal device such as a desktop computer, a notebook computer, or a smart phone, or may be a server device such as a conventional server, a cloud server, or a server array. If the image processing device of the embodiment is implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, etc., the image processing device may include components within the dashed line frame in fig. 5 b; if the image processing apparatus of the present embodiment is implemented as a server-side apparatus such as a conventional server, a cloud server, or a server array, the components within the dashed-line box in fig. 5b may not be included.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement steps of the method shown in fig. 1 that are executable by an image processing apparatus.

Accordingly, embodiments of the present application also provide a computer program product comprising a computer program/instructions which, when executed by a processor, cause the processor to carry out the steps of the method shown in fig. 1.

The communication assembly of fig. 5b is configured to facilitate wired or wireless communication between the device in which the communication assembly is located and other devices. The device where the communication component is located can access a wireless network based on a communication standard, such as a mobile communication network of WiFi,2G, 3G, 4G/LTE, 5G, etc., or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

The display in fig. 5b described above includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

The power supply assembly of fig. 5b provides power to the various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.

The audio component of fig. 5b described above may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive external audio signals when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a speech recognition mode. The received audio signal may be further stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. An image processing method, comprising:

performing feature extraction on the 2D face image by using a first neural network model to obtain a 2D face global feature map; generating feature vectors of a plurality of pixel points based on the 2D face global feature map, wherein the pixel points are from a face area of the 2D face image;

extracting features of the prior 3D face model by using a second neural network model to obtain a 3D face global feature map; generating feature vectors of a plurality of 3D points based on the 3D face global feature map;

splicing the 2D face global feature map and the feature vectors of the plurality of 3D points to obtain a first fusion feature; reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters obtained by feature learning of the first fusion features by a third neural network;

In the three-dimensional reconstruction process, the 3D face global feature and feature vectors of the plurality of pixel points are spliced to obtain a second fusion feature; performing face pose estimation according to the position mapping relation between the 2D face image obtained by performing feature learning on the second fusion feature and the target 3D face model by a fourth neural network so as to obtain pose data of the 2D face;

and executing perspective projection processing to the 2D face image aiming at the target 3D face model according to the gesture data of the 2D face.

2. The method according to claim 1, wherein in the three-dimensional reconstruction process, the 3D face global feature and feature vectors of the plurality of pixel points are spliced to obtain a second fusion feature; performing face pose estimation according to a position mapping relation between the 2D face image obtained by performing feature learning on the second fusion feature and the target 3D face model by a fourth neural network to obtain pose data of the 2D face, including:

splicing the 3D face global feature and the feature vectors of the plurality of pixel points to obtain a second fusion feature; performing feature learning on the second fusion feature by using a fourth neural network model to obtain a 2D-3D mapping parameter; and carrying out gesture estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model so as to obtain gesture data of the 2D face.

3. The method of claim 2, wherein the feature extraction of the 2D face image using the first neural network model to obtain a 2D face global feature map comprises:

sending the 2D face image into a face segmentation network layer in a first neural network model to carry out face segmentation to obtain an initial feature map of the 2D face image and a face region in the 2D face image, wherein the face region comprises the plurality of pixel points, and the initial feature map comprises the features of the plurality of pixel points; inputting the characteristics of the plurality of pixel points into a first characteristic extraction network layer in a first neural network model, and carrying out characteristic extraction on a 2D face to obtain a 2D face global characteristic map;

correspondingly, generating feature vectors of a plurality of pixel points based on the 2D face global feature map comprises: and inputting the 2D face global feature map into a second feature extraction network layer in the first neural network model, and carrying out feature extraction on each pixel point to obtain feature vectors of a plurality of pixel points.

4. The method of claim 2, wherein extracting features of the prior 3D face model using the second neural network model to obtain the 3D face global features comprises: selecting a plurality of 3D points from the prior 3D face model, inputting the position coordinates of the 3D points into a third feature extraction network layer in a second neural network model for feature extraction, and obtaining 3D face global features;

Correspondingly, generating feature vectors of the plurality of 3D points based on the 3D face global feature map includes: and inputting the 3D face global features into a fourth feature extraction network layer in a second neural network model to perform feature extraction, so as to obtain feature vectors of the plurality of 3D points.

5. The method according to any one of claims 2 to 4, further comprising, before obtaining pose data of the 2D face: acquiring a reference 3D face model, wherein an object to be projected is arranged on the reference 3D face model; aligning the target 3D face model with the reference 3D face model;

correspondingly, carrying out gesture estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain gesture data of the 2D face, wherein the gesture data comprises the following steps: according to the aligned target 3D face model and the 2D-3D mapping parameters, carrying out pose estimation on the face in the 2D face image to obtain pose data of the 2D face;

accordingly, performing perspective projection processing on the 2D face image for the target 3D face model according to the pose data of the 2D face, including: and projecting the object to be projected into the 2D face image according to the gesture data of the 2D face.

6. An AR display method, suitable for an AR display device with a camera, comprising:

the AR display equipment acquires a 2D face image of a user by using a camera, and displays the 2D face image on a display screen of the AR display equipment;

splicing the 2D face global feature map and the feature vectors of the plurality of 3D points to obtain a first fusion feature; reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters obtained by feature learning of the first fusion features by a third neural network; in the three-dimensional reconstruction process, the 3D face global feature and feature vectors of the plurality of pixel points are spliced to obtain a second fusion feature; performing face pose estimation according to the position mapping relation between the 2D face image obtained by performing feature learning on the second fusion feature and the target 3D face model by a fourth neural network so as to obtain pose data of the 2D face; adding a target object on the target 3D face model;

And according to the gesture data of the 2D face, performing perspective projection processing on the 2D face image aiming at the target 3D face model added with the target object, and displaying the 2D face image with the target object on the display screen.

7. A live broadcast method, comprising:

acquiring an initial live video of a main broadcasting end by using a camera, wherein the initial live video comprises 2D face images;

In the three-dimensional reconstruction process, the 3D face global feature and feature vectors of the plurality of pixel points are spliced to obtain a second fusion feature; performing face pose estimation according to the position mapping relation between the 2D face image obtained by performing feature learning on the second fusion feature and the target 3D face model by a fourth neural network so as to obtain pose data of the 2D face; adding live broadcast dynamic effects or products on the target 3D face model;

and performing perspective projection processing on the 2D face image according to the gesture data of the 2D face aiming at a target 3D face model added with the live broadcast dynamic effect or the product so as to obtain a target live broadcast video, and sending the target live broadcast video to a playing terminal, wherein the target live broadcast video comprises the 2D face image with the live broadcast dynamic effect or the product.

8. An image processing apparatus, comprising:

the three-dimensional reconstruction module is used for extracting features of the 2D face image by using the first neural network model to obtain a 2D face global feature map; generating feature vectors of a plurality of pixel points based on the 2D face global feature map, wherein the pixel points are from a face area of the 2D face image; extracting features of the prior 3D face model by using a second neural network model to obtain a 3D face global feature map; generating feature vectors of a plurality of 3D points based on the 3D face global feature map; splicing the 2D face global feature map and the feature vectors of the plurality of 3D points to obtain a first fusion feature; reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters obtained by feature learning of the first fusion features by a third neural network;

The gesture estimation module is used for splicing the 3D face global feature and the feature vectors of the plurality of pixel points in the three-dimensional reconstruction process to obtain a second fusion feature; performing face pose estimation according to the position mapping relation between the 2D face image obtained by performing feature learning on the second fusion feature and the target 3D face model by a fourth neural network so as to obtain pose data of the 2D face;

and the perspective projection module is used for executing perspective projection processing on the 2D face image aiming at the target 3D face model according to the gesture data of the 2D face.

9. An image processing apparatus, characterized by comprising: a memory and a processor; the memory is used for storing a computer program; the processor, coupled to the memory, is configured to execute the computer program for: performing feature extraction on the 2D face image by using a first neural network model to obtain a 2D face global feature map; generating feature vectors of a plurality of pixel points based on the 2D face global feature map, wherein the pixel points are from a face area of the 2D face image; extracting features of the prior 3D face model by using a second neural network model to obtain a 3D face global feature map; generating feature vectors of a plurality of 3D points based on the 3D face global feature map; splicing the 2D face global feature map and the feature vectors of the plurality of 3D points to obtain a first fusion feature; reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters obtained by feature learning of the first fusion features by a third neural network; in the three-dimensional reconstruction process, the 3D face global feature and feature vectors of the plurality of pixel points are spliced to obtain a second fusion feature; performing face pose estimation according to the position mapping relation between the 2D face image obtained by performing feature learning on the second fusion feature and the target 3D face model by a fourth neural network so as to obtain pose data of the 2D face; and executing perspective projection processing to the 2D face image aiming at the target 3D face model according to the gesture data of the 2D face.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1-7.