CN113628322A

CN113628322A - Image processing method, AR display live broadcast method, AR display equipment, AR display live broadcast equipment and storage medium

Info

Publication number: CN113628322A
Application number: CN202110844521.5A
Authority: CN
Inventors: 考月英; 吕江靖; 盘博文; 李晓波
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-11-09
Anticipated expiration: 2041-07-26
Also published as: CN113628322B

Abstract

The embodiment of the application provides an image processing method, an AR display method, a live broadcast method, equipment and a storage medium. In the embodiment of the application, a 2D image is three-dimensionally reconstructed based on a priori 3D face model, in the three-dimensional reconstruction process, face pose estimation is carried out by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, 2D face pose data are obtained, when the three-dimensionally reconstructed 3D face model is projected to the 2D face image, perspective projection processing to the 2D face image is carried out on a target 3D face model by utilizing the 2D face pose data, the pose data of the 2D face are fully considered, the effect of projecting the 3D face model to the 2D image is more fit and more real, and more possibilities are provided for application of high-precision requirements.

Description

Image processing method, AR display live broadcast method, AR display equipment, AR display live broadcast equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method, an AR display method, a live broadcast method, an AR display device, and a storage medium.

Background

The vision-based three-dimensional face reconstruction and posture estimation has important application value in application scenes such as 3D (Augmented Reality) head portrait creation, face animation generation, AR makeup or AR try-on and the like of AR real-time live broadcast. For example, in an AR makeup scene, some makeup special effects (such as blush, lipstick, etc.) need to be added to a reconstructed three-dimensional face model, and then the three-dimensional face model with the makeup special effects added needs to be projected to a two-dimensional face image for presentation. For another example, in an AR fitting scene, a fitting product (e.g., sunglasses, earrings, etc.) needs to be added to the reconstructed three-dimensional face model, and then the three-dimensional face model with the fitting product added is projected to a two-dimensional face image for presentation. When the three-dimensional face model is projected to the two-dimensional face image, the problem that the reconstructed three-dimensional face model and the two-dimensional face image are not attached to each other often occurs, for example, the problem that makeup is distorted or suspended in an AR makeup scene occurs, so that the application of the reconstructed three-dimensional face model is limited.

Disclosure of Invention

Aspects of the application provide an image processing method, an AR display method, a live broadcast device and a storage medium, so that the effect of projecting a 3D face model to a 2D image is more fit and more real, and more possibilities are provided for application of high-precision requirements.

An embodiment of the present application provides an image processing method, including: performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image; in the three-dimensional reconstruction process, carrying out face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model to obtain pose data of the 2D face; and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the pose data of the 2D face.

The embodiment of the present application further provides an AR display method, which is applicable to an AR display device, where the AR display device has a camera, and the method includes: the AR display equipment acquires a 2D face image of a user by using a camera and displays the 2D face image on a display screen of the AR display equipment; performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performing face attitude estimation by combining a position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain attitude data of the 2D face; adding a target object on the target 3D face model; and according to the attitude data of the 2D face, performing perspective projection processing on the 2D face image aiming at the target 3D face model added with the target object, and displaying the 2D face image with the target object on a display screen.

The embodiment of the application further provides a live broadcasting method, which includes: acquiring an initial live video of a main broadcasting end by using a camera, wherein the initial live video comprises a 2D face image; performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performing face attitude estimation by combining a position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain attitude data of the 2D face; adding live broadcast dynamic effects or products on the 3D target face model; and performing perspective projection processing on the 2D face image aiming at the target 3D face model added with the live broadcast dynamic effect or the product according to the attitude data of the 2D face to obtain a target live broadcast video, and sending the target live broadcast video to a playing terminal, wherein the target live broadcast video comprises the 2D face image with the live broadcast dynamic effect or the product.

An embodiment of the present application further provides an image processing apparatus, including: the three-dimensional reconstruction module is used for performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image; the pose estimation module is used for carrying out face pose estimation by combining a position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process so as to obtain pose data of the 2D face; and the perspective projection module is used for performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the posture data of the 2D face.

An embodiment of the present application further provides an image processing apparatus, including: a memory and a processor; a memory for storing a computer program; a processor coupled with the memory for executing the computer program for: performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image; in the three-dimensional reconstruction process, carrying out face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model to obtain pose data of the 2D face; and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the pose data of the 2D face.

Embodiments of the present application further provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the image processing method, the AR display method, and the live broadcast method provided in the embodiments of the present application.

Embodiments of the present application further provide a computer program product, which includes a computer program/instruction, and when the computer program/instruction is executed by a processor, the processor is caused to implement the steps in the image processing method, the AR display method, and the live broadcast method provided in the embodiments of the present application.

In the embodiment of the application, a 2D image is three-dimensionally reconstructed based on a priori 3D face model, in the three-dimensional reconstruction process, face pose estimation is carried out by combining the position mapping relation from the 2D face image to the reconstructed 3D face model, 2D face pose data are obtained, when the three-dimensionally reconstructed 3D face model is projected to the 2D face image, perspective projection processing to the 2D face image is carried out on a target 3D face model by utilizing the 2D face pose data, the pose data of the 2D face are fully considered, the effect of projecting the 3D face model to the 2D image is more fit and more real, and more possibilities are provided for application of high-precision requirements.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flowchart of an image processing method according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network model structure for generating an a priori 3D face model according to an exemplary embodiment of the present application;

fig. 3a is an architecture diagram of a neural network model based facial image processing according to an exemplary embodiment of the present application;

FIG. 3b is an architecture diagram of a neural network model based three-dimensional reconstruction provided by an exemplary embodiment of the present application;

FIG. 3c is a block diagram of another architecture for three-dimensional reconstruction and face pose estimation based on a neural network model according to an exemplary embodiment of the present disclosure;

FIG. 3d is an architecture diagram of a first neural network model and a process for performing feature extraction thereof according to an exemplary embodiment of the present disclosure;

FIG. 3e is an architecture diagram of a second neural network model and a process for feature extraction thereof according to an exemplary embodiment of the present disclosure;

FIG. 3f is a schematic diagram of a neural network model structure based on a single RGB/RGB-D image and a 3D face shape prior provided by an exemplary embodiment of the present application;

FIG. 3g is a schematic diagram of an RGB image provided by an exemplary embodiment of the present application;

fig. 3h is a schematic diagram of an RGB image with face regions cut out according to an exemplary embodiment of the present application;

FIG. 3i is a diagram illustrating an effect of projecting a 3D face onto a 2D image according to an exemplary embodiment of the present application;

fig. 3j is a schematic diagram of a neural network model structure based on fusion of RGB images and depth images and 3D face shape prior provided by an exemplary embodiment of the present application;

fig. 4a is a schematic structural diagram of a live broadcast system provided in an exemplary embodiment of the present application;

fig. 4b is a schematic flowchart of a live broadcasting method according to an exemplary embodiment of the present application;

fig. 4c is a schematic structural diagram of an AR display system according to an exemplary embodiment of the present application;

fig. 4d is a schematic flowchart of an AR display method according to an exemplary embodiment of the present application;

fig. 5a is a schematic structural diagram of an image processing apparatus according to an exemplary embodiment of the present application;

fig. 5b is a schematic structural diagram of an image processing apparatus according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In some display scenes based on facial images, there are some application scenes that require special effects or products to be added to the facial images. For example, in a live scene, some ornamentation or beauty effect needs to be added to the face image. For another example, in an AR makeup scene, a makeup special effect needs to be added to a face image. For another example, in an AR fitting scene, some fitting products such as sunglasses, earrings, or hats need to be added to the face image. The process of adding the special effect is actually a process of reconstructing a 3D face model based on the 2D face image and performing a projection operation on the 3D face model to which the special effect is added to the 2D face image. In the whole process, three-dimensional reconstruction of the face model and projection operation of the reconstructed 3D face model to the 2D face image are involved. The three-dimensional reconstruction refers to a process of restoring a 3D face model based on a 2D face image; the projection operation can be regarded as a process of mapping three-dimensional points on the 3D face model to two-dimensional points on the 2D face image, and the essence thereof can be regarded as a process of transforming three-dimensional coordinates into two-dimensional coordinates.

In the existing scheme, the projection of the reconstructed 3D face model to the 2D face image is usually implemented by using an orthogonal projection or weak perspective projection method. The formula of the orthogonal projection is X ═ s (X) + t, s is a constant, n is a projection transformation matrix from 3D to 2D, and t is a translation vector of 2 multiplied by 1; the formula of the weak perspective projection is ═ s × R × X + t, s is a constant, P is the orthogonal projection matrix [1,0, 0; 0,1,0], R is a 3 × 3 rotation matrix and t is a 2 × 1 translation vector. However, the orthogonal projection or the weak perspective projection is adopted, and the pose data of part or all of the human face is lost, so after the projection operation is performed, a phenomenon that a 3D human face model added with a special effect or a product is not attached to a 2D human face image may occur, for example, a problem of makeup distortion or suspension occurring in an AR makeup scene may occur, and the application of reconstructing a three-dimensional human face model is limited.

In order to solve the technical problems, in some embodiments of the present application, a prior 3D face model is used to perform three-dimensional reconstruction on a 2D image, in the three-dimensional reconstruction process, a position mapping relationship from the 2D face image to a reconstructed 3D face model is combined to perform face pose estimation, so as to obtain 2D face pose data, when the three-dimensional reconstructed 3D face model is projected onto the 2D face image, the 2D face pose data is used to perform perspective projection processing on the target 3D face model to the 2D face image, so that the pose data of the 2D face is fully considered, so that the effect of projecting the 3D face model onto the 2D image is more realistic, and more possibilities are provided for application of high-precision requirements.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of an image processing method according to an exemplary embodiment of the present application; as shown in fig. 1, the method includes:

101. performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image;

102. in the three-dimensional reconstruction process, carrying out face pose estimation by combining a position mapping relation from a 2D face image to a target 3D face model to obtain pose data of a 2D face;

103. and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the pose data of the 2D face.

In the embodiment of the application, after the prior 3D face model is used for carrying out three-dimensional reconstruction on the 2D face image to obtain the target 3D face model corresponding to the 2D face image, a plurality of special effects or try-on products can be added to the target 3D face model; thereafter, projection processing to the 2D face image may be performed on the target 3D face model to which the special effect or the try-on product is added. In order to improve the fit degree of a target 3D face model, a special effect or a try-on product and a 2D face image, in the embodiment of the application, face pose estimation is carried out in the process of carrying out three-dimensional reconstruction on the 2D image, and pose data of a 2D face are obtained; based on the pose data of the 2D face, when the projection processing to the 2D face image is executed aiming at the target 3D face model after three-dimensional reconstruction, the perspective projection processing to the 2D face image is executed on the target 3D face model by utilizing the 2D face pose data, and the pose data of the 2D face is fully considered in the perspective projection process, so that the effect of the 3D face model projected to the 2D image is more fit and more real, and more possibilities are provided for the application of high-precision requirements.

The perspective projection processing of the target 3D face model to the 2D face image by using the 2D face pose data is a projection process of projecting the target 3D face model or a special effect or a try-on product thereon onto the 2D face image from a certain projection center, an expression of perspective projection transformation is X ═ K (RX + T), X is a 2D coordinate of a 3D point on the target 3D face model projected onto the 2D face image, K is a camera internal reference, R is a rotation matrix of 3 × 3, T is a translation vector of 3 × 1, and X is a reconstructed 3D point or a point cloud formed by 3D points in the target 3D face model. The human face has six degrees of freedom in space, namely, the degree of freedom of movement along the directions of three orthogonal coordinate axes of x, y and z and the degree of freedom of rotation around the three coordinate axes. For example, the translation matrix T in the above formula of perspective transformation is the degree of freedom of movement along the directions of three orthogonal coordinate axes x, y, and z, and the rotation matrix R is the degree of freedom of rotation around these three coordinate axes. In this embodiment, the pose data of the face may be represented by six-degree-of-freedom information of the face.

Specifically, three-dimensional reconstruction is performed on the 2D face image based on the prior 3D face model, and a target 3D face model corresponding to the 2D face image is obtained. For example, feature extraction is respectively performed on the 2D face image and the prior 3D face model to obtain 2D face feature information and 3D face feature information; generating a 3D face deformation parameter by combining the 2D face characteristic information and the 3D face characteristic information; and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the prior 3D face model. The prior 3D face model is a general face model, and may be understood as a 3D face shape prior, or may be understood as an average face mesh point (mesh), and due to the generality of the prior 3D face model, the prior 3D face model is a 3D face model that does not contain face pose information.

In the embodiment of the present application, an implementation of obtaining the prior 3D face model is not limited. Optionally, an embodiment of obtaining an a priori 3D face model includes: learning the prior 3D face model using an encoding-decoding (Encode-Decode) network structure, as shown in fig. 2, the encoding-decoding network structure comprising: an encoder 21 and a decoder 22. Specifically, some 2D face sample images are acquired, and three-dimensional reconstruction is performed on the 2D face sample images to obtain 3D face meshes, or some 3D face meshes can be acquired directly by using a 3D face acquisition device; then, these 3D face meshes are used as samples, each sample includes three-dimensional coordinates of a plurality of face 3D points, the plurality of samples are input to the encoder 21, the encoder 21 maps the input samples into hidden vectors, and averages the hidden vectors of the plurality of samples, and then inputs the averaged hidden vectors to the decoder 22, and the decoder 22 generates a priori 3D face model according to the averaged hidden vectors.

And then, in the three-dimensional reconstruction process, the position mapping relation from the 2D face image to the target 3D face model is combined to carry out face pose estimation so as to obtain pose data of the 2D face. The position mapping relationship refers to a position conversion relationship between a pixel point in the 2D face image and a 3D point in the target 3D face model. The face pose estimation mainly acquires face orientation information, the pose data of the 2D face can be represented by a rotation matrix, a rotation vector, a quaternion or Euler angle, multidimensional freedom and the like, and the four representation forms can be mutually converted. Optionally, the pose data of the face may be represented by six degrees of freedom information of the face, such as a degree of freedom of movement and a degree of freedom of rotation, for details, see the foregoing embodiments, and are not described herein again.

After the pose data of the 2D face is obtained, a perspective projection process to the 2D face image may be performed for the target 3D face model according to the pose data of the 2D face. The perspective projection processing may be performed by using the formula x ═ K (RX + T) of the perspective projection transform. Performing perspective projection processing on the 2D face image aiming at the target 3D face model, including but not limited to the following modes: firstly, a special effect object can be added on a reference 3D face model, a target 3D face model is aligned with the reference 3D face model, and based on this, the special effect object on the reference 3D face model can be directly projected into a 2D face image, wherein the special effect object can be a makeup special effect, such as a makeup special effect like blush, eyebrow or lipstick, or a special effect like a sticker, doodle, text or mosaic, or a try-on product, such as sunglasses, a hat or an earring; secondly, adding a special effect object on the target 3D face model, and projecting the special effect object and the target 3D face model into a 2D face image; thirdly, adding a special effect object on the target 3D face model, carrying out transparentization treatment on the target 3D face model, and projecting the special effect object and the transparent target 3D face model into the 2D face image.

In an optional embodiment of the present application, a neural network model may be used to implement three-dimensional reconstruction of a 2D face image based on a priori 3D face model, and face pose estimation is completed in the three-dimensional reconstruction process, so as to finally obtain a target 3D face model corresponding to the 2D face image and pose data of the 2D face. Fig. 3a shows a scheme architecture for performing face image processing based on a neural network model. As shown in fig. 3a, a 2D face image and a prior 3D face model are input to a neural network model; in the neural network model, on one hand, three-dimensional reconstruction is carried out on the 2D face image to obtain a target 3D face model, and on the other hand, face posture estimation is carried out in the three-dimensional reconstruction process to obtain posture data of the 2D face; after the neural network model outputs the pose data of the 2D face and the target 3D face model, perspective projection processing to the 2D face image is performed for the target 3D face model according to the pose data of the 2D face.

Further alternatively, the neural network model of the present embodiment may be split into two branches, i.e. a first neural network model and a second neural network model, as shown in fig. 3 b. The first neural network model is responsible for feature extraction of the 2D face image; and the second neural network model is responsible for carrying out feature extraction on the prior 3D face model. Specifically, as shown in fig. 3b, in the process of performing three-dimensional reconstruction, for the processing of the 2D face image, the 2D face image may be input to the first neural network model, and the first neural network model is used to perform feature extraction on the 2D face image, so as to obtain a 2D face global feature map. The first neural network model is any neural network model capable of feature extraction, for example, the first neural network model may be a multi-layer neural network model, that is, including a plurality of network layers, and different network layers may perform corresponding feature extraction in a targeted manner. Among them, the network layer in the neural network model may include but is not limited to: convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), residual error Networks (DNNs), and the like.

With reference to fig. 3b, in the process of performing three-dimensional reconstruction, for the processing of the prior 3D face model, a second neural network model may be used to perform feature extraction on the prior 3D face model to obtain feature vectors of a plurality of 3D points on the prior 3D face model. Likewise, the second neural network model is any neural network model capable of feature extraction, for example, the second neural network model may be a multi-layer neural network model, i.e., comprising a plurality of network layers, and different network layers may be targeted for respective feature extraction. Among them, the network layer in the neural network model may include but is not limited to: convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), residual error Networks (DNNs), and the like.

Further, as shown in fig. 3b, after obtaining the 2D face global feature map and the feature vectors of the plurality of 3D points on the prior 3D face model, a 3D face deformation parameter may be generated according to the 2D face global feature map and the feature vectors of the plurality of 3D points; and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the prior 3D face model. The 3D face deformation parameter is a deformation parameter of the target 3D face model relative to the prior 3D face model, and the deformation parameter may be represented by an offset of the target 3D face model relative to the prior 3D face model. And according to the 3D face deformation parameters and the prior 3D face model, a mode of reconstructing the target 3D face model corresponding to the 2D face image is not limited. For example, the prior 3D face model may be subjected to deformation processing by using the 3D face deformation parameters, so as to obtain a target 3D face model corresponding to the 2D face image. For another example, the 3D face deformation parameters may be superimposed on the prior 3D face model to reconstruct a target 3D face model corresponding to the 2D face image.

Optionally, as shown in fig. 3c, an embodiment of generating a 3D face deformation parameter according to a 2D face global feature map and feature vectors of a plurality of 3D points includes: splicing the 2D face global feature map and the feature vectors of the plurality of 3D points to obtain a first fusion feature; and performing feature learning on the first fusion features by using a third neural network model to obtain a 3D face deformation parameter.

In an optional embodiment, by using the neural network model, in addition to the 3D face deformation parameter, a 2D-3D mapping parameter may be generated, where the 2D-3D mapping parameter reflects a position mapping relationship from the 2D face image to the target 3D face model, specifically, a mapping parameter between a pixel point on the 2D face image and a 3D point in the target 3D face model, and the mapping parameter may be represented by a mapping matrix or a functional relation. After the 2D-3D mapping parameters are acquired, face pose estimation can be performed based on the 2D-3D mapping parameters to obtain pose data of the 2D face.

Specifically, as shown in fig. 3c, after the 2D face global feature map is obtained, feature vectors of a plurality of pixel points may be generated based on the 2D face global feature map, where the plurality of pixel points are from a face region of the 2D face image, and the feature vectors are features of each pixel point obtained by further feature extraction based on the 2D face global feature map. Further, as shown in fig. 3c, in the process of extracting the features of the prior 3D face model by using the second neural network model, firstly, the features of the prior 3D face model are extracted by using the second neural network model to obtain a 3D face global feature map; and further generating a feature vector of a plurality of 3D points based on the 3D face global feature map. According to the 3D face global feature map and the feature vectors of the pixel points, 2D-3D mapping parameters for reflecting the position mapping relation from the 2D face image to the target 3D face model can be generated. Based on this, in the three-dimensional reconstruction process, the implementation mode of performing face pose estimation by combining the position mapping relationship from the 2D face image to the target 3D face model to obtain pose data of the 2D face includes: splicing the global feature of the 3D face and the feature vectors of the multiple pixel points to obtain a second fusion feature; performing feature learning on the second fusion feature by using a fourth neural network model to obtain a 2D-3D mapping parameter; and performing pose estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain pose data of the 2D face, as shown in fig. 3 c.

In the embodiment of the present application, the model structure of the first neural network model is not limited, and all model structures that can perform feature extraction on the 2D face image to obtain the 2D face global feature map and the feature vectors of a plurality of pixel points in the face region are suitable for the embodiment of the present application. In an alternative embodiment, as shown in fig. 3d, the first neural network model comprises: the system comprises a face segmentation network layer, a first feature extraction network layer and a second feature extraction network layer. Firstly, sending the 2D face image into a face segmentation network layer for face segmentation to obtain an initial feature map of the 2D face image and a face region in the 2D face image, wherein the face region comprises: and the initial characteristic graph comprises the characteristics of the plurality of pixel points. And then, inputting the characteristics of a plurality of pixel points into a first characteristic extraction network layer to perform characteristic extraction on the 2D face to obtain a 2D face global characteristic diagram. And finally, inputting the 2D face global feature map into a second feature extraction network layer to perform feature extraction on each pixel point to obtain feature vectors of a plurality of pixel points. For the face segmentation network layer, the face image with the marked face region can be adopted in advance to train the face segmentation network layer, and the face segmentation network layer includes but is not limited to: at least one of a convolutional layer, a pooling layer, an upsampling layer, and a deconvolution layer. In addition, the loss function employed by the face segmentation network layer may include, but is not limited to: cross-entropy loss function (cross-entropy loss) or focal loss (focal loss), etc.

The first feature extraction network layer is a pre-trained neural network model used for extracting 2D face global features from the features of a plurality of pixel points in a face region, the number of the input pixel points and the feature dimensions of the pixel points can be determined by the model and application, the second feature extraction network layer is a pre-trained neural network model used for extracting feature vectors of the pixel points from the 2D face global features, and the dimensions of the feature vectors are determined by the model and application requirements and are not limited. The first feature extraction network layer or the second feature extraction network layer adopts, but is not limited to: CNN, RNN, DNN, residual network, or Multi-Layer Perceptron (MLP), and the model architecture thereof is not limited. In an optional embodiment, the first feature extraction network layer or the second feature extraction network layer is implemented by using MLP, but it should be noted that the two network layers have different functions, so the structures of the two MLP network layers are different.

In this embodiment, the 2D face image may adopt a Red Green Blue (RGB) image, may also adopt an RGB image with Depth information (Depth), that is, an RGB-D image, and may also adopt an RGB image and a Depth image at the same time. If the 2D face image is an RGB image, the features of the initial feature map include pixel information of the image, and do not include depth information, a plurality of pixel points may be selected from the face region, and the features of the selected plurality of pixel points (the features are features without depth information in the initial feature map obtained based on the RGB image) are input to the first feature extraction network layer to perform global feature extraction for the 2D face. If the 2D face image is an RGB-D image, the features of the initial feature image comprise pixel information and depth information of the image, a plurality of pixel points can be selected from the face region, and the features of the selected pixel points (the features are the features with the depth information in the initial feature image obtained based on the RGB-D image) are input into a first feature extraction network layer to perform global feature extraction on the 2D face. Optionally, if the 2D face image adopts an RGB image and a depth image at the same time, the RGB image may be input into a face segmentation network layer to perform face segmentation, so as to obtain an initial feature map of the 2D face image and a face region in the 2D face image, the features of the initial feature map include pixel information of the image, a plurality of pixel points may be selected from the face region, and before inputting the features of the plurality of pixel points (the features are features without depth information in the initial feature map obtained based on the RGB image) into a first feature extraction network layer to perform global feature extraction on the 2D face, a depth image corresponding to the RGB image may also be obtained; extracting depth features of a plurality of pixel points from a depth image, fusing the depth features of each pixel point into the features of the pixel point (the features are the features in an initial feature map obtained based on an RGB image), and then inputting the features of the pixel points (the features are the features fused with depth information and pixel information) into a first feature extraction network layer to perform global feature extraction on a 2D face.

The embodiment of extracting the depth features of the plurality of pixel points from the depth image is not limited. In an optional embodiment, the depth image may be sent to a face segmentation network layer to perform face segmentation, so as to obtain an initial feature map of the depth image, where the initial feature map of the depth image includes depth features of a plurality of pixel points. In another optional embodiment, the depth image has depth information and pixel point information, if a coordinate point of a face in the depth image includes three-dimensional coordinate information, and the three-dimensional coordinates are an x axis, a y axis, and a z axis, respectively, the x axis and the y axis coordinates may inversely map the pixel point information, the z axis coordinates may reflect the depth information, the depth information and the pixel point information of a plurality of pixel points in the depth image may be converted into 3D point cloud data, and based on the 3D point cloud data, the depth features of the plurality of pixel points in the depth image are learned.

Similarly, the embodiment of the application does not limit the model structure of the second neural network model, and all the model structures for performing feature extraction on the prior 3D face model to obtain the global feature map of the 3D face and the feature vectors of the plurality of 3D points on the prior 3D face model are suitable for the embodiment of the application. In an alternative embodiment, as shown in FIG. 3e, the second neural network model includes a third feature extraction network layer and a fourth feature extraction network layer. With reference to fig. 3e, performing feature extraction on the prior 3D face model by using a second neural network model to obtain global features of the 3D face, including: selecting a plurality of 3D points from a prior 3D face model, inputting the position coordinates of the plurality of 3D points into a third feature extraction network layer for feature extraction to obtain 3D face global features, wherein the 3D face global feature values reflect the position coordinates of the plurality of 3D points and do not contain the attitude data of a 3D face, and the 3D face global features belong to the features without the attitude data; correspondingly, generating feature vectors of a plurality of 3D points based on the 3D face global feature map comprises the following steps: and inputting the global features of the 3D face into a fourth feature extraction network layer for feature extraction to obtain feature vectors of a plurality of 3D points. The third feature extraction network layer is a pre-trained neural network model used for extracting the global features of the 3D face according to the position coordinates of the plurality of 3D points, and the fourth feature extraction network layer is a pre-trained neural network model used for extracting the feature vectors of the 3D points from the global features of the 3D face, and the third feature extraction network layer is not limited to the neural network model. The third feature extraction network layer or the fourth feature extraction network layer may include, but is not limited to: CNN, RNN, DNN, residual network or MLP, etc. In an optional embodiment, the third feature extraction network layer or the fourth feature extraction network layer is implemented by using MLP, but it should be noted that the two network layers have different functions, so the structures of the two MLP network layers are different.

In some application scenarios, such as an AR fitting scenario, a model object needs to be added on the target 3D model, and the model object may be sunglasses or a hat, so as to perform perspective projection processing to the 2D face image for the target 3D face model with the model object added. In some application scenarios, for example, a reference 3D face model is provided by a merchant trying on a product, in the process of performing perspective projection processing on a 2D face image with respect to a target 3D face model to which a model object is added, the perspective projection processing may be assisted by the reference 3D face model provided with the model object (object to be projected). Based on this, before obtaining the pose data of the 2D face, the method further includes: acquiring a reference 3D face model, wherein an object to be projected is arranged on the reference 3D face model; and aligning the target 3D face model with the reference 3D face model. The alignment is a process of automatically positioning key feature points of the face, such as eyes, nose tips, mouth corner points, eyebrows or face contour points, for a target 3D face model or a reference 3D face model, and performing operations such as rotation, scaling, expansion or translation on the target 3D face model by using the reference 3D face model as a reference, so that the target 3D face model is as close to the reference 3D face model in shape as possible. On the basis of aligning the target 3D face model and the reference 3D face model, the position effect of the object to be projected on the reference 3D face model can replace the effect of the object to be projected on the target 3D face model. Correspondingly, performing pose estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain pose data of the 2D face, including: performing pose estimation on the face in the 2D face image according to the aligned target 3D face model and the 2D-3D mapping parameters to obtain pose data of the 2D face; because the target 3D face model is aligned with the reference 3D face model, the posture data of the 2D face is also suitable for the reference 3D face model and an object to be projected on the reference 3D face model; based on this, the perspective projection processing to the 2D face image is executed for the target 3D face model according to the pose data of the 2D face, and the perspective projection processing comprises the following steps: and projecting the object to be projected into the 2D face image according to the attitude data of the 2D face to obtain the 2D face image with the object to be projected.

In the above embodiments of the present application, the structure, the input/output dimension, and the like of the neural network model are not limited, and in the following embodiments, the neural network model structure based on a single RGB/RGB-D image and 3D face shape prior, and the neural network model structure based on fusion of the RGB image and the depth image and 3D face shape prior will be taken as examples to describe in detail the technical solution of the embodiments of the present application. Wherein the single RGB/RGB-D image corresponds to the 2D face image in the aforementioned method, and the 3D face shape prior corresponds to the prior 3D face model in the aforementioned method.

As shown in FIG. 3f, for a neural network model junction based on a single RGB/RGB-D image and a 3D face shape prior Structure:

firstly, acquiring an RGB image (as shown in FIG. 3 g), wherein the RGB image comprises a human face, and performing human face detection by using a human face detection method (a traditional machine learning method or a deep neural network method) to obtain a human face area framed by a human face detection frame; based on the face detection frame, the original RGB image is cut to obtain a face RGB image (as shown in FIG. 3 h), which is marked as I_RGB. If the face detection frame cannot completely frame out the face region, for example, the face detection frame frames out the region below the eyebrow and above the chin in the face region, the face detection frame is appropriately expanded, and the expanded detection frame is used for cutting out the original RGB image to obtain a complete face RGB image.

In addition, prior information of the 3D face shape may also be obtained, which may be understood as an average 3D face grid point (mesh), wherein the prior information of the 3D face shape may be learned through an encode-decode network structure (as shown in fig. 2), and a prior 3D face shape is obtained and is recorded as X ∈ R³。

As further shown in fig. 3f, the face RGB image (2D image) cut out after face detection is input into a convolutional neural network (e.g., CNN), where the size of the RGB image is H × W × 3, H represents the height (High) of the RGB image, W represents the width (Wide) of the RGB image, and 3 represents three dimensions of RGB. Inputting the face RGB image into a convolution neural network, performing convolution, then performing up-sampling or deconvolution on a convolution result to obtain a characteristic diagram with the size of H multiplied by W multiplied by C, wherein C represents the number of channels in the convolution process. In the process, face segmentation is simultaneously carried out on the face RGB image, namely segmentation of two types is carried out, and a face area and a non-face area in the face RGB image are obtained. According to the face segmentation result, m pixel points are randomly selected in a face region, the feature of each pixel point is a value of a dimension C in a feature map of H × W × C, an mxC feature vector is finally obtained, then the mxC feature vector is input to a multilayer perceptron (MLP) network layer, and global features of 1 × f1 are obtained through global pooling (Pooling), wherein f1 is a positive integer, and can be 512 or 1024 or the like. Inputting the global feature of the 1 xf 1 to the next MLP network layer to obtain a feature vector 1 xc 1 of each pixel point of m pixel points, wherein the feature vectors of the m pixel points are expressed as mxc 1.

As further shown in fig. 3f, it is assumed that 3D grid points of the prior 3D face shape are n × 3, n is a positive integer and represents the number of 3D grid points (referred to as 3D points for short), and 3 represents that the coordinate dimension of the 3D points is three-dimensional; then, inputting the 3D grid points of the prior 3D face shape into an MLP network layer for feature extraction, and obtaining the global features of 1 xf 2 through global pooling, wherein f2 is a positive integer; then, the global feature of 1 xf 2 is input into the next MLP network layer to learn, so as to obtain a feature vector 1 xc 2 of each of n 3D points, where the feature vectors of n 3D points are represented as nxc 2.

As further shown in fig. 3f, on one hand, the global features of 1 xf 1 and the feature vectors of nxc 2 are spliced, and the spliced features are input into a neural network model for prediction to obtain a 2D-3D mapping matrix mxn; on the other hand, the m × c1 feature vector is spliced with the 1 × f2 global feature, the offset 3 Δ of each 3D point relative to the prior 3D face shape (the face shape is a geometric shape) is predicted according to the spliced feature vector, the 3 Δ represents the offset in three coordinate dimensions, and the offsets of the n 3D points relative to the prior 3D face shape are represented asn x 3 delta, adding the offset n x 3 delta and the three-dimensional coordinate value of the input prior 3D face shape to obtain the currently input 2D face I_RGBAnd correspondingly reconstructing the 3D face model.

Further, by combining the 2D-3D mapping matrix m x n with the coordinates of m pixel points in the face RGB image, 3D coordinate information corresponding to the reconstructed 3D face model by the m pixel points can be obtained, and then the six-degree-of-freedom attitude information of the current 2D face, namely the attitude data of the 2D face, is calculated by an n-point perspective (PNP) method or a variation method thereof. Furthermore, based on the six-degree-of-freedom pose information of the 2D face, perspective projection processing to the face RGB image is performed on the reconstructed 3D face model, and as shown in fig. 3i, the perspective projection processing is an effect map projected to the face RGB image by the reconstructed 3D face model in the estimated six-degree-of-freedom pose. Wherein, for the convenience of viewing, the outline of the 3D face projection is represented by a dotted line. It should be noted that, when the network model shown in fig. 3f is trained, the face segmentation model may not be trained, but the face information labeled actually is directly used in the training process instead of the face segmentation process.

The neural network shown in fig. 3f is a multi-task network structure, and can implement three tasks, the first is to perform face segmentation (face region and non-face region), and the loss functions that can be used by the image segmentation network layer may include but are not limited to: a cross entropy loss function or a focal loss and the like, wherein the second is to generate a 2D-3D mapping matrix, a classification model can be established for each 2D to 3D point in a modeling mode, and the loss function which can be adopted is softmax loss and the like; the third task is to solve the coordinate offset of each face 3D point relative to the prior 3D face shape, and a loss function such as smooth L1 or Euclidean distance loss can be used.

As shown in fig. 3j, for a neural network based on fusion of RGB image and depth image and 3D face shape prior Model structure:

obtaining face RGB image (as shown in FIG. 3 h) cut out after face detection, inputting face RGB image into convolution neural network, and performing convolution to obtain face RGB imageThe convolution result is subjected to up-sampling or deconvolution to obtain a feature map with a size of H × W × C, and face segmentation is performed, for details, see the foregoing embodiments, and are not described herein again. Further, as shown in fig. 3j, an original Depth image corresponding to the original RGB image is obtained, optionally, the original Depth image is cut to obtain a human face Depth image based on the same method as the human face RGB image, and the human face Depth image is marked as I_D. The size of the face Depth image is H × W × 1. Then, according to the face segmentation result, the face Depth image I_DRandomly selecting m pixel points in the face area, and converting the m pixel points into 3D point cloud; then, feature extraction is performed on the 3D point cloud through an MLP network layer (not shown in fig. 3 j), and features of corresponding points in an H × W × C feature map obtained by convolving the extracted features with the face RGB image are fused to obtain m × C' features. There are various ways of feature fusion, for example, fusion on feature channels or point-by-point feature fusion.

As shown in fig. 3j, after obtaining the m × C 'feature, inputting the m × C' feature into the MLP network layer, obtaining a global feature of 1 × f1 through global pooling, and inputting the global feature of 1 × f1 into the next MLP network layer to obtain a feature vector 1 × C1 of each pixel, where the feature vectors of m pixels are represented as m × C1. In fig. 3j, the contents of processing the prior 3D face shape, performing three-dimensional reconstruction based on the prior 3D face shape, and obtaining 2D face pose data may refer to the foregoing embodiment, and are not described herein again.

It should be noted that, in addition to calculating the pose information of six degrees of freedom of the face by using the PNP and its variant method, the pose information of six degrees of freedom of the face may also be calculated by means of least square fitting. Specifically, since the depth image may be converted into a 3D point cloud, the 2D-3D mapping matrix m × n may also be a mapping from the 3D point cloud after the depth image is converted to a 3D point in the face reconstruction model, and then the pose information of six degrees of freedom of the face may be calculated in a manner of least square fitting of the 3D point.

It should be noted that the operations such as face detection, image segmentation, feature extraction, and the like for the input RGB image or Depth image may be changed in the present application. For example, the Depth image can also be subjected to feature extraction in a convolution network mode; for another example, there are many ways to perform feature fusion between RGB image and Depth image, such as fusion on feature channel or point-by-point feature fusion. In addition, the network layer parameters and settings in the network structure employed in the present application may also be varied.

Scenario example 1:

the following description will take a live scene as an example. The present application further provides a live broadcast system, as shown in fig. 4a, the live broadcast system 400a includes: the system comprises an anchor terminal 401a, a playing terminal 402a, a server terminal 403a and an image acquisition device 404 a.

The anchor terminal 401a and the playing terminal 402a may be implemented as terminal devices such as a desktop computer, a notebook computer, or a smart phone, and the anchor terminal 401a and the playing terminal 402a may be implemented in the same or different forms. The number of the playing terminals 402a is one or more, and in fig. 4, the playing terminals 402a are illustrated as a plurality.

The image capturing device 404a may be a monocular camera or a Red Green Blue Depth map (RGBD) camera, etc. The image capturing device 404a is configured to capture an initial live video, where the initial live video includes a 2D face image, where the face may be of a main broadcast or of another staff in the picture. The image capturing device 404a provides the captured initial live video to the server 403a via the anchor 401 a. If the image acquisition device 404a is a monocular camera, the 2D face image included in the initial live video is an RGB image; if the image capturing device 404a is an RGBD camera, the 2D face image included in the initial live video is an RGB-D image.

The server 403a may be a server device such as a conventional server, a cloud server, or a server array, and is illustrated in fig. 4 by taking the cloud server as an example. The server 403a may perform three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and perform face pose estimation by combining a position mapping relationship from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face. When a anchor or other staff member adds a live action or product through the anchor terminal 401a, the anchor terminal 401a provides the live action or product to the server terminal 403 a. The server 403a adds the live broadcast dynamic effect or product to the target 3D face model, and performs perspective projection processing on the 2D face image according to the pose data of the 2D face with respect to the target 3D face model to which the live broadcast dynamic effect or product is added. Wherein, the direct seeding effect comprises special effects of head ornaments, hanging ornaments or makeup. The product may be, but is not limited to: wigs, sunglasses, or hats, etc. For example, live animation or products may be added to the reference 3D model, the target 3D face model may be aligned with the reference 3D face model, and based on this, the live animation or products on the reference 3D face model may be directly projected into the 2D face image. For another example, a live broadcast dynamic effect is added on the target 3D face model, and the live broadcast dynamic effect and the target 3D face model are projected into the 2D face image; thirdly, live broadcast action or products are added to the target 3D face model, the target 3D face model is subjected to transparentization processing, and the live broadcast action or products and the transparent target 3D face model are projected into the 2D face image. The server 403a may provide the 2D face image that is subjected to the perspective projection processing to the player 402a, and the player 402a provides the 2D face image to the viewing user. For details, reference may be made to the foregoing embodiments, which are not described herein again.

In this embodiment, the image processing process in the system shown in fig. 4a may be executed by the server side, or may be executed by the anchor side. As shown in fig. 4b, an embodiment of the present application further provides a live broadcasting method, including the following steps:

401b, acquiring an initial live video of a main broadcasting end by using a camera, wherein the initial live video comprises a 2D face image;

402b, three-dimensional reconstruction is carried out on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and in the three-dimensional reconstruction process, face posture estimation is carried out by combining the position mapping relation from the 2D face image to the target 3D face model to obtain posture data of the 2D face; adding live broadcast dynamic effects or products on the 3D target face model;

and 403b, performing perspective projection processing on the 2D face image according to the attitude data of the 2D face aiming at the target 3D face model added with the live broadcast dynamic effect or the product to obtain a target live broadcast video, and sending the target live broadcast video to a playing terminal, wherein the target live broadcast video comprises the 2D face image with the live broadcast dynamic effect or the product.

For details of the live broadcast method, reference may be made to the foregoing embodiments, which are not described herein again.

The live broadcast method provided by the embodiment of the application comprises the steps of carrying out three-dimensional reconstruction on a 2D image based on a priori 3D face model, carrying out face pose estimation by combining a position mapping relation from the 2D face image to a reconstructed 3D face model in the three-dimensional reconstruction process, obtaining 2D face pose data, carrying out perspective projection processing on a target 3D face model to the 2D face image by using the 2D face pose data when the three-dimensional reconstructed 3D face model is projected to the 2D face image, and fully considering the pose data of a 2D face, so that the effect of projecting the 3D face model to the 2D image is more real, and more possibilities are provided for application of high-precision requirements.

Scenario example 2:

the following description will take an AR display scene as an example. The present application also provides an AR display system, as shown in fig. 4c, the AR display system 400c includes: AR display device 401c and server device 402 c.

Wherein the AR display device 401c may be, but is not limited to: mobile phones, tablets, PCs, Head-up displays (HUDs) or AR smart interaction devices, etc. The AR display device 401c has an image capturing function, for example, a monocular camera or an RGBD camera is installed on the AR display device 401c, and is used to implement the image capturing function. Server 402c may be a server-side device such as a conventional server, a cloud server, or an array of servers.

In this embodiment, the AR display device 401c may collect a 2D face image of a user and display the 2D face image on a display screen thereof; the AR display device 401c may provide the 2D face image to the server device 402 c. The server device 402c performs three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performs face pose estimation by combining a position mapping relationship from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face. When the user needs to add a target object to the 2D face image, the target object to be added may be sent to the server device 402c through the AR display device 401 c. Wherein the target object may be a special effect or a product. The server device 402c receives a target object to be added, and adds the target object to the target 3D face model; according to the pose data of the 2D face, perspective projection processing on the 2D face image is performed on the target 3D face model to which the target object is added, so that a 2D face image with the target object is obtained, and the 2D face image with the target object is sent to the AR display device 401 c. The AR display device 401c displays the 2D face image with the target object on the display screen.

In this embodiment, the image processing procedure in the system shown in fig. 4c may be performed by the server device, or may be performed by the AR display device 401 c. As shown in fig. 4d, an embodiment of the present application further provides an AR display method, including the following steps:

401D, the AR display device collects a 2D face image of a user by using a camera, and displays the 2D face image on a display screen of the AR display device;

402D, performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performing face pose estimation by combining a position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face; adding a target object on the target 3D face model;

and 403D, according to the posture data of the 2D face, performing perspective projection processing on the 2D face image aiming at the target 3D face model added with the target object, and displaying the 2D face image with the target object on a display screen.

The AR display method provided by the embodiment of the application carries out three-dimensional reconstruction on a 2D image based on a prior 3D face model, carries out face pose estimation by combining the position mapping relation from the 2D face image to a reconstructed 3D face model in the three-dimensional reconstruction process, obtains 2D face pose data, and utilizes the 2D face pose data to carry out perspective projection processing on a target 3D face model to the 2D face image when the three-dimensional reconstructed 3D face model is projected to the 2D face image, so that the effect of projecting the 3D face model to the 2D image is more real, and more possibilities are provided for application of high-precision requirements.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 101 to 103 may be device a; for another example, the execution subject of

steps

101 and 102 may be device a, and the execution subject of step 103 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 101, 102, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 5a is a schematic structural diagram of an image processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 5a, the image processing apparatus 50a includes: a three-dimensional reconstruction module 51a, an attitude estimation module 52a, and a perspective projection module 53 a.

The three-dimensional reconstruction module 51a is configured to perform three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image;

the pose estimation module 52a is configured to perform face pose estimation by combining a position mapping relationship from a 2D face image to a target 3D face model in a three-dimensional reconstruction process to obtain pose data of a 2D face;

and a perspective projection module 53a, configured to perform perspective projection processing on the 2D face image for the target 3D face model according to the pose data of the 2D face.

In an alternative embodiment, the three-dimensional reconstruction module 51a is configured to: performing feature extraction on the 2D face image by using a first neural network model to obtain a 2D face global feature map; performing feature extraction on the prior 3D face model by using a second neural network model to obtain feature vectors of a plurality of 3D points on the prior 3D face model; generating a 3D face deformation parameter according to the 2D face global feature map and the feature vectors of the plurality of 3D points; and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the prior 3D face model.

In an alternative embodiment, the three-dimensional reconstruction module 51a is configured to: splicing the 2D face global feature map and the feature vectors of the plurality of 3D points to obtain a first fusion feature; and performing feature learning on the first fusion features by using a third neural network model to obtain a 3D face deformation parameter.

In an optional embodiment, the apparatus further comprises: a generation module; the generation module is to: generating feature vectors of a plurality of pixel points based on the 2D face global feature map, wherein the pixel points are from a face region of the 2D face image; accordingly, the three-dimensional reconstruction module 51a is configured to: performing feature extraction on the prior 3D face model by using a second neural network model to obtain a 3D face global feature map, and generating feature vectors of a plurality of 3D points based on the 3D face global feature map; correspondingly, the pose estimation module 52a is configured to splice the global feature of the 3D face and the feature vectors of the multiple pixel points to obtain a second fusion feature; performing feature learning on the second fusion feature by using a fourth neural network model to obtain a 2D-3D mapping parameter; and carrying out pose estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain pose data of the 2D face.

In an alternative embodiment, the three-dimensional reconstruction module 51a is configured to: sending the 2D face image into a face segmentation network layer in a first neural network model to perform face segmentation to obtain an initial feature map of the 2D face image and a face region in the 2D face image, wherein the face region comprises a plurality of pixel points, and the initial feature map comprises features of the plurality of pixel points; inputting the characteristics of a plurality of pixel points into a first characteristic extraction network layer in a first neural network model to extract the characteristics of the 2D face to obtain a 2D face global characteristic diagram; accordingly, the generation module is to: and inputting the 2D face global feature map into a second feature extraction network layer in the first neural network model to extract features of each pixel point, so as to obtain feature vectors of a plurality of pixel points.

In an optional embodiment, the 2D face image is an RGB image, and before inputting the features of the plurality of pixel points into the first feature extraction network layer in the first neural network model to perform feature extraction on the 2D face, the apparatus further includes: the system comprises a first acquisition module and a fusion module; the first obtaining module is used for: acquiring a depth image corresponding to the RGB image; the fusion module is used for: and extracting the depth characteristics of a plurality of pixel points from the depth image, and fusing the depth characteristics of each pixel point into the characteristics of the pixel point.

In an alternative embodiment, the three-dimensional reconstruction module 51a is configured to: selecting a plurality of 3D points from the prior 3D face model, inputting the position coordinates of the plurality of 3D points into a third feature extraction network layer in the second neural network model for feature extraction, and obtaining the global features of the 3D face; and inputting the global features of the 3D face into a fourth feature extraction network layer in the second neural network model for feature extraction to obtain feature vectors of a plurality of 3D points.

In an alternative embodiment, before obtaining pose data of the 2D face, the apparatus further comprises: a second acquisition module and an alignment module; the second obtaining module is used for: acquiring a reference 3D face model, wherein an object to be projected is arranged on the reference 3D face model; the alignment module is to: aligning the target 3D face model with the reference 3D face model; correspondingly, the pose estimation module 52a is configured to perform pose estimation on the face in the 2D face image according to the aligned target 3D face model and the 2D-3D mapping parameters, so as to obtain pose data of the 2D face; correspondingly, the perspective projection module 53a is configured to project the object to be projected into the 2D face image according to the pose data of the 2D face.

The image processing device provided by the embodiment of the application, carry out three-dimensional reconstruction to the 2D image based on priori 3D face model, in the three-dimensional reconstruction process, combine the position mapping relation of 2D face image to the 3D face model that rebuilds to carry out the face attitude estimation, obtain 2D face attitude data, when projecting the 3D face model that three-dimensional reconstruction goes out to the 2D face image, utilize this 2D face attitude data to carry out the perspective projection processing to the 2D face image to target 3D face model, fully consider the attitude data of 2D face, make the effect of 3D face model projection to 2D image more laminate truer, provide more probably for the application of high accuracy demand.

Fig. 5b is a schematic structural diagram of an image processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 5, the image processing apparatus includes: a memory 54b and a processor 55 b.

The memory 54b is used for storing a computer program, and may be configured to store other various data to support operations on the image processing apparatus. Examples of such data include instructions for any application or method operating on an image processing device.

The memory 54b may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 55b, coupled to the memory 54b, for executing computer programs in the memory 54b for: performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image; in the three-dimensional reconstruction process, carrying out face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model to obtain pose data of the 2D face; and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the pose data of the 2D face.

In an optional embodiment, when performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain the target 3D face model corresponding to the 2D face image, the processor 55b is specifically configured to: performing feature extraction on the 2D face image by using a first neural network model to obtain a 2D face global feature map; performing feature extraction on the prior 3D face model by using a second neural network model to obtain feature vectors of a plurality of 3D points on the prior 3D face model; generating a 3D face deformation parameter according to the 2D face global feature map and the feature vectors of the plurality of 3D points; and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the prior 3D face model.

In an optional embodiment, when the processor 55b generates the 3D face deformation parameter according to the 2D face global feature map and the feature vectors of the plurality of 3D points, the processor is specifically configured to: splicing the 2D face global feature map and the feature vectors of the plurality of 3D points to obtain a first fusion feature; and performing feature learning on the first fusion features by using a third neural network model to obtain a 3D face deformation parameter.

In an alternative embodiment, in the process of extracting features of the 2D face image by using the first neural network model, the processor 55b is further configured to: generating feature vectors of a plurality of pixel points based on the 2D face global feature map, wherein the pixel points are from a face region of the 2D face image; correspondingly, when the processor 55b performs feature extraction on the prior 3D face model by using the second neural network model to obtain feature vectors of a plurality of 3D points on the prior 3D face model, the processor is specifically configured to: performing feature extraction on the prior 3D face model by using a second neural network model to obtain a 3D face global feature map, and generating feature vectors of a plurality of 3D points based on the 3D face global feature map; correspondingly, the processor 55b performs face pose estimation by combining the position mapping relationship from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face, and is specifically configured to: splicing the global feature of the 3D face and the feature vectors of the multiple pixel points to obtain a second fusion feature; performing feature learning on the second fusion feature by using a fourth neural network model to obtain a 2D-3D mapping parameter; and carrying out pose estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain pose data of the 2D face.

In an optional embodiment, when the processor 55b performs feature extraction on the 2D face image by using the first neural network model to obtain the 2D face global feature map, the processor is specifically configured to: sending the 2D face image into a face segmentation network layer in a first neural network model to perform face segmentation to obtain an initial feature map of the 2D face image and a face region in the 2D face image, wherein the face region comprises a plurality of pixel points, and the initial feature map comprises features of the plurality of pixel points; inputting the characteristics of a plurality of pixel points into a first characteristic extraction network layer in a first neural network model to extract the characteristics of the 2D face to obtain a 2D face global characteristic diagram; correspondingly, when the processor 55b generates the feature vectors of a plurality of pixel points based on the 2D face global feature map, it is specifically configured to: and inputting the 2D face global feature map into a second feature extraction network layer in the first neural network model to extract features of each pixel point, so as to obtain feature vectors of a plurality of pixel points.

In an optional embodiment, the 2D face image is an RGB image, and before inputting the features of the plurality of pixel points into the first feature extraction network layer in the first neural network model to perform feature extraction on the 2D face, the processor 55b is further configured to: acquiring a depth image corresponding to the RGB image; and extracting the depth characteristics of a plurality of pixel points from the depth image, and fusing the depth characteristics of each pixel point into the characteristics of the pixel point.

In an optional embodiment, when the processor 55b performs feature extraction on the prior 3D face model by using the second neural network model to obtain the global features of the 3D face, the processor is specifically configured to: selecting a plurality of 3D points from the prior 3D face model, inputting the position coordinates of the plurality of 3D points into a third feature extraction network layer in the second neural network model for feature extraction, and obtaining the global features of the 3D face; accordingly, the processor 55b, when generating the feature vectors of the plurality of 3D points based on the 3D face global feature map, is specifically configured to: and inputting the global features of the 3D face into a fourth feature extraction network layer in the second neural network model for feature extraction to obtain feature vectors of a plurality of 3D points.

In an alternative embodiment, before obtaining the pose data of the 2D face, the processor 55b is further configured to: acquiring a reference 3D face model, wherein an object to be projected is arranged on the reference 3D face model; aligning the target 3D face model with the reference 3D face model; accordingly, the processor 55b, when performing pose estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain pose data of the 2D face, is specifically configured to: performing pose estimation on the face in the 2D face image according to the aligned target 3D face model and the 2D-3D mapping parameters to obtain pose data of the 2D face; accordingly, the processor 55b, when performing the perspective projection processing to the 2D face image for the target 3D face model according to the pose data of the 2D face, is specifically configured to: and projecting the object to be projected into the 2D face image according to the attitude data of the 2D face.

The image processing device provided by the embodiment of the application carries out three-dimensional reconstruction on a 2D image based on a priori 3D face model, in the three-dimensional reconstruction process, human face posture estimation is carried out by combining the position mapping relation from the 2D face image to a reconstructed 3D face model, 2D face posture data are obtained, when the three-dimensional reconstructed 3D face model is projected to the 2D face image, perspective projection processing to the 2D face image is carried out on a target 3D face model by utilizing the 2D face posture data, the effect of projecting the 3D face model to the 2D image is more real, and more possibilities are provided for application of high-precision requirements.

Further, as shown in fig. 5b, the image processing apparatus further includes: communication component 56b, display 57b, power component 58b, audio component 59b, and the like. Only some of the components are schematically shown in fig. 5b and it is not meant that the image processing apparatus comprises only the components shown in fig. 5 b. It should be noted that the components within the dashed box in fig. 5b are optional components, not necessary components, and may be determined according to the product form of the image processing apparatus.

The image processing device of this embodiment may be implemented as a terminal device such as a desktop computer, a notebook computer, or a smart phone, or may be a server device such as a conventional server, a cloud server, or a server array. If the image processing device of this embodiment is implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, etc., the image processing device may include components within a dashed line frame in fig. 5 b; if the image processing apparatus of the present embodiment is implemented as a server-side apparatus such as a conventional server, a cloud server, or a server array, the components within the dashed box in fig. 5b may not be included.

Accordingly, embodiments of the present application also provide a computer readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps of the method shown in fig. 1, which can be performed by an image processing apparatus.

Accordingly, embodiments of the present application also provide a computer program product, which includes computer programs/instructions, when executed by a processor, cause the processor to implement the steps in the method shown in fig. 1.

The communication component of fig. 5b described above is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The display in fig. 5b described above includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The power supply module of fig. 5b provides power to the various components of the device in which the power supply module is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

The audio component of fig. 5b described above may be configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An image processing method, comprising:

performing three-dimensional reconstruction on a 2D face image based on a priori 3D face model to obtain a target 3D face model corresponding to the 2D face image;

in the three-dimensional reconstruction process, carrying out face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model to obtain pose data of the 2D face;

and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the pose data of the 2D face.

2. The method of claim 1, wherein three-dimensionally reconstructing a 2D face image based on a prior 3D face model to obtain a target 3D face model corresponding to the 2D face image comprises:

performing feature extraction on the 2D face image by using a first neural network model to obtain a 2D face global feature map;

performing feature extraction on the prior 3D face model by using a second neural network model to obtain feature vectors of a plurality of 3D points on the prior 3D face model;

generating a 3D face deformation parameter according to the 2D face global feature map and the feature vectors of the plurality of 3D points;

and reconstructing a target 3D face model corresponding to the 2D face image according to the 3D face deformation parameters and the prior 3D face model.

3. The method according to claim 2, wherein in the process of extracting features of the 2D face image by using the first neural network model, the method further comprises: generating feature vectors of a plurality of pixel points based on the 2D face global feature map, wherein the pixel points are from a face region of the 2D face image;

correspondingly, performing feature extraction on the prior 3D face model by using a second neural network model to obtain feature vectors of a plurality of 3D points on the prior 3D face model, including: performing feature extraction on the prior 3D face model by using a second neural network model to obtain a 3D face global feature map, and generating feature vectors of the plurality of 3D points based on the 3D face global feature map;

correspondingly, in the three-dimensional reconstruction process, the face pose estimation is performed by combining the position mapping relation from the 2D face image to the target 3D face model to obtain pose data of the 2D face, which includes:

splicing the global feature of the 3D face and the feature vectors of the plurality of pixel points to obtain a second fusion feature; performing feature learning on the second fusion feature by using a fourth neural network model to obtain a 2D-3D mapping parameter; and performing pose estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain pose data of the 2D face.

4. The method of claim 3, wherein performing feature extraction on the 2D face image by using a first neural network model to obtain a 2D face global feature map comprises:

sending the 2D face image into a face segmentation network layer in a first neural network model for face segmentation to obtain an initial feature map of the 2D face image and a face region in the 2D face image, wherein the face region comprises a plurality of pixel points, and the initial feature map comprises features of the pixel points; inputting the characteristics of the plurality of pixel points into a first characteristic extraction network layer in a first neural network model to perform characteristic extraction on the 2D face to obtain a 2D face global characteristic diagram;

correspondingly, generating feature vectors of a plurality of pixel points based on the 2D face global feature map comprises the following steps: and inputting the 2D face global feature map into a second feature extraction network layer in the first neural network model to extract features of each pixel point, so as to obtain feature vectors of a plurality of pixel points.

5. The method of claim 3, wherein performing feature extraction on the prior 3D face model by using a second neural network model to obtain 3D face global features comprises: selecting a plurality of 3D points from the prior 3D face model, inputting the position coordinates of the 3D points into a third feature extraction network layer in a second neural network model for feature extraction, and obtaining the global features of the 3D face;

correspondingly, generating the feature vectors of the plurality of 3D points based on the 3D face global feature map comprises the following steps: and inputting the global features of the 3D face into a fourth feature extraction network layer in a second neural network model for feature extraction to obtain feature vectors of the plurality of 3D points.

6. The method of any of claims 3-5, further comprising, prior to obtaining pose data for the 2D face: acquiring a reference 3D face model, wherein an object to be projected is arranged on the reference 3D face model; aligning the target 3D face model with the reference 3D face model;

correspondingly, performing pose estimation on the face in the 2D face image according to the 2D-3D mapping parameters and the target 3D face model to obtain pose data of the 2D face, including: performing pose estimation on the face in the 2D face image according to the aligned target 3D face model and the 2D-3D mapping parameters to obtain pose data of the 2D face;

correspondingly, the perspective projection processing to the 2D face image is executed aiming at the target 3D face model according to the pose data of the 2D face, and the perspective projection processing comprises the following steps: and projecting the object to be projected into the 2D face image according to the attitude data of the 2D face.

7. An AR display method is suitable for AR display equipment, the AR display equipment is provided with a camera, and the method is characterized by comprising the following steps:

the AR display equipment acquires a 2D face image of a user by using a camera, and displays the 2D face image on a display screen of the AR display equipment;

performing three-dimensional reconstruction on the 2D face image based on a priori 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performing face pose estimation by combining a position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face; adding a target object on the target 3D face model;

and according to the attitude data of the 2D face, performing perspective projection processing on the 2D face image aiming at the target 3D face model added with the target object, and displaying the 2D face image with the target object on the display screen.

8. A live broadcast method, comprising:

acquiring an initial live video of a main broadcasting end by using a camera, wherein the initial live video comprises a 2D face image;

performing three-dimensional reconstruction on the 2D face image based on a priori 3D face model to obtain a target 3D face model corresponding to the 2D face image, and performing face pose estimation by combining a position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process to obtain pose data of the 2D face; adding live broadcast dynamic effects or products on the 3D target face model;

and performing perspective projection processing on the 2D face image according to the attitude data of the 2D face aiming at the target 3D face model added with the live broadcast dynamic effect or the product to obtain a target live broadcast video, and sending the target live broadcast video to a playing terminal, wherein the target live broadcast video comprises the 2D face image with the live broadcast dynamic effect or the product.

9. An image processing apparatus characterized by comprising:

the three-dimensional reconstruction module is used for performing three-dimensional reconstruction on the 2D face image based on the prior 3D face model to obtain a target 3D face model corresponding to the 2D face image;

the pose estimation module is used for carrying out face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model in the three-dimensional reconstruction process so as to obtain pose data of the 2D face;

and the perspective projection module is used for executing perspective projection processing on the 2D face image aiming at the target 3D face model according to the attitude data of the 2D face.

10. An image processing apparatus characterized by comprising: a memory and a processor; the memory for storing a computer program; the processor, coupled with the memory, to execute the computer program to: performing three-dimensional reconstruction on a 2D face image based on a priori 3D face model to obtain a target 3D face model corresponding to the 2D face image; in the three-dimensional reconstruction process, carrying out face pose estimation by combining the position mapping relation from the 2D face image to the target 3D face model to obtain pose data of the 2D face; and performing perspective projection processing on the 2D face image aiming at the target 3D face model according to the pose data of the 2D face.

11. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 8.

12. A computer program product comprising computer programs/instructions, characterized in that, when executed by a processor, causes the processor to implement the steps in the method of any of claims 1-8.