CN116363245A

CN116363245A - Virtual face generation method, virtual face live broadcast method and device

Info

Publication number: CN116363245A
Application number: CN202310304671.6A
Authority: CN
Inventors: 郑康元; 陈增海; 陈广
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-06-30

Abstract

The application relates to the technical field of image processing and network live broadcast, and provides a virtual face generation method, a virtual face live broadcast method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: inputting the face picture into a face attribute decoupling model to obtain an identity feature vector, an expression feature vector and a texture feature vector; inputting the description text into a trained first depth network model to obtain a face appearance vector and a face texture vector; the method comprises the steps of inputting space coordinates, identity feature vectors, expression feature vectors and face appearance vectors of a plurality of sampling points in a preset three-dimensional space to a density prediction module to obtain the density of each sampling point; the visual angle information, the texture feature vector and the face texture vector of each sampling point are input to a color value prediction module, and a color value of each sampling point is obtained; and performing volume rendering on the density and the color value of each sampling point to obtain a virtual face, thereby improving the quality of the generated virtual face.

Description

Virtual face generation method, virtual face live broadcast method and device

Technical Field

The embodiment of the application relates to the technical field of image processing and network live broadcast, in particular to a virtual face generation method, a virtual face live broadcast device, electronic equipment and a storage medium.

Background

With the development of image processing technology, a virtual face similar to a face can be generated based on the face in the face image, so that a user can take the virtual face as an avatar of the user. For example, the anchor may use the virtual face as a cover image of the live room or replace the anchor's real face with the virtual face during live broadcast.

In the related art, a virtual face is obtained by extracting face features in a face image and inputting the face features into a deep learning network model. However, this solution has a problem that the extracted face detail features are inaccurate, for example, the extracted face detail features are missing, resulting in low quality of the obtained virtual face.

Disclosure of Invention

The embodiment of the application provides a virtual face generation method, a virtual face live broadcast device, electronic equipment and a storage medium, which can improve the virtual face generation quality, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a virtual face generating method, where the method includes the following steps:

acquiring a face picture, inputting the face picture into a face attribute decoupling model, and acquiring an identity feature vector, an expression feature vector and a texture feature vector;

Acquiring a description text of a face picture, and inputting the description text into a trained first depth network model to acquire a face appearance vector and a face texture vector;

the method comprises the steps of inputting space coordinates, identity feature vectors, expression feature vectors and face appearance vectors of a plurality of sampling points in a preset three-dimensional space to a density prediction module of a trained second depth network model to obtain the density of each sampling point;

the visual angle information, the texture feature vector and the face texture vector of each sampling point are input to a color value prediction module of a trained second depth network model, and a color value of each sampling point is obtained;

and performing volume rendering on the density and the color value of each sampling point to obtain the virtual face.

In a second aspect, an embodiment of the present application provides a virtual face live broadcast method, where the method includes the following steps:

identifying a face image of a host from live video frames of the host;

responding to text editing operation of a anchor on the face image, and acquiring description text of the face image;

generating a virtual face of a host by adopting the virtual face generation method according to the face image and the description text;

and fusing the virtual face with the live video frame of the anchor, and playing the fused live video frame.

In a third aspect, an embodiment of the present application provides a virtual face generating apparatus, including:

the facial image acquisition module is used for acquiring a facial image, inputting the facial image into the facial attribute decoupling model and acquiring an identity feature vector, an expression feature vector and a texture feature vector;

the description text acquisition module is used for acquiring a description text of the face picture, inputting the description text into the trained first depth network model, and acquiring a face appearance vector and a face texture vector;

the density obtaining module is used for inputting the space coordinates, the identity feature vectors, the expression feature vectors and the face appearance vectors of a plurality of sampling points in a preset three-dimensional space to the density prediction module of the trained second depth network model to obtain the density of each sampling point;

the color value obtaining module is used for inputting the visual angle information, the texture feature vector and the face texture vector of each sampling point to the color value prediction module of the trained second depth network model to obtain the color value of each sampling point;

and the virtual face obtaining module is used for carrying out volume rendering on the density and the color value of each sampling point to obtain the virtual face.

In a fourth aspect, an embodiment of the present application provides a virtual face live broadcast device, including:

the face image recognition module is used for recognizing the face image of the anchor from the live video frame of the anchor;

the text acquisition module is used for responding to text editing operation of the anchor on the face image and acquiring a description text of the face image;

the virtual face generation module is used for generating a virtual face of a host by adopting the virtual face generation method according to the face image and the description text;

and the video frame playing module is used for fusing the virtual face with the live video frame of the anchor and playing the fused live video frame.

In a fifth aspect, embodiments of the present application provide an electronic device, a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing steps of a method as in the first or second aspect when the computer program is executed by the processor.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps of a method as in the first or second aspect.

According to the embodiment of the application, the face picture is acquired and is input into the face attribute decoupling model, so that the identity feature vector, the expression feature vector and the texture feature vector are obtained; acquiring a description text of a face picture, and inputting the description text into a trained first depth network model to acquire a face appearance vector and a face texture vector; the method comprises the steps of inputting space coordinates, identity feature vectors, expression feature vectors and face appearance vectors of a plurality of sampling points in a preset three-dimensional space to a density prediction module of a trained second depth network model to obtain the density of each sampling point; the visual angle information, the texture feature vector and the face texture vector of each sampling point are input to a color value prediction module of a trained second depth network model, and a color value of each sampling point is obtained; and performing volume rendering on the density and the color value of each sampling point to obtain the virtual face. According to the method and the device, the virtual face is obtained by extracting the identity, the expression and the texture characteristics of the face in the face picture and combining the description text of the face picture, so that the quality of the generated virtual face is improved. Meanwhile, the change of the virtual face can be realized according to the edited descriptive text, and the interestingness of generating the virtual face is improved.

For a better understanding and implementation, the technical solutions of the present application are described in detail below with reference to the accompanying drawings.

Drawings

Fig. 1 is an application scenario schematic diagram of a virtual face generating method provided in an embodiment of the present application;

fig. 2 is a flow chart of a virtual face generating method according to an embodiment of the present application;

fig. 3 is a schematic diagram of a virtual face according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training architecture of a trained second depth network model according to an embodiment of the present application;

fig. 5 is a flow chart of a live virtual face method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a virtual face generating device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a virtual face live broadcast device provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if"/"if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination", depending on the context.

The virtual face generation method provided by the application can be used for virtual face generation, and specifically can be based on application scenes such as VR games, video calls, live webcasts and the like, and the embodiment of the application is illustrated by taking the application of the virtual face to cover images of a live broadcasting room or virtual image live broadcasting of a host.

Referring to fig. 1, fig. 1 is a schematic application scenario of a virtual face generating method provided in an embodiment of the present application, where the application scenario includes a hosting client 101, a server 102 and an audience client 103 provided in an embodiment of the present application, and the hosting client 101 and the audience client 103 interact through the server 102.

The anchor client 101 refers to an end that transmits a live video, and is generally a client used by an anchor (i.e., a live anchor user) in a live video.

The viewer client 103 refers to a client employed by a viewer (i.e., a live viewer user) receiving and viewing a live video, typically in a live video.

The hardware pointed to by the anchor client 101 and the audience client 103 essentially refers to computer devices, which may be, as shown in fig. 1, in particular, smart phones, smart interactive tablets, personal computers, and the like. Both the anchor client 101 and the spectator client 103 may access the internet via known network access means to establish a data communication link with the server 102.

The server 102 acts as a service server and may be responsible for further interfacing with related audio data servers, video streaming servers, and other servers providing related support, etc., to form a logically associated service cluster for serving related end devices, such as the anchor client 101 and the viewer client 103 shown in fig. 1.

In this embodiment of the present application, the anchor client 101 and the viewer client 103 may join the same live broadcast room (i.e., live broadcast channel), where the live broadcast room is a chat room implemented by means of the internet technology, and generally has an audio/video playing control function. A live user plays a live broadcast in a live broadcast room through a live broadcast client 101, and a viewer of a viewer client 103 can log into a server 102 to watch live broadcast in the live broadcast room.

In a live broadcasting room, interaction between a host and audiences can be realized through well-known online interaction modes such as voice, video, characters and the like, generally, host users perform programs in the form of audio and video streams for the audiences, and economic transaction behaviors can be generated in the interaction process. Of course, the application form of the live broadcast room is not limited to online entertainment, and can be popularized to other related scenes, for example: user pairing interaction scenes, video conference scenes, product recommendation sales scenes, and any other scenes requiring similar interactions.

Specifically, the process of viewing a live broadcast by a viewer is as follows: the viewer can click to access a live broadcast application (e.g., YY) installed on the viewer client 103 and select to enter any one of the live broadcast rooms, triggering the viewer client 103 to load the viewer with a live broadcast room interface, where the live broadcast room interface includes a plurality of interaction components, and by loading these interaction components, the viewer can watch live broadcast in the live broadcast room and perform various online interactions.

When a live broadcast is performed by a host, a cover image can be arranged for the live broadcast room, and a high-quality cover image can attract audience to enter the live broadcast room of the host. In particular, a virtual face may be used as a cover image of a living room. In order to improve the interest of the network live broadcast, the virtual image can be used for displaying in a live broadcast picture instead of the actual image of the anchor. In particular, virtual faces may be used to replace the anchor's real faces.

For this reason, the embodiment of the application provides a virtual face generating method, which can be implemented by a host client or a server as an execution subject.

Referring to fig. 2, fig. 2 is a flowchart of a virtual face generating method according to an embodiment of the present application, where the method includes the following steps:

s10: and acquiring a face picture, inputting the face picture into a face attribute decoupling model, and acquiring an identity feature vector, an expression feature vector and a texture feature vector.

The face picture can be a self-timer picture uploaded by the anchor, or can be a self-timer video uploaded by the anchor through a screenshot service, or can be a picture in a face screenshot picture sequence after the live video stream acquired from the server is subjected to screenshot.

The face attribute decoupling model is used for extracting face attribute features in the face picture, wherein the face attribute features comprise identity features, expression features and texture features of a face. Specifically, the face attribute decoupling model may be a 3DMM model (face 3D deformation statistical model). The identity features include facial features, contours, etc. of the face, the expression features include smiling, laughing, frowning, squinting, etc., and the texture features include wrinkles, skin color, hair color, etc.

S20: and acquiring a description text of the face picture, and inputting the description text into the trained first depth network model to acquire a face appearance vector and a face texture vector.

The description text is used for describing the five sense organs, the skin colors and the color of the human face in the human face picture. For example, the descriptive text may be "yellow coloration, enlarged nose, rounded eyes, etc.

The trained first depth network model is used for generating a face appearance vector and a face texture vector which accord with the description text according to the given description text. Specifically, the trained first depth network model includes a text editing layer network, an appearance extraction layer network, and a texture extraction layer network. The text editing layer network is a text editor of a contrast language Image Pre-Training (CLIP) model, the appearance extraction layer network is a multi-layer perceptron network shape net, and the texture extraction layer network is a multi-layer perceptron network TextureNet.

In the embodiment of the application, the description text of the face picture is sequentially input into a text editing layer network and an appearance extraction layer network to obtain the face appearance vector. And sequentially inputting the description text of the face picture into a text editing layer network and a texture extraction layer network to obtain a face texture vector.

S30: and inputting the space coordinates, the identity feature vectors, the expression feature vectors and the face appearance vectors of a plurality of sampling points in a preset three-dimensional space to a density prediction module of the trained second depth network model to obtain the density of each sampling point.

The plurality of sampling points are spatial coordinate points with different depths when preset light rays in a plurality of directions pass through the three-dimensional space. Specifically, a plurality of light rays are emitted through the light ray sampler, and each light ray comprises a light ray direction and a light ray starting point. As the ray passes through three-dimensional space, spatial coordinate points (x, y, z) are sampled according to different depths in the ray direction.

The density prediction module of the trained second depth network model is used for mapping the spatial coordinates, the identity features, the expression features and the face appearance features of the sampling points into densities. Specifically, the density prediction module includes a first perceptron network and a second perceptron network, where the first perceptron network and the second perceptron network are both neural radiation field (Neural Radiance Field, neRF) networks.

S40: and inputting the visual angle information, the texture feature vector and the face texture vector of each sampling point to a color value prediction module of the trained second depth network model to obtain the color value of each sampling point.

The view angle information of the sampling point is the light direction of the sampling point, and can be represented by an angle coordinate (theta, phi).

The color value prediction module of the trained second depth network model is used for mapping the view angle information, the texture features and the face texture features of the sampling points into color values. Specifically, the color value prediction module includes a third perceptron network, the third perceptron network being a neural radiation field network.

S50: and performing volume rendering on the density and the color value of each sampling point to obtain the virtual face.

Referring to fig. 3, fig. 3 is a schematic diagram of a virtual face according to an embodiment of the present application. Based on the face picture 10, a virtual face 11 is generated in combination with the descriptive text of the face picture. For example, the description text is "yellow color, enlarged nose, rounded eyes, etc.", and the nose of the virtual face 11 is larger than the nose of the face picture 10, the color of the virtual face 11 is yellow color than the color of the face picture 10, and the eyes of the virtual face 11 are rounded eyes of the face picture 10.

The volume rendering refers to integrating the density and color value of each sampling point along the light direction to obtain the color value of each pixel on the virtual face. Specifically, a light ray starting point of each light ray is connected with a pixel of the virtual face, a light ray direction is determined, and densities and color values of a plurality of sampling points in the light ray direction are integrated along the light ray direction.

By applying the embodiment of the application, the identity feature vector, the expression feature vector and the texture feature vector are obtained by acquiring the face picture and inputting the face picture into the face attribute decoupling model; acquiring a description text of a face picture, and inputting the description text into a trained first depth network model to acquire a face appearance vector and a face texture vector; the method comprises the steps of inputting space coordinates, identity feature vectors, expression feature vectors and face appearance vectors of a plurality of sampling points in a preset three-dimensional space to a density prediction module of a trained second depth network model to obtain the density of each sampling point; the visual angle information, the texture feature vector and the face texture vector of each sampling point are input to a color value prediction module of a trained second depth network model, and a color value of each sampling point is obtained; the method comprises the steps of carrying out a first treatment on the surface of the And performing volume rendering on the density and the color value of each sampling point to obtain the virtual face. According to the method and the device, the virtual face is obtained by extracting the identity, the expression and the texture characteristics of the face in the face picture and combining the description text of the face picture, so that the quality of the generated virtual face is improved. Meanwhile, the change of the virtual face can be realized according to the edited descriptive text, and the interestingness of generating the virtual face is improved.

In an alternative embodiment, the trained first depth network model includes a text editing layer network, an appearance extraction layer network, and a texture extraction layer network, step S20 includes steps S201 to S203, specifically as follows:

s201: inputting the descriptive text into a text editing layer network to obtain a text embedding vector;

s202: inputting the text embedded vector into an appearance extraction layer network to obtain a face appearance vector;

s203: and inputting the text embedded vector into a texture extraction layer network to obtain the face texture vector.

In the embodiment of the application, the text embedding vector is used for representing semantic information of the descriptive text, the appearance extraction layer network is used for mapping the semantic information of the descriptive text into face appearance characteristics, and the texture extraction layer network is used for mapping the semantic information of the descriptive text into face texture characteristics, so that mapping from the descriptive text of the face to the face appearance and the face texture can be automatically and rapidly realized.

In an alternative embodiment, the training process of the trained second deep network model includes steps S1 to S7, which are specifically as follows:

s1: obtaining a sample description text of the sample face picture;

s2: inputting the sample face picture into a face attribute decoupling model to obtain a sample identity feature vector, a sample expression feature vector and a sample texture feature vector;

S3: inputting the sample description text into a trained first depth network model to obtain a sample face appearance vector and a sample face texture vector;

s4: the method comprises the steps of inputting space coordinates, sample identity feature vectors, sample expression feature vectors and sample face appearance vectors of a plurality of sample sampling points in a preset three-dimensional space to a density prediction module of a second depth network model to be trained, and obtaining the density of each sample sampling point;

s5: the visual angle information, the sample texture feature vector and the sample face texture vector of each sample sampling point are input to a color value prediction module of a second depth network model to be trained, and a color value of each sample sampling point is obtained;

s6: performing volume rendering on the density and the color value of each sample sampling point to obtain a sample virtual face;

s7: and acquiring a real human face, training a density prediction module and a color value prediction module of a second depth network model to be trained according to the real human face, the sample virtual human face and the sample description text, and acquiring a trained second depth network model.

The real face may be any photo of a real user.

In the embodiment of the present application, referring to fig. 4, fig. 4 is a schematic diagram of a training architecture of a trained second depth network model provided in the embodiment of the present application, and steps S1 to S6 may refer to steps S10 to S50, which are not described herein again. Wherein, the sample identity feature vector, the sample expression feature vector and the sample texture feature vector are respectively represented by alpha, beta and gamma, and the sample face appearance vector and the sample face texture vector are respectively represented by delta Z _s 、ΔZ _α The density is denoted sigma and the color value is denoted c. In step S7, training a density prediction module and a color value prediction module of the second depth network model to be trained through the sample virtual face and the real face, so as to improve the reality of generating the virtual face through the trained second depth network model. And training a density prediction module and a color value prediction module of a second depth network model to be trained through the sample virtual face and the sample description text so as to improve the accuracy of generating the virtual face according to the description text through the trained second depth network model.

In an alternative embodiment, step S7 includes step S71, which is specifically as follows:

s71: inputting the real face and the sample virtual face into the countermeasure network, calculating a first loss function of the countermeasure network, training a density prediction module and a color value prediction module of the countermeasure network and a second depth network model to be trained according to the first loss function, and obtaining a trained countermeasure network and a trained second depth network model.

In the embodiment of the application, the countermeasure network comprises a discriminator, a first loss function of the discriminator is calculated by respectively inputting a real face and a sample virtual face into the discriminator, and the network parameters of the discriminator and the model parameters of a second depth network model to be trained are adjusted according to the first loss function until the discriminator cannot distinguish the real face from the sample virtual face, so that a trained countermeasure network and a trained second depth network model are obtained.

The reality of the virtual face generated by the trained second depth network model can be improved by inputting the real face and the sample virtual face to the countermeasure network.

In an alternative embodiment, step S7 includes steps S72-S73, which are specifically as follows:

s72: inputting the sample virtual face into a trained face attribute classification model to obtain a face attribute classification result;

s73: and calculating a second loss function between the face attribute classification result and the attribute category corresponding to the sample description text, and training a density prediction module and a color value prediction module of a second depth network model to be trained according to the second loss function to obtain a trained second depth network model.

The trained face attribute classification model is used for classifying face attribute features to obtain a plurality of face attribute categories.

In the embodiment of the application, the second loss function is a cross entropy loss function, and the model parameters of the second depth network model to be trained are adjusted according to the second loss function until the face attribute classification result is consistent with the attribute category corresponding to the sample description text, so as to obtain the trained second depth network model.

The training of the second depth network model to be trained by the attribute category of the face attribute classification result corresponding to the sample description text can improve the accuracy of generating the virtual face according to the description text through the trained second depth network model.

In an alternative embodiment, step S7 includes steps S74-S77, specifically as follows:

s74: inputting the sample description text into a text editing layer network to obtain a sample text embedded vector;

s75: inputting the sample virtual face into an image editing layer network to obtain a sample image embedded vector;

s76: inputting the sample text embedded vector and the sample image embedded vector into a linear network layer to obtain a first semantic vector and a second semantic vector;

s77: and calculating a third loss function between the first semantic vector and the second semantic vector, and training a density prediction module and a color value prediction module of the second depth network model to be trained according to the third loss function to obtain a trained second depth network model.

The image editing layer network is an image editor of the CLIP model, and the linear network layer is used for mapping the sample text embedded vector and the sample image embedded vector into a first semantic vector and a second semantic vector with the same vector dimension. The third loss function is the L2 loss function, i.e. the square of the difference between the predicted value and the actual value.

In the embodiment of the application, according to the third loss function, the model parameters of the second depth network model to be trained are adjusted until the first semantic vector and the second semantic vector are similar, and the trained second depth network model is obtained.

Training the second depth network model to be trained through the first semantic vector and the second semantic vector can improve the accuracy of generating the virtual face according to the description text through the trained second depth network model.

Optionally, the second depth network model to be trained may be jointly trained according to the first loss function, the second loss function and the third loss function, and the total loss function of the joint training may be expressed as: loss=w ₁ ·L _GAN +w ₂ ·L _Cls +w ₃ ·L _CLIP Wherein w is ₁ Weights, w, representing the first loss function ₂ Weights, w, representing the second loss function ₃ Weights representing third loss function, L _GAN Representing a first loss function, L _Cls Representing a second loss function, L _CLIP And representing a third loss function, so that the accuracy of the trained second depth network model for generating the virtual face according to the reality of the virtual face and the description text is improved.

In an alternative embodiment, the density prediction module includes a first perceptron network and a second perceptron network, and step S30 includes steps S301-S304, which are specifically as follows:

S301: and inputting the space coordinates of a plurality of sampling points in a preset three-dimensional space into a position coding model to obtain a first coordinate vector corresponding to the space coordinates of each sampling point.

The position coding model is used for mapping the space coordinate of each sampling point into a first coordinate vector. In particular, the position-coding model may be a transducer model.

S302: and splicing the first coordinate vector, the identity feature vector, the expression feature vector and the face appearance vector to obtain a first spliced vector.

The first coordinate vector, the identity feature vector, the expression feature vector and the face appearance vector have the same vector dimension.

S303: inputting the first spliced vector into a first perceptron network to obtain a first feature vector;

s304: and inputting the first characteristic vector into a second perceptron network to obtain the density of each sampling point.

In the embodiment of the application, through the first perceptron network and the second perceptron network, the identity features, the expression features and the face appearance features corresponding to the description text of the face picture can be automatically and rapidly mapped to the density for the subsequent rendering of the virtual face.

In an alternative embodiment, the color value prediction module includes a third perceptron network, step S40, including steps S401-S403, specifically as follows:

S401: and inputting the view angle information of each sampling point into a position coding model to obtain a second coordinate vector corresponding to the view angle information of each sampling point.

The position coding model is used for mapping the view angle information of each sampling point into a second coordinate vector, and the second coordinate vector is identical to the first coordinate vector in vector dimension. In particular, the position-coding model may be a transducer model.

S402: and splicing the second coordinate vector, the texture feature vector, the face texture vector and the first feature vector to obtain a second spliced vector.

The second coordinate vector, the texture feature vector, the face texture vector and the first feature vector have the same vector dimension.

S403: and inputting the second spliced vector into a third perceptron network to obtain the color value of each sampling point.

In the embodiment of the application, the texture features of the face picture and the face texture features can be automatically and quickly mapped into the color values through the third perceptron network so as to render the subsequent virtual face.

Referring to fig. 5, fig. 5 is a flow chart of a live virtual face method provided in an embodiment of the present application, which may be implemented by a hosting client, where the method includes the following steps:

S100: identifying a face image of a host from live video frames of the host;

s200: responding to text editing operation of a anchor on the face image, and acquiring description text of the face image;

s300: generating a virtual face of a host by adopting the virtual face generation method according to the face image and the description text;

s400: and fusing the virtual face with the live video frame of the anchor, and playing the fused live video frame.

In the embodiment of the application, the face recognition technology can be adopted to extract the face image of the anchor from the live video frame of the anchor. The anchor can carry out text description on the face image, thereby realizing editing of the virtual face. For example, the hair color of the current anchor is black, and the descriptive text "hair yellowing" may be edited such that the hair color in the generated virtual face is yellow.

By fusing the virtual face with the live video frame of the anchor, the anchor can live with the virtual face, and the live interestingness is improved.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a virtual face generating apparatus according to an embodiment of the present application. The apparatus may be implemented as all or part of a computer device by software, hardware, or a combination of both. The virtual face generating device 5 provided in the embodiment of the present application includes:

The face picture obtaining module 51 is configured to obtain a face picture, input the face picture to the face attribute decoupling model, and obtain an identity feature vector, an expression feature vector and a texture feature vector;

the descriptive text obtaining module 52 is configured to obtain descriptive text of a face picture, and input the descriptive text to a trained first depth network model to obtain a face appearance vector and a face texture vector;

the density obtaining module 53 is configured to input spatial coordinates, identity feature vectors, expression feature vectors, and face appearance vectors of a plurality of sampling points in a preset three-dimensional space to a density prediction module of the trained second depth network model, so as to obtain a density of each sampling point;

the color value obtaining module 54 is configured to input the view angle information, the texture feature vector and the face texture vector of each sampling point to the color value prediction module of the trained second depth network model, so as to obtain a color value of each sampling point;

the virtual face obtaining module 55 is configured to perform volume rendering on the density and the color value of each sampling point to obtain a virtual face.

It should be noted that, when the virtual face generating apparatus provided in the foregoing embodiment executes the virtual face generating method, only the division of the foregoing functional modules is used for illustrating, in practical application, the foregoing functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the virtual face generating device and the virtual face generating method provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the virtual face generating device and the virtual face generating method are shown in method embodiments, and are not repeated herein.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a virtual face live broadcast device according to an embodiment of the present application. The apparatus may be implemented as all or part of a computer device by software, hardware, or a combination of both. The virtual face live broadcast device 7 provided in the embodiment of the present application includes:

a face image recognition module 71, configured to recognize a face image of a anchor from live video frames of the anchor;

a text obtaining module 72, configured to obtain descriptive text of the face image in response to a text editing operation of the face image by the anchor;

a virtual face generating module 73, configured to generate a virtual face of a host by adopting the virtual face generating method according to the face image and the description text;

the video frame playing module 74 is configured to fuse the virtual face with the live video frame of the anchor, and play the fused live video frame.

It should be noted that, when the virtual face live broadcast device provided in the foregoing embodiment executes the virtual face live broadcast method, only the division of the foregoing functional modules is used for illustrating, in practical application, the foregoing functional allocation may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the virtual face live broadcast device and the virtual face live broadcast method provided in the foregoing embodiments belong to the same concept, which embody the implementation process in detail in the method embodiment, and are not repeated here.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device provided in the present application. As shown in fig. 8, the electronic device 21 may include: a processor 210, a memory 211, and a computer program 212 stored in the memory 211 and executable on the processor 210, for example: virtual face generating program and virtual face live broadcasting program; the processor 210, when executing the computer program 212, implements the steps of the embodiments described above.

Wherein the processor 210 may include one or more processing cores. The processor 210 performs various functions of the computer device 21 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 211, and invoking data in the memory 211, using various interfaces and lines to connect various parts within the computer device 21, alternatively, the processor 210 may be implemented in at least one hardware form of digital signal processing (Digital Signal Processing, DSP), field-programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programble Logic Array, PLA). The processor 210 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the touch display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 210 and may be implemented by a single chip.

The Memory 211 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 211 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 211 may be used to store instructions, programs, code sets, or instruction sets. The memory 211 may include a storage program area and a storage data area, wherein the storage program area may store instructions for implementing an operating system, instructions for at least one function (such as touch instructions, etc.), instructions for implementing the above-described various method embodiments, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 211 may optionally also be at least one storage device located remotely from the aforementioned processor 210.

The embodiment of the present application further provides a computer storage medium, where a plurality of instructions may be stored, where the instructions are adapted to be loaded and executed by a processor, and the specific implementation procedure may refer to the specific description of the foregoing embodiment, and details are not repeated herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other manners. For example, the apparatus/terminal device embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of each method embodiment described above may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc.

The present invention is not limited to the above-described embodiments, but, if various modifications or variations of the present invention are not departing from the spirit and scope of the present invention, the present invention is intended to include such modifications and variations as fall within the scope of the claims and the equivalents thereof.

Claims

1. The virtual face generation method is characterized by comprising the following steps of:

acquiring a description text of the face picture, and inputting the description text into a trained first depth network model to obtain a face appearance vector and a face texture vector;

inputting space coordinates of a plurality of sampling points in a preset three-dimensional space, the identity feature vector, the expression feature vector and the face appearance vector into a density prediction module of a trained second depth network model to obtain the density of each sampling point;

inputting the visual angle information, the texture feature vector and the face texture vector of each sampling point to a color value prediction module of a trained second depth network model to obtain a color value of each sampling point;

And performing volume rendering on the density and the color value of each sampling point to obtain a virtual face.

2. The virtual face generation method according to claim 1, wherein:

the density prediction module comprises a first perceptron network and a second perceptron network;

the step of inputting the spatial coordinates of a plurality of sampling points in a preset three-dimensional space, the identity feature vector, the expression feature vector and the face appearance vector to a density prediction module of a trained second depth network model to obtain the density of each sampling point comprises the following steps:

inputting the space coordinates of a plurality of sampling points in a preset three-dimensional space into a position coding model to obtain a first coordinate vector corresponding to the space coordinates of each sampling point;

splicing the first coordinate vector, the identity feature vector, the expression feature vector and the face appearance vector to obtain a first spliced vector;

inputting the first spliced vector to the first perceptron network to obtain a first feature vector;

and inputting the first feature vector into the second perceptron network to obtain the density of each sampling point.

3. The virtual face generation method according to claim 2, wherein:

the color value prediction module comprises a third perceptron network;

the step of inputting the view angle information, the texture feature vector and the face texture vector of each sampling point to a color value prediction module of a trained second depth network model to obtain a color value of each sampling point includes:

inputting the view angle information of each sampling point into a position coding model to obtain a second coordinate vector corresponding to the view angle information of each sampling point;

splicing the second coordinate vector, the texture feature vector, the face texture vector and the first feature vector to obtain a second spliced vector;

and inputting the second spliced vector to the third perceptron network to obtain the color value of each sampling point.

4. The virtual face generation method according to claim 1, wherein:

the trained first depth network model comprises a text editing layer network, an appearance extraction layer network and a texture extraction layer network;

the step of obtaining the description text of the face picture, inputting the description text into a trained first depth network model to obtain a face appearance vector and a face texture vector, comprising the following steps:

Inputting the descriptive text into the text editing layer network to obtain a text embedding vector;

inputting the text embedded vector into the appearance extraction layer network to obtain a face appearance vector;

and inputting the text embedded vector into the texture extraction layer network to obtain a face texture vector.

5. The virtual face generation method according to any one of claims 1 to 4, wherein:

the training process of the trained second depth network model comprises the following steps:

obtaining a sample description text of a sample face picture;

inputting the sample face picture into the face attribute decoupling model to obtain a sample identity feature vector, a sample expression feature vector and a sample texture feature vector;

inputting the sample description text into the trained first depth network model to obtain a sample face appearance vector and a sample face texture vector;

inputting space coordinates of a plurality of sample sampling points in a preset three-dimensional space, the sample identity feature vector, the sample expression feature vector and the sample face appearance vector into a density prediction module of a second depth network model to be trained, and obtaining the density of each sample sampling point;

The visual angle information of each sample sampling point, the sample texture feature vector and the sample face texture vector are input to a color value prediction module of a second depth network model to be trained, and a color value of each sample sampling point is obtained;

performing volume rendering on the density and the color value of each sample sampling point to obtain a sample virtual face;

and acquiring a real human face, training a density prediction module and a color value prediction module of the second depth network model to be trained according to the real human face, the sample virtual human face and the sample description text, and acquiring a trained second depth network model.

6. The virtual face generation method of claim 5, wherein:

the step of obtaining a real face, training a density prediction module and a color value prediction module of the second depth network model to be trained according to the real face, the sample virtual face and the sample description text, and obtaining a trained second depth network model, comprising the following steps:

inputting the real face and the sample virtual face into an countermeasure network, calculating a first loss function of the countermeasure network, training a density prediction module and a color value prediction module of the countermeasure network and the second depth network model to be trained according to the first loss function, and obtaining a trained countermeasure network and a trained second depth network model.

7. The virtual face generation method of claim 5, wherein:

inputting the sample virtual face into a trained face attribute classification model to obtain a face attribute classification result;

and calculating a second loss function between the face attribute classification result and the attribute category corresponding to the sample description text, and training a density prediction module and a color value prediction module of the second depth network model to be trained according to the second loss function to obtain a trained second depth network model.

8. The virtual face generation method of claim 5, wherein:

Inputting the sample description text to a text editing layer network to obtain a sample text embedded vector;

inputting the sample virtual face to an image editing layer network to obtain a sample image embedding vector;

inputting the sample text embedded vector and the sample image embedded vector to a linear network layer to obtain a first semantic vector and a second semantic vector;

and calculating a third loss function between the first semantic vector and the second semantic vector, and training a density prediction module and a color value prediction module of the second depth network model to be trained according to the third loss function to obtain a trained second depth network model.

9. The virtual face live broadcasting method is characterized by comprising the following steps of:

identifying a face image of a host from live video frames of the host;

responding to the text editing operation of the anchor on the face image, and acquiring the description text of the face image;

generating the anchor virtual face by adopting the virtual face generation method according to any one of claims 1 to 8 according to the face image and the description text;

10. A virtual face generation apparatus, comprising:

the text acquisition module is used for acquiring a description text of the face picture, and inputting the description text into the trained first depth network model to obtain a face appearance vector and a face texture vector;

the density obtaining module is used for inputting the space coordinates of a plurality of sampling points in a preset three-dimensional space, the identity feature vector, the expression feature vector and the face appearance vector into the density prediction module of the trained second depth network model to obtain the density of each sampling point;

And the virtual face obtaining module is used for carrying out volume rendering on the density and the color value of each sampling point to obtain a virtual face.

11. A virtual face live broadcast device, comprising:

the face image recognition module is used for recognizing the face image of the anchor from live video frames of the anchor;

the description text acquisition module is used for responding to the text editing operation of the anchor on the face image and acquiring the description text of the face image;

a virtual face generating module, configured to generate the anchor virtual face by using the virtual face generating method according to any one of claims 1 to 8 according to the face image and the description text;

12. An electronic device, comprising: a processor, a memory and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any one of claims 1 to 8 or the steps of the method according to claim 9 when the computer program is executed by the processor.

13. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method of any one of claims 1 to 8 or the steps of the method of claim 9.