CN117171392A

CN117171392A - Virtual anchor generation method and system based on nerve radiation field and hidden attribute

Info

Publication number: CN117171392A
Application number: CN202311094348.7A
Authority: CN
Inventors: 郑博文; 董建武; 吴林涛; 黄萌
Original assignee: Beijing Scistor Technologies Co ltd
Current assignee: Beijing Scistor Technologies Co ltd
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-12-05

Abstract

The invention discloses a virtual anchor generation method and system based on nerve radiation fields and hidden attributes, and belongs to the technical field of artificial intelligence. And determining the character image of the virtual anchor according to the requirement, and synthesizing through a virtual anchor generating system. Firstly, synthesizing a three-dimensional face of a virtual anchor through a face feature extraction and construction module. And then synthesizing the text information to be broadcasted into the voice of the virtual anchor through a voice synthesis module. The voice, lip movement, head movement and eye blinking features of the virtual anchor are extracted and the video of the virtual anchor is synthesized by combining the features through the improved NeRF network module. And finally replacing the background of the synthesized video, and synthesizing the final virtual anchor. The virtual anchor generated by the method has high-efficiency stability and higher reality, and can be suitable for virtual anchor production in different fields.

Description

Virtual anchor generation method and system based on nerve radiation field and hidden attribute

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to text, voice and video analysis and synthesis, and in particular relates to a virtual anchor generation method and system based on a nerve radiation field and hidden attributes.

Background

With the continuous development of internet technology, the application of network living broadcast in the fields of social, entertainment, education and the like is becoming wider and wider. The anchor is a core character of the live broadcast and is also a main attractive force of the live broadcast. The virtual anchor technology is adopted, so that the voice, video, action and the like of the virtual anchor can be generated through the artificial intelligence and machine learning technology, and a real live broadcast experience is realized. In the broadcast live field, the technology can generate virtual broadcasters; in the field of television live broadcasting, a virtual host can be generated; in the entertainment live broadcast field, virtual network anchor can be generated; in the field of education live broadcasting, virtual teachers can be generated to provide online teaching for students and the like.

The virtual anchor generation technique is a new type of digital content generation technique that uses machine learning and deep learning algorithms based on artificial intelligence to input a large amount of audio, video and text data into a model to generate content similar to the original data. The field relates to artificial intelligence, machine learning, speech recognition, image processing, natural language processing and the like.

The virtual anchor analyzes the input text data by utilizing the artificial intelligent natural language processing technology, captures the related context information and can generate the voice with the same tone, intonation, speech speed and language rhythm.

However, the character image of the virtual anchor is a difficulty that needs to be overcome in the present day. The currently generated virtual anchor is always missing from the existing virtual anchor in lively, natural and harmonious expression of the face of the character. Because human facial expressions are extremely rich, the character image generated by using the existing intelligent algorithm is quite stiff in facial expressions and lacks flexibility, so that people feel a variety-gap feeling. So that it is difficult for the viewer to generate emotion resonance from the heart in the face of such a puppet without emotion. All of these factors are small details that are not in eye, referred to herein as hidden attributes.

While the hidden attribute refers to a hidden attribute of a character feature that plays an important role in generating a virtual anchor. The character display attributes mainly comprise: age, gender, attitude angle, expression, eyewear wear, race, etc. The character hidden attribute is a few unobvious features mainly including: emotion, eye movement, blink frequency, body and gesture, micro-expression, and the like. These hidden attributes may be guided by other deeper dimensional information, providing more abundant information for the composition of the personals, thus making the generated virtual anchor more realistic and natural, and having important value for the application of the generated virtual anchor.

NeRF (Neural Radiance Fields) is a neural radiation field, and is a technology for three-dimensional reconstruction by using a deep neural network model. The method can generate a vivid virtual anchor model by modeling the facial expression, the body action and other characteristics of the anchor. The NeRF technology has strong expandability, and the generation effect of the model can be improved through continuous training and improvement of the model. Therefore, hidden characteristics of the characters are used as input and used for virtual anchor rendering through a NeRF technology, and body gestures, expressions and the like of the virtual anchor can be finely adjusted, so that the effect of the generated virtual anchor is improved. The original NeRF requires a large amount of data to learn and a long time to generate a three-dimensional image.

Disclosure of Invention

The invention provides a virtual anchor generating method and a system based on a nerve radiation field and hidden attributes, which aim to improve the sense of reality of a virtual anchor and expand the application range and the application scene of the virtual anchor, and learn and train the hidden attributes of characters based on a NeRF nerve radiation field technology so as to realize the virtual anchor which is more vivid and real.

The virtual anchor generating system based on the nerve radiation field and the hidden attribute comprises a face feature extraction and construction module, a voice synthesis module, a voice feature extraction module, a hidden attribute feature extraction module, an improved NeRF network module and a background replacement module; the background replacing module comprises a background segmentation module and an image harmony module.

The virtual anchor generation method based on the nerve radiation field and the hidden attribute specifically comprises the following steps:

step one, constructing character images of the virtual anchor according to actual needs, wherein the character images comprise character video data, voice, text data and background data, and the character video data, the voice, the text data and the background data are used as input of a virtual anchor generating system.

The method comprises the steps of determining the use languages of a virtual anchor, and acquiring voice and text data sets of the corresponding languages according to different languages to serve as model training data for semantic analysis and voice synthesis of the virtual anchor.

Secondly, a face feature extraction and construction module performs face feature extraction and construction on the character video data to generate a three-dimensional face of a virtual anchor;

the face feature extraction and construction flow of the virtual anchor is divided into face analysis, 3DMM face feature extraction and face reconstruction, and specifically comprises the following steps:

step 201, face analysis is to decompose character video data into face components through a deep learning technology, and obtain corresponding facial features;

the face component includes skin, hair, eyes, eyebrows, nose, mouth, etc.

Step 202, carrying out three-dimensional feature coding on facial features of different parts of a human face through 3DMM human face feature extraction, and selecting a combination of a reference shape and a texture map which can most represent the required human face to synthesize the human face;

And 203, carrying out weighted combination on the reference shape and the texture map of the face through a database to generate a reconstructed three-dimensional face.

Training a voice synthesis network through voice data, inputting text data into a text transcription module to process the front end of the text, inputting the processed text into a trained voice synthesis network model, and obtaining the synthesized voice of the virtual anchor after voice synthesis.

The text front-end processing refers to that after punctuation marks, numbers, spaces and the like in a text are removed from the input text, the text is divided into individual words according to semantic understanding, each word is converted into a phoneme, and voice tags such as tones and pitches are marked.

And step four, the voice characteristic extraction module performs characteristic extraction on the synthesized voice of the virtual anchor, and simultaneously the implicit attribute characteristic extraction module extracts implicit attribute characteristic information in the video data by combining the three-dimensional face, and outputs all the extracted characteristic information to the improved NeRF network module.

And (3) voice feature extraction: and extracting the characteristics of the synthesized voice of the virtual anchor, extracting the characteristic information such as frequency spectrum, intonation, tone height and the like, and mapping the characteristic information to corresponding discrete values.

And (3) extracting display attribute characteristics: extracting lip movement, facial movement and expression characteristic data which have strong correlation with voice data by using a 3DMM (digital media management) on the video data, and outputting the lip movement, facial movement and expression characteristic data as display attributes to an improved NeRF (network radio frequency) network module;

extracting hidden attribute features: the attribute with weak correlation with the voice data, namely the attribute related with the voice context or other attributes related with personalized talking style, including head movement, blinking and the like, extracts the movement of the relevant part in the video data through the constructed three-dimensional face model and outputs the movement to the improved NeRF network module as the hidden attribute.

Step five, the improved NeRF network module models static scenes, dynamic heads and dynamic trunk of the virtual anchor according to the voice characteristic information and the implicit attribute characteristic information to obtain a synthetic video of the virtual anchor;

the method comprises the following steps:

(1) When the improved NeRF network is used for static scene modeling, the MLP multi-layer perceptron is reduced, and the MLP is replaced by linear interpolation to maintain the reconstructed static information at each static 3D position, so that the characteristics of the 3D scene are stored in a static scene trainable grid structure.

(2) When the improved NeRF network is used for dynamic head modeling, the audio and video processing network of the high-dimensional character is decomposed into three low-dimensional trainable feature grids, namely a lip movement model, a character head movement model and an eye blinking model; to achieve synchronization of the audio and the motion models, the audio-spatial coding module is decomposed into a 3D spatial grid and a 2D audio grid, and the audio and spatial representations are decomposed into two grids. While each motion model maintains static spatial coordinates in 3D, audio dynamics are encoded into low-dimensional "coordinates".

The mouth movements embedded in the audible utterance are directly synchronized when constructing the relationship between explicit lip movements and audio. In particular, CNN audio encoder E is used _a Extracting phoneme features f from input audio _a The expression is as follows:

f _a ＝E _a (a)

where a denotes input audio data.

Alignment of audio features with features of the mouth using a contrast learning strategy, audio and mouth features (f) _a ，f _m ) Viewed as being directly opposed to being alignedConsidered as a negative pair. Contrast learning is performed using binary cross entropy loss, where the distance between aligned pairs in time is closer than the non-aligned negative pairs.

τ _con Representing the binary cross entropy loss of lip and speech, d (f) _m ,f _a ) Representing the cosine distance of the facing direction,representing the cosine distance of the negative pair.

The synchronization process of the blink frequency and the head posture of the audio and the hidden attribute is as follows:

using controllable probabilistic models for blink and head pose movements, a sequence h of facial attributes (head pose or blink) of length T _1:T And a conditioning audio sequence a of length T _1:T′ Face attribute sequence h generated by embedding prediction in pictures T to T _T+1:T′ . Facial attribute sequence h _T+1:T′ Consists of two parts: and (1) constructing a potential hidden attribute space. The mapping face attribute sequence between inputs and a hidden attribute space Z are established by training a transducer-VAE using a Gaussian Process (GP) on a large dataset. (2) Pose, blink space construction, fine tuning of the trans-modal encoder embeds two head BOPs on the selected person and blink frequency audio embeds the hidden attribute space Z.

In obtaining the generated head pose, blink feature f _e And a synchronized audio feature f _a The neural radiation field is then used to generate a final image with these conditions. First the isochronous audio feature f _a And blink feature f _e Connected to form a new feature f _c . Then, taking this new feature as input, a conditional radiation field is proposed. After converting the head pose from camera space to canonical space, the head pose is directly used to replace the viewing direction d of the conditional radiation field. Finally, the feature F, the viewing direction D and the 3D position x in the canonical space constitute a hidden function F _θ Is input to the computer. In fact, F _θ Is realized by a multi-layer sensor. For all input vectors, the hidden function F _θ The color value c associated with the density σ can be estimated and the assigned ray.

Hidden function F _θ Expressed by the formula:

F _θ :(f,d,x)→(c,σ)

(3) When the improved NeRF network is used for dynamic torso modeling, the dynamics of the torso are simulated with another 2D mesh in a lightweight pseudo 3D deformable module and a natural torso image matching the head is synthesized.

Torso deformation is conditioned on head pose p such that torso motion is synchronized with head motion.

MLP was used to predict torso deformation:

Δx＝MLP(x _t ,p)

x _t refers to sampling pixel coordinates from image space. Δx refers to the pixel coordinates after torso deformation.

Feeding coordinates of the trunk deformation to a two-dimensional feature grid encoder to obtain trunk features f _t ：

Another MLP is used to generate torso RGB colors and alpha values:

c _t ,α _t ＝MLP(f _t ,i _t )

wherein i is _t Is embedded hidden feature added to model learning, c _t Is the trunk RGB color, alpha _t Is the alpha value.

(4) And synthesizing the head and trunk model and the static model which are independently rendered to obtain the synthesized video of the virtual anchor.

And step six, replacing the background of the virtual anchor synthesized video according to the background data by a background replacement module, and fusing the character image, the background and the audio of the virtual anchor to synthesize the final virtual anchor.

And step 601, inputting the virtual anchor composite video into a Background segmentation module, and extracting Alpha channels of foreground objects in the image through a Background-marking Background segmentation model, so that the image of the virtual anchor in the composite video is completely separated from the Background.

The Background-Matting Background segmentation model comprises a base network and a refinement network, wherein the base network predicts the Alpha mask and the foreground layer with low resolution and outputs an misprediction image block indicating the region requiring high resolution refinement. The refinement network takes the low resolution results and the original image as inputs and generates a high resolution output only in the indicated region, thereby segmenting the character picture of the video.

Modeling of the background segmentation problem, each pixel of an image is represented as a combination of foreground and background:

C＝F*α+B*(1-α)

c is the given image, F is the foreground calculated for each pixel, B is the background calculated for each pixel, and α is the transparency of each pixel.

Step 602, the segmented virtual anchor image is synthesized into another background image to obtain a synthesized image, and the synthesized image is subjected to harmony processing through an image harmony module to complete background replacement.

The harmony processing comprises color illumination adjustment, brightness linear transformation, gray level histogram equalization, color correction and local contrast enhancement on the foreground picture in sequence.

The background replaced image is:

wherein the background image is I _b The foreground image is I _f The foreground image mask is M, the combined image is I _c ，Is the Hadamard product.

And 603, inputting the virtual anchor video with the replaced background and the virtual anchor synthesized voice into an FFmpeg tool for audio and video combination to synthesize a final virtual anchor.

The invention has the advantages and positive effects that:

(1) The method for generating the virtual anchor by the automatic and intelligent method is a low-cost and high-efficiency virtual anchor generation method, can greatly reduce the time and cost of virtual anchor production, can be suitable for virtual anchor production in different fields, and enriches the choices of audiences; the virtual anchor generated by the method has high-efficiency stability and can produce positive effects in large-scale application.

(2) According to the invention, through the improved NeRF network, the virtual anchor with high-quality synchronous voice lip is generated, so that the virtual anchor has higher reality, stronger interestingness and better information transmission.

(3) The invention can adjust and optimize the generated image of the virtual anchor through learning the hidden attribute, so that the audience generates strong perception.

(4) The video background real-time replacement is realized by combining the background replacement algorithm and the image harmony algorithm, so that the requirements of different use scenes and different users are better met, brand new viewing experience can be brought to spectators, and the video background real-time replacement method can be widely applied to more fields.

Drawings

FIG. 1 is a schematic diagram of a virtual anchor generation system of the present invention;

FIG. 2 is a schematic diagram of a face feature extraction and construction flow for generating a virtual anchor according to the present invention;

FIG. 3 is a flow chart of the virtual anchor speech generation of the present invention;

FIG. 4 is a flow chart of the virtual anchor video generation of the present invention;

fig. 5 is a schematic flow chart of virtual anchor background replacement according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be further described in detail below with reference to the accompanying drawings and examples.

The virtual anchor generating system based on the nerve radiation field and the hidden attribute is shown in fig. 1, and comprises a face feature extraction and construction module for extracting features of a face and modeling, a voice synthesis module for synthesizing voice, a voice feature extraction module for extracting voice features, a hidden attribute extraction module for extracting hidden features such as lip shape, blink frequency, head gesture and the like, an improved NeRF network module for synthesizing a virtual anchor video, and a background replacement module for replacing an application scene, so that a more real and vivid virtual anchor is realized.

The face feature extraction and construction module extracts and constructs the face features of the virtual anchor, which is a method for recovering the three-dimensional shape of the two-dimensional face image through the 3DMM model. The 3DMM is a basic three-dimensional face statistical model, a BFM database can be loaded, the applicable scene of the 3DMM is effectively expanded, and the BFM can fit any three-dimensional face and store 3DMM parameters.

The speech synthesis module performs speech synthesis of the virtual anchor, namely, takes input text of the virtual anchor as input of the speech synthesis, converts the text into phoneme information through text front end processing using a TTS model, and synthesizes the converted phonemes into speech of the virtual anchor through a Fatspech acoustic synthesis model.

The voice feature extraction module and the implicit attribute extraction module extract features such as voice and lip shape of the virtual anchor, namely, firstly, frequency spectrum features of voice are extracted from a voice part through deep speech, namely, a voice signal is decomposed into a series of short-time Fourier transform features, and then the features are input into CNN, so that the frequency spectrum features are extracted. Then, the 3DMM is used for obtaining the head gesture and the expression parameters, and then the facial attribute decoupling module is used for decoupling blink and mouth movement, so that the visible and hidden characteristics of the virtual anchor are obtained.

The improved NeRF network module performs virtual anchor video synthesis, namely, adapts to character modeling driven by dynamic audio through efficient static scene modeling capability by utilizing the improved NeRF network. The improved NeRF network here is to decompose the high-dimensional audio pair image into three low-dimensional trainable feature grids, and for dynamic head modeling, a decomposed audio-spatial coding module is proposed, which mainly decomposes into eye blinking, head pose and lip movement. This decomposed audio spatial coding implements a voice-driven dynamic head modeled NeRF network.

The Background replacement module comprises a Background segmentation module and an image harmony module, and the virtual anchor Background replacement is realized by segmenting the character of the virtual anchor and the Background through Background-material, generating a scratched virtual anchor foreground object, and taking the Background to be replaced as the Background of the scratched character, thereby realizing the Background replacement of the virtual anchor.

A method for generating a virtual anchor using a virtual anchor generation system based on neural radiation fields and implicit properties, comprising the steps of:

And determining the character video of the virtual anchor and the background picture of the virtual anchor, and taking the character video and the background picture of the virtual anchor as the input for generating the action video of the virtual anchor.

The method comprises the steps of determining the use languages of a virtual anchor, and acquiring voice and text data sets of the corresponding languages according to different languages to serve as training data of semantic analysis and voice synthesis models of the virtual anchor.

Secondly, the face feature extraction and construction module extracts and constructs face features of the character video data to generate a three-dimensional face of a virtual anchor;

as shown in fig. 2, the face feature extraction and construction flow of the virtual anchor is divided into face analysis, 3DMM face feature extraction and face reconstruction, specifically:

face parsing is a special case of semantic image segmentation, a technique in the field of computer vision, which aims to decompose a face image or video into its constituent parts, such as facial features of skin, hair, eyes, eyebrows, nose, mouth, etc. Face analysis algorithms typically use deep learning techniques, such as Convolutional Neural Networks (CNNs), to analyze images or videos and identify different facial features. Face analysis algorithms are trained on a large number of annotated face image datasets enabling them to learn the relationships between different facial features and distinguish their visual cues.

Step 202, carrying out three-dimensional feature coding on facial features of different parts of a human face through a 3DMM network, and selecting a combination of a reference shape and a texture map which can most represent the required human face to synthesize the human face;

here, 3DMM (3D Morphable Model, three-dimensional deformation model) is a statistical model for representing shape and texture variations of faces and heads, and can be used to synthesize new faces or analyze existing faces. It is constructed based on a data set of real faces of a large number of 3D scans or photos. A 3DMM is typically composed of a set of reference shapes, each representing a different aspect of the shape of the face, such as the overall shape of the head, nose, mouth, etc. These reference shapes, in combination with a set of texture maps, represent variations in skin color, texture, and other details. Using 3DMM, a face can be synthesized by selecting a combination of reference shape and texture map that best represents the desired face.

And 203, carrying out weighted combination on the reference shape and the texture map of the face through a BFM database to generate a reconstructed three-dimensional face.

Used herein is the BFM-2017 database (Blendshape Face Model 2017), which is a modified version of Basel Face Model (BFM) published based on 2017, and has more shape and texture variables. BFM-2017 contains a set of shape references representing different shape features of the face and corresponding texture maps representing detailed information of the face. Different face shapes and textures can be synthesized by weighting and combining the shape references and the texture graphs, so that a reconstructed three-dimensional face can be generated, the real form of the face can be effectively simulated, and the accuracy of face reconstruction can be effectively improved.

The facial expression S may be represented as a combination of 3DMM expressions and geometric parameters, as follows:

to average the face mesh, B _id Refers to the geometric basis of PCA (primary component analysis, principal component analysis), B _exp Refers to PCA expression, F _id Refer to the geometric basis of coefficients, F _exp Refers to a coefficient expression.

Training a voice synthesis network through voice data, inputting text data into a text transcription module to process the front end of the text, inputting the processed text into a trained voice synthesis network model, and obtaining the synthesized voice of a virtual anchor after voice synthesis to enable the synthesized voice to have real voice expression.

As shown in fig. 3, the speech synthesis network is a fastspech model structured with a transducer, which is a deep learning model based on the mechanism of attention. The transducer architecture can better process voice features and generate high quality voice. Meanwhile, the FastSpecch model uses a bidirectional prediction and alignment loss function method, and predicts the length and pronunciation of voice at the same time, so that the generation efficiency and quality of the model are effectively improved.

Text can be effectively converted into a form which can be processed by the FastSpecech model through a TTS (Text-to-Speech) model, so that the FastSpecech model can better generate voice and can better process special characters and words, thereby generating more accurate and smooth voice.

The text front-end processing refers to that after punctuation marks, numbers, spaces and the like in text data are removed, the text is divided into individual words according to semantic understanding, each word is converted into a phoneme, and voice tags such as tones and pitches are marked.

And (3) voice feature extraction: the speech feature extraction network is a speech recognition system using deep learning technology for extracting speech feature information in speech. The speech feature extraction model deep performs feature extraction on the synthesized speech data of the virtual anchor, extracts feature information such as frequency spectrum, intonation, pitch and the like, and maps the feature information to corresponding discrete values.

The implicit attribute feature extraction network is an attribute feature extraction network based on deep learning technology.

And (3) extracting display attribute characteristics: extracting lip movements, facial movements and expressions which have strong correlation with voice data from the video data by using a 3DMM, and outputting the lip movements, facial movements and expressions as display attributes to an improved NeRF network module;

Extracting hidden attribute features: the hidden attribute is an attribute with weak correlation with voice data, namely, the hidden attribute is related with voice context or other attributes related with personalized talking style, including head movement, blink and the like, and the movement of relevant parts in video data is extracted through a constructed three-dimensional face model and is output to an improved NeRF network module as the hidden attribute.

The voice feature extraction network and the hidden attribute feature extraction network can be combined to generate a real synthesis effect according to the feature information and the hidden attribute information of the voice text.

as shown in fig. 4, the present invention uses the modified NeRF network to model static scenes, dynamic heads, and dynamic torso, as follows:

NeRF neural radiation fields have been successful in high-fidelity three-dimensional modeling, but NeRF networks are slow in training and reasoning, severely affecting its efficiency of use. The NeRF network is optimized in the invention, which is an improvement on the nerve radiation field, effectively realizes the real-time synthesis of sound driving tasks and faster convergence of training, and can generate the image of high-quality virtual anchor of lip synchronous voice information by less data and learning the lip, blink frequency and head gesture characteristics of different speakers.

The improved NeRF network is characterized in that an audio and video processing network of a high-dimensional character is decomposed into three low-dimensional feature grids, specifically, a decomposed audio-space coding module models a dynamic head by using a 3D space grid and a 2D audio grid, and a trunk is processed by using another 2D grid in a lightweight pseudo 3D deformable module.

Wherein for static scene modeling, improved NeRF is used for new view synthesis, achieving unprecedented realism effects. Because the synthesis efficiency of the NeRF network is lower, in order to improve the model efficiency of the NeRF, the invention adopts the cost of reducing the MLP multi-layer perceptron, replaces the MLP by linear interpolation so as to maintain the reconstructed static information at each static 3D position, stores the characteristics of the 3D scene in a static scene trainable grid structure, and realizes low-cost static scene reconstruction.

Wherein for modeling of dynamic heads, the invention proposes an improved NeRF model for real-time audio spatial decomposition that allows efficient training and real-time synthesis of audio-driven human heads. The invention uses inherent high-dimensional audio to synthesize character head information, which is expressed and clearly decomposed into three low-dimensional trainable feature grids, namely a lip movement model, a character head movement model and an eye blinking model. The model combining the dominant features and the invisible features of the virtual anchor can more effectively synthesize the natural and real virtual anchor. For dynamic head modeling, to achieve synchronization of audio and lips, expression, emotion, the present invention proposes a decomposed audio-spatial coding module that decomposes audio and spatial representations into two grids. While maintaining static spatial coordinates in 3D, audio dynamics are encoded into low-dimensional "coordinates". This has the advantage that the synthesis of the video can be divided into two separate low-dimensional feature grids instead of querying the audio and spatial coordinates within one high-dimensional feature grid, thereby reducing the cost of interpolation.

A relationship between lip movement and input audio is constructed. Instead of using visual cues, the option here is to synchronize the mouth movements of the embedded auditory utterances directly. In particular, CNN audio encoder E is used _a Extracting phoneme features f from input audio _a The expression is as follows:

f _a ＝E _a (a)

where a denotes input audio data.

A contrast learning strategy is employed to align the audio features with the features of the mouth in search of their synchronization. In particular, the audio and mouth features (f _a ，f _m ) Viewed as being directly opposed to being alignedConsidered as a negative pair. Contrast learning is performed using binary cross entropy loss, where the distance between aligned pairs in time should be closer than non-aligned negative pairs.

Using controllable probabilistic models for blink and head pose movements, a sequence h of facial attributes (head pose or blink) of length T _1:T And a conditioning audio sequence of length a having a broader length T _1:T′ Wherein T'>T is a T; due to t+1 of audio: t 'is present and thus requires embedding the prediction generated facial attribute sequence h in pictures T' to T _T+1:T′ . Facial attribute sequence h _T+1:T′ Comprises: and (1) constructing a potential hidden attribute space. The mapping face attribute sequence between inputs and a hidden attribute space Z are established by training a transducer-VAE using a Gaussian Process (GP) on a large dataset. (2) Head pose and blink space construction where a fine-tuning cross-modality encoder embeds two BOPs (Beginning of Pose ) on the selected person and blink frequency audio embeds the hidden attribute space Z.

In obtaining the generated head pose, blink feature f _e And a synchronized audio feature f _a The neural radiation field is then used to generate a final image with these conditions. First embed the isochronous audio into f _a And blink embedding f _e Connected into new embedded f _c . Then, taking this new embedding as input, a conditional radiation field is proposed. After converting the head pose from camera space to canonical space, the head pose is directly used to replace the viewing direction d of the conditional radiation field. Finally, positiveThe embedding F, the viewing direction D and the 3D position x in space constitute a hidden function F _θ Is input to the computer. In fact, F _θ Is realized by a multi-layer sensor. For all connected input vectors, the hidden function F _θ The color value c accompanying the density sigma will be estimated and the ray assigned. The entire hidden function can be formulated as:

F _θ :(f,d,x)→(c,σ)

thus, the synchronization process of the frequency of blinking of the audio and the hidden attribute and the head gesture is realized.

In which modeling of a dynamic torso is performed, the present invention proposes a model for torso-portion motion in a lightweight manner to pursue lower computational costs. Because the virtual anchor's torso moves less, the present invention proposes a lightweight pseudo 3D deformable module that models the torso with a 2D feature grid. A dynamic torso modeling based module may successfully simulate the dynamics of the torso and synthesize a natural torso image that matches the head. More importantly, the pseudo 3D representation by the 2D feature grid is very lightweight and efficient. The separately rendered head and torso images may be harmonically synthesized with any provided background images to obtain a final output virtual anchor video.

Because the torso is almost stationary, containing only slight movements, and no topology changes, the method of the present invention can be seen as a deformation-based two-dimensional version NeRF. Torso deformation is conditioned on head pose p such that torso motion is synchronized with head motion.

MLP (Multi-Layer Perceptron ) was used to predict torso deformation:

Δx＝MLP(x _t ,p)

The pixel coordinates of the trunk deformation are fed to a two-dimensional feature grid encoder to obtain trunk features f _t ：

Another MLP is used to generate torso RGB colors and alpha values:

c _t ,α _t ＝MLP(f _t ,i _t )

And synthesizing the head and trunk model and the static model which are independently rendered to obtain the synthesized video of the virtual anchor.

And step six, replacing the background of the synthesized video of the virtual anchor according to the background data by a background replacement module.

As shown in fig. 5, the background replacing module respectively utilizes the background dividing module to separate the foreground and the background of the synthesized video, and utilizes the image harmonizing module to fuse the foreground and the replaced background, specifically:

(1) The foreground and Background separation of the synthesized video utilizes a Background-matching Background segmentation model, and the character segmentation model trains a network on two self-made large databases, wherein the two databases have obvious diversity on the human gesture so as to learn the prior knowledge of robustness. Training is then performed on publicly available data sets that are manually managed to learn details of fine textures.

The Background-matching algorithm is an image processing technique that implements the replacement of an application scene by identifying the boundaries of foreground objects and Background in an image. Firstly, the algorithm completely separates the transparency of the foreground object from the background by extracting the Alpha channel of the foreground object in the image. Such Alpha channels may be used to identify the boundaries of foreground objects and background in an image. Then, the separated foreground object, namely the image of the virtual anchor, is synthesized into the background image through an image harmony algorithm to complete background replacement, and the voice and the video are combined to create a vivid virtual anchor. Two networks of the model are constructed, where the base network predicts the Alpha mask and foreground layer at a lower resolution and outputs an erroneous prediction tile that indicates the region where high resolution refinement may be required. And the other refinement network takes the low-resolution result and the original image as input, and generates high-resolution output only in the selected area, so that the character picture of the video is segmented.

The Background-matching Background segmentation model is a real-time and high-resolution Background replacement technology, and can realize real-time segmentation on a video with a resolution of 4K (3840 multiplied by 2160) of 30fps and an HD (1920 multiplied by 1080) of 60 fps.

C＝F*α+B*(1-α)

(2) The fusion of the foreground and the replacement background means that the foreground of one picture is pasted on the other background picture through an image harmony technique, and a synthetic image is obtained.

However, the synthetic image obtained by simple stitching may have many problems, such as unreasonable size and position of the foreground, unreasonable perspective angle of the foreground, unnatural connection of the foreground and the background, dissonance of color illumination of the foreground and the background, and the like. These factors all lead to degradation of the quality of the composite map and appear unrealistic.

Therefore, the image harmony can effectively solve the problem of color illumination dissonance between the foreground and the background, and comprises the steps of sequentially carrying out color illumination adjustment, brightness linear transformation, gray histogram equalization, color correction and local contrast enhancement on the foreground image, wherein the foreground image is more adaptive to the background by adjusting the color illumination of the foreground; the brightness of the foreground image is subjected to linear transformation, so that the gray value distribution range of the image is wider; carrying out equalization treatment on a gray level histogram of the foreground image through histogram equalization, so that the gray level distribution of the image is more uniform, and the contrast and detail of the image are enhanced; then, through color correction, the color balance and coordination of the image are improved by adjusting color parameters (such as hue, saturation, color temperature and the like) of the foreground image; finally, the local area of the foreground image is subjected to contrast enhancement treatment through local contrast enhancement, so that details of the image are more abundant; finally, the replacement of the foreground and the background of the synthesized video is realized.

The background replaced image is:

And seventhly, carrying out audio and video fusion on the virtual anchor synthesized voice and the virtual anchor video with the background replaced, and synthesizing a final virtual anchor.

And taking the synthesized voice of the virtual anchor and the video of the virtual anchor after the background replacement as input, and transmitting the input video into an FFmpeg tool to perform audio and video combination to synthesize the final virtual anchor. FFmpeg, which contains a very advanced audio/video codec library, is a set of open source tools that can be used to record, convert digital audio, video, and convert it into streams.

According to the method, a customer can automatically generate a vivid and natural virtual anchor by taking the video, the voice and the background of the virtual anchor application scene containing the characters as input, so that the time and the cost for manufacturing the virtual anchor are greatly reduced, and the method can be suitable for manufacturing the virtual anchor in different fields.

It should be noted and appreciated that various modifications and improvements of the invention described in detail above can be made without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any particular exemplary teachings presented.

Claims

1. The virtual anchor generation method based on the nerve radiation field and the hidden attribute is characterized by comprising the following steps of:

step one, constructing character images of a virtual anchor according to actual needs, wherein the character images comprise character video data, voice, text data and background data, and the character video data, the voice, the text data and the background data are used as input of a virtual anchor generating system;

training a voice synthesis network through voice data, inputting text data into a text transcription module to process the front end of the text, inputting the processed text into a trained voice synthesis network model, and obtaining the synthesized voice of a virtual anchor after voice synthesis;

step four, the voice feature extraction module performs feature extraction on the synthesized voice of the virtual anchor, and simultaneously the implicit attribute feature extraction module extracts implicit attribute feature information in the video data by combining the three-dimensional face, and outputs each extracted feature information to the improved NeRF network module;

and (3) voice feature extraction: feature extraction is carried out on the synthesized voice of the virtual anchor, frequency spectrum, intonation and tone feature information are extracted, and the frequency spectrum, intonation and tone feature information are mapped to corresponding discrete values;

extracting hidden attribute features: the attribute with weak correlation with the voice data, namely the attribute related to the voice context or other attributes related to the personalized talking style, including head movement and blinking, extracts the movement of the relevant part in the video data through the constructed three-dimensional face model, and outputs the movement as hidden attribute to the improved NeRF network module;

the method comprises the following steps:

(1) When the improved NeRF network is used for static scene modeling, a reduced MLP multi-layer perceptron is adopted, and MLP is replaced by linear interpolation so as to maintain reconstructed static information at each static 3D position, thereby storing the characteristics of the 3D scene in a static scene trainable grid structure;

(2) When the improved NeRF network is used for dynamic head modeling, the audio and video processing network of the high-dimensional character is decomposed into three low-dimensional trainable feature grids, namely a lip movement model, a character head movement model and an eye blinking model; to achieve synchronization of audio and motion models, the audio-spatial coding module is decomposed into a 3D spatial grid and a 2D audio grid, and the audio and spatial representations are decomposed into two grids; while each motion model maintains static spatial coordinates in 3D, audio dynamics are encoded as low-dimensional "coordinates";

When the relation between the apparent attribute lip movement and the audio frequency is constructed, the mouth movement of the audible speaking is directly and synchronously embedded; in particular, CNN audio encoder E is used _a Extracting phoneme features f from input audio _a The expression is as follows:

f _a ＝E _a (a)

wherein a represents input audio data;

alignment of audio features with features of the mouth using a contrast learning strategy, audio and mouth features (f) _a ，f _m ) Viewed as being directly opposed to being alignedConsidered as a negative pair; contrast learning using binary cross entropy loss, wherein the distance between aligned pairs in time is closer than the non-aligned negative pairs;

τ _con representing the binary cross entropy loss of lip and speech, d (f) _m ,f _a ) Representing the cosine distance of the facing direction,a cosine distance representing the negative pair;

using controllable probabilistic models for blink and head pose movements, a sequence of facial attributes h of length T _1:T And a conditioning audio sequence a of length T _1:T′ Facial attributes include head pose or blink; face attribute sequence h generated by embedding prediction on pictures T to T _T+1:T′ The method comprises the steps of carrying out a first treatment on the surface of the Facial attribute sequence h _T+1:T′ Comprises: (1) Potential hidden attribute space construction, training a transducer-VAE on a large data set by using a Gaussian Process, so as to establish a mapping face attribute sequence between inputs and a hidden attribute space Z; (2) Building head gestures and blink spaces, and embedding two head BOPs and a blink frequency audio embedded hidden attribute space Z on a selected person by a fine tuning trans-modal encoder;

In obtaining the generated head pose, blink feature f _e And a synchronized audio feature f _a Thereafter, generating a final image having these conditions using the neuro-radiation field; first the isochronous audio feature f _a And blink feature f _e Connected to form a new feature f _c The method comprises the steps of carrying out a first treatment on the surface of the Then, taking this new feature as input, a conditional radiation field is proposed; directly using the head pose to replace the viewing direction d of the conditional radiation field after converting the head pose from camera space to canonical space; finally, the feature F, the viewing direction D and the 3D position x in the canonical space constitute a hidden function F _θ Is input to the computer; for all input vectors, the hidden function F _θ The color value c associated with the density σ and the assigned ray can be estimated;

hidden function F _θ Expressed by the formula:

F _θ :(f,d,x)→(c,σ)

(3) When the improved NeRF network is used for modeling a dynamic trunk, the dynamic characteristic of the trunk is simulated by using another 2D grid in a lightweight pseudo 3D deformable module, and a natural trunk image matched with the head is synthesized;

(4) Synthesizing the head and trunk model and the static model which are independently rendered to obtain a synthetic video of a virtual anchor;

step six, the background replacing module replaces the background of the virtual anchor synthesized video according to the background data, and fuses the character image, the background and the audio of the virtual anchor to synthesize the final virtual anchor;

Step 601, inputting a virtual anchor composite video into a Background segmentation module, and extracting Alpha channels of foreground objects in images through a Background-marking Background segmentation model, so that the image of the virtual anchor in the composite video is completely separated from the Background;

step 602, synthesizing the segmented virtual anchor image into another background image to obtain a synthesized image, and carrying out harmony processing on the synthesized image through an image harmony module to finish background replacement;

the background replaced image is:

I _c ＝M°I _f +(1-M)°I _b

wherein the background image is I _b The foreground image is I _f The foreground image mask is M, the combined image is I _c The degree is Hadamard product;

2. The virtual anchor generating method based on the neural radiation field and the hidden attribute according to claim 1, wherein in the second step, the face feature extraction and construction flow is divided into face analysis, 3DMM face feature extraction and face reconstruction, specifically:

3. A method of generating a virtual anchor based on neural radiation fields and implicit properties according to claim 2, wherein the face components include skin, hair, eyes, eyebrows, nose and mouth.

4. The virtual anchor generating method based on the neural radiation field and the hidden attribute according to claim 1, wherein in the third step, the text front-end processing means that after punctuation marks, numbers and spaces in the text are removed from the input text, the text is divided into individual words according to semantic understanding, each word is converted into a phoneme, and a voice tag is marked.

5. The virtual anchor generating method based on neural radiation fields and hidden attributes according to claim 1, wherein in the fifth step, the torso modeling is specifically:

torso deformation is conditioned on head pose p such that torso motion is synchronized with head motion;

MLP was used to predict torso deformation:

Δx＝MLP(x _t ,p)

x _t sampling pixel coordinates from an image space, wherein Deltax refers to the pixel coordinates after the trunk is deformed;

Another MLP is used to generate torso RGB colors and alpha values:

c _t ,α _t ＝MLP(f _t ,i _t )

6. The virtual anchor generation method based on neural radiation fields and hidden attributes according to claim 1, wherein in the sixth step, the Background-segmentation Background model comprises a base network and a refinement network, wherein the base network predicts Alpha mask and foreground layer with low resolution, and outputs a misprediction image block indicating the region requiring high resolution refinement; the refinement network takes the low-resolution result and the original image as input, and generates high-resolution output only in the pointed area, so as to divide the character picture of the video;

CF*α(+B*(1-α)

7. The method of claim 1, wherein the harmonizing includes sequentially performing color illumination adjustment, luminance linear transformation, gray histogram equalization, color correction, and local contrast enhancement on the foreground picture.

8. The virtual anchor generating system based on the nerve radiation field and the hidden attribute is characterized by comprising a face feature extraction and construction module, a voice synthesis module, a voice feature extraction module, a hidden attribute extraction module, an improved NeRF network module and a background replacement module;

the face feature extraction and construction module restores the three-dimensional shape of the two-dimensional face image through the 3DMM model;

the voice synthesis module takes an input text of the virtual anchor as input of voice synthesis, converts the text into phoneme information, and synthesizes the converted phonemes into voice of the virtual anchor through an acoustic synthesis model;

the voice feature extraction module and the implicit attribute extraction module extract voice, blink and lip features of the virtual anchor;

the improved NeRF network module utilizes the improved NeRF network to decompose the high-dimensional audio pair image into three low-dimensional trainable feature grids, and decompose and synthesize eye blink, head posture and lip movement;

The background replacement module comprises a background segmentation module and an image harmony module, and is used for realizing virtual anchor background replacement.