CN111145282B

CN111145282B - Avatar composition method, apparatus, electronic device, and storage medium

Info

Publication number: CN111145282B
Application number: CN201911274701.3A
Authority: CN
Inventors: 左童春; 何山; 胡金水; 刘聪
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2023-12-05
Anticipated expiration: 2039-12-12
Also published as: CN111145282A

Abstract

The embodiment of the invention provides an avatar synthesis method, an avatar synthesis device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining relevant characteristics of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data; inputting the image data and related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with expressions corresponding to voice data; the expression synthesis model is obtained by training based on relevant features of sample voice data corresponding to the sample speaker video and sample image data. The method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention can enable the virtual image expression to be better attached to the voice data, and are more natural and real.

Description

Avatar composition method, apparatus, electronic device, and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an avatar composition method, apparatus, electronic device, and storage medium.

Background

In recent years, with the continuous progress of computer speech synthesis and video synthesis technologies, various virtual image synthesis technologies based on speech driving have been developed in the industry. The avatar may perform news broadcasting, weather forecast, narration game, provide order service, etc.

In the process of executing the tasks, most of the virtual images only synthesize mouth shapes matched with output voices, the virtual images always keep neutral expressions, or several basic expressions are preset, and corresponding expressions are configured for different voice output contents. When the synthesized virtual image outputs voice, the corresponding expression is often not lifelike and natural, and the user experience is poor.

Disclosure of Invention

The embodiment of the invention provides an avatar synthesis method, an avatar synthesis device, electronic equipment and a storage medium, which are used for solving the problem that the corresponding expression of the existing avatar is not lifelike and natural when outputting voice.

In a first aspect, an embodiment of the present invention provides an avatar composition method, including:

determining relevant characteristics of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data;

Inputting the image data and the related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with expressions corresponding to the voice data;

the expression synthesis model is trained based on a sample speaker video, relevant characteristics of sample voice data corresponding to the sample speaker video and sample image data.

Preferably, the inputting the avatar data and the related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model specifically includes:

inputting image data and related features corresponding to any frame to a feature extraction layer of the expression synthesis model respectively to obtain frame features output by the feature extraction layer;

and inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression image of any frame output by the expression prediction layer.

Preferably, the inputting the image data and the related features corresponding to any frame to the feature extraction layer of the expression synthesis model to obtain the frame features output by the feature extraction layer specifically includes:

Inputting the image data and related features corresponding to any frame to a current feature extraction layer of the feature extraction layer to obtain current features output by the current feature extraction layer;

and inputting the virtual expression image of any frame preset before to a frame front feature extraction layer of the feature extraction layer to obtain the frame front features output by the frame front feature extraction layer.

Preferably, the inputting the frame features to the expression prediction layer of the expression synthesis model to obtain a virtual expression map of the any frame output by the expression prediction layer specifically includes:

and the current features and the pre-frame features are fused and then input into the expression prediction layer, so that a virtual expression image of any frame output by the expression prediction layer is obtained.

Preferably, the fusing the current feature and the pre-frame feature and inputting the fused current feature and the pre-frame feature to the expression prediction layer to obtain a virtual expression map of any frame output by the expression prediction layer, which specifically includes:

the current characteristics and the frame front characteristics are fused and then input into a candidate expression prediction layer of the expression prediction layer, so that a candidate expression image output by the candidate expression prediction layer is obtained;

The current features and the frame front features are fused and then input into an optical flow prediction layer of the expression prediction layer, so that optical flow information output by the optical flow prediction layer is obtained;

and inputting the candidate expression images and the optical flow information into a fusion layer in the expression prediction layer to obtain a virtual expression image of any frame output by the fusion layer.

Preferably, the expression synthesis model is obtained based on a sample speaker video, relevant characteristics of sample voice data corresponding to the sample speaker video and sample image data, and a discriminant training, and the expression synthesis model and the discriminant form a generative type countermeasure network.

Preferably, the arbiter comprises an image arbiter and/or a video arbiter;

the image discriminator is used for judging the synthesis authenticity of any frame of virtual expression graph in the virtual image video, and the video discriminator is used for judging the synthesis authenticity of the virtual image video.

Preferably, the relevant features include language-related features, as well as emotional and/or speaker identity features.

Preferably, the avatar data is determined based on the speaker identity.

Preferably, the expression corresponding to the voice data of the avatar configuration in the avatar video includes a facial expression and a neck expression.

In a second aspect, an embodiment of the present invention provides an avatar composition device including:

a relevant feature determining unit for determining relevant features of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data;

the expression synthesis unit is used for inputting the image data and the related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with an expression corresponding to the voice data;

the expression synthesis model is obtained by training relevant features of sample voice data and sample image data corresponding to a sample speaker video.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a bus, where the processor, the communication interface, and the memory are in communication with each other through the bus, and the processor may invoke logic instructions in the memory to perform the steps of the method as provided in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as provided by the first aspect.

According to the method, the device, the electronic equipment and the storage medium for synthesizing the virtual image, which are provided by the embodiment of the invention, the expression synthesis of the virtual image is performed by applying the related characteristics containing rich expression related information, so that the expression of the virtual image can be better attached to voice data, and the virtual image is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expression of the virtual image exists in an integral form, compared with a mode of singly modeling each region of the expression in the virtual image, the method can effectively solve the problem of the linkage of the muscles of each region by aiming at the integral modeling of the expression, so that the muscle linkage of each region is more natural and lifelike.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart illustrating an avatar composition method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of an expression synthesis method according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a feature extraction method according to an embodiment of the present invention;

fig. 4 is a flow chart of an expression prediction method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an expression synthesis model according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating an avatar composition method according to another embodiment of the present invention;

fig. 7 is a schematic structural view of an avatar composition device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the prior art, the synthesis technology of the avatar is mainly classified into the following three types:

the first category, speech driven avatar composition technology: the language information and expression information acquired from the voice are independently applied to the finally synthesized video. In the method, only a plurality of basic expressions are considered, the synthesized virtual image is compared with a crowing, only a plurality of predefined basic expressions can be made, and the problems that the mouth and the lips are not matched with the eyebrows, the throat, the cheeks and the like exist. The above problems are on the one hand because the opening and closing of the mouth shape is determined only according to the pronunciation characteristics of the voice content, and the difference between different people is not considered, and the physiological linkage between the face muscle blocks is not considered, so that the rich emotion cannot be expressed individually. On the other hand, because such a method can only select one or two from several or tens of fixed expressions to be superimposed on the synthesized video, a rich facial expression cannot be synthesized.

Second, the virtual image synthesis technology based on expression migration: and migrating the facial expression, the mouth shape and the rigid motion of the driving person to the virtual image. The video synthesized by the method is more lifelike, but is very dependent on real person performance, and can not be synthesized offline.

Third, based on the technology of synthesizing the virtual image expression by modeling each part of the face separately, the artist is required to design the motion of the whole face according to the physiological and aesthetic expertise, and a video segment is synthesized to edit the state of each part frame by frame, so that not only is strong expertise required, but also time and effort are consumed.

From the anatomical point of view, the face of the person has 42 muscles, so that rich expressions can be generated, and various different moods and emotions can be accurately conveyed. The stretching of these muscles is not independent, but has a strong correlation, for example: the person speaks in calm state, and the muscle of lips and chin stretches, and the same sentence that the person says when the emotion is excited, and the muscle of forehead, cheek muscle also stretches, and the muscle stretching intensity in areas such as lips, chin is obviously bigger than calm. In addition, there are thousands of human expressions, and the existing method only has several or tens of preset expressions, and the expression capability is not fine enough and personalized. Therefore, how to automatically compose an avatar with more lifelike and natural appearance is still a problem to be solved by those skilled in the art.

In this regard, the embodiment of the present invention provides an avatar composition method. Fig. 1 is a flow chart of an avatar composition method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

Step 110, determining relevant characteristics of voice data; the relevant features are used to characterize the features contained in the speech data that relate to the expression of the speaker.

Specifically, the voice data is voice data for performing avatar synthesis, where the avatar may be an avatar, or may be an avatar, an animal, or the like, and the embodiment of the present invention is not limited thereto. The voice data may be voice data of a speaker speaking collected by the radio device, or may be intercepted from voice data obtained through a network or the like, which is not particularly limited in the embodiment of the present invention.

The relevant features are features related to the expression of the speaker, such as language-related features in the voice data, which correspond to different utterances, which require the speaker to mobilize facial muscles to form different mouth shapes, such as emotional features in the voice data, and when the speaker speaks the same content under different emotions, the movements of the facial muscles including the mouth shapes and neck muscles are also different, such as scene features in the voice data, the speaking scene of the speaker may also affect the facial expression of the speaker, such as when speaking in a noisy environment, the speaker may speak aloud, the facial expression may be relatively exaggerated, when speaking in a quiet environment, the speaker may speak aloud, the facial expression may be relatively fine, such as the speaker's identity features in the voice data, the expressions of different speakers when speaking may be different, such as a host who hosts a child program, the expressions when speaking may be tangential, a host who hosts a smile program, and the host when speaking may be exaggerated.

Step 120, inputting the avatar data and related features into the expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with expressions corresponding to the voice data; the expression synthesis model is obtained by training based on the sample speaker video, the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data.

Specifically, the avatar data, that is, the image data for performing avatar composition, may be an avatar of a speaker corresponding to the voice data, or may be an avatar unrelated to the speaker corresponding to the voice, which is not particularly limited in the embodiment of the present invention. The avatar data includes a texture map and an expression mask map, wherein the texture map is an image of the avatar itself, the texture map includes the avatar, and each region of the avatar where the expression is performed, and the expression mask map is an avatar image after masking each region of the avatar where the expression is performed, and may be one expression mask map corresponding to each frame or one expression mask map corresponding to a plurality of frames.

The expression synthesis model is used for analyzing the expression of the avatar based on the relevant characteristics of the avatar data, and combining the expression of the avatar to obtain an avatar video configured with the expression corresponding to the voice data. Before executing step 120, an expression synthesis model may be trained in advance, and specifically, the expression synthesis model may be trained by the following manner: firstly, a large number of sample speaker videos and sample voice data corresponding to the sample speaker videos are collected, and sample image data in the sample speaker videos and relevant features in the sample voice data are extracted. Here, the sample speaker video is a real person speaker video. And then training the initial model based on the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data, thereby obtaining an expression synthesis model.

According to the method provided by the embodiment of the invention, the expression synthesis of the virtual image is performed by applying the related features containing rich expression related information, so that the expression of the virtual image can be better attached to voice data, and the virtual image is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expression of the virtual image exists in an integral form, compared with a mode of singly modeling each region of the expression in the virtual image, the method can effectively solve the problem of the linkage of the muscles of each region by aiming at the integral modeling of the expression, so that the muscle linkage of each region is more natural and lifelike.

Based on the above embodiments, the expression synthesis model includes a feature extraction layer and an expression prediction layer. Fig. 2 is a flow chart of an expression synthesis method according to an embodiment of the present invention, as shown in fig. 2, step 120 specifically includes:

step 121, inputting the image data and the related features corresponding to any frame to the feature extraction layer of the expression synthesis model to obtain the frame features output by the feature extraction layer.

Specifically, the speech data may be divided into speech data of a plurality of frames, for which there is a corresponding correlation feature. Also, in the avatar data, the same texture map may correspond to each frame to embody the appearance of an avatar in the avatar video, and different expression mask maps may correspond to different frames to embody the actions of the avatar corresponding to different frames in the avatar video, in particular, the head actions.

In the expression synthesis model, the feature extraction layer is used for extracting frame features of any frame from image data and related features respectively corresponding to the frame. The frame features herein may be the image features of the frame and the expression related features of the frame, and may also include fusion features of the image features and the expression related features of the frame, which is not particularly limited in the embodiment of the present invention.

Step 122, inputting the frame characteristics into the expression prediction layer of the expression synthesis model to obtain the virtual expression map of the frame output by the expression prediction layer.

Specifically, in the expression synthesis model, the expression prediction layer is used for predicting a virtual expression map of any frame based on frame characteristics of the frame. Here, the avatar is an image including an avatar, wherein the avatar is configured with an expression corresponding to the voice data of the frame, and a position, an action, etc. of the avatar are consistent with the avatar data corresponding to the frame. Each frame of virtual expression image forms an avatar video.

According to the method provided by the embodiment of the invention, the frame characteristics of any frame are obtained, the virtual expression image of the frame is obtained based on the frame characteristics, the virtual image video is finally obtained, and the overall naturalness and fidelity of the virtual image video are improved by improving the naturalness and fidelity of the virtual expression image of each frame.

Based on any of the above embodiments, the feature extraction layer includes a current feature extraction layer and a pre-frame feature extraction layer; fig. 3 is a flow chart of a feature extraction method according to an embodiment of the present invention, as shown in fig. 3, step 121 specifically includes:

and 1211, inputting the image data and the related features corresponding to any frame respectively to a current feature extraction layer of the feature extraction layer to obtain the current features output by the current feature extraction layer.

Step 1212, inputting the virtual expression image of the frame pre-set frame to the frame pre-feature extraction layer of the feature extraction layer to obtain the frame pre-feature output by the frame pre-feature extraction layer.

Specifically, the frame characteristics of any frame comprise two parts, namely a current characteristic and a frame front characteristic, wherein the current characteristic is obtained by extracting characteristics of image data and related characteristics respectively corresponding to the frame through a current characteristic extraction layer, and the current characteristic is used for reflecting the characteristics of the frame in the aspect of the virtual image, particularly the expression of the virtual image; the pre-frame features are obtained by extracting features of the virtual expression map of the preset frame before the frame through a pre-frame feature extraction layer, and are used for reflecting the virtual images, especially the features of the virtual image expression, in the virtual expression map of the preset frame before the frame.

Here, any frame pre-set frame may be a number of frames before the frame that are pre-set, for example, any frame is the nth frame, and the frame pre-set frame of the frame is the first two frames of the frame, that is, the nth-2 frame and the nth-1 frame.

Based on any of the above embodiments, step 122 specifically includes: and the current characteristics and the frame front characteristics are fused and then input into the expression prediction layer, so that a virtual expression image of the frame output by the expression prediction layer is obtained.

In the embodiment of the invention, the current characteristic and the frame characteristic of any frame are used for expression prediction, so that the synthesized avatar expression can not only be naturally matched with the voice data corresponding to the frame, but also realize the natural transition of the avatar expression of the frame and the avatar expressions of the previous frames, and further improve the reality and naturalness of the avatar video.

Based on any of the above embodiments, the expression prediction layer includes a candidate expression prediction layer, an optical flow prediction layer, and a fusion layer; fig. 4 is a flowchart of an expression prediction method according to an embodiment of the present invention, as shown in fig. 4, step 122 specifically includes:

step 1221, inputting the fused current feature and the pre-frame feature into a candidate expression prediction layer of the expression prediction layer to obtain a candidate expression map output by the candidate expression prediction layer.

Here, the candidate expression prediction layer is configured to predict an avatar expression of any frame based on a current feature and a pre-frame feature corresponding to the frame, and output a candidate expression map of the frame. Here, the candidate emoticons of the frame are virtual pictograms configured with emotions corresponding to the voice data of the frame.

Step 1222, merging the current feature and the pre-frame feature, and inputting the merged current feature and the pre-frame feature into an optical flow prediction layer of the expression prediction layer to obtain optical flow information output by the optical flow prediction layer.

Here, the optical flow prediction layer is configured to predict an optical flow between a previous frame and a frame based on a current feature and a pre-frame feature corresponding to any frame, and output optical flow information of the frame. Here, the optical flow information of the frame may include the predicted optical flow of the previous frame and the frame, and may further include weights for weighting the optical flow to the candidate emoticons.

Step 1223, inputting the candidate expression image and the optical flow information to a fusion layer in the expression prediction layer, so as to obtain a virtual expression image of the frame output by the fusion layer.

Here, the fusion layer is used for fusing the candidate expression image and the optical flow information of any frame, so as to obtain the virtual expression image of the frame. For example, the fusion layer may directly superimpose the candidate expression map with the virtual expression map of the previous frame after the deformation based on the predicted optical flow, or may superimpose the candidate expression map with the virtual expression map of the previous frame after the deformation based on the predicted optical flow based on the weight obtained by prediction, so as to obtain the virtual expression map.

According to the method provided by the embodiment of the invention, the current characteristics and the pre-frame characteristics are used for carrying out optical flow prediction, and the optical flow information is applied to the generation of the virtual expression graph, so that the muscle movement of each area of the virtual image executing expression in the virtual image video is more natural.

Based on any of the above embodiments, fig. 5 is a schematic structural diagram of an expression synthesis model provided in an embodiment of the present invention, and in fig. 5, the expression synthesis model includes a current feature extraction layer, a pre-frame feature extraction layer, a candidate expression prediction layer, an optical flow prediction layer, and a fusion layer.

The current feature extraction layer is used for obtaining the current features of any frame based on the image data and the related features respectively corresponding to the frame.

Assuming that the relevant feature of the voice data is M, sending M into the long and short-term memory network LSTM may obtain the hidden layer feature HT of the relevant feature, and the hidden layer feature corresponding to each frame may be labeled HT (0), HT (1), …, HT (t), …, and HT (N-1). Wherein HT (t) represents hidden layer characteristics of relevant characteristics corresponding to a t frame, and N is the total frame number of the image data. T frame pairThe corresponding image data includes I (0) and I _m (t). Wherein I (O) represents a texture map, I _m And (t) represents an expression mask corresponding to the t frame.

In FIG. 5, in the current feature extraction layer, I (O) and I _m (t) feeding the first layer convolution (kenerl=3, stride=2, channel_out=64), feeding the obtained feature map into the second layer convolution (kenerl=3, stride=2, channel_out=128), feeding the obtained feature map into the third layer convolution (kenerl=3, stride=2, channel_out=256), feeding the obtained feature map into the fourth layer convolution (kenerl=3, stride=2, channel_out=512), obtaining a 512-dimensional feature map, and then passing through 5 layers of the feature map (kenerl=3, stride=1, channel_out=512) obtaining a 512-dimensional feature map. In the process, hidden layer features HT (t) of the related features are expanded and then embedded into second, third and fourth layers of convolutions and added with convolution results, so that fusion of the related features and image data is realized, and the current feature CFT (t) of a t-th frame is obtained.

In the current feature extraction layer, HT (t) is combined with I (0) and I _m When the convolution results FT (t) of (t) are added, only HT (t) is superimposed on the mask region of FT (t), where mask regions, i.e., regions in the avatar where expressions are performed, are superimposed, and non-mask regions of FT (t) are not superimposed. Thus, the expression-related features are superimposed only in the region where the expression is required to be performed, and the original avatar is maintained in the region where the expression is not required to be performed, specifically expressed as the following formula:

in the formula, θ is a relevant parameter of the current feature extraction layer.

The frame front feature extraction layer is used for obtaining frame front features of any frame based on a virtual expression image of a preset frame before the frame.

The pre-set frame preceding the frame is assumed to be the first two frames, namely the t-1 st frame and the t-2 nd frame. The virtual emoticons of the pre-set frame before the frame are Fake (t-1) and Fake (t-2). In the pre-frame feature extraction layer, fake (t-1) and Fake (t-2) are sent to a 4-layer convolutional network (kenerl=3, stride=2, channel_out= 64,128,256,512), and then a 512-dimensional feature map, namely a pre-frame feature PFT (t), is obtained through a 5-layer ResBlock (kenerl=3, stride=1, channel_out=512).

The frame characteristics of the t-th frame are thus obtained, denoted CFT (t) +pft (t).

The candidate expression prediction layer is used for determining a corresponding candidate expression graph according to the input frame characteristics. In the candidate expression prediction layers, the frame characteristics CFT (t) +pft (t) can obtain a candidate expression map of the t frame through 4 layers of ResBlock (kenerl=3, stride=1, channel_out=512) and 4 layers of upsampling layers (kenerl=3, stride=2, channel_out= 256,128,64,1), and the candidate expression map is denoted as S (t). The formula is as follows:

Wherein,is a parameter of the candidate expression prediction layer.

The optical flow prediction layer is used for predicting the optical flow between the previous frame and the frame according to the input frame characteristics and outputting the optical flow information of the frame. In the optical flow prediction layer, the frame characteristics CFT (t) +pft (t) pass through 4 layers of ResBlock (kenerl=3, stride=1, channel_out=512) and 4 layers of upsampling layers (kenerl=3, stride=2, channel_out= 256,128,64,3) to obtain an optical flow F (t-1) and a weighting matrix W (t) between the virtual expression images Fake (t-1) of the previous frame and the virtual expression image Fake (t) of the frame.

The fusion layer is used for fusing the candidate expression image S (t) of any frame with the optical flow information F (t-1) and the optical flow information W (t) to obtain a virtual expression image Fake (t) of the frame. Specifically, the candidate expression map S (t) and the virtual expression map F (t-1) of the previous frame which is deformed by the optical flow F (t-1) can be weighted and summed by the weighting matrix W (t), so that the fusion of the candidate expression map S (t) and the virtual expression map F (t-1) is realized, and the specific formula is as follows:

Fake(t)＝S(t)*W(t)+(1-W(t))*F(t-1)⊙Fake(t-1)

wherein, as follows, the image is deformed by using the optical flow, W (t) is the corresponding weight of the candidate expression map, and 1-W (t) is the corresponding weight of the virtual expression map of the previous frame after the optical flow deformation.

The expression synthesis model provided by the embodiment of the invention can be used for describing the synthesis details of different people under different emotions more vividly through the application of the related characteristics and the integral modeling of the expression, and simultaneously avoids the problem of incompatibility caused by independent synthesis. In addition, the inter-frame continuity of the composite avatar is optimized by the optical flow information.

Based on any embodiment, in the method, the expression synthesis model is obtained based on the sample speaker video, the relevant characteristics of the sample voice data corresponding to the sample speaker video and the sample image data, and the discriminant training, and the expression synthesis model and the discriminant form a generated type countermeasure network.

Specifically, the generated countermeasure network (GAN, generative Adversarial Networks) is a deep learning model, and is one of the most promising methods for unsupervised learning on complex distribution. The generation type antagonism network passes through two modules in the framework: the mutual game learning of the Generative Model and the discriminant Model Discriminative Model produces a fairly good output. In the embodiment of the invention, the expression synthesis model is a generated model, and the discriminant is a discriminant model.

The expression synthesis model is used for synthesizing continuous virtual image videos, and the discriminator is used for discriminating whether the input video is the virtual image video synthesized by the expression synthesis model or the truly recorded video. The role of the discriminator is to judge whether the virtual image video synthesized by the expression synthesis model is true and realistic.

According to the method provided by the embodiment of the invention, through the mutual game learning training of the expression synthesis model and the discriminator, the training effect of the expression synthesis model can be obviously improved, so that the fidelity and naturalness of the virtual image video output by the expression synthesis model can be effectively improved.

Based on any of the above embodiments, the arbiter comprises an image arbiter and/or a video arbiter; the image discriminator is used for judging the synthesis authenticity of any frame of virtual expression graph in the virtual image video, and the video discriminator is used for judging the synthesis authenticity of the virtual image video.

Specifically, the generated countermeasure network may include only an image discriminator or a video discriminator, or may include both the image discriminator and the video discriminator.

The image discriminator is used for judging the authenticity from the image level, namely judging whether the synthesis of the expression, such as the synthesis of the facial and neck muscles, is realistic. The image discriminator can obtain a virtual expression image Fake (t) of the current frame synthesized by the expression synthesis model, send the virtual expression image Fake (t) into a 4-layer convolution network (kenerl=3, stride=1 and channel_out= 64,128,256,1), and calculate an L2 norm between a feature image obtained by convolution and a full 0 matrix with the same size. Similarly, the image discriminator can send any image frame Real (t) in the Real recorded video into the 4-layer convolution network, and calculate the L2 norm between the feature map obtained by convolution and the full 1 matrix with the same size. Here, the all 0 matrix corresponds to a synthesized image, the all 1 matrix corresponds to a true image, and the L2 norm is a loss value of the image discriminator. In order to make the quality of the synthesized virtual expression image higher in each resolution, the virtual expression image output by the expression synthesis model can be respectively sampled by 2 times and 4 times for discrimination.

The video discriminator is used for judging the authenticity at the video level, namely judging whether the synthesis of the video, such as the linkage of the facial and neck muscle movements, is authentic. Multiple continuous virtual expression images and corresponding optical flows synthesized by the expression synthesis model, such as Fake (t-2), fake (t-1), fake (t) and F (t-2) and F (t-1), can be obtained, and can be sent to a video discriminator formed by a 4-layer convolution network (kenerl=3, stride=1 and channel_out= 64,128,256,1), so as to calculate discrimination loss. Similarly, the video discriminator also needs to calculate discrimination loss of the true recorded video. In order to make the quality of the synthesized virtual image video higher in each resolution, the virtual expression images output by the expression synthesis model can be respectively sampled by 2 times and 4 times for discrimination.

In the training process of the expression synthesis model, the opposite loss function of the discriminator can be added into the loss function of the expression synthesis model, so that the expression synthesis model and the discriminator are combined to form countermeasures.

Based on any of the above embodiments, the method wherein the relevant features include language-related features, as well as emotional features and/or speaker identity features.

Wherein the language-dependent features correspond to different pronunciations that require the speaker to mobilize facial muscles to form different mouth shapes, and the facial muscles corresponding to the different mouth shapes are different from the movements of the neck muscles. The emotional features are used to characterize the emotion of the speaker, who speaks the same content under different emotions, and the movements of the facial muscles, including mouth shape, and neck muscles are also different. The identity feature of the speaker is used for representing the identity of the speaker, and specifically may be an identifier corresponding to the speaker, or an identifier corresponding to the occupation of the speaker, or an identifier corresponding to the personality feature, the language style feature, and the like of the speaker.

Based on any of the above embodiments, in the method, the avatar data is determined based on a speaker identity feature.

Specifically, among the mass avatar data stored in advance, different avatar data correspond to different avatars having different identity characteristics. After the speaker identity characteristics in the related characteristics of the voice data are known, the image data matched with the speaker identity characteristics can be selected from the massive image data and applied to the synthesis of the virtual image video.

For example, image data of four A, B, C, D persons is stored in advance. When the speaker identity characteristics of the voice data are known to point to B, the avatar data of B may be correspondingly determined for the synthesis of the avatar video.

Based on any of the above embodiments, step 110 specifically includes: determining acoustic features of the speech data; relevant features are determined based on the acoustic features.

Specifically, the acoustic features herein may be spectrogram and fbank features. For example, the speech data may be denoised using an adaptive filter and the audio sample rate and channel unified, here set to 16K, mono, from which the spectrogram and fbank features (frame shift 10ms, window length 1 s) are then extracted.

Thereafter, the BN feature sequences representing the language content can be extracted as language-dependent features using the bottleneck network, respectively, and here set to obtain a 256-dimensional BN feature at intervals of 40ms, denoted as L (0), L (1), …, L (N-1), N being the number of frames of 25fps video. Compared with the method based on the phoneme characteristic in the prior art, the BN characteristic is irrelevant to languages, even if the expression synthesis model is trained only in Chinese, the correct mouth shape can be synthesized by using other languages when the expression synthesis model is trained. In addition, in the embodiment of the invention, the high-dimensional feature sequence for expressing emotion is extracted from the convolutionally long-short-time memory network ConvLSTM which is fully trained on 8 basic expression (lively, happy, afraid, depressed, excited, surprise, sad and neutral) recognition tasks and is used as emotion features. Here, a 128-dimensional emotion vector is set for every 40ms, denoted E (0), E (1), …, E (N-1), N being the number of frames of 25fps video. Similarly, in order to achieve personalized customization, in the embodiment of the invention, a speaker identity recognition network based on a deep neural network DNN and an i-vector is used for extracting a speaker identity feature sequence, wherein the speaker identity feature sequence is set to obtain a 128-dimensional identity feature vector at intervals of 40ms, the 128-dimensional identity feature vector is recorded as P (0), P (1), … and P (N-1), and N is the number of frames of 25fps video. Finally, the three feature sequences are spliced according to the corresponding frames, and 512-dimensional fusion related features are obtained for each frame and are marked as M (0), M (1), … and M (N-1), wherein N is the number of frames of the 25fps video.

Based on any of the above embodiments, in the method, the expression corresponding to the voice data of the avatar configuration in the avatar video includes a facial expression and a neck expression.

Correspondingly, the expression mask map in the image data covers the part comprising an execution area of the facial expression and an execution area of the neck expression. Here, the facial expression execution area may include facial muscle areas such as frontal muscle, orbicularis oculi muscle, frowning muscle, orbicularis stomatitis muscle, and the like, and does not include eyeball areas and nose bridge areas, because the movement of eyeballs is not controlled by facial muscles, and the nose bridge has bones, approximately rigid body, and is little affected by the movement of muscles in other areas of the face.

In the embodiment of the invention, the facial expression and the neck expression are combined to be used as the expression whole, and compared with a mode of singly modeling each region of the expression in the virtual image, the method can effectively solve the problem of the linkage of the muscles of each region by aiming at the expression whole modeling, so that the muscle linkage of each region is more natural and lifelike.

Based on any of the above embodiments, fig. 6 is a schematic flow chart of an avatar composition method according to another embodiment of the present invention, as shown in fig. 6, the method includes:

Step 610, determining voice data:

extracting voice data from the collected video and audio data, denoising the voice data by using an adaptive filter, unifying an audio sampling rate and a sound channel, and then extracting a spectrogram and fbank characteristics from the voice data to be identified. In order to fully ensure the time sequence of the voice data, the embodiment of the invention does not need to split the input voice data.

Step 620, acquiring relevant features of the voice data:

and (3) respectively obtaining the language-related features, the emotion features and the speaker identity features corresponding to the voice data to be recognized of each frame through a neural network for extracting the language-related features, the emotion features and the speaker identity features of the voice data obtained in the previous step, and splicing the three features according to the corresponding frames to obtain the corresponding related features of each frame.

Step 630, determining video data, detecting a face area, and cutting a head area:

extracting video data from the collected video and audio data, detecting a face area of each frame of image, taking the obtained face frame as a reference, expanding the face area outwards by 1.5 times to obtain an area containing the whole head and neck, cutting the area, storing the area as an image sequence, and recording as I (0), I (1), …, I (N-1) and N as the frame number of 25fps video.

Step 640, generating image data:

facial muscle areas and neck muscle areas such as frontal muscle, orbicularis oculi, frowning muscle, orbicularis stomatalis muscle and the like of each frame of cut image I (t) are segmented according to skin colors and physiological structural characteristics of a human face or by using a neural network, and eyeball areas and nose bridge areas are not included, so that the movement of eyeballs is not controlled by facial muscles, and the nose bridge has bones, is approximately rigid, and is little influenced by the movement of muscles of other areas of the face. The pixel values of the facial muscle region and the neck muscle region are set to zero to obtain an expression mask image sequence, which is marked as Im (0), im (1), …, im (N-1), and N is the number of frames of the video of 25 fps.

The image data thus obtained contains a texture map I (0) and an expression mask map corresponding to each frame.

Step 650, inputting an expression synthesis model to obtain an avatar head video:

in the expression synthesis model, a feature map is obtained from a texture map and an expression mask map in image data through a plurality of layers of convolution networks, the feature map is fused with spliced related features, then face and neck regions are synthesized through a plurality of layers of convolution networks, and finally optical flow information is added into a video, so that the synthesized mouth shape, expression, throat movement and the like are more natural.

For example, the texture map entered is expressionless and the speech data is excited to say "abstract-! For irrelevant areas, such as the hair, nose and other areas of the texture map, the relevant areas, such as the mouth shape, cheek, eyebrow and other areas, the original areas are deformed into new textures according to relevant characteristics and texture images, and the final synthesized virtual expression map is obtained through fusion.

Step 660, merging the avatar header video and the body part of the video data:

if the head area of the synthesized virtual head image is spliced into the video according to the original coordinates, tiny joints appear at the boundary, and preferably, a poisson fusion algorithm can be used for fusing the joint areas, so that the boundary transition is smoother.

Compared with the traditional virtual image synthesis technology based on voice driving and the face synthesis technology based on expression migration, the method provided by the embodiment of the invention can more realistically synthesize the facial and neck muscle movements of different people under different emotions, and can realize full-automatic offline synthesis. Saving a large amount of labor cost and improving the production efficiency.

Based on any one of the above embodiments, fig. 7 is a schematic structural view of an avatar composition device according to an embodiment of the present invention, and as shown in fig. 7, the device includes a relevant feature determining unit 710 and an expression composition unit 720;

wherein the relevant feature determining unit 710 is configured to determine relevant features of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data;

the expression synthesis unit 720 is configured to input the avatar data and the related features into an expression synthesis model, so as to obtain an avatar video output by the expression synthesis model, where an avatar in the avatar video is configured with an expression corresponding to the voice data;

The device provided by the embodiment of the invention synthesizes the expression of the virtual image by applying the related characteristics containing rich expression related information, so that the expression of the virtual image can be better attached to the voice data, and the device is more natural and real. In addition, in the virtual image video generated by the expression synthesis model, the expression of the virtual image exists in an integral form, compared with a mode of singly modeling each region of the expression in the virtual image, the method can effectively solve the problem of the linkage of the muscles of each region by aiming at the integral modeling of the expression, so that the muscle linkage of each region is more natural and lifelike.

Based on any of the above embodiments, the expression synthesis unit 720 includes:

the feature extraction unit is used for inputting the image data and the related features corresponding to any frame to a feature extraction layer of the expression synthesis model to obtain frame features output by the feature extraction layer;

and the expression prediction unit is used for inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression image of any frame output by the expression prediction layer.

Based on any of the above embodiments, the feature extraction unit includes:

a current feature extraction subunit, configured to input image data and related features corresponding to any frame respectively to a current feature extraction layer of the feature extraction layer, so as to obtain a current feature output by the current feature extraction layer;

and the pre-frame feature extraction subunit is used for inputting the virtual expression image of any pre-frame preset frame to the pre-frame feature extraction layer of the feature extraction layer to obtain the pre-frame features output by the pre-frame feature extraction layer.

Based on any of the above embodiments, the expression prediction unit is specifically configured to:

Based on any of the above embodiments, the expression prediction unit includes:

the candidate expression prediction subunit is used for inputting the current characteristics and the frame front characteristics into a candidate expression prediction layer of the expression prediction layer after fusing, so as to obtain a candidate expression image output by the candidate expression prediction layer;

the optical flow prediction subunit is used for inputting the current feature and the frame front feature into the optical flow prediction layer of the expression prediction layer after fusing, so as to obtain optical flow information output by the optical flow prediction layer;

and the fusion subunit is used for inputting the candidate expression images and the optical flow information into a fusion layer in the expression prediction layer to obtain the virtual expression image of any frame output by the fusion layer.

Based on any one of the above embodiments, the expression synthesis model is obtained by training based on a sample speaker video, relevant features of sample voice data corresponding to the sample speaker video, sample image data, and a discriminator, where the expression synthesis model and the discriminator form a generated type countermeasure network.

Based on any of the above embodiments, the arbiter comprises an image arbiter and/or a video arbiter;

Based on any of the above embodiments, the relevant features include language-related features, as well as emotional features and/or speaker identity features.

Based on any of the above embodiments, the persona data is determined based on the speaker identity characteristics.

Based on any of the above embodiments, the expression of the avatar configuration in the avatar video corresponding to the voice data includes a facial expression and a neck expression.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 8, the electronic device may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following method: determining relevant characteristics of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data; inputting the image data and the related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with expressions corresponding to the voice data; the expression synthesis model is obtained by training relevant features of sample voice data and sample image data corresponding to a sample speaker video.

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising: determining relevant characteristics of the voice data; the relevant features are used for representing the features related to the expression of the speaker contained in the voice data; inputting the image data and the related features into an expression synthesis model to obtain an avatar video output by the expression synthesis model, wherein the avatar in the avatar video is configured with expressions corresponding to the voice data; the expression synthesis model is obtained by training relevant features of sample voice data and sample image data corresponding to a sample speaker video.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A avatar composition method, comprising:

the expression synthesis model is obtained by training based on a sample speaker video, relevant characteristics of sample voice data corresponding to the sample speaker video and sample image data;

Inputting the image data and the related features into an expression synthesis model to obtain an virtual image video output by the expression synthesis model, wherein the virtual image video comprises the following specific steps:

inputting image data and related features corresponding to any frame in the voice data to a feature extraction layer of the expression synthesis model respectively to obtain frame features output by the feature extraction layer, wherein the frame features comprise current features and pre-frame features;

inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression image of any frame output by the expression prediction layer, wherein the virtual expression image is a frame of image containing an avatar, and each frame of virtual expression image forms an avatar video;

the step of inputting the frame characteristics to an expression prediction layer of the expression synthesis model to obtain a virtual expression map of any frame output by the expression prediction layer, comprises the following steps:

2. The avatar composition method as claimed in claim 1, wherein the inputting of the avatar data and related features corresponding to any frame of the voice data to the feature extraction layer of the expression composition model, respectively, obtains frame features outputted from the feature extraction layer, comprises:

3. The avatar composition method of claim 1, wherein the expression composition model is based on a sample speaker video, correlation features of sample voice data corresponding to the sample speaker video and sample avatar data, and a discriminant training, the expression composition model and the discriminant constituting a generative countermeasure network.

4. A method of avatar composition according to claim 3, wherein the discriminator comprises an image discriminator and/or a video discriminator;

5. The avatar composition method of any one of claims 1 to 4, wherein the related features include language related features, and emotional and/or speaker identity features.

6. The avatar composition method of claim 5, wherein the avatar data is determined based on the speaker identity characteristics.

7. The avatar composition method of any one of claims 1 to 4, wherein the expression corresponding to the voice data of the avatar configuration in the avatar video includes a facial expression and a neck expression.

8. An avatar composition device, comprising:

the expression synthesis model is obtained by training relevant features of sample voice data and sample image data corresponding to a sample speaker video;

the expression synthesis unit is specifically configured to:

the current characteristics and the pre-frame characteristics are fused and then input into a candidate expression prediction layer of the expression prediction layer, so that a candidate expression image output by the candidate expression prediction layer is obtained;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the avatar composition method as claimed in any one of claims 1 to 7 when the program is executed.

10. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the avatar composition method of any one of claims 1 to 7.