WO2022242381A1 - 图像生成方法、装置、设备以及存储介质 - Google Patents

图像生成方法、装置、设备以及存储介质 Download PDF

Info

Publication number
WO2022242381A1
WO2022242381A1 PCT/CN2022/086972 CN2022086972W WO2022242381A1 WO 2022242381 A1 WO2022242381 A1 WO 2022242381A1 CN 2022086972 W CN2022086972 W CN 2022086972W WO 2022242381 A1 WO2022242381 A1 WO 2022242381A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
audio sequence
sequence
features
facial
Prior art date
Application number
PCT/CN2022/086972
Other languages
English (en)
French (fr)
Inventor
吴潜溢
吴文岩
戴勃
王宇欣
高娜
钱晨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2022242381A1 publication Critical patent/WO2022242381A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to an image generation method, device, equipment and storage medium.
  • Pronunciation face image generation is a very critical technology in applications such as voice-driven characters and virtual digital humans.
  • Generating a speaking face image refers to a process of generating a speaking face image showing pronunciation actions when speaking according to the received audio data and the face image.
  • the present disclosure provides an image generation method.
  • the method may include: receiving audio data and face images; extracting text features corresponding to audio sequences included in the audio data; wherein, the text features represent text content corresponding to the audio sequences; Text features, performing facial feature mapping to obtain facial features corresponding to the audio sequence; wherein, the facial features represent the pronunciation action corresponding to the audio sequence; according to the facial features corresponding to the audio sequence and the human face image , generating a pronunciation face image corresponding to the audio sequence.
  • performing facial feature mapping based on the text features corresponding to the audio sequence to obtain the facial features corresponding to the audio sequence includes: obtaining the sound corresponding to the audio sequence according to the audio sequence feature; wherein, the sound feature represents at least one feature in the timbre, loudness, and tone corresponding to the audio sequence; the text feature and sound feature corresponding to the audio sequence are fused to obtain the fusion corresponding to the audio sequence Features: using a facial feature mapping network to perform facial feature mapping on the fusion features corresponding to the audio sequence to obtain facial features corresponding to the audio sequence.
  • the audio data includes a plurality of continuous audio sequences; the facial feature mapping is performed on the fusion features corresponding to the audio sequences by using the facial feature mapping network to obtain the facial features corresponding to the audio sequences , including: corresponding to the audio sequence, at least one audio sequence before the audio sequence among the plurality of audio sequences, and at least one audio sequence after the audio sequence among the plurality of audio sequences
  • the fusion feature is used as input, and the facial feature corresponding to the audio sequence is obtained by using the facial feature mapping network.
  • the facial features include three-dimensional coordinates of a plurality of key points of the facial region; and generating a speaker corresponding to the audio sequence according to the facial features corresponding to the audio sequence and the human face image
  • a face image comprising: determining a projection matrix according to the face image; wherein, the projection matrix represents the mapping relationship of the coordinates of the key points of the face in the face image from three-dimensional to two-dimensional; through the projection matrix, Projecting the three-dimensional coordinates of a plurality of key points corresponding to the audio sequence into two-dimensional coordinates; obtaining an occlusion image after the target facial area in the face image is occluded; using a generation network, according to the occlusion image and the audio
  • the two-dimensional coordinates of the multiple key points corresponding to the sequence are used to generate the pronunciation face image corresponding to the audio sequence.
  • the audio data includes a plurality of continuous audio sequences; using the generation network, according to the two-dimensional coordinates of multiple key points corresponding to the occlusion image and the audio sequences, the corresponding audio sequences are generated.
  • the method also includes: based on the audio sequence and at least one audio sequence before the audio sequence in the plurality of audio sequences and the audio sequence in the audio sequence in the plurality of audio sequences The two-dimensional coordinates of the multiple key points corresponding to the at least one subsequent audio sequence are smoothed.
  • the target facial region includes at least one of: mouth; jaw; nose; eyes; eyebrows; ears.
  • the audio data includes a plurality of continuous audio sequences; the method further includes: for each audio sequence in the plurality of continuous audio sequences, generating a pronunciation human face image corresponding to the audio sequence ; According to the pronunciation human face image corresponding to each audio sequence in the plurality of continuous audio sequences, generate the pronunciation human face video corresponding to the audio data.
  • the generating the pronunciation human face video corresponding to the audio data according to the generated pronunciation human face images includes: acquiring a background image corresponding to the human face image; combining the background image with the The pronunciation human face images corresponding to each audio sequence in the plurality of continuous audio sequences are fused to obtain a plurality of fusion images; and the pronunciation human face videos corresponding to the audio data are generated according to the plurality of fusion images.
  • the extracting the text feature corresponding to the audio sequence included in the audio data includes: acquiring the audio signal feature corresponding to the audio sequence; performing text feature extraction on the audio signal feature corresponding to the audio sequence, A text feature corresponding to the audio sequence is obtained.
  • the acquiring the audio signal features corresponding to the audio sequence includes: acquiring the audio signal features corresponding to the audio data through an audio signal analysis algorithm; Audio signal features corresponding to the audio sequence.
  • the audio data includes a plurality of continuous audio sequences; performing text feature extraction on audio signal features corresponding to the audio sequences to obtain text features corresponding to the audio sequences includes: according to the Audio sequence and audio signal features corresponding to at least one audio sequence before the audio sequence in the plurality of audio sequences and at least one audio sequence after the audio sequence in the plurality of audio sequences, generating input features ; Using a text feature extraction network to perform text feature extraction on the input features to obtain text features corresponding to the audio sequence.
  • the audio signal features corresponding to the audio sequence include at least one of the following: Mel cepstrum features; Mel features; linear prediction features; linear prediction cepstral features; line spectrum frequency features; Transform features.
  • the present disclosure also proposes an image generating device, including: a receiving and extracting module, configured to receive audio data and face images, and extract text features corresponding to audio sequences included in the audio data; wherein, the text features represent the The text content corresponding to the audio sequence; the facial feature mapping module is used to perform facial feature mapping based on the text feature corresponding to the audio sequence to obtain the facial feature corresponding to the audio sequence; wherein the facial feature represents the The pronunciation action corresponding to the audio sequence; the image generation module generates the pronunciation human face image corresponding to the audio sequence according to the facial features corresponding to the audio sequence and the human face image.
  • a receiving and extracting module configured to receive audio data and face images, and extract text features corresponding to audio sequences included in the audio data
  • the text features represent the The text content corresponding to the audio sequence
  • the facial feature mapping module is used to perform facial feature mapping based on the text feature corresponding to the audio sequence to obtain the facial feature corresponding to the audio sequence
  • the facial feature represents the The pronunciation action corresponding to the audio sequence
  • the device further includes: a video generation module, configured to generate a plurality of pronunciation human face images corresponding to a plurality of continuous audio sequences included in the audio data; image, and generate a pronunciation human face video corresponding to the audio data.
  • a video generation module configured to generate a plurality of pronunciation human face images corresponding to a plurality of continuous audio sequences included in the audio data
  • image and generate a pronunciation human face video corresponding to the audio data.
  • the present disclosure also proposes an electronic device, including: a processor; and a memory for storing processor-executable instructions; wherein, the processor executes the executable instructions to realize the image shown in any of the foregoing embodiments generate method.
  • the present disclosure also proposes a computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to cause a processor to execute the image generation method as shown in any one of the foregoing embodiments.
  • the facial features representing the pronunciation action corresponding to the audio sequence can be obtained, and then the facial image corresponding to the audio sequence can be generated according to the facial features. Since the same text content has a unique pronunciation action, and the text content can contain semantics, and does not contain the characteristics related to the individual speaker, so the facial features that can accurately represent the pronunciation action can be obtained according to the text content corresponding to the audio sequence, which can help It is used to reduce the impact on the determination of facial features due to characteristics related to the individual speaker such as pronunciation, and to obtain accurate facial features that characterize the pronunciation action, thereby helping to obtain the pronunciation face image that accurately expresses the pronunciation action, and improves the perception Effect.
  • the text features representing the text content and the sound features representing at least one of the timbre, loudness, and pitch can be fused to obtain fusion features, and facial feature mapping is performed to obtain facial features corresponding to the audio sequence, so that Combine the sound characteristics and text content corresponding to the audio sequence to get more accurate facial features.
  • facial features are represented by the three-dimensional coordinates of multiple key points selected on the contour of the target facial area, which can accurately express the pronunciation action corresponding to the audio sequence, thereby improving the accuracy of the pronunciation action expressed by the pronunciation face image .
  • FIG. 1 is a method flowchart of an image generation method shown in an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a text feature extraction process shown in an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a facial feature mapping network structure shown in an embodiment of the present disclosure
  • FIG. 4 is a schematic flow diagram of a method for generating a pronunciation human face video shown in an embodiment of the present disclosure
  • FIG. 5 is a schematic structural diagram of an image generating device shown in an embodiment of the present disclosure.
  • Fig. 6 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
  • the present disclosure proposes an image generation method.
  • the method can obtain the facial features representing the pronunciation action corresponding to the audio sequence according to the text features representing the text content of the audio sequence, and then generate the pronunciation face image corresponding to the audio sequence according to the facial features. Since the same text content has a unique pronunciation action, and the text content can contain semantics, and does not contain the characteristics related to the individual speaker, so the facial features that can accurately represent the pronunciation action can be obtained according to the text content corresponding to the audio sequence, which can help It is used to reduce the impact on the determination of facial features due to characteristics related to the individual speaker such as pronunciation, and to obtain accurate facial features that characterize the pronunciation action, thereby helping to obtain the pronunciation face image that accurately expresses the pronunciation action, and improves the perception Effect.
  • the method can be applied to electronic equipment.
  • the electronic device may implement the method by carrying a software device corresponding to the image generation method.
  • the type of the electronic device may be a notebook computer, a computer, a server, a mobile phone, a PAD terminal and the like.
  • the present disclosure does not specifically limit the specific type of the electronic device.
  • the electronic device may be a device on the client side or on the server side.
  • the server may be a server or a cloud provided by a server, a server cluster or a distributed server cluster.
  • an electronic device (hereinafter referred to as device) is taken as an example for description.
  • FIG. 1 is a method flowchart of an image generation method shown in an embodiment of the present disclosure.
  • the method may include the following steps S102 to S106.
  • S102 Receive audio data and a face image, and extract text features corresponding to audio sequences included in the audio data; wherein the text features represent text content corresponding to the audio sequences.
  • S104 Perform facial feature mapping based on the text features corresponding to the audio sequence to obtain facial features corresponding to the audio sequence; wherein the facial features represent pronunciation actions corresponding to the audio sequence.
  • the user can transmit the audio data and face image to the electronic device through the client program provided by the electronic device. After receiving the audio data, the device may execute S102.
  • the audio data may include voice information.
  • the audio data may be voice audio files such as speaking and singing.
  • the audio data may comprise a single audio sequence or a plurality of temporally consecutive audio sequences.
  • the audio sequence can be synthesized with a human face image to obtain a pronunciation human face image consistent with the audio sequence.
  • Audio sequences can usually express certain text content. For example, when the audio data is "I'm going to eat", the text content expressed by the first audio sequence included may be the first phoneme "w" of "wo (me)". The same text content has a unique pronunciation action, and the text content can contain semantics, and does not contain the characteristics related to the individual speaker. Therefore, according to the text content corresponding to the audio sequence, the facial features that represent the pronunciation action can be obtained accurately, and then an accurate Pronunciation of face images.
  • the textual features may characterize the textual content of the audio sequence.
  • the text feature may be a vector expression of text content.
  • a pre-trained first text feature extraction network (hereinafter referred to as the first network) may be used to perform feature extraction on the audio sequence to obtain text features corresponding to the audio sequence.
  • the first network may be a regression or classification network constructed based on a neural network.
  • the first network is trained according to the acquired samples until the first network converges.
  • each audio sequence can be obtained first, and then the text features corresponding to the text content of each audio sequence can be determined according to the corresponding rules of text features and text content, and each audio sequence can be textualized by means such as manual labeling.
  • the features are marked, and several audio sequence samples are obtained.
  • a supervised training method may be adopted, and the network parameters of the first network may be iterated multiple times by using backpropagation until the network converges and the training is completed.
  • the audio sequences included in the received audio data may be respectively input into the first network, so as to obtain text features corresponding to the audio sequences.
  • S1022 may be performed to obtain audio signal features corresponding to the audio sequence. Then execute S1024, perform text feature extraction on the audio signal feature, and obtain the text feature corresponding to the audio sequence.
  • the audio signal features may represent sound characteristics (such as at least one of pitch, loudness, timbre, etc.) and text content.
  • the audio signal features may include at least one of the following: Mel-Frequency Cepstral Coefficients (MFCC); Mel features; linear predictive features; linear predictive cepstral features; Spectral frequency features; wavelet transform features.
  • MFCC Mel-Frequency Cepstral Coefficients
  • the audio signal features help to accurately describe the audio signal information, thereby helping to obtain more accurate text features.
  • the audio signal analysis method includes but not limited to Fourier transform, wavelet transform and so on.
  • the present disclosure does not limit the specific type of the audio signal analysis method.
  • the audio signal characteristics of the audio sequence can be obtained according to the audio signal characteristics corresponding to the entire audio data. Compared with determining the audio signal characteristics of the single-frame audio sequence for the single-frame audio sequence, the single-frame audio sequence can be combined The semantics of the preceding and following audio sequence representations obtains more accurate audio signal features of the single-frame audio sequence.
  • audio signal features corresponding to the audio data may be acquired through an audio signal analysis algorithm. Then the audio signal features corresponding to the audio sequence may be extracted from the audio signal features corresponding to the audio data.
  • the second text feature extraction network (hereinafter referred to as the second network) may be used to perform feature extraction on the audio signal features corresponding to the audio sequence to obtain the text feature corresponding to the audio sequence.
  • the second network includes: a neural network obtained by training several audio signal feature samples marked with text features. Therefore, the text feature extraction of the audio sequence is performed according to the audio signal characteristics representing the audio signal information, and the text features can be directly extracted from the audio signal related to the text content, which helps to reduce the influence of other information included in the audio sequence on the extracted text features. , so as to obtain more accurate text features.
  • more accurate text features can be obtained by combining the semantics between the current audio sequence and several consecutive sequences before and after it.
  • S1 may be executed, according to the audio sequence, and at least one audio sequence before the audio sequence in the multiple audio sequences and at least one audio sequence after the audio sequence in the multiple audio sequences respectively
  • the corresponding audio signal features generate input features.
  • S2. Using a text feature extraction network to perform text feature extraction on the input features to obtain text features corresponding to the audio sequence.
  • the text feature extraction network includes: a neural network obtained by training through several training samples marked with text features.
  • FIG. 2 is a schematic diagram of a text feature extraction process shown in an embodiment of the present disclosure.
  • steps such as feature splicing, weighted summation, and other steps can be performed on the audio signal features corresponding to the audio sequence and its preceding m consecutive audio sequences and subsequent n consecutive audio sequences.
  • the input features are then obtained.
  • the m and n are preset positive integers. Since the input features include not only the audio signal features of the audio sequence, but also the semantic information between the audio sequence and its adjacent audio sequences, more accurate text features can be obtained.
  • the text feature extraction network may be a regression or classification network constructed based on a neural network.
  • the audio signal features of a plurality of continuous audio sequences can be obtained first; then any continuous three audio sequences can be determined based on the intermediate audio sequence, and the front and rear audio sequences are determined respectively.
  • the difference value of the audio signal feature of the intermediate audio sequence, and then the determined difference value is spliced with the audio signal feature of the intermediate audio sequence to obtain the input feature of the intermediate audio sequence.
  • various input features can be marked by methods such as manual labeling, and several training samples can be obtained.
  • supervised training may be adopted, and the text feature extraction network may be iterated multiple times by using backpropagation until the network converges and the training is completed.
  • the input features can be obtained by using the method of constructing input features used in training the network according to the audio signal features corresponding to the current audio sequence and the two preceding and following audio sequences. Then the input feature can be input into the text feature extraction network to obtain the text feature corresponding to the current audio sequence. In this way, the semantics between the current audio sequence and its preceding and subsequent audio sequences can be used to obtain more accurate text features of the audio sequence.
  • the device may execute S104.
  • the facial features in this step can represent the pronunciation action corresponding to the audio sequence.
  • at least two facial features can be used to characterize the pronunciation action.
  • the facial features can include the texture features of the target facial area, and the pronunciation action can be represented by the texture features of the target facial area;
  • the facial features can include the contour features of the target facial area,
  • Features can also represent articulation actions. In the following, description will be made by taking facial features including contour features of a target facial region as an example.
  • the target facial area refers to any area that can express pronunciation actions.
  • the target facial area can be selected according to business requirements.
  • the target facial region includes at least one of: mouth; jaw; nose; eyes; eyebrows; ears. Therefore, according to actual requirements, at least one area such as the mouth, jaw, eyebrows, etc. can be flexibly selected to express the pronunciation movement when speaking, so as to achieve a more accurate expression of the pronunciation movement, and then improve the accuracy of the pronunciation movement expressed by the pronunciation face image .
  • the facial features may include three-dimensional coordinates of multiple key points selected for the target facial area.
  • the facial features are represented by the three-dimensional coordinates of multiple key points selected on the outline of the target facial area, and the pronunciation action corresponding to the audio sequence can be accurately expressed, thereby improving the accuracy of the pronunciation action expressed by the pronunciation face image.
  • a facial feature mapping network (hereinafter referred to as the third network) may be used to perform facial feature mapping on the text features corresponding to the audio sequence to obtain the facial feature corresponding to the audio sequence .
  • the facial features include three-dimensional coordinates of multiple key points selected for the target facial area.
  • the third network includes: a neural network obtained by training through several text feature samples marked with facial features.
  • the third network may be a regression network constructed based on a neural network.
  • text features corresponding to several audio sequences may be obtained first, and facial features corresponding to each audio sequence may be determined. Then use methods such as manual labeling to mark the text features, and obtain several text feature samples. Afterwards, a supervised training method may be adopted, and the network parameters of the third network may be iterated multiple times by using backpropagation until the network converges and the training is completed.
  • the text features corresponding to the audio sequence can be input into the third network to obtain the facial features corresponding to the audio sequence.
  • the fusion of the text features representing the text content and the sound features representing at least one of the timbre, loudness, and pitch can be used to obtain fusion features, and facial feature mapping is performed to obtain facial features corresponding to the audio sequence.
  • facial feature mapping is performed to obtain facial features corresponding to the audio sequence.
  • S1042 may be executed to obtain the sound feature corresponding to the audio sequence according to the audio sequence; the sound feature represents at least one of timbre, loudness, and pitch of the corresponding audio sequence. Then execute S1044 to fuse the text features and sound features corresponding to the audio sequence to obtain the fusion feature corresponding to the audio sequence.
  • the fourth network uses the facial feature mapping network (hereinafter referred to as the fourth network) to perform facial feature mapping on the fusion feature corresponding to the audio sequence, and obtain the facial feature corresponding to the audio sequence.
  • the fourth network may include: a neural network obtained by training several fused feature samples marked with facial features.
  • the sound feature when performing S1042, the sound feature may be obtained according to the audio signal feature corresponding to the audio sequence.
  • features of a dimension related to the sound feature may be obtained, so as to obtain the sound feature.
  • the audio signal feature Take the audio signal feature as Mel cepstrum feature (hereinafter referred to as MFCC) as an example.
  • MFCC Mel cepstrum feature
  • the MFCC includes the first-dimensional features among the multi-dimensional features representing features related to sound characteristics, so the first-dimensional features of the MFCC can be used as the sound features.
  • the text features and the sound features can be fused by means of feature splicing or feature superposition, so that the fusion features representing the text content and sound characteristics can be obtained, so that when determining facial features
  • the text content and sound characteristics of the audio sequence can be taken into account at the same time, so as to determine the facial features that more accurately express the pronunciation action.
  • the fourth network may be a regression network constructed based on a neural network.
  • fusion features corresponding to several audio sequences may be obtained first, and facial features corresponding to each audio sequence may be determined. Then use methods such as manual labeling to mark the fusion features, and obtain several fusion feature samples. Afterwards, a supervised training method may be adopted, and the network parameters of the fourth network may be iterated for several times by using backpropagation until the network converges and the training is completed.
  • the fusion features can be obtained according to the sound features and text features corresponding to the audio sequence, and then input into the fourth network to obtain facial features.
  • more accurate facial features can be obtained by combining the audio sequence and the associated information between several consecutive audio sequences before and after it.
  • S3 may be executed, corresponding to the audio sequence and at least one audio sequence before the audio sequence among the multiple audio sequences and at least one audio sequence after the audio sequence among the multiple audio sequences
  • the fusion feature of is used as input, and the facial feature corresponding to the audio sequence is obtained by using the facial feature mapping network.
  • the facial feature mapping network can be constructed based on the long short-term memory network.
  • the long-short-term memory network (LSTM, Long Short-Term Memory) can retain the fusion feature information of multiple audio sequences in time series, and then can combine the current audio sequence and the associated information between several consecutive sequences before and after it to obtain More accurate facial features.
  • FIG. 3 is a schematic diagram of a facial feature mapping network structure according to an embodiment of the present disclosure.
  • the facial feature mapping network shown in FIG. 3 may include an input layer, an LSTM layer, a fully connected layer and an output layer.
  • the input layer includes N nodes 31 (311, 312).
  • the N nodes respectively correspond to N LSTM processing units 32 ( 321 , 322 . . . ; hereinafter referred to as processing units) of the LSTM layer.
  • the N is a positive integer set according to service requirements, and the N is usually the same as the number of input audio sequences.
  • the N nodes are used to input fusion features corresponding to the audio sequence to the corresponding processing unit.
  • the LSTM processing unit may include a forget gate, an input gate and an output gate.
  • the output gate can divide the processing result of the current processing unit into two parts, one part is used as the output result of the current processing unit; the other part can be used as the input of the next processing unit.
  • the forget gate can filter out beneficial information in the output result of the previous processing unit.
  • the input gate can filter useful information from the input information of the node corresponding to the current processing unit.
  • the processing unit can process the input of the current node and the output of the previous processing unit through the three gates to obtain a processing result.
  • the fully connected layer 33 can fully connect the output results of each LSTM processing unit to obtain the output and result corresponding to the current audio sequence.
  • the fusion features corresponding to each audio sequence in the sequence set may be sequentially input into the N nodes included in the input layer according to the time sequence.
  • the facial features corresponding to the audio sequence can be obtained.
  • the facial features of the current audio sequence can be obtained according to the output features output by each processing unit, so that more accurate facial features can be obtained by further combining the associated information between the audio sequences in the sequence set.
  • the device may execute S106.
  • S21 may be executed to determine a projection matrix according to the received face image. Then execute S22, project the three-dimensional coordinates of the multiple key points corresponding to the audio sequence into two-dimensional coordinates through the projection matrix. Afterwards, S23 is executed to obtain a occluded image after occluding the target facial area in the human face image. Finally, S24 is executed, using the generation network to generate the pronunciation face image corresponding to the audio sequence according to the two-dimensional coordinates of the multiple key points corresponding to the occlusion image and the audio sequence. Wherein, the generating network includes obtaining a neural network through confrontation training.
  • the projection matrix can represent the mapping relationship from three-dimensional to two-dimensional coordinates of multiple key points of the human face in the human face image. There is a certain mapping relationship between each coordinate point in the three-dimensional and two-dimensional coordinate system.
  • the fashion mapping relationship can be represented by a projection matrix, and the three-dimensional coordinates can be mapped to two-dimensional coordinates by implementing the projection matrix.
  • the received face image may include a face.
  • the human face can be a profile or a frontal human face.
  • an utterance face image expressing an utterance action may be generated based on the face image.
  • the multiple key points may be used to characterize facial contour information of the target facial region.
  • the plurality of key points may be feature points on facial contours.
  • the multiple key points may be feature points on the contours of the mouth and the jaw.
  • the received face image may first be input into a pre-trained three-dimensional face shape model to obtain a projection matrix corresponding to the face image.
  • the three-dimensional face shape model is used to generate a three-dimensional model according to two-dimensional images.
  • the projection matrix generated in the mapping process may be used as the projection matrix corresponding to the face image.
  • the projection matrix and the three-dimensional coordinate matrices of multiple key points corresponding to the audio sequence can be used to perform matrix operations to obtain the two-dimensional coordinate matrix of multiple key points corresponding to the current audio sequence .
  • the when performing S23 when performing S23, it can be done manually, or based on Faster-Rcnn (Faster Region Convolutional Neural Networks, faster regional convolutional neural network), Mask-Rcnn (Mask Region Convolutional Neural Networks, mask A mask network constructed by a neural network such as a regional convolutional neural network) performs occlusion processing on the face image to obtain a face image that occludes the target facial area.
  • Faster-Rcnn Faster Region Convolutional Neural Networks, faster regional convolutional neural network
  • Mask-Rcnn Mask A mask network constructed by a neural network such as a regional convolutional neural network
  • the generating network in S24 may be a regression network constructed based on a neural network.
  • the generation network can generate a partial image corresponding to the preset region by means of pixel filling according to the two-dimensional coordinates of multiple key points representing the outline of the preset region, and then fill the partial image into the face image by means of image twisting and the like. In the masked area, a complete pronunciation face image is obtained.
  • the generation network may be trained in an adversarial training manner.
  • a classification network and a generation network can be constructed first.
  • the classification network is trained by using a number of image samples marked with real images or false image classifications to obtain a classification network that can classify images more accurately.
  • several occlusion images and the two-dimensional coordinates of a plurality of key points representing the target facial area can be obtained, and then by adjusting the parameters of the generation network, the generation network can be used to image the occlusion images and key point coordinates.
  • the image obtained after supplementation can be judged as a real image by the trained classification network. At this point, the process of confrontation training is completed.
  • the two-dimensional coordinates of multiple key points corresponding to the occlusion image and the audio sequence may be input into the generation network to obtain the speaking face image.
  • the contour of the target facial area can be accurately represented by the coordinates of multiple key points, so that the accuracy of the pronunciation action expressed by the pronunciation human face image can be improved.
  • the two-dimensional coordinates of the multiple key points corresponding to the sequences are smoothed.
  • the audio sequence and the two-dimensional coordinates of multiple key points corresponding to the audio sequence before and after the audio sequence can be corrected by methods such as abnormal data exclusion and interpolation to achieve the audio sequence
  • the purpose of the inter-pronunciation action is natural, and to improve the coherence of the pronunciation action reflected in the pronunciation face video generated based on the pronunciation face image corresponding to each audio sequence.
  • the received audio data may include consecutive multiple audio sequences.
  • the pronunciation face images respectively corresponding to the multiple continuous audio sequences included in the audio data may be generated.
  • a facial pronunciation video corresponding to the audio data can be generated.
  • the pronunciation human face video may include multiple frames of facial pronunciation images arranged in time sequence.
  • the audio data can be divided into multiple audio sequences, and the playback duration of the audio sequences can be the same as the playback duration of the single frame image of the video, so that each audio sequence included in the audio data can be determined
  • sort the graphics according to the time sequence to obtain a video with the same playing time as the audio data.
  • the audio data is 5s.
  • the frame rate of the video is 25fps, that is, the playing time of a single frame image is 40 milliseconds.
  • the audio data may be divided into 125 audio sequences with a playback duration of 40 milliseconds. After obtaining 125 pronunciation face images corresponding to the audio sequence, each face image can be sorted according to the time sequence corresponding to the audio, and the face pronunciation video can be obtained.
  • a more realistic pronunciation video of human face can be generated by fusing the video with the background image.
  • the background image may be a background image corresponding to a human face image.
  • the background image is an image related to the speaking environment.
  • the background image may be a background such as a lecture hall.
  • the background image may be a stage background or the like.
  • a background image corresponding to the face image may be acquired first. Then the background image is fused with the pronunciation face image corresponding to each audio sequence in the continuous plurality of audio sequences to obtain a plurality of fused images, and the corresponding audio data is generated according to the plurality of fused images Pronunciation of face videos.
  • the background image can be fused with the facial images of each pronunciation through the image fusion technology to obtain the fused image, and then the fused image can be used as a video frame and arranged according to the time sequence of the audio sequence , to obtain the pronunciation face video after merging the background, which is more in line with the real scene.
  • the virtual character is used for news broadcasting.
  • the virtual character may be a public figure.
  • the pronunciation video generation method described in the present disclosure can be applied to the cloud.
  • the cloud can provide an interface for the user to upload the news audio to be played (hereinafter referred to as audio) and the character image including the virtual character.
  • the frame rate of the pronunciation video is 25fps.
  • the cloud can deploy a pre-trained text feature extraction network for text feature extraction of audio sequences, deploy a pre-trained three-dimensional key point mapping network for mapping text features to multiple key point three-dimensional coordinates, Deploy an image completion network to complete occluded images based on budgeted keypoint coordinates.
  • FIG. 4 is a schematic flowchart of a method for generating a human-pronunciation face video according to an embodiment of the present disclosure.
  • the cloud After the cloud receives the news and images of people, it can execute S41 to obtain the MFCC corresponding to the audio, and divide the MFCC to obtain each audio sequence (duration: 40ms) correspond to the MFCC respectively.
  • S42 may be executed to extract text features corresponding to each audio sequence by using the text feature extraction network for each audio sequence. Therefore, by accurately expressing the MFCC of the audio signal, accurate text features can be obtained.
  • S43 can be executed to splicing the sound features representing the sound characteristics in the MFCC of each audio sequence and the text features, and for the spliced features of each audio sequence, use the three-dimensional key point mapping network to obtain the character of the virtual character's mouth and jaw ( 3D coordinates of multiple key points of the target facial region). In this way, the facial features that accurately represent the pronunciation actions of the audio sequence can be obtained.
  • S44 can be executed, using the received face image to obtain a projection matrix representing the mapping relationship between three-dimensional coordinates and two-dimensional coordinates, and using the projection matrix to map the three-dimensional coordinates of the multiple key points into two-dimensional coordinates, and perform each audio Smoothing of multiple keypoint coordinates corresponding to a sequence.
  • Execute S45 again, generate an occluded image that blocks the mouth and jaw of the avatar according to the face image, and then use the image completion network to process the occluded image according to the two-dimensional coordinates of multiple key points corresponding to each audio sequence Completion, to obtain complete virtual character pronunciation face images corresponding to each audio sequence.
  • S46 can be executed to obtain the background image of the news broadcast, and merge the background image into each pronunciation face image, and then use each pronunciation face image as a video frame, and produce a virtual character pronunciation face video according to the corresponding audio sequence timing.
  • the cloud can return the generated pronunciation face video and show it to the user.
  • the text features that only express the audio text content are obtained, which have nothing to do with the personal characteristics of the audio recording, and then the text features are spliced with the sound features that represent the voice characteristics of the recording person, and the mouth and jaw contours are mapped.
  • the jaw contour complements the face image and generates a video, which can consider both the text content of the audio and the sound characteristics, so as to obtain the pronunciation face video that accurately expresses the pronunciation movements of the virtual character, and improve the visual effect of the pronunciation face video.
  • the present disclosure proposes an image generating device.
  • FIG. 5 is a schematic structural diagram of an image generating device according to an embodiment of the present disclosure.
  • the device 50 may include:
  • Receiving and extracting module 51 for receiving audio data and face image, extracts the text feature corresponding to the audio sequence that described audio data comprises; Described text feature characterizes the text content of corresponding audio sequence;
  • the facial feature mapping module 52 is used to perform facial feature mapping based on the text features corresponding to the audio sequence to obtain the facial features corresponding to the audio sequence; wherein the facial features represent the pronunciation action corresponding to the audio sequence;
  • the image generating module 53 generates a pronunciation human face image corresponding to the audio sequence according to the facial features corresponding to the audio sequence and the human face image.
  • the facial feature mapping module 52 is used to:
  • the sound feature corresponding to the audio sequence is obtained; wherein, the sound feature represents at least one of the timbre, loudness, and pitch of the corresponding audio sequence;
  • Using a facial feature mapping network perform facial feature mapping on the fusion features corresponding to the audio sequence to obtain facial features corresponding to the audio sequence.
  • the audio data includes a plurality of continuous audio sequences; the facial feature mapping module 52 is used for:
  • the audio sequence, and at least one audio sequence before the audio sequence in the plurality of audio sequences and at least one audio sequence after the audio sequence in the plurality of audio sequences respectively correspond to fusion features as Input, use the facial feature mapping network to obtain the facial features corresponding to the audio sequence.
  • the facial features include three-dimensional coordinates of a plurality of key points of the facial region
  • the image generating module 53 is used for:
  • the projection matrix represents the mapping relationship of the coordinates of the key points of the face in the face image from three-dimensional to two-dimensional;
  • a generation network is used to generate a pronunciation human face image corresponding to the audio sequence according to the occlusion image and the two-dimensional coordinates of a plurality of key points corresponding to the audio sequence.
  • the audio data includes a plurality of continuous audio sequences; the device 50 also includes:
  • a smoothing processing module configured to be based on the audio sequence and a plurality of key points respectively corresponding to at least one audio sequence before the audio sequence and at least one audio sequence after the plurality of audio sequences in the plurality of audio sequences The two-dimensional coordinates of the multiple key points corresponding to the audio sequence are smoothed.
  • the target facial area includes at least one of the following:
  • the audio data includes a plurality of continuous audio sequences; the device 50 also includes:
  • Video generation module 54 for each audio sequence in described continuous multiple audio sequences, generate the pronunciation human face image corresponding respectively with this audio sequence;
  • the pronunciation human face image corresponding to each audio sequence in the plurality of continuous audio sequences generate the pronunciation human face video corresponding to the audio data.
  • the video generation module 54 is used for:
  • the receiving and extracting module 51 is used for:
  • Text feature extraction is performed on audio signal features corresponding to the audio sequence to obtain text features corresponding to the audio sequence.
  • the receiving and extracting module 51 is used for:
  • the audio signal feature corresponding to the audio sequence is extracted from the audio signal feature corresponding to the audio data.
  • the audio data includes a plurality of continuous audio sequences; the receiving and extracting module 51 is used for:
  • the audio signal features respectively corresponding to the audio sequence and at least one audio sequence before the audio sequence among the plurality of audio sequences and at least one audio sequence after the audio sequence among the plurality of audio sequences, generate input features
  • a text feature extraction network is used to extract text features from the input features to obtain text features corresponding to the audio sequence.
  • the audio signal characteristics corresponding to the audio sequence include at least one of the following:
  • Mel cepstrum feature Mel feature; linear predictive feature; linear predictive cepstrum feature; line spectrum frequency feature; wavelet transform feature.
  • the image generating apparatus shown in the embodiments of the present disclosure can be applied to electronic equipment. Accordingly, the present disclosure provides an electronic device, which may include: a processor; and a memory for storing instructions executable by the processor. Wherein, the processor is configured to invoke the executable instructions stored in the memory to implement the image generation method shown in any one of the foregoing embodiments.
  • FIG. 6 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
  • the electronic device may include a processor for executing instructions, a network interface for connecting to a network, a memory for storing operation data for the processor, and a memory for storing instructions corresponding to the image generating device. volatile memory.
  • the embodiment of the device may be implemented by software, or by hardware or a combination of software and hardware.
  • software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of the electronic device where it is located.
  • the electronic device where the device in the embodiment is usually based on the actual function of the electronic device can also include other Hardware, no more details on this.
  • the corresponding instructions of the image generation device may also be directly stored in the memory, which is not limited herein.
  • the present disclosure proposes a computer-readable storage medium, the storage medium stores a computer program, and the computer program can be used to cause a processor to execute the image generation method as shown in any one of the foregoing embodiments.
  • one or more embodiments of the present disclosure may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) having computer-usable program code embodied therein.
  • Embodiments of the subject matter and functional operations described in this disclosure can be implemented in digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this disclosure and their structural equivalents, or in A combination of one or more of .
  • Embodiments of the subject matter described in this disclosure can be implemented as one or more computer programs, i.e. one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or to control the operation of data processing apparatus. Multiple modules.
  • the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for transmission by the data
  • the processing means executes.
  • a computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the processes and logic flows described in this disclosure can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, such as an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing system.
  • a central processing system will receive instructions and data from read only memory and/or random access memory.
  • the basic components of a computer include a central processing system for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both.
  • mass storage devices for storing data, such as magnetic or magneto-optical disks, or optical disks, to receive data therefrom or to It transmits data, or both.
  • a computer is not required to have such a device.
  • a computer may be embedded in another device such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a device such as a Universal Serial Bus (USB) ) portable storage devices like flash drives, to name a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB Universal Serial Bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or removable disk), magneto-optical disk, and 0xCD_00ROM and DVD-ROM disks.
  • semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
  • magnetic disks such as internal hard disks or removable disk
  • magneto-optical disk and 0xCD_00ROM and DVD-ROM disks.
  • the processor and memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Abstract

本公开提出一种图像生成方法、装置、设备以及存储介质。其中,所述方法可以包括:接收音频数据和人脸图像;提取所述音频数据包括的音频序列对应的文本特征。所述文本特征表征所述音频序列的文本内容。基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征。所述面部特征表征所述音频序列对应的发音动作。根据所述音频序列对应的面部特征以及人脸图像,生成与所述音频序列对应的发音人脸图像。

Description

图像生成方法、装置、设备以及存储介质
相关申请的交叉引用
本公开要求于2021年05月21日提交的、申请号为202110560359.4的中国专利申请的优先权,该申请以引用的方式并入本文中。
技术领域
本公开涉及计算机技术领域,具体涉及一种图像生成方法、装置、设备以及存储介质。
背景技术
发音人脸图像的生成是语音驱动人物、虚拟数字人等应用中非常关键的一项技术。
生成发音人脸图像是指根据接收的音频数据和人脸图像,生成说话时呈现发音动作的发音人脸图像的过程。
如果发音人脸图像中体现的发音动作不准确,可能会影响观感效果。
发明内容
有鉴于此,本公开提供一种图像生成方法。该方法可以包括:接收音频数据和人脸图像;提取所述音频数据包括的音频序列对应的文本特征;其中,所述文本特征表征所述音频序列对应的文本内容;基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征;其中,所述面部特征表征所述音频序列对应的发音动作;根据所述音频序列对应的面部特征以及所述人脸图像,生成与所述音频序列对应的发音人脸图像。
在一些实施例中,所述基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征,包括:根据所述音频序列,得到所述音频序列对应的声音特征;其中,所述声音特征表征所述音频序列对应的音色、响度、音调中的至少一种特征;将所述音频序列对应的文本特征和声音特征进行融合,得到所述音频序列对应的融合特征;利用面部特征映射网络,对所述音频序列对应的融合特征进行面部特征映射,得到与所述音频序列对应的面部特征。
在一些实施例中,所述音频数据包括连续的多个音频序列;所述利用面部特征映射网络,对所述音频序列对应的融合特征进行面部特征映射,得到与所述音频序列对应的面部特征,包括:将所述音频序列,以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的融合特征作为输入,利用所述面部特征映射网络,得到所述音频序列对应的面部特征。
在一些实施例中,所述面部特征包括面部区域的多个关键点的三维坐标;所述根据所述音频序列对应的面部特征以及所述人脸图像,生成与所述音频序列对应的发音人脸图像,包括:根据所述人脸图像确定投影矩阵;其中,所述投影矩阵表征所述人脸图像中的人脸关键点的坐标从三维到二维的映射关系;通过所述投影矩阵,将所述音频序列对应的多个关键点的三维坐标投影为二维坐标;获取将所述人脸图像中目标面部区域 遮挡之后的遮挡图像;利用生成网络,根据所述遮挡图像与所述音频序列对应的多个关键点的二维坐标,生成所述音频序列对应的发音人脸图像。
在一些实施例中,所述音频数据包括连续的多个音频序列;在利用生成网络,根据所述遮挡图像与所述音频序列对应的多个关键点的二维坐标,生成所述音频序列对应的发音人脸图像之前,所述方法还包括:基于所述音频序列以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的多个关键点的二维坐标,对所述音频序列对应的多个关键点的二维坐标进行平滑处理。
在一些实施例中,所述目标面部区域包括以下中的至少一项:嘴部;下颚;鼻子;眼睛;眉毛;耳朵。
在一些实施例中,所述音频数据包括多个连续音频序列;所述方法还包括:针对所述连续的多个音频序列中的每个音频序列,生成与该音频序列对应的发音人脸图像;根据所述连续的多个音频序列中的每个音频序列对应的发音人脸图像,生成与所述音频数据对应的发音人脸视频。
在一些实施例中,所述根据生成的各发音人脸图像,生成与所述音频数据对应的发音人脸视频,包括:获取与所述人脸图像对应的背景图像;将所述背景图像与所述连续的多个音频序列中的每个音频序列对应的发音人脸图像融合以得到多个融合图像;根据所述多个融合图像生成与所述音频数据对应的发音人脸视频。
在一些实施例中,所述提取所述音频数据包括的音频序列对应的文本特征,包括:获取所述音频序列对应的音频信号特征;对所述音频序列对应的音频信号特征进行文本特征提取,得到所述音频序列对应的文本特征。
在一些实施例中,所述获取所述音频序列对应的音频信号特征,包括:通过音频信号分析算法获取所述音频数据对应的音频信号特征;从所述音频数据对应的音频信号特征中截取出与所述音频序列对应的音频信号特征。
在一些实施例中,所述音频数据包括连续的多个音频序列;所述对所述音频序列对应的音频信号特征进行文本特征提取,得到所述音频序列对应的文本特征,包括:根据所述音频序列以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的音频信号特征,生成输入特征;利用文本特征提取网络,对所述输入特征进行文本特征提取,得到与所述音频序列对应的文本特征。
在一些实施例中,所述音频序列对应的音频信号特征,包括以下中的至少一项:梅尔倒谱特征;梅尔特征;线性预测特征;线性预测倒谱特征;线谱频率特征;小波变换特征。
本公开还提出一种图像生成装置,包括:接收与提取模块,用于接收音频数据和人脸图像,并且提取所述音频数据包括的音频序列对应的文本特征;其中,所述文本特征表征所述音频序列对应的文本内容;面部特征映射模块,用于基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征;其中,所述面部特征表征所述音频序列对应的发音动作;图像生成模块,根据所述音频序列对应的面部特征以及所述人脸图像,生成与所述音频序列对应的发音人脸图像。
在一些实施例中,所述装置还包括:视频生成模块,用于生成与所述音频数据所包括的连续的多个音频序列对应的多个发音人脸图像;根据所述多个发音人脸图像,生成与所述音频数据对应的发音人脸视频。
本公开还提出一种电子设备,包括:处理器;以及用于存储处理器可执行指令的存储器;其中,所述处理器通过运行所述可执行指令以实现如前述任一实施例示出的图像生成方法。
本公开还提出一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于使处理器执行如前述任一实施例示出的图像生成方法。
在所述方案中,第一,可以根据表征音频序列的文本内容的文本特征,得到表征音频序列对应发音动作的面部特征,然后再根据所述面部特征生成与音频序列对应的发音人脸图像。由于同一文本内容具有唯一的发音动作,并且文本内容可以包含语义,且不包含与发声人员个人有关的特性,因此根据音频序列对应的文本内容可以获取准确的表征发音动作的面部特征,可以有助于减少由于诸如发音等与说话人员个人有关的特性带来的对确定面部特征的影响,获取准确的表征发音动作的面部特征,从而有助于获得准确表达发音动作的发音人脸图像,提升观感效果。
第二,可以将表征文本内容的文本特征和表征音色、响度、音调中至少一种特征的声音特征融合得到融合特征,并进行面部特征映射,得到与所述音频序列对应的面部特征,从而可以结合音频序列对应的声音特性与文本内容,得到更准确的面部特征。
第三,通过在目标面部区域轮廓上选取的多个关键点的三维坐标来表征面部特征,可以准确的表达出音频序列对应的发音动作,从而可以提升发音人脸图像表达的发音动作的准确性。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
为了更清楚地说明本公开一个或多个实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例示出的一种图像生成方法的方法流程图;
图2为本公开实施例示出的一种文本特征提取流程示意图;
图3为本公开实施例示出的一种面部特征映射网络结构示意图;
图4为本公开实施例示出的一种发音人脸视频生成方法流程示意图;
图5为本公开实施例示出的一种图像生成装置的结构示意图;
图6为本公开实施例示出的一种电子设备的硬件结构示意图。
具体实施方式
下面将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的设备和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在 包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。还应当理解,本文中所使用的词语“如果”,取决于语境,可以被解释成为“在……时”或“当……时”或“响应于确定”。
有鉴于此,本公开提出一种图像生成方法。该方法可以根据表征音频序列的文本内容的文本特征,得到表征音频序列对应发音动作的面部特征,然后再根据所述面部特征生成与音频序列对应的发音人脸图像。由于同一文本内容具有唯一的发音动作,并且文本内容可以包含语义,且不包含与发声人员个人有关的特性,因此根据音频序列对应的文本内容可以获取准确的表征发音动作的面部特征,可以有助于减少由于诸如发音等与说话人员个人有关的特性带来的对确定面部特征的影响,获取准确的表征发音动作的面部特征,从而有助于获得准确表达发音动作的发音人脸图像,提升观感效果。
该方法可以应用于电子设备中。其中,所述电子设备可以通过搭载与图像生成方法对应的软件装置执行所述方法。所述电子设备的类型可以是笔记本电脑,计算机,服务器,手机,PAD终端等。本公开不对所述电子设备的具体类型进行特别限定。所述电子设备可以是客户端或服务端一侧的设备。所述服务端可以是由服务器、服务器集群或分布式服务器集群提供的服务端或云端。以下以执行主体为电子设备(以下简称设备)为例进行说明。
请参见图1,图1为本公开实施例示出的一种图像生成方法的方法流程图。
如图1所示,所述方法可以包括以下步骤S102至S106。
S102,接收音频数据和人脸图像,提取所述音频数据包括的音频序列对应的文本特征;其中,所述文本特征表征音频序列对应的文本内容。
S104,基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征;其中,所述面部特征表征所述音频序列对应的发音动作。
S106,根据所述音频序列对应的面部特征以及所述人脸图像,生成与所述音频序列对应的发音人脸图像。
在一些实施例中,用户可以通过所述电子设备提供的客户端程序,将音频数据与人脸图像传输至所述电子设备。接收到所述音频数据后,所述设备可以执行S102。
所述音频数据,可以包含语音信息。例如,所述音频数据可以是说话、唱歌等语音音频文件。所述音频数据可以包括单个音频序列或者多个在时序上连续的音频序列。本公开可以将所述音频序列与人脸图像进行合成,得到与音频序列一致的发音人脸图像。
音频序列通常可以表达一定的文本内容。例如,当音频数据为“我要去吃饭”时,其包括的首个音频序列表达的文本内容可能为“wo(我)”的第一个音素“w”。同一文本内容具有唯一的发音动作,并且文本内容可以包含语义,且不包含与发声人员个人有关的特性,因此根据音频序列对应的文本内容可以获取准确的表征发音动作的面部特征,进而获得准确的发音人脸图像。
所述文本特征可以表征所述音频序列的文本内容。在一些实施例中,所述文本特征可以是文本内容的向量表达。
在一些实施例中,在执行S102时,可以利用预先训练好的第一文本特征提取网络(以下称为第一网络),对所述音频序列进行特征提取得到音频序列对应的文本特征。
所述第一网络可以是基于神经网络构建的回归或分类网络。在训练该网络时,可以获取标注了文本特征的若干音频序列样本。然后再根据获取的样本对所述第一网络进 行训练,直至该第一网络收敛。
在一些实施例中,可以先获取若干音频序列,然后可以根据文本特征与文本内容的对应规则,确定各音频序列的文本内容对应的文本特征,并采用诸如人工标注等方式对各音频序列进行文本特征的标注,得到若干音频序列样本。之后可以采用有监督训练的方式,利用反向传播对所述第一网络的网络参数进行多次迭代,直至该网络收敛,完成训练。
在完成训练后,可以将接收到的音频数据包括的音频序列分别输入所述第一网络,从而得到与音频序列对应的文本特征。
在一些实施例中,为了提升更准确的文本特征,在执行S102时,可以执行S1022,获取所述音频序列对应的音频信号特征。然后执行S1024,对所述音频信号特征进行文本特征提取,得到所述音频序列对应的文本特征。
所述音频信号特征可以表征声音特性(如音调,响度,音色等中至少一种)和文本内容等。在一些实施例中,所述音频信号特征可以包括以下中的至少一项:梅尔倒谱特征(Mel-Frequency Cepstral Coefficients,MFCC);梅尔特征;线性预测特征;线性预测倒谱特征;线谱频率特征;小波变换特征。通过所述音频信号特征有助于对音频信号信息进行准确描述,从而有助于得到更准确的文本特征。
需要说明的是,所述音频信号分析方法包括但不限于傅里叶变换,小波变换等。本公开不限定所述音频信号分析方法的具体类型。
在一些实施例中,可以根据整个音频数据对应的音频信号特征,得到音频序列的音频信号特征,与针对单帧音频序列确定该单帧音频序列的音频信号特征相比,可以结合单帧音频序列前后音频序列表征的语义得到该单帧音频序列更准确的音频信号特征。
在一些实施例中,在执行S1022时,可以通过音频信号分析算法获取所述音频数据对应的音频信号特征。然后可以从所述音频数据对应的音频信号特征中截取出与所述音频序列对应的音频信号特征。
在执行S1024时,可以通过第二文本特征提取网络(以下称为第二网络),对所述音频序列对应的音频信号特征进行特征提取得到所述音频序列对应的文本特征。其中,所述第二网络包括:通过标注了文本特征的若干音频信号特征样本进行训练得到的神经网络。由此根据表征音频信号信息的音频信号特征进行音频序列的文本特征提取,可以直接从与文本内容有关的音频信号中提取文本特征,有助于减少音频序列包括的其它信息对提取文本特征的影响,从而得到更准确的文本特征。
在一些实施例中,可以结合当前音频序列与其前后若干连续序列之间的语义,获取更准确的文本特征。在执行S1024时,可以执行S1,根据所述音频序列,以及多个音频序列中在所述音频序列之前的至少一个音频序列和多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的音频信号特征,生成输入特征。S2,利用文本特征提取网络,对所述输入特征进行文本特征提取,得到与所述音频序列对应的文本特征。其中,所述文本特征提取网络包括:通过标注了文本特征的若干训练样本进行训练得到的神经网络。
请参见图2,图2为本公开实施例示出的一种文本特征提取流程示意图。
如图2所示,在执行S1时,可以对所述音频序列及其之前连续m个音频序列和之后连续n个音频序列分别对应的音频信号特征,执行诸如特征拼接,加权求和等步骤,然后得到所述输入特征。其中,所述m和n为预设正整数。由于所述输入特征除了包括所述音频序列的音频信号特征外,还包括所述音频序列和与其相邻的音频序列之间的语 义信息,因此可以得到更准确的文本特征。
所述文本特征提取网络,可以是基于神经网络构建的回归或分类网络。
在一些实施例中,在训练该网络时,可以先获取连续的多个音频序列的音频信号特征;然后可以将任意连续的三个音频序列,以中间音频序列为准,分别确定前后音频序列与中间音频序列的音频信号特征的差值,然后将确定的差值与所述中间音频序列的音频信号特征进行拼接,得到所述中间音频序列的输入特征。然后可以采用诸如人工标注等方式,对各输入特征进行标注,得到若干训练样本。之后,可以采用有监督训练的方式,利用反向传播对所述文本特征提取网络进行多次迭代,直至该网络收敛,完成训练。
请继续参见图2,在执行S2时,可以根据当前音频序列以及前后两个音频序列各自对应的音频信号特征,采用训练网络时采用的构建输入特征的方法,得到输入特征。然后可以将该输入特征输入所述文本特征提取网络,得到与所述当前音频序列对应的文本特征。由此可以利用当前音频序列与其前后音频序列之间的语义,得到所述音频序列更准确的文本特征。
在得到音频序列分别对应的文本特征后,所述设备可以执行S104。
本步骤中的面部特征可以表征音频序列对应的发音动作。在本公开中,可以利用至少两种面部特征表征发音动作。其一,所述面部特征可以包括目标面部区域的纹理特征,通过目标面部区域的纹理特征可以表征发音动作,其二,所述面部特征可以包括目标面部区域的轮廓特征,通过目标面部区域的轮廓特征也可以表征发音动作。以下以面部特征包括目标面部区域的轮廓特征为例进行说明。
所述目标面部区域是指可以表达发音动作的任意区域。所述目标面部区域可以根据业务需求进行选定。在一些实施例中,所述目标面部区域包括以下中的至少一项:嘴部;下颚;鼻子;眼睛;眉毛;耳朵。由此可以根据实际要求灵活选择嘴部,下颚,眉毛等至少一个区域多个来表达说话时的发音动作,从而实现对发音动作更准确的表达,进而提升发音人脸图像表达发音动作的准确性。
在一些实施例中,所述面部特征可以包括针对目标面部区域选取的多个关键点的三维坐标。通过在目标面部区域轮廓上选取的多个关键点的三维坐标来表征面部特征,可以准确的表达出音频序列对应的发音动作,从而可以提升发音人脸图像表达的发音动作的准确性。
在一些实施例中,在执行S104时,可以利用面部特征映射网络(以下称为第三网络),对所述音频序列对应的文本特征进行面部特征映射,得到与所述音频序列对应的面部特征。其中,所述面部特征包括针对目标面部区域选取的多个关键点的三维坐标。
其中,所述第三网络包括:通过标注了面部特征的若干文本特征样本进行训练得到的神经网络。
在一些实施例中,所述第三网络可以是基于神经网络构建的回归网络。
在训练所述第三网络时,可以先获取若干音频序列对应的文本特征,并确定各音频序列对应的面部特征。然后采用诸如人工标注等方式对文本特征进行标注,得到若干文本特征样本。之后可以采用有监督训练的方式,利用反向传播对所述第三网络的网络参数进行多次迭代,直至该网络收敛,完成训练。
完成训练后,可以将所述音频序列对应的文本特征输入所述第三网络,得到与所述音频序列对应的面部特征。
在一些实施例中,可以利用表征文本内容的文本特征与和表征音色、响度、音调中至少一种特征的声音特征融合得到融合特征,进行面部特征映射,得到与所述音频序列对应的面部特征,从而可以综合考虑音频序列对应的声音特性与文本内容,得到更准确的面部特征。在执行S104时,可以执行S1042,根据所述音频序列,得到所述音频序列对应的声音特征;所述声音特征表征对应音频序列的音色、响度、音调中的至少一种特征。然后执行S1044,将所述音频序列对应的文本特征和声音特征进行融合,得到所述音频序列对应的融合特征。之后执行S1046,利用面部特征映射网络(以下称为第四网络),对所述音频序列对应的融合特征进行面部特征映射,得到与所述音频序列对应的面部特征。其中,所述第四网络可以包括:通过标注了面部特征的若干融合特征样本进行训练得到的神经网络。
由于音频信号特征可以涵盖声音特征。因此在一些实施例中,在执行S1042时,可以根据所述音频序列对应的音频信号特征,得到所述声音特征。在一些实施例中,可以获取音频序列的音频信号特征包括的多维度特征中,与声音特征相关维度的特征,从而得到声音特征。以音频信号特征为梅尔倒谱特征(以下称为MFCC)为例。所述MFCC包括多维的特征中的首维特征表征与声音特性有关的特征,因此可以将MFCC的首维特征作为所述声音特征。
在一些实施例中,在执行S1044时,可以采用特征拼接或特征叠加等方式,将文本特征和声音特征进行融合,由此可以得到表征文本内容与声音特性的融合特征,以使在确定面部特征时,可以同时兼顾音频序列的文本内容与声音特性,从而确定出更准确表达发音动作的面部特征。
所述第四网络可以是基于神经网络构建的回归网络。
在训练所述第四网络时,可以先获取若干音频序列对应的融合特征,并确定各音频序列对应的面部特征。然后采用诸如人工标注等方式对融合特征进行标注,得到若干融合特征样本。之后可以采用有监督训练的方式,利用反向传播对所述第四网络的网络参数进行多次迭代,直至该网络收敛,完成训练。
完成训练后,可以根据所述音频序列对应的声音特征与文本特征,得到融合特征,然后输入所述第四网络,得到面部特征。
在一些实施例中,可以结合所述音频序列以及其前后若干连续音频序列之间的关联信息,获取更准确的面部特征。在执行S1046时,可以执行S3,将所述音频序列以及多个音频序列中在所述音频序列之前的至少一个音频序列和多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的融合特征作为输入,利用面部特征映射网络,得到所述音频序列对应的面部特征。其中,面部特征映射网络可以基于长短期记忆网络构建。
所述长短期记忆网络(LSTM,Long Short-Term Memory),可以在时序上保留多个音频序列的融合特征信息,进而可以结合当前音频序列,以及其前后若干连续序列之间的关联信息,获取更准确的面部特征。
请参见图3,图3为本公开实施例示出的一种面部特征映射网络结构示意图。
图3示出的面部特征映射网络(以下称为第五网络)可以包括输入层,LSTM层,全连接层以及输出层。
其中,所述输入层包括N个节点31(311,312…)。所述N个节点分别对应LSTM层的N个LSTM处理单元32(321,322…;以下称为处理单元)。所述N为根据业务需求设定的正整数,所述N通常与输入的音频序列个数相同。所述N个节点用于向对应处 理单元输入音频序列对应的融合特征。
所述LSTM处理单元可以包括遗忘门,输入门与输出门。其中,输出门可以将当前处理单元的处理结果分为两份,一份作为当前处理单元的输出结果;另一份可以作为下一处理单元的输入。所述遗忘门可以筛选出上一处理单元的输出结果中有益的信息。所述输入门可以筛选出当前处理单元对应节点的输入信息中有益的信息。所述处理单元可以通过所述三个门,对当前节点的输入与上一处理单元的输出进行处理,得到处理结果。
所述全连接层33,可以对各LSTM处理单元的输出结果进行全连接,得到与当前音频序列对应的输出与结果。
在执行S3时(以下,将所述音频序列以及多个音频序列中在所述音频序列之前的至少一个音频序列和多个音频序列中在所述音频序列之后的至少一个音频序列称为序列集合),可以按照时序,依次将序列集合中的各音频序列对应的融合特征按照时序输入所述输入层包括的N个节点。
然后经过LSTM层与全连接层处理后,可以得到与所述音频序列对应的面部特征。
其中,在LSTM层处理过程中,除了利用当前节点输入的融合特征外,还可以结合之前节点输入的信息,从而可以确定出更准确的输出特征。所述全连接层处理过程中,可以根据各处理单元输出的输出特征,得到当前音频序列的面部特征,从而可以进一步结合序列集合中各音频序列之间的关联信息,获得更准确的面部特征。
在得到音频序列对应的面部特征后,所述设备可以执行S106。
在一些实施例中,在执行S1062时,可以执行S21,根据接收的人脸图像确定投影矩阵。然后执行S22,通过所述投影矩阵,将所述音频序列对应的多个关键点的三维坐标投影为二维坐标。之后执行S23,获取将所述人脸图像中目标面部区域遮挡之后的遮挡图像。最后执行S24,利用生成网络,根据所述遮挡图像与所述音频序列对应的多个关键点的二维坐标,生成所述音频序列对应的发音人脸图像。其中,所述生成网络包括通过对抗训练方式得到神经网络。
所述投影矩阵,可以表征所述人脸图像中人脸的多个关键点的坐标从三维到二维的映射关系。三维与二维坐标系中的各坐标点存在一定的映射关系。在一些实施例中可以通过投影矩阵表征时尚映射关系,通过实施投影矩阵可以将三维坐标影射为二维坐标。
接收的所述人脸图像可以包括人脸。所述人脸可以是侧面或正面人脸。在本公开中可以根据所述人脸图像,生成表达发音动作的发音人脸图像。
所述多个关键点可以用于表征所述目标面部区域的面部轮廓信息。在一些实施例中,所述多个关键点可以是面部轮廓上的特征点。例如,所述目标面部区域为嘴部和下颚时,所述多个关键点可以是嘴部和下颚轮廓上的特征点。
在一些实施例中,在执行S21时,可以先将接收的人脸图像输入预先训练的三维人脸形态模型中,得到与所述人脸图像对应的投影矩阵。所述三维人脸形态模型用于根据二维图像生成三维模型。在本公开中可以将所述映射过程中生成的投影矩阵作为所述与所述人脸图像对应的投影矩阵。
在一些实施例中,在执行S22时,可以利用所述投影矩阵与所述音频序列对应的多个关键点三维坐标矩阵进行矩阵运算,得到当前音频序列对应的多个关键点的二维坐标矩阵。
在一些实施例中,在执行S23时,可以通过人工方式,或基于Faster-Rcnn(Faster Region Convolutional Neural Networks,更快速的区域卷积神经网络)、Mask-Rcnn (Mask Region Convolutional Neural Networks,掩膜区域卷积神经网络)等神经网络构建的掩膜网络,对所述人脸图像进行遮挡处理,得到遮挡了所述目标面部区域的人脸图像。
S24中的生成网络可以是基于神经网络构建的回归网络。所述生成网络可以根据表征预设区域轮廓的多个关键点二维坐标,通过像素填充等方式生成预设区域对应的局部图像,然后再通过图像扭转等方式,将局部图像填充至人脸图像被掩盖的区域中,得到完整的发音人脸图像。
在一些实施例中,可以使用对抗训练的方式训练所述生成网络。在训练该网络时,可以先构建分类网络和生成网络。然后利用若干标注了真实图像或虚假图像分类的图像样本,对所述分类网络进行训练,得到对图像分类比较精准的分类网络。之后,可以获取若干遮挡图像和表征所述目标面部区域的多个关键点的二维坐标,再之后通过调整所述生成网络的参数,使得通过生成网络对所述遮挡图像与关键点坐标进行图像补充后得到的图像,可以被训练完成的所述分类网络判定为真实图像。至此则完成了对抗训练的过程。
完成训练后,可以将所述遮挡图像与所述音频序列对应的多个关键点的二维坐标输入所述生成网络,得到所述发音人脸图像。
在所述例子中,通过多个关键点的坐标可以准确的表征出目标面部区域轮廓,从而可以提升发音人脸图像表达的发音动作的准确性。
在一些实施例中,在执行S24之前,可以基于所述音频序列以及多个音频序列中在所述音频序列之前的至少一个音频序列和多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的多个关键点的二维坐标,对所述音频序列对应的多个关键点的二维坐标进行平滑处理。
在一些实施例中,可以通过异常数据排除法与插值法等方法,对所述音频序列以及所述音频序列前后多个音频序列分别对应的多个关键点的二维坐标进行修正,达到音频序列之间发音动作斜街自然的目的,提升基于各音频序列对应的发音人脸图像生成的发音人脸视频所体现的发音动作的连贯性。
在一些实施例中,接收的音频数据可以包括连续的多个音频序列。本例中可以根据前述任一实施例示出的图像生成方法,生成所述音频数据包括的连续的多个音频序列分别对应的发音人脸图像。然后可以根据生成的这些发音人脸图像,生成与所述音频数据对应的发音人脸视频。
所述发音人脸视频(以下简称视频),可以包括多帧按照时序排列的人脸发音图像。在一些实施例中,可以将音频数据分割为多个音频序列,所述音频序列的播放时长可以与所述视频单帧图像的播放时长相同,由此在确定所述音频数据包括的各音频序列对应的发音人脸图像后,按照时序将各图形排序即可得到与音频数据播放时长一致的视频。
例如,所述音频数据为5s。所述视频的帧率为25fps,即单帧图像的播放时长为40毫秒。此时可以将所述音频数据划分为125个播放时长为40毫秒的音频序列。在得到125个与音频序列对应的发音人脸图像后,可以将各人脸图像按照音频对应的时序排序,即可得到人脸发音视频。
在一些实施例中,可以通过将视频与背景图像进行融合,生成更真实的发音人脸视频。
所述背景图像可以是与人脸图像对应的背景图像。在一些实施例中,所述背景 图像与发音环境相关的图像。例如,在演讲场景中,所述背景图像可以是演讲大厅等背景。再例如,在歌唱场景中,所述背景图像可以是舞台背景等。
在执行S1064时,可以先获取与所述人脸图像对应的背景图像。然后将所述背景图像与所述连续的多个音频序列中的每个音频序列对应的发音人脸图像融合以得到多个融合图像,根据所述多个融合图像生成与所述音频数据对应的发音人脸视频。
在一些实施例中,可以通过图像融合技术,将所述背景图像分别与各发音人脸图像进行融合,得到融合后的图像,然后将融合后的图像作为视频帧,按照音频序列的时序进行排列,得到融合背景后的发音人脸视频,从而更符合真实场景。
以下结合虚拟人物的场景进行实施例说明。
所述虚拟人物用于进行新闻播报。在一些实施例中,所述虚拟人物可以是某位公众人物。例如,主持人或公司负责人等。
本公开记载的发音视频生成方法可以应用于云端。所述云端可以为用户提供界面,供用户上传待播放的新闻音频(以下成为音频)与包括所述虚拟人物的人物图像。所述发音视频的帧率为25fps。
所述云端可以部署预先训练完成的文本特征提取网络,用于进行音频序列的文本特征提取,部署预先训练完成的三维关键点映射网络,用于进行文本特征到多个关键点三维坐标的映射,部署图像补全网络,用于根据预算关键点坐标,补全遮挡图像。
请参见图4,图4为本公开实施例示出的一种发音人脸视频生成方法流程示意图。
如图4所示,所述云端在接收到所述新闻与人物图像后,可以执行S41,获取所述音频对应的MFCC,并对MFCC进行分割,得到所述音频包括的各音频序列(时长为40ms)分别对应的MFCC。
然后可以执行S42,针对各音频序列,利用所述文本特征提取网络,提取各音频序列对应的文本特征。由此通过准确表述音频信号的MFCC,可以得到准确的文本特征。
然后可以执行S43,将各音频序列的MFCC中表征声音特性的声音特征与文本特征进行拼接,并针对各音频序列拼接后的特征,利用三维关键点映射网络,得到表征虚拟人物嘴部与下颚(目标面部区域)的多个关键点的三维坐标。由此可以得到准确表述音频序列的发音动作的面部特征。
之后可以执行S44,利用接收到的人脸图像得到表征三维坐标到二维坐标映射关系的投影矩阵,并利用投影矩阵将所述多个关键点的三维坐标映射为二维坐标,并进行各音频序列对应的多个关键点坐标的平滑处理。
再执行S45,根据人脸图像,生成遮挡了虚拟人物嘴部和下颚的遮挡图像,然后利用所述图像补全网络,根据各音频序列对应的多个关键点的二维坐标,对遮挡图像进行补全,得到与各音频序列分别对应的完整的虚拟人物发音人脸图像。
最后可以执行S46,获取新闻播报背景图像,并将背景图像融合至各发音人脸图像,然后将各发音人脸图像作为视频帧,按照对应音频序列时序,生产虚拟人物发音人脸视频。
所述云端可以将生成的发音人脸视频返回,并向用户展示。
由此先获取与录制所述音频个人特性无关仅表达音频文本内容的文本特征,再将文本特征与表征录音人员声音特性的声音特征拼接,进行嘴部与下颚轮廓的映射,再根据嘴部与下颚轮廓,补全人脸图像,并生成视频,可以既考虑音频的文本内容也考虑声音特性,得到准确表达虚拟人物发音动作的发音人脸视频,提升发音人脸视频观感效 果。
与所述实施例相应的,本公开提出一种图像生成装置。
请参见图5,图5为本公开实施例示出的一种图像生成装置的结构示意图。
如图5所示,所述装置50可以包括:
接收与提取模块51,用于接收音频数据和人脸图像,提取所述音频数据包括的音频序列对应的文本特征;所述文本特征表征对应音频序列的文本内容;
面部特征映射模块52,用于基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征;其中,所述面部特征表征所述音频序列对应的发音动作;
图像生成模块53,根据所述音频序列对应的面部特征以及所述人脸图像,生成与所述音频序列对应的发音人脸图像。
在一些实施例中,所述面部特征映射模块52用于:
根据所述音频序列,得到所述音频序列对应的声音特征;其中,所述声音特征表征对应音频序列的音色、响度、音调中的至少一种特征;
将所述音频序列对应的文本特征和声音特征进行融合,得到所述音频序列对应的融合特征;
利用面部特征映射网络,对所述音频序列对应的融合特征进行面部特征映射,得到与所述音频序列对应的面部特征。
在一些实施例中,所述音频数据包括连续的多个音频序列;所述面部特征映射模块52用于:
将所述音频序列,以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的融合特征作为输入,利用面部特征映射网络,得到所述音频序列对应的面部特征。
在一些实施例中,所述面部特征包括面部区域的多个关键点的三维坐标;
所述图像生成模块53用于:
根据所述人脸图像确定投影矩阵;其中,所述投影矩阵表征所述人脸图像中的人脸关键点的坐标从三维到二维的映射关系;
通过所述投影矩阵,将所述音频序列对应的所述多个关键点的三维坐标投影为二维坐标;
获取将所述人脸图像中目标面部区域遮挡之后的遮挡图像;
利用生成网络,根据所述遮挡图像与所述音频序列对应的多个关键点的二维坐标,生成所述音频序列对应的发音人脸图像。
在一些实施例中,所述音频数据包括连续的多个音频序列;所述装置50还包括:
平滑处理模块,用于基于所述音频序列以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列之后的至少一个音频序列分别对应的多个关键点的二维坐标,对所述音频序列对应的多个关键点的二维坐标进行平滑处理。
在一些实施例中,所述目标面部区域包括以下中的至少一项:
嘴部;下颚;鼻子;眼睛;眉毛;耳朵。
在一些实施例中,所述音频数据包括连续的多个音频序列;所述装置50还包括:
视频生成模块54,用于针对所述连续的多个音频序列中的每个音频序列,生成与该音频序列分别对应的发音人脸图像;
根据所述连续的多个音频序列中的每个音频序列对应的发音人脸图像,生成与所述音频数据对应的发音人脸视频。
在一些实施例中,所述视频生成模块54用于:
获取与所述人脸图像对应的背景图像;
将所述背景图像与所述连续的多个音频序列中的每个音频序列对应的发音人脸图像融合以得到多个融合图像;根据所述多个融合图像生成与所述音频数据对应的发音人脸视频。
在一些实施例中,所述接收与提取模块51用于:
获取所述音频序列对应的音频信号特征;
对所述音频序列对应的音频信号特征进行文本特征提取,得到所述音频序列对应的文本特征。
在一些实施例中,所述接收与提取模块51用于:
通过音频信号分析算法获取所述音频数据对应的音频信号特征;
从所述音频数据对应的音频信号特征中截取出与所述音频序列对应的音频信号特征。
在一些实施例中,所述音频数据包括连续的多个音频序列;所述接收与提取模块51用于:
根据所述音频序列以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的音频信号特征,生成输入特征;
利用文本特征提取网络,对所述输入特征进行文本特征提取,得到与所述音频序列对应的文本特征。
在一些实施例中,所述音频序列对应的音频信号特征,包括以下中的至少一项:
梅尔倒谱特征;梅尔特征;线性预测特征;线性预测倒谱特征;线谱频率特征;小波变换特征。
本公开实施例示出的图像生成装置可以应用于电子设备上。相应地,本公开提供了一种电子设备,该设备可以包括:处理器;以及用于存储处理器可执行指令的存储器。其中,所述处理器被配置为调用所述存储器中存储的可执行指令,实现前述任一实施例示出的图像生成方法。
请参见图6,图6为本公开实施例示出的一种电子设备的硬件结构示意图。
如图6所示,该电子设备可以包括用于执行指令的处理器,用于进行网络连接的网络接口,用于为处理器存储运行数据的内存,以及用于存储图像生成装置对应指令的非易失性存储器。
其中,所述装置的实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软件实现为例,作为一个逻辑意义上的装置,是通过其所在电子设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层 面而言,除了图6所示的处理器、内存、网络接口、以及非易失性存储器之外,实施例中装置所在的电子设备通常根据该电子设备的实际功能,还可以包括其他硬件,对此不再赘述。
可以理解的是,为了提升处理速度,图像生成装置对应指令也可以直接存储于内存中,在此不作限定。
本公开提出一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序可以用于使处理器执行如前述任一实施例示出的图像生成方法。
本领域技术人员应明白,本公开一个或多个实施例可提供为方法、系统或计算机程序产品。因此,本公开一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本公开一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、光学存储器等)上实施的计算机程序产品的形式。
本公开中记载的“和/或”表示至少具有两者中的其中一个,例如,“A和/或B”包括三种方案:A、B、以及“A和B”。
本公开中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于数据处理设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
所述对本公开特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本公开中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本公开中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本公开中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。
本公开中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何其他类型的中央处理系统。通常,中央处理系统将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理系统以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个 人数字助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及0xCD_00ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。
虽然本公开包含许多具体实施细节,但是这些不应被解释为限制任何公开的范围或所要求保护的范围,而是主要用于描述特定公开的具体实施例的特征。本公开内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,所述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。
以上所述仅为本公开一个或多个实施例的较佳实施例而已,并不用以限制本公开一个或多个实施例,凡在本公开一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开一个或多个实施例保护的范围之内。

Claims (16)

  1. 一种图像生成方法,包括:
    接收音频数据和人脸图像;
    提取所述音频数据包括的音频序列对应的文本特征;其中,所述文本特征表征所述音频序列对应的文本内容;
    基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征;其中,所述面部特征表征所述音频序列对应的发音动作;
    根据所述音频序列对应的面部特征以及所述人脸图像,生成与所述音频序列对应的发音人脸图像。
  2. 根据权利要求1所述的方法,其中,所述基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征,包括:
    根据所述音频序列,得到所述音频序列对应的声音特征;其中,所述声音特征表征所述音频序列对应的音色、响度、音调中的至少一种特征;
    将所述音频序列对应的文本特征和声音特征进行融合,得到所述音频序列对应的融合特征;
    利用面部特征映射网络,对所述音频序列对应的融合特征进行面部特征映射,得到与所述音频序列对应的面部特征。
  3. 根据权利要求2所述的方法,其中,所述音频数据包括连续的多个音频序列;所述利用面部特征映射网络,对所述音频序列对应的融合特征进行面部特征映射,得到与所述音频序列对应的面部特征,包括:
    将所述音频序列,以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的融合特征作为输入,利用所述面部特征映射网络,得到所述音频序列对应的面部特征。
  4. 根据权利要求1-3任一项所述的方法,其中,所述音频序列对应的面部特征包括目标面部区域的多个关键点的三维坐标;
    所述根据所述音频序列对应的面部特征以及所述人脸图像,生成与所述音频序列对应的发音人脸图像,包括:
    根据所述人脸图像确定投影矩阵;其中,所述投影矩阵表征所述人脸图像中的人脸关键点的坐标从三维到二维的映射关系;
    通过所述投影矩阵,将所述音频序列对应的所述多个关键点的三维坐标投影为二维坐标;
    获取将所述人脸图像中所述目标面部区域遮挡之后的遮挡图像;
    利用生成网络,根据所述遮挡图像与所述音频序列对应的所述多个关键点的二维坐标,生成所述音频序列对应的发音人脸图像。
  5. 根据权利要求4所述的方法,其中,所述音频数据包括连续的多个音频序列;
    在利用生成网络,根据所述遮挡图像与所述音频序列对应的所述多个关键点的二维坐标,生成所述音频序列对应的发音人脸图像之前,所述方法还包括:
    基于所述音频序列以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的多个关键点的二维坐标,对所述音频序列对应的所述多个关键点的二维坐标进行平滑处理。
  6. 根据权利要求4或5所述的方法,其中,所述目标面部区域包括以下中的至少一项:
    嘴部;下颚;鼻子;眼睛;眉毛;耳朵。
  7. 根据权利要求1-6任一项所述的方法,其中,所述音频数据包括连续的多个音频序列;所述方法还包括:
    针对所述连续的多个音频序列中的每个音频序列,生成与该音频序列对应的发音人 脸图像;
    根据所述连续的多个音频序列中的每个音频序列对应的发音人脸图像,生成与所述音频数据对应的发音人脸视频。
  8. 根据权利要求7所述的方法,其中,所述根据所述连续的多个音频序列中的每个音频序列对应的发音人脸图像,生成与所述音频数据对应的发音人脸视频,包括:
    获取与所述人脸图像对应的背景图像;
    将所述背景图像与所述连续的多个音频序列中的每个音频序列对应的发音人脸图像融合以得到多个融合图像;
    根据所述多个融合图像生成与所述音频数据对应的发音人脸视频。
  9. 根据权利要求1-8任一项所述的方法,其中,所述提取所述音频数据包括的音频序列对应的文本特征,包括:
    获取所述音频序列对应的音频信号特征;
    对所述音频序列对应的音频信号特征进行文本特征提取,得到所述音频序列对应的文本特征。
  10. 根据权利要求9所述的方法,其中,所述获取所述音频序列对应的音频信号特征,包括:
    通过音频信号分析算法获取所述音频数据对应的音频信号特征;
    从所述音频数据对应的音频信号特征中截取出与所述音频序列对应的音频信号特征。
  11. 根据权利要求9或10所述的方法,其中,所述音频数据包括连续的多个音频序列;所述对所述音频序列对应的音频信号特征进行文本特征提取,得到所述音频序列对应的文本特征,包括:
    根据所述音频序列以及所述多个音频序列中在所述音频序列之前的至少一个音频序列和所述多个音频序列中在所述音频序列之后的至少一个音频序列分别对应的音频信号特征,生成输入特征;
    利用文本特征提取网络,对所述输入特征进行文本特征提取,得到与所述音频序列对应的文本特征。
  12. 根据权利要求9-11任一项所述的方法,其中,所述音频序列对应的音频信号特征包括以下中的至少一项:
    梅尔倒谱特征;梅尔特征;线性预测特征;线性预测倒谱特征;线谱频率特征;小波变换特征。
  13. 一种图像生成装置,包括:
    接收与提取模块,用于接收音频数据和人脸图像,并且提取所述音频数据包括的音频序列对应的文本特征;其中,所述文本特征表征所述音频序列对应的文本内容;
    面部特征映射模块,用于基于所述音频序列对应的文本特征,进行面部特征映射,得到与所述音频序列对应的面部特征;其中,所述面部特征表征所述音频序列对应的发音动作;
    图像生成模块,根据所述音频序列对应的面部特征以及所述人脸图像,生成与所述音频序列对应的发音人脸图像。
  14. 根据权利要求13所述的装置,还包括:
    视频生成模块,用于生成与所述音频数据所包括的连续的多个音频序列对应的多个发音人脸图像;
    根据所述多个发音人脸图像,生成与所述音频数据对应的发音人脸视频。
  15. 一种电子设备,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器通过运行所述可执行指令以实现如权利要求1-12任一项所述的图像生成方法。
  16. 一种计算机可读存储介质,其存储有计算机程序,所述计算机程序用于使处理器执行如权利要求1-12任一项所述的图像生成方法。
PCT/CN2022/086972 2021-05-21 2022-04-15 图像生成方法、装置、设备以及存储介质 WO2022242381A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110560359.4 2021-05-21
CN202110560359.4A CN113299312B (zh) 2021-05-21 2021-05-21 一种图像生成方法、装置、设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2022242381A1 true WO2022242381A1 (zh) 2022-11-24

Family

ID=77323911

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/086972 WO2022242381A1 (zh) 2021-05-21 2022-04-15 图像生成方法、装置、设备以及存储介质

Country Status (3)

Country Link
CN (1) CN113299312B (zh)
TW (1) TW202247144A (zh)
WO (1) WO2022242381A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299312B (zh) * 2021-05-21 2023-04-28 北京市商汤科技开发有限公司 一种图像生成方法、装置、设备以及存储介质
CN115914653A (zh) * 2021-09-30 2023-04-04 中兴通讯股份有限公司 视音频数据的发送方法、显示方法、发送端及接收端
CN115187727B (zh) * 2022-06-29 2023-06-13 北京百度网讯科技有限公司 一种虚拟面部图像的生成方法、装置、设备及存储介质
CN117014675A (zh) * 2022-09-16 2023-11-07 腾讯科技(深圳)有限公司 虚拟对象的视频生成方法、装置和计算机可读存储介质
CN116778041B (zh) * 2023-08-22 2023-12-12 北京百度网讯科技有限公司 基于多模态的人脸图像生成方法、模型的训练方法及设备

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110677598A (zh) * 2019-09-18 2020-01-10 北京市商汤科技开发有限公司 视频生成方法、装置、电子设备和计算机存储介质
CN110991329A (zh) * 2019-11-29 2020-04-10 上海商汤智能科技有限公司 一种语义分析方法及装置、电子设备和存储介质
CN111243626A (zh) * 2019-12-30 2020-06-05 清华大学 一种说话视频生成方法及系统
US20200234690A1 (en) * 2019-01-18 2020-07-23 Snap Inc. Text and audio-based real-time face reenactment
CN112562722A (zh) * 2020-12-01 2021-03-26 新华智云科技有限公司 基于语义的音频驱动数字人生成方法及系统
CN112668407A (zh) * 2020-12-11 2021-04-16 北京大米科技有限公司 人脸关键点生成方法、装置、存储介质及电子设备
CN112735371A (zh) * 2020-12-28 2021-04-30 出门问问(苏州)信息科技有限公司 一种基于文本信息生成说话人视频的方法及装置
CN112785671A (zh) * 2021-01-07 2021-05-11 中国科学技术大学 虚假人脸动画合成方法
CN113299312A (zh) * 2021-05-21 2021-08-24 北京市商汤科技开发有限公司 一种图像生成方法、装置、设备以及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107944367B (zh) * 2017-11-16 2021-06-01 北京小米移动软件有限公司 人脸关键点检测方法及装置
CN110162598B (zh) * 2019-04-12 2022-07-12 北京搜狗科技发展有限公司 一种数据处理方法和装置、一种用于数据处理的装置
CN112188304B (zh) * 2020-09-28 2022-11-15 广州酷狗计算机科技有限公司 视频生成方法、装置、终端及存储介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200234690A1 (en) * 2019-01-18 2020-07-23 Snap Inc. Text and audio-based real-time face reenactment
CN110677598A (zh) * 2019-09-18 2020-01-10 北京市商汤科技开发有限公司 视频生成方法、装置、电子设备和计算机存储介质
CN110991329A (zh) * 2019-11-29 2020-04-10 上海商汤智能科技有限公司 一种语义分析方法及装置、电子设备和存储介质
CN111243626A (zh) * 2019-12-30 2020-06-05 清华大学 一种说话视频生成方法及系统
CN112562722A (zh) * 2020-12-01 2021-03-26 新华智云科技有限公司 基于语义的音频驱动数字人生成方法及系统
CN112668407A (zh) * 2020-12-11 2021-04-16 北京大米科技有限公司 人脸关键点生成方法、装置、存储介质及电子设备
CN112735371A (zh) * 2020-12-28 2021-04-30 出门问问(苏州)信息科技有限公司 一种基于文本信息生成说话人视频的方法及装置
CN112785671A (zh) * 2021-01-07 2021-05-11 中国科学技术大学 虚假人脸动画合成方法
CN113299312A (zh) * 2021-05-21 2021-08-24 北京市商汤科技开发有限公司 一种图像生成方法、装置、设备以及存储介质

Also Published As

Publication number Publication date
CN113299312A (zh) 2021-08-24
CN113299312B (zh) 2023-04-28
TW202247144A (zh) 2022-12-01

Similar Documents

Publication Publication Date Title
WO2022242381A1 (zh) 图像生成方法、装置、设备以及存储介质
JP6993353B2 (ja) ニューラルネットワークベースの声紋情報抽出方法及び装置
US20210357625A1 (en) Method and device for generating video, electronic equipment, and computer storage medium
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
Habibie et al. Learning speech-driven 3d conversational gestures from video
JP6019108B2 (ja) 文字に基づく映像生成
US11551393B2 (en) Systems and methods for animation generation
Xie et al. Realistic mouth-synching for speech-driven talking face using articulatory modelling
CN112465935A (zh) 虚拟形象合成方法、装置、电子设备和存储介质
Yu et al. Multimodal inputs driven talking face generation with spatial–temporal dependency
KR102509666B1 (ko) 텍스트 및 오디오 기반 실시간 얼굴 재연
CN113077537B (zh) 一种视频生成方法、存储介质及设备
JP2014519082A5 (zh)
US20210390945A1 (en) Text-driven video synthesis with phonetic dictionary
Hussen Abdelaziz et al. Modality dropout for improved performance-driven talking faces
WO2023088080A1 (zh) 说话视频生成方法、装置、电子设备以及存储介质
KR102540763B1 (ko) 머신 러닝 기반의 립싱크 영상 생성을 위한 학습 방법 및 이를 수행하기 위한 립싱크 영상 생성 장치
CN114895817B (zh) 交互信息处理方法、网络模型的训练方法及装置
Hajarolasvadi et al. Generative adversarial networks in human emotion synthesis: A review
RU2721180C1 (ru) Способ генерации анимационной модели головы по речевому сигналу и электронное вычислительное устройство, реализующее его
Wang et al. Fastlts: Non-autoregressive end-to-end unconstrained lip-to-speech synthesis
CN113269066B (zh) 说话视频生成方法、装置和电子设备
Hussen Abdelaziz et al. Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models
CN116912375A (zh) 面部动画生成方法、装置、电子设备及存储介质
Hussen Abdelaziz et al. Audiovisual speech synthesis using tacotron2

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22803706

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE