CN115497150A - Virtual anchor video generation method and device, electronic equipment and storage medium - Google Patents

Virtual anchor video generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115497150A
CN115497150A CN202211296864.3A CN202211296864A CN115497150A CN 115497150 A CN115497150 A CN 115497150A CN 202211296864 A CN202211296864 A CN 202211296864A CN 115497150 A CN115497150 A CN 115497150A
Authority
CN
China
Prior art keywords
video
real person
photo
network
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211296864.3A
Other languages
Chinese (zh)
Inventor
余国军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaoduo Intelligent Technology Beijing Co ltd
Original Assignee
Xiaoduo Intelligent Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoduo Intelligent Technology Beijing Co ltd filed Critical Xiaoduo Intelligent Technology Beijing Co ltd
Priority to CN202211296864.3A priority Critical patent/CN115497150A/en
Publication of CN115497150A publication Critical patent/CN115497150A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application discloses a virtual anchor video generation method and device, electronic equipment and a storage medium. The method comprises the steps of firstly, obtaining a target image video to be synthesized; acquiring a 2D (two-dimensional) real person photo, acquiring related information of the 2D real person photo, and sorting out face image data and lip sound characteristic data according to the related information of the 2D real person photo; inputting the face image data, the lip sound characteristic data and the target image video into a generation countermeasure network which is trained in advance to carry out synthesis processing, and obtaining a synthetic image video. The lip sound synchronization is enhanced by extracting the linguistic characteristics and the prosodic characteristics of the audio, the confrontation network is generated for synthesis, so that a virtual anchor face which tends to be real is generated, the product image video is synthesized, and a user has a relatively real virtual anchor experience effect when watching the image video.

Description

Virtual anchor video generation method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of image technologies, and in particular, to a method and an apparatus for generating a virtual anchor video, an electronic device, and a storage medium.
Background
In the internet anchor video field, the anchor can shoot the internet video live broadcast video through the used electronic equipment, so that live broadcast programs are provided for audiences, and the audiences can watch videos through the electronic equipment. However, most of the video interfaces are not expected, the display form is single, the experience is affected, or the cost of the desired effect is high and the effect does not meet the budget.
Disclosure of Invention
Based on this, the embodiment of the application provides a virtual anchor video generation method and device, an electronic device and a storage medium, which can solve the problems of high cost, poor stability and insufficient flexibility of the traditional anchor video shooting by using a real person.
In a first aspect, a virtual anchor video generation method is provided, and the method includes:
acquiring a target image video to be synthesized; the target image video comprises a real person image video and a virtual person image video;
acquiring a 2D (two-dimensional) real person photo, acquiring related information of the 2D real person photo, and sorting out face image data and lip sound characteristic data according to the related information of the 2D real person photo;
and inputting the face image data, the lip sound characteristic data and the target image video into a generation countermeasure network which is trained in advance to perform synthesis processing, so as to obtain a synthetic image video.
Optionally, the 2D real person photo related information comprises at least face data of the photo and audio information of the photo real person itself.
Optionally, sorting out the face image data according to the 2D real person photo related information includes:
and connecting continuous key points in the predicted key point sequence of each frame by using line segments and rendering the continuous key points into different colors so as to obtain the characteristics of the image.
And connecting the features of the image and the original image on the channel dimension to obtain a feature map.
And generating each frame of real person image for the feature map through an encoder-decoder network.
Optionally, the structure of the encoder in the encoder-decoder network is: the device consists of 6 layers of CNNs, and each layer of CNN follows two residual blocks; the output of the encoder is directly input into the encoder, which has a mirror structure with the encoder, and there is a short-circuit connection between two layers of CNNs in order.
Optionally, sorting out lip sound feature data according to the 2D real person photo related information includes:
performing contentmbedding on the audio clip in the 2D real person photo related information, and inputting the audio clip into an LSTM network to obtain c-t;
performing spatakeridistimenbedding on the audio fragment to obtain an s vector;
the s vector is connected with c-t after being processed by MLP, is connected with the same result in a preset time period, and is input into self-attentionblock to obtain the characteristics in the preset time period;
and mapping the key points and the static key points of the 2D real person photo through MLP processing to obtain the change of the key points, thereby obtaining a result of fine tuning of the key point prediction, and taking the fine-tuned result as lip sound characteristic data.
Optionally, before inputting the face image data, the lip sound feature data, and the target image video to a generation countermeasure network that is trained in advance and performing synthesis processing, the method includes:
different utterances of the same person are required to be selected for carrying out content feature extraction training on the generated confrontation network; matching training of head actions and facial dynamic expressions is completed through a GAN network; the image generation training is to train through paired video frames and perform fine adjustment on a high-resolution video, and the use data set is VoxCeleb2.
Optionally, the network of discriminators in the generation countermeasure network is specifically:
r t =Attn d (y t ,c~t,s)
wherein, y t Representing the continuous key points in each predicted frame key point sequence, c-t representing the result obtained by performing content embedding on the audio clip and inputting the audio clip into the LSTM network, s representing the result of performing spatiality embedding on the audio clip, attnd representing the network layer of the attention mechanism, r t Representing the discriminator output.
In a second aspect, there is provided a virtual anchor video generating apparatus, the apparatus comprising:
acquiring a target image video to be synthesized; wherein the target image video comprises a real person image video and a virtual person image video;
acquiring a 2D (two-dimensional) real person photo, acquiring related information of the 2D real person photo, and sorting out face image data and lip sound characteristic data according to the related information of the 2D real person photo;
and inputting the face image data, the lip sound characteristic data and the target image video into a generation countermeasure network which is trained in advance to carry out synthesis processing, so as to obtain a synthetic image video.
In a third aspect, an electronic device is provided, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the virtual anchor video generation method according to any one of the first aspect when executing the computer program.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the virtual anchor video generation method of any of the first aspects described above.
According to the technical scheme provided by the embodiment of the application, a target image video to be synthesized is obtained; acquiring a 2D (two-dimensional) real person photo, acquiring related information of the 2D real person photo, and sorting out face image data and lip sound characteristic data according to the related information of the 2D real person photo; inputting the face image data, the lip sound characteristic data and the target image video into a generation countermeasure network which is trained in advance to carry out synthesis processing, and obtaining a synthetic image video.
The beneficial effects that technical scheme that this application embodiment provided brought include at least that linguistic feature and rhythm characteristic through extracting the audio frequency, thereby strengthen lip sound synchronism, obtain long and short term dependence by long and short term memory and the self-attention mechanism, thereby produce harmonious head motion, realize the image, the audio frequency, the unity of chronogenesis, thereby generate and tend to real virtual anchor people's face, carry out the synthesis processing again to product image video, let the user produce this and be exactly that the real person shoots, the sensation of later stage clip, play the synthetic image video of processing, reach when watching image video, let the viewer have relatively real virtual anchor experience effect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
Fig. 1 is a flowchart illustrating steps of a method for generating a virtual anchor video according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a virtual anchor video generation method according to an alternative embodiment of the present application;
fig. 3 is a block diagram of a virtual anchor video generation apparatus according to an embodiment of the present application;
fig. 4 is a schematic view of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
In the description of the present invention, the terms "comprises," "comprising," "has," "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements specifically listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus or added steps or elements based on further optimization concepts of the present invention.
To facilitate understanding of the present embodiment, a method for generating a virtual anchor video disclosed in the embodiments of the present application is first described in detail.
Referring to fig. 1, a flowchart of a virtual anchor video generation method provided by an embodiment of the present application is shown, where the method may include the following steps:
step 101, a target image video to be synthesized is obtained.
The target image video comprises a real person image video and a virtual person image video.
And 102, acquiring a 2D (two-dimensional) real person photo, acquiring related information of the 2D real person photo, and sorting out face image data and lip sound characteristic data according to the related information of the 2D real person photo.
The 2D real person photo related information at least comprises face data of the photo and audio information of the photo real person.
In the embodiment of the application, the face image data is sorted out according to the related information of the 2D real person photo, and the method comprises the steps of connecting continuous key points in each predicted frame key point sequence by line segments and rendering the key points into different colors so as to obtain the characteristics of the image; connecting the characteristics of the image with the original image on the channel dimension to obtain a characteristic image; and generating each frame of real person image for the feature map through an encoder-decoder network.
Wherein, the structure of the encoder in the encoder-decoder network is: consists of 6 layers of CNN, each layer of CNN follows two residual blocks; the output of the encoder is directly input into the encoder, which has a mirror structure with the encoder, and there is a short-circuit connection between two layers of CNNs in order.
In the embodiment of the application, lip sound characteristic data are sorted out according to the related information of the 2D real person photo, including content embedding is carried out on audio clips in the related information of the 2D real person photo, and the audio clips are input into an LSTM network to obtain c-t; performing spatker identity embedding on the audio segments to obtain s vectors; the s vector is connected with c-t after being processed by MLP, is connected with the same result in a preset time period, and is input into self-attentionblock to obtain the characteristics in the preset time period; and mapping the key points and the static key points of the 2D real person photo through MLP processing to obtain the change of the key points, thereby obtaining a result of fine tuning of the key point prediction, and taking the fine-tuned result as lip sound characteristic data.
And 103, inputting the face image data, the lip sound characteristic data and the target image video into a generation countermeasure network which is trained in advance to perform synthesis processing, so as to obtain a synthetic image video.
In the embodiment of the application, before inputting the face image data, the lip sound characteristic data and the target image video into the generated confrontation network which is trained in advance for synthesis processing, the method comprises the steps of selecting different utterances of the same person to carry out content characteristic extraction training on the generated confrontation network; completing matching training of head actions and face dynamic expressions through a GAN network; the image generation training is to train through paired video frames and perform fine adjustment on a high-resolution video, and the data set is VoxCeleb2.
The generation of the discriminator network in the countermeasure network is specifically as follows:
r t =Attn d (y t ,c~t,s)
wherein, y t Representing the continuous key points in each predicted frame key point sequence, c-t representing the result obtained by performing contentmbedding on the audio clip and inputting the audio clip into the LSTM network, s representing the result of performing spatkeringity embedding on the audio clip, attnd representing the network layer of the attention mechanism, r t Representing the discriminator output.
As shown in fig. 2, a possible embodiment of another virtual anchor video generation method in the present application is given, specifically:
step 201, storing a character (real character or virtual character) video and product related information, and playing the video;
wherein the collected related information includes lip sound data and facial image information of the real person's photo
202, performing countermeasure training according to the collected lip sound characteristic information and the face image information of the 2D photo to generate a real face picture, and then performing synthesis processing with a person video;
the content of the real person video synthesis processing comprises the following steps: replacing the face of the person in the real person or virtual person video with the face of the 2D real person photo; the face of the person is adaptively adjusted based on the image being a 2D real person photograph.
When synthesizing the real person video, firstly matching lip sound characteristic data, face image information and the video real person of the 2D real person photo, and then synthesizing after matching successfully.
And step 203, playing the processed synthesized real person video.
In an embodiment of the present application, replacing a video person face with a 2D photo face comprises the sub-steps of:
the face key points can obtain a plurality of tiny dynamic characteristics based on the speaker, such as tiny expression changes, lip habits and the like, 68 face key points are selected by the method,
these face keypoints can drive various facial actions.
And matching facial features by using the face key points, and collecting face data.
A network structure is used to predict face key points through audio signals so as to capture expressions and head gestures.
And decoupling the voice content characteristics and the voice character characteristics so as to generate a speaking head dynamic effect based on the speaking voice.
The image is evaluated by two human face key point synchronization methods respectively corresponding to a real person and an animation character.
The lip sound characteristic collection according to the voice audio content corresponding to the 2D real person photo comprises the following substeps:
the purpose is to predict some subtle changes in the face's keypoints (e.g., head movements, association of eyebrows and mouth, etc.), thereby making the keypoints change more naturally.
And extracting the spoke identity embedding through the speaker verification model.
The method maximizes the similarity of embedding obtained by different utterances of the same person and minimizes the similarity of embedding obtained by different utterances of different persons, thereby obtaining the language characteristics which can be used for distinguishing different persons. The dimension of the output of the embedding layer is 256, but the generalization capability of the image which is not used in the training period can be improved after the dimension is reduced to 128.
The variation of the key points of the human face predicted by the method can reflect the characteristics of the speaker (namely, the variation of the key points obtained by different utterances of the same speaker should be approximate), so that the variation of the key points can be more reasonable.
The number of frames of sound is not equivalent to the time of generating the head change of the key point because the factor of the input speech is generally several tens of milliseconds, but the magnitude of the head pose change generated during this period should take longer to complete (i.e. the magnitude of the head change is too large).
Thus making the frame number of the audio more matched to the generated keypoint change, longer dependency needs to be acquired, where a self-attention mechanism is used.
The specific method comprises the following steps:
s1, performing contentmbedding on a given audio clip to obtain A.
And S2, inputting A into the LSTM to obtain c-t.
And S3, performing spatakeridistyimbedding on the audio fragment to obtain S.
And S4, after passing through MLP, S is coordinated with c-t, is connected with the same result of a longer time step (4S) and is input into self-attentionblock (the same as encoderblock in a transformer), so that the characteristic of the longer time step is obtained.
And S5, mapping the key points and the given static key points q together through MLP to obtain the change of the key points.
And S6, adding the key point prediction obtained in the previous step to obtain a result of fine adjustment of the key point prediction.
Wherein, the concrete formula is as follows:
ht=Attns(c~t,MLP(s))
△p=MLPs(ht,q)
yt=pt+△p
the lip sound features are input by key points and audio, and are not related to whether the original image is a real person or not, but the finally presented effect, cartoon and human are certainly different, so that the method is also different. Therefore, the purpose of collecting 2D photo face data is to associate predicted key points with images and obtain final image output.
In the embodiment of the application, the face data collection according to the 2D real person photo comprises the following sub-steps:
generating a real person image;
a UNet-like network is used to implement the procedure.
And connecting the continuous key points Yt in each predicted frame key point sequence by using line segments and rendering the key points to different colors so as to obtain the characteristic Yt of the image.
And (3) performing concatee on Yt and the original image Q in the channel dimension to obtain the featuremap with the size of [256,256,6 ].
Each frame of real person image Ft is generated for the featuremap by an encoder-decoder network.
The structure is an encoder-decoder. The structure of the encoder is as follows: consisting of 6 layers of CNN, each following two residualblocks, forming a bottleeck.
The output of the encoder is directly input into the decoder, and the decoder has the same structure as the encoder but is reversed. For both subsequent layers CNN, there will be a short-circuit connection.
In the embodiment of the application, the training of confrontation according to the face data collected by the 2D real person photo and the lip sound characteristics collected by the corresponding voice audio content comprises the following sub-steps:
for the extraction training of the content features, different utterances of the same person need to be selected for training in a data set, so that the model learning can only aim at the features of the utterance contents.
The data set was obtained using ObamaWeeklyAddress.
Loss function:
for each key point, calculating its graph laplacian coordinate and then its distance can promote correctness between the relative positions of the key points and preserve some detailed features of the face.
The formula is as follows
Figure BDA0003903078120000091
Figure BDA0003903078120000092
Where Pt is the key point for prediction and p ^ t is the true value and λ c is typically 1.N (pi) represents a keypoint in the domain including the keypoint pi, and L (pi, t) represents the graph laplacian transform.
For extracting features based on a speaker, a data set of the same utterance and the speech of different people needs to be selected, so that a model learns how to extract the features aiming at the speaker.
The data collection adopts VoxColeb 2.
Generating a countermeasure network
During training, GAN is needed to complete the matching of head movements and facial dynamic expressions.
Which discriminates that its network is similar to the self-attention generator structure mentioned earlier.
The purpose of this step is to verify that the generated face keypoints appear to be truly false.
The discriminator network: r is t =Attn d (y t ,c~t,s)
Wherein, y t Representing the continuous key points in each predicted frame key point sequence, c-t representing the result of contentmbedding the audio clip and inputting it into the LSTM networkIf s represents the audio segment undergoing spatiality embedding, attnd represents the network layer of attention mechanism, r t Representing the discriminator output.
Loss function:
Figure BDA0003903078120000101
Speaker-AwareAnimation loss function
Figure BDA0003903078120000102
Where λ s =1 μ s =0.001. The generator (minimize Ls) and discriminator (minimize Lgan) are trained interactively to improve each other.
Image generation training is trained using pairs of video frames. And fine-tuning on high-resolution video. The usage dataset is VoxCeleb2.
Loss function
Figure BDA0003903078120000103
Where λ a =1 and Φ denotes that the image is associated with the feature map obtained after it is input to VGG 19.
Otherwise, the human face key point frame rate is 62.5FPS; the audio sampling frequency is 16KHz; the mel frequency spectrum is 62.5Hz; the frequency of the original audio is 16KHz.
As a result, the reason why it can be applied to a real person and also to a non-real person is pointed out in the non-real person photo dynamism is that a key point is only an intermediate generated feature (a photo is generated from the key point later), and relative coordinates of the key point are learned instead of absolute coordinates (the absolute coordinates may be difficult to realize on a non-real person, the absolute coordinates need to be generated based on a face position, and the feature of a non-face is greatly different from the face feature).
Animating real photos
The algorithm generates a human face without artifacts, and has the phenomenon that the background is distorted when the head moves, because the foreground (namely the human head) and the background are not separated when the whole picture is generated.
Even if a real person is used in the training, the image translation stage can produce non-real persons or even 3D pictures.
Referring to fig. 3, a block diagram of a virtual anchor video generating apparatus 300 according to an embodiment of the present application is shown. As shown in fig. 3, the apparatus 300 may include: an acquisition module 301, a processing module 302 and a synthesis module 303.
An obtaining module 301, configured to obtain a target image video to be synthesized; the target image video comprises a real person image video and a virtual person image video;
the processing module 302 is configured to collect a 2D real person photo, acquire related information of the 2D real person photo, and sort out face image data and lip sound feature data according to the related information of the 2D real person photo;
and the synthesis module 303 is configured to input the face image data, the lip sound feature data, and the target image video into a generation countermeasure network that is trained in advance to perform synthesis processing, so as to obtain a synthesized image video.
For specific limitations of the virtual anchor video generation apparatus, reference may be made to the above limitations of the virtual anchor video generation method, which are not described herein again. The respective modules in the virtual anchor video generation apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, an electronic device is provided, which may be a computer, and its internal structure diagram may be as shown in fig. 4. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for virtual anchor video generation method data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a virtual anchor video generation method.
It will be appreciated by those skilled in the art that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the steps of the above-mentioned virtual anchor video generation method.
The implementation principle and technical effect of the computer-readable storage medium provided by this embodiment are similar to those of the above-described method embodiment, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in M forms, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SyMchlimk) DRAM (SLDRAM), raMbus (RaMbus) direct RAM (RDRAM), direct RaMbus Dynamic RAM (DRDRAM), and RaMbus Dynamic RAM (RDRAM), among others.
All the technical features of the above embodiments can be arbitrarily combined (as long as there is no contradiction between the combinations of the technical features), and for brevity of description, all the possible combinations of the technical features in the above embodiments are not described; these examples, which are not explicitly described, should be considered to be within the scope of the present description.
The present application has been described in considerable detail with reference to certain embodiments and examples thereof. It should be understood that several conventional adaptations or further innovations of these specific embodiments may also be made based on the technical idea of the present application; however, such conventional modifications and further innovations can also fall into the scope of the claims of the present application as long as they do not depart from the technical idea of the present application.

Claims (10)

1. A method for virtual anchor video generation, the method comprising:
acquiring a target image video to be synthesized; wherein the target image video comprises a real person image video and a virtual person image video;
acquiring a 2D (two-dimensional) real person photo, acquiring related information of the 2D real person photo, and sorting out face image data and lip sound characteristic data according to the related information of the 2D real person photo;
and inputting the face image data, the lip sound characteristic data and the target image video into a generation countermeasure network which is trained in advance to carry out synthesis processing, so as to obtain a synthetic image video.
2. The method of claim 1, wherein the 2D photo related information of the real person comprises at least face data of the photo and audio information of the photo real person itself.
3. The method of claim 1, wherein sorting out facial image data according to the 2D photo related information comprises:
connecting continuous key points in each predicted frame key point sequence by line segments and rendering the continuous key points into different colors so as to obtain the characteristics of the image;
connecting the characteristics of the image with the original image on the channel dimension to obtain a characteristic image;
and generating each frame of real person image for the feature map through an encoder-decoder network.
4. The method of claim 3, wherein the structure of an encoder in the encoder-decoder network is: the device consists of 6 layers of CNNs, and each layer of CNN follows two residual blocks; the output of the encoder is directly input into the encoder, which has a mirror structure with the encoder, and there is a short-circuit connection between two layers of CNNs in order.
5. The method according to claim 1, wherein sorting out lip sound feature data according to the 2D photo related information comprises:
carrying out content embedding on audio clips in the related information of the 2D real person photo, and inputting the audio clips into an LSTM network to obtain c-t;
performing spatker identity embedding on the audio clip to obtain an s vector;
the s vector is connected with c-t after being processed by MLP, is connected with the same result in a preset time period, and is input into a self-attribute block to obtain the characteristics in the preset time period;
and mapping the key points and the static key points of the 2D real person photo through MLP processing to obtain the change of the key points, thereby obtaining a result of fine tuning of the key point prediction, and taking the fine-tuned result as lip sound characteristic data.
6. The method of claim 1, wherein before inputting the face image data, lip sound feature data and target image video into a pre-trained generative confrontation network for synthesis processing, the method comprises:
different utterances of the same person are required to be selected for generating the confrontation network for carrying out content feature extraction training; matching training of head actions and facial dynamic expressions is completed through a GAN network; the image generation training is to train through paired video frames and perform fine adjustment on a high-resolution video, and the use data set is VoxCeleb2.
7. Method according to claim 6, characterized in that said generation of a network of discriminators in a countering network is in particular:
r t =Attn d (y t ,c~t,s)
wherein, y t Representing continuous key points in each predicted frame key point sequence, c-t representing the result obtained by performing content embedding on the audio clip and inputting the audio clip into the LSTM network, s representing the result of performing marker identification embedding on the audio clip, attnd representing the network layer of the attention mechanism, r t Representing the discriminator output.
8. An apparatus for virtual anchor video generation, the apparatus comprising:
the acquisition module is used for acquiring a target image video to be synthesized; wherein the target image video comprises a real person image video and a virtual person image video;
the processing module is used for acquiring a 2D (two-dimensional) real person photo, acquiring related information of the 2D real person photo, and sorting out face image data and lip sound characteristic data according to the related information of the 2D real person photo;
and the synthesis module is used for inputting the face image data, the lip sound characteristic data and the target image video into a generation countermeasure network which is trained in advance to carry out synthesis processing so as to obtain a synthesized image video.
9. An electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, implements the virtual anchor video generation method of any of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the virtual anchor video generation method of any of claims 1 to 7.
CN202211296864.3A 2022-10-21 2022-10-21 Virtual anchor video generation method and device, electronic equipment and storage medium Pending CN115497150A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211296864.3A CN115497150A (en) 2022-10-21 2022-10-21 Virtual anchor video generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211296864.3A CN115497150A (en) 2022-10-21 2022-10-21 Virtual anchor video generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115497150A true CN115497150A (en) 2022-12-20

Family

ID=84474329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211296864.3A Pending CN115497150A (en) 2022-10-21 2022-10-21 Virtual anchor video generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115497150A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116828129A (en) * 2023-08-25 2023-09-29 小哆智能科技(北京)有限公司 Ultra-clear 2D digital person generation method and system
WO2023231712A1 (en) * 2022-05-30 2023-12-07 中兴通讯股份有限公司 Digital human driving method, digital human driving device and storage medium
CN117593473A (en) * 2024-01-17 2024-02-23 淘宝(中国)软件有限公司 Method, apparatus and storage medium for generating motion image and video

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023231712A1 (en) * 2022-05-30 2023-12-07 中兴通讯股份有限公司 Digital human driving method, digital human driving device and storage medium
CN116828129A (en) * 2023-08-25 2023-09-29 小哆智能科技(北京)有限公司 Ultra-clear 2D digital person generation method and system
CN116828129B (en) * 2023-08-25 2023-11-03 小哆智能科技(北京)有限公司 Ultra-clear 2D digital person generation method and system
CN117593473A (en) * 2024-01-17 2024-02-23 淘宝(中国)软件有限公司 Method, apparatus and storage medium for generating motion image and video

Similar Documents

Publication Publication Date Title
Aldausari et al. Video generative adversarial networks: a review
Wang et al. One-shot talking face generation from single-speaker audio-visual correlation learning
CN115497150A (en) Virtual anchor video generation method and device, electronic equipment and storage medium
Song et al. Talking face generation by conditional recurrent adversarial network
CN110266973B (en) Video processing method, video processing device, computer-readable storage medium and computer equipment
US20220375190A1 (en) Device and method for generating speech video
KR102437039B1 (en) Learning device and method for generating image
US11581020B1 (en) Facial synchronization utilizing deferred neural rendering
Hajarolasvadi et al. Generative adversarial networks in human emotion synthesis: A review
CN113948105A (en) Voice-based image generation method, device, equipment and medium
CN115278293A (en) Virtual anchor generation method and device, storage medium and computer equipment
CN116828129B (en) Ultra-clear 2D digital person generation method and system
CN111275778B (en) Face simple drawing generation method and device
Sun et al. Vividtalk: One-shot audio-driven talking head generation based on 3d hybrid prior
CN117409121A (en) Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
Wang et al. Talking faces: Audio-to-video face generation
Wang et al. Ca-wav2lip: Coordinate attention-based speech to lip synthesis in the wild
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
CN115294622B (en) Method, system and storage medium for synthesizing and enhancing voice-driven speaker head motion video
Zhai et al. Talking face generation with audio-deduced emotional landmarks
Ravichandran et al. Synthesizing photorealistic virtual humans through cross-modal disentanglement
RU2720361C1 (en) Multi-frame training of realistic neural models of speakers heads
Kawai et al. VAE/WGAN-based image representation learning for pose-preserving seamless identity replacement in facial images
Wang et al. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head
Korshunov et al. Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination