CN113178206B - AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium - Google Patents

AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN113178206B
CN113178206B CN202110436997.5A CN202110436997A CN113178206B CN 113178206 B CN113178206 B CN 113178206B CN 202110436997 A CN202110436997 A CN 202110436997A CN 113178206 B CN113178206 B CN 113178206B
Authority
CN
China
Prior art keywords
video
video frame
frame
discriminator
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110436997.5A
Other languages
Chinese (zh)
Other versions
CN113178206A (en
Inventor
王炜华
董林坤
张晖
飞龙
高光来
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia University
Original Assignee
Inner Mongolia University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia University filed Critical Inner Mongolia University
Priority to CN202110436997.5A priority Critical patent/CN113178206B/en
Publication of CN113178206A publication Critical patent/CN113178206A/en
Application granted granted Critical
Publication of CN113178206B publication Critical patent/CN113178206B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an AI (artificial intelligence) synthesis anchor generation method, electronic equipment and a readable storage medium, wherein the method comprises the following steps: collecting data, and preprocessing the data to obtain voice data and video frame data; training a mouth shape synchronization discriminator to judge the probability of the synchronization of the voice data and the video frame data; training a video generator to generate a composite video frame; the AI synthesized anchor broadcast generated by the invention has small difference with the real video, good synchronization of voice and mouth shape in the video, smooth video transition and no jump.

Description

AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium
Technical Field
The invention belongs to the technical field of artificial intelligence, and relates to an AI (artificial intelligence) composite anchor generation method, electronic equipment and a readable storage medium.
Background
An AI (Artificial Intelligence) composite anchor is an AI (artificial intelligence) separate model which is not different from a real person by combining multi-mode information such as voice and video frames to perform combined modeling training through a plurality of leading-edge technologies such as face key point detection, face feature extraction and face reconstruction, and the technology can automatically generate videos with corresponding contents from input texts, ensure that audio, expression and lip movement in the videos are naturally consistent, and show an information transmission effect which is not different from the real person anchor.
In recent years, a method based on deep learning gradually becomes a mainstream method of voice synthesis, mouth shape synthesis, face reconstruction and posture synthesis technologies in an artificial intelligence synthesis anchor, a dog searching company adopts a multi-mode input deep learning framework to successfully develop a domestic first-money artificial intelligence synthesis anchor system, research universities and companies such as Qinghua university, science news, Baidu, Chinese academy automation, Chinese academy acoustics, Harbin industry university and the like make a series of researches on voice synthesis and video frame reconstruction, and a plurality of related products are also promoted; internationally, the U.S. MIT media laboratory, MIT artificial intelligence laboratory, cmu (cartegie Mellon university), university of edinburgh, japan ATR, microsoft, google, IBM, and the like all established an intelligent interactive research group, and conducted a series of studies on video frame video reconstruction techniques.
At present, the performances of the voice synthesis, mouth shape synthesis and expression synthesis technologies relatively reach a higher level, but many research difficulties exist, such as that a new anchor content cannot be generated by a small amount of speaker materials, the method has no wide practicability, cannot be popularized to a plurality of speakers, only can be specific to a specific person, and the combination of the anchor voice and the expression is still unnatural.
Disclosure of Invention
In order to achieve the above object, the present invention provides a method for generating an AI composite anchor, which is used for rapidly generating a corresponding news broadcast video based on a speaker's video frame, and has a good synchronization effect between the mouth shape and the voice of the anchor in the composite video, a natural combination of the voice and the expression of the anchor, and a smooth video transition.
The invention also provides electronic equipment and a readable storage medium.
The invention adopts the technical scheme that the AI synthesis anchor generation method comprises the following steps:
collecting data, preprocessing the data to obtain voice data and video frame data;
training a mouth shape synchronization discriminator, respectively inputting voice data and video frame data into the mouth shape synchronization discriminator, and judging the synchronization probability of the voice data and the video frame data;
the method comprises the steps of training a video generator, a multi-scale frame discriminator and a multi-scale time discriminator, wherein the video generator is used for generating a synthesized video frame according to MFCC (Mel frequency cepstrum coefficient) of voice data and video frame data, the multi-scale frame discriminator is used for judging the detail difference between the synthesized video frame and a real video frame, and the multi-scale time discriminator is used for judging the smooth transition of the synthesized video frame;
and inputting the voice file and the speaking video frame of the arbitrary anchor into a video generator to obtain an AI synthesized anchor video.
Further, the preprocessing comprises decomposing the speaking video of any person into video frames and voice files, carrying out face detection on the video frames, and cutting the faces in the video frames to obtain video frame data.
Further, the mouth shape synchronization discriminator comprises a face encoder and an audio encoder, video frame data and synchronous voice features are respectively used as input of the face encoder and the audio encoder, a cosine similarity and a binary cross entropy loss function are used for calculating a dot product between a random sampling video frame vector v and a voice vector s activated by ReLU, and the probability P of the input voice features and the video frame synchronization is obtainedsyncAnd stopping training when the probability is reduced to 0.2 to obtain the mouth shape synchronization discriminator.
Further, the multi-scale frame discriminator includes three frame discriminators, and the multi-scale time discriminator includes three time discriminators.
Further, the process of training the video generator is as follows:
the first stage, construct the video generator, input MFCC characteristic and speaking video of the voice file into the video generator and get the synthetic video frame, carry on the downsampling to real video frame and synthetic video frame separately, get three groups of real video frame sequences and synthetic video frame sequences of different resolutions, input three groups of video frame sequences into three frame discriminators separately, carry on the confrontation training to multi-scale frame discriminator and video generator;
and in the second stage, a trained video generator is used for generating a synthetic video frame, the synthetic video frame and the real video frame are respectively subjected to down-sampling to obtain three groups of real video frame sequences and synthetic video frame sequences with different resolutions, the three groups of video frame sequences are respectively input into three time discriminators, and the multi-scale time discriminators and the video generator are subjected to confrontation training to obtain a final video generator.
Further, the loss function calculation of the first stage is shown in formula (1):
Figure GDA0003101542210000021
in formula (1), G denotes a video generator, k denotes a number variable of discriminators, and k is 1,2, 3, D'1、D′2、D′3Respectively representing three frame discriminators, LGAN(G,D′k) Denotes the penalty of antagonism, L, of the k-th frame discriminatorFM(G,D′k) Represents the feature matching loss, λ, of the k-th frame discriminatorFMRepresents the control feature matching penalty LFM(G,D′k) Is determined.
Further, the loss function calculation of the second stage of the countertraining is shown in formula (2):
Figure GDA0003101542210000031
in formula (2), G denotes a video generator, k denotes a number variable of discriminators, and k is 1,2, 3, D ″1、D″2、D″3Respectively representing three time discriminators, t representing the total time length of the speech feature, Lt(G,D″k) Representing the time-confrontation loss, λ, of the video generator and the kth time arbiter for a time length of tRL、λSL、λBLRespectively, represents the control L1 reconstruction loss LRLMouth-shaped synchronous loss LSLAnd loss of blinking LBLA hyper-parameter of importance.
The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for finishing mutual communication through the communication bus by the memory;
a memory for storing a computer program;
and the processor is used for realizing the steps of the method when executing the program stored in the memory.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method steps.
The invention has the beneficial effects that: the invention uses a multi-scale frame discriminator to match the synthesized video frame and the real video frame from rough to fine step by step, so that the detail difference between the two frames is smaller, and the synthesized anchor video has higher quality; the invention uses a multi-scale time discriminator to match the mouth shape and eye change of the speaker in the synthesized video frame, so that the transition of the synthesized video frame is more natural and has no jump; the invention can synthesize the speaking video of any person without performing targeted training aiming at specific target characters, and is convenient to use.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a network architecture diagram of the present invention.
Fig. 2 is a diagram showing the position of 6 eye key points in the loss of blinking.
Fig. 3 is an architecture diagram of a prior art method.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The AI synthetic anchor generation method specifically comprises the following steps:
step 1, collecting data and preprocessing the data, wherein the specific process is as follows:
step 1a, collecting BBC (British Broadcasting corporation) open source data set Lip Reading sequences 2(LRS2), which is composed of speaking videos of hundreds of people, totaling about 29h, renumbering video forecasts, and reserving ascending names of five-digit numbers from 0;
step 1b, decomposing video frames and voice files in each video by using an FFmpeg tool, and storing the video frames and the voice files in the same directory;
step 1c, carrying out face detection on the video frame by using an s3fd model, and cutting a face part of the video frame to obtain a video frame sequence;
step 2, training a mouth shape synchronous discriminator:
the mouth-shaped synchronous network comprises a face encoder and an audio encoder, both of which are composed of a stack of 2D convolutional layers, video frames (only including face lower half face images) and synchronous voice features thereof are respectively used as input of the face encoder and the audio encoder, and the dot product between random sampling video frame vectors v and voice vectors s activated by ReLU is calculated by using cosine similarity and a binary cross entropy loss function to obtain the probability P of the input voice features and the video frames being synchronoussyncThe value belongs to [0,1 ]]Stopping training when the probability decreases to 0.2 to obtain a mouth shape synchronization discriminator, PsyncIs calculated as shown in equation (1):
Figure GDA0003101542210000041
Psyncrepresenting the probability of the input voice characteristic and the video frame synchronization, and belonging to parameters representing the built-in ReLU activation function, wherein the parameters are infinite;
step 3, training a video generator, which is used for generating a vivid human face video frame sequence with mouth shape and voice synchronization:
constructing a network architecture shown in fig. 1 by adopting a spatial-Adaptive Normalization (SPADE) architecture, wherein the SPADE comprises a series of SPADE residual error structures with an upsampling layer, inputting a voice file into a deep speech2 model to extract voice characteristics, and then inputting the voice characteristics and a human speaking video frame into a video generator to obtain a synthesized video frame;
the video generator and each discriminator are trained in turn by the following two stages:
first stage, in order to avoid overfitting of the deep neural network, generating a higher quality composite video frame, a multi-scale frame discriminator D ' comprising three frame discriminators D ' is used for the confrontation training with the video generator '1、D′2And D'3The composite video frame and the real video frame are down-sampled by using sampling layers with sampling rates of 64 × 64, 128 × 128, and 256 × 256, respectively, to obtain three sets of composite video frame sequences and real video frame sequences with different resolutions, which are respectively used as frame discriminators D'1、D′2And D'3The first stage of the computation of the loss of the resistance training is shown in equation (2):
Figure GDA0003101542210000051
in formula (2), G denotes a video generator, k denotes a number variable of discriminators, and k is 1,2, 3, D'1、D′2、D′3Respectively representing three frame discriminators, LGAN(G,D′k) Denotes the penalty of antagonism, L, of the k-th frame discriminatorFM(G,D′k) Represents the feature matching loss, λ, of the k-th frame discriminatorFMRepresents the control LFM(G,D′k) The hyper-parameter of (c);
the adversarial loss calculation of the frame discriminator is shown in equation (3):
Figure GDA0003101542210000052
in the formula (3)x denotes the real video frame, PdRepresenting the distribution of real video frames, x-PdIndicating that the real video frames x are all sampled at PdZ represents noise, PzRepresenting the composite video frame distribution, z-PzRepresenting noise z sampled at Pz
Figure GDA0003101542210000053
Indicating the expectation that the k-th frame discriminator discriminates a real video frame as real data,
Figure GDA0003101542210000054
means that the kth frame discriminator discriminates the composite video frame as the expectation, D'k(x) Denotes a probability, D 'that the kth frame discriminator discriminates the real video frame as real data'k(g (z)) represents the probability that the kth frame discriminator discriminates the composite video frame as dummy data;
the feature matching loss calculation of the frame discriminator is shown in equation (4):
Figure GDA0003101542210000061
e in formula (4)(x,z)Mathematical expectation representing the loss of matching between real video frames and noise features, T representing the total number of layers of each frame discriminator network, i representing the number variable of layers of the frame discriminator network, NiIndicates the number of nodes in the network of layer i'k)i(x) Representing the real video frame feature (D ') extracted by the kth frame discriminator at the ith layer'k)i(g (z)) represents the characteristics of the composite video frame extracted by the kth frame discriminator at the ith layer;
in the stage, a synthetic video frame and a real video frame are down-sampled into three groups of video frame sequences with different resolutions, the three video frame sequences are respectively input into three frame discriminators, and the synthetic video frame and the real video frame are matched step by step from rough to fine, so that the detail difference between the synthetic video frame and the real video frame is smaller, the quality of the synthetic video frame is higher, and meanwhile, antagonism loss and characteristic matching loss are introduced to train a video generator, so that the training process of an antagonistic network is more stable;
when the loss value is stable, entering a second stage, and performing antagonistic training by using a multi-scale time discriminator D 'and a video generator G, wherein the multi-scale time discriminator D' also comprises three time discriminators D ″1、D″2And D ″)3The synthesized video frame and the real video frame are down-sampled to three resolutions by sampling layers with sampling rates of 64 × 64, 128 × 128 and 256 × 256, respectively as a time discriminator D ″1、D″2And D ″)3Performing confrontation training until loss is stable to obtain a final video generator;
the second stage loss is shown in equation (5):
Figure GDA0003101542210000062
t in equation (5) represents the total time length of the speech feature, Lt(G,D″k) Representing the time-opposed loss, L, of the video generator and the kth time arbiter for a time period of tRLRepresents the L1 reconstruction loss, LSLIndicating loss of synchronization of the lip shape, LBLIndicating loss of blinking, λRL、λSL、λBLRespectively represent control LRL、LSLAnd LBLA hyper-parameter of importance;
Lt(G,D″k) Is calculated as shown in equation (6):
Figure GDA0003101542210000063
in the formula (6), L represents the length of the time interval for generating the antagonistic loss, and J represents the length of time [ t-L, t ]]The total number of intra video frames, j, represents the time length t-L, t]Variable number of intra video frames, xjRepresenting the j-th frame of the real video frame, G (z)j) Denotes the j frame composite video frame, Dk(xj) Indicates that the k-th time discriminator will xjProbability, D', of distinguishing as true datak(G(zj) Means that the kth time arbiter willG(zj) The probability of being false data is judged;
l1 reconstruction loss LRLIs calculated as shown in equation (7):
Figure GDA0003101542210000071
in the formula (7), N represents the total number of batchs, N represents the variable number of batchs, and G (z)n) Nth batch, x representing composite video framenAn nth batch representing a real video frame;
mouth shape synchronous loss L of mouth shape synchronous discriminatorSLIs calculated as shown in equation (8):
Figure GDA0003101542210000072
in the formula (8)
Figure GDA0003101542210000073
Representing the probability that the input speech feature is synchronized with the video frame in the nth batch;
calculating the blink loss through the human face key point detection and the obtained human eye length-width ratio, using a blink loss training video generator to learn the blinking action to generate a more realistic portrait, wherein the blink loss is calculated according to the formulas (9) and (10):
Figure GDA0003101542210000074
LBL=||mr-mg|| (10)
in the formulae (9) and (10), m represents the aspect ratio of human eyes, paThe coordinate of the a-th eye key point is shown, a is 1,2, …,6, the key points of the human eyes are shown as figure 2, mrRepresenting the aspect ratio of the human eye, m, in a real video framegRepresenting the aspect ratio of the human eye in the composite video frame;
the time confrontation loss, the L1 reconstruction loss, the mouth shape synchronous loss and the blink loss are used for carrying out confrontation training, so that the loss of a synthesized video frame is reduced by the video generator through continuous iterative optimization, the mouth shape, the eye movement and the voice are matched by using the multi-scale time discriminator, the generation of a synthesized video frame sequence with natural transition by the video generator can be promoted, the smooth transition and no jump between continuous synthesized video frames are ensured, the good matching between the mouth shape and the voice of the human speaking in the synthesized video frame is ensured, and the quality of the synthesized video frame is improved;
and 4, calling a voice synthesis interface, synthesizing the to-be-synthesized speaking content (text) into a corresponding voice file, inputting the video frame of the speaker and the voice file into a video generator, and synthesizing to obtain an AI (artificial intelligence) synthesized anchor video.
The invention can generate the video of the text of the target character description aiming at any speaker and any text, and can also use Chroma-Key (Chroma keying) function of FFmpeg tool to change the video background for the main broadcast and add news subtitles and titles to the video after the video is made.
The invention provides a neural network structure consisting of a multi-scale frame discriminator, a multi-scale time discriminator, a mouth shape synchronous discriminator and a video generator based on a deep learning neural network, wherein the multi-scale frame discriminator can distinguish fine detail difference between a real video frame and a synthesized video frame and is beneficial to the video generator to generate high-quality synthesized video frames, the multi-scale time discriminator can ensure smooth transition between the synthesized video frames, so that the synthesized video is more natural, the mouth shape synchronous discriminator can accurately detect the synchronization error of the mouth shape and the voice and ensure the synchronization effect of the mouth shape and the voice.
The technical architecture of the article, "Synthesizing Obama from Audio", published by the university of washington in recent years is shown in fig. 3, and the technical scheme learns the sequence mapping from Audio to video, only pays attention to the content of the area around the synthetic mouth, and other eyes, head, upper body, background and the like are completely reserved; extracting audio features as input of the RNN, outputting a sparse mouth shape corresponding to the video output by each frame by the RNN, synthesizing the texture of the mouth and the lower part of the face for each mouth shape, and synthesizing the mouth and the texture into the original video as output; the network structure can only train for the corpus of a specific person, which means that the use cost is very high, a model needs to be retrained every time a speaker is replaced, and the spliced portrait has an unnatural problem.
The embodiment of the invention also provides electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory finish mutual communication through the communication bus, the communication bus can be a Peripheral Component Interconnect (PCI) bus or an Extended Industrial Standard Architecture (EISA) bus, and the like, the communication bus can be divided into an address bus, a data bus, a control bus and the like, and the communication interface is used for communication between the electronic equipment and other equipment; the memory is used for storing computer programs for realizing video frame generation, and the processor is used for executing the programs stored on the memory to realize AI anchor video composition.
The Memory may include a Random Access Memory (RAM) and/or a cache Memory, a non-volatile Memory (non-volatile Memory), such as at least one disk Memory, and may be at least one storage device located remotely from the processor; the memory may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device to communicate with one or more other computing devices, such communication may be through input/output (I/O) interfaces; also, the electronic device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) through a network adapter, which may communicate with other modules of the electronic device through a bus. It should be appreciated that other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
From the above description, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software, or by software combined with necessary hardware, and therefore, the technical solution according to the embodiments of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, or a network device, etc.) execute the above method according to the embodiments of the present invention.
A program product for implementing the above method, which may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be run on a terminal device, such as a personal computer; however, the program product of the present disclosure is not limited in this respect, and in the present disclosure, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media, which may be a readable signal medium or readable storage medium, such as but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing; more specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may comprise a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave, which propagated data signal may take many forms, including, but not limited to, electromagnetic signals, optical signals, or any suitable combination thereof; a readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device; program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + +, or the like, as well as conventional procedural programming languages, such as the "C" language or similar programming languages, and may be executed entirely on a user computing device, partly on a user computing device, as a stand-alone software package, partly on a user computing device and partly on a remote computing device, or entirely on a remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (7)

  1. An AI composite anchor generation method, comprising the steps of:
    collecting data, preprocessing the data to obtain voice data and video frame data;
    training a mouth shape synchronization discriminator, respectively inputting voice data and video frame data into the mouth shape synchronization discriminator, and judging the synchronization probability of the voice data and the video frame data;
    the method comprises the steps of training a video generator, a multi-scale frame discriminator and a multi-scale time discriminator, wherein the video generator is used for generating a synthesized video frame according to MFCC (Mel frequency cepstrum coefficient) of voice data and video frame data, the multi-scale frame discriminator is used for judging the detail difference between the synthesized video frame and a real video frame, and the multi-scale time discriminator is used for judging the smooth transition of the synthesized video frame;
    the multi-scale frame arbiter comprises three frame arbiters, the multi-scale time arbiter comprises three time arbiters, and the process of training the video generator is as follows:
    the first stage, construct the video generator, input MFCC characteristic and speaking video of the voice file into the video generator and get the synthetic video frame, carry on the downsampling to real video frame and synthetic video frame separately, get three groups of real video frame sequences and synthetic video frame sequences of different resolutions, input three groups of video frame sequences into three frame discriminators separately, carry on the confrontation training to multi-scale frame discriminator and video generator;
    in the second stage, a trained video generator is used for generating a synthetic video frame, the synthetic video frame and the real video frame are respectively subjected to down-sampling to obtain three groups of real video frame sequences and synthetic video frame sequences with different resolutions, the three groups of video frame sequences are respectively input into three time discriminators, and the multi-scale time discriminators and the video generator are subjected to confrontation training to obtain a final video generator;
    and inputting the voice file and the speaking video frame of the arbitrary anchor into a video generator to obtain an AI synthesized anchor video.
  2. 2. The AI composite anchor generation method of claim 1, wherein the preprocessing includes decomposing a video of any person speaking into video frames and a voice file, performing face detection on the video frames, and cropping faces in the video frames to obtain video frame data.
  3. 3. The AI synthetic anchor generation method of claim 1, wherein the lip sync discriminator includes a face encoder and an audio encoder, and calculates a dot product between a ReLU-activated randomly sampled video frame vector v and a speech vector s using cosine similarity and a binary cross entropy loss function with video frame data and synchronized speech features as inputs to obtain a probability P that the input speech features are synchronized with the video framesyncAnd stopping training when the probability is reduced to 0.2 to obtain the mouth shape synchronization discriminator.
  4. 4. The AI composite anchor generation method of claim 1, wherein the first-stage penalty function calculation is as shown in equation (1):
    Figure FDA0003586821340000021
    in formula (1), G denotes a video generator, k denotes a number variable of discriminators, and k is 1,2, 3, D'1、D′2、D′3Respectively representing three frame discriminators, LGAN(G,D′k) Denotes the penalty of antagonism, L, of the k-th frame discriminatorFM(G,D′k) Represents the feature matching loss, λ, of the k-th frame discriminatorFMRepresents the control feature matching penalty LFM(G,D′k) Is determined.
  5. 5. The AI composition anchor generation method of claim 1, wherein the second-stage penalty function calculation for the confrontation training is as shown in equation (2):
    Figure FDA0003586821340000022
    in formula (2), G denotes a video generator, k denotes a number variable of discriminators, and k is 1,2, 3, D ″1、D″2、D″3Respectively representing three time discriminators, t representing the total time length of the speech feature, Lt(G,D″k) Representing the time-confrontation loss, λ, of the video generator and the kth time arbiter for a time length of tRL、λSL、λBLRespectively representing the control L1 reconstruction loss LRLMouth-shaped synchronous loss LSLAnd loss of blinking LBLA hyper-parameter of importance.
  6. 6. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
    a memory for storing a computer program;
    a processor for implementing the method steps of any one of claims 1 to 5 when executing a program stored in the memory.
  7. 7. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-5.
CN202110436997.5A 2021-04-22 2021-04-22 AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium Active CN113178206B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110436997.5A CN113178206B (en) 2021-04-22 2021-04-22 AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110436997.5A CN113178206B (en) 2021-04-22 2021-04-22 AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN113178206A CN113178206A (en) 2021-07-27
CN113178206B true CN113178206B (en) 2022-05-31

Family

ID=76924741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110436997.5A Active CN113178206B (en) 2021-04-22 2021-04-22 AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113178206B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403559A (en) * 2023-03-30 2023-07-07 东南大学 Implementation method of text-driven video generation system

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7209882B1 (en) * 2002-05-10 2007-04-24 At&T Corp. System and method for triphone-based unit selection for visual speech synthesis
US9094576B1 (en) * 2013-03-12 2015-07-28 Amazon Technologies, Inc. Rendered audiovisual communication
CN109637518A (en) * 2018-11-07 2019-04-16 北京搜狗科技发展有限公司 Virtual newscaster's implementation method and device
CN109697416A (en) * 2018-12-14 2019-04-30 腾讯科技(深圳)有限公司 A kind of video data handling procedure and relevant apparatus
CN110493613A (en) * 2019-08-16 2019-11-22 江苏遨信科技有限公司 A kind of synthetic method and system of video audio lip sync
US10521946B1 (en) * 2017-11-21 2019-12-31 Amazon Technologies, Inc. Processing speech to drive animations on avatars
CN110751708A (en) * 2019-10-21 2020-02-04 北京中科深智科技有限公司 Method and system for driving face animation in real time through voice
CN110874557A (en) * 2018-09-03 2020-03-10 阿里巴巴集团控股有限公司 Video generation method and device for voice-driven virtual human face
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN111696182A (en) * 2020-05-06 2020-09-22 广东康云科技有限公司 Virtual anchor generation system, method and storage medium
CN111783566A (en) * 2020-06-15 2020-10-16 神思电子技术股份有限公司 Video synthesis method based on lip language synchronization and expression adaptation effect enhancement
CN112233210A (en) * 2020-09-14 2021-01-15 北京百度网讯科技有限公司 Method, device, equipment and computer storage medium for generating virtual character video
CN112652041A (en) * 2020-12-18 2021-04-13 北京大米科技有限公司 Virtual image generation method and device, storage medium and electronic equipment
CN112669417A (en) * 2020-12-18 2021-04-16 北京大米科技有限公司 Virtual image generation method and device, storage medium and electronic equipment

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7209882B1 (en) * 2002-05-10 2007-04-24 At&T Corp. System and method for triphone-based unit selection for visual speech synthesis
US9094576B1 (en) * 2013-03-12 2015-07-28 Amazon Technologies, Inc. Rendered audiovisual communication
US10521946B1 (en) * 2017-11-21 2019-12-31 Amazon Technologies, Inc. Processing speech to drive animations on avatars
CN110874557A (en) * 2018-09-03 2020-03-10 阿里巴巴集团控股有限公司 Video generation method and device for voice-driven virtual human face
CN109637518A (en) * 2018-11-07 2019-04-16 北京搜狗科技发展有限公司 Virtual newscaster's implementation method and device
CN109697416A (en) * 2018-12-14 2019-04-30 腾讯科技(深圳)有限公司 A kind of video data handling procedure and relevant apparatus
CN110493613A (en) * 2019-08-16 2019-11-22 江苏遨信科技有限公司 A kind of synthetic method and system of video audio lip sync
CN110751708A (en) * 2019-10-21 2020-02-04 北京中科深智科技有限公司 Method and system for driving face animation in real time through voice
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN111696182A (en) * 2020-05-06 2020-09-22 广东康云科技有限公司 Virtual anchor generation system, method and storage medium
CN111783566A (en) * 2020-06-15 2020-10-16 神思电子技术股份有限公司 Video synthesis method based on lip language synchronization and expression adaptation effect enhancement
CN112233210A (en) * 2020-09-14 2021-01-15 北京百度网讯科技有限公司 Method, device, equipment and computer storage medium for generating virtual character video
CN112652041A (en) * 2020-12-18 2021-04-13 北京大米科技有限公司 Virtual image generation method and device, storage medium and electronic equipment
CN112669417A (en) * 2020-12-18 2021-04-16 北京大米科技有限公司 Virtual image generation method and device, storage medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于深度学习的语音分离研究";张晖;《中国优秀博士学位论文全文数据库 信息科技辑》;20181231;全文 *

Also Published As

Publication number Publication date
CN113178206A (en) 2021-07-27

Similar Documents

Publication Publication Date Title
Pepino et al. Emotion recognition from speech using wav2vec 2.0 embeddings
US11836593B1 (en) Devices, systems, and methods for learning and using artificially intelligent interactive memories
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
CN109785824B (en) Training method and device of voice translation model
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
Xie et al. Attention-based dense LSTM for speech emotion recognition
CN111914076B (en) User image construction method, system, terminal and storage medium based on man-machine conversation
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN113782048B (en) Multi-mode voice separation method, training method and related device
Deng et al. Foundations and trends in signal processing: Deep learning–methods and applications
CN112837669B (en) Speech synthesis method, device and server
CN114339450B (en) Video comment generation method, system, device and storage medium
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
Hu et al. Unified discrete diffusion for simultaneous vision-language generation
CN111274412A (en) Information extraction method, information extraction model training device and storage medium
CN114329041A (en) Multimedia data processing method and device and readable storage medium
Azuh et al. Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio.
CN113205793A (en) Audio generation method and device, storage medium and electronic equipment
CN112489634A (en) Language acoustic model training method and device, electronic equipment and computer medium
CN113178206B (en) AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium
CN117634459B (en) Target content generation and model training method, device, system, equipment and medium
CN114999443A (en) Voice generation method and device, storage medium and electronic equipment
Atkar et al. Speech emotion recognition using dialogue emotion decoder and CNN Classifier
CN116665642A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN114360491B (en) Speech synthesis method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant