CN115580743A - Method and system for driving human mouth shape in video - Google Patents

Method and system for driving human mouth shape in video Download PDF

Info

Publication number
CN115580743A
CN115580743A CN202211568819.9A CN202211568819A CN115580743A CN 115580743 A CN115580743 A CN 115580743A CN 202211568819 A CN202211568819 A CN 202211568819A CN 115580743 A CN115580743 A CN 115580743A
Authority
CN
China
Prior art keywords
video
mouth shape
audio
lip
face
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211568819.9A
Other languages
Chinese (zh)
Inventor
李志强
陈尧森
杨瀚
朱婷婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Sobey Digital Technology Co Ltd
Original Assignee
Chengdu Sobey Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Sobey Digital Technology Co Ltd filed Critical Chengdu Sobey Digital Technology Co Ltd
Priority to CN202211568819.9A priority Critical patent/CN115580743A/en
Publication of CN115580743A publication Critical patent/CN115580743A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

The invention relates to the technical field of voice driving, and discloses a method and a system for driving a character mouth shape in a video. The invention solves the problems of difficulty in accurately driving the mouth shape, low editing efficiency and the like in the prior art.

Description

Method and system for driving human mouth shape in video
Technical Field
The invention relates to the technical field of voice driving, in particular to a method and a system for driving a figure mouth shape in a video.
Background
With the rise of the concepts of the meta universe and the twin of the numbers and the popularization of the digital videos and the converged media, the research for constructing the automatic broadcasting of a general host becomes more and more important in the news scene. In this field, it is very important to the real character mouth shape driving technology of news video.
The mouth shape drive is mainly used for driving the mouth shape of a real host in a video, and the mouth shape drive is made to correspond to the mouth shape accurately along with specific voice so as to achieve the effect that the host broadcasts any news. Or aiming at the broadcasting error of the real person host population, the mouth shape of the real person host population is corrected. The method has important value for the research of accurately driving the mouth shape, the research of the fields such as the field of television news production and the field of the meta universe.
However, the prior art has the following problems: difficulty in accurately driving the mouth shape, low editing efficiency, and the like.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method and a system for driving the mouth shape of a person in a video, and solves the problems that the mouth shape is difficult to drive accurately, the editing efficiency is low and the like in the prior art.
The technical scheme adopted by the invention for solving the problems is as follows:
a method for driving the mouth shape of a person in a video aims at a video with a figure speaking graph and a voice, and the mouth shape of the person in the video is driven through the learning of the voice and the learning of the video.
As a preferred technical scheme, the method comprises the following steps:
s1, collecting: collecting videos, and cleaning and encoding the videos by data;
s2, lip synchronization judgment: constructing and training a lip sound synchronization judging model for judging whether the lips and the sound are synchronous or not;
s3, mouth shape generation: constructing and training a mouth shape generation network model, and generating a corresponding mouth shape according to the voice;
s4, driving the mouth shape: and constructing and applying a mouth shape driving model to carry out reasoning and prediction on any character video.
As a preferred technical solution, the step S2 includes the following steps:
s21, obtaining the input of a lip sound synchronization judging model: obtaining a continuous video frame segment T with a set frame number f The video input data is used as the video input data of the lip sound synchronization judging model; and acquiring audio data T corresponding to a time window with the same set frame number a The voice frequency input data is used as the voice frequency input data of the lip sound synchronization judging model;
s22, constructing a lip sound synchronization discrimination model: building a lip sound synchronization discrimination model based on a Syncnet network model, and calculating the probability of video and audio synchronization by using the lip sound synchronization discrimination model; the lip sound synchronization discrimination model comprises an audio encoder and a face encoder, wherein the audio encoder is used for encoding audio data and extracting Mel cepstrum coefficient (MFCC) characteristics corresponding to the audio data, the face encoder is used for encoding a face image into face image characteristics, and the lip sound synchronization discrimination model calculates loss by using cosine similarity of binary cross entropy; the probability of video and audio synchronization is calculated as follows:
Figure 251097DEST_PATH_IMAGE001
wherein, the first and the second end of the pipe are connected with each other,
Figure 475405DEST_PATH_IMAGE002
which represents the probability of video and audio synchronization,
Figure 182959DEST_PATH_IMAGE003
the characteristics of the face image are shown,
Figure 467310DEST_PATH_IMAGE004
representing the characteristics of the audio mel-frequency cepstral coefficients,
Figure 942153DEST_PATH_IMAGE005
represents an arbitrarily small number;
s23, the lip sound synchronization judging model generates data pairs through randomly sampling a video and audio synchronization window and a video and audio asynchronization window so as to train; then, for each data pair, two encoders based on convolutional neural networks are used for respectively carrying out dimension reduction and feature extraction on input voice and video frames, and features of the two encoders are mapped into the same dimension space; finally, measuring lip sound synchronism by using a cosine similarity loss function; the video and audio synchronous window refers to correct audio in the original video corresponding to the continuously set frame number video, the video and audio asynchronous window refers to any other audio segment in the video corresponding to the continuously set frame number video, and the cosine similarity loss function calculation formula is as follows:
Figure 970152DEST_PATH_IMAGE006
in the formula (I), the compound is shown in the specification,
Figure 964784DEST_PATH_IMAGE007
the similarity of lip-sound synchronization is shown,
Figure 216774DEST_PATH_IMAGE008
representing the angle between the image feature vector and the speech feature vector,
Figure 116597DEST_PATH_IMAGE009
a video feature representing a set length is shown,
Figure 495757DEST_PATH_IMAGE010
and representing the audio characteristics corresponding to the corresponding video characteristics with set length.
As a preferred technical solution, the step S3 includes the following steps:
s31, constructing a generator in the mouth shape generation network model, wherein the generator is used for generating a corresponding mouth shape according to the voice characteristics; the generator is based on an encoder-decoder network structure and comprises a human face encoder, an audio encoder and a human face decoder, wherein the human face encoder is used for receiving a video image frame and generating face intermediate features, and the audio encoder is used for receiving an audio signal and generating audio intermediate features; then respectively carrying out feature fusion on the obtained face intermediate features and the audio intermediate features, sending the fused features into a face decoder for decoding, and finally outputting lip-shaped image frames which are generated by a generator and are synchronous with audio;
s32, constructing a discriminator in the mouth shape generation network model, wherein the discriminator is used for judging whether the mouth shape generated by the generator is correct or not and whether the mouth shape is synchronous with the voice or not; wherein, the discriminator comprises a pre-trained discriminator for synchronizing the lip shape and the audio frequency and a discriminator for the lip shape visual quality; then, inputting the lip-shaped synchronous image frame generated by the generator into a discriminator, and inputting the original video frame and the generated image frame into a visual quality discriminator, wherein the discriminator uses two classifications for judgment, and the result of the two classifications shows whether the image is a real image or a generated picture; and the generated image frame and the audio are input into a lip synchronization discriminator trained in advance, and whether the generated lip is synchronized with the corresponding audio is judged.
As a preferred technical solution, step S4 includes the following steps:
s41, preprocessing output data of the mouth shape generation network model;
s42, generating a lip sound synchronous video frame by using the mouth shape generation network model;
and S43, combining the generated video frames into a video, and combining the video with the input voice to form a final output video.
As a preferred technical solution, in step S41, the preprocessing operation includes face extraction, video framing, and voice feature extraction.
As a preferred technical solution, the step S1 includes the following steps:
step S11, lip language identification data set LRS2 is collected, and video data with set duration and capable of broadcasting Chinese voice for a host are extracted;
and S12, screening, cleaning and encoding the collected video data.
As a preferred technical solution, in step S11, an open source algorithm S3FD is used to extract video data with a set duration and broadcasting chinese voice for the host.
As an optimal technical scheme, the method for extracting the video data with the set duration and the Chinese voice broadcasted by the host comprises the following steps:
step S111, face recognition: identifying the face in the image by a learning mode by adopting a target detection method YOLO of a deep neural network;
step S112, acquiring a face detection frame: based on the output characteristic of a target detection algorithm YOLO, a plurality of face detection frames with different sizes and confidence degrees are generated while the face in the image is identified;
step S113, NMS redundancy elimination block: removing redundant face detection frames, and only keeping a correct face detection frame;
step S114, acquiring a face region picture: and obtaining the position information of the face part in the whole image through a unique detection frame corresponding to the face, and cutting the face part in the original image.
A mouth shape driving system for people in video is used for realizing the mouth shape driving method for people in video, and comprises the following modules which are connected in sequence:
a collection module: the video processing device is used for collecting videos and cleaning and coding data of the videos;
lip synchronization judging module: the method is used for constructing and training a lip sound synchronization judging model and judging whether lips and sound are synchronous or not;
a mouth shape generation module: the method is used for constructing and training a mouth shape generation network model and generating a corresponding mouth shape according to voice;
the mouth shape driving module: the method is used for constructing and applying the mouth shape driving model to carry out reasoning and prediction on any character video.
Compared with the prior art, the invention has the following beneficial effects:
(1) According to the method, the mouth shape of a real person in a video is driven by any voice by combining an image processing algorithm in artificial intelligence, a countermeasure generation network and voice related processing, so that broadcasting is performed;
(2) The invention can effectively improve the editing efficiency of the news video and save the labor and time cost of the news host.
Drawings
Fig. 1 is an overall flowchart of a method for driving a human mouth shape in a video according to the present invention;
FIG. 2 is a diagram of a die drive model frame;
FIG. 3 is a flow chart of face detection;
FIG. 4 is a generator model framework diagram.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.
Example 1
As shown in fig. 1 to 4, the technical problem to be solved by the present invention is: a television host is given to broadcast a video of a news program and any section of voice data, the mouth shape of a character in the video is accurately driven through learning of voice and the video, and the effect that the television host broadcasts any voice data is achieved.
The invention provides a real character mouth shape driving method for a news scene, which comprises the following steps of:
step S1: and collecting related videos broadcasted by the host in the news scene, and cleaning and encoding the data.
Step S2: and training a lip sound synchronization judging model for judging whether the lips and the sound are synchronous or not.
And step S3: training the mouth shape to generate a network model, and generating a corresponding mouth shape according to the voice.
And step S4: and (4) carrying out inference prediction on any character video by using the mouth shape driving model.
Further, step S1 comprises the following sub-steps:
step S11: the Lip language identification data set LRS2 (The Oxford-BBC Lip Reading sequences 2) sourced by The british broadcaster tv was collected, and video data for about 30 hours was collected for all hosts to broadcast chinese speech.
Step S12: the collected video data is filtered, cleaned and encoded into a specific format.
Further, step S2 comprises the following sub-steps:
step S21: and (5) constructing input of a lip sound synchronization discrimination model.
Step S22: and constructing a lip sound synchronous discrimination model.
Step S23: and reading data for training.
Further, step S3 comprises the following sub-steps:
step S31: and constructing a generator in the mouth shape generation network model, and generating a corresponding mouth shape according to the voice characteristics.
Step S32: and constructing a discriminator in the mouth shape generation network model for judging whether the mouth shape generated by the generator is correct and synchronous with the voice.
Further, step S4 includes the following sub-steps:
step S41: and preprocessing output data. The method comprises the operations of face extraction, video framing, voice feature extraction and the like.
Step S42: and (4) reasoning by the network model to generate lip sound synchronous video frames. And generating a lip sound synchronous video frame by using the mouth shape generation network model.
Step S43: and outputting the final mouth shape driving video. And combining the generated video frames into a video, and combining the video with the input voice to form a final output video.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
the invention provides a voice-driven mouth shape solution aiming at the problem that a host can automatically broadcast a specific text in the field of television media manufacturing, particularly in the field of news broadcasting. By combining an image processing algorithm in artificial intelligence, a countermeasure generation network and voice related processing, the mouth shape of a real person in any voice-driven video is realized, and broadcasting is carried out. The invention can effectively improve the editing efficiency of the news video and save the labor and time cost of the news host.
Example 2
As shown in fig. 1 to fig. 4, as a further optimization of embodiment 1, on the basis of embodiment 1, the present embodiment further includes the following technical features:
as shown in fig. 1, the present embodiment provides a real character mouth shape driving method for a news scene, including the following steps:
step S1: and collecting related videos broadcasted by a host in a news scene, and cleaning and encoding data.
Step S2: and training a lip sound synchronization judging model for judging whether the lips and the sound are synchronous or not.
And step S3: training the mouth shape to generate a network model, and generating a corresponding mouth shape according to the voice.
And step S4: and (4) carrying out inference prediction on any character video by using the mouth shape driving model.
Further, step S1 comprises the following sub-steps:
step S11: and downloading an LRS2 (Lip Reading sequences 2) data set which is all foreign language video data and contains pictures of different characters speaking. Meanwhile, in order to ensure that the mouth shape driving model has better generalization on Chinese, about 30 hours of video data for broadcasting Chinese voice for a host are collected in each video platform, television station official channel and the like.
As shown in fig. 3, more specifically:
the method for extracting the video data with the set duration and the Chinese voice broadcasted for the host comprises the following steps:
step S111, face recognition: and (3) recognizing the face in the image in a learning mode by adopting a target detection method YOLO (you only look once) of a deep neural network.
Step S112, acquiring a face detection frame: based on the output characteristic of the target detection algorithm YOLO, a plurality of face detection frames with different sizes and confidence degrees can be generated when the face in the image is identified.
Step S113, NMS redundancy removal block: the target detection algorithm YOLO will detect a plurality of face detection rectangular boxes in one image, and the main purpose of NMS (Non-Maximum Suppression) is to remove redundant rectangular boxes and only keep one correct face rectangular box. The specific idea is that a detection frame with the highest confidence coefficient (assumed to be C) is selected first, then the C and the remaining frames calculate corresponding IOU (Intersection over Union) values, when the IOU value exceeds a set threshold (the target detection is always set to 0.7, the invention is set to 0.7), that is, the frames exceeding the threshold are suppressed, and the suppression method is to set the score of the detection frame to 0, so after one round, the remaining detection frames continue to search for the frame with the highest score, and then suppress the frame with the IOU exceeding the threshold until the last frame with almost no overlap is retained. According to the invention, only one face exists in the image, and finally, only one detection frame is left in each face image.
Step S114, obtaining a face region picture: and the unique detection frame corresponding to the face represents the position of the face part in the whole image. By this position information, the face portion is cut out from the original image.
Step S12: and screening and cleaning the video data, wherein only one character and the voice of the same person appear in the video as much as possible. Using an open source tool ffmpeg to split all videos into short videos of about 3s each segment, then performing frame extraction on all split videos, then extracting human faces from the images after frame extraction, and finally separating the whole segment of audio from the split videos. Face extraction is performed by using an open source algorithm S3FD (Single Shot Scale-innovative Face Detector), and an extraction flow chart is shown in fig. 3.
Further, step S2 comprises the following sub-steps:
step S21: and acquiring the input of the lip-voice synchronous discrimination model. A video frame fragment Tf of 5 consecutive frames is acquired as the video input data for the discriminator. Meanwhile, audio data Ta corresponding to 5 consecutive frame time windows is acquired as audio input data for the discriminator.
Step S22: and constructing a lip sound synchronous discrimination model. A lip sound synchronization judging model is built based on a Syncnet network model (a basic neural network model for detecting audio and lip synchronization in videos), and the Syncnet is a network for judging whether videos and audios are synchronous from end to end. The lip sound synchronization distinguishing model comprises an audio encoder and a face encoder, wherein the audio encoder is used for encoding audio frequency into MFCC characteristics and face image characteristics, and the model calculates loss by using cosine similarity of binary cross entropy. Wherein, the lip-voice synchronous discrimination model comprises an audio coder and a human face coder, the audio encoder encodes the audio data using the open source audio processing library Librosa (Librosa is a python toolkit used for audio, music analysis, processing), extracting the MFCC (Mel-Freguency ceptraI Coefficients) characteristics corresponding to the audio data; the face encoder encodes the face image into the face image characteristics by using a deep convolutional neural network, and the lip sound synchronous discrimination model calculates loss by using cosine similarity of binary cross entropy; the probability of video-audio synchronization is represented by calculating the dot product of the image (V) and the audio coding feature (S) after Relu, the formula:
Figure 797425DEST_PATH_IMAGE011
wherein the content of the first and second substances,
Figure 220316DEST_PATH_IMAGE012
which represents the probability of video and audio synchronization,
Figure 420484DEST_PATH_IMAGE003
the characteristics of the face image are represented,
Figure 790286DEST_PATH_IMAGE004
representing the characteristics of the audio mel-frequency cepstral coefficients,
Figure 743198DEST_PATH_IMAGE005
is an abbreviation for epsilon, meaning an arbitrarily small number;
step S23: and reading data for training. The lip sound synchronization discrimination model generates data pairs by randomly sampling the video and audio synchronization window and the video and audio asynchronization window, so as to train. And (3) performing dimension reduction and feature extraction on each data pair by using two encoders based on a convolutional neural network to input voice and video frames respectively, mapping the features of the two encoders to the same dimensional space, and finally measuring the lip sound synchronism by using a cosine similarity loss function (cosine _ loss), wherein a cosine similarity calculation formula is as follows.
Figure 540253DEST_PATH_IMAGE013
In the formula (I), the compound is shown in the specification,
Figure 959209DEST_PATH_IMAGE014
the similarity of lip-sound synchronization is shown,
Figure 132701DEST_PATH_IMAGE008
representing the angle between the image feature vector and the speech feature vector,
Figure 205699DEST_PATH_IMAGE009
a video feature representing a set length is shown,
Figure 986704DEST_PATH_IMAGE015
and representing the audio characteristics corresponding to the corresponding video characteristics with set length. Preferably, this embodiment selects the set length as 5 consecutive frames of images.
Further, step S3 comprises the following sub-steps:
step S31: and constructing a generator in the mouth shape generation network model. The generator is based on an encoder-decoder network structure and comprises a face encoder, an audio encoder, a face decoder, a video image frame and an intermediate feature, wherein the face encoder in the generator receives the video image frame and generates the intermediate feature; an audio encoder in the generator receives the audio signal to generate an audio intermediate feature; and respectively carrying out feature fusion on the obtained face intermediate features and the audio intermediate features, sending the fused features into a face decoder for decoding, and finally outputting the lip-shaped image frame which is generated by the generator and is synchronous with the audio. The generator structure block diagram is shown in fig. 4.
Step S32: and constructing a discriminator in the mouth shape generation network model. The lip-shaped synchronous image frame generated by the generator is input into a discriminator, the discriminator comprises two parts, one is a pre-trained lip-shaped and audio-frequency synchronous discriminator which receives an audio signal and a generated lip-shaped synchronous image as input to discriminate whether the generated lip-shaped image is synchronous with audio frequency or not, the discriminator is pre-trained in the step S2, and the parameters of the discriminator need to be frozen during training of a mouth shape generation model without training, so that the capability of discriminating the lip-shaped and the audio frequency synchronously is enhanced; and the discriminator receives the lip image generated by the generator and the lip image which is synchronized with the audio to discriminate the lip image from true or false, so as to drive the lip image to generate better quality. The original video frame and the generated image frame are input into a visual quality discriminator, the discriminator judges by using two classifications, and the result of the two classifications shows whether the image is a real image or a generated picture, thereby improving the image quality. And inputting the generated image frame and the audio into a lip synchronization discriminator trained in advance, and judging whether the lip is generated accurately.
Further, step S4 comprises the following sub-steps:
step S41: and preprocessing output data. And (3) performing frame extraction on the real person video to be driven by the mouth shape, and performing face detection and face extraction on each frame of image. And separating the video and the audio of the whole video, extracting the audio part data and carrying out Mel cepstrum coefficient coding.
Step S42: and (4) reasoning by the network model to generate lip sound synchronous video frames. And generating a lip sound synchronous video frame by using the mouth shape generation network model.
Step S43: and outputting the final mouth shape driving video. And combining the generated video frames into a video, and combining the video with the input voice to form a final output video.
As described above, the present invention can be preferably implemented.
All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims (10)

1. A method for driving the mouth shape of a person in a video is characterized in that the mouth shape of the person in the video is driven by learning voice and learning video aiming at a section of video with a figure speaking graph and a section of voice.
2. The method for driving the mouth shape of the person in the video according to claim 1, comprising the following steps:
s1, collecting: collecting videos, and cleaning and encoding the videos;
s2, lip synchronization judgment: constructing and training a lip sound synchronization discrimination model for judging whether lips and sound are synchronous or not;
s3, generating a mouth shape: constructing and training a mouth shape generation network model, and generating a corresponding mouth shape according to the voice;
s4, driving the mouth shape: and constructing and applying a mouth shape driving model to carry out reasoning and prediction on the video of any person.
3. The method for driving the mouth shape of the person in the video according to claim 2, wherein the step S2 comprises the steps of:
s21, obtaining the input of a lip sound synchronization judging model: obtaining a continuous video frame segment T with a set frame number f The video input data is used as the lip synchronization distinguishing model; and acquiring audio data T corresponding to a time window with the same set frame number a The voice frequency input data is used as the voice frequency input data of the lip sound synchronization judging model;
s22, constructing a lip synchronization discrimination model: building a lip sound synchronization discrimination model based on a Syncnet network model, and calculating the probability of video and audio synchronization by using the lip sound synchronization discrimination model; the lip synchronization judging model comprises an audio encoder and a face encoder, wherein the audio encoder is used for encoding audio data and extracting Mel cepstrum coefficient (MFCC) characteristics corresponding to the audio data, the face encoder is used for encoding a face image into face image characteristics, and the lip synchronization judging model calculates loss by using cosine similarity of binary cross entropy; the probability of video and audio synchronization is calculated as follows:
Figure 574551DEST_PATH_IMAGE001
wherein, the first and the second end of the pipe are connected with each other,
Figure 408515DEST_PATH_IMAGE002
which represents the probability of video and audio synchronization,
Figure 172203DEST_PATH_IMAGE003
the characteristics of the face image are shown,
Figure 54708DEST_PATH_IMAGE004
representing the characteristics of the audio mel-frequency cepstral coefficients,
Figure 469509DEST_PATH_IMAGE005
represents an arbitrarily small number;
s23, the lip sound synchronization judging model generates data pairs through randomly sampling a video and audio synchronization window and a video and audio asynchronization window so as to train; then, for each data pair, two encoders based on convolutional neural networks are used for respectively carrying out dimension reduction and feature extraction on input voice and video frames, and features of the two encoders are mapped into the same dimension space; finally, measuring lip sound synchronism by using a cosine similarity loss function; the video and audio synchronous window refers to correct audio in the original video corresponding to the continuously set frame number video, the video and audio asynchronous window refers to any other audio segment in the video corresponding to the continuously set frame number video, and the cosine similarity loss function calculation formula is as follows:
Figure 412057DEST_PATH_IMAGE006
in the formula (I), the compound is shown in the specification,
Figure 660111DEST_PATH_IMAGE007
the similarity of lip-sound synchronization is shown,
Figure 143045DEST_PATH_IMAGE008
representing the angle between the image feature vector and the speech feature vector,
Figure 615615DEST_PATH_IMAGE009
a video feature representing a set length is shown,
Figure 542114DEST_PATH_IMAGE010
and representing the audio characteristics corresponding to the corresponding video characteristics with set length.
4. The method for driving the mouth shape of the person in the video according to claim 3, wherein the step S3 comprises the following steps:
s31, constructing a generator in the mouth shape generation network model, wherein the generator is used for generating a corresponding mouth shape according to the voice characteristics; the generator is based on an encoder-decoder network structure and comprises a face encoder, an audio encoder and a face decoder, wherein the face encoder is used for receiving video image frames and generating face intermediate features, and the audio encoder is used for receiving audio signals and generating audio intermediate features; then respectively carrying out feature fusion on the obtained face intermediate features and the audio intermediate features, sending the fused features into a face decoder for decoding, and finally outputting lip-shaped image frames which are generated by a generator and are synchronous with audio;
s32, constructing a discriminator in the mouth shape generation network model, wherein the discriminator is used for judging whether the mouth shape generated by the generator is correct or not and whether the mouth shape is synchronous with the voice or not; wherein, the discriminator comprises a pre-trained discriminator for synchronizing the lip shape and the audio frequency and a discriminator for the lip shape visual quality; then, inputting the lip-shaped synchronous image frame generated by the generator into a discriminator, and inputting the original video frame and the generated image frame into a visual quality discriminator, wherein the discriminator uses two classifications for judgment, and the result of the two classifications shows whether the image is a real image or a generated picture; and the generated image frame and the audio are input into a lip synchronization discriminator trained in advance, and whether the generated lip is synchronized with the corresponding audio is judged.
5. The method for driving the mouth shape of the person in the video according to claim 4, wherein the step S4 comprises the steps of:
s41, preprocessing output data of the mouth shape generation network model;
s42, generating a lip sound synchronous video frame by using the mouth shape generation network model;
and S43, combining the generated video frames into a video, and combining the video with the input voice to form a final output video.
6. The method as claimed in claim 5, wherein the preprocessing operations in step S41 include face extraction, video framing, and voice feature extraction.
7. The method for driving the mouth shape of a person in a video according to any one of claims 1 to 6, wherein the step S1 comprises the following steps:
s11, collecting a lip language identification data set LRS2, and extracting video data with set duration and broadcasting Chinese voice for a host;
and S12, screening, cleaning and coding the collected video data.
8. The method for driving the mouth shape of a character in a video according to claim 7, wherein in step S11, an open source algorithm S3FD is used to extract video data with a set duration and all of which broadcast chinese speech to a host.
9. The method as claimed in claim 8, wherein the step of extracting the video data of the set duration for broadcasting the chinese voice for the host comprises the steps of:
step S111, face recognition: identifying the face in the image by a learning mode by adopting a target detection method YOLO of a deep neural network;
step S112, acquiring a face detection frame: based on the output characteristic of a target detection algorithm YOLO, a plurality of face detection frames with different sizes and confidence degrees are generated while the face in the image is identified;
step S113, NMS redundancy elimination block: removing redundant face detection frames, and only keeping a correct face detection frame;
step S114, acquiring a face region picture: and obtaining the position information of the face part in the whole image through a unique detection frame corresponding to the face, and cutting the face part in the original image.
10. A system for driving a figure mouth shape in a video, which is used for implementing the method for driving a figure mouth shape in a video according to any one of claims 1 to 9, and comprises the following modules connected in sequence:
a collection module: the video processing device is used for collecting videos and cleaning and coding data of the videos;
lip synchronization judging module: the method is used for constructing and training a lip sound synchronization judging model and judging whether lips and sound are synchronous or not;
a mouth shape generation module: the method is used for constructing and training a mouth shape generation network model and generating a corresponding mouth shape according to voice;
the mouth shape driving module: the method is used for constructing and applying the mouth shape driving model to carry out reasoning and prediction on any character video.
CN202211568819.9A 2022-12-08 2022-12-08 Method and system for driving human mouth shape in video Pending CN115580743A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211568819.9A CN115580743A (en) 2022-12-08 2022-12-08 Method and system for driving human mouth shape in video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211568819.9A CN115580743A (en) 2022-12-08 2022-12-08 Method and system for driving human mouth shape in video

Publications (1)

Publication Number Publication Date
CN115580743A true CN115580743A (en) 2023-01-06

Family

ID=84590144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211568819.9A Pending CN115580743A (en) 2022-12-08 2022-12-08 Method and system for driving human mouth shape in video

Country Status (1)

Country Link
CN (1) CN115580743A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665695A (en) * 2023-07-28 2023-08-29 腾讯科技(深圳)有限公司 Virtual object mouth shape driving method, related device and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562720A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Lip-synchronization video generation method, device, equipment and storage medium
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113901894A (en) * 2021-09-22 2022-01-07 腾讯音乐娱乐科技(深圳)有限公司 Video generation method, device, server and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112562720A (en) * 2020-11-30 2021-03-26 清华珠三角研究院 Lip-synchronization video generation method, device, equipment and storage medium
CN113378697A (en) * 2021-06-08 2021-09-10 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113901894A (en) * 2021-09-22 2022-01-07 腾讯音乐娱乐科技(深圳)有限公司 Video generation method, device, server and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665695A (en) * 2023-07-28 2023-08-29 腾讯科技(深圳)有限公司 Virtual object mouth shape driving method, related device and medium
CN116665695B (en) * 2023-07-28 2023-10-20 腾讯科技(深圳)有限公司 Virtual object mouth shape driving method, related device and medium

Similar Documents

Publication Publication Date Title
US11869261B2 (en) Robust audio identification with interference cancellation
Makino et al. Recurrent neural network transducer for audio-visual speech recognition
Zhou et al. Modality attention for end-to-end audio-visual speech recognition
CN110222719B (en) Figure identification method and system based on multi-frame audio and video fusion network
CN112562720A (en) Lip-synchronization video generation method, device, equipment and storage medium
CN108307250B (en) Method and device for generating video abstract
Ding et al. Self-supervised learning for audio-visual speaker diarization
CN115580743A (en) Method and system for driving human mouth shape in video
CN112668559A (en) Multi-mode information fusion short video emotion judgment device and method
CN111107284B (en) Real-time generation system and generation method for video subtitles
Shrestha et al. Synchronization of multiple camera videos using audio-visual features
CN112382277A (en) Smart device wake-up method, smart device and computer-readable storage medium
Ebeneze et al. Detection of audio-video synchronization errors via event detection
CN113239903B (en) Cross-modal lip reading antagonism dual-contrast self-supervision learning method
CN113327619B (en) Conference recording method and system based on cloud-edge collaborative architecture
CN101877223A (en) Video and audio editing system and method and electronic equipment with video and audio editing system
CN116017088A (en) Video subtitle processing method, device, electronic equipment and storage medium
CN115565533A (en) Voice recognition method, device, equipment and storage medium
Giannakopoulos et al. A novel efficient approach for audio segmentation
CN114283493A (en) Artificial intelligence-based identification system
Sahrawat et al. " Notic My Speech"--Blending Speech Patterns With Multimedia
Morrone Deep learning methods for audio-visual speech processing in noisy environments
CN113099283A (en) Method for synchronizing monitoring picture and sound and related equipment
Chaloupka A prototype of audio-visual broadcast transcription system
EP3014622A1 (en) Programme control

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20230106