CN116112737A

CN116112737A - Video data processing method and system

Info

Publication number: CN116112737A
Application number: CN202211738171.5A
Authority: CN
Inventors: 司马华鹏; 王培雨
Original assignee: Nanjing Silicon Intelligence Technology Co Ltd
Current assignee: Nanjing Silicon Intelligence Technology Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-12

Abstract

The invention provides a video data processing method, which comprises the steps of obtaining basic video data for basic model training; performing data preprocessing on the basic video data; performing basic training on the basic model through the preprocessed video data to generate a first model; performing fine tuning training on the first model through sample video data to generate a second model; generating target mouth data corresponding to the audio to be processed according to the audio to be processed and the second model; and outputting target video data according to the target mouth data and the video data to be processed. The invention also provides a corresponding system. The method and the system solve the problem that the existing video recording errors are not easy to correct, can regenerate the mouth shape through the audio with low cost, avoid the cost of re-shooting and improve the viewing experience.

Description

Video data processing method and system

Technical Field

The present invention relates to a data processing method, and in particular, to a method and system for processing video data.

Background

With the development of computer technology, videos are widely applied in daily consumption life, for example, scenes such as a short video platform, live broadcast with goods, online education and the like all need to be transmitted in a video mode. Recording video is slowly becoming a means for people to socialize and communicate information, and accordingly post-processing after video recording is also becoming an indispensable task.

However, the recording and production of video is a time-consuming task, and usually requires a specific recording environment and specialized recording equipment, in which case if some recording errors occur or some lines need to be revised, the video segments need to be re-recorded, which is time-consuming and labor-consuming, and also considers the splicing problem of the newly recorded video and the original video. Also, the above-described problems are encountered during movie shooting. Because the equipment of shooing is more professional, the personnel that cooperate to shoot is more, and the cost of recording the video also can be higher. In addition, problems of inconsistent dubbing and lip-form can be encountered in the case of film translation, mandarin dubbing of dialect movies, and the like. At this time, it is necessary to provide a technology capable of modifying the mouth shape of a person according to audio, generating a video with consistent audio and mouth shape, thereby greatly improving the working efficiency, reducing the video production cost, and enabling the viewer to obtain a better viewing experience.

Disclosure of Invention

The invention provides a video data processing method and a video data processing system, which can solve the problem that the mouth shape of a person in a video cannot be efficiently and simply adjusted according to the requirement of a user in the later period of video recording or film shooting so as to be matched with audio.

In one aspect, the present invention provides a video data processing method, including:

acquiring basic video data for basic model training;

performing data preprocessing on the basic video data; the data preprocessing comprises the steps of extracting audio characteristics and face data, so that audio characteristic data and mouth characteristic data are obtained;

performing basic training on the basic model through the audio feature data and the mouth feature data to generate a first model;

acquiring sample video data for training a first model, performing fine tuning training on the first model through the sample video data, and generating a second model;

generating target mouth data corresponding to the audio to be processed according to the audio to be processed and the second model;

and generating target video data according to the target mouth data and the video data to be processed.

Optionally, the base video data requires the base video data to completely expose a mouth that is synchronized with the audio of the base video data, the base video being sufficiently clear.

Optionally, the extracting the audio feature refers to a feature capable of extracting semantic information.

Optionally, the extracting the audio feature refers to extracting a voice recognition feature of the base video data as the audio feature after the data preprocessing.

Optionally, the extracting face data refers to firstly cutting a face region in a video as a basic picture, and then processing a mouth region of the basic picture as mouth feature data.

Optionally, the training the basic model through the audio feature data and the mouth feature data means training the basic model by using the audio feature data and the mouth feature data as model inputs and using the basic picture as an output.

Optionally, the generating the target video data means that the corresponding part of the video data to be processed is modified according to the target mouth data, so as to generate the target video data.

Optionally, the sample video data is subjected to data preprocessing to obtain second audio feature data and second mouth feature data, and fine tuning training is performed on the first model through the second audio feature data and the second mouth feature data.

Optionally, fusion processing is performed on the target mouth data.

In another aspect, the present invention provides a video data processing system comprising:

the video acquisition module is used for acquiring basic video data for basic model training;

the data preprocessing module is used for preprocessing the data of the basic video data; the data preprocessing comprises the steps of extracting audio characteristics and face data, so that audio characteristic data and mouth characteristic data are obtained;

the first training module is used for carrying out basic training on the basic model through the audio characteristic data and the mouth characteristic data to generate a first model;

the second training module is used for acquiring sample video data for training the first model, performing fine tuning training on the first model through the sample video data and generating a second model;

the video generation module is used for generating target mouth data corresponding to the audio to be processed according to the audio to be processed and the second model;

and the video output module is used for outputting target video data according to the target mouth data and the video data to be processed.

The advantages or beneficial effects in the technical scheme at least comprise:

the invention is a technology for modifying video mouth data according to audio, and can be widely applied to scenes such as movies, short videos and the like. After the video shooting is completed, the mouth shape is regenerated through the audio with little cost, and the cost for re-shooting is avoided. The invention can also be used for dubbing of movies, dialect movies and the like, and can generate and dub matched mouth shapes according to the audio frequency, thereby improving the viewing experience.

The invention can be applied to the video creation field of all design mouth shapes, such as the fields of film shooting, news broadcasting, self-media video creation, film translation, cartoon making and the like. In the shooting of a movie, the situation that the lines need to be modified and the lines are added often occurs, but the shooting work is often finished when the movie is clipped, the additional shooting is often carried out at the moment, the cost is often high, and even the shooting is difficult to be carried out again due to the change of environment and the damage of scenes in more cases.

The method is a very suitable solution, and the mouth shape of the actor can be changed arbitrarily only by cutting the video of the specific actor for one minute for fine tuning training, so that the mouth shape of the actor corresponds to the modified line. Even can directly generate video according to new lines, increase bigger scenario. Thus greatly reducing the cost and having larger creation space for the artistic creator.

The current society increasingly shows a lot of video self-media, the shooting process is similar to that of a film, but the shooting cost is lower than that of the film, the shooting environment is more free, and the video self-media is similar to the film, so that the video self-media is also a good application scene of the invention.

The foregoing summary is for the purpose of the specification only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will become apparent by reference to the drawings and the following detailed description.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the disclosure and are not therefore to be considered limiting of its scope.

Fig. 1 is a flow chart of data processing according to a first embodiment of the present invention.

FIG. 2 is a flow chart of speech-based mouth shape generation according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of a voice-based mouth shape generating network according to a third embodiment of the present invention.

Detailed Description

Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those skilled in the pertinent art, the described embodiments may be modified in numerous different ways without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.

Human pronunciation has the characteristic that specific voices generally correspond to specific voices. The mutual mapping of the face image and the audio is the core of the voice-driven face video animation synthesis, and the association between the audio characteristics and the face characteristics is utilized, and the audio data is used as input, so that the face (especially mouth shape) data is obtained. In the process, semantic information in audio data is extracted through a neural network model, and then a target mouth shape picture is generated by combining face information.

At the heart of speech-based generation of a mouth shape is the generation of a corresponding mouth shape from audio features. What the audio features need to extract here is semantic information, not information of loudness, timbre etc. of speech. Different people read the same text at different loudness and should have the same mouth shape. Different people should have the same mouth shape for the same speech. In addition, each person has its own characteristics in terms of skin color, tooth shape, lip size, etc. To generate these lips, face information needs to be provided, and the neural network model can generate corresponding lip information according to different face information. That is, in this process the speech features provide mouth shape information, the facial features provide lip information, and the mouth shape information and lip information form mouth data.

Referring to fig. 1, an embodiment of the present invention proposes a method for processing video data, the method comprising:

s1.1, acquiring a large amount of basic video data for basic model training.

In the invention, to obtain a good effect, a basic model needs to be trained by a large amount of data, and the trained effect is directly related to the used data. The training data is required to reach a certain definition, preferably more than 720, so that the video predicted by the model can be sufficiently clear; the training data is used to leak out the complete mouth area, preferably a straight shot, so that the model can learn the relation between the audio frequency and the mouth shape; the mouth shape of the data to be used corresponds to the voice accurately, otherwise, the model is difficult to learn the correct mouth shape. These videos are not limited to people, can be recorded by themselves or downloaded on the internet, and are good materials for movies, news, lectures and the like. The more training videos are, the better the accuracy and generalization of the training model are, and according to training experience, at least 3 hours of training videos are needed to obtain a more accurate effect.

The basic model is trained by adopting a large amount of video data and comprises speaking videos of different people and different actions, so that the model can learn a large amount of information such as facial forms, environments, tone, loudness, speech speed and the like, and the generalization of the model is greatly enhanced.

S1.2, data preprocessing is carried out on the basic video data.

To complete the training process, the preprocessing of the data needs to be completed, and the preprocessing mainly comprises the extraction of the audio features and the extraction of the face data.

The extraction of the audio features is mainly to obtain audio feature data. The above principle description mentions that semantic information needs to be extracted by audio features in order to obtain an accurate mouth shape. The voice recognition model is closely related to voice semantics, so that the voice recognition model is adopted to extract audio characteristics, which is a good choice.

One preferred embodiment is to extract the audio features using a neural network model, such as the audio pre-training model Audionet. Audionet is a speech recognition model trained with 1 ten thousand hours of data, where the intermediate features extracted by Audionet are used as the audio features of the present invention.

The extraction of face data is mainly to obtain mouth feature data. One preferred embodiment is to convert the basic video data into a picture, then detect the face position of the picture by using a face detection model dlib, intercept the whole face area as a crop picture, and then perform zero setting processing on the mouth area of the intercepted picture as a mask picture.

S1.3, performing basic training on the basic model through a large amount of preprocessed video data to generate a first model.

And (3) performing basic training on the basic model through the video data generated by the large-batch preprocessing of the video data of the S1.2, and generating a first model. The preprocessed volume of video data includes the audio feature data and the mouth feature. In a preferred embodiment, the preprocessed plurality of video data is input with the extracted audio features and cropped mask pictures as models, and the cropped crop pictures as outputs, the base model is trained, and the trained first model can be used for subsequent model fine tuning.

S1.4, performing fine tuning training on the first model through sample video data to generate a second model.

And acquiring sample video data for training a first model, performing fine tuning training on the first model through the sample video data, and generating a second model. The basic model, the first model and the second model fine tuning model have the same network structure, the same training strategy and the same data preprocessing mode, so that the fine tuning training can directly load the first model, and accurate mouth shape data can be obtained only by fine tuning data of about one minute with the first model trained by big data. The big data basic model is trained from zero, and the big data over 3 hours is used for training, so that the rules corresponding to the audio frequency and the mouth shape are learned, and the rules for generating mouth shape pictures according to the characteristics outside the mouth shape are also learned, wherein the characteristics comprise skin textures, tooth shapes, lip shapes and other face characteristics. The basic model is directly loaded during fine tuning training, which is equivalent to mastering a general rule when training is started, and only the special characteristics of the model need to be learned in a small amount of data.

That is, the first model after the big data basic training is obtained, the basic parameters are obtained, and a large amount of training videos are not needed in the practical training scene.

In a preferred embodiment, according to training experience, only about one minute of video of a model is needed as sample video data to modify the mouth shape of the model, and after fine tuning training is performed on the first model, a second model is obtained. Thus, the threshold used by the model is greatly reduced. In actual use scene, the excellent effect can be achieved by training about 20 rounds under normal conditions.

In another preferred embodiment, the sample video data is subjected to data preprocessing to obtain second audio feature data and second mouth feature data, and the first model is subjected to fine tuning training through the second audio feature data and the second mouth feature data to obtain a second model.

And S1.5, generating target mouth data corresponding to the audio to be processed according to the audio to be processed and the second model.

After the second model is obtained, preprocessing operation is carried out on the video with the mouth shape error, and the preprocessed audio information is used as the input of the second model, so that the correct mouth shape data of the corresponding audio can be obtained.

The correct mouth shape picture obtained through the second model prediction and the original picture may have weak chromatic aberration. In this case, a preferred embodiment is to perform some fusion operation with opencv to remove chromatic aberration.

And S1.6, outputting target video data according to the target mouth data and the video data to be processed.

And splicing the video data to be processed back according to the target mouth shape data, thus completing the whole mouth shape correction work.

In other words, according to the picture formed by the correct mouth shape data, the correct video data strictly corresponding to the audio can be outputted by replacing the corresponding part of the wrong video.

The embodiment of the invention also provides a mouth shape generating method based on voice, and the method is described in detail below with reference to fig. 2 and 3:

s2.1, obtaining a mask picture. The input picture here removes the mouth region, leaving information of eyes, ears, eyebrows, etc., and the left image information can provide some characteristic information of the model, such as skin texture color, face shape, ID characteristics, etc., to the neural network. As previously described, the voice features provide mouth shape information, the facial features provide lip information, and the mouth shape information and lip information form mouth data.

S2.2, acquiring audio characteristics. The audio features contain a lot of information such as various information of loudness, frequency, tone, environment, reverberation, speed of speech, tone, etc., and the voice-based mouth shape generating system needs semantic information of voice because the mouth shape and the semantics of human beings have a one-to-one correspondence relationship, but have no regular correspondence relationship with loudness, frequency, tone, environment, reverberation, speed of speech, tone, etc. of sound. Because the voice contains too much information, the original voice is directly used as input, the learning difficulty of the network can be greatly increased, and the problem of difficult convergence is caused. Thus, a preferred embodiment is to extract the semantic information of the speech using a pre-trained speech recognition model that is trained for 1 ten thousand hours with good accuracy and generalization.

S2.3, mask picture information and voice feature information are respectively sent into an E1 image convolution network and an E2 audio convolution network, the convolution network can further reduce the dimension of the picture and the audio feature information, useful feature information is better extracted, two features are spliced together after convolution operation, the spliced features are sent into a network structure called resnet, input and convolution results are overlapped and output after each convolution operation of the structure, and the problem of model degradation can be relieved, so that the deep learning model can be designed more deeply and more complicated. The 9-layer resnet network structure is used for carrying out convolution processing on the spliced features, a large number of model parameters are better fused with the two features, and the task of generating the mouth shape is completed. The resnet is followed by a transposed neural network, which is capable of performing an upscale operation on the extracted features, restoring the upscale operation to dimensions of the input picture, and outputting the predicted image, wherein the specific layer number and the input convolution have a corresponding relationship. The input convolution, the resnet network, and the transposed network convolution are collectively referred to as a generator.

S2.4, the training process of the voice-based mouth shape generating network uses the training thought of a GAN network, and mainly comprises two network generators and a discriminator, wherein the generators are used for generating pictures close to real pictures according to input features, and the discriminator is used for discriminating the generated pictures and the true or false of the real pictures. In brief, the generator tries to generate a picture with spurious reality, and the arbiter tries to find the difference between the generated picture and the real picture, so as to excite the generator to generate a more real picture. The multi-layer discriminator is adopted by the discriminator, which is a multi-scale discriminator, the loss is calculated in a plurality of coding layers respectively, the receptive fields of the discriminators in different layers are different, the large receptive field can learn more global features, and the small receptive field can learn more detailed features such as materials, textures and the like. The multi-layer arbiter herein has a stronger learning capability.

S2.5, after the generator and the discriminator are provided, a loss function is designed to calculate loss, and the optimizer performs model fitting operation on the parameters of the updated model. The invention adopts L _per 、L _gan 、L _vgg Three loss functions: l (L) _per The first-order loss function is used for directly calculating the difference value between the predicted picture and the real picture and evaluating the accuracy of the generated picture; l (L) _gan The square difference loss is calculated after the generated picture and the real picture extract the characteristics, and is used for judging the real picture and the generated picture; l (L) _vgg The loss is also a first order loss that calculates the difference between vgg features of the generated picture and the actual picture, and the vgg model is based on a large number of picture training, thus enabling more representative picture features to be extracted.

The voice-based mouth shape generating system is characterized in that the mouth shape generating task can be completed through the flow.

The embodiment of the invention also provides a system for processing video data, which comprises:

Other functions of each module in the video data processing system according to the embodiment of the present invention may be referred to the corresponding descriptions in the above method, and will not be described herein.

The embodiment of the invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the method provided in the embodiment of the invention.

The embodiment of the invention also provides a chip, which comprises a processor and is used for calling the instructions stored in the memory from the memory and running the instructions stored in the memory, so that the communication equipment provided with the chip executes the method provided by the embodiment of the invention.

The embodiment of the invention also provides a chip, which comprises: the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the invention.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processing, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (fieldprogrammablegate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an advanced reduced instruction set machine (advanced RISC machines, ARM) architecture.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory, among others. Volatile memory can include random access memory (random access memory, RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, static RAM (SRAM), dynamic RAM (dynamic random access memory, DRAM), synchronous DRAM (SDRAM), double data rate synchronous DRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present invention are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. Computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Any process or method description in a flowchart or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present invention includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the methods of the embodiments described above may be performed by a program that, when executed, comprises one or a combination of the steps of the method embodiments, instructs the associated hardware to perform the method.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that various changes and substitutions are possible within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of video data processing, the method comprising:

acquiring basic video data for basic model training;

and outputting target video data according to the target mouth data and the video data to be processed.

2. The method of claim 1, wherein the base video data requires the base video data to completely expose a mouth that is synchronized with the audio of the base video data, the base video being sufficiently clear.

3. The method of claim 1, wherein the extracted audio features are features that enable semantic information to be extracted.

4. The method of claim 1, wherein extracting audio features is extracting speech recognition features of the base video data as data pre-processed audio features.

5. The method according to claim 1, wherein the extracting face data is that face regions in a video are taken as a base picture, and then mouth regions of the base picture are processed as mouth feature data.

6. The method of claim 5, wherein the training the base model with the audio feature data and the mouth feature data is training the base model with the audio feature data and the mouth feature data as model inputs and the base picture as an output.

7. The method of claim 1, wherein generating the target video data is by modifying a corresponding portion of the video data to be processed based on the target mouth data, thereby generating the target video data.

8. The method of claim 1, wherein the sample video data is subjected to data preprocessing to obtain second audio feature data and second mouth feature data, and wherein the first model is subjected to fine tuning training through the second audio feature data and the second mouth feature data.

9. The method of claim 1, wherein fusion processing is performed for the target mouth data.

10. A system for video data processing, the system comprising: