CN116229311A

CN116229311A - Video processing method, device and storage medium

Info

Publication number: CN116229311A
Application number: CN202211723702.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Beijing Shengshu Technology Co ltd
Current assignee: Beijing Shengshu Technology Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-06-06
Anticipated expiration: 2042-12-30
Also published as: CN116229311B

Abstract

The embodiment of the application relates to the technical field of depth synthesis, and provides a video processing method, a video processing device and a storage medium. The video processing method comprises the following steps: acquiring a video to be processed, wherein the video to be processed comprises a plurality of first face images of a target user, and the video to be processed is obtained by utilizing a preset audio to drive an initial video; feature extraction is carried out on the lip features of the target user in at least one first face image in the video to be processed, and initial lip features of the target user are obtained; and decoding the initial lip feature to generate a first video, wherein the first video comprises a plurality of second face images of the target user, and the naturalness of the lip feature of the target user in the second face images is higher than that of the lip feature of the target user in the first face images. The scheme can acquire the high-quality image with high natural degree of the virtual image speaking.

Description

Video processing method, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of depth synthesis, in particular to a video processing method, a video processing device and a storage medium.

Background

As a branch of artificial intelligence (Artificial Intelligence, AI) technology, a technology called digital man is beginning to be applied to scenes such as short video platforms, live broadcast with goods, online education, and the like. The digital person is a virtual character obtained by virtually simulating different forms and functions of a human body by AI technology. With the rapid development of AI and image processing technologies, digital human generation technologies are becoming more mature. Taking digital person application in video technology as an example, it is possible to construct a false object image by e.g. deep learning, while driving the facial expression of this virtual object with speech to simulate a real person speaking. Although this approach allows a high degree of lip and voice synchronization, the detail of the lips and surrounding areas (e.g., teeth, corner of mouth wrinkles, etc.) of a virtual object, whether obtained by face-changing or otherwise, is not clear enough, and especially after the entire virtual object is enlarged, the above-mentioned drawbacks are more intolerable.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device and a storage medium, which can be used for forming a high-quality image with high natural degree of virtual image speaking.

In a first aspect, an embodiment of the present application provides a video processing method, including:

acquiring a video to be processed, wherein the video to be processed comprises a plurality of first face images of a target user, and the video to be processed is obtained by utilizing a preset audio to drive an initial video;

feature extraction is carried out on the lip features of the target user in at least one first face image in the video to be processed, and initial lip features of the target user are obtained;

and decoding the initial lip feature to generate a first video, wherein the first video comprises a plurality of second face images of the target user, and the naturalness of the lip feature of the target user in the second face images is higher than that of the lip feature of the target user in the first face images.

In one possible design, the training the encoder, first decoder, and first discriminator using a first training set includes:

the encoder performs feature extraction on the input first training set to obtain first lip features of a target user, wherein the first lip features are lip features of the target user contained in images in the first training set;

the first decoder decodes the first lip feature to obtain a first target image;

The first discriminator judges the confidence of the first target image according to the first target image and the first training set;

calculating a loss function of the first sub-model according to the confidence coefficient of the first target image to obtain a first loss function value;

and taking the first loss function value as a counter-propagation quantity, and adjusting model parameters of the first sub-model to train the first sub-model until the first loss function value is a first preset loss threshold value.

In one possible design, the training the encoder, second decoder, and second discriminator using the second training set includes:

the encoder performs feature extraction on the input second training set to obtain second lip features of the target user, wherein the second lip features are lip features of the target user contained in the second training set;

the second decoder decodes the second lip feature to obtain a second target image;

the second discriminator judges the confidence of the second target image according to the second target image and the second training set;

calculating a loss function of the second sub-model according to the confidence coefficient of the second target image to obtain a second loss function value;

And taking the second loss function value as a counter-propagation quantity, and adjusting model parameters of the second sub-model to train the second sub-model until the second loss function value is a first preset loss threshold value.

In one possible design, the second training set is derived from video generated using a speech driven model, and the naturalness of the lip features of the target user in the second training set is inferior to the naturalness of the lip features of the target user in the first training set; the face image in the training image is the same user as the target user.

In a second aspect, embodiments of the present application provide a video processing model, where the video processing model includes an image processing model and a face-changing model, and the video generating model includes a first sub-model and a second sub-model;

the image processing model is used for carrying out lip naturalness optimization on the video to be processed to obtain a first video, and inputting the face changing model;

the face changing model is used for carrying out lip feature replacement on the first video input from the image processing model to obtain a second video;

the first sub-model is obtained based on a first training set, the second sub-model is obtained based on a second training set, and the first training set comprises video materials of a plurality of users speaking; the second training set includes historical videos of the target user speaking.

In a third aspect, an embodiment of the present application provides a video processing apparatus, including:

the input/output module is used for acquiring a video to be processed, wherein the video to be processed comprises a plurality of first face images of a target user, and the video to be processed is obtained by driving an initial video by using preset audio;

the processing module is used for extracting the feature of the lip feature of the target user in at least one first face image in the video to be processed, which is acquired by the acquisition module, so as to acquire the initial lip feature of the target user;

In a third aspect, an embodiment of the present application provides an electronic device, including:

a processor; and

a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method as described above.

A fourth aspect of the present application provides a storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform a method as described above.

Compared with the prior art, the technical scheme provided by the application can comprise the following beneficial effects: and sequentially carrying out feature extraction and decoding processing on target faces in at least one first face image in the video to be processed, and generating a second face image with lip naturalness superior to that of the first face image when the first video is generated. On the one hand, even if the lip naturalness of a first face image in a current video to be processed is low, the lip features in the first face image are extracted, the lip features with lower naturalness affecting the playing effect of the video to be processed in the first face image can be subjected to targeted key processing through extraction, and the extracted lip features can be subjected to decoding processing to replace the lip features with lower lip naturalness in the first face image, namely, a second face image with high naturalness lip features can be obtained through reconstructing the lip features in the first face image, so that the problem of poor user watching experience caused by the defects of unclear details and the like of specific positions of the face of a virtual object in application scenes such as a short video platform, live broadcasting, online education and the like is solved; on the other hand, the second face image with higher quality can be obtained quickly by carrying out feature extraction and decoding processing on the target face in the first face image in sequence, so that the efficiency of obtaining the first video with higher lip naturalness can be effectively improved, and the line-up speed and the user watching effect of the first video for displaying the virtual image are further improved.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

Fig. 1 is an application scenario schematic diagram of a video processing method provided in an embodiment of the present application;

FIG. 2a is a schematic structural diagram of an image processing model according to an embodiment of the present application;

FIG. 2b is a schematic diagram of an image processing model according to another embodiment of the present application;

FIG. 2c is a schematic diagram of a video processing apparatus according to another embodiment of the present application for processing a video to be processed;

FIG. 3 is a schematic diagram of a training process of an image processing model according to an embodiment of the present application;

FIG. 4a is a schematic structural diagram of an image processing model of a training phase provided in an embodiment of the present application;

FIG. 4b is a schematic diagram of a first face image processing using a trained image processing model, as shown in an embodiment of the present application;

FIG. 5a is a flow chart of a video processing method according to an embodiment of the present disclosure;

FIG. 5b is a flow chart of a video processing method according to an embodiment of the present disclosure;

FIG. 6a is a schematic diagram showing the comparison of the effects of inputting a video to be processed into an image processing model to generate a first video according to an embodiment of the present application;

FIG. 6b is a schematic diagram showing the comparison of the effects of generating a second video from a first video input face-changing model according to an embodiment of the present application;

fig. 7 is a schematic diagram of a video to be processed according to an embodiment of the present application, where the video to be processed includes video frames of at least two different target users, and the video frames respectively call corresponding image processing models to obtain corresponding second face images;

fig. 8 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device shown in an embodiment of the present application;

fig. 10 is a schematic structural view of a video processing apparatus according to another embodiment of the present application;

fig. 11 is a schematic diagram of a server structure according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The embodiment of the application provides a video processing method, a video processing device and a storage medium, which can acquire a high-quality image containing a virtual object, can be used for a server or terminal equipment, and particularly can be used for carrying out feature extraction and decoding processing on a plurality of first face images comprising at least one target user, so as to generate a second face image with quality superior to that of the first face image.

In some embodiments, when the scheme is applied to an application environment as shown in fig. 1, the application environment may include a server, a database and a terminal, where the database may be a database independent of the server, or may be a database integrated with the server, and the terminal may be a personal computer or the like, where the terminal obtains a video to be processed by using a preset audio driving initial video through a neural network, or the terminal may be an intelligent terminal (e.g., a smart phone) with a photographing function or an image capturing device such as a camera, and photographs a section of the video to be processed against a real human. When the video processing method is implemented based on the application environment shown in fig. 1, the terminal acquires the video to be processed, uploads the video to the database, and the server runs the trained image processing model after acquiring the video to be processed from the database, and sequentially performs feature extraction and decoding processing on the target face in at least one first face image in the video to be processed, so as to generate a first video.

The solution of the embodiment of the present application may be implemented based on artificial intelligence (Artificial Intelligence, AI), natural language processing (Nature Language Processing, NLP), machine Learning (ML), and other technologies, and is specifically described by the following embodiments:

the AI is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

AI technology is a comprehensive discipline, and relates to a wide range of technologies, both hardware and software. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

NLP is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Aiming at the audio and video processing in the artificial intelligence field, the embodiment of the application can adopt the artificial intelligence technology to make up the defect of character details in the voice-driven video.

It should be noted that, the server (for example, the video processing apparatus) according to the embodiments of the present application may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The video processing device according to the embodiment of the present application may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a personal digital assistant, and the like. The video processing device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

As an important branch of AI technology, a technology called digital man is beginning to be applied to scenes such as short video platforms, live broadcast, with-goods, online education, and the like. The digital person is a virtual character obtained by virtually simulating different forms and functions of a human body by AI technology. With the rapid development of AI and image processing technologies, digital human generation technologies are becoming more mature. Taking digital person application in video technology as an example, it is possible to construct a false object image by e.g. deep learning, while driving the facial expression of this virtual object with speech to simulate a real person speaking. Although this approach allows a high degree of lip and voice synchronization, the detail of the lips and surrounding areas (e.g., teeth, corner of mouth wrinkles, etc.) of a virtual object, whether obtained by face-changing or otherwise, is not clear enough, and especially after the entire virtual object is enlarged, the above-mentioned drawbacks are more intolerable.

Aiming at the problems, the embodiment of the application mainly adopts the following technical scheme: acquiring a video to be processed; and sequentially carrying out feature extraction and decoding processing on target faces in at least one first face image in the video to be processed to generate a first video, wherein the video to be processed comprises a plurality of first face images of at least one target user, the video to be processed is obtained by driving an initial video by using preset audio, the first video comprises a plurality of second face images of the target user, and the quality of the second face images is better than that of the first face images.

The following describes the technical solutions of the embodiments of the present application in detail with reference to fig. 2a to 9.

Because the video processing method of the embodiment of the application can process the face image based on the pre-trained image processing model, the training process of the image processing model is introduced before the video processing method is introduced. In order to realize the replacement of a first face image in a video to be processed so as to improve the definition of a target user in the video to be processed, the image processing model in the embodiment of the application adopts the training set with two image quality to train the initial model in two ways, so that the image processing model can train two images with different quality respectively.

In some embodiments, as shown in fig. 2a, the image processing model includes an encoder, a first decoder, a first discriminator, a second decoder, and a second discriminator, which are not limited in the structure of the image processing model for implementing the video processing method. In the image processing model, the first training set and the second training set share one encoder, but different decoders and discriminators are used for different training sets, wherein the processing path is the encoder, the first decoder and the first discriminator after the first training set is input into the image processing model, as shown by the thick solid line in fig. 2a, and the processing path is the encoder, the second decoder and the second discriminator after the second training set is input into the image processing model, as shown by the thick dashed line in fig. 2 a. In the case of a video to be processed in which there are at least two avatars, the above-described image processing model may be provided with a two-way training image processing sub-model (substantially identical to the image processing model structure shown in fig. 2 a) for each avatar alone, such as a model structure diagram shown in fig. 2b, or a video processing apparatus structure diagram shown in fig. 2 c. In the embodiment of the present application, the image processing models to which the single avatar belongs may be deployed separately, or the image processing sub-models to which at least two avatars belong may be deployed in an integrated manner.

The encoder of the embodiments of the present application essentially works to extract lip features of the first face image in the video to be processed by programming, converting signals (e.g., bit streams) or data into a signal form that can be used for communication, transmission and storage. The decoder is a hardware/software device capable of decoding the digital video and audio data stream back into an analog video and audio signal, and is used for decoding lip features extracted from the second face image to generate the second face image.

Fig. 3 is a schematic diagram of a training process for the image processing model shown in fig. 2a, taking training for an image processing scene of an avatar as an example, where the training process includes steps S301 to S302:

s301: a first training set and a second training set are obtained.

The first training set comprises a plurality of video materials for speaking of users, and is used for collecting materials of basic lip effects and training a first sub-model. For example, a small section of material (e.g., 10 tens of thousands of 3-5 seconds of video are collected, related to different genders, different ages, different mouth shapes, different distances from the lens, different attitudes/effects/emotions when speaking, different languages, etc.) is collected for many people in the world to speak.

The second training set comprises historical videos of speaking of the target user, and is spliced data formed by 402 syllables, namely habit data of speaking of the target user and continuous reading effect data of the target user, and when the second training set is used for optimizing lip naturalness effects, the second sub-model is required to be trained by using data of the target user. Specifically, the concatenated data of 402 syllables in the second training set can be used to simulate the speaking habit of the target user, i.e., the target user has a unique style of the target user when speaking a syllable. Typically embodied as: the pronunciation, lip size, and magnitude of lip opening and closing will all be different for each individual speaking, so the individual data of the target user need to be specifically selected for learning, i.e. training the second sub-model in the image processing model.

In some embodiments, to ensure lip accuracy of the training, some filtering may be performed on the acquisition of the first training set and the second training set, in particular, the video material in the first training set meets at least one of the following characteristics:

the playing time length is smaller than the preset time length;

different sexes;

Different ages;

different mouth shapes;

different shooting focal lengths;

different poses when speaking;

different effects when speaking;

different moods when speaking;

alternatively, different languages;

in other embodiments, the video material in the second training set satisfies at least one of the following characteristics:

habit syllable data of the target user when speaking in the history time;

or effect syllable data of continuous speaking of the target user in the history time.

In this embodiment, since the first training set is a video material of speaking for a plurality of users, when the first training set is used for training the first sub-model, the basic lip shape of speaking for a plurality of users can be learned, and the habit data of speaking for the target users of the second training set, that is, the effect data of continuous reading for the target users, can effectively improve the lip shape accuracy of training when the second training set is used for optimizing the lip shape naturalness effect, thereby providing a good basis for training for the subsequent lip shape naturalness optimization.

S302: training the encoder, first decoder and first discriminator using the first training set and training the encoder, second decoder and second discriminator using the second training set until a first loss function value reaches a first preset loss threshold for the first sub-model and a second loss function value reaches a second preset loss threshold for the second sub-model.

The first loss function value is the value of the loss function of a first sub-model formed by the encoder, the first decoder and the first discriminator, the second loss function value is the value of the loss function of a second sub-model formed by the encoder, the second decoder and the second discriminator, and the lip naturalness of a target user in the first training set is better than that of any user in the second training set. It should be noted that, in this embodiment of the present application, the first training set and the second training set include the same target user represented by or referred to by the same identifier, or the content included in the first training set and the second training set, that is, the target user, are the same, which is different in the style or quality of these images. From the viewpoint of training cost or efficiency of the model, it is desirable to input two kinds of samples which are highly similar in content, style, and the like and have higher quality to the image processing model. However, in engineering practice, either the possibility of acquiring two samples of the same content but similar style is low, or the cost of acquiring a sample of higher quality is high, especially when the sample is image-type data, this possibility is low or high cost is more pronounced. For example, it is easy to capture images of the same content as a training image, but it is not easy to capture two different types of images for the same target user as a training image, and it is easy to capture a general or low quality image for a certain target user, but it is not easy to capture a high quality image for the certain target user.

In order to reduce the overall training cost of the image conversion model by reducing the acquisition cost of the image, in the embodiment of the present application, besides the first training set and the second training set should contain target users with the same identification, the lip features of the users in the first training set and the second training set are not required to have similar or identical naturalness, and only the lip naturalness of the target users in the first training set is required to be significantly higher than that of the second training set. For example, in terms of lip naturalness, the lip naturalness of a target user in the first training set is significantly higher than that of any user in the second training set, and/or in terms of detail representation, the lip naturalness in the first training set is significantly higher than that in the second training set, and the lip features are rich and more consistent with the real speaking habit of the target user. Taking the digital person as an example, although the digital person in the first training set and the digital person in the second training set are the same digital person, when the digital person is used as video content, the mouth shapes of the two digital persons are completely matched with the audio content. However, the digital person in the first training set is clearer as a whole than the digital person in the second training set, and details of the digital person in the first training set such as the lip shape, teeth, mouth angle wrinkles and chin can be naturally, clearly and normally displayed, but details of the digital person in the second training set such as the lip shape, teeth, mouth angle wrinkles and chin may be blurred or distorted.

As for the specific acquisition modes of the first training set and the second training set, the first training set and the second training set may be acquired by a photographing mode, for example, the face of the same person is photographed by using an image acquisition device such as a camera. In some embodiments, the second training set may be obtained synthetically because of low lip naturalness requirements for any user in the second training set. In particular, the second training set may be derived from video generated using a speech driven model, and the lip naturalness of the images in the second training set is inferior to the lip naturalness of the images in the first training set. For example, the lip naturalness of any user in the second training set is lower than the lip naturalness of the target user in the first training set. Meanwhile, for the first training set, an image acquisition device (for example, a single-lens reflex camera for professional shooting, etc.) with a better tracking shooting effect can be used for shooting aiming at a target user such as a face of a real human. As for the video generated by using the voice driving model, the implementation is the same as the technical scheme of obtaining the video to be processed by using the preset audio driving initial video in the foregoing embodiment, and the description of the embodiment of obtaining the video to be processed by using the preset audio driving initial video can be referred to, which is not repeated herein.

S303: and adjusting parameters of the first sub-model and the second sub-model according to the first loss function value and the second loss function value until the difference between the training result output by the first sub-model and the naturalness of the first training set does not exceed a first preset naturalness threshold value, and the difference between the training result output by the second sub-model and the naturalness of the second training set does not exceed a second preset naturalness threshold value.

How to train the encoder, the first decoder and the first discriminator using the first training set, and how to train the encoder, the second decoder and the second discriminator using the second training set, respectively, are described below.

(1) Training an encoder, a first decoder, and a first discriminator using a first training set

As an embodiment of the present application, the training encoder, the first decoder, and the first discriminator using the first training set may be: the encoder performs feature extraction on the input first training set to obtain first lip features of the target user respectively; the first decoder decodes the first lip feature to obtain a first target image; the first discriminator judges the confidence of the first target image according to the first target image and the first training image; calculating a loss function of the first sub-model according to the confidence coefficient of the first target image to obtain a first loss function value; and taking the first loss function value as a counter-propagation quantity, and adjusting model parameters of the first sub-model to train the first sub-model until the first loss function value is a first preset loss threshold value, wherein the first lip feature is a lip feature of a target user contained in the image in the first training set.

To further illustrate the above solution, the encoder and the first decoder of the image processing model illustrated in FIG. 2a are abstracted here as G ₁ (x ₁ ) A first generator of the representation abstracting the encoder and the second decoder to G ₂ (x ₂ ) A second generator of representations, the first discriminator using D ₁ (y ₁ ) Representing that the second discriminator uses D ₂ (y ₂ ) The image processing model after abstraction is represented as shown in fig. 4 a. In the image processing model illustrated in FIG. 4a, a first generator G ₁ (x ₁ ) Input x of (2) ₁ The first training set of the above embodiment can be represented, and the output y ₁ Can represent a first training set x ₁ Input to a first generator G ₁ (x ₁ ) A first target image reconstructed afterwards; can be x ₁ And y ₁ Input first discriminator uses D ₁ (y ₁ ). Training first generator G ₁ (x ₁ ) Is to make the input x ₁ Generate and x ₁ Highly similar y ₁ So that D ₁ (y ₁ ) Cannot identify y entered therein ₁ Whether from the first generator G ₁ (x ₁ ) The output data is also from x ₁ Expressed as G ₁ (x ₁ ) Is "deception ability" while training D ₁ (y ₁ ) The goal of (a) is by "feeding" a large amount of x into it ₁ Or with x ₁ Data x 'with identical features' ₁ Make it constantly learn x ₁ To be able to discriminate the y entered therein ₁ Whether from the first generator G ₁ (x ₁ ) The output data is also from x ₁ Expressed as D ₁ (y ₁ ) Is a "open authentication capability".

The pair G ₁ (x ₁ ) And D ₁ (y ₁ ) Is not synchronized or performed simultaneously, i.e. G can be trained first ₁ (x ₁ ) Make it output y ₁ ，D ₁ (y ₁ ) By discriminating y ₁ Outputting the identification result, which is expressed asThe value being expressed, i.e. y is identified ₁ From x ₁ Probability of (2); if the probability value is too large and exceeds the preset threshold value, the probability value is expressed as D ₁ (y ₁ ) Is not expected, then D is adjusted ₁ (y ₁ ) For the parameter of D ₁ (y ₁ ) Training is carried out; otherwise, if the probability value is too small and is far smaller than the preset threshold value, G is represented ₁ (x ₁ ) Generated y ₁ And x ₁ (or x' ₁ ) Is too low to enable D ₁ (y ₁ ) Easily identify y ₁ Is composed of G ₁ (x ₁ ) Generating but not from x ₁ Or x' ₁ Thus adjust G ₁ (x ₁ ) For the parameter of G ₁ (x ₁ ) Starting a new round of training; new wheel pair G ₁ (x ₁ ) Training process and previous round of G ₁ (x ₁ ) Is similar to the training process of (a). From the pair G ₁ (x ₁ ) And D ₁ (y ₁ ) From the description of G ₁ (x ₁ ) Hope D ₁ (y ₁ ) The larger the probability value, the better the output authentication result, and the larger the probability value, the representation of D ₁ (y ₁ ) The more misidentified the spectrum. Theoretically, D ₁ (y ₁ ) The output identification result is ideal value or G when 1 ₁ (x ₁ ) However, this will present other problems. Thus, train G ₁ (x ₁ ) And D ₁ (y ₁ ) To the preferred state should be G ₁ (x ₁ ) "deception ability" and D of (2) ₁ (y ₁ ) The "open authentication ability" of (1) is balanced and is expressed as D ₁ (y ₁ ) The probability value of the output is 0.5, namely D ₁ (y ₁ ) Neither can determine y ₁ Is composed of G ₁ (x ₁ ) Is generated and cannot be determined to be from x ₁ Or x' ₁ Alternatively, D ₁ (y ₁ ) Only y can be considered as entered therein ₁ With 50% probability of being G ₁ (x ₁ ) The probability of 50% is derived from x ₁ Or x' ₁ 。

Since in general, consider the firstA training set is training G ₁ (x ₁ ) And D ₁ (y ₁ ) In the above embodiment, the first discriminator determines that the confidence of the first target image is actually determining the similarity between the first target image and the first training image according to the first target image and the first training image, and the higher the similarity between the first target image and the first training image is, the higher the confidence of the first target image is. As for the first loss function value, it is actually equal to G ₁ (x ₁ ) "deception ability" and D of (2) ₁ (y ₁ ) Corresponding to the "open authentication ability" of (D) when the two reach equilibrium, i.e. D ₁ (y ₁ ) When the probability value is 0.5, it can be considered that the corresponding first loss function value reaches the first preset loss threshold value, the first loss function converges, and G is ended ₁ (x ₁ ) And D ₁ (y ₁ ) Is a training of (a).

From the above description, through the pair G ₁ (x ₁ ) And D ₁ (y ₁ ) Training to make G ₁ (x ₁ ) "deception ability", and D ₁ (y ₁ ) The 'open authentication ability' of the training set reaches equilibrium, and any one low-quality face image (the face corresponds to the same user as the face in the first training set) is input by G ₁ (x ₁ ) And D ₁ (y ₁ ) After the network is formed, the features of the lip images with high naturalness are learned in the training stage, G ₁ (x ₁ ) A face image of a high naturalness lip may be generated.

(2) Training an encoder, a second decoder, and a second discriminator using a second training set

As one embodiment of the present application, the training encoder, the second decoder, and the second discriminator using the second training set may be: a second decoder decodes the second lip feature to obtain a second target image; the second discriminator judges the confidence of the second target image according to the second target image and the second training set; calculating a loss function of the second sub-model according to the confidence coefficient of the second target image to obtain a second loss function value; taking the second loss function value as the counter-propagation quantity, and adjusting model parameters of the second sub-model to train the second sub-model until the second loss function value is the first preset loss threshold value; wherein the second lip feature is a lip feature of a target user contained in the image in the second training set. The more specific process of training the encoder, the second decoder and the second discriminator is similar to the training of the encoder, the first decoder and the first discriminator, and reference is made to the related description of the foregoing embodiments, which is not repeated here.

From the above description, it can be seen that by comparing G of FIG. 4a ₂ (x ₂ ) And D ₂ (y ₂ ) Training to make G ₂ (x ₂ ) "deception ability" and D of (2) ₂ (y ₂ ) The "authentication ability" of (1) reaches equilibrium. When a face image of a low-naturalness lip needs to be reconstructed, only the face image of any one low-naturalness lip (the face corresponds to the same user as the face in the second training set, such as a target user) is input by G ₂ (x ₂ ) And D ₂ (y ₂ ) After the network is formed, the features of the lip images with low naturalness are learned in the training stage, G ₂ (x ₂ ) A facial image with a low degree of lip naturalness may be generated.

Therefore, in the embodiment of the application, after the image processing model is trained by using the first training set and the second training set in two ways through the embodiment, the image processing model can have the function of replacing the lip feature in any one of the lip low-naturalness face images (the face corresponds to the same user as the face in the first training set) with the lip feature with high naturalness (namely, the high-naturalness lip feature provided in the first training set), namely, the face image with high naturalness lip feature can be obtained by modifying the details or improving the separation rate of the input face image.

As described in the above embodiment, after the image processing model is obtained through model training, the face image that needs to be optimized for the details of the face feature in the video to be processed may be processed based on the image processing model, so as to obtain the face image with high lip naturalness (for example, the lip feature is not stiff, and accords with the habit of the target user when speaking). Specifically, referring to fig. 5a, a flow chart of a video processing method according to an embodiment of the present application is shown, where the video processing method is implemented based on an unsupervised learning image processing model. Taking the processing of faces in a virtual anchor video as an example, the method can be executed by a service server side, and the service server can be a training platform, a social platform, a government platform, a short video platform and the like which need to interact based on virtual images (such as virtual characters, virtual animals, cartoon animals and the like). The embodiment of the present application mainly includes steps S501 to S503 illustrated in fig. 5a, and is described as follows:

s501: and acquiring the video to be processed.

The video to be processed includes a plurality of first face images of at least one target user (in this embodiment, 1 target user is taken as an example), and the video to be processed is obtained by driving an initial video with preset audio. For example, a video is driven with preset audio, news virtual hosting is performed instead of a real person, game virtual commentary is performed instead of a game master, and the like.

In this embodiment of the present application, the video to be processed includes a plurality of first face images of at least one target user, where the target user may be a person or other animal, and in this embodiment of the present application, only an avatar of the target user is taken as an example, and an avatar of a non-person may refer to an embodiment for an avatar, which is not described herein. In some embodiments, the first face image may be obtained through a synthesis manner, for example, a preset audio driving initial video is used to obtain a to-be-processed video, and any video frame with a face in the to-be-processed video may be used as the first face image. The video to be processed can be obtained by using a preset audio drive initial video, and one method is as follows: obtaining an audio fragment and at least two video fragments, obtaining a target fragment according to the at least two video fragments, determining the corresponding relation between the (NxN) th audio frame of the audio fragment to the (Nx (i+1) -1) th audio frame and the ith video frame of the target fragment, and driving each video frame corresponding to the audio frame by utilizing the audio frame according to the corresponding relation to obtain a video to be processed. In the video to be processed, the lips of the target user are synchronized with the speech.

S502: and extracting features of lip features of a target user in at least one first face image in the video to be processed to obtain initial lip features of the target user.

The first video comprises a plurality of second face images of the target user, and the quality of the second face images is better than that of the first face images. For example, the resolution of the pixel region to which the face feature of the second face image belongs is high definition, and the resolution of the pixel region to which the face feature of the first face image belongs is low definition, or the texture, expression, and other attributes of the face portion of the second face image are finer and more realistic than those of the face portion of the first face image.

S503: and decoding the initial lip feature to generate a first video.

The first video comprises a plurality of second face images of the target user, and the naturalness of the lip features of the target user in the second face images is higher than that of the lip features of the target user in the first face images.

In some implementations, when the embodiments of the present application are implemented based on a neural network model, the image processing model includes a first sub-model including an encoder, a first arbiter, and a first decoder; the first sub-model is obtained based on training of a first training set. Specifically, the first video may be generated based on the image processing model trained in the embodiment corresponding to fig. 2 a:

The encoder performs feature extraction on the first face image to obtain initial lip features of the target user, and inputs the initial lip features into the first decoder;

the first decoder decodes the initial lip feature to obtain the second face image, and the matching degree of the lip feature of the target user in the second face image and the audio is higher than that of the lip feature of the target user in the first face image.

The lip feature of the above embodiment may refer to a semantic feature in the lip information, and the semantic feature refers to an abstract feature of things in the image, where the abstract feature is fused with spatial information, including color, texture, shape, attribute features, and the like. The encoder of the embodiments of the present application essentially works by programming, converting a signal (e.g., a bit stream) or data into a signal form that can be used for communication, transmission and storage, and is used to extract lip features of a first face image, while the decoder is a hardware/software device that can decode a digital video-audio data stream back into an analog video-audio signal, and works to decode the lip features to generate a second face image. It should be noted here that, unlike the use of an encoder and a second decoder to process images with low lip naturalness (such as the second training set mentioned in the previous embodiment) during the image processing model training phase, the lip naturalness of the first face image is low (e.g., the lip naturalness is lower than the first threshold) during the application of the trained image processing model phase, so in order to obtain a second face image with lip naturalness superior to the first face image, the encoder and the first decoder may be used to reconstruct lip features in the first face image. Specifically, when the first face image is sequentially input to the encoder and the first decoder, the first decoder may perform decoding processing on the lip feature extracted from the first face image by the encoder to generate a second face image with a lip naturalness superior to that of the first face image, for example, the lip naturalness of the first face image is higher than a second threshold, and the first threshold is smaller than the second threshold (the difference between the two is not limited in the embodiment of the present application), so that the lip naturalness of the first face image is lower than that of the second face image.

As shown in fig. 4b, when the first face image is input into the trained image processing model, it is a first generator G composed of an encoder and a first decoder ₁ (x ₁ ) Processing it instead of the second generator G consisting of an encoder and a second decoder ₂ (x ₂ ) It is processed. Due to the first generator G ₁ (x ₁ ) Is trained, therefore, does not need a first discriminator D ₁ (y ₁ ) Then to the first generator G ₁ (x ₁ ) The generated second face image is identified and directly used for the first generator G ₁ (x ₁ ) The generated second face image is only needed.

Referring to fig. 6, the first face image is a face image of at least one target user included in a video to be processed obtained by driving an initial video with a preset audio. The face image is generally unclear, particularly in specific areas such as lips, the details of which are more blurred when enlarged (see the enlarged portion of the top left corner of the first face image versus the specific areas of lips in fig. 6). When the first face image is input into the trained image processing model, the encoder of the trained image processing model performs feature extraction on the first face image to obtain initial lip features of the face. Since the first decoder of the trained image processing model has learned the features of the high definition face image, the first decoder decodes the initial lip features of the face to reconstruct the second face image. Referring to the enlarged portion of the second face image in the upper left corner of the lip region in fig. 6, it is understood that the second face image is significantly better in sharpness, detailed representation of the region, etc. than the first face image.

From the above training of the image processing model, one image processing model is obtained by training an image training set including a specific target user, that is, the target user and the image processing model have a corresponding relationship. Therefore, in the above embodiment, if the video to be processed is a video frame including at least two different target users, the corresponding image processing models are respectively called to obtain the corresponding second face images. For example, if the video to be processed is a video frame including the target user O1 and the target user O2, the image processing model M1 corresponding to the target user O1 (i.e., the image processing model trained using the image training set including the target user O1) needs to be invoked to generate the corresponding second face image M2, and the image processing model M2 corresponding to the target user O2 needs to be invoked (i.e., the image processing model trained using the image training set including the target user O2) to generate the corresponding second face image M'2, as shown in fig. 7.

In the above embodiment, when the first face image is processed, the whole face information of the target user included in the first face image is processed, and the processing range is relatively large, which objectively affects the training efficiency of the image processing model. Considering that the related art mainly processes the detailed parts (e.g., teeth, corner wrinkles, etc.) of specific areas such as lips and their surroundings when generating digital persons, in other words, lips of the target user are important attention areas in image processing in the embodiments of the present application. In order to improve the training efficiency of the image processing model and reduce the range during image processing, a feature enhancer and a feature converter can be further added to the trained image processing model, and correspondingly, the embodiment of the application further comprises the steps of a and b:

a. And the feature enhancer is used for carrying out feature enhancement on the concerned region in the lip feature of the target user to obtain enhanced features.

In some embodiments, the region of interest may be a detailed portion of a specific region (e.g., teeth, corner of mouth wrinkles, etc.) of the lips of the target user and their surroundings.

b. The feature converter maps the enhancement features to image features in the second face image and distribution thereof to obtain converted lip features.

It should be noted that, similar to the encoder, the first decoder, the second decoder, the first discriminator, and the second discriminator of the foregoing embodiments, the feature enhancer and the feature converter also need to be trained. Through training, the feature enhancer learns to enhance the information of the concerned region in the lip feature and inhibit the information of the rest part.

In this embodiment, since the face action of the avatar is very critical in the scene of driving the avatar video based on voice, and is usually the focal area of the viewer watching the video, the region of interest of the lip feature in each first face image in the video to be processed is a specific area of the face such as the lip and the surrounding thereof, so that the enhanced lip feature can be mapped to the image feature and the distribution thereof in the second face image by enhancing the region of interest of the lip feature, which can not only improve the training efficiency of the image processing model, but also improve the naturalness, the texture definition and the richness of the detail expression of the lip motion of the region of interest of the face image (i.e., the lip feature region).

Optionally, in other embodiments of the present application, the steps S501 to S503 are performed on the video to be processed that has been driven by voice, which is sent to the image processing model, and because the image processing model is provided with the first sub-model and the second sub-model, the naturalness of the lip feature of each target object in the first video can be effectively improved. However, the sharpness of the lip feature of the target object in the first video, which is processed by the image processing model, may not meet the business requirement. Therefore, in order to further improve the overall quality of the first video, after the first video is obtained, the following manner (i.e. S504 to S505) may be adopted to improve the sharpness of the lip feature of the target user in each frame of the first video, as shown in the flowchart of fig. 5 b:

s504: and determining the first video frame of which the lip definition in the first video does not meet the preset condition.

Wherein the preset condition may include at least one of:

the lip definition is lower than a preset definition;

the number of video frames having a lip definition below the preset definition is greater than the preset threshold.

The embodiment of the application does not limit any preset conditions, the mode and the number of the first video frames which are selected to be sent into the face-changing model for lip definition optimization, and can be flexibly determined according to actual quality requirements, business requirements and the like.

S505: and inputting the first video frame into a face-changing model to obtain a second video by taking the lip feature in the first video frame as the preset lip feature of the target user.

The definition of the preset lip feature is higher than that of the lip feature in the first video frame; the face-changing model is trained based on a third training set, and the third training set comprises a plurality of face images of the target user, wherein the yaw angle of the face images is higher than a preset angle, and the face images can be considered to be effective when the pose angle of the face is smaller than a preset threshold. For example, taking the attitude angle as the yaw angle of the face as an example, when the target user speaks, the face of the target user turns right within about 20 degrees, but the training effect on the face-changing model is not affected, so the face image in this case is an effective frontal face image and can be used for training the face-changing model.

In this embodiment, the training method of the face-changing model includes the following steps:

(1) A third training set is obtained.

For example, the third training set may include a number of facial image information including different ages, skin colors, thoroughfares, faces of men and women.

(2) And preprocessing the third training set to obtain a candidate image set, wherein the candidate image set comprises a lip image with definition lower than a first threshold value and a lip image with definition higher than a second threshold value.

For example, the third training set may be decimated and then lip-form low-definition, high-definition lip-form picture data may be obtained.

(3) And uniformly aligning the first image set and the second image set according to the lip shape to obtain the first image set and the second image set.

The first image set is a lip-shaped image with the definition lower than a first threshold value, and the second image set is a lip-shaped image with the definition higher than a second threshold value.

(4) And respectively extracting the characteristics of the first image set and the second image set to respectively obtain a first coding characteristic and a second coding characteristic.

Wherein the first coding feature is a lip coding feature based on a sharpness below a first threshold and the second coding feature is a lip coding feature based on a sharpness above a second threshold.

The first image set and the second image set are respectively input into two feature extractors with the same structure, and low-definition lip coding features are respectively obtained.

(5) And respectively decoding the first coding feature and the second coding feature to obtain a first decoded image and a second decoded image.

Inputting the first coding feature to a first decoder to obtain a first decoded image; and inputting the second coding feature to a high definition decoder to obtain a second decoded image.

(6) And respectively calculating loss values (such as L1 loss values) of the first decoding image, the second decoding image and the original image, and carrying out gradient feedback to update model parameters of the face-changing model until the loss values are lower than expected values, and ending training of the face-changing model, so that a trained face-changing model is obtained.

In some embodiments, the face-changing model may adopt a deepfacelab model, and the mouth region is replaced by the deepfacelab model alone, so that on one hand, the face-changing problem can be solved, and on the other hand, the lip definition can be improved.

In one aspect, the face-changing model is used for further optimizing the first video, so that the first video frame with lip-shaped definition which does not meet the preset condition can be replaced by the lip feature with higher definition for the target user during pre-training, and the second video with the natural degree and definition simultaneously meeting the natural degree and the natural degree of the lip feature of the target user can be obtained. On the other hand, by changing the lip feature in the first video frame from a low-definition and low-resolution to a high-resolution (for example, to a second video frame with a high definition of 256×256, or 480P, etc.), the resolution of the second video frame is simply changed to be high, so that the motion track of the lip before the target user in the first video frame in the three-dimensional space is not affected, that is, the naturalness of the lip of the target user in the first video frame when speaking is not affected.

As can be seen from the video processing methods illustrated in fig. 2a to fig. 7, when the first video is generated by sequentially performing feature extraction and decoding processing on the target face in at least one first face image in the video to be processed, a second face image with lip naturalness better than that of the first face image can be generated.

On the one hand, even if the lip naturalness of the first face image in the current video to be processed is low, the lip features in the first face image are extracted, the lip features with low naturalness affecting the playing effect of the video to be processed in the first face image can be subjected to targeted key processing through extraction, and the extracted lip features can be subjected to decoding processing to replace the lip features with low naturalness in the first face image, namely, the second face image with high naturalness can be obtained through reconstructing the lip features in the first face image, so that the problem of poor user watching experience caused by the defects of unclear details and the like of the specific position of the face of the virtual object in application scenes such as a short video platform, live broadcasting, online education and the like is solved.

On the other hand, the second face image with higher quality can be obtained quickly by sequentially carrying out feature extraction and decoding processing on the target face in the first face image, so that the efficiency of obtaining the first video with high lip naturalness and lip definition can be effectively improved, and the upper line speed of the first video for displaying the virtual image and the watching effect of a user are further improved.

Any technical features mentioned in the embodiments corresponding to any one of fig. 1 to fig. 7 are also applicable to the embodiments corresponding to fig. 8 to fig. 11 in the embodiments of the present application, and the following similar parts will not be repeated.

A video processing method according to an embodiment of the present application is described above, and a video processing apparatus that performs the video processing method is described below.

Referring to fig. 8, a schematic diagram of a video processing apparatus 80 shown in fig. 8 is applicable to processing a video to be processed including a plurality of first face images of at least one target user, so that a search engine based on a neural network model crawls face images of a specific user from a network (e.g. crawls face images of a target user that have been historically released from a service server), and after preprocessing the face images, the target face images that match the specific user cannot be directly identified. The video processing device 80 in the embodiment of the present application can implement steps in the video processing method performed by the video processing device 80 in the embodiment corresponding to any one of fig. 1 to 6 described above. The functions implemented by the video processing apparatus 80 may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above, which may be software and/or hardware. The video processing device 80 may include an acquisition module 801 and a processing module 802. The functional implementation of the obtaining module 801 and the processing module 802 may refer to operations performed in any of the embodiments corresponding to fig. 1 to 6, which are not described herein.

In some embodiments, the input/output module 801 may be configured to include a plurality of first face images of the target user in the video to be processed, where the video to be processed is obtained by driving an initial video with a preset audio;

the processing module 802 may be configured to perform feature extraction on a lip feature of a target user in at least one of the first face images in the video to be processed acquired by the input/output module 801, so as to obtain an initial lip feature of the target user;

Alternatively, the video processing apparatus 80 described above is implemented based on an image processing model including a first sub-model including an encoder, a first arbiter, and a first decoder; the first sub model is obtained based on training of a first training set;

the encoder is used for extracting the characteristics of the first face image to obtain initial lip characteristics of a target user, and inputting the initial lip characteristics into the first decoder;

And the first decoder is used for decoding the initial lip characteristics to obtain a second face image, wherein the definition of the second face image is higher than that of the first face image.

Optionally, the image processing model further includes a second decoder, a first discriminator, and a second discriminator, and the video processing apparatus 80 may further include:

a training module to train the encoder, first decoder, and first discriminator using the first training set and train the encoder, second decoder, and second discriminator using the second training set until a first loss function value reaches a first preset loss threshold for the first sub-model and a second loss function value reaches a second preset loss threshold for the second sub-model; the first training set comprises a plurality of video materials which are used for speaking by users; the second training set comprises historical videos of speaking of the target user;

the adjustment module is used for adjusting parameters of the first sub-model and the second sub-model according to the first loss function value and the second loss function value until the difference between the training result output by the first sub-model and the naturalness of the first training set does not exceed a first preset naturalness threshold value, and the difference between the training result output by the second sub-model and the naturalness of the second training set does not exceed a second preset naturalness threshold value.

Optionally, after generating the first video, the video processing apparatus further includes a face-changing model: the processing module is further used for determining a first video, of which the lip definition does not meet a preset condition, in the first video;

inputting the first video frame into a face-changing model through the input-output module 801 so as to obtain a second video by taking the lip feature in the first video frame as the preset lip feature of the target user; the definition of the preset lip feature is higher than that of the lip feature in the first video frame; the face changing model is obtained based on training of a third training set, and the third training set comprises a plurality of face images of the target user, wherein the yaw angle of the face images is higher than a preset angle.

Optionally, the training module includes:

the encoder is used for extracting the characteristics of the input first training set to obtain first lip characteristics of the target user, wherein the first lip characteristics are the lip characteristics of the target user contained in the images in the first training set;

the first decoder is used for decoding the first lip feature to obtain a first target image;

the first discriminator is used for judging the confidence level of the first target image according to the first target image and the first training set;

The first calculation unit is used for calculating the loss function of the first sub-model according to the confidence coefficient of the first target image to obtain a first loss function value;

and the first parameter adjusting unit is used for adjusting the model parameters of the first sub-model by taking the first loss function value as the back propagation quantity so as to train the first sub-model until the first loss function value is a first preset loss threshold value.

Optionally, the training module includes:

the encoder is used for extracting the characteristics of the input second training set to obtain second lip characteristics of the target user, wherein the second lip characteristics are the lip characteristics of the target user contained in the second training set;

the second decoder is used for decoding the second lip feature to obtain a second target image;

the second discriminator is used for judging the confidence level of the second target image according to the second target image and the second training set;

the second calculation unit is used for calculating the loss function of the second sub-model according to the confidence coefficient of the second target image to obtain a second loss function value;

and the second parameter adjusting unit is used for taking the second loss function value as the back propagation quantity and adjusting the model parameters of the second sub-model to train the second sub-model until the second loss function value is the first preset loss threshold value.

Optionally, the second training set is derived from video generated using a speech driven model, and the quality of the images in the second training set is inferior to the quality of the images in the first training set, the face images in the training images being the same person as the current target user.

The specific manner in which the respective modules perform the operations in the apparatus of the above embodiments has been described in detail in the embodiments related to the method, and will not be described in detail herein.

As can be seen from the video processing apparatus illustrated in fig. 8, when the first video is generated by sequentially performing feature extraction and decoding processing on the target face in at least one first face image in the video to be processed, a second face image with quality better than that of the first face image can be generated. On the one hand, even if the image quality of a first face image in a current video to be processed is lower (for example, definition is lower), the lip feature in the first face image is extracted, the lip feature with lower definition, which affects the playing effect of the video to be processed, in the first face image can be subjected to targeted key processing through extraction, and the extracted lip feature can be subjected to decoding processing to replace the lip feature with lower definition in the first face image, namely, a second face image with high definition lip feature can be obtained through reconstructing the lip feature in the first face image, so that the problem of poor user watching experience caused by defects of unclear details and the like of a specific position of a virtual object face in application scenes such as a short video platform, live broadcasting, online education and the like is solved; on the other hand, the second face image with higher quality can be obtained quickly by carrying out feature extraction and decoding processing on the target face in the first face image in sequence, so that the efficiency of obtaining the high-quality first video can be effectively improved, and the line-up speed of the first video for displaying the virtual image and the watching effect of the user are further improved.

The video processing apparatus 80 for performing the video processing method in the embodiment of the present application is described above from the viewpoint of the modularized functional entity, and the video processing apparatus 80 for performing the video processing method in the embodiment of the present application is described below from the viewpoint of hardware processing, respectively. It should be noted that, in the embodiment shown in fig. 8 of the present application, the physical device corresponding to the acquiring module 801 may be an input/output unit, a transceiver, a radio frequency circuit, a communication module, an output interface, and the like, and the physical device corresponding to the processing module 802 may be a processor. The video processing apparatus 80 shown in fig. 8 may have a structure of an electronic device 900 as shown in fig. 9, and when the video processing apparatus 80 shown in fig. 8 has a structure of an electronic device 900 as shown in fig. 9, the memory 910 and the processor 920 in fig. 9 can implement the same or similar functions as the acquisition module 801 and the processing module 802 provided in the foregoing apparatus embodiment corresponding to the video processing apparatus 80, and the memory 910 in fig. 9 stores a computer program that needs to be invoked when the processor 920 performs the foregoing video processing method.

The embodiment of the present application further provides another video processing apparatus, as shown in fig. 10, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The video processing device can be any video processing device including a mobile phone, a tablet personal computer, a personal digital assistant (English: personal Digital Assistant, english: PDA), a Sales video processing device (English: point of Sales, english: POS), a vehicle-mounted computer and the like, taking the video processing device as an example of the mobile phone:

Fig. 10 is a block diagram showing a part of the structure of a mobile phone related to the video processing apparatus provided in the embodiment of the present application. Referring to fig. 10, the mobile phone includes: radio Frequency (RF) circuit 710, memory 720, input unit 730, display unit 740, sensor 780, audio circuit 760, wireless-fidelity (Wi-Fi) module 7100, processor 780, and power supply 790. It will be appreciated by those skilled in the art that the handset construction shown in fig. 10 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 10:

the RF circuit 710 may be configured to receive and transmit signals during a message or a call, and specifically, receive downlink information of a base station and process the downlink information with the processor 780; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (English full name: low Noise Amplifier, english short name: LNA), a duplexer, and the like. In addition, the RF circuitry 710 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (english: global System of Mobile communication, english: GSM), general packet radio service (english: general Packet Radio Service, english: GPRS), code division multiple access (english: code Division Multiple Access, CDMA), wideband code division multiple access (english: wideband Code Division Multiple Access, english: WCDMA), long term evolution (english: long Term Evolution, english: LTE), email, short message service (english: short Messaging Service, english: SMS), and the like.

The memory 720 may be used to store software programs and modules, and the processor 780 performs various functional applications and data processing of the handset by running the software programs and modules stored in the memory 720. The memory 720 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on or thereabout the touch panel 731 using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 731 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 780, and can receive commands from the processor 780 and execute them. In addition, the touch panel 731 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 730 may include other input devices 732 in addition to the touch panel 731. In particular, the other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 740 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 740 may include a display panel 741, and optionally, the display panel 741 may be configured in the form of a liquid crystal display (english: liquid Crystal Display, abbreviated as LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 731 may cover the display panel 741, and when the touch panel 731 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 780 to determine the type of touch event, and then the processor 780 provides a corresponding visual output on the display panel 741 according to the type of touch event. Although in fig. 10, the touch panel 731 and the display panel 741 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 731 and the display panel 741 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 780, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 741 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 741 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 760, speaker 761, and microphone 762 may provide an audio interface between a user and a cell phone. The audio circuit 760 may transmit the received electrical signal converted from audio data to the speaker 761, and the electrical signal is converted into a sound signal by the speaker 761 to be output; on the other hand, microphone 762 converts the collected sound signals into electrical signals, which are received by audio circuit 760 and converted into audio data, which are processed by audio data output processor 780 for transmission to, for example, another cell phone via RF circuit 710 or for output to memory 720 for further processing.

Wi-Fi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive e-mails, browse webpages, access streaming media and the like through a Wi-Fi module 7100, so that wireless broadband Internet access is provided for the user. Although fig. 10 shows Wi-Fi module 7100, it is understood that it does not belong to the necessary constitution of the mobile phone, and can be omitted entirely as required within the scope of not changing the essence of the application.

The processor 780 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions and processes of the mobile phone by running or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby performing overall monitoring of the mobile phone. Optionally, the processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 780.

The handset further includes a power supply 790 (e.g., a battery) for powering the various components, which may be logically connected to the processor 780 through a power management system, thereby performing functions such as managing charging, discharging, and power consumption by the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In the embodiment of the present application, the processor 780 included in the mobile phone further has a control function to execute the above method executed by the video processing device 80 shown in fig. 10. The steps performed by the video processing apparatus in the above embodiments may be based on the structure of the cellular phone shown in fig. 10.

The embodiment of the present application further provides another face video processing apparatus for implementing the video processing method, as shown in fig. 11, fig. 11 is a schematic diagram of a server structure provided in the embodiment of the present application, where the server 100 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (in english: central processing units, in english: CPU) 1022 (for example, one or more processors) and a memory 1032, and one or more storage media 1030 (for example, one or more mass storage devices) storing application programs 1042 or data 1044. Wherein memory 1032 and storage medium 1030 may be transitory or persistent. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, central processor 1022 may be configured to communicate with storage medium 1030 to execute a series of instruction operations in storage medium 1030 on server 100.

The Server 100 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input/output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server, mac OS X, unix, linux, freeBSD, and the like.

The steps performed by the service server (e.g., the video processing apparatus 80 shown in fig. 8) in the above-described embodiment may be based on the structure of the server 100 shown in fig. 11. The steps performed by the video processing apparatus 80 shown in fig. 8 in the above embodiment, for example, may be based on the server structure shown in fig. 11. For example, processor 1022 may perform the following by invoking instructions in memory 1032:

acquiring a video to be processed through an input/output interface 1058, wherein the video to be processed comprises a plurality of first face images of a target user, and the video to be processed is obtained by driving an initial video by using preset audio;

feature extraction is carried out on the lip features of the target user in at least one first face image in the video to be processed, which is acquired by the acquisition module, so as to obtain initial lip features of the target user;

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, apparatuses and modules described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program is loaded and executed on a computer, the flow or functions described in accordance with embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a storage medium or transmitted from one storage medium to another storage medium, for example, from one website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing describes in detail the technical solution provided by the embodiments of the present application, in which specific examples are applied to illustrate the principles and implementations of the embodiments of the present application, where the foregoing description of the embodiments is only used to help understand the methods and core ideas of the embodiments of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope according to the ideas of the embodiments of the present application, the present disclosure should not be construed as limiting the embodiments of the present application in view of the above.

Claims

1. A method of video processing, the method comprising:

2. The video processing method of claim 1, wherein the method is implemented based on an image processing model, the image processing model comprising a first sub-model comprising an encoder, a first arbiter, and a first decoder; the first sub model is obtained based on training of a first training set; extracting features of lip features of a target user in at least one first face image in the video to be processed to obtain initial lip features of the target user; decoding the initial lip feature to generate a first video, including:

the encoder performs feature extraction on the first face image to obtain initial lip features of the target user;

the encoder inputting the initial lip feature into the first decoder;

3. The video processing method of claim 2, wherein the image processing model further comprises a second sub-model comprising an encoder, a second arbiter, and a second decoder; the method further comprises the steps of:

Acquiring a first training set and a second training set, wherein the first training set comprises a plurality of video materials for speaking of users; the second training set comprises historical videos of speaking of the target user;

training the encoder, first decoder and first discriminator using the first training set and training the encoder, second decoder and second discriminator using the second training set until a first loss function value reaches a first preset loss threshold for the first sub-model and a second loss function value reaches a second preset loss threshold for the second sub-model;

and adjusting parameters of the first sub-model and the second sub-model according to the first loss function value and the second loss function value until the difference between the training result output by the first sub-model and the naturalness of the first training set does not exceed a first preset naturalness threshold value, and the difference between the training result output by the second sub-model and the naturalness of the second training set does not exceed a second preset naturalness threshold value.

4. A video processing method according to claim 2 or 3, wherein the video material in the first training set meets at least one of the following characteristics:

The playing time length is smaller than the preset time length;

different sexes;

different ages;

different mouth shapes;

different shooting focal lengths;

different poses when speaking;

different effects when speaking;

different moods when speaking;

alternatively, different languages;

the video material in the second training set meets at least one of the following characteristics:

habit syllable data of the target user when speaking in the history time;

5. The method of video processing according to claim 4, wherein after the generating the first video, the method further comprises:

determining a first video frame of which the lip definition in the first video does not meet a preset condition;

inputting the first video frame into a face-changing model to obtain a second video by taking the lip feature in the first video frame as the preset lip feature of the target user; the definition of the preset lip feature is higher than that of the lip feature in the first video frame; the face changing model is obtained based on training of a third training set, and the third training set comprises a plurality of face images of the target user, wherein the yaw angle of the face images is higher than a preset angle.

6. The video processing method according to claim 5, wherein the training mode of the face-changing model includes:

Preprocessing the obtained third training set to obtain a candidate image set, wherein the candidate image set comprises a lip image with definition lower than a first threshold value and a lip image with definition higher than a second threshold value;

according to the lip shape, the first image set and the second image set are aligned uniformly, and the first image set and the second image set are obtained;

respectively extracting features of the first image set and the second image set to respectively obtain a first coding feature and a second coding feature;

respectively decoding the first coding feature and the second coding feature to obtain a first decoded image and a second decoded image;

and respectively calculating the loss values of the first decoding image, the second decoding image and the original image, and carrying out gradient feedback to update the model parameters of the face-changing model until the loss values are lower than expected values, and ending the training of the face-changing model.

7. A video processing model, wherein the video processing model comprises an image processing model and a face-changing model, and the video generating model comprises a first sub-model and a second sub-model;

8. A video processing apparatus, the video processing apparatus comprising:

9. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor causes the processor to perform the method of any of claims 1 to 7.

10. A storage medium having stored thereon executable code which when executed by a processor of an electronic device causes the processor to perform the method of any of claims 1 to 7.