CN114007099A

CN114007099A - Video processing method and device for video processing

Info

Publication number: CN114007099A
Application number: CN202111298553.6A
Authority: CN
Inventors: 辛晓哲; 刘怀飙
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-02-01

Abstract

The embodiment of the invention provides a video processing method and device and a device for video processing. The method comprises the following steps: performing target part fusion processing on the first image and the second image through a model-based generation model to generate a fusion image, wherein the fusion image comprises target part characteristics of the first image and target part characteristics of the second image; carrying out target part detection on each frame of image in an original video, and determining a target frame image containing a target part to be replaced; and performing target part replacement processing on the target frame image in the original video by using the fusion image to obtain a target video. The embodiment of the invention can reduce the cost of recording the template image and improve the efficiency and quality of video processing.

Description

Video processing method and device for video processing

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a video processing method and apparatus, and an apparatus for video processing.

Background

With the popularization of video applications such as short videos and live broadcasts, video face changing is a technology which is started in recent years, and target faces in videos can be replaced by template faces through video face changing. Video face changing is widely used in scenes such as content production, movie production, entertainment video production and the like.

However, the existing video face changing process needs to be trained for a long time and for a plurality of times, and the resource consumption is large. In addition, if the face image of the real user is used as the template face to replace the target face in the video image, the cost of recording the template image (the image containing the template face) is high, and the efficiency of video processing is low.

Disclosure of Invention

Embodiments of the present invention provide a video processing method and apparatus, and an apparatus for video processing, which can reduce the cost of recording a template image and improve the efficiency and quality of video processing.

In order to solve the above problem, an embodiment of the present invention discloses a video processing method, where the method includes:

performing target part fusion processing on the first image and the second image through a model-based generation model to generate a fusion image, wherein the fusion image comprises target part characteristics of the first image and target part characteristics of the second image;

carrying out target part detection on each frame of image in an original video, and determining a target frame image containing a target part to be replaced;

and performing target part replacement processing on the target frame image in the original video by using the fusion image to obtain a target video.

Optionally, the target site comprises any one or more of: face, lips, hands, legs, body.

Optionally, the target portion is a human face, and performing target portion replacement processing on a target frame image in the original video by using the fused image to obtain a target video includes:

extracting a first face key point sequence of the fused image and a second face key point sequence of the target frame image;

determining a migration face key point sequence of the target frame image based on the first face key point sequence and the second face key point sequence by utilizing a triangular mesh deformation transfer algorithm;

and carrying out triangulation mapping processing on the key point sequence of the migrated human face of the target frame image by using the fused image to obtain a target video.

Optionally, the determining, by using a triangular mesh deformation transfer algorithm, a migration face key point sequence of the target frame image based on the first face key point sequence and the second face key point sequence includes:

calculating a transformation matrix between second key point sequences of a previous frame image and a next frame image in two adjacent target frame images based on a triangular mesh deformation transfer algorithm;

taking the first face key point sequence as a transfer face key point sequence of a first target frame image in the original video;

and calculating the migration face key point sequence of the next frame image according to the migration face key point sequence of the previous frame image and the transformation matrix to obtain the migration face key point sequence of each target frame image in the original video.

acquiring a first texture image corresponding to the fusion image and a second texture image corresponding to the target frame image;

performing fusion processing on the first texture image and the second texture image to obtain a fusion texture image;

and rendering the target frame image in the original video based on the fusion texture image to obtain a target video.

Optionally, the obtaining a first texture image corresponding to the fusion image and a second texture image corresponding to the target frame image includes:

respectively inputting the fusion image and the target frame image into a pre-trained three-dimensional face reconstruction network for three-dimensional face reconstruction processing to obtain three-dimensional model parameters of the fusion image and three-dimensional model parameters of the target frame image;

respectively constructing a three-dimensional face model corresponding to the fusion image and a three-dimensional face model corresponding to the target frame image according to the three-dimensional model parameters of the fusion image and the three-dimensional model parameters of the target frame image;

and drawing a first texture image corresponding to the fusion image based on the three-dimensional face model corresponding to the fusion image, and drawing a second texture image corresponding to the target frame image based on the three-dimensional face model corresponding to the target frame image.

Optionally, the pattern-based generation model may include a mapping network and a synthesis network, and the performing target region fusion processing on the first image and the second image by the pattern-based generation model to generate a fused image includes:

acquiring a first image and a second image to be fused;

encoding the first image and the second image into input hidden codes of the mapping network respectively;

inputting the input hidden code corresponding to the first image and the input hidden code corresponding to the second image into the mapping network, and respectively converting the input hidden code corresponding to the first image and the input hidden code corresponding to the second image into intermediate hidden codes through the mapping network;

and inputting the intermediate hidden code corresponding to the first image and the intermediate hidden code corresponding to the second image into the synthesis network for fusion processing to obtain a fused image.

Optionally, the style-based generative model comprises a StyleGAN generative model or a StyleGAN2 generative model.

In another aspect, an embodiment of the present invention discloses a video processing apparatus, where the apparatus includes:

the fusion processing module is used for carrying out target part fusion processing on the first image and the second image through a pattern-based generation model to generate a fusion image, and the fusion image comprises target part characteristics of the first image and target part characteristics of the second image;

the target detection module is used for detecting a target part of each frame of image in the original video and determining a target frame image containing the target part to be replaced;

and the replacement processing module is used for performing target part replacement processing on the target frame image in the original video by using the fusion image to obtain a target video.

Optionally, the target region is a human face, and the replacement processing module includes:

the key point extraction submodule is used for extracting a first face key point sequence of the fused image and extracting a second face key point sequence of the target frame image;

a key point migration submodule, configured to determine a migration face key point sequence of the target frame image based on the first face key point sequence and the second face key point sequence by using a triangular mesh deformation transfer algorithm;

and the mapping processing submodule is used for carrying out triangulation mapping processing on the transferred human face key point sequence of the target frame image by using the fusion image to obtain a target video.

Optionally, the key point migration sub-module includes:

the matrix calculation unit is used for calculating a transformation matrix between second key point sequences of a previous frame image and a next frame image in two adjacent target frame images based on a triangular mesh deformation transfer algorithm;

a sequence determining unit, configured to use the first face key point sequence as a migration face key point sequence of a first target frame image in the original video;

and the migration determining unit is used for calculating the migration face key point sequence of the next frame image according to the migration face key point sequence of the previous frame image and the transformation matrix to obtain the migration face key point sequence of each target frame image in the original video.

the texture obtaining submodule is used for obtaining a first texture image corresponding to the fusion image and a second texture image corresponding to the target frame image;

the texture fusion submodule is used for carrying out fusion processing on the first texture image and the second texture image to obtain a fusion texture image;

and the rendering processing submodule is used for rendering the target frame image in the original video based on the fusion texture image to obtain the target video.

Optionally, the texture fetching sub-module includes:

the three-dimensional reconstruction unit is used for respectively inputting the fusion image and the target frame image into a pre-trained three-dimensional face reconstruction network to carry out three-dimensional face reconstruction processing so as to obtain three-dimensional model parameters of the fusion image and three-dimensional model parameters of the target frame image;

the face construction unit is used for respectively constructing a three-dimensional face model corresponding to the fusion image and a three-dimensional face model corresponding to the target frame image according to the three-dimensional model parameters of the fusion image and the three-dimensional model parameters of the target frame image;

and the texture drawing unit is used for drawing a first texture image corresponding to the fused image based on the three-dimensional face model corresponding to the fused image and drawing a second texture image corresponding to the target frame image based on the three-dimensional face model corresponding to the target frame image.

Optionally, the style-based generative model may include a mapping network and a synthesis network, and the fusion processing module includes:

the image acquisition submodule is used for acquiring a first image and a second image to be fused;

an image encoding sub-module, configured to encode the first image and the second image into input hidden codes of the mapping network, respectively;

the hidden code conversion sub-module is used for inputting the input hidden codes corresponding to the first image and the input hidden codes corresponding to the second image into the mapping network, and converting the input hidden codes corresponding to the first image and the input hidden codes corresponding to the second image into intermediate hidden codes through the mapping network;

and the fusion processing submodule is used for inputting the intermediate hidden code corresponding to the first image and the intermediate hidden code corresponding to the second image into the synthesis network for fusion processing to obtain a fusion image.

In yet another aspect, an embodiment of the present invention discloses a device for video processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, and the one or more programs comprise instructions for performing one or more of the video processing methods described above.

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors of an apparatus, cause the apparatus to perform one or more of the video processing methods described above.

In yet another aspect, a computer program product is characterized in that it comprises computer instructions stored in a computer readable storage medium and adapted to be read and executed by a processor to cause a computer device having said processor to perform a video processing method according to one or more of the preceding claims.

The embodiment of the invention has the following advantages:

the video processing method provided by the embodiment of the invention mainly comprises the following two stages: a fusion phase and a replacement phase. And the fusion stage is used for carrying out target part fusion processing on the first image and the second image to generate a fusion image. The fusion image includes a virtual target portion image generated by performing fusion processing on the target portion in the first image and the target portion in the second image, and the virtual target portion image fuses the features of the target portion in the first image and the target portion in the second image. And the replacing stage is used for replacing the target part in the original video with the virtual target part in the fused image to obtain the target video. The embodiment of the invention replaces the target part in the original video by using the generated virtual target part image, thereby reducing the cost of recording the template image and improving the efficiency of video processing. In addition, the embodiment of the invention can generate the fusion images containing the virtual target part images in batch, can perform video processing in batch and further improves the efficiency of video processing. Furthermore, the embodiment of the invention generates the fusion image through the pattern-based generation model, so that the quality of the fusion image in the details such as hair, wrinkles, skin color and the like can be improved, the image quality in the target video is further improved, and the replaced target video is more real and natural.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of one embodiment of a video processing method of the present invention;

fig. 2 is an application scene architecture diagram of a video processing method according to an embodiment of the present invention;

FIG. 3 is a schematic view of a video processing flow in one example of the invention;

FIG. 4 is a block diagram of a video processing apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram of an apparatus 800 for video processing of the present invention;

fig. 6 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms first, second and the like in the description and in the claims of the present invention are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the invention may be practiced other than those illustrated or described herein, and that the objects identified as "first," "second," etc. are generally a class of objects and do not limit the number of objects, e.g., a first object may be one or more. Furthermore, the term "and/or" in the specification and claims is used to describe an association relationship of associated objects, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The term "plurality" in the embodiments of the present invention means two or more, and other terms are similar thereto.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a video processing method according to the present invention is shown, which may specifically include the following steps:

step 101, performing target part fusion processing on a first image and a second image through a model-based generation model to generate a fusion image, wherein the fusion image comprises target part characteristics of the first image and target part characteristics of the second image;

102, carrying out target part detection on each frame of image in an original video, and determining a target frame image containing a target part to be replaced;

and 103, performing target part replacement processing on a target frame image in the original video by using the fusion image to obtain a target video.

The video processing method provided by the embodiment of the invention can be applied to a terminal and/or a server. Referring to fig. 2, an application scene architecture diagram of a video processing method according to an embodiment of the present invention is shown. As shown in fig. 2, an application scenario of the embodiment of the present invention may include a terminal 201 and/or a server 202. The terminal 201 and the server 202 are connected through a wireless or wired network. The terminal 201 includes, but is not limited to, an intelligent terminal, a computer, a Personal Digital Assistant (PDA), a tablet computer, an in-vehicle device, and other electronic devices. The server 202 refers to a map server, and may be one server, a server cluster composed of several servers, or a cloud computing center. The terminal 201 and the server 202 may be used separately to execute the video processing method provided in the embodiment of the present invention, and the terminal 201 and the server 202 may also be used to cooperatively execute the video processing method provided in the embodiment of the present invention.

The video processing method provided by the embodiment of the invention can comprise two stages: a fusion phase and a replacement phase. And the fusion stage is used for carrying out target part fusion processing on the first image and the second image to generate a fusion image. The fused image may be a single image of the avatar. Embodiments of the present invention refer to the target portion of the avatar as a virtual target portion, and the virtual target portion image in the fused image fuses the features of the target portion in the first image and the target portion in the second image. And the replacing stage is used for replacing the target part of the target object in the original video with the virtual target part in the fused image to obtain the target video.

In an alternative embodiment of the present invention, the target site may include, but is not limited to, any one or more of the following: face, lips, hands, legs, body.

The embodiment of the invention does not limit the target part needing to be replaced. For example, the face of the target person in the original video may be replaced with the face of the avatar in the fused image (containing five sense organs). As another example, the lips of the target person in the original video may be replaced with the lips of the avatar in the fused image. As another example, the hand of the target person in the original video may be replaced with the hand of the avatar in the fused image, and so on.

It should be noted that the avatar in the fusion image and the target object in the target frame image are not limited to a human image, including a real or virtual human image, but may also be an animal image, including a real or virtual animal image.

In one example, the first image includes character E, the second image includes character F, and the target portion is an eye. In the fusion stage, eye fusion processing is carried out on the first image and the second image to obtain a fusion image, wherein the fusion image comprises a virtual character G, and the eye features of the character E and the eye features of the character F are fused with the eyes of the virtual character G. In the replacement stage, assuming that eye replacement needs to be performed on the character image D in the original video, eye image detection is performed on each frame of image in the original video, a target frame image containing the eye image of the character image D is determined, and the eyes in the target frame image are replaced with the eyes of the fused image, so that the target video is obtained.

For convenience of description, the embodiment of the present invention is described by taking a human face replacing a human image as an example, and the processing procedure for replacing other target portions of the human image and replacing the target portions of the animal image is similar to the processing procedure for replacing the human face of the human image, and the two processing procedures are referred to each other.

Taking a human face replacing the character as an example, the first image may include a first human face image, and the second image may include a second human face image. The first face image and/or the second face image may be a partial face image (i.e., an image in which face information is incomplete, such as a side face or a face whose face is partially covered by hair, ornaments, and the like), or may be a complete face image (i.e., an image in which face information is complete, such as an image in which no face is covered). The first face image and/or the second face image may be a color image or a grayscale image. In addition, the Image Format is not limited in the embodiment of the present invention, for example, the Image Format of the first face Image and/or the second face Image may be any Format that can be recognized by an electronic device, such as jpg (Joint Photo graphics experts Group, a picture Format), BMP (Bitmap, an Image file Format), or RAW (RAW Image Format).

Taking a human face replacing a human image as an example, the video processing method provided by the embodiment of the invention can comprise two stages: a face fusion stage and a face changing stage. And the face fusion stage is used for carrying out face fusion processing on the first image and the second image to generate a fusion image. The fused image may be a single image of the avatar. The virtual image in the fused image can be a virtual character, and the face of the virtual character fuses the first face feature in the first image and the second face feature in the second image. The face changing stage is used for replacing the face of a target object (such as a target person) in the original video with the face (a virtual face) of a virtual person image in the fusion image to obtain the target video.

In the face fusion stage, the embodiment of the invention performs face fusion processing on the first image and the second image through a pattern-based generation model to generate a fusion image, wherein the fusion image comprises a virtual face image (a face image of a virtual character) generated according to the fusion of the first image and the second image.

It should be noted that the first image and/or the second image may be a face image obtained by the electronic device through an image acquisition device, a face image pre-stored in the electronic device, or a face image downloaded by the electronic device from a network. In addition, the faces in the first image and/or the second image are not limited to human faces, and may be faces of cartoon characters, cartoon characters and the like.

In an alternative embodiment of the present invention, the style-based generative model may comprise a StyleGAN generative model or a StyleGAN2 generative model.

The pattern-based generation model is a new generator network structure designed by using a style migration algorithm based on GAN (Generative adaptive Networks). The pattern-based generation model can perform certain decoupling and separation on the high-level semantic attributes of the image through unsupervised automatic learning, such as the posture and the identity of a human face image, and random changes of the generated image, such as freckles, hairs and the like. And can also control synthesis to a certain extent.

For example, StyleGAN takes a picture as a collection of many styles (styles) in the light of style migration algorithms. Styligan uses style (style) to affect the pose, identity, etc. of a human face and noise (noise) to affect the detail of hair lines, wrinkles, skin tones, etc. Style here refers to the style of the face, including the expression above the face, face orientation, hairstyle, etc., and also includes the complexion of the face, face lighting, etc., on the detail of the texture.

Each style may control the effect of the image at a different scale. The dimensions may include the following three: coarse, medium, and fine. Wherein the rough scale can be used for controlling styles such as postures, hair, face shapes and the like. A moderate scale may be used to control the style of facial features, eyes, etc. Fine scales may be used to control styles such as color matching. By adjusting different styles, the face picture can be adjusted on different scales to obtain different final fusion effects.

In one example, the first image is graph a and the second image is graph B. The graph a may determine the gender, age, hair length, pose, etc. of the generated virtual face image. The map B may determine other factors of the generated virtual face image, such as skin color, hair color, clothes color, etc. Thus, a portion of the identity of graph B can be migrated to graph A. But facial features such as the orientation and expression of the face are still those of the image a. That is, the generated fused image has the facial features of the map a and the identity features of the map B.

The StyleGAN2 is an enhanced version of StyleGAN, which can eliminate the artifacts typical of StyleGAN and generate a higher quality fused image. In the embodiment of the present invention, the styligan is mainly used as an example for description, and the application scenarios of the styligan 2 may be referred to each other.

In the face changing stage, the embodiment of the invention carries out face detection on each frame of image in the original video and determines a target frame image containing a face image to be replaced; and performing face changing processing on each target frame image in the original video by using the fused image to obtain a target video.

The embodiment of the invention generates a fusion image containing a virtual human face image through a pattern-based generation model, and performs face changing processing on a target frame image in an original video by using the fusion image to obtain a target video. The face image in the target video has the face feature of the virtual face image and the identity feature of the target frame image in the original video. The video processing method provided by the embodiment of the invention can be used for scenes such as virtual anchor broadcasting, virtual customer service communication and the like. By the embodiment of the invention, the fusion images containing the virtual face images can be generated in batch, so that the original video is subjected to face changing in batch, the target video of the virtual face images is generated in batch, the cost of recording the template images can be reduced, and the video processing efficiency can be improved. In addition, the embodiment of the invention generates the fusion image through the pattern-based generation model, so that the quality of the fusion image at the detail parts such as hair, wrinkles, skin color and the like can be improved, the image quality in the target video is further improved, and the target video after face changing is more real and natural.

In an optional embodiment of the present invention, the pattern-based generation model may include a mapping network and a synthesis network, and the generating a fused image by performing the target region fusion processing on the first image and the second image through the pattern-based generation model may include:

step S11, acquiring a first image and a second image to be fused;

step S12, encoding the first image and the second image into input hidden codes of the mapping network, respectively;

step S13, inputting the input hidden code corresponding to the first image and the input hidden code corresponding to the second image into the mapping network, and converting the input hidden code corresponding to the first image and the input hidden code corresponding to the second image into intermediate hidden codes through the mapping network;

and step S14, inputting the intermediate hidden code corresponding to the first image and the intermediate hidden code corresponding to the second image into the synthesis network for fusion processing to obtain a fused image.

Taking the StyleGAN generation model as an example, the StyleGAN generation model includes a Mapping network (Mapping network) f and a Synthesis network (Synthesis network) g. Specifically, first, the first image and the second image to be fused are encoded into the input hidden codes of the mapping network f, respectively, as denoted by z. If the input hidden code corresponding to the first image is represented as z1, the input hidden code corresponding to the second image is represented as z 2. Then, the input hidden code corresponding to the first image and the input hidden code corresponding to the second image are input to the mapping network, and the input hidden code corresponding to the first image and the input hidden code corresponding to the second image are converted into intermediate hidden codes through the mapping network, as denoted by w.

The mapping network f may convert the input covert code z into the intermediate covert code w using eight fully-connected layers. w may be used to control the style, i.e. style, of the generated fused image. Through the mapping network f, an input hidden code z of 512 dimensions can be converted into an intermediate hidden code w of 512 dimensions. For example, z1 and z2 are input into mapping network f, which converts z1 and z2 into intermediate covert codes w1 and w2, respectively.

And inputting the intermediate hidden code w1 corresponding to the first image and the intermediate hidden code w2 corresponding to the second image into the synthesis network for fusion processing, so as to obtain a fused image. The fused image has facial features of the first image and identity features of the second image.

A and B are input into each layer of sub-network (convolutional layer) of the synthetic network, wherein A is affine transformation obtained by w conversion and used for controlling the style (such as facial features, expressions, eyes and other styles) of the generated fused image, B is random noise after conversion and used for enriching the details (such as hair lines, wrinkles, skin colors and the like) of the generated fused image, and each convolutional layer can adjust the style (style) of the generated fused image according to the input A.

In the face changing stage, as for the step of performing face changing processing on the target frame image in the original video by using the fused image, the embodiment of the invention can provide two implementation modes. One is a fusion face-changing mode based on face key point migration and triangulation mapping. Another is a face-changing approach based on 3D (3 Dimensions) face modeling. In addition, the embodiment of the invention only realizes the replacement of five sense organs in the face changing stage, and can avoid the situation that the head image of the virtual face image is directly replaced into the original video to cause the unnatural parts such as hair, ears and the like. Namely, the hair, the ears and other parts in the target frame image in the original video are still reserved in the face changing stage, and only the five sense organs in the fused image are replaced on the face of the target frame image.

In an optional embodiment of the present invention, the target portion is a human face, and performing target portion replacement processing on a target frame image in the original video by using the fused image to obtain a target video may include:

step S21, extracting a first face key point sequence of the fused image and extracting a second face key point sequence of the target frame image;

step S22, determining a migration face key point sequence of the target frame image based on the first face key point sequence and the second face key point sequence by utilizing a triangular mesh deformation transfer algorithm;

and step S23, carrying out triangulation mapping processing on the migration human face key point sequence of the target frame image by using the fusion image to obtain a target video.

First, a first face changing method provided by the present invention is explained: and a fusion face changing mode based on face key point migration and triangulation mapping.

Specifically, firstly, face key point detection is carried out on a fusion image and each target frame image, a first face key point sequence of the fusion image is extracted, and a second face key point sequence of the target frame image is extracted.

The embodiment of the invention does not limit the specific method for detecting the key points of the human face. For example, a preset face detection model, such as a Dlib detection model and/or a retinafece detection model, may be used to extract the first face keypoint sequence of the fused image and the second face keypoint sequence of the target frame image, respectively.

Taking a Dlib detection model as an example, firstly, performing face detection on the fused image by using the Dlib detection model to obtain a face frame diagram, then performing face key point detection on the image in the face frame diagram by using the Dlib detection model to obtain face key point coordinates, and further obtaining a first face key point sequence of the fused image. The first sequence of face keypoints may comprise 68 face keypoint coordinates.

And extracting a second face key point sequence for each target frame image in the original video by adopting the same method. The second face keypoint sequence for each target frame image may include 68 face keypoint coordinates.

After extracting the first face key point sequence of the fused image and the second face key point sequence of each target frame image in the original video, performing key point migration on the first face key point sequence and the second face key point sequence by using a triangular mesh deformation transfer algorithm to obtain a migrated face key point sequence of each target frame image. And carrying out triangulation mapping treatment on the transferred human face key point sequence of each target frame image by using the texture image of the fusion image to obtain a target video.

In an alternative embodiment of the present invention, the determining, by using a triangular mesh deformation transfer algorithm in step S22, a migration face keypoint sequence of the target frame image based on the first face keypoint sequence and the second face keypoint sequence may include:

step S221, calculating a transformation matrix between second key point sequences of a previous frame image and a next frame image in two adjacent target frame images based on a triangular mesh deformation transfer algorithm;

step S222, taking the first face key point sequence as a transfer face key point sequence of a first target frame image in the original video;

step S223, calculating a transition face key point sequence of the next frame image according to the transition face key point sequence of the previous frame image and the transformation matrix, and obtaining a transition face key point sequence of each target frame image in the original video.

Among them, the Triangle mesh Deformation Transfer algorithm, such as dt2d (Deformation Transfer for Triangle messages) algorithm.

Illustratively, assume that the first sequence of face keypoints extracted from the fused image is denoted as a 1. And recording a second face key point sequence extracted from the first target frame image in the original video as c1, and recording a second face key point sequence extracted from the second target frame image in the original video as c 2. Wherein, the a1, c1 and c2 respectively comprise 68 face key point coordinates. Based on the dt2d algorithm, a transformation matrix of c1 to c2 is calculated, which is assumed to be T1. And taking the first face key point sequence a1 as a transition face key point sequence of the first target frame image in the original video, and assuming that b1 is used. And calculating the sequence of the key points of the migration face of the next frame of image according to the sequence of the key points of the migration face of the previous frame of image and the transformation matrix. For example, a migrated face keypoint sequence b2, e.g., b 2-b 1-T1, of the second target frame image is calculated based on the migrated face keypoint sequence b1 of the first target frame image and the transformation matrix T1. Here, the multiplication may be simple, or may be a predetermined transform function.

Assuming that the second face keypoint sequence extracted from the third target frame image is denoted as c3, the c3 includes 68 face keypoint coordinates. Based on the dt2d algorithm, a transformation matrix of c2 to c3 is calculated, which is assumed to be T2. And calculating a migration face key point sequence b3, such as b 3-b 2-T2, of the third target frame image according to the migration face key point sequence b2 of the second target frame image and the transformation matrix T2. By analogy, a migration face key point sequence of each target frame image in the original video, such as b 1-bn, can be calculated. Wherein, each of b 1-bn includes 68 face key point coordinates.

And carrying out triangulation mapping processing on the key point sequence of the transferred human face of each target frame image by using the fusion image to obtain each target frame image after face changing, and further obtain a target video. The triangulation mapping can use existing algorithms.

Referring to fig. 3, a schematic view of a video processing flow in an example of the present invention is shown, as shown in fig. 3, taking a human face replacing a character image as an example, the video processing flow includes a human face fusion stage and a face changing stage. And in the face fusion stage, a fusion image is generated through a StyleGAN generation model. In the face changing stage, a fusion face changing mode of human face key point migration and triangulation mapping is adopted based on a dt2d algorithm.

Based on the face key point migration and the fusion face changing mode of the triangulation chartlet, a model does not need to be trained, a large amount of training data is not needed, and resource consumption can be reduced.

step S31, acquiring a first texture image corresponding to the fusion image and a second texture image corresponding to the target frame image;

step S32, carrying out fusion processing on the first texture image and the second texture image to obtain a fusion texture image;

and step S33, rendering the target frame image in the original video based on the fusion texture image to obtain the target video.

Based on the face key point migration and the fusion face changing mode of the triangulation mapping, the face changing effect of the front face image is good, but the face changing effect of the side face image is difficult to guarantee.

The embodiment of the invention also can provide a face changing mode based on 3D face modeling. Specifically, a first texture image corresponding to the fusion image and a second texture image corresponding to each target frame image are obtained first. The first texture image corresponding to the fusion image and the second texture image corresponding to each target frame image can be acquired through a 3D scanning device. For example, a 3D face model corresponding to the fusion image and a 3D face model corresponding to each target frame image are obtained by a 3D scanning device, and a first mapping relationship and a second mapping relationship are respectively established, where the first mapping relationship refers to a mapping relationship between a point on the 3D face model and each pixel point in the texture image, and the second mapping relationship refers to a mapping relationship between a point on the 3D face model and each pixel point in the face image. The electronic device can obtain a first texture image corresponding to the fused image and a second texture image corresponding to each target frame image through the first mapping relation and the second mapping relation.

And then, carrying out fusion processing on the first texture image and the second texture image to obtain a fusion texture image. In particular, the first texture image may be fused into the second texture image based on the dynamic mask. It should be noted that the dynamic mask is a binary map used to indicate the range to be fused. For example, if a mouth in the fused image is to be replaced in the target frame image, the region indicated by the dynamic mask is the region where the mouth is located in the fused image. By dynamic masking, the fusion area of the first texture image and the second texture image can be controlled within a valid range.

And finally, rendering each target frame image based on the fusion texture image corresponding to each target frame image, so that each target frame image after face changing can be obtained, and a target video can be obtained. The process of rendering the target frame image based on the fused texture image is a process of adjusting the texture value of each pixel point of the face area (such as five sense organs) to be replaced in the target frame image to the texture value of the corresponding pixel point in the fused texture image.

In an optional embodiment of the present invention, the acquiring, in step S31, a first texture image corresponding to the fused image and a second texture image corresponding to the target frame image may include:

step S311, inputting the fusion image and the target frame image into a pre-trained three-dimensional face reconstruction network respectively for three-dimensional face reconstruction processing to obtain three-dimensional model parameters of the fusion image and three-dimensional model parameters of the target frame image;

step S312, respectively constructing a three-dimensional face model corresponding to the fusion image and a three-dimensional face model corresponding to the target frame image according to the three-dimensional model parameters of the fusion image and the three-dimensional model parameters of the target frame image;

step 313, drawing a first texture image corresponding to the fusion image based on the three-dimensional face model corresponding to the fusion image, and drawing a second texture image corresponding to the target frame image based on the three-dimensional face model corresponding to the target frame image.

The three-dimensional face reconstruction network can adopt any neural network model which is good at three-dimensional face reconstruction in the field. The three-dimensional face reconstruction network can be trained by adopting a weak supervision training method, and then the trained three-dimensional face reconstruction network is adopted to respectively carry out three-dimensional face reconstruction processing on the fusion image and each target frame image so as to obtain three-dimensional model parameters of the fusion image and three-dimensional model parameters of each target frame image.

It should be noted that the three-dimensional model parameters in the embodiment of the present invention include, but are not limited to, shape parameters, expression parameters, and pose parameters in the face image. And constructing a three-dimensional face model corresponding to the fusion image and a three-dimensional face model corresponding to each target frame image according to the three-dimensional model parameters.

And similarly, for a certain target frame image, each pixel point in the target frame image also has a corresponding coordinate point in the three-dimensional face model corresponding to the target frame image. And then according to the preset mapping relation between each coordinate point in the three-dimensional face model and each coordinate point in the texture space, determining a first texture image corresponding to the fused image and a second texture image corresponding to each target frame image.

In the embodiment of the invention, in the face changing stage, a fusion face changing mode based on face key point migration and triangulation mapping can be adopted, the operation cost is low, and the resource consumption can be reduced. A face changing mode based on 3D (3Dimensions, three-dimensional) face modeling may also be employed, which may improve the face changing effect during side face. In specific implementation, a suitable face changing mode can be selected according to actual needs.

To sum up, the video processing method provided by the embodiment of the present invention mainly includes the following two stages: a fusion phase and a replacement phase. And the fusion stage is used for carrying out target part fusion processing on the first image and the second image to generate a fusion image. The fusion image includes a virtual target portion image generated by performing fusion processing on the target portion in the first image and the target portion in the second image, and the virtual target portion image fuses the features of the target portion in the first image and the target portion in the second image. And the replacing stage is used for replacing the target part in the original video with the target part in the fused image to obtain the target video. The embodiment of the invention replaces the target part in the original video by using the generated virtual target part image, thereby reducing the cost of recording the template image and improving the efficiency of video processing. In addition, the embodiment of the invention can generate the fusion images containing the virtual target part images in batch, can perform video processing in batch and further improves the efficiency of video processing. Furthermore, the embodiment of the invention generates the fusion image through the pattern-based generation model, so that the quality of the fusion image in the details such as hair, wrinkles, skin color and the like can be improved, the image quality in the target video is further improved, and the replaced target video is more real and natural.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 4, a block diagram of a video processing apparatus according to an embodiment of the present invention is shown, where the apparatus may include:

a fusion processing module 401, configured to perform target region fusion processing on the first image and the second image through a model-based generation model, so as to generate a fusion image, where the fusion image includes a target region feature of the first image and a target region feature of the second image;

a target detection module 402, configured to perform target portion detection on each frame of image in an original video, and determine a target frame image including a target portion to be replaced;

a replacement processing module 403, configured to perform target portion replacement processing on the target frame image in the original video by using the fused image, so as to obtain a target video.

Optionally, the key point migration sub-module includes:

Optionally, the texture fetching sub-module includes:

The video processing method provided by the embodiment of the invention mainly comprises the following two stages: a fusion phase and a replacement phase. And the fusion stage is used for carrying out target part fusion processing on the first image and the second image to generate a fusion image. The fusion image includes a virtual target portion image generated by performing fusion processing on the target portion in the first image and the target portion in the second image, and the virtual target portion image fuses the features of the target portion in the first image and the target portion in the second image. And the replacing stage is used for replacing the target part in the original video with the target part in the fused image to obtain the target video. The embodiment of the invention replaces the target part in the original video by using the generated virtual target part image, thereby reducing the cost of recording the template image and improving the efficiency of video processing. In addition, the embodiment of the invention can generate the fusion images containing the virtual target part images in batch, can perform video processing in batch and further improves the efficiency of video processing. Furthermore, the embodiment of the invention generates the fusion image through the pattern-based generation model, so that the quality of the fusion image in the details such as hair, wrinkles, skin color and the like can be improved, the image quality in the target video is further improved, and the replaced target video is more real and natural.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides a device for video processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for: performing target part fusion processing on the first image and the second image through a model-based generation model to generate a fusion image, wherein the fusion image comprises target part characteristics of the first image and target part characteristics of the second image; carrying out target part detection on each frame of image in an original video, and determining a target frame image containing a target part to be replaced; and performing target part replacement processing on the target frame image in the original video by using the fusion image to obtain a target video.

acquiring a first image and a second image to be fused;

Fig. 5 is a block diagram illustrating an apparatus 800 for video processing according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also search for a change in the position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in the temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 6 is a schematic diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the video processing method shown in fig. 1.

A non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of a device (server or terminal), enable the device to perform the description of the video processing method in the embodiment corresponding to fig. 1, and therefore, the description thereof will not be repeated herein. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer program product or the computer program referred to in the present application, reference is made to the description of the embodiments of the method of the present application.

Further, it should be noted that: embodiments of the present application also provide a computer program product or computer program, which may include computer instructions, which may be stored in a computer-readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor can execute the computer instruction, so that the computer device executes the description of the video processing method in the embodiment corresponding to fig. 1, which is described above, and therefore, the description thereof will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer program product or the computer program referred to in the present application, reference is made to the description of the embodiments of the method of the present application.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The present invention provides a video processing method, a video processing apparatus and a video processing apparatus, which have been described above in detail, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the above examples are only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of video processing, the method comprising:

2. The method of claim 1, wherein the target site comprises any one or more of: face, lips, hands, legs, body.

3. The method according to claim 1, wherein the target portion is a human face, and performing target portion replacement processing on a target frame image in the original video by using the fused image to obtain a target video comprises:

4. The method according to claim 3, wherein the determining the sequence of migrated face keypoints for the target frame image based on the first sequence of face keypoints and the second sequence of face keypoints by using a triangular mesh warping transfer algorithm comprises:

5. The method according to claim 1, wherein the target portion is a human face, and performing target portion replacement processing on a target frame image in the original video by using the fused image to obtain a target video comprises:

6. The method according to claim 5, wherein the obtaining a first texture image corresponding to the fused image and a second texture image corresponding to the target frame image comprises:

7. The method according to claim 1, wherein the pattern-based generative model comprises a mapping network and a synthesis network, and the generating of the fused image by performing the target region fusion process on the first image and the second image by the pattern-based generative model comprises:

acquiring a first image and a second image to be fused;

8. The method of any one of claims 1 to 7, wherein the style-based generative model comprises a StyleGAN generative model or a StyleGAN2 generative model.

9. A video processing apparatus, characterized in that the apparatus comprises:

10. The apparatus of claim 9, wherein the target site comprises any one or more of: face, lips, hands, legs, body.

11. The apparatus of claim 9, wherein the target region is a human face, and the replacement processing module comprises:

12. The apparatus of claim 11, wherein the keypoint migration sub-module comprises:

13. An apparatus for video processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing the video processing method of any of claims 1-8.

14. A machine-readable medium having stored thereon instructions which, when executed by one or more processors of an apparatus, cause the apparatus to perform the video processing method of any of claims 1 to 8.

15. A computer program product comprising computer instructions stored in a computer readable storage medium and adapted to be read and executed by a processor to cause a computer device having the processor to perform the video processing method of any of claims 1-8.