CN111263226B

CN111263226B - Video processing method, video processing device, electronic equipment and medium

Info

Publication number: CN111263226B
Application number: CN202010057682.5A
Authority: CN
Inventors: 张勇东; 胡梓珩; 谢洪涛; 邓旭冉; 李岩
Original assignee: Beijing Zhongke Research Institute; University of Science and Technology of China USTC
Current assignee: Beijing Zhongke Research Institute; University of Science and Technology of China USTC
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2021-10-22
Anticipated expiration: 2040-01-17
Also published as: CN111263226A

Abstract

Video processing method, apparatus, electronic device, and medium. A method of video processing, the method comprising: decoding a video to be replaced and a target video into a first frame sequence and a second frame sequence respectively, and acquiring an object image to be replaced and a target object image which respectively correspond to each other; coding an image of an object to be replaced, and adding preselected noise in the coding process; carrying out style migration on the coding result; decoding and reconstructing the encoding result of the style migration to enable the target object image to replace the object image to be replaced, and obtaining a reconstructed image; and fusing the reconstructed image to the first frame sequence to obtain a replaced first frame sequence, and restoring the replaced first frame sequence to the video. The method saves time cost and material cost, reduces replacement traces, ensures the definition and the trueness of the face changing effect, has higher watching effect and is simple to operate.

Description

Video processing method, video processing device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of image processing, and in particular, to a video processing method, apparatus, electronic device, and medium.

Background

In the current movie entertainment industry, for several reasons, it is desirable to replace certain actors after the completion of the filming of the work. Engaging other actors to re-record related shots can consume significant time and material costs. Video is a collection of sequences of frames, usually measured in number of frames per second (i.e., FPS). According to the conventional 24 frames per second estimation, 240 frames of images exist in a video clip every 10 seconds, the workload of manual processing is extremely large, and a professional person needing the images needs to operate the video clip, so that the requirements on professional ability and proficiency are high, and otherwise, the processing effect cannot be guaranteed. With the continuous development of the artificial intelligence deep learning technology, it becomes possible to apply the artificial intelligence technology to the automatic face replacement in the video.

Disclosure of Invention

Technical problem to be solved

In view of the above technical problems, the present disclosure provides a video processing method, an apparatus, an electronic device, and a medium, which are used to at least solve the above technical problems.

(II) technical scheme

According to a first aspect of the embodiments of the present disclosure, there is provided a video processing method, including: decoding a video to be replaced and a target video into a first frame sequence and a second frame sequence respectively, and acquiring an object image to be replaced and a target object image which respectively correspond to each other; coding an image of an object to be replaced, and adding preselected noise in the coding process; carrying out style migration on the coding result; decoding and reconstructing the encoding result of the style migration to enable the target object image to replace the object image to be replaced, and obtaining a reconstructed image; and fusing the reconstructed image to the first frame sequence to obtain a replaced first frame sequence, and restoring the replaced first frame sequence to the video.

Optionally, encoding the image of the object to be replaced includes: acquiring a first noise reduction self-encoder; and inputting the object image to be replaced into a first noise reduction self-encoder for encoding.

Optionally, decoding and reconstructing the encoded result of the style migration includes: acquiring a second noise reduction self-encoder; and inputting the coding result of the style migration into a second noise reduction self-coder for decoding and reconstruction.

Optionally, the method further comprises: training either the first noise reducing self-encoder or the second noise reducing self-encoder, the operations comprising: acquiring a first training data set and a second training data set, wherein the first training data set comprises a video to be replaced for training, and the second training data set comprises a target video for training; extracting first image data in a video to be replaced for training, and extracting second image data of a target video for training; and training a first noise reduction self-encoder by using the first image data and training a second noise reduction self-encoder by using the second image data by using a hierarchical training method.

Optionally, training the first noise reduction self-encoder with the first image data by using a hierarchical training method includes: performing first training on the first image data by adopting double-layer convolution to obtain a first parameter; training the first image data and the first parameters by adopting four layers of convolution to obtain second parameters; training the first image data and the second parameter by adopting six-layer convolution to obtain a third parameter; and by analogy, two layers of convolution are added each time, wherein one layer corresponds to an encoder of the first noise reduction self-encoder, and the other layer corresponds to a decoder of the first noise reduction self-encoder.

Optionally, the method further comprises: and acquiring the position information of the object image to be replaced in the video to be replaced.

Optionally, fusing the reconstructed image to a sequence of frames, comprising: the reconstructed image is fused to the position in the frame sequence to which the position information points.

Optionally, the preselected noise is selected to be gaussian noise.

According to a second aspect of the embodiments of the present disclosure, there is provided a video processing apparatus including: the decomposition module is used for decoding the video to be replaced and the target video into a first frame sequence and a second frame sequence respectively and acquiring the object image to be replaced and the target object image which respectively correspond to each other; the first self-encoder is used for encoding the image of the object to be replaced, and preselection noise is added in the encoding process; the migration module is used for carrying out style migration on the coding result; the second self-encoder is used for decoding and reconstructing the encoding result of the style migration, so that the target object image replaces the object image to be replaced, and a reconstructed image is obtained; and the replacing module is used for fusing the reconstructed image to the first frame sequence to obtain a replaced first frame sequence and restoring the replaced first frame sequence into the video.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: one or more processors. Memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program comprising computer executable instructions for implementing the method as described above when executed.

(III) advantageous effects

The present disclosure provides a video processing method, apparatus, electronic device and medium, which have the following beneficial effects:

1. when the video works are replaced, all shots in the video can be processed only by training a model aiming at the replaced objects, so that the time cost and the material cost are saved.

2. In the process of coding the object to be replaced, a noise reduction mechanism is added, style migration is carried out on a coding result, in addition, a level training method is adopted in the model training process, and replacement traces are reduced, so that the definition and the authenticity of the face changing effect are ensured, and the higher watching effect is realized.

3. The operation in the method is packaged in a program stored in a device or electronic equipment, so that a user only needs to have basic computer operation capacity and operate the corresponding program according to the training steps and the use flow without deeply mastering computer science and image processing professional knowledge, and the operation is simple.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention. Wherein:

fig. 1 schematically shows a flow chart of a video processing method according to an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a schematic diagram of an auto-encoder according to an exemplary embodiment of the present disclosure;

fig. 3 schematically illustrates a network structure diagram of an auto-encoder according to an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a training method of an autoencoder according to an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a network architecture diagram of a VGG19 according to an exemplary embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure; and

fig. 7 schematically shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The embodiment of the disclosure provides a video processing method, which includes: and respectively decoding the video to be replaced and the target video into a first frame sequence and a second frame sequence to obtain the object image to be replaced and the target object image which respectively correspond to each other. And coding the image to be replaced, and adding preselected noise in the coding process. And carrying out style migration on the encoding result. And decoding and reconstructing the encoding result of the style migration, so that the target object image replaces the object image to be replaced, and a reconstructed image is obtained. And fusing the reconstructed image to the first frame sequence to obtain a replaced first frame sequence, and restoring the replaced first frame sequence to the video.

Fig. 1 schematically shows a flow diagram of a video processing method according to an exemplary embodiment of the present disclosure, which may include operations S101 to S104, for example.

S101, decoding the video to be replaced and the target video into a first frame sequence and a second frame sequence respectively, and obtaining the object image to be replaced and the target object image which respectively correspond to each other.

In a feasible manner of this embodiment, for example, video decoding software may be used to decode a video to be replaced into frames to obtain a first frame sequence, decode a target video into frames to obtain a second frame sequence, and store the first frame sequence and the second frame sequence in two folders, respectively. For each frame, an object image to be replaced and a target object image are extracted, and the object image to be replaced may be a human face, for example, but the invention is not limited thereto. If the face needs to be used, a face Detector (DLIB) or a Multi-task cascaded convolutional Neural network (MTCNN) may be used to extract and align the face images in each frame, and the specific extraction method is not limited in the present invention. In order to facilitate the subsequent restoration of the replaced object (human face) to the original position in the video, the position information of the object to be replaced needs to be acquired and stored, so as to be fused into the original frame sequence after replacement.

S102, coding the image to be replaced, and adding preselected noise in the coding process.

In a feasible manner of the embodiment, an auto-encoder may be used to encode the image to be replaced. The self-Encoder is an unsupervised learning model, which is composed of an Encoder (Encoder) and a Decoder (Decoder). The encoder may be represented by a function h ═ f (x), and the decoder may be represented by a function r ═ g (h). The purpose of the self-encoder is to impose certain constraints on the output, so that the encoder learns discriminative features in the image, and the output image data x' reconstructed by the decoder reproduces the input image x as much as possible.

In practical application, due to the fact that scenes in videos are complex, angles and expressions of faces of people are changed greatly, and local areas of the faces are easily shielded by hairs or other objects, namely, irrelevant noise of face images is large. Therefore, when face replacement is performed, local blurring may occur, and even false replacement may occur, resulting in a large noise in the output image. In order to overcome this problem, the present embodiment retrains based on the common self-encoder to obtain the first noise reduction self-encoder, and the principle of the noise reduction self-encoder is shown in fig. 2. The following describes the training process in detail by taking the example of replacing faces in video, but the training process is not limited to faces.

First, a first training data set is obtained, the training data set includes a to-be-replaced training video, and the to-be-replaced training video refers to a to-be-replaced video for training.

Then, the video to be replaced for training is decoded into a frame sequence, and the facial image x in each frame of image is extracted as the first image data, and the extraction method may be DLIB, for example. And adding noise to the face image to obtain a noise picture, wherein the noise picture is obtained. The noise may be gaussian noise, for example. The noise picture is input into a self-encoder to be trained, the network structure of an encoder and a decoder of the self-encoder is shown in fig. 3, the sizes of convolution kernels and the number of output channels of each layer are specifically marked, and the encoder and the decoder respectively comprise 5 layers of convolution layers.

The phenomenon of gradient dispersion is easy to occur when the random gradient descent method is used for training the network. Therefore, in a feasible manner of this embodiment, the first image data is trained by using a hierarchical training method, and the training process is as shown in fig. 4, and the first image data is trained for the first time by using double-layer convolution to obtain the first parameter, that is, only the first layer is trained. And training the first image data and the first parameters by adopting four layers of convolution to obtain second parameters, namely adding the first layer parameters into the second layer for training. And training the first image data and the second parameter by adopting six layers of convolution to obtain a third parameter, and repeating the steps, wherein two layers of convolution layers are added each time, one layer corresponds to an encoder of the first noise reduction self-encoder, and the other layer corresponds to a decoder of the first noise reduction self-encoder. Wherein, one layer is deepened every 100000 training rounds. The relevant parameters may be, for example: training parameters: batch _ size: 16, weight attenuation is 0.0001, initial learning rate is 0.0001, network output image y is obtained, and a loss function is a mean square error function MSE:

L(x，y)＝||x-y||²

the encoder thus far obtained is denoted as E_AThe decoder is marked as D_A。

And after the training is finished, acquiring the first noise reduction self-encoder, and inputting the image of the object to be replaced into the first noise reduction self-encoder for encoding.

Because the video needs to be restored subsequently, the object image to be replaced is replaced by the target object image, and the encoding result needs to be decoded, based on the same training method, the second noise reduction self-encoder is obtained by training, and the encoder for obtaining the second noise reduction self-encoder is marked as E_BThe decoder is marked as D_B. The training process uses a second training data set that includes target video for training.

And S103, carrying out style migration on the coding result.

The object to be replaced is different from the target object in video scenes and shooting environments, so that the light ray difference is large. Natural light is input into the camera equipment through processes of lens processing, CCD sensitization and the like, and dark current noise, thermal noise and shot noise are inevitably added into the electronic equipment. Meanwhile, the camera equipment performs mathematical quantization operations such as difference, white balance, gamma correction and the like on the sensed information to finally obtain a digital image, and quantization noise is generated in the process. Therefore, the object to be replaced has a different light and shadow effect and noise pattern from the target object. If directly using the encoder E_AAnd decoder D_BThe replaced human faces can meet the basic requirements of face changing, but are not real and natural enough, have certain artificial traces and have poor ornamental effect.

Therefore, in a feasible manner of the embodiment, a migration network T is introduced_θCascaded at encoder E_AAnd decoder D_BIn addition, the replacement process is further trained by using VGG loss, the original style characteristics can be kept, and the overall effect is more real and coordinated.

The migration network T_θThe training process of (a) may be, for example, as follows:

and taking the first training data set as training data, namely the input face image x to be replaced. The structure of the migration network may be, for example, two fully connected layers plus one convolutional layer, the number of output neurons is 1024 and 4 × 1024, respectively, the convolutional layer convolution kernel is 3 × 3, and the number of output channels is 1024.

The training loss function is:

Loss＝L_vgg(y，x)＝L_vgg(D_B(T_θ(E_A(x)))，x)

wherein the content of the first and second substances,

is a characteristic diagram of the image x output through a VGG19 network, y is a reconstructed image obtained by encoding + migration + decoding the x image,

is a characteristic diagram of the reconstructed image y output through the network of VGG19, wherein the structure diagram of the network of VGG19 is shown in fig. 5.

And S104, decoding and reconstructing the encoding result of the style migration, so that the target object image replaces the object image to be replaced, and a reconstructed image is obtained.

And acquiring the trained second noise reduction self-encoder, inputting the encoding result after the style migration into the second noise reduction self-encoder for decoding, and reconstructing an image based on the decoding result, wherein the reconstructed image has the identity characteristic of the target object. Thus, the input image x is passed through the encoder E of the first noise reduction self-encoder_AThen through the migration network, and finally through the decoder D of the second noise reduction self-encoder_BDecoding to obtain a reconstructed image y ═ D_B(T_θ(E_A(x)))。

S105, fusing the reconstructed image to the first frame sequence to obtain a replaced first frame sequence, and restoring the replaced first frame sequence to the video.

And seamlessly fusing the replaced reconstructed image into a first frame sequence according to the stored position information, and restoring the first frame sequence into a video, so as to finish the replacement of the object in the video to be replaced.

When replacing the video works in the embodiment, all shots in the video can be processed only by training a model aiming at the replaced object, so that the time cost and the material cost are saved. In the process of coding the object to be replaced, a noise reduction mechanism is added, style migration is carried out on a coding result, in addition, a level training method is adopted in the model training process, and replacement traces are reduced, so that the definition and the authenticity of the face changing effect are ensured, and the higher watching effect is realized. The operation in the method is packaged in a program stored in a device or electronic equipment, so that a user only needs to have basic computer operation capacity and operate the corresponding program according to the training steps and the use flow without deeply mastering computer science and image processing professional knowledge, and the operation is simple.

Fig. 6 schematically shows a block diagram of a video processing apparatus according to an exemplary embodiment of the present disclosure, and as shown in fig. 6, the apparatus 600 may include, for example, a decomposition module 610, a first self-encoder 620, a migration module 630, a second self-encoder 640, and a replacement module 650.

The decomposition module 610 is configured to decode the video to be replaced and the target video into a first frame sequence and a second frame sequence, and obtain an object image to be replaced and a target object image that respectively correspond to each other.

And a first self-encoder 620 for encoding the image to be replaced, wherein preselected noise is added in the encoding process.

And a migration module 630, configured to perform style migration on the encoding result.

And the second self-encoder 640 is configured to decode and reconstruct the encoding result of the style migration, so that the target object image replaces the object image to be replaced, and a reconstructed image is obtained.

And a replacing module 650, configured to fuse the reconstructed image to the first frame sequence to obtain a replaced first frame sequence, and restore the replaced first frame sequence to a video.

It should be noted that the embodiment of the apparatus portion is similar to the embodiment of the method portion, and please refer to the method embodiment portion for details, which are not described herein again.

Any of the modules, units, or at least part of the functionality of any of them according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules and units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, units according to the embodiments of the present disclosure may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by any other reasonable means of hardware or firmware by integrating or packaging the circuits, or in any one of three implementations of software, hardware and firmware, or in any suitable combination of any of them. Alternatively, one or more of the modules, units according to embodiments of the present disclosure may be implemented at least partly as computer program modules, which, when executed, may perform the respective functions.

For example, any plurality of the decomposition module 610, the first self-encoder 620, the migration module 630, the second self-encoder 640, and the replacement module 650 may be combined in one module to be implemented, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the decomposition module 610, the first self-encoder 620, the migration module 630, the second self-encoder 640, and the replacement module 650 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware, and firmware, or any suitable combination of any of them. Alternatively, at least one of the decomposition module 610, the first self-encoder 620, the migration module 630, the second self-encoder 640 and the replacement module 650 may be at least partially implemented as a computer program module that, when executed, may perform a corresponding function.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 includes a processor 710, a computer-readable storage medium 720. The electronic device 700 may perform a method according to an embodiment of the present disclosure.

In particular, processor 710 may comprise, for example, a general purpose microprocessor, an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and/or the like. The processor 710 may also include on-board memory for caching purposes. Processor 710 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

Computer-readable storage medium 720, for example, may be a non-volatile computer-readable storage medium, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); memory such as Random Access Memory (RAM) or flash memory, etc.

The computer-readable storage medium 720 may include a computer program 721, which computer program 721 may include code/computer-executable instructions that, when executed by the processor 710, cause the processor 710 to perform a method according to an embodiment of the disclosure, or any variation thereof.

The computer program 721 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 721 may include one or more program modules, including 721A, modules 721B, … …, for example. It should be noted that the division and number of modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that the processor 710 may execute the method according to the embodiment of the present disclosure or any variation thereof when the program modules are executed by the processor 710.

At least one of the decomposition module 610, the first self-encoder 620, the migration module 630, the second self-encoder 640, and the replacement module 650 may be implemented as a computer program module described with reference to fig. 7, which when executed by the processor 710 may implement the respective operations described above, according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be included in the apparatus/device/system described in the above embodiments, or may exist separately without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be understood by those skilled in the art that while the present disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure as defined by the appended claims and their equivalents. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims

1. A method of video processing, the method comprising:

decoding a video to be replaced and a target video into a first frame sequence and a second frame sequence respectively, and acquiring an object image to be replaced and a target object image which respectively correspond to each other;

acquiring a first noise reduction self-encoder, inputting the object image to be replaced into the first noise reduction self-encoder for encoding, and adding preselected noise in the encoding process;

carrying out style migration on the coding result;

acquiring a second noise reduction self-encoder, inputting the encoding result of the style migration into the second noise reduction self-encoder for decoding and reconstruction, and enabling the target object image to replace the object image to be replaced to obtain a reconstructed image;

and fusing the reconstructed image to the first frame sequence to obtain a replaced first frame sequence, and restoring the replaced first frame sequence to a video.

2. The method of claim 1, further comprising: training the first or second noise-reducing autoencoder, operations comprising:

acquiring a first training data set and a second training data set, wherein the first training data set comprises videos to be replaced for training, and the second training data set comprises target videos for training;

extracting first image data in the video to be replaced for training, and extracting second image data of the target video for training;

and training the first noise reduction self-encoder by using the first image data and training the second noise reduction self-encoder by using the second image data by using a hierarchical training method.

3. The method of claim 2, wherein training the first noise-reducing self-encoder using the hierarchical training method using the first image data comprises:

performing first training on the first image data by adopting double-layer convolution to obtain a first parameter;

training the first image data and the first parameters by adopting four layers of convolution to obtain second parameters;

training the first image data and the second parameter by adopting six layers of convolution to obtain a third parameter;

and in the same way, two layers of convolution are added each time, wherein one layer corresponds to the encoder of the first noise reduction self-encoder, and the other layer corresponds to the decoder of the first noise reduction self-encoder.

4. The method of claim 1, further comprising:

and acquiring the position information of the object image to be replaced in the video to be replaced.

5. The method of claim 1, wherein the fusing the reconstructed image to the sequence of frames comprises:

fusing the reconstructed image to a position in the frame sequence pointed to by the position information.

6. The method of claim 1, wherein the preselected noise is selected to be gaussian noise.

7. A video processing apparatus, characterized in that the apparatus comprises:

the decomposition module is used for decoding the video to be replaced and the target video into a first frame sequence and a second frame sequence respectively and acquiring the object image to be replaced and the target object image which respectively correspond to each other;

the first self-encoder is used for encoding the object image to be replaced, and preselection noise is added in the encoding process;

the migration module is used for carrying out style migration on the coding result;

the second self-encoder is used for decoding and reconstructing the encoding result of the style migration, so that the target object image replaces the object image to be replaced, and a reconstructed image is obtained;

and the replacing module is used for fusing the reconstructed image to the first frame sequence to obtain a replaced first frame sequence and restoring the replaced first frame sequence into a video.

8. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.