CN117278800A - Video content replacement method and device, electronic equipment and storage medium - Google Patents

Video content replacement method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117278800A
CN117278800A CN202311235918.XA CN202311235918A CN117278800A CN 117278800 A CN117278800 A CN 117278800A CN 202311235918 A CN202311235918 A CN 202311235918A CN 117278800 A CN117278800 A CN 117278800A
Authority
CN
China
Prior art keywords
video
frame
target
sequence
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311235918.XA
Other languages
Chinese (zh)
Inventor
王倓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202311235918.XA priority Critical patent/CN117278800A/en
Publication of CN117278800A publication Critical patent/CN117278800A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47205End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/64Computer-aided capture of images, e.g. transfer from script file into camera, check of taken image quality, advice or proposal for image composition or decision on when to take image

Abstract

The application relates to a video content replacement method, a video content replacement device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an original video to be processed; positioning target areas in each video frame of an original video respectively, and erasing the target areas in each video frame to obtain a background image frame sequence; analyzing the original video frame by frame to obtain a pose change sequence of the camera relative to the target area in the shooting process of the original video; generating a rendering image of the target object under each camera pose in the pose change sequence to obtain a rendering image frame sequence of the target object; and synthesizing the background image frame sequence and the rendering image frame sequence of the target object frame by frame to obtain a target video, wherein each video frame of the target video contains the target object. The method and the device solve the technical problems that in the prior art, when video content is replaced, the replacement effect is poor and the high efficiency of video processing is guaranteed.

Description

Video content replacement method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of video processing, and in particular, to a method and apparatus for replacing video content, an electronic device, and a storage medium.
Background
With the development of multimedia technology, the replacement of objects in video becomes a common video editing requirement in various application scenes such as creation, entertainment, education and the like. For example, a character in a video may be replaced with a cartoon character, or an item in a video may be replaced with another item.
Currently, there are two main methods for replacing objects in video: the method comprises the following steps: and manually repairing the graph, namely manually editing the video frame by a user through professional software or tools, and covering or replacing objects in the original video with target objects. And two,: and analyzing and reconstructing the original video by utilizing a video generation algorithm based on deep learning and other technologies, and generating new video content.
The first method described above, while guaranteeing sharpness and harmony of video, requires a large amount of manual operations and requires high requirements on the expertise of the user, resulting in time and effort. The second method can realize automation and rapidity, but the generated video is low in quality, and problems such as blurring, distortion and unnaturalness are easy to occur, and particularly when a camera moves, the content of a generated area cannot be changed along with the change of the visual angle of the camera.
Disclosure of Invention
The application provides a video content replacement method, a video content replacement device, electronic equipment and a storage medium, which are used for solving the technical problems that in the prior art, when video content replacement is carried out, the replacement effect is poor and the video processing efficiency is guaranteed.
In a first aspect, the present application provides an alternative method of video content, the method comprising:
acquiring an original video to be processed;
positioning a target area in each video frame of the original video respectively, and erasing the target area in each video frame to obtain a background image frame sequence;
analyzing the original video frame by frame to obtain a pose change sequence of a camera relative to the target area in the shooting process of the original video;
generating a rendering image of a target object under each camera pose in the pose change sequence to obtain a rendering image frame sequence of the target object;
and synthesizing the background image frame sequence and the rendering image frame sequence of the target object frame by frame to obtain a target video, wherein each video frame of the target video contains the target object.
In a possible implementation manner, the generating a rendered image of the target object under each camera pose in the pose change sequence, to obtain a rendered image frame sequence of the target object, includes:
acquiring a plurality of imaging pictures of a target object under different shooting angles;
constructing a reconstruction model of the target object by using the plurality of imaging pictures;
and inputting the pose change sequence into the reconstruction model to obtain a rendered image frame sequence of the target object corresponding to the pose change sequence.
In a possible implementation manner, the constructing a reconstruction model of the target object using the plurality of imaging pictures includes:
and training the initial neural radiation field model by using the imaging pictures to obtain a reconstruction model of the target object.
In a possible implementation manner, the positioning the target area in each video frame of the original video includes:
identifying an object to be replaced from each video frame of the original video respectively;
and determining the area including the object to be replaced in each video frame as a target area.
In a possible implementation manner, the erasing the target area in each video frame to obtain a background image frame sequence includes:
The following is performed for each of the video frames:
determining a target pixel value;
and resetting the pixel value of each pixel in the target area in the video frame to be the target pixel value to obtain a background image frame.
In a possible implementation manner, the determining the target pixel value includes:
acquiring a pixel value of each pixel in the video frame in other areas except the target area;
and determining a target pixel value according to the pixel value of each pixel in the other region.
In a possible implementation manner, the analyzing the original video frame by frame to obtain a pose change sequence of a camera relative to the target area in the shooting process of the original video includes:
extracting a feature descriptor set of the target region from each video frame of the original video;
matching the feature descriptor set of each video frame in the original video to obtain a matching feature point set of each video frame;
determining a pose change sequence of the camera in a world coordinate system based on the matched feature point set of each video frame;
constructing a dense point cloud for the target area, and obtaining the centroid position of the target area based on the dense point cloud;
Determining an offset of the target area relative to the origin of the world coordinate system according to the centroid position;
and shifting the pose change sequence of the camera in the world coordinate system according to the offset to obtain the pose change sequence of the camera relative to the target area in the shooting process of the original video.
In a second aspect, the present application provides an alternative apparatus for video content, the apparatus comprising:
the video acquisition module is used for acquiring an original video to be processed;
the positioning module is used for positioning the target area in each video frame of the original video respectively;
the erasing module is used for erasing the target area in each video frame to obtain a background image frame sequence;
the pose determining module is used for analyzing the original video frame by frame to obtain a pose change sequence of the camera relative to the target area in the shooting process of the original video;
the rendering module is used for generating a rendering image of the target object under each camera pose in the pose change sequence to obtain a rendering image frame sequence of the target object;
and the synthesis module is used for synthesizing the background image frame sequence and the rendering image frame sequence of the target object frame by frame to obtain a target video, wherein each video frame of the target video contains the target object.
In a possible implementation manner, the rendering module includes:
the imaging unit is used for acquiring a plurality of imaging pictures of the target object under different shooting angles;
a model construction unit for constructing a reconstruction model of the target object using the plurality of imaging pictures;
and the model processing unit is used for inputting the pose change sequence into the reconstruction model to obtain a rendered image frame sequence of the target object corresponding to the pose change sequence.
In a possible embodiment, the model building unit is specifically configured to:
and training the initial neural radiation field model by using the imaging pictures to obtain a reconstruction model of the target object.
In a possible implementation manner, the positioning module is specifically configured to:
identifying an object to be replaced from each video frame of the original video respectively;
and determining the area including the object to be replaced in each video frame as a target area.
In one possible embodiment, the erase module includes:
a pixel value determining unit configured to determine a target pixel value for each of the video frames;
and a pixel resetting unit, configured to reset, for each video frame, a pixel value of each pixel in the target area in the video frame to the target pixel value, to obtain a background image frame.
In a possible implementation manner, the pixel value determining unit is specifically configured to:
acquiring a pixel value of each pixel in the video frame in other areas except the target area;
and determining a target pixel value according to the pixel value of each pixel in the other region.
In one possible embodiment, the pose determining module includes:
a feature extraction unit, configured to extract a feature descriptor set of the target region from each video frame of the original video;
the feature matching unit is used for matching the feature descriptor set of each video frame in the original video to obtain a matching feature point set of each video frame;
a first pose determining unit, configured to determine a pose changing sequence of the camera in the world coordinate system based on the matching feature point set of each video frame;
the mass center determining unit is used for constructing a dense point cloud for the target area and obtaining the mass center position of the target area based on the dense point cloud;
an offset identification unit, configured to determine an offset of the target area relative to the origin of the world coordinate system according to the centroid position;
and the second pose determining unit is used for shifting the pose changing sequence of the camera in the world coordinate system according to the offset to obtain the pose changing sequence of the camera relative to the target area in the shooting process of the original video.
In a third aspect, the present application provides an electronic device, including: a processor and a memory, the processor being configured to execute a replacement program for video content stored in the memory, to implement the video content replacement method according to any one of the first aspects.
In a fourth aspect, the present application provides a storage medium storing one or more programs executable by one or more processors to implement the method of replacing video content of any of the first aspects.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: according to the method provided by the embodiment of the application, the target area is respectively positioned in each video frame of the original video, the target area in each video frame is erased, the background image frame sequence without displaying the object to be replaced can be obtained, the original video is subjected to frame-by-frame analysis to obtain the pose change sequence of the camera relative to the target area in the shooting process of the original video, the rendering image of each camera pose in the target object pose change sequence is generated, the rendering image frame sequence of the target object is obtained, and the rendering image of the target object is generated according to the pose of the camera relative to the target area in the shooting process of the original video. Therefore, by applying the technical scheme provided by the embodiment of the application, the video with clear, real and natural content replaced can be obtained, and meanwhile, no user operation is needed, so that the video content is replaced with zero labor cost, and the high efficiency of video processing is effectively ensured.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.
FIG. 1 is an embodiment flow chart of an alternative method of video content provided in an embodiment of the present application;
fig. 2 is a flowchart of an implementation of a pose change sequence of a camera relative to a target area in a shooting process of an original video by analyzing the original video frame by frame according to an embodiment of the present application;
FIG. 3 is an embodiment flow chart of another alternative method of video content provided by embodiments of the present application;
FIG. 4 is a system architecture diagram provided in an embodiment of the present application;
FIG. 5 is a block diagram of an embodiment of an alternative apparatus for video content provided by an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.
The following disclosure provides many different embodiments, or examples, for implementing different structures of the invention. In order to simplify the present disclosure, components and arrangements of specific examples are described below. They are, of course, merely examples and are not intended to limit the invention. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
In order to solve the technical problems that in the prior art, when video content replacement is performed, the replacement effect is poor and the video processing efficiency is guaranteed at the same time, the application provides a video content replacement method, which can achieve the video content replacement efficiency, and simultaneously solves the problems that when the video content replacement is performed by using a current video generation algorithm, the quality of the generated video is low, blurring, distortion, unnaturalness and incapability of changing the content of a generated area along with the change of a camera view angle are easy to occur.
Fig. 1 is a flowchart of an embodiment of an alternative method for video content according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step 101, obtaining an original video to be processed.
In an embodiment, the execution body of the embodiment of the present application is a component having a video content replacement function. Wherein the component is installed on the client as an alternative implementation. The client may then obtain the original video to be processed locally, e.g., the client obtains the original video to be processed locally according to a user-specified storage path. As another alternative implementation, the component is deployed at the cloud. Here, the cloud may receive an original video from the user device, determine the original video as an original video to be processed, or receive a URL address from the user device, and obtain the original video to be processed from the URL address.
Step 102, positioning target areas in each video frame of the original video respectively, and performing erasure processing on the target areas in each video frame to obtain a background image frame sequence.
The video content in the target area is video content that needs to be replaced. In practical applications, the video content in the target area may be a person, an article, or an object of text waiting for replacement, which is not limited in the embodiments of the present application.
In an embodiment, the specific implementation of locating the target area in each video frame of the original video respectively may include: and respectively identifying the object to be replaced from each video frame of the original video, and determining the area including the object to be replaced in each video frame as a target area. Here, the object to be replaced is, for example, the article a, and the area including the object to be replaced may be an area where the object to be replaced is located, or an area including the area where the object to be replaced is located, for example, an area corresponding to a circumscribed rectangular frame of the object to be replaced. The area where the object to be replaced is located refers to an area formed by pixels corresponding to the object to be replaced, that is, the object to be replaced in the video frame can be identified at the pixel level, and then the object to be replaced in the video frame can be replaced at the pixel level, so that the video frame after replacement is clear and fidelity. In application, a general deep learning based segmentation algorithm may be used to obtain MASK (pixel level MASK) of the object a in the video frame, and the target area is segmented from the video frame through MASK.
Here, the region information (including position information, size information, and the like) of the target region in the video frame may be the same or different in the different video frames. More commonly, as the pose of the camera changes, or the position/angle of the object to be replaced changes, the shooting angle of the object to be replaced changes, so that the position/angle of the object to be replaced in different video frames is different, which means that the region information of the target region in the video frames is different. Therefore, in the embodiment of the application, for each video frame of the original video, the object to be replaced is identified from the video frames, and the area including the object to be replaced in each video frame is determined as the target area.
In this embodiment of the present application, the erasure processing of the target area in the video frame means: such that the video content of the target area is no longer displayed in the video frame. Then, in an application scenario where item a in the original video is replaced, erasing the target area in the video frame can cause item a to no longer be displayed in the video frame. For convenience of description herein, a video frame in which the target region is erased is referred to as a background image frame, so that the target region in each video frame of the original video is erased to obtain a background image frame sequence.
In one embodiment, erasing the target area in each video frame to obtain the background image frame sequence includes: the following processing is performed for each video frame: and determining a target pixel value, and resetting the pixel value of each pixel in a target area in the video frame to be the target pixel value to obtain a background image frame.
Wherein, as an optional implementation manner, the determining the target pixel value includes: the set pixel value, for example, (255 ) is determined as the target pixel value. Here, the same target pixel value may be set for each video frame in the original video, or different target pixel values may be set.
As another optional implementation manner, the determining the target pixel value includes: for each video frame in the original video, acquiring a pixel value of each pixel in other areas except for the target area in the video frame, and determining the target pixel value according to the pixel value of each pixel in the other areas. It follows that the target pixel values are determined separately for different video frames.
Optionally, determining the target pixel value from the pixel value of each pixel in the other region includes: an average value of the pixel values of each pixel in the other areas is determined, the average value is determined as a target pixel value, the pixel value with the largest number of occurrences in the other areas is obtained, and the pixel value with the largest number of occurrences is determined as the target pixel value.
By determining the target pixel value according to the pixel value of each pixel in the other areas except the target area and resetting the pixel value of each pixel in the target area in the video frame to the target pixel value, on one hand, the video content of the target area can be erased in the video frame, and on the other hand, the visual effects of the erased target area and the background area (namely the other areas) can be more similar, so that the problems that after the video content of the target area is erased, the target area is abrupt in the whole video frame and the erasing effect is unnatural are avoided. This can further enhance the sense of realism of the final content replacement effect.
And 103, analyzing the original video frame by frame to obtain a pose change sequence of the camera relative to the target area in the shooting process of the original video.
The pose change sequence of the camera relative to the target area in the shooting process of the original video can represent the shooting visual angle of the camera to the target area in the shooting process of the original video.
In an embodiment, the frame-by-frame analysis of the original video is implemented through the flow shown in fig. 2, so as to obtain the pose change sequence of the camera relative to the target area in the shooting process of the original video.
As shown in fig. 2, the method comprises the following steps:
step 201, extracting a feature descriptor set of a target area from each video frame of an original video.
As an alternative implementation, feature extraction functions may be used in a COLMAP (a general motion structure and multi-view stereo pipeline with graphics and command line interfaces) to extract feature descriptors of feature points of a target region in each video frame of an original video, resulting in a feature descriptor set.
Step 202, matching feature descriptor sets of all video frames in an original video to obtain matching feature point sets of all video frames.
The matching feature point set comprises at least four feature points, and the feature points in the matching feature point sets of different video frames correspond to the same entity. For example, the set of matching feature points { A1, B1, C1, D1} is determined from the 1 st video frame, the set of matching feature points { A2, B2, C2, D2} is determined from the 2 nd video frame, the set of matching feature points { A3, B3, C3, D3} is determined from the 3 rd video frame, wherein A1, A2, A3 correspond to the same entity, B1, B2, B3 correspond to the same entity, C1, C2, C3 correspond to the same entity, D1, D2, D3 correspond to the same entity.
As an alternative implementation manner, a feature matching function may be used in the COLMAP to match feature descriptor sets of each video frame in the original video, so as to obtain a matching feature point set of each video frame.
Step 203, determining a pose change sequence of the camera in the world coordinate system based on the matched feature point set of each video frame.
Since the matching feature point set includes at least four feature points, the pose of the camera in the world coordinate system can be estimated from the four feature points. And sequencing the estimated pose according to the arrangement sequence of the video frames, so as to obtain a pose change sequence of the camera in a world coordinate system in the shooting process of the original video.
Step 204, constructing a dense point cloud for the target area, and obtaining the centroid position of the target area based on the dense point cloud.
Step 205, determining the offset of the target area relative to the origin of the world coordinate system according to the centroid position.
The above-described offset amounts include offset amounts in three axial directions of X, Y, Z.
And 206, shifting the pose change sequence of the camera in the world coordinate system according to the offset to obtain the pose change sequence of the camera relative to the target area in the shooting process of the original video.
The step of shifting the pose changing sequence of the camera in the world coordinate system according to the offset refers to shifting each pose in the pose changing sequence by the offset toward the centroid direction of the target area, so as to obtain the pose of the camera relative to the target area, and further obtain the pose changing sequence of the camera relative to the target area in the shooting process of the original video.
Step 104, generating a rendering image of the target object under each camera pose in the pose change sequence, and obtaining a rendering image frame sequence of the target object.
As can be seen from the description of step 104, the embodiment of the present application generates, according to the pose of the camera relative to the target area during the capturing process of the original video, a rendered image of the target object, which can make the rendered image of the target object conform to the original capturing view angle of the camera for the object to be replaced, for example, when the camera rotates 90 degrees, the target object in the rendered image is also an effect after rotating 90 degrees, so that details of the target object can be presented more clearly and more truly and naturally, and by generating the rendered image of the target object under each camera pose in the pose change sequence, changes in details presented by the target object in the video frame along with changes in the capturing view angle of the camera can be realized.
And 105, synthesizing the background image frame sequence and the rendering image frame sequence of the target object frame by frame to obtain a target video, wherein each video frame of the target video contains the target object.
As can be seen from the descriptions in the above steps 102 to 104, the object to be replaced is not displayed in the target area of the background image frame, and the rendered image frame of the target object can display the target object, and since the rendered image frame of the target object is rendered and generated according to the pose of the camera relative to the target area in the capturing process of the original video, the rendered image of the target object accords with the capturing view angle of the camera originally for the object to be replaced. Based on the method, the background image frame and the rendering image frame of the target object are synthesized, the target object is displayed in the obtained synthesized frame, the target object is utilized to replace the object to be replaced in the original video frame, and the details of the target object can be clearly and naturally displayed in the synthesized frame.
Further, the background image frame sequence and the rendering image frame sequence of the target object are synthesized frame by frame to obtain the target video, so that the target video conforming to the original video frame time sequence can be obtained, and the detail of the replaced target object can be presented in the target video to change along with the change of the shooting visual angle of the camera, so that the video effect after content replacement is more real.
In an embodiment, synthesizing the background image frame sequence and the rendered image frame sequence of the target object frame by frame, and obtaining the specific implementation of the target video may include: and inputting the background image frame sequence and the rendering image frame sequence of the target object into a trained GAN (Generative Adversarial Nets) model or a diffusion model to synthesize the background image frame sequence and the rendering image frame sequence of the target object frame by the GAN model or the diffusion model so as to obtain the target video. The advantage of using the GAN model to synthesize the background image frame sequence and the rendered image frame sequence of the target object frame by frame is that the stability of the finally generated target video is high, and the advantage of using the diffusion model to synthesize the background image frame sequence and the rendered image frame sequence of the target object frame by frame is that the effect of the finally generated target video is better.
According to the embodiment of the application, the background image frame sequence and the rendering image frame sequence of the target object are synthesized frame by means of the model, so that the target video is obtained, user operation can be saved, and the replacement efficiency of video content is improved.
According to the technical scheme provided by the embodiment of the invention, the target area is respectively positioned in each video frame of the original video, the target area in each video frame is erased, the background image frame sequence without displaying the object to be replaced can be obtained, the original video is subjected to frame-by-frame analysis to obtain the pose change sequence of the camera relative to the target area in the shooting process of the original video, the rendering image of each camera pose in the pose change sequence of the target object is generated, the rendering image frame sequence of the target object is obtained, and the rendering image of the target object is generated according to the pose of the camera relative to the target area in the shooting process of the original video. Therefore, by applying the technical scheme provided by the embodiment of the application, the video with clear, real and natural content replaced can be obtained, and meanwhile, no user operation is needed, so that the video content is replaced with zero labor cost, and the high efficiency of video processing is effectively ensured.
Fig. 3 is an embodiment flowchart of another alternative method for video content provided in an embodiment of the present application. The flow shown in fig. 3 describes how to generate a rendered image of a target object in each camera pose in the pose change sequence based on the flow shown in fig. 1, resulting in a sequence of rendered image frames of the target object. As shown in fig. 3, the method comprises the following steps:
step 301, acquiring a plurality of imaging pictures of a target object under different shooting angles.
In an embodiment, the image capturing device may be controlled to move along a set trajectory, for example, along a circular trajectory centered on the target object, and to capture a picture of the target object (referred to herein as an imaging picture for convenience of description) every time the image capturing device moves by a set angle, so as to obtain a plurality of imaging pictures of the target object at different capturing angles.
Furthermore, a plurality of imaging pictures of the target object under different shooting angles can be stored in the picture library, so that the recycling of the plurality of imaging pictures of the target object can be realized.
Step 302, constructing a reconstruction model of the target object by using a plurality of imaging pictures.
And 303, inputting the pose change sequence into a reconstruction model to obtain a rendered image frame sequence of the target object corresponding to the pose change sequence.
In one embodiment, a specific implementation of constructing a reconstruction model of a target object using a plurality of imaging pictures includes: training the initial neural radiation field model by using a plurality of imaging pictures to obtain a reconstruction model of the target object. That is, neural radiation field models are employed in embodiments of the present application to reconstruct a rendered image of a target object at a specified viewing angle. The neural radiation field model is an implicit 3D reconstruction method, continuous representation of a 3D scene is learned by a neural network, the volume density and the color of a target object under a specified visual angle can be obtained through the neural radiation field, and then a final rendering image is obtained through volume rendering. This approach can reconstruct high quality new view angle images from a sparse set of 2D images without explicitly reconstructing the 3D scene.
According to the technical scheme provided by the embodiment of the application, the rendered image of the target object under the specified visual angle is rebuilt by adopting the nerve radiation field model, compared with other generation algorithms, the detail definition of the target object can be ensured, and a professional is not required to edit on 3D software, so that user operation is saved, and the replacement efficiency of video content is improved.
Fig. 4 is a system architecture diagram provided in an embodiment of the present application, and the system architecture diagram is combined with the above embodiment to make an overall explanation of an alternative method of video content provided in an embodiment of the present application.
The system architecture shown in fig. 4 comprises a positioning and dividing module, a Nerf reconstruction module and a generation module. The positioning and segmentation module is used for detecting and segmenting an object A (an object to be replaced) in a video frame of an original video to obtain a background image frame sequence, and estimating the pose of a camera relative to a target area in the shooting process of the original video to obtain a pose change sequence (namely a camera pose sequence in fig. 4). The Nerf reconstruction module constructs a Nerf model of the target object B (target object) based on the nerve radiation field so as to obtain a rendering image frame sequence of the target object B corresponding to the pose change sequence by using the Nerf model. And the generating module synthesizes the background image frame sequence and the rendering image frame sequence of the target object B by using the generating model to obtain a target video, thereby completing the replacement of the object A in the original video by the object B.
Fig. 5 is a block diagram of an embodiment of an alternative apparatus for video content according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:
a video acquisition module 51, configured to acquire an original video to be processed;
a positioning module 52, configured to position a target area in each video frame of the original video;
An erasing module 53, configured to erase the target area in each video frame to obtain a background image frame sequence;
the pose determining module 54 is configured to analyze the original video frame by frame, so as to obtain a pose change sequence of the camera relative to the target area in the shooting process of the original video;
a rendering module 55, configured to generate a rendered image of a target object under each camera pose in the pose change sequence, to obtain a rendered image frame sequence of the target object;
and a synthesis module 56, configured to synthesize the background image frame sequence and the rendered image frame sequence of the target object frame by frame, so as to obtain a target video, where each video frame of the target video contains the target object.
In a possible implementation, the rendering module 55 includes:
the imaging unit is used for acquiring a plurality of imaging pictures of the target object under different shooting angles;
a model construction unit for constructing a reconstruction model of the target object using the plurality of imaging pictures;
and the model processing unit is used for inputting the pose change sequence into the reconstruction model to obtain a rendered image frame sequence of the target object corresponding to the pose change sequence.
In a possible embodiment, the model building unit is specifically configured to:
and training the initial neural radiation field model by using the imaging pictures to obtain a reconstruction model of the target object.
In a possible embodiment, the positioning module 52 is specifically configured to:
identifying an object to be replaced from each video frame of the original video respectively;
and determining the area including the object to be replaced in each video frame as a target area.
In one possible implementation, the erasing module 53 includes:
a pixel value determining unit configured to determine a target pixel value for each of the video frames;
and a pixel resetting unit, configured to reset, for each video frame, a pixel value of each pixel in the target area in the video frame to the target pixel value, to obtain a background image frame.
In a possible implementation manner, the pixel value determining unit is specifically configured to:
acquiring a pixel value of each pixel in the video frame in other areas except the target area;
and determining a target pixel value according to the pixel value of each pixel in the other region.
In one possible embodiment, the pose determining module includes:
a feature extraction unit, configured to extract a feature descriptor set of the target region from each video frame of the original video;
the feature matching unit is used for matching the feature descriptor set of each video frame in the original video to obtain a matching feature point set of each video frame;
a first pose determining unit, configured to determine a pose changing sequence of the camera in the world coordinate system based on the matching feature point set of each video frame;
the mass center determining unit is used for constructing a dense point cloud for the target area and obtaining the mass center position of the target area based on the dense point cloud;
an offset identification unit, configured to determine an offset of the target area relative to the origin of the world coordinate system according to the centroid position;
and the second pose determining unit is used for shifting the pose changing sequence of the camera in the world coordinate system according to the offset to obtain the pose changing sequence of the camera relative to the target area in the shooting process of the original video.
As shown in fig. 6, the embodiment of the present application provides an electronic device, which includes a processor 611, a communication interface 612, a memory 613 and a communication bus 614, wherein the processor 611, the communication interface 612, and the memory 613 perform communication with each other through the communication bus 614,
A memory 613 for storing a computer program;
in one embodiment of the present application, the processor 611 is configured to implement the method for replacing video content provided in any one of the foregoing method embodiments when executing the program stored in the memory 613, where the method includes:
acquiring an original video to be processed;
positioning a target area in each video frame of the original video respectively, and erasing the target area in each video frame to obtain a background image frame sequence;
analyzing the original video frame by frame to obtain a pose change sequence of a camera relative to the target area in the shooting process of the original video;
generating a rendering image of a target object under each camera pose in the pose change sequence to obtain a rendering image frame sequence of the target object;
and synthesizing the background image frame sequence and the rendering image frame sequence of the target object frame by frame to obtain a target video, wherein each video frame of the target video contains the target object.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the alternative method of video content provided by any of the method embodiments described above.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of replacing video content, the method comprising:
acquiring an original video to be processed;
positioning a target area in each video frame of the original video respectively, and erasing the target area in each video frame to obtain a background image frame sequence;
analyzing the original video frame by frame to obtain a pose change sequence of a camera relative to the target area in the shooting process of the original video;
generating a rendering image of a target object under each camera pose in the pose change sequence to obtain a rendering image frame sequence of the target object;
and synthesizing the background image frame sequence and the rendering image frame sequence of the target object frame by frame to obtain a target video, wherein each video frame of the target video contains the target object.
2. The method of claim 1, wherein generating a rendered image of the target object at each camera pose in the sequence of pose changes results in a sequence of rendered image frames of the target object, comprising:
acquiring a plurality of imaging pictures of a target object under different shooting angles;
Constructing a reconstruction model of the target object by using the plurality of imaging pictures;
and inputting the pose change sequence into the reconstruction model to obtain a rendered image frame sequence of the target object corresponding to the pose change sequence.
3. The method of claim 2, wherein constructing a reconstruction model of the target object using the plurality of imaging pictures comprises:
and training the initial neural radiation field model by using the imaging pictures to obtain a reconstruction model of the target object.
4. The method of claim 1, wherein said locating the target region in each video frame of the original video, respectively, comprises:
identifying an object to be replaced from each video frame of the original video respectively;
and determining the area including the object to be replaced in each video frame as a target area.
5. The method of claim 1, wherein said erasing said target region in each of said video frames results in a sequence of background image frames, comprising:
the following is performed for each of the video frames:
determining a target pixel value;
And resetting the pixel value of each pixel in the target area in the video frame to be the target pixel value to obtain a background image frame.
6. The method of claim 5, wherein determining the target pixel value comprises:
acquiring a pixel value of each pixel in the video frame in other areas except the target area;
and determining a target pixel value according to the pixel value of each pixel in the other region.
7. The method according to claim 1, wherein the analyzing the original video frame by frame to obtain a pose change sequence of a camera relative to the target area during the capturing process of the original video includes:
extracting a feature descriptor set of the target region from each video frame of the original video;
matching the feature descriptor set of each video frame in the original video to obtain a matching feature point set of each video frame;
determining a pose change sequence of the camera in a world coordinate system based on the matched feature point set of each video frame;
constructing a dense point cloud for the target area, and obtaining the centroid position of the target area based on the dense point cloud;
Determining an offset of the target area relative to the origin of the world coordinate system according to the centroid position;
and shifting the pose change sequence of the camera in the world coordinate system according to the offset to obtain the pose change sequence of the camera relative to the target area in the shooting process of the original video.
8. A video content replacement apparatus, the apparatus comprising:
the video acquisition module is used for acquiring an original video to be processed;
the positioning module is used for positioning the target area in each video frame of the original video respectively;
the erasing module is used for erasing the target area in each video frame to obtain a background image frame sequence;
the pose determining module is used for analyzing the original video frame by frame to obtain a pose change sequence of the camera relative to the target area in the shooting process of the original video;
the rendering module is used for generating a rendering image of the target object under each camera pose in the pose change sequence to obtain a rendering image frame sequence of the target object;
and the synthesis module is used for synthesizing the background image frame sequence and the rendering image frame sequence of the target object frame by frame to obtain a target video, wherein each video frame of the target video contains the target object.
9. An electronic device, comprising: a processor and a memory, the processor being configured to execute a replacement program for video content stored in the memory to implement the video content replacement method of any one of claims 1 to 7.
10. A storage medium storing one or more programs executable by one or more processors to implement the method of replacing video content of any of claims 1-7.
CN202311235918.XA 2023-09-22 2023-09-22 Video content replacement method and device, electronic equipment and storage medium Pending CN117278800A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311235918.XA CN117278800A (en) 2023-09-22 2023-09-22 Video content replacement method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311235918.XA CN117278800A (en) 2023-09-22 2023-09-22 Video content replacement method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117278800A true CN117278800A (en) 2023-12-22

Family

ID=89213824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311235918.XA Pending CN117278800A (en) 2023-09-22 2023-09-22 Video content replacement method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117278800A (en)

Similar Documents

Publication Publication Date Title
US20200334894A1 (en) 3d motion effect from a 2d image
Rematas et al. Image-based synthesis and re-synthesis of viewpoints guided by 3d models
CN111325693B (en) Large-scale panoramic viewpoint synthesis method based on single viewpoint RGB-D image
CA2286168A1 (en) Adaptive modeling and segmentation of visual image streams
JP2009211335A (en) Virtual viewpoint image generation method, virtual viewpoint image generation apparatus, virtual viewpoint image generation program, and recording medium from which same recorded program can be read by computer
US9253415B2 (en) Simulating tracking shots from image sequences
EP3533218B1 (en) Simulating depth of field
US11227428B2 (en) Modification of a live-action video recording using volumetric scene reconstruction to replace a designated region
Luo et al. A disocclusion inpainting framework for depth-based view synthesis
CN113220251B (en) Object display method, device, electronic equipment and storage medium
CN115298708A (en) Multi-view neural human body rendering
US20210274094A1 (en) Reconstruction of obscured views in captured imagery using user-selectable pixel replacement from secondary imagery
CN115428027A (en) Neural opaque point cloud
KR20220143721A (en) Machine learning based on volumetric capture and mesh tracking
Zhou et al. NeRFLix: High-quality neural view synthesis by learning a degradation-driven inter-viewpoint mixer
CN109542574B (en) Pop-up window background blurring method and device based on OpenGL
US8891857B2 (en) Concave surface modeling in image-based visual hull
CN117278800A (en) Video content replacement method and device, electronic equipment and storage medium
Casas et al. Props alive: a framework for augmented reality stop motion animation
CN109729285B (en) Fuse grid special effect generation method and device, electronic equipment and storage medium
Cho et al. Depth image processing technique for representing human actors in 3DTV using single depth camera
Yan et al. Stereoscopic image generation from light field with disparity scaling and super-resolution
US11145109B1 (en) Method for editing computer-generated images to maintain alignment between objects specified in frame space and objects specified in scene space
WO2024007968A1 (en) Methods and system for generating an image of a human
Bai et al. Anything in Any Scene: Photorealistic Video Object Insertion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination