CN111862275A

CN111862275A - Video editing method, device and equipment based on 3D reconstruction technology

Info

Publication number: CN111862275A
Application number: CN202010725481.8A
Authority: CN
Inventors: 吴善思源; 龚秋棠; 吴方灿; 林奇
Original assignee: Xiamen Zhenjing Technology Co ltd
Current assignee: Xiamen Zhenjing Technology Co ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-10-30
Anticipated expiration: 2040-07-24
Also published as: CN111862275B

Abstract

The invention discloses a video editing method based on a 3D reconstruction technology, which comprises the following steps: acquiring a video to be edited; detecting an identifiable object in each frame of the video to be edited; reconstructing a first 3D model corresponding to each of the objects using a neural network; selecting a current frame of the object in the video to be edited, editing the selected object, modifying the first 3D model by the edited content, and generating a second 3D model; and performing real-time attitude estimation on each frame image of the object based on the second 3D model, driving the second 3D model to generate a replacement image according to the attitude estimation, and rendering the replacement image to all frames of the same object of the video to be edited. The scheme provided by the invention can realize that the same object on the whole video frame is automatically applied after the object is edited in a single frame in the video, thereby improving the video editing efficiency of a user and improving the experience effect.

Description

Video editing method, device and equipment based on 3D reconstruction technology

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video editing method, apparatus and device based on a 3D reconstruction technology.

Background

With the development of 5G and short video applications, users gradually started moving from editing pictures to editing videos. The video editing software at the present stage is more in the editing of the whole video timeline, such as deleting useless segments, adding music and the like. If a user wants to edit a certain object in the video, such as changing the color of furniture or modifying the patterns of people and clothes in the video, the user needs to modify the object frame by frame, 7200 frames of images need to be edited in a period of 5 minutes, and the workload is extremely high; there is no way to edit an object and then synchronize the object with a subsequent video frame, thereby resulting in poor experience effect of the user in editing the video.

Disclosure of Invention

In view of this, the present invention provides a video editing method, apparatus and device based on a 3D reconstruction technology, which can automatically apply to the same object on the whole video frame after editing the object in a single frame in a video, thereby improving the efficiency of editing the video by a user and improving the experience effect.

In order to achieve the above object, the present invention provides a video editing method based on a 3D reconstruction technique, the method comprising:

acquiring a video to be edited;

detecting an identifiable object in each frame of the video to be edited;

reconstructing a first 3D model corresponding to each of the objects using a neural network;

selecting a current frame of the object in the video to be edited, editing the selected object, modifying the first 3D model by the edited content, and generating a second 3D model;

and performing real-time attitude estimation on each frame image of the object based on the second 3D model, driving the second 3D model to generate a replacement image according to the attitude estimation, and rendering the replacement image to all frames of the same object of the video to be edited.

Preferably, the detecting an identifiable object in each frame of the video to be edited includes:

and detecting identifiable objects in each frame of the video to be edited by utilizing a general object detection technology.

Preferably, the reconstructing, by using a neural network, the first 3D model corresponding to each of the objects includes:

reconstructing, by an auto-encoder, the first 3D model corresponding to each of the objects from the voxel composition of the object.

Preferably, performing real-time pose estimation on each frame of image of the video to be edited based on the second 3D model, driving the second 3D model to generate a replacement image according to the pose estimation, rendering the replacement image to all frames of the same object of the video to be edited, performing real-time pose estimation on the object based on the 3D model, and driving the 3D model to render the edited content to all frames of the same object of the video to be edited includes:

cutting out the object according to the coordinates of each frame of image where the object is located, and inputting the object into the second 3D model;

outputting the coordinates of each frame of image where the object is located and the three-dimensional posture parameters of the object;

and driving the second 3D model to rotate and translate to the position where the object appears in each corresponding frame image according to the coordinates and the three-dimensional posture parameters, projecting the edited content to all frames of the same object, replacing pixel points in all frames, and realizing rendering.

In order to achieve the above object, the present invention further provides a video editing apparatus based on 3D reconstruction technology, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a video to be edited;

the detection unit is used for detecting an identifiable object in each frame of the video to be edited;

a reconstruction unit for reconstructing a first 3D model corresponding to each of the objects using a neural network;

the editing unit is used for selecting the current frame of the object in the video to be edited, editing the selected object, modifying the first 3D model by the edited content and generating a second 3D model;

and the rendering unit is used for carrying out real-time attitude estimation on each frame of image where the object is located based on the second 3D model, driving the second 3D model to generate a replacement image according to the attitude estimation, and rendering the replacement image to all frames of the same object of the video to be edited.

Preferably, the detection unit further includes:

Preferably, the editing unit further includes:

Preferably, the rendering unit further includes:

the input unit is used for cutting out the object according to the coordinates of each frame of image where the object is located and inputting the object into the second 3D model;

the output unit is used for outputting the coordinates of each frame of image where the object is located and the three-dimensional posture parameters of the object;

and the driving unit is used for driving the second 3D model to rotate and translate to the position where the object appears in each corresponding frame image according to the coordinates and the three-dimensional posture parameters, projecting the edited content onto all frames of the same object, replacing pixel points in all frames and realizing rendering.

In order to achieve the above object, the present invention further proposes a 3D reconstruction technology-based video editing apparatus, comprising a processor, a memory, and a computer program stored in the memory, wherein the computer program is capable of implementing the 3D reconstruction technology-based video editing method according to any one of the above items when executed by the processor.

In order to achieve the above object, the present invention further provides a computer-readable storage medium, which includes a stored computer program, wherein when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to implement the video editing method based on the 3D reconstruction technology according to any one of the above mentioned items.

It can be found that, according to the above scheme, a video to be edited can be obtained, an identifiable object in each frame of the video to be edited is detected, a first 3D model corresponding to each object is reconstructed by using a neural network, a current frame of the object in the video to be edited is selected, the selected object is edited, the edited content is modified by the first 3D model to generate a second 3D model, each frame image of the object is subjected to real-time posture estimation based on the second 3D model, the second 3D model is driven according to the posture estimation to generate a replacement image, the replacement image is rendered on all frames of the same object of the video to be edited, and after the object is edited in a single frame in the video, the replacement image is automatically applied to the same object on the whole video frame, so that the efficiency of a user for editing the video is improved, and the experience effect is improved.

Furthermore, the above scheme utilizes a general object detection technology to detect the recognizable object in each frame of the video to be edited, which has the advantages of being capable of accurately recognizing a plurality of objects in the video and having a plurality of recognized types.

Furthermore, according to the scheme, the self-encoder is used for forming the first 3D model corresponding to the reconstruction object according to the voxels of each object, so that the object on a single frame can be automatically edited in the video and can be automatically applied to the whole video, and the difficulty that the editing in the video needs to be edited frame by frame is solved.

Furthermore, according to the scheme, the object is cut according to the coordinate of each frame of image where the object is located, the object is input into the second 3D model, the coordinate of each frame of image where the object is located and the three-dimensional posture parameter of the object are output, the second 3D model is driven to rotate and translate to the position where the object appears in each frame of image where the object is located correspondingly according to the coordinate and the three-dimensional posture parameter, the edited content is projected onto all frames of the same object, pixel points in all frames are replaced, rendering is achieved, the same object can be automatically applied to the whole video frame after the object is edited in a single frame of the video, and therefore the video editing efficiency of a user is improved, and the experience effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a video editing method based on a 3D reconstruction technique according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a video editing apparatus based on a 3D reconstruction technique according to another embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be noted that the following examples are only illustrative of the present invention, and do not limit the scope of the present invention. Similarly, the following examples are only some but not all examples of the present invention, and all other examples obtained by those skilled in the art without any inventive work are within the scope of the present invention.

The present invention will be described in detail with reference to the following examples.

The invention provides a video editing method based on a 3D reconstruction technology, which can be automatically applied to the same object on the whole video frame after the object is edited in a single frame in a video, thereby improving the video editing efficiency of a user and improving the experience effect.

Fig. 1 is a schematic flow chart of a video editing method based on a 3D reconstruction technique according to an embodiment of the present invention. The method comprises the following steps:

and S1, acquiring the video to be edited.

S2, detecting identifiable objects in each frame of the video to be edited.

Wherein, detecting the recognizable object in each frame of the video to be edited comprises: and detecting identifiable objects in each frame of the video to be edited by utilizing a general object detection technology.

In this embodiment, through traversing the entire video, a recognizable object appearing in the video is found out by using a general object detection technology, where the object includes an object, a person, an animal, and the like, which can be selected and edited by a user.

In the general object detection technology, after training the neural network model by labeling a large amount of data, the neural network model can detect an object contained in an image according to a given image, for example: cats, dogs, people, beds, quilts, etc., and frames where these objects are located in the image.

Because the video is essentially composed of one frame of image, a 1-second video generally contains 30 frames of image, when a recognizable object is detected by the video, each frame of image in the video is input into a neural network model for general object detection, the neural network model gives the object content contained in each frame of image, all the image detection results are gathered, n (for example, 5) objects with the highest occurrence frequency are selected as the video detection results, and the positions of the objects in the video are marked.

S3, reconstructing a first 3D model corresponding to each of the objects using the neural network.

In this embodiment, according to the requirements of time and precision of an actual application scene, when a user selects a certain object, a neural network can be used to reconstruct a 3D model corresponding to the object according to a single frame or multiple frames. When the time requirement is strict, a single frame can be selected to reconstruct a 3D model corresponding to the object; when the precision requirement is strict, multiple frames can be selected to reconstruct the 3D model corresponding to the object.

Wherein reconstructing a first 3D model corresponding to each of the objects using a neural network comprises: reconstructing, by an auto-encoder, the first 3D model corresponding to each of the objects from the voxel composition of the object.

Specifically, an image is input through a self-encoding network (auto-encoder), and a 3D model composed of voxels of an object after reconstruction is output. Wherein: the input image may be the object detected by the general object detection technique, and the object may be cut out from the image according to the position result detected by the general object detection technique.

Furthermore, for time and accuracy considerations, the 3D model includes two modes: the first is fast, i.e. only 1 image is input; and the other is high precision, n frames of images (for example, 5 frames) in the video respectively pass through the neural network model in the first mode, n 3D models are output, and values of voxels of the models are averaged according to positions to obtain the final high-precision 3D model.

S4, selecting the current frame of the object in the video to be edited, editing the selected object, modifying the first 3D model by the edited content, and generating a second 3D model.

In this embodiment, when the user edits the object on the image, such as changing color, changing shape, etc., the change will be recorded on the reconstructed 3D model, resulting in a modified 3D model.

And S5, performing real-time attitude estimation on each frame of image where the object is located based on the second 3D model, driving the second 3D model to generate a replacement image according to the attitude estimation, and rendering the replacement image to all frames of the same object of the video to be edited.

Performing real-time attitude estimation on each frame of image where the object is located based on the second 3D model, driving the second 3D model to generate a replacement image according to the attitude estimation, and rendering the replacement image to all frames of the same object of the video to be edited, including:

s5-1, cutting out the object according to the coordinates of each frame of image where the object is located, and inputting the object into the second 3D model;

s5-2, outputting the coordinates of each frame of image where the object is located and the three-dimensional posture parameters of the object;

and S5-3, driving the second 3D model to rotate and translate to the position where the object appears in each corresponding frame image according to the coordinates and the three-dimensional posture parameters, projecting the edited content to all frames of the same object, and replacing pixel points in all frames to realize rendering.

In this embodiment, a neural network model is trained for each object, and the neural network model is input as an image of the object and output as coordinates (x, y) of the center of the object in the image and a three-dimensional posture of the object (i.e., yaw, pitch, roll 3-posture rotation angles).

Calling a 3D model of a corresponding object according to the object selected by the user, cutting out an image aiming at a frame of the object after the detection of the universal object detection technology in the video and the coordinate of the corresponding object in the image, inputting the image into the 3D model, and outputting 5 attitude parameters including x, y, yaw, pitch and roll for subsequent use.

And driving the 3D model to rotate and translate to the position of the corresponding frame image where the object appears by applying the 3D model and the output 5 attitude parameters, directly projecting the editing of the 3D model by a user onto the 2-dimensional image, replacing pixel points in the frame image, and finishing rendering.

For example, in a display video of a home environment, a user selects a quilt detected by a general object detection technology, a neural network model reconstructs a 3D model of the quilt, the color of the quilt on the bed is changed by color matching, and the color of the quilt in the whole video is modified after editing is confirmed.

For another example, in a segment of self-timer video, people, clothes and the like in a scene are detected through a general object detection technology, a user selects clothes of a human body, a 3D model of the clothes of the human body is reconstructed through a neural network model, the patterns of the clothes are changed through editing, and after the editing is confirmed, the clothes patterns in the whole video are modified.

Fig. 2 is a schematic structural diagram of a video editing apparatus based on a 3D reconstruction technique according to another embodiment of the present invention. The apparatus 10 comprises:

an acquiring unit 11, configured to acquire a video to be edited;

a detection unit 12, configured to detect an identifiable object in each frame of the video to be edited;

a reconstruction unit 13 for reconstructing a first 3D model corresponding to each of the objects using a neural network;

an editing unit 14, configured to select a current frame of the object in the video to be edited, edit the selected object, modify the edited content into the first 3D model, and generate a second 3D model;

and the rendering unit 15 is configured to perform real-time pose estimation on each frame of image where the object is located based on the second 3D model, drive the second 3D model to generate a replacement image according to the pose estimation, and render the replacement image to all frames of the same object of the video to be edited.

Optionally, the detecting unit 12 is further configured to:

Optionally, the editing unit 14 is further configured to:

Optionally, the rendering unit 15 further includes:

an input unit (not labeled in the figure) for cutting out the object according to the coordinates of each frame of image where the object is located, and inputting the object into the second 3D model;

an output unit (not labeled in the figure) for outputting the coordinates of each frame of image where the object is located and the three-dimensional posture parameters of the object;

and the driving unit (not marked in the figure) is used for driving the second 3D model to rotate and translate to the position where the object appears in each corresponding frame image according to the coordinates and the three-dimensional posture parameters, projecting the edited content onto all frames of the same object, replacing pixel points in all frames, and realizing rendering.

The functions or operation steps implemented by each unit in the video editing apparatus based on the 3D reconstruction technology are substantially the same as those in the above embodiments, and are not described herein again.

An embodiment of the present invention further provides a video editing apparatus based on a 3D reconstruction technology, which includes a processor, a memory, and a computer program stored in the memory, where the computer program is executable by the processor to implement the video editing method based on the 3D reconstruction technology as described in the foregoing embodiment.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, a device in which the computer-readable storage medium is located is controlled to execute the video editing method based on the 3D reconstruction technology according to the above embodiment.

Illustratively, the computer program may be divided into one or more units, which are stored in the memory and executed by the processor to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program in a video editing apparatus based on 3D reconstruction techniques.

The 3D reconstruction technology based video editing apparatus may include, but is not limited to, a processor, a memory. It will be understood by those skilled in the art that the schematic diagram is merely an example of a video editing apparatus based on 3D reconstruction technology, and does not constitute a limitation of the video editing apparatus based on 3D reconstruction technology, and may include more or less components than those shown, or combine some components, or different components, for example, the video editing apparatus based on 3D reconstruction technology may further include an input and output device, a network access device, a bus, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the control center of the 3D reconstruction technology-based video editing apparatus connects various parts of the entire 3D reconstruction technology-based video editing apparatus by using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the 3D reconstruction technology-based video editing apparatus by executing or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the unit integrated with the video editing device based on the 3D reconstruction technology can be stored in a computer readable storage medium if the unit is realized in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc.

The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiments in the above embodiments can be further combined or replaced, and the embodiments are only used for describing the preferred embodiments of the present invention, and do not limit the concept and scope of the present invention, and various changes and modifications made to the technical solution of the present invention by those skilled in the art without departing from the design idea of the present invention belong to the protection scope of the present invention.

Claims

1. A method for editing video based on 3D reconstruction technology, the method comprising:

acquiring a video to be edited;

detecting an identifiable object in each frame of the video to be edited;

2. The method for video editing based on 3D reconstruction technology according to claim 1, wherein the detecting identifiable objects in each frame of the video to be edited includes:

3. The method for video editing based on 3D reconstruction technology according to claim 1, wherein the reconstructing the first 3D model corresponding to each of the objects by using a neural network comprises:

4. The method according to claim 1, wherein performing real-time pose estimation on each frame of image where the object is located based on the second 3D model, and driving the second 3D model to generate a replacement image according to the pose estimation, and rendering the replacement image onto all frames of the same object of the video to be edited includes:

5. A video editing apparatus based on 3D reconstruction technology, the apparatus comprising:

6. The apparatus for editing video according to claim 5, wherein the detecting unit further comprises:

7. The apparatus for video editing based on 3D reconstruction technology as claimed in claim 5, wherein the editing unit further comprises:

8. The apparatus for video editing based on 3D reconstruction technology as claimed in claim 5, wherein the rendering unit further comprises:

9. A video editing device based on 3D reconstruction technology, characterized by comprising a processor, a memory and a computer program stored in the memory, the computer program being executable by the processor to implement the 3D reconstruction technology based video editing method according to any of claims 1 to 4.

10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus on which the computer-readable storage medium is located to perform the video editing method based on the 3D reconstruction technique according to any one of claims 1 to 4.