CN112995757A

CN112995757A - Video clipping method and device

Info

Publication number: CN112995757A
Application number: CN202110497779.2A
Authority: CN
Inventors: 张韵璇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2021-06-18
Anticipated expiration: 2041-05-08
Also published as: CN112995757B

Abstract

The embodiment of the application provides a video clipping method and device, and relates to the technical field of computers. The video cropping method provided by the embodiment of the application can perform scene division on each video frame contained in a video to be cropped to obtain at least one frame set, determine the frame set to be cropped, determine a key video frame contained in the frame set to be cropped, determine at least one target object corresponding to the key video frame, determine each tracking video frame corresponding to the key video frame according to the at least one target object, crop the key video frame and each tracking video frame respectively based on the at least one target object to obtain the cropped target frame set, and generate the cropped target video according to the target frame set. Compared with the method for finishing the clipping of the video in a manual mode in the related art, the method provided by the embodiment of the application can finish the automatic clipping of the video, so that the efficiency of clipping the video can be improved.

Description

Video clipping method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a video clipping method and device.

Background

With the rapid development of internet multimedia technology, the original video content cannot meet the requirement of people for consuming fragmented entertainment time, and people tend to watch attractive video content with rich information. Therefore, the original video content needs to be tailored, and the most attractive and information-rich content in the video needs to be tailored.

The traditional video cropping requires professional cropping personnel and software to perform manual operation, however, the cropping of the video is completed manually, a great deal of cropping time and labor cost are consumed, and the efficiency of cropping the video is low.

Disclosure of Invention

In order to solve technical problems in the related art, embodiments of the present application provide a video clipping method and apparatus, which can improve the efficiency of clipping a video.

In order to achieve the above purpose, the technical solution of the embodiment of the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a video cropping method, where the method includes:

obtaining a frame set to be cut, wherein the frame set to be cut is obtained after each video frame contained in a video to be cut is subjected to scene division;

determining key video frames contained in the frame set to be cropped, and determining at least one target object based on the key video frames;

determining, based on the at least one target object, respective tracking video frames corresponding to the key video frames in the set of frames to be cropped;

based on the at least one target object, respectively clipping the key video frames and the tracking video frames to obtain a target frame set consisting of the clipped target video frames; the target video frame comprises the at least one target object;

and generating the clipped target video based on the target frame set.

In an optional embodiment, the cropping the key video frames and the tracking video frames according to the lens tracking trajectory respectively includes:

and respectively clipping the key video frame and each tracking video frame based on the lens tracking position in the key video frame and each tracking video frame according to a preset clipping proportion.

In an alternative embodiment, the determining at least one target object based on the key video frames comprises:

performing target detection on the key video frame to obtain at least one candidate object, performing feature extraction on the key video frame, and determining a feature image corresponding to the key video frame;

aligning the at least one candidate object with the feature image to obtain sub-feature images corresponding to the at least one candidate object;

and selecting at least one target object corresponding to the key video frame from the at least one candidate object according to the sub-feature image corresponding to the at least one candidate object.

In an optional embodiment, the determining, based on the at least one target object, each tracking video frame corresponding to the key video frame in the to-be-cropped frame set includes:

based on the at least one target object, in the set of frames to be cropped, for each video frame located after the key video frame, performing the following operations respectively:

tracking the at least one target object in one video frame of all the video frames, and if the at least one target object is tracked in the one video frame, taking the one video frame as a tracking video frame corresponding to the key video frame;

and determining each tracking video frame corresponding to the key video frame.

In an alternative embodiment, the tracking the at least one target object within one of the video frames includes:

performing target detection on one video frame in each video frame to obtain at least one candidate object;

matching the at least one target object with the at least one candidate object respectively;

and if the at least one candidate object is determined to comprise the at least one target object according to the matching result, determining that the at least one target object is tracked in the video frame.

In an optional embodiment, the method further comprises:

and if the at least one target object cannot be tracked in the video frame and the at least one target object is tracked in a video frame before the video frame, taking the video frame as a key video frame.

In a second aspect, an embodiment of the present application further provides a video cropping device, where the device includes:

the device comprises a frame set acquisition unit, a frame set generation unit and a frame set generation unit, wherein the frame set acquisition unit is used for acquiring a frame set to be cut, and the frame set to be cut is acquired after each video frame contained in a video to be cut is subjected to scene division;

the target object determining unit is used for determining key video frames contained in the frame set to be cut and determining at least one target object based on the key video frames;

a target object tracking unit, configured to determine, based on the at least one target object, each tracking video frame corresponding to the key video frame in the to-be-cropped frame set;

a video frame clipping unit, configured to clip the key video frame and each of the tracking video frames based on the at least one target object, respectively, and obtain a target frame set composed of clipped target video frames; the target video frame comprises the at least one target object;

and the target video generating unit is used for generating the clipped target video based on the target frame set.

In an optional embodiment, the frame set obtaining unit is specifically configured to:

inputting each video frame contained in the video to be cut into a trained scene boundary detection model to obtain at least one scene boundary frame, wherein each scene boundary frame is a video frame of which the similarity with the adjacent next video frame is smaller than a set threshold value;

for the at least one scene boundary frame, respectively performing the following operations: attributing a scene boundary frame in the at least one scene boundary frame and all video frames between the adjacent previous scene boundary frame to the same frame set;

and determining a frame set to be clipped from the obtained at least one frame set.

In an alternative embodiment, the target object comprises a target body and an adjacent body; the target object determination unit is specifically configured to:

performing target detection on the key video frame to obtain a plurality of candidate objects, and determining the distance between the candidate objects and the center position of the picture in the adjacent video frame; the adjacent video frame is a video frame adjacent to the key video frame;

and sequencing the candidate objects according to the sequence of the distances from near to far, taking the candidate object with the closest distance as a target main body corresponding to the key video frame, and selecting the first N candidate objects from the remaining candidate objects as adjacent main bodies corresponding to the key video frame.

In an optional embodiment, the target object determining unit is further configured to:

determining a subject probability value of each candidate object through the trained subject selection model; the main body probability value is used for representing the distance between the corresponding candidate object and the picture center position in the adjacent video frame; the main body selection model is obtained by training a sample image sequence marked with a candidate object; the labeled candidate object is provided with a main body label labeled according to the distance between the labeled candidate object and the center position of the picture in the sample image; the subject label is used for representing that the corresponding labeled candidate object is a target subject or an adjacent subject.

In an optional embodiment, the target object tracking unit is specifically configured to:

taking a first video frame in the frame set to be cropped as a key video frame;

detecting the target main body in each video frame after the key video frame one by one until a cut-off video frame not containing the target main body is detected, and taking each video frame between the key video frame and the cut-off video frame as each tracking video frame corresponding to the key video frame;

after determining each tracking video frame corresponding to the key video frame, the target object tracking unit is further configured to:

and taking the cut video frame as the next key video frame in the frame set to be cut, and returning to execute the steps of determining at least one target object based on the key video frame and the subsequent steps.

In an optional embodiment, the target object tracking unit is further configured to:

for each video frame in the video frames, respectively executing the following operations:

matching the target subject with the at least one candidate object respectively;

determining that the video frame includes the target subject if the target subject is included in the at least one candidate object.

respectively determining the distance between each candidate object except the target subject and the target subject in the video frame;

and taking the candidate object with the distance to the target object smaller than a set distance threshold value as the adjacent object in the video frame.

In an alternative embodiment, the video frame cropping unit is specifically configured to:

for each video frame in the frame set to be cropped, respectively performing the following operations:

determining an image area containing each target object in the video frame based on the position of each target object in the video frame, and taking the central position of the image area as a lens tracking position of the video frame;

determining a lens tracking track of the frame set to be cut according to the key video frame and the lens tracking position in each tracking video frame;

and respectively clipping the key video frames and all the tracking video frames according to the lens tracking track.

In an alternative embodiment, the video frame cropping unit is further configured to:

performing sliding window filtering on the lens tracking track to obtain a corresponding smooth tracking track;

and clipping the key video frames and the tracking video frames respectively according to the smooth tracking track.

and determining each tracking video frame corresponding to the key video frame.

In a third aspect, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method for video cropping according to the first aspect is implemented.

In a fourth aspect, this embodiment of the present application further provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and when the computer program is executed by the processor, the processor is caused to implement the video cropping method of the first aspect.

The video clipping method and device provided by the embodiment of the application can determine a frame set to be clipped from at least one obtained frame set after scene division is performed on each video frame contained in a video to be clipped, determine at least one target object based on a key video frame after the key video frame contained in the frame set to be clipped is determined, determine each tracking video frame corresponding to the key video frame in the frame set to be clipped according to the at least one target object, respectively clip the key video frame and each tracking video frame based on the at least one target object to obtain the clipped target frame set, and generate the clipped target video based on the target frame set. Compared with the method for finishing the clipping of the video in a manual mode in the related art, the method provided by the embodiment of the application can finish the automatic clipping of the video without manual participation, so that the efficiency of clipping the video can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is an application scene diagram of a video cropping method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a video cropping method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of another video cropping method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another video cropping method according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a target object corresponding to a key video frame according to an embodiment of the present disclosure;

FIG. 6 is a diagram illustrating cropped key video frames according to an embodiment of the present application;

fig. 7 is a schematic flowchart of a training method for a scene boundary detection model according to an embodiment of the present disclosure;

fig. 8 is a schematic flowchart of another video cropping method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a video cropping device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that references in the specification of the present application to the terms "comprises" and "comprising," and variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

(1) Video frame: the minimum unit of the video is a static image; for example, when video information is played, a picture at any time is frozen, i.e., a video frame is obtained.

(2) Lens: video is typically made up of more than one shot, each shot corresponding to a segment of the video stream, each shot also representing a video scene, and within each shot the video content is typically continuously changing.

The word "exemplary" is used hereinafter to mean "serving as an example, embodiment, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The terms "first" and "second" are used herein for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

The embodiment of the application relates to a Blockchain (Blockchain) technology, wherein the Blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A blockchain is essentially a decentralized database, a string of blocks that are generated using cryptographic methods. Each block records a batch of test data of user behavior for verifying the validity (anti-counterfeiting) of the test data and generating the next block. Each block of the block chain comprises a hash value of the test data stored in the block (the hash value of the block) and a hash value of a previous block, and the blocks are connected through the hash values to form the block chain. Each block of the block chain may further include information such as a time stamp when the block is generated.

In the embodiment of the application, the video to be cut can be stored in the block chain in real time, the server acquires the video to be cut from the block chain, and then each video frame contained in the video to be cut is cut to obtain the cut target video. And the training data set containing the video frame samples can also be stored on the blockchain in real time, and the server acquires the training data set from the blockchain to train the scene boundary detection model to obtain the trained scene boundary detection model.

Embodiments of the present application also relate to Artificial Intelligence (AI) and Machine Learning (ML) techniques, which are designed based on Computer Vision (CV) techniques and Machine Learning in Artificial Intelligence.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology mainly comprises a computer vision technology, a voice processing technology, machine learning/deep learning and other directions.

With the research and progress of artificial intelligence technology, artificial intelligence is developed and researched in a plurality of fields, such as common smart home, image retrieval, video monitoring, video detection, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, intelligent medical treatment and the like.

Computer vision technology is an important application of artificial intelligence, which studies relevant theories and techniques in an attempt to build an artificial intelligence system capable of obtaining information from images, videos or multidimensional data to replace human visual interpretation. Typical computer vision techniques generally include image processing and video analysis. The embodiment of the application provides a video clipping method, belonging to a method for video analysis.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like. In the video clipping process, the trained scene boundary detection model is obtained by training the scene boundary detection model based on machine learning or deep learning.

In order to better understand the technical solution provided by the embodiment of the present application, some brief descriptions are provided below for application scenarios to which the technical solution provided by the embodiment of the present application is applicable, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The video cropping method provided by the embodiment of the application can be applied to the application scene shown in fig. 1. Referring to fig. 1, the server 100 is communicatively connected to the terminal device 300 through a network 200, wherein the network 200 may be, but is not limited to, a local area network, a metropolitan area network, a wide area network, or the like, and the number of the terminal devices 300 connected to the server 100 may be plural. The terminal device 300 can transmit communication data and messages to and from the server 100 through the network 200.

The terminal 300 may be a portable device (e.g., a mobile phone, a tablet Computer, a notebook Computer, etc.), or may be a Computer, a smart screen, a Personal Computer (PC), etc. The server 100 may be a server or a server cluster or a cloud computing center composed of a plurality of servers, or a virtualization platform, and may also be a personal computer, a large and medium-sized computer, or a computer cluster, etc. According to implementation needs, the application scenario in the embodiment of the present application may have any number of terminal devices and servers. The embodiment of the present application is not particularly limited to this. The video cropping method provided by the embodiment of the application can be executed by the server 100, the terminal device 300, or the terminal device 300 and the server 100 cooperatively execute.

Illustratively, the terminal device 300 may record a video and upload the recorded video to the server 100, or may acquire a locally stored video and upload the video to the server 100. After receiving a video uploaded by a terminal device, the server 100 may first perform scene division on each video frame included in a video to be clipped, determine a frame set to be clipped from at least one obtained frame set, then determine a key video frame included in the frame set to be clipped, determine at least one target object based on the key video frame, determine each tracking video frame corresponding to the key video frame in the frame set to be clipped according to the at least one target object, respectively clip the key video frame and each tracking video frame based on the at least one target object, obtain a clipped target frame set, and finally generate a clipped target video based on the target frame set. After the server 100 clips the video to obtain the target video, the target video may be sent to the terminal device 300, and after receiving the target video, the terminal device 300 may display the target video to the relevant user.

To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide the method operation steps as shown in the following embodiments or figures, more or less operation steps may be included in the method based on the conventional or non-inventive labor. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application. The method can be executed in sequence or in parallel according to the method shown in the embodiment or the figure when the method is executed in an actual processing procedure or a device.

Fig. 2 shows a flowchart of a video cropping method provided in an embodiment of the present application, which may be executed by the server 100 in fig. 1, or by a terminal device or other electronic devices. Illustratively, a specific implementation procedure of the video cropping method according to the embodiment of the present application is described below with a server for video cropping as an execution subject. The specific implementation process performed by other devices is similar to the process performed by the server alone, and is not described herein again.

As shown in fig. 2, the video cropping method includes the following steps:

in step S201, the server obtains a frame set to be clipped.

After the video to be cropped is acquired, each video frame included in the video to be cropped can be input into the trained scene boundary detection model, and at least one scene boundary frame can be obtained based on the scene boundary detection model. And each scene boundary frame is a video frame of which the similarity with the adjacent next video frame is less than a set threshold value.

Then, for at least one scene boundary frame, the following operations are respectively performed: and attributing one scene boundary frame in the at least one scene boundary frame and all video frames between the adjacent previous scene boundary frame to the same frame set. According to the operation, the scene division of each video frame contained in the video to be clipped can be completed, and at least one frame set is obtained.

After obtaining the at least one frame set, a set of frames to be cropped may be determined from the obtained at least one frame set. Wherein each frame set corresponds to a scene. And, the frame set to be clipped may be any one of the obtained at least one frame set.

For example, 15 video frames are included in one video to be cropped, the 15 video frames included in the video to be cropped may be input into a trained scene boundary detection model, and if the scene boundary detection model outputs a 4 th video frame as a scene boundary frame, an 8 th video frame as a scene boundary frame, and the 15 th video frame as a scene boundary frame, it may be determined that the 1 st to 4 th video frames in the 15 video frames belong to the same frame set, the 5 th to 8 th video frames belong to the same frame set, and the 9 th to 15 th video frames belong to the same frame set.

In step S202, the server determines key video frames included in the frame set to be cropped, and determines at least one target object based on the key video frames.

After the to-be-cropped frame set is obtained, the key video frames included in the to-be-cropped frame set may be determined, and the first video frame in the to-be-cropped frame set may be used as the key video frame.

In an embodiment, the target object may include a target subject and an adjacent subject, and after determining a key video frame included in the frame set to be cropped, target detection may be performed on the key video frame to obtain a plurality of candidate objects, and a distance between the center position of the picture and the candidate objects in the adjacent video frame is determined. And the adjacent video frame is a video frame adjacent to the key video frame.

When determining the distances between a plurality of candidate objects and the center position of the picture in the adjacent video frames, the body probability value of each candidate object can be determined through the trained body selection model. Wherein the subject probability value may be used to characterize a distance between the corresponding candidate object and a picture center position in the adjacent video frame. The subject selection model is obtained by training a sample image sequence of labeled candidate objects, wherein the labeled candidate objects are provided with subject labels labeled according to the distance between the labeled candidate objects and the center position of the picture in the sample image. The subject label is used for representing that the corresponding labeled candidate object is a target subject or an adjacent subject.

Then, the candidate objects may be ranked according to the order of the distance from near to far, the candidate object with the closest distance is used as the target subject corresponding to the key video frame, and the top N candidate objects are selected from the remaining candidate objects as the neighboring subjects corresponding to the key video frame.

For example, after performing target detection on a key video frame included in a frame set to be cropped, 4 candidate objects, namely a candidate object a, a candidate object B, a candidate object C, and a candidate object D, may be obtained, and assuming that a distance between the candidate object a and a center position of a picture in an adjacent video frame is 15mm, a distance between the candidate object B and the center position of the picture in the adjacent video frame is 10mm, a distance between the candidate object C and the center position of the picture in the adjacent video frame is 8mm, and a distance between the candidate object D and the center position of the picture in the adjacent video frame is 5mm, the candidate object D may be taken as a target subject of the key video frame. Assuming that 2 candidates can be selected from the remaining 3 candidates as neighboring subjects of the key video frame, candidate B and candidate C can be taken as neighboring subjects of the key video frame.

In another embodiment, after determining the key video frames included in the frame set to be cropped, target detection may be performed on the key video frames to obtain at least one candidate object, and feature extraction may be performed on the key video frames to determine feature images corresponding to the key video frames. Then, at least one candidate object and the feature image may be aligned to obtain a sub-feature image corresponding to each of the at least one candidate object, and at least one target object corresponding to the key video frame may be selected from the at least one candidate object according to the sub-feature image corresponding to each of the at least one candidate object.

For example, target detection is performed on a key video frame included in a frame set to be clipped, 4 candidate objects including a candidate object a, a candidate object B, a candidate object C, and a candidate object D may be obtained, feature extraction is performed on the key video frame, a feature image corresponding to the key video frame may be obtained, the candidate object a, the candidate object B, the candidate object C, and the candidate object D are aligned with the feature image, and a sub-feature image corresponding to the candidate object a, a sub-feature image corresponding to the candidate object B, a sub-feature image corresponding to the candidate object C, and a sub-feature image corresponding to the candidate object D may be determined respectively. Respectively determining whether the candidate object A, the candidate object B, the candidate object C and the candidate object D meet the set selection standard according to the sub-feature image corresponding to the candidate object A, the sub-feature image corresponding to the candidate object B, the sub-feature image corresponding to the candidate object C and the sub-feature image corresponding to the candidate object D, and assuming that the candidate object A and the candidate object B both meet the set selection standard, taking the candidate object A and the candidate object B as 2 target objects corresponding to the key video frame.

Step S203, the server determines, based on the at least one target object, each tracking video frame corresponding to the key video frame in the frame set to be cropped.

In an embodiment, after at least one target object is determined in a key video frame included in a frame set to be cropped, a target subject included in the target object may be detected one by one in each video frame after the key video frame until an end video frame not including the target subject is detected, and each video frame between the key video frame and the end video frame is used as each tracking video frame corresponding to the key video frame. And, the cut-off video frame can be used as the next key video frame in the frame set to be cut, and the determination of at least one target object based on the key video frame and the subsequent steps are returned to be executed.

For each video frame after the key video frame, the following operations may be performed: the method comprises the steps of carrying out target detection on one video frame in each video frame to obtain at least one candidate object, respectively matching a target main body with the at least one candidate object, if the at least one candidate object comprises the target main body, determining that the video frame comprises the target main body, after determining that the video frame comprises the target main body, respectively determining the distance between each candidate object except the target main body and the target main body in the video frame, and taking the candidate object of which the distance between the candidate object and the target main body is smaller than a set distance threshold value as an adjacent main body in the video frame.

For example, 4 target objects are determined in the key video frames of the frame set to be cropped, and the 4 target objects include 1 target subject and 3 adjacent subjects. Then 1 target subject may be detected one by one in each video frame located after the key video frame in the set of frames to be cropped, and assuming that 4 candidate subjects may be detected in the 1 st video frame located after the key video frame, including 1 target subject, 3 candidate subjects may be detected in the 2 nd video frame located after the key video frame, including 1 target subject, 5 candidate subjects may be detected in the 3 rd video frame located after the key video frame, including 1 target subject, and 4 candidate subjects may be detected in the 4 th video frame located after the key video frame, excluding 1 target subject, then the 1 st video frame, the 2 nd video frame, and the 3 rd video frame located after the key video frame may be taken as the tracking video frame corresponding to the key video frame, that is, the 1 st tracking video frame, the 2 nd tracking video frame and the 3 rd tracking video frame corresponding to the key video frame can be determined. And the 4 th video frame after the key video frame can be used as the next key video frame in the frame set to be cropped, and the steps of determining at least one target object based on the key video frame and the following steps can be carried out in a return mode.

After the 1 st tracking video frame, the 2 nd tracking video frame and the 3 rd tracking video frame are determined, the distances between the target subject and the 3 rd candidate objects except the target subject in the 1 st tracking video frame, the distances between the target subject and the 2 nd candidate objects except the target subject in the 2 nd tracking video frame, and the distances between the target subject and the 4 rd candidate objects except the target subject in the 3 rd tracking video frame may be determined, respectively. Assuming that the distance between 2 candidates and the target subject in the 1 st tracking video frame is smaller than the set distance threshold, the 2 candidates may be taken as neighboring subjects in the 1 st tracking video frame. Assuming that the distance between 1 candidate object in the 2 nd tracking video frame and the target subject is smaller than the set distance threshold, the 1 candidate object may be taken as the adjacent subject in the 2 nd tracking video frame. Assuming that the distance between 2 candidates and the target subject in the 3 rd tracking video frame is smaller than the set distance threshold, the 2 candidates may be taken as neighboring subjects in the 3 rd tracking video frame.

In another embodiment, after at least one target object is determined in a key video frame included in the set of frames to be cropped, based on the at least one target object, the following operations may be performed for each video frame located after the key video frame in the set of frames to be cropped, respectively: firstly, performing target detection on one video frame in each video frame to obtain at least one candidate object, then respectively matching the at least one target object with the at least one candidate object, if the at least one candidate object comprises the at least one target object according to a matching result, determining that the at least one target object is tracked in the one video frame, and when the at least one target object is tracked in the one video frame, taking the one video frame as a tracking video frame corresponding to a key video frame; if at least one target object cannot be tracked in one video frame and at least one target object can be tracked in the previous video frame of the one video frame, the one video frame is taken as the next key video frame, and the steps of determining at least one target object based on the key video frame and the following steps can be returned to be executed. According to the above operation, each tracking video frame corresponding to the key video frame in the frame set to be cropped can be determined, and the next key video frame in the frame set to be cropped can be determined.

For example, 2 target objects are determined in one key video frame included in the set of frames to be cropped, the 2 target objects may be tracked in turn in each video frame of the set of frames to be cropped that is located after the key video frame, assuming that the 2 target objects may be tracked in the 1 st video frame that is located after the key video frame, the 2 target objects can also be tracked within the 2 nd video frame located after the key video frame, while only one of the 2 target objects can be tracked in the 3 rd video frame following the key video frame, or the 2 target objects cannot be tracked within the 3 rd video frame located after the key video frame, the 1 st video frame and the 2 nd video frame after the key video frame can be taken as the tracking video frames corresponding to the key video frame. And the 3 rd video frame after the key video frame can be used as the key video frame next in the frame set to be cropped. After the next key video frame is determined, the target object included in the key video frame can be determined, and the subsequent tracking step is continued.

Step S204, the server cuts the key video frames and the tracking video frames respectively based on at least one target object, and a target frame set composed of the cut target video frames is obtained.

For each video frame in the set of frames to be cropped, the following operations may be performed respectively: and determining an image area containing each target object in the video frame based on the position of each target object in the video frame, and taking the central position of the image area as the lens tracking position of the video frame. After the key video frames and the shot tracking positions in all the tracking video frames are determined, the shot tracking track of the frame set to be cut can be determined, the key video frames and all the tracking video frames are cut based on the shot tracking track, all the cut target video frames are obtained, and the target frame set consisting of all the target video frames is obtained.

In some embodiments, after determining the lens tracking trajectory of the frame set to be clipped, the lens tracking trajectory may be further subjected to sliding window filtering to obtain a corresponding smooth tracking trajectory. Based on the smoothed tracking trajectory, the key video frames and the respective tracking video frames can be cropped separately. The smooth lens tracking track is more stable, so that the image shake caused by violent motion can be inhibited, and the cut video effect is improved.

For example, assume that a set of frames to be cropped contains 3 video frames, where the first video frame contains 2 target objects, the second video frame contains 3 target objects, the third video frame contains 4 target objects, and the positions of the respective target objects contained in each video frame can be determined respectively, then the image areas containing 2 target objects in the first video frame, the image areas containing 3 target objects in the second video frame, the image areas containing 4 target objects in the third video frame can be determined respectively, and the center position of the image area containing 2 target objects can be taken as the lens tracking position of the first video frame, the center position of the image area containing 3 target objects can be taken as the lens tracking position of the second video frame, and the center position of the image area containing 4 target objects can be taken as the lens tracking position of the third video frame. According to the lens tracking positions, a lens tracking track of the frame set to be cut can be determined, the lens tracking track is smoothed, and then each video frame is cut according to the smoothed lens tracking track.

In one embodiment, when the key video frames and the respective tracking video frames are cropped based on the lens tracking trajectory of the frame set to be cropped, an image area containing the respective target object in each video frame may be determined as a target image area, and the cropping is performed based on the corresponding target image area for each video frame, so that the cropped video frames at least include the image content of the target image area, and even in one implementation, the cropped video frames only include the image content of the target image area. Therefore, a cutting video for performing lens tracking on each target object can be obtained based on the original video.

In another embodiment, the key video frames and the tracking video frames may be cropped based on the lens tracking trajectory according to a preset cropping ratio. For example, the target image area determined in each video frame may be cropped based on the lens tracking trajectory according to a preset cropping ratio of 3: 4. In this embodiment, if there are multiple target objects, that is, there are a target subject and at least one neighboring subject, the following operations may be performed for each video frame in the set of frames to be cropped when determining the lens tracking position of each video frame in the set of frames to be cropped: determining a main shot position based on a target subject in the video frame, and determining a main image area of the video frame according to a preset cropping proportion based on the main shot position, so that the main image area contains as many adjacent subjects in the video frame as possible under the condition that the main image area can contain the image of the target subject, and then taking the central position of the main image area as a shot tracking position of the video frame. After determining the key video frames and the shot tracking positions in the tracking video frames, the shot tracking trajectory of the frame set to be cropped can be determined.

In step S205, the server generates a clipped target video based on the target frame set.

After the cropped target frame set is obtained, the cropped target video can be generated according to the cropped target frame set.

The video cropping method provided by the embodiment of the application can determine a frame set to be cropped from at least one obtained frame set after scene division is performed on each video frame contained in a video to be cropped, determine at least one target object based on a key video frame after the key video frame contained in the frame set to be cropped is determined, determine each tracking video frame corresponding to the key video frame in the frame set to be cropped according to the at least one target object, crop the key video frame and each tracking video frame respectively based on the at least one target object to obtain the cropped target frame set, and generate the cropped target video based on the target frame set. Compared with the method for finishing the clipping of the video in a manual mode in the related art, the method provided by the embodiment of the application can finish the automatic clipping of the video without manual participation, so that the efficiency of clipping the video can be improved.

In some embodiments, the video cropping method proposed in the present application may be implemented according to the process shown in fig. 3, which may be executed by the server 100 in fig. 1, or may be executed by a terminal device or other electronic devices. For example, a server for video cropping is used as an execution subject, and a specific implementation process performed by other devices is similar to a process performed by the server alone, and is not described herein again.

As shown in fig. 3, the following steps may be included:

step S301, the server performs scene division on each video frame contained in the video to be clipped to obtain at least one frame set.

A video usually includes a series of continuous shots, in the same shot, a main object included in a video frame often does not change, and between shots, the main object included in the video frame often changes, even if the main object does not change, information such as the position and the size of the main object in the video frame still changes, so that the video needs to be subjected to mirror segmentation, that is, each video frame included in the video to be cropped is subjected to scene division to obtain at least one frame set. Each frame set corresponds to a scene, that is, each frame set corresponds to a shot of the video.

Specifically, each video frame included in the video to be cropped may be input into a trained scene Boundary Detection model, and the scene Boundary Detection model may perform Shot Boundary Detection (SBD) on the video to be cropped, and segment the video to be cropped into a series of continuous shots. Based on the scene boundary detection model, at least one scene boundary frame can be output, and each scene boundary frame is a video frame with the similarity degree with the adjacent next video frame being smaller than a set threshold value. For each scene boundary frame, the following operations may be performed: and attributing one scene boundary frame in each scene boundary frame and all video frames between the adjacent previous scene boundary frame to the same frame set. So that at least one frame set corresponding to the video to be cropped can be obtained.

In step S302, the server determines a frame set to be clipped from the obtained at least one frame set.

The frame set to be cropped may be any one of at least one frame set corresponding to the obtained video to be cropped.

In step S303, the server determines the key video frames included in the frame set to be cropped, and determines the target subject and the neighboring subject based on the key video frames.

The set of frames to be cropped may include a plurality of key video frames, and the first video frame in the set of frames to be cropped may be a key video frame. Target detection can be performed on the key video frame to obtain a plurality of candidate objects, and the distance between the plurality of candidate objects and the center position of the picture in the adjacent video frame is determined. And the adjacent video frame is a video frame adjacent to the key video frame.

In step S304, the server determines, based on the target subject, each tracking video frame corresponding to the key video frame.

After the target main body and the adjacent main body in the key video frame are determined, the target main body can be detected in each video frame after the key video frame one by one, and each video frame between the key video frame and the cut-off video frame is used as each tracking video frame corresponding to the key video frame until the cut-off video frame not containing the target main body is detected. And, the cut-off video frame can be used as the next key video frame in the frame set to be cut, and the determination of at least one target object based on the key video frame and the subsequent steps are returned to be executed.

For each video frame after the key video frame, the following operations may be performed: and performing target detection on one video frame in each video frame to obtain at least one candidate object, respectively matching the target main body with the at least one candidate object, and if the at least one candidate object comprises the target main body, determining that the video frame comprises the target main body.

In step S305, the server determines the target subject in each tracking video frame.

The target subject is detected in each video frame after the key video frame one by one based on the target subject, so that each tracking video frame corresponding to the key video frame can be determined, and simultaneously, the target subject in each tracking video frame can be respectively determined.

In step S306, the server determines the adjacent subjects in each tracking video frame.

After the target subject included in each of the tracking video frames is determined, the distance between each candidate object except the target subject and the target subject may be determined separately for each of the tracking video frames, and the candidate object whose distance from the target subject is smaller than the set distance threshold may be regarded as the neighboring subject in each of the tracking video frames.

In step S307, the server determines image regions including the target subject and the adjacent subject in the key video frame and each of the tracking video frames, respectively, and uses the center position of the image region as the lens tracking position of the key video frame and each of the tracking video frames.

After the target main body and the adjacent main body included in the key video frame are determined, the target main body in each tracking video frame corresponding to the key video frame can be respectively determined, the adjacent main body in each tracking video frame can be respectively determined, the position of the target main body and the position of the adjacent main body in the key video frame can be simultaneously determined, and the position of the target main body and the position of the adjacent main body in the tracking video frame can be simultaneously determined. After the positions of the target main body and the adjacent main body in the key video frame and each tracking video frame are determined, image areas containing the target main body and the adjacent main body in the key video frame and each tracking video frame can be respectively determined, and the central position of each image area is used as a lens tracking position of the key video frame and each tracking video frame.

Step S308, the server determines the lens tracking track of the frame set to be cut according to the key video frame and the lens tracking position of each tracking video frame.

After the key video frames and the lens tracking positions of all the tracked video frames are determined, lens tracking tracks of a frame set to be cut can be determined, the lens tracking tracks of the frame set to be cut comprise target cutting areas in all the video frames contained in the frame set to be cut, and each target cutting area in each video frame not only comprises an image area containing a target main body and an adjacent main body, but also comprises a background part with a set width except the target main body and the adjacent main body.

Step S309, the server smoothes the lens tracking track of the frame set to be cut to obtain a smooth tracking track of the frame set to be cut, cuts the key video frames and the tracking video frames according to the smooth tracking track, and obtains a target frame set formed by the cut target video frames.

After the lens tracking track of the frame set to be cut is obtained, sliding window filtering can be performed on the lens tracking track to obtain a smooth tracking track of the frame set to be cut, and key video frames and tracking video frames included in the frame set to be cut are cut according to the smooth tracking track to obtain a target frame set formed by the cut target video frames. Each target video frame includes a corresponding target subject and an adjacent subject.

In step S310, the server generates a clipped target video based on the target frame set.

After the target frame set is obtained, a clipped target video may be generated according to the clipped target frame set.

The following describes the video cropping method in further detail by using a specific application scenario:

suppose that a video to be cropped contains 5 video frames, which are video frame a, video frame B, video frame C, video frame D, and video frame E, respectively. And inputting the 5 video frames into a trained scene boundary detection model to perform scene boundary detection on the video to be cropped. Based on the scene boundary detection model, it can be obtained that the video frame B is a scene boundary frame, and the video frame E is a scene boundary frame.

Then, the video frame a and the video frame B may be divided into the same frame set, and the video frame C, the video frame D, and the video frame E are divided into the same frame set, then 5 video frames included in the video to be cropped may be divided into 2 frame sets, namely, frame set 1 and frame set 2, and each frame set corresponds to one scene of the video. The frame set 1 comprises 2 video frames including a video frame A and a video frame B, and the frame set 2 comprises 3 video frames including a video frame C, a video frame D and a video frame E.

After the frame set 1 and the frame set 2 are obtained, the frame set 1 and the frame set 2 can be respectively used as frame sets to be clipped, the video frame a in the frame set 1 is used as a key video frame corresponding to the frame set 1, and the video frame C in the frame set 2 is used as a key video frame corresponding to the frame set 2. Then, target detection may be performed on the video frame a to obtain 3 candidate objects, and a target subject a, an adjacent subject b, and an adjacent subject c may be determined from the 3 candidate objects. And performing target detection on the video frame C to obtain 2 candidate objects, and determining a target main body h and an adjacent main body i from the 2 candidate objects.

The target detection can be performed on the video frame B to obtain 4 candidate objects, the target subject a is respectively matched with the 4 candidate objects, the obtained 4 candidate objects include the target subject a, the video frame B is a tracking video frame corresponding to the video frame a, the video frame B includes the target subject a, it can be determined from the remaining 3 candidate objects that the video frame B further includes the adjacent subject B and the adjacent subject d, and the video frame B includes the target subject a, the adjacent subject B and the adjacent subject d.

Simultaneously, target detection can be performed on the video frame D and the video frame E one by one to obtain that the video frame D comprises 3 candidate objects, the video frame E comprises 5 candidate objects, then the target subject h is respectively matched with the 3 candidate objects in the video frame D and the 5 candidate objects in the video frame E, so that the 3 candidate objects in the video frame D comprise the target subject h, the 5 candidate objects in the video frame E do not comprise the target subject h, the video frame D is a tracking video frame corresponding to the video frame C, and the video frame E is a next key video frame in the frame set 2. And the video frame D includes the target subject h, and it can be determined from the remaining 2 candidate objects that the video frame D further includes the adjacent subject i and the adjacent subject j, so that the video frame D includes the target subject h, the adjacent subject i, and the adjacent subject j. Since the video frame E is the next key video frame in the frame set 2, the target subject o, the neighboring subject p, and the neighboring subject q can be determined from the 5 candidate objects included in the video frame E.

According to the above operations, in the frame set 1, it can be determined that the video frame a includes the target subject a, the adjacent subject B, and the adjacent subject c, and the video frame B includes the target subject a, the adjacent subject B, and the adjacent subject d. In the frame set 2, it can be determined that the video frame C includes a target subject h and an adjacent subject i, the video frame D includes the target subject h, the adjacent subject i and an adjacent subject j, and the video frame E includes a target subject o, an adjacent subject p and an adjacent subject q. According to the target subject and the adjacent subject respectively included in the video frame A, the video frame B, the video frame C, the video frame D and the video frame E, image areas including the target subject and the adjacent subject in the video frame A, the video frame B, the video frame C, the video frame D and the video frame E can be respectively determined, and the central position of the image area can be used as the lens tracking position of each of the video frame A, the video frame B, the video frame C, the video frame D and the video frame E. The lens tracking track of the frame set 1 can be obtained according to the respective lens tracking positions of the video frame A and the video frame B, the lens tracking track of the frame set 2 can be obtained according to the respective lens tracking positions of the video frame C, the video frame D and the video frame E, and after the lens tracking tracks are smoothed, the video frame A, the video frame B, the video frame C, the video frame D and the video frame E can be respectively cut to obtain a cut target video frame A, a target video frame B, a target video frame C, a target video frame D and a target video frame E.

And finally, obtaining the clipped target video based on the clipped target video frame A, the clipped target video frame B, the clipped target video frame C, the clipped target video frame D and the clipped target video frame E.

In other embodiments, the video cropping method proposed in the present application may be implemented according to the process shown in fig. 4, which may be executed by the server 100 in fig. 1, or may be executed by a terminal device or other electronic devices. For example, a server for video cropping is used as an execution subject, and a specific implementation process performed by other devices is similar to a process performed by the server alone, and is not described herein again.

As shown in fig. 4, the following steps may be included:

step S401, the server obtains a frame set to be clipped.

After the server acquires the video to be cropped, the server can perform scene division on each video frame included in the video to be cropped to obtain a plurality of frame sets, and then any one of the frame sets is used as the frame set to be cropped.

In step S402, the server determines key video frames included in the frame set to be cropped, and determines at least one target object based on the key video frames.

After determining the key video frames contained in the frame set to be cropped, target detection may be performed on the key video frames to obtain at least one candidate object, feature extraction may be performed on the key video frames to determine feature images corresponding to the key video frames, then at least one candidate object is aligned with the feature images to obtain sub-feature images corresponding to the at least one candidate object, and at least one target object corresponding to the key video frames is selected from the at least one candidate object according to the sub-feature images corresponding to the at least one candidate object.

Specifically, the image shown in fig. 5 is a key video frame, and target detection may be performed on the key video frame to determine 3 candidate objects, namely candidate object 1, candidate object 2, and candidate object 3. And performing target detection on the key video frame, determining a candidate object corresponding to the key video frame, and simultaneously performing feature extraction on the key video frame to respectively extract common features, salient features and fuzzy features corresponding to the key video frame. The common features and the significant features are extracted through a Convolutional Neural Network (CNN), and the fuzzy features are extracted through a Tenengrad function. After obtaining the common features, the salient features and the fuzzy features corresponding to the key video frame, the common features, the salient features and the fuzzy features can be combined to obtain a feature image corresponding to the key video frame. The candidate object may be aligned with the feature image to determine a sub-feature image corresponding to the candidate object 1, a sub-feature image corresponding to the candidate object 2, and a sub-feature image corresponding to the candidate object 3, respectively. Since the sub-feature image corresponding to the candidate object 1 does not include the face feature of the candidate object 1, the candidate object 1 may be deleted, and finally, the candidate object 2 and the candidate object 3 may be used as the target objects corresponding to the key video frame, that is, 2 target objects in total, such as the target object 1 and the target object 2 shown in fig. 4, may be determined in the key video frame.

In an embodiment, when at least one target object corresponding to a key video frame is selected from a plurality of candidate objects according to sub-feature images corresponding to the candidate objects, if the candidate objects are all human, the candidate object may be deleted from the candidate objects when the sub-feature image corresponding to a certain candidate object includes fewer face features, incomplete human features, or fewer human features of the candidate object.

In another embodiment, assuming that, for a key video frame of an artificial subject, a plurality of candidate objects corresponding to the key video frame may be determined, and a sub-feature image corresponding to each candidate object is determined, and the sub-feature image corresponding to each candidate object includes a complete face feature and a human body feature of the candidate object, the candidate object near the center of the picture of the key video frame may be used as the target object corresponding to the key video frame.

In step S403, the server tracks at least one target object in each video frame after the key video frame in the to-be-cropped frame set based on at least one target object corresponding to the key video frame.

Based on at least one target object corresponding to the key video frame, the following operations may be sequentially and respectively performed on each video frame located after the key video frame in the frame set to be cropped:

and performing target detection on one video frame in each video frame to obtain at least one candidate object, respectively matching the at least one target object with the at least one candidate object, and determining that the at least one target object is tracked in the one video frame if the at least one candidate object comprises the at least one target object according to a matching result.

In one embodiment, if a target object is identified in the key video frames, the target object may be tracked in each video frame following the key video frame in the set of frames to be cropped using the SimaMark algorithm. Specifically, target detection may be performed in each video frame after the key video frame, to determine candidate objects corresponding to each video frame, and then a SimaMark algorithm is used to match the target object corresponding to the key video frame with the candidate objects corresponding to each video frame, so as to determine whether each video frame includes the target object.

In another embodiment, if a plurality of target objects are determined in the key video frame, that is, if the number of target objects exceeds one, a Multi-Object Tracking (MOT) algorithm may be used to track the target objects in each video frame following the key video frame in the set of frames to be cropped. Specifically, target detection may be performed in each video frame after the key video frame, to determine candidate objects corresponding to each video frame, and then an MOT algorithm is used to match a plurality of target objects corresponding to the key video frame with the candidate objects corresponding to each video frame, so as to determine whether each video frame includes the plurality of target objects.

Step S404, the server determines each tracking video frame corresponding to the key video frame in the frame set to be cropped.

And respectively and sequentially tracking at least one target object in each video frame positioned behind the key video frame in the frame set to be cropped based on at least one target object corresponding to the key video frame, wherein if the at least one target object cannot be tracked in a certain video frame positioned behind the key video frame but the at least one target object can still be tracked in a previous video frame of the video frame, all video frames positioned behind the key video frame and before the video frame can be used as each tracking video frame corresponding to the key video frame.

In addition, if the at least one target object cannot be tracked in a certain video frame after the key video frame, but the at least one target object can still be tracked in a video frame before the video frame, the video frame may be regarded as a key video frame in the set of frames to be cropped, and the at least one target object corresponding to the key video frame is determined in the key video frame, based on the at least one target object, the at least one target object may be tracked in each video frame after the key video frame, so that each tracked video frame corresponding to the key video frame may be determined.

In an embodiment, in order to make the shot position in the video relatively smooth, when determining a target object corresponding to a key video frame, a target object corresponding to a video frame adjacent to the key video frame generally needs to be considered, that is, when performing feature extraction on the key video frame and determining a feature image corresponding to the key video frame, the shot center position of the adjacent video frame is introduced, and then after determining a candidate object corresponding to the key video frame, when selecting a target object from candidate objects, a candidate object closer to the shot center position of the adjacent video frame may be preferentially selected as the target object corresponding to the key video frame.

In step S405, the server determines at least one target object corresponding to each tracking video frame included in the frame set to be clipped.

After each tracking video frame corresponding to the key video frame is determined in the frame set to be cropped based on the at least one target object corresponding to the key video frame, at least one target object corresponding to each tracking video frame can be respectively determined in each tracking video frame.

In step S406, the server determines a tracking track corresponding to at least one target object.

When each key video frame contained in the frame set to be cropped and at least one target object corresponding to each tracking video frame are respectively determined, a tracking track corresponding to the at least one target object can be determined.

Step S407, the server smoothes the tracking trajectory, and cuts the key video frames and each tracking video frame based on at least one target object, respectively, to obtain a set of cut target frames.

In order to suppress the picture jitter caused by the severe motion, a gaussian filter may be used to perform sliding window filtering on the tracking trajectory corresponding to the at least one target object, so as to obtain a smooth tracking trajectory corresponding to the at least one target object.

After obtaining the smooth tracking track corresponding to at least one target object, each key video frame and each tracking video frame may be respectively clipped according to the at least one target object, so as to obtain a clipped target frame set.

For example, cropping the key video frames in fig. 5 based on 2 target objects identified in the key video frames shown in fig. 5 may result in cropped key video frames as shown in fig. 6.

In step S408, the server generates the clipped target video based on the target frame set.

The video cropping method provided by the embodiment of the application can firstly perform scene division on each video frame contained in a video to be cropped to obtain at least one frame set, for each frame set, can determine a key video frame contained in the frame set, and determine at least one target object corresponding to the key video frame based on the key video frame. Based on the at least one target object corresponding to the key video frame, the at least one target object may be tracked in each video frame located after the key video frame in the frame set, and each tracked video frame corresponding to the key video frame is determined. After determining each key video frame and each tracking video frame in the frame set and determining at least one target object corresponding to each key video frame and each tracking video frame, the method may clip each key video frame and each tracking video frame respectively to obtain a clipped target frame set, and generate a clipped target video based on the target frame set. Compared with the method for finishing the clipping of the video in a manual mode in the related art, the method provided by the embodiment of the application finishes the automatic clipping of the video by determining the target object in the key video frame and tracking the target object in the corresponding tracking video frame, can accurately determine the main target in the video frame, can automatically finish the clipping of the video without manual participation, and therefore can improve the efficiency of the clipping of the video.

In one embodiment, the scene boundary detection model in step S301 may employ a TransNet V2 to implement shot boundary detection for the video to be cropped, and a convolutional layer, a pooling layer, and a full-link layer may be included in the TransNet V2. Each video frame contained in the video to be cropped can be respectively input into the convolutional layer, and feature extraction of each video frame is completed based on the convolutional layer to obtain a feature map corresponding to each video frame. And inputting the feature map corresponding to each video frame into the pooling layer, and obtaining the dimensionality reduction feature corresponding to each video frame based on the pooling layer. And inputting the dimensionality reduction characteristics corresponding to each video frame into the full-connection layer, and outputting a scene boundary detection result of the video to be cut based on the full-connection layer, namely determining whether each video frame contained in the video to be cut is a scene boundary frame.

Fig. 7 is a training method of the scene boundary detection model, as shown in fig. 7, the training process may include the following steps:

step S701, a training data set is acquired.

The plurality of video frame samples may be included in the acquired training data set and may be attributed to a plurality of scenes. Each video frame sample can be labeled, that is, the video frames belonging to the scene boundary are labeled with a scene boundary frame label, and the video frames not belonging to the scene boundary are labeled with a common frame label.

Step S702, a video frame sample is extracted from the training data set.

The training data set can be obtained in advance, and when the model is trained, video frame samples are extracted from the training data set to serve as training sample data.

Step S703, inputting the extracted video frame sample into a scene boundary detection model to be trained, and obtaining a scene boundary detection result corresponding to the video frame sample.

And inputting the extracted video frame sample into a scene boundary detection model to be trained to obtain a scene boundary detection result corresponding to the video frame sample, and determining whether the video frame sample is a scene boundary frame according to the scene boundary detection result corresponding to the video frame sample.

Step S704, determining a loss value according to the scene boundary detection result corresponding to the video frame sample and the label of the video frame sample.

And comparing the scene boundary detection result corresponding to the video frame sample output by the scene boundary detection model with the label of the video frame sample to determine the loss value. For example, the scene boundary detection result output by the scene boundary detection model shows that the video frame sample is a scene boundary frame, and the label of the video frame sample is a normal frame, the scene boundary detection result output by the scene boundary detection model does not conform to the label of the video frame sample, so that the corresponding loss value can be determined. For another example, the scene boundary detection result output by the scene boundary detection model shows that the video frame sample is the scene boundary frame, and the label of the video frame sample is the scene boundary frame, then the scene boundary detection result output by the scene boundary detection model is consistent with the label of the video frame sample, so that another corresponding loss value can be determined.

When the loss value is calculated, a preset loss function can be used for calculating the loss value, and a cross entropy loss function, such as a Sigmoid function, can be used for the loss function. The Loss function used may also be, but is not limited to, a multi-class cross entropy Loss function, a contrast Loss function (coherent Loss) or a triple Loss function (triple Loss) related to metric learning, and the like. In general, the loss value is a measure of how close the actual output is to the desired output. The smaller the loss value, the closer the actual output is to the desired output.

Step S705, determining whether the loss value converges to a preset target value; if not, executing step S706; if so, step S707 is executed.

Judging whether the loss value converges to a preset target value, if the loss value is smaller than or equal to the preset target value, or if the variation amplitude of the loss value obtained by continuous N times of training is smaller than or equal to the preset target value, considering that the loss value converges to the preset target value, and indicating that the loss value converges; otherwise, it indicates that the loss value has not converged.

And step S706, adjusting parameters of the scene boundary detection model to be trained according to the determined loss value.

And if the loss value is not converged, adjusting the model parameters, and after adjusting the model parameters, returning to execute the step S702 to continue the next round of training process.

And step S707, finishing the training to obtain the trained scene boundary detection model.

And if the loss value is converged, taking the currently obtained scene boundary detection model as a trained scene boundary detection model.

Referring to fig. 8, the following describes in further detail a video cropping method provided in the embodiment of the present application with a specific application scenario:

assuming that a video to be cropped comprises 7 video frames, namely a video frame A, a video frame B, a video frame C, a video frame D, a video frame E, a video frame F and a video frame G, the 7 video frames are input into a trained scene boundary detection model to perform scene boundary detection on the video to be cropped. Based on the scene boundary detection model, it can be obtained that the video frame C is a scene boundary frame, and the video frame G is a scene boundary frame.

After determining the scene boundary frame in 7 video frames, the 7 video frames may be subjected to scene division by dividing video frame a, video frame B, and video frame C into the same frame set, and dividing video frame D, video frame E, video frame F, and video frame G into the same frame set, so that the 7 video frames included in the video to be cropped may be divided into 2 frame sets, that is, frame set 1 and frame set 2, and each frame set corresponds to one scene of the video. The frame set 1 comprises 3 video frames including a video frame A, a video frame B and a video frame C, and the frame set 2 comprises 4 video frames including a video frame D, a video frame E, a video frame F and a video frame G.

After the frame set 1 and the frame set 2 are obtained, the frame set 1 and the frame set 2 can be respectively used as frame sets to be clipped, the video frame a in the frame set 1 is used as a key video frame corresponding to the frame set 1, and the video frame D in the frame set 2 is used as a key video frame corresponding to the frame set 2. Then, the target object a may be determined in the video frame a, and the target object b and the target object c may be determined in the video frame D.

Based on the target object a, the target object a can be respectively tracked in the video frame B and the video frame C in sequence, and because the target object a can be tracked in the video frame B and the video frame C, the video frame B and the video frame C can be used as tracking video frames corresponding to the video frame a, and the target object a can be respectively determined in the video frame B and the video frame C. Based on the target object b and the target object c, the target object b and the target object c can be respectively tracked in the video frame E, the video frame F and the video frame G in sequence, and because the target object b and the target object c can be tracked in the video frame E and the target object b and the target object c cannot be tracked in the video frame F, the video frame E can be used as a tracking video frame corresponding to the video frame D, and the target object b and the target object c can be determined in the video frame E.

Because the target object b and the target object c can be tracked in the video frame E, and the target object b and the target object c cannot be tracked in the video frame F, the video frame F can be used as a key video frame corresponding to the frame set 2, the target object d and the target object E are determined in the video frame F, the target object d and the target object E are tracked in the video frame G, and the target object d and the target object E can be tracked in the video frame G, the video frame G can be used as a tracking video frame corresponding to the video frame F, and the target object d and the target object E are determined in the video frame G.

According to the above operations, the target objects corresponding to the video frame a, the video frame B, the video frame C, the video frame D, the video frame E, the video frame F and the video frame G can be respectively determined, and the tracking tracks corresponding to the target objects can be obtained. After the tracking track is smoothed, the video frame a, the video frame B, the video frame C, the video frame D, the video frame F, and the video frame G may be clipped according to the target object a corresponding to the video frame a, the target object a corresponding to the video frame B, the target object C, the target object D, the target object E, the target object D, and the target object E corresponding to the video frame C, the video frame D, the video frame E, the video frame F, and the video frame G, respectively, to obtain the clipped video frame a, the video frame B, the video frame C, the video frame D, the video frame E, the video frame F, and the video frame G.

And obtaining the clipped target video according to the clipped video frame A, the clipped video frame B, the clipped video frame C, the clipped video frame D, the clipped video frame E, the clipped video frame F and the clipped video frame G.

Based on the same inventive concept as the video cropping method shown in fig. 2, the embodiment of the present application further provides a video cropping device, which may be disposed in a server or a terminal device. Because the device is a device corresponding to the video cropping method of the application and the principle of solving the problem of the device is similar to that of the method, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Fig. 9 is a schematic structural diagram of a video cropping device according to an embodiment of the present application, and as shown in fig. 9, the video cropping device includes a frame set acquisition unit 901, a target object determination unit 902, a target object tracking unit 903, a video frame cropping unit 904, and a target video generation unit 905.

A frame set obtaining unit 901, configured to obtain a frame set to be clipped, where the frame set to be clipped is obtained after performing scene division on each video frame included in a video to be clipped;

a target object determining unit 902, configured to determine a key video frame included in the frame set to be cropped, and determine at least one target object based on the key video frame;

a target object tracking unit 903, configured to determine, based on at least one target object, tracking video frames corresponding to the key video frames in the frame set to be cropped;

a video frame clipping unit 904, configured to clip the key video frames and the tracking video frames, respectively, based on at least one target object, to obtain a target frame set composed of clipped target video frames; the target video frame comprises at least one target object;

and a target video generating unit 905 configured to generate the clipped target video based on the target frame set.

In an alternative embodiment, the frame set obtaining unit 901 is specifically configured to:

for at least one scene boundary frame, respectively performing the following operations: attributing a scene boundary frame in at least one scene boundary frame and all video frames between the adjacent previous scene boundary frame to the same frame set;

In an alternative embodiment, the target object comprises a target body and an adjacent body; the target object determining unit 902 is specifically configured to:

performing target detection on the key video frame to obtain a plurality of candidate objects, and determining the distance between the plurality of candidate objects and the center position of the picture in the adjacent video frame; the adjacent video frame is a video frame adjacent to the key video frame;

and sequencing the candidate objects according to the sequence of the distances from near to far, taking the candidate object with the closest distance as a target main body corresponding to the key video frame, and selecting the first N candidate objects from the rest candidate objects as adjacent main bodies corresponding to the key video frame.

In an alternative embodiment, the target object determining unit 902 is further configured to:

determining a subject probability value of each candidate object through the trained subject selection model; the main body probability value is used for representing the distance between the corresponding candidate object and the picture center position in the adjacent video frame; the main body selection model is obtained by training a sample image sequence marked with a candidate object; the labeled candidate object is provided with a main body label labeled according to the distance between the labeled candidate object and the center position of the picture in the sample image; the body label is used for representing the corresponding marked candidate object as a target body or an adjacent body.

In an alternative embodiment, the target object tracking unit 903 is specifically configured to:

taking a first video frame in a frame set to be cut out as a key video frame;

after determining each tracking video frame corresponding to the key video frame, the target object tracking unit 903 is further configured to:

In an alternative embodiment, the target object tracking unit 903 is further configured to:

respectively matching the target subject with at least one candidate object;

and if the target subject is included in the at least one candidate object, determining that one video frame comprises the target subject.

respectively determining the distance between each candidate object except the target subject and the target subject in one video frame;

and taking the candidate object with the distance to the target object smaller than the set distance threshold value as the adjacent object in one video frame.

In an alternative embodiment, the video frame cropping unit 904 is specifically configured to:

for each video frame in the frame set to be cropped, the following operations are respectively performed:

determining a lens tracking track of a frame set to be cut according to the key video frame and the lens tracking position in each tracking video frame;

In an alternative embodiment, the video frame cropping unit 904 is further configured to:

carrying out sliding window filtering on the lens tracking track to obtain a corresponding smooth tracking track;

and respectively clipping the key video frame and each tracking video frame according to the smooth tracking track.

and respectively clipping the key video frames and each tracking video frame based on the lens tracking positions in the key video frames and each tracking video frame according to a preset clipping proportion.

aligning at least one candidate object with the characteristic image to obtain a sub-characteristic image corresponding to each candidate object;

based on at least one target object, in the frame set to be cropped, for each video frame located after the key video frame, the following operations are respectively performed:

tracking at least one target object in one video frame in each video frame, and if the at least one target object is tracked in one video frame, taking the one video frame as a tracking video frame corresponding to the key video frame;

and determining each tracking video frame corresponding to the key video frame.

respectively matching at least one target object with at least one candidate object;

and if the at least one candidate object comprises at least one target object according to the matching result, determining that the at least one target object is tracked in one video frame.

if at least one target object cannot be tracked in one video frame and at least one target object is tracked in the previous video frame of one video frame, one video frame is taken as a key video frame.

The embodiment of the method and the embodiment of the device are based on the same inventive concept, and the embodiment of the application also provides electronic equipment. The electronic device may be a server, such as server 100 shown in FIG. 1. In this embodiment, the electronic device may be configured as shown in fig. 10, and include a memory 1001, a communication module 1003, and one or more processors 1002.

A memory 1001 for storing computer programs executed by the processor 1002. The memory 1001 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

Memory 1001 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1001 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or any other medium which can be used to carry or store desired program code in the form of instructions or data structures and which can be accessed by a computer. The memory 1001 may be a combination of the above memories.

The processor 1002 may include one or more Central Processing Units (CPUs), a digital processing unit, and the like. The processor 1002 is configured to implement the video cropping method when a computer program stored in the memory 1001 is called.

The communication module 1003 is used for communicating with the terminal device and other electronic devices. If the electronic device is a server, the server may receive the video to be cropped sent by the terminal device through the communication module 1003.

In the embodiment of the present application, the specific connection medium among the memory 1001, the communication module 1003, and the processor 1002 is not limited. In fig. 10, the memory 1001 and the processor 1002 are connected by a bus 1004, the bus 1004 is represented by a thick line in fig. 10, and the connection manner between other components is merely illustrative and not limited. The bus 1004 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10, but this is not intended to represent only one bus or type of bus.

In another embodiment, the electronic device may be any electronic device such as a mobile phone, a tablet computer, a Point of sale (POS), a vehicle-mounted computer, a smart wearable device, a PC, and the like, and may also be the terminal device 300 shown in fig. 1 by way of example.

Fig. 11 shows a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 11, the electronic apparatus includes: radio Frequency (RF) circuitry 1110, memory 1120, input unit 1130, display unit 1140, sensors 1150, audio circuitry 1160, wireless fidelity (WiFi) module 1170, processor 1180, and the like. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 11 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the electronic device in detail with reference to fig. 11:

RF circuit 1110 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages to processor 1180; in addition, the data for designing uplink is transmitted to the base station.

The memory 1120 can be used for storing software programs and modules, such as program instructions/modules corresponding to the video cropping method and apparatus in the embodiment of the present application, and the processor 1180 executes various functional applications and data processing of the electronic device, such as the video cropping method provided in the embodiment of the present application, by executing the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program of at least one application, and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 1120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1130 may be used to receive numeric or character information input by a user and generate key signal inputs related to user settings and function control of the terminal.

Optionally, the input unit 1130 may include a touch panel 1131 and other input devices 1132.

The touch panel 1131, also referred to as a touch screen, may collect touch operations of a user on or near the touch panel 1131 (for example, operations of the user on or near the touch panel 1131 using any suitable object or accessory such as a finger or a stylus pen), and implement corresponding operations according to a preset program, for example, operations of the user clicking a shortcut identifier of a function module, and the like. Alternatively, the touch panel 1131 may include two parts, namely, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1180, and can receive and execute commands sent by the processor 1180. In addition, the touch panel 1131 can be implemented by using various types, such as resistive, capacitive, infrared, and surface acoustic wave.

Optionally, other input devices 1132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 1140 may be used to display information input by a user or interface information presented to the user, and various menus of the electronic device. The display unit 1140 is a display system of the terminal device, and is used for presenting an interface, such as a display desktop, an operation interface of an application, or an operation interface of a live application.

The display unit 1140 may include a display panel 1141. Alternatively, the Display panel 1141 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

Further, the touch panel 1131 can cover the display panel 1141, and when the touch panel 1131 detects a touch operation on or near the touch panel, the touch panel is transmitted to the processor 1180 to determine the type of the touch event, and then the processor 1180 provides a corresponding interface output on the display panel 1141 according to the type of the touch event.

Although in fig. 11, the touch panel 1131 and the display panel 1141 are two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 1131 and the display panel 1141 may be integrated to implement the input and output functions of the terminal.

The electronic device may also include at least one sensor 1150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1141 according to the brightness of ambient light, and the proximity sensor may turn off the backlight of the display panel 1141 when the electronic device moves to the ear. As one of the motion sensors, the accelerometer sensor may detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and may be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), and the like, for recognizing the attitude of the electronic device, and the electronic device may further be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, and the like, which are not described herein again.

Audio circuitry 1160, speakers 1161, and microphone 1162 may provide an audio interface between a user and the electronic device. The audio circuit 1160 may transmit the electrical signal converted from the received audio data to the speaker 1161, and convert the electrical signal into a sound signal for output by the speaker 1161; on the other hand, the microphone 1162 converts the collected sound signals into electrical signals, which are received by the audio circuit 1160 and converted into audio data, which are then processed by the audio data output processor 1180, and then transmitted to, for example, another electronic device via the RF circuit 1110, or output to the memory 1120 for further processing.

WiFi belongs to short-range wireless transmission technology, and the electronic device can help the user send and receive e-mail, browse web pages, access streaming media, etc. through the WiFi module 1170, and it provides wireless broadband internet access for the user. Although fig. 11 shows the WiFi module 1170, it is understood that it does not belong to the essential constitution of the electronic device, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 1180 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 1120 and calling data stored in the memory 1120, thereby performing overall monitoring of the electronic device. Optionally, processor 1180 may include one or more processing units; optionally, the processor 1180 may integrate an application processor and a modem processor, wherein the application processor mainly processes software programs such as an operating system, applications, and functional modules inside the applications, for example, a video cropping method provided in the embodiment of the present application. The modem processor handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1180.

It will be appreciated that the configuration shown in fig. 11 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 11 or have a different configuration than shown in fig. 11. The components shown in fig. 11 may be implemented in hardware, software, or a combination thereof.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the video cropping method in the above-described embodiment. The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A method of video cropping, the method comprising:

and generating the clipped target video based on the target frame set.

2. The method of claim 1, wherein the obtaining the set of frames to be cropped comprises:

3. The method of claim 1, wherein the target object comprises a target subject and an adjacent subject; said determining at least one target object based on said key video frames comprises:

4. The method of claim 3, wherein determining the distance between the candidate objects and the center of the picture in the neighboring video frames comprises:

5. The method according to claim 3, wherein said determining, based on the at least one target object, respective tracking video frames corresponding to the key video frames in the set of frames to be cropped comprises:

taking a first video frame in the frame set to be cropped as a key video frame;

after determining the respective tracking video frames to which the key video frames correspond, the method further comprises:

6. The method according to claim 5, wherein said detecting the target subject in each video frame after the key video frame one by one comprises:

7. The method of claim 6, wherein after determining that the one video frame contains the target subject, the method further comprises:

8. The method according to any one of claims 1 to 7, wherein the cropping the key video frame and the respective tracking video frames based on the at least one target object comprises:

9. The method according to claim 8, wherein said cropping the key video frames and the respective tracking video frames according to the shot tracking trajectory comprises:

10. A video cropping device, comprising: