CN111263234B

CN111263234B - Video clipping method, related device, equipment and storage medium

Info

Publication number: CN111263234B
Application number: CN202010062446.2A
Authority: CN
Inventors: 梁涛; 张晗; 马连洋; 衡阵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2021-06-15
Anticipated expiration: 2040-01-19
Also published as: CN111263234A

Abstract

The application discloses a method of video clipping, comprising: acquiring video content corresponding to a video to be edited and audio content corresponding to the video to be edited; acquiring at least two video clips according to the video content; acquiring at least one audio clip from the audio content according to the change state of the audio frequency of the audio content in unit time; and generating a target clipping segment set corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment. The application also discloses a device, equipment and a storage medium, and the video content and the audio content are jointly used as the reference basis of the video clip, so that information complementation is realized, the clipped segment cannot have information loss, and the accuracy of the video clip is improved.

Description

Video clipping method, related device, equipment and storage medium

Technical Field

The present application relates to the field of computer processing, and in particular, to a method of video editing, a related apparatus, a device, and a storage medium.

Background

With the development of user demands and media technologies, the number of videos is also exponentially and explosively increased, and the editing of videos also becomes a video processing mode concerned by people. The video editing technology is a video processing mode for combining an object to be edited into a section of edited video in an editing mode, and is often applied to video editing scenes such as short video production, video collection and the like.

At present, most of video clipping methods are time-sharing clipping, that is, automatic clipping is performed on a video in equal time periods, for example, a 60-second video is automatically clipped every 10 seconds, and 6 clipped video segments can be obtained.

However, the video segments obtained after the uniform cropping may have incomplete video scenes, so that the clipped video segments have missing information, and the accuracy of the video cropping is not good.

Disclosure of Invention

The embodiment of the application provides a video clipping method, a related device, equipment and a storage medium, wherein video content and audio content are jointly used as reference bases of video clipping, so that information complementation is realized, clipped segments cannot have information loss, and the accuracy of video clipping is improved.

In view of the above, the present application provides in a first aspect a method of video clipping, comprising: acquiring video content corresponding to a video to be edited and audio content corresponding to the video to be edited;

acquiring at least two video segments according to video content, wherein the video content comprises at least one object frame, the at least one object frame comprises an object frame for segment segmentation, the object frame for segment segmentation and a next adjacent object frame have target similarity, the target similarity is less than or equal to a similarity threshold, and the object frame for segment segmentation is used for determining the video segments;

acquiring at least one audio clip from the audio content according to the change state of the audio frequency of the audio content in unit time;

and generating a target clipping segment set corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment, wherein the target clipping segment set comprises at least one target clipping segment.

A second aspect of the present application provides a video editing apparatus comprising:

the acquisition module is used for acquiring video content corresponding to a video to be edited and audio content corresponding to the video to be edited;

the acquisition module is further used for acquiring at least two video segments according to the video content, wherein the video segments comprise object frames for segment segmentation, target similarity exists between the object frames for segment segmentation in two adjacent video segments, and the target similarity is smaller than or equal to a similarity threshold;

the acquisition module is also used for acquiring at least one audio clip from the audio content according to the change state of the audio frequency of the audio content in unit time;

and the generating module is used for generating a target clipping segment set corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment acquired by the acquiring module, wherein the target clipping segment set comprises at least one target clipping segment.

In one possible design, in a first implementation of the second aspect of an embodiment of the present application,

an acquisition module specifically configured to:

acquiring a video to be edited;

analyzing the video to be edited by adopting a solution protocol to obtain format encapsulated data, wherein the solution protocol is used for converting data corresponding to a first protocol into data corresponding to a second protocol, and the first protocol and the second protocol belong to different protocol types;

decapsulating the format encapsulated data to obtain audio code stream data and video code stream data;

decoding the audio code stream data to obtain audio content;

and decoding the video code stream data to obtain video content.

In one possible design, in a second implementation of the second aspect of the embodiments of the present application,

an acquisition module specifically configured to:

acquiring an object frame sequence according to video content, wherein the object frame sequence comprises N object frames, and N is an integer greater than or equal to 2;

generating at least one object frame subsequence according to the object frame sequence, wherein the object frame subsequence comprises M object frames for segment segmentation, and M is an integer which is greater than or equal to 1 and less than or equal to N;

at least two video segments are generated based on the at least one object frame sub-sequence.

In one possible design, in a third implementation of the second aspect of the embodiments of the present application,

an acquisition module specifically configured to:

acquiring a first image frame and a second image frame from video content, wherein the first image frame is a previous frame image adjacent to the second image frame;

generating a difference image according to the first image frame and the second image frame;

determining a target pixel value according to the differential image;

and if the target pixel value meets the object frame extraction condition, determining that the second image frame belongs to one object frame in the object frame sequence.

In one possible design, in a fourth implementation of the second aspect of the embodiment of the present application,

an acquisition module specifically configured to:

acquiring a first image frame, a second image frame and a third image frame from video content, wherein the first image frame is a previous frame image adjacent to the second image frame, and the second image frame is a previous frame image adjacent to the third image frame;

generating a first difference image according to the first image frame and the second image frame;

generating a second difference image according to the second image frame and the third image frame;

generating a target differential image according to the first differential image and the second differential image;

determining a target pixel value according to the target differential image;

and if the target pixel value meets the object frame extraction condition, determining that the third image frame belongs to one object frame in the object frame sequence.

In one possible design, in a fifth implementation of the second aspect of the embodiments of the present application,

a generation module specifically configured to:

step one, acquiring a first object frame and a second object frame from an object frame sequence, wherein the first object frame is a previous object frame adjacent to the second object frame;

acquiring a first key point set corresponding to the first object frame and a second key point set corresponding to the second object frame, wherein the first key point set comprises at least one first key point, and the second key point set comprises at least one second key point;

determining similarity according to the first key point set and the second key point set;

if the similarity is smaller than or equal to the similarity threshold, determining that the first object frame belongs to an object frame for segment segmentation in the object frame subsequence;

if the similarity is larger than the similarity threshold value, removing the first object frame from the object frame sequence;

for every two adjacent object frames in the object frame sequence, the steps one to four are performed as described above until the object frame subsequence is extracted from the object frame sequence.

In one possible design, in a sixth implementation of the second aspect of the embodiments of the present application,

a generation module specifically configured to:

generating a first video segment from the video content based on the first object frame sub-sequence;

generating a second video segment from the video content based on the second object frame subsequence;

or, generating at least two video segments from at least one object frame sub-sequence, comprising:

acquiring a first object frame for segment division and a second object frame for segment division from M object frames for segment division included in the object frame subsequence;

intercepting a first video segment from the video content according to a first object frame for segment segmentation;

and intercepting a second video segment from the video content according to a second object frame for segment segmentation, wherein the second video segment and the first video segment belong to two different video segments.

In one possible design, in a seventh implementation of the second aspect of the embodiments of the present application,

an acquisition module specifically configured to:

acquiring the change state of the audio frequency in unit time of the audio content;

and intercepting at least one audio clip from the audio content according to the change state of the audio frequency in the unit time and the audio frequency threshold, wherein the audio frequency of each audio clip is greater than or equal to the audio frequency threshold.

In one possible design, in an eighth implementation of the second aspect of the embodiments of the present application,

a generation module specifically configured to:

acquiring a first video clip and a second video clip from at least two video clips, wherein the second video clip and the first video clip belong to two different video clips, and the second video clip is the next video clip adjacent to the first video clip;

acquiring the last image frame in the first video clip;

acquiring a first image frame in a second video segment;

and if the last image frame in the first video segment and the first image frame in the second video segment both correspond to the target audio segment, merging the first video segment and the second video segment to obtain a target clip segment in the target clip segment set, wherein the target audio segment belongs to any one of the at least one audio segment.

In a possible design, in a ninth implementation form of the second aspect of the embodiment of the present application, the video editing apparatus further includes a determining module,

a determining module, configured to determine that the first video segment is one target clip segment in the target clip segment set and the second video segment is another target clip segment in the target clip segment set if the last image frame in the first video segment corresponds to the target audio segment and the first image frame in the second video segment does not correspond to the target audio segment;

the determining module is further configured to determine that the first video segment is one target clip segment of the set of target clip segments and the second video segment is another target clip segment of the set of target clip segments if the last image frame of the first video segment does not correspond to the target audio segment and the first image frame of the second video segment corresponds to the target audio segment.

In one possible design, in a tenth implementation of the second aspect of the embodiment of the present application,

a generation module specifically configured to:

obtaining a target audio clip from at least one audio clip, wherein the target audio clip corresponds to a target time period;

acquiring a video to be detected according to a target time period corresponding to the target audio clip;

acquiring a first image frame in a video to be detected;

acquiring the last image frame in a video to be detected;

if the first image frame in the video to be detected and the last image frame in the video to be detected both belong to the target video clip, determining that the target video clip is one target clip in the target clip set, wherein the target video clip belongs to any one of at least two video clips.

In an eleventh implementation form of the second aspect of the embodiments of the present application, the video clipping device further comprises a merging module,

and the merging module is used for merging the first video segment and the second video segment to obtain a target clip segment in the target clip segment set if the first image frame in the video to be detected belongs to the first video segment and the last image frame in the video to be detected belongs to the second video segment, wherein the first video segment and the second video segment belong to two different video segments.

A third aspect of the present application provides a terminal device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein, the memory is used for storing programs;

the processor is used for executing the program in the memory and comprises the following steps:

acquiring video content corresponding to a video to be clipped and audio content corresponding to the video to be clipped;

acquiring at least two video segments according to the video content, wherein the video segments comprise object frames for segment segmentation, and target similarity exists between the object frames for segment segmentation in two adjacent video segments, and the target similarity is smaller than or equal to a similarity threshold;

generating a target clipping segment set corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment, wherein the target clipping segment set comprises at least one target clipping segment;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

A fourth aspect of the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the above-described aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a video clipping method, which includes the steps of firstly obtaining video content corresponding to a video to be clipped and audio content corresponding to the video to be clipped, then obtaining at least two video segments according to the video content, then obtaining at least one audio segment from the audio content according to the change state of audio frequency of the audio content in unit time, and finally generating a target clipping segment set corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment. By the mode, in the process of clipping the video, the integrity of the picture in the video content and the integrity of the audio segment in the audio content are considered, the video content and the audio content are jointly used as the reference basis of the video clip, information complementation is realized, the clipped segment cannot have information loss, and the accuracy of the video clip is improved.

Drawings

FIG. 1 is a block diagram of an architecture of a video editing system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an embodiment of a method for video editing in an embodiment of the present application;

FIG. 3 is a schematic diagram of an embodiment of a method for video editing in an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of generating a video clip in an embodiment of the present application;

FIG. 5 is a schematic diagram of another embodiment of generating a video clip in the embodiment of the present application;

FIG. 6 is a diagram illustrating the variation state of the audio frequency per unit time in the embodiment of the present application;

FIG. 7 is a schematic diagram of an embodiment of merging video segments in an embodiment of the present application;

FIG. 8 is a diagram of an embodiment of generating a target clip segment in an embodiment of the present application;

FIG. 9 is a schematic diagram of an embodiment of generating a target clip segment in an embodiment of the present application;

FIG. 10 is a schematic diagram of an embodiment of a video clipping device according to the present invention;

fig. 11 is a schematic structural diagram of a terminal device in an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server in an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that the present application may be applied to various application scenarios in which video clips exist, for example, in a news client, the news client may also extract and play the highlight segments in the long video while playing the long video, so as to attract the attention of the user, and thus the long video may need to be clipped, so as to extract the highlight segments in the long video; as another example, for example, in an application with a video uploading function, a function of personalized editing by a user may be provided, that is, after recording a long video, the user may clip the long video into several segments, so as to further edit a favorite segment or directly upload the favorite segment; as another example, for example, in a video playing client, there is a function of personalized recommendation of videos, that is, when a long video is recommended to a user, different segments in the long video are recommended for different users, so that the long video also needs to be clipped; it should be understood that the examples are only for convenience of understanding the present solution, and are not exhaustive of all application scenarios of the present application.

In order to obtain a video clip with high accuracy in the above various scenarios, the present application provides a method for video clipping, which is applied to the video clipping system shown in fig. 1, please refer to fig. 1, where fig. 1 is a schematic structural diagram of the video clipping system in an embodiment of the present application, and as shown in the figure, the video clipping system includes a server and a terminal device. The execution subject of the video clipping method (i.e. the video clipping device) can be deployed in a server or a terminal device with strong computing power.

Specifically, the video clipping device may obtain video content and audio content corresponding to the video to be clipped after obtaining the video to be clipped, and then obtain at least two video segments according to the video content, where the video content includes at least one object frame, the at least one object frame includes an object frame for segment segmentation, the object frame for segment segmentation and a next adjacent object frame have a target similarity, the target similarity is less than or equal to a similarity threshold, and the object frame for segment segmentation is used for segmentation to obtain the video segments. And acquiring at least one audio segment from the audio content according to the change state of the audio frequency of the audio content in unit time, and further generating a target clip segment corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment, namely, the video content and the audio content are jointly used as reference bases of the video clip, so that the accuracy of the video clip is improved.

More specifically, the video clipping device may be embodied as a client deployed on the terminal device, for example, all the clients shown in the above application scenarios of the present application, and the server may send the video clipping device to the terminal device through the wireless network. The video editing apparatus may be embodied as a terminal device dedicated to video editing, and the server may configure the video editing apparatus on the terminal device through a wired network, a mobile storage medium, or the like after generating the video editing apparatus. The video editing device may also be disposed on a server, and then the terminal device sends the video to be edited to the server after acquiring the video to be edited, and sends the video to be edited to the terminal device after the server executes the video editing operation. Further, the wireless networks described above use standard communication techniques and/or protocols. The wireless Network is typically the internet, but can be any Network including, but not limited to, bluetooth, Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, private, or any combination of virtual private networks. In some embodiments, custom or dedicated data communication techniques may be used in place of or in addition to the data communication techniques described above.

As shown in fig. 1, the terminal devices include, but are not limited to, a tablet computer, a notebook computer, a handheld computer, a mobile phone, a voice interactive device, and a Personal Computer (PC), and are not limited herein. The voice interaction device includes, but is not limited to, an intelligent sound and an intelligent household appliance. In some implementations, the client may be represented as a web page client, or may be represented as an application client, and is deployed on the terminal device. The server in fig. 1 may be a server or a server cluster composed of multiple servers, or a cloud computing center, and the like, which are not limited herein.

Although only five terminal devices and one server are shown in fig. 1, it should be understood that the example in fig. 1 is only used for understanding the present solution, and the number of the specific terminal devices and the number of the servers should be flexibly determined according to actual situations.

With reference to fig. 2, a method for video clipping in the present application will be described below, and an embodiment of the method for video clipping in the present application includes:

101. the video clipping device acquires video content corresponding to a video to be clipped and audio content corresponding to the video to be clipped;

in this embodiment, after acquiring the video to be clipped, the video clipping device may acquire video content corresponding to the video to be clipped and audio content corresponding to the video to be clipped. The video content corresponding to the video to be edited refers to image frames in the video to be edited, namely all the image frames in the video to be edited are included and sound in the video to be edited is not included; correspondingly, the audio content corresponding to the video to be clipped includes all the sounds in the video to be clipped and does not include the image frames in the video to be clipped. Specifically, the video editing device may directly capture a video to be edited by a video camera, for example, a user may open a local video camera through a client having a video uploading function, so as to directly capture the video to be edited; for example, the video to be edited may be selected from an album by acquiring the video to be edited from a media file stored in a local internal storage device; the method can also be used for downloading the video to be edited from the cloud; the video to be clipped may also be obtained from a media file stored in a local external storage device, for example, the video clipping apparatus obtains the video to be clipped from a video file stored in a hard disk through a wired network, and the like, which is not limited herein.

102. The video clipping device acquires at least two video segments according to video content, wherein the video content comprises at least one object frame, the at least one object frame comprises an object frame for segment segmentation, the object frame for segment segmentation and a next adjacent object frame have target similarity, the target similarity is less than or equal to a similarity threshold, and the object frame for segment segmentation is used for determining the video segments;

in this embodiment, after acquiring the video content corresponding to the video to be edited, the video editing apparatus may divide the video content corresponding to the video to be edited into at least two video segments, where the video to be edited may include a plurality of image frames, where the plurality of image frames include an object frame, and the object frame refers to a complete reservation of a picture of the frame, and may be completed by only using data of the frame when decoding, where the data describes details of an image background and a moving subject, and the object frame may also be referred to as a key frame. The object frames for segment division refer to cut points for indicating videos to be edited, each video segment comprises the object frames for segment division, and the object frames for segment division have target similarity with the next adjacent object frame, and the target similarity is smaller than or equal to a similarity threshold.

103. The video clipping device acquires at least one audio clip from the audio content according to the change state of the audio frequency of the audio content in unit time;

in this embodiment, after acquiring the audio content corresponding to the video to be clipped, the video clipping device may cut the audio content according to the change state of the audio frequency of the audio content in unit time, and further acquire one or more audio segments from the audio content. Wherein the higher the audio frequency, the higher the sound pitch, and the lower the audio frequency, the lower the sound pitch.

Specifically, the video clipping device may preset an audio frequency value, and use an audio point in the audio frequency corresponding to the audio content, which is consistent with the audio frequency value, as a cutting point to cut the audio content; or an audio frequency threshold value can be preset, and at least one audio clip with the audio frequency greater than or equal to the audio frequency threshold value is intercepted from the audio content; it is also possible to obtain at least one audio piece from the audio content by other means, which are not exhaustive here.

104. The video clipping device generates a target clipping segment set corresponding to a video to be clipped according to at least two video segments and at least one audio segment, wherein the target clipping segment set comprises at least one target clipping segment.

In this embodiment, after obtaining at least two video segments and at least one audio segment, the video clipping device may correct the cut point of the video to be clipped through the video segments and the audio segments to cut the video to be clipped into at least one target clip segment because the start positions of the video segments and the audio segments may not be aligned.

Specifically, in one case, the video editing apparatus may perform correction using a cut point of the audio section with reference to a cut point of the video section. That is, after obtaining at least two video segments and at least one audio segment, for each audio segment in the at least one audio segment, it is determined whether a certain audio segment simultaneously appears in the two video segments, and if a certain audio segment simultaneously appears in the two video segments, the two video segments need to be merged into one video segment, and then a cut point of a video to be edited is determined based on the merged video segment. For example, the first video segment corresponds to 1 st to 3 rd seconds of the video to be clipped, the second video segment corresponds to 4 nd to 6 th seconds of the video to be clipped, and a certain audio segment corresponds to 2 nd to 5 th seconds of the video to be clipped, the first video segment and the second video segment are merged, so that the cut point of the video to be clipped is changed from the original two cut points of the 3 rd second and the 6 th second to the cut point of the 6 th second, which is only an example and is not used to limit the present solution. Optionally, if a certain audio segment appears in two video segments simultaneously, and the two video segments are not adjacent video segments, the two video segments and the video segment between the two video segments need to be merged into one video segment, and then the cut point of the video to be clipped is determined based on the merged video segment. For example, the first video segment corresponds to 1 st to 3 rd seconds of a video to be clipped, the second video segment corresponds to 4 th to 6 th seconds of the video to be clipped, the third video segment corresponds to 7 th to 9 th seconds of the video to be clipped, and a certain audio segment corresponds to 2 nd to 8 th seconds of the video to be clipped, the first video segment, the second video segment, and the third video segment are merged, so that the cut point of the video to be clipped is changed from the original three cut points of the 3 rd second, the 6 th second, and the 9 th second to the 9 th second, which is merely an example and is not used to limit the present scheme. If each audio segment does not appear in the two video segments at the same time, the cut point of the video segment acquired in step 102 may be determined as the cut point of the video to be clipped, and the video to be clipped is cut to obtain at least two target clipping segments.

In another case, the video clipping device may correct with the video clip with reference to the audio clip. That is, after obtaining at least two video segments and at least one audio segment, for each video segment in the at least one video segment, it is determined whether a certain video segment simultaneously appears in the two audio segments, and if a certain video segment simultaneously appears in the two audio segments, the two audio segments need to be merged into one audio segment, and then a cut point of a video to be edited is determined based on the merged audio segment. Optionally, if a certain video segment appears in two audio segments simultaneously, and the two audio segments are not adjacent audio segments, the two audio segments and the audio segment between the two audio segments need to be merged into one audio segment, and then the cut point of the video to be clipped is determined based on the merged audio segment. If each video segment does not appear in the two audio segments at the same time, the cut point of the audio segment acquired in step 103 may be determined as the cut point of the audio to be clipped, and the audio to be clipped is cut to obtain at least two target clip segments.

To further understand the present solution, please refer to fig. 3, where fig. 3 is a schematic view illustrating an embodiment of a method for video clipping in the embodiment of the present application, after a video to be clipped is obtained, a video content corresponding to the video to be clipped and an audio content corresponding to the video to be clipped may be obtained first, at least two video segments are obtained from the video content, at least one audio segment is obtained from the audio content, a cut point of the video to be clipped is determined by using the video segments and the audio segments together, so as to generate a target clipping segment set corresponding to the video to be clipped, and in fig. 3, 3 target clipping segments in the target clipping segment set are shown as an example, it should be understood that the example in fig. 3 is only for convenience of understanding the present solution, and is not used for limiting the present solution.

Optionally, on the basis of the embodiment corresponding to fig. 2, in an optional embodiment of the method for video clipping provided in the embodiment of the present application, the obtaining, by the video clipping device, the video content corresponding to the video to be clipped and the audio content corresponding to the video to be clipped may include:

the video editing device acquires a video to be edited;

the video editing device analyzes a video to be edited by adopting a solution protocol to obtain format encapsulated data, wherein the solution protocol is used for converting data corresponding to a first protocol into data corresponding to a second protocol, and the first protocol and the second protocol belong to different protocol types;

the video editing device carries out de-encapsulation processing on the format encapsulated data to obtain audio code stream data and video code stream data;

the video editing device decodes the audio code stream data to obtain audio content;

the video editing device decodes the video code stream data to obtain video content.

In this embodiment, after acquiring the video to be edited corresponding to the first protocol, the video editing apparatus needs to obtain video content and audio content according to the video to be edited. Specifically, in one case, if the video editing device is a video to be edited obtained by downloading through a network, because the video is transmitted over the network, various streaming media protocols are often adopted, and these streaming media protocols transmit some signaling data while transmitting the video data, where these signaling data include control data for playing or description data for a network state, and the process of solving the protocol refers to a process of removing the signaling data in the video to be edited corresponding to the first protocol and only retaining the video data, thereby obtaining format encapsulated data. In another case, if the video editing device acquires the video to be edited from the locally stored media file, if the first protocol type of the locally stored video to be edited is different from the second protocol type applicable to the video editing device, the protocol decoding process is also required. The video editing device comprises a video editing module, a video editing module and a video editing module, wherein the video editing module is used for editing a video to be edited corresponding to a first protocol into a video to be edited corresponding to a second protocol, and the first protocol and the second protocol belong to different protocol types; further, the first protocol may be the streaming media protocol, for example, the first protocol may be a hypertext transfer protocol (HTTP), a real-time messaging protocol (RTMP), a microsoft media server protocol (MMS), or the like, and the second protocol refers to a protocol adopted after the parsing processing, for example, the second protocol corresponds to, and is not limited herein. Further, the first protocol and the second protocol may or may not have a corresponding relationship, for example, a video to be edited transmitted by the RTMP protocol is subjected to a protocol decoding operation, and then the video to be edited in the FLV format is output, which is not exhaustive here.

After obtaining the video to be edited, which is encapsulated in the format, the video editing device needs to perform decapsulation processing on the video to be edited, which is encapsulated in the format, so as to separate audio stream data and video stream data from the video to be edited, which is encapsulated in the format, where the format of the audio stream data includes, but is not limited to, Advanced Audio Coding (AAC), MP3, audio coding (AC-3), and the like, and the format of the video stream data includes, but is not limited to, h.264, moving picture experts group (MPEG-2 ), video codec (video codec-1, VC-1), and the like. For example, the video to be edited in FLV format is decapsulated and then output the AAC encoded audio stream data and the h.264 encoded video stream data, and the other formats are not illustrated here. The video editing device can decode the audio code stream data to obtain uncompressed audio content, and decode the video code stream data to obtain uncompressed video content. The format of the audio content may be a Pulse Code Modulation (PCM) format or other audio formats, the format of the video content may be YUV420P, RGB or other video formats, and the like, where YUV420 refers to a ratio of brightness (Y), chroma (U), and density (chroma, V) of 4:2:0, and RGB refers to red (red, R), green (G), and blue (bulb, B), and all the above examples are merely to prove the realizability of the scheme, and are not used to limit the scheme. The video clipping device may further synchronize the audio content and the video content after obtaining the audio content and the video content, where the synchronization standard may be a time of the video to be clipped corresponding to the audio content and the video content, for example, synchronizing and aligning the audio content and the video content of the video to be clipped at the 1 st second. Further, the time granularity for synchronously aligning the audio content and the video content may be seconds, or may be a time level with a finer granularity, which is not limited herein.

To further understand the present solution, please refer to fig. 4, where fig. 4 is a schematic view of an embodiment of generating a video segment in the embodiment of the present application, where a video to be edited corresponding to a first protocol is obtained, a video to be edited corresponding to the first protocol is subjected to a protocol decoding process to obtain format encapsulation data, that is, a video to be edited in a format encapsulation form, the format encapsulation data is subjected to a decapsulation process to obtain audio code stream data and video code stream data, respectively, so that the audio code stream data is decoded to obtain audio content, and the video code stream data is decoded to obtain video content. It should be noted that, in the embodiment of the present application, the execution sequence of the decoding processing on the audio code stream data and the decoding processing on the video code stream data may be an execution sequence of the decoding processing on the audio code stream data and the decoding processing on the video code stream data, and the decoding processing on the audio code stream data and the decoding processing on the video code stream data may be performed simultaneously; or decoding the audio code stream data first and then decoding the video code stream data; it is also possible to decode the video code stream data first and then decode the audio code stream data, and it should be understood that the example in fig. 4 is only for convenience of understanding the present solution and is not used to limit the present solution.

In the embodiment of the application, a video to be edited is obtained, the video to be edited is analyzed by adopting a decoding protocol to obtain format package data, wherein the decoding protocol is used for converting data corresponding to a first protocol into data corresponding to a second protocol, the format package data is unpacked to obtain audio code stream data and video code stream data, the audio code stream data is decoded to obtain audio content, and the video code stream data is decoded to obtain video content. Through the mode, after the video to be edited is obtained, format conversion is carried out on the video to be edited, and then the data after format conversion is unpacked, so that audio content and video content are obtained, that is, the audio content and the video content can be obtained from the video to be edited in any format, the application scene of the scheme is expanded, and the realization smoothness of the scheme is also ensured.

Optionally, on the basis of the embodiment corresponding to fig. 2, in an optional embodiment of the method for video clipping provided in the embodiment of the present application, the obtaining, by the video clipping device, at least two video segments according to the video content may include:

the video clipping device acquires an object frame sequence according to the video content, wherein the object frame sequence comprises N object frames, and N is an integer greater than or equal to 2;

the video clipping device generates at least one object frame subsequence according to the object frame sequence, wherein the object frame subsequence comprises M object frames for segment division, M is an integer which is greater than or equal to 1 and less than or equal to N;

the video editing device generates at least two video segments from at least one object frame sub-sequence.

In this embodiment, after acquiring the video content, the video clipping device may transmit the video content into an object frame extractor to obtain an object frame sequence output by the object frame extractor, where the object frame sequence includes at least two object frames, and further perform similarity judgment on the at least two object frames included in the object frame sequence to generate an object frame subsequence according to the object frame sequence.

Specifically, in the case a, the video clipping device may use similar object frames as object frames in the same video segment, and determine object frames for segment segmentation (which may also be referred to as segmentation object frames) in different video segments according to the dissimilar object frames, so as to segment the object frame sequence by using the object frames for segment segmentation, thereby obtaining at least two object frame subsequences, where the at least two object frame subsequences may include M object frames for segment segmentation, and further generate at least two video segments according to the at least two object frame subsequences, where each object frame subsequence corresponds to one video segment. That is, in case a, each key subsequence includes at least one object frame, at least one object frame includes an object frame for segment division, and when at least two object frames are included in the at least one object frame, one object frame subsequence includes not only the object frame for segment division but also other object frames besides the object frame for segment division.

In the case B, the video clipping device may perform similarity judgment on at least two object frames included in the object frame sequence, select M object frames for segment division from the object frames, and form one object frame subsequence by using the M object frames for segment division, and determine a cut point of the video to be clipped by using all the object frames for segment division included in the one object frame subsequence, so as to cut the video to be clipped into at least two video segments. That is, in case B, the object frame sub-sequence includes only M object frames for segment division.

More specifically, the process of generating the sequence of object frames from the video content by the object frame extractor may include extracting object frames from image frames corresponding to the video content by using a background subtraction method, a frame subtraction method, an optical flow method, or other algorithms, wherein the more drastic the change of the video content, the more object frames are extracted. And then, calculating the similarity between the object frames by using histogram comparison, image template matching, peak signal to noise ratio (PSNR), Structural Similarity (SSIM), perceptual hash algorithm (perceptual hash algorithm) or other algorithms, so as to select at least one object frame for segment segmentation from the at least two object frames, and segmenting the object frame sequence by regarding the object frame for segment segmentation as a segmentation object frame to obtain at least two object frame subsequences.

To further understand the present solution, please refer to fig. 5, fig. 5 is a schematic diagram of another embodiment of generating a video segment in an embodiment of the present application, in which an object frame extraction (extract) operation is performed on a video content to obtain an object frame sequence KF _ i, where the object frame sequence KF _ i includes KF _1 and KF _2 … … KF _ N shown in fig. 5, a similarity determination is further performed on a plurality of object frames in the object frame sequence by using a similarity rule (similarity), and an object frame merging operation is performed, that is, object frames with high similarity are merged into one object frame sub-sequence, object frames with low similarity are divided into different object frame sub-sequences, so that a clipping result v _ dual _ i obtained based on the video content is obtained according to at least two object frame sub-sequences, and v _ dual _ i includes v _ dual _ ring _1, v _ dual _2 … … v _ dual _ M (an example of at least one video segment), it should be understood that the example in fig. 5 is only for convenience of understanding the present solution and is not intended to limit the present solution.

In the embodiment of the application, an object frame sequence is obtained according to video content, an object frame subsequence is generated according to the object frame sequence, and at least two video segments are generated according to the object frame subsequence. By the method, the object frame subsequence can be obtained only by processing the object frame, so that at least two video clips are obtained, processing resources are saved, processing time is shortened, and implementation efficiency of the scheme is improved.

Optionally, on the basis of the embodiment corresponding to fig. 2, in an optional embodiment of the method for video clipping provided in the embodiment of the present application, the obtaining, by the video clipping device, the sequence of object frames according to the video content may include:

the video clipping device acquires a first image frame and a second image frame from video content, wherein the first image frame is a previous frame image adjacent to the second image frame;

the video clipping device generates a difference image according to the first image frame and the second image frame;

the video clipping device determines a target pixel value according to the difference image;

if the target pixel value satisfies the object frame extraction condition, the video clipping device determines that the second image frame belongs to one object frame in the object frame sequence.

In this embodiment, the video clipping device obtains a plurality of image frames corresponding to video content, and obtains a first image frame and a second image frame from the plurality of image frames, where the first image frame is an image of a previous frame adjacent to the second image frame, after aligning the first image frame with the second image frame, a pixel point at a corresponding position is subtracted, and an absolute value of the pixel point is taken to generate a difference image, so as to obtain a target pixel value according to the difference image, and in a case where the target pixel value satisfies an object frame extraction condition, the second image frame is determined to be an object frame of an object frame sequence.

For the operation of acquiring the pixel values from the difference image, the video clipping device may directly acquire the total pixel values of the difference image. Optionally, the video clipping device may perform Binarization (binarisation) processing on the difference image, that is, a certain threshold is used to set the gray value of each pixel point on the whole difference image to be 0 or 255, so as to obtain a binarized image in the process of converting the difference image into a black-and-white image, where a point with the gray value of 255 is a foreground point and a point with the gray value of 0 is a background point. Optionally, the video editing apparatus may further perform connectivity analysis on background points in the binarized image, and if the background points in the binarized image may be communicated together, it may be considered that the background points communicated together are classified incorrectly and should be foreground points, so that secondary adjustment needs to be performed on a threshold value used for classification, so as to regenerate the binarized image. After the binarized image is obtained, the total pixel value of the entire binarized image is obtained from the binarized image.

For the operation of determining that the target pixel value satisfies the target frame extraction condition, in one implementation, a time window and a step size may be preset in the video clipping device, and the video clipping device acquires a plurality of image frames within a unit time window from the video content, for example, 50 image frames in the corresponding video content within the unit time window. The image processing method comprises the steps that a plurality of adjacent first image frames and second image frames are included in the plurality of image frames, operation of generating difference images and obtaining pixel values is carried out on every two adjacent image frames in the plurality of image frames, so that a plurality of pixel values corresponding to the plurality of image frames can be obtained, a target pixel value with the largest pixel value is selected from the plurality of pixel values and determined as a target pixel value meeting a target frame extraction condition, and then the second image frames in the first image frames and the second image frames corresponding to the target pixel value are determined as target frames, so that the target frames in a time window are obtained. The video clipping device may slide the unit step backwards to extract the plurality of image frames in the next time window and obtain the object frame from the plurality of image frames in the next time window, and since operations of obtaining the object frame from the plurality of image frames in each time window are similar, details are not repeated here, where the length of the step is less than or equal to the length of the time window, for example, the length of the step may be 20 image frames, 30 image frames, and the like, and this is not limited here. The video clipping device repeats the foregoing operations until the sequence of object frames is extracted from all image frames of the video content. In another implementation, a pixel value threshold may be preset in the video clipping device, after obtaining a pixel value according to a difference image between two adjacent image frames, it is determined whether the pixel value is greater than or equal to the pixel value threshold, if the pixel value is greater than or equal to the pixel value threshold, the pixel value greater than or equal to the pixel value threshold is determined as a target pixel value, and a second image frame corresponding to the target pixel value belongs to one object frame in the sequence of object frames. The video clipping device performs the above-described operations for every two adjacent image frames in the video content until the sequence of object frames is extracted from the video content.

In the embodiment of the application, a first image frame and a second image frame are obtained from video content, wherein the first image frame is a previous frame image adjacent to the second image frame, a difference image is generated according to the first image frame and the second image frame, a target pixel value is obtained according to the difference image, and if the target pixel value meets an object frame extraction condition, it is determined that the second image frame belongs to an object frame in an object frame sequence. Through the method, the implementation mode for determining the object frame is provided, the implementation mode has good adaptability to scenes which change rapidly, and the realizability of the scheme is improved.

the video clipping device acquires a first image frame, a second image frame and a third image frame from video content, wherein the first image frame is an adjacent previous frame image of the second image frame, and the second image frame is an adjacent previous frame image of the third image frame;

the video clipping device generates a first difference image according to the first image frame and the second image frame;

the video clipping device generates a second difference image according to the second image frame and the third image frame;

the video clipping device generates a target differential image according to the first differential image and the second differential image;

the video clipping device determines a target pixel value according to the target differential image;

if the target pixel value satisfies the object frame extraction condition, the video clipping device determines that the third image frame belongs to one object frame in the object frame sequence.

In this embodiment, the video clipping device acquires a first image frame, a second image frame and a third image frame from the video content, where the first image frame is an image of a previous frame adjacent to the second image frame, and the second image frame is an image of a previous frame adjacent to the third image frame. Then the video editing device aligns the first image frame with the second image frame, subtracts pixel points at corresponding positions, and takes the absolute value of the pixel points to generate a first difference image; and after the second image frame and the third image frame are aligned, subtracting the pixel points at the corresponding positions, and taking the absolute value of the pixel points to generate a second difference image. And performing an and operation on the first differential image and the second differential image to generate a target differential image, and further acquiring a target pixel value according to the target differential image, and determining that the second image frame is an object frame of the object frame sequence under the condition that the target pixel value meets the object frame extraction condition.

For the operation of obtaining the pixel value according to the difference image, the pixel value of the difference image may be directly obtained, or the pixel value of the binarized image may be obtained after the binarization processing of the difference image.

For the operation of determining that the target pixel value satisfies the target frame extraction condition, similar to the above-mentioned embodiment, in an implementation manner, a time window and a step length may be preset in the video clipping device, the video clipping device acquires a plurality of image frames in a unit time window from the video content, the plurality of image frames includes a plurality of adjacent first image frames, a second image frame and a third image frame, the operation of generating a difference image and acquiring a pixel value is performed for each three adjacent image frames in the plurality of image frames in the unit time window, so that a plurality of pixel values corresponding to the plurality of image frames in the unit time window may be obtained, a target pixel value having a maximum pixel value is selected from the plurality of pixel values, and is determined as the target pixel value satisfying the target frame extraction condition, and then a third image frame corresponding to the target pixel value is determined as the target frame, thereby obtaining an object frame within a time window. The video clipping device may slide the unit step back to capture the object frame in the next time window. The video clipping device repeats the foregoing operations until the sequence of object frames is extracted from all image frames of the video content. In another implementation, a pixel value threshold may be preset in the video clipping device, after obtaining a pixel value according to a difference image between three adjacent image frames, it is determined whether the pixel value is greater than or equal to the pixel value threshold, if the pixel value is greater than or equal to the pixel value threshold, the pixel value greater than or equal to the pixel value threshold is determined as a target pixel value, and a third image frame corresponding to the target pixel value belongs to one object frame in the sequence of object frames. The video clipping device performs the above-described operations for every three adjacent image frames in the video content until the sequence of object frames is extracted from the video content.

In the embodiment of the application, through the method, another implementation mode for determining the object frame is provided, and the implementation flexibility of the scheme is improved; in addition, the implementation mode has good adaptability to scenes with slow change, and the application scene of the scheme is expanded.

Optionally, on the basis of the embodiment corresponding to fig. 2, in an optional embodiment of the method for video clipping provided in this embodiment of the present application, the video clipping apparatus generates at least one object frame subsequence according to the object frame sequence, and may include:

step one, a video clipping device acquires a first object frame and a second object frame from an object frame sequence, wherein the first object frame is a previous object frame adjacent to the second object frame;

step two, the video clipping device acquires a first key point set corresponding to the first object frame and a second key point set corresponding to the second object frame, wherein the first key point set can comprise at least one first key point, and the second key point set can comprise at least one second key point;

thirdly, the video editing device determines similarity according to the first key point set and the second key point set;

if the similarity is smaller than or equal to the similarity threshold, the video clipping device determines that the first object frame belongs to the object frame for segment segmentation in the object frame subsequence;

if the similarity is greater than the similarity threshold, the video clipping device eliminates a first object frame from the object frame sequence;

the video clipping device performs the above-described steps one to four for every two adjacent object frames in the sequence of object frames until a sub-sequence of object frames is extracted from the sequence of object frames.

In this embodiment, after the video editing apparatus acquires the sequence of object frames, since the sequence of object frames is sequentially arranged, the video editing apparatus can acquire a first object frame and a second object frame from the sequence of object frames, where the first object frame is an adjacent previous object frame to the second object frame. And then the video editing device acquires a first key point set corresponding to the first object frame and a second key point set corresponding to the second object frame, and determines the similarity according to the first key point set and the second key point set. Specifically, in one case, the video clipping device may determine the similarity between the first object frame and the second object frame by calculating a Distance value of Euclidean Distance (Euclidean Distance), a Distance value of Minkowski Distance, a Distance value of Manhattan Distance (Manhattan Distance), a Distance value of Chebyshev Distance (Chebyshev Distance), or the like between the first object frame and the second object frame from the first keypoint set and the second keypoint set, wherein the greater the Distance value between the first object frame and the second object frame, the smaller the similarity between the first object frame and the second object frame. More specifically, the video clipping device may previously set a correspondence between a distance value between the first object frame and the second object frame and a similarity between the first object frame and the second object frame, so that after the distance value between the first object frame and the second object frame is obtained, the similarity between the first object frame and the second object frame may be generated. In another case, the video editing apparatus may also calculate cosine similarity between the first object frame and the second object frame according to the first key point set and the second key point set, and further directly determine the cosine similarity as the similarity between the first object frame and the second object frame, or preset a correspondence between the cosine similarity and the similarity between the first object frame and the second object frame, so as to generate the similarity between the first object frame and the second object frame according to the cosine similarity between the first object frame and the second object frame, where the greater the cosine similarity, the greater the similarity between the first object frame and the second object frame. Furthermore, the video editing apparatus may also obtain the similarity between the first object frame and the second object frame in other ways, which is not exhaustive here.

After obtaining the similarity between the first object frame and the second object frame, the video clipping device may determine whether the similarity between the first object frame and the second object frame is less than or equal to a preset similarity threshold, and if the similarity is less than or equal to the similarity threshold, the video clipping device determines that the first object frame belongs to an object frame for segment segmentation in the object frame subsequence; if the similarity is larger than the similarity threshold, the first object frame is determined not to be a segmentation object frame of the video to be clipped, and the first object frame is further removed from the object frame sequence. The video clipping device performs the above operations of generating similarity, comparing the similarity with a similarity threshold, and determining an object frame for segment segmentation or extracting the object frame after comparison for every two adjacent object frames in the sequence of object frames until the above operations are performed for every two adjacent object frames in the sequence of object frames to extract a subsequence of object frames from the sequence of object frames. Further, for the last frame object frame in the object frame sequence, that is, in the case that the first object frame is the last frame object frame, there is no second object frame that is compared with the first object frame, and the video clipping device may determine the last frame object frame as an object frame for segment division in the object frame subsequence, or may directly perform a culling operation on the last frame object frame, which is not limited herein. Furthermore, in the embodiment, the object frame subsequence includes only the object frames for segment division, that is, all the object frame subsequences include the division object frames of the video to be edited, so that the video editing apparatus can determine the position of each object frame for segment division in the video to be edited, that is, determine the division point of the video to be edited, and thus after the division operation is performed on the video to be edited, at least two video segments are obtained. As an example, for example, the time-frequency to be clipped includes 15 image frames, where the object frame sequence includes a3 rd frame, a 5 th frame, an 8 th frame, and a 10 th frame, a similarity between the 3 rd frame and the 5 th frame is greater than a similarity threshold, a similarity between the 5 th frame and the 8 th frame is less than the similarity threshold, and a similarity between the 8 th frame and the 10 th frame is equal to the similarity threshold, where taking an example of determining the last frame object frame in the object frame sequence as an object frame for segment segmentation, the object frame subsequence includes the 5 th frame and the 10 th frame, and it should be understood that this example is only for convenience of understanding, and is not used for limiting the present solution.

In this embodiment of the present application, a video editing apparatus acquires a first object frame and a second object frame from an object frame sequence, where the first object frame is a previous object frame adjacent to the second object frame, acquires a first key point set corresponding to the first object frame and a second key point set corresponding to the second object frame, determines a similarity according to the first key point set and the second key point set, determines, if the similarity is less than or equal to a similarity threshold, that the first object frame belongs to an object frame for segment segmentation in an object frame subsequence, if the similarity is greater than the similarity threshold, removes the first object frame from the object frame sequence, and performs the above steps until an object frame subsequence is extracted from the object frame sequence for every two adjacent object frames in the object frame sequence. Through the method, the similarity judgment is carried out on each adjacent object frame, so that the object frame subsequence only comprising the object frames used for segment segmentation is obtained from the object frame sequence, the interference of the object frames not used for segment segmentation is eliminated, and the efficiency of obtaining the video segments is improved.

Optionally, on the basis of the embodiment corresponding to fig. 2, in an optional embodiment of the method for video clipping provided in this embodiment of the present application, the video clipping apparatus generates at least two video segments according to at least one object frame sub-sequence, and may include:

the video clipping device generates a first video segment from the video content based on the first object frame sub-sequence;

the video clipping device generates a second video segment from the video content based on the second object frame subsequence;

alternatively, the video clipping device generating at least two video segments from at least one object frame sub-sequence may comprise:

the video clipping device acquires a first object frame for segment division and a second object frame for segment division from M object frames for segment division included in the object frame subsequence;

the video clipping device cuts out a first video segment from the video content according to a first object frame for segment division;

the video clipping device clips a second video segment from the video content based on a second object frame for segment division, wherein the second video segment belongs to two different video segments from the first video segment.

In this embodiment, based on the description in the embodiment corresponding to fig. 5, in case a, at least one object frame subsequence includes at least two object frame subsequences, the object frame subsequences are obtained by segmenting the object frame sequence using object frames for segment segmentation, the video clipping device may obtain a first object frame subsequence from the at least two object frame subsequences, the first object frame subsequence includes one or more object frames, generate a first video segment according to the first object frame subsequence, and obtain a second object frame subsequence from the at least two object frame subsequences, the second object frame subsequence includes one or more object frames, generate a second video segment according to the second object frame subsequence, where the first object frame subsequence and the second object frame subsequence are both arbitrary object frame subsequences of the at least two object frame subsequences, and the first and second object frame sub-sequences are different object frame sub-sequences.

Based on the description in the corresponding embodiment of fig. 5, in case B, one object frame subsequence is included in at least one object frame subsequence, and the aforementioned one object frame subsequence includes M object frames for segment division, and the video clipping apparatus acquires, from the M object frames for segment division included in the object frame subsequence, a first object frame for segment division and a second object frame for segment division, where the first object frame for segment division and the second object frame for segment division are both any one of the M object frames for segment division. Specifically, in the case where only the object frames for segment division are included in the object frame sub-sequence, the video editing apparatus may arbitrarily select the first object frame for segment division and the second object frame for segment division from all the object frames included in the object frame sub-sequence. The video clipping device then clips a first video segment from the video content according to the first object frame for segment division, and clips a second video segment from the video content according to a second object frame for segment division, wherein the second video segment and the first video segment belong to two different video segments. Specifically, after acquiring the first object frame for segment division, the video clipping device may acquire an image frame corresponding to the first object frame for segment division from the image to be clipped, and determine the image frame corresponding to the first object frame for segment division as a segmentation object frame. More specifically, the first object frame for segment division may be used as the last frame of the previous video segment, so as to determine the end frame of the video segment, and since each object frame for segment division is the end frame of the video segment, one video segment may be cut from the video to be cut according to the two end frames. The first object frame for segment division may also be used as the first frame of the next video segment, so as to determine the starting frame of the video segment, and since each object frame for segment division is the starting frame of the video segment, one video segment may be cut from the video to be cut according to the two starting frames. The processing manner of the video editing apparatus for the second object frame for segment division is similar to that for the first object frame for segment division, and is not repeated here. Therefore, the video clipping device can clip the video to be clipped according to each of the M object frames for segment division to obtain M video segments.

In the embodiment of the application, a first object frame for segment division and a second object frame for segment division are obtained from M object frames for segment division included in an object frame subsequence, a first video segment is cut from video content according to the first object frame for segment division, and a second video segment is cut from the video content according to the second object frame for segment division. By the method, a specific implementation mode for obtaining the video clip by using the object frame for clip segmentation is provided, and feasibility of the scheme is improved.

Optionally, on the basis of the embodiment corresponding to fig. 2, in an optional embodiment of the method for video clipping provided in this embodiment of the present application, the obtaining, by the video clipping apparatus, at least one audio clip from the audio content according to a change state of an audio frequency of the audio content in a unit time may include:

the video clipping device acquires the change state of the audio frequency of the audio content in unit time;

the video clipping device cuts out at least one audio segment from the audio content according to the change state of the audio frequency in the unit time and the audio frequency threshold, wherein the audio frequency of each audio segment is greater than or equal to the audio frequency threshold.

In this embodiment, an audio frequency threshold may be preset in the video clip device, and after the audio content is acquired, a change state of the audio frequency in a unit time with the audio content may be acquired, where the change state of the audio frequency in the unit time may include a frequency of the audio in each time unit, and the time unit may be seconds or a time unit with a finer granularity. After the change state of the audio frequency of the audio content in the unit time is obtained, the audio frequency value of each time unit can be compared with the audio frequency threshold value, the audio content of which the audio frequency value is smaller than the audio frequency threshold value is rejected, and the audio content subjected to the rejection operation comprises at least one discontinuous audio segment, so that the interception of the at least one audio segment from the audio content is completed. The value of the audio frequency threshold may be 1HZ, 5HZ, 10HZ, 20HZ, 25HZ, or other values, which is not limited herein. To further understand the present solution, please refer to fig. 6, fig. 6 is a schematic diagram of a change state of an audio frequency in a unit time in the embodiment of the present application, in which the change state of the audio frequency in the unit time is shown in fig. 6 by a linear graph, a vertical axis in fig. 6 is a frequency value corresponding to an audio content, and a horizontal axis in fig. 6 is a time corresponding to the audio content, where a line corresponding to a1 refers to a line corresponding to an audio frequency threshold, and audio frequencies in audio segments indicated by a2, A3, and a4 are all greater than or equal to the audio frequency threshold, so that 3 audio segments, i.e., a2, A3, and a4, can be extracted from the audio content.

To further understand the present solution, please refer to fig. 7, fig. 7 is a schematic diagram illustrating an embodiment of merging video segments in an embodiment of the present application, wherein after an audio content (audio) corresponding to a video to be clipped is obtained, a change state of an audio frequency of the audio content in a unit time is obtained, which may also be referred to as an audio frequency distribution (audio frequency), and then at least one audio segment (i.e., a _ duration _1, a _ duration _2 … … a _ duration _ K shown in fig. 7) is cut from the audio content according to the change state of the audio frequency in the unit time and an audio frequency threshold, it should be understood that the example in fig. 7 is only for convenience of understanding the present solution, and is not used for limiting the present solution.

In the embodiment of the application, the change state of the audio frequency of the audio content in unit time is obtained, and at least one audio clip is intercepted from the audio content according to the change state of the audio frequency in unit time and an audio frequency threshold, wherein the audio frequency of each audio clip is greater than or equal to the audio frequency threshold. Because when part of the time period in the audio content is silent, the audio frequency value corresponding to the silent audio content is 0HZ, and when different scene segments in the video content are switched, the silent audio content is generally silent, the audio content part with continuous sound is not likely to be a cut point of a video to be clipped, at least one audio segment with the audio frequency greater than or equal to the audio frequency threshold value is cut from the audio content, namely the cut point of the audio segment is the audio segment with the audio frequency less than the audio frequency threshold value, so that the situation that the segments with continuous sound are clipped into two audio segments is avoided, the actual scene is met, the operation is simple, and the efficiency of the scheme is improved.

Optionally, on the basis of the embodiment corresponding to fig. 2, in an optional embodiment of the method for video clipping provided in the embodiment of the present application, the generating, by the video clipping device, a target clip segment set corresponding to a video to be clipped according to at least two video segments and at least one audio segment may include:

the video clipping device acquires a first video segment and a second video segment from at least two video segments, wherein the second video segment and the first video segment belong to two different video segments, and the second video segment is the next video segment adjacent to the first video segment;

the video clipping device acquires the last image frame in the first video segment;

the video clipping device acquires a first image frame in the second video segment;

if the last image frame in the first video segment and the first image frame in the second video segment both correspond to the target audio segment, the video clipping device merges the first video segment and the second video segment to obtain the target clipping segment in the target clipping segment set, wherein the target audio segment belongs to any one of the at least one audio segment.

In this embodiment, the video clipping device obtains a first video segment and a second video segment from at least two video segments, where the second video segment is a next video segment adjacent to the first video segment, obtains a last image frame in the first video segment, and obtains a first image frame in the second video segment. The video editing device may further obtain at least one audio segment corresponding to the first video segment and the second video segment, specifically, the video editing device needs to first obtain a first time corresponding to a last image frame in the first video segment and a second time corresponding to a first image frame in the second video segment from a video to be edited, and further obtain at least one audio segment corresponding to the first time and the second time from all audio segments corresponding to the audio content, where the at least one audio segment includes the audio content corresponding to the first time and/or the second time.

If the last image frame in the first video segment and the first image frame in the second video segment both correspond to the same target audio segment, that is, the audio content corresponding to the first time and the audio content corresponding to the second time are both included in the same target audio segment, the video clipping device merges the first video segment and the second video segment. The video clipping device executes the operation on every two adjacent video segments in at least two video segments acquired according to the video content, so that the correction of the video segments by the audio segments on the clipping frames is realized, the clipping points of the video to be clipped are determined according to the video segments which are subjected to the merging operation, the video to be clipped is clipped by the updated clipping points, and the target clipping segment in the target clipping segment set is obtained, wherein the target audio segment belongs to any one of the at least one audio segment. To further understand the present solution, please refer to fig. 8, fig. 8 is a schematic diagram of an embodiment of generating a target clip segment in the embodiment of the present application, in which, taking an example that a video to be clipped includes 20 image frames, each frame corresponds to 1 second in the video to be clipped, a first video segment is a1 st frame to a 7 th frame (i.e., B1 in fig. 8), a second video segment is a 8 th frame to a 10 th frame (i.e., B2 in fig. 8), a third video segment is a 11 th frame to a 12 th frame (i.e., B3 in fig. 8), for the first video segment and the second video segment, a video clipping device acquires at least one audio segment corresponding thereto, i.e., acquires an audio segment corresponding to 1 second to 12 th second, B4 in fig. 5 th second to 10 th second, the B4 includes audio contents corresponding to 7 th second and 8 th second, because the audio segments corresponding to 7 th second and 8 th second are located in the same audio segment (B4), that is, the 7 th frame and the 8 th frame are located in the same audio segment, so that the first video segment and the second video segment need to be merged, and the processing manner corresponding to the second video segment and the third video segment is similar to the foregoing operation, and details are not repeated here, it should be understood that the example in fig. 8 is only for convenience of understanding the present solution, and is not used for limiting the present solution.

In an embodiment of the application, a first video segment and a second video segment are obtained from at least two video segments, where the second video segment is a next video segment adjacent to the first video segment, a last image frame in the first video segment is obtained, a first image frame in the second video segment is obtained, and if the last image frame in the first video segment and the first image frame in the second video segment both correspond to a target audio segment, the video clipping device merges the first video segment and the second video segment to obtain a target clipping segment in a target clipping segment set. By the mode, the splitting point is corrected by utilizing the audio content on the basis of the video segments, and because the adjacent video segments are necessarily continuous and the two audio segments are not necessarily continuous, all the content in the video to be clipped can be ensured to be considered in the clipping process on the basis of the video segments, so that the precision of the clipping process is improved; and if the last image frame in the first video segment and the first image frame in the second video segment both correspond to the same target audio segment, the video object is edited in the process of continuous speaking, which is obviously unreasonable, and the audio content is utilized for calibration, so that the reasonability and the precision of the editing process are further improved.

Optionally, on the basis of the embodiment corresponding to fig. 2, in an optional embodiment of the method for video clips provided in the embodiment of the present application, the method may further include:

if the last image frame in the first video segment corresponds to the target audio segment and the first image frame in the second video segment does not correspond to the target audio segment, the video clipping device determines that the first video segment is one target clip segment in the set of target clip segments and the second video segment is another target clip segment in the set of target clip segments;

if the last image frame in the first video segment does not correspond to the target audio segment and the first image frame in the second video segment corresponds to the target audio segment, the video clipping device determines that the first video segment is one target clip segment in the set of target clip segments and the second video segment is another target clip segment in the set of target clip segments.

In this embodiment, since at least one audio segment acquired through the audio content corresponding to the video to be clipped may be discontinuous, there is a possibility that the last image frame in the first video segment or the audio content corresponding to the first image frame in the second video segment is removed in the audio-based clipping process. That is, an audio clip corresponding to the last image frame in the first video clip may not exist in all audio clips corresponding to the audio content, or an audio clip corresponding to the first image frame in the second video clip may not exist in all audio clips corresponding to the audio content, or an audio clip corresponding to the last image frame in the first video clip and the first image frame in the second video clip may not exist in all audio clips corresponding to the audio content at the same time. If the last image frame in the first video segment corresponds to the target audio segment and the first image frame in the second video segment does not correspond to the target audio segment, or if the last image frame in the first video segment does not correspond to the audio segment and the first image frame in the second video segment corresponds to the audio segment, or if the last image frame in the first video segment and the first image frame in the second video segment do not correspond to the audio segment, the video clipping device determines that the first video segment and the second video segment do not need to be merged, and the first video segment and the second video segment are two different target clipping segments in the target clipping segment set respectively.

In this embodiment, if the last image frame in the first video segment corresponds to the target audio segment and the first image frame in the second video segment does not correspond to the target audio segment, or if the last image frame in the first video segment does not correspond to the target audio segment and the first image frame in the second video segment corresponds to the target audio segment, the video clipping device determines that the first video segment is one target clip segment in the set of target clip segments and the second video segment is another target clip segment in the set of target clip segments. By the method, the specific processing mode is provided when the audio clip corresponding to the last image frame in the first video clip or the first image frame in the second video clip does not exist in all the audio clips, instead of outputting the fault question when the scene appears, the execution fluency of the scheme is ensured, the processing mode meets the requirement of an actual scene, and the rationality of the scheme is improved.

the video clipping device acquires a target audio segment from at least one audio segment, wherein the target audio segment corresponds to a target time period;

the video editing device acquires a video to be detected according to a target time period corresponding to the target audio clip;

the video editing device acquires a first image frame in a video to be detected;

the video editing device acquires the last image frame in the video to be detected;

if the first image frame in the video to be detected and the last image frame in the video to be detected both belong to the target video segment, the video clipping device determines that the target video segment is one target clipping segment in the target clipping segment set, wherein the target video segment belongs to any one of at least two video segments.

In this embodiment, the video clipping apparatus obtains a target audio segment from at least one audio segment, where the target audio segment in this embodiment refers to any one of the at least one audio segment, and obtains a target time period corresponding to the target audio segment. The video clipping device can further acquire a start image frame corresponding to the start time in the target time period and an end image frame corresponding to the end time in the target time period, so that the video to be clipped is acquired from the video to be clipped according to the start image frame and the end image frame, a first image frame in the video to be clipped and a last image frame in the video to be clipped are further acquired, at least one video clip corresponding to the first image frame in the video to be clipped and the last image frame in the video to be clipped is acquired from all video clips corresponding to the video to be clipped, and the at least one video clip comprises the first image frame in the video to be clipped and the last image frame in the video to be clipped. If the at least one video segment only comprises one target video segment, that is, the first image frame in the video to be detected and the last image frame in the video to be detected both belong to the same target video segment, the video clipping device determines that the target video segment is one target clipping segment in the target clipping segment set, wherein the target video segment belongs to any one of the at least two video segments.

In the embodiment of the application, a target audio clip is obtained from at least one audio clip, a video to be detected is obtained according to a target time period corresponding to the target audio clip, a first image frame in the video to be detected and a last image frame in the video to be detected are obtained, and if the first image frame in the video to be detected and the last image frame in the video to be detected both belong to the target video clip, the target video clip is determined to be one target clip in a target clip set. By the method, another implementation mode for verifying the video clip by using the audio clip is provided, and the implementation flexibility of the scheme is improved.

if the first image frame in the video to be detected belongs to the first video segment and the last image frame in the video to be detected belongs to the second video segment, the video clipping device combines the first video segment and the second video segment to obtain a target clipping segment in the target clipping segment set, wherein the first video segment and the second video segment belong to two different video segments.

In this embodiment, the video clipping device acquires a video to be detected according to a target time period corresponding to a target audio clip, and after acquiring a first image frame and a last image frame in the video to be detected, acquires at least one video clip corresponding to the first image frame in the video to be detected and the last image frame in the video to be detected, if the first image frame in the video to be detected belongs to a first video clip and the last image frame in the video to be detected belongs to a second video clip, the video clipping device determines that a start audio frequency point and an end audio frequency point corresponding to the same audio clip are located in the two video clips respectively, which is obviously unreasonable, and merges the first video clip and the second video clip to obtain the target clipping clip in the target clipping set. Optionally, if the second video segment is not the video segment immediately after the first video segment, that is, there are other video segments between the first video segment and the second video segment, the video clipping apparatus merges the first video segment and the second video segment and the video segments between the first video segment and the second video segment, so as to obtain the target clipping segment in the target clipping segment set.

To further understand the present solution, please refer to fig. 9, where fig. 9 is a schematic diagram of an embodiment of generating a target clip segment in the embodiment of the present application, where after a video to be clipped is obtained, video content corresponding to the video to be clipped and audio content corresponding to the video to be clipped are obtained respectively. For video content, after the object frame extraction operation is executed, the video content is clipped into at least two video segments by utilizing a similarity rule; and for the audio content, the audio content is clipped into at least one audio segment by using the change state of the audio frequency of the audio content in unit time. And then correcting the cut points of at least two video segments by using at least one audio segment, and if the audio content in the same audio segment appears in two different video segments, merging the different video segments, so as to determine the cut points of the video to be clipped according to the merged video segments, and further clip the video to be clipped to obtain the target clip segment in the target clip segment set.

In the embodiment of the application, if the first image frame in the video to be detected belongs to the first video segment and the last image frame in the video to be detected belongs to the second video segment, the first video segment and the second video segment are merged to obtain the target clip segment in the target clip segment set. By the method, another implementation mode for correcting the video clip by using the audio clip is provided, and the implementation flexibility of the scheme is improved.

Referring to fig. 10, fig. 10 is a schematic view of an embodiment of a video clipping device according to the present application, and the video clipping device 20 includes:

an obtaining module 201, configured to obtain video content corresponding to a video to be clipped and audio content corresponding to the video to be clipped;

an obtaining module 201, configured to obtain at least two video segments according to the video content, where the video content includes at least one object frame, the at least one object frame includes an object frame for segment segmentation, and the object frame for segment segmentation has a target similarity with a next adjacent object frame, where the target similarity is less than or equal to a similarity threshold, and the object frame for segment segmentation is used to determine a video segment;

the obtaining module 201 is further configured to obtain at least one audio clip from the audio content according to a change state of the audio frequency of the audio content in a unit time;

the generating module 202 is configured to generate a target clip segment set corresponding to a video to be clipped according to the at least two video segments and the at least one audio segment acquired by the acquiring module, where the target clip segment set includes at least one target clip segment.

Optionally, on the basis of the embodiment corresponding to fig. 10, in another embodiment of the video clipping device 20 provided in the embodiment of the present application, the obtaining module 201 is specifically configured to:

acquiring a video to be edited;

decoding the audio code stream data to obtain audio content;

and decoding the video code stream data to obtain video content.

Alternatively, on the basis of the embodiment corresponding to fig. 10, in another embodiment of the video editing apparatus 20 provided in the embodiment of the present application,

an obtaining module 201, configured to obtain an object frame sequence according to video content, where the object frame sequence includes N object frames, where N is an integer greater than or equal to 2;

determining a target pixel value according to the differential image;

determining a target pixel value according to the target differential image;

Optionally, on the basis of the embodiment corresponding to fig. 10, in another embodiment of the video clipping device 20 provided in the embodiment of the present application, the generating module 202 is specifically configured to:

acquiring the change state of the audio frequency of the audio content in unit time;

acquiring the last image frame in the first video clip;

acquiring a first image frame in a second video segment;

Optionally, on the basis of the embodiment corresponding to fig. 10, in another embodiment of the video clipping device 20 provided in the embodiment of the present application, the video clipping device 20 further includes a determining module 203,

a determining module 203, configured to determine that the first video segment is one target clip segment in the target clip segment set and the second video segment is another target clip segment in the target clip segment set if the last image frame in the first video segment corresponds to the target audio segment and the first image frame in the second video segment does not correspond to the target audio segment;

the determining module 203 is further configured to determine that the first video segment is one target clip segment of the set of target clip segments and the second video segment is another target clip segment of the set of target clip segments if the last image frame of the first video segment does not correspond to the target audio segment and the first image frame of the second video segment corresponds to the target audio segment.

acquiring a first image frame in a video to be detected;

acquiring the last image frame in a video to be detected;

Optionally, on the basis of the embodiment corresponding to fig. 10, in another embodiment of the video clipping device 20 provided in the embodiment of the present application, the video clipping device 20 further includes a merging module 204,

the merging module 204 is configured to, if a first image frame in the video to be detected belongs to the first video segment and a last image frame in the video to be detected belongs to the second video segment, merge the first video segment and the second video segment to obtain a target clip segment in the target clip segment set, where the first video segment and the second video segment belong to two different video segments.

An embodiment of the present application further provides an electronic device, where the video clipping apparatus provided in the embodiment corresponding to fig. 10 may be deployed on the electronic device, and is used to execute steps performed by the video clipping apparatus in the embodiments corresponding to fig. 2 to fig. 9. As shown in fig. 11, the electronic device may be a terminal device. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed. The terminal may be any electronic device including a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (point of Sales), a vehicle-mounted computer, etc., taking the terminal as the mobile phone as an example:

fig. 11 is a block diagram illustrating a partial structure of a mobile phone related to a terminal provided in an embodiment of the present application. Referring to fig. 11, the cellular phone includes: radio Frequency (RF) circuit 310, memory 320, input unit 330, display unit 340, sensor 350, audio circuit 360, wireless fidelity (WiFi) module 370, processor 380, and power supply 390. Those skilled in the art will appreciate that the handset configuration shown in fig. 3 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 11:

the RF circuit 310 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 380; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 310 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, RF circuit 310 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), etc.

The memory 320 may be used to store software programs and modules, and the processor 380 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 320. The memory 320 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 320 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 330 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 330 may include a touch panel 331 and other input devices 332. The touch panel 331, also referred to as a touch screen, can collect touch operations of a user (e.g., operations of the user on the touch panel 331 or near the touch panel 331 using any suitable object or accessory such as a finger, a stylus, etc.) on or near the touch panel 331, and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 331 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 380, and can receive and execute commands sent by the processor 380. In addition, the touch panel 331 may be implemented in various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 330 may include other input devices 332 in addition to the touch panel 331. In particular, other input devices 332 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 340 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 340 may include a display panel 341, and optionally, the display panel 341 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 331 can cover the display panel 341, and when the touch panel 331 detects a touch operation on or near the touch panel 331, the touch panel is transmitted to the processor 380 to determine the type of the touch event, and then the processor 380 provides a corresponding visual output on the display panel 341 according to the type of the touch event. Although in fig. 3, the touch panel 331 and the display panel 341 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 331 and the display panel 341 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 350, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 341 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 341 and/or the backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 360, speaker 361, microphone 362 may provide an audio interface between the user and the handset. The audio circuit 360 may transmit the electrical signal converted from the received audio data to the speaker 361, and the audio signal is converted by the speaker 361 and output; on the other hand, the microphone 362 converts the collected sound signals into electrical signals, which are received by the audio circuit 360 and converted into audio data, which are then processed by the audio data output processor 380 and then transmitted to, for example, another cellular phone via the RF circuit 310, or output to the memory 320 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 370, and provides wireless broadband internet access for the user. Although fig. 3 shows WiFi module 370, it is understood that it does not belong to the essential component of the handset.

The processor 380 is a control center of the mobile phone, connects various parts of the whole mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 320 and calling data stored in the memory 320, thereby performing overall monitoring of the mobile phone. Optionally, processor 380 may include one or more processing units; preferably, the processor 380 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 380.

The handset also includes a power supply 390 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 380 via a power management system to manage charging, discharging, and power consumption via the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment of the application, when the video clipping apparatus provided in the embodiment corresponding to fig. 10 is deployed on the terminal device, the processor 380 is further configured to execute steps executed by the video clipping apparatus in the embodiment corresponding to fig. 2 to 9, and for a specific implementation manner of the processor 380 executing the steps executed by the video clipping apparatus in the embodiment corresponding to fig. 2 to 9, reference may be made to descriptions in the method embodiment corresponding to fig. 2 to 9, which are not described herein again.

An embodiment of the present application further provides an electronic device, where the video clipping apparatus provided in the embodiment corresponding to fig. 10 may be deployed on the electronic device, and is used to execute steps performed by the video clipping apparatus in the embodiments corresponding to fig. 2 to fig. 9. As shown in fig. 12, the electronic device may specifically be a server.

Fig. 12 is a schematic diagram of a server structure provided by an embodiment of the present application, where the server 400 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 422 (e.g., one or more processors) and a memory 432, and one or more storage media 430 (e.g., one or more mass storage devices) for storing applications 442 or data 444. Wherein the memory 432 and storage medium 430 may be transient or persistent storage. The program stored on the storage medium 430 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 422 may be arranged to communicate with the storage medium 430, and execute a series of instruction operations in the storage medium 430 on the server 400.

The server 400 may also include one or more power supplies 426, one or more wired or wireless network interfaces 450, one or more input-output interfaces 458, and/or one or more operating systems 441, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 12.

In this embodiment of the application, when the video clipping device provided in the embodiment corresponding to fig. 10 is deployed on the server, the processor 380 is further configured to execute steps executed by the video clipping device in the embodiment corresponding to fig. 2 to 9, and for a specific implementation manner of the processor 380 executing the steps executed by the video clipping device in the embodiment corresponding to fig. 2 to 9, reference may be made to descriptions in the method embodiment corresponding to fig. 2 to 9, which are not described herein again.

Also provided in embodiments of the present application is a computer-readable storage medium, in which a computer program is stored, which, when run on a computer, causes the computer to perform the steps performed by the video editing apparatus in the methods described in the foregoing embodiments shown in fig. 2 to 9.

Also provided in embodiments of the present application is a computer program product comprising a program which, when run on a computer, causes the computer to perform the steps performed by the video clipping device in the method as described in the embodiments of fig. 2 to 9 above.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of video clipping, comprising:

acquiring at least two video segments according to the video content, wherein the video content comprises at least one object frame, the at least one object frame comprises an object frame for segment segmentation, the object frame for segment segmentation and a next adjacent object frame have target similarity, the target similarity is smaller than or equal to a similarity threshold, and the object frame for segment segmentation is used for determining the video segments;

2. The method according to claim 1, wherein the acquiring the video content corresponding to the video to be clipped and the audio content corresponding to the video to be clipped comprises:

acquiring the video to be edited;

decoding the audio code stream data to obtain the audio content;

and decoding the video code stream data to obtain the video content.

3. The method of claim 1, wherein the obtaining at least two video segments according to the video content comprises:

acquiring an object frame sequence according to the video content, wherein the object frame sequence comprises N object frames, and N is an integer greater than or equal to 2;

generating the at least two video segments from the at least one object frame sub-sequence.

4. The method of claim 3, wherein the obtaining a sequence of object frames from the video content comprises:

acquiring a first image frame and a second image frame from the video content, wherein the first image frame is a previous frame image adjacent to the second image frame;

determining a target pixel value according to the differential image;

5. The method of claim 3, wherein the obtaining a sequence of object frames from the video content comprises:

acquiring a first image frame, a second image frame and a third image frame from the video content, wherein the first image frame is an adjacent previous frame image of the second image frame, and the second image frame is an adjacent previous frame image of the third image frame;

determining a target pixel value according to the target difference image;

6. The method of claim 3, wherein generating at least one subsequence of object frames from the sequence of object frames comprises:

acquiring a first object frame and a second object frame from the object frame sequence, wherein the first object frame is a previous object frame adjacent to the second object frame;

and if the similarity is greater than the similarity threshold value, removing the first object frame from the object frame sequence.

7. The method of claim 3, wherein generating the at least two video segments from the at least one object frame sub-sequence comprises:

generating a first video segment from the video content based on a first object frame subsequence;

generating a second video segment from the video content based on a second object frame subsequence;

or, said generating said at least two video segments from said at least one object frame sub-sequence comprises:

acquiring a first object frame for segment division and a second object frame for segment division from the M object frames for segment division included in the object frame subsequence;

intercepting a first video segment from the video content according to the first object frame for segment segmentation;

and intercepting a second video segment from the video content according to the second object frame for segment segmentation, wherein the second video segment and the first video segment belong to two different video segments.

8. The method according to claim 1, wherein the obtaining at least one audio clip from the audio content according to the change state of the audio frequency in the unit time corresponding to the audio content comprises:

acquiring the change state of the audio frequency in unit time corresponding to the audio content;

and intercepting at least one audio clip from the audio content according to the change state of the audio frequency in the unit time and an audio frequency threshold, wherein the audio frequency of each audio clip is greater than or equal to the audio frequency threshold.

9. The method according to any one of claims 1 to 8, wherein the generating a target clip segment set corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment comprises:

acquiring a first video clip and a second video clip from the at least two video clips, wherein the second video clip and the first video clip belong to two different video clips, and the second video clip is a next video clip adjacent to the first video clip;

acquiring the last image frame in the first video segment;

acquiring a first image frame in the second video segment;

if the last image frame in the first video segment and the first image frame in the second video segment both correspond to a target audio segment, merging the first video segment and the second video segment to obtain a target clip segment in the target clip segment set, wherein the target audio segment belongs to any one of the at least one audio segment.

10. The method of claim 9, further comprising:

determining that the first video segment is one of the set of target clip segments and the second video segment is another one of the set of target clip segments if a last image frame of the first video segment corresponds to the target audio segment and a first image frame of the second video segment does not correspond to the target audio segment;

determining that the first video segment is one of the set of target clip segments and the second video segment is another one of the set of target clip segments if the last image frame of the first video segment does not correspond to the target audio segment and the first image frame of the second video segment corresponds to the target audio segment.

11. The method according to any one of claims 1 to 8, wherein the generating a target clip segment set corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment comprises:

obtaining a target audio clip from the at least one audio clip, wherein the target audio clip corresponds to a target time period;

acquiring a video to be detected according to the target time period corresponding to the target audio clip;

acquiring a first image frame in the video to be detected;

acquiring the last image frame in the video to be detected;

if the first image frame in the video to be detected and the last image frame in the video to be detected both belong to a target video clip, determining that the target video clip is one target clip in the target clip set, wherein the target video clip belongs to any one of the at least two video clips.

12. The method of claim 11, further comprising:

if the first image frame in the video to be detected belongs to a first video segment and the last image frame in the video to be detected belongs to a second video segment, merging the first video segment and the second video segment to obtain a target clip segment in the target clip segment set, wherein the first video segment and the second video segment belong to two different video segments.

13. A video clipping apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring video content corresponding to a video to be clipped and audio content corresponding to the video to be clipped;

the acquiring module is further configured to acquire at least two video segments according to the video content, where the video content includes at least one object frame, the at least one object frame includes an object frame for segment segmentation, the object frame for segment segmentation has a target similarity with a next adjacent object frame, the target similarity is less than or equal to a similarity threshold, and the object frame for segment segmentation is used to determine a video segment;

the obtaining module is further configured to obtain at least one audio clip from the audio content according to a change state of an audio frequency of the audio content in a unit time;

the generating module is configured to generate a target clipping segment set corresponding to the video to be clipped according to the at least two video segments and the at least one audio segment acquired by the acquiring module, where the target clipping segment set includes at least one target clipping segment.

14. An electronic device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute a program in the memory, including the method of any of claims 1 to 12;

15. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 12.