CN112153462A

CN112153462A - Video processing method, device, terminal and storage medium

Info

Publication number: CN112153462A
Application number: CN201910565725.8A
Authority: CN
Inventors: 赵舒羽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2020-12-29
Anticipated expiration: 2039-06-26
Also published as: CN112153462B

Abstract

The embodiment of the invention discloses a video processing method, a video processing device, a video processing terminal and a storage medium, wherein the method comprises the following steps: determining a first type of video segment from the target video; dividing the first type of video segments into a video scene segment set according to the similarity among the image frames included in the first type of video segments, wherein the similarity among the image frames included in each video scene segment in the video scene segment set meets the similarity condition; carrying out time length compression processing on each video scene segment according to a time length threshold value; and splicing the video scene segments subjected to the duration compression to obtain a compressed first-class video segment, and splicing a second-class video segment included in the target video and the compressed first-class video segment according to a video playing sequence to obtain a compressed target video. The embodiment of the invention realizes the intelligent time length compression processing of the target video.

Description

Video processing method, device, terminal and storage medium

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a video processing method, an apparatus, a terminal, and a storage medium.

Background

Generally, when the continuous image changes by more than 24 frames per second, the human eye cannot distinguish a single static image according to the principle of persistence of vision, and looks like a smooth continuous visual effect, and thus the continuous image is called a video, such as a movie, a television series, or a short film shot by a shooting device. The user can watch network videos through the terminal, such as watching movies using a mobile phone, watching television shows using a flat panel, and the like.

Due to time constraints, the user may choose to speed up the playback in order to save viewing time while watching the video. And the terminal compresses the duration of the video according to the accelerated playing operation selected by the user, shortens the playing duration of the video and realizes accelerated playing. However, some wonderful content may be lost during the time length compression process of the target, which affects the viewing experience of the user. Therefore, how to intelligently compress video becomes a hot issue of research today.

Disclosure of Invention

The embodiment of the invention provides a video processing method, a video processing device, a video processing terminal and a storage medium, which can realize time length compression processing on a target video intelligently.

In one aspect, an embodiment of the present invention provides a video processing method, including:

determining a first type of video segment from the target video;

dividing the first type of video segments into a video scene segment set according to the similarity among the frame images included in the first type of video segments, wherein the similarity among the frame images included in each video scene segment in the video scene segment set meets the similarity condition;

carrying out time length compression processing on each video scene segment according to a time length threshold value;

splicing the video scene segments subjected to the duration compression to obtain a compressed first type video segment, and splicing a second type video segment included in the target video and the compressed first type video segment according to a video playing sequence to obtain a compressed target video.

In another aspect, an embodiment of the present invention provides a video processing apparatus, including:

the acquisition unit is used for determining a first type of video segment from the target video;

the processing unit is used for dividing the first type of video segment into a video scene segment set according to the similarity among the frame images included in the first type of video segment, and the similarity among the frame images included in each video scene segment in the video scene segment set meets the similarity condition;

the processing unit is further configured to perform duration compression processing on each video scene segment according to a duration threshold;

the processing unit is further configured to splice the video scene segments subjected to the duration compression processing to obtain a compressed first-class video segment, and splice a second-class video segment included in the target video and the compressed first-class video segment according to a video playing sequence to obtain a compressed target video.

In another aspect, an embodiment of the present invention provides a terminal, where the terminal includes:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:

determining a first type of video segment from the target video;

In the embodiment of the invention, a first type of video segment is determined from a target video, wherein the target video is divided into the first type of video segment and a second type of video segment according to an audio sequence included by the target video, the first type of video segment does not include speech information, and the second type of video segment includes speech information; therefore, the situation that wonderful or important lines are missed due to compression can be avoided by compressing the first type of video segment; furthermore, each image frame included in the first type of video segment which needs to be compressed is divided into a video scene segment set, each video scene segment in the video scene segment set is composed of a plurality of image frames, and the similarity between the image frames meets the similarity condition, that is, the similarity between the image frames in each video scene segment is higher. Therefore, even if the image frames included in each video scene segment are reduced after each video scene segment is subjected to time length compression processing, the watching of the video scene segment by a user is not influenced, and finally, the second type video segment of the target video and the first type video segment subjected to compression processing are spliced to obtain the compressed target video. In the compression processing process, only the first type of video segment in the target video is compressed, so that the time length of the target video can be shortened while the power consumption expense of the terminal is saved, the time length for playing the target video is reduced, and the time length compression processing on the target video is intelligently performed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a data structure of a target video according to an embodiment of the present invention;

FIG. 2a is a schematic diagram of a terminal interface according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of a user interface provided by an embodiment of the present invention;

FIG. 2c is a schematic diagram of another user interface provided by embodiments of the present invention;

FIG. 2d is a schematic diagram of yet another user interface provided by an embodiment of the present invention;

fig. 3 is a schematic flowchart of a video processing method according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of another video processing method according to an embodiment of the present invention;

fig. 5 is a schematic diagram of partitioning a target video according to an embodiment of the present invention;

fig. 6 is a schematic diagram of performing duration compression processing on a video scene segment set according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

An image is a kind of similarity, vivid description or portrayal of an objective thing, and an image can refer to all pictures with visual effects, such as photos, paintings, faxes, movie pictures, electrocardiograms, and the like. A frame is understood to be a time unit of an image, typically a frame being equal to twenty-fifth of a second. One frame image may refer to a still picture.

Persistence of vision can mean that when an object moves rapidly, after an image seen by human eyes disappears, the human eyes can still keep the image of the object for about 0.1 to 0.4 second, and the phenomenon is called persistence of vision. For example, a disk is threaded on both sides by a rope, one side of the disk depicting a bird and the other side depicting an empty cage. When the disc is rotated rapidly, the bird appears in the cage, from which it can be seen that when the human eye sees a series of continuously changing images, it retains one image at a time.

Based on the above image and persistence of vision theory, the definition of the video may be: generally, when the continuous image changes more than 24 frames per second, human eyes cannot distinguish a single static image according to the persistence of vision principle, and the image looks as a smooth continuous visual effect, so that the continuous image is called a video. In other words, a video is composed of a plurality of images in a time domain, i.e., each video includes a plurality of image frames.

In one embodiment, when playing a video, the concepts of the time axis and the video playing order are involved. The time axis is used for indicating the total duration of the video, the multiple image frames included in one video are arranged in sequence along the time axis, the video playing sequence refers to the sequence of playing the multiple image frames included in the video, and the video playing sequence can be that the image frames are played in sequence according to the time axis sequence.

For example, assuming that the total duration of a video is 5 minutes, the time axis has a duration of 5 minutes, and the time axis includes 25 image frames arranged in sequence in time sequence per second, that is, 25 images are played in sequence per second.

In the research process of video playing, it is found that some images may be extremely similar in a target video (here, the target video may refer to a video to be played or scheduled to be played by any user), so that a picture of the video appears to be in a still state by human eyes; or the user may choose to speed up the video due to the user's time constraints. The user can input the accelerated playing operation through a user interface displayed on the terminal, and the terminal can compress the duration of the target video to achieve the aim of accelerated playing in response to the accelerated playing operation of the user.

In an embodiment, an embodiment of the present invention provides a video processing method, which can perform time-length compression processing on a target video, specifically: determining a first type of video segment from the target video; dividing the first type of video segments into a video scene segment set according to the similarity among image frames included in the first type of video segments; performing time length compression processing on each video scene segment in the video scene segment set according to a time length threshold; and finally, splicing all the video scene segments subjected to the duration compression processing to obtain a compressed first-class video, and splicing a second-class video included in the target video and the compressed first-class video according to the video playing sequence to obtain a compressed target video.

The first type of video segment refers to a video segment needing compression processing in the target video, and the second type of video segment refers to a video segment needing no compression processing in the target video. In one embodiment, the similarity between image frames may be used to measure the degree of difference between the image frames, with a greater similarity indicating a smaller difference between two image frames, the more similar the two image frames; a smaller similarity indicates a larger difference between the two image frames, the more dissimilar the two image frames are.

In order to clarify the relationship between the target video, the first type of video segment and the video scene segment set, the embodiment of the present invention provides a data structure diagram of the target video as shown in fig. 1, n image frames included in the target video 101 are represented as F1, F2, F3, … Fm … Fq … Fw … Fs … Fn, where m, q, w, s, n are positive integers greater than 1 and n is the largest and m is the smallest; the target video 101 can be divided into a first type video segment 102 and a second type video segment 103 according to a division rule (the division rule is specifically described in the following description); the first type of video segment 102 can include a plurality of groups of picture frames, which can be contiguous or non-contiguous. Similarly, the second type video segment 103 may also include a plurality of groups of picture frames, and the groups of picture frames may be continuous or discontinuous. For example, the plurality of sets of image frames that can be included in the first type of video segment 102 are F1-Fm-1, Fq-Fw-1, Fs-Fn; for example, the plurality of groups of picture frames that can be included in the second type video segment 103 are Fm-Fq-1, Fw-Fs-1. Further, the first type of video segment 102 may be divided into a plurality of video scene segments according to the similarity between the image frames, the plurality of video scene segments constitute the video scene segment set 1021, and each video scene segment may be composed of a plurality of image frames. It should be noted that, for how to divide the first type video segment into a plurality of video scene segments according to the similarity between the image frames, it will be specifically described in the following description.

The video processing method provided by the embodiment of the invention can be applied to any terminal with a video playing function, wherein the terminal can be a mobile phone, a tablet, a notebook computer and other equipment. The terminal can be provided with a video player, and the terminal can play videos through the video player, or the terminal can directly play videos on a webpage. In one embodiment, if the terminal plays a video through a video player, a user may start the video player by clicking a button corresponding to the video player included in the terminal; or, the user can also input a start instruction for starting the video player by waking up a voice assistant of the terminal to start the video player; still alternatively, the user may also start the video player by entering a shortcut gesture in the terminal.

When the accelerated playing operation of the target video is detected, the terminal intelligently compresses the target video by adopting the video processing method provided by the embodiment of the invention, so that the playing time can be saved, and the aim of accelerating the playing of the video is fulfilled.

Application scenarios of embodiments of the present invention are described below with reference to fig. 2 a-2 d. Referring to fig. 2a, 201 denotes a terminal, an application of a player may be installed in the terminal, and a player icon corresponding to the application of the player may be displayed in the terminal, and a user may input a start operation for starting the player by clicking and long-pressing the player icon; after the terminal 201 detects the start operation of the user, the player is started and the user interface of the player is displayed as shown in fig. 2 b. As can be seen from fig. 2b, the user interface of the player may include a plurality of videos, and the user may select a video to be played according to his/her preference, for example, if the user wants to view the video to be played, the user may select the video to be played by clicking, long pressing, or other predetermined manners. If the terminal 201 detects the selection operation of the user on the video to be played, the user interface displaying the video to be played is shown in fig. 2 c.

The user interface of the video to be played may include a play/pause icon 202, and the user may trigger the terminal to start playing the video to be played or pause playing the video to be played by double-clicking or clicking the play icon 202. The user interface of the video to be played may further include a play setting icon 203, and if the user wants to perform play setting on the video to be played, the user may call up a play setting area by clicking the play setting icon 203 as shown in fig. 2d, where 204 in fig. 2d represents the play setting area; the play setting area 204 may include a play speed control area 2041 and other areas 2042, the play speed control area 2041 includes icons corresponding to at least one acceleration mode, the at least one acceleration mode may include intelligent acceleration, 1-time acceleration, 2-time acceleration, and the like, and each acceleration mode corresponds to a similarity threshold; the other area 2042 may include a loop play icon, a download icon, a bullet screen setting icon, and the like.

When the selection operation of the user on any one acceleration mode is detected, the terminal 201 determines a target acceleration mode corresponding to the selection operation of the user, and acquires a similarity threshold corresponding to the target acceleration mode; then determining a similarity condition according to a similarity threshold; further, a target video requiring time length compression processing is determined according to the playing condition of the current video to be played, specifically: if the video to be played is not played yet, determining the video to be played as a target video; if the video to be played has been played for a period of time, the video that is not played in the video to be played can be determined as the target video.

Then the terminal 201 determines a first type of video segment needing time length compression from the target video; dividing the first type of video segments according to the similarity between the image frames in the first type of video segments and the similarity condition determined in advance to obtain a video scene segment set, then carrying out time length compression treatment on each video scene segment in the video scene segment set, finally splicing each compressed video scene segment to obtain a compressed first type of video segment, and splicing the compressed first type of video segment and the compressed second type of video segment to obtain a compressed target video. The duration of the compressed target video is less than that of the target video before compression, so that the playing time of the target video can be shortened, and the aim of accelerating playing is fulfilled.

Based on the above description, an embodiment of the present invention provides a flow chart diagram of a video processing method, as shown in fig. 3. The video processing method described in fig. 3 may be executed by a terminal, and in particular may be executed by a processor of the terminal. The video processing method shown in fig. 3 may include the steps of:

s301, a first type of video segment is determined from the target video.

Wherein the target video may refer to an unplayed video included in the user interface. In one embodiment, the target video may be a complete piece of video; for example, the terminal obtains a video to be played selected by the user, and before the video to be played does not start playing, the terminal detects an accelerated playing operation input by the user, where the target video is a complete video to be played. The terminal can provide multiple acceleration modes, such as intelligent acceleration, 1-time acceleration, 2-time acceleration and the like, and the acceleration playing operation refers to the selection operation of a user on any acceleration mode. For example, the user selects the video to be played as the 53 th episode of the drama catch-up, and before detecting the operation of starting playing the video to be played, the terminal first detects the accelerated playing operation, and at this time, the video to be played can be regarded as the target video.

In other embodiments, the target video may also be a portion of a complete video. For example, assume that the terminal acquires a video to be played selected by a user and starts playing the video to be played by the playing terminal; in the process of playing the video to be played, the user finds that more redundant contents exist in the video to be played, such as similar pictures, pictures without lines, and the like, and the user can input an accelerated playing operation, and at this time, the video which is not played until the current moment in the video to be played is determined as the target video. For example, the video to be played selected by the user is a movie with a duration of 100 minutes, an accelerated playing operation by the user is detected during the playing of the movie, and the movie content of the remaining 80 minutes is determined as the target video when the movie has been played for 20 minutes by the current time.

In one embodiment, the first type video segment in step S301 refers to a video segment included in the target video and needing time duration compression, and the relationship between the target video and the first type video segment may be as shown in fig. 1, where the first type video segment includes a plurality of image frame groups, each image frame in each image frame group is continuous, and each image frame group may be continuous or discontinuous. For example, the target video comprises F1-Fn image frames, wherein n is a positive integer greater than 10, the first type of video segment comprises a plurality of image frame groups which can be F1-F4, F7-F10 and F11-Fn, it can be seen that the first image frame group F1-F4 is discontinuous from the second image frame group F7-F10, and the second image frame group and the third image frame group are continuous F7-Fn.

In one embodiment, the terminal may divide the target video into a first type of video segment and a second type of video segment according to different division rules, where the first type of video segment is a video segment requiring time duration compression processing, and the second type of video segment is a video segment not requiring time duration compression processing. In one embodiment, the partition rule may indicate that the partition is performed according to the presence or absence of speech-related information in the audio sequence corresponding to the target video, the first type of video segment may refer to a video segment corresponding to an audio segment not containing speech-related information, and the second type of video segment may refer to a video segment corresponding to an audio segment containing speech-related information. The speech information may refer to information such as a human dialogue, an aside, an independent inner word, and lyrics in the video. In other embodiments, the division rule may further indicate that the division is performed according to whether the image frames include preset posture information, the first type of video segment may refer to a video segment composed of image frames that do not include the preset posture information, and the second type of video segment may refer to a video segment composed of image frames that include the preset posture information. The preset posture information may include a preset dance action and the like.

S302, dividing the first type of video segments into a video scene segment set according to the similarity among the image frames included in the first type of video segments, wherein the similarity among the image frames included in each video scene segment in the video scene segment set meets the similarity condition.

In one embodiment, the similarity condition may refer to: the video scene segment comprises image frames, wherein the similarity between the first image frame and other image frames is greater than a similarity threshold. For example, assuming that image frames included in a video scene segment are F1-F4, if a similarity condition is satisfied between the image frames in the video segment, there are: the similarity between F1 and F2, the similarity between F1 and F3, and the similarity between F1 and F4 are all greater than a similarity threshold.

In other embodiments, the similarity condition may also refer to: the video scene segment includes image frames, and the similarity between every two image frames is greater than a similarity threshold. For example, assuming that image frames included in a video scene segment are F1-F3, if a similarity condition is satisfied between the image frames in the video scene segment, there are: the similarity between F1 and F2 is greater than a similarity threshold, the similarity between F2 and F3 is greater than a similarity threshold, and the similarity between F1 and F3 is greater than a similarity threshold.

Wherein, the similarity threshold may be preset by the terminal. Optionally, multiple acceleration modes may be preset in the terminal, each acceleration mode corresponds to one similarity threshold, and when an accelerated play operation is detected, the similarity threshold is determined according to the acceleration mode included in the accelerated play operation.

In one embodiment, the terminal may divide the first type video segment into a plurality of video scene segments according to the similarity between the image frames included in the first type video segment and the similarity condition. Wherein, the similarity between two image frames can be evaluated by Structural Similarity Index (SSIM). Using the two image frames F1 and Fn as two inputs of the SSIM algorithm, a value between 0 and 1 can be obtained, a larger value indicating a larger similarity between the two image frames and a smaller value indicating a smaller similarity between the two image frames.

In an embodiment, the implementation that the terminal divides the first type video segment into a plurality of video scene segments according to the similarity between the image frames included in the first type video and the similarity condition may be: starting from a first image frame of the first type of video segment, sequentially finding out continuous image frames with the similarity greater than or equal to a similarity threshold value with the first image frame, until meeting an nth image frame with the similarity less than the similarity threshold value with the first image frame, wherein n is a positive integer greater than 1, and composing F1-Fn-1 into a video scene segment. By analogy, the first type of video segment may be divided into a plurality of video scene segments, which constitute a set of video scene segments.

For example, the video segments of the first type include F1-F6, and starting from the first image frame F1, if a consecutive frame F4 is found, the similarity between the consecutive frame F4 and F1 is greater than or equal to the similarity threshold, and the similarity between F5 and F1 is less than the similarity threshold, then F1-F4 are combined into a video scene segment.

In other embodiments, the implementation manner that the terminal divides the first type video segment into the plurality of video scene segments according to the similarity between the image frames included in the first type video and the similarity condition may be: sequentially finding out continuous image frames with the similarity greater than or equal to a similarity threshold value with the first image frame from the first image frame of the first type of video segment till meeting the mth image frame with the similarity less than the similarity threshold value with the first image frame, and obtaining an image frame set F1-Fm-1; then, in the image frame set F1-Fm-1, from the second image frame, sequentially finding out continuous image frames with the similarity greater than or equal to the similarity threshold value with the second image frame, until the similarity with the first image frame is smaller than w of the similarity threshold value, wherein w is smaller than m, and obtaining another image frame set F1-Fw-1; the searching step is performed in the image frame set F1-Fw-1 again until the similarity comparison is performed between any two image frames in F1-Fm-1. Each image frame in the set of image frames obtained at this time may constitute a video scene segment. And iterating the executing step until the non-grouped image frames do not exist in the first type of video segments, so that the first type of video segments can be divided into a plurality of video scene segments, and the plurality of video scene segments form a video scene segment set.

For example, assuming that the video segments of the first type are F1-F8, first, starting from F1, sequentially and backwards find out consecutive image frames F2-F6 whose similarity to F1 is greater than or equal to a similarity threshold, and obtain an image frame set F1-F6; then, in the image frame sets F1-F6, from F2, sequentially finding out continuous image frames F3-F5 with the similarity greater than or equal to the similarity threshold value with the F2, and obtaining another image frame set F1-F5; then in F1-F5, starting from F3, finding out continuous frames F4 with the similarity greater than or equal to the similarity threshold value with F3 sequentially backwards, obtaining an image frame set F1-F4, wherein F1-F4 can be combined into a video duration segment. Next, starting from F5, the above process is repeated to divide the first type of video segment into the video scene segment sets.

And S303, carrying out time length compression processing on each video scene segment according to the time length threshold value.

In one embodiment, the duration threshold may be preset by the terminal, for example, the terminal sets one duration threshold for each acceleration mode. Optionally, the terminal may determine the duration threshold according to an acceleration mode included in the received accelerated playback operation.

In one embodiment, the performing of the duration compression process on each video scene segment in step S303 may include: and carrying out time length compression processing on the video scene fragments with the time length larger than the time length threshold value in each video scene fragment so as to ensure that the time length of each video scene fragment is not larger than the time length threshold value.

Assuming that each video scene segment includes a target video scene segment, the duration of the target video scene segment is greater than the duration threshold, and the target video scene segment is any one of the video scene segments, the following description will describe duration compression processing by taking the target video scene as an example. The duration compression process may refer to: directly intercepting a segment with the duration equal to or less than the duration threshold from the target video scene segment with the duration greater than the duration threshold as the compressed target video scene segment, for example, if the duration threshold is 3 seconds and the target video scene segment is 5 seconds, the first 3 seconds can be directly intercepted from the target video scene segment as the target video scene segment after the duration compression, or the middle 3 seconds or the last 3 seconds can be intercepted as the target video scene segment after the duration compression.

S304, splicing the video scene segments subjected to the duration compression processing to obtain a compressed first-class video segment, and splicing a second-class video segment included in the target video and the compressed first-class video segment according to the video playing sequence to obtain a compressed target video.

In an embodiment, the compressed first-class video segments can be obtained by splicing the video scene segments subjected to the duration compression processing, the duration of the spliced first-class video segments is reduced due to the reduction of the duration of the video scene segments, which is equivalent to that the duration compression processing is also performed on the first-class video segments, and the duration of the target video obtained by splicing the second-class video segments included in the target video and the first-class video segments subjected to the duration compression processing is also reduced, so that the time for playing the target video is shortened.

In an embodiment, the splicing processing of the video scene segments after the duration compression processing may refer to directly splicing the video scene segments after the duration compression processing together, or may refer to splicing the video scene segments after the duration compression processing together by using a splicing tool.

In the embodiment of the invention, each image frame included in the first type of video segment needing compression processing is divided into the video scene segment set, each video scene segment in the video scene segment set is composed of a plurality of image frames, and the similarity among the image frames meets the similarity condition, namely the similarity among the image frames in each video scene segment is higher. Therefore, even if the image frames included in each video scene segment are reduced after each video scene segment is subjected to time length compression processing, the watching of the video scene segment by a user is not influenced, and finally, the second type video segment of the target video and the first type video segment subjected to compression processing are spliced to obtain the compressed target video. The time length of the compressed target video is shortened, the time length for playing the target video is reduced, and the time length compression processing of the target video is intelligently carried out.

Fig. 4 is a schematic flow chart of another video processing method according to an embodiment of the present invention. The video processing shown in fig. 4 may be performed by a terminal, and specifically may be performed by a processor of the terminal, and the method shown in fig. 4 may include the following steps:

s401, obtaining the video to be played and displaying the user interface for playing the video to be played.

The video to be played is the video to be played or scheduled to be played selected by the user. After the video to be played is obtained, if an instruction for starting playing the video to be played is detected, displaying the video to be played which is being played in a user interface; if the instruction for starting playing the video to be played is not detected, the first image frame of the video to be played or the video cover corresponding to the video to be played can be displayed in the user interface.

In one embodiment, a play setting icon may be included in the user interface, and a user may call up a play setting area included in the user interface by selecting the icon, where the play setting area may include a play speed control area, the play speed control area includes at least one acceleration mode, and the play control area is configured to receive an acceleration play operation of the user. The acceleration mode may include intelligent acceleration, 1-time acceleration, 2-time acceleration, 1.5-time acceleration, and the like.

In one embodiment, the play speed control region may be used to receive accelerated play operations by the user. Optionally, if the user selects a selection item corresponding to any one of the at least one acceleration mode by clicking, long-pressing, sliding, or the like, it may be determined that the user has input an accelerated play operation; alternatively, the user may input the accelerated play operation through voice control.

In other embodiments, the user may input the selection operation of the acceleration mode within a preset position range of the user interface. For example, the user may input preset gesture information within a preset position range, or a preset slide operation.

In one embodiment, if the user' S selection operation of at least one acceleration mode is detected, it is determined that the user inputs an accelerated play operation, and the terminal performs steps S402 to S408 according to the accelerated play operation in response to the accelerated play operation for the purpose of accelerated play.

S402, responding to the received accelerated playing operation, determining a target accelerated mode included in the accelerated playing operation, and determining a similarity threshold corresponding to the target accelerated model.

As can be seen from the above, the accelerated playing operation received by the terminal may be a selection of at least one accelerated model by the user, where the accelerated mode selected by the user is the target accelerated mode corresponding to the accelerated playing operation. For example, when the user clicks the smart acceleration, the terminal receives the accelerated playback operation, and the target acceleration mode corresponding to the accelerated playback operation is referred to as the smart acceleration.

In other embodiments, the accelerated playing operation received by the terminal may also refer to that the user inputs preset gesture information or preset sliding operation at a preset position of the user interface, in this case, the terminal may preset a corresponding relationship between the gesture information and the acceleration mode, or a corresponding relationship between the sliding operation and the acceleration mode, and then determine a target acceleration mode corresponding to the accelerated playing operation of the user according to the gesture information or the sliding operation input by the user within the preset position range.

In one embodiment, the terminal may preset a similarity threshold for each acceleration mode. The terminal may preset a similarity threshold corresponding to each acceleration mode according to an empirical value, where a larger acceleration multiple indicates that a shorter playing time is, and a larger redundant content needs to be removed from the target video, so that the corresponding similarity threshold is smaller, for example, the similarity threshold set for 1-time acceleration may be 0.8, and the similarity threshold set for 2-time acceleration may be 0.7.

In one embodiment, for intelligent acceleration, the terminal may set its corresponding similarity threshold according to the feature information of the target video. The feature information of the target video may include rating information and historical acceleration information, and the rating information may refer to evaluations of other users when watching the target video, including evaluations of target video content, such as "too many repeated pictures in the video, suggest useless shots at next time reduction points" or "highlights, plots, and links with each other, and recommends"; the historical acceleration information may be the acceleration playing situation of other users when watching the target video, for example, it is obtained that the user a uses 2 times acceleration playing, the user B uses 1.5 times acceleration playing, and the like.

Specifically, the manner in which the terminal sets the similarity threshold corresponding to intelligent acceleration may include: the terminal acquires the characteristic information of the target video according to the identification information of the target video; and determining a similarity threshold corresponding to intelligent acceleration according to the characteristic information. The identification information of the target video may refer to the name of the target video + the year of the showing, for example, "a thing 2019 that is more sad than sadness"; alternatively, the identification information of the target video may also refer to the name of the target video + the names of the people in the video, such as "charming dynasty".

In one embodiment, the determining a similarity threshold corresponding to the intelligent acceleration according to the feature information may include: determining a target scoring result of the target video according to the characteristic information; and determining the similarity threshold corresponding to the target scoring result as the similarity threshold corresponding to the intelligent acceleration according to the corresponding relation between the preset scoring result and the similarity threshold.

In one embodiment, the determining the target video scoring result according to the feature information may include: and determining a target scoring result of the target video according to the scoring information and the historical acceleration information. Specifically, the method comprises the following steps: presetting a first scoring rule and a weight value corresponding to scoring information and a second scoring rule and a weight value corresponding to historical acceleration information; processing the scoring information according to a first scoring rule to obtain an evaluation score; processing the historical acceleration information according to a second grading rule to obtain a historical acceleration score; and multiplying the evaluation score by a weight value corresponding to the scoring information, multiplying the historical acceleration score by a weight value corresponding to the historical acceleration information, and taking the result of the multiplication of the evaluation score and the historical acceleration score as a target scoring result of the target video.

In other embodiments, the determining the target video scoring result according to the feature information may include: and determining a target scoring result of the target video according to the scoring information or the historical acceleration information. Specifically, the method comprises the following steps: taking the evaluation score as a target scoring result of the target video; alternatively, the historical acceleration score is taken as the target scoring result of the target video.

S403, acquiring a target video according to the video to be played, and acquiring a video sequence and an audio sequence included in the target video, wherein the video sequence corresponds to the audio sequence.

In an embodiment, the obtaining of the target video according to the video to be played may refer to: and when the video content to be played is detected to have been played at the moment of receiving the accelerated playing operation, taking the video content left in the video to be played as the target video.

In one embodiment, it should be understood that a piece of video is composed of images and sounds, the images in the video may be referred to as a video sequence of the video, the sounds in the video may be referred to as an audio sequence of the video, and the video sequence and the audio sequence are in a corresponding relationship with each other.

Optionally, in the embodiment of the present invention, the execution sequence of step S402 and step S403 is not limited, and step S403 may be executed first, and then step S402 is executed.

S404, dividing the audio sequence into a first type of audio frequency segment and a second type of audio frequency segment according to a voice recognition algorithm, and determining the video content corresponding to the first type of audio frequency segment in the video sequence as a first type of video frequency segment, wherein the first type of video frequency segment does not include speech information.

In one embodiment, the speech information may include the dialogue speech of the character in the target video, and may also include the lyrics of the song inserted in the video, or the speech of the voice-over. The embodiment of the speech recognition algorithm for recognizing speech in an audio sequence, and dividing the audio sequence into a first type of audio segment and a second type of audio segment according to the speech recognition algorithm may include: and sequentially identifying the audio sequence according to a time axis by utilizing a speech recognition algorithm, determining the audio segment without the speech-line information as a first class audio segment, and determining the audio segment with the speech-line information as a second class audio segment.

For example, referring to fig. 5, a schematic diagram of a partitioning method for a target video according to an embodiment of the present invention is shown, where the target video shown in fig. 5 includes a video sequence and an audio sequence; and adopting a voice recognition algorithm to recognize the audio sequence, determining an audio segment comprising the speech-line information and an audio segment not comprising the speech-line information, and forming the audio segment not comprising the speech-line information into a first-class audio segment, wherein the video content corresponding to the first-class audio segment is a first-class video segment.

In other embodiments, the terminal may further divide the target video into the first type video segment and the second type video segment according to whether preset posture information is included in the image frame. The preset posture information may include a dance posture and the like, and the judgment of whether the image frame includes the preset posture information may be implemented by an image recognition algorithm. Specifically, the method comprises the following steps: acquiring each image frame included in the target video, and identifying each image frame of the target video by adopting an image identification algorithm to obtain an identification result; and determining a first type of video segment included by the target video according to the recognition result, wherein the first type of video segment does not include preset attitude information. The determining that the target video includes the first type of video segment according to the identification result may include: and composing the image frames which are identified not to comprise the preset posture information into a first type video segment.

S405, determining a first type of video segment from the target video.

S406, the first type of video segments are divided into a video scene segment set according to the similarity among the image frames included in the first type of video segments, and the similarity among the image frames included in each video scene segment in the video scene segment set meets the similarity condition.

In an embodiment, some possible implementations included in step S405 may refer to descriptions of relevant parts in the embodiment in fig. 3, and are not described herein again.

In one embodiment, the implementation of step S406 may include: selecting a starting image frame from image frames included in the first type of video segments according to a video playing sequence; sequentially traversing each image frame behind the initial image frame, and if the similarity between the currently traversed current image frame and the initial image frame is detected to be smaller than a similarity threshold, determining the current image frame as an end image frame; composing the starting image frame and the image frames between the starting image frame and the ending image frame into a video scene segment; and repeating the steps to obtain a plurality of video scene segments, and forming a video scene segment set by the plurality of video scene segments.

As can be seen from the foregoing, the video playing sequence refers to a sequence indicated by a time axis, and the principle of selecting a starting image frame from the image frames included in the first type of video segment may be: selecting an end frame in the last traversal process as a starting image frame of the next traversal process, for example, the first type of video segment includes 6 image frames, which are respectively F1-F6; during the first pass, all 6 image frames are image frames which are not traversed, the image frames are sequentially selected according to the principle, and F1 is determined as the initial image frame of the first pass; if during the first traversal, F2-F4 was traversed and the traversal stopped at F4, i.e., F4 is the end image frame of the first traversal; in the next pass, the end image frame F4 in the previous pass is used as the start image frame in the next pass.

In each traversal process, if the similarity between the current image frame of the current traversal and the starting image frame is detected to be smaller than the similarity threshold, the current image frame is determined as the end image frame of the current traversal. For example, if the first video segment includes F1-F6, the first pass is F1 as the starting image frame, and when sequentially traversing the image frames located after F1, the similarity between F2 and F1 is greater than the similarity threshold, the similarity between F3 and F1 is greater than the similarity threshold, and the similarity between F4 and F1 is less than the similarity threshold, then F4 may be determined as the ending image frame of the pass.

After the start image frame and the end image frame of the first traversal are determined, the start image frame and each image frame between the start image frame and the end image frame are combined into one video scene segment, e.g., F1-F3 are combined into one video scene segment a. The second traversal is started, and the traversal steps are repeated to determine another video scene segment B, with the ending image frame F4 of the first traversal being used as the starting frame image of the second traversal. The above process is repeatedly executed until each image frame in the first type of video segment is traversed, and a plurality of video scene segments are generated in the above process, and the plurality of video scene segments form a video scene segment set.

S407, performing duration compression processing on each video scene segment according to the duration threshold.

Assuming that the video scene segment set may include a target video scene segment, which may be any one of the video scene segments in the video scene segment set, step S407 is specifically described below by taking time duration compression processing on the target video scene segment as an example. And for other video scene clips in the video scene clip set, performing time length compression processing on the target video scene clip by adopting a time length compression processing method.

In one embodiment, the duration compression processing on the target video scene segment may include: if the duration of the target video scene segment is greater than the duration threshold, clipping the target video scene segment to obtain a clipped target video scene segment, wherein the duration of the clipped target video scene segment is not greater than the duration threshold; if the duration of the target video scene segment is not greater than the duration threshold, keeping the target video segment unchanged. That is to say, before the duration compression processing is performed on the target video scene segment, firstly, whether the duration of the target video scene segment is greater than a duration threshold is judged, and if the duration of the target video scene segment is greater than the duration threshold, the step of performing the duration compression processing on the target video scene segment is executed; if the time length is not greater than the time length threshold value, the target video scene segment can not be kept unchanged.

In an embodiment, the performing the cropping processing on the target video scene segment may include: extracting at least one sub-segment to be spliced from the target video scene segment according to a time length extraction rule; forming a cut target video scene segment according to the at least one sub-segment to be spliced; the duration extraction rule comprises a sub-segment to be spliced which is obtained by cutting the target video scene segment according to the duration threshold, wherein the duration of the sub-segment to be spliced is not greater than the duration threshold; or the duration extraction rule comprises that at least two sub-segments to be spliced are obtained by linear cutting from the target video scene segment according to the duration threshold, and the total duration of the at least two sub-segments to be spliced is not greater than the duration threshold.

In general, in order to ensure that the video after the duration compression process is similar to the video before the video is uncompressed to the maximum extent, the duration of the sub-segments to be spliced, or the total duration of the sub-segments to be spliced, is usually equal to the duration threshold. The following description will take the example that the duration of the sub-segments to be spliced, or the total duration of the sub-segments to be spliced is usually equal to the duration threshold.

If the duration extraction rule refers to extracting a sub-band to be spliced, of which the duration is equal to the duration threshold, from the target video scene segment, it can be understood that: and randomly selecting continuous video content with the duration equal to the duration threshold from the target video scene segment. In this case, the obtained sub-segment to be spliced can be determined as the clipped target video scene segment. For example, assuming that the time length threshold is equal to T seconds, video content corresponding to the first T seconds of the target video scene segment may be directly extracted as a sub-segment to be spliced; or continuous video content with the duration equal to T seconds can be directly extracted from any position to be used as the sub-frequency band to be spliced.

If the duration extraction rule refers to that the duration threshold value extracts at least two sub-segments to be spliced from the target video scene segment, it can be understood that at least two continuous or discontinuous video contents are selected from the target video scene segment, and the total duration of the two video contents is equal to the duration threshold value. For example, the duration threshold is 2 seconds, 1-2 seconds of video content can be extracted from the target video scene segment as one sub-segment to be spliced, and then 3-4 seconds of video content can be extracted as another sub-band to be spliced, where the total duration of the two sub-bands to be spliced is equal to the duration threshold. Wherein, the at least two sub-segments to be spliced extracted from the target video scene segment may be extracted linearly, for example, the extraction time is … of 0-1 second, 2-3 seconds, or 4-5 seconds; alternatively, it may be randomly drawn, such as for 2-3 seconds, 6-7 seconds, etc.

Based on the above description, how to perform the duration compression processing on a plurality of video scene segments can be specifically understood by fig. 6. Referring to fig. 6, a schematic diagram of performing duration compression processing on a plurality of video scene segments according to an embodiment of the present invention is assumed that a plurality of video scene segments obtained by dividing a first type of video segment include a video scene segment a, a video scene segment B, and a video scene segment C, where the video scene segment a is composed of F1-Fn-1 image frames, the video scene segment B is composed of Fn-Fw-1, and the video scene segment C is composed of Fw-Fq, where n, m, q, and w are positive integers that are not equal to each other, and q is the largest and n is the smallest; assume that the time duration threshold is 5 seconds, the time duration of video scene segment a is equal to 5 seconds, the time duration of video scene segment B is 8 seconds, and the time duration of video scene segment C is 3 seconds.

Judging that the duration of the video scene segment A and the duration of the video scene segment C are not greater than a duration threshold, and keeping the duration compression processing unchanged without carrying out duration compression processing; the duration of the video scene segment B is greater than the duration threshold, and the duration compression processing needs to be performed on the video scene segment B. The duration of video scene segment B is compressed to be equal to or less than the duration threshold. As can be seen from fig. 6, the duration of the video scene segment B is reduced, and the duration of the first type of video scene segment obtained by splicing the three video scene segments is also reduced.

In other embodiments, the terminal may also perform cropping processing on the target video scene segment based on the manner of key frame extraction. Specifically, optionally, the cutting the target video scene segment to obtain a cut target video scene segment includes: determining the target number of the required image frames to be spliced according to the time length threshold value and the time length of each image frame included by the target video scene segment; cutting out a target number of image frames to be spliced from the target video scene segment; and splicing the target number of image frames to be spliced to obtain a cut target video scene segment.

In brief, according to the duration threshold and the duration of each image frame included in the target video scene segment, a segment of video content with the duration equal to the duration threshold is determined, and the video content is composed of how many image frames to be spliced. For example, if the duration threshold is 2 seconds and the duration of each image frame is 40 milliseconds, the target number of image frames to be stitched is 50. Then 50 image frames are selected from the target video scene segment to form the clipped target video scene segment.

In one embodiment, selecting 50 image frames from the target video scene segment may refer to: randomly selecting 50 image frames from a target video scene segment; alternatively, 50 image frames may be selected according to a certain selection rule, for example, assuming that the duration of the target video scene segment is 3 seconds and each second includes 25 image frames, the selection rule may select 16 image frames from the image frames included in 0-1 second and 1-2 seconds, and 18 image frames from the image included in 2-3 seconds, respectively. The above two cases are just examples of the embodiment of the present invention, and in a specific application, the selection rule may be determined according to an actual situation.

S408, splicing the video scene segments subjected to the duration compression processing to obtain a compressed first-class video segment, and splicing a second-class video segment included in the target video and the compressed first-class video segment according to the video playing sequence to obtain a compressed target video.

In summary, in the embodiment of the present invention, a target video is divided into a first type of video segments requiring time duration compression processing according to an audio frequency of the target video, then divided into a plurality of video scene segments according to a similarity between image frames included in the first type of video segments, each video scene segment is compressed in sequence, then the compressed video scene segments are spliced to obtain a compressed first type of video segment, and finally the compressed first type of video segment and a second type of video segment included in the target video are spliced to obtain a compressed target video. Therefore, because the first type of video segment does not include the speech information, the purpose of compressing the target time length is achieved by compressing the time length of the first type of video segment, the playing time length of the target video can be shortened, the playing speed is accelerated on the basis of ensuring that the user cannot miss key speech, and the intelligent time length compression processing of the video is realized.

In other embodiments, the terminal may also perform time length compression processing on the target video directly according to the similarity between the image frames of the target video without dividing the target video according to the audio included in the target video. In the specific embodiment: acquiring image frames in each time interval which are sequentially distributed on a time axis of a target video; selecting a preset number of target image frames from the image frames of each time interval on a time axis; carrying out image similarity comparison on target image frames adjacent in time sequence on a time axis to obtain a similarity value between each two adjacent target images; taking a time interval in which the target image frame with the similarity value larger than a preset threshold value is located as an acceleration interval of a time axis; and when the target video is played, performing preset accelerated playing on the image frames in the acceleration interval.

Practice shows that compared with the former method, the time length compression method can save part of power consumption overhead of the terminal and accelerate the playing of the target video, but may cause the user to miss important speech information. In practical application, the terminal can select the adopted time length compression method according to the actual requirements of the user. For example, the terminal may display two compression options to the user in the user interface: one is compression without losing speech information, and the other is energy-saving compression. If the user chooses not to lose the compression of the speech information, the method included in the embodiment of fig. 4 is adopted to carry out time length compression processing on the target video; and if the user selects the energy-saving compression, compressing the target video by adopting the latter method.

In the embodiment of the invention, after the video to be played is acquired, the video to be played can be displayed in the user interface; if the accelerated playing operation input by the user in the user interface is detected, acquiring a similarity threshold corresponding to the accelerated playing operation; then, acquiring a target video according to the video to be played, acquiring a video sequence and an audio sequence included by the target video, dividing the audio sequence into a first type of audio band and a second type of audio band by adopting a voice recognition algorithm, and determining the video corresponding to the second type of audio band as the first type of video band; further, the first-class video end is divided into a plurality of video scene segments according to the similarity between the image frames included in the first-class video segment, time length compression processing is carried out on the video scene segments according to a time length threshold value, the video scene segments subjected to the time length compression processing are spliced to obtain a compressed first-class video segment, the first-class video segment does not include speech information, and a second-class video segment included in the target video and the compressed first-class video segment are spliced according to a video playing sequence to obtain a compressed target video.

In the video processing process, a first type of video end is determined based on the audio contained in the target video, and then time length compression processing is carried out on the first type of video segment based on the image frame, so that the purpose of carrying out time length compression processing on the target video is achieved. Because the first type of video segment does not contain the speech information, the time length compression processing of the first type of video segment does not cause the user to miss important speech or key information, and the video processing method realizes the purpose of shortening the playing time of the target time length while ensuring the integrity of the speech information.

Based on the description of the video processing method, the embodiment of the invention also discloses a video processing device, and the video processing device can execute the methods shown in fig. 3 and fig. 4. Referring to fig. 7, the video processing apparatus may operate as follows:

a determining unit 701, configured to determine a first type of video segment from a target video;

the processing unit 702 is configured to divide a first type of video segment into a set of video scene segments according to similarities between image frames included in the first type of video segment, where the similarities between the image frames included in the video scene segments in the set of video scene segments satisfy a similarity condition;

the processing unit 702 is further configured to perform duration compression processing on each video scene segment according to a duration threshold;

the processing unit 702 is further configured to splice the video scene segments subjected to the duration compression processing to obtain a compressed first-class video segment, and splice a second-class video segment included in the target video and the compressed first-class video segment according to a video playing sequence to obtain a compressed target video.

In one embodiment, the determining unit 701 is further configured to: acquiring a target video, and acquiring a video sequence and an audio sequence included in the target video, wherein the video sequence and the audio sequence correspond to each other; the processing unit 702 is further configured to: dividing the audio sequence into a first class audio frequency segment and a second class audio frequency segment according to a speech recognition algorithm; the processing unit 702 is further configured to: and determining the video content corresponding to the first type of audio segment in the video sequence as a first type of video segment, wherein the first type of video segment does not include speech information.

In one embodiment, the determining unit 701 is further configured to: acquiring each image frame included in a target video, and identifying each image frame included in the target video by adopting an image identification algorithm to obtain an identification result; the processing unit 702 is further configured to: and determining a first type of video segment included by the target video according to the identification result, wherein the first type of video segment does not include preset attitude information.

In one embodiment, the processing unit 702 performs the following operations when dividing the first type of video segment into a set of video scene segments according to the similarity between the image frames included in the first type of video segment: according to the video playing sequence, selecting a starting image frame from image frames included in the first type of video segment; sequentially traversing each image frame behind the initial image frame, and if the similarity between the currently traversed current image frame and the initial image frame is detected to be smaller than a similarity threshold, determining the current image frame as an end image frame; composing the starting image frame and the image frames between the starting image frame and the ending image frame into a video scene segment; and repeating the steps to obtain a plurality of video scene segments, and forming a video scene segment set by the plurality of video scene segments.

In an embodiment, the video scene segment set includes a target video scene segment, where the target video scene segment is any one of the video scene segment set, and the processing unit 702 performs the following operations when performing duration compression processing on each of the video scene segments according to a duration threshold: if the duration of the target video scene segment is greater than the duration threshold, clipping the target video scene segment to obtain a clipped target video scene segment, wherein the duration of the clipped target video scene segment is not greater than the duration threshold; if the duration of the target video scene segment is not greater than the duration threshold, keeping the target video scene segment unchanged.

In one embodiment, when performing a cropping process on the target video scene segment to obtain a cropped target video scene segment, the processing unit 702 performs the following operations: extracting at least one sub-segment to be spliced from the target video scene segment according to a time length extraction rule; forming a cut target video scene segment according to the at least one sub-segment to be spliced; the duration extraction rule comprises extracting a sub-segment to be spliced from the target video scene segment according to the duration threshold, wherein the duration of the sub-segment to be spliced is not greater than the duration threshold; or the duration extraction rule comprises extracting at least two sub-segments to be spliced from the target video scene segment according to the duration threshold, wherein the total duration of the at least two sub-segments to be spliced is not greater than the duration threshold.

In one embodiment, when performing a cropping process on the target video scene segment to obtain a cropped target video scene segment, the processing unit 702 performs the following operations: determining the target number of the required image frames to be spliced according to the time length threshold value and the time length of each image frame included by the target video scene segment; cutting out a target number of image frames to be spliced from the target video scene segment; and splicing the target number of image frames to be spliced to obtain a cut target video scene segment.

In one embodiment, the video processing apparatus further includes a display unit 703 configured to display a user interface, where the user interface includes a play setting area, the play setting area includes a play speed control area, the play speed control area includes at least one acceleration mode, and the play speed control area is configured to receive an acceleration play operation of a user; the determining unit 701 is further configured to: if the accelerated playing operation of the user is detected, determining a target accelerated mode included by the accelerated playing operation, and determining a similarity threshold corresponding to the target accelerated mode; the processing unit 702 is further configured to: and determining a similarity condition according to a similarity threshold corresponding to the target acceleration mode.

In one embodiment, the target acceleration mode includes smart acceleration, and the determining unit 701 performs the following operations when determining a similarity threshold corresponding to the target acceleration mode: acquiring feature information of the target video according to the identification information of the target video, wherein the feature information comprises scoring information and historical acceleration information; and determining a similarity threshold corresponding to the intelligent acceleration according to the characteristic information.

The steps involved in the method shown in fig. 3 or fig. 4 may be performed by various units in the video processing apparatus shown in fig. 7 according to an embodiment of the present invention. For example, step S301 shown in fig. 3 may be performed by the determination unit 701 in the video processing apparatus shown in fig. 7, and steps S302 to S304 may be performed by the processing unit 702 in the video processing apparatus shown in fig. 7; as another example, steps S401 to S403 and S405 shown in fig. 4 may be performed by the determination unit 701 in the video processing apparatus shown in fig. 7, and steps S404 and S406 to S408 may be performed by the processing unit 702 in the video processing apparatus shown in fig. 7.

According to another embodiment of the present invention, the units in the video processing apparatus shown in fig. 7 may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) thereof may be further split into multiple units with smaller functions to form the same operation, without affecting the achievement of the technical effect of the embodiment of the present invention. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present invention, the video processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present invention, the video processing apparatus as shown in fig. 7 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the respective methods as shown in fig. 3 or fig. 4 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and a video processing method according to an embodiment of the present invention may be implemented. The computer program may be embodied on a computer-readable storage medium, for example, and loaded into and executed by the above-described computing apparatus via the computer-readable storage medium.

Based on the description of the above method embodiment and apparatus embodiment, an embodiment of the present invention further provides a terminal, where the terminal corresponds to the first terminal in the method embodiments shown in fig. 3 and fig. 4. Referring to fig. 8, the terminal may include a processor 801 and a computer storage medium 802, and may further include a display device 803, such as a display screen, the display device 803 being used to display a user interface.

A computer storage medium 802 may be stored in the memory of the terminal, the computer storage medium 802 being used to store a computer program comprising program instructions, the processor 801 being used to execute the program instructions stored by the computer storage medium 802. The processor 801 or CPU (Central Processing Unit) is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and in particular, is adapted to load and execute the one or more instructions so as to implement a corresponding method flow or a corresponding function; in one embodiment, the processor 801 according to the embodiment of the present invention may be configured to perform: determining a first type of video segment from the target video; dividing the first type of video segments into a video scene segment set according to the similarity among the image frames included in the first type of video segments, wherein the similarity among the image frames included in each video scene segment in the video scene segment set meets the similarity condition; carrying out time length compression processing on each video scene segment according to a time length threshold value; splicing the video scene segments subjected to the duration compression processing to obtain a compressed first type video segment, and splicing a second type video segment included in the target video and the compressed first type video segment according to a video playing sequence to obtain a compressed target video.

The embodiment of the invention also provides a computer storage medium (Memory), which is a Memory device in the terminal and is used for storing programs and data. It is understood that the computer storage medium herein may include a built-in storage medium in the terminal, and may also include an extended storage medium supported by the terminal. The computer storage medium provides a storage space that stores an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor 801. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by the processor 801 to perform the corresponding steps of the method described above in connection with the embodiments of the video processing apparatus; in particular implementations, one or more instructions in the computer storage medium are loaded and executed by the processor 801 to perform the steps of:

determining a first type of video segment from the target video; dividing the first type of video segments into a video scene segment set according to the similarity among the image frames included in the first type of video segments, wherein the similarity among the image frames included in each video scene segment in the video scene segment set meets the similarity condition; carrying out time length compression processing on each video scene segment according to a time length threshold value; splicing the video scene segments subjected to the duration compression processing to obtain a compressed first type video segment, and splicing a second type video segment included in the target video and the compressed first type video segment according to a video playing sequence to obtain a compressed target video.

In one embodiment, the loading of one or more instructions in a computer storage medium by processor 801 further performs the steps of: acquiring a target video, and acquiring a video sequence and an audio sequence included in the target video, wherein the video sequence and the audio sequence correspond to each other; dividing the audio sequence into a first class audio frequency segment and a second class audio frequency segment according to a speech recognition algorithm; and determining the video content corresponding to the first type of audio segment in the video sequence as a first type of video segment, wherein the first type of video segment does not include speech information.

In one embodiment, the loading of one or more instructions in a computer storage medium by processor 801 further performs the steps of: acquiring each image frame included in a target video, and identifying each image frame included in the target video by adopting an image identification algorithm to obtain an identification result; and determining a first type of video segment included by the target video according to the identification result, wherein the first type of video segment does not include preset attitude information.

In one embodiment, the processor 801, when loaded with one or more instructions from a computer storage medium, performs the division of the first type of video segment into a set of video scene segments based on similarities between image frames comprised in the first type of video segment. The following operations are performed: according to the video playing sequence, selecting a starting image frame from image frames included in the first type of video segment; sequentially traversing each image frame behind the initial image frame, and if the similarity between the currently traversed current image frame and the initial image frame is detected to be smaller than a similarity threshold, determining the current image frame as an end image frame; composing the starting image frame and the image frames between the starting image frame and the ending image frame into a video scene segment; and repeating the steps to obtain a plurality of video scene segments, and forming a video scene segment set by the plurality of video scene segments.

In one embodiment, the set of video scene segments includes a target video scene segment, where the target video scene segment is any one of the set of video scene segments, and the processor 801, when executing one or more instructions in the loaded computer storage medium to perform the duration compression processing on each of the video scene segments according to the duration threshold, performs the following operations: if the duration of the target video scene segment is greater than the duration threshold, clipping the target video scene segment to obtain a clipped target video scene segment, wherein the duration of the clipped target video scene segment is not greater than the duration threshold; if the duration of the target video scene segment is not greater than the duration threshold, keeping the target video scene segment unchanged.

In one embodiment, when one or more instructions in the computer storage medium are loaded to perform the cropping processing on the target video scene segment, so as to obtain a cropped target video scene segment, the processor 801 performs the following operations: extracting at least one sub-segment to be spliced from the target video scene segment according to a time length extraction rule; forming a cut target video scene segment according to the at least one sub-segment to be spliced; the duration extraction rule comprises extracting a sub-segment to be spliced from the target video scene segment according to the duration threshold, wherein the duration of the sub-segment to be spliced is not greater than the duration threshold; or the duration extraction rule comprises extracting at least two sub-segments to be spliced from the target video scene segment according to the duration threshold, wherein the total duration of the at least two sub-segments to be spliced is not greater than the duration threshold.

In one embodiment, when one or more instructions in the computer storage medium are loaded to perform the cropping processing on the target video scene segment, so as to obtain a cropped target video scene segment, the processor 801 performs the following operations: determining the target number of the required image frames to be spliced according to the time length threshold value and the time length of each image frame included by the target video scene segment; cutting out a target number of image frames to be spliced from the target video scene segment; and splicing the target number of image frames to be spliced to obtain a cut target video scene segment.

In one embodiment, the loading of one or more instructions in a computer storage medium by processor 801 further performs the steps of: displaying a user interface, wherein the user interface comprises a play setting area, the play setting area comprises a play speed control area, the play speed control area comprises at least one acceleration mode, and the play speed control area is used for receiving acceleration play operation of a user; if the accelerated playing operation of the user is detected, determining a target accelerated mode included by the accelerated playing operation, and determining a similarity threshold corresponding to the target accelerated mode; and determining a similarity condition according to a similarity threshold corresponding to the target acceleration mode.

In one embodiment, the target acceleration mode includes intelligent acceleration, and the processor 801, when loading one or more instructions in the computer storage medium to perform the determining the similarity threshold corresponding to the target acceleration mode, performs the following: acquiring feature information of the target video according to the identification information of the target video, wherein the feature information comprises scoring information and historical acceleration information; and determining a similarity threshold corresponding to the intelligent acceleration according to the characteristic information.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is intended to be illustrative of only some embodiments of the invention, and is not intended to limit the scope of the invention.

Claims

1. A video processing method, comprising:

determining a first type of video segment from the target video;

dividing the first type of video segments into a video scene segment set according to the similarity among the image frames included in the first type of video segments, wherein the similarity among the image frames included in each video scene segment in the video scene segment set meets the similarity condition;

splicing the video scene segments subjected to the duration compression processing to obtain a compressed first type video segment, and splicing a second type video segment included in the target video and the compressed first type video segment according to a video playing sequence to obtain a compressed target video.

2. The method of claim 1, wherein the method further comprises:

acquiring a target video, and acquiring a video sequence and an audio sequence included in the target video, wherein the video sequence and the audio sequence correspond to each other;

dividing the audio sequence into a first class audio frequency segment and a second class audio frequency segment according to a speech recognition algorithm;

and determining the video content corresponding to the first type of audio segment in the video sequence as a first type of video segment, wherein the first type of video segment does not include speech information.

3. The method of claim 1, wherein the method further comprises:

acquiring each image frame included in a target video, and identifying each image frame included in the target video by adopting an image identification algorithm to obtain an identification result;

and determining a first type of video segment included by the target video according to the identification result, wherein the first type of video segment does not include preset attitude information.

4. The method according to claim 1, wherein said dividing said first type of video segment into a set of video scene segments based on similarities between image frames comprised in said first type of video segment comprises:

according to the video playing sequence, selecting a starting image frame from image frames included in the first type of video segment;

sequentially traversing each image frame behind the initial image frame, and if the similarity between the currently traversed current image frame and the initial image frame is detected to be smaller than a similarity threshold, determining the current image frame as an end image frame;

composing the starting image frame and the image frames between the starting image frame and the ending image frame into a video scene segment;

and repeating the steps to obtain a plurality of video scene segments, and forming a video scene segment set by the plurality of video scene segments.

5. The method of claim 1, wherein the set of video scene segments includes a target video scene segment, the target video scene segment is any one of the set of video scene segments, and the performing the duration compression processing on each of the video scene segments according to the duration threshold comprises:

if the duration of the target video scene segment is greater than the duration threshold, clipping the target video scene segment to obtain a clipped target video scene segment, wherein the duration of the clipped target video scene segment is not greater than the duration threshold;

if the duration of the target video scene segment is not greater than the duration threshold, keeping the target video scene segment unchanged.

6. The method of claim 5, wherein the cropping the target video scene segment to obtain a cropped target video scene segment comprises:

extracting at least one sub-segment to be spliced from the target video scene segment according to a time length extraction rule;

forming a cut target video scene segment according to the at least one sub-segment to be spliced;

the duration extraction rule comprises extracting a sub-segment to be spliced from the target video scene segment according to the duration threshold, wherein the duration of the sub-segment to be spliced is not greater than the duration threshold;

or the duration extraction rule comprises extracting at least two sub-segments to be spliced from the target video scene segment according to the duration threshold, wherein the total duration of the at least two sub-segments to be spliced is not greater than the duration threshold.

7. The method of claim 5, wherein the cropping the target video scene segment to obtain a cropped target video scene segment comprises:

determining the target number of the required image frames to be spliced according to the time length threshold value and the time length of each image frame included by the target video scene segment;

cutting out a target number of image frames to be spliced from the target video scene segment;

and splicing the target number of image frames to be spliced to obtain a cut target video scene segment.

8. The method of claim 1, wherein the method further comprises:

displaying a user interface, wherein the user interface comprises a play setting area, the play setting area comprises a play speed control area, the play speed control area comprises at least one acceleration mode, and the play speed control area is used for receiving acceleration play operation of a user;

if the accelerated playing operation of the user is detected, determining a target accelerated mode included by the accelerated playing operation, and determining a similarity threshold corresponding to the target accelerated mode;

and determining a similarity condition according to a similarity threshold corresponding to the target acceleration mode.

9. The method of claim 8, wherein the target acceleration pattern comprises smart acceleration, and wherein determining a similarity threshold corresponding to the target acceleration pattern comprises:

acquiring feature information of the target video according to the identification information of the target video, wherein the feature information comprises scoring information and historical acceleration information;

and determining a similarity threshold corresponding to the intelligent acceleration according to the characteristic information.

10. A video processing apparatus, comprising:

the processing unit is used for dividing the first type of video segments into a video scene segment set according to the similarity among the image frames included in the first type of video segments, and the similarity among the image frames included in each video scene segment in the video scene segment set meets the similarity condition;

11. A terminal, characterized in that it further comprises:

a computer storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to execute the video processing method according to any of claims 1-9.

12. A computer storage medium having computer program instructions stored therein, which when executed by a processor, are adapted to perform a video processing method according to any of claims 1-9.