CN114143575A

CN114143575A - Video editing method and device, computing equipment and storage medium

Info

Publication number: CN114143575A
Application number: CN202111675867.3A
Authority: CN
Inventors: 张云栋; 刘程
Original assignee: Shanghai IQIYI New Media Technology Co Ltd
Current assignee: Shanghai IQIYI New Media Technology Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-03-04

Abstract

The application discloses a video clipping method, a video clipping device, a computing device and a storage medium, wherein the method comprises the following steps: acquiring a video to be processed and corresponding wonderful degree scoring data thereof, wherein the wonderful degree scoring data is determined based on behavior data of one or more users viewing the video to be processed; and segmenting the video to be processed according to the wonderful degree score data to obtain a plurality of target video segments, wherein the wonderful degree scores corresponding to the target video segments are larger than a preset score threshold value, so that the target video is spliced based on the target video segments, and the playing time of the target video is shorter than that of the video to be processed. Therefore, not only can the labor cost be effectively reduced, but also the efficiency of generating the target video is generally higher. In addition, the target video generated based on the target video segment with relatively high wonderful degree score can enable the automatic clipping effect for the video to be processed to reach a high level.

Description

Video editing method and device, computing equipment and storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video editing method and apparatus, a computing device, and a storage medium.

Background

In an actual application scenario, for a video with a long playing time, the video may be clipped to generate a clipped video with a relatively short playing time and containing core video content. For example, under a feature print of an internet video website, a collection of highlight video clips clipped according to the feature print is usually released for the audience to quickly watch the entire highlight clip.

At present, the clipped video is usually generated by manual clipping, which not only results in higher labor cost, but also generally results in lower efficiency of manual clipping of the video.

Disclosure of Invention

The embodiment of the application provides a video clipping method, a video clipping device, computing equipment and a storage medium, and aims to improve the efficiency of generating a clipped video and reduce the cost in an automatic video clipping mode.

In a first aspect, an embodiment of the present application provides a video clipping method, where the method includes:

acquiring a video to be processed and wonderful degree evaluation data corresponding to the video to be processed, wherein the wonderful degree evaluation data is determined based on behavior data of at least one user watching the video to be processed;

segmenting the video to be processed according to the wonderness score data corresponding to the video to be processed to obtain a plurality of target video segments, wherein the wonderness scores corresponding to the target video segments are larger than a preset score threshold;

and splicing to obtain a target video based on the target video segments, wherein the playing time of the target video is shorter than that of the video to be processed.

In a possible implementation manner, the obtaining of the highlight score data corresponding to the video to be processed, where a playing time of each of the plurality of target video segments is greater than a preset time, includes:

acquiring behavior data of at least one user when watching the video to be processed;

dividing the video to be processed into a plurality of sub-video segments according to a preset time length, wherein the playing time length of each sub-video segment does not exceed the preset time length;

determining the behavior characteristics of each sub video clip in at least one dimension according to the behavior data;

and calculating to obtain the corresponding highlight rating data of each sub-video segment according to the behavior characteristics of each sub-video segment in at least one dimension.

In a possible implementation manner, the segmenting a plurality of target video segments from the video to be processed according to the highlight score data corresponding to the video to be processed includes:

determining initial segmentation points and termination segmentation points corresponding to a plurality of candidate video clips in the video to be processed according to the wonderful degree evaluation data corresponding to the video to be processed;

performing semantic analysis on audio content corresponding to a target candidate video clip to obtain a semantic analysis result corresponding to the target candidate video clip, wherein the target candidate video clip is any one of the candidate video clips;

adjusting the initial segmentation point and/or the termination segmentation point of the target candidate video clip according to the semantic analysis result;

and segmenting the video to be processed according to the adjusted initial segmentation point and/or the adjusted termination segmentation point to obtain the target candidate video segment.

In a possible embodiment, the adjusting the start segmentation point and/or the end segmentation point of the target candidate video segment according to the semantic analysis result includes:

identifying a transition location in the video to be processed;

and adjusting the initial segmentation point and/or the termination segmentation point of the target candidate video clip according to the transition position and the semantic analysis result.

In a possible implementation manner, when the video to be processed is a video of a first type, the obtaining the video to be processed includes:

acquiring an original video;

and determining a video to be processed from the original video by using a music recognition algorithm, wherein the audio content of the video to be processed comprises at least one piece of complete music.

In a possible implementation, when the video to be processed is a second type of video, the method further includes:

filtering invalid segments in the target video segments by using a preset filtering rule, wherein the invalid segments comprise video segments with specific characters and/or advertisement contents;

then, the splicing to obtain the target video based on the plurality of target video segments includes:

and splicing to obtain the target video based on the residual target video clips after filtering in the plurality of target video clips.

In a second aspect, an embodiment of the present application further provides a video editing apparatus, including:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a video to be processed and the highlight rating data corresponding to the video to be processed, and the highlight rating data is determined based on behavior data of at least one user when watching the video to be processed;

the segmentation module is used for segmenting the video to be processed into a plurality of video segments according to the wonderness score data corresponding to the video to be processed, wherein the wonderness scores corresponding to the plurality of video segments obtained by segmentation are larger than a preset score threshold;

and the splicing module is used for splicing the video segments to obtain a target video, wherein the playing time of the target video is shorter than that of the video to be processed.

In a possible implementation manner, a playing time of each of the plurality of target video segments is greater than a preset time, and the obtaining module includes:

the first acquisition unit is used for acquiring behavior data when at least one user watches the video to be processed;

the dividing unit is used for dividing the video to be processed into a plurality of sub-video clips according to preset time length, wherein the playing time length of each sub-video clip does not exceed the preset time length;

a first determining unit, configured to determine, according to the behavior data, behavior characteristics of each sub-video segment in at least one dimension;

and the calculating unit is used for calculating and obtaining the highlight scoring data corresponding to each sub-video clip according to the behavior characteristics of each sub-video clip in at least one dimension.

In one possible embodiment, the dicing module includes:

the second determining unit is used for determining starting segmentation points and ending segmentation points corresponding to a plurality of candidate video clips in the video to be processed according to the highlight rating data corresponding to the video to be processed;

the semantic analysis unit is used for performing semantic analysis on the audio content corresponding to the target candidate video clip to obtain a semantic analysis result corresponding to the target candidate video clip, wherein the target candidate video clip is any one of the candidate video clips;

the adjusting unit is used for adjusting the initial segmentation point and/or the termination segmentation point of the target candidate video clip according to the semantic analysis result;

and the segmentation unit is used for segmenting the video to be processed to obtain the target candidate video segment according to the adjusted initial segmentation point and/or the adjusted termination segmentation point.

In a possible implementation, the adjusting unit includes:

the identification subunit is used for identifying a transition position in the video to be processed;

and the adjusting subunit is used for adjusting the initial segmentation point and/or the termination segmentation point of the target candidate video clip according to the transition position and the semantic analysis result.

In a possible implementation manner, when the video to be processed is a first type of video, the obtaining module includes:

a second obtaining unit for obtaining an original video;

and the third determining unit is used for determining a video to be processed from the original video by using a music recognition algorithm, wherein the audio content of the video to be processed comprises at least one piece of complete music.

In a possible implementation manner, when the video to be processed is a second type of video, the apparatus further includes:

the filtering module is used for filtering invalid segments in the target video segments by using preset filtering rules, wherein the invalid segments comprise video segments with specific characters and/or advertisement contents;

the splicing module is specifically configured to splice to obtain the target video based on the remaining target video segments of the plurality of target video segments after filtering.

In a third aspect, an embodiment of the present application further provides a computing device, which may include a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to perform the method according to any of the embodiments of the first aspect and the first aspect.

In a fourth aspect, this embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is configured to store a computer program, where the computer program is configured to execute the method described in any one of the foregoing first aspect and the first aspect.

In the implementation manner of the embodiment of the application, the video to be processed and the highlight rating data corresponding to the video to be processed are obtained, and the highlight rating data can be determined based on behavior data of one or more users watching the video to be processed; according to the highlight score data corresponding to the video to be processed, a plurality of target video segments are obtained by segmenting the video to be processed, wherein the highlight scores corresponding to the target video segments are larger than the highlight scores corresponding to the other video segments in the video to be processed, the target video is obtained by splicing the target video segments, and the playing time length of the target video is smaller than the playing time length of the video to be processed.

Compared with a manual clipping mode, the method can effectively reduce labor cost and generally has higher efficiency of generating the target video. In addition, the highlight score is determined according to the behavior data of the user when the user views the video to be processed, so that the highlight score can reflect the preference degree of the user for the video content, and therefore the video content of the target video can be generally the content preferred by the user based on the target video generated by the target video segment with relatively high highlight score, and thus the automatic clipping effect for the video to be processed can reach a high level.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of an exemplary application scenario in an embodiment of the present application;

FIG. 2 is a flowchart illustrating a video editing method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a video editing apparatus according to an embodiment of the present application;

fig. 4 is a schematic hardware structure diagram of a computing device in an embodiment of the present application.

Detailed Description

Referring to fig. 1, a schematic view of an application scenario provided in the embodiment of the present application is shown. In the application scenario illustrated in fig. 1, a client 101 may have a communication connection with a computing device 102. Also, the client 101 may receive video provided by a user (e.g., a video clipmaker) and send the video to the computing device 102; the computing device 102 is configured to clip the received video, generate a to-be-processed video, and present the generated to-be-processed video to the user through the client 101.

The computing device 102 refers to a device having a data processing capability, and may be, for example, a terminal, a server, or the like. The client 101 may be implemented in a physical device separate from the computing device 102. For example, when the computing device 102 is implemented by a server, the client 101 may run on a user terminal or the like on the user side. Alternatively, the client 101 may also run on the computing device 102.

In an actual application scene, the mode of generating the video to be processed through manual clipping is not only higher in cost, but also lower in efficiency. Therefore, the embodiment of the application provides a video clipping method, and the computing device 102 automatically clips the video to be processed so as to improve the efficiency of generating the clipped video and reduce the cost. In a specific implementation, the computing device 102 may obtain a to-be-processed video, for example, obtain a to-be-processed video provided by a user through the client 101, and the computing device 102 further obtains highlight rating data corresponding to the to-be-processed video, where the highlight rating data may be determined based on behavior data of one or more users watching the to-be-processed video, and may reflect user preference degrees for different video contents in the to-be-processed video. Then, the computing device 102 may segment the to-be-processed video into a plurality of target video segments according to the highlight score data corresponding to the to-be-processed video, where the highlight scores corresponding to the plurality of target video segments are greater than a preset score threshold, so that the computing device 102 may splice the plurality of target video segments obtained based on the segmentation to obtain the target video, and the playing time of the target video is shorter than the playing time of the to-be-processed video.

Because the computing device 102 can automatically clip the to-be-processed video according to the highlight rating data of the to-be-processed video and generate the target video, compared with a manual clipping mode, the labor cost can be effectively reduced, and the efficiency of generating the target video is generally higher. In addition, since the highlight score is determined according to the behavior data of the user when viewing the to-be-processed video, the highlight score can reflect the preference degree of the user for the video content, and therefore, the video content of the target video can be generally the content preferred by the user based on the target video generated by the target video segment with the relatively high highlight score by the computing device 102, so that the automatic clipping effect of the computing device 102 on the to-be-processed video can reach a high level.

It should be noted that the video in this embodiment refers to a video having both image and audio contents, that is, a video file includes not only video frame images of consecutive frames, but also audio data synchronized with the video frame images.

It is understood that the architecture of the application scenario shown in fig. 1 is only one example provided in the embodiment of the present application, and in practical applications, the embodiment of the present application may also be applied to other applicable scenarios, for example, the computing device 102 may automatically obtain one or more videos from the internet, and automatically generate clip videos corresponding to the respective videos through the above implementation manner. In summary, the embodiments of the present application may be applied in any applicable scenario and are not limited to the scenario examples described above.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, various non-limiting embodiments accompanying the present application examples are described below with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a video editing method in an embodiment of the present application, where the method may be applied to the application scenario illustrated in fig. 1, or may be applied to other applicable application scenarios, and the like. For convenience of explanation and understanding, the following description will be given by taking an application scenario shown in fig. 1 as an example. The method specifically comprises the following steps:

s201: the method comprises the steps of obtaining a video to be processed and highlight rating data corresponding to the video to be processed, wherein the highlight rating data are determined based on behavior data of at least one user when the user views the video to be processed.

For the convenience of distinction and description, the video to be clipped is referred to as the video to be processed in the present embodiment, and the video generated by clipping is referred to as the target video.

In one possible implementation, the pending video may be provided to the computing device 102 by a user. Specifically, the client 101 may present a video import interface to the user, so that the user may import the to-be-processed video to the client 101 by performing a corresponding operation on the video import interface. The client 101 may then transmit the pending video provided by the user to the computing device 102 over a network connection with the computing device 102.

In yet another possible implementation, the pending video may also be obtained by the computing device 102 from the internet. For example, the user may send an instruction to generate a clipping video to the computing device 102 through the client 101, so that the computing device 102 may download a specific type of video from the internet, such as a talk show type video or a vocal type video, and use the videos as pending videos based on the instruction, so as to clip the pending videos later.

Alternatively, the pending video may be a video that has been pre-processed by the computing device 102 (or other device). For example, when the video to be processed is a first type of video (e.g., a music art type video), the computing device 102 may obtain an original video provided by the user through the client 101, where the video content of the original video may include, for example, a singing performance performed by a plurality of contestants in sequence and a content that a commentator performs a point rating/scoring on each contestant, and the computing device 102 may determine the video to be processed from the original video by using a speech recognition algorithm, where the audio content of the video to be processed includes at least one complete piece of music, for example, the computing device 102 may cut out the video content of the singing performance performed by each contestant in the original video and use the cut-out sub-videos as the video to be processed.

It should be noted that the to-be-processed video acquired by the computing device 102 may be one video or multiple videos, for example, the computing device 102 may generate one to-be-processed video based on multiple to-be-processed videos and clips, and the present embodiment is not limited thereto. For convenience of understanding and explanation, in this embodiment, a video to be processed is taken as an example for explanation, and when the video to be processed includes a plurality of videos, the implementation manner is similar to that of this embodiment, and a difference is that a plurality of video segments for subsequent splicing are derived from a plurality of different videos to be processed.

In this embodiment, the computing device 102 not only obtains the video to be processed, but also obtains the highlight rating data corresponding to the video to be processed. The highlight is to automatically judge the preference of the user for the video content by comprehensively analyzing the behavior of the user when the user watches the video, and the preference degree of the user for the video content can be indicated by the highlight score. For example, when the highlight score corresponding to a video is higher, it may indicate that the user has a higher preference for the content in the video; conversely, a lower user preference for content in a video may be indicated when the corresponding highlight score for the video is lower. Accordingly, the computing device 102 may obtain the highlight score data corresponding to the pending video so that the video content preferred by the user and the video content relatively not preferred by the user in the pending video.

The highlight data corresponding to the video to be processed may be calculated in advance, for example, the highlight data is calculated in advance by other devices, so that the computing device 102 may directly read the highlight data from other devices.

In yet another example, the computing device 102 may calculate the highlight score data corresponding to the video to be processed. In a specific implementation, the computing device 102 may obtain at least one behavior data when the user views the video to be processed, where the behavior data may indicate one or more operation behaviors of the user in a pop-up screen, reviewing a section of video content, adjusting a playback speed (for example, adjusting the playback speed from a default 1 × speed to a 2 × speed, and the like) when the user views the video to be processed, skipping playback of a part of the video content (for example, the user may skip to view a video content that the user considers that the part of the video content is not wonderful), exiting before the end of the playback (for example, the user considers that the video content is not wonderful and closes a playback page in advance or closes program software for playing the video, and the like, and this embodiment does not limit this.

Then, the computing device 102 divides the video to be processed into a plurality of sub-video segments according to the preset time duration, wherein the playing time duration of each sub-video segment does not exceed the preset time duration, so that the behavior feature of each sub-video segment in at least one dimension can be determined according to the behavior data, wherein the behavior feature of each dimension can correspond to one operation behavior of the user. For example, when the behavior data includes 5 operation behaviors of popping up a bullet screen, reviewing a piece of video content, adjusting the playback speed, skipping playing of a part of the video content, and exiting before the end of the playback, the computing device 102 may count the behavior characteristics of the 5 dimensions of the bullet screen dimension, the review dimension, the speed dimension, the skip playing dimension, and the playback exiting dimension when the user views each sub-video segment. In practical application, the computing device 102 may divide the video to be processed into a plurality of sub-video segments with a playing duration of 1 second by taking the duration of 1 second as a unit (or taking other playing durations as a unit), and count behavior characteristics corresponding to each sub-video segment with a duration of 1 second according to the behavior data.

Finally, the computing device 102 may obtain, according to the behavior characteristics of each sub-video segment in at least one dimension, the highlight scoring data corresponding to each sub-video segment, that is, the highlight scoring data corresponding to the video to be processed. In particular, when each sub-video segment has only one dimension of behavioral characteristics, the computing device 102 may score the sub-video segment directly according to the behavioral characteristics of the dimension. For example, when one dimension corresponding to each sub-video segment is the number of the bullet screens, the number of the bullet screens is positively correlated with the score of the highlight score, that is, if the number of the bullet screens corresponding to the sub-video segment is more, the score of the highlight score is larger when the computing device 102 scores the sub-video segment; on the contrary, if the number of the bullet screens corresponding to the sub-video clips is smaller, the score of the highlight score corresponding to the sub-video clip is smaller. The scoring rules of the other dimensions may be set according to the requirements of the practical application, and this embodiment is not described herein again. When each sub-video segment has behavior characteristics in multiple dimensions, the computing device 102 may score each sub-video segment using a corresponding scoring policy. For example, assuming that the sub-video segment has 5 dimensions of behavior features at the same time, the computing device 102 may assign a corresponding weight to each dimension, so that the result of the weighted summation of the behavior features of the sub-video segment in each dimension may be used as the highlight score of the sub-video segment in a manner of weighted summation. It should be noted that, when the behavior feature of a partial sub-video segment in a certain dimension is missing, the computing device 102 may add the behavior feature of the dimension to the sub-video segment by linear interpolation or the like, or may determine the behavior feature of the dimension to be 0-equivalent, which is not limited in this embodiment.

In practical application, the scoring strategies may be packaged into a scoring model, so that the computing device 102 may output the wonderful score corresponding to each sub-video segment by the scoring model after inputting the behavior characteristics of one or more dimensions corresponding to each sub-video segment into the scoring model, where a specific scoring rule related to the scoring model may be set according to requirements of practical application, and this embodiment is not described again.

S202: and segmenting the video to be processed to obtain a plurality of target video segments according to the wonderness score data corresponding to the video to be processed, wherein the wonderness scores corresponding to the target video segments are larger than a preset score threshold value.

In this embodiment, after obtaining the highlight score data, the computing device 102 may clip the video to be processed according to the highlight score data to obtain a plurality of target video segments with relatively high highlight scores, for example, by finding a highlight score maximum value or a peak value of a highlight score curve to determine a target video segment with a highlight score greater than a preset score threshold value, and the highlight score corresponding to the other video segments in the video to be processed is relatively low.

In one possible implementation, the computing device 102 may determine a start segmentation point and an end segmentation point corresponding to a plurality of candidate video segments in the video to be processed according to the highlight rating data corresponding to the video to be processed. The starting segmentation point refers to a starting point of the candidate video segment, and a video image at the starting segmentation point is a first frame image of the candidate video segment. Correspondingly, the ending segmentation point refers to the ending point of the candidate video segment, and the video image at the ending segmentation point is the last frame image of the candidate video segment. For example, the computing device 102 may determine, according to the highlight score data corresponding to the video to be processed, a sub-video segment with the largest highlight score, determine a playing position of the sub-video segment at 15 (or other numerical value) seconds as a starting segmentation point of the candidate video segment, determine a playing position of the sub-video segment at the last 15 (or other numerical value) seconds as an ending segmentation point of the candidate video segment, and so on.

The current starting segmentation point and/or ending segmentation point of the candidate video segment may cause the problem that partial video content in the candidate video segment is incomplete. For example, taking the initial segmentation point as an example, assume that the original video includes a dialog, in which the character a has a question: "how you watched the last video," character B answers the track: "none", and the starting segmentation point determined by the computing device 102 may start to be truncated at the position where the person a finishes speaking, only the answer content of the person B is included in the candidate video segment ("none"), which makes the subsequent user feel unaware of the cloud when viewing the content of the person B speaking "none" in the portion of the video, i.e., does not know what content the person B answers "none". Accordingly, computing device 102 may also adjust the start segmentation point and/or the end segmentation point.

In a specific implementation, the computing device 102 may perform speech recognition on the audio content corresponding to the target candidate video segment to obtain a semantic analysis result corresponding to the target candidate video segment. For example, the computing device 102 may recognize subtitles in the target candidate video segment, for example, perform subtitle Recognition by using an Optical Character Recognition (OCR) technology, and obtain a subtitle text corresponding to the target candidate video segment. Wherein the target candidate video segment is any one of the candidate video segments. Then, the computing device 102 may perform semantic analysis on the subtitle text corresponding to the target candidate video segment to obtain a corresponding semantic analysis result. Or, in the process of obtaining the semantic analysis result, the computing device 102 may first perform speech recognition on the audio data of the target candidate video segment to obtain text content corresponding to the audio data; then, the computing device 102 may perform semantic analysis on the recognized text content to obtain a corresponding semantic analysis result.

After obtaining the semantic recognition result, the computing device 102 may adjust the start segmentation point and/or the end segmentation point of the target candidate video segment according to the semantic analysis result. For example, the computing device 102 may determine, according to a semantic analysis result, a piece of content of complete semantics included in the target candidate video segment, and determine a position where a first sentence expression corresponding to the complete semantic content first appears in the original video, specifically, a first frame image displayed in the video to be processed by a subtitle corresponding to the first sentence expression, and determine the first frame image as a starting segmentation point of the target candidate video segment; and/or the computing device 102 determines a position where a last sentence expression corresponding to the complete semantic content finally appears in the original video, specifically, a last frame image displayed in the video to be processed by the subtitle corresponding to the last sentence expression, and determines the last frame image as a termination segmentation point of the target candidate video segment. Therefore, the semantics of the video content in the video segment segmented based on the starting segmentation point and the ending segmentation point can be complete and coherent. When the computing device 102 only adjusts the initial segmentation point based on the semantic analysis result, the computing device 102 may segment the video to be processed to obtain a target candidate video segment based on the adjusted initial segmentation point and the unadjusted termination segmentation point, that is, segment the video to obtain a target video segment; when the computing device 102 only adjusts the end segmentation point based on the semantic analysis result, the computing device 102 may segment the target video to obtain target candidate video segments based on the unadjusted start segmentation point and the adjusted end segmentation point; when the computing device 102 simultaneously segments the starting segmentation point and the ending segmentation point based on the semantic analysis result, the computing device 102 may segment the target candidate video segment from the target video based on the adjusted starting segmentation point and the adjusted ending segmentation point. In this manner, the computing device 102 may segment a plurality of target video segments from the video to be processed in a similar manner as described above.

Further, the computing device 102 may adjust the start segmentation point and/or the end segmentation point by comprehensively considering the transition location and the semantic analysis result. The transition position refers to a position where the types of the characters in the video to be processed are switched, for example, the position where the performer switches to the audience or the guest seat may be a position where the types of the characters shot by switching the shooting lens are switched. In practical applications, when the type of a character in the video to be processed is switched, which generally indicates that a piece of video content related to the character is temporarily ended, such as when a story/point of speech presented by the character is ended, the shooting scene may be switched to an auditorium to shoot the response (such as laugh, nod, pan, etc.) made by the audience to the content presented by the character, so that the computing device 102 determines the start segmentation point and/or the end segmentation point of the target candidate video segment based on the transition position, which generally makes the video content in the target candidate video segment in the vicinity of the start position or the end position more coherent. Notably, the computing device 102 may or may not adjust the start segmentation point and/or the end segmentation point of the target candidate video segment based on the transition location. For example, when the video to be processed is a video of a music genre, the computing device 102 may not adjust the start dividing point and/or the end dividing point according to the transition position.

In particular implementation, the computing device 102 may identify a transition position in the video to be processed, and adjust the start segmentation point and/or the end segmentation point of the target candidate video segment according to the transition position and the semantic analysis result. For example, the start segmentation point and/or the end segmentation point may be initially adjusted according to the semantic analysis result, and then it is determined whether the transition position is included near the initially adjusted start segmentation point and/or end segmentation point, for example, it is determined whether the transition position is included in the video segment that is not more than the play duration threshold from the start segmentation point and/or the end segmentation point. If so, computing device 102 may adjust the location of the start and/or end segmentation points in the video to be processed to the transition location.

Illustratively, the computing device 102 may determine the transition location by comparing the similarity between two video images. Specifically, the computing device 102 may sequentially compare image similarities between two adjacent frames of video images near the start division point and/or the end division point, and determine the position of the previous frame (or the next frame) of video image as the transition position if there is an image similarity between the two frames of video images that is smaller than a preset threshold. If the image similarity between any two frames of video images near the initial segmentation point and/or the end segmentation point is greater than the preset threshold, the computing device 102 determines that there is no transition position. For example, when calculating the similarity between two frames of images, the computing device 102 may first reduce the two frames of images to a size of 8 pixels by 8 pixels, i.e., each reduced frame of image has 64 pixels. The step has the effects that in order to remove the details of the image, only basic information such as structure/brightness and the like in the image is reserved, and the subsequent calculated amount is reduced; then, the computing device 102 may perform gray processing on the two reduced frames of images, and respectively calculate an average gray value of each frame of image (i.e. an average value of 64 gray values in each frame of image); next, the computing device 102 compares the gray value of each pixel in each frame of image with the average gray value corresponding to the frame of image, and the pixel with the gray value greater than or equal to the average gray value is marked as 1, and the pixel with the gray value less than the average gray value is marked as 0, so that 64 pixels in each frame of image are combined according to a uniform rule, and a 64-bit hash value (composed of 1 and 0) can be generated, and the hash value can be used as a fingerprint of the frame of image. In this way, the computing device 102 may compare 64-bit hash values corresponding to two frames of images, and when the number of bits of difference between the two hash values exceeds a preset value (e.g., 5), the computing device 102 determines that the two frames of images have a smaller similarity, and when the number of bits of difference between the two hash values does not exceed the preset value, the computing device 102 determines that the two frames of images have a larger similarity.

S203: and splicing to obtain a target video based on the plurality of target video segments, wherein the playing time length of the target video is less than that of the video to be processed.

After cropping the video including the plurality of video segments from the to-be-processed video, the computing device 102 may stitch the plurality of video segments to generate the to-be-processed video. The computing device 102 may perform sequential splicing according to a playing sequence of each video segment in the video to be processed, or may perform splicing in other sequences, and the like, which is not limited in this embodiment.

In practical application, when video clips are performed on videos to be processed, certain requirements are usually imposed on the playing time of the videos to be processed generated by the clips. For example, for a to-be-processed video with a playing time of 2 hours, the playing time of the to-be-processed video generated by clipping the to-be-processed video may not exceed 10 minutes. Therefore, if the total playing time length corresponding to the plurality of video segments is greater than the maximum playing time length of the to-be-processed video to be generated, the computing device 102 selects a part of the video segments from the plurality of video segments to generate the to-be-processed video.

In a part of scenes in actual application, among a plurality of target video segments obtained by segmenting from a video to be processed, there may be video segments which do not meet the requirements of actual application. For example, when the video to be processed is a video of an observation-like art, in the partial target video segment segmented by the computing device 102 based on the above process, there may be a video segment whose video content is a character in a studio, or a video segment including advertisement content, and the like, and this partial video segment is generally a segment that does not participate in the clipping process in a manual clipping process (generally, a video segment that is not preferred by the user, and the like). Therefore, in a further possible implementation, when the to-be-processed video is a second type of video, such as a video of an observation heddles, the computing device 102 may filter invalid segments of a plurality of target video segments after cutting the target video segments from the to-be-processed video, so that the computing device 102 generates the target video by splicing the plurality of target video segments remaining after filtering.

Illustratively, the invalid segments may be, for example, video segments including advertising content, and/or video segments including a particular character, and the like. Of course, in practical application, the invalid segment may also be implemented in other manners, which is not limited in this embodiment.

As some examples, when the invalid segment is a target video segment that includes advertising content, the computing device 102 may identify, by means of image recognition, whether a video image that includes an "advertisement" character exists in the plurality of frames of video images that the target video segment includes, and if so, the computing device 102 may identify the target video segment as the invalid segment, otherwise determine that the target video segment is not the invalid segment. Alternatively, the computing device 102 may utilize an Artificial Intelligence (AI) model that has been trained in advance to perform the recognition, which may be inputting the target video segment into the AI model and outputting by the AI model an indication of whether the target video segment is an invalid segment. Thus, when the indication is "yes" (or other value), the AI model may identify the target video segment as an invalid segment, otherwise determine that the target video segment is not an invalid segment.

In other examples, when the invalid segment is a target video segment including a specific face, the computing device 102 may identify whether a face image of a specific person is included in the target video segment by using a face recognition algorithm (or a face recognition model trained based on the face recognition algorithm), and if so, the computing device 102 may identify the target video segment as the invalid segment, otherwise, the target video segment is determined not to be the invalid segment.

It should be noted that the above implementation manner of identifying the invalid segments is merely an exemplary illustration, and in practical applications, the computing device 102 may also identify the invalid segments in the multiple target video segments by using other manners, which is not limited in this embodiment.

In this embodiment, because the computing device 102 may automatically clip the to-be-processed video according to the highlight rating data of the to-be-processed video and generate the target video, compared with a manual clipping manner, not only the labor cost may be effectively reduced, but also the efficiency of generating the target video is generally higher. In addition, since the highlight score may reflect the degree of preference of the user for the video content, the video content of the target video may be generally the favorite content of the user based on the target video generated by the target video segment with the relatively high highlight score by the computing device 102, so that the automatic clipping effect of the computing device 102 on the video to be processed may reach a higher level.

For ease of understanding, the videos to be processed are exemplified as the viewing-like variety video and the music-like variety video, respectively.

After acquiring the observation-type synthesis video, the computing device 102 may acquire the highlight score data corresponding to the observation-type synthesis video, for example, may directly acquire the highlight score data from other devices, or may calculate the highlight score data according to behavior data of at least one user watching the observation-type synthesis video. Then, the computing device 102 may segment the observation-like integrated art video according to the highlight score data to obtain a plurality of candidate video segments with higher highlight score degrees. Because partial video content of the currently determined candidate video segments may be incomplete, the computing device 102 may perform preliminary adjustment on the start segmentation point and the end segmentation point corresponding to each candidate video segment in a manner of performing semantic analysis on subtitles or audio data, and further adjust the start segmentation point and the end segmentation point in combination with a transition position, so that the semantics of the video content in the adjusted candidate video segments are complete and coherent, and finally segment to obtain a plurality of target video segments (i.e., the adjusted candidate video segments). Then, since a part of the target video segments may include video content of the studio or a video segment including advertisement content, and the part of the video segments does not usually participate in video clips (usually also video segments that are not preferred by the user, etc.), the computing device 102 may filter out a part of the target video segments including characters or advertisement content in the studio among the plurality of target video segments, and concatenate the remaining part of the target video segments to generate a clip video corresponding to the observation heddles video.

After the music variety video is acquired, the computing device 102 may identify, through a music recognition algorithm, a plurality of video segments including a complete song included in the music variety video, where a playing time of each video segment may be, for example, 3 to 5 minutes, and acquire behavior data of at least one user watching each video segment, so that the computing device 102 may calculate, according to the behavior data corresponding to each video segment, wonderful rating data corresponding to each video segment. Then, the computing device 102 may segment candidate video segments with higher chroma scores from the video segments according to the corresponding chroma score data of each video segment, for example, segment candidate video segments with a duration of 30 seconds, identify positions where the lyrics appear and disappear by using an OCR technology, and select the position where the lyrics disappear as a start segmentation point and an end segmentation point of the candidate video segments. Then, the computing device 102 may splice the candidate video segments obtained by splitting the video segments respectively to generate a clipped video corresponding to the music variety video.

In addition, the embodiment of the application also provides a video clipping device. Referring to fig. 3, fig. 3 is a schematic diagram illustrating a structure of a video clipping device 300 according to an embodiment of the present application, where the video clipping device includes:

an obtaining module 301, configured to obtain a video to be processed and highlight rating data corresponding to the video to be processed, where the highlight rating data is determined based on behavior data of at least one user watching the video to be processed;

a segmentation module 302, configured to segment a plurality of video segments from the video to be processed according to the highlight score data corresponding to the video to be processed, where the highlight scores corresponding to the segmented video segments are greater than a preset score threshold;

a splicing module 303, configured to splice to obtain a target video based on the multiple video segments, where a playing time of the target video is shorter than a playing time of the to-be-processed video.

In a possible implementation manner, a playing time of each of the plurality of target video segments is greater than a preset time, and the obtaining module 301 includes:

In a possible implementation, the segmentation module 302 includes:

In a possible implementation, the adjusting unit includes:

In a possible implementation manner, when the video to be processed is a first type of video, the obtaining module 301 includes:

a second obtaining unit for obtaining an original video;

In a possible implementation, when the video to be processed is a second type of video, the apparatus 300 further includes:

It should be noted that, for the contents of information interaction, execution process, and the like between the modules and units of the apparatus, since the same concept is based on the method embodiment in the embodiment of the present application, the technical effect brought by the contents is the same as that of the method embodiment in the embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment in the embodiment of the present application, and are not described herein again.

In addition, the embodiment of the application also provides the computing equipment. Referring to fig. 4, fig. 4 shows a hardware structure diagram of a computing device in an embodiment of the present application, and the computing device 400 may include a processor 401 and a memory 402.

Wherein the memory 402 is used for storing a computer program;

the processor 401 is configured to execute the following steps according to the computer program:

In a possible implementation manner, the playing duration of each of the plurality of target video segments is greater than a preset duration, and the processor 401 is specifically configured to execute the following steps according to the computer program:

In a possible implementation, the processor 401 is specifically configured to execute the following steps according to the computer program:

identifying a transition location in the video to be processed;

In a possible implementation manner, when the video to be processed is a first type of video, the processor 401 is specifically configured to execute the following steps according to the computer program:

acquiring an original video;

In a possible implementation manner, when the video to be processed is a second type of video, the processor 401 is further configured to execute the following steps according to the computer program: filtering invalid segments in the target video segments by using a preset filtering rule, wherein the invalid segments comprise video segments with specific characters and/or advertisement contents;

the processor 401 is specifically configured to execute the following steps according to the computer program: and splicing to obtain the target video based on the residual target video clips after filtering in the plurality of target video clips.

In addition, the embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used for storing a computer program, and the computer program is used for executing the method described in the above method embodiment.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A method of video clipping, the method comprising:

2. The method according to claim 1, wherein the playing duration of each of the plurality of target video segments is greater than a preset duration, and the obtaining of the highlight rating data corresponding to the video to be processed comprises:

3. The method according to claim 1, wherein the segmenting a plurality of target video segments from the video to be processed according to the highlight rating data corresponding to the video to be processed comprises:

4. The method according to claim 3, wherein said adjusting the start segmentation point and/or the end segmentation point of the target candidate video segment according to the semantic analysis result comprises:

identifying a transition location in the video to be processed;

5. The method according to any one of claims 1 to 4, wherein when the video to be processed is a first type of video, the obtaining the video to be processed comprises:

acquiring an original video;

6. The method according to any one of claims 1 to 4, wherein when the video to be processed is a second type of video, the method further comprises:

7. A video clipping apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, wherein the playing duration of each of the plurality of target video segments is greater than a preset duration, and the obtaining module comprises:

9. A computing device, the device comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to perform the method of any of claims 1-6 in accordance with the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any of claims 1-6.