CN114339451A

CN114339451A - Video editing method and device, computing equipment and storage medium

Info

Publication number: CN114339451A
Application number: CN202111679121.XA
Authority: CN
Inventors: 张云栋; 刘程
Original assignee: Shanghai Iqiyi New Media Technology Co ltd
Current assignee: Shanghai Iqiyi New Media Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-04-12

Abstract

The application discloses a video clipping method, a video clipping device, a computing device and a storage medium, wherein the method comprises the following steps: acquiring an original video to be processed, and identifying a plurality of key positions in the original video, wherein the audio content of the original video at the key positions comprises target type sound; according to a plurality of key positions in the original video, a plurality of video clips are obtained by segmenting the original video, and therefore a target video is obtained by splicing based on the video clips, and the playing time length of the target video is smaller than that of the original video. Compared with a manual clipping mode, the method not only can effectively reduce labor cost, but also is generally higher in efficiency of generating the clipped video. And, the automatic clipping effect for the original video can reach a high level.

Description

Video editing method and device, computing equipment and storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video editing method and apparatus, a computing device, and a storage medium.

Background

In an actual application scenario, for a video with a long playing time, the video may be clipped to generate a clipped video with a relatively short playing time and containing core video content. For example, under a talk show-like feature film of an internet video website, a plurality of smile highlight video clips clipped according to the feature film are usually published for the audience to quickly watch the whole smile clip.

At present, the clipped video is usually generated by manual clipping, which not only results in higher labor cost, but also generally results in lower efficiency of manual clipping of the video.

Disclosure of Invention

The embodiment of the application provides a video clipping method, a video clipping device, computing equipment and a storage medium, and aims to improve the efficiency of generating a clipped video and reduce the cost in an automatic video clipping mode.

In a first aspect, an embodiment of the present application provides a video clipping method, where the method includes:

acquiring an original video to be processed;

identifying a plurality of key locations in the original video, audio content of the original video at the key locations comprising target type sounds;

segmenting the original video to obtain a plurality of video segments according to a plurality of key positions in the original video;

and splicing to obtain a target video based on the plurality of video segments, wherein the playing time of the target video is shorter than that of the original video.

In a possible implementation, the segmenting a plurality of video segments from the original video according to a plurality of key positions in the original video includes:

determining starting segmentation points and ending segmentation points corresponding to a plurality of candidate video clips in the original video according to a plurality of key positions in the original video;

performing semantic analysis on audio content corresponding to the target candidate video clip to obtain a semantic analysis result corresponding to the target candidate video clip;

adjusting the initial segmentation point of the target candidate video clip according to the semantic analysis result;

and segmenting the original video to obtain the target candidate video segment according to the adjusted initial segmentation point and the termination segmentation point corresponding to the target candidate video segment.

In a possible implementation manner, the termination dividing point corresponding to the target candidate video segment is determined according to a transition position in the target candidate video segment, and a similarity between a video image at the transition position in the target candidate video segment and a previous frame video image is smaller than a preset threshold.

In a possible implementation manner, the splicing to obtain the target video based on the plurality of video segments includes:

dividing the plurality of video clips into a plurality of video sets, wherein different video clips in each video set have the same action character, and the action characters in the video clips in different video sets are different;

determining a playing time limit value corresponding to each video set according to the number of the video clips included in each video set and the maximum playing time of the target video, wherein the number of the video clips included in the video set is positively correlated with the playing time limit value corresponding to the video set;

and selecting a first video segment from each video set to be spliced according to the playing time limit value corresponding to each video set, so as to generate the target video, wherein the first video segment is the video segment selected from the video segments included in each video set.

In a possible implementation manner, each video set further includes a second video segment, where the second video segment is the rest of the video segments in the video set except for the first video segment;

then, the playing time length of the first video segment in each video set is longer than the playing time length of the second video segment in the video set, or the playing time length corresponding to the target type sound included in the first video segment in each video set is longer than the playing time length corresponding to the target type sound included in the second video segment in the video set.

In one possible implementation, the second video segment in each video set includes a character cut-out and/or a show end segment.

In one possible embodiment, the dividing the plurality of video segments into a plurality of video sets includes:

performing face recognition on the plurality of video segments, and determining an action figure corresponding to each video segment;

and dividing the plurality of video clips into a plurality of video sets according to the action figures corresponding to the video clips.

In a second aspect, an embodiment of the present application further provides a video editing apparatus, including:

the acquisition module is used for acquiring an original video to be processed;

an identification module for identifying a plurality of key locations in the original video, audio content of the original video at the key locations comprising a target type sound;

the segmentation module is used for segmenting the original video to obtain a plurality of video segments according to a plurality of key positions in the original video;

and the splicing module is used for splicing the video segments to obtain a target video, wherein the playing time of the target video is shorter than that of the original video.

In one possible embodiment, the dicing module includes:

the first determining unit is used for determining a starting segmentation point and an ending segmentation point corresponding to a plurality of candidate video clips in the original video according to a plurality of key positions in the original video;

the semantic analysis unit is used for performing semantic analysis on the audio content corresponding to the target candidate video clip to obtain a semantic analysis result corresponding to the target candidate video clip, wherein the target candidate video clip is any one of the candidate video clips;

the adjusting unit is used for adjusting the initial segmentation point of the target candidate video clip according to the semantic analysis result;

and the segmentation unit is used for segmenting the original video to obtain the target candidate video segment according to the adjusted initial segmentation point and the termination segmentation point corresponding to the target candidate video segment.

In one possible embodiment, the splicing module includes:

the dividing unit is used for dividing the plurality of video clips into a plurality of video sets, wherein different video clips in each video set have the same action character, and the action characters in the video clips in the different video sets are different;

the second determining unit is used for determining playing time limit values respectively corresponding to the video sets according to the number of the video clips included in each video set and the maximum playing time of the target video, wherein the number of the video clips included in the video sets is positively correlated with the playing time limit value corresponding to the video set;

and the selecting unit is used for selecting a first video clip from each video set to splice according to the playing time limit value respectively corresponding to each video set, so as to generate the target video, wherein the first video clip is selected from the video clips included in each video set.

In a possible implementation manner, the dividing unit is specifically configured to:

In a third aspect, an embodiment of the present application further provides a computing device, which may include a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to perform the method according to any of the embodiments of the first aspect and the first aspect.

In a fourth aspect, this embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is configured to store a computer program, where the computer program is configured to execute the method described in any one of the foregoing first aspect and the first aspect.

In the implementation manner of the embodiment of the application, an original video to be processed is acquired, and a plurality of key positions in the original video are identified, wherein audio contents of the original video at the key positions comprise target type sounds; according to a plurality of key positions in the original video, a plurality of video clips are obtained by segmenting the original video, and therefore a target video is obtained by splicing based on the video clips, and the playing time length of the target video is smaller than that of the original video.

Compared with a manual clipping mode, the method not only can effectively reduce the labor cost, but also has higher efficiency of generating the clipped video (namely, the target video). In addition, the audio content of the target key position includes target type sounds such as applause and laughter, and since the video segments including the target type sounds are usually wonderful segments in the original video, the target video generated based on the video segments usually covers most or even all core video segments in the original video, so that the automatic clipping effect for the original video can be brought to a high level.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a schematic diagram of an exemplary application scenario in an embodiment of the present application;

FIG. 2 is a flowchart illustrating a video editing method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another video editing method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a video editing apparatus according to an embodiment of the present application;

fig. 5 is a schematic hardware structure diagram of a computing device in an embodiment of the present application.

Detailed Description

Referring to fig. 1, a schematic view of an application scenario provided in the embodiment of the present application is shown. In the application scenario illustrated in fig. 1, a client 101 may have a communication connection with a computing device 102. Also, the client 101 may receive video provided by a user (e.g., a video clipmaker) and send the video to the computing device 102; the computing device 102 is configured to perform audio recognition, face recognition, video segmentation, and other processing on the received video, generate a clip video, and present the clip video to the user via the client 101.

The computing device 102 refers to a device having a data processing capability, and may be, for example, a terminal, a server, or the like. The client 101 may be implemented in a physical device separate from the computing device 102. For example, when the computing device 102 is implemented by a server, the client 101 may run on a user terminal or the like on the user side. Alternatively, the client 101 may also run on the computing device 102.

In practical application scenarios, the manner of manually generating the clip video is not only costly, but also inefficient. To this end, the embodiment of the present application provides a video clipping method, in which the computing device 102 automatically clips the original video, so as to improve the efficiency of generating the clipped video and reduce the cost. In particular implementations, the client 101 may send the original video provided by the user to the computing device 102. The computing device 102 identifies a plurality of key locations in the original video, the audio content of the original video at the key locations including a target type of sound, such as applause, laughter, and the like. Then, the computing device 102 segments the original video into a plurality of video segments according to a plurality of key positions in the original video, so that the computing device 102 splices the plurality of video segments with the target type of sound to obtain a target video, wherein the playing time of the target video is shorter than that of the original video.

Because the computing device 102 can automatically clip the original video and generate the target video according to the key positions in the original video, not only can the labor cost be effectively reduced compared to a manual clipping manner, but also the efficiency of generating the clipped video (i.e., the target video) is generally higher. In addition, the audio content of the target key position includes target type sounds such as applause and laughter, and since the video segments including the target type sounds are usually highlight segments in the original video, the target video generated based on the video segments usually covers most or even all core video segments in the original video, so that the automatic clipping effect of the computing device 102 on the original video can be achieved to a high level.

It should be noted that the video in this embodiment refers to a video having both image and audio contents, that is, a video file includes not only video frame images of consecutive frames, but also audio data synchronized with the video frame images.

It is understood that the architecture of the application scenario shown in fig. 1 is only one example provided in the embodiment of the present application, and in practical applications, the embodiment of the present application may also be applied to other applicable scenarios, for example, the computing device 102 may automatically obtain one or more videos from the internet, and automatically generate clip videos corresponding to the respective videos through the above implementation manner. In summary, the embodiments of the present application may be applied in any applicable scenario and are not limited to the scenario examples described above.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, various non-limiting embodiments accompanying the present application examples are described below with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a video editing method in an embodiment of the present application, where the method may be applied to the application scenario illustrated in fig. 1, or may be applied to other applicable application scenarios, and the like. For convenience of explanation and understanding, the following description will be given by taking an application scenario shown in fig. 1 as an example. The method specifically comprises the following steps:

s201: and acquiring an original video to be processed.

For the sake of distinction and description, the video to be edited is referred to as original video in the present embodiment, and the video generated by editing is referred to as target video.

In one possible implementation, the raw video may be provided to the computing device 102 by a user. Specifically, the client 101 may present a video import interface to the user, so that the user can import the original video into the client 101 by performing a corresponding operation on the video import interface. The client 101 may then transmit the original video provided by the user to the computing device 102 over a network connection with the computing device 102.

In yet another possible implementation, the raw video may also be obtained by the computing device 102 from the internet. For example, the user may send an instruction to generate a clip video to the computing device 102 through the client 101, so that the computing device 102 may download a specific type of video from the internet, such as a talk show type video or a vocal type video, and the like, based on the instruction, and take the videos as original videos for subsequent clipping processing of the original videos.

It should be noted that the original video acquired by the computing device 102 may be one video or multiple videos, for example, the computing device 102 may generate a target video based on multiple original videos, and the like, which is not limited in this embodiment. For convenience of understanding and explanation, in this embodiment, an original video is taken as an example for explanation, and when the original video includes a plurality of videos, the implementation manner is similar to that of this embodiment, except that a plurality of video segments for subsequent splicing are derived from a plurality of different original videos.

S202: a plurality of key locations in an original video are identified, the audio content of the original video at each key location including a target type sound.

The target type sound may be any one or more types of sounds such as laughing, applause, crying, and special melody, or may be other types of sounds, and the present embodiment does not limit the present invention. And, for the video content including the target type sound, it is usually the video content that is desired to be clipped out in the actual application scene. For example, applause and/or laughter appearing in the original video is a performance that viewers give approval and cheers for the wonderful performance of performers (such as talk show performers and vocalic performers), and generally means a video segment in the vicinity of applause and/or laughter, a wonderful performance segment of the performers, and a video segment that is manually and generally edited. To this end, the computing device 102 may identify a plurality of key locations of the original video for subsequent editing of the original video, each key location being representable by a location of a frame of a video image in the original image.

As an implementation example, the computing device 102 may determine a plurality of key locations in the original video by way of audio recognition. Specifically, the computing device 102 may input the original video into an Artificial Intelligence (AI) model that has been trained in advance, and output a plurality of key locations in the original video from the AI model. The AI model completes training through a video sample with a target type sound mark in advance, so that the trained AI model can identify the target type sound in the video.

In yet another example, the computing device 102 may determine multiple key locations in the original video by comparing the voiceprint features. Specifically, the computing device 102 may obtain audio data with "target type sound content, and extract a voiceprint feature of the target type sound, and then the computing device 102 may compare the voiceprint feature with a voiceprint feature corresponding to the audio data in the original video segment by segment, and determine the position of the audio data with the consistent voiceprint feature as a key position, thereby determining a plurality of key positions in the original video.

It should be noted that the above two implementation manners for determining the key position are further used as some exemplary illustrations, and in practical applications, the computing device 102 may also determine the key position having the target type of sound in the original video by other manners, which is not limited in this embodiment.

S203: and segmenting the original video to obtain a plurality of video segments according to a plurality of key positions in the original video.

And the audio content in each video segment obtained by segmentation comprises the target type sound.

In this embodiment, after determining each key position in the original video, the computing device 102 may segment a section of video near each key position to obtain a plurality of video segments including the target type sound content.

In one possible implementation, the computing device 102 may first determine a starting segmentation point and an ending segmentation point corresponding to a plurality of candidate video segments in the original video according to a plurality of key positions in the original video. The starting segmentation point refers to a starting point of the candidate video segment, and a video frame image at the starting segmentation point is a first frame image of the candidate video segment. Correspondingly, the ending segmentation point refers to the ending point of the candidate video segment, and the video frame image at the ending segmentation point is the last frame image of the candidate video segment. For example, the computing device 102 may determine the playing position at the first 15 (or other numerical value, which may be set by an expert, etc.) seconds of the key position as the starting segmentation point of the candidate video segment and determine the playing position at the last 1 (or other numerical value) seconds of the key position as the ending segmentation point of the candidate video segment.

Further, the computing device 102 may also determine a termination segmentation point for the candidate video segment based on the transition location in the original video. The transition position refers to a position where a character in the original video is switched, for example, a performer is switched to a spectator or a guest seat, and corresponds to a position where a frame of video image in the original image is located. In a specific implementation, for any candidate video segment (hereinafter referred to as a target candidate video segment), the computing device 102 may first identify whether a transition position is included in a video segment that is not more than a playing duration threshold from a key position corresponding to the target candidate video segment. If so, the computing device 102 may further determine a termination segmentation point for the target candidate video segment based on the transition location, such as having the transition location as the termination segmentation point, or having any location after the transition location as the termination segmentation point, etc. In practical application, when a section of complete video content is shot, the types of people shot by a shot can be switched (so as to enrich the shooting visual angle of the video and the like), so that the transition position can represent the end of the section of complete video content, and therefore, after the termination division point of the target candidate video clip is determined based on the transition position, the video content can be more consistent when the target candidate video clip is played. If not, the computing device 102 may determine a termination segmentation point for the target candidate video segment based on the key location.

Illustratively, the computing device 102 may determine the transition location by comparing the similarity between two video images. Specifically, for a video segment within a preset playing time (e.g., 3 seconds) after the key position, the computing device 102 may sequentially compare image similarities between two consecutive adjacent video frame images, and determine the position of the next video frame image as the transition position if the image similarity between the two video frame images is smaller than a preset threshold. If the image similarity between any two frames of video images in the video segment is greater than the preset threshold, the computing device 102 determines that no transition position exists in the video segment. For example, when calculating the similarity between two frames of images, the computing device 102 may first reduce the two frames of images to a size of 8 pixels by 8 pixels, i.e., each reduced frame of image has 64 pixels. The step has the effects that in order to remove the details of the image, only basic information such as structure/brightness and the like in the image is reserved, and the subsequent calculated amount is reduced; then, the computing device 102 may perform gray processing on the two reduced frames of images, and respectively calculate an average gray value of each frame of image (i.e. an average value of 64 gray values in each frame of image); next, the computing device 102 compares the gray value of each pixel in each frame of image with the average gray value corresponding to the frame of image, and the pixel with the gray value greater than or equal to the average gray value is marked as 1, and the pixel with the gray value less than the average gray value is marked as 0, so that 64 pixels in each frame of image are combined according to a uniform rule, and a 64-bit hash value (composed of 1 and 0) can be generated, and the hash value can be used as a fingerprint of the frame of image. In this way, the computing device 102 may compare 64-bit hash values corresponding to two frames of images, and when the number of bits of difference between the two hash values exceeds a preset value (e.g., 5), the computing device 102 determines that the two frames of images have a smaller similarity, and when the number of bits of difference between the two hash values does not exceed the preset value, the computing device 102 determines that the two frames of images have a larger similarity.

After determining the start segmentation point and the end segmentation point corresponding to each candidate video segment, for any candidate video segment (i.e., the target candidate video segment), the computing device 102 may perform speech recognition on the audio content corresponding to the target candidate video segment to obtain a semantic analysis result corresponding to the target candidate video segment. For example, the computing device 102 may identify subtitles in the target candidate video segment, for example, perform subtitle identification by using an Optical Character Recognition (OCR) technology, and obtain a subtitle text corresponding to the target candidate video segment; then, the computing device 102 may perform semantic analysis on the subtitle text corresponding to the target candidate video segment to obtain a corresponding semantic analysis result. Or, the computing device 102 may perform speech recognition on the audio data of the target candidate video segment to obtain text content corresponding to the audio data; then, the computing device 102 may perform semantic analysis on the recognized text content to obtain a corresponding semantic analysis result.

After obtaining the semantic recognition result, the computing device 102 may adjust the initial segmentation point of the target candidate video segment according to the semantic analysis result, for example, may determine, based on a speech integrity algorithm, a position where a first sentence with complete semantics first appears in the original video as the initial segmentation point of the target candidate video segment according to the semantic analysis result, so that the subtitle semantics in the video segment segmented based on the initial segmentation point are complete and coherent. In this way, the computing device 102 may segment the target candidate video segment from the original video based on the adjusted start segmentation point and the aforementioned end segmentation point, so that the computing device 102 may segment the plurality of candidate video segments from the original video in a similar manner as described above.

S204: and splicing to obtain a target video based on the plurality of video segments, wherein the playing time of the target video is shorter than that of the original video.

After cropping the video segment from the original video, the computing device 102 may stitch the video segments to generate the target video. The computing device 102 may perform sequential splicing according to a playing order of each video segment in the original video, or may perform splicing in another order, which is not limited in this embodiment.

In this embodiment, since the computing device 102 may automatically clip the original video and generate the target video according to the key position in the original video, compared to a manual clipping manner, not only the labor cost may be effectively reduced, but also the efficiency of generating the clipped video (i.e., the target video) is generally higher. Additionally, the computing device 102 generates the target video based on the splicing of video segments that include the target type of sound, which may generally result in a higher level of automated clipping effects of the computing device 102 on the original video.

In practical application, when video clipping is performed on an original video, a certain requirement is usually imposed on the playing time of a target video generated by clipping. For example, for an original video with a playing time of 2 hours, the playing time of the target video generated by clipping the original video may not exceed 10 minutes. Therefore, if the total playing time length corresponding to the plurality of video clips is greater than the maximum playing time length of the target video to be generated, the computing device 102 selects a part of the video clips from the plurality of video clips to generate the target video.

Further, when the original video includes a plurality of sets of video content that are individually performed by performers, the computing device 102 may determine the video segments to be selected based on the number of video segments associated with each set of performers.

Specifically, referring to fig. 3, a flowchart of another video clipping method is provided in the embodiment of the present application, and as shown in fig. 3, the method may specifically include:

s301: and acquiring an original video to be processed.

S302: a plurality of key locations in an original video are identified, the audio content of the original video at each key location including a target type sound.

S303: and segmenting the original video to obtain a plurality of video segments according to a plurality of key positions in the original video.

The specific implementation manners of steps S301 to S303 are similar to the specific implementation manners of steps S201 to S203 in the foregoing embodiment, and specific reference may be made to the descriptions of the relevant portions of the foregoing embodiment, which is not described herein again.

S304: the method comprises the steps of dividing a plurality of video clips into a plurality of video sets, wherein each video set comprises different video clips which have the same action character, and the action characters in the video clips of the different video sets are different.

In this embodiment, the computing device 102 may divide the plurality of video segments into a plurality of video sets, wherein each video set includes at least one video segment, and when a video set includes a plurality of video segments, the video set includes different video segments having the same action figure therein and different action figures therein.

As an implementation example, the computing device 102 may determine video intervals of each action figure in the original video in advance, for example, since it is common that one (or 2, etc.) action figure speaks in a continuous time period in the original video such as a talk show, the computing device 102 may sample the original video to extract video frame images and determine the start point and the cut point of the action figure segment by means of face recognition. For example, the computing device 102 may extract 2 frames of video images per second from the original video, and recognize and record a face appearing in each frame through a face recognition algorithm (or a face recognition model trained from a large number of face samples based on the face recognition algorithm); when a face continuously appears for more than 4 minutes, recording the starting point and the ending point of the face in a face recognition mode as the starting point and the ending point of the character segmentation segment. In this way, subsequent computing devices 102 may divide the video segments that are within the start point and the cut-off point into a set.

In another implementation example, the computing device 102 may also extract a part of the video frame image from each video segment, and identify a face image included in each video frame image through a face recognition algorithm (or a face recognition model), so as to determine an action figure corresponding to each video segment, so that the computing device 102 may divide the plurality of video segments into a plurality of video sets according to the action figure corresponding to each video segment, and specifically, may divide the video segments in which the video frame images having the same face image are located into the same video set.

S305: and determining the playing time limit value corresponding to each video set according to the number of the video clips included in each video set and the maximum playing time of the target video, wherein the number of the video clips included in the video set is positively correlated with the playing time limit value corresponding to the video set.

The number of the video clips included in the video set is positively correlated with the playing time limit value corresponding to the video set. That is, the larger the number of video segments included in a video set is, the larger the play duration limit value allocated to the video set is; conversely, the smaller the number of video segments included in a video set, the smaller the play duration limit value assigned to the video set.

After obtaining the plurality of video sets, the computing device 102 may determine, according to the number of video segments included in each video set and the maximum playing time of the target video, a playing time delay limit corresponding to each video set, for example, if the video set a includes 4 video segments and the video set B includes 8 video segments, if the maximum playing time of the target video to be generated is 3 minutes, it may determine, according to the ratio of the video segments included in each of the video set a and the video set B, that the playing time allocated to a of the video set is 1 minute (i.e., 3 × 4/12), and the playing time allocated to the video set B is 2 minutes (i.e., 3 × 8/12).

S306: and selecting a first video segment from each video set to splice according to the playing time limit value corresponding to each video set, so as to generate the target video.

For the video segments selected from the video sets, this embodiment is referred to as first video segments, and the total playing time length of the first video segment selected from each video set does not exceed the playing time length limit value corresponding to the video set. For the video segments that are not selected in the video set, this embodiment is referred to as a second video segment.

In this embodiment, the first video segments may be selected from each video set according to the playing time limit value corresponding to each video set, and the target video may be generated by splicing the selected first video segments. Still taking the play duration limit corresponding to the video set a as 1 minute as an example, assuming that the video set a includes a video segment 1, a video segment 2, a video segment 3, and a video segment 4, and the play durations thereof are 14 seconds, 20 seconds, 18 seconds, and 21 seconds, respectively, the computing device 102 may use the video segment 1, the video segment 2, and the video segment 3 in the video set as a first video segment and splice the three video segments, and the video play duration after splicing is 52 seconds (i.e., 14 seconds +20 seconds +18 seconds), does not exceed the play duration limit corresponding to the video set a by 1 minute (i.e., 60 seconds), and the unselected video segment 4 is the second video segment.

In one exemplary manner of selecting a first video segment, the computing device 102 may preferentially select a video segment with a longer play duration as the first video segment, with the unselected second video segments having a relatively smaller play duration. For example, when the computing device 102 selects the first video segment for generating a clip video from the 4 video segments included in the video set a, the computing device 102 may preferentially select the video segment 4 with the longest play duration. And the playing time of the video segment 4 does not exceed the playing time limit value (1 minute) corresponding to the video set a, so the computing device 102 may continue to select the video segment 2 with the largest playing time from the remaining 3 video segments. Since the total play time of video segment 2 and video segment 4 still exceeds 1 minute, the computing device 102 may continue to select video segment 3 with the largest play time from the remaining 2 video segments. At this point, the total play time of video segments 2, 3, and 4 is still greater than 1 minute, but the total play time would be greater than 1 minute when the remaining video segments continue to be selected, and thus, the computing device 102 may determine that the first video segment selected from the video set a includes video segments 2, 3, and 4.

Alternatively, the computing device 102 may preferentially select, as the first video segment, the video segment with the longer play duration corresponding to the target type sound included in each video segment, where the play duration corresponding to the target type sound included in the first video segment in the video set is greater than the play duration corresponding to the target type sound included in the second video segment in the video set. Of course, the above-mentioned manner of selecting the first video segment for participating in video splicing is only used as some exemplary illustrations, and other manners may be used to determine the first video segment in practical applications, which is not limited in this embodiment.

In practical scenarios, when the character is on the scene and the performance is over, there may be a target type of sound, such as applause or laughter, and so on, and thus, in a further possible implementation, the computing device 102 may identify the segment on the scene (i.e., the segment of the video when the character is on the scene) and/or the segment at the end of the performance (i.e., the segment of the video when the character is at the end of the performance) included in each video set, and determine these frequency bands as the second video segment not participating in the subsequent video splicing. Alternatively, the computing device 102 may identify the scene cut and/or the performance ending segment of the character in the process of segmenting the plurality of video segments according to the key position, and filter the scene cut and/or the performance ending segment of the character from the plurality of segmented video segments, which is not limited in this embodiment.

For ease of understanding, the following is an illustration of an original video as a talk show video that multiple performers perform on the floor in sequence. After obtaining the talk show video, the computing device 102 may identify a position where a laugh and/or a applause exist in the talk show video through an AI model, or determine a key position where a laugh and/or an applause exist in the talk show video by way of voiceprint feature comparison, or the like. Then, for each key location, the computing device 102 may further determine a playing location at the first 15 seconds and a playing location at the last 3 seconds of the key location, and identify whether a transition location is included between the key location and the playing location at the last 3 seconds. If so, the computing device 102 may segment the video segment from the start segmentation point to the end segmentation point from the original video with the play position at the first 15 seconds of the key position as the start segmentation point and the transition position as the end segmentation point. As such, the computing device 102 can segment out multiple video segments that include laughter and/or applause for multiple key locations. Since the video to be clipped and generated has the limitation of the playing time length, when the sum of the playing time lengths corresponding to the plurality of video segments that are cut out is greater than the playing time length, the computing device 102 may select a part of the video segments from the plurality of video segments to be spliced to generate the clipped video. In particular implementations, the computing device 102 may divide the plurality of video segments into a plurality of video sets by a face recognition algorithm or a face recognition model, each video set including at least one video segment, and for a video set including a plurality of video segments, the video set includes different video segments having the same performer and different performers. Then, the computing device 102 may count the number of the video segments included in each video set, and determine a limit value of a total playing time length of the video segments participating in the video clip in each video set according to a ratio of the number of the video segments included in each video set and a maximum playing time length of the clipped video, so as to select one or more video segments from the video sets based on the playing time length limit value corresponding to each video set, where the total playing time length of the selected video segments from the video sets does not exceed the playing time length limit value. Moreover, when selecting a video segment from the video set, a video segment with a relatively long playing time may be preferentially selected, or a video segment with a relatively long playing time of the laughter and/or the applause included in the original video may be preferentially selected, and the like. Finally, the computing device 102 may generate a talk show video collection, i.e., a cropped video desired by the user, based on the plurality of video segments ultimately selected from the respective video collections.

In addition, the embodiment of the application also provides a video clipping device. Referring to fig. 4, fig. 4 is a schematic diagram illustrating a structure of a video clipping device 400 according to an embodiment of the present application, where the video clipping device includes:

an obtaining module 401, configured to obtain an original video to be processed;

an identifying module 402 for identifying a plurality of key locations in the original video, audio content of the original video at the key locations comprising a target type sound;

a segmentation module 403, configured to segment the original video into a plurality of video segments according to the plurality of key positions in the original video;

a splicing module 404, configured to splice to obtain a target video based on the multiple video segments, where a playing time of the target video is shorter than a playing time of the original video.

In a possible implementation, the slicing module 403 includes:

In a possible implementation, the splicing module 404 includes:

It should be noted that, for the contents of information interaction, execution process, and the like between the modules and units of the apparatus, since the same concept is based on the method embodiment in the embodiment of the present application, the technical effect brought by the contents is the same as that of the method embodiment in the embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment in the embodiment of the present application, and are not described herein again.

In addition, the embodiment of the application also provides the computing equipment. Referring to fig. 5, fig. 5 is a schematic diagram illustrating a hardware structure of a computing device in an embodiment of the present application, where the computing device 500 may include a processor 501 and a memory 502.

Wherein the memory 502 is used for storing computer programs;

the processor 501 is configured to execute the following steps according to the computer program:

acquiring an original video to be processed;

In a possible implementation, the processor 501 is specifically configured to execute the following steps according to the computer program:

performing semantic analysis on audio content corresponding to a target candidate video clip to obtain a semantic analysis result corresponding to the target candidate video clip, wherein the target candidate video clip is any one of the candidate video clips

In addition, the embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used for storing a computer program, and the computer program is used for executing the method described in the above method embodiment.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate parts may or may not be physically separate, and the parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A method of video clipping, the method comprising:

acquiring an original video to be processed;

2. The method of claim 1, wherein the segmenting the original video into a plurality of video segments according to the plurality of key positions in the original video comprises:

performing semantic analysis on audio content corresponding to a target candidate video clip to obtain a semantic analysis result corresponding to the target candidate video clip, wherein the target candidate video clip is any one of the candidate video clips;

3. The method as claimed in claim 2, wherein the terminating segmentation point corresponding to the target candidate video segment is determined according to a transition position in the target candidate video segment, and a similarity between a video image at the transition position in the target candidate video segment and a video image of a previous frame is smaller than a preset threshold.

4. The method according to any one of claims 1 to 3, wherein said splicing the target video based on the plurality of video segments comprises:

5. The method according to claim 4, wherein each video collection further includes a second video segment, and the second video segment is the rest of the video collections except the first video segment;

6. The method of claim 5, wherein the second video segments in each video set comprise a character cut-out and/or an end-of-performance segment.

7. The method of claim 4, wherein the dividing the plurality of video segments into a plurality of video sets comprises:

8. A video clipping apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring an original video to be processed;

9. A computing device, the device comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to perform the method of any one of claims 1-7 in accordance with the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for performing the method of any of claims 1-7.