CN109348287B

CN109348287B - Video abstract generation method and device, storage medium and electronic equipment

Info

Publication number: CN109348287B
Application number: CN201811229502.6A
Authority: CN
Inventors: 冯俐铜; 旷章辉; 张伟
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2022-01-28
Anticipated expiration: 2038-10-22
Also published as: CN109348287A

Abstract

The embodiment of the disclosure provides a video abstract generation method, a video abstract generation device, a storage medium and electronic equipment. The video abstract generating method comprises the following steps: acquiring a plurality of first key frames in a video sequence; determining at least one valid key frame corresponding to a first style from the plurality of first key frames based on a digest dictionary of the first style; and generating the video abstract of the first style based on at least one effective key frame corresponding to the first style. By adopting the technical scheme of the embodiment of the disclosure, the video abstract of a specific style can be automatically generated, and huge workload caused by manually extracting the abstract can be effectively avoided, so that the video abstract generation efficiency is improved, and the video abstract generation cost is saved.

Description

Video abstract generation method and device, storage medium and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of computer vision, in particular to a video abstract generation method and device, a computer readable storage medium and electronic equipment.

Background

With the continuous development and popularization of internet technology, internet videos (including traditional video platforms and emerging short video APPs) gradually become the main mode for users to watch videos, and the number of network videos also shows explosive growth. However, it is difficult for viewers to make preference selections in a short time in the face of a huge amount of movies, television shows, and art programs. In order to facilitate the watching of the user and increase the click rate of the user, the video platform can extract the content abstract from each video for the rapid browsing of the audience, the content abstract only comprises the wonderful clip in the video, and the video platform can effectively help the user to find the video which is in line with the self watching preference.

At present, video content abstraction is mainly performed in a manual extraction mode, which requires that a worker watches a target video in a finished mode, records stories and highlights occurring in the target video and then screens out the video abstraction. The manual extraction of each video abstract is high-intensity labor, and the manual extraction of each video abstract is unrealistic for massive videos on a video platform.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a video summary generation technique.

According to an aspect of the embodiments of the present disclosure, there is provided a video summary generation method, including: acquiring a plurality of first key frames in a video sequence; determining at least one valid key frame corresponding to a first style from the plurality of first key frames based on a digest dictionary of the first style; and generating the video abstract of the first style based on at least one effective key frame corresponding to the first style.

Optionally, the acquiring a plurality of first key frames in a video sequence includes: performing shot segmentation processing on the video sequence to obtain a plurality of first shots, wherein each first shot comprises a plurality of image frames in the video sequence; and performing key frame extraction processing on the plurality of first shots to obtain a plurality of first key frames.

Optionally, the summary dictionary of the first style includes cluster feature data corresponding to the first style.

Optionally, the cluster feature data includes at least one cluster feature, where the cluster feature is obtained based on feature data of a plurality of sample images included in a cluster corresponding to the first style.

Optionally, the determining, from the plurality of first key frames, at least one valid key frame corresponding to the first style based on the first style digest dictionary includes: acquiring feature data of each first key frame in the plurality of first key frames; and determining at least one effective key frame corresponding to the first style from the plurality of first key frames according to the feature data of each first key frame in the plurality of first key frames and the abstract dictionary of the first style.

Optionally, the obtaining at least one valid key frame corresponding to the first style from the plurality of first key frames according to the feature data of each first key frame in the plurality of first key frames and the summary dictionary of the first style includes: determining whether each first key frame of the plurality of first key frames matches a summary dictionary of the first style based on feature data of the each first key frame; and determining a first key frame matched with the summary dictionary of the first style in the plurality of first key frames as a valid key frame corresponding to the first style.

Optionally, the determining whether each first key frame matches the summary dictionary of the first style based on the feature data of each first key frame in the plurality of first key frames comprises: determining the first key frame as a valid key frame corresponding to the first style in response to that the minimum distance between the feature data of the first key frame and at least one clustering feature contained in the abstract dictionary of the first style is smaller than or equal to a preset distance threshold; or, in response to the existence of a cluster feature matched with the feature data of the first key frame in at least one cluster feature contained in the summary dictionary of the first style, determining the first key frame as a valid key frame corresponding to the first style.

Optionally, the method further comprises: determining at least one valid key frame corresponding to a second style from the plurality of first key frames based on a digest dictionary of the second style, wherein the digest dictionary of the second style is different from the digest dictionary of the first style; and generating the video abstract of the second style based on at least one effective key frame corresponding to the second style.

Optionally, the generating a video summary of the first style based on the at least one valid key frame corresponding to the first style includes: and sequencing at least one effective key frame corresponding to the first style according to the time sequence information to obtain the video abstract of the first style.

Optionally, before the determining, by the first style-based summary dictionary, at least one valid key frame corresponding to the first style from the plurality of first key frames, the method further includes: acquiring a plurality of sample images in a video sequence sample; clustering the plurality of sample images to obtain at least one cluster corresponding to the first style; and obtaining the abstract dictionary of the first style based on at least one cluster corresponding to the first style.

Optionally, the plurality of sample images are a plurality of second keyframes; the obtaining a plurality of sample images in a sample of a video sequence comprises: performing shot segmentation processing on the video sequence sample to obtain a plurality of second shots; and extracting key frames of the plurality of second shots to obtain a plurality of second key frames.

Optionally, the clustering the plurality of sample images to obtain at least one cluster corresponding to the first style includes: extracting feature data of each sample image of the plurality of sample images; clustering the plurality of sample images based on the characteristic data of each sample image in the plurality of sample images to obtain a plurality of cluster clusters; determining at least one cluster corresponding to the first style from the plurality of cluster clusters.

Optionally, the obtaining the abstract dictionary of the first style based on the at least one cluster corresponding to the first style includes: and averaging the feature data of the plurality of sample images included in the cluster to obtain the cluster features of the cluster, wherein the abstract dictionary of the first style includes the cluster features of each cluster in at least one cluster corresponding to the first style.

Optionally, the method further comprises: obtaining at least one cluster corresponding to a second style based on a clustering result obtained by clustering the plurality of sample images; and obtaining the abstract dictionary of the second style based on at least one cluster corresponding to the second style.

Optionally, the video sequence samples comprise video trailers or video highlights.

Optionally, the method further comprises: and crawling the video sequence sample by utilizing the style key words.

According to a second aspect of the embodiments of the present disclosure, there is provided a video summary generation apparatus, including: the first acquisition module is used for acquiring a plurality of first key frames in a video sequence; a first selecting module, configured to determine, based on a summary dictionary of a first style, at least one valid key frame corresponding to the first style from the plurality of first key frames; and the first generation module is used for generating the video abstract of the first style based on at least one effective key frame corresponding to the first style.

Optionally, the first obtaining module includes: the first segmentation unit is used for carrying out shot segmentation processing on the video sequence to obtain a plurality of first shots, and each first shot comprises a plurality of image frames in the video sequence; and the first extraction unit is used for extracting key frames of the plurality of first shots to obtain a plurality of first key frames.

Optionally, the first selecting module includes: a first obtaining unit, configured to obtain feature data of each of the plurality of first key frames; a first determining unit, configured to determine at least one valid key frame corresponding to the first style from the plurality of first key frames according to the feature data of each of the plurality of first key frames and the summary dictionary of the first style.

Optionally, the first determining unit includes: a matching subunit, configured to determine, based on feature data of each first key frame in the plurality of first key frames, whether each first key frame matches with the summary dictionary of the first style; and the determining subunit is configured to determine, as the valid key frame corresponding to the first style, a first key frame, which is matched with the summary dictionary of the first style, in the plurality of first key frames.

Optionally, the determining subunit is configured to: determining the first key frame as a valid key frame corresponding to the first style in response to that the minimum distance between the feature data of the first key frame and at least one clustering feature contained in the abstract dictionary of the first style is smaller than or equal to a preset distance threshold; or, in response to the existence of a cluster feature matched with the feature data of the first key frame in at least one cluster feature contained in the summary dictionary of the first style, determining the first key frame as a valid key frame corresponding to the first style.

Optionally, the apparatus further comprises: a second selecting module, configured to determine at least one valid key frame corresponding to a second style from the plurality of first key frames based on a digest dictionary of the second style, where the digest dictionary of the second style is different from the digest dictionary of the first style; and the second generation module is used for generating the video abstract of the second style based on at least one effective key frame corresponding to the second style.

Optionally, the first generating module is configured to: and sequencing at least one effective key frame corresponding to the first style according to the time sequence information to obtain the video abstract of the first style.

Optionally, the apparatus further comprises: the second acquisition module is used for acquiring a plurality of sample images in the video sequence samples; the first clustering module is used for clustering the plurality of sample images to obtain at least one clustering cluster corresponding to the first style; and the third generation module is used for obtaining the abstract dictionary of the first style based on at least one cluster corresponding to the first style.

Optionally, the plurality of sample images are a plurality of second keyframes; the second acquisition module includes: the second segmentation unit is used for carrying out shot segmentation processing on the video sequence samples to obtain a plurality of second shots; and the second extraction unit is used for extracting key frames of the plurality of second shots to obtain a plurality of second key frames.

Optionally, the first clustering module includes: a feature extraction unit configured to extract feature data of each of the plurality of sample images; the clustering unit is used for clustering the sample images based on the characteristic data of each sample image in the sample images to obtain a plurality of clustering clusters; a second determining unit, configured to determine at least one cluster corresponding to the first style from the plurality of clusters.

Optionally, the third generating module is configured to: and averaging the feature data of the plurality of sample images included in the cluster to obtain the cluster features of the cluster, wherein the abstract dictionary of the first style includes the cluster features of each cluster in at least one cluster corresponding to the first style.

Optionally, the apparatus further comprises: the second clustering module is used for obtaining at least one clustering cluster corresponding to a second style based on a clustering result obtained by clustering the plurality of sample images; and the fourth generation module is used for obtaining the abstract dictionary of the second style based on at least one cluster corresponding to the second style.

Optionally, the apparatus further comprises: and the crawling module is used for crawling the video sequence samples by utilizing the style keywords.

According to a third aspect of the embodiments of the present disclosure, there is also provided a computer-readable storage medium having stored thereon computer program instructions, where the program instructions, when executed by a processor, are used to implement any one of the video summary generation methods provided by the embodiments of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is also provided an electronic apparatus, including: a processor, a memory; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute any video summary generation method provided by the embodiment of the disclosure.

According to the video abstract generation scheme, a plurality of key frames in a video sequence are acquired, at least one effective key frame corresponding to a first style is determined from the acquired key frames according to an abstract dictionary of the first style, and a video abstract of the first style is generated based on the at least one effective key frame, so that automatic generation of the video abstract of a specific style is realized, huge workload caused by manual abstract extraction is avoided, generation efficiency of the video abstract is improved, and generation cost of the video abstract is saved.

Drawings

FIG. 1 is a flow diagram of a method of video summary generation according to some embodiments of the present disclosure;

FIG. 2 is a flow diagram of a method of summary dictionary generation provided in accordance with some embodiments of the present disclosure;

FIG. 3 is a block diagram of a video summary generation apparatus according to some embodiments of the present disclosure;

FIG. 4 is a block diagram of a video summary generation apparatus according to further embodiments of the present disclosure;

FIG. 5 is a schematic structural diagram of an electronic device according to some embodiments of the present disclosure

Detailed Description

The following detailed description of embodiments of the present disclosure is provided in conjunction with the accompanying drawings (like numerals represent like elements throughout the several figures) and examples. The following examples are intended to illustrate the present disclosure, but are not intended to limit the scope of the present disclosure.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

Fig. 1 is a flow diagram of a video summary generation method according to some embodiments of the present disclosure.

Referring to fig. 1, in step S110, a plurality of first key frames in a video sequence are acquired.

In the disclosed embodiment, the video sequence may be a video sequence containing arbitrary content; the video sequence comprises a plurality of video frames, and the plurality of video frames comprise a plurality of first key frames which are continuous or discontinuous in time sequence. The first key frame may be a video frame image in which a portion of a video subject (a person or an object) in a video sequence, which has a significant change or changes slowly, is located in motion or change.

In some optional embodiments, key frame extraction is performed on a plurality of video frames included in the video sequence to obtain a plurality of first key frames.

In other embodiments, a shot segmentation process is performed on a video sequence to obtain a plurality of shots, each shot comprising a plurality of image frames in the video sequence; and performing key frame extraction processing on the plurality of shots to obtain a plurality of first key frames.

Specifically, a video sequence is divided into at least one shot, each shot comprises at least one part of the video sequence, key frame extraction is carried out on at least two image frames (namely video frame images) included in each shot in the at least one shot, and a key frame of each shot is obtained, wherein the number of the key frames extracted from each shot is one or more. For example, one key frame is extracted from each shot. As another example, two or more keyframes are extracted from some shots that are longer in duration. In this way, at least one key frame of the video sequence is obtained.

Shot segmentation processing is used to segment a video sequence to obtain one or more shots. When the shot segmentation processing is performed on the video sequence, specific processing methods include, but are not limited to, for example, an edge segmentation method, a histogram segmentation method, a block matching method, a shot gradient detection algorithm, and the like.

When the key frame extraction processing is performed on the lens, optionally, the key frame is extracted according to one or more factors, such as contrast or brightness of each image frame included in the lens, a frame difference between adjacent image frames, or a state change of a target object. For example, an image frame having the smallest frame difference from a previous image frame is selected as a key frame from a plurality of image frames included in a shot. For another example, in order to ensure the image quality of the extracted key frame, an image frame with too dark brightness is first filtered from a plurality of image frames included in the shot, and a key frame is selected from the remaining image frames. The embodiment of the present disclosure does not limit the specific implementation of extracting the key frame.

At step S120, at least one valid key frame corresponding to the first style is determined from the plurality of first key frames according to the summary dictionary of the first style.

Here, the first style is, for example, a style of comedy, romance, action, horror, incite, and the like, and the embodiment of the present disclosure does not limit a specific implementation of the first style.

In an embodiment of the present disclosure, a summary dictionary of a first style is predetermined, and one or more valid key frames are selected from at least one key frame of a video sequence based on the summary dictionary. The first style abstract dictionary may be determined in various ways. Alternatively, the summary dictionary may be artificially defined or obtained by a machine learning method, for example, a first style of summary dictionary is obtained based on the learning of the video sequence sample, and the like, which is not limited in the embodiments of the present disclosure.

In some implementations, the first style abstract dictionary is obtained by clustering sample data corresponding to the first style. In this case, the abstract dictionary may be implemented in various ways. For example, the abstract dictionary of the first style includes cluster feature data corresponding to the first style, and for example, the abstract dictionary of the first style includes cluster sample data corresponding to the first style, and so on.

At S120, at least one valid key frame corresponding to the first style may be determined from the plurality of key frames based on the digest dictionary of the first style, where the valid key frame is a key frame of the plurality of key frames that matches the digest dictionary of the first style. In some optional embodiments, feature data of each of the plurality of key frames is obtained, and at least one valid key frame corresponding to the first style is determined from the plurality of key frames according to the feature data of each of the plurality of key frames and the summary dictionary of the first style.

Wherein the feature data of the key frame can be acquired in various ways. For example, feature data of the key frame is extracted by using a feature extraction algorithm based on machine learning, and for example, feature data of the key frame is extracted by using a neural network. As an example, the key frame is directly input or input to the deep convolutional neural network after one or more kinds of preprocessing, and the feature data of the key frame is obtained through feature extraction processing, but the embodiment of the present disclosure is not limited thereto.

The feature data of the image frame represents semantic information of the image frame, effective key frames corresponding to the first style are selected based on the feature data of the key frames, the generated video abstract can better correspond to the first style, and therefore user experience is enhanced.

Optionally, valid key frames matching the summary dictionary of the first style are determined from the plurality of key frames based on feature data of each key frame of the plurality of key frames. For example, based on the feature data of each key frame, it is determined whether each key frame matches the summary dictionary of the first style, and a key frame matching the summary dictionary of the first style among the plurality of key frames is determined as a valid key frame corresponding to the first style, but the embodiments of the present disclosure are not limited thereto.

In some implementations, the summary dictionary includes at least one cluster feature that is derived based on feature data of a plurality of sample images included in a cluster corresponding to the first style. In this case, whether a key frame matches the digest dictionary or whether a key frame is a valid key frame may be determined in a variety of ways. For example, it is determined whether there is a clustering feature matching the key frame in the summary dictionary based on the feature data of the key frame, and in the case that there is a clustering feature matching the key frame in the summary dictionary, it is determined that the key frame matches the summary dictionary or determined that the key frame is a valid key frame, where optionally, in the case that the distance between the feature data of the key frame and the clustering feature is less than or equal to a specific value, it is determined that the key frame matches the clustering feature, or it may also be determined in other ways whether the key frame matches the clustering feature in the summary dictionary, which is not limited by the embodiments of the present disclosure. For another example, a distance between the feature data of the key frame and each of at least one cluster feature included in the summary dictionary is determined, at least one distance is obtained, and whether the key frame is matched with the summary dictionary is determined based on the at least one distance. There are various ways to determine whether a key frame matches a digest dictionary based on at least one distance between the key frame and at least one cluster feature included in the digest dictionary. In one example, it is determined whether the key frame matches the summary dictionary or whether the key frame is a valid key frame based on a minimum of the at least one distance. For example, in the case where the minimum value (i.e., the minimum distance) of the at least one distance is less than or equal to a certain value, the key frame is determined to match the digest dictionary or is determined to be a valid key frame. In another example, whether the key frame matches the summary dictionary is determined based on the average or maximum of the at least one distance, which is not limited by the embodiments of the present disclosure.

In step S130, a video summary of the first genre is generated based on the at least one valid key frame corresponding to the first genre.

After determining the at least one valid key frame corresponding to the first style, optionally, the at least one valid key frame may be directly merged, or merged after performing one or more kinds of preprocessing, so as to obtain the video summary of the first style. In some optional embodiments, at least one valid key frame corresponding to the first style is merged according to the timing information to obtain a video summary of the first style. For example, the merging process is performed on at least one valid key frame, so that the timing relationship of the at least one valid key frame in the video summary is the same as the timing relationship in the video sequence. Or, merging at least one valid key frame according to other rules to obtain a video summary, for example, merging based on information such as a target object or a scene included in the valid video frame or an event that occurs, which is not limited in this disclosure.

In practical applications, the above-mentioned video summary generation method can be applied to the generation of video summaries of one or more styles. For example, video digests of multiple genres (including actions, romance, thriller, puzzling, and the like) are generated for a video sequence, that is, at least one valid key frame corresponding to each genre is determined from multiple key frames based on a digest dictionary of each genre of the multiple genres, and thus a video digest of each genre is generated to meet different requirements of users.

For example, in step S120, at least one valid key frame corresponding to a second style is determined from the plurality of first key frames based on a digest dictionary of the second style, wherein the second style is different from the first style. In step S130, at least one valid key frame corresponding to the second style is further merged according to the timing information, so as to obtain a video summary of the second style.

Specifically, according to the abstract dictionary of each style, the valid key frame corresponding to each style is determined from a plurality of key frames. And merging the effective key frames corresponding to each style in the plurality of styles to generate the video abstract of each style.

In some optional embodiments of the present disclosure, the first style summary dictionary is generated before step S110 is performed or before step S120 is performed.

In some alternative embodiments, the first style of summary dictionary is generated using the summary dictionary generation method illustrated in fig. 2.

Referring to fig. 2, in step S210, a plurality of sample images in a video sequence sample are acquired.

Alternatively, the video sequence samples may include video trailers or video highlights to perform the processing of steps S220 and S230 described below based on the video trailers or video highlights to generate the summary dictionary of the first genre. The video trailer or the video collection has representativeness on the whole video content, namely, the abstract dictionary is generated according to the whole video content, so that the whole video sequence is not required to be processed, the effect of covering the whole content can be achieved, the processing time is saved and the processing efficiency is improved under the condition of ensuring the effectiveness of generating the abstract dictionary.

The video sequence samples may be acquired in a variety of ways. Optionally, the style keywords are utilized to crawl the video sequence samples. For example, a video trailer or a video collection of the first style is crawled from a network video platform by using the corresponding keywords of the first style. Here, the crawling process includes, for example: the method comprises the steps of utilizing keywords corresponding to the first style to crawl video trailers or video compilations of which the video labels comprise the keywords corresponding to the first style from a video database of a network video platform, but the embodiment of the disclosure does not limit the specific implementation of obtaining a video sequence sample.

According to an exemplary embodiment of the present disclosure, the acquired plurality of sample images are part or all of image frames in a video sequence sample, for example, the plurality of sample images are a plurality of key frames (hereinafter referred to as second key frames or key frame samples) in the video sequence sample. When a plurality of sample images are obtained, shot segmentation processing can be carried out on the video sequence samples to obtain a plurality of second shots or shot samples; performing key frame extraction on each second shot in the plurality of second shots to extract at least one key frame from each second shot, thus obtaining a plurality of second key frames of the video sequence sample. Here, the specific manners of the shot segmentation processing performed on the video sequence sample and the processing of extracting the second key frame may be referred to the shot segmentation processing performed on the video sequence and the processing of extracting the first key frame in step S110, respectively, and are not described herein again.

In step S220, a plurality of sample images are clustered to obtain at least one cluster corresponding to the first style.

In the embodiment of the present disclosure, the clustering process may be performed in various ways. Optionally, extracting feature data of each sample image in the plurality of sample images, and performing clustering processing on the plurality of sample images based on the feature data of each sample image in the plurality of sample images to obtain a plurality of clustering clusters; and determining the first style from the plurality of clustering clusters to obtain at least one clustering cluster.

In some embodiments, feature extraction is performed on the plurality of sample images based on a deep neural network system, so as to obtain depth feature data of each sample image in the plurality of sample images. Optionally, the plurality of sample images are clustered according to the position of the feature data of each sample image in the feature space, and a clustering result including a plurality of clustering clusters is obtained, for example, the plurality of sample images are clustered based on the distance between the feature data of at least two sample images in the plurality of sample images; for example, at least two sample images with a distance between feature data smaller than or equal to a set distance are divided into the same cluster, but this is not limited by the embodiment of the present disclosure.

In the process of determining at least one cluster corresponding to the first style from the clustering result, the at least one cluster corresponding to the first style may be obtained from a plurality of clusters based on a pre-trained classification model (a classifier or a neural network system for classification). Here, optionally, a classification model for identifying the pictures corresponding to the first style is trained by using a plurality of pictures corresponding to the first style crawled from the network platform as training samples, and at least one cluster is obtained from a clustering result based on the trained classification model, wherein the obtained cluster corresponds to the image frame corresponding to the first style, that is, corresponds to the first style.

Or determining the style corresponding to the cluster based on style information labeled by the sample images contained in the cluster. Or manually screening at least one cluster corresponding to the first style from the plurality of clusters. That is, the style corresponding to each cluster is determined by observing the sample image contained in each cluster, and then at least one cluster corresponding to the first style is screened out from the plurality of clusters, but the embodiment of the present disclosure does not limit the specific implementation of the screening method.

In step S230, a summary dictionary of the first style is obtained based on the at least one cluster corresponding to the first style.

Optionally, the feature data of the plurality of sample images included in each of the at least one cluster corresponding to the first style is averaged to obtain the cluster feature of each cluster. That is, the center of the cluster is determined as a cluster feature. For example, an average value of feature values of feature data of a plurality of sample images in each cluster is obtained, and the average value is taken as a cluster feature value. Here, the feature data of the sample image is represented as a feature vector or a feature matrix, for example, which is not limited in the embodiment of the present disclosure.

And generating the abstract dictionary of the first style based on the clustering characteristics of each clustering cluster in at least one clustering cluster corresponding to the first style. The obtained abstract dictionary of the first style comprises the clustering characteristics of each clustering cluster in at least one clustering cluster corresponding to the first preset style.

In practical application, the abstract dictionary with the style can be generated in a weak supervised learning mode so as to avoid huge labor cost and calculation cost caused by intensive manual labeling, and the model required for generating the abstract dictionary is short in training time and high in running speed.

It should be noted that, after the aforementioned steps S210 to S230 are performed to generate the abstract dictionary of the first style, the steps S220 to S230 may be performed multiple times to generate the abstract dictionary of each of the plurality of styles for other plurality of styles. For example, based on a clustering result obtained by clustering a plurality of sample images, at least one clustering cluster corresponding to the second style is obtained; and obtaining the abstract dictionary of the second style based on at least one cluster corresponding to the second style.

Here, if the video sequence sample acquired in step S210 is for the first genre, steps S210 to S230 are re-executed, a video sequence sample corresponding to the second genre is acquired, and a digest dictionary of the second genre is generated.

That is, before the video summary generation method (steps S110 to S130) is performed, the summary dictionary generation method steps S210 to S230 may be performed multiple times to generate a summary dictionary corresponding to each of a plurality of styles. Specifically, a plurality of sample images in each style of video sequence sample are obtained; clustering a plurality of sample images in the video sequence sample of each style to obtain at least one cluster corresponding to each style; and generating a summary dictionary for each style in the plurality of styles based on the at least one cluster corresponding to each style. Furthermore, the video abstract generating method of the embodiment of the disclosure can be executed based on the obtained abstract dictionaries of the plurality of styles to generate the video abstract of each style, so that the user requirements are met.

According to the video abstract generation method, a plurality of key frames in a video sequence are acquired, at least one effective key frame corresponding to a first style is determined from the acquired key frames according to an abstract dictionary of the first style, and a video abstract of the first style is generated based on the at least one effective key frame, so that the automatic generation of the video abstract of a specific style is realized, the huge workload caused by manually extracting the abstract is avoided, the generation efficiency of the video abstract is improved, and the video abstract generation cost is saved; effective key frames are selected by matching the feature data of the key frames with the abstract dictionary, so that the effectiveness of generating the video abstract is improved, and the user experience is enhanced; a large amount of manual labeling can be avoided in a weak supervision learning mode, and the video abstract generation efficiency is improved; and the video abstract can be generated by processing the video sequence such as shot segmentation and key frame extraction, and processing the acquired key frames such as feature extraction, so that the video sequence does not need to be processed frame by frame, the effect of covering the whole video sequence can be achieved, and the generation efficiency of the video abstract is further improved under the condition of ensuring the effectiveness of the generated video abstract.

The video summary generation method of the embodiments of the present disclosure may be performed by any suitable device having a corresponding image or data processing capability, including but not limited to: a terminal device such as a computer, and a computer program, a processor, etc., integrated on the terminal device.

Based on the same technical concept, fig. 3 is a block diagram of a video summary generation apparatus according to some embodiments of the present disclosure. The method and the device can be used for executing the video summary generation method flow described in the above embodiments.

Referring to fig. 3, a video summary generation apparatus according to some alternative embodiments of the present disclosure includes: a first obtaining module 310, configured to obtain a plurality of first key frames in a video sequence; a first selecting module 320, configured to determine, based on a summary dictionary of a first style, at least one valid key frame corresponding to the first style from the plurality of first key frames; a first generating module 330, configured to generate a video summary of the first style based on at least one valid key frame corresponding to the first style.

Optionally, referring to fig. 4, on the basis of the video summary generating apparatus shown in fig. 3, the first obtaining module 310 includes: a first dividing unit 3101, configured to perform shot division processing on the video sequence to obtain a plurality of first shots, where each first shot includes a plurality of image frames in the video sequence; a first extracting unit 3102, configured to perform key frame extraction processing on the plurality of first shots to obtain a plurality of first key frames.

Optionally, the first selecting module 320 includes: a first obtaining unit 3201, configured to obtain feature data of each first key frame in the plurality of first key frames; a first determining unit 3202, configured to determine at least one valid key frame corresponding to the first style from the plurality of first key frames according to the feature data of each of the plurality of first key frames and the summary dictionary of the first style.

Optionally, the first determining unit 3202 includes: a matching subunit (not shown in the figure) for determining whether each first key frame of the plurality of first key frames matches with the summary dictionary of the first style based on the feature data of the first key frame; a determining subunit (not shown in the figure), configured to determine a first key frame, which is matched with the summary dictionary of the first style, in the plurality of first key frames as a valid key frame corresponding to the first style.

Optionally, the apparatus further comprises: a second selecting module 340, configured to determine at least one valid key frame corresponding to a second style from the plurality of first key frames based on a digest dictionary of the second style, where the digest dictionary of the second style is different from the digest dictionary of the first style; and a second generating 350 module, configured to generate a video summary of the second style based on the at least one valid key frame corresponding to the second style.

Optionally, the first generating module 330 is configured to: and sequencing at least one effective key frame corresponding to the first style according to the time sequence information to obtain the video abstract of the first style.

Optionally, the apparatus further comprises: a second obtaining module 350, configured to obtain a plurality of sample images in a video sequence sample; the first clustering module 360 is configured to perform clustering processing on the plurality of sample images to obtain at least one cluster corresponding to the first style; a third generating module 370, configured to obtain a summary dictionary of the first style based on the at least one cluster corresponding to the first style.

Optionally, the plurality of sample images are a plurality of second keyframes; the second obtaining module 350 includes: a second segmentation unit 3501, configured to perform shot segmentation processing on the video sequence samples to obtain a plurality of second shots; a second extracting unit 3502, configured to perform key frame extraction on the plurality of second shots to obtain a plurality of second key frames.

Optionally, the first clustering module 360 includes: a feature extraction unit 3601, configured to extract feature data of each sample image in the plurality of sample images; a clustering unit 3602, configured to perform clustering on the plurality of sample images based on feature data of each sample image in the plurality of sample images to obtain a plurality of cluster clusters; a second determining unit 3603, configured to determine at least one cluster corresponding to the first style from the plurality of clusters.

Optionally, the third generating module 370 is configured to: and averaging the feature data of the plurality of sample images included in the cluster to obtain the cluster features of the cluster, wherein the abstract dictionary of the first style includes the cluster features of each cluster in at least one cluster corresponding to the first style.

Optionally, the apparatus further comprises: the second clustering module 380 is configured to obtain at least one cluster corresponding to a second style based on a clustering result obtained by clustering the plurality of sample images; and the fourth generating module 381 is configured to obtain the abstract dictionary of the second style based on at least one cluster corresponding to the second style.

Optionally, the apparatus further comprises: a crawling module 390 for crawling the video sequence samples using the style keywords.

The video summary generation device in the embodiment of the present disclosure is used to implement the corresponding video summary generation method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Some embodiments of the present disclosure further provide a computer program, which includes computer program instructions, and when the program instructions are executed by a processor, the computer program instructions are used to implement the steps corresponding to any video summary generation method provided by the embodiments of the present disclosure.

Some embodiments of the present disclosure also provide a computer readable storage medium, on which computer program instructions are stored, and the program instructions are used for implementing the steps corresponding to any video summary generation method provided by the embodiments of the present disclosure when being executed by a processor.

Some embodiments of the present disclosure also provide an electronic device, which may be, for example, a mobile terminal, a Personal Computer (PC), a tablet, a server, or the like. Referring now to fig. 5, a schematic diagram of an electronic device 500 suitable for use in implementing a terminal device or server of an embodiment of the present disclosure is shown: as shown in fig. 5, the electronic device 500 includes one or more processors, communication elements, and the like, for example: one or more Central Processing Units (CPUs) 501, and/or one or more image processors (GPUs) 513, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)502 or loaded from a storage section 508 into a Random Access Memory (RAM) 503. The communication elements include a communication component 512 and/or a communication interface 509. Among other things, the communication component 512 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card, the communication interface 509 includes a communication interface such as a network interface card of a LAN card, a modem, or the like, and the communication interface 509 performs communication processing via a network such as the internet.

The processor may communicate with the read-only memory 502 and/or the random access memory 503 to execute the executable instructions, connect with the communication component 512 through the communication bus 504, and communicate with other target devices through the communication component 512, thereby completing the operations corresponding to any video summary generation method provided by the embodiments of the present disclosure, for example, acquiring a plurality of first key frames in a video sequence; determining at least one valid key frame corresponding to a first style from the plurality of first key frames based on a digest dictionary of the first style; and generating the video abstract of the first style based on at least one effective key frame corresponding to the first style.

In addition, in the RAM503, various programs and data necessary for the operation of the apparatus can also be stored. The CPU501 or GPU513, the ROM502, and the RAM503 are connected to each other through a communication bus 504. The ROM502 is an optional module in case of the RAM 503. The RAM503 stores executable instructions or writes executable instructions into the ROM502 during running, and the executable instructions cause the processor to execute operations corresponding to the video summary generation method. An input/output (I/O) interface 505 is also connected to communication bus 504. The communication component 512 may be integrated or may be configured with multiple sub-modules (e.g., multiple IB cards) and linked over a communication bus.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication interface 509 comprising a network interface card such as a LAN card, modem, or the like. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

It should be noted that the architecture shown in fig. 5 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 5 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication element may be separately set, or the GPU and the CPU may be integrated, and so on. These alternative embodiments are all within the scope of the present disclosure.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flowchart, the program code may include instructions corresponding to performing the steps of the video summary generation method provided by embodiments of the present disclosure, for example, obtaining a plurality of first key frames in a video sequence; determining at least one valid key frame corresponding to a first style from the plurality of first key frames based on a digest dictionary of the first style; and generating the video abstract of the first style based on at least one effective key frame corresponding to the first style. In such an embodiment, the computer program may be downloaded and installed from a network via the communication element, and/or installed from the removable medium 511. The computer program, when executed by a processor, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that, according to the implementation requirement, each component/step described in the embodiments of the present disclosure may be split into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiments of the present disclosure.

The above-described methods according to the embodiments of the present disclosure may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium and to be stored in a local recording medium downloaded through a network, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments.

The above description is only a specific implementation of the embodiments of the present disclosure, but the scope of the embodiments of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present disclosure, and all the changes or substitutions should be covered by the scope of the embodiments of the present disclosure. Therefore, the protection scope of the embodiments of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for generating a video summary is characterized by comprising the following steps:

acquiring a plurality of first key frames in a video sequence;

determining at least one valid key frame corresponding to a first style from the plurality of first key frames based on a digest dictionary of the first style, wherein the digest dictionary of the first style comprises clustering feature data corresponding to the first style, and the clustering feature data comprises at least one clustering feature, wherein the clustering feature is obtained based on feature data of a plurality of sample images included in a clustering cluster corresponding to the first style;

and generating the video abstract of the first style based on at least one effective key frame corresponding to the first style.

2. The method of claim 1, wherein obtaining a plurality of first key frames in a video sequence comprises:

performing shot segmentation processing on the video sequence to obtain a plurality of first shots, wherein each first shot comprises a plurality of image frames in the video sequence;

and performing key frame extraction processing on the plurality of first shots to obtain a plurality of first key frames.

3. The method according to claim 1, wherein determining at least one valid key frame corresponding to the first style from the plurality of first key frames based on the first style digest dictionary comprises:

acquiring feature data of each first key frame in the plurality of first key frames;

and determining at least one effective key frame corresponding to the first style from the plurality of first key frames according to the feature data of each first key frame in the plurality of first key frames and the abstract dictionary of the first style.

4. The method according to claim 3, wherein said obtaining at least one valid key frame corresponding to a first style from the plurality of first key frames according to the feature data of each of the plurality of first key frames and the summary dictionary of the first style comprises:

determining whether each first key frame of the plurality of first key frames matches a summary dictionary of the first style based on feature data of the each first key frame;

and determining a first key frame matched with the summary dictionary of the first style in the plurality of first key frames as a valid key frame corresponding to the first style.

5. The method according to claim 4, wherein determining whether each first key frame of the plurality of first key frames matches the summary dictionary of the first style based on feature data of the each first key frame comprises:

determining the first key frame as a valid key frame corresponding to the first style in response to that the minimum distance between the feature data of the first key frame and at least one clustering feature contained in the abstract dictionary of the first style is smaller than or equal to a preset distance threshold; or

And in response to the existence of the cluster feature matched with the feature data of the first key frame in at least one cluster feature contained in the abstract dictionary of the first style, determining the first key frame as a valid key frame corresponding to the first style.

6. The method according to any one of claims 1 to 5, further comprising:

determining at least one valid key frame corresponding to a second style from the plurality of first key frames based on a digest dictionary of the second style, wherein the digest dictionary of the second style is different from the digest dictionary of the first style;

and generating the video abstract of the second style based on at least one effective key frame corresponding to the second style.

7. The method according to any one of claims 1 to 5, wherein the generating the video summary of the first style based on the at least one valid key frame corresponding to the first style comprises:

and sequencing at least one effective key frame corresponding to the first style according to the time sequence information to obtain the video abstract of the first style.

8. The method according to any of claims 1-5, wherein prior to determining at least one valid key frame corresponding to the first style from the plurality of first key frames based on the first style digest dictionary, the method further comprises:

acquiring a plurality of sample images in a video sequence sample;

clustering the plurality of sample images to obtain at least one cluster corresponding to the first style;

and obtaining the abstract dictionary of the first style based on at least one cluster corresponding to the first style.

9. The method of claim 8, wherein the plurality of sample images is a plurality of second keyframes;

the obtaining a plurality of sample images in a sample of a video sequence comprises:

performing shot segmentation processing on the video sequence sample to obtain a plurality of second shots;

and extracting key frames of the plurality of second shots to obtain a plurality of second key frames.

10. The method according to claim 8, wherein the clustering the plurality of sample images to obtain at least one cluster corresponding to the first style comprises:

extracting feature data of each sample image of the plurality of sample images;

clustering the plurality of sample images based on the characteristic data of each sample image in the plurality of sample images to obtain a plurality of cluster clusters;

determining at least one cluster corresponding to the first style from the plurality of cluster clusters.

11. The method of claim 8, wherein obtaining the abstract dictionary of the first style based on the at least one cluster corresponding to the first style comprises:

and averaging the feature data of the plurality of sample images included in the cluster to obtain the cluster features of the cluster, wherein the abstract dictionary of the first style includes the cluster features of each cluster in at least one cluster corresponding to the first style.

12. The method of claim 8, further comprising:

obtaining at least one cluster corresponding to a second style based on a clustering result obtained by clustering the plurality of sample images;

and obtaining the abstract dictionary of the second style based on at least one cluster corresponding to the second style.

13. The method of claim 8, wherein the video sequence samples comprise video trailers or video highlights.

14. The method of claim 8, further comprising:

and crawling the video sequence sample by utilizing the style key words.

15. A video summary generation apparatus, comprising:

the first acquisition module is used for acquiring a plurality of first key frames in a video sequence;

the system comprises a first selection module, a second selection module and a third selection module, wherein the first selection module is used for determining at least one effective key frame corresponding to a first style from a plurality of first key frames based on an abstract dictionary of the first style, the abstract dictionary of the first style comprises clustering feature data corresponding to the first style, the clustering feature data comprises at least one clustering feature, and the clustering feature is obtained based on feature data of a plurality of sample images in a clustering cluster corresponding to the first style;

and the first generation module is used for generating the video abstract of the first style based on at least one effective key frame corresponding to the first style.

16. The apparatus of claim 15, wherein the first obtaining module comprises:

the first segmentation unit is used for carrying out shot segmentation processing on the video sequence to obtain a plurality of first shots, and each first shot comprises a plurality of image frames in the video sequence;

and the first extraction unit is used for extracting key frames of the plurality of first shots to obtain a plurality of first key frames.

17. The apparatus of claim 15, wherein the first selecting module comprises:

a first obtaining unit, configured to obtain feature data of each of the plurality of first key frames;

a first determining unit, configured to determine at least one valid key frame corresponding to the first style from the plurality of first key frames according to the feature data of each of the plurality of first key frames and the summary dictionary of the first style.

18. The apparatus of claim 17, wherein the first determining unit comprises:

a matching subunit, configured to determine, based on feature data of each first key frame in the plurality of first key frames, whether each first key frame matches with the summary dictionary of the first style;

and the determining subunit is configured to determine, as the valid key frame corresponding to the first style, a first key frame, which is matched with the summary dictionary of the first style, in the plurality of first key frames.

19. The apparatus of claim 18, wherein the determining subunit is configured to:

20. The apparatus of any one of claims 15 to 19, further comprising:

a second selecting module, configured to determine at least one valid key frame corresponding to a second style from the plurality of first key frames based on a digest dictionary of the second style, where the digest dictionary of the second style is different from the digest dictionary of the first style;

and the second generation module is used for generating the video abstract of the second style based on at least one effective key frame corresponding to the second style.

21. The apparatus of any one of claims 15 to 19, wherein the first generating module is configured to:

22. The apparatus of any one of claims 15 to 19, further comprising:

the second acquisition module is used for acquiring a plurality of sample images in the video sequence samples;

the first clustering module is used for clustering the plurality of sample images to obtain at least one clustering cluster corresponding to the first style;

and the third generation module is used for obtaining the abstract dictionary of the first style based on at least one cluster corresponding to the first style.

23. The apparatus of claim 22, wherein the plurality of sample images is a plurality of second keyframes;

the second acquisition module includes:

the second segmentation unit is used for carrying out shot segmentation processing on the video sequence samples to obtain a plurality of second shots;

and the second extraction unit is used for extracting key frames of the plurality of second shots to obtain a plurality of second key frames.

24. The apparatus of claim 22, wherein the first clustering module comprises:

a feature extraction unit configured to extract feature data of each of the plurality of sample images;

the clustering unit is used for clustering the sample images based on the characteristic data of each sample image in the sample images to obtain a plurality of clustering clusters;

a second determining unit, configured to determine at least one cluster corresponding to the first style from the plurality of clusters.

25. The apparatus of claim 22, wherein the third generating means is configured to:

26. The apparatus of claim 22, further comprising:

the second clustering module is used for obtaining at least one clustering cluster corresponding to a second style based on a clustering result obtained by clustering the plurality of sample images;

and the fourth generation module is used for obtaining the abstract dictionary of the second style based on at least one cluster corresponding to the second style.

27. The apparatus of claim 22, wherein the video sequence samples comprise video trailers or video highlights.

28. The apparatus of claim 22, further comprising:

and the crawling module is used for crawling the video sequence samples by utilizing the style keywords.

29. A computer readable storage medium having stored thereon computer program instructions, wherein the program instructions, when executed by a processor, are for implementing the video summary generation method of any of claims 1 to 14.

30. An electronic device, comprising: a processor, a memory;

the memory is configured to store at least one executable instruction that causes the processor to perform the video summary generation method of any of claims 1 to 14.