WO2019007020A1

WO2019007020A1 - Method and device for generating video summary

Info

Publication number: WO2019007020A1
Application number: PCT/CN2018/072191
Authority: WO
Inventors: 葛雷鸣
Original assignee: 优酷网络技术（北京）有限公司
Priority date: 2017-07-05
Filing date: 2018-01-11
Publication date: 2019-01-10
Also published as: TWI712316B; CN109213895A; TW201907736A

Abstract

Embodiments of the present application disclose a method and device for generating a video summary, a video therein having text description information. The method comprises: extracting a plurality of scene switching frames from the video, and setting scene labels for the scene switching frames, the similarity between two adjacent scene switching frames meeting a designated condition; extracting a theme label corresponding to the video from the text description information; and selecting a target frame from the plurality of scene switching frames according to a correlation between the scene labels of the scene switching frames and the theme label, and generating a video summary of the video based on the target frame. According to the technical solution provided by the present application, the efficiency can be improved, and the theme of the video can be precisely characterized.

Description

Method and device for generating video summary

The present application claims priority to Chinese Patent Application No. PCT Application No. No. No. No. No. No. No. No. No. No. No. No.

Technical field

The present application relates to the field of Internet technologies, and in particular, to a method and an apparatus for generating a video summary.

Background technique

Currently, in order to let users know the content of the video in a short time, the video playing platform usually creates a corresponding video summary for the uploaded video. The video summary may be a short duration video, and a part of the scene in the original video may be included in the video summary. In this way, the user can quickly understand the approximate content of the original video while viewing the video summary.

At present, when creating a video summary, on the one hand, the entire video can be viewed by the staff of the video playing platform by manual editing, and then the more critical segments are clipped to form a video summary of the video. Video digests created in this way can more accurately characterize the information contained in the video, but as the number of videos grows rapidly, this way of making video summaries can take a lot of manpower, and the speed at which video summaries are produced is quite high. slow.

In view of this, in order to save manpower and improve the efficiency of video summary production, video summary is currently produced by image recognition technology. Specifically, the uploaded video may be sampled at a fixed time interval to extract a multi-frame image in the video. Then, the similarity between adjacent two frames of images can be calculated in turn, and two frames of lower similarity can be retained, thereby ensuring that the retained image frames can display the contents of multiple scenes. In this way, the finally retained image frame can be made up of a video summary of the video.

In the prior art, a method for creating a video summary by image recognition, although the efficiency of the production can be improved, it is easy to miss the key scene in the video by selecting the image frame in the video summary by means of fixed sampling and comparison similarity. , resulting in a generated video summary that does not accurately reflect the subject of the video.

Summary of the invention

An object of the embodiments of the present application is to provide a method and an apparatus for generating a video summary, which can accurately represent the theme of a video while improving efficiency.

To achieve the above objective, an embodiment of the present application provides a method for generating a video summary, where the video has text description information, the method includes: extracting a plurality of scene switching frames from the video, and switching frames for the scene. Setting a scene label, wherein a similarity between two adjacent scene switching frames satisfies a specified condition; extracting a topic tag corresponding to the video from the text description information; and switching a scene label of the frame according to the scene and the An association between the topic tags, filtering out the target frame from the plurality of scene switching frames, and generating a video summary of the video based on the target frame.

In order to achieve the above object, an embodiment of the present application further provides a video summary generating apparatus, where the video has text description information, and the apparatus includes: a scene switching frame extracting unit, configured to extract multiple scene switching from the video. a frame, and setting a scene label for the scene switching frame, wherein a similarity between two adjacent scene switching frames satisfies a specified condition; and a topic label extracting unit, configured to extract the video corresponding from the text description information a topic identifier; a video summary generating unit, configured to filter a target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme label, and based on the target The frame generates a video summary of the video.

As can be seen from the above, the present application can first extract a scene switching frame whose similarity meets the specified condition from the video, and set a corresponding scene label for the scene switching frame. The textual description of the video can then be combined to determine the subject tag of the video. This hashtag accurately represents the subject of the video. Then, by determining the association between the scene label and the topic tag, the target frame closely related to the topic can be retained from the scene switching frame. In this way, the video summary generated based on the target frame can thereby accurately characterize the subject content of the video.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a few embodiments described in the present application, and other drawings can be obtained from those skilled in the art without any inventive labor.

1 is a flowchart of a method for generating a video digest in an embodiment of the present application;

2 is a schematic diagram of a target frame and a scene switching frame in an embodiment of the present application;

FIG. 3 is a schematic diagram of extracting a scene switching frame according to an embodiment of the present application;

4 is a schematic diagram of extracting a scene label in an embodiment of the present application;

FIG. 5 is a functional block diagram of a video summary generating apparatus in an embodiment of the present application.

Detailed ways

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. The embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope shall fall within the scope of the application.

The present application provides a method for generating a video summary, which can be applied to an electronic device having a data processing function. The electronic device may be, for example, a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a television set with network access function, or the like. The method can also be applied to software running in the above electronic device. The software may be in a software having a video production function or a video playback function. In addition, the method can also be applied to a server of a video playing website. The video playing website may be, for example, iQiyi, Sohu video, Acfun, and the like. The number of the servers is not specifically limited in the present embodiment. The server may be a server, or may be several servers, or a server cluster formed by several servers.

In this embodiment, the video summary may be generated based on video. The video may be a video local to the user or a video uploaded by the user to the video playing website. The video usually has text description information. The text description information may be a title of the video or an introduction to the video. The title and the introduction may be pre-edited by the video creator or the video uploader, or may be added by a staff member who reviews the video. The comparison of the present application is not limited. Of course, in practical applications, the text description information may include a text label of the video or a descriptive phrase extracted from the barrage information of the video, in addition to the title and the introduction of the video.

Referring to FIG. 1 and FIG. 2, the method for generating a video summary provided by the present application may include the following steps.

S1: Extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, wherein a similarity between two adjacent scene switching frames satisfies a specified condition.

In this embodiment, the video may be a video stored locally or a video stored in another device. In this way, the manner in which the video is obtained may include loading the video locally according to a specified path or downloading the video according to a Uniform Resource Locator (URL) provided by another device.

In this embodiment, after the video is acquired, each frame of the video may be analyzed to extract a plurality of scene switching frames therein. In order to be able to acquire a scene switching frame corresponding to each scene of the video, in the embodiment, the extraction may be performed by a frame-by-frame comparison manner. Specifically, a reference frame may first be determined in the video, and the similarity between each frame after the reference frame and the reference frame may be sequentially calculated.

In this embodiment, the reference frame may be a frame frame randomly specified within a certain range. For example, the reference frame may be a frame of pictures randomly selected within 2 minutes of the beginning of the video. Of course, in order not to miss the scene in the video, the first frame of the video may be used as the reference frame.

In this embodiment, after the reference frame is determined, each frame frame subsequent to the reference frame may be sequentially compared with the reference frame from the reference frame to calculate subsequent frames and The similarity between the reference frames. Specifically, when calculating the similarity between each frame and the reference frame, the first feature vector and the second feature vector of the reference frame and the current frame may be separately extracted.

In this embodiment, the first feature vector and the second feature vector may have various forms. Wherein, the feature vector of the frame picture can be constructed based on the pixel value of the pixel point in each frame of the picture. Each frame of the picture is usually arranged in a certain order by a plurality of pixel points, and the pixel points correspond to respective pixel values, thereby forming a colorful picture. The pixel value may be a value within a specified interval. For example, the pixel value may be any one of 0 to 255. The size of the value can indicate the shade of the color. In this embodiment, the pixel value of each pixel in each frame of the frame can be acquired, and the feature vector of the frame picture is formed by the acquired pixel value. For example, for a current frame having 9*9=81 pixels, the pixel values of the pixels may be sequentially acquired, and then the acquired pixel values are sequentially arranged according to the order from left to right and top to bottom, thereby Forms a vector of 81 dimensions. The 81-dimensional vector can be used as the feature vector of the current frame.

In this embodiment, the feature vector may also be a CNN (Convolutional Neural Network) feature of each frame of the picture. Specifically, the reference frame and each frame picture subsequent to the reference frame may be input into a convolutional neural network, and then the convolutional neural network may output the reference frame and the feature vector corresponding to each frame picture.

In this embodiment, in order to accurately represent the content displayed in the reference frame and the current frame, the first feature vector and the second feature vector may further represent the reference frame and the current frame, respectively. Scale-invariant features. Thus, even if the rotation angle, the image brightness, or the photographing angle of view of the image is changed, the extracted first feature vector and the second feature vector can still well reflect the contents in the reference frame and the current frame. Specifically, the first feature vector and the second feature vector may be a Sift (Scale-invariant feature transform) feature, a VEL feature (Speed Up Robust Feature), or a color. Histogram features, etc.

In this embodiment, after the first feature vector and the second feature vector are determined, the similarity between the first feature vector and the second feature vector may be calculated. Specifically, the similarity may be expressed in the vector space as the distance between the two vectors. The closer the distance, the more similar the two vectors are, so the higher the similarity. The further the distance, the greater the difference between the two vectors, so the lower the similarity. Therefore, when calculating the similarity between the reference frame and the current frame, a spatial distance between the first feature vector and the second feature vector may be calculated, and the reciprocal of the spatial distance is taken as The similarity between the reference frame and the current frame. Thus, the smaller the spatial distance, the greater the corresponding similarity, indicating the similarity between the reference frame and the current frame. Conversely, the larger the spatial distance, the smaller the corresponding similarity, indicating the more dissimilar between the reference frame and the current frame.

In the present embodiment, the similarity between each frame after the reference frame and the reference frame can be sequentially calculated in the above manner. The content displayed in the two frames with higher similarity is also generally similar, and the main purpose of the video summary is to display the content of different scenes in the video to the user. Therefore, in the present embodiment, when the reference frame is When the similarity between the current frame and the current frame is less than or equal to the specified threshold, the current frame may be determined as a scene switching frame. The specified threshold may be a preset value, and the value may be flexibly adjusted according to actual conditions. For example, when the number of scene switching frames filtered according to the specified threshold is excessive, the size of the specified threshold may be appropriately reduced. For another example, when the number of scene switching frames filtered according to the specified threshold is too small, the size of the specified threshold may be appropriately increased. In this embodiment, the similarity is less than or equal to the specified threshold, which may indicate that the content in the two frames has been significantly different, so that the scene displayed by the current frame may be changed, and the scene displayed by the reference frame may be changed. . At this time, the current frame can be reserved as one frame of the scene switching.

In this embodiment, when the current frame is determined as one scene switching frame, subsequent other scene switching frames may be determined. Specifically, from the reference frame to the current frame, it may be considered that the scene has changed once, so the current scene is the content displayed by the current frame. Based on this, the current frame may be used as a new reference frame, and the similarity between each frame after the new reference frame and the new reference frame may be sequentially calculated, according to the calculated similarity. Determine the next scene switch frame. Similarly, when determining the next scene switching frame, the similarity between the two frames can still be determined by extracting the feature vector and calculating the spatial distance, and the determined similarity can still be performed with the specified threshold. Contrast, thereby determining the next scene switching frame in which the scene changes again after the new reference frame.

Referring to FIG. 3, in the embodiment, after determining the next scene switching frame, the scene switching frame may be used as a new reference frame, and the subsequent scene switching frame extraction process may be continued. In this way, by sequentially changing the reference frame, each frame of the scene in which the scene changes can be extracted, so that the scene displayed in the video is not missed, so as to ensure the completeness of the video summary. In FIG. 3, a rectangular strip filled with a diagonal line may serve as a scene switching frame, and the similarity between adjacent two scene switching frames may be less than or equal to the specified threshold.

In this embodiment, in the scene switching frame extracted by the foregoing manner, the similarity between any two adjacent scene switching frames is less than or equal to the specified threshold, and therefore, between two adjacent scene switching frames If the similarity satisfies the specified condition, the similarity between two adjacent scene switching frames may be less than or equal to the specified threshold.

In this embodiment, after the plurality of scene switching frames are extracted, a scene label may be set for the scene switching frame. The scene tag may be a text tag for characterizing content displayed in the scene switch frame. For example, if a scene switching frame shows that two people are fighting, the scene label corresponding to the scene switching frame may be “martial arts”, “fighting” or “Kung Fu”.

In this embodiment, the content in the scene switching frame may be identified to determine a scene label corresponding to the scene switching frame. Specifically, features of the scene switching frame may be extracted, wherein the features may include at least one of a color feature, a texture feature, and a shape feature. Wherein, the color feature may be a feature extracted based on different color spaces. The color space may include, for example, RGB (Red, Green, Blue, Red, Green, Blue) space, HSV (Hue, Saturation, Value, Hue, Saturation, Lightness) space, HIS (Hue, Saturation, Intensity, Hue, Saturation, brightness, space, etc. In the color space, you can have multiple color components. For example, the R component, the G component, and the B component may be provided in the RGB space. The color components will also be different for different screens. Thus, the color components can be used to characterize the features of the scene switching frame.

In addition, the texture feature may be used to describe a material corresponding to the scene switching frame. The texture features can generally be represented by a distribution of gray levels. The texture features may correspond to low frequency components and high frequency components in the image spectrum. Thus, the low frequency component and the high frequency component of the image contained in the scene switching frame can be used as features of the scene switching frame.

In this embodiment, the shape features may include edge-based shape features and region-based shape features. Specifically, the boundary of the Fourier transform may be utilized as the edge-based shape feature, and the invariant moment descriptor may also be utilized as the region-based shape feature.

Referring to FIG. 4, in the embodiment, after extracting features in each scene switching frame, the extracted features may be compared with each feature sample in the feature sample library. The feature sample library may be a sample set summarized based on historical data of image recognition. In the feature sample library, feature samples representing different contents may be provided. The feature sample may also be at least one of the color feature, the texture feature, and the shape feature described above. For example, in the feature sample library, there are feature samples that characterize playing football, feature samples that characterize dance, and feature samples that characterize wrestling. Specifically, the feature samples in the feature sample library may be associated with a text tag, and the text tag may be used to describe the display content corresponding to the feature sample. For example, the text label associated with the feature sample representing the soccer game may be "playing football", and the text label representing the feature sample of the dance may be "square dance".

In this embodiment, the extracted features and the feature samples in the feature sample library may all be represented by a vector form. In this way, comparing the extracted features with each feature sample in the feature sample library may refer to calculating a distance between the feature and each feature sample. The closer the distance, the more similar the extracted features are to the feature samples. In this way, target feature samples of the feature sample library that are most similar to the extracted features can be determined. The distance calculated between the most similar target feature sample and the extracted feature sample may be the smallest. The extracted feature is the most similar to the target feature sample, indicating that the content displayed by the two is also the most similar. Therefore, the text label associated with the target feature sample can be used as the scene label corresponding to the scene switching frame, thereby Each scene switching frame sets a corresponding scene label.

As shown in FIG. 4, the distance between the feature extracted from the scene switching frame and each feature sample in the feature sample library may be 0.8, 0.5, 0.95, and 0.6, respectively, so that the character label corresponding to the feature sample with a distance of 0.5 It can be used as the scene label corresponding to the scene switching frame.

S3: Extract a topic tag corresponding to the video from the text description information.

In this embodiment, the text description information may more accurately indicate the subject of the video. Therefore, the topic tag corresponding to the video may be extracted from the text description information. Specifically, the video playing website can summarize and summarize the text description information of a large number of videos, filter out each text label that may be the subject of the video, and form each of the selected text labels into a text label library. The content in the text tag library can be continuously updated. In this way, when the topic tag is extracted from the text description information, the text description information may be matched with each text tag in the text tag library, and the matched text tag is used as the theme tag of the video. For example, the text description information of the video is “foreign guy and Chinese aunt dancing square dance, stunned everyone!”, then the text description information is matched with each text label in the text tag library, and the “square” can be obtained. Dance" this match result. Therefore, "square dance" can be used as the theme tag of the video.

It should be noted that since the text description information of the video is usually long, when matching with the text label in the text label library, at least two results may be matched. For example, the text description information of the video is “foreign guy and Chinese aunt dancing square dance, stunned everyone!”, then the text description information is matched with each text label in the text tag library, and then “ The results of the three matches of foreign guys, "Chinese aunt" and "square dance". In one aspect, the matched three matching results can be simultaneously used as the subject tag of the video. On the other hand, when the number of subject tags of the video is limited, a suitable topic tag can be selected from the matched multiple results. Specifically, in this embodiment, each text label in the text label library may be associated with a statistical number, wherein the number of statistics may be used to represent the total number of times the text label is a topic label. The greater the number of statistics, the more the total number of times the corresponding text label is used as the subject label of the video, and the higher the credibility of the text label as the subject label. Therefore, when the number of matched text labels is at least two, the matched text labels may be sorted according to the order of statistics, and the specified number of text labels in the ranking result are used as the The subject tag of the video. The specified number may be a predefined number of subject tags of the video. For example, if the number of subject tags of the video is limited to a maximum of two, then the three matching results of "foreign guy", "Chinese aunt" and "square dance" can be sorted according to the number of statistics, and finally the top 2 will be ranked. The "Chinese Aunt" and "Plaza Dance" are the subject labels for the video.

S5: Filter the target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the topic label, and generate a video summary of the video based on the target frame.

In this embodiment, it is considered that there are many scenes appearing in the video, but the scene switching frames corresponding to the scene are not all closely related to the theme of the video. In order to enable the generated video digest to accurately reflect the subject of the video, the target frame may be selected from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme tag.

In this embodiment, the association between the scene tag and the topic tag may refer to the degree of similarity between the scene tag and the topic tag. The more similar the scene tag is to the topic tag, the more relevant the content displayed by the scene switch frame is to the theme of the video. Specifically, the manner of determining the association between the scene label and the topic label may include calculating a similarity between the scene label of each of the scene switching frames and the theme label. In an actual application, the scene label and the theme label may each be composed of a vocabulary. When calculating the similarity between the two, the scene label and the location may be respectively represented by a word vector. The subject tag. In this way, the similarity between the scene tag and the subject tag can be represented by the spatial distance between the two word vectors. The closer the spatial distance between the two word vectors, the higher the similarity between the scene label and the topic label; conversely, the farther the spatial distance between the two word vectors is, indicating the scene label The lower the similarity between the hashtag and the subject tag. In this way, in the actual application scenario, the reciprocal of the spatial distance between the two word vectors can be used as the similarity between the scene tag and the topic tag.

In this embodiment, after calculating the similarity between the scene label and the topic label, the calculated scene switching frame whose similarity is greater than the specified similarity threshold may be determined as the target frame. The specified similarity threshold may be used as a threshold for measuring whether the scene switching frame and the topic are sufficiently related. When the similarity is greater than the specified similarity threshold, the current scene switching frame and the video theme may be Sufficiently associated, the content displayed by the scene switching frame can accurately reflect the subject of the video, so the scene switching frame can be determined as the target frame.

In this embodiment, the target frames selected from the scene switching frame are all closely related to the main body of the video. Therefore, the video summary of the video may be generated based on the target frame. Specifically, the method for generating the video summary of the video may sequentially arrange the respective target frames in the order in which they are located in the video, thereby forming a video summary of the video. In addition, considering the normal logic of the content between the preceding and following frames in the content displayed by the video summary, each target frame can be randomly arranged, and the sequenced target frame sequence is used as the video summary of the video. .

In an embodiment of the present application, the scene label of the scene switching frame is generally set for the overall content of the scene switching frame, so the scene label cannot accurately reflect the local details in the scene switching frame. In order to further improve the association between the target frame and the video theme, in the present embodiment, the target object included in the scene switching frame may be identified, and the target frame may be filtered based on the identified target object. Specifically, after calculating the similarity between the scene label of each scene switching frame and the topic label, the weighting coefficient may be set for the corresponding scene switching frame according to the calculated similarity. The higher the similarity between the scene label and the topic label, the larger the weight coefficient set for the corresponding scene switching frame. The weighting factor can be a value between 0 and 1. For example, if the theme tag of the current video is “square dance”, then for the two scene switching frames whose scene labels are “dance” and “kungfu”, the weight coefficient of the scene switching frame set with the scene label “dance” may be 0.8, and the scene switching frame set with the scene label "Kung Fu" may have a weight coefficient of 0.4.

In the present embodiment, after the weight coefficients are set for each scene switching frame, the target object included in the scene switching frame can be identified. Specifically, when identifying the target object included in the scene switching frame, an adaboost algorithm, an R-CNN (Region-based Convolutional Neural Network) algorithm, or an SSD (Single Shot Detector) may be used. An algorithm to detect a target object included in the scene switching frame. For example, for a scene switching frame whose scene label is "dance", the R-CNN algorithm can recognize that the scene switching frame includes two kinds of target objects: "woman" and "audio". In this way, after identifying the target object included in each scene switching frame, the associated value may be set for the scene switching frame according to the identified association between the target object and the theme tag. In particular, the subject tag can be associated with at least one object. The object may be an object that is more closely associated with the subject tag. At least one object associated with the topic tag may be obtained by analyzing historical data. For example, when the theme tag is "Beach", at least one of its associated objects may include "sea water", "beach", "seagull", "swimwear", "sun umbrella", and the like. In this way, the target object identified in the scene switch frame can be compared with the at least one object, and the number of target objects appearing in the at least one object can be counted. Specifically, for the theme label "Beach", assuming that the target objects identified from the scene switching frame are "parasols", "cars", "beach", "trees", and "seawater", then the target object is When the at least one object is compared, it may be determined that the target objects appearing in the at least one object are "parasols", "beach", and "seawater". That is, the number of target objects appearing in the at least one object is three. In the present embodiment, the product of the number of statistics and the specified value may be used as the associated value of the scene switching frame. The specified value may be a preset value. For example, the specified value may be 10, and the associated value of the scene switching frame in the above example may be 30. Thus, the greater the number of target objects appearing in the at least one object, the more closely the association between the local details and the video subject in the scene switching frame is, and the corresponding associated value is also higher.

In the present embodiment, when determining the target frame, the determination may be made based on the overall features and local features of the scene switching frame. Specifically, a product of a weight coefficient of each of the scene switching frames and an associated value may be calculated, and a scene switching frame whose product is greater than a specified product threshold is determined as the target frame. The product is used as the basis for the judgment, so that the overall characteristics and local features of the scene switching frame can be integrated. The specified product threshold may be a threshold for measuring whether a scene switching frame is a target frame. The specified product threshold can be flexibly adjusted in an actual application scenario.

In one embodiment of the present application, it is considered that in some scenarios, the total number of picture frames (or the total duration) in the video digest may be limited in advance. In this case, when determining the target frame, it is also necessary to comprehensively consider the total number of frames that are pre-limited. Specifically, when the total number of the respective scene switching frames is greater than or equal to the total number of the specified frames, it indicates that a sufficient number of frames can be extracted from the scene switching frame to constitute a video digest. In this case, the love may be sorted according to the product of the weight coefficient corresponding to each scene switching frame calculated in the above embodiment and the associated value, and the scene switching frames are sorted in descending order of the product. Then, the total number of the specified frames of the specified number of frames in the ranking result may be determined as the target frame. For example, it is currently limited that the total number of frames in the video digest is 1440 frames, and the number of scene switching frames currently extracted from the video is 2000 frames. In this way, the product of the weight coefficient and the associated value corresponding to each scene switching frame can be calculated in turn, and after the ordering of the products from large to small, the scene switching frame of the top 1440 is used as the target frame, thereby being 1440. The frame target frame constitutes a video summary that meets the requirements.

In this embodiment, when the total number of the respective scene switching frames is less than the total number of the specified frames, it indicates that all the currently selected scene switching frames are not enough to constitute a video summary that meets the requirements. In this case, a certain number of picture frames in the original video need to be inserted between the extracted scene switching frames, thereby achieving the requirement of the total number of frames defined by the video summary. Specifically, when the picture frame in the original video is inserted, it can be performed between two scene switching frames with a large scene jump, so that the consistency of the content can be maintained. In this embodiment, at least one video frame in the video may be inserted between two adjacent scene switching frames whose similarity is less than the determination threshold. The two adjacent scene switching frames whose similarity is smaller than the determination threshold may be regarded as two scene switching frames with weak content relevance. In this embodiment, a picture frame in the original video may be inserted frame by frame between two scene switching frames with weak correlation, until the total number of scene switching frames after inserting the at least one video frame is equal to the specified The total number of frames. In this way, the original scene switching frame and the inserted picture frame as a whole can be used as the target frame, thereby constituting the video summary of the video.

In an embodiment of the present application, the number of the topic tags extracted from the text description information of the video may be at least two. In this case, the scene label of the scene switching frame may be calculated for the scene switching frame. Similarity to each of the subject tags. For example, if the current topic label is the label 1 and the label 2, the similarity between the current scene switching frame and the label 1 and the label 2 can be separately calculated, so that the first similarity and the corresponding corresponding to the current scene switching frame can be obtained. Two similarities. After the respective similarities corresponding to the scene switching frame are calculated, the respective similarities calculated for the scene switching frame may be accumulated to obtain a cumulative similarity corresponding to the scene switching frame. For example, the sum of the first similarity and the second similarity described above may be used as the cumulative similarity corresponding to the current scene switching frame. In this embodiment, after calculating the cumulative similarity corresponding to each scene switching frame, the cumulative similarity may be compared with the specified similarity threshold, and the scene switching frame whose cumulative similarity is greater than the specified similarity threshold may be determined. Is the target frame.

Referring to FIG. 5, the present application further provides a device for generating a video summary, where the video has text description information, and the device includes:

The scene switching frame extraction unit 100 is configured to extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, where the similarity between two adjacent scene switching frames satisfies a specified condition;

The topic tag extracting unit 200 is configured to extract, from the text description information, a topic tag corresponding to the video;

The video summary generating unit 300 is configured to filter a target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme label, and generate a target frame based on the target frame. A video summary of the video.

In this embodiment, the scene switching frame extraction unit 100 includes:

a similarity calculation module, configured to determine a reference frame in the video, and sequentially calculate a similarity between the frame after the reference frame and the reference frame;

a scene switching frame determining module, configured to determine the current frame as a scene switching frame when a similarity between the reference frame and the current frame is less than or equal to a specified threshold;

a loop execution module, configured to use the current frame as a new reference frame, and sequentially calculate a similarity between the frame after the new reference frame and the new reference frame, according to the calculated similarity Determines the next scene switching frame.

In this embodiment, the scene switching frame extraction unit 100 includes:

a feature extraction module, configured to extract features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature;

And a comparison module, configured to compare the extracted features with the feature samples in the feature sample library, wherein the feature samples in the feature sample library are all associated with a text label;

The target feature sample determining module is configured to determine a target feature sample that is most similar to the extracted feature in the feature sample library, and use a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.

In this embodiment, the video summary generating unit 300 includes:

a similarity calculation module, configured to calculate a similarity between the scene label of the scene switching frame and the theme label;

a weight coefficient setting module, configured to set a weight coefficient for the corresponding scene switching frame according to the calculated similarity degree;

And a correlation value setting module, configured to identify a target object included in the scene switching frame, and set an association value for the scene switching frame according to the identified association between the target object and the theme label;

And a target frame determining module, configured to calculate a product of a weight coefficient of the scene switching frame and an associated value, and determine a scene switching frame whose product is greater than a specified product threshold as the target frame.

The application can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types. The present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network. In a distributed computing environment, program modules can be located in both local and remote computer storage media including storage devices.

It will also be apparent to those skilled in the art that, in addition to implementing the device in purely computer readable program code, the device can be logically programmed by means of logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded micro-controls. The form of the device etc. to achieve the same function. Such a device may therefore be considered a hardware component, and the means for implementing various functions included therein may also be considered as a structure within the hardware component. Or even a device for implementing various functions can be considered as a software module that can be both a method of implementation and a structure within a hardware component.

In the 1990s, improvements to a technology could clearly distinguish between hardware improvements (eg, improvements to circuit structures such as diodes, transistors, switches, etc.) or software improvements (for process flow improvements). However, as technology advances, many of today's method flow improvements can be seen as direct improvements in hardware circuit architecture. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be implemented by hardware entity modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is an integrated circuit whose logic function is determined by the user programming the device. The designer is self-programming to "integrate" a digital system onto a single PLD without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip 2 . Moreover, today, instead of manually making integrated circuit chips, this programming is mostly implemented using "logic compiler" software, which is similar to the software compiler used in programming development, but before compiling The original code has to be written in a specific programming language. This is called the Hardware Description Language (HDL). HDL is not the only one, but there are many kinds, such as ABEL (Advanced Boolean Expression Language). AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., are currently the most commonly used VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog2. It should also be apparent to those skilled in the art that the hardware flow for implementing the logic method flow can be easily obtained by simply programming the method flow into the integrated circuit with a few hardware description languages.

It will be apparent to those skilled in the art from the above description of the embodiments that the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk. An optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present application or portions of the embodiments.

The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the embodiment of the device, reference can be made to the introduction of the embodiment of the aforementioned method.

While the present invention has been described by the embodiments of the present invention, it will be understood by those skilled in the art

Claims

A method for generating a video summary, wherein the video has text description information, and the method includes:

Extracting a plurality of scene switching frames from the video, and setting a scene label for the scene switching frame, where a similarity between two adjacent scene switching frames satisfies a specified condition;

Extracting a topic tag corresponding to the video from the text description information;

And selecting, according to the association between the scene label of the scene switching frame and the theme label, a target frame from the plurality of scene switching frames, and generating a video summary of the video based on the target frame.
The method according to claim 1, wherein extracting a plurality of scene switching frames from the video comprises:

Determining a reference frame in the video, and sequentially calculating a similarity between the frame after the reference frame and the reference frame;

When the similarity between the reference frame and the current frame is less than or equal to a specified threshold, determining the current frame as a scene switching frame;

Taking the current frame as a new reference frame, and sequentially calculating a similarity between the frame after the new reference frame and the new reference frame, to determine the next scene switching according to the calculated similarity frame.
The method according to claim 2, wherein the similarity between two adjacent scene switching frames satisfies the specified conditions, including:

The similarity between two adjacent scene switching frames is less than or equal to the specified threshold.
The method according to claim 2, wherein calculating the similarity between the frame after the reference frame and the reference frame comprises:

Extracting a first feature vector and a second feature vector of the reference frame and the current frame, respectively, wherein the first feature vector and the second feature vector respectively indicate that the scale of the reference frame and the current frame are unchanged feature;

Calculating a spatial distance between the first feature vector and the second feature vector, and using a reciprocal of the spatial distance as a similarity between the reference frame and the current frame.
The method according to claim 1, wherein setting the scene label for the scene switching frame comprises:

Extracting features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature;

Comparing the extracted features with feature samples in a feature sample library, wherein the feature samples in the feature sample library are associated with a text tag;

Determining a target feature sample that is most similar to the extracted feature in the feature sample library, and using a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.
The method according to claim 1, wherein the text description information includes a title and/or a profile of the video; and correspondingly, extracting a theme tag corresponding to the video from the text description information includes:

Matching the text description information with the text label in the text tag library, and matching the obtained text label as the theme label of the video.
The method according to claim 6, wherein the text label in the text tag library is associated with a number of statistics, and the number of statistics is used to represent the total number of times the text tag is used as a topic tag;

Correspondingly, when the number of matched text labels is at least two, the method further includes:

The matched text labels are sorted in descending order of statistics, and a predetermined number of text labels in the top of the sorting result are used as the theme labels of the video.
The method according to claim 1, wherein the screening of the target frame from the plurality of scene switching frames comprises:

Calculating a similarity between the scene label of the scene switching frame and the topic label, and determining the calculated scene switching frame whose similarity is greater than a specified similarity threshold as the target frame.
The method according to claim 8, wherein after calculating the similarity between the scene label of the scene switching frame and the theme label, the method further comprises:

And setting a weight coefficient for the corresponding scene switching frame according to the calculated similarity;

Identifying a target object included in the scene switching frame, and setting an association value for the scene switching frame according to the identified association between the target object and the theme tag;

Calculating a product of a weight coefficient of the scene switching frame and an associated value, and determining a scene switching frame whose product is greater than a specified product threshold as the target frame.
The method according to claim 9, wherein the topic tag is associated with at least one object; and correspondingly, setting the association value for the scene switching frame comprises:

Comparing the target object identified in the scene switching frame with the at least one object, and counting the number of target objects appearing in the at least one object;

The product of the counted number and the specified value is used as the associated value of the scene switching frame.
The method according to claim 9, wherein the video digest of the video has a specified total number of frames; correspondingly, after calculating a product of a weight coefficient of the scene switching frame and an associated value, the method further include:

When the total number of the scene switching frames is greater than or equal to the total number of the specified frames, sort the scene switching frames according to the order of the products from large to small, and the foregoing in the ranking result The total number of specified frame switching frames is determined as the target frame.
The method of claim 11 wherein the method further comprises:

When the total number of the scene switching frames is less than the total number of the specified frames, at least one video frame in the video is inserted between two adjacent scene switching frames whose similarity is less than the determination threshold, so that The total number of scene switch frames after insertion of the at least one video frame is equal to the total number of frames specified.
The method according to claim 1, wherein when the number of the topic tags is at least two, the screening of the target frames from the plurality of scene switching frames comprises:

Calculating a similarity between the scene label of the scene switching frame and the topic label for the scene switching frame; accumulating the similarity calculated for the scene switching frame to obtain the scene switching frame Corresponding cumulative similarity;

A scene switching frame in which the cumulative similarity is greater than the specified similarity threshold is determined as the target frame.
A device for generating a video summary, wherein the video has text description information, and the device includes:

a scene switching frame extracting unit, configured to extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, where a similarity between two adjacent scene switching frames satisfies a specified condition;

a theme tag extracting unit, configured to extract a topic tag corresponding to the video from the text description information;

a video summary generating unit, configured to filter a target frame from the plurality of scene switching frames according to an association between a scene label of the scene switching frame and the theme label, and generate the target according to the target frame A video summary of the video.
The device according to claim 14, wherein the scene switching frame extraction unit comprises:

a similarity calculation module, configured to determine a reference frame in the video, and sequentially calculate a similarity between the frame after the reference frame and the reference frame;

a scene switching frame determining module, configured to determine the current frame as a scene switching frame when a similarity between the reference frame and the current frame is less than or equal to a specified threshold;

a loop execution module, configured to use the current frame as a new reference frame, and sequentially calculate a similarity between the frame after the new reference frame and the new reference frame, according to the calculated similarity Determines the next scene switching frame.
The device according to claim 14, wherein the scene switching frame extraction unit comprises:

a feature extraction module, configured to extract features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature;

And a comparison module, configured to compare the extracted features with the feature samples in the feature sample library, wherein the feature samples in the feature sample library are all associated with a text label;

The target feature sample determining module is configured to determine a target feature sample that is most similar to the extracted feature in the feature sample library, and use a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.
The device according to claim 14, wherein the video summary generating unit comprises:

a similarity calculation module, configured to calculate a similarity between the scene label of the scene switching frame and the theme label;

a weight coefficient setting module, configured to set a weight coefficient for the corresponding scene switching frame according to the calculated similarity degree;

And a correlation value setting module, configured to identify a target object included in the scene switching frame, and set an association value for the scene switching frame according to the identified association between the target object and the theme label;

And a target frame determining module, configured to calculate a product of a weight coefficient of the scene switching frame and an associated value, and determine a scene switching frame whose product is greater than a specified product threshold as the target frame.