WO2019007020A1 - Method and device for generating video summary - Google Patents

Method and device for generating video summary Download PDF

Info

Publication number
WO2019007020A1
WO2019007020A1 PCT/CN2018/072191 CN2018072191W WO2019007020A1 WO 2019007020 A1 WO2019007020 A1 WO 2019007020A1 CN 2018072191 W CN2018072191 W CN 2018072191W WO 2019007020 A1 WO2019007020 A1 WO 2019007020A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
scene switching
video
scene
similarity
Prior art date
Application number
PCT/CN2018/072191
Other languages
French (fr)
Chinese (zh)
Inventor
葛雷鸣
Original Assignee
优酷网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 优酷网络技术(北京)有限公司 filed Critical 优酷网络技术(北京)有限公司
Publication of WO2019007020A1 publication Critical patent/WO2019007020A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a method and an apparatus for generating a video summary.
  • the video playing platform usually creates a corresponding video summary for the uploaded video.
  • the video summary may be a short duration video, and a part of the scene in the original video may be included in the video summary. In this way, the user can quickly understand the approximate content of the original video while viewing the video summary.
  • Video digests created in this way can more accurately characterize the information contained in the video, but as the number of videos grows rapidly, this way of making video summaries can take a lot of manpower, and the speed at which video summaries are produced is quite high. slow.
  • video summary is currently produced by image recognition technology.
  • the uploaded video may be sampled at a fixed time interval to extract a multi-frame image in the video.
  • the similarity between adjacent two frames of images can be calculated in turn, and two frames of lower similarity can be retained, thereby ensuring that the retained image frames can display the contents of multiple scenes.
  • the finally retained image frame can be made up of a video summary of the video.
  • An object of the embodiments of the present application is to provide a method and an apparatus for generating a video summary, which can accurately represent the theme of a video while improving efficiency.
  • an embodiment of the present application provides a method for generating a video summary, where the video has text description information, the method includes: extracting a plurality of scene switching frames from the video, and switching frames for the scene. Setting a scene label, wherein a similarity between two adjacent scene switching frames satisfies a specified condition; extracting a topic tag corresponding to the video from the text description information; and switching a scene label of the frame according to the scene and the An association between the topic tags, filtering out the target frame from the plurality of scene switching frames, and generating a video summary of the video based on the target frame.
  • an embodiment of the present application further provides a video summary generating apparatus, where the video has text description information
  • the apparatus includes: a scene switching frame extracting unit, configured to extract multiple scene switching from the video. a frame, and setting a scene label for the scene switching frame, wherein a similarity between two adjacent scene switching frames satisfies a specified condition; and a topic label extracting unit, configured to extract the video corresponding from the text description information a topic identifier; a video summary generating unit, configured to filter a target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme label, and based on the target The frame generates a video summary of the video.
  • the present application can first extract a scene switching frame whose similarity meets the specified condition from the video, and set a corresponding scene label for the scene switching frame.
  • the textual description of the video can then be combined to determine the subject tag of the video.
  • This hashtag accurately represents the subject of the video.
  • the target frame closely related to the topic can be retained from the scene switching frame. In this way, the video summary generated based on the target frame can thereby accurately characterize the subject content of the video.
  • FIG. 1 is a flowchart of a method for generating a video digest in an embodiment of the present application
  • FIG. 2 is a schematic diagram of a target frame and a scene switching frame in an embodiment of the present application
  • FIG. 3 is a schematic diagram of extracting a scene switching frame according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of extracting a scene label in an embodiment of the present application.
  • FIG. 5 is a functional block diagram of a video summary generating apparatus in an embodiment of the present application.
  • the present application provides a method for generating a video summary, which can be applied to an electronic device having a data processing function.
  • the electronic device may be, for example, a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a television set with network access function, or the like.
  • the method can also be applied to software running in the above electronic device.
  • the software may be in a software having a video production function or a video playback function.
  • the method can also be applied to a server of a video playing website.
  • the video playing website may be, for example, iQiyi, Sohu video, Acfun, and the like.
  • the number of the servers is not specifically limited in the present embodiment.
  • the server may be a server, or may be several servers, or a server cluster formed by several servers.
  • the video summary may be generated based on video.
  • the video may be a video local to the user or a video uploaded by the user to the video playing website.
  • the video usually has text description information.
  • the text description information may be a title of the video or an introduction to the video.
  • the title and the introduction may be pre-edited by the video creator or the video uploader, or may be added by a staff member who reviews the video.
  • the comparison of the present application is not limited.
  • the text description information may include a text label of the video or a descriptive phrase extracted from the barrage information of the video, in addition to the title and the introduction of the video.
  • the method for generating a video summary may include the following steps.
  • S1 Extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, wherein a similarity between two adjacent scene switching frames satisfies a specified condition.
  • the video may be a video stored locally or a video stored in another device.
  • the manner in which the video is obtained may include loading the video locally according to a specified path or downloading the video according to a Uniform Resource Locator (URL) provided by another device.
  • URL Uniform Resource Locator
  • each frame of the video may be analyzed to extract a plurality of scene switching frames therein.
  • the extraction may be performed by a frame-by-frame comparison manner. Specifically, a reference frame may first be determined in the video, and the similarity between each frame after the reference frame and the reference frame may be sequentially calculated.
  • the reference frame may be a frame frame randomly specified within a certain range.
  • the reference frame may be a frame of pictures randomly selected within 2 minutes of the beginning of the video.
  • the first frame of the video may be used as the reference frame.
  • each frame frame subsequent to the reference frame may be sequentially compared with the reference frame from the reference frame to calculate subsequent frames and The similarity between the reference frames.
  • the first feature vector and the second feature vector of the reference frame and the current frame may be separately extracted.
  • the first feature vector and the second feature vector may have various forms.
  • the feature vector of the frame picture can be constructed based on the pixel value of the pixel point in each frame of the picture.
  • Each frame of the picture is usually arranged in a certain order by a plurality of pixel points, and the pixel points correspond to respective pixel values, thereby forming a colorful picture.
  • the pixel value may be a value within a specified interval.
  • the pixel value may be any one of 0 to 255.
  • the size of the value can indicate the shade of the color.
  • the pixel value of each pixel in each frame of the frame can be acquired, and the feature vector of the frame picture is formed by the acquired pixel value.
  • the pixel values of the pixels may be sequentially acquired, and then the acquired pixel values are sequentially arranged according to the order from left to right and top to bottom, thereby Forms a vector of 81 dimensions.
  • the 81-dimensional vector can be used as the feature vector of the current frame.
  • the feature vector may also be a CNN (Convolutional Neural Network) feature of each frame of the picture.
  • the reference frame and each frame picture subsequent to the reference frame may be input into a convolutional neural network, and then the convolutional neural network may output the reference frame and the feature vector corresponding to each frame picture.
  • CNN Convolutional Neural Network
  • the first feature vector and the second feature vector may further represent the reference frame and the current frame, respectively.
  • Scale-invariant features Even if the rotation angle, the image brightness, or the photographing angle of view of the image is changed, the extracted first feature vector and the second feature vector can still well reflect the contents in the reference frame and the current frame.
  • the first feature vector and the second feature vector may be a Sift (Scale-invariant feature transform) feature, a VEL feature (Speed Up Robust Feature), or a color. Histogram features, etc.
  • the similarity between the first feature vector and the second feature vector may be calculated.
  • the similarity may be expressed in the vector space as the distance between the two vectors. The closer the distance, the more similar the two vectors are, so the higher the similarity. The further the distance, the greater the difference between the two vectors, so the lower the similarity. Therefore, when calculating the similarity between the reference frame and the current frame, a spatial distance between the first feature vector and the second feature vector may be calculated, and the reciprocal of the spatial distance is taken as The similarity between the reference frame and the current frame.
  • the smaller the spatial distance the greater the corresponding similarity, indicating the similarity between the reference frame and the current frame.
  • the larger the spatial distance the smaller the corresponding similarity, indicating the more dissimilar between the reference frame and the current frame.
  • the similarity between each frame after the reference frame and the reference frame can be sequentially calculated in the above manner.
  • the content displayed in the two frames with higher similarity is also generally similar, and the main purpose of the video summary is to display the content of different scenes in the video to the user. Therefore, in the present embodiment, when the reference frame is When the similarity between the current frame and the current frame is less than or equal to the specified threshold, the current frame may be determined as a scene switching frame.
  • the specified threshold may be a preset value, and the value may be flexibly adjusted according to actual conditions. For example, when the number of scene switching frames filtered according to the specified threshold is excessive, the size of the specified threshold may be appropriately reduced.
  • the size of the specified threshold may be appropriately increased.
  • the similarity is less than or equal to the specified threshold, which may indicate that the content in the two frames has been significantly different, so that the scene displayed by the current frame may be changed, and the scene displayed by the reference frame may be changed.
  • the current frame can be reserved as one frame of the scene switching.
  • subsequent other scene switching frames may be determined. Specifically, from the reference frame to the current frame, it may be considered that the scene has changed once, so the current scene is the content displayed by the current frame. Based on this, the current frame may be used as a new reference frame, and the similarity between each frame after the new reference frame and the new reference frame may be sequentially calculated, according to the calculated similarity. Determine the next scene switch frame. Similarly, when determining the next scene switching frame, the similarity between the two frames can still be determined by extracting the feature vector and calculating the spatial distance, and the determined similarity can still be performed with the specified threshold. Contrast, thereby determining the next scene switching frame in which the scene changes again after the new reference frame.
  • the scene switching frame may be used as a new reference frame, and the subsequent scene switching frame extraction process may be continued.
  • each frame of the scene in which the scene changes can be extracted, so that the scene displayed in the video is not missed, so as to ensure the completeness of the video summary.
  • a rectangular strip filled with a diagonal line may serve as a scene switching frame, and the similarity between adjacent two scene switching frames may be less than or equal to the specified threshold.
  • the similarity between any two adjacent scene switching frames is less than or equal to the specified threshold, and therefore, between two adjacent scene switching frames If the similarity satisfies the specified condition, the similarity between two adjacent scene switching frames may be less than or equal to the specified threshold.
  • a scene label may be set for the scene switching frame.
  • the scene tag may be a text tag for characterizing content displayed in the scene switch frame. For example, if a scene switching frame shows that two people are fighting, the scene label corresponding to the scene switching frame may be “martial arts”, “fighting” or “Kung Fu”.
  • the content in the scene switching frame may be identified to determine a scene label corresponding to the scene switching frame.
  • features of the scene switching frame may be extracted, wherein the features may include at least one of a color feature, a texture feature, and a shape feature.
  • the color feature may be a feature extracted based on different color spaces.
  • the color space may include, for example, RGB (Red, Green, Blue, Red, Green, Blue) space, HSV (Hue, Saturation, Value, Hue, Saturation, Lightness) space, HIS (Hue, Saturation, Intensity, Hue, Saturation, brightness, space, etc.
  • the R component, the G component, and the B component may be provided in the RGB space.
  • the color components will also be different for different screens.
  • the color components can be used to characterize the features of the scene switching frame.
  • the texture feature may be used to describe a material corresponding to the scene switching frame.
  • the texture features can generally be represented by a distribution of gray levels.
  • the texture features may correspond to low frequency components and high frequency components in the image spectrum.
  • the low frequency component and the high frequency component of the image contained in the scene switching frame can be used as features of the scene switching frame.
  • the shape features may include edge-based shape features and region-based shape features.
  • the boundary of the Fourier transform may be utilized as the edge-based shape feature, and the invariant moment descriptor may also be utilized as the region-based shape feature.
  • the extracted features may be compared with each feature sample in the feature sample library.
  • the feature sample library may be a sample set summarized based on historical data of image recognition.
  • feature samples representing different contents may be provided.
  • the feature sample may also be at least one of the color feature, the texture feature, and the shape feature described above.
  • the feature samples in the feature sample library may be associated with a text tag, and the text tag may be used to describe the display content corresponding to the feature sample.
  • the text label associated with the feature sample representing the soccer game may be "playing football”
  • the text label representing the feature sample of the dance may be "square dance”.
  • the extracted features and the feature samples in the feature sample library may all be represented by a vector form.
  • comparing the extracted features with each feature sample in the feature sample library may refer to calculating a distance between the feature and each feature sample. The closer the distance, the more similar the extracted features are to the feature samples.
  • target feature samples of the feature sample library that are most similar to the extracted features can be determined.
  • the distance calculated between the most similar target feature sample and the extracted feature sample may be the smallest.
  • the extracted feature is the most similar to the target feature sample, indicating that the content displayed by the two is also the most similar. Therefore, the text label associated with the target feature sample can be used as the scene label corresponding to the scene switching frame, thereby Each scene switching frame sets a corresponding scene label.
  • the distance between the feature extracted from the scene switching frame and each feature sample in the feature sample library may be 0.8, 0.5, 0.95, and 0.6, respectively, so that the character label corresponding to the feature sample with a distance of 0.5 It can be used as the scene label corresponding to the scene switching frame.
  • the text description information may more accurately indicate the subject of the video. Therefore, the topic tag corresponding to the video may be extracted from the text description information.
  • the video playing website can summarize and summarize the text description information of a large number of videos, filter out each text label that may be the subject of the video, and form each of the selected text labels into a text label library.
  • the content in the text tag library can be continuously updated. In this way, when the topic tag is extracted from the text description information, the text description information may be matched with each text tag in the text tag library, and the matched text tag is used as the theme tag of the video.
  • the text description information of the video is “foreign guy and Chinese aunt dancing square dance, stunned everyone!”, then the text description information is matched with each text label in the text tag library, and the “square” can be obtained. Dance” this match result. Therefore, "square dance” can be used as the theme tag of the video.
  • the text description information of the video is usually long, when matching with the text label in the text label library, at least two results may be matched.
  • the text description information of the video is “foreign guy and Chinese aunt dancing square dance, stunned everyone!”, then the text description information is matched with each text label in the text tag library, and then “ The results of the three matches of foreign guys, "Chinese uncle” and "square dance”.
  • the matched three matching results can be simultaneously used as the subject tag of the video.
  • a suitable topic tag can be selected from the matched multiple results.
  • each text label in the text label library may be associated with a statistical number, wherein the number of statistics may be used to represent the total number of times the text label is a topic label.
  • the greater the number of statistics the more the total number of times the corresponding text label is used as the subject label of the video, and the higher the credibility of the text label as the subject label. Therefore, when the number of matched text labels is at least two, the matched text labels may be sorted according to the order of statistics, and the specified number of text labels in the ranking result are used as the The subject tag of the video.
  • the specified number may be a predefined number of subject tags of the video.
  • the number of subject tags of the video is limited to a maximum of two, then the three matching results of "foreign guy”, “Chinese uncle” and “square dance” can be sorted according to the number of statistics, and finally the top 2 will be ranked.
  • the "Chinese Aunt” and “Plaza Dance” are the subject labels for the video.
  • S5 Filter the target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the topic label, and generate a video summary of the video based on the target frame.
  • the target frame may be selected from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme tag.
  • the association between the scene tag and the topic tag may refer to the degree of similarity between the scene tag and the topic tag.
  • the manner of determining the association between the scene label and the topic label may include calculating a similarity between the scene label of each of the scene switching frames and the theme label.
  • the scene label and the theme label may each be composed of a vocabulary.
  • the scene label and the location may be respectively represented by a word vector.
  • the subject tag In this way, the similarity between the scene tag and the subject tag can be represented by the spatial distance between the two word vectors.
  • the calculated scene switching frame whose similarity is greater than the specified similarity threshold may be determined as the target frame.
  • the specified similarity threshold may be used as a threshold for measuring whether the scene switching frame and the topic are sufficiently related.
  • the similarity is greater than the specified similarity threshold, the current scene switching frame and the video theme may be Sufficiently associated, the content displayed by the scene switching frame can accurately reflect the subject of the video, so the scene switching frame can be determined as the target frame.
  • the target frames selected from the scene switching frame are all closely related to the main body of the video. Therefore, the video summary of the video may be generated based on the target frame. Specifically, the method for generating the video summary of the video may sequentially arrange the respective target frames in the order in which they are located in the video, thereby forming a video summary of the video. In addition, considering the normal logic of the content between the preceding and following frames in the content displayed by the video summary, each target frame can be randomly arranged, and the sequenced target frame sequence is used as the video summary of the video. .
  • the scene label of the scene switching frame is generally set for the overall content of the scene switching frame, so the scene label cannot accurately reflect the local details in the scene switching frame.
  • the target object included in the scene switching frame may be identified, and the target frame may be filtered based on the identified target object.
  • the weighting coefficient may be set for the corresponding scene switching frame according to the calculated similarity. The higher the similarity between the scene label and the topic label, the larger the weight coefficient set for the corresponding scene switching frame.
  • the weighting factor can be a value between 0 and 1.
  • the weight coefficient of the scene switching frame set with the scene label “dance” may be 0.8, and the scene switching frame set with the scene label "Kung Fu” may have a weight coefficient of 0.4.
  • the target object included in the scene switching frame can be identified.
  • an adaboost algorithm an R-CNN (Region-based Convolutional Neural Network) algorithm, or an SSD (Single Shot Detector) may be used.
  • An algorithm to detect a target object included in the scene switching frame For example, for a scene switching frame whose scene label is "dance", the R-CNN algorithm can recognize that the scene switching frame includes two kinds of target objects: " woman" and "audio".
  • the associated value may be set for the scene switching frame according to the identified association between the target object and the theme tag.
  • the subject tag can be associated with at least one object.
  • the object may be an object that is more closely associated with the subject tag.
  • At least one object associated with the topic tag may be obtained by analyzing historical data. For example, when the theme tag is "Beach", at least one of its associated objects may include "sea water”, “beach”, “seagull”, “swimwear”, “sun umbrella”, and the like. In this way, the target object identified in the scene switch frame can be compared with the at least one object, and the number of target objects appearing in the at least one object can be counted.
  • the target object is When the at least one object is compared, it may be determined that the target objects appearing in the at least one object are “parasols", "beach”, and “seawater”. That is, the number of target objects appearing in the at least one object is three.
  • the product of the number of statistics and the specified value may be used as the associated value of the scene switching frame.
  • the specified value may be a preset value. For example, the specified value may be 10, and the associated value of the scene switching frame in the above example may be 30.
  • the greater the number of target objects appearing in the at least one object the more closely the association between the local details and the video subject in the scene switching frame is, and the corresponding associated value is also higher.
  • the determination when determining the target frame, the determination may be made based on the overall features and local features of the scene switching frame. Specifically, a product of a weight coefficient of each of the scene switching frames and an associated value may be calculated, and a scene switching frame whose product is greater than a specified product threshold is determined as the target frame. The product is used as the basis for the judgment, so that the overall characteristics and local features of the scene switching frame can be integrated.
  • the specified product threshold may be a threshold for measuring whether a scene switching frame is a target frame. The specified product threshold can be flexibly adjusted in an actual application scenario.
  • the total number of picture frames (or the total duration) in the video digest may be limited in advance.
  • the love may be sorted according to the product of the weight coefficient corresponding to each scene switching frame calculated in the above embodiment and the associated value, and the scene switching frames are sorted in descending order of the product.
  • the total number of the specified frames of the specified number of frames in the ranking result may be determined as the target frame.
  • the total number of frames in the video digest is 1440 frames
  • the number of scene switching frames currently extracted from the video is 2000 frames.
  • the product of the weight coefficient and the associated value corresponding to each scene switching frame can be calculated in turn, and after the ordering of the products from large to small, the scene switching frame of the top 1440 is used as the target frame, thereby being 1440.
  • the frame target frame constitutes a video summary that meets the requirements.
  • the total number of the respective scene switching frames is less than the total number of the specified frames, it indicates that all the currently selected scene switching frames are not enough to constitute a video summary that meets the requirements.
  • a certain number of picture frames in the original video need to be inserted between the extracted scene switching frames, thereby achieving the requirement of the total number of frames defined by the video summary.
  • the picture frame in the original video is inserted, it can be performed between two scene switching frames with a large scene jump, so that the consistency of the content can be maintained.
  • at least one video frame in the video may be inserted between two adjacent scene switching frames whose similarity is less than the determination threshold.
  • the two adjacent scene switching frames whose similarity is smaller than the determination threshold may be regarded as two scene switching frames with weak content relevance.
  • a picture frame in the original video may be inserted frame by frame between two scene switching frames with weak correlation, until the total number of scene switching frames after inserting the at least one video frame is equal to the specified The total number of frames. In this way, the original scene switching frame and the inserted picture frame as a whole can be used as the target frame, thereby constituting the video summary of the video.
  • the number of the topic tags extracted from the text description information of the video may be at least two.
  • the scene label of the scene switching frame may be calculated for the scene switching frame. Similarity to each of the subject tags. For example, if the current topic label is the label 1 and the label 2, the similarity between the current scene switching frame and the label 1 and the label 2 can be separately calculated, so that the first similarity and the corresponding corresponding to the current scene switching frame can be obtained. Two similarities. After the respective similarities corresponding to the scene switching frame are calculated, the respective similarities calculated for the scene switching frame may be accumulated to obtain a cumulative similarity corresponding to the scene switching frame.
  • the sum of the first similarity and the second similarity described above may be used as the cumulative similarity corresponding to the current scene switching frame.
  • the cumulative similarity may be compared with the specified similarity threshold, and the scene switching frame whose cumulative similarity is greater than the specified similarity threshold may be determined. Is the target frame.
  • the present application further provides a device for generating a video summary, where the video has text description information, and the device includes:
  • the scene switching frame extraction unit 100 is configured to extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, where the similarity between two adjacent scene switching frames satisfies a specified condition;
  • the topic tag extracting unit 200 is configured to extract, from the text description information, a topic tag corresponding to the video;
  • the video summary generating unit 300 is configured to filter a target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme label, and generate a target frame based on the target frame. A video summary of the video.
  • the scene switching frame extraction unit 100 includes:
  • a similarity calculation module configured to determine a reference frame in the video, and sequentially calculate a similarity between the frame after the reference frame and the reference frame;
  • a scene switching frame determining module configured to determine the current frame as a scene switching frame when a similarity between the reference frame and the current frame is less than or equal to a specified threshold
  • a loop execution module configured to use the current frame as a new reference frame, and sequentially calculate a similarity between the frame after the new reference frame and the new reference frame, according to the calculated similarity Determines the next scene switching frame.
  • the scene switching frame extraction unit 100 includes:
  • a feature extraction module configured to extract features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature;
  • a comparison module configured to compare the extracted features with the feature samples in the feature sample library, wherein the feature samples in the feature sample library are all associated with a text label;
  • the target feature sample determining module is configured to determine a target feature sample that is most similar to the extracted feature in the feature sample library, and use a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.
  • the video summary generating unit 300 includes:
  • a similarity calculation module configured to calculate a similarity between the scene label of the scene switching frame and the theme label
  • a weight coefficient setting module configured to set a weight coefficient for the corresponding scene switching frame according to the calculated similarity degree
  • a correlation value setting module configured to identify a target object included in the scene switching frame, and set an association value for the scene switching frame according to the identified association between the target object and the theme label;
  • a target frame determining module configured to calculate a product of a weight coefficient of the scene switching frame and an associated value, and determine a scene switching frame whose product is greater than a specified product threshold as the target frame.
  • the application can be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network.
  • program modules can be located in both local and remote computer storage media including storage devices.
  • the device in addition to implementing the device in purely computer readable program code, the device can be logically programmed by means of logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded micro-controls.
  • the form of the device etc. to achieve the same function.
  • Such a device may therefore be considered a hardware component, and the means for implementing various functions included therein may also be considered as a structure within the hardware component.
  • a device for implementing various functions can be considered as a software module that can be both a method of implementation and a structure within a hardware component.
  • the present application can first extract a scene switching frame whose similarity meets the specified condition from the video, and set a corresponding scene label for the scene switching frame.
  • the textual description of the video can then be combined to determine the subject tag of the video.
  • This hashtag accurately represents the subject of the video.
  • the target frame closely related to the topic can be retained from the scene switching frame. In this way, the video summary generated based on the target frame can thereby accurately characterize the subject content of the video.
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk.
  • An optical disk, etc. includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present application or portions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

Embodiments of the present application disclose a method and device for generating a video summary, a video therein having text description information. The method comprises: extracting a plurality of scene switching frames from the video, and setting scene labels for the scene switching frames, the similarity between two adjacent scene switching frames meeting a designated condition; extracting a theme label corresponding to the video from the text description information; and selecting a target frame from the plurality of scene switching frames according to a correlation between the scene labels of the scene switching frames and the theme label, and generating a video summary of the video based on the target frame. According to the technical solution provided by the present application, the efficiency can be improved, and the theme of the video can be precisely characterized.

Description

一种视频摘要的生成方法及装置Method and device for generating video summary
本申请要求于2017年7月5日递交的申请号为201710541793.1、发明名称为“一种视频摘要的生成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. PCT Application No. No. No. No. No. No. No. No. No. No. No. No.
技术领域Technical field
本申请涉及互联网技术领域,特别涉及一种视频摘要的生成方法及装置。The present application relates to the field of Internet technologies, and in particular, to a method and an apparatus for generating a video summary.
背景技术Background technique
当前,为了让用户在短时间内获知视频的内容,视频播放平台通常会为上传的视频制作对应的视频摘要。所述视频摘要可以是一个时长较短的视频,在所述视频摘要中可以包含原视频中的一部分场景。这样,用户在观看所述视频摘要时,可以快速地了解原视频的大概内容。Currently, in order to let users know the content of the video in a short time, the video playing platform usually creates a corresponding video summary for the uploaded video. The video summary may be a short duration video, and a part of the scene in the original video may be included in the video summary. In this way, the user can quickly understand the approximate content of the original video while viewing the video summary.
目前,在制作视频摘要时,一方面可以通过人工剪辑的方式,先由视频播放平台的工作人员观看整个视频,然后将其中比较关键的片段剪辑出来,构成该视频的视频摘要。通过这种方式制作的视频摘要能够比较准确地表征视频中包含的信息,但是随着视频数量的快速增长,这种制作视频摘要的方式会耗费相当多的人力,而且制作视频摘要的速度也相当慢。At present, when creating a video summary, on the one hand, the entire video can be viewed by the staff of the video playing platform by manual editing, and then the more critical segments are clipped to form a video summary of the video. Video digests created in this way can more accurately characterize the information contained in the video, but as the number of videos grows rapidly, this way of making video summaries can take a lot of manpower, and the speed at which video summaries are produced is quite high. slow.
鉴于此,为了节省人力并提高视频摘要的制作效率,当前通常是通过图像识别的技术来制作视频摘要。具体地,可以按照固定的时间间隔对上传的视频进行采样,从而提取出视频中的多帧图像。然后可以依次计算相邻两帧图像之间的相似度,并且可以保留相似度较低的两帧图像,从而保证保留下来的图像帧能够展示多个场景的内容。这样,可以将最终保留的图像帧构成该视频的视频摘要。In view of this, in order to save manpower and improve the efficiency of video summary production, video summary is currently produced by image recognition technology. Specifically, the uploaded video may be sampled at a fixed time interval to extract a multi-frame image in the video. Then, the similarity between adjacent two frames of images can be calculated in turn, and two frames of lower similarity can be retained, thereby ensuring that the retained image frames can display the contents of multiple scenes. In this way, the finally retained image frame can be made up of a video summary of the video.
现有技术中通过图像识别来制作视频摘要的方法,尽管能够提高制作的效率,但是通过固定采样和比对相似度的方式来挑选视频摘要中的图像帧,很容易漏掉视频中的关键场景,从而导致生成的视频摘要无法准确地反映视频的主题。In the prior art, a method for creating a video summary by image recognition, although the efficiency of the production can be improved, it is easy to miss the key scene in the video by selecting the image frame in the video summary by means of fixed sampling and comparison similarity. , resulting in a generated video summary that does not accurately reflect the subject of the video.
发明内容Summary of the invention
本申请实施方式的目的是提供一种视频摘要的生成方法及装置,能够在提高效率的同时,精确地表征视频的主题。An object of the embodiments of the present application is to provide a method and an apparatus for generating a video summary, which can accurately represent the theme of a video while improving efficiency.
为实现上述目的,本申请实施方式提供一种视频摘要的生成方法,所述视频具备文字描述信息,所述方法包括:从所述视频中提取多个场景切换帧,并为所述场景切换帧设置场景标签,其中,相邻两个场景切换帧之间的相似度满足指定条件;从所述文字描述信息中提取所述视频对应的主题标签;根据所述场景切换帧的场景标签与所述主题标签之间的关联性,从所述多个场景切换帧中筛选出目标帧,并基于所述目标帧生成所述视频的视频摘要。To achieve the above objective, an embodiment of the present application provides a method for generating a video summary, where the video has text description information, the method includes: extracting a plurality of scene switching frames from the video, and switching frames for the scene. Setting a scene label, wherein a similarity between two adjacent scene switching frames satisfies a specified condition; extracting a topic tag corresponding to the video from the text description information; and switching a scene label of the frame according to the scene and the An association between the topic tags, filtering out the target frame from the plurality of scene switching frames, and generating a video summary of the video based on the target frame.
为实现上述目的,本申请实施方式还提供一种视频摘要的生成装置,所述视频具备文字描述信息,所述装置包括:场景切换帧提取单元,用于从所述视频中提取多个场景切换帧,并为所述场景切换帧设置场景标签,其中,相邻两个场景切换帧之间的相似度满足指定条件;主题标签提取单元,用于从所述文字描述信息中提取所述视频对应的主题标签;视频摘要生成单元,用于根据所述场景切换帧的场景标签与所述主题标签之间的关联性,从所述多个场景切换帧中筛选出目标帧,并基于所述目标帧生成所述视频的视频摘要。In order to achieve the above object, an embodiment of the present application further provides a video summary generating apparatus, where the video has text description information, and the apparatus includes: a scene switching frame extracting unit, configured to extract multiple scene switching from the video. a frame, and setting a scene label for the scene switching frame, wherein a similarity between two adjacent scene switching frames satisfies a specified condition; and a topic label extracting unit, configured to extract the video corresponding from the text description information a topic identifier; a video summary generating unit, configured to filter a target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme label, and based on the target The frame generates a video summary of the video.
由上可见,本申请首先可以从视频中提取相似度满足指定条件的场景切换帧,并为场景切换帧设置对应的场景标签。然后可以结合该视频的文字描述信息,确定该视频的主题标签。该主题标签可以准确地表征该视频的主题。接着,通过确定场景标签与主题标签之间的关联性,从而可以从场景切换帧中保留与主题关联性较紧密的目标帧。这样,基于所述目标帧生成的视频摘要从而能够准确地表征视频的主题内容。As can be seen from the above, the present application can first extract a scene switching frame whose similarity meets the specified condition from the video, and set a corresponding scene label for the scene switching frame. The textual description of the video can then be combined to determine the subject tag of the video. This hashtag accurately represents the subject of the video. Then, by determining the association between the scene label and the topic tag, the target frame closely related to the topic can be retained from the scene switching frame. In this way, the video summary generated based on the target frame can thereby accurately characterize the subject content of the video.
附图说明DRAWINGS
为了更清楚地说明本申请实施方式或现有技术中的技术方案,下面将对实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a few embodiments described in the present application, and other drawings can be obtained from those skilled in the art without any inventive labor.
图1为本申请实施方式中视频摘要的生成方法流程图;1 is a flowchart of a method for generating a video digest in an embodiment of the present application;
图2为本申请实施方式中目标帧和场景切换帧的示意图;2 is a schematic diagram of a target frame and a scene switching frame in an embodiment of the present application;
图3为本申请实施方式中场景切换帧的提取示意图;FIG. 3 is a schematic diagram of extracting a scene switching frame according to an embodiment of the present application;
图4为本申请实施方式中场景标签的提取示意图;4 is a schematic diagram of extracting a scene label in an embodiment of the present application;
图5为本申请实施方式中视频摘要的生成装置的功能模块图。FIG. 5 is a functional block diagram of a video summary generating apparatus in an embodiment of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施方式中的附图,对本申请实施方式中的技术方案进行清楚、完整地描述,显然,所描述的实施方式仅仅是本申请一部分实施方式,而不是全部的实施方式。基于本申请中的实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施方式,都应当属于本申请保护的范围。In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. The embodiments are only a part of the embodiments of the present application, and not all of them. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope shall fall within the scope of the application.
本申请提供一种视频摘要的生成方法,所述方法可以应用于具备数据处理功能的电子设备中。所述电子设备例如可以是台式电脑、平板电脑、笔记本电脑、智能手机、数字助理、智能可穿戴设备、导购终端、具有网络访问功能的电视机等。所述方法还可以应用于在上述电子设备中运行的软件中。所述软件可以是具备视频制作功能或者视频播放功能的软件中。此外,所述方法还可以应用于视频播放网站的服务器中。所述视频播放网站例如可以是爱奇艺、搜狐视频、Acfun等。在本实施方式中并不具体限定所述服务器的数量。所述服务器可以为一个服务器,还可以为几个服务器,或者,若干服务器形成的服务器集群。The present application provides a method for generating a video summary, which can be applied to an electronic device having a data processing function. The electronic device may be, for example, a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a television set with network access function, or the like. The method can also be applied to software running in the above electronic device. The software may be in a software having a video production function or a video playback function. In addition, the method can also be applied to a server of a video playing website. The video playing website may be, for example, iQiyi, Sohu video, Acfun, and the like. The number of the servers is not specifically limited in the present embodiment. The server may be a server, or may be several servers, or a server cluster formed by several servers.
在本实施方式中,所述视频摘要可以基于视频生成。所述视频可以是用户本地的视频,也可以是用户上传至视频播放网站的视频。其中,所述视频通常可以具备文字描述信息。所述文字描述信息可以是所述视频的标题或者所述视频的简介。所述标题和所述简介可以是视频制作者或者视频上传者预先编辑的,还可以是对视频进行审核的工作人员添加的,本申请对比并不做限定。当然,在实际应用中,所述文字描述信息除了包括所述视频的标题和简介,还可以包括所述视频的文字标签或者从该视频的弹幕信息中提取的描述性短语。In this embodiment, the video summary may be generated based on video. The video may be a video local to the user or a video uploaded by the user to the video playing website. The video usually has text description information. The text description information may be a title of the video or an introduction to the video. The title and the introduction may be pre-edited by the video creator or the video uploader, or may be added by a staff member who reviews the video. The comparison of the present application is not limited. Of course, in practical applications, the text description information may include a text label of the video or a descriptive phrase extracted from the barrage information of the video, in addition to the title and the introduction of the video.
请参阅图1和图2,本申请提供的视频摘要的生成方法可以包括以下步骤。Referring to FIG. 1 and FIG. 2, the method for generating a video summary provided by the present application may include the following steps.
S1:从所述视频中提取多个场景切换帧,并为所述场景切换帧设置场景标签,其中,相邻两个场景切换帧之间的相似度满足指定条件。S1: Extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, wherein a similarity between two adjacent scene switching frames satisfies a specified condition.
在本实施方式中,所述视频可以是存储于本地的视频,也可以是存储于其它设备中的视频。这样,所述视频的获取方式可以包括按照指定路径,从本地加载所述视频或者根据其它设备提供的统一资源定位符(Uniform Resource Locator,URL)下载所述视频。In this embodiment, the video may be a video stored locally or a video stored in another device. In this way, the manner in which the video is obtained may include loading the video locally according to a specified path or downloading the video according to a Uniform Resource Locator (URL) provided by another device.
在本实施方式中,在获取到所述视频之后,可以对所述视频中的每一帧画面进行分析,以提取其中的多个场景切换帧。为了能够获取所述视频的各个场景对应的场景切换帧,在本实施方式中可以通过逐帧对比的方式进行提取。具体地,首先可以在所述视频中确定基准帧,并依次计算所述基准帧之后的各个帧与所述基准帧之间的相似度。In this embodiment, after the video is acquired, each frame of the video may be analyzed to extract a plurality of scene switching frames therein. In order to be able to acquire a scene switching frame corresponding to each scene of the video, in the embodiment, the extraction may be performed by a frame-by-frame comparison manner. Specifically, a reference frame may first be determined in the video, and the similarity between each frame after the reference frame and the reference frame may be sequentially calculated.
在本实施方式中,所述基准帧可以在一定范围内随机指定的一帧画面。例如,所述基准帧可以是在所述视频的开篇2分钟内随机选取的一帧画面。当然,为了不遗漏所述视频中的场景,可以将所述视频的第一帧作为所述基准帧。In this embodiment, the reference frame may be a frame frame randomly specified within a certain range. For example, the reference frame may be a frame of pictures randomly selected within 2 minutes of the beginning of the video. Of course, in order not to miss the scene in the video, the first frame of the video may be used as the reference frame.
在本实施方式中,当确定了所述基准帧之后,可以从所述基准帧开始,将所述基准帧之后的各帧画面依次与所述基准帧进行对比,以计算后续的各帧画面与所述基准帧之间的相似度。具体地,在计算各个帧与所述基准帧之间的相似度时,可以分别提取所述基准帧和当前帧的第一特征向量和第二特征向量。In this embodiment, after the reference frame is determined, each frame frame subsequent to the reference frame may be sequentially compared with the reference frame from the reference frame to calculate subsequent frames and The similarity between the reference frames. Specifically, when calculating the similarity between each frame and the reference frame, the first feature vector and the second feature vector of the reference frame and the current frame may be separately extracted.
在本实施方式中,所述第一特征向量和所述第二特征向量可以具备多种形式。其中,可以基于每帧画面中像素点的像素值构建该帧画面的特征向量。每帧画面通常都是由若干的像素点按照一定的顺序排列而成的,像素点对应各自的像素值,从而可以构成色彩斑斓的画面。所述像素值可以是处于指定区间内的数值。例如,所述像素值可以是0至255中的任意一个数值。数值的大小可以表示色彩的深浅。在本实施方式中,可以获取每帧画面中各个像素点的像素值,并通过获取的像素值构成该帧画面的特征向量。例如,对于具备9*9=81个像素点的当前帧而言,可以依次获取其中像素点的像素值,然后根据从左向右从上至下的顺序,将获取的像素值依次排列,从而构成81维的向量。该81维的向量便可以作为所述当前帧的特征向量。In this embodiment, the first feature vector and the second feature vector may have various forms. Wherein, the feature vector of the frame picture can be constructed based on the pixel value of the pixel point in each frame of the picture. Each frame of the picture is usually arranged in a certain order by a plurality of pixel points, and the pixel points correspond to respective pixel values, thereby forming a colorful picture. The pixel value may be a value within a specified interval. For example, the pixel value may be any one of 0 to 255. The size of the value can indicate the shade of the color. In this embodiment, the pixel value of each pixel in each frame of the frame can be acquired, and the feature vector of the frame picture is formed by the acquired pixel value. For example, for a current frame having 9*9=81 pixels, the pixel values of the pixels may be sequentially acquired, and then the acquired pixel values are sequentially arranged according to the order from left to right and top to bottom, thereby Forms a vector of 81 dimensions. The 81-dimensional vector can be used as the feature vector of the current frame.
在本实施方式中,所述特征向量还可以是每帧画面的CNN(Convolutional Neural Network,卷积神经网络)特征。具体地,可以将所述基准帧以及所述基准帧之后的各帧画面输入卷积神经网络中,然后该卷积神经网络便可以输出所述基准帧以及其它各帧画面对应的特征向量。In this embodiment, the feature vector may also be a CNN (Convolutional Neural Network) feature of each frame of the picture. Specifically, the reference frame and each frame picture subsequent to the reference frame may be input into a convolutional neural network, and then the convolutional neural network may output the reference frame and the feature vector corresponding to each frame picture.
在本实施方式中,为了能够准确地表征所述基准帧和当前帧中所展示的内容,所述第一特征向量和所述第二特征向量还可以分别表示所述基准帧和所述当前帧的尺度不变特征。这样,即使改变图像的旋转角度、图像亮度或拍摄视角,提取出的第一特征向量和所述第二特征向量仍然能够很好地体现所述基准帧和当前帧中的内容。具体地,所述第一特征向量和所述第二特征向量可以是Sift(Scale-invariant feature transform,尺度不 变特征转换)特征、surf特征(Speed Up Robust Feature,快速鲁棒性特征)或者颜色直方图特征等。In this embodiment, in order to accurately represent the content displayed in the reference frame and the current frame, the first feature vector and the second feature vector may further represent the reference frame and the current frame, respectively. Scale-invariant features. Thus, even if the rotation angle, the image brightness, or the photographing angle of view of the image is changed, the extracted first feature vector and the second feature vector can still well reflect the contents in the reference frame and the current frame. Specifically, the first feature vector and the second feature vector may be a Sift (Scale-invariant feature transform) feature, a VEL feature (Speed Up Robust Feature), or a color. Histogram features, etc.
在本实施方式中,在确定了所述第一特征向量和所述第二特征向量之后,可以计算所述第一特征向量和所述第二特征向量之间的相似度。具体地,所述相似度在向量空间中可以表示为两个向量之间的距离。距离越近,表示两个向量越相似,因此相似度越高。距离越远,表示两个向量差别越大,因此相似度越低。因此,在计算所述基准帧和所述当前帧之间的相似度时,可以计算所述第一特征向量和所述第二特征向量之间的空间距离,并将所述空间距离的倒数作为所述基准帧与所述当前帧之间的相似度。这样,空间距离越小,其对应的相似度越大,表明所述基准帧和所述当前帧之间越相似。相反地,空间距离越大,其对应的相似度越小,表明所述基准帧和所述当前帧之间越不相似。In this embodiment, after the first feature vector and the second feature vector are determined, the similarity between the first feature vector and the second feature vector may be calculated. Specifically, the similarity may be expressed in the vector space as the distance between the two vectors. The closer the distance, the more similar the two vectors are, so the higher the similarity. The further the distance, the greater the difference between the two vectors, so the lower the similarity. Therefore, when calculating the similarity between the reference frame and the current frame, a spatial distance between the first feature vector and the second feature vector may be calculated, and the reciprocal of the spatial distance is taken as The similarity between the reference frame and the current frame. Thus, the smaller the spatial distance, the greater the corresponding similarity, indicating the similarity between the reference frame and the current frame. Conversely, the larger the spatial distance, the smaller the corresponding similarity, indicating the more dissimilar between the reference frame and the current frame.
在本实施方式中,按照上述方式可以依次计算所述基准帧之后的各个帧与所述基准帧之间的相似度。相似度较高的两帧画面中所展示的内容也通常是比较相似的,而视频摘要的主旨是将视频中不同场景的内容向用户展示,因此,在本实施方式中,当所述基准帧与当前帧之间的相似度小于或者等于指定阈值时,可以将所述当前帧确定为一个场景切换帧。其中,所述指定阈值可以是预先设定的一个数值,该数值根据实际情况可以灵活地进行调整。例如,当根据该指定阈值筛选出的场景切换帧的数量过多时,可以适当减小该指定阈值的大小。又例如,当根据该指定阈值筛选出的场景切换帧的数量过少时,可以适当增大该指定阈值的大小。在本实施方式中,相似度小于或者等于指定阈值,可以表示两帧画面中的内容已经具备明显的不同,因此可以认为当前帧所展示的场景,与所述基准帧所展示的场景发生了改变。此时,所述当前帧便可以作为场景切换的一帧画面进行保留。In the present embodiment, the similarity between each frame after the reference frame and the reference frame can be sequentially calculated in the above manner. The content displayed in the two frames with higher similarity is also generally similar, and the main purpose of the video summary is to display the content of different scenes in the video to the user. Therefore, in the present embodiment, when the reference frame is When the similarity between the current frame and the current frame is less than or equal to the specified threshold, the current frame may be determined as a scene switching frame. The specified threshold may be a preset value, and the value may be flexibly adjusted according to actual conditions. For example, when the number of scene switching frames filtered according to the specified threshold is excessive, the size of the specified threshold may be appropriately reduced. For another example, when the number of scene switching frames filtered according to the specified threshold is too small, the size of the specified threshold may be appropriately increased. In this embodiment, the similarity is less than or equal to the specified threshold, which may indicate that the content in the two frames has been significantly different, so that the scene displayed by the current frame may be changed, and the scene displayed by the reference frame may be changed. . At this time, the current frame can be reserved as one frame of the scene switching.
在本实施方式中,在将所述当前帧确定为一个场景切换帧时,可以继续确定后续的其它场景切换帧。具体地,从所述基准帧到所述当前帧,可以视为场景发生了一次改变,因此当前的场景便是所述当前帧所展示的内容。基于此,可以将所述当前帧作为新的基准帧,并依次计算所述新的基准帧之后的各个帧与所述新的基准帧之间的相似度,以根据计算的的所述相似度确定下一个场景切换帧。同样地,在确定下一个场景切换帧时,依然可以通过提取特征向量以及计算空间距离的方式确定出两帧画面之间的相似度,并且可以将确定出的相似度依然与所述指定阈值进行对比,从而确定出从新的基准帧之后场景再次发生变化的下一个场景切换帧。In this embodiment, when the current frame is determined as one scene switching frame, subsequent other scene switching frames may be determined. Specifically, from the reference frame to the current frame, it may be considered that the scene has changed once, so the current scene is the content displayed by the current frame. Based on this, the current frame may be used as a new reference frame, and the similarity between each frame after the new reference frame and the new reference frame may be sequentially calculated, according to the calculated similarity. Determine the next scene switch frame. Similarly, when determining the next scene switching frame, the similarity between the two frames can still be determined by extracting the feature vector and calculating the spatial distance, and the determined similarity can still be performed with the specified threshold. Contrast, thereby determining the next scene switching frame in which the scene changes again after the new reference frame.
请参阅图3,在本实施方式中,再确定出下一个场景切换帧之后,可以将该场景切换帧作为新的基准帧,继续进行后续场景切换帧的提取过程。这样,通过依次改变基准帧的方式,可以将所述视频中场景发生变化的各帧画面均提取出来,从而不会遗漏所述视频中所展示的场景,以保证视频摘要的完备性。在图3中,被斜线填充的矩形条可以作为场景切换帧,相邻两个场景切换帧之间的相似度都可以小于或者等于所述指定阈值。Referring to FIG. 3, in the embodiment, after determining the next scene switching frame, the scene switching frame may be used as a new reference frame, and the subsequent scene switching frame extraction process may be continued. In this way, by sequentially changing the reference frame, each frame of the scene in which the scene changes can be extracted, so that the scene displayed in the video is not missed, so as to ensure the completeness of the video summary. In FIG. 3, a rectangular strip filled with a diagonal line may serve as a scene switching frame, and the similarity between adjacent two scene switching frames may be less than or equal to the specified threshold.
在本实施方式中,通过上述方式提取出的场景切换帧中,任意相邻两个场景切换帧之间的相似度都会小于或者等于所述指定阈值,因此,相邻两个场景切换帧之间的相似度满足指定条件便可以指相邻两个场景切换帧之间的相似度小于或者等于所述指定阈值。In this embodiment, in the scene switching frame extracted by the foregoing manner, the similarity between any two adjacent scene switching frames is less than or equal to the specified threshold, and therefore, between two adjacent scene switching frames If the similarity satisfies the specified condition, the similarity between two adjacent scene switching frames may be less than or equal to the specified threshold.
在本实施方式中,在提取了所述多个场景切换帧之后,可以为所述场景切换帧设置场景标签。所述场景标签可以是用于表征所述场景切换帧中所展示的内容的文字标签。例如,某一个场景切换帧中展示的是两个人在打斗,那么该场景切换帧对应的场景标签便可以是“武术”、“搏击”或者“功夫”等。In this embodiment, after the plurality of scene switching frames are extracted, a scene label may be set for the scene switching frame. The scene tag may be a text tag for characterizing content displayed in the scene switch frame. For example, if a scene switching frame shows that two people are fighting, the scene label corresponding to the scene switching frame may be “martial arts”, “fighting” or “Kung Fu”.
在本实施方式中,可以对场景切换帧中的内容进行识别,以确定场景切换帧对应的场景标签。具体地,可以提取所述场景切换帧的特征,其中,所述特征可以包括颜色特征、纹理特征以及形状特征中的至少一种。其中,所述颜色特征可以是基于不同的颜色空间进行提取的特征。所述颜色空间例如可以包括RGB(Red、Green、Blue,红、绿、蓝)空间、HSV(Hue、Saturation、Value,色调、饱和度、明度)空间、HIS(Hue、Saturation、Intensity,色调、饱和度、亮度)空间等。在颜色空间中,均可以具备多个颜色分量。例如,RGB空间中可以具备R分量、G分量以及B分量。针对不同的画面,颜色分量也会存在不同。因此,可以用所述颜色分量来表征场景切换帧的特征。In this embodiment, the content in the scene switching frame may be identified to determine a scene label corresponding to the scene switching frame. Specifically, features of the scene switching frame may be extracted, wherein the features may include at least one of a color feature, a texture feature, and a shape feature. Wherein, the color feature may be a feature extracted based on different color spaces. The color space may include, for example, RGB (Red, Green, Blue, Red, Green, Blue) space, HSV (Hue, Saturation, Value, Hue, Saturation, Lightness) space, HIS (Hue, Saturation, Intensity, Hue, Saturation, brightness, space, etc. In the color space, you can have multiple color components. For example, the R component, the G component, and the B component may be provided in the RGB space. The color components will also be different for different screens. Thus, the color components can be used to characterize the features of the scene switching frame.
此外,所述纹理特征可以用于描述所述场景切换帧对应的材质。所述纹理特征通常可以通过灰度的分布来体现。所述纹理特征可以与图像频谱中的低频分量以及高频分量相对应。这样,场景切换帧中包含的图像的低频分量和高频分量便可以作为所述场景切换帧的特征。In addition, the texture feature may be used to describe a material corresponding to the scene switching frame. The texture features can generally be represented by a distribution of gray levels. The texture features may correspond to low frequency components and high frequency components in the image spectrum. Thus, the low frequency component and the high frequency component of the image contained in the scene switching frame can be used as features of the scene switching frame.
在本实施方式中,所述形状特征可以包括基于边缘的形状特征以及基于区域的形状特征。具体地,可以利用傅里叶变换的边界来作为所述基于边缘的形状特征,还可以利用不变矩描述子来作为所述基于区域的形状特征。In this embodiment, the shape features may include edge-based shape features and region-based shape features. Specifically, the boundary of the Fourier transform may be utilized as the edge-based shape feature, and the invariant moment descriptor may also be utilized as the region-based shape feature.
请参阅图4,在本实施方式中,在提取出各个场景切换帧中的特征后,可以将提取的所述特征与特征样本库中的各个特征样本进行比对。所述特征样本库可以是基于图像识别的历史数据而总结归纳的一个样本集合。在所述特征样本库中,可以具备表征不同内容的特征样本。所述特征样本同样可以是上述的颜色特征、纹理特征以及形状特征中的至少一种。例如,所述特征样本库中,有表征踢足球的特征样本,有表征舞蹈的特征样本,还有表征搏斗的特征样本等。具体地,所述特征样本库中的所述特征样本均可以与文字标签相关联,所述文字标签可以用于描述所述特征样本所对应的展示内容。例如,表征踢足球的特征样本关联的文字标签可以是“踢足球”,表征舞蹈的特征样本的文字标签可以是“广场舞”。Referring to FIG. 4, in the embodiment, after extracting features in each scene switching frame, the extracted features may be compared with each feature sample in the feature sample library. The feature sample library may be a sample set summarized based on historical data of image recognition. In the feature sample library, feature samples representing different contents may be provided. The feature sample may also be at least one of the color feature, the texture feature, and the shape feature described above. For example, in the feature sample library, there are feature samples that characterize playing football, feature samples that characterize dance, and feature samples that characterize wrestling. Specifically, the feature samples in the feature sample library may be associated with a text tag, and the text tag may be used to describe the display content corresponding to the feature sample. For example, the text label associated with the feature sample representing the soccer game may be "playing football", and the text label representing the feature sample of the dance may be "square dance".
在本实施方式中,提取的所述特征以及所述特征样本库中的特征样本均可以通过向量的形式进行表示。这样,将提取的所述特征与特征样本库中的各个特征样本进行比对可以指计算所述特征与各个特征样本之间的距离。距离越近,表明提取的所述特征与特征样本越相似。这样,可以确定所述特征样本库中与提取的所述特征最相似的目标特征样本。其中,所述最相似的目标特征样本与所述提取的特征样本之间计算出的距离可以是最小的。提取的特征与所述目标特征样本最相似,表明这两者展示的内容也最相似,因此,可以将所述目标特征样本关联的文字标签作为所述场景切换帧对应的场景标签,从而可以为各个场景切换帧设置相应的场景标签。In this embodiment, the extracted features and the feature samples in the feature sample library may all be represented by a vector form. In this way, comparing the extracted features with each feature sample in the feature sample library may refer to calculating a distance between the feature and each feature sample. The closer the distance, the more similar the extracted features are to the feature samples. In this way, target feature samples of the feature sample library that are most similar to the extracted features can be determined. The distance calculated between the most similar target feature sample and the extracted feature sample may be the smallest. The extracted feature is the most similar to the target feature sample, indicating that the content displayed by the two is also the most similar. Therefore, the text label associated with the target feature sample can be used as the scene label corresponding to the scene switching frame, thereby Each scene switching frame sets a corresponding scene label.
如图4所示,从场景切换帧中提取的特征与特征样本库中的各个特征样本之间的距离可以分别为0.8、0.5、0.95以及0.6,这样,距离为0.5的特征样本对应的文字标签就可以作为所述场景切换帧对应的场景标签。As shown in FIG. 4, the distance between the feature extracted from the scene switching frame and each feature sample in the feature sample library may be 0.8, 0.5, 0.95, and 0.6, respectively, so that the character label corresponding to the feature sample with a distance of 0.5 It can be used as the scene label corresponding to the scene switching frame.
S3:从所述文字描述信息中提取所述视频对应的主题标签。S3: Extract a topic tag corresponding to the video from the text description information.
在本实施方式中,所述文字描述信息可以比较精确地表明所述视频的主题。因此,可以从所述文字描述信息中提取所述视频对应的主题标签。具体地,视频播放网站可以针对大量的视频的文字描述信息进行归纳总结,筛选出可能作为视频主题的各个文字标签,并将筛选出的各个文字标签构成文字标签库。所述文字标签库中的内容可以不断进行更新。这样,在从所述文字描述信息中提取主题标签时,可以将所述文字描述信息与文字标签库中的各个文字标签进行匹配,并将匹配得到的文字标签作为所述视频的主题标签。例如,所述视频的文字描述信息为“外国小伙与中国大妈跳广场舞,惊呆众人!”那么将该文字描述信息与所述文字标签库中的各个文字标签进行匹配时,可以得到“广场舞”这个匹配结果。因此,“广场舞”便可以作为该视频的主题标签。In this embodiment, the text description information may more accurately indicate the subject of the video. Therefore, the topic tag corresponding to the video may be extracted from the text description information. Specifically, the video playing website can summarize and summarize the text description information of a large number of videos, filter out each text label that may be the subject of the video, and form each of the selected text labels into a text label library. The content in the text tag library can be continuously updated. In this way, when the topic tag is extracted from the text description information, the text description information may be matched with each text tag in the text tag library, and the matched text tag is used as the theme tag of the video. For example, the text description information of the video is “foreign guy and Chinese aunt dancing square dance, stunned everyone!”, then the text description information is matched with each text label in the text tag library, and the “square” can be obtained. Dance" this match result. Therefore, "square dance" can be used as the theme tag of the video.
需要说明的是,由于视频的文字描述信息通常比较长,在与文字标签库中的文字标签进行匹配时,可能会匹配得到至少两个结果。例如,所述视频的文字描述信息为“外国小伙与中国大妈跳广场舞,惊呆众人!”,那么将该文字描述信息与所述文字标签库中的各个文字标签进行匹配时,可以得到“外国小伙”、“中国大妈”以及“广场舞”这三个匹配结果。一方面,可以将匹配到的这三个匹配结果同时作为所述视频的主题标签。另一方面,当所述视频的主题标签的数量有限制时,可以从匹配到的多个结果中筛选出合适的主题标签。具体地,在本实施方式中,所述文字标签库中的各个文字标签可以与统计次数相关联,其中,所述统计次数可以用于表征所述文字标签作为主题标签的总次数。所述统计次数越大,表明对应的文字标签作为视频的主题标签的总次数越多,该文字标签作为主题标签的可信度也就越高。因此,当匹配得到的文字标签的数量为至少两个时,可以按照统计次数从大到小的顺序对匹配得到的文字标签进行排序,并将排序结果中靠前的指定数量个文字标签作为所述视频的主题标签。其中,所述指定数量可以是预先限定的所述视频的主题标签的数量。例如,所述视频的主题标签的数量限制为最多2个,那么可以根据统计次数将“外国小伙”、“中国大妈”以及“广场舞”这三个匹配结果进行排序,并最终将排名前2的“中国大妈”和“广场舞”作为该视频的主题标签。It should be noted that since the text description information of the video is usually long, when matching with the text label in the text label library, at least two results may be matched. For example, the text description information of the video is “foreign guy and Chinese aunt dancing square dance, stunned everyone!”, then the text description information is matched with each text label in the text tag library, and then “ The results of the three matches of foreign guys, "Chinese aunt" and "square dance". In one aspect, the matched three matching results can be simultaneously used as the subject tag of the video. On the other hand, when the number of subject tags of the video is limited, a suitable topic tag can be selected from the matched multiple results. Specifically, in this embodiment, each text label in the text label library may be associated with a statistical number, wherein the number of statistics may be used to represent the total number of times the text label is a topic label. The greater the number of statistics, the more the total number of times the corresponding text label is used as the subject label of the video, and the higher the credibility of the text label as the subject label. Therefore, when the number of matched text labels is at least two, the matched text labels may be sorted according to the order of statistics, and the specified number of text labels in the ranking result are used as the The subject tag of the video. The specified number may be a predefined number of subject tags of the video. For example, if the number of subject tags of the video is limited to a maximum of two, then the three matching results of "foreign guy", "Chinese aunt" and "square dance" can be sorted according to the number of statistics, and finally the top 2 will be ranked. The "Chinese Aunt" and "Plaza Dance" are the subject labels for the video.
S5:根据所述场景切换帧的场景标签与所述主题标签之间的关联性,从所述多个场景切换帧中筛选出目标帧,并基于所述目标帧生成所述视频的视频摘要。S5: Filter the target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the topic label, and generate a video summary of the video based on the target frame.
在本实施方式中,考虑到视频中出现的场景会较多,但是场景对应的场景切换帧并非都是与视频的主题具有紧密联系的。为了使得生成的视频摘要能够准确地反映视频的主题,可以根据各个所述场景切换帧的场景标签与所述主题标签之间的关联性,从所述多个场景切换帧中筛选出目标帧。In this embodiment, it is considered that there are many scenes appearing in the video, but the scene switching frames corresponding to the scene are not all closely related to the theme of the video. In order to enable the generated video digest to accurately reflect the subject of the video, the target frame may be selected from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme tag.
在本实施方式中,场景标签与主题标签之间的关联性可以指场景标签和主题标签之间的相似程度。场景标签与主题标签越相似,则表明场景切换帧所展示的内容与视频的主题越相关。具体地,确定场景标签与主题标签之间关联性的方式可以包括计算各个所述场景切换帧的场景标签与所述主题标签之间的相似度。在实际应用中,所述场景标签与所述主题标签均可以由词汇构成,在计算这两者之间的相似度时,可以通过词向量(wordvector)的方式来分别表示所述场景标签和所述主题标签。这样,可以通过两个词向量之间的空间距离来表示所述场景标签和所述主题标签之间的相似度。两个词向量之间的空间距离越近,表明所述场景标签和所述主题标签之间的相似度越高;相反地, 两个词向量之间的空间距离越远,表明所述场景标签和所述主题标签之间的相似度越低。这样,在实际应用场景中,可以将两个词向量之间的空间距离的倒数,作为所述场景标签和所述主题标签之间的相似度。In this embodiment, the association between the scene tag and the topic tag may refer to the degree of similarity between the scene tag and the topic tag. The more similar the scene tag is to the topic tag, the more relevant the content displayed by the scene switch frame is to the theme of the video. Specifically, the manner of determining the association between the scene label and the topic label may include calculating a similarity between the scene label of each of the scene switching frames and the theme label. In an actual application, the scene label and the theme label may each be composed of a vocabulary. When calculating the similarity between the two, the scene label and the location may be respectively represented by a word vector. The subject tag. In this way, the similarity between the scene tag and the subject tag can be represented by the spatial distance between the two word vectors. The closer the spatial distance between the two word vectors, the higher the similarity between the scene label and the topic label; conversely, the farther the spatial distance between the two word vectors is, indicating the scene label The lower the similarity between the hashtag and the subject tag. In this way, in the actual application scenario, the reciprocal of the spatial distance between the two word vectors can be used as the similarity between the scene tag and the topic tag.
在本实施方式中,在计算出所述场景标签和所述主题标签之间的相似度之后,可以将计算的所述相似度大于指定相似度阈值的场景切换帧确定为所述目标帧。其中,所述指定相似度阈值可以作为衡量场景切换帧与主题之间是否足够关联的门槛,当相似度大于所述指定相似度阈值时,可以表明当前的场景切换帧与视频的主题之间已经足够关联,场景切换帧所展示的内容能够准确地反映视频的主题,因此可以将该场景切换帧确定为所述目标帧。In this embodiment, after calculating the similarity between the scene label and the topic label, the calculated scene switching frame whose similarity is greater than the specified similarity threshold may be determined as the target frame. The specified similarity threshold may be used as a threshold for measuring whether the scene switching frame and the topic are sufficiently related. When the similarity is greater than the specified similarity threshold, the current scene switching frame and the video theme may be Sufficiently associated, the content displayed by the scene switching frame can accurately reflect the subject of the video, so the scene switching frame can be determined as the target frame.
在本实施方式中,从场景切换帧中筛选出的目标帧均与视频的主体具备比较紧密的联系,因此,可以基于所述目标帧生成所述视频的视频摘要。具体地,生成所述视频的视频摘要的方式可以将各个目标帧按照在视频中所处的先后顺序依次排列,从而构成所述视频的视频摘要。此外,考虑到视频摘要所展示的内容中前后帧之间并不需要保持内容的正常逻辑,因此可以将各个目标帧随机地进行编排,并将编排后的目标帧序列作为所述视频的视频摘要。In this embodiment, the target frames selected from the scene switching frame are all closely related to the main body of the video. Therefore, the video summary of the video may be generated based on the target frame. Specifically, the method for generating the video summary of the video may sequentially arrange the respective target frames in the order in which they are located in the video, thereby forming a video summary of the video. In addition, considering the normal logic of the content between the preceding and following frames in the content displayed by the video summary, each target frame can be randomly arranged, and the sequenced target frame sequence is used as the video summary of the video. .
在本申请一个实施方式中,考虑到各个场景切换帧的场景标签通常是针对场景切换帧的整体内容进行设置的,因此场景标签无法准确地反映场景切换帧中的局部细节。为了进一步地提高目标帧与视频主题的关联性,在本实施方式中可以对场景切换帧中包含的目标对象进行识别,并在识别出的目标对象的基础上进行目标帧的筛选。具体地,在计算各个所述场景切换帧的场景标签与所述主题标签之间的相似度之后,可以根据计算得到的所述相似度,为对应的场景切换帧设置权重系数。其中,场景标签与主题标签之间的相似度越高,为对应的场景切换帧设置的权重系数就越大。所述权重系数可以是处于0和1之间的数值。例如,当前视频的主题标签为“广场舞”,那么针对场景标签为“舞蹈”和“功夫”的两个场景切换帧而言,场景标签为“舞蹈”的场景切换帧设置的权重系数可以为0.8,而场景标签为“功夫”的场景切换帧设置的权重系数可以为0.4。In an embodiment of the present application, the scene label of the scene switching frame is generally set for the overall content of the scene switching frame, so the scene label cannot accurately reflect the local details in the scene switching frame. In order to further improve the association between the target frame and the video theme, in the present embodiment, the target object included in the scene switching frame may be identified, and the target frame may be filtered based on the identified target object. Specifically, after calculating the similarity between the scene label of each scene switching frame and the topic label, the weighting coefficient may be set for the corresponding scene switching frame according to the calculated similarity. The higher the similarity between the scene label and the topic label, the larger the weight coefficient set for the corresponding scene switching frame. The weighting factor can be a value between 0 and 1. For example, if the theme tag of the current video is “square dance”, then for the two scene switching frames whose scene labels are “dance” and “kungfu”, the weight coefficient of the scene switching frame set with the scene label “dance” may be 0.8, and the scene switching frame set with the scene label "Kung Fu" may have a weight coefficient of 0.4.
在本实施方式中,在为各个场景切换帧设置了权重系数之后,可以识别所述场景切换帧中包含的目标对象。具体地,在识别场景切换帧中包含的目标对象时,可以采用adaboost算法、R-CNN(Region-based Convolutional Neural Network,基于区域的卷积神经网络)算法或者SSD(Single Shot Detector,单目标检测)算法,来检测所述场景切换 帧中所包含的目标对象。例如,对于场景标签为“舞蹈”的场景切换帧而言,可以通过R-CNN算法识别出该场景切换帧中包括“女人”、“音响”这两种目标对象。这样,在识别出各个场景切换帧中包含的目标对象之后,可以根据识别出的所述目标对象与所述主题标签之间的关联性,为所述场景切换帧设置关联值。具体地,所述主题标签可以与至少一个对象相关联。所述对象可以是与所述主题标签联系比较紧密的对象。与主题标签相关联的至少一个对象可以是通过对历史数据进行分析得到的。例如,主题标签为“海滩”时,其关联的至少一个对象可以包括“海水”、“沙滩”、“海鸥”、“泳装”、“遮阳伞”等。这样,可以将从所述场景切换帧中识别出的目标对象与所述至少一个对象进行对比,并统计在所述至少一个对象中出现的目标对象的数量。具体地,针对“海滩”这个主题标签,假设从场景切换帧中识别出的目标对象为“遮阳伞”、“汽车”、“沙滩”、“树木”以及“海水”,那么在将目标对象与所述至少一个对象进行对比时,可以确定在所述至少一个对象中出现的目标对象为“遮阳伞”、“沙滩”以及“海水”。也就是说,在所述至少一个对象中出现的目标对象的数量为3。在本实施方式中,可以将统计的所述数量与指定数值的乘积作为所述场景切换帧的关联值。所述指定数值可以是预先设置的数值,例如,所述指定数值可以是10,那么上述例子中所述场景切换帧的关联值可以为30。这样,在所述至少一个对象中出现的目标对象的数量越多,表明该场景切换帧中的局部细节与视频主题之间的关联也越紧密,对应的关联值也越高。In the present embodiment, after the weight coefficients are set for each scene switching frame, the target object included in the scene switching frame can be identified. Specifically, when identifying the target object included in the scene switching frame, an adaboost algorithm, an R-CNN (Region-based Convolutional Neural Network) algorithm, or an SSD (Single Shot Detector) may be used. An algorithm to detect a target object included in the scene switching frame. For example, for a scene switching frame whose scene label is "dance", the R-CNN algorithm can recognize that the scene switching frame includes two kinds of target objects: "woman" and "audio". In this way, after identifying the target object included in each scene switching frame, the associated value may be set for the scene switching frame according to the identified association between the target object and the theme tag. In particular, the subject tag can be associated with at least one object. The object may be an object that is more closely associated with the subject tag. At least one object associated with the topic tag may be obtained by analyzing historical data. For example, when the theme tag is "Beach", at least one of its associated objects may include "sea water", "beach", "seagull", "swimwear", "sun umbrella", and the like. In this way, the target object identified in the scene switch frame can be compared with the at least one object, and the number of target objects appearing in the at least one object can be counted. Specifically, for the theme label "Beach", assuming that the target objects identified from the scene switching frame are "parasols", "cars", "beach", "trees", and "seawater", then the target object is When the at least one object is compared, it may be determined that the target objects appearing in the at least one object are "parasols", "beach", and "seawater". That is, the number of target objects appearing in the at least one object is three. In the present embodiment, the product of the number of statistics and the specified value may be used as the associated value of the scene switching frame. The specified value may be a preset value. For example, the specified value may be 10, and the associated value of the scene switching frame in the above example may be 30. Thus, the greater the number of target objects appearing in the at least one object, the more closely the association between the local details and the video subject in the scene switching frame is, and the corresponding associated value is also higher.
在本实施方式中,在确定目标帧时,可以基于场景切换帧的整体特征和局部特征来进行判断。具体地,可以计算各个所述场景切换帧的权重系数与关联值的乘积,并将所述乘积大于指定乘积阈值的场景切换帧确定为所述目标帧。利用乘积来作为判断的依据,从而可以综合了场景切换帧的整体特征和局部特征。所述指定乘积阈值可以是衡量场景切换帧是否为目标帧的门槛。所述指定乘积阈值在实际应用场景中可以灵活地进行调整。In the present embodiment, when determining the target frame, the determination may be made based on the overall features and local features of the scene switching frame. Specifically, a product of a weight coefficient of each of the scene switching frames and an associated value may be calculated, and a scene switching frame whose product is greater than a specified product threshold is determined as the target frame. The product is used as the basis for the judgment, so that the overall characteristics and local features of the scene switching frame can be integrated. The specified product threshold may be a threshold for measuring whether a scene switching frame is a target frame. The specified product threshold can be flexibly adjusted in an actual application scenario.
在本申请一个实施方式中,考虑到有些场景中,可能会预先限制视频摘要中画面帧的总数量(或者是总时长)。在这种情况下,在确定目标帧时,还需要综合考虑预先限制的帧总数量。具体地,当各个所述场景切换帧的总数量大于或者等于所述指定的帧总数量时,表明能够从场景切换帧中提取出足够的帧数来构成视频摘要。在这种情况下爱,可以基于上述实施方式中计算出的各个场景切换帧对应的权重系数与关联值的乘 积,按照乘积从大到小的顺序对各个所述场景切换帧进行排序。然后可以将排序结果中靠前的所述指定的帧总数量个场景切换帧确定为所述目标帧。举例来说明,当前限制了视频摘要中的帧总数量为1440帧,而当前从视频中提取的场景切换帧的数量为2000帧。这样,可以依次计算各个场景切换帧对应的权重系数和关联值的乘积,并且按照乘积进行从大到小的顺序排序之后,将排名前1440的场景切换帧作为所述目标帧,从而可以由1440帧目标帧构成符合要求的视频摘要。In one embodiment of the present application, it is considered that in some scenarios, the total number of picture frames (or the total duration) in the video digest may be limited in advance. In this case, when determining the target frame, it is also necessary to comprehensively consider the total number of frames that are pre-limited. Specifically, when the total number of the respective scene switching frames is greater than or equal to the total number of the specified frames, it indicates that a sufficient number of frames can be extracted from the scene switching frame to constitute a video digest. In this case, the love may be sorted according to the product of the weight coefficient corresponding to each scene switching frame calculated in the above embodiment and the associated value, and the scene switching frames are sorted in descending order of the product. Then, the total number of the specified frames of the specified number of frames in the ranking result may be determined as the target frame. For example, it is currently limited that the total number of frames in the video digest is 1440 frames, and the number of scene switching frames currently extracted from the video is 2000 frames. In this way, the product of the weight coefficient and the associated value corresponding to each scene switching frame can be calculated in turn, and after the ordering of the products from large to small, the scene switching frame of the top 1440 is used as the target frame, thereby being 1440. The frame target frame constitutes a video summary that meets the requirements.
在本实施方式中,当各个所述场景切换帧的总数量小于所述指定的帧总数量时,表明当前提取的所有的场景切换帧都不足以构成符合要求的视频摘要。在这种情况下,需要在提取出的场景切换帧之间插入原视频中一定数量的画面帧,从而达到视频摘要限定的帧总数量的要求。具体地,在插入原视频中的画面帧时,可以在场景跳转较大的两个场景切换帧之间进行,这样可以保持内容的连贯性。在本实施方式中,可以在相似度小于判定阈值的两个相邻的场景切换帧之间,插入所述视频中的至少一个视频帧。其中,相似度小于判定阈值的两个相邻的场景切换帧可以被视为内容关联性较弱的两个场景切换帧。在本实施方式中,在关联性较弱的两个场景切换帧之间可以逐帧插入原视频中的画面帧,直至插入所述至少一个视频帧之后的场景切换帧的总数量等于所述指定的帧总数量。这样,原有的场景切换帧和插入的画面帧的整体都可以作为所述目标帧,从而构成所述视频的视频摘要。In this embodiment, when the total number of the respective scene switching frames is less than the total number of the specified frames, it indicates that all the currently selected scene switching frames are not enough to constitute a video summary that meets the requirements. In this case, a certain number of picture frames in the original video need to be inserted between the extracted scene switching frames, thereby achieving the requirement of the total number of frames defined by the video summary. Specifically, when the picture frame in the original video is inserted, it can be performed between two scene switching frames with a large scene jump, so that the consistency of the content can be maintained. In this embodiment, at least one video frame in the video may be inserted between two adjacent scene switching frames whose similarity is less than the determination threshold. The two adjacent scene switching frames whose similarity is smaller than the determination threshold may be regarded as two scene switching frames with weak content relevance. In this embodiment, a picture frame in the original video may be inserted frame by frame between two scene switching frames with weak correlation, until the total number of scene switching frames after inserting the at least one video frame is equal to the specified The total number of frames. In this way, the original scene switching frame and the inserted picture frame as a whole can be used as the target frame, thereby constituting the video summary of the video.
在本申请一个实施方式中,从视频的文字描述信息中提取的主题标签的数量可能为至少两个,在这种情况下,可以针对所述场景切换帧,计算所述场景切换帧的场景标签与各个所述主题标签之间的相似度。例如,当前的主题标签为标签1和标签2,那么可以分别计算当前场景切换帧与标签1以及标签2之间的相似度,从而可以得到所述当前场景切换帧对应的第一相似度和第二相似度。在计算出场景切换帧对应的各个相似度之后,可以将针对所述场景切换帧计算得出的各个相似度进行累加,以得到所述场景切换帧对应的累计相似度。例如,可以将上述的第一相似度和第二相似度之和作为所述当前场景切换帧对应的累计相似度。在本实施方式中,在计算出各个场景切换帧对应的累计相似度之后,同样可以将累计相似度与指定相似度阈值进行比对,并将累计相似度大于指定相似度阈值的场景切换帧确定为所述目标帧。In an embodiment of the present application, the number of the topic tags extracted from the text description information of the video may be at least two. In this case, the scene label of the scene switching frame may be calculated for the scene switching frame. Similarity to each of the subject tags. For example, if the current topic label is the label 1 and the label 2, the similarity between the current scene switching frame and the label 1 and the label 2 can be separately calculated, so that the first similarity and the corresponding corresponding to the current scene switching frame can be obtained. Two similarities. After the respective similarities corresponding to the scene switching frame are calculated, the respective similarities calculated for the scene switching frame may be accumulated to obtain a cumulative similarity corresponding to the scene switching frame. For example, the sum of the first similarity and the second similarity described above may be used as the cumulative similarity corresponding to the current scene switching frame. In this embodiment, after calculating the cumulative similarity corresponding to each scene switching frame, the cumulative similarity may be compared with the specified similarity threshold, and the scene switching frame whose cumulative similarity is greater than the specified similarity threshold may be determined. Is the target frame.
请参阅图5,本申请还提供一种视频摘要的生成装置,所述视频具备文字描述信息,所述装置包括:Referring to FIG. 5, the present application further provides a device for generating a video summary, where the video has text description information, and the device includes:
场景切换帧提取单元100,用于从所述视频中提取多个场景切换帧,并为所述场景切换帧设置场景标签,其中,相邻两个场景切换帧之间的相似度满足指定条件;The scene switching frame extraction unit 100 is configured to extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, where the similarity between two adjacent scene switching frames satisfies a specified condition;
主题标签提取单元200,用于从所述文字描述信息中提取所述视频对应的主题标签;The topic tag extracting unit 200 is configured to extract, from the text description information, a topic tag corresponding to the video;
视频摘要生成单元300,用于根据所述场景切换帧的场景标签与所述主题标签之间的关联性,从所述多个场景切换帧中筛选出目标帧,并基于所述目标帧生成所述视频的视频摘要。The video summary generating unit 300 is configured to filter a target frame from the plurality of scene switching frames according to the association between the scene label of the scene switching frame and the theme label, and generate a target frame based on the target frame. A video summary of the video.
在本实施方式中,所述场景切换帧提取单元100包括:In this embodiment, the scene switching frame extraction unit 100 includes:
相似度计算模块,用于在所述视频中确定基准帧,并依次计算所述基准帧之后的帧与所述基准帧之间的相似度;a similarity calculation module, configured to determine a reference frame in the video, and sequentially calculate a similarity between the frame after the reference frame and the reference frame;
场景切换帧确定模块,用于当所述基准帧与当前帧之间的相似度小于或者等于指定阈值时,将所述当前帧确定为一个场景切换帧;a scene switching frame determining module, configured to determine the current frame as a scene switching frame when a similarity between the reference frame and the current frame is less than or equal to a specified threshold;
循环执行模块,用于将所述当前帧作为新的基准帧,并依次计算所述新的基准帧之后的帧与所述新的基准帧之间的相似度,以根据计算的的所述相似度确定下一个场景切换帧。a loop execution module, configured to use the current frame as a new reference frame, and sequentially calculate a similarity between the frame after the new reference frame and the new reference frame, according to the calculated similarity Determines the next scene switching frame.
在本实施方式中,所述场景切换帧提取单元100包括:In this embodiment, the scene switching frame extraction unit 100 includes:
特征提取模块,用于提取所述场景切换帧的特征,所述特征包括颜色特征、纹理特征以及形状特征中的至少一种;a feature extraction module, configured to extract features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature;
比对模块,用于将提取的所述特征与特征样本库中的特征样本进行比对,其中,所述特征样本库中的所述特征样本均与文字标签相关联;And a comparison module, configured to compare the extracted features with the feature samples in the feature sample library, wherein the feature samples in the feature sample library are all associated with a text label;
目标特征样本确定模块,用于确定所述特征样本库中与提取的所述特征最相似的目标特征样本,并将所述目标特征样本关联的文字标签作为所述场景切换帧对应的场景标签。The target feature sample determining module is configured to determine a target feature sample that is most similar to the extracted feature in the feature sample library, and use a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.
在本实施方式中,所述视频摘要生成单元300包括:In this embodiment, the video summary generating unit 300 includes:
相似度计算模块,用于计算所述场景切换帧的场景标签与所述主题标签之间的相似度;a similarity calculation module, configured to calculate a similarity between the scene label of the scene switching frame and the theme label;
权重系数设置模块,用于根据计算得到的所述相似度,为对应的场景切换帧设置权重系数;a weight coefficient setting module, configured to set a weight coefficient for the corresponding scene switching frame according to the calculated similarity degree;
关联值设置模块,用于识别所述场景切换帧中包含的目标对象,并根据识别出的所述目标对象与所述主题标签之间的关联性,为所述场景切换帧设置关联值;And a correlation value setting module, configured to identify a target object included in the scene switching frame, and set an association value for the scene switching frame according to the identified association between the target object and the theme label;
目标帧确定模块,用于计算所述场景切换帧的权重系数与关联值的乘积,并将所述乘积大于指定乘积阈值的场景切换帧确定为所述目标帧。And a target frame determining module, configured to calculate a product of a weight coefficient of the scene switching frame and an associated value, and determine a scene switching frame whose product is greater than a specified product threshold as the target frame.
本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The application can be described in the general context of computer-executable instructions executed by a computer, such as a program module. Generally, program modules include routines, programs, objects, components, data structures, and the like that perform particular tasks or implement particular abstract data types. The present application can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are connected through a communication network. In a distributed computing environment, program modules can be located in both local and remote computer storage media including storage devices.
本领域技术人员也知道,除了以纯计算机可读程序代码方式实现装置以外,完全可以通过将方法步骤进行逻辑编程来使得装置以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种装置可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。It will also be apparent to those skilled in the art that, in addition to implementing the device in purely computer readable program code, the device can be logically programmed by means of logic gates, switches, application specific integrated circuits, programmable logic controllers, and embedded micro-controls. The form of the device etc. to achieve the same function. Such a device may therefore be considered a hardware component, and the means for implementing various functions included therein may also be considered as a structure within the hardware component. Or even a device for implementing various functions can be considered as a software module that can be both a method of implementation and a structure within a hardware component.
由上可见,本申请首先可以从视频中提取相似度满足指定条件的场景切换帧,并为场景切换帧设置对应的场景标签。然后可以结合该视频的文字描述信息,确定该视频的主题标签。该主题标签可以准确地表征该视频的主题。接着,通过确定场景标签与主题标签之间的关联性,从而可以从场景切换帧中保留与主题关联性较紧密的目标帧。这样,基于所述目标帧生成的视频摘要从而能够准确地表征视频的主题内容。As can be seen from the above, the present application can first extract a scene switching frame whose similarity meets the specified condition from the video, and set a corresponding scene label for the scene switching frame. The textual description of the video can then be combined to determine the subject tag of the video. This hashtag accurately represents the subject of the video. Then, by determining the association between the scene label and the topic tag, the target frame closely related to the topic can be retained from the scene switching frame. In this way, the video summary generated based on the target frame can thereby accurately characterize the subject content of the video.
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片2。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与 程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog2。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。In the 1990s, improvements to a technology could clearly distinguish between hardware improvements (eg, improvements to circuit structures such as diodes, transistors, switches, etc.) or software improvements (for process flow improvements). However, as technology advances, many of today's method flow improvements can be seen as direct improvements in hardware circuit architecture. Designers almost always get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be implemented by hardware entity modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is an integrated circuit whose logic function is determined by the user programming the device. The designer is self-programming to "integrate" a digital system onto a single PLD without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip 2 . Moreover, today, instead of manually making integrated circuit chips, this programming is mostly implemented using "logic compiler" software, which is similar to the software compiler used in programming development, but before compiling The original code has to be written in a specific programming language. This is called the Hardware Description Language (HDL). HDL is not the only one, but there are many kinds, such as ABEL (Advanced Boolean Expression Language). AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., are currently the most commonly used VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog2. It should also be apparent to those skilled in the art that the hardware flow for implementing the logic method flow can be easily obtained by simply programming the method flow into the integrated circuit with a few hardware description languages.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施方式或者实施方式的某些部分所述的方法。It will be apparent to those skilled in the art from the above description of the embodiments that the present application can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product in essence or in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk. An optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present application or portions of the embodiments.
本说明书中的各个实施方式均采用递进的方式描述,各个实施方式之间相同相似的部分互相参见即可,每个实施方式重点说明的都是与其他实施方式的不同之处。尤其,针对装置的实施方式来说,均可以参照前述方法的实施方式的介绍对照解释。The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the embodiment of the device, reference can be made to the introduction of the embodiment of the aforementioned method.
虽然通过实施方式描绘了本申请,本领域普通技术人员知道,本申请有许多变形和变化而不脱离本申请的精神,希望所附的权利要求包括这些变形和变化而不脱离本申请的精神。While the present invention has been described by the embodiments of the present invention, it will be understood by those skilled in the art

Claims (17)

  1. 一种视频摘要的生成方法,其特征在于,所述视频具备文字描述信息,所述方法包括:A method for generating a video summary, wherein the video has text description information, and the method includes:
    从所述视频中提取多个场景切换帧,并为所述场景切换帧设置场景标签,其中,相邻两个场景切换帧之间的相似度满足指定条件;Extracting a plurality of scene switching frames from the video, and setting a scene label for the scene switching frame, where a similarity between two adjacent scene switching frames satisfies a specified condition;
    从所述文字描述信息中提取所述视频对应的主题标签;Extracting a topic tag corresponding to the video from the text description information;
    根据所述场景切换帧的场景标签与所述主题标签之间的关联性,从所述多个场景切换帧中筛选出目标帧,并基于所述目标帧生成所述视频的视频摘要。And selecting, according to the association between the scene label of the scene switching frame and the theme label, a target frame from the plurality of scene switching frames, and generating a video summary of the video based on the target frame.
  2. 根据权利要求1所述的方法,其特征在于,从所述视频中提取多个场景切换帧包括:The method according to claim 1, wherein extracting a plurality of scene switching frames from the video comprises:
    在所述视频中确定基准帧,并依次计算所述基准帧之后的帧与所述基准帧之间的相似度;Determining a reference frame in the video, and sequentially calculating a similarity between the frame after the reference frame and the reference frame;
    当所述基准帧与当前帧之间的相似度小于或者等于指定阈值时,将所述当前帧确定为一个场景切换帧;When the similarity between the reference frame and the current frame is less than or equal to a specified threshold, determining the current frame as a scene switching frame;
    将所述当前帧作为新的基准帧,并依次计算所述新的基准帧之后的帧与所述新的基准帧之间的相似度,以根据计算的的所述相似度确定下一个场景切换帧。Taking the current frame as a new reference frame, and sequentially calculating a similarity between the frame after the new reference frame and the new reference frame, to determine the next scene switching according to the calculated similarity frame.
  3. 根据权利要求2所述的方法,其特征在于,相邻两个场景切换帧之间的相似度满足指定条件包括:The method according to claim 2, wherein the similarity between two adjacent scene switching frames satisfies the specified conditions, including:
    相邻两个场景切换帧之间的相似度小于或者等于所述指定阈值。The similarity between two adjacent scene switching frames is less than or equal to the specified threshold.
  4. 根据权利要求2所述的方法,其特征在于,计算所述基准帧之后的帧与所述基准帧之间的相似度包括:The method according to claim 2, wherein calculating the similarity between the frame after the reference frame and the reference frame comprises:
    分别提取所述基准帧和当前帧的第一特征向量和第二特征向量,其中,所述第一特征向量和所述第二特征向量分别表示所述基准帧和所述当前帧的尺度不变特征;Extracting a first feature vector and a second feature vector of the reference frame and the current frame, respectively, wherein the first feature vector and the second feature vector respectively indicate that the scale of the reference frame and the current frame are unchanged feature;
    计算所述第一特征向量和所述第二特征向量之间的空间距离,并将所述空间距离的倒数作为所述基准帧与所述当前帧之间的相似度。Calculating a spatial distance between the first feature vector and the second feature vector, and using a reciprocal of the spatial distance as a similarity between the reference frame and the current frame.
  5. 根据权利要求1所述的方法,其特征在于,为所述场景切换帧设置场景标签包括:The method according to claim 1, wherein setting the scene label for the scene switching frame comprises:
    提取所述场景切换帧的特征,所述特征包括颜色特征、纹理特征以及形状特征中的至少一种;Extracting features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature;
    将提取的所述特征与特征样本库中的特征样本进行比对,其中,所述特征样本库中的所述特征样本与文字标签相关联;Comparing the extracted features with feature samples in a feature sample library, wherein the feature samples in the feature sample library are associated with a text tag;
    确定所述特征样本库中与提取的所述特征最相似的目标特征样本,并将所述目标特征样本关联的文字标签作为所述场景切换帧对应的场景标签。Determining a target feature sample that is most similar to the extracted feature in the feature sample library, and using a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.
  6. 根据权利要求1所述的方法,其特征在于,所述文字描述信息包括所述视频的标题和/或简介;相应地,从所述文字描述信息中提取所述视频对应的主题标签包括:The method according to claim 1, wherein the text description information includes a title and/or a profile of the video; and correspondingly, extracting a theme tag corresponding to the video from the text description information includes:
    将所述文字描述信息与文字标签库中的文字标签进行匹配,并将匹配得到的文字标签作为所述视频的主题标签。Matching the text description information with the text label in the text tag library, and matching the obtained text label as the theme label of the video.
  7. 根据权利要求6所述的方法,其特征在于,所述文字标签库中的文字标签与统计次数相关联,所述统计次数用于表征所述文字标签作为主题标签的总次数;The method according to claim 6, wherein the text label in the text tag library is associated with a number of statistics, and the number of statistics is used to represent the total number of times the text tag is used as a topic tag;
    相应地,当匹配得到的文字标签的数量为至少两个时,所述方法还包括:Correspondingly, when the number of matched text labels is at least two, the method further includes:
    按照统计次数从大到小的顺序对匹配得到的文字标签进行排序,并将排序结果中靠前的指定数量个文字标签作为所述视频的主题标签。The matched text labels are sorted in descending order of statistics, and a predetermined number of text labels in the top of the sorting result are used as the theme labels of the video.
  8. 根据权利要求1所述的方法,其特征在于,从所述多个场景切换帧中筛选出目标帧包括:The method according to claim 1, wherein the screening of the target frame from the plurality of scene switching frames comprises:
    计算所述场景切换帧的场景标签与所述主题标签之间的相似度,并将计算的所述相似度大于指定相似度阈值的场景切换帧确定为所述目标帧。Calculating a similarity between the scene label of the scene switching frame and the topic label, and determining the calculated scene switching frame whose similarity is greater than a specified similarity threshold as the target frame.
  9. 根据权利要求8所述的方法,其特征在于,在计算所述场景切换帧的场景标签与所述主题标签之间的相似度之后,所述方法还包括:The method according to claim 8, wherein after calculating the similarity between the scene label of the scene switching frame and the theme label, the method further comprises:
    根据计算得到的所述相似度,为对应的场景切换帧设置权重系数;And setting a weight coefficient for the corresponding scene switching frame according to the calculated similarity;
    识别所述场景切换帧中包含的目标对象,并根据识别出的所述目标对象与所述主题标签之间的关联性,为所述场景切换帧设置关联值;Identifying a target object included in the scene switching frame, and setting an association value for the scene switching frame according to the identified association between the target object and the theme tag;
    计算所述场景切换帧的权重系数与关联值的乘积,并将所述乘积大于指定乘积阈值的场景切换帧确定为所述目标帧。Calculating a product of a weight coefficient of the scene switching frame and an associated value, and determining a scene switching frame whose product is greater than a specified product threshold as the target frame.
  10. 根据权利要求9所述的方法,其特征在于,所述主题标签与至少一个对象相关联;相应地,为所述场景切换帧设置关联值包括:The method according to claim 9, wherein the topic tag is associated with at least one object; and correspondingly, setting the association value for the scene switching frame comprises:
    将从所述场景切换帧中识别出的目标对象与所述至少一个对象进行对比,并统计在所述至少一个对象中出现的目标对象的数量;Comparing the target object identified in the scene switching frame with the at least one object, and counting the number of target objects appearing in the at least one object;
    将统计的所述数量与指定数值的乘积作为所述场景切换帧的关联值。The product of the counted number and the specified value is used as the associated value of the scene switching frame.
  11. 根据权利要求9所述的方法,其特征在于,所述视频的视频摘要具备指定的帧总数量;相应地,在计算所述场景切换帧的权重系数与关联值的乘积之后,所述方法还包括:The method according to claim 9, wherein the video digest of the video has a specified total number of frames; correspondingly, after calculating a product of a weight coefficient of the scene switching frame and an associated value, the method further include:
    当所述场景切换帧的总数量大于或者等于所述指定的帧总数量时,按照所述乘积从大到小的顺序对所述场景切换帧进行排序,并将排序结果中靠前的所述指定的帧总数量个场景切换帧确定为所述目标帧。When the total number of the scene switching frames is greater than or equal to the total number of the specified frames, sort the scene switching frames according to the order of the products from large to small, and the foregoing in the ranking result The total number of specified frame switching frames is determined as the target frame.
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:The method of claim 11 wherein the method further comprises:
    当所述场景切换帧的总数量小于所述指定的帧总数量时,在相似度小于判定阈值的两个相邻的场景切换帧之间,插入所述视频中的至少一个视频帧,以使得插入所述至少一个视频帧之后的场景切换帧的总数量等于所述指定的帧总数量。When the total number of the scene switching frames is less than the total number of the specified frames, at least one video frame in the video is inserted between two adjacent scene switching frames whose similarity is less than the determination threshold, so that The total number of scene switch frames after insertion of the at least one video frame is equal to the total number of frames specified.
  13. 根据权利要求1所述的方法,其特征在于,当所述主题标签的数量为至少两个时,从所述多个场景切换帧中筛选出目标帧包括:The method according to claim 1, wherein when the number of the topic tags is at least two, the screening of the target frames from the plurality of scene switching frames comprises:
    针对所述场景切换帧,计算所述场景切换帧的场景标签与所述主题标签之间的相似度;将针对所述场景切换帧计算得出的相似度进行累加,以得到所述场景切换帧对应的累计相似度;Calculating a similarity between the scene label of the scene switching frame and the topic label for the scene switching frame; accumulating the similarity calculated for the scene switching frame to obtain the scene switching frame Corresponding cumulative similarity;
    将累计相似度大于指定相似度阈值的场景切换帧确定为所述目标帧。A scene switching frame in which the cumulative similarity is greater than the specified similarity threshold is determined as the target frame.
  14. 一种视频摘要的生成装置,其特征在于,所述视频具备文字描述信息,所述装置包括:A device for generating a video summary, wherein the video has text description information, and the device includes:
    场景切换帧提取单元,用于从所述视频中提取多个场景切换帧,并为所述场景切换帧设置场景标签,其中,相邻两个场景切换帧之间的相似度满足指定条件;a scene switching frame extracting unit, configured to extract a plurality of scene switching frames from the video, and set a scene label for the scene switching frame, where a similarity between two adjacent scene switching frames satisfies a specified condition;
    主题标签提取单元,用于从所述文字描述信息中提取所述视频对应的主题标签;a theme tag extracting unit, configured to extract a topic tag corresponding to the video from the text description information;
    视频摘要生成单元,用于根据所述场景切换帧的场景标签与所述主题标签之间的关联性,从所述多个场景切换帧中筛选出目标帧,并基于所述目标帧生成所述视频的视频摘要。a video summary generating unit, configured to filter a target frame from the plurality of scene switching frames according to an association between a scene label of the scene switching frame and the theme label, and generate the target according to the target frame A video summary of the video.
  15. 根据权利要求14所述的装置,其特征在于,所述场景切换帧提取单元包括:The device according to claim 14, wherein the scene switching frame extraction unit comprises:
    相似度计算模块,用于在所述视频中确定基准帧,并依次计算所述基准帧之后的帧与所述基准帧之间的相似度;a similarity calculation module, configured to determine a reference frame in the video, and sequentially calculate a similarity between the frame after the reference frame and the reference frame;
    场景切换帧确定模块,用于当所述基准帧与当前帧之间的相似度小于或者等于指定阈值时,将所述当前帧确定为一个场景切换帧;a scene switching frame determining module, configured to determine the current frame as a scene switching frame when a similarity between the reference frame and the current frame is less than or equal to a specified threshold;
    循环执行模块,用于将所述当前帧作为新的基准帧,并依次计算所述新的基准帧之后的帧与所述新的基准帧之间的相似度,以根据计算的的所述相似度确定下一个场景切换帧。a loop execution module, configured to use the current frame as a new reference frame, and sequentially calculate a similarity between the frame after the new reference frame and the new reference frame, according to the calculated similarity Determines the next scene switching frame.
  16. 根据权利要求14所述的装置,其特征在于,所述场景切换帧提取单元包括:The device according to claim 14, wherein the scene switching frame extraction unit comprises:
    特征提取模块,用于提取所述场景切换帧的特征,所述特征包括颜色特征、纹理特征以及形状特征中的至少一种;a feature extraction module, configured to extract features of the scene switching frame, the features including at least one of a color feature, a texture feature, and a shape feature;
    比对模块,用于将提取的所述特征与特征样本库中的特征样本进行比对,其中,所述特征样本库中的所述特征样本均与文字标签相关联;And a comparison module, configured to compare the extracted features with the feature samples in the feature sample library, wherein the feature samples in the feature sample library are all associated with a text label;
    目标特征样本确定模块,用于确定所述特征样本库中与提取的所述特征最相似的目标特征样本,并将所述目标特征样本关联的文字标签作为所述场景切换帧对应的场景标签。The target feature sample determining module is configured to determine a target feature sample that is most similar to the extracted feature in the feature sample library, and use a text tag associated with the target feature sample as a scene tag corresponding to the scene switching frame.
  17. 根据权利要求14所述的装置,其特征在于,所述视频摘要生成单元包括:The device according to claim 14, wherein the video summary generating unit comprises:
    相似度计算模块,用于计算所述场景切换帧的场景标签与所述主题标签之间的相似度;a similarity calculation module, configured to calculate a similarity between the scene label of the scene switching frame and the theme label;
    权重系数设置模块,用于根据计算得到的所述相似度,为对应的场景切换帧设置权重系数;a weight coefficient setting module, configured to set a weight coefficient for the corresponding scene switching frame according to the calculated similarity degree;
    关联值设置模块,用于识别所述场景切换帧中包含的目标对象,并根据识别出的所述目标对象与所述主题标签之间的关联性,为所述场景切换帧设置关联值;And a correlation value setting module, configured to identify a target object included in the scene switching frame, and set an association value for the scene switching frame according to the identified association between the target object and the theme label;
    目标帧确定模块,用于计算所述场景切换帧的权重系数与关联值的乘积,并将所述乘积大于指定乘积阈值的场景切换帧确定为所述目标帧。And a target frame determining module, configured to calculate a product of a weight coefficient of the scene switching frame and an associated value, and determine a scene switching frame whose product is greater than a specified product threshold as the target frame.
PCT/CN2018/072191 2017-07-05 2018-01-11 Method and device for generating video summary WO2019007020A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710541793.1A CN109213895A (en) 2017-07-05 2017-07-05 A kind of generation method and device of video frequency abstract
CN201710541793.1 2017-07-05

Publications (1)

Publication Number Publication Date
WO2019007020A1 true WO2019007020A1 (en) 2019-01-10

Family

ID=64949707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072191 WO2019007020A1 (en) 2017-07-05 2018-01-11 Method and device for generating video summary

Country Status (3)

Country Link
CN (1) CN109213895A (en)
TW (1) TWI712316B (en)
WO (1) WO2019007020A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298270A (en) * 2019-06-14 2019-10-01 天津大学 A kind of more video summarization methods based on the perception of cross-module state importance

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI762764B (en) * 2019-02-15 2022-05-01 國風傳媒有限公司 Apparatus, method, and computer program product thereof for integrating terms
CN110263650B (en) * 2019-05-22 2022-02-22 北京奇艺世纪科技有限公司 Behavior class detection method and device, electronic equipment and computer readable medium
CN110149531A (en) * 2019-06-17 2019-08-20 北京影谱科技股份有限公司 The method and apparatus of video scene in a kind of identification video data
CN112153462B (en) * 2019-06-26 2023-02-14 腾讯科技(深圳)有限公司 Video processing method, device, terminal and storage medium
CN110297943B (en) * 2019-07-05 2022-07-26 联想(北京)有限公司 Label adding method and device, electronic equipment and storage medium
CN111275097B (en) * 2020-01-17 2021-06-18 北京世纪好未来教育科技有限公司 Video processing method and system, picture processing method and system, equipment and medium
TWI741550B (en) * 2020-03-31 2021-10-01 國立雲林科技大學 Method for bookmark frame generation, and video player device with automatic generation of bookmark and user interface thereof
CN111641868A (en) * 2020-05-27 2020-09-08 维沃移动通信有限公司 Preview video generation method and device and electronic equipment
CN115086783B (en) * 2022-06-28 2023-10-27 北京奇艺世纪科技有限公司 Video generation method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308501A (en) * 2008-06-30 2008-11-19 腾讯科技(深圳)有限公司 Method, system and device for generating video frequency abstract
CN103810711A (en) * 2014-03-03 2014-05-21 郑州日兴电子科技有限公司 Keyframe extracting method and system for monitoring system videos
CN106612468A (en) * 2015-10-21 2017-05-03 上海文广互动电视有限公司 A video abstract automatic generation system and method
CN106713964A (en) * 2016-12-05 2017-05-24 乐视控股(北京)有限公司 Method of generating video abstract viewpoint graph and apparatus thereof

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006510248A (en) * 2002-12-11 2006-03-23 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Method and system for obtaining text keywords or phrases for providing content-related links to network-based resources using video content
US8705933B2 (en) * 2009-09-25 2014-04-22 Sony Corporation Video bookmarking
US8665345B2 (en) * 2011-05-18 2014-03-04 Intellectual Ventures Fund 83 Llc Video summary including a feature of interest
CN103200463A (en) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 Method and device for generating video summary
CN103440640B (en) * 2013-07-26 2016-02-10 北京理工大学 A kind of video scene cluster and browsing method
CN103646094B (en) * 2013-12-18 2017-05-31 上海紫竹数字创意港有限公司 Realize that audiovisual class product content summary automatically extracts the system and method for generation
CN106921891B (en) * 2015-12-24 2020-02-11 北京奇虎科技有限公司 Method and device for displaying video characteristic information
CN105868292A (en) * 2016-03-23 2016-08-17 中山大学 Video visualization processing method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308501A (en) * 2008-06-30 2008-11-19 腾讯科技(深圳)有限公司 Method, system and device for generating video frequency abstract
CN103810711A (en) * 2014-03-03 2014-05-21 郑州日兴电子科技有限公司 Keyframe extracting method and system for monitoring system videos
CN106612468A (en) * 2015-10-21 2017-05-03 上海文广互动电视有限公司 A video abstract automatic generation system and method
CN106713964A (en) * 2016-12-05 2017-05-24 乐视控股(北京)有限公司 Method of generating video abstract viewpoint graph and apparatus thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298270A (en) * 2019-06-14 2019-10-01 天津大学 A kind of more video summarization methods based on the perception of cross-module state importance

Also Published As

Publication number Publication date
TWI712316B (en) 2020-12-01
CN109213895A (en) 2019-01-15
TW201907736A (en) 2019-02-16

Similar Documents

Publication Publication Date Title
TWI712316B (en) Method and device for generating video summary
CN109151501B (en) Video key frame extraction method and device, terminal equipment and storage medium
US11113587B2 (en) System and method for appearance search
US10528821B2 (en) Video segmentation techniques
Mussel Cirne et al. VISCOM: A robust video summarization approach using color co-occurrence matrices
CN104994426B (en) Program video identification method and system
US20120027295A1 (en) Key frames extraction for video content analysis
Mahapatra et al. Coherency based spatio-temporal saliency detection for video object segmentation
Thomas et al. Perceptual video summarization—A new framework for video summarization
CN113766330A (en) Method and device for generating recommendation information based on video
CN102156686B (en) Method for detecting specific contained semantics of video based on grouped multi-instance learning model
CN110765314A (en) Video semantic structural extraction and labeling method
EP2345978B1 (en) Detection of flash illuminated scenes in video clips and related ranking of video clips
JP2009060413A (en) Method and system for extracting feature of moving image, and method and system for retrieving moving image
Premaratne et al. Structural approach for event resolution in cricket videos
WO2020192869A1 (en) Feature extraction and retrieval in videos
Cirne et al. Summarization of videos by image quality assessment
Khan et al. RICAPS: residual inception and cascaded capsule network for broadcast sports video classification
Karthick et al. Automatic genre classification from videos
Glasberg et al. Cartoon-recognition using visual-descriptors and a multilayer-perceptron
Lotfi A Novel Hybrid System Based on Fractal Coding for Soccer Retrieval from Video Database
Premaratne et al. A Novel Hybrid Adaptive Filter to Improve Video Keyframe Clustering to Support Event Resolution in Cricket Videos
Saha et al. Cricket Highlight Generation: Automatic Generation Framework Comprising Score Extraction and Action Recognition
Mpountouropoulos et al. Visual Information Analysis for Big-Data Using Multi-core Technologies
Lee et al. A comparative study of the objectionable video classification approaches using single and group frame features

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27/02/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18827892

Country of ref document: EP

Kind code of ref document: A1