US20190066732A1 - Video Skimming Methods and Systems - Google Patents

Video Skimming Methods and Systems Download PDF

Info

Publication number
US20190066732A1
US20190066732A1 US16/171,116 US201816171116A US2019066732A1 US 20190066732 A1 US20190066732 A1 US 20190066732A1 US 201816171116 A US201816171116 A US 201816171116A US 2019066732 A1 US2019066732 A1 US 2019066732A1
Authority
US
United States
Prior art keywords
video
saliency
shots
shot
concept
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/171,116
Inventor
Taoran Lu
Zheng Yuan
Yu Huang
Dapeng Oliver Wu
Hong Heather Wu
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vid Scale Inc
Original Assignee
Vid Scale Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vid Scale Inc filed Critical Vid Scale Inc
Priority to US16/171,116 priority Critical patent/US20190066732A1/en
Publication of US20190066732A1 publication Critical patent/US20190066732A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G06K9/00751
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording

Definitions

  • the present invention relates to image processing, and, in particular embodiments, to video skimming by the perspective of hierarchical audio-visual reconstruction with saliency-masked Bag-of-Words features.
  • Video summary also called a still abstract, is a set of salient images (key frames) selected or reconstructed from an original video sequence.
  • Video skimming also called a moving abstract, is a collection of image sequences along with the corresponding audios from an original video sequence
  • Video skimming is also called a preview of an original video, and can be classified into two sub-types: highlight and summary sequence
  • a highlight contains the most interesting and attractive parts of a video
  • a summary sequence renders the impression of the content of an entire video
  • summary sequence conveys the highest semantic meaning of the content of an original video.
  • One prior art method is uniform sampling the frames to shrink the video size while losing the audio part, which is similar to the fast forward function seen in many in digital video players.
  • Time compression methods can compress audio and video at the same time to synchronize them, using frame dropping and audio sampling.
  • the compression ratio can be limited by speech distortion in some cases.
  • Frame-level skimming mainly relies on the user attention model to compute a saliency curve, but this method is weak in keeping the video structure, especially for a long video.
  • Shot clustering is a middle-level method in video abstraction, but its readability is mostly ignored. Semantic level skimming is a method trying to understand the video content, but can be difficult to realize its goal due to the “semantic gap” puzzle.
  • a method of creating a skimming preview of a video includes electronically receiving a plurality of video shots, analyzing each frame in a video shot from the plurality of video shots, where analyzing includes determining a saliency of each frame of the video shot.
  • the method also includes determining a key frame of the video shot based on the saliency of each frame the video shot, extracting visual features from the key frame, performing shot clustering of the plurality of video shots to determine concept patterns based on the visual features, and generating a reconstruction reference tree based on the shot clustering.
  • the reconstruction reference tree includes video shots categorized according to each concept pattern.
  • FIG. 1 provides frame shots illustrating embodiment visual saliency masking
  • FIG. 2 illustrates embodiment SIFT features in active regions on video frames
  • FIG. 3 provides a graph to illustrating an embodiment BoW feature for example video frames
  • FIG. 4 provides a flow chart of an embodiment visual BoW feature extraction
  • FIG. 5 illustrates an embodiment saliency masking of audio words
  • FIG. 6 illustrates a flow chart of audio BoW feature extraction
  • FIG. 7 illustrates an example graph of an audio BoW feature
  • FIG. 8 illustrates an embodiment reconstruction reference tree and shot table
  • FIG. 9 illustrates an embodiment post processing saliency curve
  • FIG. 10 illustrates and embodiment system that implements embodiment algorithms
  • FIG. 11 illustrates a processing system that can be utilized to implement embodiment methods
  • FIG. 12 provides an appendix listing notations used herein.
  • Video skimming is a task that shortens video into a temporally condensed version by which viewers may still understand the plot of original video. This technique allows viewers to quickly browse of a large multimedia library and thus facilitates tasks such as fast video browsing, video indexing and retrieval.
  • the performance of video summarization mainly lies in the following three aspects: skeleton preservation, exciting and interesting summarization and smooth transition.
  • Video summarization enables viewers to quickly and efficiently grasp what a video describes or presents from a shorter summarized version.
  • the main skeleton from the original video is extracted and kept it in the summarized video.
  • a video skeleton can be seen as a queue of concept patterns with certain semantic implications in a temporal order.
  • the term “Concept pattern” here is not a high-level as real semantic concept that is learned by human intervention. Rather, a concept pattern encodes semantic meanings of shots (sets of consecutive similar video frames), symbolizes a shot group that portraits consistent semantic settings, and generally possesses the capability as a hallmark or self-evident clue that hints to the development of the original video. Viewers may possibly recover the plot by only watching and hearing a handful of shots as long as all concept patterns are conveyed.
  • Some embodiments of the present invention present to viewers an exciting interesting summary of video. Often in a video, there are various shots conveying the same concept patterns. When selecting one shot conveying a concept pattern from many, the one with high saliency value or equivalently generating the largest stimulus to human attention would be favored so that the resultant summarized video not only contains integral concept patterns, but also carefully selects shot instances with richest information to reflect these concept patterns. Hence, a plain or even dull summarization is avoided.
  • Some embodiments of the present invention provide a smooth transition between adjacent concept patterns by providing additional frame level summarization.
  • Embodiments of the present invention generate video summarization by providing unsupervised learning of original video concepts and hierarchical (both frame and shot levels) reconstruction.
  • the skeleton of original video is analyzed by concept pattern mining. Viewing it as a clustering problem. Bag of Words features (SIFT based visual word and Matching-pursuit based audio word) are extracted for each shot from both visual and audio sensory channels that filtered with saliency masking. The Bag of Words features are then clustered into several groups using spectral clustering techniques. Each cluster represents a certain concept pattern.
  • the original video is summarized from reconstruction point of view based on the learned concept pattern.
  • summarization is regarded as a “summation” process rather than a “subtraction” process. Keeping at least one shot for each concept pattern, the concept pattern integrity of summarized video offer viewers the capability of context recovery.
  • a video that also contains maximum achievable saliency accumulation is generated.
  • the summarization process is conducted in an iterative fashion, allowing flexible control of summarized video information richness vs. skimming ratio.
  • a good understanding of video content can help achieve a good video summary.
  • the most common contents for a typical video sequence are visual and acoustic channels.
  • visual signals provide the majority information from which latent concept patterns are learned from original video.
  • audio sensory channels can also provide important information regarding concept pattern in ways not offered by the visual channel, for example, in low light environments and nighttime shots.
  • a concept pattern can be derived that also shares both visual and audio consistency at the same time.
  • learned concept results can be jointly analyzed in a parity-check fashion to enhance co-reliability. Therefore, in some embodiments, an audio stream is extracted from raw video and processed in parallel with the video stream to detect possible audio concepts.
  • temporal segmentation for video stream is archived using shot detection.
  • a variance-difference based approach is used to detect a shot change, and robustly detects scene cuts and fades between scenes.
  • the variance of a frame is calculated and the delta variance with its previous frame Dvar is recoded.
  • the criteria for Dvar to start a new shot are:
  • shot boundaries can also be found using color histogram based approaches or optical-flow motion features.
  • audio data are segmented into pieces, where each piece has its boundaries synchronized to its co-located video shot in time axis.
  • an attention model and a bag-of-words feature construction on shots is performed.
  • Skeleton preservation uses some distinctive feature for shot-discrimination
  • an exciting summarization uses a content attentiveness (saliency) measurement.
  • Embodiment saliency measurement methods effectively reflect how informative a shot is, and shot features are selected to represent video skeleton with discrimination (i.e., to be used to find the similarity with other shots).
  • Bag-of-Words (BoW) models are used to characterize the shot properties in visual and aural domains, respectively.
  • the Bag-of-Words (BoW) model was initially utilized in Natural Language Processing to represent the structure of a text.
  • a BoW model regards a text document as a collection of certain words belonging to a reference dictionary but ignores the order and semantic implications of words.
  • a BoW model uses the occurrence of each word in the dictionary as the feature of text, and therefore, often produces a sparse vector.
  • the BoW model can be regarded as the “histogram representation based on independent features.” As such, a shot can be regarded as a sort of “text document” with regard to some embodiments.
  • visual words are derived using saliency detection according to PQFT-based attention modeling.
  • PQFT-based attention modeling Such an attention model has been shown to be successful in imitating human's perceptual properties on video frames.
  • the generated saliency map is used as a good indicator of how conspicuous a frame is, and which part within the frame incurs the highest human attention.
  • a measure of visual frame-saliency is formulated by calculating the average value of the saliency map for a frame t:
  • SM refers to the saliency map for frame t.
  • the visual conspicuous level is calculated by averaging the visual frame saliency in that shot:
  • the visual structure of original video observed from a middle-level video concept pattern is derived.
  • a video concept pattern can be viewed as a self-learned set featured by a combination of certain Spatially Local Visual Atom (SLVA) and each SLVA stands for a single visual pattern, which is found within a localized neighborhood at a particular spatial location, with plausible semantic implications, like green grass, blue sky, etc.
  • SLVA Spatially Local Visual Atom
  • a noticeable property of the video concept pattern is that, importance is only attached to the occurrence of SLVAs, without esteeming the order (spatial location).
  • a shot of a far view of green hills with blooming yellow flowers and a shot of a near view of the grass and flowers should both imply the same concept, even though the grass and flowers may appear in different locations and in different scales.
  • the BoW model for visual shots which felicitly expresses the order-irrelevant property, is employed by embodiment of the present invention using SLVAs as the visual words.
  • SLVAs as the visual words.
  • other techniques such as part-based methods, referred in B Leibe, A Leonardis, and B Schiele, “Robust Object Detection with Interleaved Categorization and Segmentation”, IJCV Special Issue on Learning for Vision and Vision for Learning, August 2007, can be used.
  • SIFT feature points are generally detected on every frame in the shot and on every region within a frame. This procedure, although precise, is especially time-consuming. Thus, some embodiments employ pre-processing steps prior to SIFT feature detection. In an embodiment key frames are used to balance computational cost and accuracy. Further, a filtering process called saliency masking is used to improve the robustness and efficiency of the SIFT feature extraction in some embodiments.
  • a key frame is selected as the most representative frame in a shot.
  • key frame selection methods known in the art that can be used. Some straightforward methods include choosing the first/last frame, or the middle frame in a shot. Some motion-based approaches use motion intensity to guide the key frame selection, such as those used in MEPG-7. In an embodiment, however, human attention models are used, and a most salient frame is used to represent a shot as follows:
  • Embodiment key frame selection techniques can save a large amount of computation resources at a minor cost of precision loss, assuming that frames are similar within a shot. If such an assumption does not hold, the attention model can be exploited with respect to a single frame to exclude some inattentive regions on the key frame.
  • An embodiment attention model based on image phase spectrum and motion residue is used to imitate human perceptual properties.
  • a saliency map SM is generated whose pixel value indicates how attentive the pixel on original frame is.
  • the movement of the camera through an environment, e.g., a fixed background is the ego-motion.
  • the impact of ego-motion is incorporated in determining the saliency of a frame. As described further in detail, this is accomplished by computing a motion channel having a difference in intensity map and an ego-motion compensated intensity map.
  • the camera motion between adjacent frames is.
  • Camera motion between two adjacent frames can be computed by estimating a 2-D rigid transformation based on the corresponding KLT (Kanade-Lucas-Tomasi Feature Tracker) key point tracks on the two adjacent frames.
  • KLT Kanade-Lucas-Tomasi Feature Tracker
  • Embodiments of the invention also include alternative methods such as SIFT matching or Speeded Up Robust Features (SURF) correspondence etc.
  • s, ⁇ , b, and by are camera parameters, wherein s is the zooming factor, ⁇ is the counterclockwise rotation degree, b x corresponds to the pan movement, and b y corresponds to the tilt movement.
  • matrix A and vector b may be solved using the robust RANSAC (RANdom SAmple Consensus) rigid transformation estimation, which is a known iterative method to estimate parameters of a mathematical model from a set of observed data having outliers.
  • RANSAC Random SAmple Consensus
  • Embodiments of the invention may also use alternative methods such as Least Median of Squares or M-Estimator etc.
  • the visual saliency of each frame may be determined.
  • the camera motion may be applied to compensate the ego motion and the residual may be fused into the color information to generate visual saliency.
  • the intensity channel I(t) of a frame t is calculated using the color channels of the frame as follows.
  • a given frame t may comprise red r(t), green g(t), and blue b(t) channels.
  • Four broadly tuned color channels may be generated by the following equations:
  • Y ( t ) ( r ( t )+ g ( t ))/2 ⁇
  • two color difference channels are defined as following.
  • RG ( t ) R ( t ) ⁇ G ( t )
  • the intensity channel is calculated as follows:
  • I ( t ) ( r ( t )+ g ( t )+ b ( t ))/3.
  • the ego-motion compensated intensity map I(t ⁇ ) for the previous frame (t ⁇ ) is computed.
  • the motion channel M(t) is computed as an absolute difference between intensity map I(t) and ego-motion compensated intensity map I(t ⁇ ) as follows:
  • a t ⁇ t and b t ⁇ t t are the estimated camera parameters from frame (t ⁇ ) to frame t.
  • the frame t can be represented as a quaternion image q(t):
  • q(t) can be represented in symplectic form as follows:
  • QFT Quaternion Fourier Transform
  • (u,v) is the location of each pixel in frequency domain, while N and M are the image's height and width.
  • ⁇ W is the phase spectrum of Q(t).
  • the frequency domain representation Q(t) of the quaternion image q(t) includes only the phase spectrum in frequency domain. Therefore, the inverse Quaternion Fourier Transform (IQFT) of the phase spectrum of the frequency domain representation Q(t) of the quaternion image q(t) may be performed.
  • the IQFT of the phase spectrum q′(t) is a 2-D image map and may be computed as follows:
  • the saliency map (sM(t)) of frame t may be obtained by taking a smoothing filter kernel and running a convolution with the 2-D image map q′(t):
  • g is a 2-D Gaussian smoothing filter.
  • PQFT Phase Spectrum of Quaternion Fourier Transform
  • the visual saliency value S v (t) of the frame t may be computed by taking the average over the entire saliency map as follows:
  • Camera motion may be utilized to emphasize or neglect certain objects.
  • camera motion may be used to guide viewers' attentions during a scene.
  • the rigid motion estimation as described above may be used to determine the camera motion type and speed.
  • further information is required to understand the relationship between camera motion and the significance of a particular camera motion in guiding a user. For example, it is necessary to be able to map the computed camera parameters to their ability to attract a viewer's attention.
  • Embodiments of the invention use general camera work rules to set up a user attention based model.
  • the user attention based model is obtained based on the following assumptions from general movie production.
  • zooming is assumed to emphasize something.
  • the speed of zooming scales linearly with the importance of the media segment. Therefore, faster zooming speeds describe important content.
  • zoom-in is used to emphasize details
  • zoom-out is used to emphasize an overview scene.
  • a video producer may apply panning if the video producer wants to neglect or de-emphasize something.
  • the speed of the panning operation may be used a metric of importance. Unlike zooming, the faster the panning speed is, the less important the content is.
  • an attention factor Worn caused by camera motion is quantified over a pre-determined range, for example, [0 ⁇ tilde over ( ) ⁇ 2]. For example, a value greater than 1 may represent emphasis, while a value smaller than 1 may represent neglect.
  • an active region on the key frame is defined by thresholding the saliency map:
  • AR t k ( i,j ) ⁇ F t k ( i,j )
  • T is the active threshold.
  • SIFT feature detection in remaining active regions then generates prominent and robust SLVAs of the frame.
  • FIG. 1 illustrates the results of saliency masking on shot 19 and 45 of a frame sequence.
  • frame 102 represents shot 19 prior to the application of saliency masking
  • frame 104 represents shot 19 after saliency masking has been applied.
  • Regions 103 represent the masked regions of shot 19 .
  • frame 106 represents shot 45 prior to the application of saliency masking
  • frame 108 represents shot 45 after saliency masking has been applied.
  • Regions 105 represent the masked regions of shot 45 .
  • Lowe's algorithm for SIFT feature detection in active regions on the key frame is used.
  • the frame is convolved with Gaussian filters at different scales, and then the differences of successive Gaussian-blurred versions are taken.
  • Key points are located as maxima/minima of the Difference of Gaussians (DoG) that occur at multiple scales.
  • DoG Difference of Gaussians
  • each key point is assigned one or more orientations based on the local gradient directions.
  • a highly distinctive 128 -dimension vector is generated as the point descriptor; i.e., the SLVA.
  • FIG. 2 shows detected SIFT feature points 109 in frames 104 and 108 representing shots 19 and 45 , respectively.
  • the shot as a bag has a collection of “visual-word,” with each is a vector of dimension 128 .
  • the number of words is the number of SIFT feature point s on the key frame.
  • a shot bag with its SIFT feature descriptors can now be regarded as a text document that has many words.
  • “dictionary” is built as the collection of all the “words” from all the bags, and similar “words” should be treat as one “codeword,” as in text documents, “take”, “takes”, “taken” and “took” should be regarded same—“take”, as its codeword.
  • the bags of words in visual appearance are referred to in L.
  • a codeword can be considered as a representative of several similar SLVAs.
  • K-means clustering over all the SLVAs is used, where the number of the clusters is the codebook size.
  • codewords are the centers of the clusters, and each “word” is mapped to a certain codeword through the clustering process.
  • each shot can be represented by a histogram of the codewords.
  • 200 codewords are used.
  • FIG. 3 depicts the visual BoW feature for 19 , frame 104 shown in FIGS. 1 and 2 above.
  • the x-axis represents the index of words
  • the y-axis represents the normalized frequency of words occurred in the key frame of the shot.
  • FIG. 4 illustrates flowchart 200 of an embodiment visual BoW feature extraction for a shot.
  • Key frame 202 of SHOT k is input to SIFT feature detection block 204 , which applies SIFT feature detection to SHOT k .
  • Each detected SLVA 206 is assigned a codeword in step 208 based on codeword generation by K-means clustering block 210 .
  • the frequency of each determined codeword 218 is counted in step 220 to produce visual BoW 222 for SHOT k .
  • Codeword generation by K-means clustering block 210 generates codewords based on all SLVAs (block 212 ) found among the key frame of SHOT k 202 , as well as and key frames 216 and SLVAs 214 from other shots.
  • Visual BoWs 224 for other shots are similarly derived.
  • the audio structure of the original video is observed with respect to an audio concept pattern.
  • an audio concept pattern is interpreted as acoustic environment featured by a combination of certain Temporally Local Acoustic Atom (TLAA).
  • TLAA Temporally Local Acoustic Atom
  • Each TLAA stands for a single audio pattern with plausible semantic implications.
  • an audio concept conversation between John and Mary at the shore is featured as a combination of John's short time voice (a TLAA) switching with Mary's (a TLAA) and continuous environmental sound of sea wave (a TLAA).
  • an audio skeleton is sought that are usually comprised of “self-contained” concept patterns, meaning that in the set of shots that form a concept pattern, every shot has TLAAs from the same closed subset of plausible audio patterns and the reshuffling of plausible audio patterns is allowed.
  • This assumption originates from the fact that humans recognize an audio scene from a macroscopic perspective, which emphasizes the components instead of an exact time and location of every component.
  • the feature vectors of different shots may be much closer to each other, as long as their acoustic component TLAAs are alike. In some embodiments, they are then pruned to be clustered into the same group, which captures the underlying common characteristics of an audio scene.
  • indicator-like features which identifies a shot as a single acoustic source, for example, speech from a single person, sound from a machine or environment, and background music
  • each shot will end up to be a sparse vector with only one 1-entry that indicates to which acoustic source this shot belongs. While this hard-decision-like feature can be viewed as contradictory to the fact that an audio segment corresponding to a shot usually consists of multiple intervening sources, this fact is implicitly reflected by a BoW feature.
  • embodiment BoW features encode intervening sources of a concept softly, which provides a closer approximation to the nature of an underlying concept as perceived by humans, and thus yields more accuracy.
  • the BoW model can suitably represent the audio features of a detected shot. If the audio stream of a shot is chopped into multiple overlapped short-time audio segments with equal length, the shot can be regarded as a bag containing multiple audio segments as audio words. Each word, with extracted feature by Matching Pursuit decomposition, represents a unique TLAA, which is an audio pattern with plausible semantic implications. A shot is consequently considered as a bag containing the audio patterns.
  • the histogram of each word occurrence is a summarized feature of a shot through all the words within.
  • an encoding theme is applied to avoid the over-sparsity of feature vectors (negatively impact the classification result) from a direct word occurrence statistic.
  • all audio words from all shots in raw video are stored into a dictionary, and K-means clustering is conducted over the dictionary to produce K codewords. Each word is then assigned to a nearest codeword.
  • the BoW feature of each shot is the occurrence of codewords inside the shot.
  • the robustness of an audio BoW feature is improved by taking into account audio words above an acoustic saliency level to avoid the negative effect on the BoW accuracy exerted by audio words of low saliency. This can be due to its small value compared with noise.
  • audio saliency is measured by a variety of low-level audio features (scalar values), including Spectral Centroid, Root Mean Square (RMS), Absolute Value Maximum, Zero-Crossing Ratio (ZCR), and Spectral Flux.
  • the spectral centroid is the center of the spectrum of the signal, and is computed by considering the spectrum as a distribution whose values are the frequencies, and the probabilities to observe these are the normalized amplitude.
  • Root mean square is a measure of short time energy of a signal from norm 2.
  • Absolute Value Maximum is a measure of short time energy of a signal from norm 1.
  • the zero-crossing is a measure of the number of time the signal value cross the zero axis.
  • time signature of shot member is used as alternative feature to bind two concept sets. A concept pair producing more similar time signature on both sides are considered as a good pair and are matched up.
  • the time signature is the starting/ending time and duration of a shot.
  • an algorithm is used to progressively generate a summarized video clip by the means of collecting shots.
  • a video skimming process is regarded as a video reconstruction process. Starting from an empty output sequence, a shot is recruited each time to the output sequence until the target skimming ratio is achieved. The duration of the output video is thus controlled by recruiting different amounts of video shots to satisfy arbitrary skimming ratio. The recruiting order plays an important role in the final result.
  • each concept contributes shots to the skimmed video.
  • the skimmed video reflects the diversity of concepts of the original video and, thus yields the maximum entropy, even though some concepts may seem not salient.
  • the concept importance is used as a factor for deciding the recruiting order of different concept patterns. It is not equivalent to the concept saliency. Rather, concept importance is a more high-level argument that can reveal a video producer's intention for the concepts' representation. Most commonly, if the producer gives a long shot for a concept pattern, or repeats the concept in many shots, then this concept can be considered to be of high importance intentionally. Under this assumption, the concept importance can be expressed as:
  • N k is the total number of frames in shot k within concept 1.
  • a shot is first picked from the most important concept.
  • every shot is assigned an average audio-visual saliency value to indicate how exciting this shot is to viewers.
  • Some shots have mismatched audio-visual concepts. For example, a video of two people, A and B are talking; most shots will consistently show the person's figure and play the person's voice. Some shots will show A′s figure while play B′s voice. The case is rare but possible, and we call it a mismatch. After the concept registration, the mismatch can be easily found by comparing the registered spectral clustering results.
  • the audio-visual saliency of the shot is decreased in some embodiments, since recruiting such a shot may cause some misunderstanding to viewers.
  • the audio-visual saliency is reduced according to the following expression:
  • AvgSal k ⁇ AvgSal k v +(1 ⁇ )AvgSal k a
  • AvgSal k AvgSal k ⁇ d k ,
  • is weighing parameters to balance audio-visual saliency and a is saliency penalty for audio-visual concepts mismatch.
  • the most salient shot in each concept is defined as a “must-in” shot, which means that, these shots are recruited in the skimmed video in spite of the skimming ratio. This helps guarantee concept integrity.
  • the other shots are “optional” shots in the sense that they can be recruited or not depending on the target skimming ratio.
  • the reconstruction reference tree is a tree structure for video reconstruction guidance. According to embodiments, the RRT is built according embodiments principles regarding concept integrity, concept importance, shot saliency penalty for audio-visual concept mismatch and “must-in” shots and optional shots for each concept.
  • the root of the RRT is the video concept space, which is the learned through the spectral clustering process.
  • the first level leaves are the concepts, which are sorted in importance descending order from left to right, and the second level leaves are the shots. Under each concept, the shots are sorted in saliency descending order from top to bottom.
  • the first child of each concept is the “must-in shot” and the rest of the shots are optional shots. Since each concept may have different number of shots, some “virtual shots” with zero saliency are included to form an array of all shots. The resulting array is called the shot table.
  • Shot table 402 has shots ordered according to concept 404 and shot saliency.
  • concept categories 404 are ordered according to concept importance.
  • shots within each category are ordered according to saliency, such that the more salient shots are ordered at the top of each concept category and the least salient shots are ordered toward the bottom of each concept category.
  • ordered shots within each category can include must-in shots 406 , optional shots 408 and virtual shots 410 as described above.
  • the current skimming ratio R c may not perfectly equal to the target skimming ratio. In some embodiments, it may be more likely that R c is slightly larger than R t (due to the stop criteria).
  • pure frame-level skimming which is based on the attention model, is used as post processing.
  • Sal t ⁇ Sal t v +(1 ⁇ )Sal t a .
  • the audio-visual saliency of every frame that appears in the output sequence is checked again. By thresholding on the saliency curve, frames with relatively low saliency are discarded, thereby allowing the final duration of the output video satisfy the target duration. In addition, the smoothness requirement is also considered to yield a viewer-friendly skimmed video.
  • a morphological smoothing operation is adopted which includes deleting curve segments that are too short than K frames, and joining together curve segments that are less than K frames apart.
  • K is generally a small number, for example, 10 frames. Alternatively, other numbers can be used for K.
  • the post processing algorithm is described as follows:
  • FIG. 9 illustrates an embodiment saliency curve 450 with threshold 452 to obtain a preserving ratio 95%. It should be appreciated that in alternative embodiments, other thresholds can be applied to obtain other threshold ratios.
  • shot segmentor 502 is configured to segment a video in to individual shots.
  • the saliency of each shot is determined by video saliency determination block 504 , the output of which, is used by key frame determining block 506 to determine a key frame from among each shot.
  • Visual feature extractor block 508 is configured to extract visual features
  • visual word clustering block 509 is configured to cluster visual words form visual concepts using methods described above.
  • Shot clustering block 510 is configured to cluster shots based on different visual and audio descriptors to build concept patterns.
  • audio feature determination block 516 is configured to determine audio features from each segmented shot, and audio saliency determination block 518 is configured to the saliency of each audio feature.
  • Audio clustering is performed by audio word clustering block 520 to produce audio concepts. Furthermore, audio and visual concepts are aligned by block 522 .
  • reconstruction reference tree generation block 512 creates an RRT based on saliency and concept importance, according to embodiments described herein.
  • video skimming preview generator 514 is configured to generate the skimming preview.
  • FIG. 11 illustrates a processing system 600 that can be utilized to implement methods of the present invention.
  • the main processing is performed in by processor 602 , which can be a microprocessor, digital signal processor or any other appropriate processing device.
  • Program code e.g., the code implementing the algorithms disclosed above
  • data can be stored in memory 604 .
  • the memory can be local memory such as DRAM or mass storage such as a hard drive, optical drive or other storage (which may be local or remote). While the memory is illustrated functionally with a single block, it is understood that one or more hardware blocks can be used to implement this function.
  • the processor can be used to implement various some or all of the units shown in FIG. 11 .
  • the processor can serve as a specific functional unit at different times to implement the subtasks involved in performing the techniques of the present invention.
  • different hardware blocks e.g., the same as or different than the processor
  • some subtasks are performed by the processor while others are performed using a separate circuitry.
  • FIG. 11 also illustrates I/O port 606 , which can be used to provide the video to and from the processor.
  • Video source 608 the destination of which is not explicitly shown, is illustrated in dashed lines to indicate that it is not necessary part of the system.
  • the source can be linked to the system by a network such as the Internet or by local interfaces (e.g., a USB or LAN interface).
  • display device 612 is coupled to I/O port 606 and supplies display 616 , such as a CRT or flat-panel display with a video signal
  • audio device 610 is coupled to I/O port 606 and drives acoustic transducer 614 with an audio signal.
  • audio device 610 and display device 612 can be coupled via a computer network, cable television network, or other type of network.
  • video skimming previews can be generated offline and included in portable media such as DVDs, flash drives, and other types of portable media.
  • FIG. 12 provides a listing of notations used herein.
  • Embodiments also include the ability to provide progressive video reconstruction from concept groups for high-level summarization, concept group categorization by spectral clustering of video shots, and alignment of audio concept groups with video concept groups.
  • visual and audio Bag-of-Words techniques are used for feature extraction, where visual words are constructed by using a SIFT (Scale-invariant feature transform), and audio words constructed by using Gabor-dictionary based Matching Pursuit (MP) techniques.
  • SIFT Scale-invariant feature transform
  • MP Gabor-dictionary based Matching Pursuit
  • saliency masking is used to provide for robust and distinguishable Bag-of-Word feature extraction.
  • visual saliency curve shaping uses a PQFT+dominant motion contrast attention model
  • audio saliency curve shaping is performed using by low-level audio features, such as Maximum Absolute Value, Spectrum Centroid, RMS, and ZCR.
  • saliency-curve based skimming is used as a low level summarization.
  • spectral clustering favors the classification of locally-correlated data into one cluster because it adds another constraint to distinguish the close-located or locally-connected data and increase their similarity to be divided into one group. By this constraint, the clustering result approaches human intuition that a cluster with consistent members is generally subject to a concentrated distribution.
  • a further advantage of spectral clustering is that the clustering result is not sensitive to the number of members in the clusters.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Television Signal Processing For Recording (AREA)
  • Image Analysis (AREA)

Abstract

In an embodiment, an apparatus and method of creating a skimming preview of a video includes electronically receiving a plurality of video shots, analyzing each frame in a video shot from the plurality of video shots, where analyzing includes determining a saliency of each frame of the video shot. The method also includes determining a key frame of the video shot based on the saliency of each frame the video shot, extracting visual features from the key frame, performing shot clustering of the plurality of video shots to determine concept patterns based on the visual features, and generating a reconstruction reference tree based on the shot clustering. The reconstruction reference tree includes video shots categorized according to each concept pattern.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application is a continuation application of U.S. patent application Ser. No. 14/922,936, filed Oct. 26, 2015, entitled “Video Skimming Methods and Systems,” which is a continuation of U.S. patent application Ser. No. 13/103,810, filed May 9, 2011, now issued as U.S. Pat. No. 9,171,578, which claims priority from U.S. Provisional Application, Ser. No. 61/371,458, filed Aug. 6, 2010, entitled “Video Skimming Methods and Systems,” which application is hereby incorporated herein by reference in its entirety.
  • TECHNICAL FIELD
  • The present invention relates to image processing, and, in particular embodiments, to video skimming by the perspective of hierarchical audio-visual reconstruction with saliency-masked Bag-of-Words features.
  • BACKGROUND
  • The fast evolution of digital video has brought many new applications and consequently, research and development of new technologies, which will lower the costs of video archiving, cataloging and indexing, as well as improve the efficiency, usability and accessibility of stored videos arc greatly needed Among all possible research areas, one important topic is how to enable a quick browse of a large collection of video data and how to achieve efficient content access and representation.
  • To address these issues, video abstraction techniques have emerged and have been attracting more research interest in recent years There are two types of video abstraction, video summary and video skimming Video summary, also called a still abstract, is a set of salient images (key frames) selected or reconstructed from an original video sequence.
  • Video skimming, also called a moving abstract, is a collection of image sequences along with the corresponding audios from an original video sequence Video skimming is also called a preview of an original video, and can be classified into two sub-types: highlight and summary sequence A highlight contains the most interesting and attractive parts of a video, while a summary sequence renders the impression of the content of an entire video Among all types of video abstractions, summary sequence conveys the highest semantic meaning of the content of an original video.
  • One prior art method is uniform sampling the frames to shrink the video size while losing the audio part, which is similar to the fast forward function seen in many in digital video players. Time compression methods can compress audio and video at the same time to synchronize them, using frame dropping and audio sampling. However, the compression ratio can be limited by speech distortion in some cases. Frame-level skimming mainly relies on the user attention model to compute a saliency curve, but this method is weak in keeping the video structure, especially for a long video. Shot clustering is a middle-level method in video abstraction, but its readability is mostly ignored. Semantic level skimming is a method trying to understand the video content, but can be difficult to realize its goal due to the “semantic gap” puzzle.
  • SUMMARY OF THE INVENTION
  • In accordance with an embodiment, a method of creating a skimming preview of a video includes electronically receiving a plurality of video shots, analyzing each frame in a video shot from the plurality of video shots, where analyzing includes determining a saliency of each frame of the video shot. The method also includes determining a key frame of the video shot based on the saliency of each frame the video shot, extracting visual features from the key frame, performing shot clustering of the plurality of video shots to determine concept patterns based on the visual features, and generating a reconstruction reference tree based on the shot clustering. The reconstruction reference tree includes video shots categorized according to each concept pattern.
  • The foregoing has outlined rather broadly the features of an embodiment of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of embodiments of the invention will be described hereinafter, which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiments disclosed may be readily utilized as a basis for modifying or designing other structures or processes for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims,
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • For a more complete understanding of the present invention, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
  • FIG. 1 provides frame shots illustrating embodiment visual saliency masking;
  • FIG. 2 illustrates embodiment SIFT features in active regions on video frames;
  • FIG. 3 provides a graph to illustrating an embodiment BoW feature for example video frames;
  • FIG. 4 provides a flow chart of an embodiment visual BoW feature extraction;
  • FIG. 5 illustrates an embodiment saliency masking of audio words;
  • FIG. 6 illustrates a flow chart of audio BoW feature extraction;
  • FIG. 7 illustrates an example graph of an audio BoW feature;
  • FIG. 8 illustrates an embodiment reconstruction reference tree and shot table;
  • FIG. 9 illustrates an embodiment post processing saliency curve;
  • FIG. 10 illustrates and embodiment system that implements embodiment algorithms;
  • FIG. 11 illustrates a processing system that can be utilized to implement embodiment methods; and
  • FIG. 12 provides an appendix listing notations used herein.
  • DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
  • The making and using of the presently preferred embodiments are discussed in detail below It should be appreciated, however, that the present invention provides many applicable inventive concepts that can be embodied in a wide variety of specific contexts The specific embodiments discussed are merely illustrative of specific ways to make and use the invention, and do not limit the scope of the invention.
  • A novel approach to video summarization is disclosed This approach includes unsupervised learning of original video concepts and hierarchical (both frame and shot levels) reconstruction.
  • Video skimming is a task that shortens video into a temporally condensed version by which viewers may still understand the plot of original video. This technique allows viewers to quickly browse of a large multimedia library and thus facilitates tasks such as fast video browsing, video indexing and retrieval. The performance of video summarization mainly lies in the following three aspects: skeleton preservation, exciting and interesting summarization and smooth transition.
  • Video summarization enables viewers to quickly and efficiently grasp what a video describes or presents from a shorter summarized version. To meet this need, the main skeleton from the original video is extracted and kept it in the summarized video. A video skeleton can be seen as a queue of concept patterns with certain semantic implications in a temporal order. The term “Concept pattern” here is not a high-level as real semantic concept that is learned by human intervention. Rather, a concept pattern encodes semantic meanings of shots (sets of consecutive similar video frames), symbolizes a shot group that portraits consistent semantic settings, and generally possesses the capability as a hallmark or self-evident clue that hints to the development of the original video. Viewers may possibly recover the plot by only watching and hearing a handful of shots as long as all concept patterns are conveyed.
  • Some embodiments of the present invention present to viewers an exciting interesting summary of video. Often in a video, there are various shots conveying the same concept patterns. When selecting one shot conveying a concept pattern from many, the one with high saliency value or equivalently generating the largest stimulus to human attention would be favored so that the resultant summarized video not only contains integral concept patterns, but also carefully selects shot instances with richest information to reflect these concept patterns. Hence, a plain or even dull summarization is avoided.
  • In some cases, an unnatural transition between two adjacent concept patterns due to the elimination of a number of visually and acoustically similar shots is apparent in conventional video skimming previews. Some embodiments of the present invention provide a smooth transition between adjacent concept patterns by providing additional frame level summarization.
  • Embodiments of the present invention generate video summarization by providing unsupervised learning of original video concepts and hierarchical (both frame and shot levels) reconstruction. In an embodiment, the skeleton of original video is analyzed by concept pattern mining. Viewing it as a clustering problem. Bag of Words features (SIFT based visual word and Matching-pursuit based audio word) are extracted for each shot from both visual and audio sensory channels that filtered with saliency masking. The Bag of Words features are then clustered into several groups using spectral clustering techniques. Each cluster represents a certain concept pattern.
  • Next, based on the discovered concept patterns, the original video is summarized from reconstruction point of view based on the learned concept pattern. In some embodiments, summarization is regarded as a “summation” process rather than a “subtraction” process. Keeping at least one shot for each concept pattern, the concept pattern integrity of summarized video offer viewers the capability of context recovery. In addition, given a specified skimming ratio, a video that also contains maximum achievable saliency accumulation is generated. In some embodiments, the summarization process is conducted in an iterative fashion, allowing flexible control of summarized video information richness vs. skimming ratio.
  • Finally, to meet the skimming ratio specification and keep the smooth transition in the summarized video, a frame level saliency thresholding is used, which is followed by a temporally morphological operation as post processing.
  • A good understanding of video content can help achieve a good video summary. The most common contents for a typical video sequence are visual and acoustic channels. Most of the time, visual signals provide the majority information from which latent concept patterns are learned from original video. However, audio sensory channels can also provide important information regarding concept pattern in ways not offered by the visual channel, for example, in low light environments and nighttime shots. Furthermore, in embodiments, a concept pattern can be derived that also shares both visual and audio consistency at the same time. Thus, if independent feature extraction and unsupervised concept learning from both visual and audio sensory data is used, learned concept results can be jointly analyzed in a parity-check fashion to enhance co-reliability. Therefore, in some embodiments, an audio stream is extracted from raw video and processed in parallel with the video stream to detect possible audio concepts.
  • In an embodiment, temporal segmentation for video stream is archived using shot detection. A variance-difference based approach is used to detect a shot change, and robustly detects scene cuts and fades between scenes. The variance of a frame is calculated and the delta variance with its previous frame Dvar is recoded. In an embodiment, the criteria for Dvar to start a new shot are:
  • a. Dvar (current)<Th1 (stability requirement)
  • b. maxDvar(start to current)−minDvar(start to current)>Th2 (tolerance requirement)
  • c. Frame number in current shot>Th3 (shot length requirement)
  • In alternative embodiments, other techniques can be used. For example, shot boundaries can also be found using color histogram based approaches or optical-flow motion features. For processing convenience, in some embodiments, audio data are segmented into pieces, where each piece has its boundaries synchronized to its co-located video shot in time axis.
  • In an embodiment, an attention model and a bag-of-words feature construction on shots is performed. Skeleton preservation uses some distinctive feature for shot-discrimination, and an exciting summarization uses a content attentiveness (saliency) measurement. Embodiment saliency measurement methods effectively reflect how informative a shot is, and shot features are selected to represent video skeleton with discrimination (i.e., to be used to find the similarity with other shots).
  • In embodiments, Bag-of-Words (BoW) models are used to characterize the shot properties in visual and aural domains, respectively. The Bag-of-Words (BoW) model was initially utilized in Natural Language Processing to represent the structure of a text. For example, a BoW model regards a text document as a collection of certain words belonging to a reference dictionary but ignores the order and semantic implications of words. A BoW model uses the occurrence of each word in the dictionary as the feature of text, and therefore, often produces a sparse vector. The BoW model can be regarded as the “histogram representation based on independent features.” As such, a shot can be regarded as a sort of “text document” with regard to some embodiments. However, since neither the “visual word” nor the “aural word” in a shot is readily apparent like real words in a text document, such visual and aural “words” need to be defined. In an embodiment, the determination of a “word” usually involves two steps: feature extraction and codeword generation.
  • In an embodiment, visual words are derived using saliency detection according to PQFT-based attention modeling. Such an attention model has been shown to be successful in imitating human's perceptual properties on video frames. The generated saliency map is used as a good indicator of how conspicuous a frame is, and which part within the frame incurs the highest human attention. Given the saliency map for each frame, a measure of visual frame-saliency is formulated by calculating the average value of the saliency map for a frame t:
  • Sal t v = 1 W × H i = 1 W j = 1 H WM t ( i , j ) ,
  • where W and H are frame width and height respectively, SM refers to the saliency map for frame t.
  • For a shot, the visual conspicuous level is calculated by averaging the visual frame saliency in that shot:
  • AvgSal k v = { 1 N k t Sal t v | F t Shot k }
  • In an embodiment, the visual structure of original video observed from a middle-level video concept pattern is derived. In general, a video concept pattern can be viewed as a self-learned set featured by a combination of certain Spatially Local Visual Atom (SLVA) and each SLVA stands for a single visual pattern, which is found within a localized neighborhood at a particular spatial location, with plausible semantic implications, like green grass, blue sky, etc. A noticeable property of the video concept pattern is that, importance is only attached to the occurrence of SLVAs, without esteeming the order (spatial location). For example, a shot of a far view of green hills with blooming yellow flowers and a shot of a near view of the grass and flowers should both imply the same concept, even though the grass and flowers may appear in different locations and in different scales. As such, the BoW model for visual shots, which graciously expresses the order-irrelevant property, is employed by embodiment of the present invention using SLVAs as the visual words. Alternatively, other techniques, such as part-based methods, referred in B Leibe, A Leonardis, and B Schiele, “Robust Object Detection with Interleaved Categorization and Segmentation”, IJCV Special Issue on Learning for Vision and Vision for Learning, August 2007, can be used.
  • In a regular full process mode, SIFT feature points are generally detected on every frame in the shot and on every region within a frame. This procedure, although precise, is especially time-consuming. Thus, some embodiments employ pre-processing steps prior to SIFT feature detection. In an embodiment key frames are used to balance computational cost and accuracy. Further, a filtering process called saliency masking is used to improve the robustness and efficiency of the SIFT feature extraction in some embodiments.
  • Considering the fact that some frames within a shot appear to have minor differences, one frame, referred to as a key frame, is selected as the most representative frame in a shot. There are many key frame selection methods known in the art that can be used. Some straightforward methods include choosing the first/last frame, or the middle frame in a shot. Some motion-based approaches use motion intensity to guide the key frame selection, such as those used in MEPG-7. In an embodiment, however, human attention models are used, and a most salient frame is used to represent a shot as follows:

  • t k=arg max(Salt v |F tϵShotk)
  • Embodiment key frame selection techniques can save a large amount of computation resources at a minor cost of precision loss, assuming that frames are similar within a shot. If such an assumption does not hold, the attention model can be exploited with respect to a single frame to exclude some inattentive regions on the key frame. An embodiment attention model, based on image phase spectrum and motion residue is used to imitate human perceptual properties. In an embodiment, a saliency map SM is generated whose pixel value indicates how attentive the pixel on original frame is.
  • The movement of the camera through an environment, e.g., a fixed background is the ego-motion. In an embodiment, the impact of ego-motion is incorporated in determining the saliency of a frame. As described further in detail, this is accomplished by computing a motion channel having a difference in intensity map and an ego-motion compensated intensity map.
  • In an embodiment, the camera motion between adjacent frames is. Camera motion between two adjacent frames can be computed by estimating a 2-D rigid transformation based on the corresponding KLT (Kanade-Lucas-Tomasi Feature Tracker) key point tracks on the two adjacent frames. Embodiments of the invention also include alternative methods such as SIFT matching or Speeded Up Robust Features (SURF) correspondence etc.
  • Suppose a KLT key point is located at (x,y) in frame t, the corresponding KLT key point is tracked at (x′,y′) in frame (t+1), and the transformation from (x,y) to (x′,y′) can be expressed as follows:
  • [ x y ] = A [ x y ] + b = [ s cos θ s sin θ - s sin θ s cos θ ] [ x y ] + [ b x b y ] .
  • In the above equation, s, θ, b, and by are camera parameters, wherein s is the zooming factor, θ is the counterclockwise rotation degree, bx corresponds to the pan movement, and by corresponds to the tilt movement.
  • For a set of KLT key point correspondences, matrix A and vector b may be solved using the robust RANSAC (RANdom SAmple Consensus) rigid transformation estimation, which is a known iterative method to estimate parameters of a mathematical model from a set of observed data having outliers. RANSAC is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, which increases with the number of allowed iterations. Embodiments of the invention may also use alternative methods such as Least Median of Squares or M-Estimator etc.
  • After estimating the camera motion parameters, the visual saliency of each frame may be determined. The camera motion may be applied to compensate the ego motion and the residual may be fused into the color information to generate visual saliency.
  • Next, the intensity channel I(t) of a frame t is calculated using the color channels of the frame as follows. A given frame t may comprise red r(t), green g(t), and blue b(t) channels. Four broadly tuned color channels may be generated by the following equations:

  • R(t)=r(t)−(g(t)+b(t))/2

  • G(t)=g(t)−(r(t)+b(t))/2

  • B(t)=b(t)−(r(t)+g(t))/2

  • Y(t)=(r(t)+g(t))/2−|r(t)−g(t)|/2−b(t).
  • In addition, two color difference channels are defined as following.

  • RG(t)=R(t)−G(t)

  • BY(t)=B(t)−Y(t).
  • The intensity channel is calculated as follows:

  • I(t)=(r(t)+g(t)+b(t))/3.
  • The ego-motion compensated intensity map I(t−τ) for the previous frame (t−τ) is computed. The motion channel M(t) is computed as an absolute difference between intensity map I(t) and ego-motion compensated intensity map I(t−τ) as follows:

  • M(t)=|I(t)−(A t−τ t(t−τ)+b t−τT t)|.
  • In the above equation, At−τ t and bt−τ t t are the estimated camera parameters from frame (t−τ) to frame t.
  • Next, the frame t can be represented as a quaternion image q(t):

  • q(t)=M(t)+RG(t1 +BY(t2 +I(t3.
  • In the above equation, μ1 2=−1, j=1, 2, 3; and μ1⊥μ2, μ1⊥μ3, μ2⊥μ3, μ31μ2.
  • Furthermore, q(t) can be represented in symplectic form as follows:

  • q(t)=(t)+f 2(t2

  • (t)=M(t)+RG(t1

  • f 2(t)=BY(t)I(t1.
  • A Quaternion Fourier Transform (QFT) is performed on the quaternion image q(n,m,t), where (n,m) is the location of each pixel in time domain:

  • Q[u, v]=F 1 [u, v]+F 2 [u, v]μ 2
  • F i [ u , v ] = 1 MN m = 0 M - 1 n = 0 N - 1 e - μ 1 2 π ( ( mv / M ) + ( nu / N ) ) f i ( n , m ) .
  • In the above equations, (u,v) is the location of each pixel in frequency domain, while N and M are the image's height and width.
  • The inverse Fourier transform is obtained as follows:
  • f i [ n , m ] = 1 MN m = 0 M - 1 n = 0 N - 1 e μ 1 2 π ( ( mv / M ) + ( nu / N ) ) F i [ u , v ] .
  • A Frequency domain representation Q(t) of the quaternion image q(t) can be rewritten in the polar form as follows:

  • Q(t)=||Q(t)||e μφ(t),
  • where φW is the phase spectrum of Q(t).
  • In equation 20, if ||Q(t)||=1, the frequency domain representation Q(t) of the quaternion image q(t) includes only the phase spectrum in frequency domain. Therefore, the inverse Quaternion Fourier Transform (IQFT) of the phase spectrum of the frequency domain representation Q(t) of the quaternion image q(t) may be performed. The IQFT of the phase spectrum q′(t) is a 2-D image map and may be computed as follows:

  • q 40 )t)=a(t)+b(t1 +c(t2 +d(t3.
  • The saliency map (sM(t)) of frame t may be obtained by taking a smoothing filter kernel and running a convolution with the 2-D image map q′(t):

  • sM(t)=g*||q′(t)||2,
  • where g is a 2-D Gaussian smoothing filter. In various embodiments, for computation efficiency, only the Phase Spectrum of Quaternion Fourier Transform (PQFT) on a resized image (e.g., whose width equals to 128) may be computed.
  • Next, the visual saliency value Sv(t) of the frame t may be computed by taking the average over the entire saliency map as follows:
  • S v ( t ) = 1 MN m = 0 M - 1 n = 0 N - 1 sM ( n , m , t ) .
  • Embodiments of the invention for tuning the saliency to account for camera motion will next be described. Camera motion may be utilized to emphasize or neglect certain objects. Alternatively, camera motion may be used to guide viewers' attentions during a scene.
  • In one or more embodiments, the rigid motion estimation as described above, may be used to determine the camera motion type and speed. However, further information is required to understand the relationship between camera motion and the significance of a particular camera motion in guiding a user. For example, it is necessary to be able to map the computed camera parameters to their ability to attract a viewer's attention. Embodiments of the invention use general camera work rules to set up a user attention based model.
  • The user attention based model is obtained based on the following assumptions from general movie production. First, zooming is assumed to emphasize something. In particular, the speed of zooming scales linearly with the importance of the media segment. Therefore, faster zooming speeds describe important content. Usually, zoom-in is used to emphasize details, while zoom-out is used to emphasize an overview scene. Second, a video producer may apply panning if the video producer wants to neglect or de-emphasize something. As in zooming, the speed of the panning operation may be used a metric of importance. Unlike zooming, the faster the panning speed is, the less important the content is.
  • [The visual saliency value Sv(t) of frame t is then scaled by the corresponding camera attention factor ωcm(t). Therefore, the effective visual saliency Sv*(t) is computed as:

  • S v*(t)←ωcm(tS v(t).
  • In various embodiments, an attention factor Worn caused by camera motion is quantified over a pre-determined range, for example, [0{tilde over ( )}2]. For example, a value greater than 1 may represent emphasis, while a value smaller than 1 may represent neglect.
  • Next, an active region on the key frame is defined by thresholding the saliency map:

  • AR t k(i,j)={F t k(i,j)|SM t k(i,j)>T, 1≤i≤W,1≤j≤H}
  • where, T is the active threshold. The SIFT feature detection in remaining active regions then generates prominent and robust SLVAs of the frame.
  • FIG. 1 illustrates the results of saliency masking on shot 19 and 45 of a frame sequence. For example, frame 102 represents shot 19 prior to the application of saliency masking, and frame 104 represents shot 19 after saliency masking has been applied. Regions 103 represent the masked regions of shot 19. Similarly, frame 106 represents shot 45 prior to the application of saliency masking, and frame 108 represents shot 45 after saliency masking has been applied. Regions 105 represent the masked regions of shot 45.
  • In an embodiment, Lowe's algorithm for SIFT feature detection in active regions on the key frame is used. The frame is convolved with Gaussian filters at different scales, and then the differences of successive Gaussian-blurred versions are taken. Key points are located as maxima/minima of the Difference of Gaussians (DoG) that occur at multiple scales. Then, low-contrast key points are discarded and high edge responses are eliminated. Next, each key point is assigned one or more orientations based on the local gradient directions. Finally, a highly distinctive 128-dimension vector is generated as the point descriptor; i.e., the SLVA. For example, FIG. 2 shows detected SIFT feature points 109 in frames 104 and 108 representing shots 19 and 45, respectively.
  • After SIFT feature points are found on the key frame of each shot, the shot as a bag has a collection of “visual-word,” with each is a vector of dimension 128. The number of words is the number of SIFT feature point s on the key frame. A shot bag with its SIFT feature descriptors can now be regarded as a text document that has many words. In order to generate the histogram representation as the feature for the shot, “dictionary” is built as the collection of all the “words” from all the bags, and similar “words” should be treat as one “codeword,” as in text documents, “take”, “takes”, “taken” and “took” should be regarded same—“take”, as its codeword. The bags of words in visual appearance are referred to in L. Fei-Fei and P. Perona, “A Bayesian Hierarchical Model for Learning Natural Scene Categories,” by IEEE Computer Vision and Pattern Recognition. pp. 524-531, 2005, which is incorporated herein by reference. Alternatively, other algorithms can be used, such as, but not limited to those described in G. Csurka, C. Dance, L. X. Fan, J. Willamowski, and C. Bray. “Visual categorization with bags of keypoints”. Proc. of ECCV International Workshop on Statistical Learning in Computer Vision, 2004, can be used. Furthermore, other vector dimensions can be used as well.
  • A codeword can be considered as a representative of several similar SLVAs. In an embodiment, K-means clustering over all the SLVAs is used, where the number of the clusters is the codebook size. Such an embodiment can be viewed as being analogous to the number of different words in a text dictionary. Here, codewords are the centers of the clusters, and each “word” is mapped to a certain codeword through the clustering process.
  • Thus, each shot can be represented by a histogram of the codewords. In one example, to take into account the complexity of a particular video sequence, 200 codewords are used. FIG. 3 depicts the visual BoW feature for 19, frame 104 shown in FIGS. 1 and 2 above. Here, the x-axis represents the index of words, and the y-axis represents the normalized frequency of words occurred in the key frame of the shot.
  • FIG. 4 illustrates flowchart 200 of an embodiment visual BoW feature extraction for a shot. Key frame 202 of SHOTk is input to SIFT feature detection block 204, which applies SIFT feature detection to SHOTk. Each detected SLVA 206 is assigned a codeword in step 208 based on codeword generation by K-means clustering block 210. The frequency of each determined codeword 218 is counted in step 220 to produce visual BoW 222 for SHOTk. Codeword generation by K-means clustering block 210 generates codewords based on all SLVAs (block 212) found among the key frame of SHOT k 202, as well as and key frames 216 and SLVAs 214 from other shots. Visual BoWs 224 for other shots are similarly derived.
  • In an embodiment, the audio structure of the original video is observed with respect to an audio concept pattern. In general, an audio concept pattern is interpreted as acoustic environment featured by a combination of certain Temporally Local Acoustic Atom (TLAA). Each TLAA stands for a single audio pattern with plausible semantic implications. For example, an audio concept conversation between John and Mary at the shore is featured as a combination of John's short time voice (a TLAA) switching with Mary's (a TLAA) and continuous environmental sound of sea wave (a TLAA). Note that for the purpose of video summarization, an audio skeleton is sought that are usually comprised of “self-contained” concept patterns, meaning that in the set of shots that form a concept pattern, every shot has TLAAs from the same closed subset of plausible audio patterns and the reshuffling of plausible audio patterns is allowed. This assumption originates from the fact that humans recognize an audio scene from a macroscopic perspective, which emphasizes the components instead of an exact time and location of every component.
  • As in the above example, if another audio scene also includes John, Mary and the sea wave, but this time John continuously talks during the first half and Mary talks during the second half, without any voice switching, this scene is still considered to have the same concept pattern as the example above. Here, the second example also conveys the semantic implication of John Mary's conversation at the shore. With respect to one audio concept, those shots are subject to consistent TLAA compositions, regardless of which order these TLAAs are arranged.
  • In the context of audio concept clustering, at this level, the feature vectors of different shots may be much closer to each other, as long as their acoustic component TLAAs are alike. In some embodiments, they are then pruned to be clustered into the same group, which captures the underlying common characteristics of an audio scene. Compared to many indicator-like features, which identifies a shot as a single acoustic source, for example, speech from a single person, sound from a machine or environment, and background music, each shot will end up to be a sparse vector with only one 1-entry that indicates to which acoustic source this shot belongs. While this hard-decision-like feature can be viewed as contradictory to the fact that an audio segment corresponding to a shot usually consists of multiple intervening sources, this fact is implicitly reflected by a BoW feature.
  • For indicator-like features, the sparse nature of their shot data highlights the difference of shot data by assuming shot as a single source with majority contribution, which are usually different. In this way, the clustering may lose much opportunity to learn a reasonable concept pattern where shots have similar acoustic components, but the majority of sources are different. Therefore, embodiment BoW features encode intervening sources of a concept softly, which provides a closer approximation to the nature of an underlying concept as perceived by humans, and thus yields more accuracy.
  • To serve the needs of concept pattern mining that focuses on components rather than their order, the BoW model can suitably represent the audio features of a detected shot. If the audio stream of a shot is chopped into multiple overlapped short-time audio segments with equal length, the shot can be regarded as a bag containing multiple audio segments as audio words. Each word, with extracted feature by Matching Pursuit decomposition, represents a unique TLAA, which is an audio pattern with plausible semantic implications. A shot is consequently considered as a bag containing the audio patterns. The histogram of each word occurrence is a summarized feature of a shot through all the words within. Here, an encoding theme is applied to avoid the over-sparsity of feature vectors (negatively impact the classification result) from a direct word occurrence statistic. In an embodiment, all audio words from all shots in raw video are stored into a dictionary, and K-means clustering is conducted over the dictionary to produce K codewords. Each word is then assigned to a nearest codeword. The BoW feature of each shot is the occurrence of codewords inside the shot.
  • In an embodiment, the robustness of an audio BoW feature is improved by taking into account audio words above an acoustic saliency level to avoid the negative effect on the BoW accuracy exerted by audio words of low saliency. This can be due to its small value compared with noise. Here, audio saliency is measured by a variety of low-level audio features (scalar values), including Spectral Centroid, Root Mean Square (RMS), Absolute Value Maximum, Zero-Crossing Ratio (ZCR), and Spectral Flux. By using saliency masking, the audio words experience a reliability test so that the accuracy of features for every word is increased.
  • The spectral centroid is the center of the spectrum of the signal, and is computed by considering the spectrum as a distribution whose values are the frequencies, and the probabilities to observe these are the normalized amplitude. Root mean square is a measure of short time energy of a signal from norm 2. Absolute Value Maximum is a measure of short time energy of a signal from norm 1. The zero-crossing is a measure of the number of time the signal value cross the zero axis. These measures are further discussed by G. Peeters. “A large set of audio features for sound description (similarity and classification) in the CUIDADO project,”Report for the Institute De Recherche Et Coordination Acoustique/Musique, April 2004, which is incorporated herein by reference. If the numbers of members of more than one concept are equal, then match ambiguity emerges in the one-to-one mapping between visual and audio concepts. Here the time signature of shot member is used as alternative feature to bind two concept sets. A concept pair producing more similar time signature on both sides are considered as a good pair and are matched up. The time signature is the starting/ending time and duration of a shot.
  • In an embodiment, an algorithm is used to progressively generate a summarized video clip by the means of collecting shots. In other words, a video skimming process is regarded as a video reconstruction process. Starting from an empty output sequence, a shot is recruited each time to the output sequence until the target skimming ratio is achieved. The duration of the output video is thus controlled by recruiting different amounts of video shots to satisfy arbitrary skimming ratio. The recruiting order plays an important role in the final result.
  • As discussed hereinabove, all the three aspects in video skimming are considered: efficiency, saliency, and smoothness. Given the requirements, we design several rules and propose a “reconstruction reference tree” structure for our skimming algorithm.
  • To maintain concept integrity, some embodiments require that each concept contributes shots to the skimmed video. By having each concept contribute shots to the skimmed video, the skimmed video reflects the diversity of concepts of the original video and, thus yields the maximum entropy, even though some concepts may seem not salient. In some embodiments, it is possible to have less salient shots added into the video skimming preview as a way of trading of concept integrity and saliency maximization.
  • In embodiments, the concept importance is used as a factor for deciding the recruiting order of different concept patterns. It is not equivalent to the concept saliency. Rather, concept importance is a more high-level argument that can reveal a video producer's intention for the concepts' representation. Most commonly, if the producer gives a long shot for a concept pattern, or repeats the concept in many shots, then this concept can be considered to be of high importance intentionally. Under this assumption, the concept importance can be expressed as:

  • I l ={ΣN k|Shotk ϵC l},
  • where Nk is the total number of frames in shot k within concept 1. In an embodiment reconstruction framework, a shot is first picked from the most important concept.
  • To increase or maximize saliency, in some embodiments, every shot is assigned an average audio-visual saliency value to indicate how exciting this shot is to viewers. Some shots, however, have mismatched audio-visual concepts. For example, a video of two people, A and B are talking; most shots will consistently show the person's figure and play the person's voice. Some shots will show A′s figure while play B′s voice. The case is rare but possible, and we call it a mismatch. After the concept registration, the mismatch can be easily found by comparing the registered spectral clustering results.
  • When there is a mismatch, the audio-visual saliency of the shot is decreased in some embodiments, since recruiting such a shot may cause some misunderstanding to viewers. The audio-visual saliency is reduced according to the following expression:

  • AvgSalk =λAvgSal k v+(1−λ)AvgSalk a

  • AvgSalk=AvgSalk −αd k,
  • where λ is weighing parameters to balance audio-visual saliency and a is saliency penalty for audio-visual concepts mismatch.
  • The most salient shot in each concept is defined as a “must-in” shot, which means that, these shots are recruited in the skimmed video in spite of the skimming ratio. This helps guarantee concept integrity. The other shots are “optional” shots in the sense that they can be recruited or not depending on the target skimming ratio.
  • The reconstruction reference tree (RRT) is a tree structure for video reconstruction guidance. According to embodiments, the RRT is built according embodiments principles regarding concept integrity, concept importance, shot saliency penalty for audio-visual concept mismatch and “must-in” shots and optional shots for each concept.
  • In an embodiment, the root of the RRT is the video concept space, which is the learned through the spectral clustering process. The first level leaves are the concepts, which are sorted in importance descending order from left to right, and the second level leaves are the shots. Under each concept, the shots are sorted in saliency descending order from top to bottom.
  • The first child of each concept is the “must-in shot” and the rest of the shots are optional shots. Since each concept may have different number of shots, some “virtual shots” with zero saliency are included to form an array of all shots. The resulting array is called the shot table.
  • Turning to FIG. 8, embodiment RRT 400 is illustrated. Shot table 402 has shots ordered according to concept 404 and shot saliency. In an embodiment, concept categories 404 are ordered according to concept importance. Similarly, shots within each category are ordered according to saliency, such that the more salient shots are ordered at the top of each concept category and the least salient shots are ordered toward the bottom of each concept category. In an embedment, ordered shots within each category can include must-in shots 406, optional shots 408 and virtual shots 410 as described above.
  • Given the RRT and shot table, the reconstruction process is relatively straightforward. The following describes an embodiment reconstruction algorithm:
  • ALGORITHM 1
    (Reconstruction)
    Input: RRT, Target skimming ratio Rt
    Output: skimmed video Vim, Current skimming ratio Rc
    Initialization: skimmed video = Empty;
    Current skimming ratio Rc = 0;
    Begin:
    Recruit a shot (must-in or optional, skip the virtual shots)
    in the shot table in raster scan order.
    Update Rc;
    If Rc≥Rt and all must-in shots are recruited
    Stop;
    Else Loop;
    End
  • As the reconstruction is based on shots, the current skimming ratio Rc may not perfectly equal to the target skimming ratio. In some embodiments, it may be more likely that Rc is slightly larger than Rt (due to the stop criteria). In order to precisely control the output video duration, pure frame-level skimming, which is based on the attention model, is used as post processing. The audio-visual saliency of a frame as is computed as:

  • Sal t=λSalt v+(1−λ)Salt a.
  • The audio-visual saliency of every frame that appears in the output sequence is checked again. By thresholding on the saliency curve, frames with relatively low saliency are discarded, thereby allowing the final duration of the output video satisfy the target duration. In addition, the smoothness requirement is also considered to yield a viewer-friendly skimmed video. A morphological smoothing operation is adopted which includes deleting curve segments that are too short than K frames, and joining together curve segments that are less than K frames apart. In some embodiment, K is generally a small number, for example, 10 frames. Alternatively, other numbers can be used for K. The post processing algorithm is described as follows:
  • ALGORITHM 2
    (post processing)
    Input: Vim, Rc, Rt
    Output: final skimmed video Vo
    Initialization: saliency curve formation of Vim using the user
    attention model calculate curve preserving ratio R = Rt/Rc
    Begin:
    smooth the saliency curve using median filter
    calculate a threshold T such that R percent of curve are on
    top of the threshold.
    thresholding Vimusing T, do morphological smoothing
    and the remaining frames compose Vo
    End.
  • FIG. 9 illustrates an embodiment saliency curve 450 with threshold 452 to obtain a preserving ratio 95%. It should be appreciated that in alternative embodiments, other thresholds can be applied to obtain other threshold ratios.
  • The generation of embodiment video skimming previews described hereinabove can be implemented in system 500 as shown in FIG. 10. Referring to that figured, shot segmentor 502 is configured to segment a video in to individual shots. The saliency of each shot is determined by video saliency determination block 504, the output of which, is used by key frame determining block 506 to determine a key frame from among each shot. Visual feature extractor block 508 is configured to extract visual features, and visual word clustering block 509 is configured to cluster visual words form visual concepts using methods described above. Shot clustering block 510 is configured to cluster shots based on different visual and audio descriptors to build concept patterns.
  • In an embodiment, audio feature determination block 516 is configured to determine audio features from each segmented shot, and audio saliency determination block 518 is configured to the saliency of each audio feature. Audio clustering is performed by audio word clustering block 520 to produce audio concepts. Furthermore, audio and visual concepts are aligned by block 522.
  • In an embodiment, reconstruction reference tree generation block 512 creates an RRT based on saliency and concept importance, according to embodiments described herein. Moreover, video skimming preview generator 514 is configured to generate the skimming preview.
  • FIG. 11 illustrates a processing system 600 that can be utilized to implement methods of the present invention. In this case, the main processing is performed in by processor 602, which can be a microprocessor, digital signal processor or any other appropriate processing device. Program code (e.g., the code implementing the algorithms disclosed above) and data can be stored in memory 604. The memory can be local memory such as DRAM or mass storage such as a hard drive, optical drive or other storage (which may be local or remote). While the memory is illustrated functionally with a single block, it is understood that one or more hardware blocks can be used to implement this function.
  • In one embodiment, the processor can be used to implement various some or all of the units shown in FIG. 11. For example, the processor can serve as a specific functional unit at different times to implement the subtasks involved in performing the techniques of the present invention. Alternatively, different hardware blocks (e.g., the same as or different than the processor) can be used to perform different functions. In other embodiments, some subtasks are performed by the processor while others are performed using a separate circuitry.
  • FIG. 11 also illustrates I/O port 606, which can be used to provide the video to and from the processor. Video source 608, the destination of which is not explicitly shown, is illustrated in dashed lines to indicate that it is not necessary part of the system. For example, the source can be linked to the system by a network such as the Internet or by local interfaces (e.g., a USB or LAN interface). In some embodiments, display device 612 is coupled to I/O port 606 and supplies display 616, such as a CRT or flat-panel display with a video signal, and audio device 610 is coupled to I/O port 606 and drives acoustic transducer 614 with an audio signal. Alternatively, audio device 610 and display device 612 can be coupled via a computer network, cable television network, or other type of network. In a further embodiment, video skimming previews can be generated offline and included in portable media such as DVDs, flash drives, and other types of portable media.
  • FIG. 12 provides a listing of notations used herein.
  • As discussed above, the present application provides a number of new features including, but not limited to, using a hierarchical video summarization framework for arbitrary and accurate skimming ratio control with integrated concept preservation. Embodiments also include the ability to provide progressive video reconstruction from concept groups for high-level summarization, concept group categorization by spectral clustering of video shots, and alignment of audio concept groups with video concept groups. In some embodiments, visual and audio Bag-of-Words techniques are used for feature extraction, where visual words are constructed by using a SIFT (Scale-invariant feature transform), and audio words constructed by using Gabor-dictionary based Matching Pursuit (MP) techniques.
  • In embodiments, saliency masking is used to provide for robust and distinguishable Bag-of-Word feature extraction. In some embodiments, visual saliency curve shaping uses a PQFT+dominant motion contrast attention model, and audio saliency curve shaping is performed using by low-level audio features, such as Maximum Absolute Value, Spectrum Centroid, RMS, and ZCR. Furthermore, saliency-curve based skimming is used as a low level summarization.
  • An advantage of embodiments using spectral clustering is that spectral clustering favors the classification of locally-correlated data into one cluster because it adds another constraint to distinguish the close-located or locally-connected data and increase their similarity to be divided into one group. By this constraint, the clustering result approaches human intuition that a cluster with consistent members is generally subject to a concentrated distribution. A further advantage of spectral clustering is that the clustering result is not sensitive to the number of members in the clusters.
  • While this invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications and combinations of the illustrative embodiments, as well as other embodiments of the invention, will be apparent to persons skilled in the art upon reference to the description. It is therefore intended that the appended claims encompass any such modifications or embodiments.

Claims (20)

We claim:
1. An apparatus comprising:
a processor;
a memory coupled to the processor;
a port coupled to the processor to electronically receive a plurality of video shots; and
a non-transitory computer-readable medium storing instructions that are operative, when executed by the processor to perform acts including:
analyzing each frame in a video shot from the plurality of video shots, the analyzing comprising determining a saliency of each frame of the video shot, the saliency being a content attentiveness saliency providing a measurement of representative shot properties;
determining an effective visual saliency based on the determined saliency, the effective visual saliency based on camera motion of the plurality of video shots and a human attention model of the video shot;
selecting a key frame of the video shot based on the effective visual saliency of each frame the video shot;
extracting visual features from the key frame; performing shot clustering of the plurality of video shots to determine concept patterns based on the visual features; and
generating a hierarchical reconstruction based on the shot clustering, the hierarchical reconstruction enabling a skimming preview of the video.
2. The apparatus of claim 1, wherein the human attention model is based on camera motion.
3. The apparatus of claim 2, wherein:
the human attention model comprises a camera attenuation factor based on the camera motion,
wherein the processor to performs acts further including:
determining the camera attenuation factor; and
determining the effective visual saliency comprises multiplying the determined effective visual saliency with the determined camera attenuation factor.
4. The apparatus of claim 3, wherein the determined camera attenuation factor is proportional to a zooming speed of video shot from the plurality of video shots.
5. The apparatus of claim 3, wherein the determined camera attenuation factor is inversely proportional to a panning speed of video shot from the plurality of video shots.
6. The apparatus of claim 1, wherein the hierarchical reconstruction comprises a reconstruction reference tree that includes video shots categorized according to each concept pattern.
7. The apparatus of claim 6, wherein generating the reconstruction reference tree comprises categorizing video shots within concept categories ordered according to concept importance, and ordering video shots within each concept category according to effective visual saliency.
8. The apparatus of claim 7, wherein the concept importance includes determining a total number of frames in a shot having a same concept.
9. The apparatus of claim 6, further comprising wherein the hierarchical reconstruction includes generating a video skimming preview based on the reconstruction reference tree.
10. The apparatus of claim 1, wherein the processor performs acts further including extracting audio features from the video shot.
11. The apparatus of claim 10, wherein extracting audio features comprises:
determining audio words from the video shot; and
performing clustering on the audio words.
12. The apparatus of claim 11, further comprising:
determining visual concept patterns based on the performing shot clustering; and
determining audio concept patterns based on the performing clustering on the audio words.
13. The apparatus of claim 12, further comprising:
calculating a number of member shots for each visual concept of the visual concept patterns and for each audio concept of the audio concept patterns;
sorting each visual concept by calculated number of shots;
sorting each audio concept by calculated number of shots; and
aligning visual concepts and audio concepts having a same number of shots.
14. A method comprising:
electronically receiving a reconstruction reference tree comprising video shots categorized within concept categories ordered according to concept importance, wherein video shots within each concept category is ordered according to saliency;
selecting shots starting from categories of highest importance and shots of highest saliency within the categories of highest importance; and
generating a preview based on the selected shots.
15. The method of claim 14, further comprising:
selecting frames within the selected shots having a highest saliency; and
using the frames within the selected shots having the highest saliency to generate the preview.
16. The method of claim 15, wherein selecting frames within the selected shots having a highest saliency comprises:
selecting a target skimming ratio;
determining a threshold according to the selected target skimming ratio; and comparing the saliency of the selected shots to the threshold.
17. The method of claim 16, wherein the selected target skilling ratio is an arbitrary length.
18. The method of claim 14, wherein the saliency comprises an effective visual saliency based on camera motion of the video shots.
19. A non-transitory computer readable medium with an executable program stored thereon, wherein the program instructs a microprocessor to perform the following steps:
analyzing each frame in a video shot from a plurality of video shots, the analyzing including determining a saliency of each frame of the video shot, the saliency being a content attentiveness saliency;
determining an effective visual saliency based on the determined saliency and based on camera motion of each from the video shot;
selecting a key frame of the video shot based on the effective visual saliency of each frame of the video shot;
extracting visual features from the key frame;
performing shot clustering of the plurality of video shots to determine concept patterns based on the visual features; and
generate a reconstruction reference tree based on the shot clustering, the reconstruction reference tree comprising video shots categorized according to each concept pattern.
20. The non-transitory computer readable medium of claim 19, wherein the program instructs the microprocessor to further perform the steps of:
determining audio features of the video shot;
determining saliency of the determined audio features;
clustering determined audio features; and
aligning audio and video concept categories.
US16/171,116 2010-08-06 2018-10-25 Video Skimming Methods and Systems Abandoned US20190066732A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/171,116 US20190066732A1 (en) 2010-08-06 2018-10-25 Video Skimming Methods and Systems

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US37145810P 2010-08-06 2010-08-06
US13/103,810 US9171578B2 (en) 2010-08-06 2011-05-09 Video skimming methods and systems
US14/922,936 US10153001B2 (en) 2010-08-06 2015-10-26 Video skimming methods and systems
US16/171,116 US20190066732A1 (en) 2010-08-06 2018-10-25 Video Skimming Methods and Systems

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/922,936 Continuation US10153001B2 (en) 2010-08-06 2015-10-26 Video skimming methods and systems

Publications (1)

Publication Number Publication Date
US20190066732A1 true US20190066732A1 (en) 2019-02-28

Family

ID=45556235

Family Applications (3)

Application Number Title Priority Date Filing Date
US13/103,810 Expired - Fee Related US9171578B2 (en) 2010-08-06 2011-05-09 Video skimming methods and systems
US14/922,936 Expired - Fee Related US10153001B2 (en) 2010-08-06 2015-10-26 Video skimming methods and systems
US16/171,116 Abandoned US20190066732A1 (en) 2010-08-06 2018-10-25 Video Skimming Methods and Systems

Family Applications Before (2)

Application Number Title Priority Date Filing Date
US13/103,810 Expired - Fee Related US9171578B2 (en) 2010-08-06 2011-05-09 Video skimming methods and systems
US14/922,936 Expired - Fee Related US10153001B2 (en) 2010-08-06 2015-10-26 Video skimming methods and systems

Country Status (1)

Country Link
US (3) US9171578B2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814844A (en) * 2020-03-17 2020-10-23 同济大学 Intensive video description method based on position coding fusion
CN112364850A (en) * 2021-01-13 2021-02-12 北京远鉴信息技术有限公司 Video quality inspection method and device, electronic equipment and storage medium
US11445272B2 (en) 2018-07-27 2022-09-13 Beijing Jingdong Shangke Information Technology Co, Ltd. Video processing method and apparatus

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9171578B2 (en) * 2010-08-06 2015-10-27 Futurewei Technologies, Inc. Video skimming methods and systems
JP2012114559A (en) * 2010-11-22 2012-06-14 Jvc Kenwood Corp Video processing apparatus, video processing method and video processing program
US9020244B2 (en) * 2011-12-06 2015-04-28 Yahoo! Inc. Ranking and selecting representative video images
US20140328570A1 (en) * 2013-01-09 2014-11-06 Sri International Identifying, describing, and sharing salient events in images and videos
US20140157096A1 (en) * 2012-12-05 2014-06-05 International Business Machines Corporation Selecting video thumbnail based on surrounding context
KR102025362B1 (en) * 2013-11-07 2019-09-25 한화테크윈 주식회사 Search System and Video Search method
US10079040B2 (en) 2013-12-31 2018-09-18 Disney Enterprises, Inc. Systems and methods for video clip creation, curation, and interaction
CN107005676A (en) * 2014-12-15 2017-08-01 索尼公司 Information processing method, image processor and program
JP2016144080A (en) * 2015-02-03 2016-08-08 ソニー株式会社 Information processing device, information processing system, information processing method, and program
US9449248B1 (en) * 2015-03-12 2016-09-20 Adobe Systems Incorporated Generation of salient contours using live video
WO2016161136A1 (en) * 2015-03-31 2016-10-06 Nxgen Partners Ip, Llc Compression of signals, images and video for multimedia, communications and other applications
WO2017074448A1 (en) 2015-10-30 2017-05-04 Hewlett-Packard Development Company, L.P. Video content summarization and class selection
US10229324B2 (en) * 2015-12-24 2019-03-12 Intel Corporation Video summarization using semantic information
KR20170098079A (en) * 2016-02-19 2017-08-29 삼성전자주식회사 Electronic device method for video recording in electronic device
US10303984B2 (en) * 2016-05-17 2019-05-28 Intel Corporation Visual search and retrieval using semantic information
US11205103B2 (en) 2016-12-09 2021-12-21 The Research Foundation for the State University Semisupervised autoencoder for sentiment analysis
US10643485B2 (en) 2017-03-30 2020-05-05 International Business Machines Corporation Gaze based classroom notes generator
CN107220597B (en) * 2017-05-11 2020-07-24 北京化工大学 Key frame selection method based on local features and bag-of-words model human body action recognition process
CN107886109B (en) * 2017-10-13 2021-06-25 天津大学 Video abstraction method based on supervised video segmentation
CN107784662B (en) * 2017-11-14 2021-06-11 郑州布恩科技有限公司 Image target significance measurement method
KR102429901B1 (en) * 2017-11-17 2022-08-05 삼성전자주식회사 Electronic device and method for generating partial image
CN108111537B (en) * 2018-01-17 2021-03-23 杭州当虹科技股份有限公司 Method for quickly previewing online streaming media video content in MP4 format
CN108427713B (en) * 2018-02-01 2021-11-16 宁波诺丁汉大学 Video abstraction method and system for self-made video
US10777228B1 (en) 2018-03-22 2020-09-15 Gopro, Inc. Systems and methods for creating video edits
US10834295B2 (en) * 2018-08-29 2020-11-10 International Business Machines Corporation Attention mechanism for coping with acoustic-lips timing mismatch in audiovisual processing
US20200160889A1 (en) * 2018-11-19 2020-05-21 Netflix, Inc. Techniques for identifying synchronization errors in media titles
CN110555434B (en) * 2019-09-03 2022-03-29 浙江科技学院 Method for detecting visual saliency of three-dimensional image through local contrast and global guidance
CN112331337B (en) * 2021-01-04 2021-04-16 中国科学院自动化研究所 Automatic depression detection method, device and equipment

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5521841A (en) * 1994-03-31 1996-05-28 Siemens Corporate Research, Inc. Browsing contents of a given video sequence
US5664227A (en) * 1994-10-14 1997-09-02 Carnegie Mellon University System and method for skimming digital audio/video data
US5956026A (en) * 1997-12-19 1999-09-21 Sharp Laboratories Of America, Inc. Method for hierarchical summarization and browsing of digital video
US6331859B1 (en) * 1999-04-06 2001-12-18 Sharp Laboratories Of America, Inc. Video skimming system utilizing the vector rank filter
US6535639B1 (en) * 1999-03-12 2003-03-18 Fuji Xerox Co., Ltd. Automatic video summarization using a measure of shot importance and a frame-packing method
KR20030054352A (en) * 2001-12-24 2003-07-02 주식회사 케이티 Method of Video Summary through Hierarchical Shot Clustering having Threshold Time using Video Summary Time
US20030210886A1 (en) * 2002-05-07 2003-11-13 Ying Li Scalable video summarization and navigation system and method
US20030234805A1 (en) * 2002-06-19 2003-12-25 Kentaro Toyama Computer user interface for interacting with video cliplets generated from digital video
US20040125877A1 (en) * 2000-07-17 2004-07-01 Shin-Fu Chang Method and system for indexing and content-based adaptive streaming of digital video content
US6964021B2 (en) * 2000-08-19 2005-11-08 Lg Electronics Inc. Method and apparatus for skimming video data
US7158676B1 (en) * 1999-02-01 2007-01-02 Emuse Media Limited Interactive system
US20070101269A1 (en) * 2005-10-31 2007-05-03 Microsoft Corporation Capture-intention detection for video content analysis
US20080112684A1 (en) * 2006-11-14 2008-05-15 Microsoft Corporation Space-Time Video Montage
US20080127270A1 (en) * 2006-08-02 2008-05-29 Fuji Xerox Co., Ltd. Browsing video collections using hypervideo summaries derived from hierarchical clustering
US20090083790A1 (en) * 2007-09-26 2009-03-26 Tao Wang Video scene segmentation and categorization
US8363960B2 (en) * 2007-03-22 2013-01-29 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Method and device for selection of key-frames for retrieving picture contents, and method and device for temporal segmentation of a sequence of successive video pictures or a shot
US8493448B2 (en) * 2006-12-19 2013-07-23 Koninklijke Philips N.V. Method and system to convert 2D video into 3D video
US20130282747A1 (en) * 2012-04-23 2013-10-24 Sri International Classification, search, and retrieval of complex video events
US8599316B2 (en) * 2010-05-25 2013-12-03 Intellectual Ventures Fund 83 Llc Method for determining key video frames
US20150143239A1 (en) * 2013-11-20 2015-05-21 Google Inc. Multi-view audio and video interactive playback
US9171578B2 (en) * 2010-08-06 2015-10-27 Futurewei Technologies, Inc. Video skimming methods and systems

Family Cites Families (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5635982A (en) * 1994-06-27 1997-06-03 Zhang; Hong J. System for automatic video segmentation and key frame extraction for video sequences having both sharp and gradual transitions
US5485611A (en) * 1994-12-30 1996-01-16 Intel Corporation Video database indexing and method of presenting video database index to a user
US5708767A (en) * 1995-02-03 1998-01-13 The Trustees Of Princeton University Method and apparatus for video browsing based on content and structure
US5828809A (en) * 1996-10-01 1998-10-27 Matsushita Electric Industrial Co., Ltd. Method and apparatus for extracting indexing information from digital video data
US6956573B1 (en) * 1996-11-15 2005-10-18 Sarnoff Corporation Method and apparatus for efficiently representing storing and accessing video information
US6340971B1 (en) * 1997-02-03 2002-01-22 U.S. Philips Corporation Method and device for keyframe-based video displaying using a video cursor frame in a multikeyframe screen
US6473095B1 (en) * 1998-07-16 2002-10-29 Koninklijke Philips Electronics N.V. Histogram method for characterizing video content
US6721454B1 (en) * 1998-10-09 2004-04-13 Sharp Laboratories Of America, Inc. Method for automatic extraction of semantically significant events from video
US6492998B1 (en) * 1998-12-05 2002-12-10 Lg Electronics Inc. Contents-based video story browsing system
US6342904B1 (en) * 1998-12-17 2002-01-29 Newstakes, Inc. Creating a slide presentation from full motion video
US6748158B1 (en) * 1999-02-01 2004-06-08 Grass Valley (U.S.) Inc. Method for classifying and searching video databases based on 3-D camera motion
SE9902328A0 (en) * 1999-06-18 2000-12-19 Ericsson Telefon Ab L M Procedure and system for generating summary video
KR100350792B1 (en) * 1999-09-22 2002-09-09 엘지전자 주식회사 Multimedia data browsing system based on user profile
US7016540B1 (en) * 1999-11-24 2006-03-21 Nec Corporation Method and system for segmentation, classification, and summarization of video images
US6549643B1 (en) * 1999-11-30 2003-04-15 Siemens Corporate Research, Inc. System and method for selecting key-frames of video data
AUPQ535200A0 (en) * 2000-01-31 2000-02-17 Canon Kabushiki Kaisha Extracting key frames from a video sequence
US6642940B1 (en) * 2000-03-03 2003-11-04 Massachusetts Institute Of Technology Management of properties for hyperlinked video
US6616529B1 (en) * 2000-06-19 2003-09-09 Intel Corporation Simulation and synthesis of sports matches
US7653530B2 (en) * 2000-07-13 2010-01-26 Novell, Inc. Method and mechanism for the creation, maintenance, and comparison of semantic abstracts
US6807361B1 (en) * 2000-07-18 2004-10-19 Fuji Xerox Co., Ltd. Interactive custom video creation system
US20040125124A1 (en) * 2000-07-24 2004-07-01 Hyeokman Kim Techniques for constructing and browsing a hierarchical video structure
US20020157116A1 (en) * 2000-07-28 2002-10-24 Koninklijke Philips Electronics N.V. Context and content based information processing for multimedia segmentation and indexing
US6697523B1 (en) * 2000-08-09 2004-02-24 Mitsubishi Electric Research Laboratories, Inc. Method for summarizing a video using motion and color descriptors
US7110458B2 (en) * 2001-04-27 2006-09-19 Mitsubishi Electric Research Laboratories, Inc. Method for summarizing a video using motion descriptors
US7296231B2 (en) * 2001-08-09 2007-11-13 Eastman Kodak Company Video structuring by probabilistic merging of video segments
US7474698B2 (en) * 2001-10-19 2009-01-06 Sharp Laboratories Of America, Inc. Identification of replay segments
US7120873B2 (en) * 2002-01-28 2006-10-10 Sharp Laboratories Of America, Inc. Summarization of sumo video content
US7263660B2 (en) * 2002-03-29 2007-08-28 Microsoft Corporation System and method for producing a video skim
US8872979B2 (en) * 2002-05-21 2014-10-28 Avaya Inc. Combined-media scene tracking for audio-video summarization
US7222300B2 (en) * 2002-06-19 2007-05-22 Microsoft Corporation System and method for automatically authoring video compositions using video cliplets
US7260257B2 (en) * 2002-06-19 2007-08-21 Microsoft Corp. System and method for whiteboard and audio capture
FR2845179B1 (en) * 2002-09-27 2004-11-05 Thomson Licensing Sa METHOD FOR GROUPING IMAGES OF A VIDEO SEQUENCE
US7131059B2 (en) * 2002-12-31 2006-10-31 Hewlett-Packard Development Company, L.P. Scalably presenting a collection of media objects
US7212666B2 (en) * 2003-04-01 2007-05-01 Microsoft Corporation Generating visually representative video thumbnails
JP2005277531A (en) * 2004-03-23 2005-10-06 Seiko Epson Corp Moving image processing apparatus
US20050228849A1 (en) * 2004-03-24 2005-10-13 Tong Zhang Intelligent key-frame extraction from a video
US7986372B2 (en) * 2004-08-02 2011-07-26 Microsoft Corporation Systems and methods for smart media content thumbnail extraction
US7388586B2 (en) * 2005-03-31 2008-06-17 Intel Corporation Method and apparatus for animation of a human speaker
US8316301B2 (en) * 2005-08-04 2012-11-20 Samsung Electronics Co., Ltd. Apparatus, medium, and method segmenting video sequences based on topic
US7904455B2 (en) * 2005-11-03 2011-03-08 Fuji Xerox Co., Ltd. Cascading cluster collages: visualization of image search results on small displays
JP2007150724A (en) * 2005-11-28 2007-06-14 Toshiba Corp Video viewing support system and method
US8379154B2 (en) * 2006-05-12 2013-02-19 Tong Zhang Key-frame extraction from video
US8301669B2 (en) * 2007-01-31 2012-10-30 Hewlett-Packard Development Company, L.P. Concurrent presentation of video segments enabling rapid video file comprehension
KR101456652B1 (en) * 2007-02-01 2014-11-04 이섬 리서치 디벨러프먼트 컴파니 오브 더 히브루 유니버시티 오브 예루살렘 엘티디. Method and System for Video Indexing and Video Synopsis
EP2399386A4 (en) * 2009-02-20 2014-12-10 Indian Inst Technology Bombay A device and method for automatically recreating a content preserving and compression efficient lecture video
US8571330B2 (en) * 2009-09-17 2013-10-29 Hewlett-Packard Development Company, L.P. Video thumbnail selection
US8280158B2 (en) * 2009-10-05 2012-10-02 Fuji Xerox Co., Ltd. Systems and methods for indexing presentation videos
CN103210651B (en) * 2010-11-15 2016-11-09 华为技术有限公司 Method and system for video summary

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5521841A (en) * 1994-03-31 1996-05-28 Siemens Corporate Research, Inc. Browsing contents of a given video sequence
US5664227A (en) * 1994-10-14 1997-09-02 Carnegie Mellon University System and method for skimming digital audio/video data
US5956026A (en) * 1997-12-19 1999-09-21 Sharp Laboratories Of America, Inc. Method for hierarchical summarization and browsing of digital video
US7158676B1 (en) * 1999-02-01 2007-01-02 Emuse Media Limited Interactive system
US6535639B1 (en) * 1999-03-12 2003-03-18 Fuji Xerox Co., Ltd. Automatic video summarization using a measure of shot importance and a frame-packing method
US6331859B1 (en) * 1999-04-06 2001-12-18 Sharp Laboratories Of America, Inc. Video skimming system utilizing the vector rank filter
US20040125877A1 (en) * 2000-07-17 2004-07-01 Shin-Fu Chang Method and system for indexing and content-based adaptive streaming of digital video content
US6964021B2 (en) * 2000-08-19 2005-11-08 Lg Electronics Inc. Method and apparatus for skimming video data
KR20030054352A (en) * 2001-12-24 2003-07-02 주식회사 케이티 Method of Video Summary through Hierarchical Shot Clustering having Threshold Time using Video Summary Time
US20030210886A1 (en) * 2002-05-07 2003-11-13 Ying Li Scalable video summarization and navigation system and method
US20030234805A1 (en) * 2002-06-19 2003-12-25 Kentaro Toyama Computer user interface for interacting with video cliplets generated from digital video
US20070101269A1 (en) * 2005-10-31 2007-05-03 Microsoft Corporation Capture-intention detection for video content analysis
US20080127270A1 (en) * 2006-08-02 2008-05-29 Fuji Xerox Co., Ltd. Browsing video collections using hypervideo summaries derived from hierarchical clustering
US20080112684A1 (en) * 2006-11-14 2008-05-15 Microsoft Corporation Space-Time Video Montage
US8493448B2 (en) * 2006-12-19 2013-07-23 Koninklijke Philips N.V. Method and system to convert 2D video into 3D video
US8363960B2 (en) * 2007-03-22 2013-01-29 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Method and device for selection of key-frames for retrieving picture contents, and method and device for temporal segmentation of a sequence of successive video pictures or a shot
US20090083790A1 (en) * 2007-09-26 2009-03-26 Tao Wang Video scene segmentation and categorization
US8599316B2 (en) * 2010-05-25 2013-12-03 Intellectual Ventures Fund 83 Llc Method for determining key video frames
US9171578B2 (en) * 2010-08-06 2015-10-27 Futurewei Technologies, Inc. Video skimming methods and systems
US20130282747A1 (en) * 2012-04-23 2013-10-24 Sri International Classification, search, and retrieval of complex video events
US20150143239A1 (en) * 2013-11-20 2015-05-21 Google Inc. Multi-view audio and video interactive playback

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11445272B2 (en) 2018-07-27 2022-09-13 Beijing Jingdong Shangke Information Technology Co, Ltd. Video processing method and apparatus
CN111814844A (en) * 2020-03-17 2020-10-23 同济大学 Intensive video description method based on position coding fusion
CN112364850A (en) * 2021-01-13 2021-02-12 北京远鉴信息技术有限公司 Video quality inspection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
US9171578B2 (en) 2015-10-27
US20120033949A1 (en) 2012-02-09
US10153001B2 (en) 2018-12-11
US20160111130A1 (en) 2016-04-21

Similar Documents

Publication Publication Date Title
US20190066732A1 (en) Video Skimming Methods and Systems
EP2641401B1 (en) Method and system for video summarization
Guan et al. Keypoint-based keyframe selection
Ejaz et al. Efficient visual attention based framework for extracting key frames from videos
Dang et al. RPCA-KFE: Key frame extraction for video using robust principal component analysis
CN108307229B (en) Video and audio data processing method and device
US8457469B2 (en) Display control device, display control method, and program
US8467610B2 (en) Video summarization using sparse basis function combination
US20120057775A1 (en) Information processing device, information processing method, and program
Sreeja et al. Towards genre-specific frameworks for video summarisation: A survey
Mironică et al. A modified vector of locally aggregated descriptors approach for fast video classification
Mademlis et al. Multimodal stereoscopic movie summarization conforming to narrative characteristics
Sidiropoulos et al. Enhancing video concept detection with the use of tomographs
WO2010084738A1 (en) Collation weighting information extracting device
Gade et al. Audio-visual classification of sports types
Sun et al. Automatic annotation of web videos
Kamishima et al. Event detection in consumer videos using GMM supervectors and SVMs
Acar et al. Fusion of learned multi-modal representations and dense trajectories for emotional analysis in videos
Psallidas et al. Multimodal video summarization based on fuzzy similarity features
Cohendet et al. Transfer Learning for Video Memorability Prediction.
Khan et al. Semantic analysis of news based on the deep convolution neural network
Gao et al. Multi-modality movie scene detection using kernel canonical correlation analysis
Quenot et al. Rushes summarization by IRIM consortium: redundancy removal and multi-feature fusion
Vallet et al. Robust visual features for the multimodal identification of unregistered speakers in tv talk-shows
Bhaumik et al. Real-time storyboard generation in videos using a probability distribution based threshold

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION