WO2005093752A1 - Procede et systeme de detection de changements de scenes audio et video - Google Patents

Procede et systeme de detection de changements de scenes audio et video Download PDF

Info

Publication number
WO2005093752A1
WO2005093752A1 PCT/GB2005/001027 GB2005001027W WO2005093752A1 WO 2005093752 A1 WO2005093752 A1 WO 2005093752A1 GB 2005001027 W GB2005001027 W GB 2005001027W WO 2005093752 A1 WO2005093752 A1 WO 2005093752A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
video
scene
shots
shot
Prior art date
Application number
PCT/GB2005/001027
Other languages
English (en)
Inventor
Li-Qun Xu
Sergio Benini
Original Assignee
British Telecommunications Public Limited Company
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by British Telecommunications Public Limited Company filed Critical British Telecommunications Public Limited Company
Publication of WO2005093752A1 publication Critical patent/WO2005093752A1/fr

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/34Indicating arrangements 
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/102Programmed access in sequence to addressed parts of tracks of operating record carriers
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording

Definitions

  • This invention relates to a video and audio content analysis and segmentation method and system which allows for segmentation of video scenes and audio scenes at the semantic level.
  • the two segmentation results may be integrated together to provide richer semantic understanding of the content.
  • the hierarchical model of a movie structure can usually be organised on a three- level basis, comprising (from low to high level) the shot level, event level, and episode (or scene) level.
  • a shot is a segment of audio-visual data filmed in a single camera take.
  • Most multimedia content analysis tasks start with the decomposition of the entire video into elementary shots, which is necessary for the extraction of audio-visual content descriptors.
  • An event is the smallest semantic unit of a movie. It can be a dialogue, an action scene or, in general, a set of contiguous shots which share location and time.
  • An episode (or scene) is normally defined to be a sequence of shots that share a common semantic thread and can contain one or more events.
  • episode boundary detection is performed using only automatically detected low-level features without any prior knowledge. It is often the case, therefore, that the detected scene boundaries do not correspond precisely to those of an actual scene.
  • researchers have introduced the so-called computable scene [6], or logical story unit (LSU) [1] , which reveal the best approximation to real movie episodes.
  • LSUs are defined in terms of specific spatio-temporal features which are characteristic of the scene under analysis.
  • LSU Logical Story Unit
  • a shot can either be part of an event or serve for its 'description' [1]. This means that a shot can show a particular aspect of an event which is taking place (such as a human face during dialogue) or can show the scenery where the succeeding event takes place.
  • these two kinds of shots are respectively referred to as 'event' shots and 'descriptive' shots.
  • 'event' shots usually, the presence of a 'descriptive' shot, at the beginning of an episode, works as an introduction to the scenery for the following 'event' shots. For example, in the comedy film "Notting Hill" (copyright Polygram Holding, Inc.) we see many times a shot showing a bookshop from the exterior, while succeeding shots elaborate on what is happening inside the bookshop.
  • N-type these scenes (normal scenes) are characterised by a long-term consistency of chromatic composition, lighting condition and sound
  • M-type these scenes (montage scenes) are characterised by widely different visual contents (e.g., different location, timing, lighting condition, characters, etc.) and often with long-term consistency in the audio content.
  • Many post-production video programme genres such as movies, documentaries, sitcoms etc.
  • the present invention aims to provide a further video and audio segmentation technique in addition to those described above, and which allows for ready integration of the video and audio segmentations to provide a richer semantic understanding of the content.
  • a method of determining semantic content information relating to a video sequence provided with an audio sequence comprising the steps: segmenting the video sequence into video segments each corresponding to an identifiable video shot; determining data defining video scenes of the video sequence each comprising one or more of the video segments; segmenting the audio sequence into audio segments, each audio segment temporally corresponding to a respective video segment; determining data defining audio scenes of the audio sequence each comprising one or more of the audio segments; and integrating the video scene data and the audio scene data to provide semantic scene information data indicative of the semantic content of the audio-video sequence.
  • the integrating step comprises applying one or more heuristic rules.
  • the concept of an audio shot is introduced corresponding to readily identifiable video shots.
  • the integration of respective determined audio and video semantic scene information becomes relatively straightforward, thus giving richer semantic understanding of the content.
  • the fusion of audio and visual analysis results through heuristic rules also provides further advantages.
  • the industrial applicability of the invention is in the field of automating the time-consuming and laborious process of organising and indexing increasingly large video databases such that they can be easily browsed and searched using natural query structures that are close to human concepts.
  • a method of determining semantic content information relating to a video sequence having an associated audio sequence comprising the steps of identifying different video scenes of the video sequence, each different video scene being separated by a video scene boundary, identifying different audio scenes of the audio sequence, each different audio scene being separated by an audio scene boundary, and deriving information indicative of the semantic content of the audio-video sequence based on a comparison of the relative timing between the audio and video scene boundaries.
  • Figure 1 is a block diagram of a system architecture of an embodiment of the invention
  • Figure 1 (b) is a block diagram of a system architecture of an embodiment of the invention
  • Figures 2(a)-(c) are diagrams illustrating vector quantisation codebook generation
  • Figure 3 is a diagram illustrating how VQ codebooks distances are generated between video shots
  • Figure 4 is a diagram illustrating a sequence of shots as an original shot graph (OSG)
  • Figure 5 is a diagram illustrating distances between clusters of shots, called VQ distance graph
  • Figure 6 is a diagram illustrating distances between clusters of shots
  • Figure 7 is a diagram illustrating an example of clustering of shots for the first iteration
  • Figure 8 is a diagram illustrating the replacement of the two clusters ( 2 ,C 4 ) by a new cluster C 2 '
  • Figure 9 is a diagram illustrating the validation of a clustering operation
  • Figure 10 is a diagram illustrating the time-constrained sub-clustering of shots
  • Figure 11 is an example
  • the method and system of embodiments of the invention are intended to operate on a video sequence such as provided by an MPEG video stream or the like. It should be noted, however, that the embodiments are not concerned with decoding any encoded video sequence, whether the encoding be MPEG or otherwise. It is assumed that input to the embodiments of the invention is in the form of decoded video data.
  • the method and system is, in this embodiment, implemented by way of a computer program arranged to be executed under the control of a processor, such as a personal computer.
  • the computer program may be made available on a portable storage medium, such as a floppy or compact disk, where it may thereafter be stored and/or executed on the computer.
  • a first step 1 the entire video stream is broken up into elementary camera shots using a known automated method.
  • Techniques for decomposing the video sequence into individual shots are known in the art, such as those described in L-Q. Xu, J. Zhu, and F. W. M. Stentiford, "Video summarisation and semantic editing tools," in Storage and Retrieval for Media Databases, Proc. of SPIE, Vol. 4315, San Jose, USA, 21 - 26 Jan, 2001. The contents of this document are incorporated herein by reference.
  • a subsequent step 3 for each elementary shot, one or more keyframes are extracted to represent that shot's visual content by way of some characteristic 'signature'.
  • the signature is provided by a vector quantisation (VQ) codebook which is generated using known techniques in a subsequent step 7.
  • VQ codebook is derived from low- level visual features (e.g., colours, textures etc).
  • Shot keyframe selection is known in the art, as described in Y. Zhuang, Y. Rui, T. Huang, S. Mehrotra, "Adaptive key frame extraction using unsupervised clustering," in Proc. of IEEE Int'l Conf. on Image Processing, pp. 866-870, Chicago, October 1998.
  • Vector quantisation codebook techniques are also known in the art , as described in R.M. Gray, “Vector quantization,” IEEE ASSP Magazine, Vol. 1 , pp. 4-29, April 1984.
  • an audio shot can also be defined.
  • the length of the audio shot is normally selected so as to correspond to the length of a visual shot, but it can be a concatenation of a few adjacent visual shots if the visual shot is too short.
  • each audio shot is characterised by short-term spectral characteristics, such as Mel-Frequency Cepstrum Coefficients (MFCCs).
  • MFCCs Mel-Frequency Cepstrum Coefficients
  • the audio shot's content 'signature' which can also comprise a VQ codebook, is then computed in a subsequent stage 11 on the basis of the aggregated short-term audio characteristics over the entire shot, as described later.
  • step 5 visual clustering of shots is performed in a subsequent step 7 based on the VQ codebooks.
  • the aim is to perform a global grouping of all shots of the video stream into a number of so-called clusters based on the similarity of their visual content signatures, i.e. the VQ codebooks.
  • each cluster contains a single shot.
  • a clustering algorithm employing a well-defined distance metric is used, which algorithm also has error correction capabilities if two clusters are similar enough (in terms of having minimal distance between them) to allow merging.
  • the visual clusters are output to an audio-visual (A/V) profile analysis stage 15.
  • segmentation of the accompanying audio stream is performed in stage 13 using a similar vector quantisation technique to that applied to the video stream.
  • the dissimilarity between audio content signatures for consecutive shots is computed using the well-known Earth Mover's Distance (EMD) metric.
  • EMD Earth Mover's Distance
  • time-constrained cluster analysis 17 is performed on the received video clusters. In the previous stage 7, each shot within a cluster is time-stamped. In order to distinguish visually similar scenes that occur at different times/stage of a programme, e.g. scenes having been captured in a similar physical setting, such as a particular pub or flat, the time-constrained cluster analysis stage 17 is arranged to perform the task of checking temporal consistency of the scene using a sliding window technique.
  • a number of clusters are generated in which the shots are not only similar to each other in terms of appearance, but are also adjacent in time. There is also generated a graphical representation describing the temporal relationship between these clusters.
  • STG Scene Transition Graph
  • graphical analysis is performed to derive the final semantic scenes by associating a weaker cluster A with the same semantic label of a second cluster B. Although cluster A may be visually different from cluster B, it is nevertheless temporally sandwiched between the shots belonging to B.
  • the STG-based analysis step 19 detects the "cut edges" of the transition graph for semantically distinct video segments.
  • the output of the STG-based analysis step 19 provides useful information as to the semantic structure of the video sequence at the scene level rather than merely at the shot level, and can be used in several ways.
  • a subsequent automatic scene change detection step 21 provides a first step towards greater semantic understanding of a video sequence, such as a movie, since breaking the movie up into scenes assists in creating content summaries, which in turn can be exploited to enable non-linear navigation inside the movie.
  • the determination of visual-structure within each scene helps the process of visualising each scene in the film summary.
  • the visual clustering stage 7 can itself provide useful information, even without the subsequent time-constrained analysis stage 17. This is because the visual clustering stage 7 groups all shots, having similar semantic content, together in the same cluster.
  • the visual clustering can then be used to cluster together all shots taken at said location in an automatic manner, for identification and subsequent display to the user.
  • fusion of the video and audio results can be performed. This stage 23 takes, as input, the outcome of the two above-described groupings of steps and generates three different explanations of audio-visual scene changes based on a set of heuristic rules. The processing involved in the above steps will now be described in more detail.
  • key-frames corresponding to each shot are received and analysed.
  • the selected keyframe is decoded as a still 352x288 image in LUV format.
  • each key-frame is sub- sampled by a factor of 2, and the image is subdivided into blocks of 4x4 pixels. Since the sub-sampled image will have a display format of 176x144, it will be appreciated that 1584 4x4 pixel blocks will be present.
  • These pixel blocks serve as input vectors (or a 'training set') to a vector quantiser for generation of a suitable codebook which is used to characterise the particular key-frame.
  • VQ Vector quantisation
  • This type of vector quantiser is called a 1 -dimensional 2-bit vector quantiser having a rate of 2-bits per dimension.
  • Figure 2(c) an example of a two dimensional vector quantiser is shown. It will be seen that every pair of numbers falling in a particular region are approximated by single value indicated by a circle 25. In this case, there are 16 regions and 16 circles, each of which can be uniquely represented by 4-bits. Thus, Figure 2(c) represents a 2-dimensional 4-bit vector quantiser having a rate of 2-bits per dimension. In the examples of Figures 2(b) and 2(c) the circles are called codewords and the regions are called encoding regions. The set of all codevectors is called a codebook.
  • the VQ process involves substituting each input image block with a predefined block (chosen among the codebook vectors) so as to minimise a distortion measure.
  • a predefined block chosen among the codebook vectors
  • the entire image can be reconstructed using only blocks belonging to the codebook.
  • an image is characterised by some homogeneous colour zones of different sizes, implying that the pixels belonging to the same block share some colour properties so that correlation inside one block is likely to be very high. The larger the block size, however, the less correlation there is likely to be between pixels inside the block.
  • each image there is always the presence of dominant colours corresponding to particular combinations of colour components.
  • VQ Codebook Generation The codebook to be generated for each visual shot contains C codewords, each of D dimensions. Thereafter, we call this VQ codebook the signature of the key-frame (and so the visual shot) to which it relates.
  • the codebook comprises the following elements.
  • the C codewords which are, respectively, the centroid value of each cluster in the final codebook.
  • each 48-dimensional vector is applied to a vector quantiser and, depending on the region within which the vector lies, the centroid value is thereafter assigned to that vector.
  • D 48.
  • the i th D-dimensional vector is noted as p[,..., p D ' .
  • Weights of the codewords which account for the number of 4x4 blocks M c associated with each codeword c.
  • a normalised weight between 0 and 1 is used, i.e.
  • the denominator is the total number of training vectors.
  • M c is the number of vectors falling within the c ,h cluster. Note that all codewords with no or only one associated blocks are discarded.
  • the VQ codebook distance metric ( VQCDM) As mentioned above, once the codebook for each video shot is obtained, a clustering algorithm employing a well-defined distance metric is used. Specifically, in this embodiment, a VQ codebook distance metric (VQCDM) is used. Referring to Figure 3, the VQCDM between any two shots can be computed in two steps.
  • the two codebooks being compared may have different valid sizes, i.e. the first shot's codebook has a size N and the second a size M where M ⁇ N . This is not unusual since, as mentioned before, when some codebook vectors have no associated blocks they are simply discarded to reduce the codebook size.
  • the grouping of video shots according to visual similarity gives the next hierarchical-level description of the video sequence.
  • the next step 7 we use the VQ-based shot-level visual content descriptions, and the VQCDMs described above, in a clustering algorithm. Note that this scheme is neither related to the genre of a video, and nor does it need to have specific knowledge of the underlying story structure.
  • the clustering process assumes the presence of repetitive (or at least similar) shot structures along the sequence. This is a reasonable assumption for highly-structured programmes in a wide range of genres including feature movies, be they comedies and/or drama, situation-comedies and cartoons.
  • the story-structure can be partially lost though, when, for example, the director uses a rapid succession of shots to highlight suspenseful moments or uses a series of shots merely to develop the plot of the movie.
  • the clustering algorithm based on VQCDMs provides good performance, since shots belonging to the same scene usually share at least a similar chromatic composition, or environmental lighting condition, of the scene.
  • the clustering procedure is split into two parts: a time-unconstrained procedure and a time- constrained procedure.
  • Time-unconstrained clustering procedure Initially, we assume there are M clusters , .... C M , each representing a respective shotS, .... S M .
  • this situation is represented by a simple graph, called an Original Shot Graph (OSG) in which nodes correspond to the clusters (or shots) and edges/arrows indicate the transitions between clusters.
  • OSG Original Shot Graph
  • the VCDM is computed for all shot combinations along the temporal axis so as to explore exhaustively the visual similarities within the entire video sequence.
  • a cluster together with the VQCDMs representing the distance of said cluster from all other clusters, is represented as a node on an updated graph, called a VQ distance graph.
  • An example VQ distance graph is illustrated in Figure 5.
  • the VQ distance graph contains 4 clusters C, .... C 4 , some of which include more than one shot.
  • cluster C includes two shots. Since the VQCDMs for a pair of clusters are symmetric, for ease of explanation, Figure 6 shows only one distance value for each pair.
  • the above-described procedure aims to merge a reference cluster R with its most visually similar test cluster T, wherein ft ⁇ 7 " in the sense of minimal VQCDM, thereby to form a new single cluster R' in place of R's position in the timeline. According to this merge operation, all shots belonging to reference cluster R and test cluster T become shots belonging to the new cluster R'.
  • Figure 7 shows an example of a first merge operation (in which each cluster includes only one shot), the merge taking place between cluster C 2 (i.e. the reference cluster R) and C 4 (i.e. the test cluster 7).
  • C 2 i.e. the reference cluster R
  • C 4 i.e. the test cluster 7
  • a new VQ codebook is required to represent or characterise its visual content.
  • the same process described above is employed to generate the codebook, although there is a difference in that the key-frames of all shots belonging to R' will be used for the codebook generation.
  • the VQCDMs between the cluster fl'and all the other clusters are calculated in preparation for the next step.
  • the shot representation error in the VQ codebook is likely to increase for a given cluster.
  • Figure 8 we show how the error is generated after the first merging of clusters C 2 and C 4 with respect to the OSG.
  • the error resulting from representing a shot with a new cluster VQ codebook can easily be computed using the VQCDM.
  • the error is given by the sum of the distances between the VQ of cluster R' and the OSG VQs of all shots belonging to R'.
  • the error is given by:
  • VQ_err(step) ⁇ VQ_Dist ⁇ S i ,R') where R' is the newly formed cluster, and S, e R' for all associated shots. Since we are grouping similar shots into clusters, the VQ codebook for a particular cluster is at risk of losing specificity and accuracy in representing the visual contents of its individual shots with an increase in cluster size. To prevent this degenerative process, after each iteration, statistical analysis is performed on the error generated in the latest merging step, in order to evaluate how generic the new VQ codebook is in representing shots of the cluster, and hence to decide if the merging step is to be retained or invalidated.
  • the previous clustering operation is reversed (i.e. the merged cluster is split into the shots/clusters which existed prior to the clustering operation).
  • the reference cluster is locked out of future clustering processes, and a new reference and test cluster (which are currently classed as unlocked) are selected with the minimal VQCDM for the next clustering iteration. The iteration process is repeated until no more unlocked clusters are available for merging. Note again that, as the size of the merged cluster increases, visually dissimilar shots are likely to enter the cluster, corrupting even more the representativity of the VQ codebook. Whilst the analysis and subsequent un-clustering operations described above can be used to prevent such a degenerative process, it is also useful to lock a cluster when its size exceeds a certain threshold, e.g. 12-15 shots. Further cluster locking criteria may also be used.
  • Time-constrained analysis is very useful for many types of video programme, such as movies, since it groups shots into the same cluster based only on the visual contents of the shots and without considering the timing of the context.
  • This approach doesn't set a priori time limit for a scene duration (which is a problem in [2], for instance) and, furthermore, can be useful for certain retrieval purposes, such as user defined-queries such as searching for repeats. For example, when watching the film "Notting Hill' a viewer may wish to see all the scenes set around the 'bookshop' in the order in which they appear. This is straightforward if all similar shots are grouped in the same cluster, which is possible with the time-unconstrained approach.
  • time-constrained analysis should be performed on every cluster.
  • the aim of time-constrained analysis is to split a cluster into one or more temporally-consistent sub- clusters (see Figure 10) according to a temporal locality criterion.
  • the time-constrained splitting criterion is as follows: Vx h C t J , 3x k e C t J : ⁇ h - k ⁇ ⁇ TW
  • TW is a time window denoting the duration (in terms of number of shots) of a user- selected time window
  • C is the ⁇ h cluster
  • C is one temporally-local sub-cluster of C
  • x i ,x 2 ,- --,x n are the shots belonging to C .
  • each pair of shots falling within the time window TW as it is moved along the timeline belongs to the same cluster. When only one shot falls within TW , there is a split at the end of TW .
  • TW moves from left to right, there is established a cluster including S1 and S3 followed by S5. As indicated, there is a point where TW includes only S5 and so a split is made at the end of TW .
  • the first sub-cluster is labelled C,, 0 .
  • a new TW starts and, in the same way, shots S9, S10 (not shown in the timeline) and S11 are grouped together in a new cluster C, , .
  • the cluster itself becomes a temporally local sub-cluster.
  • This condition can be applied to the shots in each cluster to split the cluster into one or more temporally local sub-clusters, dependent on the scenes represented by the shots contained therein.
  • Transition information indicating the temporal flow between sub-clusters is retained at each sub-clustering operation, such that a directed graph comprising a number of temporally local sub- clusters, as well as the transitions/edges between sub-clusters, is obtained.
  • Each sub- cluster contains visually similar and temporally adjacent shots, and each transition represents the time evolution of the story-line.
  • an example of this splitting of clusters into temporally local sub-clusters is indicated in Figure 10, and an example of the directed graph thus obtained is shown in Figure 11.
  • the directed graph will be subjected to Scene Transition Graph (STG) analysis to automatically extract its structure to account for the semantic structure and time flow of its underlying video programme.
  • STG Scene Transition Graph
  • Scene Transition Graphs As described previously, a logical story unit (LSU) is regarded as a sequential collection of interrelated shots unified by common semantic visual content. Given the output from the preceding visual similarity and temporal analysis steps, in this section we show how the STG concept, originally proposed in [16], can be efficiently used to find edges of LSUs, providing a compact representation of the story structure in the video programme. As already mentioned, the output from the previous processing step is a so-called directed graph comprising a number of nodes and transitions/edges between nodes. Each node may contain some visually similar and temporally adjacent shots, and each transition represents the time evolution of the story-line.
  • STG the concept of STGs, and thereafter discuss how to extract the structure of the STG automatically without a priori knowledge of the semantic structure and time flow of the video programme.
  • STG cut-edges for LSU detection
  • One important type of transition between two nodes is called a "cut edge”.
  • a transition is considered a "cut edge", if, when removed, the graph results in two disconnected graphs.
  • the cut edges correspondingly induce the same partition on G such that there are n disjoint STGs, G, ,G 2 ,...,G n .
  • G, (V, ,E, ,F) .
  • the mapping F from G is preserved in each
  • each connected sub-graph, following removal of cut- edges will represent a LSU while the collection of all cut edges of the STG represents all the transitions from one LSU to the next, thus reflecting the natural evolution of the video flow and allowing hierarchical organisation of the story structure.
  • LSUs, and transitions therebetween can be detected in the video sequence. These LSUs (or scenes) semantically represent the video sequence at a higher level than shots, and have a number of uses, as discussed previously.
  • Audio Signal Processing The steps involved in processing audio information, which steps may be performed in parallel to the above-described video processing steps, will now be described with particular reference to Figures 12 to 17.
  • Current approaches to semantic video analysis focus more on visual cues than associated audio cues. There is, however, a significant amount of information contained in audio data, which can often be more important than, or complementary to, the visual part.
  • the director often employs creative montage techniques in which a succession of short but visually different shots share the same audio characteristics (usually a musical melody). In this sense, therefore, the shots belong to the same semantic theme. In such cases, it is contended that the audio cues actually play a primary role in parsing and segmenting the video data.
  • audio data is considered to play a supporting role in relation to visual processing results, the visual part remaining the main reference for detecting real scene changes.
  • the audio segmentation may be more important.
  • the video segmentation supports the audio segmentation.
  • audio scene changes can be identified according to a distance measure between two consecutive shots.
  • a 'signature' derived from a VQ codebook in a similar manner to the method described previously with respect to the visual signatures
  • the spectral features are provided in the form of Mel-Frequency Cepstrum Coefficients (MFCCs).
  • MFCCs Mel-Frequency Cepstrum Coefficients
  • EMD Earth Mover's Distance
  • a simple thresholding method is used to detect audio scene changes and separate coherent segments of audio data.
  • Figures 12 and 13 respectively represent the step of audio shot signature extraction, and the distance computation of consecutive audio shots for audio scene change detection. Further explanation of each stage is given below.
  • Audio shot data preparation As mentioned above, the audio stream is first split into arbitrary segments.
  • the advantage of treating audio data in terms of corresponding video shot segments is that the combination of results from audio and visual data analysis is made more straightforward.
  • the processing objective is not for the purpose of classification (i.e. to decide whether the audio is music, speech, silence, noise, etc.), but to identify variations in audio characteristics that may correspond either to a scene change or an important event in the story evolution underlined by a significant change in the audio properties.
  • short term spectral analysis is performed on each audio shot to generate feature vectors characterising the shot. This is achieved by first dividing the audio shot into audio frames, each being locally stationary and lasting for a few tens of milliseconds. Then, for each audio frame we conduct spectral analysis, involving the extraction of 19 Mel Frequency Cepstral Coefficients (MFCCs) plus a sound energy component.
  • MFCCs Mel Frequency Cepstral Coefficients
  • the audio data is preferably sampled at 22.050 kHz, each sample being represented by 16 bits. The samples are then divided into audio frames of 20 ms long and weighted by a Hamming window; the sliding window is overlapped by 10 ms, so that an output feature vector, or the set of 19 MFCCs, is obtained every 10 ms.
  • MFCCs Vector Quantisation of audio shots
  • the MFCCs thus obtained are, in the subsequent step 11 , used to derive a signature for each audio shot.
  • MFCCs 19-dimensional real vectors
  • an entire audio shot is represented by a sequence of 19-dimensional real vectors (MFCCs) representing a 10ms audio frame of the shot. So, for example, if an audio shot lasts for 2.3 seconds, there will be 230 vectors available for the codebook generation process. Note that since an audio frame with high energy has more influence on the human ear, when we compute frame level features, weighted versions of these MFCCs are used. The weighting factor is proportional to the energy of the frame.
  • the relative weight of each audio frame is obtained using the ratio of the frame energy value and the value of the most energetic one-second clip computed over the audio file (with clip overlapping by 99%).
  • the algorithm for VQ codebook generation starts by randomly positioning K seeds (which will form the centres of the clusters) into a 19-dimension hypercube that contains all spectral audio frames. Each frame is positioned in the hypercube according to its spectral coordinates (i.e. its MFCCs values).
  • the VQ structure is defined by the final position of the gravity centres of its cells, the centroids, which are directly related to the statistical density of features describing the content of the audio shots.
  • the codebook generated for each audio shot thus contains C codewords, each having D dimensions.
  • This VQ codebook becomes the 'signature' of the audio shot to which it relates, the codebook comprising the following information.
  • Weights of the codewords which account for the number of audio frames M c associated with each codeword c. Usually, a normalised weight between 0 and 1 is used, i.e. Note that any codewords with no audio frames associated are insignificant. Also, if there is only one frame associated to a codeword, then its corresponding cluster will have zero variance and infinite distance from every other codeword according to the distance metric discussed below. So these codewords are discarded.
  • EMD Earth Mover's Distance
  • the EMD is a method to evaluate dissimilarity between two signatures.
  • one signature can be seen as a mass of earth properly spread in space, and the other a collection of holes in that same space.
  • the EMD provides a measure of the least amount of work needed to fill the holes with earth.
  • a unit of work corresponds to transporting a unit of earth by a unit of ground distance.
  • the EMD is applied between each pair of consecutive audio shots to determine the distance therebetween, and the results stored for use in the next stage. Calculation of the EMD between audio shots is represented in Figure 13. It will be seen that a graphical representation of EMD values can be plotted in relation to the temporal axis.
  • an audio shot distance curve This is referred to as an audio shot distance curve.
  • Segmentation procedure for audio scene detection Having calculated the EMD values between consecutive pairs of audio shots, the resulting distance measures are then used to segment the audio stream into scenes.
  • the objective here is to detect the boundaries of spatially similar (in the sense of spectral properties) and temporally adjacent audio shots in order to identify possible audio scene changes.
  • an audio scene change is likely to occur when the majority of dominant audio features in the sound change [9]. This can happen just before (or just after) a new visual scene begins. However, this can also indicate an important event in the story, even in the middle of a scene. For example, the dominant audio features in the sound may change to indicate a kiss between two main characters, or to raise suspense before something thrilling happens.
  • step 13 a statistical analysis is be performed to calculate the mean ⁇ and standard deviation ⁇ of all pairs of consecutive audio shot distances.
  • An empirically chosen threshold value multiplied by the standard deviation a is then used to detect peaks in the audio shot distance curve and to partition the audio shots into different segments, as shown in Figure 14.
  • Audio-assisted video scene hierarchical segmentation The final stage 23 in the present embodiment is arranged to integrate the segmented audio scene information with the segmented video scene information.
  • the importance of audio in video segmentation has been recognised by many researchers, and recently in reference [20], although the combined use of visual and audio analysis results remains a challenging issue.
  • the most likely video scene change is searched for in the neighbouring shots of the shot associated with said audio change.
  • the audio output is used to support, or complement, the visual processing results.
  • a video scene boundary is detected, and an audio scene boundary exists at the same time-stamp: This is the simplest case to detect, given that the audio and visual features change at the same time. It may be that the audio change just anticipates (as an introduction) or just follows a corresponding visual scene change. These situations are taken into account when defining the rules to be used. As stated before, it is not always true that detected scene boundaries correspond to the actual start and/or ending of a scene, and that is one of the reasons why LSUs are considered the best approximation to real movie episodes. In this case, however, since both audio and visual features are changing at the same time we can be relatively certain that the detected LSU boundary is also a real scene break.
  • LSU Logical Story Unit
  • Audio pattern hints inside an LSU If an audio change happens but is not followed by a visual change, from the definition of LSU, we cannot state anything about an eventual scene break. However, we do know that, for some reason, the audio has changed with respect to the previous shot - possibly underlining an important event which is taking place, a change in the mood in a movie, or a romantic moment. For this reason, and because our technique relies mainly on visual analysis for detecting scene changes, we refer to these audio changes as 'audio pattern hints' that the video programme creator intended to be significant for the evolution of the storyline.
  • Audio-visual scene change If an audio change occurring at shot / coincides with a visual change on or around that shot, we detect an 'audio-visual scene change, '. In this case, it is very likely that the detected LSU boundary is a real scene break.
  • the ambiguity window of length N takes into account the case of an audio change anticipating (occurring just before) or following (occurring just after) a visual scene change.
  • the third rule is defined as follows.
  • the invention is primarily, although not exclusively, implemented by way of a software program running on a processor, although dedicated hardware implementations can also be equally envisaged.
  • a personal computer, DVD recorder, or other audio-visual equipment capable of reproducing an audio-visual sequence could be provided with software arranged to embody the invention when executed. Execution of the software is performed under the control of the user, for example by the user pressing a "Find Similar" control button on a remote control for the DVD or the like. During playback of an audio video sequence the user presses the "Find Similar" button at a scene for which he wishes the DVD player to search the entire sequence to find semantically similar scenes.
  • the DVD player then executes the software embodying the present invention up to the stage that time- unconstrained clustering is performed, and then displays to the user all of the found scenes which are in the same cluster as the scene which the user was viewing when he activated the "Find Similar" function. In this way the user is able to view all of the semantically similar scenes in an audio-video sequence at the touch of a button.
  • the invention may be used to automatically generate chapter markers relating to audio scene, changes, video scene changes, and logical story unit boundaries.
  • a device such as a personal computer, DVD recorder, or other audio-visual equipment capable of reproducing an audio-visual sequence could be provided with software arranged to embody the invention when executed.
  • the user loads an audio-visual sequence (which could be stored on a DVD disc, downloaded from the internet, or otherwise input) into the device and then commands the device to execute the software embodying the invention (for example by pressing a suitable button or controlling the device using a graphical user interface).
  • the software then operates to determine video and audio scene boundaries, and logical story unit boundaries as described, and markers relating to the different types of boundary can be stored within the content or in an index thereto. Having generated such markers the user can then navigate through the audio visual content using the markers.
  • the different types of markers provide for different types of navigation, and hence the user experience of the content is enriched. Other applications of the invention will be readily apparent to the intended readers.
  • the preferred embodiment provides a technique for deriving content information at the semantic level so as to provide meaningful information relating to the content of a video sequence, such as a film or television programme.
  • the technique initially divides the video sequence into individual shots.
  • the accompanying audio sequence is divided into audio shots on the basis of the shot-based video division.
  • a two branch analysis stage is employed to respectively process the video and audio shots.
  • a representative keyframe is provided for each shot.
  • the keyframe is divided into pixel blocks which constitute training vectors for a VQ codebook learning process, the codebook thereafter characterising the keyframe, and so the shot.
  • a known distance metric is employed to compute the distance (indicative of the visual similarity) between each pair of codebooks (shots), following which clustering is performed by grouping together shots whose inter-shot distance falls within a predetermined range.
  • clustering is performed by grouping together shots whose inter-shot distance falls within a predetermined range.
  • time-constrained clustering is performed in which the temporal locality of shots grouped in a common cluster is taken into account.
  • the resulting sub-clusters are representative of video shots having visually similar and temporally-adjacent content.
  • a number of steps are then employed to identify logical story units (LSUs) from the sub-clusters.
  • LSUs logical story units
  • MFCCs Mel Frequency Cepstrum Components
  • EMD Earth-Mover's Distance
  • a set of heuristic rules are applied to the resulting LSUs and audio shot scenes to identify information relating to the audio-video sequence at the semantic level. This uses a comparison between respective boundary edges of the LSUs and the audio shot scenes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Television Signal Processing For Recording (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un système de segmentation d'une séquence vidéo et d'une séquence audio correspondante en scènes vidéo et audio respectives. Pour permettre une intégration directe des informations de segmentation de scènes audio et vidéo, on introduit le concept d'une séquence audio correspondant à des séquences vidéo pouvant être identifiées directement. Par synchronisation du flux audio et du flux vidéo sur la base de séquences audio et vidéo correspondantes, l'intégration d'information de scènes sémantiques audio et vidéo déterminées, respectives (une scène comportant une ou plusieurs séquences), devient relativement directe de manière à offrir à une meilleure compréhension sémantique du contenu. Par ailleurs, la fusion de résultats d'analyse audio et visuelle par l'intermédiaire de règles heuristiques offre également d'autres avantages. Le procédé et le système selon invention peuvent être appliqués industriellement dans l'automatisation de processus laborieux et coûteux en temps d'organisation et d'indexation de bases de données vidéo de taille croissante, de telle manière que lesdites bases de données peuvent être parcourues et balayées simplement au moyen de structures de requête naturelles, proches du concept humain.
PCT/GB2005/001027 2004-03-23 2005-03-17 Procede et systeme de detection de changements de scenes audio et video WO2005093752A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0406504A GB0406504D0 (en) 2004-03-23 2004-03-23 Method and system for detecting audio and video scene changes
GB0406504.1 2004-03-23

Publications (1)

Publication Number Publication Date
WO2005093752A1 true WO2005093752A1 (fr) 2005-10-06

Family

ID=32188532

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2005/001027 WO2005093752A1 (fr) 2004-03-23 2005-03-17 Procede et systeme de detection de changements de scenes audio et video

Country Status (2)

Country Link
GB (1) GB0406504D0 (fr)
WO (1) WO2005093752A1 (fr)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009063383A1 (fr) * 2007-11-14 2009-05-22 Koninklijke Philips Electronics N.V. Procédé de détermination du point de départ d'une unité sémantique dans un signal audiovisuel
EP2269374A1 (fr) * 2008-04-23 2011-01-05 Samsung Electronics Co., Ltd Procédé de stockage et d'affichage de contenus de diffusion et appareil associé
CN103279580A (zh) * 2013-06-24 2013-09-04 魏骁勇 基于新型语义空间的视频检索方法
WO2015038121A1 (fr) * 2013-09-12 2015-03-19 Thomson Licensing Segmentation vidéo par sélection audio
US10248864B2 (en) 2015-09-14 2019-04-02 Disney Enterprises, Inc. Systems and methods for contextual video shot aggregation
US10339959B2 (en) 2014-06-30 2019-07-02 Dolby Laboratories Licensing Corporation Perception based multimedia processing
CN112347303A (zh) * 2020-11-27 2021-02-09 上海科江电子信息技术有限公司 媒体视听信息流监测监管数据样本及其标注方法
CN113163272A (zh) * 2020-01-07 2021-07-23 海信集团有限公司 视频剪辑方法、计算机设备及存储介质
CN114465737A (zh) * 2022-04-13 2022-05-10 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1067800A1 (fr) * 1999-01-29 2001-01-10 Sony Corporation Procede de traitement des signaux et dispositif de traitement de signaux video/vocaux
WO2003058623A2 (fr) * 2002-01-09 2003-07-17 Koninklijke Philips Electronics N.V. Procede et appareil de segmentation d'histoire plurimodale permettant de lier un contenu multimedia

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1067800A1 (fr) * 1999-01-29 2001-01-10 Sony Corporation Procede de traitement des signaux et dispositif de traitement de signaux video/vocaux
WO2003058623A2 (fr) * 2002-01-09 2003-07-17 Koninklijke Philips Electronics N.V. Procede et appareil de segmentation d'histoire plurimodale permettant de lier un contenu multimedia

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
SARACENO C ET AL: "Identification of story units in audio-visual sequences by joint audio and video processing", IMAGE PROCESSING, 1998. ICIP 98. PROCEEDINGS. 1998 INTERNATIONAL CONFERENCE ON CHICAGO, IL, USA 4-7 OCT. 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, vol. 1, 4 October 1998 (1998-10-04), pages 363 - 367, XP010308744, ISBN: 0-8186-8821-1 *
SARACENO C ET AL: "INDEXING AUDIOVISUAL DATABASES THROUGH JOINT AUDIO AND VIDEO PROCESSING", INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, WILEY AND SONS, NEW YORK, US, vol. 9, no. 5, 1998, pages 320 - 331, XP000782119, ISSN: 0899-9457 *
SHIH-FU CHANG ET AL: "Structural and semantic analysis of video", MULTIMEDIA AND EXPO, 2000. ICME 2000. 2000 IEEE INTERNATIONAL CONFERENCE ON NEW YORK, NY, USA 30 JULY-2 AUG. 2000, PISCATAWAY, NJ, USA,IEEE, US, vol. 2, 30 July 2000 (2000-07-30), pages 687 - 690, XP010513105, ISBN: 0-7803-6536-4 *
SUNDARAM H ET AL: "Video scene segmentation using video and audio features", MULTIMEDIA AND EXPO, 2000. ICME 2000. 2000 IEEE INTERNATIONAL CONFERENCE ON NEW YORK, NY, USA 30 JULY-2 AUG. 2000, PISCATAWAY, NJ, USA,IEEE, US, vol. 2, 30 July 2000 (2000-07-30), pages 1145 - 1148, XP010513212, ISBN: 0-7803-6536-4 *
WANG Y ET AL: "MULTIMEDIA CONTENT ANALYSIS USING BOTH AUDIO AND VISUAL CLUES", IEEE SIGNAL PROCESSING MAGAZINE, IEEE INC. NEW YORK, US, November 2000 (2000-11-01), pages 12 - 36, XP001127426, ISSN: 1053-5888 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009063383A1 (fr) * 2007-11-14 2009-05-22 Koninklijke Philips Electronics N.V. Procédé de détermination du point de départ d'une unité sémantique dans un signal audiovisuel
US20100259688A1 (en) * 2007-11-14 2010-10-14 Koninklijke Philips Electronics N.V. method of determining a starting point of a semantic unit in an audiovisual signal
EP2269374A1 (fr) * 2008-04-23 2011-01-05 Samsung Electronics Co., Ltd Procédé de stockage et d'affichage de contenus de diffusion et appareil associé
EP2269374A4 (fr) * 2008-04-23 2011-05-04 Samsung Electronics Co Ltd Procédé de stockage et d'affichage de contenus de diffusion et appareil associé
US8352985B2 (en) 2008-04-23 2013-01-08 Samsung Electronics Co., Ltd. Method of storing and displaying broadcast contents and apparatus therefor
CN103279580A (zh) * 2013-06-24 2013-09-04 魏骁勇 基于新型语义空间的视频检索方法
WO2015038121A1 (fr) * 2013-09-12 2015-03-19 Thomson Licensing Segmentation vidéo par sélection audio
US10339959B2 (en) 2014-06-30 2019-07-02 Dolby Laboratories Licensing Corporation Perception based multimedia processing
US10748555B2 (en) 2014-06-30 2020-08-18 Dolby Laboratories Licensing Corporation Perception based multimedia processing
US10248864B2 (en) 2015-09-14 2019-04-02 Disney Enterprises, Inc. Systems and methods for contextual video shot aggregation
CN113163272A (zh) * 2020-01-07 2021-07-23 海信集团有限公司 视频剪辑方法、计算机设备及存储介质
CN113163272B (zh) * 2020-01-07 2022-11-25 海信集团有限公司 视频剪辑方法、计算机设备及存储介质
CN112347303A (zh) * 2020-11-27 2021-02-09 上海科江电子信息技术有限公司 媒体视听信息流监测监管数据样本及其标注方法
CN114465737A (zh) * 2022-04-13 2022-05-10 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
GB0406504D0 (en) 2004-04-28

Similar Documents

Publication Publication Date Title
US7949050B2 (en) Method and system for semantically segmenting scenes of a video sequence
Truong et al. Video abstraction: A systematic review and classification
US10134440B2 (en) Video summarization using audio and visual cues
WO2005093752A1 (fr) Procede et systeme de detection de changements de scenes audio et video
Asghar et al. Video indexing: a survey
Wang et al. A multimodal scheme for program segmentation and representation in broadcast video streams
Gunsel et al. Hierarchical temporal video segmentation and content characterization
Huang et al. A film classifier based on low-level visual features
WO2004019224A2 (fr) Unite et procede de detection d'une propriete de contenu dans une sequence d'images video
Jiang et al. Hierarchical video summarization in reference subspace
Banjar et al. Sports video summarization using acoustic symmetric ternary codes and SVM
Darabi et al. Video summarization by group scoring
Brezeale Learning video preferences using visual features and closed captions
Huang et al. Movie classification using visual effect features
Adami et al. The ToCAI description scheme for indexing and retrieval of multimedia documents
Adami et al. An overview of video shot clustering and summarization techniques for mobile applications
WO2005093712A1 (fr) Procede et systeme de segmentation semantique d'une sequence audio
Cotsaces et al. Semantic video fingerprinting and retrieval using face information
Benini et al. Identifying video content consistency by vector quantization
Choroś Reduction of faulty detected shot cuts and cross dissolve effects in video segmentation process of different categories of digital videos
Benini et al. Audio-Visual VQ Shot Clustering for Video Programs
Yaşaroğlu et al. Summarizing video: Content, features, and HMM topologies
Barbieri et al. Movie-in-a-minute: automatically generated video previews
Camastra et al. Video segmentation and keyframe extraction
Zwicklbauer et al. Improving scene detection algorithms using new similarity measures

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase