WO2014082812A1 - Clustering and synchronizing multimedia contents - Google Patents

Clustering and synchronizing multimedia contents Download PDF

Info

Publication number
WO2014082812A1
WO2014082812A1 PCT/EP2013/072697 EP2013072697W WO2014082812A1 WO 2014082812 A1 WO2014082812 A1 WO 2014082812A1 EP 2013072697 W EP2013072697 W EP 2013072697W WO 2014082812 A1 WO2014082812 A1 WO 2014082812A1
Authority
WO
WIPO (PCT)
Prior art keywords
mel
salient
sequences
frequency cepstrum
clustering
Prior art date
Application number
PCT/EP2013/072697
Other languages
French (fr)
Inventor
Franck Thudor
Pierre HELLER
Alexey Ozerov
Ashish BAGRI
Original Assignee
Thomson Licensing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing filed Critical Thomson Licensing
Priority to EP13785447.7A priority Critical patent/EP2926337A1/en
Priority to US14/648,705 priority patent/US20150310008A1/en
Publication of WO2014082812A1 publication Critical patent/WO2014082812A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/432Query formulation
    • G06F16/433Query formulation using audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel

Definitions

  • the invention relates to a method and a device for clustering and synchronizing sequences of multimedia contents with regard to a certain event as e.g. independently recorded multimedia contents of a certain event.
  • a further aspect is related to clustering sequences of multimedia content belonging to a certain event in a data base and that said clustering and synchronizing of multimedia content relies on audio similarity of multimedia content as audio or audiovisual content.
  • multimedia content means audio or audiovisual content.
  • a precise synchronization is performed by a precise realignment within each created cluster performed on MFCC features using MFCC cross-correlations computed over a window corresponding to a salient MFCC computation window.
  • MFCC cross-correlations computed over a window corresponding to a salient MFCC computation window.
  • a pair wise comparison between all videos belonging to the same cluster is performed to find a precise alignment between them.
  • a pair wise comparison is done between videos belonging to the same cluster to find the precise time offset between them.
  • Each video in a cluster is only compared to all the other videos in the same cluster as the non- overlapping videos have already been separated before as a new cluster is formed if a video does not match with any existing representative cluster or if there is a match but the video has a non-overlapping region.
  • a cluster representative is a minimal set of recording the union of which covers the entire cluster time line. The comparison of two videos is then done in the salient MFCC domain and is based on cross correlation. A complete match-list with time offset between the matching videos is generated. The match-list is used to categorize the videos into events. In such a way, videos which have an overlapping region form a part of the same event. Videos which are not overlapping but are connected to each other via a common video sequence also form a part of the same event, so that all videos belonging to the same event will be clustered and videos belonging to a different event being excluded.
  • mel-frequency cepstrum coefficients of audio tracks of the multimedia contents are used for clustering and synchronizing multimedia contents by computing salient mel-frequency cepstrum coefficients as dimension-wise maxima over a predetermined window from the mel-frequency cepstrum coefficients, creating clusters such that every pair of segments having an overlapping audio segment belong to a same cluster by comparing the salient mel- frequency cepstrum coefficient features with regard to that a majority of features correspond to a maximum correlation, creating cluster representatives by matching the longest sequences with others to form intermediate clusters in the salient mel-frequency cepstrum coefficient domain and a fine synchronization by a pair wise comparison between all sequences belonging to the same intermediate cluster to provide a complete match-list with time offset between the matching sequences and categorizing sequences into events for final clustering.
  • the method for clustering and synchronizing multimedia contents with regard to a certain event is performed in a device comprising extracting means for extracting mel-frequency cepstrum coefficients from audio tracks of the multimedia contents, computing means for calculating dimension-wise maxima over a predetermined window from the mel- frequency cepstrum coefficients to provide salient mel-frequency cepstrum coefficients, comparing means for comparing the features of the salient mel-frequency cepstrum coefficients with regard to that a majority of features correspond to a maximum correlation for creating clusters such that every pair of segments having an overlapping audio segment belong to a same cluster, voting means for providing cluster representatives by matching the longest sequences with others to form intermediate clusters in the salient mel-frequency cepstrum coefficient domain, synchronizing means for a pair wise comparison between all sequences belonging to the same intermediate cluster to provide a complete match-list with time offset between the matching sequences and sorting means for categorizing sequences into events for final clustering.
  • the invention is characterized in that mel-frequency cepstrum coefficients of audio tracks are used for clustering multimedia contents with regard to a certain event by determining salient mel-frequency cepstrum coefficient values from mel-frequency cepstrum coefficient vectors and clustering segments having an overlapping audio segment by comparing the salient mel- frequency cepstrum coefficient values. Synchronization is performed by comparing sequences of the same cluster with regard to a time offset, and a final clustering comprises categorizing sequences into events as sequences which have an overlapping segment form a part of the same event and sequences which do not overlap but are connected via a common sequence also form part of the same event.
  • Audio fingerprints may be too robust for the task of identification of the same event and as they are resistant against additive noise. This property makes them too robust to be able to distinguish the same music played at different events. In such a way, two audio sequences, being the same song but played at two different parties, could be wrongly clustered together. Audio fingerprints are robust to ambient sounds and would most probably wrongly identify the two corresponding recordings as belonging to the same event.
  • MFCCs While not robust to additive perturbations, capture also information about ambient sounds. MFCCs, as compared to fingerprints, allow better differentiation between the same songs played by the same group in different concerts.
  • the comparing is a result of a voting approach function of the determined MFCC features which only needs fixing on one non-adaptive threshold to avoid other heuristics to filter out the high number of false positives by adaptive threshold values.
  • one non-adaptive threshold is sufficient and cluster representatives are used to address a large scale issue with regard to the size of the dataset.
  • joint clustering and alignment in a bottom-up hierarchical manner are performed by splitting the database in subsets at the lower stages and by comparing only clustering representatives at the higher stages.
  • Such a strategy, applied in several stages reduces the computational complexity, thus allows addressing much bigger datasets.
  • a created cluster contains one or more cluster representative and comprising the adding of a new audiovisual segment to the created cluster if a new audiovisual segment matches the one or more representatives.
  • a positive comparing leads to the determination of a time offset between two audiovisual segments of a pair of segments.
  • the audiovisual segments of a created cluster are temporally aligned by using the determined offset.
  • Figure 1 shows users equipped with a smartphone comprising audiovisual capturing means during a concert
  • Figure 2 is a schematic illustrating the structure of the invention
  • Figure 3 is a schematic illustrating examples of cluster representatives;
  • Figure 4 shows in a diagram a standard deviation of video length per
  • cluster according to the average video length per cluster for a dataset of concert videos cluster
  • Figure 5 illustrates in a diagram the accuracy of the method according to the invention.
  • Figure 6 illustrates in a diagram the clustering performance of the
  • Fig. 2 is a schematic illustrating the structure with regard to the method and a device of the present invention as mel-frequency cepstral coefficients MFCCs are first extracted for each of the multimedia content as audio recording Audio.
  • Cepstral coefficients obtained for mel-ceptrum are referred to as Mel-Frequency Cepstral Coefficients often and here also denoted by MFCC.
  • MFCC is a representation of the audio signal Audio. Audio samples within a window W are combined through discrete Fourier transformation and discrete cosine transformation on a mel-scale to create one MFCC sample as a multi-dimensional vector d of floating values.
  • salient mel-frequency cepstrum coefficients Salient MFCC are computed from the original MFCC vectors as illustrated in figure 2. Only maximal MFCC values over a sliding window W are retained for each dimension of a MFCC independently. This selection of salient mel-frequency cepstrum coefficients Salient MFCC is based on the notion that the maximum value is likely to be retained in other audio of the same content even under influence of noise.
  • a salient mel-frequency cepstrum coefficient Salient MFCC is a representation that has only a fraction of about 10% of the components of the original MFCC features and is still sufficient robust to be able to compare two audio files.
  • Clustering sequences having an overlapping audio segment with regard to a certain event is performed by comparing the salient mel-frequency cepstrum coefficients Salient MFCC by applying a voting approach function Voting-based clustering to the mel-frequency cepstrum coefficient features as a comparison with regard to whether a majority of features corresponds to a maximum correlation for a rough synchronization.
  • said clustering provides already a rough synchronization with regard to a certain event as none matching sequences already have been excluded and cluster representatives can be generated by matching the longest sequences with others to form intermediate clusters in a salient mel-frequency cepstrum coefficient domain. That means that for clustering multimedia contents with regard to a certain event mel-frequency cepstrum coefficients MFCC of audio tracks of the multimedia contents are used for clustering and synchronizing or aligning multimedia contents with regard to a certain event, as mel- frequency cepstrum coefficients MFCC in addition capture information about ambient sound which in comparison to fingerprints makes it possible to distinguish more precise between different events.
  • Cluster representatives are advantageous with regard to forming clusters as newly processed recordings are compared - aligned and matched - to these representatives as it is imaginable by the illustration shown in figure 3, that cluster representatives drastically limit the required number of comparisons for clustering. Finally a fine synchronization and final clustering are recommended. That means that the method further comprises a synchronization by comparing sequences of the same cluster with regard to a time offset and further comprises a final clustering by categorizing sequences into events as sequences which have an overlapping segment form a part of the same event and sequences which do not overlap but are connected via a common sequence also form part of the same event.
  • MFCC features are first extracted for all recordings of the audio Audio or audiovisual files also named as AV files.
  • Salient MFCC that are dimension-wise maxima of MFCCs over some window
  • Joint clustering and synchronization is then performed on salient MFCCs using. This is done in two substeps:
  • cluster representatives recordings are compared sequentially - starting from the longest ones - while creating clusters with their representatives and newly processed recordings are only compared - that is temporally registered and matched - to these representatives.
  • voting is applied: while comparing two recordings, the cross-correlation of the two recordings is computed independently for each salient MFCC dimension, and the matching is established if and only if the cross correlation maximum location is the same for a sufficient pre- defined number of dimensions.
  • the proposed approach for joint clustering and synchronization is more robust to presence of similar predominant audio content as e.g., the same music played in different parties, since it relies on MFCCs that, in contrast to audio fingerprints, describe the overall audio content, scales with dataset size and average recordings size thanks to the use of cluster representatives and salient MFCCs, it is easier to implement and reproduce thanks to the proposed voting approach for matching decision that allows avoiding adaptive thresholds and heuristic post-filtering.
  • the window W has a width of 40ms with an overlap of 50 % and the multi-dimensional vector d to be 12.
  • salient MFCC values from the original MFCC vectors are extracted. It is a representation that has only a fraction of about 10% of the components of the original MFCC features and is still robust enough to be able to compare two audio files.
  • To compute the salient MFCC only the maximal MFCC values are retained over a sliding window of Ws. This is done over each of the d dimension of MFCC independently.
  • This selection of salient MFCC is based on the notion that the maximum value is likely to be retained in other audio of the same content even under influence of noise.
  • This framework also provides us a way to perform the comparison at a coarse level to filter our obvious none matching and reduces the number of matching performed at the granular level. In the present approach, a two stage approach but it can be envisioned to perform the comparisons at several different levels.
  • a first level clustering is performed to group the set of videos which have a common overlapping segment. Since a goal is to work with large datasets, it quickly becomes infeasible to compare all videos with each other. To avoid comparing each video with every other video in the database, clusters are created and each cluster has a cluster representative. Cluster representatives are videos which have an overlapping segment with all the other videos in that cluster. To form clusters, the videos are arranged based on their lengths, starting with the longest video first. The longest video is made a cluster representative of the first cluster. At every stage of this clustering process, videos are only compared to the existing cluster representatives.
  • a new cluster is formed if a video does not match with any existing representative or if there is a match but the video also has a non- overlapping region.
  • the comparison of two videos is done on the salient MFCC domain and is based on cross correlation, description of which is detailed further.
  • the clustering technique of not comparing all videos with each other and the fact that the comparison is done on a sparse salient MFCC's provides an effective mechanism to deal with very large datasets without increasing the computation time exponentially.
  • a pair wise comparison is done between all the videos belonging to the same cluster to find precise alignment between them.
  • a pair wise comparison is done between videos belonging to the same cluster to find the precise time offset between them.
  • Each video in a cluster is only compared to all the other videos in the same cluster as the non- overlapping videos have already been separated as described before.
  • a complete match-list with time offset in seconds between the matching videos is generated. Using this match-list, videos are categorized into events. Videos which have an overlapping region form part of the same event. Videos which are not overlapping but are connected to each other via a common video also form part of the same event.
  • the actual comparison between any two videos is carried by computing the cross correlation on the feature values.
  • the features used are the salient MFCC values while in the temporal registration of matching videos and final clustering, the features used are complete MFCC values.
  • Cross correlation is an effective way to find the time offset between two signals which are shifted versions of each other.
  • MFCC multi- dimensional vector d with several dimensions which are decorrelated during the creation of the features
  • the cross correlation is performed on each of the dimension separately.
  • the peak in each of the dimension points to a time offset between the two compared signals. If the two signals really do match, then the time offset in most of the dimension points to the same correct value. If the signals do not match, the cross correlation in each dimension has a peak at different offsets and hence we can easily detect that there is no match between these signals.
  • a voting approach is used where each dimension votes for its selected time offset and if the majority of the dimensions point to the same window of time offset, a match is declared between the two signals with the given time offset.
  • new additional videos can be added on a database/system where the temporal registrations have already been computed.
  • the new videos needn't be compared to all the existing videos in the database.
  • a cluster center is identified. It is generally the longest video which has the largest overlapping region with all the other videos in that cluster. This cluster center is identified and stored for further use. For every new video that is being added, instead of comparing it with all the existing videos to find if they have an overlap, it is enough to just match it with the existing cluster centers. This way the proposed framework handles incremental data while still using the advantages that it provides in the first place.
  • the intermediate clusters provide a starting point for new videos to be added.
  • the dataset consists of user contributed videos taken from YouTube. A total of 164 videos from 6 separate artist and bands having a cumulative duration of 17.56 hours were used. The longest sequence was of 21 minutes while the shortest one was of 44 seconds. A hand man groundtruth of 36 clusters was realized on this dataset. From this groundtruth, a binary matrix of size 164 * 164 is generated, where ones and zeros code respectively for matching and non-matching sequences. This matrix is denoted GT matching.
  • the details of the dataset can be seen in Figure 4, in which each cluster of videos is represented by a bubble whose width is proportional to the number of videos inside the cluster whose coordinates are given by the average video length per cluster in seconds and the standard deviation of video length per cluster in seconds.
  • Nbcrude and Nbfine are respectively the number of computations performed at salient and fine level according to the present invention.
  • Figure 6 illustrates the probability that variable is greater than abscissa over F-measure in percent % when the database is split for the configurations mentioned above.
  • Tests showed the ability of the invention to incrementally add videos to the database while keeping the same performance without doing extra calculations as compared to adding all the videos together.
  • the split approach provides a way to make the system scalable and incremental and to be able to effectively split the task when a very large number of videos need to be compared and synchronized.

Abstract

A method and a device for clustering sequences of multimedia contents with regard to a certain event are recommended wherein mel-frequency cepstrum coefficients of the sequences audio tracks of the multimedia contents are used for clustering and synchronizing multimedia contents with regard to a certain event by computing salient mel-frequency cepstrum coefficients from mel-frequency cepstrum coefficient features and clustering sequences having an overlapping audio segment by comparing the salient mel-frequency cepstrum coefficients. Method and device provide an improvement in comparison to fingerprint detection.

Description

CLUSTERING AND SYNCHRONIZING MULTIMEDIA CONTENTS
TECHNICAL FIELD
The invention relates to a method and a device for clustering and synchronizing sequences of multimedia contents with regard to a certain event as e.g. independently recorded multimedia contents of a certain event. A further aspect is related to clustering sequences of multimedia content belonging to a certain event in a data base and that said clustering and synchronizing of multimedia content relies on audio similarity of multimedia content as audio or audiovisual content.
BACKGROUND
The popularity of portable devices, e.g. smartphones, leads to creation of a huge amount of audio-visual recordings of the same or different multimedia presentation events. For example, a concert of a popular music band can be filmed by hundreds of fans, and then all these recordings being uploaded to YouTube. Such collections could be for example efficiently exploited to enhance the corresponding audio-visual content, to create summaries of a particular event, etc. However, to do so, one first needs to identify the videos corresponding to the same event and to synchronize them in time. Doing this relying on the only video sequence seems to be challenging due to high variation of point of views and to the fact that two devices often film completely different parts of a visual scene. However, the task seems becoming easier if one relies on the audio tracks alone. Indeed, whatever the location and orientation of two devices in the same place, they record more or less the same sounds.
Bryan et al. addresses in "Clustering and synchronizing multi-camera video via landmark cross-correlation," in IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, Kyoto, Japan, 03/2012 2012, IEEE, the problem of joint clustering and synchronization of audiovisual contents by audio tracks, that is, regrouping audiovisual contents by event and register them temporally. This is done by using audio fingerprinting, to match the audiovisual contents corresponding to the same event, and to temporally register the matched audiovisual contents. However, it has been found that audio fingerprints may wrongly identify two corresponding recordings at different locations to a similar event as belonging to the same event.
SUMMARY OF THE INVENTION
It is an aspect of the present invention to provide an improved differentiation regarding whether sequences of multimedia contents correspond to the same event or not, wherein multimedia content means audio or audiovisual content.
Although it is the task of Mel Frequency Cepstrum Coefficients - in the following also denoted as MFCC - to represent the information of an audio signal as efficient as possible, that means in a decorrelated manner, it is nevertheless recommended using MFCC for clustering and synchronizing multimedia contents. It is furthermore recommended to determine salient features from said MFCC by computing dimension-wise maxima of the MFCCs and to compare salient MFCC features of at least two audio tracks of multimedia content for a voting based clustering and a rough synchronization of the audio tracks. Finally, after clustering has been established, a precise synchronization is performed by a precise realignment within each created cluster performed on MFCC features using MFCC cross-correlations computed over a window corresponding to a salient MFCC computation window. In case of audiovisual multimedia content, a pair wise comparison between all videos belonging to the same cluster is performed to find a precise alignment between them. Using the clusters created in the previous step, a pair wise comparison is done between videos belonging to the same cluster to find the precise time offset between them. Each video in a cluster is only compared to all the other videos in the same cluster as the non- overlapping videos have already been separated before as a new cluster is formed if a video does not match with any existing representative cluster or if there is a match but the video has a non-overlapping region. A cluster representative is a minimal set of recording the union of which covers the entire cluster time line. The comparison of two videos is then done in the salient MFCC domain and is based on cross correlation. A complete match-list with time offset between the matching videos is generated. The match-list is used to categorize the videos into events. In such a way, videos which have an overlapping region form a part of the same event. Videos which are not overlapping but are connected to each other via a common video sequence also form a part of the same event, so that all videos belonging to the same event will be clustered and videos belonging to a different event being excluded.
That means, it is proposed a method for clustering and synchronizing multimedia contents with regard to a certain event wherein mel-frequency cepstrum coefficients of audio tracks of the multimedia contents are used for clustering and synchronizing multimedia contents by computing salient mel-frequency cepstrum coefficients as dimension-wise maxima over a predetermined window from the mel-frequency cepstrum coefficients, creating clusters such that every pair of segments having an overlapping audio segment belong to a same cluster by comparing the salient mel- frequency cepstrum coefficient features with regard to that a majority of features correspond to a maximum correlation, creating cluster representatives by matching the longest sequences with others to form intermediate clusters in the salient mel-frequency cepstrum coefficient domain and a fine synchronization by a pair wise comparison between all sequences belonging to the same intermediate cluster to provide a complete match-list with time offset between the matching sequences and categorizing sequences into events for final clustering.
The method for clustering and synchronizing multimedia contents with regard to a certain event is performed in a device comprising extracting means for extracting mel-frequency cepstrum coefficients from audio tracks of the multimedia contents, computing means for calculating dimension-wise maxima over a predetermined window from the mel- frequency cepstrum coefficients to provide salient mel-frequency cepstrum coefficients, comparing means for comparing the features of the salient mel-frequency cepstrum coefficients with regard to that a majority of features correspond to a maximum correlation for creating clusters such that every pair of segments having an overlapping audio segment belong to a same cluster, voting means for providing cluster representatives by matching the longest sequences with others to form intermediate clusters in the salient mel-frequency cepstrum coefficient domain, synchronizing means for a pair wise comparison between all sequences belonging to the same intermediate cluster to provide a complete match-list with time offset between the matching sequences and sorting means for categorizing sequences into events for final clustering. That means that the invention is characterized in that mel-frequency cepstrum coefficients of audio tracks are used for clustering multimedia contents with regard to a certain event by determining salient mel-frequency cepstrum coefficient values from mel-frequency cepstrum coefficient vectors and clustering segments having an overlapping audio segment by comparing the salient mel- frequency cepstrum coefficient values. Synchronization is performed by comparing sequences of the same cluster with regard to a time offset, and a final clustering comprises categorizing sequences into events as sequences which have an overlapping segment form a part of the same event and sequences which do not overlap but are connected via a common sequence also form part of the same event.
The problem of clustering and synchronizing multimedia contents with regard to a certain event is solved by a method and a device as a processor-controlled machine disclosed in the independent claims. Advantageous embodiments of the invention are disclosed in respective dependent claims.
It has been found out that audio fingerprints may be too robust for the task of identification of the same event and as they are resistant against additive noise. This property makes them too robust to be able to distinguish the same music played at different events. In such a way, two audio sequences, being the same song but played at two different parties, could be wrongly clustered together. Audio fingerprints are robust to ambient sounds and would most probably wrongly identify the two corresponding recordings as belonging to the same event.
In contrast, MFCCs, while not robust to additive perturbations, capture also information about ambient sounds. MFCCs, as compared to fingerprints, allow better differentiation between the same songs played by the same group in different concerts.
Preferably, according to the invention, the comparing is a result of a voting approach function of the determined MFCC features which only needs fixing on one non-adaptive threshold to avoid other heuristics to filter out the high number of false positives by adaptive threshold values. It is an advantage of the recommended method and device that one non-adaptive threshold is sufficient and cluster representatives are used to address a large scale issue with regard to the size of the dataset. To address large scale issue, joint clustering and alignment in a bottom-up hierarchical manner are performed by splitting the database in subsets at the lower stages and by comparing only clustering representatives at the higher stages. Such a strategy, applied in several stages, reduces the computational complexity, thus allows addressing much bigger datasets. Favorably, a created cluster contains one or more cluster representative and comprising the adding of a new audiovisual segment to the created cluster if a new audiovisual segment matches the one or more representatives.
According to another aspect of the invention, a positive comparing leads to the determination of a time offset between two audiovisual segments of a pair of segments. Preferably, the audiovisual segments of a created cluster are temporally aligned by using the determined offset.
For a better understanding, the invention shall now be explained in more detail in the following description with reference to the figures. It is understood that the invention is not limited to the described embodiments and that specified features can also expediently be combined and/or modified without departing from the scope of the present invention as defined in the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.
In the drawings:
Figure 1 shows users equipped with a smartphone comprising audiovisual capturing means during a concert;
Figure 2 is a schematic illustrating the structure of the invention;
Figure 3 is a schematic illustrating examples of cluster representatives; Figure 4 shows in a diagram a standard deviation of video length per
cluster according to the average video length per cluster for a dataset of concert videos cluster;
Figure 5 illustrates in a diagram the accuracy of the method according to the invention; and
Figure 6 illustrates in a diagram the clustering performance of the
inventive method with regard to split configurations.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings. With reference to the accompanying drawings, the present invention will now be described in detail. In the description and drawings of the present invention, the same reference characters are given to the same elements.
Fig. 2 is a schematic illustrating the structure with regard to the method and a device of the present invention as mel-frequency cepstral coefficients MFCCs are first extracted for each of the multimedia content as audio recording Audio. Cepstral coefficients obtained for mel-ceptrum are referred to as Mel-Frequency Cepstral Coefficients often and here also denoted by MFCC. MFCC is a representation of the audio signal Audio. Audio samples within a window W are combined through discrete Fourier transformation and discrete cosine transformation on a mel-scale to create one MFCC sample as a multi-dimensional vector d of floating values. To reduce the number of features describing an audio sequence and hence limit the complexity, salient mel-frequency cepstrum coefficients Salient MFCC are computed from the original MFCC vectors as illustrated in figure 2. Only maximal MFCC values over a sliding window W are retained for each dimension of a MFCC independently. This selection of salient mel-frequency cepstrum coefficients Salient MFCC is based on the notion that the maximum value is likely to be retained in other audio of the same content even under influence of noise. A salient mel-frequency cepstrum coefficient Salient MFCC is a representation that has only a fraction of about 10% of the components of the original MFCC features and is still sufficient robust to be able to compare two audio files. It also provides a way to perform the comparison at a coarse level to filter obvious none matching and reduces the number of matching performed at the granular level. A two stage approach has been used but it can be envisioned to perform the comparisons at several different levels. Clustering sequences having an overlapping audio segment with regard to a certain event is performed by comparing the salient mel-frequency cepstrum coefficients Salient MFCC by applying a voting approach function Voting-based clustering to the mel-frequency cepstrum coefficient features as a comparison with regard to whether a majority of features corresponds to a maximum correlation for a rough synchronization. As illustrated in figure 2, said clustering provides already a rough synchronization with regard to a certain event as none matching sequences already have been excluded and cluster representatives can be generated by matching the longest sequences with others to form intermediate clusters in a salient mel-frequency cepstrum coefficient domain. That means that for clustering multimedia contents with regard to a certain event mel-frequency cepstrum coefficients MFCC of audio tracks of the multimedia contents are used for clustering and synchronizing or aligning multimedia contents with regard to a certain event, as mel- frequency cepstrum coefficients MFCC in addition capture information about ambient sound which in comparison to fingerprints makes it possible to distinguish more precise between different events. Cluster representatives are advantageous with regard to forming clusters as newly processed recordings are compared - aligned and matched - to these representatives as it is imaginable by the illustration shown in figure 3, that cluster representatives drastically limit the required number of comparisons for clustering. Finally a fine synchronization and final clustering are recommended. That means that the method further comprises a synchronization by comparing sequences of the same cluster with regard to a time offset and further comprises a final clustering by categorizing sequences into events as sequences which have an overlapping segment form a part of the same event and sequences which do not overlap but are connected via a common sequence also form part of the same event.
That means for a concrete embodiment that for a given set of audio Audio or audiovisual files, MFCC features are first extracted for all recordings of the audio Audio or audiovisual files also named as AV files.
Then, salient MFCC features as salient mel-frequency cepstrum
coefficients Salient MFCC, that are dimension-wise maxima of MFCCs over some window, are computed. Joint clustering and synchronization is then performed on salient MFCCs using. This is done in two substeps:
In the first substep, cluster representatives recordings are compared sequentially - starting from the longest ones - while creating clusters with their representatives and newly processed recordings are only compared - that is temporally registered and matched - to these representatives.
In a second substep, voting is applied: while comparing two recordings, the cross-correlation of the two recordings is computed independently for each salient MFCC dimension, and the matching is established if and only if the cross correlation maximum location is the same for a sufficient pre- defined number of dimensions.
Finally, once a clustering has been established, a precise realignment within each created cluster is performed on MFCCs features using MFCC cross-correlations computed over a reduced window or a window corresponding to salient MFCC computation window.
The proposed approach for joint clustering and synchronization is more robust to presence of similar predominant audio content as e.g., the same music played in different parties, since it relies on MFCCs that, in contrast to audio fingerprints, describe the overall audio content, scales with dataset size and average recordings size thanks to the use of cluster representatives and salient MFCCs, it is easier to implement and reproduce thanks to the proposed voting approach for matching decision that allows avoiding adaptive thresholds and heuristic post-filtering.
There are few steps that can be done off-line before the clustering and temporal registration process start.
In the following example, the window W has a width of 40ms with an overlap of 50 % and the multi-dimensional vector d to be 12. To reduce the number of features describing an audio and hence limit the complexity, salient MFCC values from the original MFCC vectors are extracted. It is a representation that has only a fraction of about 10% of the components of the original MFCC features and is still robust enough to be able to compare two audio files. To compute the salient MFCC, only the maximal MFCC values are retained over a sliding window of Ws. This is done over each of the d dimension of MFCC independently.
This selection of salient MFCC is based on the notion that the maximum value is likely to be retained in other audio of the same content even under influence of noise. This framework also provides us a way to perform the comparison at a coarse level to filter our obvious none matching and reduces the number of matching performed at the granular level. In the present approach, a two stage approach but it can be envisioned to perform the comparisons at several different levels.
A first level clustering is performed to group the set of videos which have a common overlapping segment. Since a goal is to work with large datasets, it quickly becomes infeasible to compare all videos with each other. To avoid comparing each video with every other video in the database, clusters are created and each cluster has a cluster representative. Cluster representatives are videos which have an overlapping segment with all the other videos in that cluster. To form clusters, the videos are arranged based on their lengths, starting with the longest video first. The longest video is made a cluster representative of the first cluster. At every stage of this clustering process, videos are only compared to the existing cluster representatives.
If a video has an overlapping segment with an existing cluster representative, that video is added to that cluster.
A new cluster is formed if a video does not match with any existing representative or if there is a match but the video also has a non- overlapping region. The comparison of two videos is done on the salient MFCC domain and is based on cross correlation, description of which is detailed further. The clustering technique of not comparing all videos with each other and the fact that the comparison is done on a sparse salient MFCC's provides an effective mechanism to deal with very large datasets without increasing the computation time exponentially.
The temporal registration and matching of videos as well as the final clustering will now be described.
A pair wise comparison is done between all the videos belonging to the same cluster to find precise alignment between them. Using the clusters created in previous step, a pair wise comparison is done between videos belonging to the same cluster to find the precise time offset between them. Each video in a cluster is only compared to all the other videos in the same cluster as the non- overlapping videos have already been separated as described before. A complete match-list with time offset in seconds between the matching videos is generated. Using this match-list, videos are categorized into events. Videos which have an overlapping region form part of the same event. Videos which are not overlapping but are connected to each other via a common video also form part of the same event.
The actual comparison between any two videos is carried by computing the cross correlation on the feature values. In the clustering step, the features used are the salient MFCC values while in the temporal registration of matching videos and final clustering, the features used are complete MFCC values. Cross correlation is an effective way to find the time offset between two signals which are shifted versions of each other.
To find the offset, a novel voting approach. Since MFCC consists of multi- dimensional vector d with several dimensions which are decorrelated during the creation of the features, the cross correlation is performed on each of the dimension separately. The peak in each of the dimension points to a time offset between the two compared signals. If the two signals really do match, then the time offset in most of the dimension points to the same correct value. If the signals do not match, the cross correlation in each dimension has a peak at different offsets and hence we can easily detect that there is no match between these signals. A voting approach is used where each dimension votes for its selected time offset and if the majority of the dimensions point to the same window of time offset, a match is declared between the two signals with the given time offset.
In the context of this application, new additional videos can be added on a database/system where the temporal registrations have already been computed. To add these additional videos, the new videos needn't be compared to all the existing videos in the database. In the adopted approach, for each intermediate cluster computed, a cluster center is identified. It is generally the longest video which has the largest overlapping region with all the other videos in that cluster. This cluster center is identified and stored for further use. For every new video that is being added, instead of comparing it with all the existing videos to find if they have an overlap, it is enough to just match it with the existing cluster centers. This way the proposed framework handles incremental data while still using the advantages that it provides in the first place. The intermediate clusters provide a starting point for new videos to be added. Once the events of the new videos have been identified, it is then matched to the existing videos of that event to create a precise temporal registration. This has the advantage to make the system more scalable. The proposed framework can handle large amounts of data without exponentially increasing the computations. The comparison carried out on salient MFCC features makes the comparison quick and robust while the intermediate clusters provides a mechanism to reduce the number of comparisons to a bare minimum required.
In the following, some experimental results are shown.
The dataset consists of user contributed videos taken from YouTube. A total of 164 videos from 6 separate artist and bands having a cumulative duration of 17.56 hours were used. The longest sequence was of 21 minutes while the shortest one was of 44 seconds. A hand man groundtruth of 36 clusters was realized on this dataset. From this groundtruth, a binary matrix of size 164*164 is generated, where ones and zeros code respectively for matching and non-matching sequences. This matrix is denoted GT matching. The details of the dataset can be seen in Figure 4, in which each cluster of videos is represented by a bubble whose width is proportional to the number of videos inside the cluster whose coordinates are given by the average video length per cluster in seconds and the standard deviation of video length per cluster in seconds.
Salient MFCC representation is first evaluated on the entire dataset, through the exhaustive 164*163/2=13366 comparisons which are compared to the GT matching matrix. It is used an F-measure criteria to summarize precision P and recall R as F = 2PR= (P + R). F-measure results are plotted in Figure 5 with different sets of parameters. The parameters are the sliding window Ws equal to 10, 20 or 40 MFCC samples and an overlap ove between consecutive windows of 0% and 50%. These results show that the proposed method is really robust for comparing the videos with a light representation.
They also show that the salient representation is not so much sensitive to parameterization. The configuration Ws = 20 and ove = 50% was elected. In a second step, the clusters results obtained with the temporal registration and final clustering method are compared. With the 36 clusters of the groundtruth, all but one are found correctly. The missed one is a two song cluster - Muse-Unintended - which is badly merged with a five-song cluster - Muse-Feeling Good - captured during the same event. The two songs are correctly synchronized together, but the analysis of the *.wav files showed that one of them exhibit a very low signal to noise ratio SNR, leading to a mismatch with one of the representative of the other cluster. Such cases could be alleviated by filtering the sequences before creating the dataset. But for each individual cluster a manual check has been performed a- posteriori by loading the cluster's elements on audacity and listening to them. Using a human ear, all the sequences are correctly synchronized.
Regarding the complexity analysis, cross correlation between two signals for every possible shifts is O (N LogN) when FFT based cross correlation is used. To create the matchlist for K = 164 sequences, normally the number of cross correlations needed would be 13366 (164*163/2), leading to a complexity Cbaseline:
C tifM! = K * (K - l}/2 * N * log(N) where N is the average number of MFCCs per sequence. Using the salient representation allows a reduction in the size of the signals to be compared. Hence, a clustering based on the salient MFCCs would exhibit a complexity Csalient:
CeaUcnt = K * { K - 1 )/2 * N * log(Nc) where Nc is the average number of salient MFCCs per sequence.
When N becomes high, this reduction is proportional to the ratio Nc=N of 10% in the current case. But in the adopted approach, not all comparisons need to be made. The complexity formula is separated into two parts. The first one deals with the salient MFCCs and is devoted to the clustering. The second one deals with MFCCs and is devoted to the fine synchronization around the coarse synchronization given by the salient MFCCs correlation.
Hence, the complexity becomes Cours: C«wri = NbcmdmtNetloglNe ) +Nbfine *N * Log(Wm ) where Nbcrude and Nbfine are respectively the number of computations performed at salient and fine level according to the present invention. Some values were computed for the dataset and are presented in table 1 .
Figure imgf000015_0001
Table 1. Comparison of targeted complexity with respect
to baseline (i.e. all cross-correlation at MFCC level) on
our dataset. Only a small fraction of the baseline's computations is needed with the proposed method.
Regarding the scalability, stability tests were carried out to simulate the effectiveness of the adopted approach to incremental additions of video into an existing database. For this purpose, the dataset has been split into two parts. The first part is then clustered and aligned using the recommended approach and the second part is incrementally added to the database. The following configurations were tested:
120+44; 100+64; 90+74; 84+80 For each configuration, many different split were randomly run, leading to a total of 175 tests. The precision, recall and F-measure of the final matchlist have then been calculated, in all of 175 tests, and were compared to the GT matching matrix. As summarized in table 2 below showing values of mean deviation μ and standard deviation σ and as also illustrated in figure 6, the results showed equivalent performance whatever the configuration.
Figure imgf000016_0001
Table 2. Mean and standard deviation of precision, recall and F-measure when the database is split.
Figure 6 illustrates the probability that variable is greater than abscissa over F-measure in percent % when the database is split for the configurations mentioned above.
Tests showed the ability of the invention to incrementally add videos to the database while keeping the same performance without doing extra calculations as compared to adding all the videos together. The split approach provides a way to make the system scalable and incremental and to be able to effectively split the task when a very large number of videos need to be compared and synchronized.
Although the present invention has been described in terms of the presently preferred embodiment, it is to be understood that such disclosure is not to be interpreted as limiting. Various alternations and modifications will no doubt become apparent to those skilled in the art after reading the above disclosure. Accordingly, it is intended that the appended claims be interpreted as covering all alternations and modifications as fall within the true spirit and scope of the claims.

Claims

1 . Method for clustering sequences of multimedia contents with regard to a certain multimedia presentation event wherein
mel-frequency cepstrum coefficients (MFCC) of audio tracks of the multimedia contents are used for clustering and synchronizing multimedia contents with regard to a certain event by
computing salient mel-frequency cepstrum coefficients (Salient MFCC) from mel-frequency cepstrum coefficient (MFCC) features and clustering sequences having an overlapping audio segment (ove) by comparing the salient mel-frequency cepstrum coefficients (Salient MFCC).
2. Method according to claim 1 , wherein the mel-frequency cepstrum coefficient features are mel-frequency cepstrum coefficient vectors.
3. Method according to claim 1 or 2 further comprising a synchronization by comparing sequences of the same cluster with regard to a time offset.
4. Method according to claim 1 , 2 or 3 further comprising a final clustering by categorizing sequences into events as sequences which have an overlapping segment form a part of the same event and sequences which do not overlap but are connected via a common sequence also form part of the same event.
5. Method according to one of the claims 1 to 4, wherein said salient mel- frequency cepstrum coefficients (Salient MFCC) are computed as dimension-wise maxima over a predetermined window from the mel- frequency cepstrum coefficients (MFCC).
6. Method according to one of the claims 1 to 5, wherein said mel- frequency cepstrum coefficient features are compared with regard to whether a majority of features corresponds to a maximum correlation.
7. Method according to claim 6, wherein the comparing is a result of a voting approach function (Voting-based clustering) of the mel-frequency cepstrum coefficient features.
8. Method according to claim 1 , wherein cluster representatives are generated by matching the longest sequences with others to form intermediate clusters in a salient mel-frequency cepstrum coefficient domain.
9. Method according to claim 8, wherein a created cluster contains one or more cluster representative and comprises the adding of a new audio or audiovisual segment to the created cluster if a new audio or audiovisual segment matches the one or more representatives.
10. Device for clustering sequences of multimedia contents with regard to a certain multimedia presentation event comprising:
extracting means for extracting mel-frequency cepstrum coefficients (MFCCs) from the sequences audio tracks of the multimedia contents, computing means for calculating dimension-wise maxima over a predetermined window from the mel-frequency cepstrum coefficients (MFCCs) to provide salient mel-frequency cepstrum coefficients (Salient MFCCs),
comparing means for comparing the features of the salient mel- frequency cepstrum coefficients (Salient MFCCs) with regard to that a majority of features correspond to a maximum correlation for creating clusters such that every pair of segments having an overlapping audio segment belong to the same cluster.
1 1 . Device according to claim 10, wherein voting means are provided for determining cluster representatives by matching the longest sequences with others to form intermediate clusters in the salient mel-frequency cepstrum coefficient domain.
12. Device according to claim 10 or 1 1 , further comprising:
synchronizing means for a pair wise comparison between all sequences belonging to the same intermediate cluster to provide a complete match-list with time offset between the matching sequences.
13. Device according to one of the claims 10 to 12 further comprising:
sorting means for categorizing sequences into events for final clustering.
14. Device according to one of the claims 10 to 13, characterized in that the device for clustering sequences of multimedia contents with regard to a certain event is a processor-controlled machine.
PCT/EP2013/072697 2012-11-30 2013-10-30 Clustering and synchronizing multimedia contents WO2014082812A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP13785447.7A EP2926337A1 (en) 2012-11-30 2013-10-30 Clustering and synchronizing multimedia contents
US14/648,705 US20150310008A1 (en) 2012-11-30 2013-10-30 Clustering and synchronizing multimedia contents

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP12306488.3 2012-11-30
EP12306488 2012-11-30

Publications (1)

Publication Number Publication Date
WO2014082812A1 true WO2014082812A1 (en) 2014-06-05

Family

ID=47469806

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2013/072697 WO2014082812A1 (en) 2012-11-30 2013-10-30 Clustering and synchronizing multimedia contents

Country Status (3)

Country Link
US (1) US20150310008A1 (en)
EP (1) EP2926337A1 (en)
WO (1) WO2014082812A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104505101A (en) * 2014-12-24 2015-04-08 北京巴越赤石科技有限公司 Real-time audio comparison method
US9653094B2 (en) 2015-04-24 2017-05-16 Cyber Resonance Corporation Methods and systems for performing signal analysis to identify content types
EP3171599A1 (en) 2015-11-19 2017-05-24 Thomson Licensing Method for generating a user interface presenting a plurality of videos

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016029039A1 (en) * 2014-08-20 2016-02-25 Puretech Management, Inc. Systems and techniques for identifying and exploiting relationships between media consumption and health
US10141009B2 (en) 2016-06-28 2018-11-27 Pindrop Security, Inc. System and method for cluster-based audio event detection
AU2017327003B2 (en) 2016-09-19 2019-05-23 Pindrop Security, Inc. Channel-compensated low-level features for speaker recognition
WO2018053537A1 (en) 2016-09-19 2018-03-22 Pindrop Security, Inc. Improvements of speaker recognition in the call center
US11019201B2 (en) 2019-02-06 2021-05-25 Pindrop Security, Inc. Systems and methods of gateway detection in a telephone network
US11646018B2 (en) 2019-03-25 2023-05-09 Pindrop Security, Inc. Detection of calls from voice assistants

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090087161A1 (en) * 2007-09-28 2009-04-02 Graceenote, Inc. Synthesizing a presentation of a multimedia event
WO2012001216A1 (en) * 2010-07-01 2012-01-05 Nokia Corporation Method and apparatus for adapting a context model
EP2450898A1 (en) * 2010-11-05 2012-05-09 Research in Motion Limited Mixed video compilation

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7164797B2 (en) * 2002-04-25 2007-01-16 Microsoft Corporation Clustering
CN100397387C (en) * 2002-11-28 2008-06-25 新加坡科技研究局 Summarizing digital audio data
US9299364B1 (en) * 2008-06-18 2016-03-29 Gracenote, Inc. Audio content fingerprinting based on two-dimensional constant Q-factor transform representation and robust audio identification for time-aligned applications
US8595005B2 (en) * 2010-05-31 2013-11-26 Simple Emotion, Inc. System and method for recognizing emotional state from a speech signal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090087161A1 (en) * 2007-09-28 2009-04-02 Graceenote, Inc. Synthesizing a presentation of a multimedia event
WO2012001216A1 (en) * 2010-07-01 2012-01-05 Nokia Corporation Method and apparatus for adapting a context model
EP2450898A1 (en) * 2010-11-05 2012-05-09 Research in Motion Limited Mixed video compilation

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BRYAN ET AL.: "Clustering and synchronizing multi-camera video via landmark cross-correlation", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING ICASSP, March 2012 (2012-03-01)
LYNDON KENNEDY ET AL: "Less talk, more rock", INTERNATIONAL WORLD WIDE WEB CONFERENCE 18TH; 20090420 - 20090424, 20 April 2009 (2009-04-20), pages 311 - 320, XP058025602, ISBN: 978-1-60558-487-4, DOI: 10.1145/1526709.1526752 *
MCKINNEY M F ET AL: "Features for Aduio and Music Classification", PROCEEDINGS ANNUAL INTERNATIONAL SYMPOSIUM ON MUSIC INFORMATIONRETRIEVAL, XX, XX, 1 January 2003 (2003-01-01), pages 1 - 8, XP002374912 *
PRARTHANA SHRSTHA ET AL: "Synchronization of multi-camera video recordings based on audio", PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON MULTIMEDIA , MULTIMEDIA '07, 1 January 2007 (2007-01-01), New York, New York, USA, pages 545, XP055098830, ISBN: 978-1-59-593702-5, DOI: 10.1145/1291233.1291367 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104505101A (en) * 2014-12-24 2015-04-08 北京巴越赤石科技有限公司 Real-time audio comparison method
US9653094B2 (en) 2015-04-24 2017-05-16 Cyber Resonance Corporation Methods and systems for performing signal analysis to identify content types
EP3171599A1 (en) 2015-11-19 2017-05-24 Thomson Licensing Method for generating a user interface presenting a plurality of videos
EP3171600A1 (en) 2015-11-19 2017-05-24 Thomson Licensing Method for generating a user interface presenting a plurality of videos

Also Published As

Publication number Publication date
US20150310008A1 (en) 2015-10-29
EP2926337A1 (en) 2015-10-07

Similar Documents

Publication Publication Date Title
US20150310008A1 (en) Clustering and synchronizing multimedia contents
US20230196809A1 (en) Robust audio identification with interference cancellation
US8977067B1 (en) Audio identification using wavelet-based signatures
JP5362178B2 (en) Extracting and matching characteristic fingerprints from audio signals
US9093120B2 (en) Audio fingerprint extraction by scaling in time and resampling
Malekesmaeili et al. A local fingerprinting approach for audio copy detection
US20140280304A1 (en) Matching versions of a known song to an unknown song
WO2011045424A1 (en) Method for detecting audio and video copy in multimedia streams
Jégou et al. Babaz: a large scale audio search system for video copy detection
Ustubioglu et al. Robust copy-move detection in digital audio forensics based on pitch and modified discrete cosine transform
EP2926273A1 (en) Synchronization of different versions of a multimedia content
Pandey et al. Cell-phone identification from audio recordings using PSD of speech-free regions
Williams et al. Efficient music identification using ORB descriptors of the spectrogram image
CN108198573B (en) Audio recognition method and device, storage medium and electronic equipment
Duong et al. Movie synchronization by audio landmark matching
You et al. Music identification system using MPEG-7 audio signature descriptors
Burges et al. Identifying audio clips with RARE
Khemiri et al. A generic audio identification system for radio broadcast monitoring based on data-driven segmentation
Lin et al. Generalized time-series active search with Kullback–Leibler distance for audio fingerprinting
Lopez-Otero et al. Introducing a Framework for the Evaluation of Music Detection Tools.
Anguera et al. Multimodal video copy detection applied to social media
KR101002731B1 (en) Method for extracting feature vector of audio data, computer readable medium storing the method, and method for matching the audio data using the method
Ulutas et al. Forge Audio Detection Using Keypoint Features on Mel Spectrograms
Lee et al. A TV commercial monitoring system using audio fingerprinting
Uikey et al. A highly robust deep learning technique for overlap detection using audio fingerprinting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13785447

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2013785447

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 14648705

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE