WO2015177513A1 - Procédé permettant de grouper, de synchroniser et de composer une pluralité de vidéos correspondant au même événement la présente invention concerne un système et un support lisible par ordinateur comprenant son code lisible par ordinateur. - Google Patents

Procédé permettant de grouper, de synchroniser et de composer une pluralité de vidéos correspondant au même événement la présente invention concerne un système et un support lisible par ordinateur comprenant son code lisible par ordinateur. Download PDF

Info

Publication number
WO2015177513A1
WO2015177513A1 PCT/GB2015/051395 GB2015051395W WO2015177513A1 WO 2015177513 A1 WO2015177513 A1 WO 2015177513A1 GB 2015051395 W GB2015051395 W GB 2015051395W WO 2015177513 A1 WO2015177513 A1 WO 2015177513A1
Authority
WO
WIPO (PCT)
Prior art keywords
videos
video
features
synchronised
reference video
Prior art date
Application number
PCT/GB2015/051395
Other languages
English (en)
Inventor
Andrea Cavallaro
Sophia BANO
Original Assignee
Queen Mary & Westfield College, University Of London
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Queen Mary & Westfield College, University Of London filed Critical Queen Mary & Westfield College, University Of London
Publication of WO2015177513A1 publication Critical patent/WO2015177513A1/fr

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7328Query by example, e.g. a complete video frame or video sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/19Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
    • G11B27/28Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier by using information signals recorded by the same method as the main recording

Definitions

  • a method for collating videos is disclosed. More specifically, but not exclusively, a method for synchronising multiple videos of the same event by analysing the audio streams of the multiple videos is disclosed.
  • a method of generating a composite video from a plurality of videos comprising extracting features from a plurality of videos, comparing the extracted features of the plurality of videos to extracted features of a reference video, synchronising the plurality of videos and the reference video based on the extracted features, and generating a composite video from the plurality of videos and the reference video.
  • a method of synchronising videos comprising receiving a plurality of videos, receiving a reference video, and synchronising the plurality of videos and the reference video based on audio features of the plurality of videos and the reference video.
  • a method of synchronising videos comprising receiving a plurality of videos, receiving a reference video, and synchronising the plurality of videos and the reference video based on audio chroma features of the plurality of videos and the reference video.
  • a method of synchronising videos comprising receiving a plurality of videos, organising the plurality of videos into a plurality of clusters, receiving a reference video, identifying a cluster in the plurality of clusters which corresponds to a same event as the reference video, and synchronising the videos in the identified cluster and the reference video.
  • a method of synchronising videos comprising receiving a plurality of videos, organising the plurality of videos into a plurality of clusters, selecting a representative video for each cluster, receiving a reference video, identifying a cluster in the plurality of clusters by its representative video which corresponds to a same event as the reference video, and synchronising the videos in the identified cluster and the reference video.
  • a method of synchronising videos comprising extracting features from each of a plurality of videos, using the extracted features to determine characteristics of the plurality of videos, receiving a reference video, and synchronising a number of the plurality of videos and the reference video based on the determined characteristics.
  • a method of generating a composite video from a plurality of videos comprising, receiving a plurality of videos, receiving a reference video, synchronising a number of videos in the plurality of videos to the reference video based on obtaining characteristics of the plurality of videos and the reference video, generating a composite video from the number of videos.
  • a method of synchronising videos comprising receiving a database of unorganised and unsynchronised videos, identifying videos in the database corresponding to a particular event, synchronising the identified videos and a reference video based on extracting features of the videos.
  • a method for synchronising a plurality of videos each video of the plurality of videos comprising a video element and an audio element, the method comprising obtaining features of one or more of the plurality of videos, identifying a reference video to which a number of videos of the plurality of videos are to be synchronised, identifying the number of videos of the plurality of videos that are to be synchronised to the reference video, and synchronising in time the number of videos and the reference video based on the obtained features.
  • Obtaining the features may comprise extracting features of each of the plurality of videos.
  • the extracted features may be frequency features of each of the plurality of videos.
  • the frequency features may be audio features of the audio element of each of the plurality of videos.
  • the audio features may be audio chroma features of the audio element of each of the plurality of videos.
  • Obtaining the features may comprise receiving the features at a receiving unit.
  • Obtaining the features may comprise processing each video of the plurality of videos in order to obtain the features.
  • Identifying a reference video may comprise selection of the reference video from the plurality of videos by a user.
  • Identifying the number of videos of the plurality of videos that are to be synchronised to the reference video may comprise organising the plurality of videos into a plurality of clusters of one or more videos based on the obtained features, wherein each cluster may correspond to a particular event, and identifying, based on the obtained features, a cluster within the plurality of clusters which corresponds to a same event as the reference video.
  • Organising the plurality of videos into a plurality of clusters may comprise comparing the obtained features of each video in the plurality of videos with the obtained features of all other videos in the plurality of videos to identify videos corresponding to a same event.
  • the method may further comprise determining a parameter indicative of the likelihood of videos of the plurality of videos relating to a same event wherein the plurality of videos may be organised into a plurality of clusters in accordance with the parameter.
  • the parameter may be determined by determining a plurality of matching histograms, each matching histogram relating to one or more videos in the plurality of videos in combination with one or more other videos in the plurality of videos, and comparing the determined matching histograms to identify the parameter indicative of the likelihood of videos of the plurality of videos relating to the same event.
  • Comparing the determined matching histograms may further comprise normalizing each matching histogram, counting a number of peaks in each normalized histogram, computing a descriptor for each matching histogram based on the number of peaks and determining the parameter based on the computed descriptors of each matching histogram.
  • Synchronising may comprise calculating a relative time difference between each of the videos in the number of videos to be synchronised and the reference video for aligning each of the number of videos with the reference video in time, and aligning the number of videos with the reference video in accordance with the calculated relative time differences.
  • the relative time difference may be calculated in accordance with the obtained features of each video in the number of videos to be synchronised.
  • the method may further comprise validating the synchronisation to detect any video that does not correspond to a same event as the reference video.
  • Validating may comprise comparing a relative time difference between each of the synchronised videos and the reference video to a relative time difference between all other synchronised videos and the reference video to identify anomalous values.
  • the method may further comprise playback of the synchronised videos by a multi-display visualiser for selection, by a user, of videos to be included in a composite video.
  • the method may further comprise generating a composite video by selecting portions of the synchronised videos, wherein a portion of a synchronised video comprises a plurality of frames of the synchronised video.
  • the portions of the synchronised videos may be selected in accordance with characteristics of the synchronised videos.
  • the portions of the synchronised videos may be selected in accordance with number of characteristics for a given frame.
  • a portion of a synchronised video may be selected if it has a higher number of characteristics than all other synchronised videos for a given frame.
  • the characteristics of the synchronised videos may be spatio-temporal characteristics. Obtaining features may be performed on a frame-by-frame basis.
  • a system comprising a processor arranged to perform the method.
  • a computer readable medium comprising computer readable code operable, in use, to instruct a computer system to perform the method.
  • a system for taking multiple videos of a particular event and automatically selecting relevant clips, ranking them and generating a coherent video by selecting appropriate clips from each source - as a directed film camera crew would.
  • Such a system acts as a filtering tool of vast amounts of video and thus dramatically reduces the time needed for collecting, selecting, ranking and merging videos for breaking news and high-profile events in a timely manner.
  • Such a system significantly reduces the amount of processing required by a computer to put together a montage of media.
  • the video production system disclosed herein may be available as a web service, personalized for each user by learning their preferences.
  • Audio chroma feature is exploited for this purpose, as it gives a condensed representation of the tonal content of an audio signal, which makes it robust to audio noise.
  • a query video input to the recording identification and synchronization framework by the user its feature matching is performed with a database of pre-computed features for UGVs.
  • a matching strategy which maximizes the feature similarity is proposed, using which we estimate the synchronization time shift.
  • An automatic classification threshold estimation strategy is then proposed, which allows effective identification of UGVs belonging to same event from the larger database of videos.
  • UGC user-generated content
  • a query by example automatic video identification and synchronization framework for multi- camera user-generated videos is disclosed.
  • the approach may be audio-based and exploit chroma features for feature extraction of multiple recordings. Chroma features are used because they give a powerful representation of the audio signal, and are highly robust to audio channel and ambient noise.
  • a feature matching strategy may also be used that provides the time shift estimation. For a query video input to a video identification and synchronization framework by the user, its feature matching may be performed with a database of pre-computed features for UGVs.
  • An automatic classification threshold estimation strategy may be used, which allows effective identification of UGVs belonging to the same event from our larger database of videos.
  • Figure 1 shows a method for identifying and synchronising videos
  • Figure 2 illustrates how each audio frame is composed of an audio segment of frame size f r and overlap shift h p between two consecutive frames
  • Figure 3 shows the distance matrix for two feature vectors obtained from two
  • Figure 4 shows a block diagram of a learning classification threshold system
  • Figure 5 shows a post processing step used to obtain the descriptor P'y showing an example for match and non-match class
  • Figure 6 is a block diagram of the video identification and synchronization framework
  • Figure 7 shows the result of a recording identification system, with varying size of training sample set
  • Figure 8 shows the average computation time for matching pair for varying frame size
  • Figure 9 shows a plot for accuracy and area under the curve (AUC) with varying frame size
  • Figure 10 shows the effect of varying frame size on the recording identification system
  • Figure 1 1 shows the synchronization result on some collected datasets
  • Figure 12 illustrates the accuracy of the method disclosed herein compared to the existing state-of-the-art methods.
  • Figure 13 illustrates the architecture of a system which performs the method
  • Synchronization of multi-camera user-generated recordings involves several challenges particularly because of the nature of these recordings.
  • User-generated recordings are captured free-handed with small and light devices and this, sometimes, leads to jerkiness or
  • a query by example video search framework for the automatic identification and synchronization of multi-camera UGVs which uses audio chroma features.
  • the method utilises audio chroma features for building this framework as it gives a condensed yet powerful representation of the tonal content of an audio signal, making it robust to audio noise.
  • the proposed approach allows grouping of overlapping videos belonging to the same event into clusters.
  • For a query video input to the video identification and synchronization system by the user its feature matching is performed with a database of pre-computed features for large scale UGVs.
  • An automatic classification threshold estimation strategy allows effective identification of UGVs belonging to the same event from a larger database of videos.
  • the following is proposed (i) the first query by example video search system for the automatic identification and synchronization of UGVs, (ii) a method for clustering of videos belonging to the same event, and (iii) use of chroma features.
  • the system receives a plurality of video recordings as an input and provides the grouping of these recordings in clusters representing events as an output, for example a sporting event, concert or birthday party.
  • the video recordings may contain both a video and audio element.
  • the system identifies the set C k of recordings corresponding to the same event as Q, a query video provided by the user.
  • the query video, or reference video, Q is a video selected by the user of an event for which they would like to see compiled footage.
  • the user may be able to identify an event that they would like to see a composite video of and a query video may be provided automatically by a video identification unit of the system performing the method, or by a third party.
  • the system synchronizes the set C k and the query video Q on a common timeline.
  • the query video Q is aligned in time with the videos in the set C k , relating to a particular event.
  • a set of videos of a given event, each synchronised in time with the other videos in the set, is then provided.
  • a single, composite video can be generated as a composition of the set of videos in C k and Q.
  • the composition is performed by extracting spatio-temporal features from the video and audio streams. This provides an automatic tool for showing a single video out of many that are recording a single event.
  • the method takes a set of videos and collates and synchronises those which relate to a particular event.
  • the collation and synchronisation is performed based on a reference video that may be selected by the user.
  • the set of synchronised videos is then compiled to provide a single output video containing elements from a number of videos in the set.
  • the primary step towards solving the problem of video identification and synchronization involves feature extraction and matching of frequency features of the videos, particularly audio. Therefore, before method 100 can be performed, an initialisation process must be performed to extract features and facilitate the identification and synchronisation of the videos. This process involves analysing audio information of user-generated videos (UGVs) such that relevant information is obtained in order to perform the method 100 introduced above. This initialisation process will now be described.
  • UUVs user-generated videos
  • Audio signals of each video in the set form the basis of the feature extraction process.
  • ⁇ ⁇ ( ⁇ ) , % , 0 ⁇ i ⁇ T n j , (2)
  • i is the index of the audio sample
  • tTM is the time at the i th sample for the n th recording
  • a n is the amplitude of audio sample at time t
  • s n is the audio sampling rate
  • K n is the total number of audio samples.
  • audio chroma feature is used as it gives a condensed yet powerful representation of the tonal content of A n , making it robust to audio noise.
  • the extracted audio chroma features can then be used to identify matching videos and synchronise them in time.
  • a feature matching strategy is then proposed, which for a pair of feature vectors Fj and F j that are overlapping maximizes their feature similarity. This will now be explained in detail.
  • Chroma feature is a descriptor which gives a condensed representation of an audio signal. It is derived from the pitch by combining the pitch bands belonging to twelve pitch classes (namely C, C # , D, D # , E, F, F # , G, G # , A, A # , B), which correspond to the same distinct semitones (or chroma). Chroma feature for an audio signal shows its distribution of energy along the different pitch classes.
  • the pitch class index / depends on their centre frequency f(l) in a logarithmic way and is given by: I— Vllog- -f 69, (4)
  • a pitch class is the set of all pitches which share the same chroma.
  • the pitch class corresponding to chroma C* is (C # 0, C # 1 , C # 2, C # 8) which relates to the pitch sub- band (13, 25, 37, 109).
  • Chroma feature is used in this system as a descriptor for the audio content analysis of the user- generated recording as it gives a coarser representation of pitch, thus higher frequencies get mapped on the chromagram while lower frequencies get suppressed. This makes chroma feature robust to noise.
  • FIG. 2 illustrates the process of extraction of the chroma feature of a particular audio frame 202 of an audio signal, wherein the spectrum is calculated and is divided into sub-bands.
  • Each audio frame (202, 204, 206) is composed of an audio segment of frame length or size f r and overlap shift h p between two consecutive frames (represented in Figure 2(a)).
  • the number of audio frames Y n in A n is a function of the number of audio samples K in A and is computed by the following relation:
  • the frequency spectrum of each audio frame is then computed by applying discrete Fourier transform (DFT), as shown in Figure 2(b).
  • DFT discrete Fourier transform
  • This spectrum f(l) is mapped into the pitch class using eq. 4.
  • the chroma feature vector for a particular audio frame is thus represented as a 12- dimensional vector v m , such that m defines the time stamp of a particular frame position.
  • a chromagram is then formed by summing all pitch bands belonging to a particular chroma ( Figure 2(c)). Chroma features for the n th audio signal A n , segmented into Y n audio frames are given by: where v G M 12*1 is the chroma feature vector for m th frame of n th camera's audio signal.
  • the features may have been extracted beforehand and are provided to a receiving unit of the system performing the initialisation process.
  • the next step is to perform feature matching between pairs of recordings. Since similar features are expected to be positioned at a similar, if not equal, time shift, feature matching is beneficial for computing the time shifts between a pair of recordings. Time shift is defined as the time misalignment between a pair of recordings.
  • the proposed matching method operates by minimizing a feature distance between pairs of camera recordings. This is explained below.
  • F, and F be the extracted features for recording 0, and ⁇ , and in particular from the audio stream Aj and A, respectively.
  • the chroma features Fj and F are compared and their distance
  • Figure 3(a) shows the distance matrix for two feature vectors obtained from two overlapping camera recordings each of 2 seconds duration.
  • This distance matrix is a rectangular matrix where the main diagonal corresponds to zero time shift. Lower diagonals correspond to negative and upper diagonals correspond to positive time shifts. The minimum across each row is then calculated as
  • the framework for video identification requires learning a parameter that indicates the likelihood of two particular videos corresponding to the same event.
  • This parameter is a classification threshold that is determined from matching histograms of some training recordings (an offline stage).
  • FIG 4 shows the block diagram of the proposed learning classification threshold system 400 where M pairs of training camera recordings are considered for learning the classification threshold ⁇ .
  • M UGVs are taken as a learning set such that M « M, where M defines the total number of videos in a large database. These M videos are from Zk different events, such that two overlapping recordings from each event are provided.
  • is the threshold used to identify if there is a dominant peak or not in the matching plots V.
  • the matching plots V (see also Figure 3(b)) for all M pairs are calculated and matching descriptors P' computed.
  • the database is small, a fixed threshold may be set but a time-consuming manual tuning of such a threshold may still be required; while, if the database is large and unseen, it may be non-trivial to manually set a fixed threshold. Therefore, we provide a method based on support vector classifier that automatically obtains the classification threshold ⁇ for the M UGVs.
  • the classification threshold gives a decision boundary between match and non-match video pairs.
  • the threshold ⁇ is learned by computing a descriptor P' 8 for each matching histogram Vy at operation 406, then using P' 8 and its ground-truth class to calculate the threshold at operation 408. In this case, the ground-truth class specifies if Oj and ⁇ are matching or non-matching pairs.
  • FIG. 5 shows an example for match and non-match classes.
  • Figure 5(a) shows an example of the histogram obtained for match class
  • 5(b) shows an example of the histogram obtained for non-match class.
  • the histogram Vy is first normalized V with respect to the maximum obtained in each row of the distance matrix ⁇ in order to limit its range to [0, 1].
  • the number of peaks existing at each scanning threshold is counted in V ti , thus providing the number of matches (Py) at each step of T r .
  • the number of matches corresponds to the number of features matched with a particular time shift.
  • the derivative P'y of Py is then computed which gives a 100-point (the number of incremental steps) descriptor of the histogram Vy. Since the histograms for match and non-match have quite dissimilar trends, their descriptors P'y are also dissimilar. This is also visible from Figure 5.
  • a classification threshold ⁇ is obtained which can be used to identify matching audio features and therefore matching videos.
  • the threshold may also be used for synchronising matching videos along a common timeline.
  • the method 100 may then be performed.
  • Chroma features F q for the query recording and ⁇ F _ for all representative recordings of the clusters are pre-computed at operation 602.
  • the matching histogram V nk is then obtained by comparing F q with all F k at operation 604, where features F k are extracted at operation 603. It will be understood that operations 602 and 603 are the same operation as the same type of features will be extracted from the query recording Q and all other videos.
  • Post processing of these histograms is then performed at operation 606 to obtain the descriptors P' qk
  • the obtained descriptors are then classified at operation 608 using the threshold decision boundary ⁇ , which gives the identified recording representing the cluster with the same overlapping event as that of Q.
  • the query-by-example video identification framework eliminates the need of a tagged document or keyword based search, which is widely used in social media sharing websites. Instead, it takes as input a query recording Q and identifies all recordings which are similar to it.
  • the synchronization of a set of video recordings involves the estimation of the relative time differences, or time shifts, between video recordings before the time shifts are validated for consistency in the estimation.
  • a multi-display visualizer may then be used to playback the set C k
  • the universal time t of a frame in a recording is an instant referring to the continuous physical time. Frames captured at the same instant by multiple cameras should refer to the same universal time. Without loss of generality, let us consider the two camera recordings ⁇ M and 0 k ,2, where 0 k
  • the time shift estimation is performed for all N k recordings in order to synchronize them.
  • the validation of synchronization time shifts is then performed in order to filter out any false identification, if this occurs.
  • FIG. 6b A block diagram of the proposed multi-camera synchronization framework is illustrated in Figure 6b, where 0 is considered as the representative recording. Positive time shifts correspond to recordings that start later than 0 and negative time shifts to those that start earlier.
  • the two main steps involved in multi-camera synchronization are feature extraction and feature matching as discussed in detail above.
  • Feature vector F k n obtained from the n th camera recording 0 k n of cluster C k at operation 610 is compared with the feature vectors ⁇ F k,n ⁇ of all the camera recordings in C k at operation 612, thus matching is performed between all pairs of camera recordings (N k ⁇ N k ).
  • operations 602 and 610 are the same operation as the same type of features will be extracted from the query recording Q and the identified videos.
  • the feature vector F k n may be of a different type or a chroma feature computed with different parameters with respect to the feature vector used for the identification described above, thus the operations 602 and 610 of Figure 6(b) may be different to operations 602 and 603 of Figure 6(a).
  • the feature vector representing each video is compared with all the others in order to obtain a consistent synchronization among all the videos in the same group. Even if the group is large, this process is highly parallelisable and can take advantage of modern search space partition methods.
  • a scoring scheme is developed based on the analysis of the distance matrix ⁇ which makes use of the fact that the diagonal containing the maximum number of minimum distances x t (across each row), given by eq. 8, corresponds to the estimated time shift between two recordings as explained above.
  • the estimated time shift between 0 k i and 0 kJ is thus given by:
  • the (relative) delay matrix D k is not anti-symmetric if any false identification occurs (a recording does not belong to the same event). Further analysis of D k is thus required for validating the identification results in order to eliminate any false identification and to calculate consistent time shifts.
  • Histogram h k ir is generated where i ⁇ i', V i, i' e i [1 , N k ].
  • the histogram h k ir contains the time shifts between camera recordings 0 k i and 0 k r , where each camera 0 kJ , V j e [1 , N k ], is used as reference to test the time shift between 0 k i and 0 k r .
  • Eq. 1 1 one difference in Eq. 1 1 is performed in the top half of the delay matrix D k and one in the bottom half, thus providing a cross validation for the time shift.
  • the most occurring value using all video recordings 0 kJ on this histogram h kiir is selected as the consistent time shift A tKW .
  • a tKW the consistent time shift
  • a multi-display visualizer has been developed as well in order to validate the obtained results and to coherently playback the identified UGVs using the estimated time shifts.
  • a single video is generated out of them by the system that mixes multiple video recordings in a single video with a coherent timing.
  • a clip from one of the synchronized videos is selected for a given time.
  • the final composite video may then be made up of a selection of consecutive clips, or portions, from the set of synchronized videos, where a portion is one or more frames of the video.
  • the decision of which video recording within the set is selected and shown in the final single video at each time step is performed by extracting characteristics that describe the spatio-temporal evolution of the scene.
  • the video recordings with the maximum numbers of characteristics are those that will potentially be selected to be part of the final composite video.
  • the video recording with the highest number of characteristics is selected and maintained in the composite video for a minimum number of mf frames.
  • a decision whether to maintain the actual video recording or to switch to another video recording is performed based on the spatio- temporal characteristics. The switch happens if there exists another video recording with higher number of characteristics. If the switch happens, the new video recording will be maintained for mf frames. If the switch does not happen, the initial video is maintained until a video recording with higher number of characteristics appears. This process is repeated over time and it continues until all video recordings are played.
  • the event clustering step may not be performed and, instead, the query video Q may be compared to all videos M in the data set ⁇ .
  • descriptors P'qn are extracted for all matching histograms V qn and the set of recordings that overlaps with Q is obtained directly, without clustering the data set ⁇ .
  • the synchronisation process may then be carried out as outlined above.
  • Table 1 summarizes the main characteristics of the datasets along with the introduced challenges.
  • TorchSheffield 1 2 30 44.1 X X X X
  • the ground truth time shifts for all the recordings were manually generated. This is done by manually observing an audio event, visual event or audio-visual event in all recordings for each dataset.
  • the generated ground truth has a tolerance interval of ⁇ 1 video frame which approximately lies within ⁇ 0.04 seconds. The obtained manual ground truth is useful for the validation of our proposed method.
  • This value of frame size for synchronization was selected since it gives an accuracy of 0.01 milliseconds.
  • intelligent selection of frame size was required, since a value of frame size as small as 0.08 seconds would be sufficient for the identification process while a large value might not give accurate results.
  • the energy spectrum of the audio frames is computed on the logarithmic scale, where the minimum and maximum is set to 100Hz and 5000Hz. The computed spectrum energy is then redistributed along the 12 pitch classes (chroma) and matching is performed using Euclidean distance measure. is considered in each event dataset as the representative camera recording for the time shift estimation, and present the results with respect to it.
  • the recording identification system comprises a first step of computing a classification threshold.
  • a small dataset was selected consisting of 7 events containing 42 recordings.
  • the frame size was selected as 3.5 seconds through experimentation.
  • This dataset gave 1764 matching pairs, on which training is performed and the classification threshold boundary is learned.
  • the testing is performed on 39 events containing 207 recordings for which all possible matches (42,849) are computed and ground truth for recording identification is generated.
  • Figure 7 shows the result of recording identification system, with varying size of training sample set.
  • Figure 7 shows the ROC for varying training sample size. It is observed through experimentation that with a sample size as small as 40 samples (with 20 positive and 20 negative), high accuracy is achieved. Training sample sizes of 20 and 40 were selected to learn the threshold by randomly selecting 10 and 20 positive and negative matching examples from each of the total 1764 pairs, respectively. Training is also performed by taking all 1764 matching pairs. The result shows that even with just 40 training samples, high
  • the next step is to perform synchronization to bring them on a common time line.
  • a frame size of 0.08 seconds is selected to obtain precision of 0.01 seconds of the estimated time shifts.
  • Synchronization results on some of the collected datasets are shown in Figure 11 where the results are presented on a common timeline. Each recording is represented by two horizontal bars. The time shift estimation obtained using the proposed method is shown by the top bar, while the ground truth time shift is shown by the bottom bar.
  • all recordings are synchronized with an error between estimated and ground truth time shifts of less than 0.03 seconds, thus proving that the proposed method is robust to audio noise.
  • Figure 1 1 (g) shows that for 0 7 of Event7_TorchMileEnd dataset, the obtained time shift estimate is not the same as that of the ground truth. This is due to a malfunctioning recording device that lost the audio signal for most of the time during the recording.
  • the results clearly depict that the proposed method is capable of handling non-amplified sound recordings containing high ambient noise.
  • the time-shift validation is also performed in order to verify that the obtained cluster of recordings belongs to the same event.
  • a multi-camera visualization software has been developed to play the synchronized recordings to further validate the obtained results.
  • AO indicates audio-onset based method
  • AF indicates audio fingerprinting method
  • AV indicates audio-visual method
  • AC indicates the proposed chroma based method.
  • audio-onset and audio-visual methods are comparable, while at times audio-visual gave slightly worse results than audio-onset. Since these two methods are sensitive to audio noise, for most of the public dataset recordings they did not obtain the correct time shift estimates with respect to the ground truth. Audio-fingerprint showed robustness toward ambient noise but failed to give the correct result for some recordings containing audio channel noise. The presented method outperformed the tested state-of-the-art methods giving an overall accuracy of 99%, as it was able to synchronise 238 out of 239 recordings.
  • Figure 13 shows a system 1300 which is used to carry out method 100 outlined above.
  • the system includes server 1302.
  • Server 1302 includes server processor 1304 and server memory 1306.
  • Server 1302 is connected to user port 1310 via network 1308.
  • Network 1308 may be a wireless internet network or any other suitable network.
  • User port 1310 includes port processor 1312, port memory 1314 and user interface 1316.
  • UGVs may be stored in server memory 1306, port memory 1314 or any other suitable memory not shown.
  • Method 100 may be performed on server processor 1304, port processor 1312 or any other suitable processor not shown.
  • the user enters commands into the user interface 1316 in order to perform method 100.
  • the various methods described above may be implemented by a computer program.
  • the computer program may include computer code arranged to instruct a computer to perform the functions of one or more of the various methods described above.
  • the computer program and/or the code for performing such methods may be provided to an apparatus, such as a computer, on a computer readable medium or computer program product.
  • the computer readable medium could be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or a propagation medium for data transmission, for example for downloading the code over the Internet.
  • the computer readable medium could take the form of a physical computer readable medium such as semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disc, and an optical disk, such as a CD-ROM, CD- R/W or DVD.
  • a physical computer readable medium such as semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disc, and an optical disk, such as a CD-ROM, CD- R/W or DVD.
  • An apparatus such as a computer may be configured in accordance with such code to perform one or more processes in accordance with the various methods discussed herein.
  • Such an apparatus may take the form of a data processing system.
  • a data processing system may be a distributed system.
  • such a data processing system may be distributed across a network.

Abstract

Un procédé, un système et un support lisible par ordinateur comprenant un code lisible par ordinateur destiné à synchroniser une pluralité de vidéos, chaque vidéo de la pluralité de vidéos comprenant un élément vidéo et un élément audio, le procédé consistant à obtenir des caractéristiques d'une ou plusieurs vidéos de la pluralité de vidéos identifiant une vidéo de référence avec laquelle un certain nombre de vidéos de la pluralité de vidéos doivent être synchronisées, identifier le nombre de vidéos de la pluralité de vidéos qui doivent être synchronisées avec la vidéo de référence ; et synchroniser dans le temps le nombre de vidéos et la vidéo de référence en fonction des caractéristiques obtenues.
PCT/GB2015/051395 2014-05-22 2015-05-13 Procédé permettant de grouper, de synchroniser et de composer une pluralité de vidéos correspondant au même événement la présente invention concerne un système et un support lisible par ordinateur comprenant son code lisible par ordinateur. WO2015177513A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1409147.4 2014-05-22
GBGB1409147.4A GB201409147D0 (en) 2014-05-22 2014-05-22 Media processing

Publications (1)

Publication Number Publication Date
WO2015177513A1 true WO2015177513A1 (fr) 2015-11-26

Family

ID=51177310

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2015/051395 WO2015177513A1 (fr) 2014-05-22 2015-05-13 Procédé permettant de grouper, de synchroniser et de composer une pluralité de vidéos correspondant au même événement la présente invention concerne un système et un support lisible par ordinateur comprenant son code lisible par ordinateur.

Country Status (2)

Country Link
GB (1) GB201409147D0 (fr)
WO (1) WO2015177513A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090087161A1 (en) * 2007-09-28 2009-04-02 Graceenote, Inc. Synthesizing a presentation of a multimedia event
EP2450898A1 (fr) * 2010-11-05 2012-05-09 Research in Motion Limited Compilation vidéo mixte
US20120198317A1 (en) * 2011-02-02 2012-08-02 Eppolito Aaron M Automatic synchronization of media clips
US20140079372A1 (en) * 2012-09-17 2014-03-20 Google Inc. Method for synchronizing multiple audio signals

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090087161A1 (en) * 2007-09-28 2009-04-02 Graceenote, Inc. Synthesizing a presentation of a multimedia event
EP2450898A1 (fr) * 2010-11-05 2012-05-09 Research in Motion Limited Compilation vidéo mixte
US20120198317A1 (en) * 2011-02-02 2012-08-02 Eppolito Aaron M Automatic synchronization of media clips
US20140079372A1 (en) * 2012-09-17 2014-03-20 Google Inc. Method for synchronizing multiple audio signals

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ASHISH BAGRI ET AL: "A SCALABLE FRAMEWORK FOR JOINT CLUSTERING AND SYNCHRONIZING MULTI-CAMERA VIDEOS", EUSIPCO 2013, 8 September 2013 (2013-09-08), XP055098545, Retrieved from the Internet <URL:http://hal.inria.fr/docs/00/87/03/81/PDF/Bagri_et_al_EUSIPCO_2013.pdf> [retrieved on 20140127] *
LLAGOSTERA CASANOVAS ANNA ET AL: "Audio-visual events for multi-camera synchronization", MULTIMEDIA TOOLS AND APPLICATIONS, KLUWER ACADEMIC PUBLISHERS, BOSTON, US, vol. 74, no. 4, 23 March 2014 (2014-03-23), pages 1317 - 1340, XP035449847, ISSN: 1380-7501, [retrieved on 20140323], DOI: 10.1007/S11042-014-1872-Y *
SEBASTIAN EWERT ET AL: "High resolution audio synchronization using chroma onset features", ACOUSTICS, SPEECH AND SIGNAL PROCESSING, 2009. ICASSP 2009. IEEE INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 19 April 2009 (2009-04-19), pages 1869 - 1872, XP031459618, ISBN: 978-1-4244-2353-8 *
SOPHIA BANO ET AL: "Discovery and organization of multi-camera user-generated videos of the same event", INFORMATION SCIENCES, vol. 302, 1 May 2015 (2015-05-01), pages 108 - 121, XP055203323, ISSN: 0020-0255, Retrieved from the Internet <URL:http://www.sciencedirect.com/science/article/pii/S0020025514008159> [retrieved on 20150720], DOI: 10.1016/j.ins.2014.08.026 *

Also Published As

Publication number Publication date
GB201409147D0 (en) 2014-07-09

Similar Documents

Publication Publication Date Title
US11336952B2 (en) Media content identification on mobile devices
KR100893671B1 (ko) 멀티미디어 콘텐트의 해시들의 생성 및 매칭
US10540993B2 (en) Audio fingerprinting based on audio energy characteristics
US20180144194A1 (en) Method and apparatus for classifying videos based on audio signals
EP1081960A1 (fr) Procede de traitement de signaux et dispositif de traitement de signaux video/vocaux
US20060013451A1 (en) Audio data fingerprint searching
US20140245463A1 (en) System and method for accessing multimedia content
US11736762B2 (en) Media content identification on mobile devices
Dimoulas et al. Syncing shared multimedia through audiovisual bimodal segmentation
CN109644283B (zh) 基于音频能量特性的音频指纹识别
US20150310008A1 (en) Clustering and synchronizing multimedia contents
RU2413990C2 (ru) Способ и устройство для обнаружения границ элемента контента
Bano et al. Discovery and organization of multi-camera user-generated videos of the same event
JP6159989B2 (ja) シナリオ生成システム、シナリオ生成方法およびシナリオ生成プログラム
JP2022537894A (ja) オーディオデータを用いてビデオ及びオーディオクリップを同期するシステム及び方法
US20210360313A1 (en) Event Source Content and Remote Content Synchronization
Kekre et al. A review of audio fingerprinting and comparison of algorithms
Bost et al. Constrained speaker diarization of TV series based on visual patterns
Duong et al. Movie synchronization by audio landmark matching
Chenot et al. A large-scale audio and video fingerprints-generated database of tv repeated contents
WO2015177513A1 (fr) Procédé permettant de grouper, de synchroniser et de composer une pluralité de vidéos correspondant au même événement la présente invention concerne un système et un support lisible par ordinateur comprenant son code lisible par ordinateur.
Bestagini et al. Feature-based classification for audio bootlegs detection
Shao et al. Automatically generating summaries for musical video
US10219047B1 (en) Media content matching using contextual information
Basaran et al. Multiresolution alignment for multiple unsynchronized audio sequences using sequential Monte Carlo samplers

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15724001

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15724001

Country of ref document: EP

Kind code of ref document: A1