AU2003248029B2

AU2003248029B2 - Audio Object Classification Based on Statistically Derived Semantic Information

Info

Publication number: AU2003248029B2
Application number: AU2003248029A
Authority: AU
Inventors: Timothy John Wark
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-09-17
Filing date: 2003-09-15
Publication date: 2005-12-08
Anticipated expiration: 2023-09-15
Also published as: AU2003248029A1

Description

S&F Ref: 645957

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT Name and Address of Applicant Actual Inventor(s): Address for Service: Invention Title: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Timothy John Wark Spruson Ferguson St Martins Tower Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) Audio Object Classification Based on Statistically Derived Semantic Information ASSOCIATED PROVISIONAL APPLICATION DETAILS [33] Country [31] Applic. No(s) AU 2002951439 [32] Application Date 17 Sep 2002 The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5815c -1- AUDIO OBJECT CLASSIFICATION BASED ON STATISTICALLY DERIVED SEMANTIC INFORMATION Field of the Invention The present invention relates generally to the classification of audio objects and, in particular, to the use of statistically derived semantic features in order to describe audio objects.

Background There is an increasing demand for automated computer systems that extract meaningful information from large amounts of data. One such application is the extraction of information from continuous streams of audio. Such continuous audio streams may include speech from, for example, a news broadcast or a telephone conversation, or non-speech, such as music or background noise.

Hitherto a number of systems have been developed for automatically determining the identity of some "event", or "object", that occurs in audio. Such systems range from systems that attempt to identify a speaker from a short section of speech, identify a type of music given a short sample, or search for a particular audio occurrence, such as a type of noise, in a long section of audio. All these systems are based upon the idea of training an event or object model based on features from a set of samples with known identity, and then comparing test samples to a number of object models.

Many of the prior classification systems are based on the use of short-term or frame features in order to characterise objects. Each short-term feature is generally obtained from a small window of signal, typically between 10ms and 40ms in length.

Common short-term features are features such as energy, mel-cepstrum, pitch, linear- 645957.doc predictive coefficients, zero-crossing rate, etc. Whilst the use of these features is effective in scenarios where there is little mismatch or variation between training and testing conditions, they are far less effective when large variations occur. The prime reason for this is that very little semantic information is captured by such short-term features, because just the immediate characteristics of the observed signal are captured. Thus when the signal varies, for example through a channel change or environment change, whilst the overall semantic difference might be negligible, the differences in the immediate characteristics of the signal are enough to make the classification system ineffective.

Some more recent classification systems have considered the use of long-term features in order to characterise objects. Long term features are derived from a set of short-term features and alleviate many of the problems with short-term features by capturing much of the higher-level, semantic information. Examples of long-term features are measures such as the volume dynamic range, the volume undulation, the pitch variance, the frequency centroid. Typically a long-term feature will be derived from a section of speech at least 0.5 seconds long, and could be as long as 10 seconds or more.

These previous systems designed for the detection of objects in audio based on clip features have utilised somewhat heuristic techniques to develop specific algorithms to extract clip features well suited to the types of objects that are being searched for.

However, whenever new object types have to be detected, such a system has to be revised to accommodate the new object types. This often requires for a new heuristic approach to be developed in order to ensure that the clip features are well suited to the new object types.

645957.doc

I

-3- Summary of the Invention It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

Improved long-term features for extracting semantic information from an audio stream are disclosed. The long-term features describe the underlying characteristics of an object under varying conditions, and are based upon the observation that much useful semantic information about an object is contained within its higher-order moments.

According to an aspect of the invention, there is provided a method of extracting semantic information from an audio clip comprising a sequence of audio samples, said method comprising the steps of: forming a sequence of frames along said sequence of audio samples, each of said frames comprising a plurality of said audio samples; extracting for each frame at least one frame feature; and extracting said semantic information from moments of said at least one frame feature, said moments including moments of the order higher than 2.

Other aspects of the invention are also disclosed.

Brief Description of the Drawings One or more embodiments of the present invention will now be described with reference to the drawings, in which: Fig. 1 shows a flow diagram of a method of segmenting an audio stream into homogeneous segments, and then classifying each of those homogeneous segments; Fig. 2 shows a schematic block diagram of a system upon which audio classification can be practiced; 645957.doc -4- Fig. 3 shows a flow diagram of the preferred sub-steps of segmenting the audio stream into homogeneous segments; Fig. 4 illustrates a sequence of sampled audio, the division of the sequence into frames, and the segmentation of the frames; Fig. 5A illustrates a distribution of example frame features and the distribution of a Gaussian event model that best fits the set of frame features obtained from a segment of speech; Fig. 5B illustrates a distribution of the example frame features of Fig. 5A and the distribution of a Laplacian event model that best fits the set of frame features; Fig. 6A illustrates a distribution of example frame features and the distribution of a Gaussian event model that best fits the set of frame features obtained from a segment of music; Fig. 6B illustrates a distribution of the example frame features of Fig. 6A and the distribution of a Laplacian event model that best fits the set of frame features; Figs. 7A and 7B show a sequence of frames and the sequence of frames being divided into two segments; Fig. 8A shows a flow diagram of a method for detecting a single transition-point within a sequence of frame features; Fig. 8B shows a flow diagram of a method for detecting multiple transition-points within a sequence of frame features; Fig. 9 shows a flow diagram of the preferred sub-steps for extracting a feature vector for each clip; Fig. 10 illustrates the classification of the segment against 4 known classes A, B.

C and D; 645957.doc Fig. 11 is a diagram illustrating visually the nature of the Gaussian mixture model; Fig. 12 is a diagram illustrating the inter-class distances for object models; Figs. 13A and 13B illustrate the manner in which Parzen windowing is utilized to estimate the probability density function (pdf) from a set of samples; Fig. 14 illustrates the manner in which the Characteristic Function (CF) is windowed with log-spaced bins; and Fig. 15 shows a plot of the distribution of two particular semantic features, namely the first two moments derived from energy.

Detailed Description Some portions of the description which follow are explicitly or implicitly presented in terms of algorithms and symbolic representations of operations on data within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated.

It should be borne in mind, however, that the above and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels.

Unless specifically stated otherwise, and as apparent from the following, it will be appreciated that throughout the present specification, discussions refer to the action and processes of a computer system, or similar electronic device, that manipulates and 645957.doc transforms data represented as physical (electronic) quantities within the registers and memories of the computer system into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

Fig. 1 shows a flow diagram of a method 200 of segmenting an audio stream in the form of a sequence x(n) of sampled audio from unknown origin into homogeneous segments, and then classifying each of those homogeneous segments to assign to each an object label. A homogeneous segment is a segment only containing samples from a source having constant acoustic characteristic, such as from a particular human speaker, a type of background noise, or a type of music.

Fig. 2 shows a schematic block diagram of a system 100 upon which audio classification can be practiced. The system 100 comprises a computer module 101, such as a conventional general-purpose computer module, input devices including a keyboard 102, pointing device 103 and a microphone 115, and output devices including a display device 114 and one or more loudspeakers 116.

The computer module 101 typically includes at least one processor unit 105, a memory unit 106, for example formed from semiconductor random access memory (RAM) and read only memory (ROM), input/output interfaces including a video interface 107 for the video display 114, an IO interface 113 for the keyboard 102, the pointing device 103 and interfacing the computer module 101 with a network 118, such as 645957.doc -7the Internet, and an audio interface 108 for the microphone 115 and the loudspeakers 116.

A storage device 109 is provided and typically includes a hard disk drive and a floppy disk drive. A CD-ROM or DVD drive 112 is typically provided as a non-volatile source of data. The components 105 to 113 of the computer module 101, typically communicate via an interconnected bus 104 and in a manner which results in a conventional mode of operation of the computer module 101 known to those in the relevant art.

Audio data for processing by the system 100, and in particular the processor 105, may be derived from a compact disk or video disk inserted into the CD-ROM or DVD drive 112 and may be received by the processor 105 as a data stream encoded in a particular format. Audio data may alternatively be derived from downloading audio data from the network 118. Yet another source of audio data may be recording audio using the microphone 115. In such a case, the audio interface 108 samples an analog signal received from the microphone 115 and provides the audio data to the processor 105 in a particular format for processing and/or storage on the storage device 109.

The audio data may also be provided to the audio interface 108 for conversion into an analog signal suitable for output to the loudspeakers 116.

Referring again to Fig. 1, the method 200 is preferably implemented in the system 100 by a software program executed by the processor 105 (Fig. It is assumed that the audio stream is appropriately digitised at a sampling rate F. Those skilled in the art would understand the steps required to convert an analog audio stream into the sequence x(n) of sampled audio. In an example arrangement, the audio stream is sampled at a sampling rate F of 16 kHz and the sequence x(n) of sampled audio is stored on the storage device 109 in a form such as a .wav file or a .raw file. The method 200 starts in step 202 where the 645957.doc sequence x(n) of sampled audio are read from the storage device 109 and placed in memory 106.

The method 200 continues to step 204 where the sequence x(n) of sampled audio is segmented into a number of homogeneous segments. In other words, step 204 locates the boundaries in time where the characteristics of the audio signal, contained in the sequence x(n) of sampled audio, significantly change. For example, this could constitute such changes as a transition from speech to music, or a transition from silence to noise.

The preferred segmentation utilises the Bayesian Information Criterion (BIC) for segmenting the sequence x(n) of sampled audio into a number of homogeneous segments.

In order for the BIC to be applied to the sequence x(n) of sampled audio, one or more features must be extracted for each small, incremental interval of K samples along the sequence An underlying assumption is that the properties of the audio signal change relatively slowly in time, and that each extracted feature provides a succinct description of important characteristics of the audio signal in the associated interval. Ideally, such features extract enough information from the underlying audio signal so that the subsequent segmentation algorithm can perform well, and yet be compact enough that segmentation can be performed very quickly.

Fig. 3 shows a flow diagram of the preferred sub-steps of step 204 for segmenting the sequence x(n) of sampled audio. Step 204 starts in sub-step 304 where the processor 105 forms interval windows or frames, each containing K audio samples. In the example, a frame of 20 ms is used, which corresponds to K=320 samples at the sampling rate F of 16 kHz. Further, the frames are overlapping, with the start position of the next frame positioned only 10 ms later in time, or 160 samples later, providing a shift-time of 10 ms.

645957.doc -9- Fig. 4 illustrates a sequence x(n) of sampled audio. Frames 701 to 704 are also illustrated.

Referring again to Fig. 3, in sub-step 306 a Hamming window function of the same length as that of the frames, i.e. K sample long, is applied by the processor 105 to the sequence samples x(n) in each frame to give a modified set of windowed audio samples s(i,k) for frame i, with ke The purpose of applying the Hamming window is to reduce the side-lobes created when applying the Fast Fourier Transform (FFT) in subsequent operations.

In sub-step 308 the bandwidth bw(i) of the modified set of windowed audio samples s(i,k) of the i'th frame is calculated by the processor 105 as follows: S( o FC(i)) 2 Si (o 2 do bw(i)= (1) f ISi(cof 2 dco where |S i (c 2 is the power spectrum of the modified windowed audio samples s(i,k) of the i'th frame, co is a signal frequency variable for the purposes of calculation, and FC is the frequency centroid, defined as: r wS;( )2 dw FC(i)= f S 2 d (2) fIs,(c2 d co The Simpson's Rule of integration is used to evaluate the integrals. The Fast Fourier Transform is used to calculate the power spectrum S,( 1 2 whereby the samples having length K, are zero padded until the next highest power of 2 is reached. Thus, in the example where the length of the samples s(i,k) is 320, the FFT would be applied to a vector of length 512, formed from 320 modified windowed audio samples s(i,k) and 192 zero components.

645957.doc In sub-step 310 the energy E(i) of the modified set of windowed audio samples s(i,k) of the i'th frame is calculated by the processor 105 as follows: K s k) (3) A segmentation frame featurefs(i) for each frame i is calculated by the processor 105 in sub-step 312 by weighting the frame bandwidth bw(i) by the frame energy E(i).

This forces a bias in the measurement of bandwidth bw(i) in those frames i that exhibit a higher energy and are thus more likely to come from an event of interest, rather than just background noise. The segmentation frame featuresfs(i) are thus calculated as being: f, E(i)bw(i) (4) The BIC is used in sub-step 314 by the processor 105 to segment the sequence of segmentation frame features fs(i) into homogeneous segments, such as the segments illustrated in Fig. 4. The output of sub-step 314 is one or more frame numbers of the frames where changes in acoustic characteristic were detected ("transition-points").

The value of the BIC is a statistical measure for how well a model represents a set of segmentation frame featuresfi(i), and is calculated as:

D

BIC log(L) log(N) 2 where L is the maximum-likelihood probability for a chosen model to represent the set of segmentation frame features fs(i), D is the dimension of the model which is 1 when the segmentation frame features fs(i) of Equation is used, and N is the number of segmentation frame featuresfs(i) being tested against the model.

645957.doc 11- The maximum-likelihood L is calculated by finding parameters 0 of the model that maximise the probability of the segmentation frame features fi(i) being from that model. Thus, for a set of parameters 0, the maximum-likelihood L is: L max P(f, (6) 0 Segmentation using the BIC operates by testing whether the sequence of segmentation frame features f(i) are better described by a single-distribution event model, or a twin-distribution event model, where the first m number of frames, those being frames are from a first source and the remainder of the N frames, those being frames are from a second source. The frame m is accordingly termed the transitionpoint. To allow a comparison, a criterion difference ABIC is calculated between the BIC using the twin-distribution event model with that using the single-distribution eventmodel. As the change-point m approaches a transition in acoustic characteristics, the criterion difference ABIC typically increases, reaching a maximum at the transition, and reducing again towards the end of the N frames under consideration. If the maximum criterion difference ABIC is above a predefined threshold, then the two-distribution event model is deemed a more suitable choice, indicating a significant transition in acoustic characteristics at the transition-point m where the criterion difference ABIC reached a maximum.

Current BIC segmentation systems assume that D-dimensional segmentation frame features fs(i) are best represented by a Gaussian event model having a probability density function of the form: lD exp{ (7) (27c 2 2 645957.doc 12where is the mean vector of the segmentation frame features fs(i), and Z is the covariance matrix. The segmentation frame featurefs(i) of the preferred implementation is one-dimensional and as calculated in Equation Fig. 5A illustrates a distribution 500 of segmentation frame features fs(i), where the segmentation frame features fs(i) were obtained from an audio stream of duration 1 second containing voice. Also illustrated is the distribution of a Gaussian event model 502 that best fits the set of segmentation frame featuresfi(i).

It is proposed that segmentation frame features fs(i) representing the characteristics of audio signals such as a particular speaker or block of music, is better represented by a leptokurtic distribution, particularly where the number N of features being tested against the model is small. A leptokurtic distribution is a distribution that is more peaked than a Gaussian distribution. An example of a leptokurtic distribution is a Laplacian distribution. Fig. 5B illustrates the distribution 500 of the same segmentation frame features fs(i) as those of Fig. 5A, together with the distribution of a Laplacian event model 505 that best fits the set of segmentation frame featuresfs(i). It can be seen that the Laplacian event model gives a much better characterisation of the feature distribution 500 than the Gaussian event model.

This proposition is further illustrated in Figs. 6A and 6B wherein a distribution 600 of segmentation frame features fi(i) obtained from an audio stream of duration 1 second containing music is shown. The distribution of a Gaussian event model 602 that best fits the set of segmentation frame features fs(i) is shown in Fig. 6A, and the distribution of a Laplacian event model 605 is illustrated in Fig. 6B.

A quantitative measure to substantiate that the Laplacian distribution provides a better description of the distribution characteristics of the segmentation frame features f(i) 645957.doc 13for short events rather than the Gaussian model is the Kurtosis statistical measure which provides a measure of the "peakness" of a distribution and may be calculated for a sample set X as: 3 (8) (var(x))2 For a true Gaussian distribution, the Kurtosis measure K is 0, whilst for a true Laplacian distribution the Kurtosis measure K is 3. In the case of the distributions 500 and 600 shown in Figs. 5A and 6A, the Kurtosis measures K are 2.33 and 2.29 respectively.

Hence the distributions 500 and 600 are more Laplacian in nature than Gaussian.

The Laplacian probability density function in one dimension is: exp- (9) where pu is the mean of the segmentation frame features fi(i) and a is their standard deviation. In a higher order feature space with segmentation frame features fs(i), each having dimension D, the feature distribution is represented as: 2 (f (f g(f, (i),z2 2(f, (27)2 2 2 where v (2 D) 2 and K v is the modified Bessel function of the third kind.

Whilst the method 200 can be used with multi-dimensional segmentation features fs(i), the rest of the analysis is contained to the one-dimensional space due to the use of the one-dimensional segmentation frame featurefs(i) shown in Equation 645957.doc -14- Given N segmentation frame featuresf/(i) as illustrated in Fig. 7A, the maximum likelihood L for the set of segmentation frame featuresfs(i) falling under a single Laplacian distribution is: exp f, (11) where cr is the standard deviation of the segmentation frame features fs(i) and j is the mean of the segmentation frame features Equation (11) may be simplified in order to provide: L= 2cr2) exp f (12) SC i The maximum log-likelihood log(L), assuming natural logs, for all N frame featuresj(i) to fall under a single Laplacian event model is thus: log() N (2 2 s)-p (13) 2 a Fig. 7B shows the N frames being divided at frame m into two segments 550 and 555, with the first m number of frames forming segment 550 and the remainder of the N frames forming segment 555. A log-likelihood ratio R(m) of a twin- Laplacian distribution event model to a single Laplacian distribution event model, with the division at frame m and assuming segment 550 is from a first source and segment 555 is from a second source, is: R(m) log(L,) log(L 2 log(L) (14) where: log(L,)= m(2, 2 2 i, 2 al i=1 645957.doc and (N2-rn) 2 -2 N(16) log(L2)= )2 fS(i)-p2 (16) 2 "2 i=m+1 wherein and {/p 2 '2 are the means and standard-deviations of the segmentation frame featuresfi(i) before and after the change point m.

The criterion difference ABIC for the Laplacian case having a change point m is calculated as: ABIC(m)= R(m)-log (17) 2N In a simplest of cases where only a single transition is to be detected in a section of audio represented by a sequence of N segmentation frame features the most likely transition point m is given by: m arg(max ABIC(m)) (18) Fig. 8A shows a flow diagram of a method 300 for detecting a single transitionpoint m within a sequence of N segmentation frame features f(i) that may be substituted as sub-step 314 in step 204 shown in Fig. 3. When more than one transition-point rhj is to be detected, the method 400 shown in Fig. 8B is substituted as sub-step 314 in step 204 (Fig. Method 400 uses method 300 as is described below.

Method 300, performed by the processor 105, receives a sequence of N' segmentation frame features fs(i) as input. When method 300 is substituted as sub-step 314 in step 204, then the number of frames VN equals the number of features N. In step 805 the change-point m is set by the processor 105 to 1. The change-point m sets the point dividing the sequence of N' frame featuresf(i) into two separate sequences namely m] and 645957.doc

I

-16- Step 810 follows where the processor 105 calculates the log-likelihood ratio R(m) by first calculating the means and standard deviations and 2 u,,a, 2 of the segmentation frame featuresfi(i) before and after the change-point m. Equations and (16) are then calculated by the processor 105, and the results are substituted into Equation The criterion difference ABIC(m) for the Laplacian case having the change-point m is then calculated by the processor 105 using Equation (17) in step 815.

In step 820 the processor 105 determines whether the change point m has reached the end of the sequence of length If the change-point m has not reached the end of the sequence, then the change-point m is incremented by the processor 105 in step 825, and steps 810 to 820 are repeated for the next change-point m. When the processor 105 determines in step 820 that the change-point m has reached the end of the sequence, then the method 300 proceeds to step 830 where the processor 105 determines whether a significant change in the sequence of N' segmentation frame features fi(i) occurred by determining whether the maximum criterion difference max[ABIC(m)] has a value that is greater than a predetermined threshold. In the example, the predetermined threshold is set to 0. If the change was determined by the processor 105 in step 830 to be significant, then the method 300 proceeds to step 835 where the most likely transition-point A is determined using Equation and the result is provided as output of method 300.

Alternatively, if the change was determined step 830 not to be significant, then the method 300 proceeds to step 840 where the null string is provided as output.

Fig. 8B shows a flow diagram of the method 400 for detecting multiple transition-points thi within the sequence of N segmentation frame features fs(i) that may be used as sub-step 314 in step 204 shown in Fig. 3. Method 400 thus receives the sequence of N segmentation frame features fi(i) from sub-step 312 (Fig. Given an 645957.doc 17audio stream that is assumed to contain an unknown number of transition-points hj, the method 400 operates principally by analysing short sequences of segmentation frame features fs(i), with each sequence consisting of Nmin segmentation frame features and detecting a single transition-point t. within each sequence, if it occurs, using method 300 (Fig. 8A). Once all the transition-points hi have been detected, the method 400 performs a second pass wherein each of the transition-points ,ti detected are verified as being significant by analysing the sequence of segmentation frame features fi(i) included in the segments either side of the transition-point mA under consideration, and eliminating any transition-points thj verified not to be significant. The verified significant transitionpoints rh' are then provided as output.

Method 400 starts in step 405 where the sequence of segmentation frame features fs(i) are defined by the processor 105 as being the sequence The first sequence includes Nmin features and method 400 is therefore initiated with a=l and b=a+Nmi. The number of features Nmn, is variable and is determined for each application. By varying Nmi,, the user can control whether short or spurious events should be detected or ignored, where the requirement being different with each scenario. In the example, a minimum segment length of 1 second is assumed. Given that the segmentation frame features f,(i) are extracted every 10 ms, being the window shift time, the number of features Nmj, is set to 100.

Step 410 follows where the processor 105 detects a single transition-point ri within the sequence if it occurs, using method 300 (Fig. 8A) with In step 415 the processor 105 determines whether the output received from step 410, i.e.

method 300, is a transition-point rh or a null string indicating that no transition-point thj 645957.doc 18 occurred in the sequence If a transition-point rh i was detected in the sequence then the method 400 proceeds to step 420 where that transition-point 4i is stored in the memory 106. Step 425 follows wherein a next sequence is defined by the processor 105 by setting a= hj+61 and b=a+Nmin, where 81 is a predetermined small number of frames.

If the processor 105 determines in step 415 that no significant transition-point A., was detected in the sequence then the sequence [fs(a)fs(b)] is lengthened by the processor 105 in step 430 by appending a small number 62 of segmentation frame featuresfs(i) to the sequence by defining b=b+5 2 From either step 425 or 430 the method 400 proceeds to step 435 where the processor 105 determines whether all N segmentation frame features fs(i) have been considered. If all N segmentation frame featuresfs(i) have not been considered, then control is passed by the processor 105 to step 410 from where steps 410 to 435 are repeated until all the segmentation frame features f(i) have been considered.

The method 400 then proceeds to step 440, which is the start of the second pass.

In the second pass the method 400 verifies each of the N, transition-point detected in steps 405 to 435. The transition-point th is verified by analysing the sequence of segmentation frame featuresfs(i) included in the segments either side of a transition-point rhi under consideration. Hence, when considering the transition-point rh, the sequence 1 is analysed, with counter n initially being 0 and the verified transition-point ri also being 0.

Step 440 starts by setting counters j to 1 and n to 0. Step 445 follows where the processor 105 detects a single transition-point r within the sequence 645957.doc -19- [fi( +1)fs(i if it occurs, using again method 300 (Fig. 8A). In step 450 the processor 105 determines whether the output received from step 445, i.e. method 300, is a transition-point m or a null string indicating that no significant transition-point A occurred in the sequence 1 +n If a transition-point A was detected in the sequence 1)f,(i+ l +n then the method 400 proceeds to step 455 where that transition-point m is stored in memory 106 and in a sequence of verified transition-points h' Step 460 follows where the counterj is incremented and n is reset to 0.

Alternatively if the processor 105 in step 450 determined that no significant transition-point it was detected by step 445, then the sequence [fs( is merged as a single segment, and the counter n is also incremented thereby extending the sequence of segmentation frame features f(i) under consideration to the next transitionpoint i From either step 460 or 465 the method 400 proceeds to step 470 where it is determined by the processor 105 whether all the transition-points fim have been considered for verification. If any transition-points mj remain, control is returned to step 445 from where steps 445 to 470 are repeated until all the transition-points mi have been considered. The method 400 provides the sequence of verified transition-points m' as output. Referring again to Fig. 3, sub-step 314, and thus also step 204, provide as output homogeneous segments defined by the sequence of verified transition-points ri' Referring again to Fig. 1, from step 204 the method 200 proceeds to label the identity of an object contained within a particular homogeneous segment. To classify each of the homogeneous segments, a number of statistical features are extracted from each segment. Whilst previous systems extract from each segment a feature vector, and then 645957.doc classify the segments based on the distribution of the feature vectors, method 200 divides each homogenous segment into a number of smaller sub-segments, or clips hereinafter, with each clip large enough to extract a meaningful feature vector f from the clip. The clip feature vectors f are then used to classify the segment from which it is extracted. The key advantage of extracting a number of feature vectors f from a series of smaller clips rather than a single feature vector for a whole segment is that the characteristics of the distribution of the feature vectors f over the segment of interest may be examined. Thus, whilst the signal characteristics over the length of the segment are expected to be reasonably consistent, by virtue of the segmentation algorithm, some important characteristics in the distribution of the feature vectors f over the segment of interest is significant for classification purposes.

Step 206 divides each homogenous segment into a number of clips, with each clip comprising B frames. The number B of frames included in each clip is predefined, except the last clip of each segment which extends beyond the predefined length to the end of the segment. In the example where each frame is 20 ms long and overlapping with a shifttime of 10 ms, each clip is defined to be at least 0.8 seconds long. The clip thus comprises at least 80 frames.

Step 208 follows where the processor 105 extracts for each clip a feature vector f containing semantic information of the clip. In particular, the feature vector f of each clip is extracted from the characteristic function (CF) of the frame bandwidths bw(i) and frame energies E(i) of the frames associated with that clip. The frame bandwidths bw(i) and frame energies E(i) of the frames were extracted in sub-steps 308 and 310 (Fig. 3) respectively.

645957.doc -21 The CF is defined as the Fourier transform of the probability density function of a signal, in this case the frame bandwidths bw(i) and frame energies E(i) of the frames associated with the clip. Hence, the CF is defined as: cf(t)= e p(x)dx (19) where p(x) is the probability density function and t is an arbitrary variable within the CF sub-space. The CF uniquely describes the probability density function p(x) and also contains the information about the moments of the probability density function p(x) as follows: cf )-j where p is the nth moment about 0 (raw moment) andj is the square root of negative one.

In order to estimate the CFs of the frame bandwidths bw(i) and frame energies E(i) of the frames associated with the clip, the probability density functions p(x) of those signals bw(i) and denoted pbw(x) and pE(x) respectively, within the clip have to be estimated. Whilst some parametric techniques, such as the Expectation-Maximisation (EM) approach, may be used to estimate the probability density functions pbw(x) and pE(x), the preferred implementation utilizes a Parzen Window approach, which is a faster approach when only a small number of samples are present.

The Parzen window approach to density estimation effectively smoothes the histogram of the samples being averaged by constraining the characteristics of the probability density function p(x) with a "kernel" function In the preferred 645957.doc -22implementation the kernel function used is a zero mean, unit variance, univariate normal function or Gaussian of the form: 1 e2/2 (21) The Gaussian kernel p(u) is used as the Gaussian distribution provides a good description of the nature of the underlying modalities within real distributions. The probability density functions Pbw(x) and pE(x) of the frame bandwidths bw(i) and frame energies E(i) respectively are then estimated over the set of samples bw(i) and with i as follows: Pbw 1 P Bhb x bw(i)' and (22) Bhb hbw PE(X)= (23) Bh h E 1= where B is the number of frames in the clip, while hbw and hE are parameters of the Parzen window. In a preferred implementation the parameters hbw and hE are chosen to be: hbw (max(bw(i))- min(bw(i)), and (24) 8 f hE= 1 min(E(i))) 8 i The effectiveness of using the Parzen Window approach is shown in Figs. 13A and 13B. Fig. 13A shows a histogram determined from arbitrary raw data, whilst Fig. 13B shows the estimated probability density function of the raw data using the Gaussian kernel of Equation As can be seen from Fig. 13B, the probability density function is estimated to be the sum of a number of normal functions.

645957.doc -23- Fig. 9 shows a flow diagram of the preferred sub-steps of step 208 for extracting for each clip a feature vector f. Step 208 starts in sub-step 902 where the frame bandwidths bw(i) and frame energies E(i) are normalised with their respective first order moments by dividing the frame bandwidths bw(i) and frame energies E(i) with their respective means. Sub-step 904 follows where the probability density functions pbw(x) and pE(x) are estimated using Equations (22) and (23) respectively. In sub-step 906 the processor 105 calculates the CFs cfbw and efE of the frame bandwidths bw(i) and frame energies E(i) respectively using Equation Preferably the Fast Fourier Transform (FFT) is used to calculate the characteristic function cf(t) for values t e Due to the symmetry of the FFT, the characteristic function cf(t) for values t e is the mirror image of the characteristic function cf(t) for values x e (0,oo).

The CF, cf(t), typically includes more information in the lower values of variable t. This property of the CF cf(t) is due to the fact that most information about a random variable is typically contained in its first few moments, whereas many higher level moments are required to give the complete accurate description. In order to better extract the semantic information contained in the CFs cfbw and cfE, the CFs cfbw and cfE are logweighted in sub-step 908.

Preferably, log-spaced binning is used in order to focus more bins over the lower values of the probability density functions pbw(X) and pE(x). In particular, triangular bins are used to weight the values of the CF cf(t). Fig. 14 illustrates a CF cf(t) and the bins used to weight the CF cf(t). Each bin is defined to have a value of 1 at the center of the bin, and the value of 0 at the borders thereof. Hence, all values of the CF cf(t) are weighted by a value between 0 and 1 by each bin, and then summed with all other weighted values under the bin. The log-spaced binning provides Nbi, weighted 645957.doc -24characteristic function values cfsumi, Nb,. The lower and upper bounds of the ith bin are calculated as follows: K( i -1 1127 (N 2) binlower. =700 exp Nbin 2 -1 (26) binupper, 700 exp f '2 -1 (27) 1127 where parameter K controls the relative spacing of bins and Nbin is the number of bins.

Preferably 12 bins are used to bin the various CF values within the range of t, and parameter K 1127 x log(l +15000/700), which has been derived from various empirical observations.

Referring again to Fig. 9, step 208 continues in sub-step 910 where the processor 105 reduces the order of the weighted characteristic function values cfsumi to extract only the most important information therefrom. The most common statistical method used do this is via Principal Component Analysis (PCA). However, PCA is often a computationally expensive process as it requires the calculation of eigenvalues and eigenvectors from the covariance matrix of the data.

With this particular data, a more efficient method of dimension reduction is possible via the Discrete Cosine Transform (DCT). Here the Cosine function is very close to the mapping applied by the eigenvectors through PCA, but is much quicker to apply.

Thus for calculating N c final CF values from the weighted characteristic function values cfsumi, the following transform is used: 645957.doc 2 Nb,, 0 5 (28) f( j) Nb cfsum 5 2 8 where j N c The preferred value of Nc is 2.

Step 208 ends in sub-step 912 where the processor 105 extracts the feature vector f, which is preferably based upon a combination of the final CF valuesfvE andfvbw, which were derived from the respective CFs cfE and cfbw, together with some lower level central moments p'2) and The final CF values fVE and fbw contributes information from the higher order moments. The preferred feature vector f for the clip is thus: (2) fvE(1) f= fvE (29) Pbw f bw (2) frbw((2)2 The lower level central moments I2) andb being the variance of the frame energies E(i) and the mean of the frame bandwidths bw(i) of the frames associated with that clip, may be computed directly without the need to calculate the moments cfE 2 and cfbw for the CFs cfE and cfbw.

To illustrate the nature of the distribution of individual semantic features over a homogenous segment, Fig. 15 shows a plot of the distribution of two particular semantic features derived from the frame energies The distributions of each clip features, as shown in this example, are clearly multi-modal in nature.

Referring again to Fig. 1, from step 208 method 200 continues to step 210 where the processor 105 classifies each segment based on the distribution of its clip feature 645957.doc -26vectors f. Because the segment is homogeneous, it relates to a single object and only one label type is attached to each segment.

The classification step 210 operates to solve what is known in pattern recognition literature as an open-set identification problem. The open-set identification may be considered as a combination between a standard closed-set identification scenario and a verification scenario. In a standard closed-set identification scenario, a set of test features from unknown origin are classified against features from a finite set of classes, with the most probable class being allocated as the identity label for the object associated with the set of test features. In a verification scenario, again a set of test features from an unknown origin is presented. However, after determining the most probable class, it is then determined whether the test features match the features of the class closely enough in order to verify its identity. If the match is not close enough, the identity is labelled as "unknown".

In an open-set identification problem, a finite number of classes are presented for classification against. An additional class is also used, labelled "unknown", and is assigned to test features that are deemed to not belong to any of the primary classes. The open-set identification problem is well suited to classification in an audio stream, as it is not possible to adequately model every type of event that may occur in an audio sample of unknown origin. It is therefore far better to label an event, which is dissimilar to any of the trained models, as "unknown", rather than falsely label it as another class.

Fig. 10 illustrates the classification of the segment, characterised by its extracted feature vectors f, against 4 known classes A, B, C and D, with each class being defined by an object model. The extracted feature vectors f are "matched" against the object models by determining a model score between the feature vectors f of the segment and each of the 645957.doc -27object models. An empirically determined threshold is applied to the best model score. If the best model score is above the threshold, then the label of the class A, B, C or D to which the segment was more closely matched is assigned as the object label. However, if the best model score is below the threshold, then the segment does not match any of the object models closely enough, and the segment is assigned the label "unknown".

Given that the distribution of clip features is multi-modal, a simple distance measure, such as Euclidean or Mahalanobis, will not suffice for calculating a score for the classification. A classifier based on a mixture of Gaussians, or Gaussian Mixture Model (GMM) is used for the object classification. A Gaussian mixture density is defined as a weighted sum of M component densities, expressed as:

M

p(x A) pib i=1 where x is a D dimensional random sample vector, bi(x) are the component density functions, and pi are the mixture weights.

Each density function b, is a D dimensional Gaussian function of the form: 1 exp (31) (2)TEl l 2 i where is the covariance matrix and pi the mean vector for the density function b,.

The Gaussian mixture model A c with c 1,2,...,Cwhere C is the number of class models, is then defined by the covariance matrix and mean vector [i for each density function b i and the mixture weights pi, collectively expressed as: c i= (32) 645957.doc -28- The characteristics of the probability distribution function p(x/c) of the GMM can be more clearly visualized when using two-dimensional sample data x. The example shown in Fig. 11 shows a five-mixture GMM for a sample of two-dimensional speech features xl and x 2 where x=[xlx 2 The GMM A c is formed from a set of labelled training data via the expectationmaximization (EM) algorithm known in the art. The labelled training data is clip feature vectors f extracted from clips with known origin. The EM algorithm is an iterative algorithm that, after each pass, updates the estimates of the mean vector A i ,'covariance matrix Z, and mixture weights pi. Around 20 iterations are usually satisfactory for convergence.

In a preferred implementation GMM's with 6 mixtures and diagonal covariance matrices are used. The preference for diagonal covariance matrices Z i over full covariance matrices is based on the observation that GMM's with diagonal covariance matrices E i are more robust to mismatches between training and test data.

With the segment being classified comprising T clips, and hence being characterised by T clip feature vectors ft, the model score between the clip feature vectors f, of the segment and one of the C object models is calculated by summing the log statistical likelihoods of each of T feature vectors f, as follows:

T

c logp(f, c) (33) t=1 where the model likelihoods p(f, 2c) are determined by evaluating Equation The log of the model likelihoods p(f, is taken to ensure no computational underflows occur due to very small likelihood values.

645957.doc

I

-29- As described in relation to Fig. 10, an adaptive technique is used to determine whether or not the best model score Sp is good enough to attribute the label of the class resulting in the best model score s^ to the segment. The best model score sp is defined as: max( c (34) c -The adaptive technique is based upon a distance measure Dij between object models of the classes to which the test segment may belong. Fig. 12 illustrates four classes and the inter-class distances Dij between each object model i andj. As the object models are made up of a mixture of Gaussians, the distance measure Dij is based on a weighted sum of the Mahalanobis distance between the mixtures of the models i andj as follows: M N

D

1 m=1 n=1 where M and N are the number of mixtures in class models i and j respectively, p' and pI are the mixture weights within each model, and is the Mahalanobis distance between mixture m of class i and mixture n of class j. The inter-class distances Dij may be predetermined from the set of labelled training data, and stored in memory 106.

The Mahalanobis distance between two mixtures is calculated as: i= j )T (Fi Y.j)-I (36) Because diagonal covariance matrices are used, the two covariance matrices I,' and 2- may simply be added in the manner shown. It is noted that the Mahalanobis 645957.doc distance A, is not strictly speaking a correct measure of distance between two distributions. When the distributions are the same, the distance should be zero. However, this is not the case for the Mahalanobis distance defined in Equation For this to be achieved, various constraints have to be placed on Equation This adds a huge amount of computation to the process and is not necessary for the classification, as a relative measures of class distances is all that is needed.

In order to decide whether the segment should be assigned the label of the class with the highest score, or labelled us "unknown", a confidence score is calculated. This is achieved by taking the difference of the top two model scores and and normalizing that difference by the distance measure Dpq between their class models p and q. This is based on the premise that an easily identifiable segment should be a lot closer to the model it belongs to than the next closest model. With further apart models, the model scores 9, should also be well separated before the segment is assigned the class label of the class with the highest score. More formally, the confidence score may be defined as: 1000 S P q (37) Dpq The additional constant of 1000 is used to bring the confidence score (D into a more sensible range. A threshold T is applied to the confidence score D. In the preferred implementation a threshold t of 5 is used. If the confidence score cI is equal or above the threshold T, then the segment is given the class label of the highest model score 9p, else the segment is given the label "unknown".

Referring again to Fig. 1, with each of the segments being assigned a label, the segment labels are provided as output in step 212, usually accompanied by the verified 645957.doc -31 transition-points that define the boundaries of the segments. The output may be directly to the display 114. Typically method 200 would form part of another application in which the labels are used for further processing. A number of such applications are now described as examples only, with the applications being illustrative and not restrictive.

A number of applications of the segmentation and classification method 200 will now be described. A key application in which the segmentation and classification method 200 may be used is in the automatic generation of object identity metadata from a large amount of offline audio data. In such an application method 200 processes the continuous stream of audio, and then generates and stores to a text-file the location information and identities of objects such as speech, music, a particular speaker, etc.

The object identity metadata may then be used in aiding subsequent fast retrieval of information from a large stream of data. For example, a user may which to locate where in the audio stream a particular speaker starts talking or where music starts, etc. In such a case, all segments with the object label of the event of interest may easily be retrieved by simply extracting all segments with a particular label.

Another application in which the segmentation and classification method 200 may be used is in a real-time automatic filming application for sensibly film a scene based on incoming metadata. Most of the metadata would be comprised from visual information. The segmentation and classification method 200 operates within a small buffer of audio obtained from an attached microphone to create additional metadata. The metadata from the audio is used as an additional input to control a camera (not illustrated).

For example. if a new noise type is detected, the camera is controlled to pan out.

645957.doc 32 Yet another application in which the segmentation and classification method 200 may be used is in the automatic recording of video in unattended systems. In unattended systems it is desirable to save storage space. This is achieved by only hard-recording events that could be of potential interest, which may be determined by the label assigned to audio segments. For example, a security system may only wish to record video when significant instances of sound activity or speech are detected.

In order for this to be implemented, given that the segmentation and classification method 200 is non-causal and must have access to data ahead in time of the point it is currently analysing, a memory buffer is used to hold audio data for a specified length of time, such as 30 seconds. The segmentation and classification method 200 then segments the audio data in the buffer in the normal manner, and classifies each segment by comparing the clip features f of the segment against known event models, such as speech or general noise models. Such an application can then determine whether these events have occurred within the audio in the buffer. If one or more of the segments is deemed to contain sound activity, then this and subsequent buffers of audio, with the respective video information, is written to the storage device 109. The writing to storage device 109 continues until either no more sound activity is detected in the current audio buffer or until a specified length of time has elapsed after the first sound event.

The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiment(s) being illustrative and not restrictive.

In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including" and not "consisting only 645957.doc -33 of'. Variations of the word comprising, such as "comprise" and "comprises" have corresponding meanings.

645957.doc

Claims

1. A method of extracting semantic information from an audio clip comprising a sequence of audio samples, said method comprising the steps of: forming a sequence of frames along said sequence of audio samples, each of said frames comprising a plurality of said audio samples; extracting for each frame at least one frame feature; and extracting said semantic information from moments of said at least one frame feature, said moments including moments of the order higher than 2.

2. The method as claimed in claim 1 wherein said semantic information from said moments of said at least one frame feature are extracted by the sub-steps of: estimating a characteristic function from said frame features; and extracting said semantic information from said characteristic function.

3. The method as claimed in claim 2, wherein said semantic information relates to collective characteristics of said moments.

4. The method as claimed in claim 2 or 3 wherein said characteristic function is summed under log-spaced bins in order to extract said semantic information from said characteristic function.

The method as claimed in claim 4 wherein said semantic information is refined by performing a dimension reduction transform on said characteristic function after said characteristic function is summed under log-spaced bins.

6. The method as claimed in claim 5 wherein said dimension reduction transform is the Discrete Cosine Transform.

7. The method as claimed in any one of claims 1 to 6 wherein said at least one frame feature includes at least one of: a bandwidth of audio samples of said frame; and

645957.doc 35 an energy value of audio samples of said frame.

8. An apparatus for extracting semantic information from an audio clip comprising a sequence of audio samples, said apparatus comprising: means for forming a sequence of frames along said sequence of audio samples, each of said frames comprising a plurality of said audio samples; means for extracting for each frame at least one frame feature; and means for extracting said semantic information from moments of said at least one frame feature, said moments including moments of the order higher than 2.

9. The apparatus as claimed in claim 8 wherein said semantic information from said moments of said at least one frame feature are extracted by: estimating a characteristic function from said frame features; and extracting said semantic information from said characteristic function. The apparatus as claimed in claim 9, wherein said semantic information relates to collective characteristics of said moments. 11. The apparatus as claimed in claim 9 or 10 wherein said characteristic function is summed under log-spaced bins in order to extract said semantic information from said characteristic function. 12. The apparatus as claimed in claim 11 wherein said semantic information is refined by performing a dimension reduction transform on said characteristic function after said characteristic function is summed under log-spaced bins. 13. The apparatus as claimed in claim 12 wherein said dimension reduction transform is the Discrete Cosine Transform. 14. The apparatus as claimed in any one of claims 8 to 13 wherein said at least one frame feature includes at least one of: a bandwidth of audio samples of said frame; and 645957.doc 36 an energy value of audio samples of said frame. A program stored on a memory medium for extracting semantic information from an audio clip comprising a sequence of audio samples, said program comprising: code for forming a sequence of frames along said sequence of audio samples, each of said frames comprising a plurality of said audio samples; code for extracting for each frame at least one frame feature; and code for extracting said semantic information from moments of said at least one frame feature, said moments including moments of the order higher than 2. 16. The program as claimed in claim 15 wherein said semantic information from said moments of said at least one frame feature are extracted by: estimating a characteristic function from said frame features; and extracting said semantic information from said characteristic function. 17. The program as claimed in claim 16, wherein said semantic information relates to collective characteristics of said moments. 18. The program as claimed in claim 16 or 17 wherein said characteristic function is summed under log-spaced bins in order to extract said semantic information from said characteristic function. 19. The program as claimed in claim 18 wherein said semantic information is refined by performing a dimension reduction transform on said characteristic function after said characteristic function is summed under log-spaced bins. The program as claimed in claim 19 wherein said dimension reduction transform is the Discrete Cosine Transform. 21. The program as claimed in any one of claims 15 to 20 wherein said at least one frame feature includes at least one of: a bandwidth of audio samples of said frame; and 645957.doc -37- an energy value of audio samples of said frame. 22. A method of extracting semantic information from an audio clip, said method being substantially as described herein with reference to the accompanying drawings. 23. An apparatus for extracting semantic information from an audio clip, said apparatus being substantially as described herein with reference to the accompanying drawings. 24. A program stored on a memory medium for extracting semantic information from an audio clip, said program being substantially as described herein with reference to the accompanying drawings. Dated this 12th day of September 2003 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant SPRUSON&FERGUSON 645957.doc