AU2005100274A4

AU2005100274A4 - Method and apparatus for analyising sound

Info

Publication number: AU2005100274A4
Application number: AU2005100274A
Authority: AU
Inventors: Dinesh Kant; Neil Mclachlan
Original assignee: Kapur Ruchika Ms
Current assignee: Kapur Ruchika Ms
Priority date: 2004-03-31
Filing date: 2005-04-01
Publication date: 2005-06-23
Anticipated expiration: 2013-03-31

Description

METHOD AND APPARATUS FOR ANALYISING SOUND FIELD OF INVENTION The present invention relates to the field of sound analysis. In one form, the invention relates to the analysis of sound to determine biometrics for security purposes. In other forms, the present invention relates to sound analysis in applications as such environmental and traffic sound monitoring, surveillance and or maintenance of vehicles (cars, planes, trams etc) and machinery maintenance (electric power transformers) and or moving machinery such as motors, turbines, conveyers, and the like.

It will be convenient to hereinafter describe the invention in relation to biometrics and security, however it should be appreciated that the present invention is not limited to that use only.

BACKGROUND ART The computer classification of sound has been researched by parties interested in environmental acoustic logging, automated logging of music databases according to instrumentation, machine monitoring, military and other surveillance and speaker recognition systems.

The present inventor has identified that the most common approaches have been to apply statistical distance measures to band-limited or STFT (Short Time Fourier Transform) spectral data or cepstral coefficients. This approach has a number of disadvantages, such as the use of static spectrum for non-stationary signal, and the assumption that the difference between the signals is located in the higher energy content of the signal.

The inventor has also identified that heuristic methods have also been attempted for environmental sound analysis. To date few commercially viable products exist for these applications apart from relatively simple examples of machine monitoring which suffer from the disadvantage that these require extensive examples and supervision and require the source properties to be reasonably well known.

The inventor has also identified that speech (phoneme) recognition is another field in which many products are commercially available. In this field the range of sounds is restricted to the phonemes of a given language. These techniques use a combination of the spectral and temporal properties of the signal and STFT and wavelet analysis are commonly used to generate the time varying spectral data for classification by neural nets. Under good acoustic conditions these techniques can achieve accuracies of greater than 95% where a single person is speaking but tend to have a relatively lower accuracy when multiple persons are speaking. Thus, the practical implementation of this technology in an outdoor or acoustically complex environment is considered to be very limited.

The inventors have also identified that many prior art systems have a model-based approach to speech analysis. In these methods and systems, they base the analysis on making a mathematical model of the source and then determine the properties of the source that produce the sound. Examples are in automobile noise emission studies, and in human voice based Structured Audio (MPEG 2 and 4) or Linear Predictive Source Modelling used for mobile telephone compression. This model approach is considered to suffer from problems that the model of the source should be deterministic. This model is suitable for sound approximation and provides good compression but is not suitable for sound source authentication nor for sound classification.

Any discussion of documents, devices, acts or knowledge in this specification is included to explain the context of the invention. It should not be taken as an admission that any of the material forms a part of the prior art base or the common general knowledge in the relevant art in Australia or elsewhere on or before the priority date of the disclosure and claims herein.

An object of the present invention is to provide an improved sound classification and recognition method and system.

A further object of the present invention is to alleviate at least one disadvantage associated with the prior art.

SUMMARY OF INVENTION The present invention provides, in one aspect, a method of and apparatus for identifying the source of sound or other similar signals by the use of statistical descriptions of the time, frequency analysis coefficients of the signal or a section of the signal.

The present invention provides a method of and apparatus for analysing sound including the steps of providing a sample of sound, applying timefrequency analysis methods such as wavelet transforms or STFT, statistically analysing the time-frequency coefficients, and defining the classes of sounds being analysed using Neural Networks or statistical methods for the purposes of verifying the sound source. If required the method normalizes the amplitude of the sound samples before undertaking the time-frequency analysis. The statistical analysis of the time-frequency coefficients provides coefficients related to the time-averaged energy content within each frequency band and the range of energy fluctuation over the length of the sample within each frequency band.

The present invention also provides a method of and apparatus for analysing sound, including the steps of providing a sample of sound, statistically analysing the sample, and classifying the analysed sample according to a band or bands of time-frequency coefficients that lie within a predetermined magnitude range. The selection of both the time-frequency coefficients used for the classification and the magnitude range in which these coefficients will be found for particular sound classes are iteratively determined during training of the system.

The present invention also provides, a method of and apparatus for determining biometrics, including the steps of analysing or identifying a sound sample as disclosed herein, comparing the analysed sample against a reference sample, verifying whether or not the sound sample is substantially the same as the reference sample.

Other aspects and preferred aspects are disclosed in the specification and or defined in the appended claims, forming a part of the description of the invention.

In essence, the present invention stems from a hybrid approach incorporating Statistical analysis of wavelet coefficients and Neural networks.

The present invention, in one aspect, uses an iterative technique and incorporates a neural network for classification of the statistical properties of the signal in a combination of time and frequency domains using multi- scale wavelet coefficients.

The uniqueness of the technique is that this does not model the source and look for identification features but develops a database of the source. A specific sound of a word (for example) is different from a database of the same word originating from different sources. What is required to be stored to identify the source is not the information about the source or the sound but simply the weights of the neural network after it is trained for the specific user.

The present invention has been found to result in a number of advantages, such as: Low complexity of source data: The present invention requires a comparison of the statistical properties and weights of the neural network of the source and requires little if any information of the speaker's voice sample or information about the speech. The size of the stored file can be as small as 2 KB making it relatively convenient for being loaded on the smart cards or even magnetic strips. This makes the present invention very suitable for security of all levels as it is considered virtually impossible to backwards engineer the voice or the text of the speaker from the weights of the network suitably classifying the statistical properties of the wavelet coefficients of the recording.

the present invention will identify the occurrence of the sounds and their temporal location, even in mixed and complex sounds.

The present invention may use the complete word or group of words uttered (such as password or name of the speaker). This makes it very easy for the present invention to be trained for the new user, with or without the effort (knowledge) of the speaker over the telephone or similar.

Stores only the statistical properties and not the signal, thus ensuring substantially improved security and privacy.

Classifies the speaker to the set of known (pre-recorded signals).

The present invention is adjustable to the set of sounds to be classified both through parameters used in the sound feature extraction and through the use of a trained neural net signal classifier.

The present invention can also be used to classify groups of sounds as widely varying as 'noises' such as wind and engine noise as well as information rich sounds like speech and music, or to classify relatively similar sounds such as groups of individual human voices speaking the same word.

By 'neural network', we mean supervised neural network. Examples of this include the back propagation type neural networks.

Further scope of applicability of the present invention will become apparent from the detailed description given hereinafter. However, it should be understood that the detailed description and specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.

BRIEF DESCRIPTION OF THE DRAWINGS Further disclosure, objects, advantages and aspects of the present application may be better understood by those skilled in the relevant art by reference to the following description of preferred embodiments taken in conjunction with the accompanying drawing, which is given by way of illustration only, and thus are not limitative of the present invention, and in which: Figure 1 illustrates an embodiment of the present invention.

DETAILED DESCRIPTION The present invention is a sound classification method system that applies statistical measures to wavelet coefficients of a sound sample for use in a neural net classifier. The system has to be trained to classify a finite set of sounds.

The present invention is distinguished from existing speech recognition software by the particular statistical measures used. The present invention is adjustable to the set of sounds to be classified both through parameters used in the sound feature extraction and through the use of a trained neural net signal classifier. The present invention can be used to classify groups of sounds as widely varying as 'noises' such as wind and engine noise to information rich sounds like speech and music, or to classify relatively similar sounds such as groups of individual human voices speaking the same word.

Turning to Figure 1, a representation of the present invention is illustrated.

A sound sample 1 is provided as an input for analysis. The sample is then iteratively analysed 2. Studies on human recognition of environmental sounds have revealed that sounds are most distinguishable from a combination of spectral (frequency) and temporal features and based on the spectral energies, time varying behaviour of these spectral components and statistical properties of the signals. In the present invention, it has also been identified that often the difference between two audio signals may lie in the component of the signal that has a relatively small energy content. The present invention seeks to exploit this with the help of multi-resolution time-frequency attributes of wavelet transforms to extract statistical information about the distribution of energy across frequency bands and across the time of the sound sample. Lower frequency regions of the sound are well resolved in frequency but not so well in time, while higher frequency regions are well resolved in time but not frequency.

The output of the iterative analysis is provided to a neural network for classification 3. The neural network provides for a band threshold for coefficients with a lower and upper bound and determined by maximising the statistical distance between the signals of different origin/ source. Signals are classified to the source based on the band that best defines the values of the coefficients.

The system provides the flexibility to select the narrowness of the signal classification band so that it may be used to determine an exact match or a wide match, depending on the application. The system does not need to store the entire signal but simply the values of the features mentioned above.

The statistical measures used are the mean of the coefficients and the variance/mean of the coefficients for each wavelet band over the time of the sample. The mean of the coefficient(s) are related to the time-averaged energy content in each band, while the variance/mean is related to the range of energy fluctuation over the length of the sample. In the one embodiment, 12 wavelet bands are used to cover the frequency range of 11 22,000Hz. Extra low frequency bands can be included to capture slower variations of the sound envelope.

The classified sample 6, is then compared 4 with a previously known sample 8 provided from a library 5 of various sounds and or references.

Obviously the references or contents of the library will vary depending on the particular use of the present invention.

Certain criteria can be used in identifying or determining a 'match' between the classified sample 6 and the library sample 8. Such matching criteria may include setting a tolerance value in determining the requirement for an absolute 'match' /(allowing for a certain tolerance). The tolerance value may be preset or determined according to the application of the present invention.

If the is no 'match', an output 9 is given, whereas if a 'match' is found, an output 10 is given.

The present sound classification system has applications to biometric, security monitoring and surveillance, remote machine monitoring and environmental acoustic monitoring.

This system is ideally suited for biometrics applications as it does not require the storage of the original sound making the system naturally encode the data. It also uses very small amount of memory and thus can be translated on a magnetic swipe card or similar cards and be used to gain entry into buildings, data or systems. This enables the use of a multi-tier security system, where once the card is passed through the reader the person is prompted to speak his/ her name and/ or a password into a microphone for computer recognition. This adds a measure of the identity of the person bearing the card.

For security monitoring this system enables video monitoring to be alerted and guided by the occurrence of specific sounds such as breaking glass or human distress calls. This will be important in cases where many locations are being monitored simultaneously and rapid responses are required. Surveillance methods can be partially automated by the use of this system to search and log audio-tapes generated over long surveillance periods. Remote machine monitoring can be facilitated by the improved discriminatory power of this system to detect the presence of, or changes to indicator sounds in the presence of other sounds.

The system has applications in environmental and traffic sound monitoring.

The system is flexible and the user needs to train the system for the sounds that have to be monitored. Based on the thresholding and use of statistical features of the time and frequency components, the system will identify the combination of the features mentioned earlier to determine the occurrence of the sounds and the temporal location- even in mixed and complex sounds.

Other applications of the technology are in the field of vehicular maintenance (cars, planes and trams etc), where the sound of the engine and the body are often used by the mechanics to identify engine and body problems.

These technologies provide a means for identifying problems in the vehicle and generate alerts for preventive maintenance. The system also provides to give an early warning for static machinery maintenance (electric power transformers) and for moving machinery such as motors, turbines, conveyers, etc.

The present invention is suitable for use in applications that can be broadly put in three categories: 1. Biometrics- for verification the identity of the person: Confirming the identity of an individual over the telephone (such as for telephone banking).

Integrated with smart cards for the purpose of entry into an office or other such space or data or network.

Integrated with other biometrics technology for multiple tier sabbatical.

For accessing a computer in place of, or in conjunction with passwords.

For ensuring security of devices such as a mobile phone.

For automobile access and security.

2. Audio monitoring: Of buildings for identifying the time when certain, predefined audio events occurred- such as breaking of glass, voices of people, etc.

For telephone surveillance- for automatic identification of certain audio events such as the voice of an individual in a conversation.

3. Environmental and Noise monitoring: For monitoring the street noise for better city noise management.

For monitoring audio events where there is a litigation related to noise between two people or groups.

For street barrier design and monitoring of road, aircraft and other transport noise.

For machine noise monitoring to pre-determine and thus prevent the possible engine/ machine failure. This is based on the use of machine sound as a powerful and early indicator of machine defects.

While this invention has been described in connection with specific embodiments thereof, it will be understood that it is capable of further modification(s). This application is intended to cover any variations uses or adaptations of the invention following in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains and as may be applied to the essential features hereinbefore set forth.

As the present invention may be embodied in several forms without departing from the spirit of the essential characteristics of the invention, it should be understood that the above described embodiments are not to limit the present invention unless otherwise specified, but rather should be construed broadly within the spirit and scope of the invention as defined in the appended claims.

Various modifications and equivalent arrangements are intended to be included within the spirit and scope of the invention and appended claims. Therefore, the specific embodiments are to be understood to be illustrative of the many ways in which the principles of the present invention may be practiced. In the following claims, means-plus-function clauses are intended to cover structures as performing the defined function and not only structural equivalents, but also equivalent structures. For example, although a nail and a screw may not be structural equivalents in that a nail employs a cylindrical surface to secure wooden parts together, whereas a screw employs a helical surface to secure wooden parts together, in the environment of fastening wooden parts, a nail and a screw are equivalent structures.

"Comprises/comprising" when used in this specification is taken to specify the presence of stated features, integers, steps or components but does not preclude the presence or addition of one or more other features, integers, steps, components or groups thereof."

Claims

1. A method of classifying sounds or other similar signals by the use of statistical descriptions of the time, frequency analysis coefficients of the signal or a section of the signal. 00

2. A method of analysing sound, including the steps of: a. Providing a sample of sound b. Undertaking a time-frequency analysis of the sample C- c. Statistically analysing the time-frequency coefficients 0 d. Classifying the analysed sample using statistical coefficients, and t e. Classifying the analysed sample according to band thresholds using Siteratively determined coefficients.

3. A method as claimed in claim 1or 2, wherein the statistical analysis includes providing information about the distribution of energy across frequency bands

4. A method as claimed in claim 1, 2 or 3, wherein the statistical analysis includes providing information about the distribution of energy across the time of the sample A method as claimed in claim 1or 2, wherein the mean coefficients are related to the time-averaged energy content in each band.

6. A method as claimed in claim lor 2, wherein the variance/mean is related to the range of energy fluctuation over the length of the sample or section of the sample.

7. A method as claimed in claim 1 or 2, wherein the classifying is performed by the use of neural networks or statistical distance classifiers trained using the statistical properties (claim 5 and 6) of samples of each sound class.

12- 8. A method as claimed in claim 1or 2, further including the step of Scomparing the unknown sample to the necessary information of a sample signal of the source used during training of the system. 9. A method as claimed in claim 1 or 2, where the signal from a source is not required to be saved in the data base but only the statistical descriptors (such as mean and standard deviation) of some or all scales of the wavelet coefficients of the signal need to be saved. C 10. A method as claimed in claimed in claim 1 or 2, where only the weights 0of the neural network of the network trained for the specific source needs to Sbe saved. C 11. A method of analysing sound, including the steps of: providing a sample of sound, determining at least one frequency band attributable to the sample, analysing the sample to provide coefficients related to the time- averaged energy content in each band and the range of energy fluctuation over the length of the sample. 12. A method of distinguishing the spoken password by the authentic user according to any one of claims 1 to 11, comparing the analysed sample properties against a reference sample.

13. A method as claimed in claim 12, wherein the biometrics are used for sam security purposes.

14. Amethod as claimed in claim 12, wherein the biometric data is stored on smart cards or similar technology for real-time comparison with biometric data obtained from the card bearer. A method as claimed in claim 12, where the method is used to identify the of sound or other similar signal in the presence of other sounds or other similar signals based on the method described in claim 1. -13-

16. A method as claimed in claim 1, where the method is used to identify the source of sound or similar signal for monitoring environmental noise or audio monitoring or other similar application.

17. A method claimed in claim 1, where the identification of sound is used to identify the possible disorder of machinery such as electrical transformers, oO automobiles and aircraft.

18. A method claimed in claim 1, where identifying sounds will be used to C identify the possible speaker in telephone monitoring for security applications. O

19. Apparatus adapted to analyse sound, said apparatus including: S a process or means adapted to operate in accordance with a predetermined Cl instruction set, said apparatus, in conjunction with said instruction set, being adapted to perform the method as claimed in any one of claims 1 to 18. A computer program product including: a computer usable medium having computer readable program code and computer readable system code embodied on said medium for operation in association with a data processing system, said computer program product including computer readable code within said computer usable medium for analysing sound according to any one of claims I to 18.

21. A method as herein disclosed.

22. An apparatus and I or device as herein disclosed. DATED THIS 31st day of March 2005