EP1062659A1

EP1062659A1 - Method and device for processing a sound signal

Info

Publication number: EP1062659A1
Application number: EP99917771A
Authority: EP
Inventors: Tobias Schneider
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1998-03-19
Filing date: 1999-03-08
Publication date: 2000-12-27
Anticipated expiration: 2019-03-08
Also published as: US6804646B1; DE59900797D1; WO1999048084A1; EP1062659B1; JP4276781B2; JP2002507775A

Abstract

The invention relates to a method and device for processing a sound signal comprising a useful signal and an interfering signal. The sound signal is transformed in the frequency range and a modification in the shape of the signal is represented by an envelope for at least one frequency over a given time period. By segmenting the envelope it is possible to obtain a maximum for each segment. The lowest maximum is weighted by a factor and is extracted from the sound signal. It is also possible to take the minimum into account in order to reduce the interfering signal.

Description

1 description

Method and device for processing a sound signal

The invention relates to a method and a device for processing a sound signal.

A system for speech recognition is known from [1]. There you will also find a basic introduction to the components of the system for speech recognition and important techniques common in speech recognition.

A wavelet transformation is known from [2]. A wavelet transformation is preferably carried out in a plurality of transformation stages, a transformation stage dividing a pattern into a high-pass and a low-pass component. The respective high-pass or low-pass portion preferably has a reduced resolution compared to the pattern (technical term: subsampling, i.e. reduced sampling rate, thus reduced resolution). From the high pass and the

Low pass portion of the pattern can be reconstructed. This is ensured in particular by the special shape of the transformation filters used in the transformation. The wavelet transformation can be one-dimensional, two-dimensional or multidimensional.

A sound signal comprises a useful signal and a disturbance signal, the strength of the disturbance signal depending on the environment. It is essential for further processing of the audio signal

Requirement to separate the useful signal from the disturbance signal.

Methods are known which more or less strongly suppress different areas of a frequency spectrum of the audio signal. It is disadvantageous that a dynamic development of the disturbance signal is not taken into account. 2 The object of the invention is to provide a method and a device which ensures the processing of a sound signal in such a way that the disadvantage described above is avoided.

This object is achieved according to the features of the independent claims.

With a transformation of a time signal m a frequency range, e.g. using Fast Fourier transform

(FFT), an area of the time signal comprising a predetermined number of samples is transformed into the frequency area. This process takes place for different points in time, so that as time progresses in the frequency range, the individual frequencies result in different values depending on the respective transformed range of the time signal. In this way, the course of a frequency over time can be displayed.

In addition to the FFT, a wavelet transformation or any other transformation can also be used to map the time range m the frequency range.

A method is specified for processing an audio signal, in which the audio signal is transformed into a frequency range. For at least one predetermined frequency of the audio signal, an envelope of the audio signal transformed over the frequency range is determined over time. The envelope is subdivided into a set of segments, which segments are each determined by a predetermined duration. A maximum of the envelope is determined for each segment of the set of segments. The smallest maximum is determined for a predetermined number of segments of the set of segments. The audio signal is processed by subtracting the smallest maximum weighted by a factor from the audio signal. 3 The smallest maximum is thus advantageously specified, which is determined over a predetermined duration for the respective frequency, the envelope of which is determined over time, the smallest maximum preferably detecting an interference signal, which comprises a useful signal and an interference signal, the interference signal. This is particularly evident when the sound signal is naturally spoken language. The language comprises several words which, even with fluent pronunciation, contain places with spectral minima (in particular pauses between the individual words). In such places of spectral minima, the useful signal is almost non-existent, whereas the interference signal dominates.

Another advantage is that the smallest maximum is determined for the number of segments. The multiple segments include a dynamic course of the interference signal over time. For example, the disturbance signal can be an engine noise in a motor vehicle which continuously accelerates the motor vehicle over a period of time. The disturbance signal in the motor vehicle thus increases over time (during acceleration). Since the smallest maximum is determined in each case for the number of segments, the smallest maximum is (re) determined over time for each number of segments, so that the dynamic development of the fault signal can also be taken into account.

A further development of the invention consists in that a minimum is determined for a further number of segments of the set of segments, and that the audio signal is processed by subtracting the minimum maximum combined with the minimum from the audio signal.

The inclusion of the minimum, which is determined for a further number of segments, proves to be extremely advantageous for adapting the interference signal, which is to be subtracted from the audio signal in order to obtain the useful signal. Provided 4 there is currently no useful signal, the minimum characterizes the interference signal and is therefore subtracted from the sound signal.

Another development is that the minimum and the smallest maximum according to the relationship

max a + b mm, where a denotes a first predetermined coefficient, b a second predetermined coefficient, max the smallest maximum and mm the minimum.

The coefficients are to be specified in such a way that the interference signal is reduced favorably for the application.

An advantageous development consists in that an update is carried out each time the number or the further number of segments has elapsed, in such a way that an updated fault signal is subtracted from the sound signal.

As part of an additional development, it is advantageous if the sound signal is a speech signal, preferably naturally spoken speech.

It is also a further development that the processed audio signal is used for speech recognition. A clear useful signal, if possible without a disturbance signal component, is an advantageous requirement, especially for a system for speech recognition. The speech recognition system recognizes the spoken language the better, the clearer the useful signal is. The useful signal can also be output. 5 Furthermore, a device for processing a sound signal is specified which has a processor unit which is set up in such a way that the sound signal m can be transformed over a frequency range. For at least one predetermined frequency, an envelope of the tone signal transformed over the frequency range can be determined over time. The envelope can be subdivided into a set of segments, each of which is determined by a predetermined duration. A maximum of the envelope is determined for each segment of the set of segments. For a number of segments of the

The smallest maximum is determined for the number of segments. The audio signal is processed by subtracting the smallest maximum weighted by a factor from the audio signal.

A possible development of the device for processing a sound signal is that the processor unit is set up in such a way that a minimum is determined for a further number of segments, and that the sound signal is processed by combining the smallest maximum with the minimum of is deducted from the sound signal.

The device is particularly suitable for carrying out the method according to the invention or a further development described above.

Further developments also result from the dependent claims.

Exemplary embodiments of the invention are illustrated in more detail with the aid of the following figures.

Show it

Fig.l is a block diagram showing steps of a method for processing a sound signal; 6 FIG. 2 shows a profile of an envelope f (t) of a frequency £ _ over the time t;

3 shows a processor unit;

4 shows a system for speech recognition.

Fig.l shows a block diagram which has steps of a method for processing a sound signal. Two variants for processing the sound signal are shown below using Fig.la and Fig.lb.

In Fig.la the sound signal m is transformed at least one frequency range (see step 101).

This transformation is preferably a Fast Fourier Transformation (FFT). The transformation is carried out at specific points in time t _λ and thus a course of at least one frequency is determined over the points in time t _x . An envelope is determined in a step 102 via this time-dependent course of the frequency. This is carried out for at least one frequency, in particular for several significant frequencies of the audio signal. In a step 103, the respective envelope m is subdivided into a set of segments, which segments preferably have the same duration. A maximum is determined for each segment in the course of the envelope (cf. step 104). The smallest maximum of a predetermined number of segments is determined in a step 105 and this smallest maximum, in particular weighted by a factor, is subtracted from the audio signal in order to reduce the interference signal and to ensure the strongest possible useful signal (cf. step 106). The smallest maximum is determined for a certain number of previous segments, with an update being carried out again after a predefined time for the smallest maximum, taking into account the number predefined at this new time 7 past segments. Thus, the smallest maximum for the envelope of the respective frequency is dynamically adjusted over time at all times given by the number N of previous segments. An example that illustrates the need for dynamic adjustment of the

Illustrated disturbance signal, the disturbance signal is an accelerating vehicle in which an engine noise increases over time in accordance with the acceleration. The disturbance signal corresponding to the increasing engine noise is adapted by updating the smallest maximum at predetermined times for the envelope of predetermined frequencies in order to obtain a high-quality useful signal from the audio signal.

Fig.lb shows the blocks 101, 102, 103, 104 and 105 corresponding to Fig.la. After step 103, in addition to the determination of the maximum (104 and 105), a minimum over a predetermined time of the envelope of the particular examined Frequency determined (see step 107). In particular, the (smallest) minimum of a predetermined number of previous segments is of interest, that is to say the minimum that results from the envelope from a current point in time for a duration to be taken into account. Finally, in step 108, both the smallest maximum and the minimum are linked to one another in order to obtain a disturbance signal to be subtracted from the audio signal and thus decisively improve the quality of the useful signal.

The minimum becomes the smallest maximum according to the relationship

max a + b • mm

linked, where a denotes a first predetermined coefficient, b a second predetermined coefficient, max the smallest maximum and mm the minimum.

Then it is preferred

max i

X - | a + b N now /

calculated where

S the new (suppressed) sound signal,

X the disturbed sound signal,

N denotes a noise estimate or a value strongly correlated with the noise.

This link also takes into account the temporal variation of the fault signal. If a constant interference signal is superimposed on the useful signal, exactly this interference signal or a proportion proportional to it is eliminated.

The time interval T to be taken into account to determine the minimum and possibly also the smallest maximum, which characterizes the duration of the number of past segments, is chosen in particular so that this time interval T is longer than the spoken word (the sound signal corresponds to naturally spoken language). The minimum or the smallest maximum is updated at times t = n * T, that is to say every n time intervals T.

2 shows a profile of an envelope f (t) of a frequency f _x over time t. There is one on the ordinate

Amplitude Af of the frequency f and on the abscissa is the

Time t plotted. There is also an envelope 9 f (t) over time t. The time axis t is divided into segments SEG _X , 1 representing a time variable. The segments SEG1, SEG2, ..., SEG6 are shown in FIG. 2 as an example. For each segment SEG _X , a maximum Max _{x is} determined, which represents em maximum of the envelope f (t) of the frequency f _x over time t related to the respective segment SEG _X. The maxima Maxl, Max2, ..., Max 6 result. Now the smallest of the maxima, in the example Maximum Maxβ from segment SEGβ, is determined. The minimum Mm of the segments SEG _{L shown} is m segment SEG2. The smallest maximum Maxβ and the minimum Mm determined in this way are linked to one another in the manner described above and by the sound signal, that is to say the frequency f _1; subtracted to improve the useful signal (again based on the frequency f).

In particular, a weighted average of the smallest maximum and minimum is subtracted from the audio signal (based on the frequency f _x to be taken into account in each case).

Furthermore, the smallest maximum and the minimum at a time t _a ] t are determined taking into account a predetermined number N of segments before this time t k. By adapting the interference signal to be subtracted from the sound signal, the smallest maximum and the minimum (over the past N segments) are again determined at different times t _a kt, linked to one another and subtracted from the useful signal (based on the respective frequency f).

2 shows an example of the envelope f H (t) for a predetermined frequency i _x . After transformation (for example after carrying out an FFT) of the sound signal x (t) into the frequency range, exactly one value of an amplitude Af at the respective time t is obtained for each frequency f _x . The The course of the frequency fτ_ (t) over time t results from transformations into the frequency range carried out at different times t. In this way, the time course of a predetermined frequency f _x (t) is obtained. The envelope f (t) is determined via this time course of the frequency fχ (t). This envelope f (t) is shown in Fig.2. In particular, an envelope f (t) is determined for several frequencies f _x , so that the

Invention is applied to a plurality of envelopes f (t), which represent the course of a plurality of frequencies f _x over time, and thus a significant improvement in the sound signal is achieved by subtracting the determined interference signal from a sound signal containing information.

A processor unit PRZE is shown in FIG. The processor unit PRZE comprises a processor CPU, a memory SPE and an input / output interface IOS, which is used in different ways via an interface IFC: an output is visible on a monitor MON and / or on a printer via a graphic interface PRT issued. An entry is made using a mouse MAS or a keyboard TAST. The processor unit PRZE also has a data bus BUS, which ensures the connection of a memory MEM, the processor CPU and the input / output interface IOS. Additional components can also be connected to the data bus BUS, e.g. additional memory, data storage (hard disk) or scanner.

Fig. Shows em speech recognition system. A prerequisite for recognizing naturally spoken language is a suitable formalism for representing knowledge. A complete speech recognition system comprises several levels of processing. These are in particular acoustic phonetics, intonation, syntax, semantics and pragmatics. Fig. 4 shows the processing levels during recognition (see [1]) - 11

The natural speech signal SPRS reaches the speech recognition system. A feature extraction is carried out there in a component MEX. After the feature extraction, APE speech sounds are recognized using known acoustic-phonetic units (see block SPLE). This is the calculation of acoustic distance parameters. After the speech sound recognition SPLE, the lexical decoding (word recognition) takes place in a block LDK with the aid of the pronunciation model or word lexicon WOLX and then a syntax analysis SYAL with the help of the language model which includes the grammar, GRSML. The word recognition LDK and the syntax analysis SYAL represent the search for a correspondence for the speech signal. Finally, a semantic postprocessing is carried out in a block SENB, taking into account context knowledge and pragmatics KWPM and finally the language recognized by the speech recognition system ERSPR.

12 The following publication was cited in the context of this document:

[1] A. Hauenstein: "Optimization of algorithms and design of a processor for automatic speech recognition", Chair for Integrated Circuits, Technical

University of Munich, dissertation, July 19, 1993, chapter 2, pages 13 to 26.

[2] S.G. Mallat: A Theory for Multiresolution Signal

Decomposition: The Wavelet Representation, IEEE Trans, on Pattern Analysis and Machine Intelligence, Vol.11, No.7, July 1989, pages 674-693.

Claims

13 claims

1. A method for processing an audio signal, a) in which the audio signal m is transformed into a frequency range, b) in which an envelope of the audio signal transformed over a period of time is determined for at least one predetermined frequency, c) in which the envelope is converted into an Set of segments, each determined by a predetermined duration, is divided, d) in which the maximum of the envelope is determined for each segment of the set of segments, e) in which the smallest maximum for a number of segments of the set of segments f) in which the audio signal is processed by subtracting the smallest maximum weighted by a factor from the audio signal.

2. The method of claim 1, a) in which for a further number of segments of the set of segments em minimum is determined, b) in which the sound signal is processed by subtracting the smallest maximum associated with the minimum from the sound signal.

3. The method of claim 2, wherein the minimum and the smallest maximum are linked according to the following relationship:

max a + b •, mm

where a is a first predetermined coefficient, b is a second predetermined coefficient, max is the smallest maximum and 14 mm indicate the minimum.

4. The method according to any one of the preceding claims, in which the sound signal is processed each time the number or the further number of segments.

5. The method according to any one of the preceding claims, wherein the sound signal is e speech signal.

Method according to one of the preceding claims, in which the processed audio signal is used for speech recognition.

7. A device for processing a sound signal, in which a processor unit is set up such that a) the sound signal can be transformed over a frequency range, b) for at least one predetermined frequency, an envelope of the sound signal transformed over the time range can be determined over time, c) the Envelope m is a subset of segments, each determined by a predetermined duration, d) is determined for each segment of the set of segments em maximum of the envelope, e) the smallest maximum is determined for a number of segments of the set of segments f) the audio signal is processed by subtracting the smallest maximum weighted by a factor from the audio signal. 15 Device according to claim 7, in which the processor unit is set up in such a way that a) a minimum is determined for a further number of segments of the set of segments, b) the audio signal is processed by combining the smallest maximum with the minimum of the audio signal is subtracted.