WO2001091106A1

WO2001091106A1 - Adaptive analysis windows for speech recognition

Info

Publication number: WO2001091106A1
Application number: PCT/FR2001/001218
Authority: WO
Inventors: Frédéric SOUFFLET; Teddy Furon
Original assignee: Thomson Licensing S.A.
Priority date: 2000-05-23
Filing date: 2001-04-20
Publication date: 2001-11-29
Also published as: AU2001254892A1

Abstract

The invention concerns speech recognition comprising a sampling step which consists in sampling a voice signal (400) in a time window (501) and a step which consists in modifying (707, 713) the length of the time window, on the basis of at least a predetermined criterion.

Description

ADAPTIVE ANALYSIS WINDOWS FOR SPEECH RECOGNITION

The present invention relates to the field of voice interfaces.

More specifically, the invention relates to the optimization of phonetic-acoustic decoders (or "front-end" in English) used in voice recognition for example for interface applications or human-machine dialogues.

According to known techniques, the phonetic-acoustic decoders produce acoustic vectors at regular intervals, by applying a sampling window to a speech signal to be processed. These vectors are then generally delivered to a voice recognition engine as described in the book by Frederik Jelinek "Statistical methods for speech recognition" (or in French "statistical methods for voice recomiaissance") published by MIT Press in 1997.

Different types of methods to obtain a modeling used for speech recognition. Today, the most used methods are the MFCC (from the English "Mel Frequency Cepstral Coefficients" or, in French "cepstral coefficients calculated on a scale with Mel frequencies"), the PLP method (from the English "Perceptive Linear Prediction "or linear perceptual prediction) for noiseless recognition and the so-called RASTA-PLP method, for noisy recognition, or through telephone lines which distort the signal. These techniques are notably described in the article "Spectral Signal Processing for ASR" written by M. Hunt and published in the collection "Proceedings 1999 IEEE Automatic Speech Récognition and Understanding Workshop, Colorado, USA, December 12-15" as well as in the article "Perceptual linear predictive (PLP) analysis of speech" written by H. Hermansky and published in the April 1990 issue of the Journal of Acoustical Society of America.

Information and control systems are increasingly using a voice interface to make interaction with the user quick and intuitive. As these systems become more complex, the styles of dialogue supported are more and more rich and varied.

For applications involving voice recognition methods, in particular for the general public, it is important that the voice recognition used supports a spontaneous speech style comprising, for example, hesitation, silence in the middle of sentences, stuttering, etc. most recognition systems are based on the use of hidden Markov networks (or HMM) to model the temporal sequence of the phonetic units composing the language. However in such systems, the duration of the sound is represented by a decreasing exponential of the probability of maintenance in the same state.

The inventors have found that this way of doing things is not consistent with observation and that, for example, a vowel maintained for a long time contains more information for recognition than a short vowel and thus the contribution to a score final state associated with the vowel should be maximum for long pronunciation.

Variants of the solution based on standard HMMs have been proposed to remedy this drawback. Thus, according to the document "Explicit modeling of state occupancy in hidden markov models for automatic speech" written by MJ Russel and RK

Moore, (published in ICASSP proceedings, pages 5 to 8, in 1985), the duration of the state is explicitly provided, and is not of the exponentially decreasing type.

According to this approach, there is an optimal value for the duration of a sound in the pronunciation of a word. This duration can be long or short. It is therefore possible not to penalize sustained voiced sounds. But then, conversely, it is standard speech which receives a lower score. The problem is thus simply displaced.

The invention according to its different aspects aims in particular to overcome these drawbacks of the prior art.

More specifically, an object of the invention is to provide a method and a device for voice recognition making it possible to respond effectively to the problems posed by spontaneous speech, with long voiced sounds for example, without penalizing the processing of standard speech which does not contains neither hesitation nor slowness.

To this end, the invention proposes a voice recognition method comprising a sampling step in which a voice signal is sampled in a time window, remarkable in that it comprises a step of modifying the length of the time window, based on at least one predetermined criterion.

Thus, the invention makes it possible to consider a more or less large number of samples as a function of the length of the time window which makes it possible to take into account, for example, long voiced sounds, hesitations and slowness without penalizing speakers speaking d 'on a regular basis.

According to a particular characteristic, the method is remarkable in that one of the predetermined criteria is information representative of the stationary of the voice signal, the length of the time window being all the greater the more the voice signal is stationary. It is recalled that a signal is stationary if it is the reproduction (or more generally quasi-reproduction) periodic, according to a given frequency, of the same temporal pattern.

Thus, the voice recognition method advantageously allows the stationing of the signal to be taken into account from the acoustico-phonetic decoding step, which results in a relatively simple implementation and greater efficiency of the recognition engine. In particular, since the number of acoustic vectors sent to the recognition engine is lower than in the techniques of the state of the art in a given time interval, the computation time necessary for the recognition engine to decode a signal is reduced.

According to a particular characteristic, the method is remarkable in that the information representative of the statiormarity of the voice signal is obtained during a step of analysis of the signal taking into account a psycho-acoustic model.

According to a particular characteristic, the method is remarkable in that the step of analyzing the stationing of the voice signal comprises an analysis of formants in the voice signal, allowing the detection of voiced sounds.

According to a particular characteristic, the method is remarkable in that one of the predetermined criteria is information representative of the presence of a voiced sound, the length of the time window being all the greater when a voiced sound has been detected. in the voice signal.

The invention also relates to a voice recognition device comprising a sampler for sampling a voice signal in a time window and comprising means for modifying the length of the time window, as a function of at least one predetermined criterion. The invention further relates to a speech recognition computer program product comprising program elements, recorded on a medium readable by at least one processing device, remarkable in that the program elements control the device so that it performs :

- a sampling step in which a voice signal is sampled in a time window; and

a step of modifying the length of the time window as a function of at least one predetermined criterion.

The invention also relates to a computer program product characterized in that said program comprises sequences of instructions suitable for implementing the voice recognition method as described above when the program is executed on a computer. The advantages of the voice recognition device, and of the computer program products are the same as those of the voice recognition method, they are not described in more detail.

Other characteristics and advantages of the invention will appear more clearly on reading the following description of a preferred embodiment, given by way of simple illustrative and nonlimiting example, and of the appended drawings, among which:

- Figure 1 shows a general block diagram of a system comprising a voice-controlled unit, in which the technique of the invention is implemented; - Figure 2 shows a block diagram of the voice recognition unit of the system of Figure 1;

- Figure 3 describes an electronic diagram of a voice recognition unit implementing the block diagram of Figure 2;

- Figure 4 shows a voice signal sampled in accordance with the state of the art;

- Figure 5 shows a voice signal sampled according to the invention according to a particular embodiment;

- Figure 6 shows two successive sampling windows of the signal as illustrated with reference to Figure 5; - Figure 7 shows a flow diagram for processing voice domes as implemented by the voice recognition unit of Figures 2 and 3;

- Figure 8 shows an example of sampled voice signal close to the sound of the vowel "I", at the input of an element of the voice box illustrated with reference to Figure 1; - Figures 9 and 10 illustrate the voice signal of Figure 8 after filtering by the voice box of Figure 1;

- Figure 11 shows the voice signal of Figure 8 processed in a larger sampling window; and

FIG. 12 shows an example of a sampled voice signal close to the sound of the stop "T" of the word "SMALL", at the input of an element of the voice box illustrated with reference to FIG. 1.

The general principle of the invention therefore rests on the adaptation of the size of the sampling window of a voice signal.

The invention thus proposes to replace a processing based on a window of fixed size, which delivers acoustic vectors at regular intervals, according to a fixed period (whose value is often close to 10 ms) independently of the information coded by an extraction. vector acoustics on variable size windows and non regular intervals, all the more spaced as the variation of information contained in the signal is weak.

Thus, for a relatively stationary signal for which the information contained in the signal is quite reduced (because, in this case, there is reproduction of the same fundamental form), large windows will be used. Conversely, short windows will be used for non or weakly stationary signals.

Thus, the principle of the invention is to provide an acoustic vector not regularly over time, independently of the information contained in the signal, but to provide a vector each time the information contained in the signal changes sufficiently. This is more in line with the principle of Markov networks used in decoding.

Indeed, when a voice signal is received, it is sampled in a window having an initial size allowing the acquisition of NInit voice samples.

A stationary analysis is performed. This stationary analysis is for example similar to that used in the EPAC coder, which is based on the principle of perceptual analysis as described in the document "The AT&T Perceptual Audio Coder (PAC)" (in French "le codeur audio perceptuel from AT&T ") written by JD Johnston and D. Sinha, presented at the AES convention (in New York in October 1995). We can also use the methods implemented in MPEG-1 layer 3 (MP3) and Dolby AC-3 coders, and described in particular in the document "The modulated lapped transform, its tile-varying form, and its applications to audio coding standards "(in French," overlapping and modulated transformation, its variable form, and its applications to audio coding standards "), written by S. Shlien (IEEE Transaction on Speech Audio Processing, vol. 5, pp359-366, July 1997). In the case of audio coding, a so-called "perfect reconstruction" constraint limits the size of the windows used, because it is necessary to be able to reconstruct the coded signal, for listening. In the case of the invention, this constraint does not exist, and a different and more suitable method for voice recognition is described below with regard to a particular embodiment. According to the invention, by combining a stationary analysis with a method for determining and adjusting the sampling window, a long voiced sound will use a long window while a short voiced sound will use a short analysis window.

If the signal is stationary, the size of the window is multiplied by a multiplicative coefficient for example equal to 2. If the signal is not stationary, the size of the window is divided by a divisor coefficient β for example equal to 2.

The window size is thus enlarged or decreased one or more times: - until the window size is the largest window size for which the signal is stationary; or

- up to a minimum size taking into account a predetermined value NMin of samples without a statistical signal being observed; or - up to a maximum size taking into account a predetermined value

NMax of samples while a stationary signal is still observed.

Thus, the modification of the length of the window is carried out dynamically or adaptively.

More generally, other rules for calculating the size can of course be envisaged (values different from the multiplicative and divisor coefficients; addition and subtraction of time units; selection from predetermined sizes ...).

When the window size is determined, the samples corresponding to this window size are processed and the process is repeated with a window shifted by a third compared to the initial window and an initial window size allowing the acquisition of NInit voice samples. .

Of course, the different values, NInit, NMin, NMax, a, β are configurable or even variable, as is the offset value between two iterations of the process.

There is presented, in relation to FIG. 1, a general block diagram of a system comprising a voice-controlled unit 102 implementing the technique of the invention.

We note that this system notably includes:

a voice source 100 which may in particular consist of a microphone intended to pick up a voice signal produced by a speaker; - a voice recognition unit 102;

- A control unit 105 intended to control an apparatus 107;

a controlled device 107, for example of the television or video recorder type.

The source 100 is connected to the voice recognition unit 102, via a link 101 which allows it to transmit an analog source wave representative of a voice signal to the unit 102.

The unit 102 can retrieve context information 104 (such as for example, the type of device 107 that can be controlled by the control unit 105 or the list of command codes) via a link 104 and send to the control unit 105 of commands via a link 103.

The control unit 105 sends commands via a link 106, for example infrared, to the device 107. According to the embodiment considered, the source 100, the voice recognition unit 102 and the control unit 105 are part of the same device and thus the links 101, 103 and 104 are internal links to the device. On the other hand, the link 106 is typically a wireless link. According to a first alternative embodiment of the invention described in the figure

1, the elements 100, 102 and 105 are partly or completely separate and are not part of the same device. In this case, the links 101, 103 and 104 are external connections, wired or not.

According to a second variant, the source 100, the boxes 102 and 105 and the device 107 are part of the same device and are connected to each other by internal buses (links 101, 103, 104 and 106). This variant is particularly advantageous when the device is, for example, a telephone or portable telecommunication terminal.

FIG. 2 shows a block diagram of a voice-controlled unit such as the unit 102 illustrated with reference to FIG. 2.

It is noted that the box 102 receives from the outside the analog source wave 101 which is processed by an Acoustic-Phonetic Decoder 200 or DAP (called "front-end" in English). The DAP 200 samples at regular intervals (typically every 10 ms) the source wave 101 to produce real vectors or those belonging to code books (or "code books" in English), typically representing oral resonances which are emitted via a link 201 to a recognition engine 203.

With the aid of a dictionary 202, the recognition engine 203 analyzes the real vectors which it receives using in particular hidden Markov models or HMM (from the English Hidden Markov Models) and language models (which represent the probability that a word will follow another word). Recognition engines are notably described in detail in the book "Statistical Methods for Speech Recognition" written by Frederick Jelinek, and published by MIT Press in 1997.

The recognition engine 203 supplies words which it has identified from the vectors received to a means for translating these words into commands which can be understood by the apparatus 107. This means uses an artificial intelligence translation method which itself even takes into account a context 104 provided by the control unit 105 before issuing one or more commands 103 to the control unit 105. FIG. 3 schematically illustrates a voice recognition module or device 102 as illustrated with reference to Figure 1, and implementing the block diagram of Figure 2. The housing 102 comprises interconnected by an address and data bus:

- a voice interface 301;

- an Analog to Digital converter 302 - a processor 304;

- a non-volatile memory 305;

- a random access memory 306; and

- An interface 307 for controlling a device.

Each of the elements illustrated in Figure 3 is well known to those skilled in the art. These common elements are not described here.

It is further observed that the word "register" used throughout the description designates in each of the memories mentioned, both a low-capacity memory area (some binary data) and a high-capacity memory area

(used to store an entire program or an entire sequence of transaction data).

Non-volatile memory 305 (or ROM) stores in registers which, for convenience, have the same names as the domiés they keep:

- the operating program of processor 304 in a "prog" register 308; - a dictionary of words to be understood by the recognition engine in a register 309;

a value NInit (worth for example 512), representing an initial value of window size in a register 310;

- an NMin value (worth for example 64), representing a minimum value of window size in a register 311; and

an NMax value (worth for example 2048), representing a maximum value of window size in a register 312.

The random access memory 306 stores data, variables and intermediate processing results and comprises in particular: - a register 313 in which values of bo, and e representative of the glottal excitation of the received signal are stored;

- a vector ai, a _∑ , ... ap representing a resonator in a register 314;

- a vector sj, $ ₂ , .- S _N representing a voice signal in a register 315; and

- a value N of current window size in a register 316; and - a Boolean Stationary value which can take the values

"Stationary" or "Non Stationary" in a register 317. FIG. 4 represents a voice signal sampled in accordance with the state of the art.

The voice signal 400 is represented along two axes:

- an abscissa axis 402 symbolizing time t; and - a vertical axis 401 representing an intensity.

According to the state of the art, the signal 400 is sampled at regular intervals in a window 403 of fixed duration L and containing a fixed number of samples, equal to N.

After processing the N samples, the sampling window is shifted by a time t equal to J / 3. A second window 404 is thus obtained, then a third window 405 of the same length L as the window 403.

FIG. 5 represents a voice signal sampled according to a particular embodiment of the invention as it is implemented in the box illustrated with reference to FIGS. 2 and 3.

In this figure, the voice signal 400 is represented in the same way as in FIG. 4.

According to the invention, the signal 400 is sampled at regular intervals in a window 500 of initial duration L and containing a number of samples, equal to N.

After a first processing of the N samples, according to the algorithm described later with reference to FIG. 7, the size of the sampling window can be:

- either divided by 2 to obtain a window 501 of sampling duration LU and containing N / 2 samples;

- or multiplied by 2 to obtain a sampling window 502 of duration 2Z and containing 2N samples. FIG. 6 more particularly illustrates the offset between two successive windows as they are implemented in the housing illustrated with reference to FIGS. 2 and 3.

In this figure, the voice signal 400 is represented in the same way as in FIG. 5. According to the invention, the signal 400 is sampled at regular intervals in a window 501 for example of final duration L / 2 and containing a number d 'samples equal to N / 2 (this value having been obtained by execution of the algorithm illustrated with reference to FIG. 7).

After processing the N / 2 samples of the sampling window 501, a new windowing is determined from a first window 600:

- shifted by the duration of window 501 (determined during the previous windowing operation) divided by 3, either D / 2/3 or L / 6; - and having an initial duration equal to L and taking into account N samples.

The offset of a window with respect to the precedence is a function of the length of this previous window and advantageously equal to a fraction of this length. FIG. 7 represents a flow diagram for processing the voice data as implemented by the voice recognition unit of FIGS. 2 and 3.

After an initialization step 700, the unit 102 launches the program Prog 308 and initializes the various variables (in particular the value of t0, initial instant corresponding to the start of a first window containing samples). Then, during a step 701, the unit 102 performs sampling at a frequency of 22050 Hz and an analog / digital conversion of the voice signal 400 which it receives.

After having sampled enough samples (for example a number greater than or equal to the value NMax 312 stored in memory 305) during a step 702, the box 102 initializes the size of the window, N, to a predetermined value

NInit 310 for example equal to 512. This predetermined value is an average value of N between NMin 311 and NMax 312.

According to a first alternative embodiment, the value of NInit is equal to NMin, the value of Nne can then only increase. Indeed, in attack of sentence or word, one is very often in a non stationary zone of the vocal signal.

According to a second alternative embodiment, the value of NInit is equal to NMαx, the value of Nne can then decrease.

Then, during a step 703, the housing 102 performs windowing corresponding to a window of current size N and starting at time t0. Then, during a step 704, the housing 102 performs a psycho-acoustic conversion and a perceptual analysis.

Then, during a test 705, the housing 102 determines whether in the analysis window of current size N, the signal is stationary.

If so, during a test 706, the housing 102 determines whether the value of N has reached a maximum limit or in other words if the value of N multiplied by 2 is strictly greater than the value of NMαx 312 stored in memory.

If the result of test 706 is negative, the value of N has not reached the upper limit equal to NMαx and we assign to N a new value equal to 2 times N. In other words, the content of the register N 316 is multiplied by 2. Then, during a step 708, the box 102 performs windowing corresponding to a window of the last current size N and starting at an instant t0 which is the same as the window start time defined by the last windowing step 703.

Then, the housing 102 performs a step 709 of psycho-acoustic conversion and perceptual analysis taking into account the samples of the current window, step 709 being quite similar to step 704 previously described.

Then during a stationary test 710 similar to test 705, the box determines whether in the analysis window of current size N, the signal is stationary.

If the result of test 710 is positive, test 706 is repeated. If the result of test 710 is negative, we return to the previous window size, that is to say N / 2 which is the largest window size which led to a stationary signal within the window. Thus, we assign to N the value N divided by 2.

If the result of test 705 is negative, during a test 712, the housing 102 determines whether the value of N has reached a minimum limit or in other words if the value of N divided by 2 is strictly greater than the NMin 311 value stored in memory.

If the result of test 712 is negative, the value of N has not reached the lower limit equal to NMinx and a new value assigned to N / 2 is assigned to Nune. In other words, the content of register N 316 is divided by 2.

Then, during a step 714, the housing 102 performs windowing corresponding to a window of the last current size N and starting at a time t0 which is the same as the start time of the window defined by the last step fenestration 703. Then, the housing 102 performs a step 715 of psycho-acoustic conversion and perceptual analysis taking into account the samples of the current window, step 715 being quite similar to step 704 previously described .

Then during a test 716 of statiormarity similar to test 705, the box determines if in the current size analysis window N, the signal is stationary.

If the result of test 716 is negative, the signal is not stationary in the current window and test 706 is repeated.

A step 717 is carried out in one of the following cases: after a step 711; - after a positive result in test 706 of reaching a maximum value of

NOT; after a positive result in test 712 reaching a minimum value of

NOT; or after a positive result in the test 716 for statiormarity carried out after one or more decreases in the value of N. During this step 717, the housing 102 performs a calculation of the acoustic coefficients, delivers them to the recognition engine 203 and then performs an offset of the value of the instant tO which becomes equal to tO to which a duration equal to one third of the size of the current window is added (or in other words to a duration equal to the duration of reception of N / 3 samples) . The samples received before the new time t0 are no longer useful and can then be discarded.

Then the window size initialization step 702 is repeated. The steps of psycho-acoustic conversion, perceptual analysis and test of statiormarity will now be detailed with mention of the examples of signals illustrated with reference to FIGS. 8 to 12. FIG. 8 shows the example of a signal 800 close to the vowel sound

"I", pronounced by a man in his thirties, on a window of the desired size (512 at the start), time being represented on the abscissa axis 801 at the rate of sampling F, ie 22050 values per second and a pressure being represented on the ordinate axis 802 according to an arbitrary scale. A first step consists in filtering the signal 800 by a low-pass filter, to remove unnecessary details from the sound wave.

We use for this, for example, the following filter (where Y is the filtered signal and S, the original signal):

Y _n = l / 7 (S _n + ₃ + Sn + 2 + Sn + 1 + Sn + Sn-1 + Sn-2 + Sn-θ) Figure 9 shows the result (filtered signal 900) for the same time window than Figure 8.

We then seek, by a conventional algorithm, the signal crossings by zero on the rising edge, and the zero crossings on the falling edge. Figure 9 illustrates: - points 906, 908, 911, 913 and 915 obtained on the rising edge; and

- points 907, 910, 912, 914 and 916 obtained on the falling edge.

We are also looking for, between two zero crossings, the signal extremes:

- maxima 917, 920, 918, 921 and 919; and - minimum 922, 923, 924 and 925.

A thresholding is then carried out, that is to say that the rising edge points whose associated next maximum value is less than a threshold S, calculated, are eliminated. as a fraction of the maximum value M (level of point 919 according to the example) on the window. This fraction can typically take a value equal to 0.3. We do the same for the falling fronts.

According to the example, points 908 and 913 (corresponding to rising edges having maximum points 920 and 921 respectively) are eliminated while no point corresponding to a falling edge is.

We then look for which list (list of rising edges or list of falling edges) contains the least number of candidates. According to the example, this is the list of rising edges which contains the three residual points 906, 911 and 915. We then calculate Faire formed by the signal between two consecutive values of abscissa of the rising edges. By way of illustration, for the values corresponding to points 906 and 911, it is to Make hatched 1001 in FIG. 10.

Then, we calculate the difference of this initial area 1001 with Faire of the signal shifted by the difference 1000 of the two abscissas 906 and 911. If this difference is less than a fraction of initial Faire (typically 15% of its value), then the signal initial between points 906 and 911 is a possible candidate for the stationary pattern reproduced by the sound wave. In the illustrated case, there is indeed the possibility of its neighbor.

We can decide at this stage that the signal is voiced, and multiply the size of the window by two. We then look to see if Fonde on the whole new window is well represented by the translate, as before, of the basic shape obtained.

The window with 1024 values is represented in FIG. 11, for the signal 1100 before filtering. The algorithm described above determines that the signal is always stationary on this window. The algorithm continues until the maximum allowable size NMax is reached, typically 2048 values, or a non-stationary one is encountered.

According to a variant, one can carry out a Fourier analysis on the source form of the signal supplemented with a few translates (4 typically) and make a psycho-acoustic analysis to determine, if the spectral content obtained would be judged different by a listener from the spectral content obtained on the following basic form. We can use the principle of analysis detailed in the document "perceptual coding of digital audio" (in French "perceptual coding of digital audio"), written by Ted Painter, in the magazine Proceedings of the IEEE, published in April 2000 ( pages 451 to 513).

FIG. 12 illustrates the signal 1200 on a window of 512 values of the non-stationary sound of the stopper "T" in the word "SMALL". The algorithm concludes that the signal is not statiormarried and therefore the window is not enlarged, but a new one window of the same size shifted, for example, by 256 values is generated to start a new analysis.

Of course, the invention is not limited to the exemplary embodiments mentioned above. In particular, the person skilled in the art can make any variant:

- in the definition of an initial window size (NInit which can be fixed or vary from one windowing operation to another);

- in the definition of window size variations (for example which may be significant at the start of a window size determination step and more fine at the end);

- in the implementation of the window size determination step (which can be based for example on a regular increase and / or decrease in window sizes or on a dichotomy between two window size values); and or

- in the implementation of the step of shifting between two successive windows producing acoustic vectors taken into account by a voice recognition engine.

It should be noted that the voice recognition engine extends to any type of engine using acoustic vectors produced by an acoustico-phonetic decoder with a window of variable size according to the invention. It will be noted that the invention is not limited to a purely material installation but that it can also be implemented in the form of a sequence of instructions of a computer program or any form mixing a material part and a part software. In the case where the invention is implemented partially or completely in software form, the corresponding sequence of instructions may be stored in a removable storage means (such as for example a floppy disk, a CD-ROM or a DVD-ROM) or no, this storage means being partially or totally readable by a computer or a microprocessor.

Claims

1. A voice recognition method comprising a sampling step in which a voice signal (400) is sampled in a time window (501), characterized in that it comprises a step of modification (707, 713) of the length of said time window, as a function of at least one predetermined criterion.

2. A voice recognition method according to claim 1, characterized in that one of said predetermined criteria is information representative of the statiormarity of said voice signal (400), said length of said time window being all the greater as said signal vocal is statiom area.

3. A voice recognition method according to claim 2, characterized in that said information representative of the statiormarity of said voice signal is obtained during a step of analysis (709, 715) of said signal taking into account a psychoacoustic model.

4. Voice recognition method according to one of claims 1 to 3, characterized in that said analysis step (709, 715) of the stationary of said voice signal comprises an analysis of formants in said voice signal, allowing the detection of voiced sounds.

5. A voice recognition method according to claim A, characterized in that one of said predetermined criteria is information representative of the presence of a voiced sound, the length of said time window being all the greater as a voiced sound was detected in said voice signal.

6. Device (102) for voice recognition comprising a sampler sampling a voice signal (400) in a time window (501), characterized in that it comprises means for modifying the length of said time window, as a function of at least one predetermined criterion.

7. Voice recognition computer program product comprising program elements, recorded on a medium readable by at least one microprocessor, characterized in that said program elements control the said microprocessor (s) so that it performs:

- a sampling step in which a voice signal is sampled in a time window; and

a step of modifying the length of said time window as a function of at least one predetermined criterion.

8. Product computer program characterized in that said program comprises sequences of instructions adapted to the implementation of a method of voice recognition according to any one of claims 1 to 5 when said program is executed on a computer.