EP0594480B1 - Speech detection method - Google Patents

Speech detection method Download PDF

Info

Publication number
EP0594480B1
EP0594480B1 EP93402522A EP93402522A EP0594480B1 EP 0594480 B1 EP0594480 B1 EP 0594480B1 EP 93402522 A EP93402522 A EP 93402522A EP 93402522 A EP93402522 A EP 93402522A EP 0594480 B1 EP0594480 B1 EP 0594480B1
Authority
EP
European Patent Office
Prior art keywords
noise
frames
frame
speech
voiced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
EP93402522A
Other languages
German (de)
French (fr)
Other versions
EP0594480A1 (en
Inventor
Dominique Pastor
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thales Avionics SAS
Original Assignee
Thales Avionics SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to FR9212582A priority Critical patent/FR2697101B1/en
Priority to FR9212582 priority
Application filed by Thales Avionics SAS filed Critical Thales Avionics SAS
Publication of EP0594480A1 publication Critical patent/EP0594480A1/en
Application granted granted Critical
Publication of EP0594480B1 publication Critical patent/EP0594480B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • G10L2025/932Decision in previous or following frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • G10L2025/937Signal energy in various frequency bands
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique

Description

The present invention relates to a method of speech detection.

When trying to determine the effective start and end of speech, various solutions are possible:

  • (1) We can work on the instantaneous amplitude by reference to a threshold determined experimentally and confirm the speech detection by a voicing detection (see article "Speech-noise discrimination and its applications" by V. PETIT / F. DUMONT , published in the THOMSON-CSF Technical Review - Vol. 12 - N ° 4, Dec. 1980).
  • (2) We can also work on the energy of the total signal over a time slice of duration T, by thresholding, still experimentally, this energy using local histograms, for example, and then confirming using d '' voicing detection, or calculation of the minimum energy of a vowel. The use of the minimum energy of a vowel is a technique described in the report "AMADEUS Version 1.0" by JL GAUVAIN of the LIMSI laboratory of the CNRS.
  • (3) The preceding systems allow the detection of voicing, but not of the effective start and end of speech, that is to say the detection of unvoiced fricative sounds (/ F /, / S /, / CH /) and unvoiced plosive sounds (P /, / T /, / Q /). They must therefore be supplemented by an algorithm for detecting these fricatives. A first technique may consist in the use of local histograms, as recommended in the article "Problem of detection of word boundaries in the presence of additive noise" by P. WACRENIER, published in the DEA thesis of the University of PARIS-SUD, Center d'Orsay.
  • Other techniques similar to the previous and relatively close to that exposed here, have been presented in the article "A Study of Endpoint Detection Algorithms in Adverse Conditions: Incidence on a DTW and HMM Recognizer "from J.C. JUNQUA / B. REAVES / B. MAK, during the EUROSPEECH Congress 1991.

    In all of these approaches, much is done to heuristics, and few powerful theoretical tools are used.

    Work on denoising of speech similar to those presented here are much more numerous, and we will cite in especially the book "Speech Enhancement" by J.S. LIM at Editions Prentice-Hall Signal Processing Series "Suppression of Acoustic Noise in Speech Using Spectral Substraction "by S.F. BOLL, published in IEEE Transactions on Acoustics, speech, and signal processing, vol. ASSP-27, N ° 2, April 1989, and "Noise Reduction For Speech Enhancement In Cars: Non-Linear Spectral Subtraction / Kalman Filtering "by P. LOCKWOOD, C. BAILLARGEAT, J.M. GILLOT, J. BOUDY, G. FAUCON published in the review EUROSPEECH 91. We will only mention denoising techniques in the spectral domain, and it will be discussed below "spectral" denoising by abuse of language.

    In all of these works, the close relationship between detection and denoising is never really highlighted, except in the article "Suppression of Acoustic Noise in Speech Using Spectral Subtraction "above, which offers an empirical solution to this problem.

    However, it is obvious that a denoising of speech, when does not have two recording channels, requires the use of "pure" noise frames, not polluted by speech, this which requires defining a detection tool capable of distinguishing between noise and noise + speech.

    The closest state of the art is represented by Dermatas and. al., "Fast Endpoint Detection Algorithm for Isolated Word Recognition in Office Environment ", ICASSP '91, Vol. 1, pp. 733-736.

    The subject of the present invention is a method of speech detection and denoising which detects the as surely as possible the actual beginnings and ends of speech whatever the types of speech sounds, and that allows denoising signals as effectively as possible detected even when the statistical characteristics of the noise affecting these signals vary widely.

    The method of the invention consists in carrying out in medium low noise detection of voiced frames, and to detect a kernel vowel to which a confidence interval is attached.

    In a noisy environment, after having detected at least minus a voiced frame, we are looking for noise frames preceding this voiced frame, we build an autoregressive model of noise and an average noise spectrum, we whiten by filter rejector and we denoised by spectral denoiser the frames preceding voicing, we look for the actual start of speech in these bleached frames, we extract noisy frames between the actual start of speech and the first voiced frame acoustic vectors used by the recognition system voice, as long as voiced frames are detected, these are denoised and then parameterized for recognition (i.e. that we extract the acoustic vectors adapted to the recognition of these frames), when no more frames are detected voiced, we look for the effective end of the word, we denois then we configure the frames between the last voiced frame and the effective end of speech.

    In the following, when it comes to parameterization of the frames, it should be understood that one extracts from the weft the acoustic vector (or equivalently, the acoustic parameters) used by the recognition algorithm.

    An example of such acoustic parameters are cepstral coefficients well known to treatment specialists of speech.

    In the following, we will understand by money laundering, the application of the rejector filtering calculated from the model autoregressive noise, and by denoising, the application of the denoiser spectral.

    Spectral bleaching and denoising do not apply sequentially, but in parallel, whitening allowing detection of unvoiced sounds, denoising an improvement in the quality of the voice signal to be recognized.

    Thus, the method of the invention is characterized by the use of theoretical tools allowing a rigorous approach to the detection problems (voicing and fricatives), by its great adaptability, because this method is above all a process local to the word. The statistical characteristics of the noise can change over time, the process will remain capable of adapting to it, by construction. It is also characterized by the development of detection expertise from the results of signal processing algorithms (this minimizes the number of false alarms due to detection, taking into account the specific nature of the speech signal ), by denoising processes coupled with speech detection, by a "real time" approach, and this, at all levels of analysis, by its synergy with other techniques for processing the voice signal, by the use of two different denoisers:

    • Rejection filtering, mainly used for the detection of fricatives, by virtue of its bleaching properties.
    • Wiener filtering in particular, used to denoise the speech signal for recognition. We can also use spectral subtraction.

    In the process of the invention, three levels of processing must therefore be distinguished:

    • The "elementary" level which implements signal processing algorithms which are in fact the basic elements of all higher level processing.

    So the "elementary" level of voicing detection is a calculation and thresholding algorithm for the function of correlation. The result is assessed by the higher level.

    These treatments are implemented on signal processing processors, for example of the DSP 96000 type.

    • The intermediate level of expertise develops “intelligent” voice and speech detection, taking into account the “raw” detection provided by the elementary level. Expertise can use an appropriate computer language, such as Prolog.
    • The "higher" or user level manages the different algorithms for detecting, denoising and analyzing the voice signal in real time. The C language, for example, is suitable for the implementation of this management.

    The invention is described in detail below according to the plan following. We first describe the algorithm that allows to chain appropriately different signal processing techniques and necessary expertise.

    We will assume at this highest level of treatment in the design hierarchy, that we have methods reliable detection and denoising, including all signal processing algorithms, all expertise, necessary and sufficient. This description is therefore very general. It is even independent of the expertise and signal processing described below. It can therefore apply to other techniques than those described here.

    We then describe the detection expertise of voicing, beginning and end of speech, using algorithms elementary level, some examples of which are cited.

    Finally, the methods used for the detection and denoising of speech.

    It is the results of these techniques (voiced, unvoiced speech, etc.) that are used by higher levels of processing.
    Conventions and vocabulary used.

    We will call frame, the elementary time unit of treatment. The duration of a frame is conventionally 12.8 ms, but can, of course, have values (realizations in language different). The treatments use discrete Fourier transforms of the processed signals. These Fourier transforms are applied to all of the samples obtained on two consecutive frames, which corresponds to performing a Fourier transform over 25.6 ms.

    When two Fourier transforms are consecutive in time, these transforms are calculated, not on four consecutive frames, but on three consecutive frames with an overlap of a frame. This is illustrated by the following diagram:

    Figure 00060001

    We first describe here the functioning of the algorithm at design level closest to the user.

    The preferred embodiment of the invention is described below with reference to the analysis of signals from very noisy avionics environments, which allows have a starting information which is the micro alternation that pilots use. This information indicates an area time close to the signal to be processed.

    However, this alternation can be more or less closer to the actual start of the speech, and therefore we cannot grant only low credit for any specific detection. So he goes it will be necessary to specify the effective start of speech from this first information.

    First, we are looking for the first frame see located around this alternation. This first frame voiced, is sought first among the N1 frames which precede the alternation (N1 = approximately 30 frames). If this voiced frame is not found among these N1 frames, so we are looking for the voicing on the frames following the alternation, as and when that they present themselves.

    As soon as the first voiced frame is found by this method, we will initialize the denoisers. For this, it is necessary to highlight frames made up solely of noise. These noise frames are sought from among the N2 frames which precede the first voiced frame (N2 = approximately 40 frames). Indeed, each of these N2 frames is either:

    • consisting of noise alone
    • consisting of noise + breathing
    • consisting of noise + fricative or occlusive not voiced.

    The assumption made is that the noise energy is in average below that of noise + breathing, itself lower than that of noise + fricative.

    So, if we consider among the N2 frames, the one that has the lowest energy, it is very likely that this frame consists only of noise.

    From the knowledge of this frame, we search all those that are compatible with it, and compatible 2 to 2, in the sense given below, in the paragraph "Compatibilities between energies ".

    When the noise frames have been detected, we build the two noise models which will be used later:

    • Autoregressive model of the noise allowing to build the rejection filter which whitens the noise.
    • Average noise spectrum for spectral denoising.

    These models are described below.

    The noise models being constructed, we whiten (by rejector filter and the N3 frames are denoised (by spectral denoiser) which precede the voicing and among which we will look for the effective start of speech (N3 = approximately 30). It goes without saying that N3 is less than N2. This detection is done by fricative detection and is described below.

    When the start of speech is known, we denois all the frames between the start of speech and the first frame voiced, then we configure these frames for their recognition. As these frames are denoised and configured, they are sent to the recognition.

    Since the actual start of speech is known, we can continue to process the frames following the first voiced frame.

    Each acquired frame is no longer bleached but only denoised, then configured for recognition. We perform on each frame a voicing test.

    If this frame is voiced, the acoustic vector is actually sent to the recognition algorithm.

    If it is not seen, we look if it is in fact the last frame of the vocal nucleus in progress.

    If it is not the last frame of the vocal nucleus, we acquires a new frame and we repeat the process, until when we find the last voiced frame.

    When the last voiced frame is detected, it is whitened the N4 frames following this last voiced frame (N4 = about 30 frames), then we look for the effective end of the speech among these N4 bleached frames. The process associated with this detection is described below.

    When the effective end of speech is detected, the frames between the end of voicing and this end of speech, are denoised, then parameterized and sent to the pure speech recognition for processing.

    When the last speech frame has been denoised, parameterized and sent to the recognition system, we resets all processing parameters for processing of the following speech.

    As we can see, this process is local to speech processed (i.e. it processes each sentence or each set of words without "gap" between words), and therefore allows to be very adaptive to any changes in noise statistics, all the more more than we use adaptive algorithms for modeling self-regressive noise, and relatively theoretical models noise detection and detection fricatives.

    In the absence of a work-study program, the process is launched as soon as a voicing is detected.

    A significant simplification of the method described above is possible when the processed signals are not very noisy. The use of denoising and whitening algorithms can then prove useless, even harmful, when the noise level is negligible (laboratory atmosphere). This phenomenon is known, especially in the case of denoising, where denoising a very noisy signal can induce a distortion of speech detrimental to good recognition.
    The simplifications are:

    • in suppressing spectral denoising for recognition in order to avoid any distortion of the speech, not compensating for the gain in signal to noise ratio that could be obtained by denoising, and thus detrimental to good recognition.
    • in the eventual removal of the whitening filter (and therefore of the calculation of the autoregressive noise model, which also involves the removal of the noise confirmation module). This removal is not necessarily necessary in low noise environments. Pre-testing is preferable to decide.

    We will now detail the procedures voicing detection and fricative detection expertise.

    These appraisal procedures use the right tools known signal processing and detection, which are as many basic automata, whose ability is to decide raw if the processed frame is voiced or not, is a frame of unvoiced fricative or unvoiced plosive ...

    Expertise consists of combining the different results obtained using these tools, so as to highlight coherent sets, forming the vocal nucleus for example, or blocks of fricative (plosive) sounds, not seen.

    By nature, the implementation language of such procedures is preferably PROLOG.

    Unlike the process described above, this expertise is the same whether the environment is noisy or not.

    For voicing detection expertise, we call to a known voicing detection process, which, for a given frame, decides whether this frame is voiced or not, by returning the value of the "pitch" associated with this frame. The "pitch" is the frequency of repetition of the voicing pattern. This value of pitch is zero, if there is no voicing, and not zero otherwise.

    This elementary voicing detection is done without use results from the previous frames, and without predict the outcome of future frames.

    As a vowel nucleus can consist of several voiced segments, separated by unvoiced holes, one expertise is necessary in order to validate or not a voicing.

    We will now set out the general rules of expertise.

  • Rule 1: Between two voiced frames consecutive or separated by a relatively small number of frames (of the order of three or four frames), the pitch values obtained cannot differ by more than a certain delta (approximately ± 20 Hz depending on the speaker). On the other hand, when the difference between two voiced frames exceeds a certain number of frames, the pitch value can change very quickly.
  • Rule 2: A vocal core consists of voiced frames interspersed with holes. These holes must verify the following condition: the size of a hole must not exceed a maximum size, which can be a function of the speaker and especially of the vocabulary (about 40 frames). The size of the core is the sum of the number of voiced frames and the size of the holes in this core.
  • Rule 3: the effective start of the vocal nucleus is given as soon as the size of the nucleus is sufficiently large (approximately 4 frames).
  • Rule 4: the end of the vocal nucleus is determined by the last voiced frame followed by a hole exceeding the maximum size allowed for a hole in the vocal nucleus.
  • Expertise process

    We use the previous rules as explained below, and when a pitch value has been calculated.

    First part of the expertise:

    We validate or not the calculated pitch value, depending the pitch value of the previous frame and the last value pitch zero, depending on the number of frames separating the frame currently processed and that of the last non-zero pitch. This corresponds to the application of rule 1.

    Second part of the expertise:

    This second part of the expertise is broken down according to different cases.

  • Case 1: First voiced frame:
    • We increment the possible size of the kernel, which is therefore 1
    • The possible beginning of the vocal nucleus is therefore the current frame
    • The possible end of the vocal nucleus is therefore the current frame
  • Case 2: The current frame is seen as well as the previous one. We therefore treat a voiced segment.
    • We increment the possible number of voiced frames of the kernel
    • We increment the possible size of the nucleus
    • The possible end of the kernel can be the current frame which is also the possible end of the segment.
    If the kernel size is large enough (about four frames, as specified above).
    And if the actual start of the vocal nucleus is not known.
    So :
    • the start of the kernel is the first frame detected as voiced. This corresponds to the implementation of rule 3.
  • Case 3: the current frame is not seen, while the previous frame is.
    We are processing the first frame of a hole.
    • We increase the size of the hole, which goes to 1
  • Case 4: The current frame is not seen and neither is the previous frame. We are treating a hole.
    • We increase the size of the hole.
      If the hole size exceeds the maximum size allowed for a vowel core hole,
      So :
    If the effective start of voicing is known,
    So :
    the end of the vocal nucleus is the last voiced frame determined before this hole. We stop the expertise and reset all the data for the processing of the next speech. (see rule 4) If the actual start of speech is still not known,
    Then : The expertise is continued on the following frames after reinitialization of all the parameters used, since those which have been updated previously are not valid.
    Otherwise , this hole may be part of the vowel nucleus and we cannot yet make a final decision.
  • Case 5: The current frame is seen and the previous one is not.
  • We just finished a hole, and we're starting a new voiced segment.
    • The number of voiced frames of the kernel is incremented.
    • We increase the size of the kernel.
    If the hole that we have just finished can be part of the vowel nucleus, (that is to say if its size is less than the maximum size authorized for a nucleus hole according to rule 2).
    So :
    • The size of this hole is added to the current size of the core.
    • The hole size is reset, for processing of the next unvoiced frames.
    If the effective start of the voicing is not yet known,
    And if the size of the kernel is now sufficient (Rule 3),
    So :
    • the beginning of the voicing is the beginning of the voiced segment preceding the hole that we have just finished.
    Otherwise , this hole cannot be part of the vowel core:
    If the effective start of the voicing is known,
    So :
    • the end of the vocal nucleus is the last voiced frame determined before this hole. We stop the expertise and reset all the data for the processing of the next speech. (see rule 4).
    If the effective start of voicing is still not known,
    So :
    • The expertise is continued on the following frames after reinitialization of all the parameters used, because those which have been updated previously are not valid.

    This procedure is used for each frame, and after calculation of the pitch associated with this frame.

    Unvoiced speech detection expertise.

    We use here a process known per se for detecting unvoiced speech.

    This elementary voicing detection is done without use results from the previous frames, and without predict the outcome of future frames.

    Unvoiced speech signals placed at the start or end of speech can be made up:

    • of a single fricative segment as in "chaff"
    • a fricative segment followed by an occlusive segment as in "stop"
    • of a single occlusive segment as in "speech"

    There is therefore the possibility of holes in the set of frames not seen.

    In addition, such fricative blocks should not be too great. Also, an expertise intervening after the detection of these sounds is it necessary.

    In the following, through abuse of language, the term fricative is will relate as well to unvoiced fricatives as to plosives not seen.

    General rules of expertise.

    The expertise presented here is similar to that described above in the case of voicing. The differences are mainly due to the taking into account of the new parameters which are the distance between the vowel core and the fricative block, and the size of the fricative block.

  • Rule 1: the distance between the vocal nucleus and the first detected fricative frame must not be too large (about 15 frames maximum).
  • Rule 2: the size of a fricative block should not be too large. This equivalently means that the distance between the vocal nucleus and the last frame detected as fricative must not be too large (about 10 frames maximum).
  • Rule 3 the size of a hole in a fricative block must not exceed a maximum size (about 15 frames maximum). The total size of the core is the sum of the number of voiced frames and the size of the holes in this core.
  • Rule 4: the effective start of the fricative block is determined as soon as the size of a segment has become sufficient, and the distance between the vowel nucleus and the first frame of this treated fricative segment is not too great, in accordance with the rule 1. The actual start of the fricative block corresponds to the first frame of this segment.
  • Rule 5: the end of the fricative block is determined by the last frame of the fricative block followed by a hole exceeding the maximum size authorized for a hole in the vocal nucleus, and when the size of the fricative block thus determined is not too large in accordance with rule 2.
  • Conduct of the expertise.

    This expertise is used to detect blocks fricatives preceding the vocal nucleus or the following. The chosen benchmark in this expertise is therefore the vocal core.

    In the case of the detection of a fricative block preceding the vowel nucleus, the treatment is done starting from the first voicing grid, so by "going up" in time. Also, when we say that a frame i follows a frame j (previously treated), this means: vis-à-vis this first frame of the vowel nucleus. In reality, frame j is chronologically posterior to the frame i. What we call beginning of the block fricatif in the expertise described below, is in fact, chronologically, the end of this block, and what we call the end of fricative block, is actually the chronological beginning of this block. The distance between vowel core and frame detected as fricative is the distance between the first frame of the voiced block and this frame of fricative.

    In the case of the detection of a fricative block located after the vocal nucleus, the processing is done after the last voiced frame, and therefore follows the natural chronological order, and the terms of the expertise are perfectly adequate.

  • Case 1: As long as there is no fricative detection, we are in a hole which follows the vowel nucleus and precedes the fricative block.
    • The distance between the voiced segment and the fricative block is increased. This distance thus calculated is a minus of the distance between the fricative block and the vowel nucleus. This distance will be fixed as soon as the first frame of fricative is detected.
  • Case 2: First detection of fricative, We begin to treat a fricative segment.
    • The size of the fricative block is initialized to 1.
    • The distance between the voiced block and the fricative block is frozen.
    If the distance between the vocal nucleus and the fricative block is not too great (in accordance with rule 2).
    So :
    • The possible start of the fricative block can be the current frame.
    • The possible end of the fricative block can be the current frame.
    If the size of the fricative block is large enough And If the actual start of the fricative block is not yet known,
    then :
    • the start of the nucleus can be confirmed. It will be noted that this If (in " If the size of the fricative block is sufficiently large") is useless if the minimum size for a fricative block is greater than a frame, but when one seeks to detect occlusive in noisy medium, these these can only appear over the duration of a single frame. We must therefore take the minimum size of a fricative block equal to 1, and keep this condition.
      If the distance between the vowel nucleus and the fricative block is too great (see rule 2).
      There is no acceptable fricative block.
      • We reset for the processing of the next speech.
      • We leave treatment.
    As the test on the distance between the vocal nucleus and the fricative block is carried out as of the first detection of fricative, it will not be repeated in the following cases, especially since if this distance is too great here, the procedure is stopped for this speech.
  • Case 3: The current frame and the previous frame are both fricative frames.
    We are in the process of processing a frame which is right in the middle of an acceptable fricative segment (located at a correct distance from the vowel core in accordance with rule 1).
    • The possible end of the fricative block is the current frame.
    • We increase the size of the fricative block.
    If the size of the fricative block is large enough (see rule 4).
    And if the size of this block is not too large (see rule 2).
    And if the actual start of the fricative block is not yet known,
    then :
    • the start of the nucleus can be confirmed as the start of this fricative segment.
  • Case 4: The current frame is not a fricative unlike the previous frame. We are currently processing the first frame of a hole located inside the fricative block.
    • We increment the total size of the hole (which becomes equal to 1).
  • Case 5: Neither the current nor the previous frame are fricative frames. We are currently processing a frame located right in a hole in the fricative block.
    • We increase the total size of the hole.
    If the current size of the fricative block increased by the size of the hole is greater than the maximum size authorized for a fricative block (rule 2).
    Or If the size of the hole is too large.
    If the start of the fricative block is known,
    then :
    • The end of the fricative block is the last frame detected as fricative.
    • We reset all the data in order to process the next speech.
    If not :
    • we reset all data, even those that were previously updated, because they are no longer valid. We then process the next frame.
    Otherwise, this hole may be part of the fricative block and we cannot yet make a final decision.
  • Case 6: The current frame is a fricative frame unlike the previous frame.
    The first frame of a fricative segment located after a hole is processed.
    • We increase the size of the fricative block.
    If the current size of the fricative block increased by the size of the previously detected hole is greater than the maximum size authorized for a fricative block,
    Or if the size of the hole is too large,
    then :
    If the start of the fricative block is known,
    then :
    • The end of the fricative block is then the last frame detected as fricative.
    • We reset all the data in order to process the next speech.
    Otherwise ,
    • We reset all data, even those that were previously updated, because they are not valid. We then process the next frame.
    Otherwise , (the hole is part of the fricative segment).
    • The size of the fricative block is increased by the size of the hole
    • Hole size is reset to O
      If the size of the fricative block is large enough
      And if this size is not too large
      And if the actual start of the fricative block is not known
      So :
    • The start of the nucleus can be confirmed.
  • Simplification in the case of a slightly noisy environment.

    In the case where the user considers that the medium is insufficiently noisy to require sophisticated treatments precedents it is possible not only to simplify expertise presented above, but even to eliminate it. In this case speech detection will be reduced to a simple detection of the kernel vowel to which we attach a confidence interval expressed in number of frames, which is sufficient to improve the performance of a voice recognition algorithm. It is so possible to start recognition around ten or even a fifteen frames before the start of the vocal nucleus, and complete it, a dozen or even fifteen frames after the vowel nucleus.

    Signal Processing Algorithms.

    The calculation procedures and methods described below are the constituents used by the expertise and management. Such functions are advantageously implemented on a signal processor and the language used is preferably the assembler.

    For voice detection in low noise environments, a interesting solution is the thresholding of the A.M.D.F. (Average Magnitude Difference Function) whose description can be found for example in the book "Speech Processing" by R. BOX / M. KUNT published by Presses Polytechniques Romandes.

    The AMDF is the function D (k) = Σ n | x (n + k) - x (n) |. This function is bounded by the correlation function, according to:
    D (k) ≤ 2 (Γ x (0) - Γ x (k)) 1/2 . This function therefore has downward "peaks", and must therefore be thresholded like the correlation function.

    Other methods based on the calculation of the spectrum of the signal are possible, for equally acceptable results (article "speech processing" cited above). However, it is interesting to use the AMDF function, for simple questions of calculation costs.

    In noisy environments, the AMDF function is a distance between the signal and its delayed form. However, this distance is a distance which does not admit an associated dot product, and which does not therefore does not allow the concept of orthogonal projection to be introduced. However, in a noisy environment, the orthogonal projection of the noise can be zero, if the projection axis is well chosen. The AMDF is therefore not not an adequate solution in a noisy environment.

    The method of the invention is then based on correlation, because the correlation is a scalar product and performs an orthogonal projection of the signal on its delayed form. This method is therefore more robust to noise than other techniques, such as AMDF. Indeed, suppose that the observed signal is x (n) = s (n) + b (n) where b (n) is white noise independent of the useful signal s (n). The correlation function is by definition: Γ x (k) = E [x (n) x (nk)], therefore Γ x (k) = E [s (n) s (nk)] + E [b (n) b (nk)] = Γ s (k) + Γ b (k) Since the noise is white: Γ x (0) = Γ s (0) + Γ b (0) and Γ x (k) = Γ s (k) for k ≠ 0.

    Noise whiteness in practice is not a valid hypothesis. However, the result remains good approximation as soon as the noise correlation function decreases quickly, and for k large enough, as in the case of a pink noise (white noise filtered by a bandpass), where the correlation is a cardinal sine, therefore practically zero as soon as k is big enough.

    We will now describe a procedure for calculating pitch and pitch detection applicable to noisy environments such as in low noise environments.

    Let x (n) be the processed signal where n ∈ {0, ..., N-1}.

    In the case of AMDF, r (k) = D (k) = Σ n | x (n + k) - x (n). |

    In the case of correlation, the mathematical expectation allowing access to the correlation function can only be estimated, so that the function r (k) is: r (k) = K Σ 0≤n≤N -1 x (n) x (nk) where K is a calibration constant.

    In both cases, we theoretically obtain the value of the pitch by proceeding as follows: r (k) is maximum in k = 0. If the second maximum of r (k) is obtained in k = k 0 , then the value of the voicing is F 0 = F e / k 0 where F e is the sampling frequency.

    However, this theoretical description should be revised in convenient.

    Indeed, if the signal is known only on samples 0 to N-1, then x (nk) is taken zero as long as n is not greater than k. There will therefore not be the same number of calculation points from one value k to the other. For example, if the pitch range is taken equal to [100 Hz, 333 Hz], and this, for a sampling frequency of 10 KHz, the index k 1 corresponding to 100 Hz is equal to:
    k 1 = F e / F 0 = 10000/100 = 100 and the one corresponding to 333 Hz is k 2 = F e / F 0 = 10000/333 = 30.

    The pitch calculation for this range will therefore be from k = 30 to k = 100.

    If, for example, 256 samples are available (2 frames of 12.8 ms sampled at 10 KHz), the calculation of r (30) is done from n = 30 to n = 128, either on 99 points and that of r (100) from n = 100 to 128, or on 29 points.

    The calculations are therefore not homogeneous with one another and do not have the same validity.

    For the calculation to be correct, the observation window must always be the same regardless of k. So if n-k is less than 0, the past values of the signal x (n), of so as to calculate the function r (k) on as many points, whatever k. The value of the constant K no longer matters.

    This is only detrimental to the pitch calculation on the first actually voiced frame, since, in this case, the samples used for the calculation come from an unvoiced frame, and are therefore not representative of the signal to process. However, from the third consecutive voiced frame, when working, for example, in frames of 128 points sampled at 10 KHz, the pitch calculation will be valid. This assumes, in general, that a voicing lasts at least 3x12.8 ms, which is an assumption realistic. This assumption must be taken into account during the appraisal, and the minimum time to validate a voiced segment will be 3x12.8 ms in this same expertise.

    This function r (k) being calculated, it is then a question of thresholding it. The threshold is chosen experimentally, according to the dynamics of the signals processed. Thus, in an example of application, where the quantization is done on 16 bits, where the dynamics of the samples does not exceed ± 10000, and where the calculations are done for N = 128 (Sampling frequency of 10 KHz), we have chosen Threshold = 750000. But remember that these values are only given as an example for particular applications, and must be modified for other applications. In any case, this does not change the methodology described above.
    We will now describe the method of detecting noise frames.

    Apart from the vocal nucleus, the signal frames that can be encountered are of three types:

  • 1) noise alone
  • 2) noise + voiceless fricative
  • 3) noise + breathing.
  • The detection algorithm aims to detect the start and end of speech from a whitened version of the signal, while the denoising algorithm requires knowledge of the average noise spectrum. To build the noise models which will make it possible to whiten the speech signal for the detection of unvoiced sounds as described below, and to denoise the speech signal, it is obvious that it is necessary to detect the noise frames , and confirm them as such. This search for noise frames is done among a number of frames N 1 defined by the user once and for all for its application (for example for N 1 = 40), these N 1 frames being located before the vocal core.

    Recall that this algorithm allows the implementation of noise models, and is therefore not used when the user judges the level insufficient noise.

    We will first define the "positive" Gaussian random variables:
    A random variable X will be said to be positive when Pr {X <0} << 1.
    Let X 0 be the normalized centered variable associated with X. We have:
    Pr {X <0} = Pr {X 0 <-m / σ} where m = E [X] and σ 2 = E [(Xm) 2 ].

    As soon as m / σ is large enough, X can be considered as positive.

    When X is Gaussian, we denote by F (x) the normal distribution function, and we have:
    Pr {X <0} = F (-m / σ) for X ∈ N (m, σ 2 )

    An immediate essential property is that the sum X of N independent positive Gaussian variables X i ∈ N (m i ; σ i 2 ) remains a positive Gaussian variable: X = Σ 1≤i≤N X i ∈ N (Σ 1≤i≤N m i ; Σ 1≤i≤N σ i 2 )

    Fundamental result:

    If X = X 1 / X 2 where X 1 and X 2 are both independent Gaussian random variables, such as X 1 ∈ N (m 1 ; σ 1 2 ) and X 2 ∈ N (m 2 ; σ 2 2 ), we set m = m 1 / m 2 , α 1 = m 1 / σ 1 , α 2 = m 2 / σ 2 .

    When α 1 and α 2 are large enough to assume X 1 and X 2 positive, the probability density f X (x) of X = X 1 / X 2 can then be approximated by:

    Figure 00240001
    where U (x) is the indicator function of R + :
    U (x) = 1 if x ≤ 0 and U (x) = 0 if x <0
    In the following, we will ask:
    Figure 00240002
    so that: f X (x) = f (x, m | α 1 , α 2 ) .U (x) Let h (x, y | α, β) = αβ x - y α 2 x 2 + β 2 y 2 1/2 We set P (x, y | α, β) = F [h (x, y | α, β)]. We then have:
    Pr {X <x} = P (x, m | α 1 , α 2 ) f (x, y | α, β) = ∂P (x, y | α, β) / ∂x and f (x, y | α 1 , α 2 ) = ∂P (x, m | α 1 , α 2 ) / ∂x
    Special case: α = β. We will pose: f α (x, y) = f (x, y | α, β), h α (x, y) = h (x, y | α, β) and P α (x, y) = P (x, y | α, β)

    We will describe below some basic models of "positive" Gaussian variables usable in the following.

  • (1) Signal with deterministic energy: Let the samples x (0), ... x (N-1) of any signal, whose energy is deterministic and constant, or approximated by a deterministic or constant energy.
    We therefore have U = Σ 0≤n≤N-1 x (n) 2 ∈ N (Nµ, 0) where µ = (1 / N) Σ 0 ≤ n ≤ N-1 x (n) 2 Let us take as an example the signal x (n) = A cos (n + ) where  is equidistributed between [0.2π]. For N sufficiently large, we have:
    (1 / N) Σ 0≤n≤N-1 x (n) # 2 E [x (n) 2] = A 2/2. For large enough N, U can be likened to NA 2/2 and thus a constant energy.
  • (2) Gaussian White Process: Let be a white and Gaussian process x (n) such that σ x 2 = E [x (n) 2 ]. For N sufficiently large, U = Σ 0≤n≤ N-1 x (n) 2 ∈ N (Nσ x 2 ; 2Nσ x 4 ). The parameter α is α = (N / 2) 1/2
  • (3) Narrow Band Gaussian Process: The noise x (n) comes from the sampling of the process x (t), itself resulting from the filtering of a white Gaussian noise b (t) by a band pass filter h ( t): x (t) = (h * b) (t), assuming that the transfer function of the filter h (t) is: H (f) = U [- (f0-B / 2, -f0 + B / 2] (f) + U [f0-B / 2, f0 + B / 2] (f),
    where U denotes the characteristic function of the interval in index and f 0 the central frequency of the filter.
    We therefore have U ∈ N (Nσ x 2 , 2σ x 4 Σ 0≤i≤N-1.0≤j≤N-1 g f0, B, Te (ij) 2 ) with g f0, B, Te (k ) = cos (2πkf 0 T e ) sin c (πkBT e )
    The parameter α and α = N / [2Σ 0≤i≤N-1,0≤j≤N-1 g f0, B, Te (ij) 2 ] 1/2
  • (4) Sub-sampling of a Gaussian process: This model is more practical than theoretical. If the correlation function is unknown, we know however that: lim k → + ∞ Γ x (k) = 0. Therefore, for k large enough such that k> k 0 , the correlation function tends to 0. Also, at instead of processing the sequence of samples x (0) ... x (N-1), can we process the subsequence x (0), x (k 0 ), x (2k 0 ), ... , and the energy associated with this sequence remains a positive Gaussian random variable, provided that there remain in this subsequence enough points to be able to apply the approximations due to the central limit theorem.
    Compatibility between energies.
    Let C 1 = N (m 1 , σ 1 2 ) and C 2 = N (m 2 , σ 2 2 )
    We set: m = m 1 / m 2 , α 1 = m 1 / σ 1 and α 2 = m 2 / σ 2 .
    α 1 and α 2 are large enough that the random variables of C 1 and C 2 can be considered as positive random variables.
    Let (U, V) where (U, V) belong to (C 1 CU 2 ) X (C 1 CU 2 ).
    As before, U and V are assumed to be independent.
    We set U ≡ V  (U, V) ∈ (C 1 XC 1 ) U (C 2 UC 2 ).
    Let (u, v) be a value of the couple (U, V). If x = u / v, x is a value of the random variable X = U / V.
    Let s 1.
    1 / s <x <s  We decide that U ≡ V is true, which will be the decision D = D 1
    x <1 / s or x> s  We decide that U ≡ V is false, which will be the decision D = D 2 This decision rule is therefore associated with 2 hypotheses:
    H 1  U ≡ V is true, H 2  U ≡ V is false.
    We will set I = [1 / s, s].
    The detection rule is also expressed according to: x ∈ I  D = D 1 ,
    x ∈ R - I  D = D 2
  • We will say that u and v are compatible when the decision D = D 1 is taken.

    This decision rule admits a correct decision probability, the expression of which will in fact depend on the value of the probabilities Pr {H 1 } and Pr {H 2 }.

    However, these probabilities are generally not known in practice.

    We therefore prefer a Neyman-Pearson type approach, since the decision rule is reduced to two hypotheses, seeking to ensure a certain value fixed a priori for the probability of false alarm which is: P fa = Pr {D 1 | H 2 } = P (s, m | α 1 , α 2 ) - P (1 / s, m | α 1 , α 2 )

    The choice of signal and noise models determines α 1 and α 2 . We will see that then m appears to be homogeneous to a signal to noise ratio which will be fixed heuristically. The threshold is then fixed so as to ensure a certain value of P fa .
    Special case: α 1 = α 2 = α. It then comes: P fa = P α (s, m) - P α (1 / s, m)
    Compatibility of a set of values:
    Let {u 1 , ..., u n } be a set of values of positive Gaussian random variables. We will say that these values are compatible with each other, if and only if the u i are compatible 2 to 2.
    Signal and noise models used by the method of the invention.

    In order to apply the procedures corresponding to the previous theoretical reminders, it is necessary to fix a noise and signal model. We will use the following example. This model is governed by the following assumptions:

  • Hypothesis 1: We suppose not to know the useful signal in its form, but we will make the following hypothesis: for the values s (0), ..., s (N-1) of s (n), the energy S = (1 / N) Σ 0 ≤ n ≤ N-1 s (n) 2 is bounded by µ s 2 , as soon as N is sufficiently large, so that: S = Σ 0 n ≤ N-1 s (n) 2 > Nµ s 2
  • Hypothesis 2: The useful signal is disturbed by an additive noise denoted x (n), which is assumed to be Gaussian and in narrow band. We assume that the process x (n) processed is obtained by narrow band filtering of white Gaussian noise. The correlation function of such a process is then: Γ x (k) = Γ x (0) cos (2πkf 0 T e ) sinc (πkBT e ). If we consider N samples x (n) of this noise, and we pose:
    g f0, B, Te (k) = cos (2πkf 0 T e ) sin c (πkBT e ), we have: V = (1 / N) Σ 0 ≤ n ≤ N-1 x (n) 2 ∈ N (Nσ x 2 , x 4 Σ 0 ≤i≤ N 1.0≤j≤N-1 g f0, B, Te (ij) 2 ) The parameter α of this variable is α = N / [2 Σ 0 ≤ i ≤ N-1.0 ≤j≤N-1 g f0, B, Te (ij) 2 ] 1/2
  • Hypothesis 3: The signals s (n) and x (n) are then assumed to be independent. We suppose that the independence between s (n) and x (n) implies decorrelation in the temporal sense of the term, that is to say that we can write: c = Σ 0≤n≤ N-1 s (n) x (n) 0 ≤ n ≤ N-1 s (n) 2 ) 1/2 0≤ n ≤ N-1 x (n) 2 ) 1/2 = 0 This correlation coefficient is only the expression in the time domain of the spatial correlation coefficient defined by:
    E [s (n) x (n)] / (E [s (n) 2 ] E [x (n) 2 ]) 1/2 when the processes are ergodic.
    Let u (n) = s (n) + x (n) be the total signal, and U = Σ 0 ≤ n ≤ N-1 u (n) 2 .
    We can then approximate U by: U = Σ 0 ≤ n ≤ N-1 s (n) 2 + Σ 0 ≤ n ≤ N-1 x (n) 2
    As we have: Σ 0≤n≤ N-1 s (n) 2 ≥ µ s 2 ,
    we will have: U ≥ Nµ s 2 + Σ 0 n ≤ N-1 x (n) 2 .
  • Hypothesis 4: As we assume that the signal has a bounded average energy, we will assume that an algorithm capable of detecting an energy µ s 2 , will be able to detect any signal of higher energy. Taking into account the preceding hypotheses, class C 1 is defined as being the class of energies when the useful signal is present. According to hypothesis 3, U ≥ Nµ s 2 + Σ 0≤ n ≤ N-1 x (n) 2 , and according to hypothesis 4, if the energy Nµ s 2 + Σ 0≤ n≤ N- is detected 1 x (n) 2 , we will also be able to detect the total energy U.
    According to hypothesis 2,
    s 2 + Σ 0 ≤ n ≤ N-1 x (n) 2 ∈ N (Nµ s 2 + Nσ x 2 ,
    x 4 Σ 0 ≤ i ≤N-1.0≤j≤N-1 g f0, B, Te (ij) 2 ).
    So C 1 = N (Nµ s 2 + Nσ x 2 , 2σ x 4 Σ 0≤i≤N-1,0≤j≤N-1 g f0, B, Te (ij) 2 )
    and the parameter α of this variable is worth
    α 1 = N (1 + r) / [2 Σ 0 ≤i≤ N-1.0 ≤ j ≤ N-1 g f0, B, Te (ij) 2 ] 1/2,
    where r = µ s 2 / σ x 2 represents the signal to noise ratio.
    C 2 is the class of energies corresponding to noise alone. According to hypothesis 2, if the noise samples are x (0), ..., x (M-1),
    it comes V = (1 / M) Σ 0≤n≤M-1 x (n) 2 ∈ N (Mσ x 2 ,
    x 4 Σ 0≤i≤ M-1.0≤j≤ M-1 g f0, B, Te (ij) 2 ).
    The parameter α of this variable is:
    α 2 = M / [2Σ 0≤i≤M-1,0≤j≤M-1 g f0, B, Te (ij) 2 ] 1/2
    We therefore have: C 1 = N (m 1 , σ 1 2 ) and C 2 = N (m 2 , σ 2 2 ),
    with: m 1 = Nµ s 2 + Nσ x 2 , m 2 = Mσ x 2 ,
    σ 1 = σ x 2 [2Σ 0≤i≤N-1,0≤j≤N-1 g f0, B, Te (ij) 2 ] 1/2 and
    σ 2 = σ x 2 [2Σ 0≤i≤M-1,0≤j≤M-1 g f0, B, Te (ij) 2 ] 1/2.
    Hence m = m 1 / m 2 = (N / M) (1 + r),
    α 1 = m 1 / σ 1 = N (1-r) / [2Σ 0≤i≤N-1,0≤j≤N-1 g f0, B, Te (ij) 2 ] 1/2 and
    σ 2 = m 2 / σ 2 = M / [2Σ 0≤i≤M-1,0≤j≤M-1 g f0, B, Te (ij) 2 ] 1/2 .
  • Note that:

    • if the original noise is white and gaussian, the previous assumptions are still valid. It suffices to note that then g f0, B, Te (k) = δ 0 (k). The previous formulas are simplified:
      C 1 = N (m 1 , σ 1 2 ) and C 2 = N (m 2 , σ 2 2 ),
      with: m 1 = Nµ s 2 + Nσ x 2 , m 2 = Mσ x 2 , σ 1 2 = 2Nσ x 4 and σ 2 = 2Mσ x 4 .
      Hence m = m 1 / m 2 = (N / M) (1 + r),
      α 1 = m 1 / σ 1 = (1 + r) (N / 2) 1/2 and
      α 2 = m 2 / σ 2 = (M / 2) 1/2 .
      It is possible to tend towards such a model by subsampling the noise, and taking noise only one sample out of k 0 samples where k 0 is such that:: k> k 0 , Γ x (k) → 0 .
    • the notion of compatibility between energies is set up only conditionally on a priori knowledge of the parameter m, therefore of the signal to noise ratio r. This can be fixed heuristically from preliminary measurements of the signal-to-noise ratios presented by the signals that one does not want to detect by the noise confirmation algorithm, or fixed peremptorily. The second solution is preferably used. Indeed, the object of this processing aims to highlight, not all the noise frames, but only a few having a high probability of being made up only of noise. We therefore have every interest in making the algorithm very selective. This selectivity is obtained by playing on the value of the probability of false alarm that we decide to ensure and which will therefore be chosen very low (the maximum selectivity being established for P FA = 0, which leads to a zero threshold and no noise detection, which is the extreme and absurd case). But this selectivity is also obtained by the choice of r: chosen too large, there is a risk of considering energies as representative of the noise, while these are breathing energies, for example, having a signal to noise ratio less than r. Conversely, choosing an r that is too small can limit the accessible P FA , which would then be too high to be acceptable.

    Taking into account the previous models, and the calculation of the threshold having been done, we then apply the following algorithm for detection and noise confirmation, mainly based on the notion of compatibility, as described above.

    The search and confirmation of the noise frames is done among a number of frames N 1 defined by the user once and for all for its application (for example N 1 = 40), these frames being located before the vocal core. The following hypothesis is made: the energy of the noise frames alone is on average lower than that of the noise + respiration and signal noise frames. The frame having the minimum energy among the N 1 frames, is therefore assumed to consist only of noise. We then look for all the frames compatible with this frame in the sense recalled above, using the aforementioned models.

    The noise detection algorithm will search, among a set of frames T 1 , ..., T n , for those which can be considered as noise.

    Let E (T 1 ), ..., E (T n ), the energies of these frames, calculated in the form: E (T i ) = Σ 0≤n≤N-1 u (n) 2 where u ( n) are the N samples constituting the frame T i .

    We make the following assumption: the frame with the lowest energy is a noise frame. Let T i0 be this frame.

    The algorithm works as follows:
    The set of noise frames is initialized: Noise = {T i0 }
    For i describing {E (T 1 ), ..., E (T n )} - {E (T i0 )}
    Make
    If E (T i ) is compatible with each element of Noise:
    Noise = Noise U {E (T i )}
    End for
    Autoregressive model of noise.

    Since the noise confirmation algorithm provides some number of frames which can be considered as noise with a very high probability, we seek to build, from the data of time samples, an autoregressive model of noise.

    If x (n) designates the noise samples, we model x (n) in the form: x (n) = Σ 1 ≤ i ≤ p a i x (ni) + b (n), where p is the order of the model, the a i , the coefficients of the model to be determined and b (n) the modeling noise, assumed to be white and Gaussian if we follow a maximum likelihood approach.

    This type of modeling is widely described in the literature notably in "Spectrum Analysis - A Modern Perspective", by S.M. KAY and S.L. MARPLE Jr, published in Proceedings of the IEEE, Flight. 69, No. 11, November 1981.

    As for the algorithms for calculating the model, many methods are available (Burg, Levinson-Durbin, Kalman, Fast Kalman ...).

    The methods of the Kalman and Fast type are preferably used Kalman, see articles "Transverse Adaptive Filtering" by O.MACCHI / M.BELLANGER published in the journal Signal Processing, Vol. 5, No. 3, 1988 and "Signal analysis and adaptive digital filtering" by M.BELLANGER appeared in the Collection CNET-ENST, MASSON, which have very good real-time performance. But this choice is not the only possible. The order of the filter is for example chosen equal to 12, without this value is limiting.

    Rejector filtering

    Let u (n) = s (n) + x (n) be the total signal, composed of the speech signal s (n) and the noise x (n).
    Let the filter H (z) = 1 - Σ 1 ≤ i ≤ p a i z -i .
    Applied to the signal U (z), we obtain H (z) U (z) = H (z) S (z) + H (z) X (z).
    Or: H (z) X (z) = B (z) => H (z) U (z) = H (z) S (z) + B (z)

    The reject filter H (z) whitens the signal, so that the signal in output of this filter is a speech signal (filtered therefore distorted), added generally white and Gaussian noise.

    The signal obtained is in fact unsuitable for recognition, because the rejector filter distorts the original speech signal.

    However, the signal obtained being disturbed by noise practically white and Gaussian, it follows that this signal is very interesting to detect the signal s (n) according to the theory set out below, according to which the broadband signal obtained is kept, or filtered previously in the fricatives strip, as described below (cf. "fricative detection").

    It is for this reason that we use this rejection filtering after auto-regressive modeling of the noise.
    Average noise spectrum.

    As we have a certain number of confirmed frames as being noise frames, we can then calculate an average spectrum of this noise, so as to implement a spectral filtering, of the subtraction type WIENER spectral or filtering.

    We choose for example WIENER filtering. Also, do we need to calculate C XX (f) = E [| X (f) | 2 ] which represents the average noise spectrum. As the calculations are digital, we only have access to FFTs of digital signals weighted by a weighting window. In addition, the spatial average can only be approximated.

    Let X 1 (n) ..., X M (n) be the FFTs of the M noise frames confirmed as such, these FFTs being obtained by weighting the initial time signal by an adequate apodization window.
    C XX (f) = E [| X (f) | 2 ] is approximated by: ^ C XX (n) = M XX (n) = (1 / M) Σ 1 ≤ i ≤ M + 1 | X i (n) | 2

    The performances of this estimator are given for example in the book "Digital signal processing" by L.RABINER / C.M.RADER published at IEEE Press.

    As for the Wiener filter, we recall below some classic results, explained in particular in the book "Speech Enhancement "by J.S. LIM published by Prentice-Hall Signal Processing Editions Series.

    Let u (t) = s (t) + x (t) be the total signal observed, where s (t) denotes the useful signal (speech) and x (t) the noise.
    In the frequency domain, we obtain: U (f) = S (f) + X (f), with obvious notations.

    We then look for the filter H (f), so that the signal ^ S (f) = H (f) U (f) is the closest to S (f) within the meaning of the standard L 2 . We therefore seek H (f) minimizing: E [| S (f) - ^ S (f) | 2 ].
    We then demonstrate that: H (f) = 1 - (C XX (f) / C UU (f)) where
    C XX (f) = E [| X (f) | 2 ] and C UU (f) = E [| U (f) 2 |].

    This type of filter, because its expression is directly frequency, is particularly interesting to apply as soon as the parameterization is based on the calculation of the spectrum.
    Implementation by smoothed correlogram.

    In practice, C XX and C UU are not accessible. We can only estimate them. A procedure for estimating C XX (f) has been described above.

    C UU is the average spectrum of the total signal u (n) available only on a single frame. In addition, this frame must be configured so that it can intervene in the recognition process. There is therefore no question of carrying out any average of the signal u (n) all the more since the speech signal is a particularly non-stationary signal.

    It is therefore necessary to construct, from the data of u (n), an estimate of C UU (n). We then use the smoothed correlogram.

    We then estimate C UU (n) by: ^ C UU (k) = Σ 0 ≤ n ≤ N-1 F (kn) | X (n) | 2
    where F is a smoothing window constructed as follows, and N the number of points allowing the calculation of FFT: N = 256 points for example.
    We choose a smoothing window in the time domain:
    f (n) = a 0 + a 1 cos (2πn / N) + a 2 cos (4πn / N). These windows are widely described in the aforementioned article: "On the Use of Windows for Hamming Analysis with the Discrete Fourier Transform by FJHARRIS published in Proceedings of the IEEE, Vol.66, N ° 1, January 1978.
    The function F (k) is then simply the Discrete Fourier Transform of f (n).
    ^ C UU (k) = Σ 0 ≤ n≤ N-1 F (kn) | X (n) | 2 appears as a discrete convolution between F (k) and V (k) = | X (k) 2 |, so that ^ C UU = F * V
    Let ^ c UU be the FFT -1 of ^ C UU . ^ c UU (k) = f (k) v (k) where v (k) is the FFT -1 of V (k).
    We therefore calculate ^ C UU (k) according to the so-called smoothed correlogram algorithm:

  • (1) Calculation of v (k) by inverse FFT of V (n) = | X (n) | 2
  • (2) Calculation of the product fv
  • (3) Direct FFT of the product fv which leads to ^ C UU
  • Rather than applying the same estimator for the noise and the total signal, the method of the invention applies the algorithm of the previous smoothed correlogram to the average noise spectrum M XX (n).
    ^ C XX (k) is therefore obtained by: ^ C XX (k) = Σ 0 ≤ n ≤ N-1 F (kn) | M XX (n) | 2 The Wiener filter H (f) is therefore estimated by the following values:
    ^ H (n) = 1 - (^ C XX (n) / ^ C UU (n))
    The denoised signal has the spectrum: ^ S (n) = ^ H (n) U (n)
    An FFT -1 can possibly make it possible to recover the denoised time signal.

    The denoised spectrum ^ S (n) obtained is the spectrum used for the parametrization for the recognition of the frame.

    To detect unvoiced signals, the procedures described above are also used, since there are energies representative of the noise (see above the noise detection algorithm).
    Activity detection.
    Let C 1 = N (m 1 , σ 1 2 ) and C 2 = N (m 2 , σ 2 2 ).
    Since we have an algorithm capable of highlighting values of random variables belonging to the same class, of class C 2 (for example), and this, with a very low probability of error, it then becomes much easier to decide, by observing the couple U / V, if U belongs to class C 1 or to class C 2 .
    So there are two distinct possible hypotheses,
    H 1  U ∈ C 1 and H 2 U ∈ C 2
    corresponding to two distinct possible decisions:
    D = D 1  Decision U ∈ C 1 , noted "U ∈ C 1 "
    D = D 2  Decision U ∈ C 2 , noted "U ∈ C 2 "
    Optimal decision.
    We set: m = m 1 / m 2 , α 1 = m 1 / σ 1 and α 2 = m 2 / σ 2 .
    Let be a couple (U, V) of random variables, where we suppose that V ∈ C 2 and U ∈ C 1 UC 2 . U and V are assumed to be independent. By observing the variable X = U / V, we seek to make a decision between the following two possible: "C 1 XC 2 ", C 2 XC 2 ".
    We therefore have two hypotheses: H 1  U ∈ C 1 , H 2  U ∈ C 2 .
    Let p = Pr {U ∈ C 1 }.
    The decision rule is expressed in the following form:
    x> s  U ∈ C 1 , x <s  U ∈ C 2
    The probability of correct decision P c (s, m | α 1 , α 2 ) is then:
    P c (s, m | α 1 , α 2 ) = p [1-P (s, m | α 1 , α 2 )] + (1-p) P (s, 1 | α 2 , α 2 )
    where p = Pr {U ∈ C 1 }.
    The optimal threshold is that for which P c (s, m | α 1 , α 2 ) is maximum. We therefore solve the equation:
    ∂P c (s, m | α 1 , α 2 ) / ∂s = 0  pf (s, m | α 1 , α 2 ) - (1-p) f (s, 1 | α 2 , α 2 ) = 0
    Neyman-Pearson approach
    In the previous approach, we assumed to know the probability p. When this probability is unknown, a Neyman-Pearson type approach can be used.
    We define the probabilities of non-detection and false alarm:
    P nd = {x <s | H 1 } and P fa = {x> s | H 2 }
    We have: P nd = P (s, 1 | α 2 , α 2 ) and P fa = 1-P (s, m | α 1 , α 2 )
    We then set P fa or P nd , to determine the value of the threshold.

    In order to apply activity detection as described above in the case of speech, it is necessary to establish an energy model of unseen signals consistent with the assumptions that govern good operation of the methods described above. So we are looking for a model energies of unvoiced fricatives / F /, / S /, / CH / and plosives not voiced / P /, / T /, / Q /, which allow to obtain energies whose law statistic is approximately a Gaussian.

    Model 1.

    The sounds / F /, / S /, / CH / are spectrally in a band of frequency which ranges from around 4 KHz to over 5KHz. The sounds / P /, / T /, / Q / as short phenomena in time, spread over a band wider. In the band chosen, it is assumed that the spectrum of these sounds fricative is relatively flat, so the fricative signal in this band can be modeled by a narrow band signal. This can be realistic in some practical cases without resorting to the money laundering described above. However, in most cases it makes sense to work on a signal bleached to provide a suitable narrowband noise pattern.

    By accepting such a narrow band noise model, we therefore have to process the ratio of two energies which can be processed by the methods described above.

    Let s (n) be the speech signal in the band studied and x (n) the noise in this same band. The signals s (n) and x (n) are assumed independent.

    Class C 1 corresponds to the energy of the total signal u (n) = s (n) + x (n) observed on N points, class C 2 corresponds to the energy V of the noise alone observed on M points.

    The signals being Gaussian and independent, u (n) is a signal itself Gaussian, so that:
    U = Σ 0 ≤ n≤ N-1 u (n) 2 ∈ N (Nσ u 2 , 2σ u 4 Σ 0 ≤ i ≤ N-1, 0 ≤ j ≤N-1 g f0, B (ij) 2 )
    Similarly:
    V = Σ 0 ≤ n ≤ M-1 y (n) 2 ∈ N (Mσ x 2 , 2σ x 2 Σ 0 ≤i≤ M-1.0 ≤j≤ M-1 g f0, B (1-j) 2 ), where y (n) designates, it will be recalled, another value of the noise x (n) over a different time slot from that in which u (n) is observed.
    We can therefore apply the above theoretical results with:
    m = (N / M) σ u 2 / σ x 2 ,
    α 1 = N / (2Σ 0 ≤ i ≤ N-1, 0 ≤ j ≤ N-1 g f0, B (ij) 2 ) 1/2 ),
    α 2 = M / (2Σ 0 ≤i≤ M-1.0 ≤j≤ M-1 g f0, B (ij) 2 ) 1/2 )

    It will be noted that m = (N / M) (1 + r) where r = σ s 2 / σ x 2 ultimately denotes the signal to noise ratio.

    To completely resolve this problem, ability to know signal to noise ratio r as well as the probability of presence p of the useful signal. What appears to be a limitation here is common to the other two models discussed below.

    Model 2.

    As in the case of model 1, we seek to detect only unvoiced fricatives, therefore to detect a signal in a particular band.

    Here, the model of the fricative signal is not the same as previously. It is assumed that the fricatives have the minimum energy µ s 2 = Σ 0 ≤ n ≤ N-1 s (n) 2 known, for example through learning, or estimated.

    The voiced sound is independent of the noise x (n) which is here Gaussian narrow strip.

    If y (n), for n between 0 and M-1, denotes another noise value x (n) on a time slot distinct from that where the total signal is observed u (n) = s (n) + x (n), we will have:
    V = Σ 0 ≤ n ≤ M-1 y (n) 2 ∈ N (Mσ x 2 , 2Tr (C x, M 2)) where C x, M denotes the correlation matrix of the M-tuplet: t (y ( 0), ..., y (M-1) )

    Regarding the energy U = Σ 0 ≤ n ≤ N-1 u (n) 2 of the total signal, this can be expressed according to:
    U = Nµ s 2 + Σ 0 ≤ n≤ N-1 x (n) 2

    This result is obtained by supposing that the independence between s (n) and x (n) is expressed by decorrelation in the temporal sense of the term, that is to say that we can write: c = Σ 0 ≤ n ≤ N-1s (n) x (n) 0 ≤ n ≤ N-1 s (n) 2 ) 1/2 0 ≤ n ≤ N-1 x (n) 2 ) 1/2 = 0

    As V '= Σ 0 ≤ n ≤ N-1 x (n) 2 ∈ N (Nσ x 2 , 2Tr (C x, N 2 )) where C x, N denotes the correlation matrix of the N-tuplet: t ( x (0), ..., x (N-1)), we then have:
    U = µ s 2 µ s 2 0 ≤ n ≤ N-1 x (n) 2 ∈ N (N + Σ + Nσ x 2 , 2Tr (C x, N 2 )) We can therefore apply the above theoretical results with :
    C 1 = N (Nµ s 2 + Nσ x 2 , 2Tr (C x, N 2 )), C 2 = N (Mσ x 2 , 2Tr (C x, M 2 )) m = (N / M) (1 + µ s 2 / σ x 2 ),
    α 1 = N (µ s 2 + σ x 2 ) / (2Tr (C x, N 2 )) 1/2 , α 2 = Mσ x 2 / (2Tr (C x, M 2 )) 1/2 ,
    Note that m = (N / M) (1 + r) where r = µ s 2 / σ x 2 ultimately designates the signal to noise ratio. The same remark as that of Model 1, concerning the signal to noise ratio r and the probability p of presence of the useful signal, is valid here.

    Model 3.

    In this model, we seek to perform a detection of all unvoiced signals, with a Gaussian white noise hypothesis.

    The narrowband signal model used previously is therefore no longer valid. We can therefore only assume that we are dealing with a broadband signal of which we know the minimum energy µ S 2 .
    So it comes:
    C 1 = N (Nµ s 2 + Nσ x 2 , 2Nσ x 4 ), C 2 = N (Mσ x 2 , 2Mσ x 4 )
    m = (N / M) (1 + r), with r = µ s 2 / σ x 2
    α 1 = (1 + r) (N / 2) 1/2 , α 2 = (M / 2) 1/2 ,

    To use this model, the noise must be Gaussian white. If the original noise is not white, we can approach this model by subsampling in fact the signal observed, that is to say by considering only one sample out of 2, 3, or even more, depending on the noise autocorrelation function, and assuming that the thus sampled speech signal has still detectable energy. But we can also, and this is preferable, use this algorithm on a signal whitened by a rejector filter, since then the residual noise is approximately white and Gaussian.

    The preceding remarks concerning the a priori value of the signal-to-noise ratio and the probability of presence of the useful signal, remain again and again valid.

    Unvoiced sound detection algorithms.

    Using the previous models, we expose below two unvoiced sound detection algorithms.

    Algorithm 1:

    Having representative noise energies, these energies can be averaged so that a noise "reference" energy is obtained. Evening E 0 this energy. For N 3 frames T 1 , ..., T n which precede the first voiced frame, the procedure is as follows:

    Let E (T 1 ), ..., E (T n ), the energies of these frames, calculated in the form E (T i ) = Σ 0≤n≤N-1 u (n) 2 where u (n ) are the N samples constituting the frame T i .
    For E (T i ) describing {E (T 1 ), ..., E (T n )}
    Make

    If E (T i ) is compatible with E 0 (Decision on the value of E (T i ) / E 0 ).
    Detection on the frame T i .
    End for.

    Algorithm 2:

    This algorithm is a variant of the previous one. E 0 uses either the average energy of the frames detected as noise, or the value of the lowest energy of all the frames detected as noise.

    Then we proceed as follows:
    For E (T i ) describing {E (T 1 ), ..., E (T n )}.
    Make

    If E (T i ) is compatible with E 0 (Decision on the value of E (T i ) / E 0 ).

    Detection on the frame T i .
    Otherwise E 0 = E (T i ).
    End for

    The signal-to-noise ratio r can be estimated or fixed so heuristic, provided that some experimental measurements are made prerequisites, characteristics of the field of application, so as to set a order of magnitude of the signal to noise ratio presented by fricatives in the selected band.

    The probability p of the presence of unvoiced speech is also heuristic data, which modulates the selectivity of the algorithm, same title besides that the signal to noise ratio. This data can be estimated according to the vocabulary used and the number of frames on which is done looking for unvoiced sounds.

    Simplification in the case of a slightly noisy environment.

    In the case of a low noise environment, for which none noise model has not been determined, under the simplifications proposed above, the theory recalled above justifies the use of a threshold, which is not bijectively related to the signal to noise ratio, but which will be fixed totally empirically.

    An interesting alternative for environments where noise is negligible, is to be satisfied with the detection of voicing, to eliminate the detection of unvoiced sounds, and set the start of speech to a few frames before the vocal nucleus (about 15 frames) and the end of speech at a few frames after the end of the vocal nucleus (about 15 frames).

    Claims (12)

    1. Method of detecting speech, for use in a voice recognition system, in noisy signals, characterized in that, after having carried out, in these signals, the detection of at least one voiced frame, noise frames preceding this voiced frame are sought, an autoregressive model of noise and a mean noise spectrum are constructed, the frames preceding the voicing are bleached by rejector filter and noise is removed by spectral noise removal, the actual start of speech is sought in these bleached frames, from the noise-removed frames lying between the actual start of speech and the first voiced frame are extracted the acoustic vectors used by the voice recognition system, as long as voiced frames are detected, the latter have the noise removed then are parameterized for the purpose of recognizing them, when no more voiced frames are detected, the actual end of speech is sought in the bleached frames following the last voiced frame, the frames lying between the last voiced frame and the actual end of speech have the noise removed and are then parameterized.
    2. Method according to Claim 1, characterized in that the bleaching carried out by a rejector filtering is calculated on the basis of the autoregressive model of the noise.
    3. Method according to Claim 2, characterized in that, when the last speech frame has been parameterized, all the processing parameters are reinitialized.
    4. Method according to one of the preceding claims, characterized in that the frames of signals to be processed are processed by Fourier transforms, and in that, when two transforms are consecutive in time, they are calculated over three consecutive frames with an overlap of one frame.
    5. Method according to one of the preceding claims, characterized in that the detection of voicing is done, for each frame, with the aid of the value of the pitch associated with this frame.
    6. Method according to Claim 5, characterized in that the calculation of the pitch is validated after having recognized at least three voiced frames, i.e. 3 x 12.8 ms.
    7. Method according to Claim 5 or 6, characterized in that the calculation of the pitch is done from the correlation of the signal with its delayed form.
    8. Method according to one of Claims 5 to 7, characterized in that the detection of unvoiced sounds is done by thresholding.
    9. Method according to one of the preceding claims, characterized in that, in order to detect unvoiced speech, the distance between the vocal kernel and the fricative block, and the size of this fricative block, are examined.
    10. Method according to one of the preceding claims, characterized in that the mean noise spectrum is obtained by Wiener filtering.
    11. Method according to Claim 10, characterized in that the algorithm of the smooth correlogram is applied to the mean noise spectrum.
    12. Method according to Claim 1, characterized in that, furthermore, it is determined whether the medium is noisy or slightly noisy, and in that, in the latter case, only a detection of voiced frames, and a vocal kernel detection to which a confidence interval is attached, are carried out.
    EP93402522A 1992-10-21 1993-10-13 Speech detection method Expired - Lifetime EP0594480B1 (en)

    Priority Applications (2)

    Application Number Priority Date Filing Date Title
    FR9212582A FR2697101B1 (en) 1992-10-21 1992-10-21 Speech detection method.
    FR9212582 1992-10-21

    Publications (2)

    Publication Number Publication Date
    EP0594480A1 EP0594480A1 (en) 1994-04-27
    EP0594480B1 true EP0594480B1 (en) 1999-08-18

    Family

    ID=9434731

    Family Applications (1)

    Application Number Title Priority Date Filing Date
    EP93402522A Expired - Lifetime EP0594480B1 (en) 1992-10-21 1993-10-13 Speech detection method

    Country Status (5)

    Country Link
    US (1) US5572623A (en)
    EP (1) EP0594480B1 (en)
    JP (1) JPH06222789A (en)
    DE (1) DE69326044T2 (en)
    FR (1) FR2697101B1 (en)

    Families Citing this family (32)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    US5774846A (en) * 1994-12-19 1998-06-30 Matsushita Electric Industrial Co., Ltd. Speech coding apparatus, linear prediction coefficient analyzing apparatus and noise reducing apparatus
    JP3522012B2 (en) * 1995-08-23 2004-04-26 沖電気工業株式会社 Code Excited Linear Prediction Encoder
    FR2744277B1 (en) * 1996-01-26 1998-03-06 Sextant Avionique Voice recognition method in noise ambience, and implementation device
    FR2765715B1 (en) 1997-07-04 1999-09-17 Sextant Avionique Method for searching for a noise model in noise sound signals
    US6711536B2 (en) * 1998-10-20 2004-03-23 Canon Kabushiki Kaisha Speech processing apparatus and method
    JP2002073072A (en) * 2000-08-31 2002-03-12 Sony Corp Device and method for adapting model, recording medium and pattern recognition device
    US7117145B1 (en) * 2000-10-19 2006-10-03 Lear Corporation Adaptive filter for speech enhancement in a noisy environment
    US7941313B2 (en) * 2001-05-17 2011-05-10 Qualcomm Incorporated System and method for transmitting speech activity information ahead of speech features in a distributed voice recognition system
    US7203643B2 (en) * 2001-06-14 2007-04-10 Qualcomm Incorporated Method and apparatus for transmitting speech activity in distributed voice recognition systems
    KR100463657B1 (en) * 2002-11-30 2004-12-29 삼성전자주식회사 Apparatus and method of voice region detection
    JP4635486B2 (en) * 2004-06-29 2011-02-23 ソニー株式会社 Concept acquisition apparatus and method thereof, robot apparatus and action control method thereof
    KR100640865B1 (en) 2004-09-07 2006-11-02 엘지전자 주식회사 method and apparatus for enhancing quality of speech
    US8170875B2 (en) * 2005-06-15 2012-05-01 Qnx Software Systems Limited Speech end-pointer
    US8311819B2 (en) * 2005-06-15 2012-11-13 Qnx Software Systems Limited System for detecting speech with background voice estimates and noise estimates
    JP4722653B2 (en) * 2005-09-29 2011-07-13 株式会社コナミデジタルエンタテインメント Audio information processing apparatus, audio information processing method, and program
    US8175874B2 (en) * 2005-11-17 2012-05-08 Shaul Shimhi Personalized voice activity detection
    US8417185B2 (en) * 2005-12-16 2013-04-09 Vocollect, Inc. Wireless headset and method for robust voice data communication
    FI20051294A0 (en) * 2005-12-19 2005-12-19 Noveltech Solutions Oy signal processing
    US7885419B2 (en) 2006-02-06 2011-02-08 Vocollect, Inc. Headset terminal with speech functionality
    US7773767B2 (en) 2006-02-06 2010-08-10 Vocollect, Inc. Headset terminal with rear stability strap
    US8775168B2 (en) * 2006-08-10 2014-07-08 Stmicroelectronics Asia Pacific Pte, Ltd. Yule walker based low-complexity voice activity detector in noise suppression systems
    KR100930584B1 (en) * 2007-09-19 2009-12-09 한국전자통신연구원 Speech discrimination method and apparatus using voiced sound features of human speech
    JPWO2009087923A1 (en) * 2008-01-11 2011-05-26 日本電気株式会社 Signal analysis control, signal analysis, signal control system, apparatus, method and program
    US8665914B2 (en) 2008-03-14 2014-03-04 Nec Corporation Signal analysis/control system and method, signal control apparatus and method, and program
    US8509092B2 (en) * 2008-04-21 2013-08-13 Nec Corporation System, apparatus, method, and program for signal analysis control and signal control
    USD605629S1 (en) 2008-09-29 2009-12-08 Vocollect, Inc. Headset
    US8160287B2 (en) 2009-05-22 2012-04-17 Vocollect, Inc. Headset with adjustable headband
    US8438659B2 (en) 2009-11-05 2013-05-07 Vocollect, Inc. Portable computing device and headset interface
    US9280982B1 (en) * 2011-03-29 2016-03-08 Google Technology Holdings LLC Nonstationary noise estimator (NNSE)
    US8838445B1 (en) * 2011-10-10 2014-09-16 The Boeing Company Method of removing contamination in acoustic noise measurements
    US9099098B2 (en) * 2012-01-20 2015-08-04 Qualcomm Incorporated Voice activity detection in presence of background noise
    CN103325388B (en) * 2013-05-24 2016-05-25 广州海格通信集团股份有限公司 Based on the mute detection method of least energy wavelet frame

    Family Cites Families (11)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    US4972490A (en) * 1981-04-03 1990-11-20 At&T Bell Laboratories Distance measurement control of a multiple detector system
    US4627091A (en) * 1983-04-01 1986-12-02 Rca Corporation Low-energy-content voice detection apparatus
    US4912764A (en) * 1985-08-28 1990-03-27 American Telephone And Telegraph Company, At&T Bell Laboratories Digital speech coder with different excitation types
    US4852181A (en) * 1985-09-26 1989-07-25 Oki Electric Industry Co., Ltd. Speech recognition for recognizing the catagory of an input speech pattern
    US4777649A (en) * 1985-10-22 1988-10-11 Speech Systems, Inc. Acoustic feedback control of microphone positioning and speaking volume
    JP2884163B2 (en) * 1987-02-20 1999-04-19 富士通株式会社 Coded transmission device
    IL84902A (en) * 1987-12-21 1991-12-15 D S P Group Israel Ltd Digital autocorrelation system for detecting speech in noisy audio signal
    DE68929442T2 (en) * 1988-03-11 2003-10-02 British Telecomm Device for recording speech sounds
    US5276765A (en) * 1988-03-11 1994-01-04 British Telecommunications Public Limited Company Voice activity detection
    US5097510A (en) * 1989-11-07 1992-03-17 Gs Systems, Inc. Artificial intelligence pattern-recognition-based noise reduction system for speech processing
    EP0747879B1 (en) * 1990-05-28 2002-08-07 Matsushita Electric Industrial Co., Ltd. Voice signal coding system

    Also Published As

    Publication number Publication date
    DE69326044D1 (en) 1999-09-23
    JPH06222789A (en) 1994-08-12
    EP0594480A1 (en) 1994-04-27
    FR2697101B1 (en) 1994-11-25
    FR2697101A1 (en) 1994-04-22
    US5572623A (en) 1996-11-05
    DE69326044T2 (en) 2000-07-06

    Similar Documents

    Publication Publication Date Title
    Chen et al. A feature study for classification-based speech separation at low signal-to-noise ratios
    Xu et al. An experimental study on speech enhancement based on deep neural networks
    Xu et al. A regression approach to speech enhancement based on deep neural networks
    Srinivasan et al. Binary and ratio time-frequency masks for robust speech recognition
    JP5850747B2 (en) System and method for automatic speech-to-text conversion
    Atal et al. A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition
    Hermansky et al. Traps-classifiers of temporal patterns
    Mustafa et al. Robust formant tracking for continuous speech with speaker variability
    Raj et al. Missing-feature approaches in speech recognition
    US7542900B2 (en) Noise reduction using correction vectors based on dynamic aspects of speech and noise normalization
    US5611019A (en) Method and an apparatus for speech detection for determining whether an input signal is speech or nonspeech
    Chen et al. MVA processing of speech features
    JP3810608B2 (en) Training method for voice recognizer
    Graf et al. Features for voice activity detection: a comparative analysis
    AU685788B2 (en) A method and apparatus for speaker recognition
    US4905286A (en) Noise compensation in speech recognition
    Burshtein et al. Speech enhancement using a mixture-maximum model
    Ittichaichareon et al. Speech recognition using MFCC
    US6178399B1 (en) Time series signal recognition with signal variation proof learning
    US5381512A (en) Method and apparatus for speech feature recognition based on models of auditory signal processing
    Kumar et al. Delta-spectral cepstral coefficients for robust speech recognition
    Hansen Morphological constrained feature enhancement with adaptive cepstral compensation (MCE-ACC) for speech recognition in noise and Lombard effect
    Schmidt et al. Wind noise reduction using non-negative sparse coding
    US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
    US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition

    Legal Events

    Date Code Title Description
    AK Designated contracting states

    Kind code of ref document: A1

    Designated state(s): DE GB

    17P Request for examination filed

    Effective date: 19940527

    17Q First examination report despatched

    Effective date: 19980922

    AK Designated contracting states

    Kind code of ref document: B1

    Designated state(s): DE GB

    REF Corresponds to:

    Ref document number: 69326044

    Country of ref document: DE

    Date of ref document: 19990923

    RAP2 Rights of a patent transferred

    Owner name: THOMSON-CSF SEXTANT

    GBT Gb: translation of ep patent filed (gb section 77(6)(a)/1977)

    Effective date: 19991020

    26N No opposition filed
    REG Reference to a national code

    Ref country code: GB

    Ref legal event code: IF02

    PGFP Annual fee paid to national office [announced from national office to epo]

    Ref country code: DE

    Payment date: 20081014

    Year of fee payment: 16

    PGFP Annual fee paid to national office [announced from national office to epo]

    Ref country code: GB

    Payment date: 20081008

    Year of fee payment: 16

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: DE

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20100501

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: GB

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20091013