EP1451548A2 - Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebung - Google Patents

Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebung

Info

Publication number
EP1451548A2
EP1451548A2 EP02788059A EP02788059A EP1451548A2 EP 1451548 A2 EP1451548 A2 EP 1451548A2 EP 02788059 A EP02788059 A EP 02788059A EP 02788059 A EP02788059 A EP 02788059A EP 1451548 A2 EP1451548 A2 EP 1451548A2
Authority
EP
European Patent Office
Prior art keywords
audio signal
speech
frame
information
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP02788059A
Other languages
English (en)
French (fr)
Inventor
Arnaud Martin
Laurent Mauuary
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orange SA
Original Assignee
France Telecom SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by France Telecom SA filed Critical France Telecom SA
Publication of EP1451548A2 publication Critical patent/EP1451548A2/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals

Definitions

  • the present invention relates to systems for detecting speech in an audio signal and in particular in a noisy environment.
  • the invention relates to a method of detecting speech in an audio signal comprising a step of obtaining energy information of the audio signal, the energy information being used to detect speech in the signal audio.
  • the invention also relates to a speech detection device capable of implementing such a method.
  • Spoken language is the most natural mode of communication in humans. With the automation of man-machine communication, the dream of a voice interaction between man and machine appeared very early.
  • a voice recognition system conventionally consists of a speech detection module and a speech recognition module.
  • the function of the detection module is to detect the speech periods in an audio input signal, in order to avoid the recognition module from seeking to recognize speech over periods of the input signal corresponding to phases of silence. .
  • the presence of a speech detection module therefore makes it possible both to improve performance and to reduce the cost of the voice recognition system.
  • the operation of a speech detection module in an audio signal is conventionally represented by a finite state machine (also designated by an automaton).
  • the change of states of a detection module involves a criterion based on obtaining and processing energy information relating to the audio signal.
  • Such a speech detection module is described in the document entitled "Improving the performance of interactive voice servers", by L. Mauuary, Doctoral thesis, University of Rennes 1, 1994.
  • the current technical challenges are linked to the recognition of a large number of isolated words (for example, for a voice directory), to the recognition of continuous speech (i.e., of language sentences current) or the transmission / reception of the signal in a noisy environment, for example in the context of mobile telephony.
  • the main objective of the present invention is to provide a speech detection system whose efficiency in a noisy context is better than that of conventional detection systems, and which therefore makes it possible, in this context, to improve performance. of the associated voice recognition system.
  • the proposed detection system is therefore particularly suitable for use in the context of telephone speech recognition robust to surrounding noise.
  • the invention relates, according to a first aspect, to a method of detecting speech in an audio signal comprising a step of obtaining energy information of the audio signal, this energy information being used to detect speech in the audio signal.
  • this method is remarkable in that it further comprises a step of obtaining audio signal information of the audio signal, this voice information being used in conjunction with the energy information for the detection. speech in the audio signal.
  • the invention relates to a speech detection device capable of implementing a detection method as defined succinctly above.
  • this device further comprises means for obtaining voice signal information of the audio signal, this voice information being used in conjunction with the energy information for detecting speech in the audio signal.
  • the combined use of the energy of the input signal and a voicing parameter improves speech detection by reducing noise detections, and thus improves the overall accuracy of the voice recognition system. This improvement is accompanied by a decrease in the dependence of the adjustment of the detection system on the characteristics of the communication.
  • the present invention applies to the general field of processing an audio signal.
  • the invention can be applied, in a non-exhaustive manner:
  • - speech recognition robust to the acoustic environment, for example recognition in the street (mobile telephony), in the car, etc. ;
  • - speech transmission for example within the framework of telephony or else within the framework of teleconference / videoconference;
  • FIG. 3 is a graphical representation of the values of a voicing parameter calculated, according to an embodiment of the invention, on audio files from databases obtained on PSTN and GSM networks;
  • FIG. 4 illustrates the use of a new detection criterion based on a voicing parameter calculated according to the invention and applied to the state machine of Figure 2, according to a preferred embodiment
  • FIG. 5 is a graphic representation of the results obtained by a detection module according to the invention, on a database of audio files recorded on a GSM network;
  • FIG. 6 is a graphical representation of the results obtained by a detection module according to the invention, on another database of audio files recorded on a PSTN network;
  • FIG. 7 is a graphical representation of the results obtained by a voice recognition system incorporating a speech detection module according to the invention, on the basis of audio file data recorded on the PSTN network.
  • Voice - A voiced sound is a sound characterized by the vibration of the vocal cords. Voicing is a characteristic of most speech sounds, only certain plosives and fricatives are not voiced. In addition, the majority of noises are not seen. Consequently, a voicing parameter can provide useful information for discriminating in an input signal, between energetic sounds from speech and energetic noise.
  • Fundamental frequency or pitch The measurement of the fundamental frequency FO (in the sense of Fourier analysis) of the speech signal appears as an estimate of the vibration frequency of the vocal cords.
  • the fundamental frequency F 0 varies with gender, age, accent, emotional state of the speaker, etc. Its variations can be between 50 and 200 Hz.
  • the recognition system shown comprises a speech detection module 14 designated by DBP (Noise / Speech Detection) and a voice recognition module 12 (RECO).
  • DBP Noise / Speech Detection
  • RECO voice recognition module
  • the speech detection module 14 determines the periods of the audio input signal in which speech is present.
  • This determination is preceded by the analysis of the audio signal by an analysis module 11, so as to extract from it coefficients relevant for the detection module 14 and for the recognition module 12.
  • the extracted coefficients are cepstral coefficients, also called MFCC coefficients (Mel Frequency Cepstrum Coefficients).
  • the detection (14) and recognition (12) modules operate simultaneously.
  • the recognition module 12 used for recognizing isolated words and continuous speech, is based on a known method, based on the use of Markov chains.
  • the detection module 14 supplies the start and then end of speech information to the recognition module 12. When all the speech frames have been processed, the speech recognition system provides the recognition result via a decision module 13.
  • DBP speech in noise detection systems
  • a finite state machine or machine For example, a two-state machine can be used in the simplest case (used for example for voice activity detection), three state, four state, or even five state.
  • the decision is made at each of the frames of the input signal, the rate of which can be for example 16 milliseconds (ms).
  • the use of an automaton with a large number of finite states allows finer modeling of the decision to be taken, by taking into account structural considerations of speech.
  • this automaton is modified, in accordance with a preferred embodiment of the invention, so as to incorporate therein a voicing parameter as an additional criterion for changing d 'states.
  • - state 5 "possible speech recovery”.
  • the transitions from one state to another of the automaton are conditioned by a test on the energy of the input signal and by structural constraints of duration (minimum duration of a vowel and maximum duration of a plosive).
  • the transition to state 3 (“speech") determines the boundary at which speech begins in the input signal.
  • the recognition module 12 takes into account the speech start border with a predetermined safety margin on this border, for example 160 ms (10 frames of 16 ms each).
  • the return to state 1 of the PLC means confirmation of the end of speech.
  • the end of speech border is therefore determined during the transition from state 3 or 5 to state 1 of the automaton.
  • the recognition module 12 takes into account the end of speech border with a predetermined safety margin on this border, for example 240 ms (15 frames of 16 ms each).
  • Non_C1 a frame whose energy is greater than a predetermined detection threshold.
  • the automaton enters state 3. when conditions C1 and C2 are fulfilled simultaneously, that is to say when the automaton has remained in state 2 for a predetermined minimum number "Speech Minimum” (condition C2 ) of successive energy frames (condition C1) received. It then remains in the state
  • Non_C1 whose cumulative duration is greater than "End Silence” (condition C3) confirms a state of silence and causes a return to state 1 "noise or silence".
  • the variable "Silence End” is therefore used to confirm a state of silence due to the end of the speech. For example, in the case of continuous speech, Silence End can reach 1 second.
  • condition Non_C1 causes it to return to state 1 "noise or silence” or to state 4 "plosive unvoiced or silence", depending on whether the duration of silence (Silence Duration - DS) is greater (C3) or not (Non_C3) than a predefined number of frames (End Silence).
  • the duration of silence represents the time spent in state 4 "plosive unvoiced or silence” and in state 5 "possible speech recovery”.
  • the state “unvoiced plosive or silence” (4) models low energy passages in a word or a sentence, such as intra-word rests or plosives.
  • a certain number of actions are executed.
  • action A1 indicates the duration of silence after the last detected speech frame
  • action A6 resets the variable "Silence Duration” (DS) intended to count the silences, as well as the variable “Speech Duration” ( DP).
  • action A3 makes it possible to specify the number of frames of silence after the last speech frame of state 3 ( "speech") to determine the end of speech border.
  • speech the last speech frame of state 3
  • actions A3 and A6 are performed.
  • Actions A2 and A5 for their part, set the variables “Duration Speech” (DP) and “Duration Silence” (DS) to “1" respectively. Finally, action A4 increments the variable DP.
  • the condition C1 for changing states is based on a detection criterion which uses energy information from the frames of the input signal: energy information of a given frame of the input signal is compared with a predetermined threshold.
  • condition C1 another condition (C4) based on a second detection criterion using a parameter voicing.
  • the speech detection system (14) comprises means for measuring the energy of the input signal, used to define the energy criterion of the condition C1.
  • this energy criterion is based on the use of noise statistics. We make the classic assumption that the logarithm of the noise energy E (n) follows a normal law of parameters ( ⁇ , ⁇ 2 ).
  • E (n) is the logarithm of the short-term energy of the noise, that is to say the logarithm of the sum of the squares of the samples of a frame n considered of the input signal.
  • the statistics of the logarithm of the noise energy are estimated when the controller is in state 1 "noise or silence".
  • the mean and the standard deviation are estimated respectively by equations (1) and (2) which follow:
  • 0.995; which corresponds to a time constant of 3200 ms.
  • threshold values between 1.5 and 3.5 can be used.
  • criterion SB This first criterion, based on the use of energy information E (n) of the input signal.
  • the speech in noise detection system further comprises means for calculating a voicing parameter which is associated with the energy information for the detection of the speech in noise.
  • this parameter is calculated as follows.
  • the voicing parameter used is estimated from the fundamental frequency.
  • other types of voicing parameter obtained by other methods, can be used in the context of the present invention.
  • the fundamental frequency is calculated from a spectral method. This method searches for the harmonicity of the signal by inter-correlation with a comb function whose distance between the teeth of the comb is varied.
  • the period of the harmonics in the spectrum over the entire input signal is calculated at regular time intervals.
  • the period of the harmonics in the spectrum is calculated every four milliseconds (ms) over the whole of the input signal, that is to say even in the non-speech periods.
  • the period of the harmonics in the spectrum is the fundamental frequency.
  • fundamental frequency is used in the rest of the description to designate the period of the harmonics in the spectrum.
  • the median is then calculated between the current value of fundamental frequency and a predetermined number of previous values of fundamental frequency. In practice, in the chosen implementation, the median is calculated between the current value of fundamental frequency and the two previous values. The use of the median in particular makes it possible to eliminate certain errors in estimating the fundamental frequency.
  • a median, med (m) is calculated for each of the subframes m of the input signal (audio signal).
  • med (m) is the median calculated for the subframe m
  • successive frames of the input signal of length 16 ms are considered, and a median value is calculated every 4 ms, that is to say for each sub-frame of length 4 ms.
  • ⁇ med (m) ⁇ med (m) - med ( m - 1)
  • This average is a criterion for the local variation of the fundamental frequency. If the fundamental frequency varies little, the current frame is assumed to be a speech frame.
  • the arithmetic mean ⁇ med (m) therefore constitutes an estimate of a degree of voicing.
  • FIG. 3 is a plot of curves representing the value of the voicing parameter calculated according to equation (6) above, as a function of the number of audio files of different types (speech, impulsive noises, background noises). More specifically, the curves in FIG. 3 represent the average of the degree of voicing measured on the basis of audio files recorded on PSTN and GSM networks.
  • the voicing parameter makes it possible to discriminate speech from impulsive noises. Indeed, by applying for example a threshold of 15 to this value of the parameter, one can effectively distinguish speech from impulsive noises and background noise.
  • this voicing parameter in addition to the energy information of the input signal, to discriminate speech from noise, is implemented in the detection module (14) by the described decision automaton. above in relation to FIG. 2.
  • the joint use of the energy of the input signal and the voicing parameter then makes it possible to define a more precise criterion for triggering the transitions between all or part of the states of the automaton. .
  • FIG. 4 illustrates, by way of example, the insertion of the new criterion above based on a voicing parameter according to the invention in the state machine of FIG. 2.
  • the present invention can therefore also apply to detection systems whose function is to detect only the start of speech.
  • condition C4 is defined as follows. ⁇ med (P- n + 3) ⁇ threshold- ⁇ (7)
  • Detection tests on a noisy part of a GSM base used audio files were used to determine the value "10" as the threshold value for optimized seu Sn ⁇ ed ⁇ • ⁇ e seu '' had * ® * re adapted to the conditions of noise present in the input signal so as to guarantee precise detection whatever the acoustic environment.
  • the combination of the new condition C4 with the condition C1 thus makes it possible to obtain a double detection criterion, based on a measurement of the energy of the input signal and on a measurement of the voicing.
  • the GSM_T base is a laboratory base registered on a GSM network in four different environments: interior, exterior, stationary vehicle and rolling vehicle. Normally each word is repeated only once, except if there is a loud noise during the pronunciation of the word. The occurrences of each word are therefore substantially identical.
  • the vocabulary consists of 65 words. The 29,558 segments from manual segmentation are divided into 85% vocabulary words, 3% non-vocabulary words and 12% noises.
  • the GSM_T base is composed of two sub-bases defined according to the signal to noise ratio (RSB) of each file making up these sub-bases.
  • RSSB signal to noise ratio
  • the AGORA base is a basis for experimenting with a man-machine dialogue application, recorded on a PSTN switched network. It is therefore a basis for continuous speech.
  • the AGORA base is mainly used as a test base. It is composed of 64 records.
  • the 3,115 reference segments include 12,635 words.
  • the vocabulary of the recognition model is 1633 words. There are no non-vocabulary word segments for this base.
  • the speech segments constitute 81% of the reference segments and the noise segments 19%.
  • the results of the speech detection alone are considered, then the results of this detection in the context of voice recognition, by studying the results obtained by the speech system. recognition.
  • the results of the detection alone are studied by considering the final error rate as a function of the rejection error rate.
  • the final errors generated by the detection module are composed of omitted speech, fragmentations of a word or a sentence, and groupings of several words or several sentences. These errors are said to be "final” because they cause definitive errors at the level of the recognition module.
  • the rejection errors generated by the detection module are composed of noise inserts (or noise detections). A rejectable error can be rejected by a rejection model incorporated in the decision module (fig. 1, 13) of the recognition module. Otherwise, it causes a voice recognition error.
  • the approach consisting in evaluating the detection module alone makes it possible to place oneself in a context independent of voice recognition.
  • the results of the recognition system using a detection module according to the invention are studied by considering three types of error in the case of recognition of isolated words, and four types of error in the case of continuous speech recognition.
  • substitution error represents a word of the vocabulary recognized as being another word of the vocabulary.
  • False acceptance error is a noise detection recognized as being a word.
  • Wrongly rejected error is the rejection of a word from the vocabulary by the rejection model or corresponds to a word not detected by the detection module.
  • a so-called "insertion” error concerns a word inserted in a sentence (or request)
  • a so-called “omission” error concerns a word omitted in a sentence
  • a so-called “error” substitution concerns a substituted word in a sentence
  • an error called” wrongly rejected concerns a sentence wrongly rejected by the rejection model, or not detected by the detection module.
  • These wrongly rejected errors are expressed by a rate of omitting words in sentences.
  • the errors of insertions, omissions and substitutions are represented according to the errors of rejection wrongly.
  • FIG. 5 is a graphic representation of the results obtained by a detection module according to the invention on the GSM_T database of audio files recorded on a GSM network.
  • the curves of FIG. 5 represent, for each sub-base (noisy and non-noisy) of the GSM_T base, the results obtained with the detection automaton of FIG. 2 (condition C1 only), and the results obtained using l detection automaton modified according to FIG. 4 (combination of conditions C1 and C4).
  • the results are expressed as a rejectable error rate compared to the final error rate. For a given rejectable error rate, the lower the final error rate, the better the performance obtained.
  • curves 51 and 52 correspond to the results obtained with the "noiseless" sub-base, that is to say corresponding to a signal to noise ratio (SNR) greater than 18 decibels (dB).
  • SNR signal to noise ratio
  • the curves 53, 54 correspond to the results obtained with the "noise" sub-base, that is to say corresponding to a SNR less than 18 dB.
  • curves 51, 53 correspond to the use only of the "energy” criterion based on the energy of the input signal (condition C1), while curves 52, 54 correspond to the use joint of the energy criterion and the voicing criterion (conditions C1 and C4).
  • FIG. 6 represents the results obtained with a detection module according to the invention on the basis of AGORA continuous speech of audio files recorded on a PSTN network.
  • FIG. 7 is a graphic representation of the results obtained by a voice recognition system integrating a speech detection module according to the invention, on the AGORA basis of audio files recorded on a PSTN network. These results were obtained using the optimal thresholds for recognition. For recognition, the results are assessed by comparing the rate of rejection wrongly with the rate of errors of omission, insertion and substitution of words. In fig.
  • curve 71 represents the results obtained with the sole use of the energy criterion (condition C1); while curve 72 represents the results obtained with the joint use of the energy criterion and the voicing criterion (conditions C1 and C4). It can be observed that the results (curve 72) on the voice recognition are also better with the use of the double energy-voicing criterion for the detection module.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)
EP02788059A 2001-12-05 2002-11-15 Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebung Withdrawn EP1451548A2 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
FR0115685 2001-12-05
FR0115685A FR2833103B1 (fr) 2001-12-05 2001-12-05 Systeme de detection de parole dans le bruit
PCT/FR2002/003910 WO2003048711A2 (fr) 2001-12-05 2002-11-15 System de detection de parole dans un signal audio en environnement bruite

Publications (1)

Publication Number Publication Date
EP1451548A2 true EP1451548A2 (de) 2004-09-01

Family

ID=8870113

Family Applications (1)

Application Number Title Priority Date Filing Date
EP02788059A Withdrawn EP1451548A2 (de) 2001-12-05 2002-11-15 Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebung

Country Status (5)

Country Link
US (1) US7359856B2 (de)
EP (1) EP1451548A2 (de)
AU (1) AU2002352339A1 (de)
FR (1) FR2833103B1 (de)
WO (1) WO2003048711A2 (de)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2856506B1 (fr) * 2003-06-23 2005-12-02 France Telecom Procede et dispositif de detection de parole dans un signal audio
FR2864319A1 (fr) * 2005-01-19 2005-06-24 France Telecom Procede et dispositif de detection de parole dans un signal audio
CN1815550A (zh) * 2005-02-01 2006-08-09 松下电器产业株式会社 可识别环境中的语音与非语音的方法及系统
US8175877B2 (en) * 2005-02-02 2012-05-08 At&T Intellectual Property Ii, L.P. Method and apparatus for predicting word accuracy in automatic speech recognition systems
GB2450886B (en) * 2007-07-10 2009-12-16 Motorola Inc Voice activity detector and a method of operation
KR100930039B1 (ko) * 2007-12-18 2009-12-07 한국전자통신연구원 음성 인식기의 성능 평가 장치 및 그 방법
US8380497B2 (en) * 2008-10-15 2013-02-19 Qualcomm Incorporated Methods and apparatus for noise estimation
US8938389B2 (en) * 2008-12-17 2015-01-20 Nec Corporation Voice activity detector, voice activity detection program, and parameter adjusting method
CN102667927B (zh) * 2009-10-19 2013-05-08 瑞典爱立信有限公司 语音活动检测的方法和背景估计器
KR20140026229A (ko) * 2010-04-22 2014-03-05 퀄컴 인코포레이티드 음성 액티비티 검출
CN102237081B (zh) * 2010-04-30 2013-04-24 国际商业机器公司 语音韵律评估方法与系统
US8898058B2 (en) 2010-10-25 2014-11-25 Qualcomm Incorporated Systems, methods, and apparatus for voice activity detection
JP5747562B2 (ja) * 2010-10-28 2015-07-15 ヤマハ株式会社 音響処理装置
US20150281853A1 (en) * 2011-07-11 2015-10-01 SoundFest, Inc. Systems and methods for enhancing targeted audibility
KR20140147587A (ko) * 2013-06-20 2014-12-30 한국전자통신연구원 Wfst를 이용한 음성 끝점 검출 장치 및 방법
WO2015098079A1 (ja) * 2013-12-26 2015-07-02 パナソニックIpマネジメント株式会社 音声認識処理装置、音声認識処理方法、および表示装置
ES2664348T3 (es) 2014-07-29 2018-04-19 Telefonaktiebolaget Lm Ericsson (Publ) Estimación de ruido de fondo en señales de audio
CN111739515B (zh) * 2019-09-18 2023-08-04 北京京东尚科信息技术有限公司 语音识别方法、设备、电子设备和服务器、相关系统
KR20210089347A (ko) * 2020-01-08 2021-07-16 엘지전자 주식회사 음성 인식 장치 및 음성데이터를 학습하는 방법
CN111599377B (zh) * 2020-04-03 2023-03-31 厦门快商通科技股份有限公司 基于音频识别的设备状态检测方法、系统及移动终端
CN111554314A (zh) * 2020-05-15 2020-08-18 腾讯科技(深圳)有限公司 噪声检测方法、装置、终端及存储介质
CN115602152B (zh) * 2022-12-14 2023-02-28 成都启英泰伦科技有限公司 一种基于多阶段注意力网络的语音增强方法

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4696039A (en) * 1983-10-13 1987-09-22 Texas Instruments Incorporated Speech analysis/synthesis system with silence suppression
US5276765A (en) * 1988-03-11 1994-01-04 British Telecommunications Public Limited Company Voice activity detection
US5579431A (en) * 1992-10-05 1996-11-26 Panasonic Technologies, Inc. Speech detection in presence of noise by determining variance over time of frequency band limited energy
US5598466A (en) * 1995-08-28 1997-01-28 Intel Corporation Voice activity detector for half-duplex audio communication system
JPH0990974A (ja) * 1995-09-25 1997-04-04 Nippon Telegr & Teleph Corp <Ntt> 信号処理方法
US5819217A (en) * 1995-12-21 1998-10-06 Nynex Science & Technology, Inc. Method and system for differentiating between speech and noise
US5890109A (en) * 1996-03-28 1999-03-30 Intel Corporation Re-initializing adaptive parameters for encoding audio signals
US6023674A (en) * 1998-01-23 2000-02-08 Telefonaktiebolaget L M Ericsson Non-parametric voice activity detection
US6122531A (en) * 1998-07-31 2000-09-19 Motorola, Inc. Method for selectively including leading fricative sounds in a portable communication device operated in a speakerphone mode
US6327564B1 (en) * 1999-03-05 2001-12-04 Matsushita Electric Corporation Of America Speech detection using stochastic confidence measures on the frequency spectrum
US6775649B1 (en) * 1999-09-01 2004-08-10 Texas Instruments Incorporated Concealment of frame erasures for speech transmission and storage system and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JUNQUA J -C; MAK B; REAVES B: "A robust algorithm for word boundary detection in the presence of noise", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, vol. 2, no. 3, July 1994 (1994-07-01), usa, pages 406 - 412 *
LAMEL L F; RABINER L R; ROSENBERG A E; WILPON J G: "An improved endpoint detector for isolated word recognition", IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, vol. ASSP-29, no. 4, August 1981 (1981-08-01), 08-1981, pages 777 - 785, XP002062762, DOI: doi:10.1109/TASSP.1981.1163642 *
See also references of WO03048711A3 *

Also Published As

Publication number Publication date
WO2003048711A2 (fr) 2003-06-12
US20050143978A1 (en) 2005-06-30
AU2002352339A1 (en) 2003-06-17
FR2833103A1 (fr) 2003-06-06
WO2003048711A3 (fr) 2004-02-12
FR2833103B1 (fr) 2004-07-09
US7359856B2 (en) 2008-04-15
AU2002352339A8 (en) 2003-06-17

Similar Documents

Publication Publication Date Title
EP1451548A2 (de) Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebung
EP1154405B1 (de) Verfahren und Vorrichtung zur Spracherkennung in einer Umgebung mit variablerem Rauschpegel
EP2415047B1 (de) Klassifizieren von in einem Tonsignal enthaltenem Hintergrundrauschen
EP0867856B1 (de) Verfahren und Vorrichtung zur Sprachdetektion
JP4568371B2 (ja) 少なくとも2つのイベント・クラス間を区別するためのコンピュータ化された方法及びコンピュータ・プログラム
EP2419900B1 (de) Verfahren und einrichtung zur objektiven evaluierung der sprachqualität eines sprachsignals unter berücksichtigung der klassifikation der in dem signal enthaltenen hintergrundgeräusche
CA2404441C (fr) Parametres robustes pour la reconnaissance de parole bruitee
EP3078027A1 (de) Stimmendetektionsverfahren
EP1131813B1 (de) Verfahren und vorrichtung zur spracherkennung eines mit störungen behafteten akustischen signals
EP3627510A1 (de) Filterung eines tonsignals, das durch ein stimmerkennungssystem erfasst wurde
Odriozola et al. An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods
Martin et al. Robust speech/non-speech detection based on LDA-derived parameter and voicing parameter for speech recognition in noisy environments
Skorik et al. On a cepstrum-based speech detector robust to white noise
EP1665231B1 (de) Verfahren für unbeaufsichtigter doping und ablehnung von wörter ausser dem spracherkennungwortschatz
FR2856506A1 (fr) Procede et dispositif de detection de parole dans un signal audio
FR2864319A1 (fr) Procede et dispositif de detection de parole dans un signal audio
FR2823361A1 (fr) Procede et dispositif d&#39;extraction acoustique d&#39;un signal vocal
WO2001091106A1 (fr) Fenetres d&#39;analyse adaptatives pour la reconnaissance de la parole

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SK TR

17P Request for examination filed

Effective date: 20040617

RIN1 Information on inventor provided before grant (corrected)

Inventor name: MARTIN, ARNAUD

Inventor name: MAUUARY, LAURENT

17Q First examination report despatched

Effective date: 20070620

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: ORANGE

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

RIC1 Information provided on ipc code assigned before grant

Ipc: G10L 25/78 20130101AFI20160712BHEP

INTG Intention to grant announced

Effective date: 20160805

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20161216

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN