EP1451548A2 - Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebung - Google Patents
Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebungInfo
- Publication number
- EP1451548A2 EP1451548A2 EP02788059A EP02788059A EP1451548A2 EP 1451548 A2 EP1451548 A2 EP 1451548A2 EP 02788059 A EP02788059 A EP 02788059A EP 02788059 A EP02788059 A EP 02788059A EP 1451548 A2 EP1451548 A2 EP 1451548A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- audio signal
- speech
- frame
- information
- detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 87
- 230000005236 sound signal Effects 0.000 title claims abstract description 45
- 238000000034 method Methods 0.000 claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000007704 transition Effects 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims 1
- 230000009471 action Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 238000006467 substitution reaction Methods 0.000 description 5
- 238000003780 insertion Methods 0.000 description 4
- 230000037431 insertion Effects 0.000 description 4
- 238000005259 measurement Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000002452 interceptive effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011084 recovery Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 208000003251 Pruritus Diseases 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000005311 autocorrelation function Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- DGLFSNZWRYADFC-UHFFFAOYSA-N chembl2334586 Chemical compound C1CCC2=CN=C(N)N=C2C2=C1NC1=CC=C(C#CC(C)(O)C)C=C12 DGLFSNZWRYADFC-UHFFFAOYSA-N 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Definitions
- the present invention relates to systems for detecting speech in an audio signal and in particular in a noisy environment.
- the invention relates to a method of detecting speech in an audio signal comprising a step of obtaining energy information of the audio signal, the energy information being used to detect speech in the signal audio.
- the invention also relates to a speech detection device capable of implementing such a method.
- Spoken language is the most natural mode of communication in humans. With the automation of man-machine communication, the dream of a voice interaction between man and machine appeared very early.
- a voice recognition system conventionally consists of a speech detection module and a speech recognition module.
- the function of the detection module is to detect the speech periods in an audio input signal, in order to avoid the recognition module from seeking to recognize speech over periods of the input signal corresponding to phases of silence. .
- the presence of a speech detection module therefore makes it possible both to improve performance and to reduce the cost of the voice recognition system.
- the operation of a speech detection module in an audio signal is conventionally represented by a finite state machine (also designated by an automaton).
- the change of states of a detection module involves a criterion based on obtaining and processing energy information relating to the audio signal.
- Such a speech detection module is described in the document entitled "Improving the performance of interactive voice servers", by L. Mauuary, Doctoral thesis, University of Rennes 1, 1994.
- the current technical challenges are linked to the recognition of a large number of isolated words (for example, for a voice directory), to the recognition of continuous speech (i.e., of language sentences current) or the transmission / reception of the signal in a noisy environment, for example in the context of mobile telephony.
- the main objective of the present invention is to provide a speech detection system whose efficiency in a noisy context is better than that of conventional detection systems, and which therefore makes it possible, in this context, to improve performance. of the associated voice recognition system.
- the proposed detection system is therefore particularly suitable for use in the context of telephone speech recognition robust to surrounding noise.
- the invention relates, according to a first aspect, to a method of detecting speech in an audio signal comprising a step of obtaining energy information of the audio signal, this energy information being used to detect speech in the audio signal.
- this method is remarkable in that it further comprises a step of obtaining audio signal information of the audio signal, this voice information being used in conjunction with the energy information for the detection. speech in the audio signal.
- the invention relates to a speech detection device capable of implementing a detection method as defined succinctly above.
- this device further comprises means for obtaining voice signal information of the audio signal, this voice information being used in conjunction with the energy information for detecting speech in the audio signal.
- the combined use of the energy of the input signal and a voicing parameter improves speech detection by reducing noise detections, and thus improves the overall accuracy of the voice recognition system. This improvement is accompanied by a decrease in the dependence of the adjustment of the detection system on the characteristics of the communication.
- the present invention applies to the general field of processing an audio signal.
- the invention can be applied, in a non-exhaustive manner:
- - speech recognition robust to the acoustic environment, for example recognition in the street (mobile telephony), in the car, etc. ;
- - speech transmission for example within the framework of telephony or else within the framework of teleconference / videoconference;
- FIG. 3 is a graphical representation of the values of a voicing parameter calculated, according to an embodiment of the invention, on audio files from databases obtained on PSTN and GSM networks;
- FIG. 4 illustrates the use of a new detection criterion based on a voicing parameter calculated according to the invention and applied to the state machine of Figure 2, according to a preferred embodiment
- FIG. 5 is a graphic representation of the results obtained by a detection module according to the invention, on a database of audio files recorded on a GSM network;
- FIG. 6 is a graphical representation of the results obtained by a detection module according to the invention, on another database of audio files recorded on a PSTN network;
- FIG. 7 is a graphical representation of the results obtained by a voice recognition system incorporating a speech detection module according to the invention, on the basis of audio file data recorded on the PSTN network.
- Voice - A voiced sound is a sound characterized by the vibration of the vocal cords. Voicing is a characteristic of most speech sounds, only certain plosives and fricatives are not voiced. In addition, the majority of noises are not seen. Consequently, a voicing parameter can provide useful information for discriminating in an input signal, between energetic sounds from speech and energetic noise.
- Fundamental frequency or pitch The measurement of the fundamental frequency FO (in the sense of Fourier analysis) of the speech signal appears as an estimate of the vibration frequency of the vocal cords.
- the fundamental frequency F 0 varies with gender, age, accent, emotional state of the speaker, etc. Its variations can be between 50 and 200 Hz.
- the recognition system shown comprises a speech detection module 14 designated by DBP (Noise / Speech Detection) and a voice recognition module 12 (RECO).
- DBP Noise / Speech Detection
- RECO voice recognition module
- the speech detection module 14 determines the periods of the audio input signal in which speech is present.
- This determination is preceded by the analysis of the audio signal by an analysis module 11, so as to extract from it coefficients relevant for the detection module 14 and for the recognition module 12.
- the extracted coefficients are cepstral coefficients, also called MFCC coefficients (Mel Frequency Cepstrum Coefficients).
- the detection (14) and recognition (12) modules operate simultaneously.
- the recognition module 12 used for recognizing isolated words and continuous speech, is based on a known method, based on the use of Markov chains.
- the detection module 14 supplies the start and then end of speech information to the recognition module 12. When all the speech frames have been processed, the speech recognition system provides the recognition result via a decision module 13.
- DBP speech in noise detection systems
- a finite state machine or machine For example, a two-state machine can be used in the simplest case (used for example for voice activity detection), three state, four state, or even five state.
- the decision is made at each of the frames of the input signal, the rate of which can be for example 16 milliseconds (ms).
- the use of an automaton with a large number of finite states allows finer modeling of the decision to be taken, by taking into account structural considerations of speech.
- this automaton is modified, in accordance with a preferred embodiment of the invention, so as to incorporate therein a voicing parameter as an additional criterion for changing d 'states.
- - state 5 "possible speech recovery”.
- the transitions from one state to another of the automaton are conditioned by a test on the energy of the input signal and by structural constraints of duration (minimum duration of a vowel and maximum duration of a plosive).
- the transition to state 3 (“speech") determines the boundary at which speech begins in the input signal.
- the recognition module 12 takes into account the speech start border with a predetermined safety margin on this border, for example 160 ms (10 frames of 16 ms each).
- the return to state 1 of the PLC means confirmation of the end of speech.
- the end of speech border is therefore determined during the transition from state 3 or 5 to state 1 of the automaton.
- the recognition module 12 takes into account the end of speech border with a predetermined safety margin on this border, for example 240 ms (15 frames of 16 ms each).
- Non_C1 a frame whose energy is greater than a predetermined detection threshold.
- the automaton enters state 3. when conditions C1 and C2 are fulfilled simultaneously, that is to say when the automaton has remained in state 2 for a predetermined minimum number "Speech Minimum” (condition C2 ) of successive energy frames (condition C1) received. It then remains in the state
- Non_C1 whose cumulative duration is greater than "End Silence” (condition C3) confirms a state of silence and causes a return to state 1 "noise or silence".
- the variable "Silence End” is therefore used to confirm a state of silence due to the end of the speech. For example, in the case of continuous speech, Silence End can reach 1 second.
- condition Non_C1 causes it to return to state 1 "noise or silence” or to state 4 "plosive unvoiced or silence", depending on whether the duration of silence (Silence Duration - DS) is greater (C3) or not (Non_C3) than a predefined number of frames (End Silence).
- the duration of silence represents the time spent in state 4 "plosive unvoiced or silence” and in state 5 "possible speech recovery”.
- the state “unvoiced plosive or silence” (4) models low energy passages in a word or a sentence, such as intra-word rests or plosives.
- a certain number of actions are executed.
- action A1 indicates the duration of silence after the last detected speech frame
- action A6 resets the variable "Silence Duration” (DS) intended to count the silences, as well as the variable “Speech Duration” ( DP).
- action A3 makes it possible to specify the number of frames of silence after the last speech frame of state 3 ( "speech") to determine the end of speech border.
- speech the last speech frame of state 3
- actions A3 and A6 are performed.
- Actions A2 and A5 for their part, set the variables “Duration Speech” (DP) and “Duration Silence” (DS) to “1" respectively. Finally, action A4 increments the variable DP.
- the condition C1 for changing states is based on a detection criterion which uses energy information from the frames of the input signal: energy information of a given frame of the input signal is compared with a predetermined threshold.
- condition C1 another condition (C4) based on a second detection criterion using a parameter voicing.
- the speech detection system (14) comprises means for measuring the energy of the input signal, used to define the energy criterion of the condition C1.
- this energy criterion is based on the use of noise statistics. We make the classic assumption that the logarithm of the noise energy E (n) follows a normal law of parameters ( ⁇ , ⁇ 2 ).
- E (n) is the logarithm of the short-term energy of the noise, that is to say the logarithm of the sum of the squares of the samples of a frame n considered of the input signal.
- the statistics of the logarithm of the noise energy are estimated when the controller is in state 1 "noise or silence".
- the mean and the standard deviation are estimated respectively by equations (1) and (2) which follow:
- ⁇ 0.995; which corresponds to a time constant of 3200 ms.
- threshold values between 1.5 and 3.5 can be used.
- criterion SB This first criterion, based on the use of energy information E (n) of the input signal.
- the speech in noise detection system further comprises means for calculating a voicing parameter which is associated with the energy information for the detection of the speech in noise.
- this parameter is calculated as follows.
- the voicing parameter used is estimated from the fundamental frequency.
- other types of voicing parameter obtained by other methods, can be used in the context of the present invention.
- the fundamental frequency is calculated from a spectral method. This method searches for the harmonicity of the signal by inter-correlation with a comb function whose distance between the teeth of the comb is varied.
- the period of the harmonics in the spectrum over the entire input signal is calculated at regular time intervals.
- the period of the harmonics in the spectrum is calculated every four milliseconds (ms) over the whole of the input signal, that is to say even in the non-speech periods.
- the period of the harmonics in the spectrum is the fundamental frequency.
- fundamental frequency is used in the rest of the description to designate the period of the harmonics in the spectrum.
- the median is then calculated between the current value of fundamental frequency and a predetermined number of previous values of fundamental frequency. In practice, in the chosen implementation, the median is calculated between the current value of fundamental frequency and the two previous values. The use of the median in particular makes it possible to eliminate certain errors in estimating the fundamental frequency.
- a median, med (m) is calculated for each of the subframes m of the input signal (audio signal).
- med (m) is the median calculated for the subframe m
- successive frames of the input signal of length 16 ms are considered, and a median value is calculated every 4 ms, that is to say for each sub-frame of length 4 ms.
- ⁇ med (m) ⁇ med (m) - med ( m - 1)
- This average is a criterion for the local variation of the fundamental frequency. If the fundamental frequency varies little, the current frame is assumed to be a speech frame.
- the arithmetic mean ⁇ med (m) therefore constitutes an estimate of a degree of voicing.
- FIG. 3 is a plot of curves representing the value of the voicing parameter calculated according to equation (6) above, as a function of the number of audio files of different types (speech, impulsive noises, background noises). More specifically, the curves in FIG. 3 represent the average of the degree of voicing measured on the basis of audio files recorded on PSTN and GSM networks.
- the voicing parameter makes it possible to discriminate speech from impulsive noises. Indeed, by applying for example a threshold of 15 to this value of the parameter, one can effectively distinguish speech from impulsive noises and background noise.
- this voicing parameter in addition to the energy information of the input signal, to discriminate speech from noise, is implemented in the detection module (14) by the described decision automaton. above in relation to FIG. 2.
- the joint use of the energy of the input signal and the voicing parameter then makes it possible to define a more precise criterion for triggering the transitions between all or part of the states of the automaton. .
- FIG. 4 illustrates, by way of example, the insertion of the new criterion above based on a voicing parameter according to the invention in the state machine of FIG. 2.
- the present invention can therefore also apply to detection systems whose function is to detect only the start of speech.
- condition C4 is defined as follows. ⁇ med (P- n + 3) ⁇ threshold- ⁇ (7)
- Detection tests on a noisy part of a GSM base used audio files were used to determine the value "10" as the threshold value for optimized seu Sn ⁇ ed ⁇ • ⁇ e seu '' had * ® * re adapted to the conditions of noise present in the input signal so as to guarantee precise detection whatever the acoustic environment.
- the combination of the new condition C4 with the condition C1 thus makes it possible to obtain a double detection criterion, based on a measurement of the energy of the input signal and on a measurement of the voicing.
- the GSM_T base is a laboratory base registered on a GSM network in four different environments: interior, exterior, stationary vehicle and rolling vehicle. Normally each word is repeated only once, except if there is a loud noise during the pronunciation of the word. The occurrences of each word are therefore substantially identical.
- the vocabulary consists of 65 words. The 29,558 segments from manual segmentation are divided into 85% vocabulary words, 3% non-vocabulary words and 12% noises.
- the GSM_T base is composed of two sub-bases defined according to the signal to noise ratio (RSB) of each file making up these sub-bases.
- RSSB signal to noise ratio
- the AGORA base is a basis for experimenting with a man-machine dialogue application, recorded on a PSTN switched network. It is therefore a basis for continuous speech.
- the AGORA base is mainly used as a test base. It is composed of 64 records.
- the 3,115 reference segments include 12,635 words.
- the vocabulary of the recognition model is 1633 words. There are no non-vocabulary word segments for this base.
- the speech segments constitute 81% of the reference segments and the noise segments 19%.
- the results of the speech detection alone are considered, then the results of this detection in the context of voice recognition, by studying the results obtained by the speech system. recognition.
- the results of the detection alone are studied by considering the final error rate as a function of the rejection error rate.
- the final errors generated by the detection module are composed of omitted speech, fragmentations of a word or a sentence, and groupings of several words or several sentences. These errors are said to be "final” because they cause definitive errors at the level of the recognition module.
- the rejection errors generated by the detection module are composed of noise inserts (or noise detections). A rejectable error can be rejected by a rejection model incorporated in the decision module (fig. 1, 13) of the recognition module. Otherwise, it causes a voice recognition error.
- the approach consisting in evaluating the detection module alone makes it possible to place oneself in a context independent of voice recognition.
- the results of the recognition system using a detection module according to the invention are studied by considering three types of error in the case of recognition of isolated words, and four types of error in the case of continuous speech recognition.
- substitution error represents a word of the vocabulary recognized as being another word of the vocabulary.
- False acceptance error is a noise detection recognized as being a word.
- Wrongly rejected error is the rejection of a word from the vocabulary by the rejection model or corresponds to a word not detected by the detection module.
- a so-called "insertion” error concerns a word inserted in a sentence (or request)
- a so-called “omission” error concerns a word omitted in a sentence
- a so-called “error” substitution concerns a substituted word in a sentence
- an error called” wrongly rejected concerns a sentence wrongly rejected by the rejection model, or not detected by the detection module.
- These wrongly rejected errors are expressed by a rate of omitting words in sentences.
- the errors of insertions, omissions and substitutions are represented according to the errors of rejection wrongly.
- FIG. 5 is a graphic representation of the results obtained by a detection module according to the invention on the GSM_T database of audio files recorded on a GSM network.
- the curves of FIG. 5 represent, for each sub-base (noisy and non-noisy) of the GSM_T base, the results obtained with the detection automaton of FIG. 2 (condition C1 only), and the results obtained using l detection automaton modified according to FIG. 4 (combination of conditions C1 and C4).
- the results are expressed as a rejectable error rate compared to the final error rate. For a given rejectable error rate, the lower the final error rate, the better the performance obtained.
- curves 51 and 52 correspond to the results obtained with the "noiseless" sub-base, that is to say corresponding to a signal to noise ratio (SNR) greater than 18 decibels (dB).
- SNR signal to noise ratio
- the curves 53, 54 correspond to the results obtained with the "noise" sub-base, that is to say corresponding to a SNR less than 18 dB.
- curves 51, 53 correspond to the use only of the "energy” criterion based on the energy of the input signal (condition C1), while curves 52, 54 correspond to the use joint of the energy criterion and the voicing criterion (conditions C1 and C4).
- FIG. 6 represents the results obtained with a detection module according to the invention on the basis of AGORA continuous speech of audio files recorded on a PSTN network.
- FIG. 7 is a graphic representation of the results obtained by a voice recognition system integrating a speech detection module according to the invention, on the AGORA basis of audio files recorded on a PSTN network. These results were obtained using the optimal thresholds for recognition. For recognition, the results are assessed by comparing the rate of rejection wrongly with the rate of errors of omission, insertion and substitution of words. In fig.
- curve 71 represents the results obtained with the sole use of the energy criterion (condition C1); while curve 72 represents the results obtained with the joint use of the energy criterion and the voicing criterion (conditions C1 and C4). It can be observed that the results (curve 72) on the voice recognition are also better with the use of the double energy-voicing criterion for the detection module.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0115685 | 2001-12-05 | ||
FR0115685A FR2833103B1 (fr) | 2001-12-05 | 2001-12-05 | Systeme de detection de parole dans le bruit |
PCT/FR2002/003910 WO2003048711A2 (fr) | 2001-12-05 | 2002-11-15 | System de detection de parole dans un signal audio en environnement bruite |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1451548A2 true EP1451548A2 (de) | 2004-09-01 |
Family
ID=8870113
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP02788059A Withdrawn EP1451548A2 (de) | 2001-12-05 | 2002-11-15 | Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebung |
Country Status (5)
Country | Link |
---|---|
US (1) | US7359856B2 (de) |
EP (1) | EP1451548A2 (de) |
AU (1) | AU2002352339A1 (de) |
FR (1) | FR2833103B1 (de) |
WO (1) | WO2003048711A2 (de) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2856506B1 (fr) * | 2003-06-23 | 2005-12-02 | France Telecom | Procede et dispositif de detection de parole dans un signal audio |
FR2864319A1 (fr) * | 2005-01-19 | 2005-06-24 | France Telecom | Procede et dispositif de detection de parole dans un signal audio |
CN1815550A (zh) * | 2005-02-01 | 2006-08-09 | 松下电器产业株式会社 | 可识别环境中的语音与非语音的方法及系统 |
US8175877B2 (en) * | 2005-02-02 | 2012-05-08 | At&T Intellectual Property Ii, L.P. | Method and apparatus for predicting word accuracy in automatic speech recognition systems |
GB2450886B (en) * | 2007-07-10 | 2009-12-16 | Motorola Inc | Voice activity detector and a method of operation |
KR100930039B1 (ko) * | 2007-12-18 | 2009-12-07 | 한국전자통신연구원 | 음성 인식기의 성능 평가 장치 및 그 방법 |
US8380497B2 (en) * | 2008-10-15 | 2013-02-19 | Qualcomm Incorporated | Methods and apparatus for noise estimation |
US8938389B2 (en) * | 2008-12-17 | 2015-01-20 | Nec Corporation | Voice activity detector, voice activity detection program, and parameter adjusting method |
CN102667927B (zh) * | 2009-10-19 | 2013-05-08 | 瑞典爱立信有限公司 | 语音活动检测的方法和背景估计器 |
KR20140026229A (ko) * | 2010-04-22 | 2014-03-05 | 퀄컴 인코포레이티드 | 음성 액티비티 검출 |
CN102237081B (zh) * | 2010-04-30 | 2013-04-24 | 国际商业机器公司 | 语音韵律评估方法与系统 |
US8898058B2 (en) | 2010-10-25 | 2014-11-25 | Qualcomm Incorporated | Systems, methods, and apparatus for voice activity detection |
JP5747562B2 (ja) * | 2010-10-28 | 2015-07-15 | ヤマハ株式会社 | 音響処理装置 |
US20150281853A1 (en) * | 2011-07-11 | 2015-10-01 | SoundFest, Inc. | Systems and methods for enhancing targeted audibility |
KR20140147587A (ko) * | 2013-06-20 | 2014-12-30 | 한국전자통신연구원 | Wfst를 이용한 음성 끝점 검출 장치 및 방법 |
WO2015098079A1 (ja) * | 2013-12-26 | 2015-07-02 | パナソニックIpマネジメント株式会社 | 音声認識処理装置、音声認識処理方法、および表示装置 |
ES2664348T3 (es) | 2014-07-29 | 2018-04-19 | Telefonaktiebolaget Lm Ericsson (Publ) | Estimación de ruido de fondo en señales de audio |
CN111739515B (zh) * | 2019-09-18 | 2023-08-04 | 北京京东尚科信息技术有限公司 | 语音识别方法、设备、电子设备和服务器、相关系统 |
KR20210089347A (ko) * | 2020-01-08 | 2021-07-16 | 엘지전자 주식회사 | 음성 인식 장치 및 음성데이터를 학습하는 방법 |
CN111599377B (zh) * | 2020-04-03 | 2023-03-31 | 厦门快商通科技股份有限公司 | 基于音频识别的设备状态检测方法、系统及移动终端 |
CN111554314A (zh) * | 2020-05-15 | 2020-08-18 | 腾讯科技(深圳)有限公司 | 噪声检测方法、装置、终端及存储介质 |
CN115602152B (zh) * | 2022-12-14 | 2023-02-28 | 成都启英泰伦科技有限公司 | 一种基于多阶段注意力网络的语音增强方法 |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4696039A (en) * | 1983-10-13 | 1987-09-22 | Texas Instruments Incorporated | Speech analysis/synthesis system with silence suppression |
US5276765A (en) * | 1988-03-11 | 1994-01-04 | British Telecommunications Public Limited Company | Voice activity detection |
US5579431A (en) * | 1992-10-05 | 1996-11-26 | Panasonic Technologies, Inc. | Speech detection in presence of noise by determining variance over time of frequency band limited energy |
US5598466A (en) * | 1995-08-28 | 1997-01-28 | Intel Corporation | Voice activity detector for half-duplex audio communication system |
JPH0990974A (ja) * | 1995-09-25 | 1997-04-04 | Nippon Telegr & Teleph Corp <Ntt> | 信号処理方法 |
US5819217A (en) * | 1995-12-21 | 1998-10-06 | Nynex Science & Technology, Inc. | Method and system for differentiating between speech and noise |
US5890109A (en) * | 1996-03-28 | 1999-03-30 | Intel Corporation | Re-initializing adaptive parameters for encoding audio signals |
US6023674A (en) * | 1998-01-23 | 2000-02-08 | Telefonaktiebolaget L M Ericsson | Non-parametric voice activity detection |
US6122531A (en) * | 1998-07-31 | 2000-09-19 | Motorola, Inc. | Method for selectively including leading fricative sounds in a portable communication device operated in a speakerphone mode |
US6327564B1 (en) * | 1999-03-05 | 2001-12-04 | Matsushita Electric Corporation Of America | Speech detection using stochastic confidence measures on the frequency spectrum |
US6775649B1 (en) * | 1999-09-01 | 2004-08-10 | Texas Instruments Incorporated | Concealment of frame erasures for speech transmission and storage system and method |
-
2001
- 2001-12-05 FR FR0115685A patent/FR2833103B1/fr not_active Expired - Fee Related
-
2002
- 2002-11-15 AU AU2002352339A patent/AU2002352339A1/en not_active Abandoned
- 2002-11-15 US US10/497,874 patent/US7359856B2/en not_active Expired - Fee Related
- 2002-11-15 EP EP02788059A patent/EP1451548A2/de not_active Withdrawn
- 2002-11-15 WO PCT/FR2002/003910 patent/WO2003048711A2/fr not_active Application Discontinuation
Non-Patent Citations (3)
Title |
---|
JUNQUA J -C; MAK B; REAVES B: "A robust algorithm for word boundary detection in the presence of noise", IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, vol. 2, no. 3, July 1994 (1994-07-01), usa, pages 406 - 412 * |
LAMEL L F; RABINER L R; ROSENBERG A E; WILPON J G: "An improved endpoint detector for isolated word recognition", IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, vol. ASSP-29, no. 4, August 1981 (1981-08-01), 08-1981, pages 777 - 785, XP002062762, DOI: doi:10.1109/TASSP.1981.1163642 * |
See also references of WO03048711A3 * |
Also Published As
Publication number | Publication date |
---|---|
WO2003048711A2 (fr) | 2003-06-12 |
US20050143978A1 (en) | 2005-06-30 |
AU2002352339A1 (en) | 2003-06-17 |
FR2833103A1 (fr) | 2003-06-06 |
WO2003048711A3 (fr) | 2004-02-12 |
FR2833103B1 (fr) | 2004-07-09 |
US7359856B2 (en) | 2008-04-15 |
AU2002352339A8 (en) | 2003-06-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1451548A2 (de) | Einrichtung zur sprachdetektion in einem audiosignal bei lauter umgebung | |
EP1154405B1 (de) | Verfahren und Vorrichtung zur Spracherkennung in einer Umgebung mit variablerem Rauschpegel | |
EP2415047B1 (de) | Klassifizieren von in einem Tonsignal enthaltenem Hintergrundrauschen | |
EP0867856B1 (de) | Verfahren und Vorrichtung zur Sprachdetektion | |
JP4568371B2 (ja) | 少なくとも2つのイベント・クラス間を区別するためのコンピュータ化された方法及びコンピュータ・プログラム | |
EP2419900B1 (de) | Verfahren und einrichtung zur objektiven evaluierung der sprachqualität eines sprachsignals unter berücksichtigung der klassifikation der in dem signal enthaltenen hintergrundgeräusche | |
CA2404441C (fr) | Parametres robustes pour la reconnaissance de parole bruitee | |
EP3078027A1 (de) | Stimmendetektionsverfahren | |
EP1131813B1 (de) | Verfahren und vorrichtung zur spracherkennung eines mit störungen behafteten akustischen signals | |
EP3627510A1 (de) | Filterung eines tonsignals, das durch ein stimmerkennungssystem erfasst wurde | |
Odriozola et al. | An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods | |
Martin et al. | Robust speech/non-speech detection based on LDA-derived parameter and voicing parameter for speech recognition in noisy environments | |
Skorik et al. | On a cepstrum-based speech detector robust to white noise | |
EP1665231B1 (de) | Verfahren für unbeaufsichtigter doping und ablehnung von wörter ausser dem spracherkennungwortschatz | |
FR2856506A1 (fr) | Procede et dispositif de detection de parole dans un signal audio | |
FR2864319A1 (fr) | Procede et dispositif de detection de parole dans un signal audio | |
FR2823361A1 (fr) | Procede et dispositif d'extraction acoustique d'un signal vocal | |
WO2001091106A1 (fr) | Fenetres d'analyse adaptatives pour la reconnaissance de la parole |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SK TR |
|
17P | Request for examination filed |
Effective date: 20040617 |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: MARTIN, ARNAUD Inventor name: MAUUARY, LAURENT |
|
17Q | First examination report despatched |
Effective date: 20070620 |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: ORANGE |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G10L 25/78 20130101AFI20160712BHEP |
|
INTG | Intention to grant announced |
Effective date: 20160805 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20161216 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |