CN1902684A - Method and device for processing a voice signal for robust speech recognition - Google Patents
Method and device for processing a voice signal for robust speech recognition Download PDFInfo
- Publication number
- CN1902684A CN1902684A CN200480040358.1A CN200480040358A CN1902684A CN 1902684 A CN1902684 A CN 1902684A CN 200480040358 A CN200480040358 A CN 200480040358A CN 1902684 A CN1902684 A CN 1902684A
- Authority
- CN
- China
- Prior art keywords
- noise
- voice signal
- voice
- signal
- normalized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000012545 processing Methods 0.000 title abstract description 7
- 238000012549 training Methods 0.000 claims description 30
- 239000013598 vector Substances 0.000 claims description 27
- 238000004891 communication Methods 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 5
- 230000005540 biological transmission Effects 0.000 claims description 4
- 230000006835 compression Effects 0.000 claims description 4
- 238000007906 compression Methods 0.000 claims description 4
- FGUUSXIOTUKUDN-IBGZPJMESA-N C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 Chemical compound C1(=CC=CC=C1)N1C2=C(NC([C@H](C1)NC=1OC(=NN=1)C1=CC=CC=C1)=O)C=CC=C2 FGUUSXIOTUKUDN-IBGZPJMESA-N 0.000 claims 1
- 238000010606 normalization Methods 0.000 abstract description 21
- 230000009467 reduction Effects 0.000 abstract description 3
- 230000006872 improvement Effects 0.000 description 5
- 239000005441 aurora Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- DILISPNYIVRDBP-UHFFFAOYSA-N 2-[3-[2-(2-hydroxypropylamino)pyrimidin-4-yl]-2-naphthalen-2-ylimidazol-4-yl]acetonitrile Chemical compound OC(CNC1=NC=CC(=N1)N1C(=NC=C1CC#N)C1=CC2=CC=CC=C2C=C1)C DILISPNYIVRDBP-UHFFFAOYSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- LRULVYSBRWUVGR-FCHUYYIVSA-N GSK2879552 Chemical compound C1=CC(C(=O)O)=CC=C1CN1CCC(CN[C@H]2[C@@H](C2)C=2C=CC=CC=2)CC1 LRULVYSBRWUVGR-FCHUYYIVSA-N 0.000 description 1
- 206010038743 Restlessness Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Noise Elimination (AREA)
Abstract
The invention relates to methods for processing a speech signal (S) for subsequent speech recognition (SR), said speech signal being tainted by noise (S) and representing at least one speech command. Said methods comprise the following steps: a) recording of the speech signal (S) that is tainted by noise; b) use of noise reduction (NR) on the speech signal (S) to generate a noise-reduced speech signal (S'); c) normalisation of the noise-reduced speech signal (S1) to a target signal value with the aid of a normalisation factor, to generate a noise-reduced, normalised speech signal (S'').
Description
The method and apparatus that the present invention relates to be used to speech recognition subsequently that the voice signal with noise is handled.
Speech recognition is used more and more, so that simplify the operation of electrical equipment.In order to carry out speech recognition, must create so-called acoustic model.Training utterance order for this reason, this for example (at the situation of speech recognition and speaker-independent) can finish in factory's side.At this, " training " is understood that to create the proper vector of so-called description voice command based on repeatedly telling about voice command.Then, at acoustic model, for example collect these proper vectors (these proper vectors are also referred to as prototype) among the so-called HMM (hidden Markov model).This acoustic model is used for (between recognition phase) probability for the proper vector determining from the given sequence of selected voice command of vocabulary or speech to be investigated.
For speech recognition or discern fluent voice, except acoustic model, also use so-called speech model, described speech model explanation each speech probability in succession in the voice that will discern.
The target of improving at present speech recognition is, reaches the phonetic recognization rate of becoming better and better, and that is to say, improves the probability that also is identified as this speech or voice command by said speech of the user of mobile communication equipment or voice command.
Because use this speech recognition, use so also in being subjected to the environment of noise, carry out this in many-side.In this case, phonetic recognization rate significantly descends because be arranged in the eigen vector of acoustic model, for example HMM be based on pure, do not have promptly that the voice of noise create yet.This causes in the environment of noise and excitement, for example on the street, in the buildings of many people's visitings or not satisfied speech recognition in automobile.
With this prior art is starting point, even task of the present invention is to create the possibility that also realizes speech recognition in having the environment of noise with high phonetic recognization rate.
This task solves by independent claims.Favourable improvement project is the theme of dependent claims.
Core of the present invention is: before voice signal for example is input to speech recognition, carry out the processing to this voice signal.In the scope of this processing, this voice signal stands squelch.Subsequently, this voice signal aspect its signal level by normalization.In the case, this voice signal comprises one or more voice commands.
This has the following advantages: the phonetic recognization rate of the phonetic recognization rate that has voice command under the voice signal situation of voice of noise in pretreated like this having under the situation of the conventional speech recognition with the voice signal that has noise.
Alternatively, this voice signal can also be imported into the unit that is used for determining speech activity after squelch.Then, the voice signal that reduces based on this noise determines whether to exist voice or voice intermittently.Be identified for the normalized normalized factor of signal level in view of the above.Especially, can come like this to determine normalized factor, make voice intermittently be suppressed more.Therefore, exist the speech signal segments of voice and the difference between these speech signal segments that does not have voice (voice intermittently) to become more obvious.This makes speech recognition become easy.
Method with above-mentioned feature also can be used in the so-called distributed speech recognition system.Distributed speech recognition system is characterised in that: not every step in the speech recognition scope all is performed on same assembly.Therefore need be more than one assembly.For example an assembly can be a communication facilities, and another assembly can be the unit of communication network.In the case, for example in being constructed to the communication facilities of movement station, carry out voice signal and detect, and in grid cell, carry out real speech recognition at network side.
This method not only can be used in the speech recognition, and can be employed in the establishment of acoustic model, for example HMM.In conjunction with speech recognition, the application in the establishment of acoustic model shows the further raising of phonetic recognization rate, and wherein said speech recognition is based on the signal pretreated according to the present invention.
Describe additional advantage by selected embodiment, described embodiment also is illustrated in the drawings.
Fig. 1 illustrates a histogram at the situation of training for the establishment acoustic model, and the voice signal that comprises one or more voice commands in this histogram is drawn with respect to its signal level;
Fig. 2 illustrates the histogram of voice signal with respect to its signal level at the situation of speech recognition;
Fig. 3 illustrates the schematic expansion scheme of treatment in accordance with the present invention;
Fig. 4 illustrates a histogram, draws noise with respect to the voice signal level and reduce and the normalized voice signal of voice level in this histogram;
Fig. 5 illustrates a histogram, draws the voice signal that noise reduces with respect to signal level in this histogram;
Fig. 6 illustrates a histogram, according to the present invention the voice signal in the training is carried out pre-service in this histogram;
Fig. 7 illustrates the scheme that distributed sound is handled;
Fig. 8 illustrates an electrical equipment, and this electrical equipment can be used in the scope that distributed sound is handled.
The electrical equipment that is constructed to mobile phone or movement station MS shown in Figure 8.This electrical equipment has microphone M, processor unit CPU that is used for processes voice signals that is used to receive the voice signal that comprises voice command and the radio interface FS that is used to transmit data (for example handled voice signal).
This electrical equipment can be individually or is realized speech recognition about the voice command that is received or detected with other assemblies.
Now at first description has been caused detailed research of the present invention:
Can see a histogram in Fig. 1, the voice signal that comprises one or more voice commands in this histogram is classified according to its signal level L, and frequency H is drawn with respect to signal level L.At this, voice signal S is as for example comprising one or more voice commands as shown in the figure subsequently.For simplicity, supposition below: voice signal comprises a voice command.Voice command for example can form by request " calling " and definite alternatively name in being constructed to the electrical equipment of mobile phone.Must the training utterance order under the situation of speech recognition, that is to say, based on repeatedly telling about this voice command, create one or more (also promptly more than one) proper vector.Carry out this training in the scope of creating acoustic model, for example HMM, this acoustic model is finished in manufacturer's side.These proper vectors are considered for speech recognition after a while.
Carry out the training (" single-stage training ") of voice command on determined signal level or loudness level, this training is used to create proper vector.In order to be used for voice signal is converted to the dynamic range of the AD converter of digital signal best, preferably in-26dB place work.By drawing determining for the position that signal level uses to decibel (dB).Therefore, 0dB overflows expression (also promptly exceeding maximum loudness or maximum level).Alternatively, replace " single-stage training ", also can be on a plurality of signal levels, for example-16 ,-26 and-the 36dB place trains.
In the case, in Fig. 1, can see the frequency distribution of voice level under the situation that voice command is used to train.
Obtain average signal value X at voice command
MeanAnd a certain distribution of voice signal level.This can be described to have average signal level X
MeanGaussian function with variances sigma.
After the distribution that in Fig. 1, can see at the voice command of training, situation when providing shown in Fig. 2 of frequency H in speech recognition with respect to signal level L again corresponding to Fig. 1: the voice signal S ' (as shown in the figure subsequently) with one or more voice commands here is classified according to its signal level L, and frequency H is drawn.Based on environmental impact, also obtain the distribution that is offset with respect to the training among Fig. 1 afterwards having used squelch NR (referring to Fig. 3), this distribution has with respect to the mean value X in the training
MeanThe average signal level x that is offset
Mean
Prove under study for action: because the average signal level x of this skew
Mean, phonetic recognization rate significantly reduces.
This can find out from following table 1:
Table 1: utilize the training of pure (" totally ") voice of different loudness levels or signal level (multistage).
This phonetic recognization rate relates to tested speech, described tested speech be normalized to signal level-16 ,-26 ,-36dB.
The tested speech signal level | Speech discrimination [%] | |||||||
Subway | Babble | Automobile | Exhibition | |||||
Totally | 5dB | Totally | 5dB | Totally | 5dB | Totally | 5dB | |
-16dB | 98.83 | 80.14 | 98.79 | 66.99 | 98.72 | 88.01 | 99.11 | 79.76 |
-26dB | 99.14 | 85.66 | 99.15 | 76.66 | 99.19 | 91.35 | 99.35 | 85.00 |
-36dB | 99.39 | 85.05 | 99.21 | 82.41 | 99.28 | 89.41 | 99.57 | 85.47 |
In table 1, list phonetic recognization rate or speech discrimination, wherein carried out utilizing the training of the muting voice (" clean speech ") of different loudness at different noise circumstances.Tested speech, also be the voice signal of Fig. 1 be normalized to-16dB ,-26dB and-three of 36db place are not at the same level.At these different tested speech energy levels, the phonetic recognization rate at the different types of noise with 5dB noise level is described.Described different noise is typical neighbourhood noise, such as ground unrest and the exhibition environment (also promptly be similar to and have only the even worse babble noise that may have broadcast announcement, music etc.) in subway (subway), so-called babble (Babble) noise (the coffee-house environment that also promptly for example has voice and other noises), the automobile (car).As can be seen from Table 1: speech recognition variable effect of tested person voice energy level not to a great extent under the situation of muting voice.Yet, for the voice that have noise, the remarkable decline of speech recognition as can be seen.In order to carry out speech recognition, described pre-service AFE below having considered in the case based on terminal, this pre-service AFE is used to create proper vector.
However, under the situation of the phonetic recognization rate of being studied in table 1 (it remains not satisfied), with respect to based on utilizing the only speech recognition of the training of a loudness level, situation is significantly improved.
In other words, neighbourhood noise worsens more significantly to the influence of the acoustic model created based on training utterance loudness only.
This causes following described according to improvement of the present invention:
Now, the flow process according to a kind of form of implementation of the present invention shown in Figure 3.Voice command or voice signal S, the speech of for example being said by the people stand squelch NR.After this squelch NR, there is the downtrod voice signal S ' of noise.
Subsequently, the voice signal S ' of noise minimizing stands the normalization SLN of signal level normalization or signal value.This normalization is used to create signal value, this signal value can with Fig. 1 in use X
MeanThe average signal value of indicating is compared.Verified: the situation at comparable signal averaging is issued to higher phonetic recognization rate.That is to say that the skew by signal value has improved phonetic recognization rate.
After signal value normalization SLN, there is voice signal S normalized and that noise reduces ".Even this can for example also be used under the situation of the tested speech that has noise at first in having the speech recognition SR of higher phonetic recognization rate subsequently.
Alternatively, the signal S ' that noise reduces is decomposed, and also flows into speech activity determining unit or " Voice Activity Detection " VAD except signal value normalization SLN.According to whether having voice or voice intermittently, adjust normalized value, wherein utilize this normalized value that the voice signal S ' that noise reduces is carried out normalization.For example can use less multiplication normalized factor in intermittently, compare thus with at the voice duration of existence, reduce the signal level of the voice signal S ' that noise reduces in intermittently biglyyer at voice at voice.Therefore can realize voice, also i.e. for example individual voice order and the voice bigger difference between intermittence, this further obviously improves rearmounted speech recognition aspect phonetic recognization rate.
Stipulate in addition: not only intermittently and between the voice segments change normalized factor, but also in the speech of different phonetic section, change normalized factor at voice.Also can improve speech recognition thus, because some voice segments are because the phoneme that wherein comprised, for example (for example p) has very high signal level when plosive, and other voice segments it would be better to say that it is light inherently.
For signal level normalization, consider diverse ways, energy normalized in real time for example, as (Vol.10, the article of delivering in No.3) " the robust end-point detection and the energy normalized (Robust Endpoint Detectionand Energy Normalisation for Real-Time Speech and Speakerrecognition) that are used for real-time voice and speaker identification " is described in section C (149-150 page or leaf) at IEEE Transactions on Speech and AudioProcessing in March, 2002 people such as Qi Li.In addition, a kind of signal level method for normalizing has been described in the scope of ITU, this method can be at Software ToolLibrary 2000User ' s Manual (151-161 page or leaf, Genf, Schweiz, in Dec, 2000) in ITU-T " SVP56:The Speech voltmeter ", find in.Normalization " off-line " that the there is illustrated or work under so-called " batch mode " that is to say, are not to work synchronously or simultaneously with speech detection.
Reduce or squelch (referring to Fig. 3) for noise, be provided with different known methods, the method for for example in the frequency space, moving equally.A kind of such method people such as Ch.Beaugeant at Proceedings of 6th World Multi-conference onSystemics, be described in " the voice enhancings (Computationally efficient speech enhancement using RLS andpsycho-acoustic motivated algorithm) efficiently in the calculating of the algorithm that use RLS and psychologic acoustics promote " delivered among the Cybernetics and Informatics (Orlando, 2002).Illustrated there system is based on by comprehensive analysis (Analyse-durch-Synthese) system, and frame by frame is recursively extracted the parameter (referring to second chapters and sections " Noise Reduction in the Frequency Domain ", the chapter 3 joint " Recursive implementation of the least squarealgorithm " of there) of description (pure) voice signal and noise signal in this system.In addition, the pure voice signal of such acquisition is weighted (referring to chapter 4 joint " Practical RLS Weighting Rule "), and the power of noise signal is estimated (referring to chapter 5 joint " Noise Power Estimation ").Alternatively, can realize the result's that obtained improvement (chapter 6 joint: " Psychoa coustic motivated method ") by means of the method that psychologic acoustics promotes.Other noise reducing methods that can be considered according to the embodiment of Fig. 3 for example are illustrated in 5.1 chapters and sections (" Noise Reduction ") in the ETSI ES in October in 2002 2020505 V1.1.1.
Frequency distribution among Fig. 1 (training) and Fig. 2 (test case, also promptly at speech recognition) is based at not processed voice signal S aspect squelch NR and the signal level normalization SN.The voice signal S ' that frequency distribution among Fig. 5 reduces based on noise.That distribution among Fig. 4 (test case) and Fig. 5 (training) reduces based on noise and the normalized signal of signal level.
The schematic flow shown in Fig. 3, that be used for rearmounted speech recognition, that voice signal is handled based on thought be illustrated at Fig. 4 to Fig. 6.
Figure 5 illustrates the frequency distribution of the voice signal S ' of noise minimizing, as for example in Fig. 3, after squelch NR, occurring.With respect to Fig. 2, after squelch NR, also implement, wherein Fig. 2 for example relates to the frequency distribution of the voice signal S shown in Fig. 3.
The voice signal S ' that this noise reduces is centered close to mean value x with respect to the frequency distribution of voice level L
Mean' locate.Described distribution have width cs '.In the transition of Fig. 4, to the voice signal S ' execution signal level normalization SLN of the minimizing of the noise shown in Fig. 5.Therefore, the distribution among Fig. 4 based on voice signal for example reduce corresponding to noise and the normalized voice signal S of signal level ".Signal level normalization reaches among desired signal level, for example Fig. 1 the actual signal level among Fig. 5 and uses X
MeanThe signal level of indicating that in training, is reached.In addition, signal level normalization SLN causes described distribution to become narrower, also be that σ is " less than σ '.Thus, utilize the average signal level X that in training, has reached among Fig. 1
MeanThe average signal level x in the coverage diagram 4 more easily
Mean".This causes higher phonetic recognization rate.
Now, contact Fig. 7 comes that above-mentioned aspect is applied to speech recognition and inquires into.Locate to set forth as beginning, can in an assembly or on a plurality of assemblies, carry out speech recognition by distributed earth at article.
For example be used for detecting the device (for example microphone M shown in Fig. 8) of voice signal, the device that is used for the device of squelch NR and is used for signal level normalization SN can be positioned at the electrical equipment MS that is constructed to movement station.These devices can be realized in the scope of processor unit CPU.Therefore, thought and the speech recognition subsequently handled according to the voice signal of the invention process form shown in Fig. 3 can be realized in vehicular radio or movement station individually or in conjunction with the unit of communication network.
According to one of replacement scheme, speech recognition SR (referring to Fig. 3) is from finishing in network side.For this reason, " proper vector of creating is transferred to central location in the network via channel, particularly radio channel according to voice signal S.Then, there based on the proper vector of being transmitted, carry out speech recognition according to the model of especially creating in factory's side." in factory's side " can mean that especially acoustic model is created by Virtual network operator.
Especially, as what carried out in the scope of so-called Aurora situation, the speech recognition of being advised can be applied to the speech recognition with speaker-independent.
If voice command when factory's side is set up acoustic model or when training aspect its signal level by normalization, then obtain another improvement.Because the distribution of signal level becomes narrower thus, thereby reach better consistent between the distribution shown in Fig. 4 and the distribution that in training, reaches.Figure 6 illustrates this distribution of the situation lower frequency H of the voice command in training, wherein in training, carried out signal level normalization with respect to signal level L.The training mean value X that obtains
Mea_neuThat reduce with noise and the normalized voice signal S of signal level " mean value x (Fig. 3)
Mean" (Fig. 4) unanimity.As illustrating, mean value consistent is one of criterion of high phonetic recognization rate.In addition, the width of the distribution among Fig. 6 is very narrow, and this makes this distribution utilize the distribution among Fig. 4 to cover, that is to say that making this distribution reach identical signal level becomes easy.
Figure 7 illustrates distributed sound identification or " Distributed SpeechRecognition " (DSR).Distributed sound identification for example can be applied in the scope of the AURORA project of the ETSISTQ that has mentioned (Speech Transmission Quality (quality of voice transmission)).
Under other situation of distributed speech recognition, the voice signal of test example such as voice command in a unit, and create the proper vector of describing this voice signal.These proper vectors are transferred to other unit, for example webserver.Handle these proper vectors there, and carry out speech recognition based on these proper vectors.
Figure 7 illustrates movement station MS and network element NE as first module or assembly.
The movement station MS that is also referred to as terminal has and is used for carrying out pretreated device AFE based on terminal, and this device AEF is used to create proper vector.Movement station MS for example is mobile radio terminal apparatus, portable computer or other mobile communication equipments arbitrarily.Being used for carrying out pretreated device AFE based on terminal for example is " the advanced front end " discussed in the scope of AURORA project.
Be used for carrying out pretreated device AFE and comprise the device that is used for voice signal is carried out the standard processing based on terminal.This received pronunciation is handled and for example is illustrated in Fig. 4 .1 in the standard ETSI ES202050 in October, 2000 V1.1.1.In the movement station side, received pronunciation is handled and is comprised the feature extraction with following steps, i.e. noise minimizing, signal form or " waveform processing ", cepstrum calculating and hiding balanced or " blind equalization (Blind Equalization) ".Then carry out the pre-service of feature compression and transmission.This processing is known for the professional, therefore this is not carried out more detailed discussion here.According to an expansion scheme of the present invention, be used for carrying out pretreated device AFE and also comprise the device that is used for signal level normalization and voice activity detection based on terminal, realize pre-service with this according to Fig. 3.
These devices can be integrated among the device AFE or alternatively be implemented as the assembly of separation.
By the back to back one or more proper vectors of creating according to voice command of device FC compression that are used for the proper vector compression based on the pre-service AFE of terminal, so that transmit via channel CH.
Another unit for example is made of the webserver as network element NE.In this network element NS, proper vector is decompressed again via the device FDC that is used for the proper vector decompression.In addition, SSP carries out the pre-service of server side via device, carries out speech recognition so that be used for the device SR of speech recognition then based on hidden Markov model HMM.
Set forth now according to improved result of the present invention: illustrated at the difference training of voice command and the phonetic recognization rate (tested speech) that is considered for the different phonetic level or the loudness of speech recognition in to 2 at table 1.
The phonetic recognization rate of the different energy levels of tested speech has been shown in table 2 now.Described training is carried out on the voice energy level of-26dB.Tested speech stands according to the squelch of Fig. 3 and the normalization of voice level.As can be seen from Table 2, the phonetic recognization rate of pure voice is another as previously being high.Important improvement with respect to so far audio recognition method is: according to the energy level of tested speech (at signal to noise ratio (S/N ratio) or " Signal-to-Noise Ratio " under the situation of 5dB) eliminated in table 1 as can be seen, at the difference of the phonetic recognization rate of the voice that have noise.In order to carry out speech recognition, considered " advanced front end " described above.
Table 2:
The tested speech energy level | Speech discrimination [%] | |||||||
Subway | Babble | Automobile | Exhibition | |||||
Totally | 5dB | Totally | 5dB | Totally | 5dB | Totally | 5dB | |
-16dB | 99.45 | 83.79 | 98.85 | 75.63 | 99.02 | 86.34 | 99.35 | 79.67 |
-26dB | 99.20 | 84.71 | 98.88 | 74.37 | 99.05 | 87.89 | 99.32 | 80.56 |
-36dB | 98.86 | 84.71 | 98.70 | 75.00 | 98.78 | 87.77 | 99.01 | 80.47 |
Claims (15)
1. the method that is used to speech recognition (SR) subsequently and the voice signal (S) that has noise is handled, wherein said voice signal (S) is represented at least one voice command, and described method has following steps:
A) detect the voice signal (S) that has noise;
B) squelch (NR) is applied to described voice signal (S) to produce the downtrod voice signal of noise (S ');
C) by means of normalized factor the downtrod voice signal of noise (S ') is normalized to the rated signal value, to produce downtrod, the normalized voice signal of noise (S ").
2. in accordance with the method for claim 1, wherein, determine the value of described normalized factor according to speech activity.
3. according to claim 1 or 2 described methods, wherein, determine described speech activity based on the downtrod voice signal of noise.
4. according to one of aforesaid right requirement described method, this method has other step:
D) by one or more proper vectors downtrod, the normalized voice command of noise is described.
5. in accordance with the method for claim 4, wherein, create and one or morely be used to describe that noise is downtrod, the proper vector of normalized voice command.
6. according to one of aforesaid right requirement described method, this method has following other step:
E) transmit the signal of describing described one or more proper vectors.
7. according to one of aforesaid right requirement described method, this method has following other step:
F) carry out speech recognition based on downtrod, the normalized voice command of noise.
8. according to claim 6 or 7 described methods, wherein, the detection of the voice signal in the step a) and the execution of the speech recognition in the step f) are performed on the position discretely.
9. one of require described method according to aforesaid right, wherein, spatially discretely or the proper vector compression (FC) of carrying out pre-service (AFE) and describing the proper vector of voice signal in same position.
10. be used for training the method for the voice command of the voice signal that has noise, this method may further comprise the steps:
A ') detects the voice signal that has noise;
B ') squelch is applied to described voice signal to produce the downtrod voice signal of noise;
C ') by means of normalized factor the downtrod voice signal of noise is normalized to the rated signal value, to produce downtrod, the normalized voice signal of noise.
11. in accordance with the method for claim 10, wherein, described training is used to create acoustic model, particularly HMM.
12. have the electrical equipment (MS) of microphone (M) and processor unit (CPU), this electrical equipment (MS) set up be used for carrying out according to the described method of claim 1 to 11, especially for execution in step a), b) and c).
13. according to the described equipment of claim 12, this equipment has the device that is used to create proper vector, described proper vector is used to describe voice signal.
14. according to claim 12 or 13 described electrical equipments, this electrical equipment is constructed to communication facilities, particularly movement station, this electrical equipment has transmission/receiving trap (FS) and according to claim 12 or 13 described equipment.
15. have communication system, in this communication network, carry out speech recognition according to described movement station of claim 14 and communication network.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE102004001863A DE102004001863A1 (en) | 2004-01-13 | 2004-01-13 | Method and device for processing a speech signal |
DE102004001863.4 | 2004-01-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN1902684A true CN1902684A (en) | 2007-01-24 |
Family
ID=34744705
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200480040358.1A Pending CN1902684A (en) | 2004-01-13 | 2004-10-04 | Method and device for processing a voice signal for robust speech recognition |
Country Status (5)
Country | Link |
---|---|
US (1) | US20080228477A1 (en) |
EP (1) | EP1704561A1 (en) |
CN (1) | CN1902684A (en) |
DE (1) | DE102004001863A1 (en) |
WO (1) | WO2005069278A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106340306A (en) * | 2016-11-04 | 2017-01-18 | 厦门盈趣科技股份有限公司 | Method and device for improving speech recognition degree |
CN107103904A (en) * | 2017-04-12 | 2017-08-29 | 奇瑞汽车股份有限公司 | A kind of dual microphone noise reduction system recognized applied to vehicle-mounted voice and noise-reduction method |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1949364B (en) * | 2005-10-12 | 2010-05-05 | 财团法人工业技术研究院 | System and method for testing identification degree of input speech signal |
US8831183B2 (en) | 2006-12-22 | 2014-09-09 | Genesys Telecommunications Laboratories, Inc | Method for selecting interactive voice response modes using human voice detection analysis |
WO2014018004A1 (en) * | 2012-07-24 | 2014-01-30 | Nuance Communications, Inc. | Feature normalization inputs to front end processing for automatic speech recognition |
KR102188090B1 (en) * | 2013-12-11 | 2020-12-04 | 엘지전자 주식회사 | A smart home appliance, a method for operating the same and a system for voice recognition using the same |
WO2019176830A1 (en) * | 2018-03-12 | 2019-09-19 | 日本電信電話株式会社 | Learning voice data-generating device, method for generating learning voice data, and program |
CN111161171B (en) * | 2019-12-18 | 2023-04-07 | 三明学院 | Blasting vibration signal baseline zero drift correction and noise elimination method, device, equipment and system |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS60184691A (en) * | 1984-03-02 | 1985-09-20 | Permelec Electrode Ltd | Durable electrode and its manufacture |
DE4111995A1 (en) * | 1991-04-12 | 1992-10-15 | Philips Patentverwaltung | CIRCUIT ARRANGEMENT FOR VOICE RECOGNITION |
US5465317A (en) * | 1993-05-18 | 1995-11-07 | International Business Machines Corporation | Speech recognition system with improved rejection of words and sounds not in the system vocabulary |
SE505156C2 (en) * | 1995-01-30 | 1997-07-07 | Ericsson Telefon Ab L M | Procedure for noise suppression by spectral subtraction |
JPH0990974A (en) * | 1995-09-25 | 1997-04-04 | Nippon Telegr & Teleph Corp <Ntt> | Signal processor |
JPH10257583A (en) * | 1997-03-06 | 1998-09-25 | Asahi Chem Ind Co Ltd | Voice processing unit and its voice processing method |
US6122384A (en) * | 1997-09-02 | 2000-09-19 | Qualcomm Inc. | Noise suppression system and method |
US6098040A (en) * | 1997-11-07 | 2000-08-01 | Nortel Networks Corporation | Method and apparatus for providing an improved feature set in speech recognition by performing noise cancellation and background masking |
US6173258B1 (en) * | 1998-09-09 | 2001-01-09 | Sony Corporation | Method for reducing noise distortions in a speech recognition system |
US6266633B1 (en) * | 1998-12-22 | 2001-07-24 | Itt Manufacturing Enterprises | Noise suppression and channel equalization preprocessor for speech and speaker recognizers: method and apparatus |
US6524647B1 (en) * | 2000-03-24 | 2003-02-25 | Pilkington Plc | Method of forming niobium doped tin oxide coatings on glass and coated glass formed thereby |
US6990446B1 (en) * | 2000-10-10 | 2006-01-24 | Microsoft Corporation | Method and apparatus using spectral addition for speaker recognition |
US20020117199A1 (en) * | 2001-02-06 | 2002-08-29 | Oswald Robert S. | Process for producing photovoltaic devices |
US7035797B2 (en) * | 2001-12-14 | 2006-04-25 | Nokia Corporation | Data-driven filtering of cepstral time trajectories for robust speech recognition |
US20040148160A1 (en) * | 2003-01-23 | 2004-07-29 | Tenkasi Ramabadran | Method and apparatus for noise suppression within a distributed speech recognition system |
-
2004
- 2004-01-13 DE DE102004001863A patent/DE102004001863A1/en not_active Withdrawn
- 2004-10-04 CN CN200480040358.1A patent/CN1902684A/en active Pending
- 2004-10-04 US US10/585,747 patent/US20080228477A1/en not_active Abandoned
- 2004-10-04 EP EP04791139A patent/EP1704561A1/en not_active Withdrawn
- 2004-10-04 WO PCT/EP2004/052427 patent/WO2005069278A1/en not_active Application Discontinuation
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106340306A (en) * | 2016-11-04 | 2017-01-18 | 厦门盈趣科技股份有限公司 | Method and device for improving speech recognition degree |
CN107103904A (en) * | 2017-04-12 | 2017-08-29 | 奇瑞汽车股份有限公司 | A kind of dual microphone noise reduction system recognized applied to vehicle-mounted voice and noise-reduction method |
CN107103904B (en) * | 2017-04-12 | 2020-06-09 | 奇瑞汽车股份有限公司 | Double-microphone noise reduction system and method applied to vehicle-mounted voice recognition |
Also Published As
Publication number | Publication date |
---|---|
US20080228477A1 (en) | 2008-09-18 |
DE102004001863A1 (en) | 2005-08-11 |
EP1704561A1 (en) | 2006-09-27 |
WO2005069278A1 (en) | 2005-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110197670B (en) | Audio noise reduction method and device and electronic equipment | |
CN107910011B (en) | Voice noise reduction method and device, server and storage medium | |
CN103578470B (en) | A kind of processing method and system of telephonograph data | |
CN1122970C (en) | Signal noise reduction by time-domain spectral subtraction using fixed filters | |
US20030061037A1 (en) | Method and apparatus for identifying noise environments from noisy signals | |
EP2898510B1 (en) | Method, system and computer program for adaptive control of gain applied to an audio signal | |
KR20050115857A (en) | System and method for speech processing using independent component analysis under stability constraints | |
CN1650349A (en) | On-line parametric histogram normalization for noise robust speech recognition | |
US20140316775A1 (en) | Noise suppression device | |
JP2006079079A (en) | Distributed speech recognition system and its method | |
WO2004095420A2 (en) | System and method for combined frequency-domain and time-domain pitch extraction for speech signals | |
CN112004177B (en) | Howling detection method, microphone volume adjustment method and storage medium | |
CN105118522B (en) | Noise detection method and device | |
CN1454380A (en) | System and method for voice recognition with a plurality of voice recognition engines | |
KR20190096855A (en) | Method and apparatus for sound processing | |
CN107705791A (en) | Caller identity confirmation method, device and Voiceprint Recognition System based on Application on Voiceprint Recognition | |
CN105719657A (en) | Human voice extracting method and device based on microphone | |
US20060241937A1 (en) | Method and apparatus for automatically discriminating information bearing audio segments and background noise audio segments | |
KR20080064557A (en) | Apparatus and method for improving speech intelligibility | |
CN110383798A (en) | Acoustic signal processing device, acoustics signal processing method and hands-free message equipment | |
CN112151055B (en) | Audio processing method and device | |
CN112002307B (en) | Voice recognition method and device | |
US9558730B2 (en) | Audio signal processing system | |
CN113593599A (en) | Method for removing noise signal in voice signal | |
CN1902684A (en) | Method and device for processing a voice signal for robust speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20070124 |