CN101447182B - Vocal-tract length normalization method capable of fast online application - Google Patents
Vocal-tract length normalization method capable of fast online application Download PDFInfo
- Publication number
- CN101447182B CN101447182B CN2008100979810A CN200810097981A CN101447182B CN 101447182 B CN101447182 B CN 101447182B CN 2008100979810 A CN2008100979810 A CN 2008100979810A CN 200810097981 A CN200810097981 A CN 200810097981A CN 101447182 B CN101447182 B CN 101447182B
- Authority
- CN
- China
- Prior art keywords
- consolidation
- alpha
- factor
- vocal
- acoustic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000010606 normalization Methods 0.000 title claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 41
- 238000000034 method Methods 0.000 claims abstract description 31
- 238000012360 testing method Methods 0.000 claims abstract description 16
- 238000007596 consolidation process Methods 0.000 claims description 119
- 230000008569 process Effects 0.000 claims description 12
- 238000007476 Maximum Likelihood Methods 0.000 claims description 2
- 238000000605 extraction Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 10
- 230000001755 vocal effect Effects 0.000 abstract description 3
- 230000001186 cumulative effect Effects 0.000 description 6
- 230000000875 corresponding effect Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 210000004704 glottis Anatomy 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000007789 sealing Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a vocal-tract length normalization method capable of fast online application. The method comprises the steps as follows: 1) a normalized acoustic model which is unrelated to the vocal tract length is trained during a training stage; 2) training data is classified according to different normalization factors, and various GMMs are trained; 3) the various GMMs are scored in segments during testing, and a vocal-tract length normalization factor is calculated fast; 4) different numbers of segments are selected according to the real-time requirement of an identification system, and the vocal-tract length normalization factor is updated; and 5) the acoustic model after the vocal-tract length normalization is used for decoding acoustic characteristics after the normalization. By adopting the method, segmental length of a test voice can be selected according to the real-time requirement of the identification system, thereby applying vocal-tract length normalization technology in an online system. Segmenting is used for eliminating the influence caused by silence which is judged inaccurately without breaking continuous voice so dispersively in frame that the differential value of acoustic dynamic features is affected, and meanwhile, different weights can be added according to the condition of the segments.
Description
Technical field
The present invention relates to a kind of speaker's acoustic feature consolidation method in the speech recognition technology, more particularly, the present invention relates to a kind of speaker's vocal-tract length normalization method that fast can online application.
Background technology
Voice are one of natural qualities of people.Because the behavior difference that the differences of Physiological of speaker's vocal organs and the day after tomorrow form, the performance of speaker's related system is better than speaker's system without interaction in speech recognition.For the speaker's system without interaction performance decrease that reduces to cause owing to speaker's difference, the sound channel length consolidation is a kind of effective ways commonly used.The sound channel length consolidation is a kind of feature consolidation technology based on model, depends on speaker's sound channel length consolidation model.Document, H.Wakita " Normalization of Vowels by Vocal-Tract Length and itsApplication to Vowel Identification; " ICASSP77 (1977) proposes to use removal speaker sound channel length first and causes that the thought of formant frequency drift improves the discrimination of isolated vowel.Position that sound channel is different and shape have determined the generation of voice, document, E.Eide et al. " A Parametric Approach to Vocal Tract LengthNormalization; " ICASSP96 (1996), think that the simplest model of speaker's sound channel is the even pipe of a length from the glottis to the lip, and be the sealing of an end opening one end.They give the influence of different consolidation functions to last recognition performance.Based on the model of this even pipeline, the centre frequency that the influence of speaker's sound channel length equals the voice signal resonance peak multiply by the inverse of sound channel length.Usually speaker's sound channel length from about schoolgirl's 13cm to more than boy student's the 18cm, these change speech recognition all is disadvantageous.The thought of sound channel length consolidation technology is exactly to find certain consolidation function that the data of training and testing are all transformed to a data field that has nothing to do with speaker's sound channel length.Based on the theory of pipeline model, resonance peak is with the sound channel length linear change.In most cases the consolidation function only depends on a simple feature consolidation factor.Concrete enforcement is exactly to seek the best consolidation factor of each speaker, eliminates the different influences that bring of speaker's sound channel length by this consolidation factor pair frequency axis stretching or compression then.The principle of sound channel length consolidation technology is very simple, but effectively concrete enforcement is quite difficult.Maximum challenge is how effectively to estimate the best consolidation factor from limited data.The considerable method of tradition is based on the method for twice decoding of maximal possibility estimation, obtain speaker's content of speaking by acoustic feature before the consolidation being carried out a decoding, on acoustic model, do mandatory alignment with the feature after the text message of the content of speaking and the different consolidation factors (normally with the fixed step size traversal) consolidation, with the consolidation factor of likelihood value maximum the best consolidation factor as this people.This method can obtain all well and good effect, but needs twice decode time.Document, L.Lee etal. " Speaker Normalization using Efficient Frequency Warping Procedures, " ICASSP96 (1996) has proposed some comparatively successful method.For training data, they have proposed a kind of method of falling generation, with acoustic model of half training data training, take this acoustic model to estimate the consolidation factor of an other half data, on original acoustic model, estimate new acoustic model again with the data after the consolidation then.Test the time a kind of method of text-independent has been proposed, selected relevant GMM (the Gaussian Mixture Model) model of the consolidation factor for use, saved the first pass decode time.The above-mentioned consolidation factor method of asking all is that the speaker is relevant, document, S.Wegmann etal. " Speaker Normalization on Conversational Telephone Speech " ICASSP96 (1996), proposed the relevant vocal-tract length normalization method of a kind of sentence fast, vocal-tract length normalization method can be worked under half off-line provides possibility.Reported method has all obtained all well and good recognition effect now, but how many these methods have certain limitation, all need a certain amount of priori data, so can only be operated under the mode of off-line or half off-line, is difficult to be applied in the actual system.In the system of reality, particularly online system, speaker information and the content of speaking are unknown, and system can not allow long time-delay, be difficult in the existing method find a suitable solution, so be difficult to use sound channel length consolidation technology.
Summary of the invention
The objective of the invention is to: overcome the defective of prior art, provide a kind of allow sound channel length consolidation technology can be applied in the online speech recognition system fast can online application vocal-tract length normalization method.
The object of the present invention is achieved like this:
Vocal-tract length normalization method that fast can online application of the present invention comprises training stage and test phase, and concrete steps are as follows:
1) acoustic model after the consolidation that one of training stage training and sound channel length have nothing to do;
2) according to different consolidation factor pair training data classification, training multiclass GMM;
Segmentation is calculated the sound channel length consolidation factor fast in multiclass GMM marking when 3) testing;
4) select different hop counts according to the real-time demand of recognition system, upgrade the sound channel length consolidation factor;
5) with the acoustic feature decoding of the acoustic model after the sound channel length consolidation after to consolidation.
Of the present invention fast can online application the vocal-tract length normalization method flow process as shown in Figure 1.
In Fig. 1, the left side is a sound channel length consolidation acoustic training model part flow process, and the right is the part of detecting flow process.
Acoustic training model part wherein: the purpose of using sound channel length consolidation technology in the training is one of training and the irrelevant acoustic model of speaker's sound channel length, thus the influence of elimination speaker sound channel length.During the training acoustic model, because the text of training is known, the problem that mainly faces is the unknown best consolidation factor and unknown model parameter.When asking the best consolidation factor, need use the acoustic model after the consolidation, and not have the model after the consolidation now based on the method for maximal possibility estimation.General way thinks that exactly the best consolidation factor can calculate in advance by certain function, calculates the consolidation feature with the best consolidation factor then, trains acoustic model then.In actual applications, the present invention has selected for use single Gauss's acoustic model to replace the acoustic model after the consolidation to calculate the best consolidation factor, mainly be to think that the performance of single Gauss's acoustic model description is a little bit poorer than mixed Gauss model, and more can describe the original attribute of voice signal.With the acoustic model that the training data of not consolidation is trained a single Gauss, the different consolidation factors and mark text are done mandatory alignment with this model.The consolidation factor travels through with certain step-length (0.02) in certain scope (0.8~1.20) usually.
Training mainly was divided into for three steps in the method for the present invention, and is specific as follows:
1) with single Gauss's acoustic model of the training of the acoustic feature before the consolidation.
θ wherein
0Single Gauss's acoustic model; R=1 ..., R speaker's number, the acoustic feature before the X consolidation, the mark text of the corresponding content of speaking of W.
2) for each best consolidation factor of person of speaking.
R=1 wherein ..., R speaker's number, α
rThe best consolidation factor of speaker r correspondence; X
r αAcoustic feature after the usefulness consolidation factor-alpha consolidation of speaker r correspondence; W
rThe mark text of the corresponding content of speaking of speaker r.
3) with the training of the acoustic feature after consolidation acoustic model θ '.
Acoustic model after θ ' consolidation wherein;
Test process flow process in the method for the present invention:
Divide with training department and to compare, the acoustic model after the consolidation has been arranged during test, but wherein speaker's information and speak content and the best consolidation factor thereof are unknown.Originally general way is that speaker's information can obtain by cluster, talks about content specifically and can lead to earlier and go over decoding, calculates everyone the best consolidation factor by formula 1.2 then.But in the on-line system of reality, this disposal route calculated amount is big and time-delay is arranged, and is unacceptable basically.Usually speaker's information is ignorant and is to be difficult to obtain, so generally be unit with the sentence when calculating the consolidation factor during test.It doesn't matter because speaker's sound channel length is with the particular content of speaking, and can directly obtain speaker's the sound channel length consolidation factor by speaker's voice.In the test, we have selected for use the method for text-independent to ask the best consolidation factor, are exactly not rely on the content that the speaker speaks and only directly estimate the best consolidation factor according to corresponding acoustic feature.
At first, in training, the feature before the consolidation is classified according to its pairing best consolidation factor, train mixed Gauss model (GMM) then
Idiographic flow is as shown in Figure 2:
X wherein
αIt is the acoustic feature that the corresponding consolidation factor is α before the consolidation.
Secondly, in identifying, with the consolidation factor of the maximum likelihood value correspondence of acoustic feature on mixed Gauss model before the consolidation as it best consolidation factor-alpha ':
Then, the feature after the consolidation is decoded:
Wherein W is a recognition result, X
αBe the feature after the consolidation of usefulness consolidation factor-alpha.
Because quiet section information that does not contain any speaker's sound channel length in the voice, they in addition may influence the calculating of the best consolidation factor.So when training GMM model, removed quiet section in the training data according to the size of speech energy.Calculate the consolidation factor as shown in Figure 3 in the test, preliminary examination α=1 judges whether it is quiet section every the n=5 frame, and if not quiet, in the worthwhile cumulative probability of GMM model, the cumulative probability maximal value is as the consolidation factor of this moment.By to every the selection of frame number n, time-delay and real-time that can control system.
The invention has the advantages that:
Method of the present invention can be selected the length of segmentation to tested speech according to the requirement of recognition system to real-time, thereby allows sound channel length consolidation technology be applied in the online system.The purpose of segmentation is exactly eliminate to judge inaccurate quiet influence, is unlikely to influenced disperseing very much of tearing open frame by frame of continuous speech the value of acoustics behavioral characteristics difference again, and the while can also add different weights according to the situation of section.
Description of drawings
Fig. 1 is the sound channel length organizer system;
Fig. 2 is a GMM training flow process;
Consolidation factor calculation process when Fig. 3 is test.
Embodiment
Below in conjunction with drawings and Examples the present invention is described in detail.
With reference to figure 1, the training stage, be used for calculating quickly soon the GMM model of the consolidation factor when obtaining irrelevant acoustic model of a sound channel length and test.
1. use the preceding single Gauss's of acoustic feature of consolidation acoustic model;
2. on single Gauss's acoustic model, calculate everyone consolidation factor, extract acoustic feature with the best consolidation factor;
According to training data mark text, put out the tabulation that the speaker is correlated with in order.With single Gauss's acoustic model everyone different consolidation factor datas are done mandatory alignment, choose this people's of conduct of likelihood probability maximum the best consolidation factor.
The α scope is from 0.80 to 1.20, and step-length is 0.02.
3. with acoustic feature training acoustic model after the consolidation.
4. according to different consolidation factor training multiclass GMM, shown in 1-2.
Before the training GMM according to voice in the size of energy to have removed in the voice may be quiet part.Because the consolidation factor considerably less with 1.12 above data below 0.88, has only been chosen the 0.88-1.12 section as different classes when training GMM.
Test phase
1) speech sound signal terminal point detects, subordinate sentence;
Change point according to acoustic enviroment is cut into the single fragment of acoustic feature with audio stream and uses quiet track algorithm to be the suitable sentence of discerning than long fragment cutting.
2) the initialization consolidation factor is 1;
Because beginning is without any priori, we have selected the consolidation factor for use is 1, the length of saying nothing exactly consolidation.
3) per 5 frames are judged quiet or voice, if voice upgrade the present best consolidation factor then at the worthwhile accumulation likelihood value of GMM;
Quiet section information that does not contain any speaker's sound channel length in the voice, they in addition may influence the calculating of the best consolidation factor.Judge whether it is quiet section every the n=5 frame, if not quiet, in the worthwhile cumulative probability of GMM model, the cumulative probability maximal value is as the consolidation factor of this moment.The purpose of segmentation is exactly eliminate to judge inaccurate quiet influence, is unlikely to too dispersion that continuous speech is torn open frame by frame again, and the while can also add different weights according to the situation of section.
In addition, by to every the selection of frame number n (3<n<15), real-time that can control system.
4) if off-line system, the consolidation factor of last cumulative probability maximum is as this consolidation factor; If on-line system is greater than the length of setting, with the consolidation factor consolidation of cumulative probability maximum this moment;
5) decode with the acoustic feature after the consolidation.
Claims (3)
- One kind fast can online application vocal-tract length normalization method, comprise training stage and test phase, concrete steps are as follows:The flow process of described training stage is as follows:1) with single Gauss's acoustic model of the training of the acoustic feature before the consolidation:θ wherein 0Be single Gauss's acoustic model; R=1 ..., R is speaker's number, and X is the acoustic feature before the consolidation, and W is the mark text of the corresponding content of speaking, and α is the consolidation factor, θ is an acoustic model;2) on single Gauss's acoustic model, calculate everyone the consolidation factor, with best consolidation factor extraction acoustic feature, for each best consolidation factor of person of speaking:Wherein, r=1 ..., R is speaker's number, α rThe best consolidation factor for speaker r correspondence; Be the acoustic feature after the usefulness consolidation factor-alpha consolidation of speaker r correspondence; W rThe mark text of the content of speaking for speaker r is corresponding;3) with the training of the acoustic feature after consolidation acoustic model θ ':Wherein, θ ' is acoustic model after the consolidation,In addition, the flow process of described test phase is as follows:1) at first, in training, the feature before the consolidation is classified according to its pairing best consolidation factor, train mixed Gauss model thenWherein, X αIt is the acoustic feature that the corresponding consolidation factor is α before the consolidation;2) secondly, in identifying, with the consolidation factor of the maximum likelihood value correspondence of acoustic feature on mixed Gauss model before the consolidation as it best consolidation factor-alpha ':3) then, the feature after the consolidation is decoded:Wherein, W is a recognition result, X αBe the feature after the consolidation of usefulness consolidation factor-alpha.
- 2. by the described vocal-tract length normalization method that fast can online application of claim 1, it is characterized in that the scope of described consolidation factor-alpha is 0.80~1.20, step-length is 0.02.
- 3. by the described vocal-tract length normalization method that fast can online application of claim 1, it is characterized in that the scope of described consolidation factor-alpha is 0.88~1.12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2008100979810A CN101447182B (en) | 2007-11-28 | 2008-05-21 | Vocal-tract length normalization method capable of fast online application |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200710195420 | 2007-11-28 | ||
CN200710195420.X | 2007-11-28 | ||
CN2008100979810A CN101447182B (en) | 2007-11-28 | 2008-05-21 | Vocal-tract length normalization method capable of fast online application |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101447182A CN101447182A (en) | 2009-06-03 |
CN101447182B true CN101447182B (en) | 2011-11-09 |
Family
ID=40742822
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2008100979810A Expired - Fee Related CN101447182B (en) | 2007-11-28 | 2008-05-21 | Vocal-tract length normalization method capable of fast online application |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101447182B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102486922B (en) * | 2010-12-03 | 2014-12-03 | 株式会社理光 | Speaker recognition method, device and system |
CN102810311B (en) * | 2011-06-01 | 2014-12-03 | 株式会社理光 | Speaker estimation method and speaker estimation equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5696878A (en) * | 1993-09-17 | 1997-12-09 | Panasonic Technologies, Inc. | Speaker normalization using constrained spectra shifts in auditory filter domain |
US6823305B2 (en) * | 2000-12-21 | 2004-11-23 | International Business Machines Corporation | Apparatus and method for speaker normalization based on biometrics |
CN1591570A (en) * | 2003-08-13 | 2005-03-09 | 松下电器产业株式会社 | Bubble splitting for compact acoustic modeling |
US7003465B2 (en) * | 2000-10-12 | 2006-02-21 | Matsushita Electric Industrial Co., Ltd. | Method for speech recognition, apparatus for the same, and voice controller |
-
2008
- 2008-05-21 CN CN2008100979810A patent/CN101447182B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5696878A (en) * | 1993-09-17 | 1997-12-09 | Panasonic Technologies, Inc. | Speaker normalization using constrained spectra shifts in auditory filter domain |
US7003465B2 (en) * | 2000-10-12 | 2006-02-21 | Matsushita Electric Industrial Co., Ltd. | Method for speech recognition, apparatus for the same, and voice controller |
US6823305B2 (en) * | 2000-12-21 | 2004-11-23 | International Business Machines Corporation | Apparatus and method for speaker normalization based on biometrics |
CN1591570A (en) * | 2003-08-13 | 2005-03-09 | 松下电器产业株式会社 | Bubble splitting for compact acoustic modeling |
Non-Patent Citations (2)
Title |
---|
JP特开2002-189491A 2002.07.05 |
JP特开2003-022088A 2003.01.24 |
Also Published As
Publication number | Publication date |
---|---|
CN101447182A (en) | 2009-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alam et al. | Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the RSR2015 corpus | |
Meyer et al. | Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes | |
CN112133277B (en) | Sample generation method and device | |
Fukuda et al. | Detecting breathing sounds in realistic Japanese telephone conversations and its application to automatic speech recognition | |
CN107958670B (en) | Device for determining coding mode and audio coding device | |
Narayanan et al. | The role of binary mask patterns in automatic speech recognition in background noise | |
Braunschweiler et al. | Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality | |
CN101887722A (en) | Rapid voiceprint authentication method | |
Kelly et al. | Effects of long-term ageing on speaker verification | |
Moraru et al. | The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation | |
CN108091340B (en) | Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium | |
Barker et al. | Speech fragment decoding techniques for simultaneous speaker identification and speech recognition | |
Varga et al. | Automatic close captioning for live hungarian television broadcast speech: A fast and resource-efficient approach | |
Sinclair et al. | A semi-markov model for speech segmentation with an utterance-break prior | |
CN101447182B (en) | Vocal-tract length normalization method capable of fast online application | |
Kalita et al. | Objective assessment of cleft lip and palate speech intelligibility using articulation and hypernasality measures | |
CN104376850B (en) | A kind of fundamental frequency estimation method of Chinese ear voice | |
Gerosa et al. | Towards age-independent acoustic modeling | |
Vlaj et al. | Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria | |
Fan et al. | Audio-visual isolated digit recognition for whispered speech | |
Yarra et al. | Noise robust speech rate estimation using signal-to-noise ratio dependent sub-band selection and peak detection strategy | |
CN107507610A (en) | A kind of Chinese tone recognition method based on vowel fundamental frequency information | |
Fuchs et al. | ASR for electro-laryngeal speech | |
JP2006154212A (en) | Speech evaluation method and evaluation device | |
Song et al. | Experimental study of discriminative adaptive training and MLLR for automatic pronunciation evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20111109 |
|
CF01 | Termination of patent right due to non-payment of annual fee |