CN101447182B - Vocal-tract length normalization method capable of fast online application - Google Patents

Vocal-tract length normalization method capable of fast online application Download PDF

Info

Publication number
CN101447182B
CN101447182B CN2008100979810A CN200810097981A CN101447182B CN 101447182 B CN101447182 B CN 101447182B CN 2008100979810 A CN2008100979810 A CN 2008100979810A CN 200810097981 A CN200810097981 A CN 200810097981A CN 101447182 B CN101447182 B CN 101447182B
Authority
CN
China
Prior art keywords
consolidation
alpha
factor
vocal
acoustic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008100979810A
Other languages
Chinese (zh)
Other versions
CN101447182A (en
Inventor
颜永红
刘赵杰
赵庆卫
潘接林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Original Assignee
Institute of Acoustics CAS
Beijing Kexin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, Beijing Kexin Technology Co Ltd filed Critical Institute of Acoustics CAS
Priority to CN2008100979810A priority Critical patent/CN101447182B/en
Publication of CN101447182A publication Critical patent/CN101447182A/en
Application granted granted Critical
Publication of CN101447182B publication Critical patent/CN101447182B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a vocal-tract length normalization method capable of fast online application. The method comprises the steps as follows: 1) a normalized acoustic model which is unrelated to the vocal tract length is trained during a training stage; 2) training data is classified according to different normalization factors, and various GMMs are trained; 3) the various GMMs are scored in segments during testing, and a vocal-tract length normalization factor is calculated fast; 4) different numbers of segments are selected according to the real-time requirement of an identification system, and the vocal-tract length normalization factor is updated; and 5) the acoustic model after the vocal-tract length normalization is used for decoding acoustic characteristics after the normalization. By adopting the method, segmental length of a test voice can be selected according to the real-time requirement of the identification system, thereby applying vocal-tract length normalization technology in an online system. Segmenting is used for eliminating the influence caused by silence which is judged inaccurately without breaking continuous voice so dispersively in frame that the differential value of acoustic dynamic features is affected, and meanwhile, different weights can be added according to the condition of the segments.

Description

A kind of vocal-tract length normalization method that fast can online application
Technical field
The present invention relates to a kind of speaker's acoustic feature consolidation method in the speech recognition technology, more particularly, the present invention relates to a kind of speaker's vocal-tract length normalization method that fast can online application.
Background technology
Voice are one of natural qualities of people.Because the behavior difference that the differences of Physiological of speaker's vocal organs and the day after tomorrow form, the performance of speaker's related system is better than speaker's system without interaction in speech recognition.For the speaker's system without interaction performance decrease that reduces to cause owing to speaker's difference, the sound channel length consolidation is a kind of effective ways commonly used.The sound channel length consolidation is a kind of feature consolidation technology based on model, depends on speaker's sound channel length consolidation model.Document, H.Wakita " Normalization of Vowels by Vocal-Tract Length and itsApplication to Vowel Identification; " ICASSP77 (1977) proposes to use removal speaker sound channel length first and causes that the thought of formant frequency drift improves the discrimination of isolated vowel.Position that sound channel is different and shape have determined the generation of voice, document, E.Eide et al. " A Parametric Approach to Vocal Tract LengthNormalization; " ICASSP96 (1996), think that the simplest model of speaker's sound channel is the even pipe of a length from the glottis to the lip, and be the sealing of an end opening one end.They give the influence of different consolidation functions to last recognition performance.Based on the model of this even pipeline, the centre frequency that the influence of speaker's sound channel length equals the voice signal resonance peak multiply by the inverse of sound channel length.Usually speaker's sound channel length from about schoolgirl's 13cm to more than boy student's the 18cm, these change speech recognition all is disadvantageous.The thought of sound channel length consolidation technology is exactly to find certain consolidation function that the data of training and testing are all transformed to a data field that has nothing to do with speaker's sound channel length.Based on the theory of pipeline model, resonance peak is with the sound channel length linear change.In most cases the consolidation function only depends on a simple feature consolidation factor.Concrete enforcement is exactly to seek the best consolidation factor of each speaker, eliminates the different influences that bring of speaker's sound channel length by this consolidation factor pair frequency axis stretching or compression then.The principle of sound channel length consolidation technology is very simple, but effectively concrete enforcement is quite difficult.Maximum challenge is how effectively to estimate the best consolidation factor from limited data.The considerable method of tradition is based on the method for twice decoding of maximal possibility estimation, obtain speaker's content of speaking by acoustic feature before the consolidation being carried out a decoding, on acoustic model, do mandatory alignment with the feature after the text message of the content of speaking and the different consolidation factors (normally with the fixed step size traversal) consolidation, with the consolidation factor of likelihood value maximum the best consolidation factor as this people.This method can obtain all well and good effect, but needs twice decode time.Document, L.Lee etal. " Speaker Normalization using Efficient Frequency Warping Procedures, " ICASSP96 (1996) has proposed some comparatively successful method.For training data, they have proposed a kind of method of falling generation, with acoustic model of half training data training, take this acoustic model to estimate the consolidation factor of an other half data, on original acoustic model, estimate new acoustic model again with the data after the consolidation then.Test the time a kind of method of text-independent has been proposed, selected relevant GMM (the Gaussian Mixture Model) model of the consolidation factor for use, saved the first pass decode time.The above-mentioned consolidation factor method of asking all is that the speaker is relevant, document, S.Wegmann etal. " Speaker Normalization on Conversational Telephone Speech " ICASSP96 (1996), proposed the relevant vocal-tract length normalization method of a kind of sentence fast, vocal-tract length normalization method can be worked under half off-line provides possibility.Reported method has all obtained all well and good recognition effect now, but how many these methods have certain limitation, all need a certain amount of priori data, so can only be operated under the mode of off-line or half off-line, is difficult to be applied in the actual system.In the system of reality, particularly online system, speaker information and the content of speaking are unknown, and system can not allow long time-delay, be difficult in the existing method find a suitable solution, so be difficult to use sound channel length consolidation technology.
Summary of the invention
The objective of the invention is to: overcome the defective of prior art, provide a kind of allow sound channel length consolidation technology can be applied in the online speech recognition system fast can online application vocal-tract length normalization method.
The object of the present invention is achieved like this:
Vocal-tract length normalization method that fast can online application of the present invention comprises training stage and test phase, and concrete steps are as follows:
1) acoustic model after the consolidation that one of training stage training and sound channel length have nothing to do;
2) according to different consolidation factor pair training data classification, training multiclass GMM;
Segmentation is calculated the sound channel length consolidation factor fast in multiclass GMM marking when 3) testing;
4) select different hop counts according to the real-time demand of recognition system, upgrade the sound channel length consolidation factor;
5) with the acoustic feature decoding of the acoustic model after the sound channel length consolidation after to consolidation.
Of the present invention fast can online application the vocal-tract length normalization method flow process as shown in Figure 1.
In Fig. 1, the left side is a sound channel length consolidation acoustic training model part flow process, and the right is the part of detecting flow process.
Acoustic training model part wherein: the purpose of using sound channel length consolidation technology in the training is one of training and the irrelevant acoustic model of speaker's sound channel length, thus the influence of elimination speaker sound channel length.During the training acoustic model, because the text of training is known, the problem that mainly faces is the unknown best consolidation factor and unknown model parameter.When asking the best consolidation factor, need use the acoustic model after the consolidation, and not have the model after the consolidation now based on the method for maximal possibility estimation.General way thinks that exactly the best consolidation factor can calculate in advance by certain function, calculates the consolidation feature with the best consolidation factor then, trains acoustic model then.In actual applications, the present invention has selected for use single Gauss's acoustic model to replace the acoustic model after the consolidation to calculate the best consolidation factor, mainly be to think that the performance of single Gauss's acoustic model description is a little bit poorer than mixed Gauss model, and more can describe the original attribute of voice signal.With the acoustic model that the training data of not consolidation is trained a single Gauss, the different consolidation factors and mark text are done mandatory alignment with this model.The consolidation factor travels through with certain step-length (0.02) in certain scope (0.8~1.20) usually.
Training mainly was divided into for three steps in the method for the present invention, and is specific as follows:
1) with single Gauss's acoustic model of the training of the acoustic feature before the consolidation.
θ 0 ≅ arg max θ { max α P ( X | W ; θ ) } - - - ( 1.1 )
θ wherein 0Single Gauss's acoustic model; R=1 ..., R speaker's number, the acoustic feature before the X consolidation, the mark text of the corresponding content of speaking of W.
2) for each best consolidation factor of person of speaking.
α r = arg max α p ( X r α | W r ; θ 0 ) - - - ( 1.2 )
R=1 wherein ..., R speaker's number, α rThe best consolidation factor of speaker r correspondence; X r αAcoustic feature after the usefulness consolidation factor-alpha consolidation of speaker r correspondence; W rThe mark text of the corresponding content of speaking of speaker r.
3) with the training of the acoustic feature after consolidation acoustic model θ '.
θ ′ = arg max θ Π r = 1 R max α r P ( X r α r | W r ; θ ) - - - ( 1.3 )
Acoustic model after θ ' consolidation wherein;
Test process flow process in the method for the present invention:
Divide with training department and to compare, the acoustic model after the consolidation has been arranged during test, but wherein speaker's information and speak content and the best consolidation factor thereof are unknown.Originally general way is that speaker's information can obtain by cluster, talks about content specifically and can lead to earlier and go over decoding, calculates everyone the best consolidation factor by formula 1.2 then.But in the on-line system of reality, this disposal route calculated amount is big and time-delay is arranged, and is unacceptable basically.Usually speaker's information is ignorant and is to be difficult to obtain, so generally be unit with the sentence when calculating the consolidation factor during test.It doesn't matter because speaker's sound channel length is with the particular content of speaking, and can directly obtain speaker's the sound channel length consolidation factor by speaker's voice.In the test, we have selected for use the method for text-independent to ask the best consolidation factor, are exactly not rely on the content that the speaker speaks and only directly estimate the best consolidation factor according to corresponding acoustic feature.
At first, in training, the feature before the consolidation is classified according to its pairing best consolidation factor, train mixed Gauss model (GMM) then Idiographic flow is as shown in Figure 2:
Figure S2008100979810D00042
X wherein αIt is the acoustic feature that the corresponding consolidation factor is α before the consolidation.
Secondly, in identifying, with the consolidation factor of the maximum likelihood value correspondence of acoustic feature on mixed Gauss model before the consolidation as it best consolidation factor-alpha ':
Figure S2008100979810D00043
Figure S2008100979810D00044
Figure S2008100979810D00045
Σ l = 1 L c l , α = 1 - - - ( 1.6 )
C wherein L, α, μ L, α, σ L, α 2Be respectively model
Figure S2008100979810D00047
Weight, average, variance.
Then, the feature after the consolidation is decoded:
W = arg max W ′ { p ( W ′ ) · max α p ( X α | W ′ ; θ ′ ) } - - - ( 1.7 )
Wherein W is a recognition result, X αBe the feature after the consolidation of usefulness consolidation factor-alpha.
Because quiet section information that does not contain any speaker's sound channel length in the voice, they in addition may influence the calculating of the best consolidation factor.So when training GMM model, removed quiet section in the training data according to the size of speech energy.Calculate the consolidation factor as shown in Figure 3 in the test, preliminary examination α=1 judges whether it is quiet section every the n=5 frame, and if not quiet, in the worthwhile cumulative probability of GMM model, the cumulative probability maximal value is as the consolidation factor of this moment.By to every the selection of frame number n, time-delay and real-time that can control system.
The invention has the advantages that:
Method of the present invention can be selected the length of segmentation to tested speech according to the requirement of recognition system to real-time, thereby allows sound channel length consolidation technology be applied in the online system.The purpose of segmentation is exactly eliminate to judge inaccurate quiet influence, is unlikely to influenced disperseing very much of tearing open frame by frame of continuous speech the value of acoustics behavioral characteristics difference again, and the while can also add different weights according to the situation of section.
Description of drawings
Fig. 1 is the sound channel length organizer system;
Fig. 2 is a GMM training flow process;
Consolidation factor calculation process when Fig. 3 is test.
Embodiment
Below in conjunction with drawings and Examples the present invention is described in detail.
With reference to figure 1, the training stage, be used for calculating quickly soon the GMM model of the consolidation factor when obtaining irrelevant acoustic model of a sound channel length and test.
1. use the preceding single Gauss's of acoustic feature of consolidation acoustic model;
θ 0 ≅ arg max θ { max α P ( X | W ; θ ) } , The idiographic flow of training is with the process of original acoustic model, and difference is not carry out Gauss's division in EM falls the process in generation, and last model is and original same state list Gauss model.The performance that single Gauss's acoustic model is described is a little bit poorer than mixed Gauss model, and more can describe the original attribute of voice signal.With everyone pairing best consolidation factor of these single Gauss model calculation training data.
2. on single Gauss's acoustic model, calculate everyone consolidation factor, extract acoustic feature with the best consolidation factor;
According to training data mark text, put out the tabulation that the speaker is correlated with in order.With single Gauss's acoustic model everyone different consolidation factor datas are done mandatory alignment, choose this people's of conduct of likelihood probability maximum the best consolidation factor. α r = arg max α p ( X r α | W r ; θ 0 ) , The α scope is from 0.80 to 1.20, and step-length is 0.02.
3. with acoustic feature training acoustic model after the consolidation.
θ ′ = arg max θ Π r = 1 R max α r P ( X r α r | W r ; θ ) , The training idiographic flow is with the training process of original acoustic model.
4. according to different consolidation factor training multiclass GMM, shown in 1-2.
Before the training GMM according to voice in the size of energy to have removed in the voice may be quiet part.Because the consolidation factor considerably less with 1.12 above data below 0.88, has only been chosen the 0.88-1.12 section as different classes when training GMM.
Test phase
1) speech sound signal terminal point detects, subordinate sentence;
Change point according to acoustic enviroment is cut into the single fragment of acoustic feature with audio stream and uses quiet track algorithm to be the suitable sentence of discerning than long fragment cutting.
2) the initialization consolidation factor is 1;
Because beginning is without any priori, we have selected the consolidation factor for use is 1, the length of saying nothing exactly consolidation.
3) per 5 frames are judged quiet or voice, if voice upgrade the present best consolidation factor then at the worthwhile accumulation likelihood value of GMM;
Quiet section information that does not contain any speaker's sound channel length in the voice, they in addition may influence the calculating of the best consolidation factor.Judge whether it is quiet section every the n=5 frame, if not quiet, in the worthwhile cumulative probability of GMM model, the cumulative probability maximal value is as the consolidation factor of this moment.The purpose of segmentation is exactly eliminate to judge inaccurate quiet influence, is unlikely to too dispersion that continuous speech is torn open frame by frame again, and the while can also add different weights according to the situation of section.
In addition, by to every the selection of frame number n (3<n<15), real-time that can control system.
4) if off-line system, the consolidation factor of last cumulative probability maximum is as this consolidation factor; If on-line system is greater than the length of setting, with the consolidation factor consolidation of cumulative probability maximum this moment;
5) decode with the acoustic feature after the consolidation.

Claims (3)

  1. One kind fast can online application vocal-tract length normalization method, comprise training stage and test phase, concrete steps are as follows:
    The flow process of described training stage is as follows:
    1) with single Gauss's acoustic model of the training of the acoustic feature before the consolidation:
    θ 0 ≅ arg max θ { max α P ( X | W ; θ ) } - - - ( 1.1 )
    θ wherein 0Be single Gauss's acoustic model; R=1 ..., R is speaker's number, and X is the acoustic feature before the consolidation, and W is the mark text of the corresponding content of speaking, and α is the consolidation factor, θ is an acoustic model;
    2) on single Gauss's acoustic model, calculate everyone the consolidation factor, with best consolidation factor extraction acoustic feature, for each best consolidation factor of person of speaking:
    α r = arg max α p ( X r α | W r ; θ 0 ) - - - ( 1.2 )
    Wherein, r=1 ..., R is speaker's number, α rThe best consolidation factor for speaker r correspondence;
    Figure FDA0000073988870000013
    Be the acoustic feature after the usefulness consolidation factor-alpha consolidation of speaker r correspondence; W rThe mark text of the content of speaking for speaker r is corresponding;
    3) with the training of the acoustic feature after consolidation acoustic model θ ':
    θ ′ = arg max θ Π r = 1 R max α r P ( X r α r | W r ; θ ) - - - ( 1.3 )
    Wherein, θ ' is acoustic model after the consolidation,
    In addition, the flow process of described test phase is as follows:
    1) at first, in training, the feature before the consolidation is classified according to its pairing best consolidation factor, train mixed Gauss model then
    Figure FDA0000073988870000015
    Figure FDA0000073988870000016
    Wherein, X αIt is the acoustic feature that the corresponding consolidation factor is α before the consolidation;
    2) secondly, in identifying, with the consolidation factor of the maximum likelihood value correspondence of acoustic feature on mixed Gauss model before the consolidation as it best consolidation factor-alpha ':
    Figure FDA0000073988870000021
    Figure FDA0000073988870000022
    Figure FDA0000073988870000023
    Σ l = 1 L c l , α = 1
    Wherein, c L, α, μ L, α,
    Figure FDA0000073988870000025
    Be respectively model Weight, average and variance;
    3) then, the feature after the consolidation is decoded:
    W = arg max W ′ { p ( W ′ ) · max α p ( X α | W ′ ; θ ′ ) } - - - ( 1.7 )
    Wherein, W is a recognition result, X αBe the feature after the consolidation of usefulness consolidation factor-alpha.
  2. 2. by the described vocal-tract length normalization method that fast can online application of claim 1, it is characterized in that the scope of described consolidation factor-alpha is 0.80~1.20, step-length is 0.02.
  3. 3. by the described vocal-tract length normalization method that fast can online application of claim 1, it is characterized in that the scope of described consolidation factor-alpha is 0.88~1.12.
CN2008100979810A 2007-11-28 2008-05-21 Vocal-tract length normalization method capable of fast online application Expired - Fee Related CN101447182B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008100979810A CN101447182B (en) 2007-11-28 2008-05-21 Vocal-tract length normalization method capable of fast online application

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN200710195420 2007-11-28
CN200710195420.X 2007-11-28
CN2008100979810A CN101447182B (en) 2007-11-28 2008-05-21 Vocal-tract length normalization method capable of fast online application

Publications (2)

Publication Number Publication Date
CN101447182A CN101447182A (en) 2009-06-03
CN101447182B true CN101447182B (en) 2011-11-09

Family

ID=40742822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008100979810A Expired - Fee Related CN101447182B (en) 2007-11-28 2008-05-21 Vocal-tract length normalization method capable of fast online application

Country Status (1)

Country Link
CN (1) CN101447182B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102486922B (en) * 2010-12-03 2014-12-03 株式会社理光 Speaker recognition method, device and system
CN102810311B (en) * 2011-06-01 2014-12-03 株式会社理光 Speaker estimation method and speaker estimation equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5696878A (en) * 1993-09-17 1997-12-09 Panasonic Technologies, Inc. Speaker normalization using constrained spectra shifts in auditory filter domain
US6823305B2 (en) * 2000-12-21 2004-11-23 International Business Machines Corporation Apparatus and method for speaker normalization based on biometrics
CN1591570A (en) * 2003-08-13 2005-03-09 松下电器产业株式会社 Bubble splitting for compact acoustic modeling
US7003465B2 (en) * 2000-10-12 2006-02-21 Matsushita Electric Industrial Co., Ltd. Method for speech recognition, apparatus for the same, and voice controller

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5696878A (en) * 1993-09-17 1997-12-09 Panasonic Technologies, Inc. Speaker normalization using constrained spectra shifts in auditory filter domain
US7003465B2 (en) * 2000-10-12 2006-02-21 Matsushita Electric Industrial Co., Ltd. Method for speech recognition, apparatus for the same, and voice controller
US6823305B2 (en) * 2000-12-21 2004-11-23 International Business Machines Corporation Apparatus and method for speaker normalization based on biometrics
CN1591570A (en) * 2003-08-13 2005-03-09 松下电器产业株式会社 Bubble splitting for compact acoustic modeling

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JP特开2002-189491A 2002.07.05
JP特开2003-022088A 2003.01.24

Also Published As

Publication number Publication date
CN101447182A (en) 2009-06-03

Similar Documents

Publication Publication Date Title
Alam et al. Supervised/unsupervised voice activity detectors for text-dependent speaker recognition on the RSR2015 corpus
Meyer et al. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes
CN112133277B (en) Sample generation method and device
Fukuda et al. Detecting breathing sounds in realistic Japanese telephone conversations and its application to automatic speech recognition
CN107958670B (en) Device for determining coding mode and audio coding device
Narayanan et al. The role of binary mask patterns in automatic speech recognition in background noise
Braunschweiler et al. Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality
CN101887722A (en) Rapid voiceprint authentication method
Kelly et al. Effects of long-term ageing on speaker verification
Moraru et al. The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
Barker et al. Speech fragment decoding techniques for simultaneous speaker identification and speech recognition
Varga et al. Automatic close captioning for live hungarian television broadcast speech: A fast and resource-efficient approach
Sinclair et al. A semi-markov model for speech segmentation with an utterance-break prior
CN101447182B (en) Vocal-tract length normalization method capable of fast online application
Kalita et al. Objective assessment of cleft lip and palate speech intelligibility using articulation and hypernasality measures
CN104376850B (en) A kind of fundamental frequency estimation method of Chinese ear voice
Gerosa et al. Towards age-independent acoustic modeling
Vlaj et al. Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria
Fan et al. Audio-visual isolated digit recognition for whispered speech
Yarra et al. Noise robust speech rate estimation using signal-to-noise ratio dependent sub-band selection and peak detection strategy
CN107507610A (en) A kind of Chinese tone recognition method based on vowel fundamental frequency information
Fuchs et al. ASR for electro-laryngeal speech
JP2006154212A (en) Speech evaluation method and evaluation device
Song et al. Experimental study of discriminative adaptive training and MLLR for automatic pronunciation evaluation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20111109

CF01 Termination of patent right due to non-payment of annual fee