CN1229971A

CN1229971A - Method for recognizing speech

Info

Publication number: CN1229971A
Application number: CN98126606A
Authority: CN
Inventors: 张育贤
Original assignee: LG Information and Communications Ltd
Current assignee: Ericsson LG Co Ltd
Priority date: 1997-12-30
Filing date: 1998-12-29
Publication date: 1999-09-29
Anticipated expiration: 2018-12-29
Also published as: CN1112670C; KR19990059297A

Abstract

Disclosed is a device and method for recognizing speech, including the following steps: switching an initial model parameters, using the switched model parameters to perform a preliminary speech recognition, using the speech recognition to adjust the model parameters, and using the adjusted model parameters to perform a speech recognition operation. By switching the model parameters in the recognition process, the effective speech recognition can be performed in a plurality of noise environments.

Description

Audio recognition method

The present invention relates to speech recognition, more particularly it relates to and utilizes a kind of speech recognition that existing model parameter is carried out with the model parameter compensation technique that shakes down that is used to adjust.

Along with the fast development of the information and the communications field, speech recognition technology has also had bigger development in recent years.Speech recognition technology makes that people can be that speech comes operating electronic equipment by voice only.For example, can carry out call or computer operation easily by voice.Go because speech recognition can be integrated in any equipment that needs to control, this technology will develop with keeping.

Yet the detected sound of electronic equipment not only comprises voice but also comprise automobile usually, the ground unrest in subway and the aircraft.In office, exist on family and the street countless by machine, electrical equipment and the noise source that the crowd produced.Therefore, from sound, isolate speech or to dispose noise be most important aspect in the identification of efficient voice.In order from detected sound with ground unrest, to identify voice exactly and done continual research at this technical elements, and still in process.

Next will be example with the mobile phone describes the audio recognition method of prior art.A CPU of mobile phone can discern a kind of input speech and recognition result is submitted to suitable element to utilize.The recognizer that has noise robustness in the mobile phone can be realized that wherein a kind of is the model parameter compensation by multiple algorithm.Because the model parameter compensation is the antinoise environment very, therefore developing the algorithm of the multiple parametric compensation that uses a model.Typical algorithm based on the model parameter compensation is a kind of parallel model combination (PMC) and a kind of vectorial Taylor series (VTS).A kind of algorithm is classified as PMC and still is classified as VTS and decides according to its method of estimating implicit Markov model (HMM) parameter.

With reference to Fig. 1, a kind of PMC speech recognition algorithm of the prior art begins (S1) by a clear speech model parameter M and a noise model parameter N.Parameter M derives in the voice from silent environment, and parameter N is to have in the noise model that estimates in the environment of ground unrest that the noise statistics that utilizes the voice produces and existing noise from one to derive.The noise model parameter N is added to or is combined to model parameter M upward to produce new model of cognition parameter (S2).To utilize this new model of cognition parameter to begin a speech recognition (S3) subsequently, and recognition result will be used to carry out a kind of function (S4).

Yet the separation of identification and model parameter conversion process makes speech recognition algorithm very loaded down with trivial details.Specifically, before the beginning voice recognition processing, must prepare the new noise model parameter what a is used for a kind of environment in advance, and itself and parameter M are made up to produce a new model of cognition parameter.In addition, owing to carried out many being similar in the process of model parameter combined treatment, it makes and can not obtain the reliable model parameter accurately that reaches.And, if can not be ready to or estimate a new noise model that is used for a kind of new environment in advance, then same noise model at random may be applied to the speech recognition of any environment.Consequently,, also can carry out unnecessary adjustment to produce model parameter, reduce the performance of recognizer thus clear speech model parameter even for a kind of environment that does not have a lot of noises.

Figure 2 shows that a kind of VTS speech recognition algorithm of the prior art.Approximate environmental model that obtain begins (S11) to this VTS algorithm so that the VTS that utilizes initial environment parameter and voice signal and carry out is approximate.Then, once judge to determine whether to restrain (S12) by approximate environmental model.If determining this environmental model restrains.Then accept estimated environmental parameter and adjust (S13) with clear speech model parameter to recognizer.After this, utilize the HMM model parameter of being adjusted to begin speech recognition (S14), and its result is used to carry out a kind of function.If in step S12, determine not convergence of environmental model, then handle and to turn back to step S11.

The VTS speech recognition algorithm also will be discerned to handle with environmental model is approximate and separate, and iteration is carried out identification and approximate processing restrains up to environmental model.Because its required calculated amount is very huge, therefore can not utilize the VTS algorithm to implement an ONLINE RECOGNITION device.And, to compare with required calculated amount, iteration does not produce tangible improvement.In addition, utilize more a spot of data to adjust the covariance of the bright model of noise and the deficiency aspect the accurately approximate environmental parameter that produces has reduced the performance of identification.

The speech recognition that realization has a noise robustness needs the correct information about ground unrest.Multiple technologies have been developed or have developed to produce and to provide accurate ground unrest statistical information to speech recognition with noise robustness.The great majority of these recognizers to recent model parameter compensation technique, all provide the statistical property ready noise model that in advance utilize multiple noise in environment before on being applied to speech recognition device from early stage spectral subtraction technology.Although can make these speech recognition devices obtain accurate estimation to ground unrest in some environment, it is not suitable for these recognizers are applied to multiple noise circumstance, and the efficient of doing like this is also very low, especially for mobile phone.

Therefore, an object of the present invention is to solve at least the problems of the prior art and shortcoming.

An object of the present invention is to provide a kind of recognizer of noise of anti-any environment with noise robustness.

Other advantage of the present invention, purpose and characteristics partly will be illustrated in ensuing explanation, and partly can be understood by common technology skilful person or be understanded from the practice of the present invention after ensuing examination.As the ground that particularly points out in the additional claim, can realize and obtain objects and advantages of the present invention.

In order to realize purpose of the present invention and according to purposes of the present invention, as implemented herein and be described in detail, a kind of method that is used for speech recognition comprises the steps: to utilize an initial relevant speech model parameter of compensation factors, utilize the model parameter that is compensated to carry out the preliminary identification of voice, utilize the result of speech recognition operation to estimate the correlation factor of an optimum, utilize estimated optimum correlation factor adjustment model parameter, and utilize the model parameter of being adjusted to carry out speech recognition.

In another embodiment of the present invention, a kind of method that is used for recognizing voice comprises the steps: to utilize an initial environment model adjustment model parameter, utilize the initial model parameter of adjusting to carry out once preliminary speech recognition, utilize the result of speech recognition to readjust environmental parameter, environmental parameter that utilization is readjusted and clear speech model parameter be the estimation model parameter again, and utilizes the HMM model parameter of compensation again to carry out speech recognition operation.

Next with reference to the accompanying drawings the present invention is described in detail, the similar element of the digital indication of wherein similar sidenote, wherein:

Figure 1 shows that the process flow diagram that utilizes a kind of PMC algorithm identified voice in the prior art;

Figure 2 shows that the process flow diagram that utilizes a kind of VTS algorithm identified voice in the prior art;

Figure 3 shows that block scheme according to a preferred embodiment of a kind of speech recognition device of the present invention;

Figure 4 shows that the process flow diagram of the speech recognition that the model parameter transfer algorithm that utilizes a kind of state to be correlated with according to the present invention carries out; And

Figure 5 shows that the process flow diagram of the speech recognition that the model parameter transfer algorithm that utilizes a kind of environment for use parameter estimation techniques according to the present invention carries out.

Figure 3 shows that a preferred embodiment according to a speech recognition device of the present invention, it comprises a vocoder 30 that voice signal is converted into a kind of pulse code modulation (PCM) speech data; A vectorial extraction unit 31 that from the PCM voice signal of vocoder 30, extracts the series of features vector; One is utilized the proper vector extracted to estimate new model parameter reappraising the initial model parameter, and the CPU (central processing unit) (CPU) of utilizing the model parameter that reappraises to carry out speech recognition; Database with 16 PCM data storage initial model parameters and recognizer candidate, and transmit the memory cell 33 of the voice of being discerned according to result that CPU32 discerned; Reach the loudspeaker 34 of an output by the voice signal that CPU32 identified.

What the recognizer of CPU32 used is a kind of algorithm based on HMM, and HMM is the statistical model that processing takes place for human speech and speech.HMM by M.J.F Gales and S.Young at " An Improved Approach to the Hidden Markov Model Decompositionof speech and Noise ", Speech Communication, no.3, pp233-236, open and intactly incorporated in this instructions in 1992.Usually, one type HMM can utilize multiple different topological structure to come modeling, and each of these topological structures can have many kinds of states.And the observation probability of HMM state is made of multiple distribution.CPU32 of the present invention uses is one and has three kinds of states and every kind of from left to right continuous HMM that state has three kinds of Gaussian distribution to observe.

In the process that identification is handled, CPU32 uses a kind of beam search to come input voice and the word that is stored in the memory cell 33 are mated to utilize HMM.CPU32 utilizes phoneme to carry out speech recognition as the base unit of voice.Specifically, a given time period extracts the series of features vector from the voice of being imported in the section a preset time.Each continuous proper vector is calculated the logarithm similarity of every kind of HMM state, and select and have peaked three to five HMM states.These three to five HMM states are known as state of activation.Begin each the continuous proper vector in the preset time section is all calculated the logarithm similarity of every kind of HMM state from the activation HMM state of previous proper vector, and select and have peaked three to five HMM states.

In the given time period, each continuous proper vector is repeated the HMM state computation and selects to handle up to the input voice all to be disposed.This processing is well-known beam search.In case search finishes, just handle and carry out traceback according to a kind of Veterbi decoding.To have the highest logarithm similarity and be elected to be the result of identification corresponding to the HMM of certain voice.Next will describe the model parameter conversion processing.Usually, the model parameter conversion processing of speech recognition device is Gaussian Mixture by mean value and covariance representative.

What one embodiment of the present of invention were used is a kind of PMC method with the relevant correlation factor of a state.Basically, be used to set up the recognizer surrounding environment the ground unrest model be that 3～5 frames corresponding to the proper vector progression that is extracted by vectorial extraction unit 31 (are equivalent to 30～50ms) data.Ground unrest be modeled as one have the 1-Gaussian Mixture (if necessary its can be expanded be 2 or the 3HMM state) 1-state HMM.Utilizing estimated ground unrest model, is the new model parameter that is more suitable in current environment with a clear speech model Parameters Transformation.Used algorithm is the PMC method with the relevant correlation factor of state in the conversion process.Therefore, CPU32 utilizes a kind of model parameter transfer algorithm to estimate the model parameter of recognizer again, and the result who submit to utilize the model parameter that reappraises to discern.

Next an a kind of speech recognition device with relevant PMC algorithm of state of a correlation factor of use according to the present invention is described with Fig. 4 with reference to Fig. 3.Vocoder 30 will be imported the PCM data (ST1) that voice are converted into 16 with the sampling rate of 8KHz.Vector extraction unit 31 will be converted into a cepstrum from the voice signal of vocoder 30, promptly as the Mel-frequency cepstral coefficients (MFCC) of proper vector.

Specifically, vectorial extraction unit 31 is handled the signal of being changed by a wave filter with pre-correction factor of a 30ms window and one 0.97, and it produces a frequency range frequency spectrum.This frequency range is divided into the frequency band of 17 mel-scales, and the frequency spectrum in each frequency band is summed to produce a frequency band energy.To carry out an inverse discrete cosine transform (IDCT) to obtain MFCC to this frequency band energy.

In the PMC method with the relevant correlation factor of state, correlation factor is defined as determining the variation in the model parameter of every kind of state of a HMM, as shown in the formula expression:

b_{li} = \hat{μ_{i}} - μ_{i}, i = 0,1,2,3 . . ., M - 1 - - - - (1)

b_{2} = \frac{1}{M} Σ_{i = 0}^{M - 1} (\hat{μ_{i}} - μ_{i}) = E {b_{l}} - - - - (2)

b=(1-λ)b _li+λb ₂……………………………………………(3)

Wherein vectorial b _LiBe the variable quantity of the average vector of (i) individual mixing, constant M is the number of the Gaussian Mixture in a kind of state.

The average vector that expression is compensated by a predetermined noise model parameter, and μ _iThe average vector of representing the clear speech model parameter in i the Gaussian Mixture.Vector b is used for compensating clear voice and by the bias vector of the state of the average vector between the noise voice, the i.e. variable quantity of model parameter at every kind; λ is a correlation factor.

Correlation factor λ is a unknown constant, and it is used to bias vector b evaluation so that model parameter is compensated, to adapt to the environment of recognizer.After utilizing the once preliminary speech recognition of being undertaken by the model parameter of original bulk b compensation, can from formula (1) to (3), obtain the optimal value of correlation factor λ.For to original bulk b evaluation, correlation factor λ is set to a predetermined initial value.In case obtain optimum correlation factor λ, then it just is used as the initial value that is used for the original bulk b evaluation of new environment.

For application of formula (1) to (3), at first the PMC method of the prior art of being discussed by reference Fig. 1 compensates clear speech model parameter.Therefore utilize the original bulk b of predetermined noise model parameter and predetermined initial relevant factor lambda desired value that clear speech model parameter is adjusted (ST3) by one.Utilize the model parameter of original compensation, carry out an initial speech identification (ST4).The result of the voice of step ST4 being discerned by Veterbi decoding carries out segmentation, and it is divided into the words and expressions that will discern a plurality of unit (ST5) of phoneme.Utilize the phoneme information of being discerned, can estimate an optimum correlation factor λ.

In order to estimate optimum correlation factor, after the voice of being discerned by preliminary speech recognition are finished Veterbi decoding, extract and be used for the information that each is identified phoneme.Therefore, utilize each information extraction that is identified phoneme just can obtain HMM status switch and bias vector b.Make the correlation factor λ of the logarithm similarity maximum of gained HMM status switch be confirmed as optimum correlation factor (ST6).Although many methods that are used to find out optimum correlation factor are arranged, used in the present invention is a kind of expectation value maximization (EM) algorithm and a kind of steepest descending method.In this preferred embodiment, but use be only need iteration just the fastest decline of convergent drive PMC (EM drives PMC) so that calculation amount and recognition time minimum.

Optimum correlation factor is used to the bias vector evaluation to the variable quantity of an average vector of having determined model parameter.Variable quantity in the average vector of every kind of HMM state is different and different with the PMC algorithm.Therefore, the average vector of every kind of HMM state will be adjusted by a quantity b who utilizes formula (1) to (3) to calculate separately, and the variable quantity of b is added on the average vector of model parameter of the every kind of phoneme that is originally compensated.Also can change the covariance of original model parameter according to the PMC algorithm, perhaps the initial covariance of model parameter also can remain unchanged.Thereby, utilize estimated correlation factor to adjust the model parameter that is compensated, and carry out a speech recognition (ST7).The result of identification is used to carry out a kind of function (ST8).

Figure 5 shows that the process flow diagram that carries out recognizing voice according to another embodiment of the invention.In this second embodiment, what the model parameter transfer algorithm used is a kind of based on the approximate environmental parameter estimation technique of VTS.Next utilize a kind of speech recognition device to describe with reference to Fig. 3 to 5 pair according to of the present invention based on the approximate environmental parameter estimation technique of VTS.

Identical with the PMC method, vocoder 30 is converted into a PMC speech data (ST11) with an input speech signal, and vectorial extraction unit 31 extracts a proper vector (ST12) from vocoder 30.In addition, the recognizer of CPU32 is to operate with the same way as of being discussed with reference to above-mentioned PMC method.Yet, in a second embodiment, utilize a kind of predetermined any environmental model at first clear speech model parameter to be adjusted (ST13).

Should predetermined any environmental model can be one by the estimated existing environmental model of initial environment parameter or one by an environmental model that any constant is estimated.For the preparing environment model, use be that first 3～5 frame corresponding to the input data of the proper vector of being extracted by vectorial extraction unit 31 (is equivalent to 30～50ms).This environmental model has been represented a kind of additional noise and channel distortion.In addition, at first come the preparing environment model by the mean value of estimating the initial environment parameter.Normally a kind of noise vector of environmental parameter and frequency spectrum transposed vector.

Utilize the initial environment model of estimating for the first time to carry out once preliminary speech recognition (ST14).Be similar to first embodiment, in case obtained the initial environment model, the model that then reappraises out just is used as the initial environment model of discerning processing in the new environment.By a Veterbi decoding voice of being discerned are divided into base unit, i.e. phoneme (ST15).Identical with first embodiment, utilize the phoneme of being discerned to estimate environmental parameter.In order to estimate these parameters, can carry out a kind of (0) rank or first rank VTS approximate (ST16) of the EM of utilization algorithm.

In this preferred embodiment, only carry out iteration one time in order to reduce calculated amount.The environmental parameter that utilization reappraises is adjusted clear voice HMM model parameter, and carries out a speech recognition operation (ST17).In this algorithm, the variation size of the average vector of model parameter is that (0) rank VTS is approximate, is similar to covariance but the equal variation size of the covariance of average vector and model parameter is the first rank VTS.

According to the present invention, can accurately estimate the variable quantity that is used for corresponding to the model parameter of every kind of HMM state of different phonemes.Thereby, just no longer need to come all recognition objective words and expressions are carried out whole adjustment by an independent estimating noise model.Consequently, can avoid the phonological component that does not have because of the serious distortion of noise is carried out unnecessary adjustment.And, in setting up the process of environmental parameter, with a large amount of calculating that no longer need to produce owing to estimating repeatedly before the model convergence.However, owing to the preliminary identification processing of environmental parameter utilization is estimated, thereby still can obtain an estimation more accurately.

In a word, can in identifying, carry out, and needn't need prior statistics ground unrest according to model parameter method for transformation of the present invention.Therefore, even in a kind of environment of many noises, also can be applied to any device that needs speech recognition.

The foregoing description only is exemplary, and should not be construed as limitation of the present invention.Can at an easy rate this method be applied to the device of other type.Explanation of the present invention only plays the illustration effect, and does not limit the scope of claim.For those technology skilful persons, multiple replacement scheme is revised and change is conspicuous.

Claims

1. an audio recognition method is characterized in that comprising the steps:

(a) to implying compensation one model parameter of every kind of state in the Markov model;

(b) utilize the model parameter that is compensated to carry out once preliminary speech recognition;

(c) utilize the model parameter that adjustment as a result compensated of preliminary speech recognition; And

(d) utilize the model parameter of being adjusted to carry out speech recognition.

2. the method for claim 1 comprises the steps: in addition

According to one in the step (a) initial relevant compensation factors model parameter;

Utilize the result of preliminary speech recognition to estimate an optimum correlation factor; And

The model parameter that is compensated according to the optimum correlation factor adjustment in the step (c).

3. a method as claimed in claim 2 is characterized in that compensation rate is to utilize a clear speech model parameter, and a noise model parameter and initial correlation factor are tried to achieve.

4. a method as claimed in claim 3 is characterized in that the noise model parameter is to utilize the initial segment of input speech frame to produce.

5. a method above-mentioned as claim 2 is characterized in that the step of estimating an optimum correlation factor comprises the steps:

The tone decoding that to be discerned by preliminary speech recognition is the voice of predetermined base unit;

From each base unit, extract identifying information, and

The correlation factor of the logarithm similarity maximum of an identifying information that makes each base unit is defined as estimated optimum correlation factor.

6. method as claimed in claim 5, what it is characterized in that using in decoding step is a kind of Veterbi decoding method.

7. method as claimed in claim 5, the predetermined base unit that it is characterized in that voice is a phoneme.

8. method as claimed in claim 5 is characterized in that in the step of determining correlation factor, use be a kind of expectation value maximization algorithm or steepest descending method.

9. a method as claimed in claim 8 is characterized in that an expectation value maximization algorithm and a steepest descending method iteration once.

10. a method as claimed in claim 2 is characterized in that optimum correlation factor is used as the initial optimum correlation factor of carrying out speech recognition in the new environment.

11. the method for claim 1, it is characterized in that step (b1) and (d) in speech recognition undertaken by beam search.

12. the method for claim 1 is characterized in that comprising the steps: in addition

In step (a) according to utilizing an estimated initial environment model of initial environment parameter that model parameter is compensated;

Utilize the result of preliminary speech recognition to adjust the initial environment parameter;

Utilize the environmental parameter of being adjusted to reappraise the initial environment model; And

Adjust the model parameter that is compensated according to the environmental model that in step (c), is reappraised.

13. a method as claimed in claim 12 is characterized in that environmental model is to utilize an initial segment of input speech frame to produce.

14. a method as claimed in claim 12 is characterized in that the step of adjusting the initial environment parameter comprises the steps:

The tone decoding that preliminary speech recognition is discerned is the voice of predetermined base unit;

From each base unit, extract identifying information, and

Each base unit is estimated environmental parameter.

15. a method as claimed in claim 14 is characterized in that environmental parameter estimates by a kind of (0) rank or the first rank VTS approximation method.

16. one kind as method as described in the claim 15, it is characterized in that if what use is (0) rank VTS approximation method, then will change an average vector of environmental parameter, and if what use is the first rank approximation method, then will change an average vector and covariance of environmental parameter.

17. method as claimed in claim 15, what it is characterized in that using in VTS is approximate is a kind of expectation value algorithm that maximizes.

18. a method as claimed in claim 14, what it is characterized in that using in decoding step is a kind of Veterbi decoding method.

19. a method as claimed in claim 14, the predetermined base unit that it is characterized in that voice are phonemes.

20. a method as claimed in claim 12 is characterized in that the environmental parameter of being adjusted is used as the initial environment parameter of carrying out speech recognition in the new environment.

21. a speech recognition system is characterized in that comprising:

A vocoder that is used for input speech signal is converted into the PCM voice signal;

A vector extraction circuit that is used for extracting the series of features vector from the PCM voice signal;

A model parameter of utilizing the characteristic vector sequence that is extracted to reach and compensated by initial noise model parameter is carried out once preliminary speech recognition to the input voice, and from the result of preliminary speech recognition, estimate the noise model parameter, and utilize characteristic vector sequence of being extracted and the recognizer that carries out the secondary speech recognition by the adjustment model parameter that estimated noise model parameter compensates;

A storer that is used for storing initial noise model parameter and will be identified a database of words and expressions.