CN101727901A

CN101727901A - Method for recognizing Chinese-English bilingual voice of embedded system

Info

Publication number: CN101727901A
Application number: CN200910242406A
Authority: CN
Inventors: 刘加; 钱彦旻
Original assignee: Tsinghua University
Current assignee: Beijing Huacong Zhijia Technology Co Ltd
Priority date: 2009-12-10
Filing date: 2009-12-10
Publication date: 2010-06-09
Anticipated expiration: 2029-12-10
Also published as: CN101727901B

Abstract

The invention belongs to the technical field of voice recognition, and in particular relates to a method for recognizing Chinese-English bilingual voice of an embedded system. The method comprises the following steps: A/D sampling, voice pre-emphasis after sampling, energy improvement on high-frequency signals, windowing and framing processing, extraction of voice characteristic parameters, and matching recognition on voice commands according to a pre-established acoustic model, wherein the process for establishing the acoustic model is to determine a Chinese-English bilingual voice recognition initial model, and integrate and adjust foreign language models of the Chinese-English bilingual voice recognition initial model; and the matching recognition of the voice commands is specifically recognition of the Chinese-English bilingual voice commands. The method overcomes the defect that the conventional voice recognition system can only recognize single language.

Description

The method for recognizing Chinese-English bilingual voice of embedded system

Technical field

The invention belongs to the speech recognition technology field, relate in particular to a kind of method for recognizing Chinese-English bilingual voice of embedded system.

Background technology

In recent years, external speech recognition special chip development is very fast.More external voice technologies and semiconductor company all drop into a large amount of man power and materials and develop the speech recognition special chip, and the speech recognition algorithm of own national language is carried out patent protection.The speech recognition performance of these special uses (system) chip also has nothing in common with each other.The process of common speech recognition as shown in Figure 1, the voice signal of input is at first sampled through A/D, frequency spectrum shaping windowing pre-emphasis is handled, improve radio-frequency component, carry out real-time characteristic parameter extraction, the parameter of extraction is a Mel frequency marking cepstrum coefficient (MFCC), carries out speech recognition template training and speech recognition template matches simultaneously, in order to improve the chip identification performance robustness under the noise circumstance, also can carry out the processing that voice strengthen.Special chip generally comprises 8 or 16 MCU controllers or 16 bit DSP microprocessors and coupled automatic gain control (AGC), audio frequency preamplifier, low-pass filter, D/A (A/D) converter, mould/number (D/A) converter, audio-frequency power amplifier, ROM (read-only memory) (ROM).These speech recognition special use (system) chips have begun to be applied on intelligent sound toy, the mobile communication terminal.

But the high-performance speech recognition special chip of existing medium vocabulary can only identification form languages language, that is to say that identification mission can only be made of the verbal order of single languages such as Chinese or English or Japanese, do not support the identification of bilingual (mixing) order such as Chinese-English bilingual.

Yet, along with deepening continuously of internationalization trend, no matter be economical, political, still culture, academic, the bilingual phenomenon that people are occurred in daily life is more and more general, such as Sino-British two-character given name etc.Thereby, only make up the requirement that more and more can not comply with era development based on the speech recognition system of single language such as Chinese or English.Particularly as maximum and most popular Chinese of number of users and English in the world, makes up one and can carry out Chinese and English and mix the system that discerns, and he is realized on portable equipments such as special chip system, seem extremely important.

Summary of the invention

The objective of the invention is,, propose a kind of method for recognizing Chinese-English bilingual voice of embedded system for overcoming the deficiency that existing chip system can only the identification form language.This method is based on Chinese-English bilingual Embedded Speech Recognition System, the embedded speech Enhancement Method that phoneme merges modeling.

Technical scheme is, a kind of method for recognizing Chinese-English bilingual voice of embedded system, the pre-emphasis that comprises A/D sampling and sampling back voice, improve the energy of high-frequency signal, windowing divides the extraction of frame processing and speech characteristic parameter, and, carry out the coupling identification of voice command according to the acoustic model of setting up in advance, the process of setting up that it is characterized in that described acoustic model is that the non-mother tongue model of establishing Chinese-English bilingual speech recognition initial model, Chinese-English bilingual speech recognition initial model merges adjustment; The coupling identification of described voice command specifically is the identification of Chinese-English bilingual voice command;

Wherein, described establishment Chinese-English bilingual speech recognition initial model comprises revision Chinese speech model of cognition, revision English Phonetics model of cognition, merges Chinese speech and English Phonetics model of cognition after revised Chinese speech model of cognition and English Phonetics model of cognition and training merge;

The non-mother tongue model of described Chinese-English bilingual speech recognition initial model merges adjustment and adopts selectable model merging method that mother tongue model and non-mother tongue model are merged, and the Chinese-English bilingual speech recognition initial model after merging carried out the training of minimum phoneme fault discrimination, obtain the Chinese-English bilingual speech recognition modeling;

The identification of described Chinese-English bilingual voice command is calculated Gauss's mark of Chinese-English bilingual speech recognition modeling by extracting the recognition feature of the voice signal of importing, and carries out template matches according to the Chinese-English bilingual entry, will mate the entry of mark maximum as recognition result.

Described method comprises that also voice strengthen step.

Revised Chinese speech model of cognition of described merging and English Phonetics model of cognition specifically are, employing is based on the modal distance computing method of state time alignment, calculate the Chinese and english distance between the phoneme in twos, will merge apart from a pair of phoneme of minimum then.

Chinese speech and English Phonetics model of cognition after described training merges, the valuation iterative algorithm of employing maximal possibility estimation criterion and expectation maximization obtains Chinese-English bilingual speech recognition initial model.

Chinese speech and English Phonetics model of cognition after described training merges are finished on PC.

The selectable model merging method of described employing merges mother tongue model and non-mother tongue model, comprises the following steps:

(11) the database training by pure mother tongue obtains a mother tongue model M 1;

(12) use the linear homing method of maximum likelihood to carry out self-adaptation with a spot of non-mother tongue database to model M 1, obtain model M 2;

(13) by selectable model merger strategy, with certain mother tongue phoneme λ of the correspondence in the Chinese-English bilingual speech recognition initial model _iModel S ^b, with the phoneme λ in the model M 1 _iCorresponding mother tongue model S ^NeWith λ in the model M 2 _iCorresponding adaptive model S ^a, and corresponding phoneme λ in the Pronounceable dictionary that obtains according to the plain changing method of non-mother tongue easy confusion tone _iThe plain γ of easy confusion tone _jAdaptive model γ ^mCarry out linear interpolation and merge the phoneme λ after obtaining merging _iAdjustment model S ^fThe model interpolation formula is as follows:

p(S ^f)＝λ ₁p(S ^b)+λ ₂p(S ^ne)+λ ₃p(S ^a)+λ ₄p(γ ^m)

λ wherein ₁, λ ₂, λ ₃And λ ₄The interpolation factor of representing corresponding model respectively.

Chinese-English bilingual speech recognition initial model after the described fusion carries out the training of minimum phoneme fault discrimination and comprises: use speech recognition device to obtain the speech lattice information of training utterance; Prime word level markup information by the voice training storehouse is trained the language model that obtains Chinese and english; An algorithm upgrades model parameter before and after doing on the speech lattice information that obtains.

Described voice strengthen step and adopt improved Wiener filtering algorithm, comprise the following steps:

(21) use the initial value of one section typical ground unrest as Noise Estimation;

(22) utilize sliding filter and tri-state state machine to carry out the walkaway of robust, noisy speech signal for different input signal-to-noise ratios, the output and the pre-set threshold of wave filter are compared, whether be in ground unrest according to decision condition decision current frame signal; If, execution in step (23) then;

(23) adopt the Decision-Directed algorithm to carry out the estimation of present frame priori signal to noise ratio (S/N ratio), and utilize historical frames information to carry out the renewal of noise signal;

(24) adopt two-stage interframe smoothing processing, improve the continuity that strengthens the voice signal frequency spectrum, reduce the distortion of voice signal.

The estimation of described present frame priori signal to noise ratio (S/N ratio) is by former frame priori signal to noise ratio (S/N ratio)

Estimation γ with present frame posteriority signal to noise ratio (S/N ratio) _k(n) weighting obtains, and computing formula is:

Wherein,

Estimation for present frame priori signal to noise ratio (S/N ratio); P is a feedback factor, is used to control the contribution to present frame priori SNR estimation of previous frame and present frame; A is the control converging factor.

Method provided by the invention has overcome the deficiency that existing chip system can only the identification form language, and it is low to have an algorithm complex, discerns the good characteristics of sane performance under the high and noise circumstance of accuracy of identification.

Description of drawings

Fig. 1 is a speech recognition synoptic diagram commonly used at present;

Fig. 2 is a method for recognizing Chinese-English bilingual voice process synoptic diagram provided by the invention;

Fig. 3 is that Chinese obscure the phoneme change list when saying English;

Fig. 4 is based on the time slice information synoptic diagram that the phoneme merging method of state time alignment obtains.

Embodiment

Below in conjunction with accompanying drawing, preferred embodiment is elaborated.Should be emphasized that following explanation only is exemplary, rather than in order to limit the scope of the invention and to use.

Fig. 2 is a method for recognizing Chinese-English bilingual voice process synoptic diagram provided by the invention.Among Fig. 2, the method for recognizing Chinese-English bilingual voice of embedded system provided by the invention, comprise the steps: the pre-emphasis of A/D sampling and sampling back voice, improve the energy of high-frequency signal, windowing divides the extraction of frame processing and speech characteristic parameter, establish Chinese-English bilingual speech recognition initial model, the non-mother tongue model of Chinese-English bilingual speech recognition initial model merges the identification of adjustment and Chinese-English bilingual voice command.Wherein, the pre-emphasis of A/D sampling and sampling back voice, improve the energy of high-frequency signal, windowing divides frame to handle and the extraction of speech characteristic parameter is existing technology, establish Chinese-English bilingual speech recognition initial model, the identification that the non-mother tongue model of Chinese-English bilingual speech recognition initial model merges adjustment and Chinese-English bilingual voice command is the new technology that the present invention proposes.

Establishing Chinese-English bilingual speech recognition initial model comprises revision Chinese speech model of cognition, revision English Phonetics model of cognition, merges Chinese speech and English Phonetics model of cognition after revised Chinese speech model of cognition and English Phonetics model of cognition and training merge.

Revision Chinese speech model of cognition and English Phonetics model of cognition are at first said the pronunciation difference finishing Pronounceable dictionary (being the Chinese and english speech recognition modeling) that is right the English or the foreigner literary composition produced according to Chinese.Mainly contain based on expertise with based on two kinds of methods of data-driven.In the present invention,, can under expertise instructs, obtain highly versatile like this, rely on the little pronunciation Changing Pattern of non-mother tongue pronunciation data volume, can have data-driven concurrently again simultaneously in conjunction with two kinds of strategies.Thereby it is good to realize with the real data matching, and manual intervention is few, propagable advantage.When using the method for data-driven, the archiphoneme mark of combined training data and the identification of recognizer are marked the phoneme matrix of easily being obscured, determine final pronunciation Changing Pattern in conjunction with the guidance of expertise then.Say that with Chinese English is example, Fig. 3 is that Chinese obscure the phoneme change list when saying English, among Fig. 3, according to this phoneme Changing Pattern of determining at last, revises English Pronounceable dictionary again.

Behind revision Chinese speech model of cognition and English Phonetics model of cognition, two models revising are merged, obtain unified and the less mode set of scale.Obtain a less model of cognition of scale and just must will carry out the merging of Chinese and English model of cognition, in order to guarantee high recognition, when merging, some enough near models of distance on the acoustic model space are merged simultaneously.The present invention adopts and weighs two distances between model based on the method model distance calculating method of state time alignment.With two phoneme model Chinese phoneme λ _iWith English phoneme γ _jBeing the distance calculating method between two models of example explanation, is earlier that the plurality of sections voice prepared in two phonemes from the voice of artificial mark, then with λ _iThis phoneme λ used respectively in each section voice _iWith the other side's phoneme γ _jCarry out viterbi (Viterbi) state time alignment, obtain segment information as shown in Figure 4.λ wherein _iAnd γ _jTwo models before expression does not merge respectively.As we know from the figure, can obtain 5 sections carve informations,, calculate the Bhattacharyya distance of last two models of each section, be designated as D then according to the time corresponding section _Mn, be weighted as weight with the length of time period at last and obtain a distance and be:

D (λ_{i}, γ_{j}) = Σ_{q = 1}^{5} {Δt}_{q} D_{mn} .

Conversely, with γ _jThis phoneme γ used respectively in each section voice _jWith the other side's phoneme λ _iCarry out viterbi (Viterbi) state time alignment, same method obtains D (γ _j, λ _i), final mask λ _iAnd γ _jBetween distance be

D = \frac{1}{2} (D (λ_{i}, γ_{j}) + D (γ_{j}, λ_{i})) .

According to above computing method, obtain the Chinese and English distance between the phoneme in twos, will merge apart from a pair of phoneme of minimum then.Carry out the circulation that phoneme merges according to this process, drop to till the quantity that needs up to the phoneme number.According to the distance calculating method of introducing above based on the state time alignment, Chinese phoneme and English phoneme have been merged 15 pairs altogether, significantly reduced the scale of phone set, be fit to the resource requirement of embedded system.

Next by a large amount of Chinese and English Phonetics database, Chinese speech after being combined and English Phonetics model of cognition are trained, here adopt MLE (Maximum likelylood estimation, maximal possibility estimation) criterion and EM (Expectation Maximum, expectation maximization) valuation iterative algorithm carries out, and obtains Chinese-English bilingual speech recognition initial model.Whole training process is finished on PC.

The non-mother tongue model of Chinese-English bilingual speech recognition initial model merges adjustment and adopts selectable model merging method that mother tongue model and non-mother tongue model are merged, and the Chinese-English bilingual after merging is discerned initial model carry out the training of minimum phone fault discrimination, obtain the Chinese-English bilingual speech recognition modeling.

Non-mother tongue speaker often have the mother tongue accent or pronounce lack of standardization, thereby recognition system can cause erroneous judgement, must adopt the model integration technology come to identification initial model adjust.The present invention adopts selectable model merging method that mother tongue model and non-mother tongue model are merged, and revises the parameter of recognition template, and its process is:

p(S ^f)＝λ1p(S ^b)+λ ₂p(S ^ne)+λ ₃p(S ^a)+λ ₄p(γ ^m)

In order to obtain meticulousr model, particularly further improve the discrimination of non-mother tongue Chinese-English bilingual, the present invention is applied to the property distinguished training technique under the bilingual environment first.According to MPE (MinimumPhone Error, minimum phoneme mistake) criterion, the Chinese-English bilingual model of cognition that has obtained is carried out the training of the MPE property distinguished: at first use speech recognition device to obtain the speech lattice information of training utterance, by the prime word level markup information in voice training storehouse, training obtains the language model of Chinese and English simultaneously; Upgrade model parameter by an algorithm before and after on the speech lattice information that obtains, being Forward-Backward at last.Through after the parameter iteration valuation repeatedly, model parameter has obtained further adjustment, keeps bigger distinctive and the property distinguished between the model; According to the adjusted Chinese-English bilingual model of cognition of non-mother tongue, can guarantee that the bilingual discrimination when voice are mother tongue does not reduce, improved the bilingual discrimination of non-mother tongue simultaneously significantly.Finally the discrimination to mother tongue and non-mother tongue Chinese and English has all reached more than 98%.

The identification of Chinese-English bilingual voice command is the recognition feature by the voice signal that extracts input, calculates Gauss's mark of Chinese-English bilingual speech recognition modeling, and carries out template matches according to the Chinese-English bilingual entry, will mate the entry of mark maximum as recognition result.Extract the recognition feature of the voice signal of input, can adopt the extracting method of speech characteristic parameter commonly used.Gauss's mark according to feature calculation Chinese-English bilingual model carries out template matches according to the Chinese-English bilingual entry, finds out the recognition result that is of coupling mark maximum.For improving recognition speed and accuracy of identification, the identification judging process also is divided into rough identification and two processes of meticulous identification.The model parameter of rough identification is less, and model parameter is less than 200, and rough recognition speed is fast.Some pronunciations voice nonstandard or that easily mix are carried out meticulous identification again, and the parameter of meticulous model of cognition is more, probably about 1000.But because the candidate who obtains after the rough identification of process seldom, although meticulous model of cognition number is more, recognition speed is equally very fast.Two-stage identification not only improves the average velocity of identification, and has improved accuracy of identification.

In order to improve the performance of speech recognition under the noise circumstance, the present invention can also comprise that voice strengthen step.Voice strengthen step specifically:

(21) use the initial value of one section typical ground unrest as Noise Estimation.

(22) utilize sliding filter and tri-state state machine to carry out the walkaway of robust, noisy speech signal for different input signal-to-noise ratios, the output and the pre-set threshold of wave filter are compared, whether be in ground unrest according to decision condition decision current frame signal; If, execution in step (23) then; Otherwise, finish.

(23) adopt the Decision-Directed algorithm to carry out the estimation of present frame priori signal to noise ratio (S/N ratio), and utilize historical frames information to carry out the renewal of noise signal.The estimation of present frame priori signal to noise ratio (S/N ratio) is by former frame priori signal to noise ratio (S/N ratio)

Wherein,

Be the estimation of present frame priori signal to noise ratio (S/N ratio), p a.

(24) adopt two-stage interframe smoothing processing simultaneously, improved the continuity that strengthens the voice signal frequency spectrum, reduce the distortion of voice signal.

Method for recognizing Chinese-English bilingual voice provided by the invention has realized that the recognition function of Chinese-English bilingual, the model scale of system compare the recognition system of single language and do not enlarge, and shared storage resources is less; Taking into account under the condition of non-mother tongue simultaneously, when guaranteeing the high discrimination of mother tongue, obtaining the high-performance of non-mother tongue identification, adopting speech enhancement technique to improve the accuracy of identification under the noise circumstance in addition, be applicable to the embedded realization of Chinese-English bilingual identification.

The present invention is that platform is that example experimentizes with the bilingual name dial system of portable mobile phone Chinese and English of a reality.Wherein identification mission comprises 500 English name-tos and 500 Chinese names in being.Experiment shows that aspect memory space, the memory space resource that bilingual recognition methods of the present invention needs is close with the identification system of single language.Can handle the identification of Chinese and English name simultaneously, take into account under the condition of non-mother tongue simultaneously, when guaranteeing the high discrimination of mother tongue, obtain the high-performance of non-mother tongue identification, the mother tongue of final system Chinese-English bilingual and non-mother tongue discrimination have all arrived more than 98%.Adopt speech enhancement technique to improve the accuracy of identification under the noise circumstance in addition, be applicable to the embedded realization of Chinese-English bilingual identification.

The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the method for recognizing Chinese-English bilingual voice of an embedded system, the pre-emphasis that comprises A/D sampling and sampling back voice, improve the energy of high-frequency signal, windowing divides the extraction of frame processing and speech characteristic parameter, and according to the acoustic model of setting up in advance, carry out the coupling identification of voice command, the process of setting up that it is characterized in that described acoustic model is that the non-mother tongue model of establishing Chinese-English bilingual speech recognition initial model, Chinese-English bilingual speech recognition initial model merges adjustment; The coupling identification of described voice command specifically is the identification of Chinese-English bilingual voice command;

2. the method for recognizing Chinese-English bilingual voice of a kind of embedded system according to claim 1 is characterized in that described method comprises that also voice strengthen step.

3. the method for recognizing Chinese-English bilingual voice of a kind of embedded system according to claim 1 and 2, it is characterized in that revised Chinese speech model of cognition of described merging and English Phonetics model of cognition specifically are, employing is based on the modal distance computing method of state time alignment, calculate the Chinese and english distance between the phoneme in twos, will merge apart from a pair of phoneme of minimum then.

4. the method for recognizing Chinese-English bilingual voice of a kind of embedded system according to claim 1 and 2, it is characterized in that Chinese speech and English Phonetics model of cognition after described training merges, adopt the valuation iterative algorithm of maximal possibility estimation criterion and expectation maximization, obtain Chinese-English bilingual speech recognition initial model.

5. the method for recognizing Chinese-English bilingual voice of a kind of embedded system according to claim 1 and 2 is characterized in that Chinese speech and the English Phonetics model of cognition after described training merges finished on PC.

6. the method for recognizing Chinese-English bilingual voice of a kind of embedded system according to claim 1 and 2 is characterized in that the selectable model merging method of described employing merges mother tongue model and non-mother tongue model, comprises the following steps:

(13) by selectable model merger strategy, with certain mother tongue phoneme λ of the correspondence in the Chinese-English bilingual speech recognition initial model _iModel S ^b, with the phoneme λ in the model M 1 _iCorresponding mother tongue model S ^NeWith λ in the model M 2 _iCorresponding adaptive model S ^a, and corresponding phoneme λ in the Pronounceable dictionary that obtains according to the plain changing method of non-mother tongue easy confusion tone _iThe plain γ of easy confusion tone _jAdaptive model γ ^mCarry out linear interpolation and merge the phoneme λ after obtaining merging _iAdjustment model S ^fInterpolation formula is as follows:

p(S ^f)＝λ ₁p(S ^b)+λ ₂p(S ^ne)+λ ₃p(S ^a)+λ ₄p(γ ^m)

7. the method for recognizing Chinese-English bilingual voice of a kind of embedded system according to claim 1 and 2 is characterized in that Chinese-English bilingual speech recognition initial model after the described fusion carries out the training of minimum phoneme fault discrimination and comprises: use speech recognition device to obtain the speech lattice information of training utterance; Prime word level markup information by the voice training storehouse is trained the language model that obtains Chinese and english; An algorithm upgrades model parameter before and after doing on the speech lattice information that obtains.

8. the method for recognizing Chinese-English bilingual voice of a kind of embedded system according to claim 1 and 2 is characterized in that described voice strengthen step and adopt improved Wiener filtering algorithm, comprises the following steps:

(22) utilize sliding filter and tri-state state machine to carry out the walkaway of robust, noisy speech signal for different input signal-to-noise ratios, the output and the pre-set threshold of wave filter are compared, whether be in ground unrest according to decision condition decision current frame signal; If, execution in step (23) then; Otherwise, finish;

9. the method for recognizing Chinese-English bilingual voice of a kind of embedded system according to claim 2 is characterized in that the estimation of described present frame priori signal to noise ratio (S/N ratio), by former frame priori signal to noise ratio (S/N ratio)

Wherein,