CN107564527A

CN107564527A - The method for recognizing Chinese-English bilingual voice of embedded system

Info

Publication number: CN107564527A
Application number: CN201710793500.9A
Authority: CN
Inventors: 李彩霞
Original assignee: Pingdingshan University
Current assignee: Pingdingshan University
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2018-01-09

Abstract

The invention belongs to technical field of voice recognition, more particularly to a kind of method for recognizing Chinese-English bilingual voice of embedded system.Include the preemphasis of voice after A/D samplings and sampling, improve the energy of high-frequency signal, adding window sub-frame processing and the extraction of speech characteristic parameter, and according to the acoustic model pre-established, carry out the match cognization of voice command；The process of establishing of wherein acoustic model is to establish the non-mother tongue Model Fusion adjustment of Chinese-English bilingual speech recognition initial model, Chinese-English bilingual speech recognition initial model；The match cognization of voice command is specifically the identification of Chinese-English bilingual voice command.The defects of can only identifying single language instant invention overcomes existing voice identifying system.

Description

The method for recognizing Chinese-English bilingual voice of embedded system

Technical field

The invention belongs to the Chinese-English bilingual speech recognition side of technical field of voice recognition, more particularly to a kind of embedded system Method.

Background technology

In recent years, external Voice ASIC have developed rapidly.Some external voice technologies and semiconductor company all throw Enter a large amount of man power and materials and develop Voice ASIC, and patent guarantor is carried out to the speech recognition algorithm of oneself national language Shield.The speech recognition performance of these special (system) chips is also different.The process of common speech recognition as shown in figure 1, The voice signal of input first passes around A/D and sampled, and the processing of frequency spectrum shaping adding window preemphasis, improves radio-frequency component, carries out real When characteristic parameter extraction, the parameter of extraction is Mel Frequency Cepstral Coefficients (MFCC), while carries out speech recognition template training and language Sound recognition template matches, and in order to improve the chip recognition performance robustness under noise circumstance, can also carry out the processing of speech enhan-cement. Special chip generally comprises 8 or 16 MCU controllers or 16 bit DSP microprocessors and coupled automatic growth control (AGC), audio frequency preamplifier, low pass filter, D/A (A/D) converter, analog (D/A) converter, audio power are put Big device, read-only storage (ROM).Special (system) chip of these speech recognitions has begun to be applied in intelligent sound object for appreciation On tool, mobile communication terminal.

But the high-performance Voice ASIC of existing medium vocabulary can only identify single languages language, that is to say, that Identification mission can only be made up of the verbal order of Chinese either single languages such as English or Japanese, not support bilingual The identification of (such as Chinese-English bilingual mixing) order.

However, deepening continuously with internationalization trend, either economic, politics, or culture, science, people are in day Bilingual phenomenon often appeared in life is more and more common, such as Sino-British two-character given name etc..Thus, only structure based on Chinese or The speech recognition system of single language such as person's English can not increasingly comply with the requirement of era development.Especially as using in the world Number is most and most popular Chinese and English, structure one can carry out the system that Chinese and English mixing identifies, and will He realizes on the portable equipments such as special chip system, it appears extremely important.

The content of the invention

The object of the present invention is to overcome the shortcomings of that existing chip system can only identify single language, propose a kind of embedded The method for recognizing Chinese-English bilingual voice of system.This method be the Chinese-English bilingual Embedded Speech Recognition System based on phoneme integration modeling, Embedded speech Enhancement Method.

Technical scheme is a kind of method for recognizing Chinese-English bilingual voice of embedded system, including language after A/D samplings and sampling The preemphasis of sound, improves the energy of high-frequency signal, adding window sub-frame processing and the extraction of speech characteristic parameter, and according to building in advance Vertical acoustic model, the match cognization of voice command is carried out, it is characterized in that the process of establishing of the acoustic model is that establishment is Chinese-English The non-mother tongue Model Fusion adjustment of double-language voice identification initial model, Chinese-English bilingual speech recognition initial model；The voice life The match cognization of order is specifically the identification of Chinese-English bilingual voice command；

Wherein, the establishment Chinese-English bilingual speech recognition initial model includes revision Mandarin speech recognition model, revision English After language speech recognition modeling, the revised Mandarin speech recognition model of merging and English Phonetics identification model and training merge Chinese speech and English Phonetics identification model；

The non-mother tongue Model Fusion adjustment of the Chinese-English bilingual speech recognition initial model uses selectable model merger Method is merged to mother tongue model and non-mother tongue model, and the Chinese-English bilingual speech recognition initial model after fusion is carried out most Small phoneme fault discrimination training, obtains Chinese-English bilingual speech recognition modeling；

The identification feature of the voice signal for being identified by extraction input of the Chinese-English bilingual voice command, calculate Chinese-English double The Gauss fraction of language speech recognition modeling, template matches are carried out according to Chinese-English bilingual entry, the maximum entry of fraction will be matched and made For recognition result.

Methods described also includes speech enhan-cement step.

It is described to merge revised Mandarin speech recognition model and English Phonetics identification model is specifically, using based on state The modal distance computational methods of time alignment, Chinese and english the distance between phoneme two-by-two is calculated, then by distance minimum A pair of phonemes merge.

Chinese speech and English Phonetics identification model after the training merging, using maximal possibility estimation criterion and expectation Maximized valuation iterative algorithm, obtains Chinese-English bilingual speech recognition initial model.

Chinese speech and English Phonetics identification model after the training merging are completed on PC.

It is described that mother tongue model and non-mother tongue model are merged using selectable model merging method, including following step Suddenly：

(11) a mother tongue model M 1 is obtained by the database training of pure mother tongue；

(12) model M 1 using maximum likelihood linear regression adaptively, obtain with a small amount of non-mother tongue database To model M 2；

(13) by selectable model merger strategy, by the correspondence in Chinese-English bilingual speech recognition initial model, some is female Voice element λ i model Sb, with λ i in the corresponding mother tongue model Sne and model M 2 of the phoneme λ i in model M 1 corresponding to it is adaptive Phoneme λ i easy confusion tone element is corresponded in model Sa, and the Pronounceable dictionary obtained according to non-mother tongue easy confusion tone element changing method γ j adaptive model γ m carry out linear interpolation fusion, the adjustment model Sf of the phoneme λ i after being merged；Model interpolation Formula is as follows：

P (Sf)=λ 1p (Sb)+λ 2p (Sne)+λ 3p (Sa)+λ 4p (γ m)

Wherein λ 1, λ 2, λ 3 and λ 4 represent the interpolation factor of corresponding model respectively.

Chinese-English bilingual speech recognition initial model after the fusion, which carries out minimum phoneme fault discrimination training, to be included：Make Obtain training the word lattice information of voice with speech recognition device；Trained by the prime word level markup information in voice training storehouse To the language model of Chinese and english；Front and rear item algorithm is done to update model parameter in obtained word lattice information.

The speech enhan-cement step uses improved Wiener filtering algorithm, comprises the following steps：

(21) initial value for using one section of typical ambient noise to estimate as noise；

(22) noise measuring of robust is carried out using sliding filter and tri-state state machine, for different input signal-to-noise ratios Noisy speech signal, by the output of wave filter compared with threshold value set in advance, present frame letter is determined according to decision condition Number whether it is in ambient noise；If it is, perform step (23)；

(23) estimation of present frame prior weight is carried out using Decision-Directed algorithms, and utilizes historical frames Information carries out the renewal of noise signal；

(24) two-stage interframe smoothing processing is used, the continuity of enhancing speech signal spec-trum is improved, reduces voice signal Distortion.

The estimation of the present frame prior weight, by the estimation of former frame prior weight and present frame posteriori SNR γ k (n) weightings obtain, and calculation formula is：

Wherein, it is the estimation of present frame prior weight；P is feedback factor, for controlling previous frame with present frame to working as The contribution of previous frame a priori SNR estimation；A is the control convergence factor.

Method provided by the invention, which overcomes existing chip system, can only identify the deficiency of single language, have algorithm complex It is low, identify the characteristics of sane performance is good under accuracy of identification height and noise circumstance.

Brief description of the drawings

Fig. 1 is currently used speech recognition schematic diagram；

Fig. 2 is that Chinese say and obscure phoneme change table during English；

Fig. 3 is the time slice information schematic diagram that the phoneme merging method based on state for time alignment obtains.

Embodiment

Below in conjunction with the accompanying drawings, preferred embodiment is elaborated.It is emphasized that the description below is merely exemplary , the scope being not intended to be limiting of the invention and its application.

Fig. 2 is method for recognizing Chinese-English bilingual voice process schematic provided by the invention.It is provided by the invention embedding in Fig. 2 The method for recognizing Chinese-English bilingual voice of embedded system, comprises the following steps：The preemphasis of voice, is improved after A/D is sampled and sampled The energy of high-frequency signal, adding window sub-frame processing and the extraction of speech characteristic parameter, establish Chinese-English bilingual speech recognition introductory die Type, the adjustment of non-mother tongue Model Fusion and the identification of Chinese-English bilingual voice command of Chinese-English bilingual speech recognition initial model.Wherein, A/D sample and sampling after voice preemphasis, improve the energy of high-frequency signal, adding window sub-frame processing and speech characteristic parameter Extraction is existing technology, establishes Chinese-English bilingual speech recognition initial model, the non-mother of Chinese-English bilingual speech recognition initial model Language Model Fusion adjusts and the identification of Chinese-English bilingual voice command is new technology proposed by the present invention.

Establishing Chinese-English bilingual speech recognition initial model includes revision Mandarin speech recognition model, revision English Phonetics identification Model, merge revised Mandarin speech recognition model and English Phonetics identification model and training merge after Chinese speech and English Phonetics identification model.

Mandarin speech recognition model and English Phonetics identification model are revised, says English or the foreigner according to Chinese first Be right pronunciation difference finishing Pronounceable dictionary (i.e. Chinese and english speech recognition modeling) caused by text.Mainly have and known based on expert Know and based on two methods of data-driven.In the present invention, so can be under expertise guidance in combination with two kinds of strategies Obtain versatile, rely on the small pronunciation changing rule of non-mother tongue pronunciation data volume, and can has data-driven concurrently.So as to realize with Real data matching is good, and manual intervention is few, propagable advantage.When using the method for data-driven, by combined training number According to archiphoneme mark and the identification of identifier mark to obtain confusing phoneme matrix, then in conjunction with the guidance of expertise It is determined that final pronunciation changing rule.So that Chinese say English as an example, Fig. 2 is that Chinese say that the phoneme of obscuring during English changes Table, in Fig. 2, the phoneme changing rule that is finally determined according to this, carry out the Pronounceable dictionary of revised English.

After revision Mandarin speech recognition model and English Phonetics identification model, two models of revision are merged, Unified and the Models Sets of scale is smaller.The identification model of a scale is smaller is obtained with regard to necessarily carrying out Chinese and English knowledge The merging of other model, while in order to ensure higher discrimination, when merging, by some, in acoustic model, spatially distance is enough Near model merges.The present invention weighs two moulds using the method model distance calculating method based on state for time alignment Distance between type.Illustrate that the distance between two models calculates by taking two phoneme model Chinese phoneme λ i and English phoneme γ j as an example Method, first prepare some sections of voices from the voice manually marked for two phonemes, then by each section of voice of λ i respectively with this sound Plain λ i and other side phoneme γ j carry out the alignment of viterbi (Viterbi) state for time, obtain segment information as shown in Figure 3.Wherein λ i and γ j represent two models before not merging respectively.It can be seen that 5 sections of segmentation informations can be obtained, then according to corresponding Period, calculate the Bhattacharyya distances of each section of upper two model, be designated as Dmn, finally by the use of the length of period as Weight is weighted to obtain a distance：

D (λ i, γ j)=∑ q=15 Δs tqDmn.]] ＞

In turn, each section of voice of γ j is subjected to viterbi (Viterbi) shape with this phoneme γ j and other side phoneme λ i respectively State time alignment, same method obtain D (γ j, λ i), and the distance between final mask λ i and γ j is

D=12 (D (λ i, γ j)+D (γ j, λ i))]] ＞

Computational methods more than, Chinese and English the distance between phoneme two-by-two is obtained, then by a pair of distance minimum Phoneme merges.The circulation of phoneme merging is carried out according to this process, untill phoneme number drops to the quantity of needs.According to Distance calculating method based on state for time alignment presented hereinbefore, 15 pairs altogether are incorporated by Chinese phoneme and English phoneme, The scale of phone set is significantly reduced, is adapted to the resource requirement of embedded system.

Followed by substantial amounts of Chinese and English Phonetics database, the Chinese speech after merging and English Phonetics are known Other model is trained, here using MLE (Maximum likelylood estimation, maximal possibility estimation) criterions and EM (Expectation Maximum, expectation maximization) valuation iterative algorithm is carried out, and it is initial to obtain Chinese-English bilingual speech recognition Model.Whole training process is completed on PC.

The non-mother tongue Model Fusion adjustment of Chinese-English bilingual speech recognition initial model uses selectable model merging method Mother tongue model and non-mother tongue model are merged, and it is wrong to carry out minimum phone to the Chinese-English bilingual identification initial model after fusion Distinction training by mistake, obtains Chinese-English bilingual speech recognition modeling.

Non-native speaker is lack of standardization often with mother tongue accent or pronunciation, must so as to which identifying system can cause to judge by accident The initial model of identification must be adjusted using Model Fusion technology.The present invention uses selectable model merging method pair Mother tongue model and non-mother tongue model are merged, and correct the parameter of recognition template, and its process is：

(13) by selectable model merger strategy, by the correspondence in Chinese-English bilingual speech recognition initial model, some is female Voice element λ i model Sb, with λ i in the corresponding mother tongue model Sne and model M 2 of the phoneme λ i in model M 1 corresponding to it is adaptive Phoneme λ i easy confusion tone element is corresponded in model Sa, and the Pronounceable dictionary obtained according to non-mother tongue easy confusion tone element changing method γ j adaptive model γ m carry out linear interpolation fusion, the adjustment model Sf of the phoneme λ i after being merged.Model interpolation Formula is as follows：

P (Sf)=λ 1p (Sb)+λ 2p (Sne)+λ 3p (Sa)+λ 4p (γ m)

In order to obtain finer model, the discrimination of non-mother tongue Chinese-English bilingual is particularly further improved, the present invention Distinction training technique is applied under bilingual environment first.According to MPE, (MinimumPhone Error, minimum phoneme are wrong Criterion by mistake), MPE distinction training is carried out to obtained Chinese-English bilingual identification model：Come first by speech recognition device The language of Chinese and English is obtained to the word lattice information of training voice, while by the prime word level markup information in voice training storehouse, training Say model；Model parameter is updated finally by item algorithm is before and after Forward-Backward in obtained word lattice information. After multiple parameter iteration valuation, model parameter has obtained further adjustment, and bigger distinctive is kept between model And distinction；Chinese-English bilingual identification model after being adjusted according to non-mother tongue, both can guarantee that bilingual discrimination when voice is mother tongue Do not reduce, while the bilingual discrimination of non-mother tongue has been significantly increased.Finally to the identification of mother tongue and non-mother tongue Chinese and English Rate has all reached more than 98%.

The identification of Chinese-English bilingual voice command, it is the identification feature by extracting the voice signal of input, calculates Chinese-English double The Gauss fraction of language speech recognition modeling, and template matches are carried out according to Chinese-English bilingual entry, the entry maximum by fraction is matched As recognition result.The identification feature of the voice signal of input is extracted, the extraction side of conventional speech characteristic parameter can be used Method.According to the Gauss fraction of feature calculation Chinese-English bilingual model, template matches are carried out according to Chinese-English bilingual entry, find out matching point Number it is maximum for recognition result.To improve recognition speed and accuracy of identification, identification judging process is also divided into rough identification and fine Identify two processes.The model parameter identified roughly is less, and for model parameter less than 200, rough recognition speed is fast.To some hairs Sound is nonstandard or easily mixed voice is finely identified that the parameter of fine identification model is more again, probably at 1000 or so. But because the candidate obtained after rough identification is seldom, although fine identification model number is more, recognition speed Equally quickly.Two stage recognition not only improves the average speed of identification, and improves accuracy of identification.

Method for recognizing Chinese-English bilingual voice provided by the invention, realize the identification function of Chinese-English bilingual, the model of system Scale does not expand compared to the identifying system of single language, and shared storage resource is smaller；Simultaneously under conditions of non-mother tongue is taken into account, While ensureing mother tongue high discrimination, the high-performance of non-mother tongue identification is obtained, has additionally been improved using speech enhancement technique Accuracy of identification under noise circumstance, suitable for the embedded realization of Chinese-English bilingual identification.

The present invention is carried out so that the bilingual name dial system of portable mobile phone Chinese and English of a reality is platform as an example Experiment.Wherein identification mission is to include 500 English name-tos and 500 Chinese personal names.Experiment shows, in terms of amount of storage, The amount of storage resource that the bilingual recognition methods of the present invention needs is close with the identification system of single language.Chinese and English can be handled simultaneously The identification of name, while under conditions of non-mother tongue is taken into account, while ensureing mother tongue high discrimination, non-mother tongue identification is obtained High-performance, the mother tongue of final system Chinese-English bilingual and non-mother tongue discrimination all reach more than 98%.Additionally increased using voice Strong technology improves the accuracy of identification under noise circumstance, suitable for the embedded realization of Chinese-English bilingual identification.

The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto, Any one skilled in the art the invention discloses technical scope in, the change or replacement that can readily occur in, It should all be included within the scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims It is defined.

Claims

1. the preemphasis of voice, is carried after a kind of method for recognizing Chinese-English bilingual voice of embedded system, including A/D samplings and sampling The energy of high high-frequency signal, adding window sub-frame processing and the extraction of speech characteristic parameter, and according to the acoustic model pre-established, The match cognization of voice command is carried out, it is characterized in that the process of establishing of the acoustic model is at the beginning of establishing Chinese-English bilingual speech recognition The non-mother tongue Model Fusion adjustment of beginning model, Chinese-English bilingual speech recognition initial model；The match cognization tool of institute's speech commands Body is the identification of Chinese-English bilingual voice command；

Wherein, the establishment Chinese-English bilingual speech recognition initial model includes revision Mandarin speech recognition model, revision English language Sound identification model, merge the Chinese after revised Mandarin speech recognition model and English Phonetics identification model and training merging Voice and English Phonetics identification model；

The non-mother tongue Model Fusion adjustment of the Chinese-English bilingual speech recognition initial model uses selectable model merging method Mother tongue model and non-mother tongue model are merged, and minimum sound is carried out to the Chinese-English bilingual speech recognition initial model after fusion Plain fault discrimination training, obtains Chinese-English bilingual speech recognition modeling；

Wherein, mother tongue model and non-mother tongue model are merged using selectable model merging method, comprised the following steps：

(12) model M 1 is carried out adaptively, obtaining mould using maximum likelihood linear regression with a small amount of non-mother tongue database Type M2；

(13) by selectable model merger strategy, by correspondence some mother pronunciation in Chinese-English bilingual speech recognition initial model Plain λ i model Sb, with adaptive model corresponding to λ i in the corresponding mother tongue model Sne and model M 2 of the phoneme λ i in model M 1 Phoneme λ i easy confusion tone element γ j are corresponded in Sa, and the Pronounceable dictionary obtained according to non-mother tongue easy confusion tone element changing method Adaptive model γ m carry out linear interpolation fusion, the adjustment model Sf of the phoneme λ i after being merged；Interpolation formula is such as Under：

P (Sf)=λ 1p (Sb)+λ 2p (Sne)+λ 3p (Sa)+λ 4p (γ m)

Wherein λ 1, λ 2, λ 3 and λ 4 represent the interpolation factor of corresponding model respectively；

The identification feature of the voice signal for being identified by extraction input of the Chinese-English bilingual voice command, calculates Chinese-English bilingual language The Gauss fraction of sound identification model, template matches are carried out according to Chinese-English bilingual entry, the maximum entry of fraction will be matched as knowledge Other result.

A kind of 2. method for recognizing Chinese-English bilingual voice of embedded system according to claim 1, it is characterized in that described embedding The method for recognizing Chinese-English bilingual voice of embedded system also includes speech enhan-cement step.

A kind of 3. method for recognizing Chinese-English bilingual voice of embedded system according to claim 1 or 2, it is characterized in that described Merge revised Mandarin speech recognition model and English Phonetics identification model is specifically, using the mould being aligned based on state for time Type distance calculating method, Chinese and english the distance between phoneme two-by-two is calculated, then carry out a pair of minimum phonemes of distance Merge.

A kind of 4. method for recognizing Chinese-English bilingual voice of embedded system according to claim 1 or 2, it is characterized in that described Chinese speech and English Phonetics identification model after training merging, using the valuation of maximal possibility estimation criterion and expectation maximization Iterative algorithm, obtain Chinese-English bilingual speech recognition initial model.

A kind of 5. method for recognizing Chinese-English bilingual voice of embedded system according to claim 1 or 2, it is characterized in that described Chinese speech and English Phonetics identification model after training merging are completed on PC.

A kind of 6. method for recognizing Chinese-English bilingual voice of embedded system according to claim 1 or 2, it is characterized in that described Chinese-English bilingual speech recognition initial model after fusion, which carries out minimum phoneme fault discrimination training, to be included：Use speech recognition device To obtain training the word lattice information of voice；Train to obtain Chinese and english by the prime word level markup information in voice training storehouse Language model；Front and rear item algorithm is done to update model parameter in obtained word lattice information.