CN1293428A

CN1293428A - Information check method based on speed recognition

Info

Publication number: CN1293428A
Application number: CN00130298A
Authority: CN
Inventors: 刘加; 单翼翔; 刘润生
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2000-11-10
Filing date: 2000-11-10
Publication date: 2001-05-02
Anticipated expiration: 2020-11-10
Also published as: CN1123863C

Abstract

An information check method based on speech recogization includes pre-training speech recognization model, end-point detection of speech signal, extracting speech recognizing parameters, viterbi speed recognizing method, measuring confidence level of speech rocognization and recognization-rejecting model, adaptive learning of speaker and sound prompt. Its advantages are high recognized rate and stability. Its speech recognization system can be used for information inquiry, speech command recognization and production control system.

Description

Information check method based on speech recognition

The invention belongs to the voice technology field, relate in particular to and adopt big vocabulary unspecified person speech recognition technology to be used for the method for information check, inquiry and order control.

In the postal services system, mailbag information check process adopts great amount of manpower, by manually mailbag being checked at present.Its check process is: at first classify mailbag (1) according to certain train number or carriage direction.(2) the corresponding mailbag information check list of output from computing machine.(3) by artificial information on every mailbag and the information of checking the mailbag on the list are checked.Check information is that the initial post office of mailbag name, mailbag arrive post office name, mailbag numbering, mailbag kind etc.Guarantee that by check packet loss or many bags do not appear in all mailbags in transportation.Packet loss has this mailbag for checking on the list, and in fact this mailbag does not exist; Many bags for check single on this mailbag not, and in fact this mailbag exists.Also to carry out special processing according to the check situation to packet loss, many bags situation.Needs to packet loss are recovered; Needs to many bags are to transport mistake according to wrapping validation of information, still check and singly miss this bag.Wrong mailbag to be return the sending station of front if transport mistake.Because in main postal intermediate center, the mailbag that send every day, receives reaches the above quantity of millions of bags, therefore artificial check process is very heavy and tired, and is easy to make mistakes.

Speech recognition technology is progressively ripe, can be used in industrial system information check, inquiry, control.Some seat reservation systems, information query system, telephone service system have been brought into use speech recognition technology abroad.Speech recognition provides the most effective, instrument the most easily for man-machine interaction.

The objective of the invention is to propose a kind of information check method based on speech recognition for overcoming the weak point of prior art.Speech recognition technology is used for the information check system, has the efficiency height, check the precision height, and characteristics such as labour intensity is little.

A kind of information check method that the present invention proposes based on speech recognition, comprise that the end-point detection of voice signal and speech recognition parameter extract, training in advance, unspecified person speech recognition, the speech recognition confidence measure of unspecified person speech recognition modeling and refuse to know model, speech recognition confidence measure and refuse to know speaker adaptation study, the generation of speech recognition entry, the voice suggestion each several part of model, unspecified person speech recognition, it is characterized in that each several part specifically may further comprise the steps:

The end-point detection of A, voice signal and speech recognition parameter extract:

(1) the sound card A/D of voice signal by computing machine samples and becomes the original figure voice signal;

(2) said original figure voice signal is carried out frequency spectrum shaping and divides the frame windowing process, to guarantee to divide the accurate stationarity of frame voice;

(3) use short-time energy, the waveform trend characteristic of voice signal to carry out end-point detection, remove the speech frame of no sound area, to guarantee the validity of each frame phonetic feature;

(4) voice signal after minute frame windowing process is carried out voice (identification) feature extraction;

The training in advance of B, unspecified person speech recognition modeling:

(1) gather a large amount of speech datas in advance, set up the training utterance database, the voice of collection are consistent with the category of language of the voice that will discern;

(2) voice signal from said database extracts speech characteristic parameter, by learning process in advance these characteristic parameters is transformed into the parameter of model of cognition then on PC; Model of cognition adopt based on phoneme Hidden codes Er Kefu model (Hidden Markov Model, HMM), the method for training is according to maximum-likelihood criterion, and HMM model parameter (bag average and variance) is carried out valuation;

C, unspecified person speech recognition:

(1) said phonetic feature and speech recognition modeling are carried out pattern match, by N-best Viterbi (Viterbi) frame synchronization beam search algorithm, extract first three in real time and select best recognition result, in the identification search procedure, kept all useful " keyword " information, do not needed to recall again;

(2) input voice messaging, this voice messaging of every check with regard to cutting the sound pronunciation template of this entry correspondence automatically, reduces the search volume, to improve the speech recognition speed and the accuracy of identification of check process; The language model of identifying adopts based on many subtrees ternary speech the syntax;

D, speech recognition confidence measure and refuse to know model:

In Viterbi (Viterbi) frame synchronization beam search process in conjunction with confidence measure with refuse to know the calculating of model; The size of the degree of confidence by judging recognizing voice determines whether to accept or refuse to know this voice identification result, refuses the irrelevant voice in operating process simultaneously;

The speaker adaptation study of E, unspecified person speech recognition:

Adopt the speaker adaptation method that model of cognition is adjusted; Said adaptive approach adopts the maximum a posteriori probability method, progressively revises the recognition template parameter by alternative manner;

The generation of F, speech recognition entry:

The data text information of Jiao Heing as required, automatic generation will be discerned the sound pronunciation template of entry by Pronounceable dictionary; The voice messaging of input and these pronunciation Template Informations compare by said unspecified person speech recognition; Pronounceable dictionary is made of with the corresponding Chinese phonetic alphabet identification vocabulary Chinese character, leaves in advance in the computing machine;

G, voice suggestion:

Adopt speech synthesis technique to carry out voice suggestion, the phonetic synthesis model parameter is analyzed leaching process and is finished by after anticipating on computers, and is stored in the hard disk of computing machine and is used for phonetic synthesis, and the phonetic synthesis model uses a sign indicating number excitation voice coding model; Voice suggestion is used for the playback recognition result, if voice playback is consistent with the input voice, represents that then recognition result is correct; If inconsistent, then require the user to read in voice command, carry out the identification of this voice command again.

The feature of extracting the end-point detection of said voice signal and speech recognition parameter can adopt the detection method in conjunction with voice/noise maximum likelihood decision device and waveform tendency decision device; The speech recognition features parameter extraction is a kind of eigenvector that the auditory properties according to people's ear calculates, i.e. MFCC (Mel-Frequency Cepstrum Coefficients) parameter.

The training in advance feature of said unspecified person speech recognition modeling can adopt branch three to go on foot progressively refinement training HMM model method, and model parameter comprises average, covariance matrix, mixed Gaussian weighting coefficient.

Said unspecified person speech recognition can have been adopted the frame synchronization beam search method of many subtrees ternary speech to the syntax.In the identification search procedure, kept all useful informations of word string, do not needed to recall again, can extract first three in real time and select best recognition result.

Said speech recognition confidence measure with refuse to know model can adopt based on whole speech confidence measure estimation method and online filler model as irrelevant voice refuse to know model, improved the robustness of model of cognition, absorbed irrelevant voice and noise.

The study of the speaker adaptation of said unspecified person speech recognition can be adopted the adaptive approach based on maximum a posteriori probability, respectively speech recognition parameter is adjusted by iteration, makes to differentiate between the model to estimate and keep maximum distinctive.

The generation of said speech recognition entry can be adopted based on the structure of many subtrees ternary speech to the syntax, generates corresponding voice entry pronunciation template according to the text message that will check, and voice entry pronunciation template is to be the tree-shaped template that elementary cell is formed with the phoneme.

The present invention proposes and adopts a kind of based on large vocabulary, unspecified person, sane, method that the continuous speech recognition technology is checked information by voice.Utilize this method can primordial in the information check software systems of speech recognition.This nucleus correcting system can be realized true-time operation on computers.The software module of this system comprises the speech data sampling by sound card, and the end-point detection of voice signal and speech recognition parameter extract, the unspecified person speech recognition, confidence measure with refuse to know model, speaker adaptation, voice suggestion.Nucleus correcting system is output as the best recognition result of first three choosing.Operating process and recognition result all have voice suggestion.

The present invention has following advantage:

(1) the present invention is the large vocabulary Speaker-independent continuous audio recognition method based on PC.These methods have characteristics such as accuracy of identification height, robustness is good, system resource overhead is little;

(2) consider the practicality of system, in recognizer, increase confidence measure and refuse to know model, increased the speaker adaptation method;

(3) adopt based on the phoneme speech recognition modeling, make system increase the speech recognition entry by text easily, do not need to train again recognition system;

(4) features such as the short-time energy of use voice signal, waveform tendency are carried out end-point detection, improve the accuracy of the end-point detection of voice signal;

(5) employing to the syntax, in conjunction with the pruning method of frame synchronization beam search, can guarantee very high discrimination based on many subtrees ternary speech;

(6) increased sane audio recognition method in the model, can adjust identification parameter automatically at channel distortion.

(7) check that not only can be used for mailbag information based on the information check method of speech recognition of the present invention, and can be applied to become one of indispensable important tool in the various infosystems in the information check and voice inquiry system in railway, aviation, telecommunications, the medicine and other fields.

Attached brief description:

Fig. 1 is embodiment of the invention voice/noise maximum likelihood decision device Valuation Modelling synoptic diagram.

Fig. 2 is that embodiment of the invention end-point detection decision device is to the anti-interference synoptic diagram of different noises.

Fig. 3 embodiment of the invention is based on speech recognition HMM model topology structure.

Fig. 4 is that many subtrees of embodiment of the invention ternary speech is to grammar construct figure.

Fig. 5 is the tree-shaped speech model structure of the identification entry of present embodiment.

Fig. 6 is a present embodiment entire system block diagram.

The present invention is elaborated in conjunction with the mailbag information check embodiment based on speech recognition, and embodiment of the invention entire method constitutes the pre-emphasis that can be divided into (1) A/D sampling and sampling back voice, improves the energy of high-frequency signal, and carries out windowing and divide frame to handle; (2) end-point detection is determined effective speech parameter; (3) extraction of speech characteristic parameter; (4) adopt frame synchronization beam search Viterbi beta pruning algorithm that recognition template is compared, and the voice identification result of the best is exported.The specification specified of each step is as follows.

1, end-point detection:

(1) voice signal carries out the sound card of computing machine by microphone, samples by 16-bit linear A/D then, becomes original digital speech.Sample frequency is 16kHz.

(2) the original figure voice signal is carried out frequency spectrum shaping and divides frame windowing (employing hamming code window) to handle, guarantee to divide the accurate stationarity of frame voice.Wherein frame length is 32ms, and frame moves and is that 16ms, preemphasis filter are taken as H (z)=1-0.98z ^-1

(3) end-point detecting method is made up of voice/noise maximum likelihood decision device and waveform tendency decision device.The voice of present embodiment/noise maximum likelihood decision device and waveform tendency decision device are described in detail as follows:

A, voice/noise maximum likelihood decision device:

The principle of work of maximum likelihood decision device as shown in Figure 1.Wherein s (n) is the clean primary speech signal of input.H (n) is because the distortion function that channel is introduced.D (n) is the additive noise of input.The voice signal of y (n) for truly receiving.Decision method calculates according to formula (1):

\log (σ_{ey}) + \frac{{(e_{ty} - μ_{ey})}^{2}}{2 σ_{ey}^{2}} < \log (σ_{ed}) + \frac{{(e_{ty} - μ_{ed})}^{2}}{2 σ_{ed}^{2}} - - - - - - (1)

If formula (1) condition satisfies, then input signal is voice and noise sum, otherwise the signal of input is a noise.Formula (1) is voice/noise maximum likelihood decision device.E wherein _TyEnergy for signal y (n).μ _EdBe the average of noise energy, it can draw by the several initial frames estimations to input signal, and along with the increase to noise frame is constantly upgraded simultaneously.

μ_{ed} = E [\frac{1}{K_{s}} \cdot Σ_{n = 1}^{K_{s}} d_{t} (n) \cdot d_{t} (n)] = \frac{1}{K_{s}} \cdot Σ_{n = 1}^{K_{S}} E [d_{t}^{2} (n)] - - - - - - (2)

Average estimation method with noise is similar, the variance of noise energy

σ_{ed}^{2}

Estimation method be:

σ_{ed}^{2} = D [\frac{1}{K_{s}} \cdot Σ_{n = 1}^{K_{S}} d_{t} (n) \cdot d_{t} (n)] = \frac{1}{K_{s}^{2}} \cdot Σ_{n = 1}^{K_{s}} (E [d_{t}^{4} (n)] - {(E [d_{t}^{2} (n)])}^{2}) - - - - - - (3)

B, waveform tendency decision device:

In order to improve the file of terminus judgement, the embodiment of the invention also uses the waveform characteristics of voice signal.The motion of people's sound channel has inertia, and the variation of any voice signal all has a progressive formation, the waveform of shock response can not occur being similar to; And for mechanic sound on the channel or interchannel noise, its shape often is similar to shock response or does not have progressive formation.If do not consider the waveform characteristics of voice signal, be difficult to they are made a distinction.In the terminus detection method, the tendency of waveform and the maximum likelihood decision method of front are combined, obtain good test findings.If the energy (e of continuous three frames _T-2, e _T-1, e _t) satisfy formula (1), so just calculate the average energy of continuous 5 frames behind the t frame:

e ₅=(e _t+1+e _t+2+e _t+3+e _t+4+e _t+5)／5 (4)

If: e ₅〉=e _T-2+ e _T-1+ e _tThen from thinking the starting point that detects voice signal, otherwise, continue to detect starting point.This detection method is called waveform tendency (WT, Waveform Tendency) decision device.

Behind two kinds of end-point detecting methods, can remove two kinds of main interference noises that occur among Fig. 2 effectively.Wherein (a) is noise stably, (b) is sudden noise.

2, speech recognition features parameter extraction:

(1) frequency domain character in short-term of voice can accurately be described the variation of voice.MEL frequency cepstral coefficient (Mel-Frequency Cepstrum Coefficients-MFCC) is a kind of eigenvector that the auditory properties according to people's ear calculates, and MFCC is based upon on the Fu Liye spectrum analysis basis.

(2) computing method of MFCC are: at first according to the MEL frequency signal spectrum is divided into the logical group of several bands, the frequency response that its band is logical is a triangle or sine-shaped.Calculate the signal energy of respective filter group then, calculate corresponding cepstrum coefficient by discrete cosine transform again.The MFCC feature mainly reflects the static nature of voice, and the behavioral characteristics of voice signal can be composed with the first order difference of static nature spectrum and second order difference and describe.These multidate informations and static information replenish mutually, can largely improve the performance of speech recognition.Whole phonetic feature constitutes with MFCC parameter, MFCC difference coefficient, normalized energy coefficient and difference coefficient thereof.

3, the training of unspecified person speech recognition template:

(1) Hidden Markov Model (HMM) is the most ripe at present effective speech recognition algorithm.HMM state transition model from left to right, it can well describe the sound pronunciation characteristics.The model that present embodiment adopts is 3 state Hidden Markov Models.Its structure as shown in Figure 3.Q wherein _iThe state of expression HMM.a _IjThe redirect probability of expression HMM.b _j(O _t) be the multithread mixed Gaussian density probability distribution function of the state output of HMM model.As shown in Equation (5).

b_{j} (O_{t}) = Π_{s = 1}^{s} {[Σ_{m = 1}^{M_{s}} C_{jsm} N (O_{st}; μ_{jsm}; φ_{jsm})]}^{γ_{s}} - - - - - - (5)

Wherein S is the fluxion of data, M _SIt is the number of the mixed Gaussian Density Distribution in each data stream; N is the higher-dimension Gaussian distribution;

N (O; μ; φ) = \frac{1}{\sqrt{{(2 π)}^{n} | φ |}} e^{- \frac{1}{2} (o - μ) φ^{- 1} (o - μ)} - - - - - (6)

(2) HMM model employing three goes on foot the training method of progressively refinement

A. at first, use the speech data of isolated word, adopt and improve segmentation K average algorithm, model of cognition is carried out initialization, internal state is tentatively cut apart, with the Viterbi algorithm state of cutting apart is carried out the iteration adjustment then, iteration about 10 just can be finished usually.

B. utilize the Baum-Welch algorithm to carry out valuation again to each initialization model, can obtain accurately HMM model parameter by this training.

C. nested model refinement training: use a large amount of speech datas and voice submodel formation composite model is carried out the refinement training, by just obtaining exquisite HMM model parameter after this step according to training statement label file.

4, unspecified person speech recognition:

(1) present embodiment adopts many subtrees ternary speech to grammatical frame synchronization beam search method.Many subtrees ternary speech to grammar construct as shown in Figure 4.Wherein the first, the second subtree is the initial and terminal point place name of mailbag that will discern.The mailbag numbering of lumbang for discerning.This searching algorithm belongs to the BFS (Breadth First Search) algorithm, whenever recognize a new frame, will the matching distance of all possible path candidate be compared and sort, keep some of the front preferably the path as active paths, other path is wiped out, proceed the identification of next frame voice then, Here it is so-called " beta pruning " handles.Keep the active paths of some, active paths K according to the hardware condition (storage space, arithmetic speed etc.) of computing machine _DctBeamGenerally between tens to hundreds of, so be called " beam search " algorithm.

(2) in conjunction with many subtrees ternary speech to grammatical model, the audio recognition method of present embodiment adopts computation model to be:

\hat{R} = \arg {\min_{(A, W)} [\log P (O / A) + \log P (A / W)]}

= \arg {\min_{(A, W)} {Σ_{m = 1}^{M} {[Σ_{t = d_{m - 1 v} + 1}^{d_{mc}} \log P (O_{t} / C_{m})] + [Σ_{t = d_{mc} + 1}^{d_{mv}} \log P (O_{t} / V_{m})]} - - - - - - (7)

+ Σ_{i = 1}^{N_{W}} Σ_{m = 1}^{M} [\log P (C_{m} / w_{i}) + \log P (V_{M} / w_{i}) + \log P (T_{m} / w_{i})]}}

Wherein P () is a probability.O is the eigenvector of voice.A is the sound pronunciation model, just the HMM model.C _mIt is the initial consonant pronunciation model.V _mIt is the simple or compound vowel of a Chinese syllable pronunciation model.T _mIt is the intonation model.W has word sequence.M is the number of full syllable, and M is 408.N _wFor discerning the quantity of speech.P (A/W) blurs pronunciation model.

(3) search routine is as follows:

A. during voice frame number nFrameNo=0, all path structures of initialization:

1) initialization of consonant class.path CactBeam: because search begins to launch from the sending station subtree, so CactBeam will carry out initialization according to all consonant nodes of sending station subtree ground floor, then initialized consonant class.path number CactBeamNum is the consonant node number of sending station subtree ground floor, and concrete initialization operation is as follows:

for(BeamNo=0；BeamNo＜CactBeamNum；BeamNo++){

NodeNum is made as 1;

WordList[0] be made as corresponding consonant semitone joint sequence number;

WordState[0] be made as 0, i.e. the corresponding sending station of this node subtree;

CurNode is made as the sequence number of respective nodes in the sending station subtree;

CheckSum is made as corresponding consonant semitone joint sequence number;

By formula (5) calculate initial distance Dist[0];

Nonsensical before other structure item, be made as 0 or-1 or infinitely great (being actually an enough big number).

2) initialization of vowel class.path VactBeam: because Chinese character is consonant-vowel structure, search is all from consonant, so nonsensical before each structure item of VactBeam, be made as 0 or-1 or infinitely great (being actually an enough big number) respectively according to its meaning separately.Initialized vowel class.path number VactBeamNum is K _VTone=1254.

B. before ought beginning nFrameNo frame voice are discerned, whether decision changes the number of active paths, the i.e. value of CactBeamNum and VactBeamNum according to the beta pruning strategy earlier.

C. the Viterbi that all active paths among CactBeam and the VactBeam are done in the t frame voice mates, and unallowable state jumps in the word.

But D. utilize the ternary speech whether reasonable, adopt corresponding syntactic information according to the position of redirect to the redirect path HeadTail of grammar testing previous frame speech production:

1) if redirect occurs in subtree inside, whether then main value according to counter on the corresponding redirect arc determines

Redirect: if Counter Value is greater than 0, can redirect; Otherwise can not redirect.

2) if redirect occurs between sending station subtree and the receiving station's subtree, then according to the grammatical relation array

Relevant information among the OutInRelation judges whether redirect.

3) if redirect occurs between receiving station's subtree and the mailbag numbering subtree, then according to the grammatical relation array

Relevant information among the Relation judges whether redirect.

According to judgement, if can redirect, then carry out the e step, otherwise carry out the g step.

E. the path redirect is handled:

1) semitone of CurNode correspondence joint enters WordList;

2) if CurNode is certain subtree (sending station subtree, receiving station's subtree or a mailbag numbering subtree)

A leaf node, then its corresponding subtree entry sequence number enters OutInCodeNo;

The Cumulative Distance that the cumulative matches distance D ist of 3) redirect rear path equals the preceding path of redirect adds front

(3) step calculate apart from sum;

4) other structure item to the redirect path carries out respective handling, generates new path;

5) modification is inserted in formation to path structure:

A) if in the path structure formation this path has been arranged, it is little then to stay distance;

B) if there is not this path in the path structure formation, then according to its accumulation distance and existing active paths number decision

Whether insert.

F. check whether current active paths can be to new unit redirect, for the processing of next frame voice is got ready.The redirect condition is last state whether this path arrives the semitone joint, and concrete grammar is to detect Dist[STATENUM] whether upgraded.If can redirect, then this path be deposited in redirect path structure HeadTail, otherwise carry out the g step.

G. if nFrameNo=FRAMENUM (totalframes of input voice) carries out the h step; Otherwise nFrameNo++ carries out the b step.

H. will sort with the active paths VactBeam of vowel ending, some paths of optimum are exported as recognition result; After recognition result obtains confirming, corresponding syntactic information is made amendment simultaneously, ready for discerning next phonetic entry.

5, the speech recognition confidence measure with refuse to know model: the valuation of (1) confidence measure plays a very important role in speech recognition.Present embodiment adopts based on speech confidence measure likelihood ratio estimation method.Refuse to know model by online filler model formation, carry out the valuation of confidence measure.

Utilization determines whether to accept recognition result by the confidence level of judging recognizing voice; (2) utilize the useful information that is comprised in N candidate's vocabulary, in identifying, set up online filler model, will

Certain mean value of the likelihood score of each frame N candidate vocabulary is as the likelihood score of online filler model.If voice

Section O={O _l..., O _t..., O _TThe first corresponding candidate result is model W ¹, it is mould that corresponding n selects the result

The type string

{W_{t}^{n}}_{t = 1, 2, . . ., T},

Then n selects result's t frame score

S_{t}^{n} = \log (p (O_{t} | W_{t}^{n}))

。The likelihood ratio test of this moment is:

LLR (O) = \log P (O / W^{1}) - \frac{1}{N - 1} \log Σ_{n = 2}^{N} P (O / W^{n})

\approx Σ_{t = 1}^{T} S_{t}^{1} - \frac{1}{N - 1} Σ_{n = 2}^{N} Σ_{t = 1}^{T} S_{t}^{n}

= LL (O, W^{1}) - \frac{1}{N - 1} Σ_{n = 2}^{N} Σ_{t = 1}^{T} LL (O_{t}, W_{t}^{n})

(3) in the present embodiment, N is 3.By confidence measure with refuse to know model, model of cognition can refuse 95% non-

Related voice noise and other noise.

6, the self-adaptation of speaker's speech recognition modeling: (Maximum a posteriori, method MAP) is utilized Bayes to the employing of (1) present embodiment based on maximum a posteriori probability

The theories of learning, with the identification code book of unspecified person as the realization that combines of prior imformation and the information that adapted to the people

Self-adaptation.The MAP algorithm is based on following criterion:

\hat{θ_{i}} = \underset{θ_{i}}{\arg \max} P (θ_{i} | x) - - - - - - (9)

Wherein x is a training sample, θ _iBe the parameter of i speech model,

\hat{θ_{i}}

Bayes estimated value for model parameter.

The advantage of MAP algorithm is that this algorithm has theoretic optimality based on maximum posteriori criterion.(2) formula (9) can obtain HMM model Mean Parameters revaluation formula:

\overset{&OverBar;}{μ} = \frac{Σ_{t = 1}^{T} γ (t) \overset{&OverBar;}{x_{t}} + τ \overset{&OverBar;}{m}}{Σ_{t = 1}^{T} γ (t) + τ} - - - - - - (10)

Just can obtain the valuation of γ (t) by the status switch that revalues the speech characteristic vector distribution.The priori parameter m

Be difficult to obtain its theoretical estimated value with τ, thus the present invention the priori parameter m is set is unspecified person speech recognition mould

The mean value vector of type, priori parameter τ=4.0.

7, the formation of speech recognition entry: the tree-shaped speech model structure under each subtree of (1) present embodiment check clauses and subclauses as shown in Figure 5.Wherein each

Circle is represented a semitone joint voice recognition unit model.Forming complete voice by the cascade between the syllable knows

Other entry.The generative process of speech recognition entry is as follows:

A. read the relevant document record from database;

B. the data entries of writing a Chinese character in simplified form, merging in will writing down launches respectively, calculates total clauses and subclauses of mailbag;

C. the number of syllables of concentrating according to sending station collection, receiving station's collection and mailbag numbering is added up the number of times that each syllable occurs;

D. generate phonetic file, code file and the tree file of sending station collection;

E. generate phonetic file, code file and the tree file of receiving station's collection;

F. generate phonetic file, code file and the tree file of mailbag numbering collection;

G. generate the phonetic file and the code file of whole mailbag sets of entries;

H. add up the grammer constraint information between the mailbag clauses and subclauses each several part, and deposit it in syntactic information literary composition in the array mode

Part.

8, voice suggestion is handled: (1) adopts sign indicating number excitation LPC voice coding model; Model parameter is handled on computers in advance, and editor presses

Contract.Voice coding/decoding algorithms can adopt the ITUG.723.1 method of standard.(2) needing the voice of compression is more than 4000 postal place names and digital string, and the voice of storage are used for returning of recognition result

Put.

Present embodiment is compiled into software processing module with above each step, combines to constitute mailbag information check software systems based on speech recognition.The main-process stream block diagram of total system comprises as shown in Figure 6: (1) is at first checked mailbag the road forms data and is loaded in the nucleus correcting system.(2) system is converted into the road forms data voice entry template that will discern automatically.(3) by sound card input voice, voice signal is carried out windowing, end-point detection, and the speech recognition features parameter extraction.(4) system adjudicates according to predetermined function, if current system is in the duty of speaker adaptation, and the then automatic speech recognition modeling that upgrades.If system is in the information check duty, then carry out corresponding speech recognition.(5) in the process of identification, by refusing to know the confidence level that model is judged recognition result, guarantee system identification result's reliability simultaneously.(6) voice messaging is carried out pattern relatively with having deposited in the mailbag information check system by checking the identification entry that the road forms data constitutes.Mailbag clauses and subclauses to correct identification are colluded nuclear, can read in voice again or marking the wait later process on corresponding clauses and subclauses the mailbag of wrong identification.(7) recognition result adopts the synthetic speech playback to feed back to the user, and for the user's voice order, system will finish the task of check automatically.

Present embodiment is based on the mailbag information check system based on speech recognition of said method exploitation, and the employing speech recognition technology can alleviate the labour intensity in the present mailbag check process widely, and the accuracy of raising labour efficiency and checking realizes no paper operation.The voice that present embodiment can be discerned are standard Chinese and Sichuan words.Identification mailbag information is more than 4000 the postal place names in the whole nation, and digital string.To the first-selected discrimination of standard Chinese is 97.7%, and first three selects discrimination is 99.5%.It is 98% that first-selected discrimination is talked about in Sichuan, and first three selects discrimination is 99.9%.

Claims

1, a kind of information check method of the present invention's proposition based on speech recognition, comprise that the end-point detection of voice signal and speech recognition parameter extract, training in advance, unspecified person speech recognition, the speech recognition confidence measure of unspecified person speech recognition modeling and refuse to know model, speech recognition confidence measure and refuse to know speaker adaptation study, the generation of speech recognition entry, the voice suggestion each several part of model, unspecified person speech recognition, specifically may further comprise the steps:

The end-point detection of A, voice signal and speech recognition parameter extract: the sound card A/D of (1) voice signal by computing machine samples and becomes the original figure voice signal; (2) said original figure voice signal is carried out frequency spectrum shaping and divides the frame windowing process, to guarantee to divide the accurate stationarity of frame voice; (3) use short-time energy, the waveform trend characteristic of voice signal to carry out end-point detection, remove the speech frame of no sound area, to guarantee the validity of each frame phonetic feature; (4) voice signal after minute frame windowing process is carried out voice (identification) feature extraction;

The training in advance of B, unspecified person speech recognition modeling: (1) gathers a large amount of speech datas in advance, sets up the training utterance database, and the voice of collection are consistent with the category of language of the voice that will discern; (2) voice signal from said database extracts speech characteristic parameter, by learning process in advance these characteristic parameters is transformed into the parameter of model of cognition then on PC; Model of cognition adopt based on phoneme Hidden codes Er Kefu model (Hidden Markov Model, HMM), the method for training is according to maximum-likelihood criterion, and HMM model parameter (bag average and variance) is carried out valuation;

C, unspecified person speech recognition: (1) carries out pattern match with said phonetic feature and speech recognition modeling, by N-best Viterbi (Viterbi) frame synchronization beam search algorithm, extract first three in real time and select best recognition result, in the identification search procedure, kept all useful " keyword " information, do not needed to recall again; (2) input voice messaging, this voice messaging of every check with regard to cutting the sound pronunciation template of this entry correspondence automatically, reduces the search volume, to improve the speech recognition speed and the accuracy of identification of check process.The language model of identifying adopts based on many subtrees ternary speech the syntax;

D, speech recognition confidence measure and refuse to know model:

In Viterbi (Viterbi) frame synchronization beam search process in conjunction with confidence measure with refuse to know the calculating of model.The size of the degree of confidence by judging recognizing voice determines whether to accept or refuse to know this voice identification result, refuses the irrelevant voice in operating process simultaneously;

The speaker adaptation study of E, unspecified person speech recognition;

The generation of F, speech recognition entry:

The data text information of Jiao Heing as required, automatic generation will be discerned the sound pronunciation template of entry by Pronounceable dictionary.The voice messaging of input and these pronunciation Template Informations compare by the unspecified person speech recognition of front; Pronounceable dictionary is made of with the corresponding Chinese phonetic alphabet identification vocabulary Chinese character, leaves in advance in the computing machine;

G, voice suggestion:

2, the audio recognition method based on information check as claimed in claim 1, it is characterized in that the end-point detection of said voice signal and speech recognition parameter extract the detection method that adopts voice/noise maximum likelihood decision device to combine with waveform tendency decision device; Said speech recognition features parameter extraction is a kind of MFCC character vector that the auditory properties according to people's ear calculates.

3, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, the training in advance of said unspecified person speech recognition modeling is: adopt and divide three to go on foot progressively refinement training HMM model method, model parameter comprises average, covariance matrix, mixed Gaussian weighting coefficient.

4, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, the frame synchronization beam search method of many subtrees ternary speech to the syntax adopted in said unspecified person speech recognition, all useful informations that in the identification search procedure, kept word string, do not need to recall again, can extract first three in real time and select best recognition result.

5, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, said confidence measure valuation with refuse to know model and adopt based on whole speech confidence measure estimation method and online filler model and refuse to know model as irrelevant voice, improve the robustness of model of cognition, absorbed irrelevant voice and noise.

6, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, the adaptive approach based on maximum a posteriori probability is adopted in the speaker adaptation study of said unspecified person speech recognition, respectively speech recognition parameter is adjusted by iteration, made between the model and to differentiate to estimate and keep maximum distinctive.

7, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, said speech recognition entry adopts based on the structure of many subtrees ternary speech to the syntax, generate corresponding voice entry pronunciation template according to the text message that will check, voice entry pronunciation template is to be the tree-shaped template that elementary cell is formed with the phoneme.