CN1293428A - Information check method based on speed recognition - Google Patents

Information check method based on speed recognition Download PDF

Info

Publication number
CN1293428A
CN1293428A CN00130298A CN00130298A CN1293428A CN 1293428 A CN1293428 A CN 1293428A CN 00130298 A CN00130298 A CN 00130298A CN 00130298 A CN00130298 A CN 00130298A CN 1293428 A CN1293428 A CN 1293428A
Authority
CN
China
Prior art keywords
voice
speech recognition
model
speech
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN00130298A
Other languages
Chinese (zh)
Other versions
CN1123863C (en
Inventor
刘加
单翼翔
刘润生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN00130298A priority Critical patent/CN1123863C/en
Publication of CN1293428A publication Critical patent/CN1293428A/en
Application granted granted Critical
Publication of CN1123863C publication Critical patent/CN1123863C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

An information check method based on speech recogization includes pre-training speech recognization model, end-point detection of speech signal, extracting speech recognizing parameters, viterbi speed recognizing method, measuring confidence level of speech rocognization and recognization-rejecting model, adaptive learning of speaker and sound prompt. Its advantages are high recognized rate and stability. Its speech recognization system can be used for information inquiry, speech command recognization and production control system.

Description

Information check method based on speech recognition
The invention belongs to the voice technology field, relate in particular to and adopt big vocabulary unspecified person speech recognition technology to be used for the method for information check, inquiry and order control.
In the postal services system, mailbag information check process adopts great amount of manpower, by manually mailbag being checked at present.Its check process is: at first classify mailbag (1) according to certain train number or carriage direction.(2) the corresponding mailbag information check list of output from computing machine.(3) by artificial information on every mailbag and the information of checking the mailbag on the list are checked.Check information is that the initial post office of mailbag name, mailbag arrive post office name, mailbag numbering, mailbag kind etc.Guarantee that by check packet loss or many bags do not appear in all mailbags in transportation.Packet loss has this mailbag for checking on the list, and in fact this mailbag does not exist; Many bags for check single on this mailbag not, and in fact this mailbag exists.Also to carry out special processing according to the check situation to packet loss, many bags situation.Needs to packet loss are recovered; Needs to many bags are to transport mistake according to wrapping validation of information, still check and singly miss this bag.Wrong mailbag to be return the sending station of front if transport mistake.Because in main postal intermediate center, the mailbag that send every day, receives reaches the above quantity of millions of bags, therefore artificial check process is very heavy and tired, and is easy to make mistakes.
Speech recognition technology is progressively ripe, can be used in industrial system information check, inquiry, control.Some seat reservation systems, information query system, telephone service system have been brought into use speech recognition technology abroad.Speech recognition provides the most effective, instrument the most easily for man-machine interaction.
The objective of the invention is to propose a kind of information check method based on speech recognition for overcoming the weak point of prior art.Speech recognition technology is used for the information check system, has the efficiency height, check the precision height, and characteristics such as labour intensity is little.
A kind of information check method that the present invention proposes based on speech recognition, comprise that the end-point detection of voice signal and speech recognition parameter extract, training in advance, unspecified person speech recognition, the speech recognition confidence measure of unspecified person speech recognition modeling and refuse to know model, speech recognition confidence measure and refuse to know speaker adaptation study, the generation of speech recognition entry, the voice suggestion each several part of model, unspecified person speech recognition, it is characterized in that each several part specifically may further comprise the steps:
The end-point detection of A, voice signal and speech recognition parameter extract:
(1) the sound card A/D of voice signal by computing machine samples and becomes the original figure voice signal;
(2) said original figure voice signal is carried out frequency spectrum shaping and divides the frame windowing process, to guarantee to divide the accurate stationarity of frame voice;
(3) use short-time energy, the waveform trend characteristic of voice signal to carry out end-point detection, remove the speech frame of no sound area, to guarantee the validity of each frame phonetic feature;
(4) voice signal after minute frame windowing process is carried out voice (identification) feature extraction;
The training in advance of B, unspecified person speech recognition modeling:
(1) gather a large amount of speech datas in advance, set up the training utterance database, the voice of collection are consistent with the category of language of the voice that will discern;
(2) voice signal from said database extracts speech characteristic parameter, by learning process in advance these characteristic parameters is transformed into the parameter of model of cognition then on PC; Model of cognition adopt based on phoneme Hidden codes Er Kefu model (Hidden Markov Model, HMM), the method for training is according to maximum-likelihood criterion, and HMM model parameter (bag average and variance) is carried out valuation;
C, unspecified person speech recognition:
(1) said phonetic feature and speech recognition modeling are carried out pattern match, by N-best Viterbi (Viterbi) frame synchronization beam search algorithm, extract first three in real time and select best recognition result, in the identification search procedure, kept all useful " keyword " information, do not needed to recall again;
(2) input voice messaging, this voice messaging of every check with regard to cutting the sound pronunciation template of this entry correspondence automatically, reduces the search volume, to improve the speech recognition speed and the accuracy of identification of check process; The language model of identifying adopts based on many subtrees ternary speech the syntax;
D, speech recognition confidence measure and refuse to know model:
In Viterbi (Viterbi) frame synchronization beam search process in conjunction with confidence measure with refuse to know the calculating of model; The size of the degree of confidence by judging recognizing voice determines whether to accept or refuse to know this voice identification result, refuses the irrelevant voice in operating process simultaneously;
The speaker adaptation study of E, unspecified person speech recognition:
Adopt the speaker adaptation method that model of cognition is adjusted; Said adaptive approach adopts the maximum a posteriori probability method, progressively revises the recognition template parameter by alternative manner;
The generation of F, speech recognition entry:
The data text information of Jiao Heing as required, automatic generation will be discerned the sound pronunciation template of entry by Pronounceable dictionary; The voice messaging of input and these pronunciation Template Informations compare by said unspecified person speech recognition; Pronounceable dictionary is made of with the corresponding Chinese phonetic alphabet identification vocabulary Chinese character, leaves in advance in the computing machine;
G, voice suggestion:
Adopt speech synthesis technique to carry out voice suggestion, the phonetic synthesis model parameter is analyzed leaching process and is finished by after anticipating on computers, and is stored in the hard disk of computing machine and is used for phonetic synthesis, and the phonetic synthesis model uses a sign indicating number excitation voice coding model; Voice suggestion is used for the playback recognition result, if voice playback is consistent with the input voice, represents that then recognition result is correct; If inconsistent, then require the user to read in voice command, carry out the identification of this voice command again.
The feature of extracting the end-point detection of said voice signal and speech recognition parameter can adopt the detection method in conjunction with voice/noise maximum likelihood decision device and waveform tendency decision device; The speech recognition features parameter extraction is a kind of eigenvector that the auditory properties according to people's ear calculates, i.e. MFCC (Mel-Frequency Cepstrum Coefficients) parameter.
The training in advance feature of said unspecified person speech recognition modeling can adopt branch three to go on foot progressively refinement training HMM model method, and model parameter comprises average, covariance matrix, mixed Gaussian weighting coefficient.
Said unspecified person speech recognition can have been adopted the frame synchronization beam search method of many subtrees ternary speech to the syntax.In the identification search procedure, kept all useful informations of word string, do not needed to recall again, can extract first three in real time and select best recognition result.
Said speech recognition confidence measure with refuse to know model can adopt based on whole speech confidence measure estimation method and online filler model as irrelevant voice refuse to know model, improved the robustness of model of cognition, absorbed irrelevant voice and noise.
The study of the speaker adaptation of said unspecified person speech recognition can be adopted the adaptive approach based on maximum a posteriori probability, respectively speech recognition parameter is adjusted by iteration, makes to differentiate between the model to estimate and keep maximum distinctive.
The generation of said speech recognition entry can be adopted based on the structure of many subtrees ternary speech to the syntax, generates corresponding voice entry pronunciation template according to the text message that will check, and voice entry pronunciation template is to be the tree-shaped template that elementary cell is formed with the phoneme.
The present invention proposes and adopts a kind of based on large vocabulary, unspecified person, sane, method that the continuous speech recognition technology is checked information by voice.Utilize this method can primordial in the information check software systems of speech recognition.This nucleus correcting system can be realized true-time operation on computers.The software module of this system comprises the speech data sampling by sound card, and the end-point detection of voice signal and speech recognition parameter extract, the unspecified person speech recognition, confidence measure with refuse to know model, speaker adaptation, voice suggestion.Nucleus correcting system is output as the best recognition result of first three choosing.Operating process and recognition result all have voice suggestion.
The present invention has following advantage:
(1) the present invention is the large vocabulary Speaker-independent continuous audio recognition method based on PC.These methods have characteristics such as accuracy of identification height, robustness is good, system resource overhead is little;
(2) consider the practicality of system, in recognizer, increase confidence measure and refuse to know model, increased the speaker adaptation method;
(3) adopt based on the phoneme speech recognition modeling, make system increase the speech recognition entry by text easily, do not need to train again recognition system;
(4) features such as the short-time energy of use voice signal, waveform tendency are carried out end-point detection, improve the accuracy of the end-point detection of voice signal;
(5) employing to the syntax, in conjunction with the pruning method of frame synchronization beam search, can guarantee very high discrimination based on many subtrees ternary speech;
(6) increased sane audio recognition method in the model, can adjust identification parameter automatically at channel distortion.
(7) check that not only can be used for mailbag information based on the information check method of speech recognition of the present invention, and can be applied to become one of indispensable important tool in the various infosystems in the information check and voice inquiry system in railway, aviation, telecommunications, the medicine and other fields.
Attached brief description:
Fig. 1 is embodiment of the invention voice/noise maximum likelihood decision device Valuation Modelling synoptic diagram.
Fig. 2 is that embodiment of the invention end-point detection decision device is to the anti-interference synoptic diagram of different noises.
Fig. 3 embodiment of the invention is based on speech recognition HMM model topology structure.
Fig. 4 is that many subtrees of embodiment of the invention ternary speech is to grammar construct figure.
Fig. 5 is the tree-shaped speech model structure of the identification entry of present embodiment.
Fig. 6 is a present embodiment entire system block diagram.
The present invention is elaborated in conjunction with the mailbag information check embodiment based on speech recognition, and embodiment of the invention entire method constitutes the pre-emphasis that can be divided into (1) A/D sampling and sampling back voice, improves the energy of high-frequency signal, and carries out windowing and divide frame to handle; (2) end-point detection is determined effective speech parameter; (3) extraction of speech characteristic parameter; (4) adopt frame synchronization beam search Viterbi beta pruning algorithm that recognition template is compared, and the voice identification result of the best is exported.The specification specified of each step is as follows.
1, end-point detection:
(1) voice signal carries out the sound card of computing machine by microphone, samples by 16-bit linear A/D then, becomes original digital speech.Sample frequency is 16kHz.
(2) the original figure voice signal is carried out frequency spectrum shaping and divides frame windowing (employing hamming code window) to handle, guarantee to divide the accurate stationarity of frame voice.Wherein frame length is 32ms, and frame moves and is that 16ms, preemphasis filter are taken as H (z)=1-0.98z -1
(3) end-point detecting method is made up of voice/noise maximum likelihood decision device and waveform tendency decision device.The voice of present embodiment/noise maximum likelihood decision device and waveform tendency decision device are described in detail as follows:
A, voice/noise maximum likelihood decision device:
The principle of work of maximum likelihood decision device as shown in Figure 1.Wherein s (n) is the clean primary speech signal of input.H (n) is because the distortion function that channel is introduced.D (n) is the additive noise of input.The voice signal of y (n) for truly receiving.Decision method calculates according to formula (1): log ( &sigma; ey ) + ( e ty - &mu; ey ) 2 2 &sigma; ey 2 < log ( &sigma; ed ) + ( e ty - &mu; ed ) 2 2 &sigma; ed 2 - - - - - - ( 1 )
If formula (1) condition satisfies, then input signal is voice and noise sum, otherwise the signal of input is a noise.Formula (1) is voice/noise maximum likelihood decision device.E wherein TyEnergy for signal y (n).μ EdBe the average of noise energy, it can draw by the several initial frames estimations to input signal, and along with the increase to noise frame is constantly upgraded simultaneously. &mu; ed = E &lsqb; 1 K s &CenterDot; &Sigma; n = 1 K s d t ( n ) &CenterDot; d t ( n ) &rsqb; = 1 K s &CenterDot; &Sigma; n = 1 K S E &lsqb; d t 2 ( n ) &rsqb; - - - - - - ( 2 )
Average estimation method with noise is similar, the variance of noise energy &sigma; ed 2 Estimation method be: &sigma; ed 2 = D &lsqb; 1 K s &CenterDot; &Sigma; n = 1 K S d t ( n ) &CenterDot; d t ( n ) &rsqb; = 1 K s 2 &CenterDot; &Sigma; n = 1 K s ( E &lsqb; d t 4 ( n ) &rsqb; - ( E &lsqb; d t 2 ( n ) &rsqb; ) 2 ) - - - - - - ( 3 )
B, waveform tendency decision device:
In order to improve the file of terminus judgement, the embodiment of the invention also uses the waveform characteristics of voice signal.The motion of people's sound channel has inertia, and the variation of any voice signal all has a progressive formation, the waveform of shock response can not occur being similar to; And for mechanic sound on the channel or interchannel noise, its shape often is similar to shock response or does not have progressive formation.If do not consider the waveform characteristics of voice signal, be difficult to they are made a distinction.In the terminus detection method, the tendency of waveform and the maximum likelihood decision method of front are combined, obtain good test findings.If the energy (e of continuous three frames T-2, e T-1, e t) satisfy formula (1), so just calculate the average energy of continuous 5 frames behind the t frame:
e 5=(e t+1+e t+2+e t+3+e t+4+e t+5)/5 (4)
If: e 5〉=e T-2+ e T-1+ e tThen from thinking the starting point that detects voice signal, otherwise, continue to detect starting point.This detection method is called waveform tendency (WT, Waveform Tendency) decision device.
Behind two kinds of end-point detecting methods, can remove two kinds of main interference noises that occur among Fig. 2 effectively.Wherein (a) is noise stably, (b) is sudden noise.
2, speech recognition features parameter extraction:
(1) frequency domain character in short-term of voice can accurately be described the variation of voice.MEL frequency cepstral coefficient (Mel-Frequency Cepstrum Coefficients-MFCC) is a kind of eigenvector that the auditory properties according to people's ear calculates, and MFCC is based upon on the Fu Liye spectrum analysis basis.
(2) computing method of MFCC are: at first according to the MEL frequency signal spectrum is divided into the logical group of several bands, the frequency response that its band is logical is a triangle or sine-shaped.Calculate the signal energy of respective filter group then, calculate corresponding cepstrum coefficient by discrete cosine transform again.The MFCC feature mainly reflects the static nature of voice, and the behavioral characteristics of voice signal can be composed with the first order difference of static nature spectrum and second order difference and describe.These multidate informations and static information replenish mutually, can largely improve the performance of speech recognition.Whole phonetic feature constitutes with MFCC parameter, MFCC difference coefficient, normalized energy coefficient and difference coefficient thereof.
3, the training of unspecified person speech recognition template:
(1) Hidden Markov Model (HMM) is the most ripe at present effective speech recognition algorithm.HMM state transition model from left to right, it can well describe the sound pronunciation characteristics.The model that present embodiment adopts is 3 state Hidden Markov Models.Its structure as shown in Figure 3.Q wherein iThe state of expression HMM.a IjThe redirect probability of expression HMM.b j(O t) be the multithread mixed Gaussian density probability distribution function of the state output of HMM model.As shown in Equation (5). b j ( O t ) = &Pi; s = 1 s &lsqb; &Sigma; m = 1 M s C jsm N ( O st ; &mu; jsm ; &phi; jsm ) &rsqb; &gamma; s - - - - - - ( 5 )
Wherein S is the fluxion of data, M SIt is the number of the mixed Gaussian Density Distribution in each data stream; N is the higher-dimension Gaussian distribution; N ( O ; &mu; ; &phi; ) = 1 ( 2 &pi; ) n | &phi; | e - 1 2 ( o - &mu; ) &phi; - 1 ( o - &mu; ) - - - - - ( 6 )
(2) HMM model employing three goes on foot the training method of progressively refinement
A. at first, use the speech data of isolated word, adopt and improve segmentation K average algorithm, model of cognition is carried out initialization, internal state is tentatively cut apart, with the Viterbi algorithm state of cutting apart is carried out the iteration adjustment then, iteration about 10 just can be finished usually.
B. utilize the Baum-Welch algorithm to carry out valuation again to each initialization model, can obtain accurately HMM model parameter by this training.
C. nested model refinement training: use a large amount of speech datas and voice submodel formation composite model is carried out the refinement training, by just obtaining exquisite HMM model parameter after this step according to training statement label file.
4, unspecified person speech recognition:
(1) present embodiment adopts many subtrees ternary speech to grammatical frame synchronization beam search method.Many subtrees ternary speech to grammar construct as shown in Figure 4.Wherein the first, the second subtree is the initial and terminal point place name of mailbag that will discern.The mailbag numbering of lumbang for discerning.This searching algorithm belongs to the BFS (Breadth First Search) algorithm, whenever recognize a new frame, will the matching distance of all possible path candidate be compared and sort, keep some of the front preferably the path as active paths, other path is wiped out, proceed the identification of next frame voice then, Here it is so-called " beta pruning " handles.Keep the active paths of some, active paths K according to the hardware condition (storage space, arithmetic speed etc.) of computing machine DctBeamGenerally between tens to hundreds of, so be called " beam search " algorithm.
(2) in conjunction with many subtrees ternary speech to grammatical model, the audio recognition method of present embodiment adopts computation model to be: R ^ = arg { min ( A , W ) &lsqb; log P ( O / A ) + log P ( A / W ) &rsqb; } = arg { min ( A , W ) { &Sigma; m = 1 M { &lsqb; &Sigma; t = d m - 1 v + 1 d mc log P ( O t / C m ) &rsqb; + &lsqb; &Sigma; t = d mc + 1 d mv log P ( O t / V m ) &rsqb; } - - - - - - ( 7 ) + &Sigma; i = 1 N W &Sigma; m = 1 M &lsqb; log P ( C m / w i ) + log P ( V M / w i ) + log P ( T m / w i ) &rsqb; } }
Wherein P () is a probability.O is the eigenvector of voice.A is the sound pronunciation model, just the HMM model.C mIt is the initial consonant pronunciation model.V mIt is the simple or compound vowel of a Chinese syllable pronunciation model.T mIt is the intonation model.W has word sequence.M is the number of full syllable, and M is 408.N wFor discerning the quantity of speech.P (A/W) blurs pronunciation model.
(3) search routine is as follows:
A. during voice frame number nFrameNo=0, all path structures of initialization:
1) initialization of consonant class.path CactBeam: because search begins to launch from the sending station subtree, so CactBeam will carry out initialization according to all consonant nodes of sending station subtree ground floor, then initialized consonant class.path number CactBeamNum is the consonant node number of sending station subtree ground floor, and concrete initialization operation is as follows:
for(BeamNo=0;BeamNo<CactBeamNum;BeamNo++){
NodeNum is made as 1;
WordList[0] be made as corresponding consonant semitone joint sequence number;
WordState[0] be made as 0, i.e. the corresponding sending station of this node subtree;
CurNode is made as the sequence number of respective nodes in the sending station subtree;
CheckSum is made as corresponding consonant semitone joint sequence number;
By formula (5) calculate initial distance Dist[0];
Nonsensical before other structure item, be made as 0 or-1 or infinitely great (being actually an enough big number).
2) initialization of vowel class.path VactBeam: because Chinese character is consonant-vowel structure, search is all from consonant, so nonsensical before each structure item of VactBeam, be made as 0 or-1 or infinitely great (being actually an enough big number) respectively according to its meaning separately.Initialized vowel class.path number VactBeamNum is K VTone=1254.
B. before ought beginning nFrameNo frame voice are discerned, whether decision changes the number of active paths, the i.e. value of CactBeamNum and VactBeamNum according to the beta pruning strategy earlier.
C. the Viterbi that all active paths among CactBeam and the VactBeam are done in the t frame voice mates, and unallowable state jumps in the word.
But D. utilize the ternary speech whether reasonable, adopt corresponding syntactic information according to the position of redirect to the redirect path HeadTail of grammar testing previous frame speech production:
1) if redirect occurs in subtree inside, whether then main value according to counter on the corresponding redirect arc determines
Redirect: if Counter Value is greater than 0, can redirect; Otherwise can not redirect.
2) if redirect occurs between sending station subtree and the receiving station's subtree, then according to the grammatical relation array
Relevant information among the OutInRelation judges whether redirect.
3) if redirect occurs between receiving station's subtree and the mailbag numbering subtree, then according to the grammatical relation array
Relevant information among the Relation judges whether redirect.
According to judgement, if can redirect, then carry out the e step, otherwise carry out the g step.
E. the path redirect is handled:
1) semitone of CurNode correspondence joint enters WordList;
2) if CurNode is certain subtree (sending station subtree, receiving station's subtree or a mailbag numbering subtree)
A leaf node, then its corresponding subtree entry sequence number enters OutInCodeNo;
The Cumulative Distance that the cumulative matches distance D ist of 3) redirect rear path equals the preceding path of redirect adds front
(3) step calculate apart from sum;
4) other structure item to the redirect path carries out respective handling, generates new path;
5) modification is inserted in formation to path structure:
A) if in the path structure formation this path has been arranged, it is little then to stay distance;
B) if there is not this path in the path structure formation, then according to its accumulation distance and existing active paths number decision
Whether insert.
F. check whether current active paths can be to new unit redirect, for the processing of next frame voice is got ready.The redirect condition is last state whether this path arrives the semitone joint, and concrete grammar is to detect Dist[STATENUM] whether upgraded.If can redirect, then this path be deposited in redirect path structure HeadTail, otherwise carry out the g step.
G. if nFrameNo=FRAMENUM (totalframes of input voice) carries out the h step; Otherwise nFrameNo++ carries out the b step.
H. will sort with the active paths VactBeam of vowel ending, some paths of optimum are exported as recognition result; After recognition result obtains confirming, corresponding syntactic information is made amendment simultaneously, ready for discerning next phonetic entry.
5, the speech recognition confidence measure with refuse to know model: the valuation of (1) confidence measure plays a very important role in speech recognition.Present embodiment adopts based on speech confidence measure likelihood ratio estimation method.Refuse to know model by online filler model formation, carry out the valuation of confidence measure.
Utilization determines whether to accept recognition result by the confidence level of judging recognizing voice; (2) utilize the useful information that is comprised in N candidate's vocabulary, in identifying, set up online filler model, will
Certain mean value of the likelihood score of each frame N candidate vocabulary is as the likelihood score of online filler model.If voice
Section O={O l..., O t..., O TThe first corresponding candidate result is model W 1, it is mould that corresponding n selects the result
The type string { W t n } t = 1 , 2 , . . . , T , Then n selects result's t frame score S t n = log ( p ( O t | W t n ) ) 。The likelihood ratio test of this moment is: LLR ( O ) = log P ( O / W 1 ) - 1 N - 1 log &Sigma; n = 2 N P ( O / W n ) &ap; &Sigma; t = 1 T S t 1 - 1 N - 1 &Sigma; n = 2 N &Sigma; t = 1 T S t n = LL ( O , W 1 ) - 1 N - 1 &Sigma; n = 2 N &Sigma; t = 1 T LL ( O t , W t n ) (3) in the present embodiment, N is 3.By confidence measure with refuse to know model, model of cognition can refuse 95% non-
Related voice noise and other noise.
6, the self-adaptation of speaker's speech recognition modeling: (Maximum a posteriori, method MAP) is utilized Bayes to the employing of (1) present embodiment based on maximum a posteriori probability
The theories of learning, with the identification code book of unspecified person as the realization that combines of prior imformation and the information that adapted to the people
Self-adaptation.The MAP algorithm is based on following criterion: &theta; i ^ = arg max &theta; i P ( &theta; i | x ) - - - - - - ( 9 )
Wherein x is a training sample, θ iBe the parameter of i speech model, &theta; i ^ Bayes estimated value for model parameter.
The advantage of MAP algorithm is that this algorithm has theoretic optimality based on maximum posteriori criterion.(2) formula (9) can obtain HMM model Mean Parameters revaluation formula: &mu; &OverBar; = &Sigma; t = 1 T &gamma; ( t ) x t &OverBar; + &tau; m &OverBar; &Sigma; t = 1 T &gamma; ( t ) + &tau; - - - - - - ( 10 )
Just can obtain the valuation of γ (t) by the status switch that revalues the speech characteristic vector distribution.The priori parameter m
Be difficult to obtain its theoretical estimated value with τ, thus the present invention the priori parameter m is set is unspecified person speech recognition mould
The mean value vector of type, priori parameter τ=4.0.
7, the formation of speech recognition entry: the tree-shaped speech model structure under each subtree of (1) present embodiment check clauses and subclauses as shown in Figure 5.Wherein each
Circle is represented a semitone joint voice recognition unit model.Forming complete voice by the cascade between the syllable knows
Other entry.The generative process of speech recognition entry is as follows:
A. read the relevant document record from database;
B. the data entries of writing a Chinese character in simplified form, merging in will writing down launches respectively, calculates total clauses and subclauses of mailbag;
C. the number of syllables of concentrating according to sending station collection, receiving station's collection and mailbag numbering is added up the number of times that each syllable occurs;
D. generate phonetic file, code file and the tree file of sending station collection;
E. generate phonetic file, code file and the tree file of receiving station's collection;
F. generate phonetic file, code file and the tree file of mailbag numbering collection;
G. generate the phonetic file and the code file of whole mailbag sets of entries;
H. add up the grammer constraint information between the mailbag clauses and subclauses each several part, and deposit it in syntactic information literary composition in the array mode
Part.
8, voice suggestion is handled: (1) adopts sign indicating number excitation LPC voice coding model; Model parameter is handled on computers in advance, and editor presses
Contract.Voice coding/decoding algorithms can adopt the ITUG.723.1 method of standard.(2) needing the voice of compression is more than 4000 postal place names and digital string, and the voice of storage are used for returning of recognition result
Put.
Present embodiment is compiled into software processing module with above each step, combines to constitute mailbag information check software systems based on speech recognition.The main-process stream block diagram of total system comprises as shown in Figure 6: (1) is at first checked mailbag the road forms data and is loaded in the nucleus correcting system.(2) system is converted into the road forms data voice entry template that will discern automatically.(3) by sound card input voice, voice signal is carried out windowing, end-point detection, and the speech recognition features parameter extraction.(4) system adjudicates according to predetermined function, if current system is in the duty of speaker adaptation, and the then automatic speech recognition modeling that upgrades.If system is in the information check duty, then carry out corresponding speech recognition.(5) in the process of identification, by refusing to know the confidence level that model is judged recognition result, guarantee system identification result's reliability simultaneously.(6) voice messaging is carried out pattern relatively with having deposited in the mailbag information check system by checking the identification entry that the road forms data constitutes.Mailbag clauses and subclauses to correct identification are colluded nuclear, can read in voice again or marking the wait later process on corresponding clauses and subclauses the mailbag of wrong identification.(7) recognition result adopts the synthetic speech playback to feed back to the user, and for the user's voice order, system will finish the task of check automatically.
Present embodiment is based on the mailbag information check system based on speech recognition of said method exploitation, and the employing speech recognition technology can alleviate the labour intensity in the present mailbag check process widely, and the accuracy of raising labour efficiency and checking realizes no paper operation.The voice that present embodiment can be discerned are standard Chinese and Sichuan words.Identification mailbag information is more than 4000 the postal place names in the whole nation, and digital string.To the first-selected discrimination of standard Chinese is 97.7%, and first three selects discrimination is 99.5%.It is 98% that first-selected discrimination is talked about in Sichuan, and first three selects discrimination is 99.9%.

Claims (7)

1, a kind of information check method of the present invention's proposition based on speech recognition, comprise that the end-point detection of voice signal and speech recognition parameter extract, training in advance, unspecified person speech recognition, the speech recognition confidence measure of unspecified person speech recognition modeling and refuse to know model, speech recognition confidence measure and refuse to know speaker adaptation study, the generation of speech recognition entry, the voice suggestion each several part of model, unspecified person speech recognition, specifically may further comprise the steps:
The end-point detection of A, voice signal and speech recognition parameter extract: the sound card A/D of (1) voice signal by computing machine samples and becomes the original figure voice signal; (2) said original figure voice signal is carried out frequency spectrum shaping and divides the frame windowing process, to guarantee to divide the accurate stationarity of frame voice; (3) use short-time energy, the waveform trend characteristic of voice signal to carry out end-point detection, remove the speech frame of no sound area, to guarantee the validity of each frame phonetic feature; (4) voice signal after minute frame windowing process is carried out voice (identification) feature extraction;
The training in advance of B, unspecified person speech recognition modeling: (1) gathers a large amount of speech datas in advance, sets up the training utterance database, and the voice of collection are consistent with the category of language of the voice that will discern; (2) voice signal from said database extracts speech characteristic parameter, by learning process in advance these characteristic parameters is transformed into the parameter of model of cognition then on PC; Model of cognition adopt based on phoneme Hidden codes Er Kefu model (Hidden Markov Model, HMM), the method for training is according to maximum-likelihood criterion, and HMM model parameter (bag average and variance) is carried out valuation;
C, unspecified person speech recognition: (1) carries out pattern match with said phonetic feature and speech recognition modeling, by N-best Viterbi (Viterbi) frame synchronization beam search algorithm, extract first three in real time and select best recognition result, in the identification search procedure, kept all useful " keyword " information, do not needed to recall again; (2) input voice messaging, this voice messaging of every check with regard to cutting the sound pronunciation template of this entry correspondence automatically, reduces the search volume, to improve the speech recognition speed and the accuracy of identification of check process.The language model of identifying adopts based on many subtrees ternary speech the syntax;
D, speech recognition confidence measure and refuse to know model:
In Viterbi (Viterbi) frame synchronization beam search process in conjunction with confidence measure with refuse to know the calculating of model.The size of the degree of confidence by judging recognizing voice determines whether to accept or refuse to know this voice identification result, refuses the irrelevant voice in operating process simultaneously;
The speaker adaptation study of E, unspecified person speech recognition;
Adopt the speaker adaptation method that model of cognition is adjusted; Said adaptive approach adopts the maximum a posteriori probability method, progressively revises the recognition template parameter by alternative manner;
The generation of F, speech recognition entry:
The data text information of Jiao Heing as required, automatic generation will be discerned the sound pronunciation template of entry by Pronounceable dictionary.The voice messaging of input and these pronunciation Template Informations compare by the unspecified person speech recognition of front; Pronounceable dictionary is made of with the corresponding Chinese phonetic alphabet identification vocabulary Chinese character, leaves in advance in the computing machine;
G, voice suggestion:
Adopt speech synthesis technique to carry out voice suggestion, the phonetic synthesis model parameter is analyzed leaching process and is finished by after anticipating on computers, and is stored in the hard disk of computing machine and is used for phonetic synthesis, and the phonetic synthesis model uses a sign indicating number excitation voice coding model; Voice suggestion is used for the playback recognition result, if voice playback is consistent with the input voice, represents that then recognition result is correct; If inconsistent, then require the user to read in voice command, carry out the identification of this voice command again.
2, the audio recognition method based on information check as claimed in claim 1, it is characterized in that the end-point detection of said voice signal and speech recognition parameter extract the detection method that adopts voice/noise maximum likelihood decision device to combine with waveform tendency decision device; Said speech recognition features parameter extraction is a kind of MFCC character vector that the auditory properties according to people's ear calculates.
3, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, the training in advance of said unspecified person speech recognition modeling is: adopt and divide three to go on foot progressively refinement training HMM model method, model parameter comprises average, covariance matrix, mixed Gaussian weighting coefficient.
4, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, the frame synchronization beam search method of many subtrees ternary speech to the syntax adopted in said unspecified person speech recognition, all useful informations that in the identification search procedure, kept word string, do not need to recall again, can extract first three in real time and select best recognition result.
5, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, said confidence measure valuation with refuse to know model and adopt based on whole speech confidence measure estimation method and online filler model and refuse to know model as irrelevant voice, improve the robustness of model of cognition, absorbed irrelevant voice and noise.
6, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, the adaptive approach based on maximum a posteriori probability is adopted in the speaker adaptation study of said unspecified person speech recognition, respectively speech recognition parameter is adjusted by iteration, made between the model and to differentiate to estimate and keep maximum distinctive.
7, the audio recognition method based on information check as claimed in claim 1, it is characterized in that, said speech recognition entry adopts based on the structure of many subtrees ternary speech to the syntax, generate corresponding voice entry pronunciation template according to the text message that will check, voice entry pronunciation template is to be the tree-shaped template that elementary cell is formed with the phoneme.
CN00130298A 2000-11-10 2000-11-10 Information check method based on speed recognition Expired - Fee Related CN1123863C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN00130298A CN1123863C (en) 2000-11-10 2000-11-10 Information check method based on speed recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN00130298A CN1123863C (en) 2000-11-10 2000-11-10 Information check method based on speed recognition

Publications (2)

Publication Number Publication Date
CN1293428A true CN1293428A (en) 2001-05-02
CN1123863C CN1123863C (en) 2003-10-08

Family

ID=4594094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN00130298A Expired - Fee Related CN1123863C (en) 2000-11-10 2000-11-10 Information check method based on speed recognition

Country Status (1)

Country Link
CN (1) CN1123863C (en)

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100411011C (en) * 2005-11-18 2008-08-13 清华大学 Pronunciation quality evaluating method for language learning machine
WO2008148323A1 (en) * 2007-06-07 2008-12-11 Huawei Technologies Co., Ltd. A voice activity detecting device and method
CN100514446C (en) * 2004-09-16 2009-07-15 北京中科信利技术有限公司 Pronunciation evaluating method based on voice identification and voice analysis
WO2009097738A1 (en) * 2008-01-30 2009-08-13 Institute Of Computing Technology, Chinese Academy Of Sciences Method and system for audio matching
CN1875400B (en) * 2003-11-07 2010-04-28 佳能株式会社 Information processing apparatus, information processing method
CN1835076B (en) * 2006-04-07 2010-05-12 安徽中科大讯飞信息科技有限公司 Speech evaluating method of integrally operating speech identification, phonetics knowledge and Chinese dialect analysis
CN1783213B (en) * 2004-12-01 2010-06-09 纽昂斯通讯公司 Methods and apparatus for automatic speech recognition
CN101894108A (en) * 2009-05-19 2010-11-24 上海易狄欧电子科技有限公司 Method and system for searching for book source on network
CN1787070B (en) * 2005-12-09 2011-03-16 北京凌声芯语音科技有限公司 On-chip system for language learner
CN102074231A (en) * 2010-12-30 2011-05-25 万音达有限公司 Voice recognition method and system
CN102097096A (en) * 2009-12-10 2011-06-15 通用汽车有限责任公司 Using pitch during speech recognition post-processing to improve recognition accuracy
CN101452701B (en) * 2007-12-05 2011-09-07 株式会社东芝 Confidence degree estimation method and device based on inverse model
CN101609672B (en) * 2009-07-21 2011-09-07 北京邮电大学 Speech recognition semantic confidence feature extraction method and device
CN102254558A (en) * 2011-07-01 2011-11-23 重庆邮电大学 Control method of intelligent wheel chair voice recognition based on end point detection
CN102341843A (en) * 2009-03-03 2012-02-01 三菱电机株式会社 Voice recognition device
CN102402984A (en) * 2011-09-21 2012-04-04 哈尔滨工业大学 Cutting method for keyword checkout system on basis of confidence
CN102760436A (en) * 2012-08-09 2012-10-31 河南省烟草公司开封市公司 Voice lexicon screening method
CN102034474B (en) * 2009-09-25 2012-11-07 黎自奋 Method for identifying all languages by voice and inputting individual characters by voice
CN103165130A (en) * 2013-02-06 2013-06-19 湘潭安道致胜信息科技有限公司 Voice text matching cloud system
CN103810998A (en) * 2013-12-05 2014-05-21 中国农业大学 Method for off-line speech recognition based on mobile terminal device and achieving method
CN105095185A (en) * 2015-07-21 2015-11-25 北京旷视科技有限公司 Author analysis method and author analysis system
CN105489222A (en) * 2015-12-11 2016-04-13 百度在线网络技术(北京)有限公司 Speech recognition method and device
CN106328152A (en) * 2015-06-30 2017-01-11 芋头科技(杭州)有限公司 Automatic identification and monitoring system for indoor noise pollution
CN106601230A (en) * 2016-12-19 2017-04-26 苏州金峰物联网技术有限公司 Logistics sorting place name speech recognition method, system and logistics sorting system based on continuous Gaussian mixture HMM
CN106611597A (en) * 2016-12-02 2017-05-03 百度在线网络技术(北京)有限公司 Voice wakeup method and voice wakeup device based on artificial intelligence
CN107204190A (en) * 2016-03-15 2017-09-26 松下知识产权经营株式会社 Misrecognition correction method, misrecognition correct device and misrecognition corrects program
CN108389576A (en) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 The optimization method and system of compressed speech recognition modeling
CN108898753A (en) * 2018-06-20 2018-11-27 南通大学 A kind of acoustic control locker and its open method
CN109255106A (en) * 2017-07-13 2019-01-22 Tcl集团股份有限公司 A kind of text handling method and terminal
CN110850998A (en) * 2019-11-04 2020-02-28 北京华宇信息技术有限公司 Intelligent word-forming calculation optimization method and device for Chinese input method
CN111583907A (en) * 2020-04-15 2020-08-25 北京小米松果电子有限公司 Information processing method, device and storage medium
CN111862950A (en) * 2020-08-03 2020-10-30 深圳作为科技有限公司 Interactive multifunctional elderly care robot recognition system
CN112151020A (en) * 2019-06-28 2020-12-29 北京声智科技有限公司 Voice recognition method and device, electronic equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102436810A (en) * 2011-10-26 2012-05-02 华南理工大学 Record replay attack detection method and system based on channel mode noise
EP3834200A4 (en) 2018-09-12 2021-08-25 Shenzhen Voxtech Co., Ltd. Signal processing device having multiple acoustic-electric transducers

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1875400B (en) * 2003-11-07 2010-04-28 佳能株式会社 Information processing apparatus, information processing method
CN100514446C (en) * 2004-09-16 2009-07-15 北京中科信利技术有限公司 Pronunciation evaluating method based on voice identification and voice analysis
CN1783213B (en) * 2004-12-01 2010-06-09 纽昂斯通讯公司 Methods and apparatus for automatic speech recognition
US9502024B2 (en) 2004-12-01 2016-11-22 Nuance Communications, Inc. Methods, apparatus and computer programs for automatic speech recognition
CN100411011C (en) * 2005-11-18 2008-08-13 清华大学 Pronunciation quality evaluating method for language learning machine
CN1787070B (en) * 2005-12-09 2011-03-16 北京凌声芯语音科技有限公司 On-chip system for language learner
CN1835076B (en) * 2006-04-07 2010-05-12 安徽中科大讯飞信息科技有限公司 Speech evaluating method of integrally operating speech identification, phonetics knowledge and Chinese dialect analysis
US8275609B2 (en) 2007-06-07 2012-09-25 Huawei Technologies Co., Ltd. Voice activity detection
WO2008148323A1 (en) * 2007-06-07 2008-12-11 Huawei Technologies Co., Ltd. A voice activity detecting device and method
CN101452701B (en) * 2007-12-05 2011-09-07 株式会社东芝 Confidence degree estimation method and device based on inverse model
WO2009097738A1 (en) * 2008-01-30 2009-08-13 Institute Of Computing Technology, Chinese Academy Of Sciences Method and system for audio matching
CN102341843B (en) * 2009-03-03 2014-01-29 三菱电机株式会社 Voice recognition device
CN102341843A (en) * 2009-03-03 2012-02-01 三菱电机株式会社 Voice recognition device
CN101894108A (en) * 2009-05-19 2010-11-24 上海易狄欧电子科技有限公司 Method and system for searching for book source on network
CN101609672B (en) * 2009-07-21 2011-09-07 北京邮电大学 Speech recognition semantic confidence feature extraction method and device
CN102034474B (en) * 2009-09-25 2012-11-07 黎自奋 Method for identifying all languages by voice and inputting individual characters by voice
CN102097096A (en) * 2009-12-10 2011-06-15 通用汽车有限责任公司 Using pitch during speech recognition post-processing to improve recognition accuracy
CN102097096B (en) * 2009-12-10 2013-01-02 通用汽车有限责任公司 Using pitch during speech recognition post-processing to improve recognition accuracy
CN102074231A (en) * 2010-12-30 2011-05-25 万音达有限公司 Voice recognition method and system
CN102254558B (en) * 2011-07-01 2012-10-03 重庆邮电大学 Control method of intelligent wheel chair voice recognition based on end point detection
CN102254558A (en) * 2011-07-01 2011-11-23 重庆邮电大学 Control method of intelligent wheel chair voice recognition based on end point detection
CN102402984A (en) * 2011-09-21 2012-04-04 哈尔滨工业大学 Cutting method for keyword checkout system on basis of confidence
CN102760436B (en) * 2012-08-09 2014-06-11 河南省烟草公司开封市公司 Voice lexicon screening method
CN102760436A (en) * 2012-08-09 2012-10-31 河南省烟草公司开封市公司 Voice lexicon screening method
CN103165130A (en) * 2013-02-06 2013-06-19 湘潭安道致胜信息科技有限公司 Voice text matching cloud system
CN103165130B (en) * 2013-02-06 2015-07-29 程戈 Speech text coupling cloud system
CN103810998A (en) * 2013-12-05 2014-05-21 中国农业大学 Method for off-line speech recognition based on mobile terminal device and achieving method
CN103810998B (en) * 2013-12-05 2016-07-06 中国农业大学 Based on the off-line audio recognition method of mobile terminal device and realize method
CN106328152B (en) * 2015-06-30 2020-01-31 芋头科技(杭州)有限公司 automatic indoor noise pollution identification and monitoring system
CN106328152A (en) * 2015-06-30 2017-01-11 芋头科技(杭州)有限公司 Automatic identification and monitoring system for indoor noise pollution
CN105095185A (en) * 2015-07-21 2015-11-25 北京旷视科技有限公司 Author analysis method and author analysis system
CN105489222B (en) * 2015-12-11 2018-03-09 百度在线网络技术(北京)有限公司 Audio recognition method and device
CN105489222A (en) * 2015-12-11 2016-04-13 百度在线网络技术(北京)有限公司 Speech recognition method and device
WO2017096778A1 (en) * 2015-12-11 2017-06-15 百度在线网络技术(北京)有限公司 Speech recognition method and device
US10685647B2 (en) 2015-12-11 2020-06-16 Baidu Online Network Technology (Beijing) Co., Ltd. Speech recognition method and device
CN107204190A (en) * 2016-03-15 2017-09-26 松下知识产权经营株式会社 Misrecognition correction method, misrecognition correct device and misrecognition corrects program
CN106611597A (en) * 2016-12-02 2017-05-03 百度在线网络技术(北京)有限公司 Voice wakeup method and voice wakeup device based on artificial intelligence
CN106611597B (en) * 2016-12-02 2019-11-08 百度在线网络技术(北京)有限公司 Voice awakening method and device based on artificial intelligence
CN106601230A (en) * 2016-12-19 2017-04-26 苏州金峰物联网技术有限公司 Logistics sorting place name speech recognition method, system and logistics sorting system based on continuous Gaussian mixture HMM
CN109255106A (en) * 2017-07-13 2019-01-22 Tcl集团股份有限公司 A kind of text handling method and terminal
CN108389576A (en) * 2018-01-10 2018-08-10 苏州思必驰信息科技有限公司 The optimization method and system of compressed speech recognition modeling
CN108389576B (en) * 2018-01-10 2020-09-01 苏州思必驰信息科技有限公司 Method and system for optimizing compressed speech recognition model
CN108898753A (en) * 2018-06-20 2018-11-27 南通大学 A kind of acoustic control locker and its open method
CN112151020A (en) * 2019-06-28 2020-12-29 北京声智科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN110850998A (en) * 2019-11-04 2020-02-28 北京华宇信息技术有限公司 Intelligent word-forming calculation optimization method and device for Chinese input method
CN111583907A (en) * 2020-04-15 2020-08-25 北京小米松果电子有限公司 Information processing method, device and storage medium
CN111583907B (en) * 2020-04-15 2023-08-15 北京小米松果电子有限公司 Information processing method, device and storage medium
CN111862950A (en) * 2020-08-03 2020-10-30 深圳作为科技有限公司 Interactive multifunctional elderly care robot recognition system

Also Published As

Publication number Publication date
CN1123863C (en) 2003-10-08

Similar Documents

Publication Publication Date Title
CN1123863C (en) Information check method based on speed recognition
EP1936606B1 (en) Multi-stage speech recognition
CN1150515C (en) Speech recognition device
Kawahara et al. Flexible speech understanding based on combined key-phrase detection and verification
US7162423B2 (en) Method and apparatus for generating and displaying N-Best alternatives in a speech recognition system
US7231019B2 (en) Automatic identification of telephone callers based on voice characteristics
Zheng et al. Accent detection and speech recognition for Shanghai-accented Mandarin.
CN1236423C (en) Background learning of speaker voices
Sethy et al. Refined speech segmentation for concatenative speech synthesis
Wang et al. Unsupervised spoken term detection with acoustic segment model
EP1734509A1 (en) Method and system for speech recognition
EP2842124A1 (en) Negative example (anti-word) based performance improvement for speech recognition
CN105654947B (en) Method and system for acquiring road condition information in traffic broadcast voice
CN1692405A (en) Voice processing device and method, recording medium, and program
Juneja et al. A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition
US20040024599A1 (en) Audio search conducted through statistical pattern matching
CN1499484A (en) Recognition system of Chinese continuous speech
Lecouteux et al. Combined low level and high level features for out-of-vocabulary word detection
Georgescu et al. Automatic annotation of speech corpora using complementary GMM and DNN acoustic models
McDermott et al. Minimum classification error training of landmark models for real-time continuous speech recognition
Vinyals et al. Discriminative pronounciation learning using phonetic decoder and minimum-classification-error criterion
Qian et al. Tone recognition in continuous Cantonese speech using supratone models
Lamel et al. Portability issues for speech recognition technologies
Wei et al. Exploiting prosodic and lexical features for tone modeling in a conditional random field framework
Zacharie et al. Keyword spotting on word lattices

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C19 Lapse of patent right due to non-payment of the annual fee
CF01 Termination of patent right due to non-payment of annual fee