CN102237083A - Portable interpretation system based on WinCE platform and language recognition method thereof - Google Patents

Portable interpretation system based on WinCE platform and language recognition method thereof Download PDF

Info

Publication number
CN102237083A
CN102237083A CN2010101605215A CN201010160521A CN102237083A CN 102237083 A CN102237083 A CN 102237083A CN 2010101605215 A CN2010101605215 A CN 2010101605215A CN 201010160521 A CN201010160521 A CN 201010160521A CN 102237083 A CN102237083 A CN 102237083A
Authority
CN
China
Prior art keywords
module
voice
algorithm
model
system based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010101605215A
Other languages
Chinese (zh)
Inventor
李心广
阳爱民
姚敏锋
张晶
马文华
陈永煊
林江豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Foreign Studies
Original Assignee
Guangdong University of Foreign Studies
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Foreign Studies filed Critical Guangdong University of Foreign Studies
Priority to CN2010101605215A priority Critical patent/CN102237083A/en
Publication of CN102237083A publication Critical patent/CN102237083A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a portable interpretation system based on a WinCE platform, which comprises a voice collector, a voice preprocessing module, a voice feature extraction and modeling module, a model base, a recognition module, a corpus base and an interpretation and voice synthesis module, wherein all the modules are established on an embedded platform; the voice collection module is connected with the voice preprocessing module; the voice preprocessing module is connected with the voice feature extraction and modeling module; the voice feature extraction and modeling module is respectively connected with the model base or the recognition module; when a training state is selected, the voice feature extraction and modeling module is connected with the model base; when a recognition state is selected, the voice feature extraction and modeling module is connected with the recognition module; the recognition module is connected with the interpretation and voice synthesis module; and the interpretation and voice synthesis module is connected with the corpus base. The portable interpretation system has the characteristics of high voice recognition efficiency, high recognition accuracy, high equipment portability and two-way interpretation.

Description

A kind of portable oral translation system and speech recognition method thereof based on the WinCE platform
Technical field
The present invention relates to the speech recognition technology field, change the portable oral translation system based on the WinCE platform of corresponding translation result after particularly a kind of voice signal identification that is used for the people is sent into.The invention still further relates to the audio recognition method of this translation system.
Background technology
Speech recognition technology allows machine pass through identification exactly and understands, and the voice signal that the people is sent changes corresponding text into or makes the technology of setting command, and it just progressively becomes the gordian technique of man-machine interface in the infotech.In recent years, along with the fast development of embedded device, characteristics such as consumer electronics product is deep into our various fields in life, and it is portable, and cost is low have obtained using widely, therefore, have very big consumption market based on Embedded speech recognition system.And traditional speech recognition system, as the SPEECHSDK 5.1 of Microsoft, the HTK in Cambridge is the speech recognition engine based on PC operating system, can not be used in the embedded OS.
Summary of the invention
The objective of the invention is to design portable oral translation system based on the WinCE platform, can be under the situation of embedded system resource-constrained, realize the recognition function of big vocabulary, and have higher discrimination, and realize from Chinese to English or English spoken two-way translation to Chinese.。
Another object of the present invention is to provide the audio recognition method of this translation system.
In order to realize the foregoing invention purpose, the present invention includes following technical characterictic: a kind of portable oral translation system based on the WinCE platform, it is characterized in that: comprise voice collecting device, voice pretreatment module, phonetic feature extraction and MBM, model bank, identification module, corpus and translation and phonetic synthesis module, all modules all are based upon on the embedded platform; Voice acquisition module is connected with the voice pretreatment module; The voice pretreatment module is extracted with phonetic feature and is connected with MBM; Phonetic feature extracts and is connected with model bank or identification module respectively with MBM; Described phonetic feature extracts and is connected with model bank by being chosen as physical training condition with MBM, by selecting status recognition, is connected with identification module; Identification module is connected with the phonetic synthesis module with translation; Translation is connected with corpus with the phonetic synthesis module; Described identification module obtains translating into text by translation and phonetic synthesis module after the optimal result through the decision-making judgement, and exports with speech form; Through speech selection, realize from Chinese to English or English spoken two-way translation to Chinese.
Described voice pretreatment module comprises successively the pre-emphasis unit that connects, divides frame processing unit, adds window unit and end-point detection unit; Pre-emphasis unit is connected with the voice collecting device, and the end-point detection unit extracts with phonetic feature and is connected with MBM;
Described pre-emphasis unit is a high boost pre-emphasis digital filter;
Frame processing unit taked the field overlapping to divide the frame mode to carry out the processing of branch frame in described minute;
The described window unit that adds adopts Hamming window function carry out windowization;
Described end-point detection unit adopts with short-time energy E and short-time average zero-crossing rate Z as the double threshold of feature relatively, and calculates zero-crossing rate threshold values Z according to quiet section cT and height energy threshold are carried out the detection of end points as thresholding.
Described phonetic feature extracts with MBM and passes through to extract the MFCC phonetic feature as recognition feature; Adopt hidden Markov model as training and model of cognition; This hidden Markov model is made up of Markov chain and general random process;
Described hidden Markov model utilizes the forward-backward algorithm probabilistic algorithm to solve the valuation problem; Utilize the Viterbi algorithm to solve decoding problem; Utilize the Baum-Welch iterative algorithm to solve problem concerning study.
Be specially: utilize the forward-backward algorithm probabilistic algorithm, solve for the given λ of hidden Markov model system=(π, A, B), the observation sequence O=O that produces according to system 1, O 2..., O TCalculate the problem of likelihood probability P (O/ λ).
Utilize the Viterbi algorithm, solve for the given λ of hidden Markov model system=(π, A, B), and the observation sequence O=O that produces by system 1, O 2..., O T, search makes this system produce the status switch S=q of the most possible experience of this observation sequence 1, q 2... q tProblem.
For the hidden Markov model system of the unknown, utilize the Baum-Welch iterative algorithm to come the estimation model parameter.
The present invention also comprises a kind of speech recognition method of the portable oral translation system based on the WinCE platform, it is characterized in that comprising the steps:
(1) hidden Markov model is trained the acquisition model parameter;
(2) phonetic feature that characteristic extracting module is obtained is as the observation sequence of hidden Markov model; The voice unit that training obtains is a status switch, solves the state transitions sequence by the Viterbi algorithm;
(3) adopt the decision-making judgement, obtain the state transitions sequence of maximum probability;
(4) go out candidate phoneme or syllable according to optimum condition sequence correspondence, form speech and sentence by language model at last.
The first initialization hidden Markov model of described step (1) parameter utilizes the Baum-Welch iterative algorithm to come the estimation model parameter then.
Described step (1) is utilized training algorithm to carry out repeatedly iteration and is obtained the result, also should provide the condition of a finishing iteration simultaneously, when the relative variation of this probability less than ε, the finishing iteration process in addition, is set maximum iteration time N, when iterations during greater than N, also stop iteration, and the Baum-Welch algorithm is adopted the method that increases scale factor, the data underflow problem of correction algorithm.
The present invention is a kind of portable oral translation system and speech recognition method thereof based on the WinCE platform, and its hardware core is a flush bonding processor, and embedded system has low cost, low-power consumption, high-performance, portable fine quality such as strong.In the voice pretreatment module, comprise pre-emphasis unit, divide frame processing unit, add window unit and end-point detection unit, by the voice signal that collects is anticipated, make that embedded system efficient when the later stage speech recognition is higher, recognition accuracy is also higher.Adopt hidden Markov model, Model Identification is carried out with it again in the training pattern storehouse, makes identifying precise and high efficiency more.The present invention compared with prior art has two-way translation, low-cost, low-power consumption, and high-performance, advantage such as portable strong, and have very big consumption market in the speech recognition system field.
Description of drawings
Fig. 1 is the composition synoptic diagram of hidden Markov model
Fig. 2 forward-backward algorithm algorithm synoptic diagram
Fig. 3 hidden Markov model parameter training process flow diagram
Fig. 4 does not have the hidden Markov model structure from left to right of leap
Fig. 5 hidden Markov model identifying;
Fig. 6 is module principle figure of the present invention;
Fig. 7 is the transition probability processing procedure of identification module of the present invention;
Fig. 8 is the corpus structural drawing of translation of the present invention and phonetic synthesis module.
Embodiment
The present invention is a kind of portable oral translation system based on the WinCE platform, design has realized a speech recognition system based on wince, embedded system has low cost, low-power consumption, fine qualities such as high-performance, its core are its flush bonding processor, at present, the little processing of ARM mainly comprises ARM7 series, ARM9 series, ARM9E series, ARM10E series, ARM11 series, and its function from strength to strength.The present invention uses embedded system scientific research platform UP-CPU 6410, adopts up-to-date S3C6410X (ARM11) embedded microprocessor of Samsung company, and its frequency reaches 633M, is a based on the ARM1176JZF-S core, adopts the processor of ARM v6 framework.
Module principle figure of the present invention as shown in Figure 6, voice signal by voice collecting device 1 microphone collection input, carry out pre-emphasis by 2 pairs of voice signals of voice pretreatment module, divide frame, windowing, processing such as end-point detection, what realize above-mentioned processing capacity is pre-emphasis unit 21, divide frame processing unit 22, add window unit 23 and end-point detection unit 24.Carrying out feature by the phonetic feature extraction with 3 pairs of voice messagings of MBM then carries and the training utterance model, phonetic feature extracts and is connected with model bank 4 or identification module 5 with MBM 3, read corpus 6 by translation and phonetic synthesis module 7, translate into the output of text and synthetic speech.
Respectively each modular unit that relates to is described below:
One, pre-emphasis unit 21
The average power spectra of voice signal is subjected to the influence of glottal excitation and mouth and nose radiation, front end is pressed 6dB/oct (octave) decay greatly more than 800Hz, the high more corresponding composition of frequency is more little, will be promoted its HFS before voice signal is analyzed for this reason.Therefore before being analyzed, adopts voice signal the high boost pre-emphasis digital filter processes voice signals of a 6dB/oct usually, realization is promoted its HFS, make the frequency spectrum of signal become smooth, remain on low frequency in the whole frequency band of high frequency, can ask frequency spectrum with same noise.The filter response function is:
H(z)=1-αz -1,0.9≤α≤1.0
Wherein α is a pre emphasis factor, gets 0.9375 usually, like this, and the output of preemphasis network
Figure GSA00000109878500051
The available difference equation of relation with the voice signal s (n) that imports Expression.
Two, divide frame processing unit 22
Voice signal has time-varying characteristics, but in a short time range, its characteristic remains unchanged promptly relatively stable substantially, and this specific character of voice signal is called " short-time characteristic ", and this short section time is generally 10~30ms.So the analysis of voice signal and processing generally are based upon on the basis of " short-time characteristic ", promptly carry out " short-time analysis ", sound signal stream is adopted divide frame to handle.The frame number of general per second has
Frames per sec ond = 1 t ( 0.01 < t < 0.03 )
Decide on actual conditions.Divide frame both can adopt continuation mode, also can adopting overlaps divides the mode of frame, owing to have correlativity between the voice signal, adopts field to overlap among the present invention and divides the mode of frame.
Like this, for the voice signal of integral body, the characteristic parameter time series of forming by each frame characteristic parameter that analyzes.
Three, add window unit 23
Voice signal has stationarity in short-term, can carry out the branch frame to signal and handle.And be to realize near the speech waveform the sampling n in the voice signal is emphasized and the remainder of waveform is weakened, and then also will be to its windowing process.Each short section to voice signal is handled, and in fact is exactly that each short section is carried out certain conversion or imposed certain computing, and its general expression is:
Q n = &Sigma; m = - &infin; &infin; T [ s ( n ) ] &omega; ( n - m )
T[wherein] represent certain conversion, it can be linear also can be non-linear, s (n) is an input speech signal series.Q nIt is the time series that all each sections obtain after treatment.
Select Hamming window in the invention for use
Figure GSA00000109878500062
Four, the end-point detection unit 24
End-point detection during voice signal is handled mainly is starting point and the end point in order to detect voice automatically.The present invention has adopted the double threshold relative method to carry out end-point detection.The double threshold relative method as feature, in conjunction with the advantage of Z and E, makes detection more accurate with short-time energy E and short-time average zero-crossing rate Z, the processing time of effective reduction system, improve the real-time of system handles, and can get rid of the noise of unvoiced segments, thus the recognition performance that improves.
In the double threshold relative method, short-time energy E and short-time average zero-crossing rate Z feature calculation are as follows respectively:
(1) short-time energy E
The short-time energy of voice signal s (n) is defined as:
E n = &Sigma; m - &infin; &infin; [ s ( n ) &omega; ( n - m ) ] 2
Wherein ω (n) is the window function of Hamming window.
For following formula, if make h (n)=ω 2(n), then have:
E n = &Sigma; m = - &infin; &infin; s 2 ( n ) h ( n - m ) = s 2 ( n ) * h ( n )
Following formula represents as can be known, and the short-time energy of window transform is equivalent to signal with " voice square " by a linear filter output, and the unit-sample response of this wave filter is h (n).It realizes that block diagram is as follows:
Figure GSA00000109878500073
The realization block diagram of short-time energy
For the short-time average energy E that with n is certain frame voice signal of sign nFor:
E n = &Sigma; m = n - N + 1 n [ s ( m ) &omega; ( n - m ) ] 2
(2) short-time average zero-crossing rate Z
The short-time average zero-crossing rate definition
Z n = &Sigma; m = - &infin; &infin; Sgn [ s ( m ) ] - Sgn [ s ( m - 1 ) ]
Sgn[wherein] be sign function, promptly
Figure GSA00000109878500076
S (n) is a voice signal.
Z n = &Sigma; m = - &infin; &infin; | Sgn [ s ( m ) ] - Sgn [ s ( m - 1 ) ] | &omega; ( n - m )
= | Sgn [ s ( n ) ] - Sgn [ s ( n - 1 ) ] | * &omega; ( n )
Wherein ω (n) is a window function.
It realizes that block diagram is as follows:
Figure GSA00000109878500081
The short time interval that voice signal begins is equally distributed ambient noise signal.When adopting the double threshold relative method to carry out end-point detection, need calculate zero-crossing rate threshold values Z according to " quiet " section of beginning cT and height energy threshold ETL (low energy metered valve) and ETU (high energy metered valve) are used as thresholding, just can realize the accurate detection of end points.
Zero-crossing rate threshold values Z cT=min (IF, Z c+ 2* σ Zc), wherein IF is an empirical value, the present invention gets IF=25; Z c, σ ZcBe respectively the average and the standard deviation of the zero-crossing rate of initial " quiet " section.
For ETL (low energy metered valve) and ETU (high energy metered valve), need calculate the short-time average energy of " quiet " section earlier, maximum energy value is designated as E Max, minimum energy value is designated as E MinOrder:
I1=0.03*(E max-E min)+E min
I2=4*E min
Then have:
ETL=min(I1,I2)
ETU=5*ETL
Utilize Z cWhen T and ETL and ETU detected as thresholding, establishing start frame was N1, then the ENERGY E at N1 frame place N1And zero-crossing rate Z N1Satisfy ETU>E simultaneously N1>ETL, E N1+1>ETU, Z N1>Z cT; ENERGY E at end frame N2 place N2And zero-crossing rate Z N2Satisfy simultaneously (adjusting coefficient k=4), Z N1<Z cT.
Adopt the double threshold relative method, combine the situation of other frame, can effectively avoid The noise, improve degree of detection, phonetic feature is extracted have high efficiency, be beneficial to the raising of discrimination.
Five, phonetic feature extracts and MBM 3
The extraction that the present invention adopts is based on the MFCC phonetic feature of the auditory properties feature as identification.(Mel-Frequency Cepstral Coefficients is to propose according to human auditory system's characteristic MFCC) to the Mel cepstrum coefficient, and anthropomorphic dummy's ear is to the perception of different frequency voice.People's ear is differentiated the process of sound frequency just as a kind of operation of taking the logarithm.For example: in the Mel frequency domain, the people is a linear relationship to the perception of tone, if the Mel difference on the frequency twice of two sections voice, then people's also poor twice in perception.
Wherein the MFCC algorithmic procedure of characteristic extracting module 3 is:
1. Fast Fourier Transform (FFT) (FFT):
X [ k ] = &Sigma; n = 0 N - 1 x [ n ] e - j 2 &pi; N nk , k = 0,1,2 , . . . , N - 1
X[n] (n=0,1,2 ..., N-1) a frame discrete voice sequence for obtaining through over-sampling, N is a frame length.X[k] the plural number series of ordering for N, again to X[k] delivery get the signal amplitude spectrum | X[k] |.
2. the actual frequency yardstick is converted to the Mel dimensions in frequency:
Mel ( f ) = 2597 lg ( 1 + f 700 )
Mel (f) is the Mel frequency, and f is an actual frequency, and unit is Hz.
3. configuration triangle filter group and calculate each triangle filter signal amplitude is composed | X[k] | filtered output:
F ( l ) = &Sigma; k = f o ( l ) f h ( l ) w l ( k ) | X [ k ] | , l = 1,2 , . . . , L
Wherein
w l ( k ) = k - f o ( l ) f c ( l ) - f o ( l ) , f o ( l ) &le; k &le; f c ( l ) f h ( l ) - k f h ( l ) - f c ( l ) , f c ( l ) &le; k &le; f h ( l )
f o ( l ) = o ( l ) [ f s N ] , f h ( l ) = h ( l ) [ f s N ] , f c ( l ) = c ( l ) [ f s N ]
w l(k) be the filter factor of respective filter, o (l), c (l), h (l) be on the actual frequency coordinate axis respective filter lower frequency limit, centre frequency and upper limiting frequency, f sBe sampling rate, L is a number of filter, and F (l) is filtering output.
4. the logarithm computing is done in all wave filter outputs, is further done discrete cosine transform (DTC) again, can obtain MFCC:
M ( i ) = 2 N &Sigma; l = 1 L log F ( l ) cos [ ( l - 1 2 ) i&pi; L ] , i = 1,2 , . . . , Q
Q is the exponent number of MFCC parameter, generally gets 12, and M (i) is gained MFCC parameter.
Speech model of the present invention adopts hidden Markov model, hidden Markov model (HMM, HiddenMarkov Model) is a kind of statistical signal transaction module,, develops by Markov chain with probability model parametric representation, that be used to describe the statistics of random processes characteristic.Two ingredients of HMM: Markov chain: describe the transfer of state, describe with transition probability.The general random process: the relation between description state and observation sequence, to describe with the observed value probability, it is formed as Fig. 1.
The HMM model can be expressed as: λ=(N, M, π, A, B), wherein
N: Markov chain state number in the model.Remember that N state is θ 1..., θ N, note t Markov chain state of living in constantly is q t, obvious q t∈ (θ 1..., θ N).
M: the possible observed value number of each state correspondence.Remember that M observed value is V 1..., V M, note t observed observation vector constantly is O t, O wherein t∈ (V 1..., V M).
π: original state probability vector, π=(π 1..., π N), π wherein i=P (q 1i), 1≤i≤N.
A: state transition probability matrix, A=(a Ij) N * N, a Ij=P (q I+1j/ q ti), 1≤i, j≤N are the transition probabilities that changes to state j from state i.
B: output probability matrix, B=(b Ik) N * M,
b Ik=P (O t=V k/ q ti), when representing to get the hang of i, 1≤i≤N, 1≤k≤M produce output V kProbability.Because a Ij, b Ik, π iAll be probability, therefore need satisfy normalizing condition: a Ij〉=0, b Ik〉=0, π i〉=0
Figure GSA00000109878500111
And &Sigma; j = 1 N a ij = 1,1 &le; i &le; N , &Sigma; k = 1 M b ik = 1,1 &le; i &le; N , &Sigma; i = 1 N &pi; i = 1
HMM relates to three problems:
1, valuation problem
A given λ of HMM system=(π, A, B), according to the observation sequence O=O of system's generation 1, O 2..., O T, calculate likelihood probability P (O/ λ).To a fixing status switch S=q 1, q 2... q t, the most basic theoretical calculation method is the probability addition with all possible status switch, promptly
Figure GSA00000109878500115
But this method complexity is c TT, therefore calculated amount is very big, adopts forward direction-back can solve this estimation problem in the identification effectively to algorithm, and calculated amount is c 2T.
Definition forward variable: a t i=P (o 1o 2... o t, q t=i| λ) under the representation model λ, at moment t, observed events is O t, state is the probability of i.Next forward variable computing formula constantly is:
Figure GSA00000109878500116
The synoptic diagram of forward-backward algorithm algorithm as shown in Figure 2.
The definition back is to variable: β t(i)=P (o T+1o T+2... o T| q t=i, λ) T is (o to the observed events sequence of moment t+1 backward from stopping constantly in expression T+1o T+2... o T), and the state of t is the probability of i constantly.The back computing formula to variable of previous moment is:
Figure GSA00000109878500117
The back is similar to the synoptic diagram and the forward direction method of algorithm, and just direction is opposite.
When utilizing forward direction probability and backward probability to calculate the valuation problem, concrete computing formula is as follows
P ( O / &lambda; ) = &Sigma; i = 1 N &alpha; T ( i ) , P ( O / &lambda; ) = &Sigma; i = 1 N &beta; I ( i )
2, decoding problem
A given λ of HMM system=(π, A, B), and the observation sequence O=O that produces by system 1, O 2..., O T, search makes this system produce the status switch S=q of the most possible experience of this observation sequence 1, q 2... q t, promptly find the solution and make P (S/O, λ) Zui Da status switch S.Because
Figure GSA00000109878500121
And P (O/ λ) is all identical for all S, so decoding problem is equivalent to and finds the solution the status switch S that makes P (S, O/ λ) maximum.Decoding problem adopts the Viterbi algorithm to solve.
Figure GSA00000109878500122
A status switch is looked in expression, and this status switch state when t is i, and the probable value maximum of the status switch of state i and front t-1 state formation, and the recursion formula of algorithm is:
3, problem concerning study
For the HMM system an of the unknown, according to the observation sequence O=O of system's generation 1, O 2..., O T, how to determine that model λ=(π, A B), promptly find the solution and make system combined probability
Figure GSA00000109878500124
Maximum model parameter π, A, B.Problem concerning study is corresponding to the parameter training process of HMM, have only observed data, lack description, select maximum likelihood probability usually as optimum target to state, be based upon on expectation maximization (EM) basis, adopt the Baum-Welch iterative algorithm to come the estimation model parameter.ξ t(i, state was the probability of j when state was i and t+1 when j) representing t
ξ t(i,j)=P(q t=i,q t+1=j|O,λ)
&xi; t ( i , j ) = P ( q t = i , q t + 1 = j , O | &lambda; ) P ( O | &lambda; ) = &alpha; t ( i ) a ij b j ( o t + 1 ) &beta; t + 1 ( j ) P ( O | &lambda; )
= &alpha; t ( i ) a ij b j ( o t + 1 ) &beta; t + 1 ( j ) &Sigma; i = 1 N &Sigma; j = 1 N &alpha; t ( i ) a ij b j ( o t + 1 ) &beta; t + 1 ( j )
Figure GSA00000109878500127
State is the probability of i during expression t
Figure GSA00000109878500128
Expression is i number of 1 process state constantly;
So the computing formula of state-transition matrix is:
Figure GSA00000109878500131
The computing formula of output probability matrix is: b &OverBar; j ( k ) = &Sigma; t = 1 o t = v k T &gamma; t ( j ) &Sigma; t = 1 T &gamma; t ( j )
The process of HMM speech recognition of the present invention is specific as follows:
In speech recognition, the MFCC phonetic feature that is obtained by characteristic extracting module is the observation sequence of HMM model; State then is the voice unit that is obtained by training.Therefore, when building the HMM model and carrying out speech recognition, need obtain the HMM model parameter to the model training, training process of the present invention has obtained good training effect as shown in Figure 3.
In the training process, at first initialization HMM parameter utilizes the Baum-Welch iterative algorithm to come the estimation model parameter then.In actual applications, should utilize training algorithm to carry out repeatedly iteration and just can obtain the result, also should provide the condition of a finishing iteration simultaneously.When the relative variation of this probability less than ε, the finishing iteration process in addition, is set maximum iteration time N, when iterations during greater than N, also stops iteration, and the Baum-Welch algorithm is adopted the method that increases scale factor, the data underflow problem of correction algorithm.As shown in Figure 4, the HMM structure from left to right of the nothing leap of the present invention's employing.
As shown in Figure 5, after training the HMM model, utilize the MFCC feature, solve state transitions sequence P (O| λ in conjunction with the Viterbi algorithm n) (n=1...M), final, adopt the decision-making judgement, obtain the state transitions sequence of maximum probability, as shown in Figure 5.λ according to optimum condition sequence correspondence provides candidate's syllable or sound mother then, forms speech and sentence by language model at last.
Concrete module realizes being described as follows:
Six, identification module 5:
As shown in Figure 7, identification module adopts the HMM model, calls the speech model of having trained in the model bank, mates with the input speech model.Be output as transition probability value P through the HMM template i(i=0,1...i, i are the template number) is to transition probability P iCompare, obtain maximum transition probability P value, export corresponding text message, just can obtain recognition result.
Owing in the large vocabulary speech recognition system, have a large amount of nearly sound speech, homonym, cause system recognition rate to reduce.For overcoming the influence of nearly sound speech, homonym, system handles the transition probability that mates the back generation, and its processing procedure as shown in Figure 1.Set the threshold value of transition probability
Figure GSA00000109878500141
Work as P i>P TThe time, export corresponding text, otherwise give up the result.
By the transition probability threshold processing, effectively improved the discrimination of system.
Seven, translation and phonetic synthesis module:
Translation mainly is that latent state and corpus by identification module output are carried out match query with the phonetic synthesis module, and it is translated into text, adopts the TTS technology, exports with speech form.
Fig. 8 is the structural drawing of corpus.Corpus adopts the complex characteristic vector to set up.Definition phoneme proper vector V Phoneme, have
V phoneme=(No.,Phoneme)
Wherein, No. is the phoneme numbering, and Phoneme is a phoneme content.
Definition syllable characteristic vector V Syllable, have
V syllable=(No.,Syllable,No. Word,G P)
Wherein, No. is the syllable numbering, and Syllable is the syllable content, No. WordBe word numbering, G PBe the aligned phoneme sequence collection.
Definition word feature vector V Word, have
V Word=(No.,Word,Vector W,Num Phrase,No. Phrase)
Wherein, No. is the word numbering, and Word is the word content, Vector WBe the part of speech proper vector, and part of speech proper vector Vector W=(n, v, num, pron, adj, adv), Num PharseBe the phrase number based on this word, No. PharseBe the phrase numbering.
Definition note vector V TranHave
V Tran=(No.,Tran n,Tran v,Tran num,Tran pron,Tran adj,Tran adv)
Wherein, No. is a numbering of note, Tran n, Tran v, Tran Num, Tran Pron, Tran Adj, Tran AdvBeing respectively part of speech is n, v, num, pron, adj, the note of adv.
In the corpus, certain incidence relation that some feature between the vector exists can come vector is striden the level inquiry by linked character, improves search efficiency.
In translation process, at first according to phoneme proper vector V PhonemeObtain syllable characteristic vector V SyllableThe information that is associated, and then to word feature vector V WordInquire about, at last with note vector V TranBe the result.
The fundamental purpose of phonetic synthesis is that the text that has translation to obtain is exported with speech form.Three main ingredients: text analysis model, rhythm generation module and acoustic module.As follows by its building-up process:
Text analyzing → rhythm generation → acoustic module
In conjunction with above-mentioned explanation, the present invention compared with prior art has two-way translation, low-cost, low-power consumption, and high-performance, advantage such as portable strong has very big consumption market in the speech recognition system field.

Claims (9)

1. portable oral translation system based on the WinCE platform, it is characterized in that: comprise voice collecting device, voice pretreatment module, phonetic feature extraction and MBM, model bank, identification module, corpus and translation and phonetic synthesis module, all modules all are based upon on the embedded platform; Voice acquisition module is connected with the voice pretreatment module; The voice pretreatment module is extracted with phonetic feature and is connected with MBM; Phonetic feature extracts and is connected with model bank or identification module respectively with MBM; Described phonetic feature extracts and is connected with model bank by being chosen as physical training condition with MBM, by selecting status recognition, is connected with identification module; Identification module is connected with the phonetic synthesis module with translation; Translation is connected with corpus with the phonetic synthesis module; Described identification module obtains translating into text by translation and phonetic synthesis module after the optimal result through the decision-making judgement, and exports with speech form; Through speech selection, realize from Chinese to English or English spoken two-way translation to Chinese.
2. the portable oral translation system based on the WinCE platform according to claim 1 is characterized in that: described voice pretreatment module comprises successively the pre-emphasis unit that connects, divides frame processing unit, adds window unit and end-point detection unit; Pre-emphasis unit is connected with the voice collecting device, and the end-point detection unit extracts with phonetic feature and is connected with MBM;
Described pre-emphasis unit is a high boost pre-emphasis digital filter;
Frame processing unit taked the field overlapping to divide the frame mode to carry out the processing of branch frame in described minute;
The described window unit that adds adopts Hamming window function carry out windowization;
Described end-point detection unit adopts short-time energy E and short-time average zero-crossing rate Z as the double threshold of feature relatively, and calculates zero-crossing rate threshold values ZcT and height energy threshold as thresholding according to quiet section, carries out the detection of end points.
3. the portable oral translation system based on the WinCE platform according to claim 2 is characterized in that: described phonetic feature extracts with MBM and passes through to extract the MFCC phonetic feature as recognition feature; Set up hidden Markov model and be training and model of cognition, this hidden Markov model is made up of Markov chain and general random process;
Described hidden Markov model utilizes the forward-backward algorithm probabilistic algorithm to solve the valuation problem, utilizes the Viterbi algorithm to solve decoding problem; Utilize the Baum-Welch iterative algorithm to solve problem concerning study.
4. the portable oral translation system based on the WinCE platform according to claim 3 is characterized in that:
Utilize the forward-backward algorithm probabilistic algorithm, solve for the given λ of hidden Markov model system=(π, A, B), the observation sequence O=O that produces according to system 1, O 2..., O TCalculate the problem of likelihood probability P (O/ λ).
5. the portable oral translation system based on the WinCE platform according to claim 3 is characterized in that: utilize the Viterbi algorithm, solve for the given λ of hidden Markov model system=(π, A, B), and the observation sequence O=O that produces by system 1, O 2..., O T, search makes this system produce the status switch S=q of the most possible experience of this observation sequence 1, q 2... q tProblem.
6. the portable oral translation system based on the WinCE platform according to claim 3 is characterized in that: for the hidden Markov model system of the unknown, utilize the Baum-Welch iterative algorithm to come the estimation model parameter.
7. the speech recognition method of the portable oral translation system based on the WinCE platform according to claim 3 is characterized in that comprising the steps:
(1) hidden Markov model is trained the acquisition model parameter;
(2) phonetic feature that characteristic extracting module is obtained is as the observation sequence of hidden Markov model; The voice unit that training obtains is a status switch, solves the state transitions sequence by the Viterbi algorithm;
(3) adopt the decision-making judgement, obtain the state transitions sequence of maximum probability;
(4) go out candidate's syllable or sound mother according to optimum condition sequence correspondence, form speech and sentence by language model at last.
8. the speech recognition method of the portable oral translation system based on the WinCE platform according to claim 7, it is characterized in that: the first initialization hidden Markov model of described step (1) parameter, utilize the Baum-Welch iterative algorithm to come the estimation model parameter then.
9. the speech recognition method of the portable oral translation system based on the WinCE platform according to claim 8, it is characterized in that: described step (1) is utilized training algorithm to carry out repeatedly iteration and is obtained the result, also should provide the condition of a finishing iteration simultaneously, when the relative variation of this probability less than ε, the finishing iteration process, in addition, set maximum iteration time N, when iterations during greater than N, also stop iteration, and the Baum-Welch algorithm is adopted the method that increases scale factor, the data underflow problem of correction algorithm.
CN2010101605215A 2010-04-23 2010-04-23 Portable interpretation system based on WinCE platform and language recognition method thereof Pending CN102237083A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101605215A CN102237083A (en) 2010-04-23 2010-04-23 Portable interpretation system based on WinCE platform and language recognition method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101605215A CN102237083A (en) 2010-04-23 2010-04-23 Portable interpretation system based on WinCE platform and language recognition method thereof

Publications (1)

Publication Number Publication Date
CN102237083A true CN102237083A (en) 2011-11-09

Family

ID=44887672

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101605215A Pending CN102237083A (en) 2010-04-23 2010-04-23 Portable interpretation system based on WinCE platform and language recognition method thereof

Country Status (1)

Country Link
CN (1) CN102237083A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663143A (en) * 2012-05-18 2012-09-12 徐信 System and method for audio and video speech processing and retrieval
CN102789779A (en) * 2012-07-12 2012-11-21 广东外语外贸大学 Speech recognition system and recognition method thereof
CN103811008A (en) * 2012-11-08 2014-05-21 中国移动通信集团上海有限公司 Audio frequency content identification method and device
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN104834393A (en) * 2015-06-04 2015-08-12 携程计算机技术(上海)有限公司 Automatic testing device and system
CN107170453A (en) * 2017-05-18 2017-09-15 百度在线网络技术(北京)有限公司 Across languages phonetic transcription methods, equipment and computer-readable recording medium based on artificial intelligence
CN108460027A (en) * 2018-02-14 2018-08-28 广东外语外贸大学 A kind of spoken language instant translation method and system
CN110765868A (en) * 2019-09-18 2020-02-07 平安科技(深圳)有限公司 Lip reading model generation method, device, equipment and storage medium
CN112329484A (en) * 2020-11-06 2021-02-05 中国联合网络通信集团有限公司 Translation method and device for natural language
CN114398468A (en) * 2021-12-09 2022-04-26 广东外语外贸大学 Multi-language identification method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050131709A1 (en) * 2003-12-15 2005-06-16 International Business Machines Corporation Providing translations encoded within embedded digital information
CN101008942A (en) * 2006-01-25 2007-08-01 北京金远见电脑技术有限公司 Machine translation device and method thereof
CN101329667A (en) * 2008-08-04 2008-12-24 深圳市大正汉语软件有限公司 Intelligent translation apparatus of multi-language voice mutual translation and control method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050131709A1 (en) * 2003-12-15 2005-06-16 International Business Machines Corporation Providing translations encoded within embedded digital information
CN101008942A (en) * 2006-01-25 2007-08-01 北京金远见电脑技术有限公司 Machine translation device and method thereof
CN101329667A (en) * 2008-08-04 2008-12-24 深圳市大正汉语软件有限公司 Intelligent translation apparatus of multi-language voice mutual translation and control method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
苏牧 等: "一种基于电话的中英双向翻译系统", 《第七届全国人机语音通讯学术会议(NCMMSC7)论文集》 *
魏力: "嵌入式语音识别系统的研究", 《中国优秀博硕士学位论文全文数据库(硕士)》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663143A (en) * 2012-05-18 2012-09-12 徐信 System and method for audio and video speech processing and retrieval
CN102789779A (en) * 2012-07-12 2012-11-21 广东外语外贸大学 Speech recognition system and recognition method thereof
CN103811008A (en) * 2012-11-08 2014-05-21 中国移动通信集团上海有限公司 Audio frequency content identification method and device
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN104834393A (en) * 2015-06-04 2015-08-12 携程计算机技术(上海)有限公司 Automatic testing device and system
CN107170453A (en) * 2017-05-18 2017-09-15 百度在线网络技术(北京)有限公司 Across languages phonetic transcription methods, equipment and computer-readable recording medium based on artificial intelligence
US10796700B2 (en) 2017-05-18 2020-10-06 Baidu Online Network Technology (Beijing) Co., Ltd. Artificial intelligence-based cross-language speech transcription method and apparatus, device and readable medium using Fbank40 acoustic feature format
CN108460027A (en) * 2018-02-14 2018-08-28 广东外语外贸大学 A kind of spoken language instant translation method and system
CN110765868A (en) * 2019-09-18 2020-02-07 平安科技(深圳)有限公司 Lip reading model generation method, device, equipment and storage medium
CN112329484A (en) * 2020-11-06 2021-02-05 中国联合网络通信集团有限公司 Translation method and device for natural language
CN114398468A (en) * 2021-12-09 2022-04-26 广东外语外贸大学 Multi-language identification method and system

Similar Documents

Publication Publication Date Title
CN102237083A (en) Portable interpretation system based on WinCE platform and language recognition method thereof
CN101944359B (en) Voice recognition method facing specific crowd
CN106228977B (en) Multi-mode fusion song emotion recognition method based on deep learning
CN103928023B (en) A kind of speech assessment method and system
CN103345923B (en) A kind of phrase sound method for distinguishing speek person based on rarefaction representation
CN104200804B (en) Various-information coupling emotion recognition method for human-computer interaction
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN101930735B (en) Speech emotion recognition equipment and speech emotion recognition method
CN102509547B (en) Method and system for voiceprint recognition based on vector quantization based
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
Dua et al. GFCC based discriminatively trained noise robust continuous ASR system for Hindi language
CN103065629A (en) Speech recognition system of humanoid robot
CN109192200B (en) Speech recognition method
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
Shanthi et al. Review of feature extraction techniques in automatic speech recognition
CN114566189B (en) Speech emotion recognition method and system based on three-dimensional depth feature fusion
Shanthi Therese et al. Review of feature extraction techniques in automatic speech recognition
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN114842878A (en) Speech emotion recognition method based on neural network
Mistry et al. Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann)
Barman et al. State of the art review of speech recognition using genetic algorithm
CN104240699A (en) Simple and effective phrase speech recognition method
Hu et al. Speaker Recognition Based on 3DCNN-LSTM.
Thalengala et al. Effect of time-domain windowing on isolated speech recognition system performance
CN113611285B (en) Language identification method based on stacked bidirectional time sequence pooling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20111109