CN1055254A - Sound-controlled typewriter - Google Patents

Sound-controlled typewriter Download PDF

Info

Publication number
CN1055254A
CN1055254A CN 90101666 CN90101666A CN1055254A CN 1055254 A CN1055254 A CN 1055254A CN 90101666 CN90101666 CN 90101666 CN 90101666 A CN90101666 A CN 90101666A CN 1055254 A CN1055254 A CN 1055254A
Authority
CN
China
Prior art keywords
sound
syllable
chinese
identification
typewriter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 90101666
Other languages
Chinese (zh)
Inventor
曹洪
谭政
潘接林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN 90101666 priority Critical patent/CN1055254A/en
Publication of CN1055254A publication Critical patent/CN1055254A/en
Pending legal-status Critical Current

Links

Images

Abstract

This sound-controlled typewriter has solved the identification and the composition problem of whole Chinese single-syllables.It adopts the layering recognition technology, and the first order is identified as the female identification of sound of Chinese single-syllable, and the second level is identified as hidden Markov model (HMM) recognition methods that distributes based on syllable steady section segment length.Its energy Real time identification is with whole Chinese speechs of word or one, sound-controlled typewriter also has the language understanding function in addition, it can be according to the phonetically similar word of sentence structure, morphology knowledge difference Chinese, by identification with understand and voice can be changed into the literal storage, pass on or export (print or read aloud).This sound-controlled typewriter not only can be used for doing the Chinese character information processing in the robotization of worker chamber, and can be widely used in various controls field.System increases a specialized hardware that independently carries out speech recognition and synthesize by common computer or chinese-English typewriter and constitutes.

Description

Sound-controlled typewriter
The present invention relates to a kind of device that collects complete pronunciation Chinese joint identification, synthetic, Chinese sound text conversion and editor, composing, printing.Wherein the complete pronunciation Chinese joint is discerned the system of both having recognized people, and also can realize the speech recognition system of not recognizing people.
Because Chinese is a kind of pictograph, it can not resemble phonetic literary composition and be imported by the phoneme serial by keyboard planting, therefore must convert Chinese character to serial code by various coding methods, so that by keyboard to computer input of Chinese characters, because numerous methods of Chinese character coding finds it difficult to learn concerning most people, difficult note, difficult operation, so just formed " bottleneck " problem of computing machine input, the most effective, most convenient ground solution to this problem is exactly the way input Chinese character by acoustic control, but owing to be subjected to technical restriction, the most of speech recognition systems of China also are confined to the scope of hundreds of speech and nearly 2,000 speech, and not only discrimination is low but also cost is high for the recognition system of thousands of speech, be difficult to practicability, moreover Chinese vocabulary there is tens0000 crowd.Therefore, under the prior art condition, be difficult to realize sound-controlled typewriter.On the other hand, that uses in Modern Chinese has nearly 400 syllables, as considering the difference of the four tones of standard Chinese pronunciation, nearly more than 1,200 single syllable is arranged then, with single syllable is to import the needs that Chinese character can satisfy any article, and this is a kind of good method that solves the Chinese character input.But Chinese single-syllable extremely is difficult to identification, and the monosyllabic system recognition rate of present many identifications is lower.And mostly be the non real-time analog result greatly, do not realize sound-controlled typewriter.
The objective of the invention is to provide a kind of, high performance sound-controlled typewriter on the microcomputer or the low cost that realizes on the common chicoder.It can be discerned and synthetic whole Chinese single-syllables, makes Chinese text processor (comprising universal microcomputer system and electronic typewriter) increase acoustic control input Chinese character and the function of reading aloud the literal manuscript.
The object of the present invention is achieved like this:
1. scheme general introduction
Before identification, set up the template (for example can set up template) of 460 syllables or 1200 full syllables by training based on DTW algorithm or HMM algorithm.This requires the user that these syllables are read one time one by one.When identification, the voice of input and each template that prestores are compared, select the winner as recognition result by the height of score.But in big vocabulary predicative sound recognition system, it is unpractical directly all templates being compared, for example in our system, need 1200 full syllable templates are compared, this not only makes accuracy of identification very low, and required calculated amount also is general miniature or minicomputer institute is flagrant.In order to address this problem, our recognition system has adopted the scheme of two-layer identification, as shown in Figure 1, wherein ground floor is that syllable is slightly discerned, promptly the four tones of standard Chinese pronunciation, initial consonant and the simple or compound vowel of a Chinese syllable of input syllable are discerned respectively, and selected 6 the highest initial consonants of score and 36 candidate's syllables of 6 simple or compound vowel of a Chinese syllable formations (result is only selected in four tones of standard Chinese pronunciation identification).The second layer is that syllable is carefully discerned, and then only selects in 36 candidate's syllables.Because Chinese single-syllable has tangible consonant vowel structure (CV structure), wherein consonant is corresponding with initial consonant and simple or compound vowel of a Chinese syllable respectively with vowel, and initial consonant has only 22, simple or compound vowel of a Chinese syllable to have only 38, so adopt the identification respectively of the consonant, vowel and the four tones of standard Chinese pronunciation to realize that the thick identification of syllable is a kind of rational solution efficiently.Yet, in case thick identification produces mistake, in the identification of following one deck, be without redemption, so require thick identification to reach very high precision.In the consonant recognition algorithm of native system, consonant is divided into voiceless sound and voiced sound stem phoneme two big classes and adopts the VQ algorithm to discern, in simple or compound vowel of a Chinese syllable identification, adopted the algorithm of more piece VQ.The operand of these algorithms is very little, and the accuracy of the thick identification of syllable reaches 99.7%.The second level is carefully discerned employing and is discerned based on the implicit markov model of syllable.Above-mentioned secondary characterization not only can be used for the special person identification system, and can be used for unspecified person identification.
2. system profile
The block diagram of mandarin full syllable recognition system as shown in Figure 2, in the feature extraction part, voice are that 100HZ carries out A/D conversion (sampling rate is 10KHZ, and quantified precision is 12bit) after the analog filter filtering of 4.2KHZ through a bandwidth.Then, digitized voice are carried out the branch frame, frame length is a 20ms(200 sampled point), frame moves and is 10ms(100 sampled point).The feature of voice is extracted frame by frame, the used feature of native system comprises: gross energy (frame) e(i), zero-crossing rate z(i) the normalization single order is from closing coefficient NR(i), add the 12 rank coefficient of autocorrelation R(k that calculate behind the hamming code window) and LPC coefficient a(k), wherein, k=0,1,12, normalization residual energy d(i).I represents the numbering of each speech frame, and i determines as follows: system is provided with an energy threshold T, checks gross energy e(i then frame by frame), if e(i)>T, then thinking has voice to enter, and at this moment pushes away forward
Figure 901016667_IMG2
Frame, fixed this frame is i=0, promptly as the point frame that rises of voice, its objective is some low-energy syllable start-up portions are all included.After detecting voice, a thresholding T is set again, if continuous 6 frames satisfy e(i)<T, represent that then voice finish, deciding last frame is i=CE, i.e. end frame.The principle of work of other parts in the narration system respectively below.
3. noiseless (S)/voiceless sound (U)/voiced sound (V) three classes judgement
Single syllable according to a mandarin of the energy threshold that is provided with above intercepting always can be divided into four sections, and promptly S(is noiseless)-the U(voiceless sound)-the V(voiced sound), as shown in Figure 3, comprising noiseless paragraph.For the terminus MB of a real definite syllable and the cut-point ME of MS and pure and impure sound, must carry out i=0 ... the S/U/V of each speech frame classification between the CE.
But the classification schemes list of references [1] that native system is got.We set up the eigenvector of one 5 dimension to each speech frame (being numbered i): X=[e(i), and z(i), NR(i), d(i), a(l)], wherein T represents transposition.If mark S is the first kind, U is second class, and V is the 3rd class, X is the random vector with the normal distribution of being similar under this three classes situation so, and we can obtain their mean vector M=E[X by a lot of people's training utterance] and variance battle array D=E[(X-M) (X-M)], k=1,2,3.For any frame input voice X, can calculate likelihood distance d(k with above-mentioned three classifications),
D(k)=(X-M) τD -1(X-M), k=1,2,3 formulas (1)
If d(l)<min[d(m)] | set up, judge that then i frame voice belong to the l class.
In order to reject some sporadic mistake in judgment, native system has also been taked following level and smooth and correction means:
(1). if " U " occur, then change the original sentence to and be " S " at the afterbody of syllable.
(2). be " U " as if occurring one section " S " (and its length is less than 5 frames) between two sections adjacent ' U ', then it being changed the original sentence to.
(3). if occur one section " U " between two sections adjacent " S ", and its length is less than 5 frames, then it changed the original sentence to be " S ".
Because the template parameter of this algorithm is tried to achieve by many people training utterance, so it is a kind of algorithm of unspecified person, facts have proved that this algorithm can obtain very high nicety of grading.
4. initial consonant identification harmony simple or compound vowel of a Chinese syllable cuts apart
At the code book of setting up initial consonant and simple or compound vowel of a Chinese syllable with before discerning, the initial consonant and the simple or compound vowel of a Chinese syllable of a syllable must be separated, this is a more scabrous problem, because the interface of the two is not what be perfectly clear under many circumstances.In order to address this problem, we are divided into two classes to initial consonant.The first kind is called " voiceless sound class ", comprising following each initial consonant:
{p,t,k,h,j,q,x,z,c,s,ch,sh,g,zh,f}
Second class is called " voiced sound stem phoneme class ", comprising following each initial consonant:
{a,o,e,i,u,v,m,n,l,r,b,d,g,zh,f}
The distinguishing feature of first kind initial consonant is that the voiceless sound paragraph of its initial consonant part and syllable has clear and definite corresponding relation, and the length of voiceless sound section is in most of the cases greater than the 60ms(6 frame).The characteristics of the second class initial consonant are that the intermodulation of initial consonant and simple or compound vowel of a Chinese syllable is very serious, thereby is difficult to determine the cut-point of the two; The voiceless sound section of syllable is very short simultaneously, is generally less than the 40ms(4 frame).In addition, comprised a in second class, o, e, i, u, plurality of units sounds such as v, they are under the situation of zero initial, are positioned at the simple or compound vowel of a Chinese syllable head at the initial position of syllable.In this two classes phoneme, also comprised g, zh, these three common phonemes of f, this is because their characteristic variations is very big, if their a certain classes of only planing are tended to make a mistake.
According to top diagnosis, we take following initial consonant splitting scheme when identification.For each test syllable, at first can make its syllable starting point (being the voiceless sound starting point) MB and pure and impure cut-point (starting point of voiced sound just) ME according to the described principle of the 3rd joint.If ME-MB>6, promptly the voiceless sound segment length of this syllable definitely judges then that greater than 6 frames (60ms) initial consonant of this syllable belongs to voiceless sound class initial consonant, thereby as long as seeks optimal approximation person in such initial consonant.If ME-MB<4, promptly the voiceless sound length of this syllable is less than four frames (40ms), and then the initial consonant of this syllable of decidable belongs to voiced sound stem phoneme initial consonant class, and searches the superior in such.If 4ME-MB 6 then can not really declare, when identification, all must search two class initial consonants.
Adopt above-mentioned initial consonant classification schemes, just can set up training and the recognizer that is suitable for this classification characteristic each classification.
(1) training of voiceless sound class initial consonant and recognizer: for voiceless sound class initial consonant, with voiceless sound section (being MB to ME-1) as the initial consonant section.Each initial consonant is taken out in the initial consonant section of the various different syllables that constitute by this initial consonant each frame voice and set up a VQ code book.Adopt LBG algorithm [2] when setting up code book, eigenvector is that 12 LPC coefficients (add hamming code window, employing is from pass method-Durbin algorithm), the distance between the code word adopts the Itakura distance metric, and the barycenter of each cluster is averaged and tries to achieve from closing coefficient by the normalization of each training utterance in the class.Comprise 10 code words in each code book at most.When identification, VQ code book with each initial consonant is encoded to the initial consonant section of input test syllable, and, select the first six candidate's initial consonant (average coding distortion is meant that each frame voice coding distortion sum of initial consonant section is divided by this section totalframes) according to the big minispread of average coding distortion.
(2) training and the recognizer of voiced sound stem phoneme class initial consonant are divided into two kinds of situations here:
[a]. for m, n, l, these several sounds of r, (the initial consonant segment length of these several sounds changes greatly as section to get preceding 6 frames (be MB-MB+5, the corresponding time interval is 60ms), and be subjected to the influence of back simple or compound vowel of a Chinese syllable, according to the analysis of a large amount of speech datas, the initial consonant segment length average out to 60ms of these sounds, thereby do this selection).Training is identical with (1) with recognizer.
[b]. for other voice, get preceding 3 frames (being MB-MB+2) as initial consonant section (the initial consonant segment length of these sounds is confined to first three frame).Training is identical with (1) with recognizer.
5. simple or compound vowel of a Chinese syllable identification
At first will determine the terminus of simple or compound vowel of a Chinese syllable in the syllable before carrying out simple or compound vowel of a Chinese syllable identification, for the discussion of inhomogeneity initial consonant characteristic, we take following scheme to make the starting point of rhythm parent segment in saving according to last one:
[a]. if in the test syllable, the ME-MB 〉=4(i.e. voiceless sound section frame number of this syllable is no less than 4 frames, the time interval is no less than 40ms in other words).Judge that then rhythm parent segment starting point is ME.
[b]. if in the test syllable ME-MB<3(promptly the voiceless sound section frame number of this syllable be not more than 3 frames, the corresponding time interval is less than or equal to 30ms in other words), judge that then rhythm parent segment starting point is MB+5.
The foundation of doing this judgement is, if ME-MB<3, so a large amount of speech data experimental results show that, the initial consonant of test syllable belongs to voiced sound stem phoneme class certainly, at this moment except the situation of zero initial, cut preceding 5 frames of syllable,, can remove initial consonant basically to the intermodulation of simple or compound vowel of a Chinese syllable and do not lose the rhythm head part of simple or compound vowel of a Chinese syllable with of the beginning of the 6th frame as simple or compound vowel of a Chinese syllable.And in the situation of zero initial, comprised all alliteration phonemes: a in the voiced sound stem phoneme table, i, u, o, e, v, even thereby the alliteration of rhythm parent segment is cut has fallen, it can also recover by the identification division of initial consonant.Otherwise, if ME-MB4, at this moment simple or compound vowel of a Chinese syllable
Section is coincide finely with the voiced segments in the syllable, so the starting point ME of voiced segments directly can be decided to be the starting point of rhythm parent segment.
In all cases, the terminal point that all the terminal point MS of syllable is decided to be the rhythm parent segment.
The recognizer of 3 joint VQ has been taked in the identification of simple or compound vowel of a Chinese syllable.In when training, each training syllable for same simple or compound vowel of a Chinese syllable is divided into three sections with its rhythm parent segment separately, sets up a VQ code book for every section then, and it is identical when training with VQ that the code word number of each code book is no more than 10 eigenvectors of being taked.When identification, the rhythm parent segment of testing syllable is divided into three sections, encode with three sections VQ code books of each simple or compound vowel of a Chinese syllable then, and, select preceding 6 winners according to arranging from small to large.
6. based on the audio recognition method that implies markov model
6.1 definition
Implicit markov process is a stochastic process that is made of two kinds of mechanism, and the first implies, the markov model with finite state, and another is a series of random observation functions, each is observed function and all links mutually with a state.The sound channel of supposing the people has limited articulatory configuration, and they and state are mapped, and voice signal is as the observation signal with respect to certain channel structure, and a kind of like this sound is described with regard to an available implicit markov process.Utilize the method for parameter estimation of HMM to obtain a kind of procedure parameter of pronunciation, this parameter sets just is HMM model parameter.The HMM model that is applied to speech recognition disperses, linear implicit markov model, and it is defined as:
If § (n) | n=0,1 ... be state space be state index collection S=1,2 ..., N } discrete stochastic process, and § (n) satisfies:
P { § (n+1)/§ (0)=S o§ (n)=Sn }=P { § (n+1)/§ (n) } formula (2)
Make a Ij(n)=P { § (n+1)=j/ § (n)=i } formula (3)
a Ij(n) be the step transition probability of § (n), it has following character:
a Ij(n) 〉=0 formula (4)
Figure 901016667_IMG20
If ij(n) irrelevant with § n, then (n) claims homogeneous markov model.In speech recognition, state implies, and observed phonetic feature is by the random function F(X that depends on certain state) to describe, the formation of observing function is relevant with the concrete sound feature.
In the HMM audio recognition method, in order to obtain a kind of HMM model of pronunciation, at first L pronunciation sample of the same race carried out the VQ coding, produce L VQ codeword sequence (Code Sequence), this L codeword sequence is regarded as by same HMM model generation, constructed the criterion of this HMM type, require it to produce the probability maximum of K codeword sequence, after K was fully big, this HMM model had write down the prior probability that this pronunciation produces the code word string.When identification, codeword sequence for this pronunciation, utilize the model of having set up to obtain the posterior probability that each model produces this codeword sequence, just can identify the HMM model that produces this codeword sequence preferably according to the maximum criterion of posterior probability, thereby draw recognition result.Therefore for the VQ/HMM audio recognition method, observe function and be actually one group of discrete-observation probability.If the label set of VQ code book is combined into TN={ m|m=0,1 ... M-1 }, Q=(Q1, Q2 ... QT) be voice VQ codeword sequence, Qt TN(t=1,2 ... T), observing definition of probability is: Vi ∈ S, j ∈ TN; B=P Qt=j/q(t)=i }.For homogeneous HMM model, b is only relevant with the observation code word with state, and under homogeneous property assumed condition, model parameter can estimate by a kind of algorithm that reappraises, and its recognizer is also very simple.
6.2 Chinese single-syllable HMM audio recognition method
In the classical HMM model training, the problem of a maximum is exactly that the iterations that reappraises is many in training algorithm (Baum-Welch), recomputates the HMM model parameter each time, all will repeat to import whole training datas, and therefore, calculated amount is big unusually.Because state implies, do not have clear and definite physical significance in addition, thereby, produce best model by continuous correction to parameter.If we can determine once that the optimum condition of model distributes, also just can once estimate model parameter, and needn't be through repeatedly reappraising calculating, calculated amount and data throughout also significantly reduce, in fact, the state of HMM model can be corresponding with certain phoneme structure.A kind of brand-new HMM model training method hereinafter will be described in detail in detail.
Based on the HMM model of word for its model state of known word { q oQ NBe given, however be not a state q iJust corresponding to a phonic signal character symbol, but corresponding to the S set i=that comprises a plurality of phonetic feature symbols { Vi1, Vi2 ... Vi }, and different conditions q i, q jCorresponding symbol collection Si, Sj have the friendship Si ∧ Sj ≠ 0(i of non-NULL ≠ j).These characteristics of HMM make it be particularly suitable for the speech recognition that has nothing to do with the people.Another characteristics of HMM model are the randomness that different conditions shifts, thereby it can be automatically mates (working to be similar to DTW) well with the same word of different ULs.
Can make various definitions (or understanding) for the state of HMM, but a good state definition should both reflect the polytrope characteristics (state is corresponding to a plurality of symbols) of voice, should make the friendship of different conditions corresponding symbol collection again (is Si ∧ Sj, i=j) minimum, otherwise will inevitably cause bigger identification error.
If the characteristic symbol of voice is to divide the mode of frame to extract by the time, be exactly that the transition section between it and phoneme steady section or phoneme is mapped to a kind of explanation of nature of HMM state so.Also just be easy to explain why voice status metastasis model from left to right shown in Figure 4 has better adaptability according to this viewpoint.This be because the phoneme of a speech (word) from left to right (time increases direction) send, and the leap between the state is shifted and has been reflected " eating sound " phenomenon that may occur.For the word of Chinese, " eating sound " phenomenon is few the appearance, thereby can ignore the leap transfer of state, describes the HMM speech recognition modeling of Chinese and should describe with the state transition diagram of Fig. 5.
If we treat the state transition model of Fig. 5 from the angle of state presence length distribution, we just can obtain a state chain shown in Figure 6, expression state q among the figure iDwell length, it is with Pi(m) be a random function of distribution, if Pi(0)=0, just represent state q iMight take place to cross over and shift, but, can think " eating sound " phenomenon can not take place, therefore always can suppose Pi(0 Chinese speech)=0.Be located at t and occurred (K-1) individual q constantly iState then at t+1 state transition probability constantly is:
Pi(K≤m i)/Pi(K≤m i+1),j=i
a Ij(K)=Pi(K=m i+ 1)/Pi(K≤m i+ 1), j=i+1 formula (6)
0, other j, i=1,2 ... N
Because a Ij(k) all relevant with k, thus the HMM model based on the state length distribution of Chinese is a kind of Markov model of non-stationary.If stochastic distribution p with state presence length i(M i) can reflect the characteristics of voice preferably, just can simplify training algorithm greatly when so this model being used for speech recognition.Because the complicacy of the training algorithm (Baum-Welch) of HMM speech recognition at present is catastrophic only not for large vocabulary system, and new model only need be cut apart train word and obtains segment length's information m 1M NWith m iFrequency as P i(m i) one approach, just can obtain state transition probability a by formula 6 Ij(k), observing matrix { b j(Qk) } then can be by the frequency of observing symbol Qk occur in the i section as b j(Qk) approximate value obtains.
6.3 HMM model parameter estimation based on the state length distribution
The purpose of parameter estimation is the segment length that obtains speech or the word Pi(m that distributes) and observing matrix { b j(Qk) } its main calculated amount is cutting apart of segment length.Be provided with L training sequence { Q for certain given word k k(k=1,2 ... L), Q k={ Q k 1, Q n 2Q m Tk, Tk is the length of k training sequence, establishes Q k tField of definition be a feature space V(VQ code book through vector quantization), the optimal dividing of Q
Figure 901016667_IMG3
Figure 901016667_IMG4
Get a division of minimum value.
Be called sequence Q kThe center of gravity of i section.Definition
Formula (7) becomes
Figure 901016667_IMG7
So being equivalent to, (7) minimum partition problem finds the solution constrained optimization problem.
Figure 901016667_IMG8
Wherein
Ω={ (X 1... X N) | 0≤X1≤X2 ... ≤ X N=Tk } formula (12)
For the optimization problem of this type of multipole value functional, whether searching algorithm can converge to minimum value, depends on the selection of initial value to a great extent.In this respect owing to easily obtain about X ... the priori of X is so formula (11) can be asked for smallest point.By formula (11)
Figure 901016667_IMG9
Thereby obtained segment length's estimation, when L is enough big, by law of great number
P i(mi)=(length of i section is the number of mi in L training sequence)/(L)
I=1,2 ... N formula (14)
At state iThe time occur to observe symbol Q kProbability
b i(Q KThe totalframes that when state qi, occurs Qk in)=(the L training sequence)/(going out the totalframes of present condition q in L sequence)
Figure 901016667_IMG11
I=1,2 ... N, k=1,2 ... M formula (15)
M is the size of VQ code book V.
All carrying out as above to each word in the identification word table, the training of step just can obtain whole needed HMM parameters.
6.3 recognizer
If we have obtained the VQ code book of a phonic signal character vector through training, each state segment length Pi(m that distributes) (i=1,2 ... N) and observation matrix { b(Qk) } (i=1,2 ... N, k=1,2 ... M), the identification word table has V word: W={ W1, W2 ... Wv }, if speech characteristic vector observation sequence to be become literate is Q={ Q, Q }, Qt ∈ V(t=0,1 ... T).At first find the optimum condition of Q to cut apart S={ X, X X } by formula (12).
Likelihood ratio
P(Q/W i)=P(Q S/W i)=P(S/W i) P(Q/W iS) formula (16)
First equal sign in the formula is because for given Q,
Figure 901016667_IMG12
Be completely specified, make the Wi of the likelihood ratio maximum of formula (16) definition promptly be considered to recognition result
Figure 901016667_IMG13
Because P(Q/Wi ∩
Figure 901016667_IMG15
) to cutting apart
Figure 901016667_IMG16
Than P(O/Wi) responsive many, so can suppose
Figure 901016667_IMG17
This is a maximum path problem in the dynamic programming, and available Viterbi algorithm is found the solution, and the step of algorithm is as follows:
(1)δ (i) 1(1)=log b (i) 1(Q 1
(2)δ (i) t(j)=max[δ (i) t-1(j-1),δ (i) t-1(j)]+log b (i) j(Qt)
J=1,2 ... N, t=1,2 ... T formula (19)
δ (i) T(N)=max[log P(Q/Wi ∩ S)] formula (20)
(3) obtain an optimal dividing by (1) (2)
Figure 901016667_IMG18
Compare with classical HMM recognizer, lacked a sub-addition in step (2), but new recognizer need write down cut-point and carry out the calculating of step (3) in computation process, the Viterbi recognizer of operand and classical HMM is basic identical.
6.4 adaptive H MM method research
In new HMM model, as can be seen, HMM model parameter Pi(m), { b } all is statistical average valuations, the very easy self-adaptation of these calculation of parameter methods.Therefore we are to Pi(m), the adaptive algorithm of { b } is done unified the description.
If Avg1=SUM1/T1, Avg2=SUM2/T2 are the average of successively once adding up, Avg is total mean value, and obviously, following equation is set up:
Avg=(SUM1+SUM2)/(T1+T2)=Avg T1/(T1+T2)+Avg2 T2/(T1+T2)
Formula (23)
Avg=(1-fa) Avg1+fa Avg2 formula (24)
Fa=T2/(T1+T2) formula (25)
As long as the ratio T1/T2 according to front and back secondary amount of training data just can estimate population mean, fa claims adaptation coefficient in the formula.Therefore, P(m), b I(j)Self-adaptation have foolproof form.
To VQ, HMM carries out self-adaptation respectively in last experiment, has just formed practical self-adaptation VQ/HMM speech recognition system.
6.5 conclusion
Above proposed the recognition methods of a kind of brand-new complete pronunciation Chinese joint, it has demonstrated fully the characteristics of Chinese single-syllable.The first adopts the female VQ algorithm of sound to solve layering full syllable Chinese isolated word identification problem.It two is to have studied a HMM model that is fit to the Chinese syllable characteristics, said two devices all has the performance of unspecified person identification again, for Chinese not only discrimination reach 98%, and algorithm operation quantity is little, can on microcomputer, low cost realize speech recognition, have practical value.
The present invention is because can discern whole Chinese single-syllables, and the recognition methods operand seldom, can on microsystem, make machine increase sound-controlled typewriter, thereby make and anyly might carry out Chinese character information processing with the language operating computer per capita, the modernization level that can advance computer Chinese-character information to handle greatly.
The objective of the invention is to finish like this: expanding a sound card on the microcomputer or on Chinese and English electronic typewriter.Sound card is discerned whole Chinese single-syllables, and change voice into corresponding literal and send on the main frame and store, stored Chinese character can change voice output again on sound card, perhaps in (or delivered to printer on by microcomputer) output on the chinese-English typewriter, this is the method for the complete pronunciation Chinese joint sound identification of a kind of unspecified person (not relying on the speaker).Then has more performance for specific people's identification.
The concrete structure of invention is provided by following examples and accompanying drawing thereof.
Fig. 7 is the sound-controlled typewriter block diagram of realizing on IBM-PC series microcomputer according to above-mentioned audio recognition method.
This sound-controlled typewriter is finished whole language identifications by speech recognition card (3), finishes the synthetic of whole voice by phonetic synthesis card (4).When the user pronounces by microphone, identification card (3) is converted into Hanzi internal code in the computing machine with voice, can show (6) whole phonetically similar words on computers corresponding to voice, the user can by key choosing or microcomputer based on context between the phrase collocation find out correct Chinese character storage automatically.Can carry out Typeset and Print (7) or read aloud manuscript so that user of service school original text for the manuscript computing machine of having imported by Synthesis Card (4).Both available phrase also can be imported voice by the single syllable mode when user imported Chinese character.

Claims (6)

1, the talking typewriter system is at the device that increases speech recognition card (3) and Synthesis Card (4) realization Sound-controlled typewriter on the conventional microcomputer or on the common chinese-English typewriter, it is to discern as the first order with the female identification of sound that speech recognition is adopted, with the HMM model that distributes based on syllable steady section segment length is the method for second level identification, its identification is the voice of unit with speech or syllable, and can synthesize syllable and phrase, speech recognition is the recognition methods of not recognizing people, better to specific people's recognition performance, this system also has the speech understanding function, can pass through morphology, syntactic knowledge difference phonetically similar word.
2, device according to claim 1 is characterized in that discerning whole Chinese speechs, can synthesize whole Chinese speechs again.
3, according to claim 1,2, sound-controlled typewriter is a kind of system of unspecified person.
4, according to claim 1,2,3, but sound-controlled typewriter identification form syllable not only, and can discern the phrase pronunciation.
5, according to claim 1,2,3, sound-controlled typewriter has the language understanding function, can be according to morphology, syntactic knowledge difference phonetically similar word.
6, according to claim 1,2,3, sound-controlled typewriter can also make the people control machine with voice as a kind of control system main frame.
CN 90101666 1990-03-28 1990-03-28 Sound-controlled typewriter Pending CN1055254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 90101666 CN1055254A (en) 1990-03-28 1990-03-28 Sound-controlled typewriter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 90101666 CN1055254A (en) 1990-03-28 1990-03-28 Sound-controlled typewriter

Publications (1)

Publication Number Publication Date
CN1055254A true CN1055254A (en) 1991-10-09

Family

ID=4877183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 90101666 Pending CN1055254A (en) 1990-03-28 1990-03-28 Sound-controlled typewriter

Country Status (1)

Country Link
CN (1) CN1055254A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100359507C (en) * 2002-06-28 2008-01-02 三星电子株式会社 Apparatus and method for executing probability calculating of observation
CN105679332A (en) * 2016-03-09 2016-06-15 四川大学 Cleft palate speech initial and final automatic segmentation method and system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100359507C (en) * 2002-06-28 2008-01-02 三星电子株式会社 Apparatus and method for executing probability calculating of observation
CN105679332A (en) * 2016-03-09 2016-06-15 四川大学 Cleft palate speech initial and final automatic segmentation method and system
CN105679332B (en) * 2016-03-09 2019-06-11 四川大学 A kind of cleft palate speech sound mother automatic segmentation method and system

Similar Documents

Publication Publication Date Title
CN1112669C (en) Method and system for speech recognition using continuous density hidden Markov models
CN1277248C (en) System and method for recognizing a tonal language
JP2003036093A (en) Speech input retrieval system
Bahl et al. Automatic phonetic baseform determination
US20100100379A1 (en) Voice recognition correlation rule learning system, voice recognition correlation rule learning program, and voice recognition correlation rule learning method
Chou et al. A minimum error rate pattern recognition approach to speech recognition
Egorova et al. Out-of-vocabulary word recovery using fst-based subword unit clustering in a hybrid asr system
Jiang et al. The ustc system for Blizzard Challenge 2010
CN1157711C (en) Adaptation of a speech recognizer for dialectal and linguistic domain variations
Hadwan et al. An End-to-End Transformer-Based Automatic Speech Recognition for Qur'an Reciters.
JP6718787B2 (en) Japanese speech recognition model learning device and program
Granell et al. Multimodal output combination for transcribing historical handwritten documents
Tjalve et al. Pronunciation variation modelling using accent features
CN111429886B (en) Voice recognition method and system
Liu et al. Pronunciation modeling for spontaneous Mandarin speech recognition
CN1055254A (en) Sound-controlled typewriter
Xiao et al. Information retrieval methods for automatic speech recognition
Chalamandaris et al. Rule-based grapheme-to-phoneme method for the Greek
EP3718107B1 (en) Speech signal processing and evaluation
Fosler-Lussier A tutorial on pronunciation modeling for large vocabulary speech recognition
Svendsen Pronunciation modeling for speech technology
Flemotomos et al. Role annotated speech recognition for conversational interactions
Huang et al. Phone set generation based on acoustic and contextual analysis for multilingual speech recognition
Müller Multilingual Modulation by Neural Language Codes
Hwang et al. Porting decipher from English to Mandarin

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C01 Deemed withdrawal of patent application (patent law 1993)
WD01 Invention patent application deemed withdrawn after publication
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication