CN1055254A

CN1055254A - Sound-controlled typewriter

Info

Publication number: CN1055254A
Application number: CN 90101666
Authority: CN
Inventors: 曹洪; 谭政; 潘接林
Original assignee: Individual
Current assignee: Individual
Priority date: 1990-03-28
Filing date: 1990-03-28
Publication date: 1991-10-09

Abstract

This sound-controlled typewriter has solved the identification and the composition problem of whole Chinese single-syllables.It adopts the layering recognition technology, and the first order is identified as the female identification of sound of Chinese single-syllable, and the second level is identified as hidden Markov model (HMM) recognition methods that distributes based on syllable steady section segment length.Its energy Real time identification is with whole Chinese speechs of word or one, sound-controlled typewriter also has the language understanding function in addition, it can be according to the phonetically similar word of sentence structure, morphology knowledge difference Chinese, by identification with understand and voice can be changed into the literal storage, pass on or export (print or read aloud).This sound-controlled typewriter not only can be used for doing the Chinese character information processing in the robotization of worker chamber, and can be widely used in various controls field.System increases a specialized hardware that independently carries out speech recognition and synthesize by common computer or chinese-English typewriter and constitutes.

Description

Sound-controlled typewriter

The present invention relates to a kind of device that collects complete pronunciation Chinese joint identification, synthetic, Chinese sound text conversion and editor, composing, printing.Wherein the complete pronunciation Chinese joint is discerned the system of both having recognized people, and also can realize the speech recognition system of not recognizing people.

Because Chinese is a kind of pictograph, it can not resemble phonetic literary composition and be imported by the phoneme serial by keyboard planting, therefore must convert Chinese character to serial code by various coding methods, so that by keyboard to computer input of Chinese characters, because numerous methods of Chinese character coding finds it difficult to learn concerning most people, difficult note, difficult operation, so just formed " bottleneck " problem of computing machine input, the most effective, most convenient ground solution to this problem is exactly the way input Chinese character by acoustic control, but owing to be subjected to technical restriction, the most of speech recognition systems of China also are confined to the scope of hundreds of speech and nearly 2,000 speech, and not only discrimination is low but also cost is high for the recognition system of thousands of speech, be difficult to practicability, moreover Chinese vocabulary there is tens0000 crowd.Therefore, under the prior art condition, be difficult to realize sound-controlled typewriter.On the other hand, that uses in Modern Chinese has nearly 400 syllables, as considering the difference of the four tones of standard Chinese pronunciation, nearly more than 1,200 single syllable is arranged then, with single syllable is to import the needs that Chinese character can satisfy any article, and this is a kind of good method that solves the Chinese character input.But Chinese single-syllable extremely is difficult to identification, and the monosyllabic system recognition rate of present many identifications is lower.And mostly be the non real-time analog result greatly, do not realize sound-controlled typewriter.

The objective of the invention is to provide a kind of, high performance sound-controlled typewriter on the microcomputer or the low cost that realizes on the common chicoder.It can be discerned and synthetic whole Chinese single-syllables, makes Chinese text processor (comprising universal microcomputer system and electronic typewriter) increase acoustic control input Chinese character and the function of reading aloud the literal manuscript.

The object of the present invention is achieved like this:

1. scheme general introduction

Before identification, set up the template (for example can set up template) of 460 syllables or 1200 full syllables by training based on DTW algorithm or HMM algorithm.This requires the user that these syllables are read one time one by one.When identification, the voice of input and each template that prestores are compared, select the winner as recognition result by the height of score.But in big vocabulary predicative sound recognition system, it is unpractical directly all templates being compared, for example in our system, need 1200 full syllable templates are compared, this not only makes accuracy of identification very low, and required calculated amount also is general miniature or minicomputer institute is flagrant.In order to address this problem, our recognition system has adopted the scheme of two-layer identification, as shown in Figure 1, wherein ground floor is that syllable is slightly discerned, promptly the four tones of standard Chinese pronunciation, initial consonant and the simple or compound vowel of a Chinese syllable of input syllable are discerned respectively, and selected 6 the highest initial consonants of score and 36 candidate's syllables of 6 simple or compound vowel of a Chinese syllable formations (result is only selected in four tones of standard Chinese pronunciation identification).The second layer is that syllable is carefully discerned, and then only selects in 36 candidate's syllables.Because Chinese single-syllable has tangible consonant vowel structure (CV structure), wherein consonant is corresponding with initial consonant and simple or compound vowel of a Chinese syllable respectively with vowel, and initial consonant has only 22, simple or compound vowel of a Chinese syllable to have only 38, so adopt the identification respectively of the consonant, vowel and the four tones of standard Chinese pronunciation to realize that the thick identification of syllable is a kind of rational solution efficiently.Yet, in case thick identification produces mistake, in the identification of following one deck, be without redemption, so require thick identification to reach very high precision.In the consonant recognition algorithm of native system, consonant is divided into voiceless sound and voiced sound stem phoneme two big classes and adopts the VQ algorithm to discern, in simple or compound vowel of a Chinese syllable identification, adopted the algorithm of more piece VQ.The operand of these algorithms is very little, and the accuracy of the thick identification of syllable reaches 99.7%.The second level is carefully discerned employing and is discerned based on the implicit markov model of syllable.Above-mentioned secondary characterization not only can be used for the special person identification system, and can be used for unspecified person identification.

2. system profile

The block diagram of mandarin full syllable recognition system as shown in Figure 2, in the feature extraction part, voice are that 100HZ carries out A/D conversion (sampling rate is 10KHZ, and quantified precision is 12bit) after the analog filter filtering of 4.2KHZ through a bandwidth.Then, digitized voice are carried out the branch frame, frame length is a 20ms(200 sampled point), frame moves and is 10ms(100 sampled point).The feature of voice is extracted frame by frame, the used feature of native system comprises: gross energy (frame) e(i), zero-crossing rate z(i) the normalization single order is from closing coefficient NR(i), add the 12 rank coefficient of autocorrelation R(k that calculate behind the hamming code window) and LPC coefficient a(k), wherein, k=0,1,12, normalization residual energy d(i).I represents the numbering of each speech frame, and i determines as follows: system is provided with an energy threshold T, checks gross energy e(i then frame by frame), if e(i)＞T, then thinking has voice to enter, and at this moment pushes away forward

Frame, fixed this frame is i=0, promptly as the point frame that rises of voice, its objective is some low-energy syllable start-up portions are all included.After detecting voice, a thresholding T is set again, if continuous 6 frames satisfy e(i)＜T, represent that then voice finish, deciding last frame is i=CE, i.e. end frame.The principle of work of other parts in the narration system respectively below.

3. noiseless (S)/voiceless sound (U)/voiced sound (V) three classes judgement

Single syllable according to a mandarin of the energy threshold that is provided with above intercepting always can be divided into four sections, and promptly S(is noiseless)-the U(voiceless sound)-the V(voiced sound), as shown in Figure 3, comprising noiseless paragraph.For the terminus MB of a real definite syllable and the cut-point ME of MS and pure and impure sound, must carry out i=0 ... the S/U/V of each speech frame classification between the CE.

But the classification schemes list of references [1] that native system is got.We set up the eigenvector of one 5 dimension to each speech frame (being numbered i): X=[e(i), and z(i), NR(i), d(i), a(l)], wherein T represents transposition.If mark S is the first kind, U is second class, and V is the 3rd class, X is the random vector with the normal distribution of being similar under this three classes situation so, and we can obtain their mean vector M=E[X by a lot of people's training utterance] and variance battle array D=E[(X-M) (X-M)], k=1,2,3.For any frame input voice X, can calculate likelihood distance d(k with above-mentioned three classifications),

D(k)=(X-M) ^τD ^-1(X-M), k=1,2,3 formulas (1)

If d(l)＜min[d(m)] | set up, judge that then i frame voice belong to the l class.

In order to reject some sporadic mistake in judgment, native system has also been taked following level and smooth and correction means:

(1). if " U " occur, then change the original sentence to and be " S " at the afterbody of syllable.

(2). be " U " as if occurring one section " S " (and its length is less than 5 frames) between two sections adjacent ' U ', then it being changed the original sentence to.

(3). if occur one section " U " between two sections adjacent " S ", and its length is less than 5 frames, then it changed the original sentence to be " S ".

Because the template parameter of this algorithm is tried to achieve by many people training utterance, so it is a kind of algorithm of unspecified person, facts have proved that this algorithm can obtain very high nicety of grading.

4. initial consonant identification harmony simple or compound vowel of a Chinese syllable cuts apart

At the code book of setting up initial consonant and simple or compound vowel of a Chinese syllable with before discerning, the initial consonant and the simple or compound vowel of a Chinese syllable of a syllable must be separated, this is a more scabrous problem, because the interface of the two is not what be perfectly clear under many circumstances.In order to address this problem, we are divided into two classes to initial consonant.The first kind is called " voiceless sound class ", comprising following each initial consonant:

｛p，t，k，h，j，q，x，z，c，s，ch，sh，g，zh，f｝

Second class is called " voiced sound stem phoneme class ", comprising following each initial consonant:

｛a，o，e，i，u，v，m，n，l，r，b，d，g，zh，f｝

The distinguishing feature of first kind initial consonant is that the voiceless sound paragraph of its initial consonant part and syllable has clear and definite corresponding relation, and the length of voiceless sound section is in most of the cases greater than the 60ms(6 frame).The characteristics of the second class initial consonant are that the intermodulation of initial consonant and simple or compound vowel of a Chinese syllable is very serious, thereby is difficult to determine the cut-point of the two; The voiceless sound section of syllable is very short simultaneously, is generally less than the 40ms(4 frame).In addition, comprised a in second class, o, e, i, u, plurality of units sounds such as v, they are under the situation of zero initial, are positioned at the simple or compound vowel of a Chinese syllable head at the initial position of syllable.In this two classes phoneme, also comprised g, zh, these three common phonemes of f, this is because their characteristic variations is very big, if their a certain classes of only planing are tended to make a mistake.

According to top diagnosis, we take following initial consonant splitting scheme when identification.For each test syllable, at first can make its syllable starting point (being the voiceless sound starting point) MB and pure and impure cut-point (starting point of voiced sound just) ME according to the described principle of the 3rd joint.If ME-MB＞6, promptly the voiceless sound segment length of this syllable definitely judges then that greater than 6 frames (60ms) initial consonant of this syllable belongs to voiceless sound class initial consonant, thereby as long as seeks optimal approximation person in such initial consonant.If ME-MB＜4, promptly the voiceless sound length of this syllable is less than four frames (40ms), and then the initial consonant of this syllable of decidable belongs to voiced sound stem phoneme initial consonant class, and searches the superior in such.If 4ME-MB 6 then can not really declare, when identification, all must search two class initial consonants.

Adopt above-mentioned initial consonant classification schemes, just can set up training and the recognizer that is suitable for this classification characteristic each classification.

(1) training of voiceless sound class initial consonant and recognizer: for voiceless sound class initial consonant, with voiceless sound section (being MB to ME-1) as the initial consonant section.Each initial consonant is taken out in the initial consonant section of the various different syllables that constitute by this initial consonant each frame voice and set up a VQ code book.Adopt LBG algorithm [2] when setting up code book, eigenvector is that 12 LPC coefficients (add hamming code window, employing is from pass method-Durbin algorithm), the distance between the code word adopts the Itakura distance metric, and the barycenter of each cluster is averaged and tries to achieve from closing coefficient by the normalization of each training utterance in the class.Comprise 10 code words in each code book at most.When identification, VQ code book with each initial consonant is encoded to the initial consonant section of input test syllable, and, select the first six candidate's initial consonant (average coding distortion is meant that each frame voice coding distortion sum of initial consonant section is divided by this section totalframes) according to the big minispread of average coding distortion.

(2) training and the recognizer of voiced sound stem phoneme class initial consonant are divided into two kinds of situations here:

[a]. for m, n, l, these several sounds of r, (the initial consonant segment length of these several sounds changes greatly as section to get preceding 6 frames (be MB-MB+5, the corresponding time interval is 60ms), and be subjected to the influence of back simple or compound vowel of a Chinese syllable, according to the analysis of a large amount of speech datas, the initial consonant segment length average out to 60ms of these sounds, thereby do this selection).Training is identical with (1) with recognizer.

[b]. for other voice, get preceding 3 frames (being MB-MB+2) as initial consonant section (the initial consonant segment length of these sounds is confined to first three frame).Training is identical with (1) with recognizer.

5. simple or compound vowel of a Chinese syllable identification

At first will determine the terminus of simple or compound vowel of a Chinese syllable in the syllable before carrying out simple or compound vowel of a Chinese syllable identification, for the discussion of inhomogeneity initial consonant characteristic, we take following scheme to make the starting point of rhythm parent segment in saving according to last one:

[a]. if in the test syllable, the ME-MB 〉=4(i.e. voiceless sound section frame number of this syllable is no less than 4 frames, the time interval is no less than 40ms in other words).Judge that then rhythm parent segment starting point is ME.

[b]. if in the test syllable ME-MB＜3(promptly the voiceless sound section frame number of this syllable be not more than 3 frames, the corresponding time interval is less than or equal to 30ms in other words), judge that then rhythm parent segment starting point is MB+5.

The foundation of doing this judgement is, if ME-MB＜3, so a large amount of speech data experimental results show that, the initial consonant of test syllable belongs to voiced sound stem phoneme class certainly, at this moment except the situation of zero initial, cut preceding 5 frames of syllable,, can remove initial consonant basically to the intermodulation of simple or compound vowel of a Chinese syllable and do not lose the rhythm head part of simple or compound vowel of a Chinese syllable with of the beginning of the 6th frame as simple or compound vowel of a Chinese syllable.And in the situation of zero initial, comprised all alliteration phonemes: a in the voiced sound stem phoneme table, i, u, o, e, v, even thereby the alliteration of rhythm parent segment is cut has fallen, it can also recover by the identification division of initial consonant.Otherwise, if ME-MB4, at this moment simple or compound vowel of a Chinese syllable

Section is coincide finely with the voiced segments in the syllable, so the starting point ME of voiced segments directly can be decided to be the starting point of rhythm parent segment.

In all cases, the terminal point that all the terminal point MS of syllable is decided to be the rhythm parent segment.

The recognizer of 3 joint VQ has been taked in the identification of simple or compound vowel of a Chinese syllable.In when training, each training syllable for same simple or compound vowel of a Chinese syllable is divided into three sections with its rhythm parent segment separately, sets up a VQ code book for every section then, and it is identical when training with VQ that the code word number of each code book is no more than 10 eigenvectors of being taked.When identification, the rhythm parent segment of testing syllable is divided into three sections, encode with three sections VQ code books of each simple or compound vowel of a Chinese syllable then, and, select preceding 6 winners according to arranging from small to large.

6. based on the audio recognition method that implies markov model

6.1 definition

Implicit markov process is a stochastic process that is made of two kinds of mechanism, and the first implies, the markov model with finite state, and another is a series of random observation functions, each is observed function and all links mutually with a state.The sound channel of supposing the people has limited articulatory configuration, and they and state are mapped, and voice signal is as the observation signal with respect to certain channel structure, and a kind of like this sound is described with regard to an available implicit markov process.Utilize the method for parameter estimation of HMM to obtain a kind of procedure parameter of pronunciation, this parameter sets just is HMM model parameter.The HMM model that is applied to speech recognition disperses, linear implicit markov model, and it is defined as:

If § (n) | n=0,1 ... be state space be state index collection S=1,2 ..., N } discrete stochastic process, and § (n) satisfies:

P { § (n+1)/§ (0)=S _o§ (n)=Sn }=P { § (n+1)/§ (n) } formula (2)

Make a _Ij(n)=P { § (n+1)=j/ § (n)=i } formula (3)

a _Ij(n) be the step transition probability of § (n), it has following character:

a _Ij(n) 〉=0 formula (4)

If ij(n) irrelevant with § n, then (n) claims homogeneous markov model.In speech recognition, state implies, and observed phonetic feature is by the random function F(X that depends on certain state) to describe, the formation of observing function is relevant with the concrete sound feature.

In the HMM audio recognition method, in order to obtain a kind of HMM model of pronunciation, at first L pronunciation sample of the same race carried out the VQ coding, produce L VQ codeword sequence (Code Sequence), this L codeword sequence is regarded as by same HMM model generation, constructed the criterion of this HMM type, require it to produce the probability maximum of K codeword sequence, after K was fully big, this HMM model had write down the prior probability that this pronunciation produces the code word string.When identification, codeword sequence for this pronunciation, utilize the model of having set up to obtain the posterior probability that each model produces this codeword sequence, just can identify the HMM model that produces this codeword sequence preferably according to the maximum criterion of posterior probability, thereby draw recognition result.Therefore for the VQ/HMM audio recognition method, observe function and be actually one group of discrete-observation probability.If the label set of VQ code book is combined into TN={ m|m=0,1 ... M-1 }, Q=(Q1, Q2 ... QT) be voice VQ codeword sequence, Qt TN(t=1,2 ... T), observing definition of probability is: Vi ∈ S, j ∈ TN; B=P Qt=j/q(t)=i }.For homogeneous HMM model, b is only relevant with the observation code word with state, and under homogeneous property assumed condition, model parameter can estimate by a kind of algorithm that reappraises, and its recognizer is also very simple.

6.2 Chinese single-syllable HMM audio recognition method

In the classical HMM model training, the problem of a maximum is exactly that the iterations that reappraises is many in training algorithm (Baum-Welch), recomputates the HMM model parameter each time, all will repeat to import whole training datas, and therefore, calculated amount is big unusually.Because state implies, do not have clear and definite physical significance in addition, thereby, produce best model by continuous correction to parameter.If we can determine once that the optimum condition of model distributes, also just can once estimate model parameter, and needn't be through repeatedly reappraising calculating, calculated amount and data throughout also significantly reduce, in fact, the state of HMM model can be corresponding with certain phoneme structure.A kind of brand-new HMM model training method hereinafter will be described in detail in detail.

Based on the HMM model of word for its model state of known word { q _oQ _NBe given, however be not a state q _iJust corresponding to a phonic signal character symbol, but corresponding to the S set i=that comprises a plurality of phonetic feature symbols { Vi1, Vi2 ... Vi }, and different conditions q _i, q _jCorresponding symbol collection Si, Sj have the friendship Si ∧ Sj ≠ 0(i of non-NULL ≠ j).These characteristics of HMM make it be particularly suitable for the speech recognition that has nothing to do with the people.Another characteristics of HMM model are the randomness that different conditions shifts, thereby it can be automatically mates (working to be similar to DTW) well with the same word of different ULs.

Can make various definitions (or understanding) for the state of HMM, but a good state definition should both reflect the polytrope characteristics (state is corresponding to a plurality of symbols) of voice, should make the friendship of different conditions corresponding symbol collection again (is Si ∧ Sj, i=j) minimum, otherwise will inevitably cause bigger identification error.

If the characteristic symbol of voice is to divide the mode of frame to extract by the time, be exactly that the transition section between it and phoneme steady section or phoneme is mapped to a kind of explanation of nature of HMM state so.Also just be easy to explain why voice status metastasis model from left to right shown in Figure 4 has better adaptability according to this viewpoint.This be because the phoneme of a speech (word) from left to right (time increases direction) send, and the leap between the state is shifted and has been reflected " eating sound " phenomenon that may occur.For the word of Chinese, " eating sound " phenomenon is few the appearance, thereby can ignore the leap transfer of state, describes the HMM speech recognition modeling of Chinese and should describe with the state transition diagram of Fig. 5.

If we treat the state transition model of Fig. 5 from the angle of state presence length distribution, we just can obtain a state chain shown in Figure 6, expression state q among the figure _iDwell length, it is with Pi(m) be a random function of distribution, if Pi(0)=0, just represent state q _iMight take place to cross over and shift, but, can think " eating sound " phenomenon can not take place, therefore always can suppose Pi(0 Chinese speech)=0.Be located at t and occurred (K-1) individual q constantly _iState then at t+1 state transition probability constantly is:

Pi（K≤m _i）/Pi（K≤m _i+1），j＝i

a _Ij(K)=Pi(K=m _i+ 1)/Pi(K≤m _i+ 1), j=i+1 formula (6)

0, other j, i=1,2 ... N

Because a _Ij(k) all relevant with k, thus the HMM model based on the state length distribution of Chinese is a kind of Markov model of non-stationary.If stochastic distribution p with state presence length _i(M _i) can reflect the characteristics of voice preferably, just can simplify training algorithm greatly when so this model being used for speech recognition.Because the complicacy of the training algorithm (Baum-Welch) of HMM speech recognition at present is catastrophic only not for large vocabulary system, and new model only need be cut apart train word and obtains segment length's information m ₁M _NWith m _iFrequency as P _i(m _i) one approach, just can obtain state transition probability a by formula 6 _Ij(k), observing matrix { b _j(Qk) } then can be by the frequency of observing symbol Qk occur in the i section as b _j(Qk) approximate value obtains.

6.3 HMM model parameter estimation based on the state length distribution

The purpose of parameter estimation is the segment length that obtains speech or the word Pi(m that distributes) and observing matrix { b _j(Qk) } its main calculated amount is cutting apart of segment length.Be provided with L training sequence { Q for certain given word ^k _k(k=1,2 ... L), Q ^k={ Q ^k ₁, Q ⁿ ₂Q ^m _Tk, Tk is the length of k training sequence, establishes Q ^k _tField of definition be a feature space V(VQ code book through vector quantization), the optimal dividing of Q

Get a division of minimum value.

Be called sequence Q ^kThe center of gravity of i section.Definition

Formula (7) becomes

So being equivalent to, (7) minimum partition problem finds the solution constrained optimization problem.

Wherein

Ω={ (X ₁... X _N) | 0≤X1≤X2 ... ≤ X _N=Tk } formula (12)

For the optimization problem of this type of multipole value functional, whether searching algorithm can converge to minimum value, depends on the selection of initial value to a great extent.In this respect owing to easily obtain about X ... the priori of X is so formula (11) can be asked for smallest point.By formula (11)

Thereby obtained segment length's estimation, when L is enough big, by law of great number

P _i(mi)=(length of i section is the number of mi in L training sequence)/(L)

I=1,2 ... N formula (14)

At state _iThe time occur to observe symbol Q _kProbability

b _i(Q _KThe totalframes that when state qi, occurs Qk in)=(the L training sequence)/(going out the totalframes of present condition q in L sequence)

I=1,2 ... N, k=1,2 ... M formula (15)

M is the size of VQ code book V.

All carrying out as above to each word in the identification word table, the training of step just can obtain whole needed HMM parameters.

6.3 recognizer

If we have obtained the VQ code book of a phonic signal character vector through training, each state segment length Pi(m that distributes) (i=1,2 ... N) and observation matrix { b(Qk) } (i=1,2 ... N, k=1,2 ... M), the identification word table has V word: W={ W1, W2 ... Wv }, if speech characteristic vector observation sequence to be become literate is Q={ Q, Q }, Qt ∈ V(t=0,1 ... T).At first find the optimum condition of Q to cut apart S={ X, X X } by formula (12).

Likelihood ratio

P(Q/W _i)=P(Q S/W _i)=P(S/W _i) P(Q/W _iS) formula (16)

First equal sign in the formula is because for given Q,

Be completely specified, make the Wi of the likelihood ratio maximum of formula (16) definition promptly be considered to recognition result

Because P(Q/Wi ∩

) to cutting apart

Than P(O/Wi) responsive many, so can suppose

This is a maximum path problem in the dynamic programming, and available Viterbi algorithm is found the solution, and the step of algorithm is as follows:

（1）δ ^（i） ₁（1）＝log b ^（i） ₁（Q ₁）

（2）δ ^（i） _t（j）＝max[δ ^（i） _t-1（j-1），δ ^（i） _t-1（j）]+log b ^（i） _j（Qt）

J=1,2 ... N, t=1,2 ... T formula (19)

δ ⁽ⁱ⁾ _T(N)=max[log P(Q/Wi ∩ S)] formula (20)

(3) obtain an optimal dividing by (1) (2)

Compare with classical HMM recognizer, lacked a sub-addition in step (2), but new recognizer need write down cut-point and carry out the calculating of step (3) in computation process, the Viterbi recognizer of operand and classical HMM is basic identical.

6.4 adaptive H MM method research

In new HMM model, as can be seen, HMM model parameter Pi(m), { b } all is statistical average valuations, the very easy self-adaptation of these calculation of parameter methods.Therefore we are to Pi(m), the adaptive algorithm of { b } is done unified the description.

If Avg1=SUM1/T1, Avg2=SUM2/T2 are the average of successively once adding up, Avg is total mean value, and obviously, following equation is set up:

Avg＝（SUM1+SUM2）/（T1+T2）＝Avg T1/（T1+T2）+Avg2 T2/（T1+T2）

Formula (23)

Avg=(1-fa) Avg1+fa Avg2 formula (24)

Fa=T2/(T1+T2) formula (25)

As long as the ratio T1/T2 according to front and back secondary amount of training data just can estimate population mean, fa claims adaptation coefficient in the formula.Therefore, P(m), b _I(j)Self-adaptation have foolproof form.

To VQ, HMM carries out self-adaptation respectively in last experiment, has just formed practical self-adaptation VQ/HMM speech recognition system.

6.5 conclusion

Above proposed the recognition methods of a kind of brand-new complete pronunciation Chinese joint, it has demonstrated fully the characteristics of Chinese single-syllable.The first adopts the female VQ algorithm of sound to solve layering full syllable Chinese isolated word identification problem.It two is to have studied a HMM model that is fit to the Chinese syllable characteristics, said two devices all has the performance of unspecified person identification again, for Chinese not only discrimination reach 98%, and algorithm operation quantity is little, can on microcomputer, low cost realize speech recognition, have practical value.

The present invention is because can discern whole Chinese single-syllables, and the recognition methods operand seldom, can on microsystem, make machine increase sound-controlled typewriter, thereby make and anyly might carry out Chinese character information processing with the language operating computer per capita, the modernization level that can advance computer Chinese-character information to handle greatly.

The objective of the invention is to finish like this: expanding a sound card on the microcomputer or on Chinese and English electronic typewriter.Sound card is discerned whole Chinese single-syllables, and change voice into corresponding literal and send on the main frame and store, stored Chinese character can change voice output again on sound card, perhaps in (or delivered to printer on by microcomputer) output on the chinese-English typewriter, this is the method for the complete pronunciation Chinese joint sound identification of a kind of unspecified person (not relying on the speaker).Then has more performance for specific people's identification.

The concrete structure of invention is provided by following examples and accompanying drawing thereof.

Fig. 7 is the sound-controlled typewriter block diagram of realizing on IBM-PC series microcomputer according to above-mentioned audio recognition method.

This sound-controlled typewriter is finished whole language identifications by speech recognition card (3), finishes the synthetic of whole voice by phonetic synthesis card (4).When the user pronounces by microphone, identification card (3) is converted into Hanzi internal code in the computing machine with voice, can show (6) whole phonetically similar words on computers corresponding to voice, the user can by key choosing or microcomputer based on context between the phrase collocation find out correct Chinese character storage automatically.Can carry out Typeset and Print (7) or read aloud manuscript so that user of service school original text for the manuscript computing machine of having imported by Synthesis Card (4).Both available phrase also can be imported voice by the single syllable mode when user imported Chinese character.

Claims

1, the talking typewriter system is at the device that increases speech recognition card (3) and Synthesis Card (4) realization Sound-controlled typewriter on the conventional microcomputer or on the common chinese-English typewriter, it is to discern as the first order with the female identification of sound that speech recognition is adopted, with the HMM model that distributes based on syllable steady section segment length is the method for second level identification, its identification is the voice of unit with speech or syllable, and can synthesize syllable and phrase, speech recognition is the recognition methods of not recognizing people, better to specific people's recognition performance, this system also has the speech understanding function, can pass through morphology, syntactic knowledge difference phonetically similar word.

2, device according to claim 1 is characterized in that discerning whole Chinese speechs, can synthesize whole Chinese speechs again.

3, according to claim 1,2, sound-controlled typewriter is a kind of system of unspecified person.

4, according to claim 1,2,3, but sound-controlled typewriter identification form syllable not only, and can discern the phrase pronunciation.

5, according to claim 1,2,3, sound-controlled typewriter has the language understanding function, can be according to morphology, syntactic knowledge difference phonetically similar word.

6, according to claim 1,2,3, sound-controlled typewriter can also make the people control machine with voice as a kind of control system main frame.