CN1163009A

CN1163009A - Method and system for recognizing a boundary between sounds in continuous speech

Info

Publication number: CN1163009A
Application number: CN 95195359
Authority: CN
Inventors: 谢伊－平·托马斯·王
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 1994-09-30
Filing date: 1995-08-17
Publication date: 1997-10-22

Abstract

Boundaries of spoken sounds in continuous speech are identified by classifying delimitative sounds to provide improved performance in a speech-recognition system. Delimitative sounds, those portions of continuous speech that occur between two spoken sounds, are recognized by the same method used to recognize spoken sounds. Recognition of delimitative sounds is accomplished by training a learning machine to act as a classifier (208) which implements a discriminant function based on a polynomial expansion.

Description

In order to identify the method and system on the border between sound in the continuous speech

The present invention is relevant with following three inventions, and these three inventions have transferred same assignee of the present invention.

On June 14th, (1) 1993 submitted to, series number is 08/076601 the invention that is entitled as " Neural Network and Method of Using Same ".

(2) year _ _ _ _ _ _ _ month _ _ _ _ _ _ _ day _ _ _ _ _ _ _ submit to, series number for _ _ _ _ _ _ _, the invention that is entitled as " Speech-Recognition System Utilizing Neural Networks and Method of Us-ing Same ".

(3) year _ _ _ _ _ _ _ month _ _ _ _ _ _ _ day _ _ _ _ _ _ _ day submit to, series number for _ _ _ _ _ _ _, the invention that is entitled as " Systemfor Recognizing Spoken Sounds from Continuous Speech and Methodof Using Same ".

Above-mentioned relevant subject matter of an invention things draws here as a reference.

The present invention relates to speech recognition (speech recognition) system, be specifically related to a kind of in order to differentiate the method and system of (identify) continuous speech.

For many years, scientists is attempted to seek a kind of in order to simplify the means of the interface between man-machine always.In order to implement man-machine interface, the most normally used instrument is the input media such as keyboard, mouse, touch-screen and light pen at present.Yet perhaps a kind of man-machine interface simple and nature is human voice.The device of the automatic recognizing voice of a kind of energy provides such interface.

The application of automatic speech recognition device comprises: utilize in the data base querying technology, processing and manufacturing process of voice commands to be used for the phonetic entry of quality control, to make the driver concentrate one's energy to note the voice activated dialing cellular telephone of road conditions and the voice operation prosthetic device that the disabled person uses when dialing.

Regrettably, automatic speech recognition is not a task easily.The voice that reason is the people have nothing in common with each other.For example, because the same sound that tone, speech rate, sex or the difference at age can make the several people send is obviously different.Except speaker's otherness, the consonant effect of phonetic, tongue (cry out/in a low voice) and ground unrest also are the problem that be difficult to go beyond of pendulum in face of speech recognition equipment.

From the later stage sixties, various methods had been introduced already at automatic speech recognition.Certain methods is based on the extension knowledge with corresponding enlightening strategy, and other methods depend on speech database and learning method.These other methods comprise dynamic time coiling method (dynamic time-warping is abbreviated as DTW) and concealment Markov model method (hidden-Markov modeling is abbreviated as HMM).The application of these two kinds of methods and time delays neural network (TDNN) will be discussed hereinafter.

DTW a kind ofly utilizes the optimization principle so that the technology of the error minimize between the model of the word of the oral account an of the unknown and the known word of having stored.The data of report show that the DTW technology is very strong, can draw good identification.But it is very fine that the DTW technology requires on calculating.In view of this, be unpractical with the DTW technology implementation in the application of real world at present.

Concealment Markov model technology is not directly the oral account word an of the unknown and the model of a known word to be contrasted, and is to use the probabilistic model of known word, compares the probability of the unknown word that each model produces.When a unknown word is issued, the HMM technology will be checked a preface (or state) of this word, find out the model that optimum matching can be provided.The HMM technology successfully had been applied in many commercial application already; But this technology also has a lot of shortcomings.These shortcomings comprise: can not distinguish word similar on the acoustics, noise susceptibility and calculating tractability.

In recent years, neural network had been used to the extremely difficult conception of solution or insoluble problem, for example problem of speech recognition and so on already.Delayed Neural Networks (TDNN) is a kind of neural network type, and it has selected for use limited neurocyte to connect the temporary effect that embodies voice.For the identification of limited word, the effect that TDNN seems more a shade better than HMM method, but TDNN has two critical defects.

The first, used training time of TDNN is very long, on the order of magnitude in several weeks.The second, the used training algorithm of TDNN often converges on the local minimum, and this is not an optimal solution of overall importance.

In a word, the shortcoming of the existing known method of automatic speech recognition (for example, algorithm require the limited tolerance of unpractical calculated amount, speaker's difference and ground unrest, excessive training time, etc.) seriously limited acceptability and the extendability of speech recognition equipment in many potential applications.In view of the above, need a kind of automatic speech recognition system now especially, it can provide high-caliber degree of accuracy, is not subjected to the influence of ground unrest, need not repetition training or complicated calculations, and insensitive to speaker's distinctiveness.

In view of the above, a kind of method of keeping apart in order to sound that will oral account and continuous voice that an advantage of the invention is that use provides has kept identification accuracy highly.This method can identify the border between the oral account sound, has increased the likelihood ratio of correct discriminating oral account sound whereby significantly.

Another advantage of the present invention has provided a kind of in order to train a learning machine to differentiate oral account sound method and system from continuous speech, and wherein this method and system need not the training period repeatedly or very long.

Another advantage of the present invention has provided the system that the border between a kind of oral account voice that can make in the continuous speech is differentiated out according to the mode identical with differentiating the oral account speech, thereby has reduced the overall complexity of system.

These advantages of the present invention and other advantage are to embody in order to the system that identifies a plurality of oral account sound from the voice of continuous oral account by providing a kind of in a preferred embodiment of the invention.This system comprises a kind of in order to identify the method on two borders between the oral account speech from continuous oral account voice.This method has the next coming in order step: at first, receive continuous oral account voice, secondly, the sound that can define that will contain the border is defined as one of oral account sound; At last, by means of the sound of differentiating in the continuous oral account voice that defines, discern its border.

In another embodiment of the present invention, provide a kind of speech recognition system, it contains a recognition device.This recognition device receives a plurality of features that extract from continuous oral account voice, use a sorter then, by means of to these a plurality of features to identify the sound that can define in the oral account voice continuously.

Though the present invention specifically indicates in appending claims, the detailed description that other characteristic of the present invention is read hereinafter with reference to the following drawings will become clear, also will be better appreciated by the present invention.

Fig. 1 illustrates the general block diagram of applied a kind of speech recognition system in one embodiment of the invention.

But Fig. 2 illustrates a kind of according to the preferred embodiment of the invention in order to identify the process flow diagram of a gauge sound method from continuous oral account voice.

Fig. 3 illustrates a kind of process flow diagram that utilizes learning machine to identify an oral account sound method in the continuous oral account voice according to one embodiment of the invention.

Fig. 4 illustrates a kind of sorter, in order to receive a plurality of extraction features.

Fig. 5 illustrates a kind of sorter, in order to receive the extraction feature of a sequence.

Fig. 6 illustrates a kind of block scheme of sorter according to the preferred embodiment of the invention.

Fig. 7 illustrates according to the present invention another embodiment a kind of in order to training speech recognition system from continuous oral account voice, to identify the process flow diagram of oral account sound method.

Fig. 8 illustrates a kind of according to an embodiment of the invention block scheme of the system in order to the training study machine.

Fig. 9 illustrates the block scheme that comprises a kind of speech recognition system in one embodiment of the invention.

Fig. 1 illustrates the general block diagram of applied a kind of speech recognition system in one embodiment of the invention.Microphone shown in the figure 2 or equivalent device are used to receive the audio frequency input of phonetic entry form and convert tones into electric energy.Speech recognition system 6 is passed through transmission mediums 4 received signals from microphone 2, and carries out various tasks, such as waveform sampling, mould/number (A/D) conversion, feature extraction and classification.Speech recognition system 6 provides the identifier of oral account sound to computing machine 10 via bus 8.Method and system of the present invention is all implemented in speech recognition system 6.Computing machine 10 utilizes the data that provided by speech recognition system 6 to execute instruction or program.

Those skilled in the art will recognize that speech recognition system 6 except can be to computing machine transmits oral account sound, also can to for example communication network, data-storage system or rerecording device can substituting for computer 10 other devices transmit oral account sound.

System shown in Figure 1 is used for identifying oral account sound from continuous oral account voice.Microphone is spoken and when each gave an oral account pause natural invariably between sound, just producing continuous oral account voice was continuous voice when a people faces toward.That is to say, when the voice that the pause of speaking only occurs in natural form need pause, for example when in short finishing.For this reason, continuous voice can be envisioned for " nature " voice, when occurring in common talk.Continuous oral account voice comprise an oral account sound at least, and wherein giving an oral account sound can be word, character or a phoneme.A phoneme is the minimum unit of speech sound, and it indicates the difference on the implication.Character comprises one or more phonemes, and a word comprises one or more characters.

But Fig. 2 illustrates according to a preferred embodiment of the invention a kind of in order to identify the process flow diagram of a gauge sound method from continuous oral account voice.In step 20, receive continuous oral account voice, wherein comprise an oral account sound at least.

In step 22, but the sound of a gauge of qualification.But the fixed sound of boundary is a part of giving an oral account voice continuously, occurs between two oral account sound.In fact, but the sound of a gauge contains in continuous oral account voice, the borders between two oral account sound.Then, in step 24, but the borders between two oral account sound are identified by means of the sound of differentiating gauge in continuous oral account voice.Though but the sound of a gauge is not represented a significant oral account sound, but the sound of gauge is to utilize with the device of differentiating the oral account assonance or method to be differentiated out in this embodiment of the present invention.

Fig. 3 illustrates a kind of process flow diagram that utilizes learning machine to identify the oral account sound method in the continuous oral account voice according to one embodiment of the invention.Learning machine can be neural network or any other can be come system that multiple pattern (Patterns) is classified by inductive education.

In step 30, with a plurality of oral account example training study machines.In one embodiment of the present of invention, but these a plurality of examples comprise the example with gauge sound.An oral account example is defined as one group of given input and a required output.For example, an oral account example can contain a stack features as given input, and this stack features extracts from continuous speech, and this example also can contain a binary code as required output, and this binary code has been represented corresponding oral account sound among the ASCII.

The training study machine makes this learning machine become a device that can make the data qualification that it receives; In view of the above, in a preferred embodiment of the invention, but housebroken learning machine can be to the sound classification of gauge.

In step 32, receive continuous oral account voice.But these continuous oral account voice typically contain a plurality of unidentified, by the separated oral account sound of the sound of gauge.

In step 34, utilize a frame withdrawal device from continuous oral account voice, to extract a plurality of features.For example, the feature that has extracted contains logarithm cepstrum (cepstral) coefficient, predictive coefficient or Fourier coefficient.

In step 36, by means of feature is classified, so that at least one oral account sound is differentiated out.When but continuous oral account voice contain by the separated oral account sound of gauge sound, but gauge sound also can be differentiated so that determine the border of each oral account sound.The border of determining oral account sound can increase the probability of correctly differentiating oral account sound.But gauge sound is differentiated out according to the mode identical with verbal sounds, that is to say, but gauge sound is differentiated out by means of its feature is classified.

Referring to Fig. 4, a kind of sorter shown in the figure is in order to receive a plurality of features that extracted.This sorter 80 receives the feature that has extracted at its input end 82,84,86,88,90,92,94 and 96.These features can be received on each input end simultaneously.Shown in this example in, this stack features that is received by sorter 80 contains x ₁, x ₂..., x ₈, they can adopt the form of cepstrum, linear prediction or Fourier coefficient.

In a preferred embodiment of the invention, sorter 80 uses a kind of parameter decision method to determine a stack features whether to belong to certain kind.One kind can be represented oral account sound.Sorter 80 operation parameters decision method is carried out a discriminant function y (x), wherein x={x ₁, x ₂..., x _iBe this feature group, i is an integer index value.The feature group that sorter 80 1 is received is just calculated its corresponding discriminant function, bears results on output terminal 98.This result's value shows usually whether a stack features belongs to and the corresponding classification of this discriminant function.In a preferred embodiment of the invention, it is directly in direct ratio that this result's value and this stack features belong to the likelihood ratio of corresponding classification.

The discriminant function of being carried out by sorter 80 is based on has used polynomial expansion, and on loose meaning, say, be based on and used for example combination of sine, cosine, index/logarithm, Fourier transform, Lai Genzhuo (Legendre) polynomial expression, non-linear basic function (for example Volterra function or radiant type basic function etc.) or polynomial expansion and orthogonal function of orthogonal function.

Used the polynomial expansion formula in a preferred embodiment of the invention, wherein generalized case is represented by following formula (1):

y = Σ_{i = 1}^{m} W_{i - 1} x_{1}^{g_{1 i}} x_{2}^{g_{2 i}} . . . . . . x_{n}^{g_{ni}} - - - - (1)

In the formula, x _iPresentation class device input, can be such as

x_{i} = f_{i (z_{j})}

And so on function, z wherein _jBe any random variable, in the formula, label i, j and m are any integer; In the formula, the output of y presentation class device, in the formula, W _I-1Represent the coefficient of i item; G in the formula _1i, g _2i... g _NiRepresent the index of i item, and be integer; N is the number of sorter input.

Fig. 5 illustrates a kind of sorter, in order to receive a characteristic sequence that has extracted.Each feature that has extracted is fed to sorter 100 from input end 102.Sorter 100 is carried out and sorter 80 identical functions shown in Figure 4 basically.Sorter 100 provides the result of its operation at its output terminal 104.Shown in this example in, can comprise logarithm cepstrum coefficient, linear predictor coefficient or Fourier coefficient in the feature group that receives by sorter 100.

Fig. 6 illustrates the block scheme according to the sorter shown in Figure 5 100 of the preferred embodiment of the present invention.Computing machine 110 is implemented sorter 100 shown in Figure 5.Computing machine 110 contains a plurality of computing units, there is shown computing unit 111,113 and 115.Computing machine 110 also contains summing circuit 117.

The polynomial expansion formula is calculated in the following manner by computing machine 110.Use bus 119 with a plurality of data input x ₁, x ₂..., x _nFeed-in computing machine 110 is distributed to a plurality of computing units 111,113 and 115 then.The feature that the input data have extracted typically.In each computing unit evaluator expansion one, and determine which data input is received.After calculating one, computing unit is sent to summing circuit 117 with this, the item addition that summing circuit 117 calculates computing unit, and total value is put on the fanout 133.

For example Fig. 6 shows polynomial calculating as follows:

y = x_{1}^{g_{11}} x_{2}^{g_{21}} + x_{1}^{g_{12}} x_{2}^{g_{22}} + . . . . . . x_{n}^{g_{nm}}

Computing unit 111 calculates , then it is sent to summing circuit 117 by bus 127; Computing unit 113 calculates

, then it is sent to summing circuit 117 by bus 129; Computing unit 115 calculates , then it is sent on the summing circuit 117 by bus 131.With they additions, and the polynomial expansion formula that will obtain thus is put on the fanout 133 summing circuit 117 when computing unit receives these.

Those of ordinary skill is understood, and computing machine 110 can calculate by the given polynomial form of formula (1), and it has the item number different with previous example, and its item is made up of the input data of the item that is different from previous example.

In one embodiment of the invention, computing machine 110 is to be implemented by the software on the processor that just is operated in microprocessor and so on.Yet those skilled in the art understands programmable logic array, ASIC or other digital logic arrangement and also can be used for implementing by the performed function of computing machine 110.

Fig. 7 illustrate by according to still a further embodiment a kind of in order to training speech recognition system with from continuous oral account voice, identify the oral account sound method process flow diagram.A kind of speech recognition system that is constituted according to one embodiment of present invention has two kinds of modes of operation in principle: (1) training patterns, and the example of giving an oral account sound in this manner is used to the training study machine; (2) recognition method, the oral account sound in the continuous speech is differentiated out in this manner.Referring to Fig. 8, but the user must come training study machine 176 by means of the example that provides all oral account sound and the gauge sound that will be discerned by this system usually.In a preferred embodiment of the invention, used and the training study machine oral account sound method identical method of classifying, the sound of gauge but the training study machine is classified.

In one embodiment of the invention, give the coefficient of the resulting discriminant function of polynomial expansion formula of definite form based on formula (1), can make learning machine be trained to such an extent that show as a sorter by means of tuning.For making discriminant function, must determine coefficient W every in this polynomial expansion formula effectively to the input data qualification _I-1This is to use following training method to realize.

In step 140, provide a plurality of oral account sound examples.An oral account sound example contains two components.First component is the sample value of one group of oral account sound, or the feature that therefrom extracts; Second component is a corresponding required sorter output signal.

Then, in step 142, training aids will be given an oral account the number of sound example and compare with the number of multinomial coefficient in the discriminant function.

In step 144, on inspection, whether equal to give an oral account the number of example with the number of determining multinomial coefficient.If "Yes", then this method advances to step 146; If "No", then method advances to step 148.

In step 146, the application matrix inversion techniques is found the solution the numerical value of each multinomial coefficient.

In step 148, use the numerical value that the least-square estimate technology is found the solution each multinomial coefficient.Suitable least-square estimate technology for example comprises the least square of least square, expansion, pseudo-counter-rotating, Carlow door (Kalman) wave filter, maximum likelihood algorithm, Bei Yixian (Bayesian) technique of estimation etc.

During available a kind of sorter, the number of computing unit is equal to or less than the oral account sound example number that offers learning machine in the common selection sort device of people in implementing one embodiment of the present of invention.

Fig. 8 illustrates the block scheme according to a kind of system in order to the training study machine of one embodiment of the present of invention.The voice sample value is received by microphone 2, and sends converter 162 to by transmission medium 4.A voice sample value is corresponding to first component of giving an oral account the sound example.Converter 162 1 is received the voice sample value, just carries out various functions to utilize this voice sample value.These functions comprise waveform sampling, filtering and mould/number (A/D) conversion.Converter 162 produces a voice signal as exporting, being sent to feature extractor 166 via bus 164 again.

Feature extractor 166 produces a plurality of features according to voice signal, and these features are sent to training aids 172 via bus 168.Training aids 172 also receives the required sorter output by bus 170 except receiving feature.The output of each required sorter is received by training aids 172, and it whereby, produces one and give an oral account the sound sample value corresponding to the feature of the voice sample value that provides by bus 168, is used for the evaluator coefficient by training aids 172 then.Training aids 172 is according to method shown in Figure 7, the evaluator coefficient.These coefficients are sent to learning machine 176 through bus 174 again.Learning machine 176 utilizes the multinomial coefficient that receives on bus 174, to make up a sorter.This sorter can be used for recognition device shown in Figure 9 200.The output of learning machine 176 is brought by bus 180.

When user's sounding during continuous speech, microphone 2 produces a signal, it represents the acoustic waveform of these voice.From the signal of microphone 2 simulating signal typically.This signal is fed to converter 162 then and carries out digitizing.Converter 162 contains the suitable device that is useful on the A/D transducer.The A/D transducer can be with the speed of per second several thousand times to from the sample of signal of microphone 2 (for example, in a preferred embodiment of the invention, according to the frequency component from the voice signal of microphone, speed be per second 8000 to 14000 times).Each sample value is for conversion into a numeric word then; Wherein, the length of this word is 12 to 32 bits.

The sampling rate and the word length that those skilled in the art will recognize that the A/D transducer can change, and the numerical value that above provides not is any restriction to the sampling rate and the word length of A/D transducer contained in one embodiment of the present of invention.

Voice signal contains one or more numeric words, and wherein, each numeric word has been represented a sample value of the continuous speech of obtaining immediately in time.Voice signal is sent to feature extractor 166, and numeric word is become a Frame by cohort there in a time interval.In a preferred embodiment of the invention, each Frame is represented the voice signal of about 10ms.Yet those skilled in the art admits, according to the duration of many factors oral account sound for example to be identified, also can adopt other Frame extended period.Frame stands cepstral analysis subsequently, and this is a kind of feature extraction method, is carried out by feature extractor 166.

To cepstrum analysis of spectrum that is the feature extraction that voice signal is carried out, produce an expression thing of this voice signal, it is characterized in the features relevant of this continuous speech in this time interval.It can be considered to a kind of data reduction process, and it has kept the key property of voice signal, has eliminated the undesirable interference from the irrelevant characteristic of this voice signal, so, make a plurality of sorters process by making decision at leisure.

The logarithm cepstral analysis is carried out in the following manner.At first, the set of number word in the voice signal is carried out P rank (P=12～14 usually) linear prediction analysis, to produce P predictive coefficient.Then, use following recursion formula predictive coefficient be for conversion into the logarithm cepstrum coefficient:

c (n) = a (n) + Σ_{k = 1}^{n - 1} (1 - k / n) a (k) c (n - k) - - - - (2)

In the formula, c (n) represents n logarithm cepstrum coefficient, and a (n) represents n predictive coefficient, and 1≤n≤P, P equal the number of logarithm cepstrum coefficient, and n represents an integer, and a (k) represents k predictive coefficient, and c (n-k) represents (n-k) individual logarithm cepstrum coefficient.

The vector of logarithm cepstrum coefficient is weighted with the form of a sinusoidal windows usually, promptly

In α (n)=1+ (L/2) sin (π n/L) (3) formula, 1≤n≤P, L are integer constant, provide the logarithm cepstral vectors c (n) of weighting, wherein,

c(n)＝c(n)α(n) (4)

This weighting is commonly referred to as " lifting of logarithm cepstrum " (cepstrum liftering).The effect of this lifting process is that the frequency spectrum intermediate frequency spectrum peak value to voice signal carries out smoothly.Have found that the logarithm cepstrum promotes the change that can suppress in the high and low logarithm cepstrum coefficient, thereby improved the performance of speech recognition system significantly.

The result of logarithm cepstral analysis draws a level and smooth log spectrum, and it is corresponding to the many frequency components of voice signal in a time interval.So the key character of voice signal all can be kept in this frequency spectrum.Feature extractor 166 produces a corresponding feature frame, and it contains all multi-site datas of the frequency spectrum of with good grounds corresponding Frame generation; Then, send the feature frame to training aids 172.

In a preferred embodiment of the invention, a feature frame contains 12 data points, wherein, the representative of each data point through the level and smooth frequency spectrum of logarithm cepstrum in this time interval, the value that goes up at a specific frequency.These data points all are 32 digital bit words.Those skilled in the art will recognize that the present invention does not limit the data point number of each feature frame or the bit length of data point; The data point number that feature frame includes can be 12 or any other suitable numerical value, and the data point bit length can be 32 bits, 16 bits or any other numerical value.

In one embodiment of the invention, system shown in Figure 8 be by be operated in a processor for example the software on the microprocessor implement.Yet those skilled in the art admits, also can use programmable logic array, ASIC or other digital logic arrangement and implement the performed function of system shown in Figure 8.

Fig. 9 illustrates a kind of block scheme of speech recognition system, this system overview one embodiment of the present of invention.This speech recognition system contains microphone 2, converter 162, feature extractor 166 and recognition device 200.Recognition device 200 in turn includes a plurality of sorters and a selector switch.There is shown the character classification device 202,204 and 206 in a plurality of sorters.But the sound classifier 208 of gauge is shown also among the figure.As shown in Figure 8, sorter is provided to recognition device 200 by bus 180.

Continuous speech receives and is for conversion into electric signal by microphone 2, is sent to converter 162 through transmission medium 4 again.Converter 162 and feature extractor 166 are according to carrying out and function identical functions mentioned above with above describing the essentially identical mode of mode shown in Figure 8 basically, and the two is that bus 164 connects.Feature extractor 166 produces a feature frame, this frame is distributed to a plurality of sorters that are contained in the recognition device 200 through bus 198 then.In the example that Fig. 9 provides, shown in the figure in a plurality of sorters 4.

Each sorter is implemented different discriminant functions.Shown in example in, sorter 202 is implemented discriminant function at the oral account sound of representing character " A ", sorter 204 is carried out discriminant function at the oral account sound of representing character " B ", and sorter 206 is carried out discriminant function at the oral account sound of representing character " Z ".But but the sound classifier 208 of gauge is carried out discriminant function at the sound of gauge.The discriminant function of being implemented by each sorter in the recognition device 200 is the polynomial expansion formula form that is provided by formula (1).

In this example, the result of the discriminant function of being implemented by sorter 202 is sent to selector switch 210 through bus 212, the result of the discriminant function of being implemented by sorter 204 is sent to selector switch 210 through bus 214, and the result of the discriminant function of being implemented by sorter 206 is sent to selector switch 210 through bus 216.In addition, but the result of the discriminant function that gauge sound classifier 208 is implemented is sent to selector switch 210 through bus 218.

Selector switch 210 determines that the output of which sorter has maximum magnitude, produces the expression thing of this corresponding oral account voice recognition symbol then on output terminal 220.When but continuous speech contains by the separated oral account sound of gauge sound, but also must identify gauge sound, so that determine the border of each oral account sound.Determine that oral account sound border can increase the probability of this oral account sound of correct discriminating.But gauge sound is according to differentiating with oral account acoustic phase mode together, that is to say, identifying them by feature is classified.

In one embodiment of the invention, system shown in Figure 9 be by be operated in a processor for example the software on the microprocessor implement.Yet those skilled in the art admit that programmable logic array, ASIC or other digital logic arrangement also can be used for implementing by the performed function of system shown in Figure 9.

In a word, this paper had described a conception of species and a kind of in order to differentiate several embodiment of the method and system on border between the oral account sound in continuous oral account voice already, wherein comprised a preferred embodiment.

But because need not the training period oversize or repeatedly in order to the several embodiment that identify gauge sound method and system in continuous oral account voice, thereby they are that the user is more acceptable.

In addition, but several embodiments of the present invention described herein allow the gauge sound in the continuous speech to differentiate according to the mode identical with differentiating the oral account voice mode, thereby have reduced the overall complexity and the cost of speech recognition system.

Obviously, those skilled in the art can make many other embodiment that change and can draw except above-described preferred form according to various modes to disclosed the present invention.

In view of the above, appending claims is intended to cover all and drops within connotation of the present invention and the category modification of the present invention.

Claims

1. a speech recognition system is characterized in that, contains:

Recognition device receives a plurality of features extract from continuous oral account voice, this recognition device is by means of being applied to these a plurality of features on the sorter, but identifies the gauge sound in the continuous oral account voice.

2. speech recognition system according to claim 1 is characterized in that, also contains:

A learning machine operationally is coupled to this recognition device;

A training aids operationally is coupled to this learning machine, in order to the training study machine, but identifies gauge sound by means of setting up sorter, and this training aids receives a plurality of oral account sound examples;

A converter in order to receiving continuous oral account voice, and produces a voice signal; And

A frame withdrawal device in response to this voice signal, operationally is coupled to recognition device and reaches, in order to extract a plurality of features from continuous oral account voice.

3. speech recognition system according to claim 2 is characterized in that, this training aids calculates the coefficient of the polynomial expansion formula with at least one, and this training aids contains:

Comparison means is compared with the item number of polynomial expansion formula in order to the number that will give an oral account the sound example;

Generator operationally is coupled to this comparison means, provides two kinds of technology to calculate described coefficient, wherein

(i), then provide the matrix inversion techniques and find the solution described coefficient if the item number of polynomial expansion formula equals to give an oral account the number of sound example;

If (ii) the item number of polynomial expansion formula then provides the least-square estimate technology and finds the solution described coefficient less than the number of oral account sound example;

4. speech recognition system according to claim 1 is characterized in that, this recognition device comprises a neural network.

5. speech recognition system according to claim 1 is characterized in that, this sorter is implemented the polynomial expansion formula.

6. speech recognition system according to claim 5 is characterized in that, this polynomial expansion formula has following form:

y = Σ_{i = 1}^{m} W_{i - 1} x_{1}^{g_{1 i}} x_{2}^{g_{2 i}} . . . . . . x_{n}^{g_{ni}}

In the formula, y represents the output of sorter;

I, m and n are integer;

W _I-1Represent the coefficient of i item in the polynomial expansion formula;

x ₁, x ₂..., x _nRepresent the input end of sorter;

g _1i, g _2i... g _NiRepresent the index of i item in the polynomial expansion formula, they are applied on the input end of sorter.

7. according to the described speech recognition system of claim 1, it is characterized in that a plurality of features of these oral account voice are corresponding with the continuous oral account voice of a period of time at interval.

8. according to the described speech recognition system of claim 1, it is characterized in that these a plurality of features are to select from the group of being made of logarithm cepstrum coefficient, predictive coefficient and Fourier coefficient.

9. speech recognition system according to claim 1 is characterized in that, this speech recognition system can identify a plurality of oral account sound from continuous oral account voice, and further contains:

A plurality of sorters, in order to a plurality of oral account sound are classified, each of these a plurality of sorters has a corresponding classification, and implements a kind of discriminant function, it is used a plurality of frames and produces a result, belongs to the likelihood ratio of corresponding classification to indicate one of a plurality of oral account sound;

A selector switch operationally is coupled to each of a plurality of sorters, in order to by each the result from a plurality of sorters is compared, identifies an oral account sound that includes at a plurality of oral account sound.

10. speech recognition system according to claim 9 is characterized in that, but in a plurality of sorter at least one classified to the sound of gauge.