CN1120372A

CN1120372A - Speech processing

Info

Publication number: CN1120372A
Application number: CN94191652A
Authority: CN
Inventors: 塞缪尔·加文·史密斯
Original assignee: British Telecommunications PLC
Current assignee: Bt Levin Scott LLC; Cisco Levin Scott LLC; Cisco Technology Inc
Priority date: 1993-03-31
Filing date: 1994-03-31
Publication date: 1996-04-10
Anticipated expiration: 2014-03-31
Also published as: CN1196104C; NO953895L; KR100309205B1; DE69416670T2; HK1014390A1; DE69416670D1; CA2158064A1; NZ263223A; FI954572A0; WO1994023424A1; JPH08508350A; AU682177B2; SG47716A1; AU6382994A; NO953895D0; FI954572A; NO308756B1; CA2158064C

Abstract

A path link passing speech recognition system and method for recognising input connected speech, the recognition system having a plurality of vocabulary nodes (24) associated with word representation models, at least one of the vocabulary nodes (24) of the network being able to process more than one path link simultaneously, so allowing for more than one recognition result.

Description

Speech processes

The present invention relates to speech processes, relate more specifically to be used to handle the system of the alternative syntactic analysis of continuous speech.

Speech processes comprises that the spokesman discerns, and detects or verify spokesman's identity therein; And speech recognition, anyone can both use a system and need not through the recognizer training therein; And the so-called spokesman identification of being correlated with, the user who allows a system of operation therein is restricted and needs a training stage to come to obtain information from the user of each permission, usually speech data is input in the so-called front-end processor with digital form in identification is handled, its is derived from the input audio data stream, and a group of being called front end feature set or vector more compacts, more significant data on the perception.For example, voice are usually via microphone input, sampling, and digitizing is cut into the frame that length is 10-20ms (such as sampling) on 8KHz, and calculates one group of coefficient for each frame.In speech recognition, suppose that usually the spokesman says in one group of known word or expression.A kind of expression of storage that is called the word or expression of model or model is included in the situation of the irrelevant identification of spokesman in advance a reference characteristic matrix of this word of obtaining from a plurality of spokesmans.With input feature vector vector and a model contrast and a generation similarity measure between the two.

Speech recognition (no matter being the mankind or machine) is easy to generate mistake and may causes the wrong identification of word.If discerned a word or expression improperly, then speech recognition provides another time trial in identification, and this may be again correct or incorrect.

Proposed variously to be used for processed voice and to select to import the method for the best or alternative coupling between the voice model of voice and storage or the model.In isolated word identification system, the generation of alternative coupling is quite simple and clear: each word is in the transfer network of a word of indicating to discern one independently ' path ', and independently the word path only links to each other on the terminal point of network.Will be from all paths that network comes out just can provide best and alternative coupling by them and the similarity ordering of model of storage and so on.

Yet, in most of continuous recognition systems and some isolated-word-recognition system based on continuous recognition technology, all paths of always might not recombinating on the terminal point of network directly obtain optimum matching the information that can not obtain from the exit point of network and can not obtain alternative coupling.A kind of solution that produces the problem of optimum matching is discussed in S.J.Young, N.H.Russell in 1989 and JHSThornton " token transmission: a kind of simple concept model of continuous speech recognition system " to some extent, and this article relates to the packets of information that is called token by a transfer network transmission.One of comprising about the similarity degree between the network portion information in the part path of passing through and expression input and that handled till this moment of token accumulates score.

Described as people such as Young, at every turn with a frame phonetic entry to transfer network the time, any token transmission that just will appear on the input end of a node is advanced in this node, and with word model that these nodes are associated in the current speech frame of coupling.New then token appears on the output terminal of node (" advanced " passed through be associated with this node model).At this moment have only the token of best score to be passed on the input end of node of back.When (such as an external unit such as time-out detecting device) sends the end of signalisation voice, a single token will appear on finish node.The routing information that is included in the front in the token by utilization is recalled along this path, just can extract the entire path by network from this token, and the optimum matching to the input voice is provided.

The paper of S.C.Austin and F.Fallside " adopts a kind of unified direction mechanism of the automatic speech recognition of latent Markov model ", and (ICASSP 1989, volume 1, the 667-670 page or leaf) relate to be similar to a kind of continuous word pronunciation recognizer that the described modes of people such as above-mentioned Young are operated.The experience of relevant identification process by transfer network is upgraded when word model is come out.During end of identification, recognition result is to draw from the experience with best score of submitting to output terminal.Terminate in path on the finish node for each bar, it is possible having only a kind of experience.

This known configuration does not allow easily to go standby selection on the output terminal of network.

According to the present invention, a kind of speech recognition system that is used to discern the path link transmission of input continuous speech comprises: the device that is used for deriving from an input speech signal recognition feature data; Treating apparatus is used to constitute the model of the input voice of expectation, and is used for the expectation input voice of recognition feature data and component model are compared, and this treating apparatus has a plurality of lexical nodes that are associated with the word list representation model; And the device that is used to rely on the identification of comparative result indication input speech signal, it is characterized in that at least one lexical node can handle the path link of one or more simultaneously.

This configuration shows at given moment node can handle the access path link of one or more, thereby and can draw more than one recognition result.

The device of component model preferably includes the transfer network of a lexical node that has a plurality of noise nodes and be associated with the word list representation model.These nodes can the generation pass link, comprises in the path link: the field that is used to store the pointer that points to last paths link; The accumulation score of one paths; Point to the pointer of previous node; An and markers that is used for segmental information.Best, the lexical node that can handle one or more path link has the word list representation model of more than one identical association.

At least one lexical node has the word list representation model that is associated more than one and allows speech recognition device to handle mulitpath simultaneously in the regulation network, and therefore allows one or more path link to pass between each node on each incoming frame chain to propagate.As a result, the present invention has set up the many levels of a transfer network, can propagate some alternative paths along them.Best sub-path can be by first models treated of a node, suboptimum by second model, by that analogy, use until exhausted up to parallel model or the path that enters.

In general noun " network ", include to acyclic graph (DAG) with the tree.DAG is the network on band road not, and tree then is a network on the conceptive terminal point that just in time appears at network of unique point in path.

Noun ＂ word ＂ represents a basic recognition unit here, and it can be that a word can be a double-tone, phoneme, phoneme variant etc. too.Identification then be with the pronunciation an of the unknown and the matching process of a predefined transfer network, this network be designed to the user say possibly consistent.

In order to identify the phrase of having discerned, can comprise the device that is used for recalling path link in the system by network.

In addition, this system also can comprise and be used for distributing to a mark at least some has the device of the node of the word list representation model that is associated, and the mark that is used for each paths of comparison is to determine to have the device to the path of input voice optimum matching and the alternative coupling of suboptimum.

This configuration allow in nature with the inevitable different and different a kind of alternatives on cutting or noise coupling just of optimum matching.

The word list representation model can be latent Markov model (HMM), model, dynamic time coiling model or other any suitable word list representation model described in " the latent Markov model of automatic speech recognition: theoretical with use " of No. 2 the 105th page of Cox of Britain's telecommunication technique magazine (British Telecom Technology Journal) the 6th volume April in 1988 prevailingly.The processing of carrying out in the model and the present invention are irrelevant.

Not all node that comprises word model associated therewith all must have the mark of distributing to them.The structure that depends on transfer network is as long as it is just enough that mark is distributed to those nodes of the predicate node front that appears in the network.Here employed decision-point is meant the point of the access path that has one or more in the network.

Can check the part path on some decision-point in network, on these decision-points, be applied with some constraint, therefore the path that only meets these constraints is propagated as the applicant by name what submitted on March 31st, 1994 " continuous speech recognition " international patent application (from European application 933025397 and 933045031 proposition priority requests) described in, be combined in this by reference.Each decision-point is related with one group of significant notation, and abandons all and have the not path link of the mark in this group.

The accumulation mark can be used for identifying complete path, just can not determine the path body owing to need not move around on path link, and can not generate the part routing information of token fully, and obtain very high operating efficiency.In this case, tag field must be even as big as identifying all paths uniquely.

In order to operate efficiently according to system of the present invention, the signal Processing of path tag is preferably in once to be finished in the single operation to improve processing speed.

Here disclose and proposed the requirement of others of the present invention and preferred embodiment, after this its advantage will be conspicuous.

Just further describe the present invention with reference to the accompanying drawings in the mode of example, in the accompanying drawing:

Fig. 1 is shown schematically in and adopts in the telecommunications environment according to recognition processor of the present invention;

Fig. 2 is the block scheme of schematically illustrated function element according to recognition processor of the present invention;

Fig. 3 is the block scheme of parts of sorter that schematically shows the part of pie graph 2;

Fig. 4 is the block scheme of the structure of a sequence parser of the part among the embodiment that Fig. 2 schematically is shown;

Fig. 5 schematically shows including of a field in the storer of a part of pie graph 5;

Fig. 6 is the synoptic diagram of embodiment of a recognition network of processor that can be applicable to the sequence parser of Fig. 4;

Fig. 7 a illustrates a node of network, and Fig. 7 b illustrates a paths link that adopts according to the present invention;

Fig. 8 to 10 illustrates advancing of the network of path link by Fig. 6;

Figure 11 is the synoptic diagram according to second embodiment of a network of system of the present invention;

Figure 12 is the synoptic diagram according to the 3rd embodiment of a network of system of the present invention.

Referring to Fig. 1, the telecommunication system that comprises speech recognition generally includes: a microphone 1 constitutes a part of a telephone bandset usually; A telecommunications network (being generally a public telecommunication switching network (PSTN)) 2; A recognition processor 3 connects into and receives a voice signal that comes automatic network 2; And an application apparatus 4, being connected and being configured on the recognition processor 3 receive a speech recognition signal from it, whether indication has discerned a particular words or phrase, and takes action according to it.For example, application apparatus 4 can be a remote-operated banking terminal that is used to carry out banking business.

In many cases, application apparatus 4 will send audible replying to the spokesman, and this is a loudspeaker 5 that is transferred to the part of common formation user mobile phone by network 2.

In operation, the spokesman speaks to microphone 1, an analog voice signal then enters network 2 from microphone 1 transmission and arrives recognition processor 3, analyze this voice signal therein, and generated a sign particular words or a phrase whether signal and be transferred to application apparatus 4, in the situation of having discerned voice, it just takes suitable action then.

Usually, recognition processor need be gathered and compare the relevant speech data of confirming voice signal, and this data acquisition can be carried out by the recognition processor in second operator scheme, in this pattern, recognition processor 3 is not connected on the application apparatus 4, but constitutes the recognition data of this word or expression from microphone 1 received speech signal.Yet other method of gathering voice recognition data also is possible.

Usually, recognition processor 3 does not know to go to and the path of being got by network 2 from the signal of microphone 1; Any in the receiver mobile phone of type miscellaneous and quality.Equally, in network 2, can adopt any in the transmission path miscellaneous, comprising radio link, simulation and digital path or the like.Thereby, arrive the voice signal Y of recognition processor 3 corresponding with the voice signal S that on microphone 1, receives, wherein reeled microphone 1, to the link of network 2, by network 2 channel and to the transmission feature of the link of recognition processor 3, they can concentrate and appointment with a single transmission feature H.

Referring to Fig. 2, recognition processor 3 comprises: an input end 31 is used for receiving the voice of (from a digital network or from an analog to digital converter) digital form; A Frame Handler 32 is used for numeral sample in succession is divided into the frame in succession of the sample of adjacency; A feature extractor 33 is used for generating a characteristic of correspondence vector from a frame sample; A sorter 34, the eigenvector that receives is in succession also operated on each vector to generate recognition result with a plurality of model states; A sequencer 35 is configured to receive from the separation results of sorter 34 and determines that also the sorter output sequence shows the predetermined pronunciation of maximum comparability; And an output port 38, an identification signal of the voice that indication identified is provided thereon.Frame maker 32

Frame maker 32 is configured to receive speech samples with the speed such as 8,000 samples of per second, and comprises 256 frames in abutting connection with sample with the frame rate formation of every 16ms one frame.Best, each frame is to adopt (be near the frame edge sample be to multiply by a predetermined weighting constant) that be split into window such as the Hamming window to reduce the false artefact that is generated by the frame edge.In a preferred embodiment, frame is that overlapping (such as overlapping 50%) is so that improve the effect of window.Feature extractor 33

Feature extractor 33 receptions generate a stack features vector from the frame of frame maker 32 and in various situations.(for example can comprise in the feature such as the cepstra coefficient, LPC cepstra system or mark ear frequency cepstra coefficient described in chollet and the Gagnoulet " about the speech recognition of adopting comparison system and the evaluation of database " (2026 pages in 1982IEEE journal)), the perhaps difference value of these coefficients, wherein comprise poor between the coefficient of correspondence value in the vector of this coefficient and front for each coefficient, as " adopting instantaneous and transition spectrum information " about discerning the spokesman at soong and Rosenberg, IEEE acoustic journal in 1988, voice and signal Processing volume 36, the 6th, 871 pages, described in.Equally, also can adopt the mixing of some kinds of characteristic numbers.

Frame number of feature extractor 33 outputs, the frame number of the frame that each is follow-up increases by one.The output of feature extractor 33 also passes to an end indicator 36, and the output terminal of indicator is connected on the sorter 34.End indicator 36 detects the end of voice, and its all kinds are well-known in this area.

In the present embodiment, frame maker 32 is to provide with single suitably digital signal processor (DSP) equipment (such as DSP56000 of Motorola or the TMSX320 of Texas Instruments) or a similar devices of programming with feature extractor.Sorter 34

Referring to Fig. 3, in the present embodiment, sorter 34 comprises a sorting process device 341 and next status register 342.

Status register 342 comprises mode field 3421,3422, is used for each of multiple voice status.For example, each phoneme variant that recognition processor will be discerned comprises three kinds of states, thereby provides three mode fields for each phoneme variant in status register 342.

Sorting process device 34 is configured to read by turns each mode field in the access to memory 342, and uses current input feature vector coefficient sets to calculate input feature vector collection or vector and the corresponding probability of corresponding state for each mode field.

Correspondingly the sorting process device is output as a plurality of state probability P, and each state in a kind of probability corresponding states storer 342 indicates input feature vector vector and the corresponding likelihood of various state.

Sorting process device 341 can be suitably digital signal processing (DSP) equipment of programming, especially may be the digital signal processing appts identical with feature extractor 33.Sequencer 3.5

Referring to Fig. 4, the sequencer 35 in the present embodiment comprises a status switch storer 352, a syntactic analysis processor 351 and a sequencer output buffer 354.

Also be provided with a state probability storer 353, be used to the state probability of each treated frame storage sorting process device 341 output, status switch storer 352 comprises a plurality of status switch fields 3521,3522,, each is corresponding to the word or expression sequence that will discern that is made of a phoneme body.

Each status switch in the status switch storer 352 comprises some state P1 as shown in Figure 5, P2, PN (wherein N is 3 multiple), and to two kinds of probability of every kind of state a: recurrence probability (Pi1) and to an a kind of transition probability (Pi2) of state down.State in the sequence is a plurality of groups of three kinds of each states relevant with single phoneme body.Therefore, what observe can comprise that with series of frames associated state sequence the several times of each the state Pi in various status switch model 3521 grades repeat, for example: frame number 123456789 ... z z+1 dog attitude P1 P1 P1 P2 P2 P2 P2 P2 P2 ... Pn Pn

Syntactic analysis processor 351 is configured to read the state probability of sorting process device 341 outputs on each frame, and the state probability of earlier stored in state probability storer 353, and the most probable state path till arriving on computing time, and itself and each status switch of being stored in the status switch storer 352 compared.

Calculate and adopt the famous latent Markov modelling of being discussed in the Cox paper cited above (HMM).The HMM that syntactic analysis processor 351 is carried out handles and utilizes famous Viterbi algorithm easily.Syntactic analysis processor 351 can be the microprocessor such as Interi-486 (trade mark) microprocessor or Motorola (trade mark) 68000 microprocessors, also can be a DSP equipment (for example, with as the identical DSP equipment of any one processor of front).

Correspondingly for each status switch (corresponding to a word, phrase or other voice sequence that will discern), syntactic analysis processor 351 is probability score of output on each input speech frame.For example the state preface is to the name that can comprise in the telephone directory.When detecting the pronunciation end, indicate corresponding name, the word or expression that has identified to the label signal a of a most probable status switch of expression of output port 38 outputs from syntactic analysis processor 351.

Syntactic analysis processor 351 comprises that a special configuration is used for discerning the network such as particular phrase such as numeric string or word.

Fig. 6 illustrates a simple network that is used to discern a word strings, is one four word strings or one three word strings in this example.Each node 12 of network is to be associated with a word list representation model 13 such as HMM, and it is stored in the model table.Some nodes can be associated with each model, and each node then comprises a pointer (if can find out) that points to its model that is associated from Fig. 6 and 7a.In order to generate an optimum matching and a single alternative syntactic analysis, finish node 14 is associated with two models, thereby allows this node processing two paths.N syntactic analysis if desired, then the finish node 14 of network is related with n identical word model.

As shown in Fig. 7 b, a paths link 15 comprises and relates to a pointer that points to last paths link, a pointer and a markers of accumulating score, pointing to the node that comes out previously.In the beginning of pronunciation, the path link 15 ' of a sky is inserted first node 16, as shown in Figure 8.Therefore at this moment first node comprises a paths link, becomes actively, and remaining node then is sluggish.On each clock signal (promptly having the speech frame that each enters), any active node mark that all on their path link, adds up.

The minimum seven frame voice if first model can mate then have the score of this seven frame and Model Matching and point to a path link 15 of ingress path link and the pointer of the node that mated just now from first node output on the 7th time clock ".This path link is fed to the node 12 of all back, as shown in Figure 9.At this moment three nodes in front enliven.In the model related, mate the speech frame of input then and export new path link with live-vertex.

Generate more path link along with the continuous elongated portion of the Model Matching of first node pronunciation, and the node of back carries out and similarly calculate, this processing continues.

When not importing the treated finish node 18 to network of voice, come the path link of automatic network each ' branch ' just can submit to this node 18.If on any given time frame, there is a single path link (promptly only having finished in the parallel route), just get this path link as (and unique) coupling of the best and be subjected to the processing of finish node 18.Yet, if there are two paths links to submit to the processing that finish node 18 boths are subjected to this node, because finish node 18 can be handled the path of one or more.The outgoing route link constantly upgrades on each frame voice.Two paths links, 15 will be arranged on the output terminal of network when pronunciation is finished, (therefrom remove the path link of sensing front and the pointer of node for brevity) as shown in Figure 10.

The whole piece path can be by follow the tracks of pointing to the front the pointer of path link find, and the pointer of the node that can point to out by observation identifies the node (thereby and think and discerned the input voice) on the path of identification.

Figure 11 represents to be configured to discern second embodiment of a network of the string of three bit digital.Grey node 22 is the empty node in the network; White nodes is a live-vertex, and they can be divided into lexical node 24 that has the word list representation model (not shown) that is associated that is used to mate the voice that enter and the noise node 25 of representing random noise.

Each can handle three paths (being that each lexical node 24 is associated with three word list representation models) if comprise the 3rd empty node 22 ' and later live-vertex thereof 24,25, then will comprise in the output of this network with system in three path link that the top score path is relevant.Fig. 8 to 10 is described as reference, and this three paths can be found out by the pointer of following the tracks of the path link of pointing to the front for each paths.The pointer of the node that can point to out by observation identifies the node (thereby think and discerned the input voice) on the path.

In of the present invention further developing, can strengthen path link with the sign of the important node of representing network.For example, these important node can comprise all lexical nodes 24.In the embodiment of Figure 11, distribute to 24 1 marks of each lexical node, for example distribute to mark ' 1 ' of node of expression numeral 1, distribute to the node 24 of expression numeral 2 " mark ' 2 ', and by that analogy.

When syntactic analysis began, a single dead circuit footpath link was transmitted into Web portal node 26.Because this is an empty node, this path link is passed to a next node, a noise node 25.This incoming frame of coupling in the noise model (not shown) of this node, and the path link of a renewal of generation on output terminal.Then this path link is delivered to next live-vertex, promptly has first lexical node 24 of the model (not shown) of an association.Each lexical node 24 is handled this speech frame and is generated the path link of a renewal in the word model that it is associated.The tag field of this path link also is updated.At the end of each time frame, sort these after upgrading path link and keep three (n bar) top score paths with different tag fields.By the accumulation mark that adds is that a table with the score ordering is safeguarded in this unique additional constraint: if second path link that has same tag enters, then keep one preferable among both.Only comprise the different paths of best n bar in this table, and ignore remaining.

This n paths link propagates into the noise node 25 and the lexical node 24 of back by next empty node 22 ' ", the word list representation model parallel connection that respectively this node is identical with three.After this, carry out models treated, the renewal and the path that draw the path link table reach among farther node 24 , 25.Should be understood that after

empty node

22 or 25 processing of noise node, the tag field of path link is not upgraded, because never distribute to these vertex ticks.

Path link is along the propagated by all the other live-vertexs, and the relative score that on an output node 28, generates the path that expression gets by network and mark (for example 121) reach three paths links.These path link constantly are updated up to the end that has detected voice (for example by such as external unit such as a time-out detecting device or up to arriving a time-out).In this, the pointer of the path link on the check output node 28 or accumulation mark are to determine recognition result.

For example, suppose at a time following three paths links to have been submitted to output node 28:

Score mark A 10 122B 9 122C 7 132 path A, the top score path is an optimum matching.Though path B has inferior top score because its mark, thereby and think that the voice of having discerned are identical with path A, therefore refuse it as an alternative syntactic analysis.Therefore, path C will be kept as the suboptimum syntactic analysis.

If the string of identification has than more structural discussed above,, then mark need only be distributed to the node that is right after in the decision-point front and get final product such as the name of reading; Rather than on each lexical node.Figure 12 illustrates a network of the pronunciation that is used to discern name ＂ Phil ＂, ＂ Paul ＂ and ＂ Peter ＂.For simplicity, not shown noise.The place that Fang Jiedian 44 expressive notations should strengthen.

' PHI ' and ' PAU ' path can be distinguished at ' L ' node by this system, are different because be based upon the mark of the path link on the node of front.The node 47 of back can be distinguished all three independent paths, because the mark of square node 44 is different.Have only ' L ' node and the final noise node 48 need be related, make these models can handle the path of one or more with word model identical more than.

In all situations, each network of the voice that diagram is to be identified need be analyzed with definite which node is wanted distribute labels.In addition, this network be configured to the user say possibly consistent.

Can accomplish to save storer volume and processing speed by the mark that limits a node propagation, described in the international patent application (proposing priority requests from European application 933025397) of by name " continuous speech recognition " submitted on March 31st, 1994 as the applicant, be combined in this by introducing.For example, have only following four numerals if say only effective input voice: 111,112,121,211 to a recognizer of network with Fig. 6.Some node in the network is associated with one group of significant notation, and a paths is only just propagated by this ' constraint ' node when having submitted path link with one of these marks to.In order to reach this point, check enters the tag field of the path link of a restraint joint (such as the 3rd empty node 22 ').If tag field comprises the mark beyond 1 or 2, no longer propagate this path just abandon this path.If submitted the path link of a permission to, just it is delivered to next node.Next restraint joint is the empty node 22 of next lexical node back ".This sky joint constraint becomes only to propagate has the path link of mark 11,12 or 21.Empty node 22 of next lexical node back are constrained to only to propagate has the path link of mark 111,112,121 or 211.This configuration has reduced necessary processing significantly, and the memory span of energy economy system.Have only some node on the decision-point in the network to need constraint like this.In practice, one 32 mark has been proved to be the sequence that is applicable to up to 9 bit digital.64 marks are applicable to the alpha-numeric string of 12 characters.

Voice detection of end and various other speech recognition features relevant with the present invention more fully propose to be combined in this by introducing in the international patent application of " speech recognition " by name that the applicant submitted on March 25th, 1994 (proposing priority requests from european patent application 933025413).

In the above-described embodiments, described and be suitable for being coupled to a recognition process unit on the telecommunication exchange.Yet, in another embodiment, present invention can be implemented on the simple mechanism of a traditional subscriber station being connected on the telephone network (mobile or fixing); In this case, the analog telephone signal that analog-digital commutator comes digitizing to enter can be set.

Claims

The speech recognition systems that 1 one kinds of path link are transmitted, voice are read by the company that is used to discern input, and this recognition system comprises: the device that is used for deriving from an input speech signal recognition feature data; Treating apparatus, voice are imported in the expectation that is used for input voice component model that will expectation and is used for comparison recognition feature data and component model, and this treating apparatus has a plurality of lexical nodes related with the word list representation model; And the device that is used for indicating the identification of input speech signal according to comparative result; It is characterized in that at least one lexical node can handle the path link of one or more simultaneously.
The 2 a kind of speech recognition systems according to claim 1 is characterized in that this at least one lexical node is related with word list representation model identical more than.
The 3 a kind of speech recognition systems according to claim 2 is characterized in that these word list representation models are latent Markov model.
4 according to any one a kind of speech recognition system in the claim 1,2 or 3, it is characterized in that all lexical nodes all have the mark of distributing to them.
5 according to any one a kind of speech recognition system in the claim 1,2 or 3, it is characterized in that the lexical node that only appears at the decision-point front just has the mark of distributing to them.
The 6 a kind of speech recognition systems according to one of claim 4 and 5 is characterized in that comprising in this path link an accumulation mark.
7 according to any one a kind of speech recognition system in the claim 4,5 or 6, it is characterized in that some node suffers restraints at least only to propagate the path link with certain predetermined labels.
8 according to any one a kind of speech recognition system in the claim 4 to 7, it is characterized in that this identification indicating device comprises that the score that is used for the comparison path link and mark determine to have with input continuous speech optimum matching and have the device in the path of the alternative coupling of suboptimum.
The method of 9 one kinds of speech recognitions comprises: the model that constitutes the input voice of expectation; From an input speech signal, derive the recognition feature data; The input voice of characteristic and component model are compared, and according to comparative result indication speech recognition, the input voice of expectation are to constitute a network that comprises a plurality of lexical nodes that are associated with the word list representation model; It is characterized in that at least one lexical node can handle more than one input simultaneously.
The 10 a kind of methods according to claim 9 is characterized in that this at least one lexical node is related with word list representation model identical more than.
The 11 a kind of methods according to claim 10 is characterized in that this at least one lexical node is related with the several same word list representation model of the number that equals desired recognition result.
The 12 a kind of methods according to one of claim 10 or 11 is characterized in that on each decision-point of network the relatively score of these path link, have only n bar top score path link just to propagate into following node.
13 according to any one a kind of method in the claim 10,11 or 12, it is characterized in that mark is distributed to all lexical nodes.
14 according to any one a kind of method in the claim 10,11 or 12, it is characterized in that only mark being distributed to the lexical node that appears at decision-point front in the network.
The 15 a kind of methods according to one of claim 13 or 14 when dependent claims 12, is characterized in that also the relatively mark of path link, and only comprising not, the path link of isolabeling just propagates into following node.
16 according to any one a kind of method in the claim 13,14 or 15, it is characterized in that retraining the path link that has certain predetermined labels in the tag field that some node at least only is delivered in them,
17 according to any one a kind of method in the claim 10 to 16, it is characterized in that thinking that the input voice of having discerned determine by recall path link through network.
18 according to any one a kind of method in the claim 13 to 16, it is characterized in that thinking that the input speech signal of having discerned is that the accumulation mark of each path link is determined.
19 according to any one a kind of method in the claim 10 to 18, it is characterized in that best score path link is to be handled by the first word list representation model of a vocabulary point, suboptimum by second models treated, more than analogize, use until exhausted up to parallel model or the path link that enters.