WO2012076895A1 - Reconnaissance de formes - Google Patents

Reconnaissance de formes Download PDF

Info

Publication number
WO2012076895A1
WO2012076895A1 PCT/GB2011/052436 GB2011052436W WO2012076895A1 WO 2012076895 A1 WO2012076895 A1 WO 2012076895A1 GB 2011052436 W GB2011052436 W GB 2011052436W WO 2012076895 A1 WO2012076895 A1 WO 2012076895A1
Authority
WO
WIPO (PCT)
Prior art keywords
tokens
cost
wfst
propagation
token
Prior art date
Application number
PCT/GB2011/052436
Other languages
English (en)
Inventor
John Mcallister
Roger Woods
Richard Veitch
Paul Mccourt
Louis-Marie Aubert
Scott Fischaber
Laurent Wojcieszak
Original Assignee
The Queen's University Of Belfast
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Queen's University Of Belfast filed Critical The Queen's University Of Belfast
Publication of WO2012076895A1 publication Critical patent/WO2012076895A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/12Speech classification or search using dynamic programming techniques, e.g. dynamic time warping [DTW]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/085Methods for reducing search complexity, pruning

Definitions

  • the present invention relates to improvements in or relating to pattern recognition, and in particular to a novel systems and methods for pattern recognition, and to devices incorporating such systems and methods.
  • Pattern recognition is an important field of computer science and engineering.
  • a candidate pattern is compared with a knowledge source, such as entries in a database, and a pattern recognition algorithm is applied to select the entry in the knowledge database that corresponds or that most closely corresponds to the candidate pattern.
  • the present invention has applicability for pattern recognition in many diverse fields, however as an example it is useful to consider one of the possible application areas to which the invention could be applied - that of speech recognition.
  • the candidate pattern will be a segment of speech
  • the knowledge sources will comprise information about phonemes, words and grammar.
  • communications devices such as mobile telephones, and other handheld devices.
  • the increasing algorithmic complexity needed to achieve high accuracy even in noisy environments in conjunction with the tight energy constraints of battery operated devices has proved a barrier to implementing sophisticated speech recognition on hand-held devices and other consumer electronic devices where power consumption, processor load, and memory use are key concerns.
  • a speech recognition system should deliver an accurate text transcription in real-time, be speaker independent (i.e. no initial training required) and be capable of handling natural continuous speech with a rich vocabulary.
  • the complexity of these tasks exceeds the processing capabilities currently forecast for embedded processor technology.
  • word recognition accuracy or system vocabulary have to be sacrificed.
  • a device such as a mobile telephone
  • the servers return the transcribed text to the device.
  • this network-dependent service relies on the availability of a data connection and creates unnecessary traffic with a potentially long delay, which in turn reduces drastically the range of possible applications.
  • An object of the present invention is to mitigate or solve some of the problems identified with the prior art and to provide pattern recognition systems and methods that are faster and/or more efficient than existing solutions.
  • a method of pattern recognition comprising performing a spectral analysis on a candidate pattern to obtain a set of observation vectors; decoding said observation vectors to produce a candidate match between said observation vectors and entries from a knowledge source; wherein said step of decoding the observation vectors comprises modelling the knowledge source as a weighted finite state transducer (WFST) and the step of decoding comprising modelling the propagation of tokens through nodes of the WFST for observation vectors derived from successive frames of the candidate pattern, and wherein, for each successive propagation, a list of tokens is sorted in order of their associated cost, and a pruning step is applied to remove tokens from said sorted list which go beyond a pre-determined cost threshold, before performing the next iteration.
  • WFST weighted finite state transducer
  • the sorting comprises binning a candidate token into one of a set of bins, each of which is associated with a range of cost values; and the pruning step comprises ignoring the tokens in the bin associated with the highest costs.
  • the pruning step comprises ignoring the tokens in a plurality of bins, being the set of bins associated with the highest costs.
  • the cost thresholds for each bin scale non-linearly, with lower cost bins being larger than higher cost bins.
  • the token propagation is carried out by a dynamic programming algorithm, most preferably using a Viterbi decoding algorithm, which implements the token propagation method.
  • transitions between nodes of the WFST correspond to Hidden Markov Model (HMM) states.
  • the propagation of tokens through the WFST comprises, for each transition, the calculation of a cost that includes a contribution from the coefficient of the determined observation vector and a fixed arc cost that represents the cost associated with a transition to a given HMM state.
  • a set of said arc costs is stored as part of the knowledge source.
  • said candidate pattern is a speech utterance
  • said knowledge source comprises one or more of a phoneme lexicon, a word lexicon and a grammar lexicon.
  • the knowledge source comprises all of phoneme lexicon, a word lexicon and a grammar lexicon stored in a single memory structure.
  • token propagation is run on different threads in parallel.
  • a pattern recognition system comprising a spectral analyser for analysing a candidate pattern to obtain a set of observation vectors; a decoder for decoding said observation vectors to produce a candidate match between said observation vectors and entries from a knowledge source; and a memory resource having stored thereon a weighted finite state transducer (WFST) that embodies said knowledge source; and a processor arranged, for each successive propagation, to model the propagation of tokens through nodes of the WFST for observation vectors derived from successive frames of the candidate pattern, to sort a list of tokens in order of their associated cost; and to apply a pruning step to remove tokens from said sorted list which go beyond a pre-determined cost threshold.
  • WFST weighted finite state transducer
  • the sorting comprises binning a candidate token into one of a set of bins, each of which is associated with a range of cost values; and the pruning step comprises ignoring the tokens in the bin associated with the highest costs.
  • the pruning step comprises ignoring the tokens in a plurality of bins, being the set of bins associated with the highest costs.
  • the cost thresholds for each bin scale non-linearly, with lower cost bins being larger than higher cost bins.
  • the token propagation is carried out by a dynamic programming algorithm, most preferably using a Viterbi decoding algorithm, which implements the token propagation method.
  • transitions between nodes of the WFST correspond to Hidden Markov Model (HMM) states.
  • HMM Hidden Markov Model
  • the propagation of tokens through the WFST comprises, for each transition, the calculation of a cost that includes a contribution from the coefficient of the determined observation vector and a fixed arc cost that represents the cost associated with a transition to a given HMM state.
  • a set of said arc costs is stored as part of the knowledge source.
  • said candidate pattern is a speech utterance
  • said knowledge source comprises one or more of a phoneme lexicon, a word lexicon and a grammar lexicon.
  • the knowledge source comprises all of phoneme lexicon, a word lexicon and a grammar lexicon stored in a single memory structure.
  • an embedded system comprising a pattern recognition system, comprising a spectral analyser for analysing a candidate pattern to obtain a set of observation vectors; a decoder for decoding said observation vectors to produce a candidate match between said observation vectors and entries from a knowledge source; and a memory resource having stored thereon a weighted finite state transducer (WFST) that embodies said knowledge source; and a processor arranged, for each successive propagation, to model the
  • a portable device comprising the embedded system of the third aspect.
  • the portable device may be a hand held mobile communications device, such as a mobile telephone, smart phone or netbook, for example.
  • a computer programme product a WFST and instructions for carrying out the method of the first aspect.
  • Figure 1 shows general principles of pattern recognition
  • Figure 2 shows the principles of operation of a decoder which is part of the system of figure 1 ;
  • Figure 3 shows the grammatical representation of a phrase
  • Figure 4 shows an example representation of various words in a word lexicon that may be used in the decoder of figure 2
  • Figure 5 shows an example of how phonemes are defined
  • Figure 6 shows the grammatical representation of figure 3 with word lexicon data represented
  • Figure 7 shows the grammatical representation of figure 3 with each of the phoneme representations illustrated
  • FIG 8 shows an overview of a speech recognition system in accordance with the present invention, comprising a weighted finite state transducer (WFST);
  • WFST weighted finite state transducer
  • Figure 9 illustrates a WFST graph with HMM states shown as the arcs of the graph;
  • Figure 10 illustrates the on-chip memory loading of WFST portions;
  • FIG 1 1 illustrates schematically the general principle of a token propagation algorithm with pruning
  • Figure 12 illustrates a prior art token propagation method which comprises both beam and histogram pruning
  • Figure 13 illustrates a token propagation method according to the disclosure which includes a sorting step
  • Figure 14 illustrates the general principle of a doubly-linked list for token sorting in beam subdivisions
  • Figure 15 illustrates a decision flowchart for a coarse token sorting method, showing the process for deciding whether a token is stored or rejected;
  • Figure 16 shows an example block diagram of architecture for a token propagation algorithm
  • Figure 17 shows an example of hardware implementation of a token propagation technique according to the present disclosure
  • Figure 18 shows a prior art parallel token propagation technique
  • Figure 19 shows a parallel and pipelined token propagation technique according to the disclosure.
  • Figure 1 shows the components of a pattern recognition system according to the invention.
  • the pattern recognition system is a speech recognition system.
  • a candidate pattern, in this case a speech waveform, 100 is recorded and encoded as a bit-stream, for example in a .wav file format or any other suitable format.
  • the speech wave form 100 is fed to a spectral analyser 102 which decomposes the wave form 100 into a plurality of acoustic vectors 106.
  • Each acoustic vector 106 is sampled over a time-slice 104, referred to as a frame.
  • a typical frame size might, for example be 10 milliseconds.
  • the spectral analyser 102 operates in a known manner using processes such as Fast Fourier Transforms (FFT), discrete cosine transforms (DCT) and other digital filtering techniques to break down each sample into the acoustic vectors 106 which form an input to a decoder 108.
  • the decoder 108 performs Gaussian calculations and graph searching (using, for example, Viterbi decoding) to produce a final transcription 1 10 that corresponds to the interpretation of the speech waveform 100.
  • the basic components of the decoder 108 of figure 1 are illustrated in figure 2.
  • the decoder 108 receives the acoustic vectors 106 and parses them using multiple processing layers.
  • the first layer is a phoneme lexicon 200.
  • the phoneme lexicon 200 may be specific to a given language for which the speech recognition system is configured, or it may be contain a set of phonemes related to multiple languages.
  • phonemes of the English language shall be referred to although it will be understood that the principles of the disclosure may be applied equally to phonemes of other languages, or to other atomic units corresponding to the pattern being recognised.
  • the phoneme lexicon 200 processes the acoustic vectors 106 and identifies a string 202 of phonemes from within the lexicon which are deemed present in the speech wave form 100.
  • the phonemes thus identified form the input to a word lexicon 204.
  • the word lexicon 204 analyses strings of phonemes and parses them to match with a database of words contained within the lexicon.
  • the detected words 206 form the input to a grammar module 208 which checks the order of the words before outputting the final transcription 1 10.
  • the present invention performs pattern recognition using a type of finite state machine known as a weighted finite state transducer (WSFT).
  • WSFT weighted finite state transducer
  • a finite state transducer acts to convert input speech into an output, i.e. the final
  • a finite state machine is a behavioural model of an automated process composed of a finite number of states, transitions between those states, and outputs which are dependant on the state of the machine and optionally the inputs to the system. It has finite internal memory, an input feature that reads symbols in a sequence, one at a time without going backward; and an output feature, which may be in the form of a user interface, once the model is implemented.
  • the operation of an FSM begins from one of the states (called a start state), goes through transitions, depending on input to different states, and can end in any of those available - however only a certain set of states mark a successful flow of operation (called accept states).
  • the WFST is based on a token propagation model.
  • Components of patterns are modelled as a network of states, with transitions between states having an associated cost, the values of which provide the "weights" of the weighted FST.
  • Each possible sequence of states through the model represents one possible alignment of the model with the candidate pattern.
  • the cost of a sequence of states is derived from the combined cost of each transition along the path of the model.
  • the "cost" is an inversely proportional representation of the likelihood of candidate state transitions or the similarity of candidate pattern components. Progression through state transitions is implemented according to a token propagation algorithm.
  • the WFST can be modelled by a WFST graph, which shows the states and the transitions between them.
  • a WFST graph which shows the states and the transitions between them.
  • WFST are shown by the WFST graphs of figures 3 to 7. These graphs are for illustration purposes, showing how a speech recognition system according to various embodiments might operate to interpret a spoken command to either "load data” or to "save data". It will be appreciated that a fully-specified WFST for practical use will be very complex, but that it can be built using the principles set out herein.
  • Figure 3 shows an example WFST structure embodying grammar knowledge and forming part of the grammar module 208 shown in Figure 2.
  • the possible commands are stored as a grammar rule as shown by the possible paths in the graph of figure 3.
  • the final command 300 to either load or save data can be thought of as the propagation of a token 302 along a path.
  • the grammar of the phrase is defined by restricting the paths that are possible, so the word “data” in this example can only be preceded by the words “load” or “save”.
  • the grammar rules mean that incompatible strings such as "load load”, or "data save” will not be recognised.
  • more complex grammar rules can be constructed that will enable similar commands to be phrased in different manners. It will be appreciated that this is a simplified example for the purposes of illustration.
  • Figure 4 shows an example WFST structure embodying word knowledge and forming part of the word lexicon 204 shown in Figure 2.
  • the graph of Figure 4 illustrates an exemplary word lexicon for the words "load” 400, "save” 402 and "data” 404.
  • Each word 400, 402, 404 is defined as a combination of phonemes 406, which follow in a defined order.
  • the word "data” 404 is defined to take account of two alternative pronunciations of the word "data", the upper path representing daetah and lower path representing deytah (emphasis added).
  • Each of the phonemes 406 in figure 4 is in turn represented by a phoneme lexicon 200.
  • Figure 5 shows an example WFST structure embodying phoneme knowledge and forming part of the phoneme lexicon 200 shown in Figure 2.
  • the graph of Figure 5 shows a phoneme for "ah” 500 and "ae” 502.
  • the phonemes for "ah” and “ae” are formed from subcomponents ( ⁇ ah1 , ah2, ah3 ⁇ and ⁇ ae1 , ae2, ae3 ⁇ ), each subcomponent corresponding to a Hidden Markov Model (HMM) state.
  • HMM Hidden Markov Model
  • Each state is evaluated against one frame by a Gaussian probability calculation (a measure of the distance from an acoustic model to the speech frame).
  • States ah1 , ae1 and ah3, ae3 can be regarded as the start and the end of the corresponding phonemes, respectively.
  • the middle states ah2, ae2 correspond to the steady state of the phoneme pronunciation.
  • Self loops on each HHM state account for the variable duration of each state (several frames of 10 ms can stay on the same state using the self-loop arcs).
  • Each phoneme in the lexicon may be modelled by a Hidden Markov Model because states between the different components of
  • HMM HMM
  • DTW dynamic time warping
  • the new WFST can be formed from a combination of some or all of the knowledge sources (i.e. grammar, word and phoneme structures).
  • Figure 6 shows the grammar rule of figure 3 implementing the word lexicon of figure 4, and figure 7 illustrates the same grammar rule with the composition of individual phonemes represented; and
  • Figure 7 shows a WFST that comprises a single network combining all the knowledge sources.
  • the decoder is agnostic to various knowledge sources. If changes are made to the knowledge sources before the WFST network is compiled,
  • the decoder does not need to know what the WFST graph is made of.
  • the decoding rules are embedded in the graph, not in the decoder (as in traditional approaches). It makes the decoding very regular, faster, because it requires less computation.
  • FIG 8 shows an overview of a speech recognition system in accordance with an embodiment of the present invention.
  • Speech information 800 i.e. information representing speech
  • the speech information 800 may comprise a data file in .wav or any other suitable audio file format.
  • the speech information is broken down into acoustic observation vectors by an acoustic vector computer 802, at a predetermined frame rate, for example 10 ms.
  • the information in each time frame comprises a small portion of the input speech information 800 converted into acoustic
  • MFCC mel-frequency cepstral coefficients
  • LPC linear prediction coding
  • An acoustic models store 804 is used to store information representing acoustic models of the smallest differentiable sounds (called phonemes) produced during speech. Each acoustic model is associated with a tag which identifies the acoustic model and a score, which will be described below in more detail.
  • the acoustic models may for example be Gaussian Mixture Models (GMM) which model speech phonemes as a combination of multidimensional Gaussian functions.
  • GMM Gaussian Mixture Models
  • An acoustic comparator 806 scores each acoustic model based on a comparison with the acoustic observation vectors of the current frame. This comparison may be performed through resolving the mixtures (typically Gaussian) representing the models for the acoustic observation vectors of the current frame.
  • the scored acoustic models are then converted into a number of transcribed utterances using a weighted finite state transducer (WFST) 808 and a WFST decoder 810.
  • WFST decoder 810 produces an output 812 that comprises a number of candidate transcription matches for the speech information 800. These candidate matches can be ranked in order of probability. In one embodiment, a single transcription match, corresponding to the most probable candidate, is output.
  • the WFST 808 comprises a precompiled network of all available knowledge sources stored in a memory structure.
  • the process of compiling and optimizing the different knowledge sources (the multiple WFSTgraphs representing the phoneme lexicon, the word lexicon and the
  • Speech recognition systems and the information stored in them require significant memory resources, and so efforts are made to either reduce the size in memory of the speech recognition systems as a whole, or to adjust the underlying architecture of the relevant components.
  • Different approaches to deal with the large size of Wastes have been proposed, including for example, two-pass algorithm and on-the-fly composition.
  • the first WFST graph is based on a simplified language model, a pass through which generates several hypotheses. These hypotheses in turn provide the input to a second WFST.
  • This second WFST comprises a grammar module that is compiled from a more complex language model only. Because it does not include neither the phoneme lexicon nor the word lexicon, it is smaller in size and so the second pass is usually much faster than the first.
  • the phoneme lexicon can be kept as a separate WFST graph from the main WFST graph comprising the word and grammar knowledge sources.
  • the phoneme lexicon needs to be composed with the word and grammar lexicons on-the-fly during the decoding.
  • the present disclosure allows for the on-chip storage of portions of the WFST 808, as shown in Figure 10.
  • the WFST 808 may be stored on a different chip than the WFST decoder 810.
  • Required nodes and arcs from the WSFT are first selected and copied (at step 1 100) and then stored (at step 1 102) at a local memory.
  • the system of the present disclosure is flexible enough to be adapted to WFSTs of different natures.
  • the core architecture for a decoder in line with the present disclosure can work on WFSTs that are composed on- the-fly by another part of the system.
  • Block 1 100 would select and copy nodes and arcs from both the phoneme lexicon and word and grammar lexicons.
  • An additional block 1 104 could be provided that would compose nodes and arcs from the first WFST graph comprising the phoneme lexicon with nodes and arcs from the second WFST graph comprising the word and grammar lexicons to form a local WFST sub-graph that contains nodes where tokens are currently located.
  • the WFST decoder 810 decodes the input to by searching the WFST 808 to find the best hypotheses according to the input speech.
  • the method used is a time synchronous Viterbi decoding method which is implemented using a token propagation algorithm.
  • Figure 9 shows an example WFST graph. Each node 900 in the graph (labelled from 0 to 6) represents a WFST state.
  • the HMM states are represented by the arcs (or lines - the term "arc" will be used as a generalisation) between the nodes.
  • the arcs of the graphs are labelled with a notation identifying "HMM state":"output”/"cost", that is, for example, the arc labelled “i3:A/0.5” represents the HMM state "i3", the output "A", and a cost of 0.5.
  • the graph only allows subsequent transitions to a subset of HMM states.
  • HMM states there are four available HMM states, i1 , i2, i3 and i4. It can be seen from node 0 for example that only i1 or i2 is allowed as an initial HMM state, and then if i1 occurs, it can then only be followed by i3. On the other hand, if i2 occurs, it can be followed either by i3 or i4.
  • the cost represented here is the fixed cost associated with the transition between the relevant nodes of the WFST, that is, in moving from a given HMM state to another.
  • the particular cost values of the transitions, and the availability of given transitions effectively embodies the knowledge sources. That is, in the example of a speech recognition, the words and grammar play a part in defining the sequences of HMM states that are allowable, and the relative probabilities of the states where there are a number of allowed transitions.
  • Outputs are then associated with various HMM states (or arcs in the graph). Each of the outputs corresponds to a sequence of HMM states, or sometimes a single HMM state. As mentioned above, in one speech recognition embodiment, a phoneme is actually comprised of three HMM states.
  • an acoustic observation vector is extracted by the spectral analyser and a Gaussian calculation is carried out to obtain a value for each HMM state, corresponding to the coefficients for each dimension of the vector.
  • the input Gaussian results are then fed to the token propagation algorithm.
  • the Gaussian result forms the basis of a Gaussian cost which is added to the fixed arc cost. The (additive) combination of these two costs gives the total cost for each arc, i.e. for transition to each of the permitted HMM states.
  • a first frame of data yields Gaussian results for HMM states i1 - i4 of 1 .7, 0.8, 1 .5 and 0.3 respectively
  • the total cost for the transition between nodes zero and one is given by the combined Gaussian result (1 .7) plus the fixed arc cost (0.2), giving a total cost of 1 .9. That is, the cost of a transition representing a null output is 1 .9.
  • the cost of a transition representing a "B" output is given by the Gaussian result for i2, 0.8, plus the fixed arc cost, 0.6, so totalling 1 .4.
  • the WFST is telling us that the output of a "B" is more likely than a null output.
  • the costs of each path is represented by the propagation of a token along the graph.
  • the token is a data structure that records the path taken, the cumulative cost of that path, and the previously output values as traceback entries.
  • a fresh set of Gaussian results is obtained, and the tokens that were propagated to nodes 1 and 2 are propagated further in the graph.
  • the costs associated with each token accumulates, and the new Gaussian portion of the cost is calculated according to the newly derived HMM states.
  • tokens propagate through the graph, and so yield a number of hypotheses of what information the input speech information 800 represents (represented by the different paths through the graph) are created using the WFST decoder 810.
  • Each hypothesis is associated with a token which is used to identify a hypothesis, store the running cost of the hypothesis, and store the location of the hypothesis in the WFST 808.
  • the token with the lowest overall cost represents the match that is most likely to be correct.
  • Pruning 1202 represents one of the main challenges of any token propagation method; performances in term of recognition accuracy, speed and energy consumption heavily depend on its implementation.
  • Figure 12 shows a classic token propagation method.
  • the hypotheses represent the possible utterances pronounced which may be sequences of words output during the tokens propagation. So each token in the token list 1300 is also linked to a record of the word history associated with that token so that the best possible transcription can be displayed at the end of the utterance.
  • Word histories are stored in the Word Link Record (WLR) memory 1302 and each token is linked to an entry in this memory.
  • An entry in the WLR memory 1302 corresponds to a token's hypothesis and the corresponding utterance to the hypothesis can be recovered by tracing the WLR back.
  • the token selection technique shown in figure 12 uses a combination of beam pruning 1304 and histogram pruning 1306.
  • Beam pruning 1304 is a breadth- first search process that discards each token having a cost higher than the current best cost plus a fixed beam width.
  • Histogram pruning 1306 limits the number of tokens propagated during the subsequent frame, to a fixed maximum number of tokens. For each frame, a histogram of the token costs is built. The maximum cost of the tokens which are selected to be propagated during the next frame is chosen so that the cumulative number of tokens does not exceed a predefined limit (for example, 4000 tokens). Pruning is then applied on the input tokens. Before being propagated, token costs are compared to the maximum cost determined in the previous step: tokens whose cost are higher than the threshold are pruned (ignored).
  • the technique shown in figure 12 has the drawback of requiring two passes through all the tokens. Moreover, the fact that more tokens/hypotheses are keep alive than actually propagated in the subsequent frames creates new entries in the WLR memory 1302 that are unusable. This in turns causes the WLR memory 1302 to fill up with unusable entries. As a result, the WLR memory 1302 needs frequent 'garbage collection' 1308 (i.e. the WLR memory 1302 needs to be searched to identify and remove the unusable entries) to reallocate the memory slots of unusable entries to new entries.
  • FIG. 13 shows a token propagation method in accordance with the present disclosure.
  • the hypotheses represent the possible utterances pronounced which are sequences of words output during the tokens propagation. So each token in the token list 1400 is also linked to a record of the word history associated with that token so that the best possible transcription can be displayed at the end of the utterance.
  • Word histories are stored in the Word Link Record (WLR) memory 1402 and each token is linked to an entry in this memory.
  • An entry in the WLR memory 1402 corresponds to a token's hypothesis and the corresponding utterance to the hypothesis can be recovered by tracing the WLR back.
  • WLR Word Link Record
  • propagated 1406 in the order they have been sorted, starting with the token having the lowest cost and ending with the token having the highest cost. As part of propagation, the cost of each token is recalculated. Beam pruning 430 then discards each token having a cost higher than the current best cost plus a fixed beam width. The tokens are then re-sorted 1404 and the process is continued for every frame until the end of the utterance thereby obtaining the most likely results.
  • the propagation of the tokens having the lowest cost in advance of higher cost tokens guarantees optimum performance and makes histogram pruning straightforward.
  • best tokens i.e. the lowest cost tokens
  • the best token is the reference used for determining which tokens are pruned during beam pruning
  • identifying the best token at an early stage in the propagation improves the performance of the system.
  • the sorting approach enables the system to selectively store only the most relevant tokens (i.e. those having a cost below a certain threshold). This drastically reduces the number of unusable tokens / hypotheses stored in WLR memory 1402.
  • the frequency that garbage collection 1410 needs to be performed is in turn greatly reduced thereby improving the performance of the system.
  • tokens are propagated, sorted and discarded "live", which allows the system to be dimensioned to store only the number of tokens that are actually propagated during the next frame; as opposed to the necessity to propagate and store all tokens, and then apply the pruning (which needs a lot more memory).
  • sorting is a computationally expensive operation, and will require a lot of resource from the processor.
  • processors have to put on hold the rest of the system to deal with the sorting.
  • processors are designed for sequential tasks and complex arithmetic operations but they handle tasks such as sorting very poorly indeed.
  • tokens are sorted into an ordering list 1500 that is divided up into fractions (for example, this may comprise 64 subdivisions).
  • the beam subdivisions are bins that divide the beam into fractions.
  • Such a scaling could be a non-linear scaling, for example a logarithmic scaling.
  • each entry in the doubly linked ordering list 1500 is associated with a pointer to the preceding value in the unordered token list 1502 as well as a pointer to the next value in the doubly linked list 1500.
  • a separate entry list (not shown) is maintained which contains a pointer to the last entry for each beam subdivision.
  • a binary search is used to determine which beam subdivision to place a new token. With this method, the number of comparisons made will only be log 2 of the number of beam subdivisions. It is also possible to implement a solution based on arithmetic methods.
  • the next step is to decide whether to store a new hypothesis as a new token, as a replacement of a previously stored token, or drop it completely. This is based on how full the beam subdivisions are and the cost of the new token.
  • Figure 15 shows a flowchart for making this decision.
  • a particular bin identified by a bin index f, is determined on the basis of the cost of the token that is input.
  • Each bin corresponds to a predetermined beam subdivision, that is, a range of costs.
  • the bin index, f increments with the costs of the subdivisions, that is, a low index corresponds to a low cost subdivision.
  • the total number, N, of sorted tokens is checked against a predetermined maximum number, Nmax, of tokens that the system (as a whole) can store. If the number of tokens N is less than the maximum Nmax then the new token is inserted in the allocated bin at step 1604. If, however, the total number of tokens N is not less than the maximum number of tokens Nmax that the system can store, then the bin index, f, is compared at step 1606 with ftop, which is the bin index corresponding to the current sub-division whose cost boundaries are highest.
  • bin index f of the identified bin is less than ftop, then one token is removed from the ftop bin (at step 1608) and then a token is inserted into bin f (at 1604). If, however, the identified bin index f is not less than ftop, then the token is discarded (at step 1610).
  • An example of an actual implementation of the token propagation algorithm including the token sorting can be represented by the block diagram of figure 16.
  • step 1700 the outgoing arcs from the nodes where tokens are located in the WFST stored in memory 1702 are loaded into local memory 1704.
  • step 1706 the cost of each token is incremented by the cost of propagation on every corresponding outgoing arc.
  • step 1708 a beam subdivision, f, is determined for each propagated token based on the updated cost.
  • step 1710 a decision is taken as to what is to be done with a particular token - i.e. is it to be inserted in the token list, is to replace an existing token in the token list or is it to be discarded? This decision is based on the allocated beam subdivision, f, and performed as previously described with reference to figure 15.
  • step 1710 If a token is not discarded in step 1710 the symbol associated with the token is copied in step 1712 from memory 1704 into a word link record (WLR) memory 1714.
  • WLR word link record
  • step 1716 the token is stored in the token list 1718 in order of cost as calculated in step 1706.
  • Each token stored in the token list 1718 is associated with a node of the WFST graph and a traceback entry to the word link record 1714 as previously described with reference to figure 13.
  • figure 16 also includes the management of N-best hypotheses.
  • step 1720 the N-best hypotheses for a particular node are checked.
  • the list of nodes for the N-best hypotheses are recorded in memory 1722, each entry in the list of nodes for the N-best hypotheses is associated with a list of the Nbest tokens located on the particular node.
  • This enables the decoder to associate up to N hypotheses per node in the WFST graph, each of them identified by a distinct token.
  • tokens are propagated through the WFST network, two or more tokens can reach the same node at the same time through different paths, offering different transcription hypotheses.
  • the system described in the present disclosure allows the display of these different transcriptions.
  • the best tokens i.e. the tokens having the lowest cost
  • the best tokens are read in order of cost in step 1730 and used by step 1700 in order of cost (i.e. lowest cost tokens first) to load outgoing arcs of the WFST into local memory 1704.
  • step 1732 the N-best utterances are output. This is done by tracing back entries in the WLR memory 1714 and the Lexicon word link (WL) memory 1734. These different transcription hypotheses can also be fed to another layer of the system.
  • the N-best outputs of the speech recognition core engine can be used at a higher application level, for example, multiple hypotheses displayed to the user; multiple hypotheses used as input to a more complex language model, to name a few uses. It is also worth noting the number and distribution of memory blocks 1702, 1704, 1714, 1718, 1722, 1726, and 1734 which highlights that this
  • figure 17 depicts the non-trivial implementation on dedicated hardware of the technique outlined.
  • system of the present disclosure may also provide for parallel token propagation, as shown in figure 18.
  • Parallel token propagation as shown in figure 18. Running the token propagation on different threads in parallel is an efficient way to accelerate the overall token propagation procedure.
  • sorting is scalable and can be distributed over parallel threads without the need for frequent synchronization. Moreover the parallel system does not have to wait that all tokens are sorted. The first tokens with best costs can be propagated without having to wait for the last one to be ready. This enables the system to be implementable with parallel pipelined processes on each thread. Moreover, the need for communication between threads is reduced and as a result the length of each pipeline process can be increased, which reduces the bandwidth and energy used by the system and also increases the speed of the system.
  • the use of threads to implement a pipeline process also offers a quasilinear improvement in performances which increases with the number of threads used.
  • Dedicated hardware can be used to implement the system of the present disclosure for embedded applications thereby making this performance gain achievable even with a large number of parallel hardware blocks.
  • Performance evaluation of the parallel implementation using traditional pruning approaches shows that as the number of parallel threads increases performance improvement tend to drop off. This is due to the need for all threads to synchronize and communicate data through shared memory. The requirement of frequent barriers to synchronize threads and the limited bandwidth of the shared memory limit the achievable speed.
  • Such systems can use a word count that provides utility for functionality in a wide range of applications, for example a word count of around 50,000 words, or more, could be accommodated.
  • Products in the embedded system market which can use the system of the present disclosure includes for example mobile battery operated devices like mobile phones, smartphones, multimedia devices, satellite navigation systems, low cost mini laptop (netbooks), personal digital assistants (PDAs), mp3 players, videogame consoles, digital cameras, DVD players, and printers.
  • mobile battery operated devices like mobile phones, smartphones, multimedia devices, satellite navigation systems, low cost mini laptop (netbooks), personal digital assistants (PDAs), mp3 players, videogame consoles, digital cameras, DVD players, and printers.
  • Medical equipment is continuing to advance with more embedded systems for vital signs monitoring, electronic stethoscopes for amplifying sounds, and various medical imaging (PET, SPECT, CT, MRI) for non-invasive internal inspections and can also include the system of the present disclosure.

Abstract

La présente invention concerne des améliorations apportées à la reconnaissance des formes ou s'y rapportant, et en particulier de nouveaux systèmes et procédés de reconnaissance de formes, et des dispositifs intégrant ces systèmes et ces procédés. En particulier, une analyse spectrale est réalisée sur une forme candidate pour obtenir un ensemble de vecteurs d'observation. Les vecteurs d'observation sont décodés pour produire une correspondance candidate entre les vecteurs d'observation et des entrées d'une source de connaissances. L'étape de décodage des vecteurs comprend la modélisation de la source de connaissances comme transducteur d'états finis pondérés (WFST) et l'étape de décodage comprend la modélisation de la propagation de jetons dans des nœuds du WFST pour les vecteurs d'observation provenant de trames successives de la forme candidate. Pour chaque propagation successive, une liste de jetons est triée dans l'ordre de leur coût associé, et une étape d'élagage est appliquée afin d'éliminer les jetons de ladite liste triée qui vont au-delà d'un seuil de coût prédéterminé, avant de procéder à l'itération suivante.
PCT/GB2011/052436 2010-12-08 2011-12-08 Reconnaissance de formes WO2012076895A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB1020771.0A GB201020771D0 (en) 2010-12-08 2010-12-08 Improvements in or relating to pattern recognition
GB1020771.0 2010-12-08

Publications (1)

Publication Number Publication Date
WO2012076895A1 true WO2012076895A1 (fr) 2012-06-14

Family

ID=43531633

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2011/052436 WO2012076895A1 (fr) 2010-12-08 2011-12-08 Reconnaissance de formes

Country Status (2)

Country Link
GB (1) GB201020771D0 (fr)
WO (1) WO2012076895A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016057151A1 (fr) * 2014-10-06 2016-04-14 Intel Corporation Système et procédé de reconnaissance automatique de la parole à l'aide de la génération de réseaux de mots sur le champ avec des historiques de mots
CN107123419A (zh) * 2017-05-18 2017-09-01 北京大生在线科技有限公司 Sphinx语速识别中背景降噪的优化方法
CN108417222A (zh) * 2017-02-10 2018-08-17 三星电子株式会社 加权有限状态变换器解码系统以及语音识别系统
US11935517B2 (en) * 2018-12-14 2024-03-19 Tencent Technology (Shenzhen) Company Limited Speech decoding method and apparatus, computer device, and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143112A1 (en) * 2005-12-20 2007-06-21 Microsoft Corporation Time asynchronous decoding for long-span trajectory model

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143112A1 (en) * 2005-12-20 2007-06-21 Microsoft Corporation Time asynchronous decoding for long-span trajectory model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BILMES J: "Dynamic Graphical Models", IEEE SIGNAL PROCESSING MAGAZINE, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 27, no. 6, 1 November 2010 (2010-11-01), pages 29 - 42, XP011317699, ISSN: 1053-5888 *
TSUNEO KATO ET AL: "An efficient beam pruning with a reward considering the potential to reach various words on a lexical tree", ACOUSTICS SPEECH AND SIGNAL PROCESSING (ICASSP), 2010 IEEE INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 14 March 2010 (2010-03-14), pages 4930 - 4933, XP031697068, ISBN: 978-1-4244-4295-9 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016057151A1 (fr) * 2014-10-06 2016-04-14 Intel Corporation Système et procédé de reconnaissance automatique de la parole à l'aide de la génération de réseaux de mots sur le champ avec des historiques de mots
US9530404B2 (en) 2014-10-06 2016-12-27 Intel Corporation System and method of automatic speech recognition using on-the-fly word lattice generation with word histories
CN106663423A (zh) * 2014-10-06 2017-05-10 英特尔公司 使用具有词历史的实时词网格生成的自动语音识别的系统和方法
CN108417222A (zh) * 2017-02-10 2018-08-17 三星电子株式会社 加权有限状态变换器解码系统以及语音识别系统
CN108417222B (zh) * 2017-02-10 2024-01-02 三星电子株式会社 加权有限状态变换器解码系统以及语音识别系统
CN107123419A (zh) * 2017-05-18 2017-09-01 北京大生在线科技有限公司 Sphinx语速识别中背景降噪的优化方法
US11935517B2 (en) * 2018-12-14 2024-03-19 Tencent Technology (Shenzhen) Company Limited Speech decoding method and apparatus, computer device, and storage medium

Also Published As

Publication number Publication date
GB201020771D0 (en) 2011-01-19

Similar Documents

Publication Publication Date Title
US11776531B2 (en) Encoder-decoder models for sequence to sequence mapping
CN108305634B (zh) 解码方法、解码器及存储介质
US11335333B2 (en) Speech recognition with sequence-to-sequence models
US9934777B1 (en) Customized speech processing language models
US10121467B1 (en) Automatic speech recognition incorporating word usage information
US11423883B2 (en) Contextual biasing for speech recognition
Ravishankar Efficient algorithms for speech recognition
JP5218052B2 (ja) 言語モデル生成システム、言語モデル生成方法および言語モデル生成用プログラム
US20210312914A1 (en) Speech recognition using dialog history
US10381000B1 (en) Compressed finite state transducers for automatic speech recognition
US20120245919A1 (en) Probabilistic Representation of Acoustic Segments
CN112435654A (zh) 通过帧插入对语音数据进行数据增强
US11705116B2 (en) Language and grammar model adaptation using model weight data
WO2010100853A1 (fr) Dispositif d'adaptation de modèle linguistique, dispositif de reconnaissance vocale, procédé d'adaptation de modèle linguistique et support d'enregistrement lisible par ordinateur
Demuynck Extracting, modelling and combining information in speech recognition
CN113574545A (zh) 用于训练模型的训练数据修改
Chen et al. Sequence discriminative training for deep learning based acoustic keyword spotting
Hu et al. Phoneme-based contextualization for cross-lingual speech recognition in end-to-end models
WO2012076895A1 (fr) Reconnaissance de formes
JP5183120B2 (ja) 平方根ディスカウンティングを使用した統計的言語による音声認識
JP2009003110A (ja) 知識源を組込むための確率計算装置及びコンピュータプログラム
Rybach et al. On lattice generation for large vocabulary speech recognition
KR20160000218A (ko) 언어모델 군집화 기반 음성인식 장치 및 방법
US20040148163A1 (en) System and method for utilizing an anchor to reduce memory requirements for speech recognition
Savitha Deep recurrent neural network based audio speech recognition system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11805572

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11805572

Country of ref document: EP

Kind code of ref document: A1