EP0551374A1

EP0551374A1 - Boundary relaxation for speech pattern recognition

Info

Publication number: EP0551374A1
Application number: EP91917937A
Authority: EP
Inventors: Ilan D. Shallom; Raziel Haimi-Cohen
Original assignee: DSP Group Inc
Current assignee: DSP Group Inc
Priority date: 1990-10-02
Filing date: 1991-10-02
Publication date: 1993-07-21
Also published as: WO1992006469A1; EP0551374A4

Abstract

L'algorithme de reconnaissance de la parole est mis en oeuvre dans un programme informatique en envoyant un signal d'entrée vocal dans un codeur (2) et en le traitant dans un ordinateur standard (4) au moyen de structures de référence stockées en mémoire (6). L'algorithme met en oeuvre la technique bien connue de la programmation dynamique pour inclure les fonctions de pondération et de normalisation.The speech recognition algorithm is implemented in a computer program by sending a voice input signal to an encoder (2) and processing it in a standard computer (4) using reference structures stored in memory. (6). The algorithm implements the well-known technique of dynamic programming to include the weighting and normalization functions.

Description

BOUNDARY RELAXATION FOR SPEECH PATTERN RECOGNITION FIELD OF THE INVENTION The present invention relates to pattern recognition processing generally and more particularly to speech recognition using a dynamic programming algorithm, typically a modification of a standard Dynamic Time Warping (DTW) or similar algorithms (for example Hidden Markov Model based on Viterbi s algorithm) . BACKGROUND OF THE INVENTION

Conventional Dynamic Time Warping (DTW) algorithms assume a precise knowledge of the boundaries of both reference and test utterances. However, the output of practical boundary detectors is inaccurate, particularly so in a noisy environment. This results in a severe deterioration of the accuracy of isolated word recognition. This problem has been well described in several publications including an article by Wilpon, Rabiner and Martin, entitled "An Improved Word Detection Algorithm for Telephone Quality Speech Incorporating both Syntactic and Semantic Constraints", and published in the AT&T Bell Lab. Tech. Journal. Vol. 63(3), March 1984, pp. 479-498. Wilpon et al show the results of recognition experiments in which the actual endpoints are manually varied. Their work suggests that the accuracy of isolated word recognition decreases dramatically as a function of errors in boundary detection. The method outlined by Wilpon et al describes a way to improve speech recognition, based on

SUBSTITUTESHEET a new boundary estimation algorithm which reduces boundary recognition errors.

The degradation in recognition accuracy due to mismatch in boundary determination can be reduced by various approaches. The method of Wilpon et al uses the approach of improving the accuracy in boundary determination to a certain degree of uncertainty. In addition, to overcome the remaining problem it is recommended that a procedure be developed that is immune to small endpoint errors.

Relaxing the requirement of exact knowledge of the boundaries gives a strong tool with which to measure the similarity between two speech events with uncertain endpoints (within a reasonable limit) . This method for improved isolated word recognition is described in

"Considerations in Dynamic Time Warping Algorithms for Discrete Word Recognition," by L.R. Rabiner, A.E. Rosenberg and S.E. Levinson and published in IEEE Trans, QR Acoustics. Speech and Signal Processing. Vol. ASSP-23, Dec. 1978, pp. 575-582. The method of

Rabiner et al attempts to improve speech recognition by relaxation of the boundary constraints and modification of the standard dynamic time warping algorithm, allowing the warping path to begin and end within a specified range with respect to the estimated boundaries.

According to this method, the accumulated distance of the final path is normalized by its length.

The method of Rabiner et al is enhanced by the algorithm described in "Dynamic Time Warping with Boundaries Constraint Relaxation", by I.D. Shallo , R. Haimi Cohen and T. Golan, and published in Proc. Conf. IEEE Israel. 1989, paper 3.1.3. The algorithm of Shallom et al also uses relaxation of boundary constraints. Their method uses the dynamic time warping algorithm — that is, where a path length normalization factor is applied in the dynamic equation at each grid point. This improves the path optimization process.

SUBSTITUTESHEET However, this method ignores the "length" of the "future" part of the warping function. Ignoring the future "length" may lead to inaccuracies, especially near the beginning of the warping path. As a result errors may occur at the overall level of the similarity measuring.

SUMMARY OF THE INVENTION The present invention provides a method of improved pattern recognition which may be used for speech recognition by relaxation of boundary constraints so as to account for boundary detection errors. The dynamic programming algorithm is modified so that the known and predicted path lengths are taken into account when determining the optimal path to each gridpoint. Additionally, the present invention provides a method for improving the accuracy of the estimated boundaries of a tested pattern.

A method for determining the predicted path length and for utilizing it in a dynamic programming algorithm is outlined below.

There is therefore provided, in accordance with a preferred embodiment of the present invention, apparatus for pattern recognition including apparatus for providing a digital pattern to be inspected which contains a plurality of feature vectors, apparatus for providing at least one digital reference pattern containing a different plurality of parameter vectors and apparatus for comparing the digital pattern to be inspected with the at least one digital reference pattern. The apparatus for comparing includes apparatus for providing a search area including a grid with the feature vectors on a first axis and the parameter vectors on a second axis and apparatus for calculating a final normalized score which is the estimated minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of the feasible paths is located in the search area. The

SUBSTITUTESHEET apparatus for calculating includes, for each point in the search area, apparatus for computing an accumulated score for a plurality of feasible paths which contain the point, apparatus for computing an overall weight for each of the plurality of feasible paths which contain the point, apparatus for computing a normalized score, whereby the normalized score is the accumulated score for the point divided by the overall weight for the point, for each of the plurality of feasible paths which contain the point, and apparatus for selecting the normalized score which is least, from the plurality of normalized scores, as an optimal normalized score for the point.

Additionally, in accordance with a preferred embodiment of the present invention, the search area includes a plurality of path beginning points and a plurality of path ending points.

Moreover, in accordance with a preferred embodiment of the present invention, the apparatus for pattern recognition also includes an apparatus for determining beginning and ending points of that feasible path which is associated with the final normalized score thereby to determine beginning and ending points of the digital pattern. Furthermore, in accordance with a preferred embodiment of the present invention, the overall weight includes an accumulated weight and a predicted weight.

Still further, in accordance with a preferred embodiment of the present invention, the pattern to be inspected is a speech utterance and the reference pattern is based on a Hidden Markov Model. Alternatively, the pattern to be inspected is a speech utterance, the reference pattern is a reference template, and the feasible paths are calculated according to a Dynamic Time Warping algorithm.

Moveover, in accordance with a preferred embodiment of the present invention, the beginning and

SUBSTITUTESHEET ending points of the feasible path which is associated with the final normalized score are used to estimate beginning and ending points of the pattern to be inspected. Additionally, in accordance with a preferred embodiment of the present invention, the digital pattern is derived from a speech signal.

There is further provided, in accordance with a preferred embodiment of the present invention, a method for producing a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of the feasible paths is located in a search area and wherein the search area includes a set of points characterized by a plurality of path beginning points and a plurality of path ending points. For each point in the search area, the method includes the steps of computing an accumulated score for a plurality of feasible paths which contain the point, computing an overall weight for each of the plurality of feasible paths which contain the point, computing a normalized score, whereby the normalized score is the accumulated score for the point divided by the overall weight for the point, for each of the plurality of feasible paths which contain the point, and selecting the normalized score which is least, from the plurality of normalized scores, as an optimal normalized score for the point.

Additionally, in accordance with a preferred embodiment of the present invention, the method also includes the step of determining beginning and ending points of that feasible path which is associated with the final normalized score.

Moreover, in accordance with a preferred embodiment of the present invention, the overall weight includes an accumulated weight and a predicted wei ht.

Furthermore, in accordance with a preferred embodiment of the present invention, the final

SUBSTITUTESHEET normalized score indicates the similarity between a reference form and a pattern to be inspected. Preferably, the pattern to be inspected is a speech utterance and the reference form is based on a Hidden Markov Model. Alternatively, the pattern to be in¬ spected is a speech utterance, the reference form is a reference template, and the feasible paths are cal¬ culated according to a Dynamic Time Warping algorithm. Additionally, in accordance with a preferred embodiment of the present invention, the beginning and ending points of the feasible path which is associated with the final normalized score are used to estimate beginning and ending points of the pattern to be inspected. Finally, there is provided, in accordance with a preferred embodiment of the present invention, a method for pattern recognition including the steps of providing a digital pattern to be inspected which contains a plurality of feature vectors, providing at least one digital reference pattern containing a different plurality of parameter vectors, and comparing the digital pattern to be inspected with the at least one digital reference pattern. The step of comparing includes the steps of providing a search area including a grid with the feature vectors on a first axis and the parameter vectors on a second axis, and calculating a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of the feasible paths is located in the search area. The step of calculating includes, for each point in the search area, the steps of computing an accumulated score for a plurality of feasible paths which contain the point, computing an overall weight for each of the plurality of feasible paths which contain the point, computing a normalized score, whereby the normalized score is the accumulated score for the point divided by the overall

SUBSTITUTESHEET weight for the point, for each of the plurality of feasible paths which contain the point, and selecting the normalized score which is least, from the plurality of normalized scores, as an optimal normalized score for the point.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings in which:

Fig. 1 is a schematic block diagram illustration of the architecture of a preferred embodiment of speech recognition apparatus constructed and operated in accordance with a preferred embodiment of the present invention;

Fig. 2 is a schematic block diagram illustration of a speech recognition system constructed and operated in accordance with the principles of a preferred embodiment of the present invention; Fig. 3 is a graphical representation illustration of an optimization procedure of a preferred embodiment of the invention; and

Fig. 4 is a pseudo-code illustration of a scoring algorithm for pattern recognition in the speech recognition system of Fig. 2 in accordance with a dynamic programming technique of the invention.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

Reference is now made to Figs. 1 - 4, Fig. 1 shows a schematic block diagram of the architecture of a microprocessor-based speech recognition system operated in accordance with the principles of the present invention.

A user codec 2, such as an Intel 2913, from Intel Corporation, interfaces with digital signal processing circuitry 4, typically a TNS 320C25 from Texas Instruments Corporation.

A memory storage area 6, which typically

SUBSTITUTESHEET comprises a static random-access memory, such as a 32K by 8 bit with an access time of 100 nsec, is connected to the digital signal processing circuitry by means of a standard address data and read-write control bus. Fig. 2 shows a schematic block diagram of a microprocessor-based speech recognition system operated in accordance with the principles of the present invention.

The algorithms of Fig. 2 are typically carried out by software run on digital signal processing circuitry 4, such as the digital signal processing circuitry of Fig. 1.

An analog signal 12, which may be obtained from a microphone or similar device, is typically provided to a standard sampling device 14. The sampling device 14, which may be codec 2(Fig. 1), converts the analog signal to a digital signal 16.

The output of the sampling device, the digital signal 16, is then supplied to a voice activated detection device 18 which may be a device as described in U.S. Patent Application 07/151,740 to the same assignee, which is incorporated herein by reference. The output of the voice activated detection device 18 is a digital speech signal 20. The voice activated detection device may be incorporated by digital signal processing circuitry 4(Fig. 1).

After the digital speech signal 20 has been extracted from the input signal, the digital speech signal 20 is provided to a boundary detector 22 which typically determines the beginning and end points of an utterance that is found in the digital speech signal. The determination may be carried out by a standard boundary detector algorithm such as the type described by Wilpon et al.

The utterance is then conveyed to a feature extraction device 26 where spectral or other features

SUBSTITUTESHEET are typically extracted, typically through LPC analysis. The feature extraction procedure transforms the utterance into a sequence of test feature vectors 28. Preferably, each test vector contains the features of a speech frame of approximately 30 msec. An overlap of typically 50% may be applied between adjacent speech frames.

The sequence of test feature vectors 28 supplied by the feature extraction 26 is provided to a pattern recognition algorithm 30. The pattern recognition algorithm consists of two primary parts — a scoring algorithm 31 and a decision procedure 36. A set of reference templates 32 from a memory 34 is passed to the scoring algorithm 31 to serve as a reference. The memory storage area 34 is typically of the type depicted in Fig. 1.

Preferably, reference templates, consisting of sequences of parameter vectors, are stored in the memory 34 during a process called training (not shown) . Training typically consists of inputting signals of a certain class to the system according to the steps of voice detection through feature extraction described above. Following these steps, the input signals are processed, and reference templates 32 are generated and stored in the memory area 34.

The parameter vectors of the template provided by the training procedure represent characteristic features of the class of input signals. Typically, a template may represent utterances of a particular word or of a particular subword word unit such as a syllable or a phoneme. Alternatively, the template may represent the voice of a particular person.

Typically each parameter vector is a feature vector of a reference utterance. Alternatively, the parameter vectors may include parameters defining a model for a feature sequence of a test utterance.

In accordance with the present invention, a

SUBSTITUTESHEET novel approach to pattern recognition, using a modification of the dynamic programming method for the scoring procedure, is achieved based on a method of path estimation and normalization of an accumulated similarity score as described in detail hereinbelow. Preferably, the novel approach to pattern recognition uses a modified Dynamic Time Warping algorithm or alternatively, a Hidden Markov Model algorithm for the scoring algorithm 31. According to a further embodiment, any other suitable dynamic programming based algorithm may be used instead of the examples offered herein.

The output of the scoring algorithm 31 is a set of final similarity scores (as defined hereinbelow) , with each score indicating the similarity between the sequence of test vectors 28 and each of the reference templates 32.

The scoring algorithm output is typically provided to decision procedure 36 which may comprise a k-NN (k-Nearest Neighbor) rule for determination of the class of inputs to which the pattern between the beginning and endpoints in input signal 12 belongs.

The overall output of the pattern recognition procedure provides a code or index 40, which describes the class of inputs to which the pattern between the beginning and the endpoints in input signal 12 belongs. Typically, this code or index indicates the verbal contents of input signal 12. Alternatively, the code or index indicates the identity of the speaker who uttered the speech embodied in the input signal 12.

Reference is now made to Fig. 3 which shows a graphical representation of a preferred embodiment of a part of the sequence of the pattern recognition procedure of Fig. 2 in accordance with a preferred embodiment of the invention.

The graph representation shows a non-linear time warping function which may be used for scoring the

SUBSTITUTESHEET similarity between a test utterance and a reference template.

The time warping function maps the time axis of a test feature sequence 50 to the time axis of a reference template 52. The mapping provides a time registration between the reference template 52, which is preferably provided by the memory storage area 34 (Fig. 2) and the test feature 50, which may be provided by the feature extraction device 26 (Fig. 2) . The reference template 52 comprises a sequence of M parameter vectors representing a word from a vocabulary recognizable by a speech recognition system such as the speech recognition system of Fig. 2. M may vary according to the particular reference template. The test feature sequence 50, comprises a sequence of N test feature vectors.

The graph comprises a grid with points associated with a local similarity score for the point (n,m) where m is the m^**-*¹ parameter vector of the reference template and n is the n*-*^*1** test feature vector in the sequence of test feature vectors.

The skilled professional may determine the local similarity score associated with each pair of test feature vectors and reference parameter vectors according to his considerations.

It is assumed that the lower the local similarity score, the greater the similarity between the pair of items being compared.

Preferably, the local similarity scores may be determined by computing standard Euclidean or

Mahalanobis distances between the test feature vector and the reference parameter vector.

According to a further embodiment, the local similarity score may be determined by a speech specific distortion measure such as the likelihood ratio distortion measure proposed by Itakura in the article, "Minimum Prediction Residual Principle Applied to Speech

SUBSTITUTESHEET Recognition, IEEE Trans. Acoustic. Speech and Signal Processing,. Vol. ASSP-23, Feb. 1975, pp. 67-72. The article is incorporated herein by reference.

Alternatively, the local similarity score may be probabilistic. Typically, the probabilistic local similarity score could be computed using a parametric function of the test feature vector, which depends on the reference parameter vector. The function value provides a statistical estimate of the minus log of the likelihood of observing the test feature vector in a particular segment of the reference word.

A feasible warping path, 54, is a sequence of grid points which satisfy certain constraints. Specific constraints are determined by the skilled professional. A typical constraint requires the feasible warping path to map the beginning and ending feature vectors of the test to the beginning and ending parameter vectors of the reference, respectively. Another typical constraint is that the slope of the warping path will be within a specified limit, typically between 1:2 and 2:1.

A search area 56, in which all feasible warping paths (as defined above) are contained, is typically defined by: { (n,m) g(n)<m<f(n) } where f and g are defined as follows: f = min {^....fp} where {f₁....f_p} is a set of linear searching area boundary functions defining the upper constraints, and g ^» max {gι....gq} where {gι«...g } is a set of linear searching area boundary functions defining the lower constraints.

In Fig. 3, p-*=3 and q=3. These figures are just offered as examples and should not be seen as limiting the possible definition for f or for g.

Fig. 4 shows a pseudo-code description of a scoring algorithm as part of the pattern recognition in the speech recognition system of Fig. 2 in accordance with a preferred embodiment of a dynamic programming

SUBSTITUTESHEET technique of the invention.

The algorithm of Fig. 4 can be implemented by the digital processing circuitry 4 of Fig. 1. Alternatively, the algorithm can be implemented using other suitable computing hardware in accordance with state-of-the-art electronic design and programming techniques.

The scoring procedure, which is typically based on a Dynamic Time Warping algorithm, or alternatively, on a Hidden Markov Model algorithm, is preferably used to determine the similarity between a test utterance and reference word in speech recognition procedures.

The operation of the scoring algorithm of Fig. 4 is described according to the following steps: STEP 1: INITIALIZE GRID

During this step, initial values are assigned to each point in search area 56, where the search area is as defined above. This step is independent of the content of the sequence of test feature vectors, and depends only on the number N of test feature vectors in a certain sequence and the number M of parameter vectors in a reference template.

Initial grid properties are defined as follows:

(1) A set of path beginning grid points and a set of path ending grid points are defined. A typical definition of the beginning set is:

{ (n, ) ; n = 1....X_],} U { (l,m) ; m = l. . . . x_) A typical definition of the ending set is:

{(n,M); n - N-x₂ N) U { (N,m) ; m = M-

X2> • • .M) where x., x. are the maximum expected beginning and end errors of the boundary detector at the beginning and at the end of the test word (assuming that the reference boundaries are sufficiently accurate) .

SUBSTITUTESHEET (2) For each grid point in the search area, as defined hereinabove, a list of "access paths" is defined. An access path is a short path leading from a neighboring grid point to a given grid point. The access paths should be defined in such a way that a concatenation of access paths leading from a path beginning grid point to a path ending grid point constitutes a feasible path (as defined above) . Additionally, any feasible path must be representable as a concatenation of access paths from a path beginning grid point to a path ending grid point.

When the scoring algorithm is based on the Dynamic Time Warping algorithm, access paths are preferably defined according to the symmetric p=l rule of Sakoe and Chiba. The rule is described in the article, incorporated herein by reference, "Dynamic programming Algorithm Optimization for Spoken Word Recognition", published in the IEEE Trans. Acoustic. Speech and Signal Processing. Vol. ASSP-26, Feb. 1978, pp. 43-49.

Alternatively, an access path may be defined by a left to right finite state automaton where each reference parameter vector is represented by a state and each grid point (n,m) indicates that at time n, the automaton has reached state m. An access path to a grid point (n,m) is a two-point path of the form [(n-l,k), (n,m)] where there exists a transition leading from the state representing the k-th reference parameter vector to the state representing the m-th reference parameter vector. Such a definition is common in Hidden Markov Models. STEP 2: LOOP ON GRID POINTS IN SEARCH AREA:

For each grid point (n,m) in the search area 56, the following steps are performed to establish the optimal path which reaches that point, where the following definitions hold true:

SUBSTITUTESHEET For each grid point (n,m) along a feasible path, a local weight may be defined indicating the significance of the local similarity score at that point. A bias at the point (n,m) may be defined to indicate the apriori likelihood of the feasible path passing through that point.

The accumulated similarity score, D(n,m) of a feasible path containing the grid point (n,m) , is the sum of all biases along the path from the path beginning to the point (n,m) , plus the sum of all local similarity scores from the path beginning to the point (n,m) , where each local score is multiplied by a corresponding local weight. The local similarity score is calculated according to the methods outlined above and the bias and local weight are calculated as defined below.

The overall weight, W(n,m) of a path con¬ taining the point (n,m) is the sum of all local weights along that path from its beginning to its ending. The accumulated weight, B(n,m) of a path containing the point (n,m) is the sum of all local weights along the path, from the path beginning till the point (n,m) .

The future weight, F(n,m) of a path containing the point (n,m) is the sum of all local weights along the path, from the point following (n,m) till the path end. For a given feasible path, the overall weight is the sum of the accumulated weight and the future weight.

The normaliz .d similarity score with respect to a feasible path containing the grid point (n,m) , A(n,m) , is the accumulated similarity score divided by the overall weight, i.e. A(n,m) = D(n,m)/W(n,m) .

The optimal normalized similarity score, A*(n,m) is the minimum of the normalized similarity scores A(n,m) , taken over all feasible paths containing (n,m) . The optimal feasible path through (n,m) is the path for which A(n,m) was minimal. If there are more than one such paths, the choice of the optimal one is

SUBSTITUTESHEET arbitrary.

The optimal overall weight W*(n,m), the optimal accumulated weight B*(n,m), the optimal future weight F*(n,m) and the optimal accumulated similarity score D*(n,m) are the overall weight W(n,m) , the accumulated weight B(n,m) , the future weight F(n,m) and the accumulated similarity score D(n,m) respectively, associated with the optimal feasible path through (n,m) . The optimal path beginning grid point b* (n,m) , and the optimal path ending grid point _£*(n,m) are the beginning and ending points, respectively, of the optimal feasible path through (n,m) (the underline in _ and b indicates that each represents a pair of coordinates) . During this step, the values of D*(n,m),

W*(n,m), B*(n,m), F*(n,m), A*(n,m) and b*(n,m) are estimated for each grid point (n,m) in the search area. In addition, the access path leading to (n,m) in the optimal feasible path through (n,m) is also computed. The preferable method of performing this task is according to the steps that follow. STEP 2.1: COMPUTE LOCAL SIMILARITY SCORE

The local similarity score D(n,m) at point (n,m) is computed according to the methods outlined above.

STEP 2.2: ESTIMATING THE FUTURE WEIGHT.

F*(n,m), the optimal future weight is predicted. Preferably, F*(n,m) is the average of the future weights from (n,m) to each of the path ending grid points which are accessible from (n,m) by a feasible path. Alternatively, F*(n,m) may be the median of those future weights.

Typically, all the definitions of local weights mentioned hereinabove share the property of path in variance, which means that all future weights F(n,m) of paths with the same ending grid point are equal. Therefore the future weight from (n,m) to each path

SUBSTITUTESHEET ending grid point is uniquely defined.

STEP 2.3: INITIALIZE SCORES FOR BEGINNING POINTS

During this step, initial estimates for the optimal scores of a grid point (n,m) are established, based on the assumption that the optimal path begins at that point.

If (n,m) does not belong to the set of path beginning grid points, the initial estimate of A*(n,m) is assigned the value of infinity, indicating the falseness of the assumption.

If (n,m) is in the set of path beginning grid points (as defined in step 1) , the initial estimates are computed according to the following steps. STEP 2.3.1: DEFINE LOCAL WEIGHT AND BIAS A local weight and a bias are defined for a path beginning grid point (n,m) . The skilled professional may determine these values according to his considerations.

In a dynamic time warping embodiment of the present invention, a typical value for the bias is 0 and a typical value for the local weight is 2.

Alternatively, in a Hidden Markov Model embodiment, a typical value for the bias may be minus log of the likelihood that the path begins at the given point (n,m) and the local weight may be set equal to 1. Typically, the value of the bias is estimated during the training procedure.

2.3.2: COMPUTE INITIAL SCORES

Using the local weight and bias calculated in Step 2.3.1, the initial estimate for the optimal scores and optimal path beginning grid point, under the assumption that ti.e optimal path begins at (n,m) , can be made as follows:

The optimal beginning point is set to be the same point: fe*(n,m) - (n,m) .

The optimal accumulated weight, B*(n,m) gets the value of the local weight.

SUBSTITUTESHEET The optimal overall weight W*(n,m) is the sum of optimal accumulated and future weights, B*(n,m)+F*(n,m) .

The optimal accumulated similarity score, D*(n,m), is the bias for the point (n,m) plus the local similarity score of that same point multiplied by the local weight of the point.

The optimal normalized similarity score, A*(n,m), is the optimal accumulated similarity score divided by the optimal overall weight D*(n,m)/W*(n,m) . STEP 2.4: LOOP ON LIST OF ACCESS PATHS LEADING TO (n,m)

In each execution of this loop, one of the access paths leading to a point (n,m) is checked for the hypothesis that the optimal path through (n,m) contains that particular access path. This is done by computing the normalized similarity score for a particular access path under this hypothesis and then comparing it to the current estimated value of the optimal normalized similarity score. If the computed value is smaller than the current estimate, all current estimates of optimal scores for that point (n,m) are replaced by the computed value.

The following steps describe the operation of the loop for each given access path to (n,m) . In the description, (p,q) will denote the beginning point of the access path to (n,m) under consideration. STEP 2.4.1: SET WEIGHTS AND BIASES FOR GIVEN ACCESS PATH Local weights and biases are defined for each point on the given access path except for the first point of the access path. Typically, in a Dynamic Time Warping embodiment of the present invention, the bias is 0 and the weight is the sum of the absolute values of the differences of corresponding coordinates in the current and previous grid points on the access path

(i.e. if point (k,l) immediately precedes (n,m) on the access path, then the local weight of (n,m) equals -kj

SUBSTITUTESHEET |m-l|).

Alternatively, the bias may be minus log of the likelihood of moving to the current grid point from the preceding one (this likelihood may typically be determined during training) and the local weight is 1. This is the common ca^a in Hidden Markov Model devices. STEP 2.4.2: COMPUTE ACCUMULATED SIMILARITY SCORE FOR GIVEN ACCESS PATH

The accumulated similarity score D(n,m) is computed for a path which comprises the concatenation of the optimal path to (p,q) and the given access path. Therefore D(n,m) is calculated as D*(p,q) plus the sum of all biases along the given access path (except for the first point (p,q)) plus the sum of all local similarity scores along the access path (except for the first point (p,q)), each multiplied by the corresponding local weight.

STEP 2.4.3: COMPUTE ACCUMULATED AND OVERALL WEIGHT FOR GIVEN ACCESS PATH The accumulated weight B(n,m) is computed for a path which contains the concatenation of the optimal path to (p,q) and the given access path. Therefore B(n,m) is calculated as B*(p,q) plus the sum of all local weights along the access paths (except for the first point (p,q)).

The overall weight W(n,m) is computed by adding the accumulated weight B(n,m) to the estimated optimal future weight F*(n,m). 2.4.4: COMPUTE NORMALIZED SIMILARITY SCORE FOR GIVEN ACCESS PATH

The normalized similarity score A(n,m) is computed for a path which contains the concatenation of the optimal path to (p,q) and the given access path. Therefore A(n,m) is calculated as D(n,m) divided by W(n,m).

STEP 2.4.5: UPDATE OPTIMAL SCORES IF NECESSARY

If the normalized score for the given access

SUBSTITUTESHEET path, A(n,m) , is less than the current estimate of the optimal normalized similarity score, A*(n,m), the following step is performed: STEP 2.4.5.1: ASSIGN NEW OPTIMAL VALUES The current estimate for the optimal path through (n,m) is updated to be a path which contains the concatenation of the optimal path to point (p,q) and the given access path. Accordingly, the current values of D*(n,m), B*(n,m), W*(n,m), and A*(n,m) are replaced by the values corresponding to the updated optimal path, that is, D(n,m) , B(n,m) , W(n,m) , and A(n,m) , respectively.

In addition, the path beginning grid point b*(n,m) is set to be equal to fe(p,q), the optimal path beginning grid point of the beginning point of the given access path. STEP 3: DETERMINE FINAL VALUES

After optimal scores have been estimated for all grid points in the search area, the final outputs of the algorithm are determined in the following steps: STEP 3.1: DETERMINE FINAL NORMALIZED SIMILARITY SCORE

The minimal value of A*(n,m), over all the points in the set of path ending grid points (as defined in step 1) is the final normalized similarity score. The feasible path associated with the final normalized score is the final path.

The path ending grid point (n,m) of the final path is the final path ending grid point. The optimal path beginning grid point of the final path, b*(n,m) is the final path beginning grid point.

STEP 3.2: DETERMINE FINAL BEGIN AND END ESTIMATES The first coordinates of the final path beginning grid point and of the path ending grid point are the final estimates for the beginning and ending of a test utterance, respectively. The second coordinate of these grid points indicates the beginning and ending, respectively, of the part of a reference template

SUBSTITUTESHEET sequence that was matched by the test utterance. If the second coordinate of the final beginning point or of the final ending point does not equal 1 or M, respectively, this indicates that the initial boundary estimate clipped the beginning or the ending, respectively, of the tested utterance.

Having described the invention with regard to certain specific embodiments thereof, it is to be understood that the description is not meant as a limitation since further modifications may now suggest themselves to those skilled in the art and it is intended to cover such modifications as fall within the scope of the appended claims.

SUBSTITUTESHEET

Claims

c ft I M s 1. Apparatus for pattern recognition comprising: means for providing a digital pattern to be inspected, said pattern containing a plurality of feature vectors; means for providing at least one digital reference pattern containing a different plurality of parameter vectors; and means for comparing said digital pattern to be inspected with said at least one digital reference pattern, said means comprising: means for providing a search area comprising a grid with said feature vectors on a first axis and said parameter vectors on a second axis; and means for calculating a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of said feasible paths is located in said search area, said means comprising for each point in said search area: means for computing an accumulated score for a plurality of feasible paths which contain said point; means for computing an overall weight for each of said plurality of feasible paths which contain said point; means for computing a normalized score, whereby the normalized score is the accumulated score for said point divided by the overall weight for said point, for each of said plurality of feasible paths which contain said point; and means for selecting the normalized score which is least, from said plurality of normalized scores, as an optimal normalized score for said point.

2. Apparatus according to claim 1, and wherein said search area comprises a plurality of path beginning points and a plurality of path ending points.

SUBSTITUTESHEET

3. Apparatus according to claim 1, and also comprising means for determining beginning and ending points of that feasible path which is associated with said final normalized score thereby to determine beginning and ending points of said digital pattern.

4. Apparatus according to claim 1, and wherein said overall weight comprises an accumulated weight and a predicted weight.

5. Apparatus according to claim 3, and wherein said overall weight comprises an accumulated weight and a predicted weight.

6. Apparatus according to claim 1, wherein said digital pattern to be inspected is a speech utterance and said digital reference pattern is based on a Hidden Markov Model.

7. Apparatus according to claim 1, wherein said pattern to be inspected is a speech utterance, said reference pattern is a reference template, and said feasible paths are calculated according to a Dynamic Time Warping algorithm.

8. Apparatus according to claim 1, wherein the beginning and ending points of said feasible path which is associated with the final normalized score are used to estimate beginning and ending points of said pattern to be inspected.

9. Apparatus according to claim 1, and wherein said digital pattern is derived from a speech signal.

10. A method for producing a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of said feasible paths is located in a search area and wherein said search area comprises a set of points characterized by a plurality of path beginning points and a plurality of path ending points, for each point in said search area said method comprising the steps of: computing an accumulated score for a plurality of

SUBSTITUTESHEET feasible paths which contain said point; computing an overall weight for each of said plurality of feasible paths which contain said point; computing a normalized score, whereby the normalized score is the accumulated score for said point divided by the overall weight for said point, for each of said plurality of feasible paths which contain said point; and selecting the normalized score which is least, from said plurality of normalized scores, as an optimal normalized score for said point.

11. A me-thod according to claim 10, which also comprises the step of determining beginning and ending points of that feasible path which is associated with said final normalized score.

12. A method according to claim 10, and wherein said overall weight comprises an accumulated weight and a predicted weight.

13. A method according to claim 11, and wherein said overall weight comprises an accumulated weight and a predicted weight.

14. A method for pattern recognition utilizing the method of claim 10, and wherein said final normalized score indicates the similarity between a reference form and a pattern to be inspected.

15. A method for pattern recognition utilizing the method of claim 12, and wherein said final normalized score indicates the similarity between a reference form and a pattern to be inspected.

16. A method according to claim 14, wherein said pattern to be inspected is a speech utterance and said reference form is based on a Hidden Markov Model.

17. A method according to claim 15, wherein said pattern to be inspected is a speech utterance, said reference form is a reference template, and said feasible paths are calculated according to a Dynamic Time Warping

SUBSTITUTESHEET algorithm.

18. A method according to claim 10, wherein the beginning and ending points of said feasible path which is associated with the final normalized score are used to estimate beginning and ending points of said pattern to be inspected.

19. A method for pattern recognition comprising the steps of: providing a digital pattern to be inspected, said pattern containing a plurality of feature vectors; providing at least one digital reference pattern containing a different plurality of parameter vectors; and comparing said digital pattern to be inspected with said at least one digital reference pattern, said step of comparing comprising the steps of: providing a search area comprising a grid with said feature vectors on a first axis and said parameter vectors on a second axis; and calculating a final normalized score which is the minimum of a plurality of optimal normalized scores each associated with a corresponding feasible path, wherein each of said feasible paths is located in said search area, said step of calculating comprising, for each point in said search area, the steps of: computing an accumulated score for a plurality of feasible paths which contain said point; computing an overall weight for each of said plurality of feasible paths which contain said point; computing a normalized score, whereby the normalized score is the accumulated score for said point divided by the overall weight for said point, for each of said plurality of feasible paths which contain said point; and selecting the normalized score which is least, from said plurality of normalized scores, as an optimal normalized score for said point.

SUBSTITUTESHEET

20. Apparatus according to claim 9, wherein said digital reference pattern indicates the verbal contents of said speech signal.

21. Apparatus according to claim 9, wherein said digital reference pattern indicates the identity of the speaker of said speech signal.

22. A method according to claim 16, wherein said reference form indicates the verbal contents of said speech utterance.

23. A method according to claim 17, wherein said reference form indicates the verbal contents of said speech utterance.

24. A method according to claim 16, wherein said reference form indicates the identity of the speaker of said speech utterance.

25. A method according to claim 17, wherein said reference form indicates the identity of the speaker of said speech utterance.

SUBSTITUTESHEET