WO1999053477A1

WO1999053477A1 - Speech recognition system using parallel microcomputer

Info

Publication number: WO1999053477A1
Application number: PCT/JP1998/001714
Authority: WO
Inventors: Motohito Nakagawa; Hideo Maejima
Original assignee: Hitachi, Ltd.
Priority date: 1998-04-15
Filing date: 1998-04-15
Publication date: 1999-10-21

Abstract

When beam searching in speech recognition using a hidden Markov model (HMM) is performed, speech signals of each frame are divided into partial sections of recognition elements such as words, or phonemes, and the width of the beams in each frame are so controlled as to have a value which is an integral multiple of the degree of parallel processing which is the number of arithmetic processings executed simultaneously by a parallel microcomputer. That is, blocks extracted from a series of continuous nodes in a group of search paths (nodes on a trellis) whose number is an integral multiple of the degree of parallel processing are managed by using a table, and the search block in the next frame is determined in each frame to execute efficient beam searching. In this case, the beam width may be adjusted within a range of an integral multiple of the degree of parallel processing if necessary. Since search nodes are selected and data are managed in accordance with the characteristics of a parallel arithmetic unit, the speech recognition system which uses a parallel computer and which processes speech recognition using an HMM with beam searching of high parallel processing efficiency is provided.

Description

Description Speech recognition system using a parallel microcomputer

The present invention relates to a speech recognition system using a parallel microphone mouth computer, and more particularly to a speech recognition system using a hidden Markov model (HMM), a microcomputer having a parallel processing function (parallel MPU). The present invention relates to a speech recognition system that is efficiently realized by using the method.

It should be noted that the present invention does not assume a so-called distributed memory type parallel computer which is applicable not only to the above-mentioned microcomputer but also to a vector type super computer and a shared memory type multiprocessor parallel computer. Background art

1. First, before starting to explain conventional speech recognition systems, an overview of a parallel computer and a microcomputer system with parallel processing functions that are generally used in speech recognition systems will be given.

1.1 Parallel computer

Computers that perform parallel processing are known as follows, classified according to the differences between (1) the method of realizing parallel processing and (2) the flow of instructions and the flow of data.

(1) Classification by parallel processing implementation method

The parallel computing method is an instruction pipeline method, a multiple operation unit method (super color, VLIW (Very Long Instruction Word), etc .: the former is scheduled by hardware, the latter is scheduled by software), an operation pipeline method (vector Torr computer), multi-processor system, etc. have been realized. The pipeline method divides the process into multiple steps, and executes each step sequentially. Since each step can be processed separately and simultaneously, the processing steps are apparently executed in parallel. The pipeline method includes pipeline processing at the instruction level (instruction pipeline method) and pipeline processing at the data level (operation pipeline method). Note that the former is not an object to be used in the speech recognition system of the present invention because the program is the same as a sequential computer. The latter is a so-called vector computer, and is an object used in the speech recognition system of the present invention.

The multiple operation unit method has a plurality of operation units and can be executed in parallel with each other, and is widely used in recent MPUs (Microprocessor Units). Of these, hardware-scheduled ones are called single path scalar. A typical example of software scheduling is VLIW (Very Long Instruction Word). In the speech recognition system of the present invention, the latter can be realized by a force that is the center of the object or a superscalar method.

The multiprocessor system is a parallel computer having a plurality of processor units (elements), and can be roughly classified into a shared memory computer and a distributed memory computer. The speech recognition system of the present invention can be realized by both methods, but is not suitable for a distributed memory computer. However, when each processor element is a multi-operation method or a vector-type computer described above, the speech recognition system of the present invention can be effectively realized.

(2) Instructions ・ Classification by data flow

In the above methods, from the viewpoint of software (algorithm), what kind of processing these parallel methods can execute in parallel becomes a problem. From this point of view, the classification can be as follows, focusing on the flow of instructions and the flow of data.

① S I S D (Single Instruction stream Single Data stream) type

: A method in which single data is processed simultaneously by a single instruction

② SIMD (Single Instruction stream Mul tiple Data stream) type : A method in which multiple data are processed simultaneously by a single instruction ③ MIMD (Multiple Instruction stream Multiple Data stream) type

: A method in which multiple data are processed simultaneously by multiple instructions

The parallel algorithm does not target the SISD type described in ① above. The S SDD type software is essentially a sequential algorithm. Of the parallel processing methods described in (1) above, the instruction pipeline method is the only type belonging to the SSDD type. The speech recognition system of the present invention is assumed to be handled by a parallel computer of the above-mentioned (1) SIMD type or (3), MI type parallel computer. As long as it is a SIMD type or MIMD type parallel computer, its implementation method does not matter, except for a distributed memory type multiprocessor parallel computer, which is not considered because it has no performance.

1.2 Microcomputer system with parallel processing function

In recent years, high-performance microcomputers (MPUs) have been required for multimedia processing of images and sounds. However, on the other hand, there are limitations on the operating frequency and the devices to be implemented, and there is a limit to the performance improvement of conventional sequential (including instruction pipeline processing) MPUs.

In such an environment, in order to improve the processing capacity of a specific type, there are many cases where a multimedia processing unit of a parallel processing type is added.

Such a parallel MPU has, for example, a configuration as shown in FIG. Figure 1 (a) shows the bus interface 101, instruction cache 102, data cache 103, floating-point unit 104, load unit 105, storage unit 106, integer unit 107, instruction control unit 108, and The configuration is such that a multimedia processing unit (MM unit) 110 capable of parallel processing is added to a normal microcomputer including the register file 109 and the like.

The multimedia operation unit 110 includes, for example, as shown in FIG. 1 (b), a function block 120 for executing a vector type operation and a function block 130 for executing a vector reduction type operation. Where the multimedia register 125 In each of ~ 127 and 134, a plurality of data are managed collectively. The multimedia processing unit 110 has a plurality of processing units, and can process n (at least four in the example of FIG. 1 (b)) data at a time.

As described above, the parallel type MPU has a relatively simple configuration (with multiple operation unit structures) with the addition of a parallel operation unit. Compared with a high-performance computer represented by a supercomputer, the parallel MPU realizes more memory access. There are many restrictions on the above.

2. Next, a brief description of a general HMM speech recognition system will be given.

2.1 Speech recognition system

Speech recognition is composed of, for example, a system as shown in FIG. FIG. 2A shows an example of a system configuration using the microcomputer shown in FIG. The details of FIGS. 2A and 2B will be described later in embodiments.

Speech recognition is generally processed in the following sequence: feature analysis ^ recognition matching. In HMM type speech recognition, a procedure for calculating the output probability of each state in the Markov model is added (see Fig. 2 (b)).

In the feature analysis, the audio signal 220 is divided into several sub-sections called frames (each frame often overlaps the previous and next frames), and feature extraction is performed. Is analyzed (assuming it is stationary). For example, frequency spectrum and LPC (Linear Predictive Coding) coefficients (including cepstrum) are used.

In recognition matching, the feature vector sequence (observed vector sequence) obtained by the above feature analysis is associated with recognition elements (words, phonemes, etc.).

As these methods, DP (Dynamic Programming) matching, HMM (Hidden Markov Model), NN (Neural Networks), and the like have been proposed.

2.2 HMM speech recognition 2.2.1 Meaning of HMM

The Hidden Markov Model (HMM) is a state transition model represented by a Markov process (a stochastic process in which the state at time t + 1 is given only by the state at time t). In speech recognition, a left-to-right type HMM is often used, as shown in Fig. 3 (a), and this is briefly described as an example.

In HMM speech recognition, as shown in Fig. 3 (a), a left-to-right type HMM is used to associate observation vectors with recognition elements (hereafter described as words). Model utterances. For example, consider expressing the word "ai" as a left-to-right type HMM. The state s1 represents "a", and the state s2 represents "i". At this time, in order to express the word "ai", it can be expressed by the state transition of s1 ^ s2.

However, in this case, it can be expressed only when "A" is uttered in frame 1 and "i" is uttered in frame 2. Therefore, in order to express various temporal “variations”, state transitions for oneself and state transitions for neighbors are represented stochastically. In this way, an utterance pattern in which “a” continues for m frames after “a” continues for n frames can be stochastically expressed (in the form of the generation probability of each pattern). The acoustic characteristics of the utterance "a" also differ greatly depending on the age and gender. Therefore, it is possible to model the utterance patterns of various people by stochastically expressing the output pattern of the feature vector in the state s1 expressing the utterance of “a” from the statistical appearance patterns. it can. Note that, in FIG. 3 (a), the words indicated by the first word “Ai” and the second word “Ao” are examples of modeled words stored in the dictionary.

In this way, the HMM is a probabilistic model of the human vocal process in order to “represent the temporal or acoustic variations of various human words”. Therefore, the evaluation must be probabilistic. In other words, given a certain observation sequence (analysis result of the input speech), the observation sequence in the model expressing each word is It evaluates the probability (likelihood) of obtaining, and outputs the model with the highest likelihood (the word that means) as a recognition candidate.

2.2.2 Viterbi search

In order to output the model with the highest likelihood as a recognition candidate in HMM speech recognition, it is necessary to calculate the likelihood for each model. Since the order of state transitions (= state transition sequence: hereinafter, paths) from which a given observation sequence can be obtained, there is usually “multiple paths”. It is necessary to calculate the “sum”. Baum-Welch's method is to calculate this in order. Specifically, it can be calculated by recursively executing the following equation (I). i (i)

(1) a t + i (

However, in equation (1),

a ,. _; is the state transition probability from state j to state i,

bi (y _t ) is the probability of outputting feature y _{t in} state i,

a _t (i) is the forward probability at time t ′ state i.

On the other hand, the Viterbi method selects the path with the highest likelihood, that is, the optimal path, from among the paths that exist in multiple paths, and evaluates the path based on the likelihood. This calculation can be processed by an operation with a structure similar to Baum-Vuel, as shown in the following equation (2).

a _{t +} i (i) = max {at (il) abi (yt ₊ i), at (i) aubi (yt + i)}… (2) Viterbi search is almost the same as the method to Baumou Vuel Nevertheless, the beam search shown below can be performed. It also has the advantage that it can be applied to two-stage DP (Dynamic Programming) matching. Fig. 3 (b) shows the state i on the vertical axis and the frame on the horizontal axis. It shows "ai" and the second word "ao". In the figure, the black circles (hata) indicate negative nodes and the white circles (〇) indicate active nodes. According to the score of the first word “Ai”, according to the result of the feature analysis described in Fig. 2 (b), the path of the state s 1—s 1—s 2 (solid route) and the state s 1—s 2—s 2 Choose the larger of the paths (dotted routes) (ie, choose the path with the highest score). Similarly, for the score of the second word “Ao”, the path with the highest score is selected. In Viterbi search, such processing is performed for all words, and the model with the highest score is output as a recognition candidate.

2.2.3 Beam search

In the above Viterbi search, the likelihood of each model was evaluated using the optimal path. When evaluating this optimal path, a path with a very small likelihood (hereinafter, score) up to that point in the course of the process has an empirically extremely small probability of becoming the optimal path. Therefore, the method of shortening the calculation time by terminating the path search in the middle is called the beam search method in general.

As the beam search method, a method of searching a certain search window range, a method of searching for a path having a score equal to or more than a certain threshold value, and the like have been proposed. Disclosure of the invention

In general, the parallel type MPU described above shows high performance in the operation of a series of data stored in a continuous area. However, if the area where the data is stored is discontinuous, processing such as shaping the data is required once, which lowers the calculation efficiency. In addition, if the data to be handled is not complete (by the number of parallels), the operation is performed only on the data, and the effect of the parallel calculation cannot be exhibited.

Some of such parallel MPUs can perform processing such as masking without performing a part of each operation group. Calculation is required. When the beam search method is applied to such a parallel MPU, the following problems occur because the conventional method does not select a node to search in consideration of the processing efficiency of the parallel computer.

(1) Access to discontinuous data occurs frequently.

(2) The efficiency of end point processing decreases.

In short, even if the number of search nodes is reduced to reduce the amount of computation, the parallel efficiency decreases and extra processing occurs, resulting in a large loss from the viewpoint of computation efficiency. .

On the other hand, in the speech recognition system according to the present invention, considering the processing efficiency of the parallel computer, the beam width (that is, the number of states searched by the beam search, or The number of nodes on the relith is controlled. Specifically, based on the number of operations that can be performed simultaneously by the realizing parallel computer (hereinafter referred to as parallelism), the path to be searched (corresponding to a node on trellis) is always an integer of parallelism. It is selected to be twice as many consecutive nodes. In addition, as one simple method for realizing this control, the nodes on the trellis are managed by a group of nodes (blocks) of an integer multiple of the degree of parallelism, and a beam search is performed by appropriately controlling this block. The block is evaluated as a block in addition to the score of each node, and the evaluation value is used to determine whether or not to search for each block.

Unlike a search window with a fixed search range (beam width), this block assumes that the beam width can be varied as needed. However, since this beam width change is always performed as a continuous node group of an integer multiple of the parallelism, the number of nodes equal to the parallelism is used as the unit of handling. Then, this block is evaluated as a whole, and the beam width is varied based on the evaluation.

In short, what has conventionally been handled mainly on each node, the present invention treats a series of nodes as a unit. This unit is simply called a block for convenience. Then, an evaluation is made for each "unit", and based on the evaluation, To change the beam width. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram showing an example of an MPU having a parallel processing function,

Figure 2 is a hardware and software system configuration diagram showing an example of a speech recognition system.

Figure 3 illustrates the concept of HMM speech recognition.

FIG. 4 is a flowchart showing an overall view of the processing procedure of the first embodiment of the speech recognition according to the present invention.

FIG. 5 is a diagram for explaining the principle of block update of the first embodiment,

FIG. 6 is a diagram for explaining the concept of the procedure 403 shown in FIG. 4,

Figure 7 is a diagram for explaining the handling of a block having Xn (n = 2, 3, ...) nodes.

FIG. 8 is a diagram showing an example of a block management list used in the first embodiment,

FIG. 9 is a flowchart showing details of the procedure 402 shown in FIG. 4,

FIG. 10 is a diagram for explaining the concept of block reconstruction in the first embodiment,

FIG. 11 is a diagram for explaining the concept 2 of block reconfiguration of the first embodiment, FIG. 12 is a flowchart showing details of the procedure 4 06 shown in FIG. 4,

FIG. 13 is a flowchart showing an overall view of the processing procedure of the second embodiment of the speech recognition according to the present invention.

FIG. 14 is a flowchart showing the details of the procedure 1301. BEST MODE FOR CARRYING OUT THE INVENTION

An embodiment of the speech recognition system according to the present invention will be described. In this embodiment, a microcomputer (MPU) as shown in FIG. 1 described in Background Art is used. FIG. 1A is a block diagram illustrating the configuration of the entire system of the MPU used in the present embodiment. This MPU can be connected to an external memory via an external bus 160 via a bus interface 101. The program is sent to the instruction control unit 108 via the instruction cache 102, and each processing unit of the computer is controlled based on the program. The data is transferred from the data cache 103 to the register file 109 via the mouth unit 1 • 5 or the store unit 106. The data sent to these registers is processed by the decimal unit 107 for integer operations and the floating-point unit 104 for floating-point numbers, and the results are stored again in the register file 109, as necessary. Is done. These are general MPU configurations, but the MPU used in the present embodiment is further characterized by having a multimedia processing unit (MM unit) 110 having a parallel processing function.

FIG. 1B shows a specific example of the MM unit 110 as an example. Here, the power of a parallel mechanism of the S I MD type as an example ^ As described above, the same can be realized in a parallel computer excluding the distributed memory type parallel computer including the VL IW type MPU. Here, the ability to describe the ability (parallelism) that the parallel computer can process in parallel as 4 is, of course, the degree of parallelism can be set arbitrarily.

As shown in FIG. 1 (b), the MM unit 110 includes a function block 120 for executing a vector type operation and a function block 130 for executing a vector reduction type operation. Here, a plurality of data are collectively managed in the multimedia registers 125 to I27 and 134. The functional block 120 has a plurality of arithmetic units, and can process n (at least four in the example of FIG. 1B) data at a time.

The function block 120 is sent to a multimedia register A 125 and a multimedia register B 126 via an internal bus 150 by a group of multimedia registers in a register file 109. In the multimedia registers A and B, the data is divided into four parts, each of which represents an independent (separate) value. I have. The same instructions are processed in parallel by arithmetic units (ALUs) 121 to 124, and the results are stored in the multimedia register C127.

Therefore, in the functional block 120, the operation (vector type operation) expressed by the following equation (3) can be executed in parallel.

Vecter [i] = f (x [i], y [i]) i = l, 2, .., n… (3) where n is the degree of parallelism and Vecter [i] is the array element x [i], y is an array element indicating the operation result for [i]. .

The function block 130 is a function block for executing a vector reduction type operation, and is sent from the group of multimedia registers in the register file 109 to the multimedia register D 134 via the internal bus 150. The ALU 131-1 to 133 form a so-called tournament processing, which is finally reduced to one scalar data and stored in the scalar register 135. For example, when it is desired to take the sum of the four data stored in the multimedia register D134, this can be realized by executing the arithmetic operation of ALU131-133.

As described above, the functional block 130 can execute the operation (vector operation) expressed by the following expression (4) in parallel.

scaler = f (x [i〗) i = l, 2, .., n… (4)

Here, in the above equation, n is the degree of parallelism, and scaler is a scalar variable that stores the calculation result for array element x [i].

FIG. 2 shows an example of a speech recognition system using the MPU having the configuration shown in FIG. FIG. 2A is a block diagram illustrating a hardware configuration. This speech recognition system is an example constituted by a speech recognition board 201, a microphone 202 and a monitor 203. Among them, the voice recognition board 201 can be realized entirely on one chip LSI. The monitor 203 is not always necessary when used for, for example, a voice input device.

The speech recognition board 201 has AD converter 204, PU 205, ROM 206, It is constituted by a RAM 207. When the monitor 203 is added, a monitor interface (IF) 208 is further required.

The AD converter 204 converts the analog audio signal input from the microphone 202 into a digital signal. The MPU 205 uses the MPU having the configuration shown in FIG. The ROM 206 is a read-only memory, and stores programs and necessary data (for example, dictionaries and HMM parameters) of the speech recognition system. The RAM 207 is a readable / writable memory. .

Figure 2 (b) shows the schematic configuration of the software. When the speech recognition system is called (start 210), the features of the sample speech are first analyzed in step 211 (feature analysis).

In the feature analysis, a speech waveform is extracted at regular intervals at regular intervals (this partial speech section is called a frame), and in this frame, the sound properties are analyzed assuming that the sound properties do not change (stationarity). . The sound properties can be analyzed, for example, by frequency spectrum or LPC coefficient. That is, the frequency spectrum can be calculated by the fast Fourier transform (FFT), and the LPC coefficient can be calculated by the Levinson-Durbin recursion. Since these are generally represented by multiple parameter groups, they are called feature vectors. By this feature analysis (step 211), the audio signal 220 is replaced with a feature vector for each frame. The feature vector sequence is stored in the table 221 and is called an observation vector sequence. The output probability is calculated by the following procedure 2 1 2. This output probability calculation can be performed in parallel with the recognition and collation by the Viterbi search in step 213. This case will be described in the next embodiment (Example 2).

Here, the output probability means the probability that each state outputs a sound having a certain characteristic in the HMM. Therefore, the output probability is expressed as a function of a feature vector indicating “a feature”.

For HMM speech recognition, the feature vector is vector-quantized and the quantized vector There is a method of giving the output probability as a function of (discrete HMM), and a method of giving it as the probability function of the feature vector (continuous HMM). In this embodiment, either of them may be used. In the former case, the table is referred to the quantization vector. In the latter case, the probability function for the feature vector is calculated. The calculation result of this output probability is stored in Table 222.

In the matching (Viterbi search) in step 2 13, the likelihood of each model is calculated from the observed vector sequence obtained in the feature analysis in step 2 11 and the output probability calculated in step 2 12. Here, the likelihood of the state transition sequence with the highest probability in each model (hereinafter, Viterbi score) is adopted as the likelihood of each model (Viterbi search), and is stored in Table 22 as recognition results. And terminate the process (step 2 14). The above likelihood (Viterbi score) can be obtained by recursively calculating the following equation (2).

a

at (il) a ii, ibi (yt + i), at (i) a ii bi (yt + i)} ··· (2) The score in the middle is low to execute this bitapisa efficiently. A method of terminating the path search, that is, a beam search method is employed. An object of the present invention is to determine the beam search in consideration of the characteristics of a parallel computer that executes a path to be searched.

In the present embodiment, in the above beam search method, in order to make the search path in each model, that is, the number of nodes, an integer multiple of the degree of parallelism of the parallel computer to be executed, several consecutive nodes in each model are used. Is managed collectively. Here, this series of nodes is called a block. In this embodiment, by appropriately managing this block, the number of search paths (nodes) in each model is always kept at an integer multiple of the parallelism of the parallel computer.

FIG. 4 shows a flowchart of a process using the beam search method in the present embodiment. Here, the parallelism is described as 4.

This process is an example of a general frame synchronization type process. That is, step 40 When this processing is started in 1, a search is executed simultaneously for the observation vector of the first frame and in the node of each block in step 402. That is, Viterbi search is executed in parallel for the nodes of each block. Step 402 is repeatedly executed by a double repetition process (double loop) using blocks and words as control variables. When all blocks have been processed for a certain word, the end of the loop is determined in step 403, and processing for the next word is started. Thus, when the end of the processing of all the words is determined in step 404, the necessity of block reconstruction is determined in step 405, and if necessary, block reconstruction is performed in step 406. Then, process the next frame.

In this way, when the processing for the frames of all the voice sections is completed, the end of the search is determined in step 407, the recognition candidates are determined and output in step 408, and then in step 409, This processing ends.

The processing of block reconstruction in step 406 is executed at regular intervals. However, it may be performed every frame. Here, a method is assumed in which the procedure 407 is executed at a certain period, such as once every N frames. In this case, in step 405, the remainder system of N of the frame variable (number) is taken, and the execution of step 406 can be determined based on the value.

In the above processing, the procedure 402 becomes a triple repetition processing (for a frame, a word, and a block) if the repetition execution for the frame is included. Step 406, on the other hand, is a mere repetition process (for frames) at regular intervals. Therefore, in the present embodiment, the processing efficiency of the procedure 402 actually determines the performance.

The concept of block update in step 402 will be described with reference to FIG. FIG. 5 (a) is an explanatory diagram showing a block evaluation method in block movement in a beam search for a state (node) belonging to the block.

The active nodes that are currently being searched are marked with 〇, and those that are not being searched are Gative nodes are represented by ·. In Fig. 5 (a), in the frame at time t = k-11, the nodes A to D belonging to the block are active, and their surroundings (0 and E in the figure) are negative.

In Viterbi search, the score of the frame at time t == k is obtained from the score of the frame at time t = k-1. At each node, in the left-to-right type HMM, after one step (frame), it transitions to its own state or the next state. Therefore, in the frame at time t = k, the nodes A ′ to E ′ belonging to the block become active, as shown in FIG.

In Viterbi search, the calculation of Eq. (2) is performed recursively, which means the selection of the path with the highest score among the paths reaching the node of interest. In the beam search, the path from the negative node can be ignored because the likelihood is low.

Therefore, at time t = k, the score of node A 'can be evaluated by the score of Α → Α' (self transition), and the score of node E 'can be evaluated by the score of D → E, (other transition). In addition, the scores of the nodes B ′ to D ′ can be evaluated by Equation (2).

In the first embodiment, since the number of nodes in the block is controlled to be an integral multiple of the degree of parallelism of the parallel computer, the number of nodes in the block is changed without changing the number of nodes in the block in step 402 shown in FIG. The processing is executed by the block reconstruction in step 406 if necessary. Therefore, in the case of Fig. 5 (a), four new nodes are selected from the five activated nodes (block update). This block can be updated in two ways, as shown on the right side of Fig. 5 (a). In this embodiment, the sum of the scores of all nodes (total score) is obtained for each block, and the larger one is selected. FIG. 5 (b) is an explanatory diagram schematically showing the realization principle for performing this block update using comparison based on the total score. The new block has two choices: the ability to select four nodes A 'to D', and the four nodes Β 'to Ε'. Here, focusing on the score of the entire block, the sum of the scores is large. Consider the case of choosing the right one.

However, since each node of B ', C', and D 'is common in each block, the comparison of the sum of each score is the comparison of node A' and node E '. In addition, node A 'has no other transition (transition from node 0) and node E' has no own transition (transition from node E). This results in comparing the transition at A 'with the transition at node D → node E'. In other words, the comparison of the total score shown on the right side of Fig. 5 (a) is eventually replaced by the comparison of the scores of both nodes. Since this has the same structure as the processing of equation (2), these series of processing can be efficiently processed by multiple data processing (SIMD) using the same instruction.

FIG. 6 is an explanatory diagram showing a processing principle when the procedure 402 of FIG. 4 is executed using a parallel computer based on the concept of such processing.

In order to perform the above processing, the score of each node at time t = k−1, the state transition probabilities (self transition and other transitions) and output probabilities from each node are required. The former score can be loaded collectively if it is managed collectively in the block management list. If the state transition probability and output probability are managed as a continuous series of data, they can also be loaded collectively.

Therefore, here, the description will be made assuming that these data exist in the multimedia register.

In step 1, the self-transition score for nodes A to D can be calculated using the following equation (5).

S0 ') = atO') aj.jbj (yt ₊ i) j = i, i + l, i + 2, i + 3…) where S (j) is the score of self-transition at node j It is. This equation (5) can be executed in parallel with 4-parallel simultaneous addition (Vecter-ADD). Similarly, the scores of other transitions for nodes A to D can be calculated using the following equation (6). N0 ') = a tO ") aj, j + ib j + i (yt + i) j = i, i + l, i + 2, i + 3 (6) where N ( j) is the score of the other transition at node j. This equation (6) can also be executed in parallel with 4-parallel simultaneous addition (Vecter-ADD).

For each of these, the (temporary) score of the other transition is rotated one word to the left to select the optimal path (Viterbi path).

In this way, in step 2, four parallel simultaneous maximum operations (Vecter-MAX) are performed for the three nodes at the center where the optimal path needs to be selected using the following equation (7). , Viterbi score can be calculated (for 3 nodes).

V *) = max {SG), N0'-l)} j = i + l, i + 2, i + 3-(7) where V (j) is the Viterbi score at node j .

In this case, at the left end of the block, the maximum operation of the self transition S Elf 1 of node A and the next transition Next 4 of node D is executed. This is nothing less than the condition judgment shown in Fig. 5 (b). Therefore, in this embodiment, based on this result, in step 3, if the transition Next 4 other than the node D is large, the block is moved by one node. To move a block by one node, execute a right rotation of one word in the multimedia register where the above result is stored. If not, no rotation is required. The search is completed by storing the contents of this register all at once.

Next, the handling when the number of nodes is N times the parallelism of the computer (N = 2, 3, ...) is shown. Figure 7 shows an example of handling a block with twice the number of nodes as the degree of parallelism, but the same applies to other cases.

In the first embodiment, the operation is performed according to the same principle as that of FIG. That is, the block is updated from the comparison of the scores at both ends of the block.

Figure 7 (a) shows an example of likelihood calculation for each path. Since the block has twice the number of nodes as the degree of parallelism, two vector operations are required to calculate the likelihood. This is executed for each of the own transition (Self) and other transitions (Next). Sco in other transitions Must format the data as in Figure 6. Rotation between multiple registers can be performed at high speed by using dedicated hardware.

However, the processing shown in Fig. 7 (a) can be realized by combining general shift instructions without using a dedicated processing device. For example, an example of 4 parallel blocks and 2 blocks is shown in Fig. 7 (b). Now, it is assumed that the scores of other transitions (Nextl to Next4) and (Next5 to Next8) are respectively stored in the multimedia register. For data shaping, consider recombining this into data of (Next8, Nextl to Next3) and (Next4 to Next7). The former is the one in which the multimedia register storing (Next1 to Next4) is shifted right by one word, and the one (Next5 to Next4)

This can be easily realized by taking the logical sum of the multimedia register that stores 8) and the result shifted leftward by three words. The same applies to the latter.

Score data shaping can be realized by performing similar shift rotation (Fig. 7 (c)).

Next, a method of managing blocks will be described. Figure 8 shows a block management method using a block management list. FIG. 14A shows a configuration example of the block management list. In this example, there is a header that indicates a pointer because of the linked list, but it is not always necessary to make a linked list. For example, an index list may be used. In this case, the management list may be fixed in the memory and continuously defined.

In the block management list, the word searched for in the block (word k), the configuration of the nodes (only the starting point may be used because they are always continuous), and the score of each node (node i) are required. is there. In addition, since the block score and the node with the highest score, that is, the maximum node (and the maximum score) are used as needed, there is an area for managing the node, but this is not always necessary.

Since the management list of this embodiment has a linked list structure, the management list itself manages the same number of nodes as the degree of parallelism of the computer. If the number of nodes in the block is N times the number of parallel blocks (N = 2, 3,…;), the management list is represented by linking. An example of this configuration is shown in Fig. 8 (b). Horizontal links indicate links between each block. The vertical link indicates the link relation in the block when the number of nodes is N times the number of parallel nodes (N = 2, 3,…). In this way, four nodes can be processed in order from Head to Tail ((1) → (2) → (3) → (4) in FIG. 8 (b)). FIG. 9 shows the above process in a specific flow. As shown in FIG. 4, the step 402 is executed following the step 401 (or the return from the step 403, the step 404, and the step 407). In FIG. 9, step 4 2 is performed in the order of the management list. First, the likelihood calculation of the own transition of each node (Node 1 to 4) (steps 911 to 914) and the likelihood calculation of the other transition (steps 921 to 924) are performed in parallel. Is performed. In step 901, a vertical link is checked, and if there is a link, the process proceeds to the list. In that case, repeat steps 911 to 914 and steps 921 to 924 again. If there is no vertical link, perform data shaping in step 102. This data shaping is performed by the method shown in Fig. 6 when there is no vertical link, and by the method shown in Fig. 7 (b) when there is a vertical link.

Next, a Viterbi score is calculated from the comparison of each pass (the calculation takes the maximum value) (procedures 931 to 934). However, at the node corresponding to the starting point of the block, the reblock score is evaluated by calculating the same structure (calculating the maximum value) as the calculation of the Viterbi score. This is repeated again if there is a vertical link (step 904).

In step 905, the relocation condition is determined by the method shown in FIGS. 5, 6, and 7, and based on this condition determination, in step 906, the node corresponding to the start point of the block is determined. Move blocks from the score. Specifically, the node data in the block list can be updated and processed by data shaping (data shaping by the method shown in Figs. 6 and 7) accompanying block movement.

Next, details of the block reconfiguration in step 406 in FIG. 4 will be described. In this example, sorting is used together, but it can be realized without using sorting. here, Of the nodes belonging to the block, the node with the highest score is expressed as a representative score (the maximum score shown in Fig. 8 (a)), and that node is expressed as a representative node (the maximum node shown in Fig. 8 (a)). In this example, this representative score is regarded as the score of the block, and the block is reconstructed using the representative score of each block.

In this embodiment, two values, a threshold value (θπθη) for controlling the deletion and reduction of a block and a threshold value (© max) for controlling the expansion of a block, are controlled. These are all functions of time (frames). In this embodiment, emin is given by the following equation (8), and emax is given by the following equation (9).

Θ min = Bmin- λ min… (8) where Lmin is a constant, Bmin = min {B (k)},

B (k) = max {V (j)}, j = i, i + 1, i + 2, i + 3 (at Block k)

It is.

0max = Bmax + Amax (9) where Lmax is a constant and Bmax = max {B (k)},

80 = 乂 {( _> ))}, j = i, i + 1, i + 2, i + 3 (at Block k)

It is.

Figure 10 (a) shows the method of block extension. If the representative score of a block is larger than Qmax during block reconstruction, block expansion is performed. Here, since max is a function of the number of linked management lists (N when the node of the block is expressed by the parallelism XN), a different evaluation is given to each block. This is an example of extension not being concentrated on a specific block.

If a block expansion is determined by the threshold, a new block list is created. This is linked so that the link extends in the vertical direction of the block to be expanded. Specifically, a new memory is allocated, the pointer is updated, and the link relation is set. Also, make the necessary settings (such as applicable words and nodes to be managed). Since the corresponding word is the same as the block to be expanded, it is copied and the managing node Write the number of the node adjacent to the block (the value obtained by adding 4 (= parallelism) to the last link in the vertical direction).

Nodes in the new list are negative nodes that have not been searched so far, so write a score equivalent to 0. In general, log likelihood is used in speech recognition. In this case, a value of 1 (sometimes expressed as 2) is set.

Figure 1 ◦ (b) shows the behavior of the search by block expansion when initialized as described above. In essence, if the path from the newly set node is set to a non-existent force, score 0 (logarithmic value ∞), the path with the lowest evaluation out of the two paths reaching the same node will be cut off, The path from the newly set node is cut off, and the path from the adjacent active node is selected.

Therefore, in the case of 4 parallelism, all nodes in the new list become active after 4 frames (t = k → t = k + 4). No special treatment is required during this time. As for the calculation process, even with parallel calculation performed with all active processes, consistent results can be obtained from the relationship with the scores by initialization.

FIG. 11 illustrates block reduction. If the representative score of a block is smaller than Θπιίη during block reconstruction, block reduction (or deletion) is performed. Here, 評価 πιΐη is a function of the number of linked management lists (Ν when the node of the block is represented by the degree of parallelism Χ 、), so different evaluations are given to each block. This is the same as for block expansion. Here, we take the method of reducing the management list one by one. Therefore, when Ν> 2, block reduction is performed, and when N = 1, block deletion is performed.

If block reduction is determined by the threshold (Ν> 2), a new block list is created. This block has _ — 1 vertical links. Specifically, a new memory is allocated, the pointer is updated, and the link relationship is set. Next, a node is selected. In the following example, the representative node is placed at the center of the new block. This can be easily calculated by any parallel computer that can perform vector reduction type parallel operations (VRmax and VRargmax). Even in the case of N> 2, this can be combined with the tournament type.

Using Fig. 11 (a), we explain the node selection method when the block is reduced. Here, the control for reducing two blocks to one block is shown. In this example, the largest node is left and placed at the center. Vector reduction VRmax operation (s = VRmax {xl, 2, x3, x4}: 1, 2, x3, x4 are multimedia (Stored in a register, s stored in a scalar register).

The maximum node may be selected by performing tournament processing of VRmax calculation. That is, first, the maximum node is obtained for each block. If this is, for example, maxl (maximum node of block 1) and maX2 (maximum node of block 2), each is sequentially copied to another multimedia register (called a merge or pack). ), But insert 0 (-領域 for logarithmic area) into the extra area, that is, mask it. Again, the maximum node is obtained by VRmax calculation. In this way, when the degree of parallelism is 4, the maximum value of 4 blocks (16 nodes) can be obtained by two-stage VRmax calculation processing. If there are more than 5 blocks, repeat the same process one more time. In this case, up to 16 blocks (64 nodes) requires only three stages. With respect to the maximum node obtained in this way, a continuous node is selected so that this is the center, and nodes constituting a new block are arranged. For example, in Fig. 11 (a), if the third node from the bottom is the maximum, the 25th node from the bottom is selected. This can be expressed by the following equation (10), where n is the degree of parallelism, i is the node number, and N is the number of nodes in the block represented by the degree of parallelism XN.

1 (jn (Nl) / 2 + l≤0) i = (jmax-n (Nl) / 2 + l (0 <jmax-n (Nl) / 2 + l <n + l) (1 0) n + 1 (n + 1≤ jmax-n (N- 1) / 2 + 1) '' Where j ma is the node with the highest score in the block.

Select a node according to this criterion and make the necessary settings (corresponding words, managed nodes, etc.). That is, since the corresponding word is the same as the block to be reduced, it is copied, and the managing node writes the node number (i above) obtained by equation (10). Since all nodes in the new list are active nodes, data must be copied to inherit from the old block list.

When N = 1, the block is deleted. In this case, simply delete the list.

The update of the link list in the block reduction described above will be described with reference to FIG. 11 (b). First, a new list is created, and the necessary data (constituting node score) is copied. This is done by copying each of the two blocks (in the figure, a copy from the data stored in the first list is copied from C 0 py 1 and the data stored in the second list). A copy of C 0 py 2).

The created list is inserted (replaced) into the linked list. This can be done by updating the pointer of the list to be replaced. Specifically, the address of the list linked to the old block is copied to the new list (Copy 3 in the figure), and at the same time, the pointer of the link destination (the list of link destinations is the address of the old list). Is saved) (that is, rewrite to the address of the new list).

This series of processing requires a certain amount of calculation, but the frequency of block reduction is not so frequent, so the load on the whole calculation is small.

FIG. 12 shows a series of flows (step 406) of the above block reconstruction. If it is determined in step 405 shown in FIG. 4 that the frame undergoes block reconfiguration, step 406 is executed.

In step 1221, the first block force (of the horizontal link list) is selected and processed. In this step 1221, a block score is calculated. This is done with a vector reduction type maximum operation. Step 1 201 controls the block It is repeatedly executed by a repetition process with control variables. When all the block forces have been processed, the end of the loop is determined in step 122, and the flow proceeds to the next step 122.

If the condition of the block representative score and the formula (9) is satisfied, it is determined that the block is extended in the step 122, and the block is extended in the step 123. In block expansion 1 203, a new management list is created, and adjacent nodes are selected and score 0 (log value 1) is initialized, as shown in Fig. 10 (a). When this is done, proceed to step 1207.

On the other hand, if the condition is not satisfied in step 122, then the procedure goes to step 1204 to determine whether the block is reduced or deleted. Here, if the block representative score satisfies the condition of equation (8), it is determined that block reduction / deletion is performed, and block reduction / deletion is performed in step 125. Here, if N> 2, the block is reduced (the management list remains), so the procedure proceeds to step 127. In step 127, a new link list is constructed. Here, we assume a method to take lists one by one from the old list and connect them to the new link list.

In the case of block deletion, since the management list no longer remains, the procedure goes from step 122 to the end judgment of step 122. Then, processing proceeds to the next unreconstructed block. When all the block forces have been processed, the end of the loop is determined in step 1209, and the procedure proceeds to step 407 shown in FIG.

With the above processing, a series of recognition processing can be executed.

FIG. 13 is an overall view of a processing procedure showing another embodiment (Embodiment 2) of the speech recognition system according to the present invention, which is slightly different from the processing procedure of Embodiment 1 shown in FIG. I have. In this embodiment, the block score is evaluated for each search of each block, and as a result, when a certain condition is satisfied, reconstruction is executed.

Here, the block is controlled by a single threshold. This threshold is Θ Put. This can be calculated, for example, as shown in the following formula (11).

Θ = p (t) + q… ( ¹¹ ) where P (t) is a function of t and q is a constant.

The above function is usually set so that p (t) decreases as t increases (for example, a linear function with a negative slope). Since the Viterbi score is a conditional probability value, the score (probability) tends to decrease as the search progresses. Therefore, the threshold value to be evaluated is also set to “decrease as t increases” so that those with relatively small scores can be excluded in each frame.

In some cases, the absolute value (also called distance) of the log probability (becomes negative) is used as the score. In such a case, since the score also increases as t increases, p (t) is set to increase as t increases.

In this embodiment, when the above condition is satisfied by the block, the block is reduced / deleted, and at the same time, the block is expanded. Block restructuring is performed to keep blocks constant by scrap & build.

Step 1301 is executed after the parallel Viterbi search of step 402 shown in FIG. Here, FIG. 14 shows the flow of the process of the step 1301. In step 1401, a block score is calculated based on the results of the search. In this case, the representative score may be used as in the first embodiment, but here, as another example, the sum total (average value) of the scores will be used.

The sum of the scores is determined in step 1402, and when the condition of the equation (11) is satisfied, the block is reconstructed. That is, the block is reduced or deleted in step 144. Then, in step 1443, instead of the block that has been deleted or reduced, the block having the largest score in the block that has not yet been block-expanded is expanded. In the present embodiment, the linked list is sorted (to be described in detail later). Therefore, a list that is not block-expanded in descending order may be searched.

After the search, the block is deleted in step 144. Unless the node is represented by the parallel degree XN, N is 1), the management block must be linked to the link list unless it is Is done.

Step 1406 and step 1407 are processing for proper placement of the link list. Specifically, the link list is arranged in order using bubble sort. In a simple example, if you link D to a list that is linked in the order of A> B> C, bubble sorting is performed by comparing the sizes in order from A, and when something smaller than D is found, This is the sort method to be inserted. For example, D is compared with A and B in order. If A> D and B> D respectively, and finally C <D, then D is inserted between B and C. In step 1407, specifically, this magnitude comparison is made. Based on the result, if it is not appropriate (if the conditions are not satisfied), the procedure returns to step 1406 to access the next link list I do. By repeating this, the link list is sorted. Sorting has advantages such as enabling scrap & build and facilitating output of recognition candidates in step 408 shown in FIG.

The expanded block can be sorted with the new score at the time of the next search. Therefore, after step 1403, the procedure moves to step 400 shown in FIG.

Industrial applicability

In the speech recognition system according to the present invention, a search node is selected and data is managed according to the characteristics of the parallel computing unit in the parallel microcomputer used, so that beam search with high parallel processing efficiency can be realized. . Since the structure of a block (a block of a series of nodes) corresponds to the register that stores vector variables, the loading and storage of vector variables can be performed collectively.

Also, since there is no need for masking or endpoint processing in parallel operations, efficient execution is possible even with a simple multimedia microcomputer without such dedicated functions. Further, even when a parallel computer having a mask and an end point processing function is used, the method of the present invention is very efficient. The number of nodes handled as blocks is an integer multiple of the degree of parallelism, but since this is the maximum value of the number of nodes that can be processed in the same batch execution, the same amount of computation (excluding block management costs) Of search nodes. Also, since the beam width can be varied for each recognition element such as a word or phoneme (however, it is an integer multiple of the degree of parallelism), unnecessary search can be omitted. For this reason, similar to the conventional beam search of the variable beam width type (eg, the branch based on the threshold is the re-method), it is possible to perform the search while suppressing the number of search nodes and without deteriorating the recognition performance.

Claims

The scope of the claims

1. A speech recognition system that performs speech recognition processing using a parallel microcomputer having a parallel processing function of executing a plurality of instructions at the same time,

Using a hidden Markov model (HMM), which is a stochastic process in which the current state is defined by the state before the I step, the audio signal is divided into sub-intervals of recognition elements such as words and phonemes for each frame. Are expressed by the stationary sound source HMM, and the likelihood, which is the probability that each HMM can generate the same acoustic feature as the acoustic feature of the input recognition element, is calculated. , In a speech recognition system using the HMM speech recognition method with the model with the highest likelihood as a recognition candidate,

For each model, when performing a search that evaluates with the value of the state transition sequence (optimal path) that maximizes the likelihood obtained by the calculation, the state transition sequence (path) to be searched is Applying the beam search limited to some promising paths in the speech recognition processing in the parallel microcomputer,

Speech recognition processing is performed by control means for controlling the beam width, which is the number of search states for each frame in the beam search, to be an integral multiple of the degree of parallelism at which the parallel microcomputer can simultaneously perform data processing. A speech recognition system characterized by.

2. The control means for controlling the beam width to be an integral multiple of the degree of parallelism comprises: a continuous series of nodes from a group of paths (nodes on the trellis) to be searched in the model; A table showing nodes belonging to each block and an evaluation value of the block for managing blocks extracted from an integer multiple of continuous nodes is set, and blocks from each node belonging to the block in the table are set. Claims characterized in that it comprises a judging means for judging whether to continue searching or to abort the search for nodes to be searched collectively in block units based on the evaluation result obtained by evaluating the whole.

The speech recognition system according to item 1.

3. The determining means comprises: a first step of calculating a score of each node belonging to the frame from a block of the frame that is a partial voice section; and determining a search block in a next frame from the score. A second procedure, wherein when performing a beam search for calculating a score and selecting a search node in the frame, the first and second procedures include: 3. The speech recognition system according to claim 2, wherein the beam width is controlled so as to double the beam width.

4. The second procedure for determining a search block in the next frame includes: a calculation procedure for calculating a score for a possible state transition from each node belonging to the block; and a result of the calculation procedure and the current block. A selection procedure of selecting a block combination having the largest sum of scores from a set of nodes to which the node belongs, and a block of the selected node set in order to make the set of nodes a block in the next frame. 4. The speech recognition system according to claim 3, comprising: a block updating unit having a recording procedure for recording in a list to be managed.

5. The second procedure for determining a search block in the next frame includes a score evaluation procedure for evaluating a block score, and each node corresponding to a search path continued from each node belonging to the block in the next frame. A search block having a criterion determination procedure for determining a criterion for determining whether or not to be a search target in a search, and an exclusion procedure for excluding all nodes belonging to the block from search targets when the block score falls below the criterion. 4. The speech recognition system according to claim 3, comprising a selection unit.

6. The block updating means includes: a score evaluation procedure for evaluating a block score; a criterion determination procedure for determining a criterion for expanding a beam width, which is a search range of a block; and a block score exceeding the criterion. 5. The speech recognition system according to claim 4, further comprising: a step of incorporating, in the block, a set of nodes belonging to the block and a number of adjacent nodes equal to the degree of parallelism.

7. The block updating means includes: a score evaluation procedure for evaluating a block score; and a beam width which is a search range when searching with a plurality of blocks of the recognition element. A criterion determination procedure for determining a criterion for reducing the number of nodes, and a node selection procedure for selecting, from the nodes belonging to the block, consecutive nodes whose number is smaller by the parallelism when the block score falls below the criterion. 5. The speech recognition system according to claim 4, comprising:

8. The criterion determining step includes a step of obtaining a minimum block score, which is a minimum value of a block score, by the score evaluation step, and a step of calculating a criterion value from the minimum block score. A speech recognition system according to paragraph 5 or 7.

9. The reference determination step includes a step of obtaining a maximum block score, which is a maximum value of a block score, by the score evaluation step, and a step of calculating a reference value from the maximum block score. A speech recognition system according to clause 6.

10. The criterion determination procedure includes a data table storing data for setting a criterion corresponding to a frame number, a frame number evaluation procedure for evaluating a frame number, and a criterion corresponding to the frame number. The speech recognition system according to any one of claims 5 to 7, comprising a reference value calculation procedure for calculating a value.

1 1. The node selecting means includes: a maximum node evaluation procedure for evaluating the position of the largest node that is the node with the highest score among the nodes belonging to the block; 8. The speech recognition system according to claim 7, further comprising: a continuous node selecting procedure for selecting a continuous node having a number smaller than the degree of parallelism by the number of parallelisms.