WO1993020552A1 - Speech recognition apparatus using neural network, and learning method therefor - Google Patents
Speech recognition apparatus using neural network, and learning method therefor Download PDFInfo
- Publication number
- WO1993020552A1 WO1993020552A1 PCT/JP1993/000373 JP9300373W WO9320552A1 WO 1993020552 A1 WO1993020552 A1 WO 1993020552A1 JP 9300373 W JP9300373 W JP 9300373W WO 9320552 A1 WO9320552 A1 WO 9320552A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- value
- output
- neural network
- learning
- data
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims description 75
- 238000013500 data storage Methods 0.000 claims description 25
- 238000000605 extraction Methods 0.000 claims description 13
- 238000001514 detection method Methods 0.000 claims description 7
- 230000001537 neural effect Effects 0.000 claims description 7
- 230000008859 change Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 230000009977 dual effect Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims 1
- 238000012545 processing Methods 0.000 description 34
- 238000010586 diagram Methods 0.000 description 33
- 230000006870 function Effects 0.000 description 19
- 230000000694 effects Effects 0.000 description 10
- 230000005540 biological transmission Effects 0.000 description 8
- 230000006835 compression Effects 0.000 description 8
- 238000007906 compression Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 230000008878 coupling Effects 0.000 description 6
- 238000010168 coupling process Methods 0.000 description 6
- 238000005859 coupling reaction Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000036039 immunity Effects 0.000 description 3
- 210000005036 nerve Anatomy 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000003930 cognitive ability Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Definitions
- Speech recognition device using neural network and learning method thereof
- the present invention relates to a speech recognition apparatus using a neural network and a learning method therefor.
- the present invention provides a conventional method of giving the start and end of input data, Instead of processing all possible start-end combinations, the neuron-like element itself is configured to hold the past history of the input data, so that the time series of speech etc.
- Harddo 3 Simple data processing and technology that enables high-precision processing.
- the present invention relates to a neural network learning method for causing a neural network to perform such processing.
- data recognition means in particular, means that are practically used as means for recognizing categories of time-series data by learning, are dynamic programming (DP) methods, hidden Markov models (DP) methods, and the like.
- DP dynamic programming
- DP hidden Markov models
- HMM back propagation learning method
- MLP multilayer perceptron-type neural network
- a problem common to the DP method and the HMM method is that the data to be used as a teacher and the data to be recognized require a start point and an end point.
- N the number of bits belonging to a certain category from a pattern of length N
- the start end may be on the order of N and the end may be
- N orders In other words, there is a possibility that the combination of the start and end may be a self-order of N. Therefore, in this case
- recognition processing must be performed for all of such a large number of combinations. And that process takes an enormous amount of time.
- the MLP method is basically a method for recognizing static data, and in order to make it recognize time series data, data in a certain time range is input as one input data, and the equivalent There is a problem that time information must be processed. This time range must be fixed by the configuration of the MLP.
- the length of time series data varies greatly depending on the range and within the same range. For example, taking a phoneme in a speech as an example, the average length of a long phoneme such as a vowel is shorter than that of a short phoneme such as a plosive. Even in the same phoneme, the length in the actual voice fluctuates twice as much. Therefore, if the input range of data is set to an average length, when recognizing a short phoneme, the input data will include a lot of data outside the recognition target. When recognizing, the input data contains only part of the data to be recognized. These are all factors that reduce the cognitive ability. Even if different input lengths are set for each phoneme, the problem is the same because the length of the phoneme itself varies. This is also common in time-series information. Disclosure of the invention
- the conventional DP and HMM methods require the beginning and end of the data to be handled.
- the LP method requires the start and end of the input range during learning.
- this cannot be clarified in principle, and assuming the start and end points forcibly reduces the recognition ability.
- processing for all combinations of the start and end is required, and enormous processing is required.
- each neuron-like element constituting the neural network includes an internal state value storage unit and an internal state value storage unit.
- Internal state value updating means for updating the internal state value based on the internal state value recorded in the above and the input value input to the neuron-like element, and the output of the internal state value storing means is converted to an external output value
- output value generating means that performs
- the internal state value updating means comprises weighted integrating means for weighting and integrating the input value and the internal state value, and the internal state value storage means for integrating the value integrated by the weighted integrating means.
- Output value generation means for converting the value obtained by the integration means into a value between a preset upper limit value and a lower limit value.
- the internal state value of the i-th neuron-like element constituting the neural network is X i, and i is a time constant, and weighting of the neuron-like element is performed.
- the weighted input value Z j to the i-th neuron-like element includes a value obtained by adding the weight to the output of the i-th neuron-like element itself,
- the weighted input value Z j to the i-th neuron-like element is the value obtained by adding the weight to the output of another neuron-like element constituting the neural network.
- the weighted input value Z j to the i-th neuron-like element includes data given from outside the neural network, 7) In the above 1) to 6), the weighted input value Z j to the i-th cell-like element includes a value obtained by adding a weight to a fixed value,
- the output value generation means has a symmetric output range.
- the neural network has at least two outputs, a positive output and a negative output
- the speech recognition device S performs feature extraction of an input to be recognized and inputs the feature extracted value to the neural network, and a speech feature extraction unit, and an output of the neural network.
- a recognition result output means for converting a value into a recognition result; and an internal state value initialization means for giving a preset initial value to the internal state value storage means of the neural cell-like element constituting the neural network.
- the background noise input means for inputting background noise to the neural network, and an equilibrium state are detected from the output of the neural network, and based on the detection result, Equilibrium state detection means for outputting a signal for changing the internal state value to the internal state initial value setting means,
- the learning method of the speech recognition device using the neural network according to the present invention includes:
- the speech recognition apparatus of the above 10) or 11) has a learning unit for learning a neural network, and the learning unit stores input data for learning.
- Input data selection means for selecting learning input data from input data storage means, output data storage means for using learning output data, and learning output data based on the selected input data and its chain;
- Output data selecting means for selecting, and learning control means for inputting the selected learning input data to the feature extraction unit and controlling learning of the neural network, and the network, wherein the learning control means includes a neural network.
- the input data storage means has a plurality of categories
- the output data storage means has categories corresponding to the respective categories of the input data storage means
- the selection means selects multiple data that you want to learn from the input data storage means.
- the output data selection means selects the learning output data corresponding to the learning input data selected by the input data selection means
- the learning control unit connects the plurality of data selected by the input data selection means into one.
- the learning unit has one input data connection means and one output data connection means for connecting the output data for learning selected by the output data selection means to one, and the learning unit inputs the one connected learning input data to the speech feature extraction means. And changing the weight of the connection of the neuron-like elements based on the output of the two neural networks and the output of the output connection means.
- the learning section superimposes noise data storage means for storing noise data and noise selected from the noise data storage means on the selected learning data.
- noise data storage means for storing noise data and noise selected from the noise data storage means on the selected learning data.
- Self-organization can respond to phenomena of various time scales by learning.
- FIG. 1 is a diagram showing a nerve cell-like element constituting the neural network of the present invention.
- FIG. 2 is a diagram in which the nerve cell-like element in FIG.
- FIG. 3 is an example in which the configuration of FIG. 2 is replaced with an electric circuit.
- FIG. 4 is a diagram showing a speech recognition apparatus using a neural network configured using the neural cell-like element of the present invention.
- FIG. 5 is a diagram of the neural network of FIG. 4 having three layers.
- FIG. 6 is a diagram in which the neural network of FIG. 5 is further multilayered.
- FIG. 7 is a diagram in which the transmission network and the network of FIG. 6 are divided.
- FIG. 8 is a diagram illustrating a dual neural network having an auto S regression loop.
- FIG. 1 is a diagram showing a nerve cell-like element constituting the neural network of the present invention.
- FIG. 2 is a diagram in which the nerve cell-like element in FIG.
- FIG. 3 is an example in which the configuration of FIG.
- FIG. 9 is a diagram showing a random connection neural network.
- FIG. 10 is a diagram for explaining the noise resistance of the speech recognition device of the present invention.
- FIG. 11 is a diagram for explaining a learning term effect of the time scale of the speech recognition device of the present invention.
- FIG. 12 is a diagram showing a configuration of another voice recognition device using the nerve cell element of the present invention.
- FIG. 13 is a diagram illustrating an operation procedure of the speech recognition device in FIG.
- FIG. 14 is a diagram showing a learning method of speech recognition and instrumentation using the neural network of the present invention.
- FIG. 15 is a diagram showing a learning procedure of the learning method of the present invention.
- FIG. 16 is a diagram showing connection of learning data according to the present invention.
- FIG. 17 is a diagram showing a configuration of the learning data of the present invention.
- FIG. 18 is another diagram showing a learning method of the speech recognition device using the neural network of the present invention.
- FIG. 19 is a diagram showing a speech word detection output by the speech recognition device of the present invention.
- FIG. 20 is a diagram showing another speech word detection output by the speech recognition device of the present invention.
- FIG. 21 is a diagram showing another configuration of the speech recognition device S of the present invention.
- FIG. 22 is a diagram showing an operation procedure of the speech recognition device in FIG.
- FIG. 23 is a diagram illustrating a learning method of the speech recognition device having the background noise superimposing means.
- Figure 24 is a diagram showing how the noise component is added to the training data. is there.
- FIG. 25 is a diagram showing a recognition result when an unknown word is given to the neural network trained by the learning method of the present invention.
- FIG. 26 is a diagram showing recognition results when the same processing as in FIG. 25 is performed for an unknown speaker.
- FIG. 27 is a diagram illustrating a recognition result obtained when the same processing as in FIG. 26 is performed with background noise.
- FIG. 28 is a diagram showing a conventional neuron-like element.
- FIG. 29 is a diagram in which the nerve cell-like element in FIG. 28 is replaced with a specific function.
- FIG. 30 is a diagram in which the configuration of FIG. 29 is replaced with an electric circuit.
- FIG. 1 schematically shows the function of a neural cell-like element (hereinafter, referred to as a “node”) constituting an NN according to the present invention.
- reference numeral 104 denotes the entirety of one node
- 101 denotes the internal state value storage means
- 102 denotes the internal state value stored in 101 and the input value input to the node.
- Internal state value updating means for updating the internal state value
- output value generating means for converting the 10 s internal state value into an external output.
- reference numeral 201 denotes data input means
- 202 denotes weighted integrating means for weighting and integrating the data input values obtained by 201
- 203 denotes integrating means for integrating the integrated data values
- 204 denotes the integrating means.
- Output value limiting means for converting a value obtained as a result of integration into a predetermined range of values is schematically shown.
- FIG. 3 is an example in which the configuration of FIG. 2 is replaced by an electronic circuit.
- reference numeral 301 denotes the data input means and the weighted integrating means of FIG. 2
- 302 denotes the integrating means
- 303 denotes the output value limiting means.
- FIG. 28 schematically shows the functions of the nodes constituting the NN by the conventional MLP method.
- reference numeral 2803 denotes an entire node
- 2801 denotes an internal state value calculating means for calculating an internal state value of the node
- 2802 denotes an output value generating means for converting the internal state value calculated by the 2801 to an external output.
- FIG. 29 specifically shows the function of the conventional node shown in FIG. 28.
- reference numeral 2901 denotes a data input means
- 2902 weights the data input value obtained by 2901.
- 2903 denotes an output value limiting means for converting the value of the integrated data into a value in a predetermined range.
- FIG. 30 shows an example in which the configuration of FIG. 29 is replaced by an electronic circuit.
- reference numeral 3001 designates the data input means and weighted integrating means of FIG. 29, and reference numeral 3002 designates the output value limiting means.
- the node of the present invention has an integrating means not provided in the conventional node.
- the node of the present invention is the past of the data input to the node. Is converted and held as its integral value, and it can be said that it is dynamic in the sense that the output is determined by it.
- An NN using NN can process time-series data with the node itself, regardless of the NN structure.
- the NN of the present invention since the context information and the like are stored as integrated values inside each element, it is not necessary to set a special structure for the NN. Therefore, for input data, the simplest input method of inputting data at each timing at each timing is sufficient, and special hardware and processing for processing time information are sufficient. Does not require any.
- the internal state value of the node is X
- the output value is Y
- the current internal state value is X curr
- the updated internal state value is X ne Xt
- the node is updated during the update operation.
- the input value input to the node be Z i (where i is 0 to n, and n is the number of firepower to that node).
- Formal operation of internal state value updating means Expressing the function G, the updated internal state value Xn ext is
- Equation (1) G (X curr, Z 0, ⁇ ', Z i, ⁇ , Z n] (1)
- Equation (1) can be various, For example, the following equation (2) using the first-order differential equation is also possible. Where i is a time constant.
- the input value zj is defined in more detail, (1) the output of the node itself multiplied by a certain connection weight, (2) the output of another node multiplied by a certain connection weight, and (3) equivalently A fixed output value obtained by adding a connection weight to give a bias to the internal state updating means, an external input to the node from outside the NN, and the like are considered. Therefore, let us consider updating the internal state value of the i-th node with respect to such an input value Z j.
- the internal state value is X i
- the output of any node is Y j
- the coupling strength of coupling the output of the j-th node to the input of the i-th node is W ij
- the bias value is 0 i
- the i-th node is Assuming that the external input value to the node is D i, equation (2) can be written more specifically as follows. Wij Yj + ⁇ + Di (s)
- the operation of the output value generating means is formally expressed by a function F, where the internal state of the node at a certain moment determined in this way is expressed as:
- FIG. 4 shows an example of a speech recognition apparatus using ⁇ ⁇ composed of nodes according to the present invention.
- reference numeral 401 denotes a voice feature extraction unit
- 402 denotes a line constituted by the node of the present invention
- 403 denotes a recognition result output unit.
- the output extracted by the audio feature extraction means is input to two nodes. Then, this ⁇ ⁇ is a fully connected ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ where any node is connected to all other nodes. Then, from ⁇ , two outputs are output to the recognition result output means.
- the output number can be set arbitrarily.
- a positive output and a negative output are provided, and the recognition result can be comprehensively determined from these outputs to improve the recognition accuracy.
- the number of inputs and outputs to ⁇ is not limited to two as shown in Fig. 4, but may be any number.
- FIGS. 5 to 9 show other examples of the configuration of the node configured by the node of the present invention.
- FIG. 5 shows an example in which only the configuration of # 402 in FIG. 4 is changed.
- # 402 is composed of an input layer 501, a hidden layer 502, and an output layer 5/3.
- This structure is apparently the same as the conventional ML method.
- the value of the input layer is determined first as in the prior art, and the value of the hidden layer having that value as the input is then determined. It is not a feedforward network where the values of each layer up to the layer are determined sequentially.
- feNN recognizes time-series data without the need for a context layer as in the conventional technology because the node itself can hold internal state values, and is equivalent to the conventional technology with a context layer. Can be obtained.
- the outputs of all layers are determined at the same time, more efficient parallel processing is possible than the MLI method of the prior art.
- Figure 10a shows the correspondence between node input and output in the conventional simple MLP method.
- a signal in which spike-like noise is superimposed on a square-wave input is given as an input, an almost unchanged waveform appears in the output.
- the node of the MLP method simply reflects its input in the output, and therefore has the effect of noise. I will receive it as it is.
- the node of the present invention records a time history as an internal state value, and the next internal state value and output value are determined as a function of the internal state value and the input. Therefore, even if a spike-like noise similar to a) is superimposed on the input, the spike-like waveform is blunted as shown in Fig. 10b), and the effect is reduced, resulting in good noise resistance. Sex can be obtained.
- the history information of some of the nodes constituting the NN is stored in an external node having a special configuration. All nodes have lower noise resistance than the case of using the node of the present invention in which all nodes hold their own history information as internal state values.
- the following example is an example of an Hourglass-type network in which the configuration of NN in FIG. 5 is further multi-layered, and is shown in FIG.
- 601 indicates a feature extraction (or information compression) network
- 602 indicates a transmission network
- 603 indicates a recognition (or information expansion) network.
- the configuration of NN in FIG. 6 is apparently similar to the conventional MLP method. However, the operation is completely different as described above.
- a feature extraction (or information compression) NN that incorporates time-series effects without impairing the effects of the present invention, and a recognition network [or information extension] that incorporates time-series effects.
- FIG. 7 shows an audio compression transmission device, and if this dashed line shows a temporal distance, FIG. 7 shows, for example, an audio compression recording device.
- the object to be compressed here is not limited to voice, but may be more general information.
- recognition processing is information compression processing in a broad sense.
- FIG. 7 does not impair the effects of the present invention described above. For example, due to the noise immunity described with reference to FIG.
- the NN in Fig. 8 can handle phenomena in a wider temporal range by having an autoregressive loop.
- the strength of the connection of the autoregressive loop in the input value Z is W, considering this autoregressive loop is equivalent to approximately replacing the time constant of the system with the following equation. Equivalent to.
- FIG. 11 is a diagram conceptually showing this effect. Assuming that a continuous input of a square wave as shown in a) of Fig. 11 is given, if the response time constant of the system is larger than the period of this square wave, the response of the system will be the same as the output of a). The next output is added to the output, and a correct recognition result cannot be obtained.
- the time constant of the system is optimized by learning, and its response can be modified, for example, as shown in Fig. 11 b). A good recognition rate can be obtained.
- Fig. 9 shows an example in which the NN in Fig. 8 is a random combination NN.
- the random combination NN 902 is composed of two sub-networks, an input network 904 and an output network 905.
- the input network is a fully-coupled sub-network
- the output and network are random-coupled sub-networks
- the two sub-networks are unidirectionally connected.
- FIG. 12 is a diagram obtained by adding an internal state initial value setting means 124 to the speech recognition apparatus of FIG. 4. The rest is the same as FIG. As shown in equation (2), the operation of N N of the present invention is described by a first-order differential equation. Therefore, an initial value is needed to determine its operation.
- the internal state initial value setting means gives a predetermined initial value to all nodes in order for NN to operate. The operation procedure of the real voice recognition device will be described based on FIG.
- the output value Y is calculated based on the updated X value.
- the procedure is as follows.
- the recognition result is given to the recognition result output means as the output of the node assigned to the output.
- the above is the basic operation principle and the configuration of the speech recognition device based on the NN using the node of the present invention.In order for such an NN to perform desired processing, it is necessary to train the NN. . Then, the learning method of NN will be described next.
- FIG. 14 is a configuration diagram showing a learning method of the speech recognition device of the present invention.
- reference numeral 1410 denotes a learning unit for learning NN1402.
- 1 4 11 1 is an input data storage means storing predetermined learning input data
- 1 4 13 is an output data storing means storing model output data corresponding to each learning input data
- 1 4 1 2 is input data selection means for selecting input data to be learned from the input data storage means
- 1 4 1 4 is output data selection means for selecting output data
- 1 4 1 5 Denotes learning control means for controlling NN learning.
- a learning method of the speech recognition device by the learning unit will be described with reference to FIGS.
- a preset initial state value X is set to all nodes.
- learning input data to be learned is selected by input data selecting means.
- the selected input data is sent to the learning control means.
- learning output data corresponding to the selected learning input data is selected by the output data selection means.
- the selected output data is also sent to the learning control means.
- the selected learning input data is input to the audio feature extraction means 1401, and the special vector extracted here is input to NN as an external input.
- the sum of the input Z is calculated for each node, and the internal state value X is updated according to the equation C2).
- the output Y is obtained from the updated X.
- the output value Y output from NN is a random value.
- T the output data for learning corresponding to the selected input data for learning
- Y the output value corresponding to the input data for learning.
- This learning rule is applicable not only to the fully-connected neural network illustrated but also to a more general random-connected neural network including layered connections and the like as special examples. It is clear.
- NN has two outputs, a positive output and a negative output.
- a method of learning both the rise and fall of the output by continuously giving two voice data as shown in FIGS. 17C a) to 17 d is used.
- negative data and positive data are input in succession to learn the rise of the positive output and the rise and fall of the negative output.
- positive data and negative data are successively input to learn the rise and fall of the positive output and the rise of the negative output.
- two sets of negative data are input in succession, and the learning in Fig. 17 C a) does not give NN the false recognition that next to negative data is positive data. To do.
- Fig. 17 [d] two positive data are input in succession, and in the learning of Fig. 17 [b], NN recognizes the false recognition that the next data is positive data after negative data. Do not hold it.
- 18 is a configuration diagram of a speech recognition device for causing the NN to learn these two continuous inputs.
- the input data storage means described with reference to FIG. 14 is composed of two tuna ⁇ I, positive data and negative data.
- 1801 is collected under various conditions; the positive data storage means, which is the data group of the words to be recognized, and 1802 is the word to be recognized, which is another category.
- the negative data storage means as an example of, and 1803 and 1804 are output data storage means for storing learning output data for each category.
- 1805 is input data selection means
- 1806 is output data selection means
- 1807 is input data connection means
- 1808 is output data connection means
- 18009 learning Control means
- 1810 indicate NN, respectively.
- the input data selection means selects two learning input data from the positive data storage means and the negative data storage means. The combination is as described in FIG.
- the two selected input data become one continuous data by the input data connection means.
- the continuous data is feature-extracted by the speech feature extraction means and input to NN.
- NN Within NN output values are calculated in chronological order according to the processing in FIG.
- the output of NN is sent to the learning control means, the error with the learning output data selected in advance is calculated, and the weight of the coupling of each node is corrected, so that NN repeats learning.
- the output of the NN is a positive output node and a negative output node
- the solid lines in 1803 and 1804 are the learning output of the positive output node corresponding to the positive data
- the broken line is The learning output of the negative output node corresponding to the negative data. Therefore, the following shows an example in which the recognition result of a speech recognition device composed of NNs composed of nodes having such features is learned by the learning method described with reference to FIG.
- a 20th-order LPC Cavestrum as the output of the speech feature extraction means, and configured NN with a total of 32 nodes, with 20 as input, 2 as output, and 10 as others.
- the output of NN is a positive output corresponding to the above positive data, Two types of negative output corresponding to negative data were considered. The four outputs described in Fig. 17 were assumed as learning outputs.
- the sigmoid function of equation (5) which has the origin at the temporal midpoint of the data and the starting end of the data is set to 110 and the ending is set to 10 in the curve portion of the learning output, Those deformed in the range of 0 to 0.9, or those obtained by inverting them, were used.
- the speakers for learning were MAU and FSU in the Japanese speech database for research at ATR Automatic Translation and Telephone Laboratory.
- one frame input (in this case, the 20th order LPC cepstrum) was input, and a set of positive output and negative output was obtained. Therefore, there is no need to input data of a plurality of frames as in the related art.
- the NN of the speech recognition method according to the present invention can generate a desired output by learning several hundreds to several thousand times by learning by the above method.
- the output for learning can be uniquely determined without any trial and error.
- Fig. 25 shows the results of verifying the ability of NNs that have been trained in this way, including data containing unknown words that were not used in the learning.
- the total number of word types was 216 words, of which nine were used for learning. From these 216 words, various combinations of 2-word chain data were created and used for verification. In the verification, the total number of words appearing is 1290 words per speaker.
- the judgment of the recognition result is based on the combination of the positive output and the negative output.If the positive output is 0.75 or more and the negative output is 0.25 ° or less, the detection is performed.If the positive output is 0.25 or less and the negative output is 0.75 3 ⁇ 4 If it is above, it is not detected, otherwise it is considered as confused.
- Fig. 26 shows the same experiment as in Fig. 25 performed on nine unknown speakers other than the speaker who used the learning.
- a very good recognition rate can be obtained only by learning a small amount of data.
- Fig. 19 shows an example of detecting words to be recognized from three or more consecutive words. In the figure, a solid line indicates a positive output, and a broken line indicates a negative output. As can be seen from the figure, the word “Toriazure” is recognized without giving the start and end as in the conventional example.
- FIG. 20 shows an example in which the recognition target unit “Toriezu” is recognized from among unknown words.
- the solid line indicates a positive output and the dashed line indicates a negative output.
- the total length of the data given in Fig. 19 is 1049, so if the conventional start and end are given and recognized, simply add We need to find out the combination of the self-reserved orders of.
- since data only needs to be input once each time there is no need to store data in a range that can be the start and end, as in the conventional case, a small amount of data memory is required, and the amount of calculation is small. Disappears.
- the output does not monotonically increase or decrease as in the conventional DP method and HMM method, it has a peak value where necessary, so the output value is normalized to the length of the input data. No need. In other words, the output is always in a certain range (between 1 and 1 in this example), and the weight of the value is the same everywhere in the recognition interval. T This is the dynamic range of the value to be processed. This means that integer-type data can provide sufficient performance without using floating-point data or logarithmic data during processing.
- the recognition is made based on the comprehensive judgment of the two outputs of the positive output and the negative output, for example, even if the positive output rises at “purchase” in Fig. 20, the negative output does not decrease
- the accuracy of the voice recognition process can be improved without erroneous recognition.
- the number of outputs is not limited to two, and any number may be provided as needed.
- the accuracy of the recognition result can be further improved.
- use more than one of them The NN that gives the optimal result can be selected.
- the recognition target unit may be not only a word as illustrated but also a syllable or a phoneme. In this case, it is possible to recognize the entire speech of the language with a relatively small number of N N. This enables, for example, a dictation system.
- the recognition unit may be an abstract one that does not consider the correspondence with the above-mentioned languages. Use of such a recognition unit is particularly effective when the recognition device is used for information compression.
- FIG. 21 shows another embodiment of the present invention, in which background noise input means 210 and equilibrium state detection means 210 are added to the speech recognition apparatus shown in FIG. It is a thing. Others are the same as in FIG.
- FIG. 22 shows the flow of processing for determining the internal state initial value in the configuration of FIG.
- the part related to the generation of background noise data in the figure may not be provided as a means for setting an appropriate initial value, a means for generating a steady input, or a means corresponding to no input.
- FIG. 27 shows the results of learning and recognizing this device by the learning method shown in FIG. 18, and summarizes the results corresponding to Tables 1 and 2 of Example 1. This is because the internal state value of the NN that has been in a state of equilibrium due to the input of background noise for about 3 seconds is stored as an initial value, and that value is used as the initial value of the differential equation in Equation (2) during recognition processing. It was what was.
- the missing word error is improved as compared with the result of the first embodiment.
- FIG. 23 shows an example in which noise data storage means and noise data superimposition means are added to the learning area of FIG.
- the basic learning method is as described in Figure 14.
- a feature of the embodiment is that data in which a noise component is superimposed in advance is used as learning data.
- the weights between the NN units are adjusted by the learning control means so that recognition is performed on data from which noise components included in the learning data have been removed. That is, the NN is trained so that the noise component included in the training data can be clearly identified.
- NN can be realized. Only the noise component can be clearly identified. As a result, NN can correctly recognize the noise portion of the voice data on which the non-stationary noise is superimposed.
- the speech recognition device and the learning method of the present invention are very effective not only for continuous speech recognition but also for isolated speech recognition.
- the present invention is not limited to speech recognition but is also effective in processing time-series information widely, and can process any type of time-series information as long as input data can correspond to output data.
- Possible uses include information compression, decompression, and waveform equalization.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Character Discrimination (AREA)
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP51729193A JP3521429B2 (ja) | 1992-03-30 | 1993-03-26 | ニューラルネットワークを用いた音声認識装置およびその学習方法 |
KR1019930703580A KR100292919B1 (ko) | 1992-03-30 | 1993-03-26 | 뉴럴 네트워크를 이용한 음성인식장치 및 그 학습방법 |
DE69327997T DE69327997T2 (de) | 1992-03-30 | 1993-03-26 | Gerät zur spracherkennung mit neuronalem netzwerk und lernverfahren dafür |
EP93906832A EP0586714B1 (en) | 1992-03-30 | 1993-03-26 | Speech recognition apparatus using neural network, and learning method therefor |
HK98115085A HK1013879A1 (en) | 1992-03-30 | 1998-12-23 | Speech recognition apparatus using neural network and learning method therefor |
Applications Claiming Priority (12)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP7381892 | 1992-03-30 | ||
JP4/73818 | 1992-03-30 | ||
JP8714692 | 1992-04-08 | ||
JP4/87146 | 1992-04-08 | ||
JP8878692 | 1992-04-09 | ||
JP4/88786 | 1992-04-09 | ||
JP4/159422 | 1992-06-18 | ||
JP15944192 | 1992-06-18 | ||
JP4/159441 | 1992-06-18 | ||
JP15942292 | 1992-06-18 | ||
JP16107592 | 1992-06-19 | ||
JP4/161075 | 1992-06-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1993020552A1 true WO1993020552A1 (en) | 1993-10-14 |
Family
ID=27551274
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP1993/000373 WO1993020552A1 (en) | 1992-03-30 | 1993-03-26 | Speech recognition apparatus using neural network, and learning method therefor |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP0586714B1 (ja) |
JP (2) | JP3521429B2 (ja) |
KR (1) | KR100292919B1 (ja) |
DE (1) | DE69327997T2 (ja) |
HK (1) | HK1013879A1 (ja) |
WO (1) | WO1993020552A1 (ja) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2011146147A1 (en) * | 2010-05-19 | 2011-11-24 | The Regents Of The University Of California | Neural processing unit |
US9082078B2 (en) | 2012-07-27 | 2015-07-14 | The Intellisis Corporation | Neural processing engine and architecture using the same |
US9185057B2 (en) | 2012-12-05 | 2015-11-10 | The Intellisis Corporation | Smart memory |
US9552327B2 (en) | 2015-01-29 | 2017-01-24 | Knuedge Incorporated | Memory controller for a network on a chip device |
CN108269569A (zh) * | 2017-01-04 | 2018-07-10 | 三星电子株式会社 | 语音识别方法和设备 |
US10027583B2 (en) | 2016-03-22 | 2018-07-17 | Knuedge Incorporated | Chained packet sequences in a network on a chip architecture |
US10061531B2 (en) | 2015-01-29 | 2018-08-28 | Knuedge Incorporated | Uniform system wide addressing for a computing system |
US10346049B2 (en) | 2016-04-29 | 2019-07-09 | Friday Harbor Llc | Distributed contiguous reads in a network on a chip architecture |
JP2021006889A (ja) * | 2019-06-27 | 2021-01-21 | バイドゥ オンライン ネットワーク テクノロジー (ベイジン) カンパニー リミテッド | ウェイクアップモデルの最適化方法、装置、デバイス及び記憶媒体 |
NL2029215A (en) * | 2021-09-21 | 2021-11-01 | Univ Dalian Tech | Speech keyword recognition method based on gated channel transformation sandglass residual neural network |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5715372A (en) * | 1995-01-10 | 1998-02-03 | Lucent Technologies Inc. | Method and apparatus for characterizing an input signal |
TW347503B (en) * | 1995-11-15 | 1998-12-11 | Hitachi Ltd | Character recognition translation system and voice recognition translation system |
KR100772373B1 (ko) | 2005-02-07 | 2007-11-01 | 삼성전자주식회사 | 복수개의 데이터 처리 장치를 이용한 데이터 처리 장치 및그 방법과, 이를 구현하기 위한 프로그램이 기록된 기록매체 |
KR102494139B1 (ko) * | 2015-11-06 | 2023-01-31 | 삼성전자주식회사 | 뉴럴 네트워크 학습 장치 및 방법과, 음성 인식 장치 및 방법 |
KR101991041B1 (ko) | 2018-12-31 | 2019-06-19 | 서울대학교산학협력단 | 아날로그 이진인공신경망 회로에서 활성도 조절을 통한 공정변이 보상방법 및 그 시스템 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0272398A (ja) * | 1988-09-07 | 1990-03-12 | Hitachi Ltd | 音声信号用前処理装置 |
JPH0281160A (ja) * | 1988-09-17 | 1990-03-22 | Sony Corp | 信号処理装置 |
JPH04295894A (ja) * | 1991-03-26 | 1992-10-20 | Sanyo Electric Co Ltd | 神経回路網モデルによる音声認識方法 |
JPH04295897A (ja) * | 1991-03-26 | 1992-10-20 | Sanyo Electric Co Ltd | 神経回路網モデルによる音声認識方法 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2580826B2 (ja) * | 1990-03-14 | 1997-02-12 | 日本電気株式会社 | フィードバック神経細胞モデル |
-
1993
- 1993-03-26 WO PCT/JP1993/000373 patent/WO1993020552A1/ja active IP Right Grant
- 1993-03-26 KR KR1019930703580A patent/KR100292919B1/ko not_active IP Right Cessation
- 1993-03-26 DE DE69327997T patent/DE69327997T2/de not_active Expired - Lifetime
- 1993-03-26 EP EP93906832A patent/EP0586714B1/en not_active Expired - Lifetime
- 1993-03-26 JP JP51729193A patent/JP3521429B2/ja not_active Expired - Lifetime
-
1998
- 1998-12-23 HK HK98115085A patent/HK1013879A1/xx unknown
-
2000
- 2000-03-27 JP JP2000085618A patent/JP2000298663A/ja not_active Withdrawn
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0272398A (ja) * | 1988-09-07 | 1990-03-12 | Hitachi Ltd | 音声信号用前処理装置 |
JPH0281160A (ja) * | 1988-09-17 | 1990-03-22 | Sony Corp | 信号処理装置 |
JPH04295894A (ja) * | 1991-03-26 | 1992-10-20 | Sanyo Electric Co Ltd | 神経回路網モデルによる音声認識方法 |
JPH04295897A (ja) * | 1991-03-26 | 1992-10-20 | Sanyo Electric Co Ltd | 神経回路網モデルによる音声認識方法 |
Non-Patent Citations (6)
Title |
---|
See also references of EP0586714A4 * |
Technical Research Report by IEICE, NC91-10, (08.05.91), YOJI FUKUDA and another, "Phoneme Recognition Using Recurrent Neural Network", p. 71-78. * |
Technical Research Report by IEICE, SP92-125, (19.01.93), MITSUHIRO Inazumi and another, "Voice Recognition of Continuous Figures by Recurrent Neural Network", p. 17-24. * |
Technical Research Report by IEICE, SP92-25, (30.06.92), MITSUHIRO INAZUMI and another, "Voice Recognition of Continuous Words by Recurrent Neural Network", p. 9-16. * |
Technical Research Report by IEICE, SP92-80, (21.10.92), KENICHI FUNABASHI, "On Recurrent Neural Network", p. 51-58. * |
Theses by IEICE, Vol. J74D-II, No. 12, (25.12.91), TATSUMI WATANABE and two others, "Examination of Recurrent Neural Network on Every Learning Rule and Shape of Learning Curvi", p. 1776-1787. * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8655815B2 (en) | 2010-05-19 | 2014-02-18 | The Regents Of The University Of California | Neural processing unit |
WO2011146147A1 (en) * | 2010-05-19 | 2011-11-24 | The Regents Of The University Of California | Neural processing unit |
US9558444B2 (en) | 2010-05-19 | 2017-01-31 | The Regents Of The University Of California | Neural processing unit |
US10083394B1 (en) | 2012-07-27 | 2018-09-25 | The Regents Of The University Of California | Neural processing engine and architecture using the same |
US9082078B2 (en) | 2012-07-27 | 2015-07-14 | The Intellisis Corporation | Neural processing engine and architecture using the same |
US9185057B2 (en) | 2012-12-05 | 2015-11-10 | The Intellisis Corporation | Smart memory |
US10445015B2 (en) | 2015-01-29 | 2019-10-15 | Friday Harbor Llc | Uniform system wide addressing for a computing system |
US10061531B2 (en) | 2015-01-29 | 2018-08-28 | Knuedge Incorporated | Uniform system wide addressing for a computing system |
US9858242B2 (en) | 2015-01-29 | 2018-01-02 | Knuedge Incorporated | Memory controller for a network on a chip device |
US9552327B2 (en) | 2015-01-29 | 2017-01-24 | Knuedge Incorporated | Memory controller for a network on a chip device |
US10027583B2 (en) | 2016-03-22 | 2018-07-17 | Knuedge Incorporated | Chained packet sequences in a network on a chip architecture |
US10346049B2 (en) | 2016-04-29 | 2019-07-09 | Friday Harbor Llc | Distributed contiguous reads in a network on a chip architecture |
CN108269569A (zh) * | 2017-01-04 | 2018-07-10 | 三星电子株式会社 | 语音识别方法和设备 |
CN108269569B (zh) * | 2017-01-04 | 2023-10-27 | 三星电子株式会社 | 语音识别方法和设备 |
JP2021006889A (ja) * | 2019-06-27 | 2021-01-21 | バイドゥ オンライン ネットワーク テクノロジー (ベイジン) カンパニー リミテッド | ウェイクアップモデルの最適化方法、装置、デバイス及び記憶媒体 |
US11189287B2 (en) | 2019-06-27 | 2021-11-30 | Baidu Online Network Technology (Beijing) Co., Ltd. | Optimization method, apparatus, device for wake-up model, and storage medium |
NL2029215A (en) * | 2021-09-21 | 2021-11-01 | Univ Dalian Tech | Speech keyword recognition method based on gated channel transformation sandglass residual neural network |
Also Published As
Publication number | Publication date |
---|---|
JP2000298663A (ja) | 2000-10-24 |
KR100292919B1 (ko) | 2001-06-15 |
JP3521429B2 (ja) | 2004-04-19 |
EP0586714B1 (en) | 2000-03-08 |
DE69327997D1 (de) | 2000-04-13 |
EP0586714A1 (en) | 1994-03-16 |
HK1013879A1 (en) | 1999-09-10 |
DE69327997T2 (de) | 2000-07-27 |
EP0586714A4 (en) | 1995-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP3168779B2 (ja) | 音声認識装置及び方法 | |
KR102494139B1 (ko) | 뉴럴 네트워크 학습 장치 및 방법과, 음성 인식 장치 및 방법 | |
US5212730A (en) | Voice recognition of proper names using text-derived recognition models | |
WO1993020552A1 (en) | Speech recognition apparatus using neural network, and learning method therefor | |
JP7070894B2 (ja) | 時系列情報の学習システム、方法およびニューラルネットワークモデル | |
US8838446B2 (en) | Method and apparatus of transforming speech feature vectors using an auto-associative neural network | |
GB2572020A (en) | A speech processing system and a method of processing a speech signal | |
El Choubassi et al. | Arabic speech recognition using recurrent neural networks | |
EP0574951A2 (en) | Speech recognition system | |
KR102406512B1 (ko) | 음성인식 방법 및 그 장치 | |
WO2016167779A1 (en) | Speech recognition device and rescoring device | |
US20050071161A1 (en) | Speech recognition method having relatively higher availability and correctiveness | |
US10741184B2 (en) | Arithmetic operation apparatus, arithmetic operation method, and computer program product | |
WO2023078370A1 (zh) | 对话情绪分析方法、装置和计算机可读存储介质 | |
US5809461A (en) | Speech recognition apparatus using neural network and learning method therefor | |
US5181256A (en) | Pattern recognition device using a neural network | |
KR100306848B1 (ko) | 신경회로망을 이용한 선택적 주의집중 방법 | |
US20230070000A1 (en) | Speech recognition method and apparatus, device, storage medium, and program product | |
US6151592A (en) | Recognition apparatus using neural network, and learning method therefor | |
CN113223504B (zh) | 声学模型的训练方法、装置、设备和存储介质 | |
JPH064097A (ja) | 話者認識方法 | |
JP3467556B2 (ja) | 音声認識装置 | |
KR102159988B1 (ko) | 음성 몽타주 생성 방법 및 시스템 | |
JP2000352994A (ja) | 神経細胞素子、ニューラルネットワークを用いた認識装置およびその学習方法 | |
JPH06119476A (ja) | 時系列データ処理装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): JP KR US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1019930703580 Country of ref document: KR |
|
ENP | Entry into the national phase |
Ref document number: 1993 150170 Country of ref document: US Date of ref document: 19931129 Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1993906832 Country of ref document: EP |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWP | Wipo information: published in national office |
Ref document number: 1993906832 Country of ref document: EP |
|
WWG | Wipo information: grant in national office |
Ref document number: 1993906832 Country of ref document: EP |