US5787393A - Speech recognition apparatus using neural network, and learning method therefor - Google Patents
Speech recognition apparatus using neural network, and learning method therefor Download PDFInfo
- Publication number
- US5787393A US5787393A US08/485,134 US48513495A US5787393A US 5787393 A US5787393 A US 5787393A US 48513495 A US48513495 A US 48513495A US 5787393 A US5787393 A US 5787393A
- Authority
- US
- United States
- Prior art keywords
- data
- neural network
- input
- output
- learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 122
- 230000013016 learning Effects 0.000 title claims description 131
- 238000000034 method Methods 0.000 title claims description 85
- 210000002569 neuron Anatomy 0.000 claims description 23
- 230000000306 recurrent effect Effects 0.000 claims 3
- 238000000605 extraction Methods 0.000 claims 1
- 230000006870 function Effects 0.000 description 22
- 238000013500 data storage Methods 0.000 description 20
- 230000008569 process Effects 0.000 description 20
- 238000012545 processing Methods 0.000 description 15
- 238000009825 accumulation Methods 0.000 description 7
- 238000001514 detection method Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 5
- 230000010354 integration Effects 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000002123 temporal effect Effects 0.000 description 5
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 230000006835 compression Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000009469 supplementation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Definitions
- the present invention relates to a speech recognition apparatus using a neural network, and its learning method therefor. Unlike the prior art, the present invention will neither require the start and end edges of input data when time series data such as speech data sequence is to be processed, nor process all the possible combinations of the start and end edges. However, the present invention makes it possible to process the time series data such as speech data or the like precisely, using a simplified hardware which comprises neuron elements capable of holding past history of input data in themselves.
- the present invention also relates to a learning method for a neural network to do such a process.
- DP Dynamic Programming
- HMM Hidden Markov Model
- MLP Back Propagation Learning Rule and Multi-Layered Perceptron
- the common problem to the DP and HMM methods is to require the start and end edges in both the teacher data and input data to be recognized.
- One technique of processing data apparently not depending on the start and end edges thereof is to find the start and end edges providing the best result in a trial-and-error manner. Where it is considered to detect data parts belonging to a category from an input data having a length N, there are N number of possible start edges, and there also are N number of possible end edges. That is, combinations of start and end edge patterns to the order of N 2 can be considered to be possible. Therefore, such a technique must recognize and process all the great number of combinations. This consumes huge processing time.
- the aforementioned technique has a more essential problem due to the fact that the start and end edges of the input data are assumed than the quantitative problem of the huge number of combinations. More particularly, the start and end edges of the input data are self-evident if the input data only contains a single data belonging to a category. However, the start and end edges of the input data cannot easily and clearly be bounded if the input data includes successive data parts belonging to more than one category. Particularly, time series data such as speech data or the like does not have definite boundaries at the start and end edges, with data parts belonging to two adjacent categories being connected to each other through an overlapping transition region. Accordingly, the assumption of the start and end data edges raises very large problem in accuracy.
- the MLP method does not require such an assumption. Instead, the MLP method raises another problem with respect to the start and end edges of the input data in that the range of the input data must be specified.
- the MLP method is basically for recognizing static data.
- the MLP method can recognize time series data only when an input data within a length of time is used while time information is equivalently processed. The length of time must be fixed due to the composition of the MLP method.
- the length of the time series data greatly varies from one category to another, and also within the same category.
- the average length of vowels, which are long phonemes is ten or more times longer than that of plosives, which are short phonemes.
- the length can fluctuate over two times in an actual speech.
- the input range of data is set to be the average length
- the input data of a short phoneme to be recognized will include a number of data other than the data to be recognized, and the input data of a long phoneme will include only a part of the data to be recognized.
- Such things cause the recognition ability to be reduced.
- input length is appropriately set for each phoneme, it will not solve the problem since the length of each phoneme itself varies. Such problems are generally found in time series information.
- DP and HMM methods require the start and end edges of data to be handled, and MLP method requires the start and end edges of the input range on learning.
- the start and end edges of the time series information cannot definitely be bounded due to the nature of the information. Even if the start and end edges are forcibly assumed, the speech recognition ability will be reduced. In order to apparently relieve it, all the combinations of start and end edges must be processed, resulting in a huge amount of processing.
- the present invention provides a speech recognition apparatus using a neural network comprising:
- Neuron elements each of the neuron elements comprising internal state value storage means, internal state value updating means for updating the internal state value in said internal state value storage means on the bases of an internal state value stored in said internal state value storage means and an input value to said neuron element, and output value generating means for converting the output of said internal state value storage means into an external output value.
- the internal state value updating means may be formed as weighted accumulation means for performing the weighted accumulation of said input values and said internal state values.
- the internal state value storage means may be formed as integration means for integrating the values accumulated by said weighted accumulation means.
- the output value generating means may be formed as output value limiting means for converting a value obtained by said integrating means into a value between upper and lower preset limits.
- the internal state value updating means may be formed to update the internal state value into a value which satisfies the following formula: ##EQU1## where X i is the internal state value of the i-th neuron elements in said neural network, ⁇ i is a time constant, and Z j (ranges between O and n:n is O or natural number) is said weighted input value to said neuron element.
- the weighted input value Z j to the i-th neuron element may include the weighted output of the i-th neuron element itself.
- the weighted input value Z j to the i-th neuron element may also include the weighted output of any other neuron element in said neural network.
- the weighted input value Z j to the i-th neuron element may also include any data provided from the outside of said neural network.
- the weighted input value Z j to the i-th neuron element may also include a weighted and fixed value.
- the output value generating means may be formed to have an output of symmetrical range in positive and negative directions.
- the neural network may be formed to have at least two outputs: positive and negative outputs.
- the speech recognition apparatus may comprise speech feature extracting means for extracting the feature of an input to be recognized and for providing the extracted value to said neural network, recognition result output means for converting the output value of said neural network into a recognition result, and internal state value initializing means for providing a preset initial value to the internal state value storage means of each neuron element comprised by said neural network.
- the speech recognition apparatus as defined in the item 10) characterized in comprising: background noise input means for inputting the background noise to said neural network, and stable state detecting means for detecting the stable state from the output of said neural network and for outputting a signal to change the preset initial internal state value to an initial internal state value setting means on the basis of said stable state detection.
- the speech recognition apparatus as defined in the item 10) or 11) including a learning section for causing said neural network to learn.
- the learning section comprises input data storage means for storing input learning data, input data selection means for selecting an input learning data from said input data storage means, output data storage means for storing output learning data, output data selection means for selecting an output learning data depending on the selected input data and chains of data including the selected data, and learning control means for inputting the selected input learning data to said feature extracting means and for controlling the learning in said neural network.
- the learning control means is formed to respond to the outputs of said neural network and output data selection means to change the weightings at the connections of said neuron elements.
- said input data storage means has a plurality of categories
- said output data storage means has a category corresponding to each of the categories in said input data storage means
- said input data selection means selects a plurality of data to be learned from the categories of said input data storage means
- said output data selection means selects an output learning data corresponding to the input learning data selected by said input data selection means.
- Said learning control means has input data connecting means for connecting the plurality of data selected by said input data selection means into a single input data, and output data connecting means for connecting the output learning data selected by said output data selection means into a single output data.
- the learning control means may be formed to input said connected input learning data to the speech feature extracting means and to change the weightings at the connections of said neuron elements on the basis of the outputs of said neural network and output data connecting means.
- the number of said categories can be two.
- the learning section comprises noise data storage means for storing noise data, and noise overlaying means for overlaying said selected learning data with the noise selected from said noise data storage means, the input data overlaid with the noise by said noise overlaying means being used to cause said neural network to learn.
- the learning may be repeated while shifting said background noise to different overlaying positions.
- the learning may be performed by first causing the neural network to learn an input data not overlaid with the background noise, and thereafter to learn the same data overlaid with the background noise.
- the speech recognition apparatus using neural network, and learning method therefor have the following advantages:
- the processing speed may greatly be increased because data input is required only once, although the prior art required a processing time proportional to the square of length N of the speech input.
- a memory used to store input data may greatly be reduced in capacity.
- the speech recognition apparatus can self-organizingly treat phenomena with various time scales by causing the apparatus to learn.
- NN neural network
- FIG. 1 is a view of a neuron element used to form a neural network according to the present invention.
- FIG. 2 is a view showing the neuron element of FIG. 1 in actual functional blocks.
- FIG. 3 is a view obtained by replacing blocks in the arrangement of FIG. 2 by electric circuits.
- FIG. 4 is a view of a speech recognition apparatus which uses a neural network constructed by neuron elements according to the present invention.
- FIG. 5 is a view showing the neural network of FIG. 4 formed into a three-layered schematic structure.
- FIG. 6 is a view showing the neural network of FIG. 5 formed into a increased number of layered structure.
- FIG. 7 is a view showing the division of the transmission network shown in FIG. 6.
- FIG. 8 is a view of a neural network having an autoregressive loop.
- FIG. 9 is a view of a random connecting type neural network.
- FIGS. 10a and 10b is a view illustrating the noise-resistant property of the speech recognition apparatus of the present invention.
- FIGS. 11a and 11b is a view illustrating the learning effect of the speech recognition apparatus of the present invention with respect to time scale.
- FIG. 12 is a view of another speech recognition apparatus using the neuron elements of the present invention.
- FIG. 13 is a view illustrating the operational procedure of the speech recognition apparatus shown in FIG. 12.
- FIG. 14 is a view illustrating a method of the present invention for causing the speech recognition apparatus using the neural network of the present invention to learn.
- FIG. 15 is a view illustrating the operational procedure in the learning method of the present invention.
- FIGS. 16a and 16b is a view showing the connecting of learning data in the present invention.
- FIGS. 17a-17d is a view showing a form of learning data in the present invention.
- FIG. 18 is a view showing another learning method of the present invention for the speech recognition apparatus using the neural network of the present invention.
- FIG. 19 is a view showing the speech word detection output of the speech recognition apparatus of the present invention.
- FIG. 20 is a view showing another speech word detection output of the speech recognition apparatus of the present invention.
- FIG. 21 is a view showing an other arrangement of a speech recognition apparatus constructed in accordance with the present invention.
- FIG. 22 is a view illustrating the operational procedure of the speech recognition apparatus shown in FIG. 21.
- FIG. 23 is a view illustrating a method of causing a speech recognition apparatus having background noise overlaying means to learn.
- FIG. 24a-24c is a view illustrating a manner that overlays a learning data with noise components.
- FIG. 25 is a table showing results obtained when unknown words are provided to the neural network that has learned according to the learning method of the present invention.
- FIG. 26 is a table showing results obtained when the same processing as in FIG. 25 is carried out to an unknown speaker.
- FIG. 27 is a table showing results obtained when the same processing as in FIG. 26 is carried out with background noise.
- FIG. 28 is a view showing a neuron element of the prior art.
- FIG. 29 is a view showing the neuron element of FIG. 28 in actual functional blocks.
- FIG. 30 is a view obtained by replacing blocks in the arrangement of FIG. 29 by electric circuit.
- a numeral 104 designates the node generally, 101 designates an internal state value storage means, 102 designates an internal state value updating means to update the internal state responsive to the internal state value stored in 101 and an input value to the node, and 103 designates an output value generating means for converting the internal state value into an external output.
- FIG. 2 shows the detailed function of the node shown in FIG. 1.
- reference numeral 201 designates data input means
- 202 designates weighted accumulation means for weighting and accumulating data input values from the data input means 201
- 203 designates integration means for integrating the accumulated data values
- 204 designates output value limiting means for converting a value obtained by the integration into a value within a preset range, respectively.
- FIG. 3 shows an example of an electronic circuit in the arrangement of FIG. 2.
- reference numeral 301 designates both the input means and weighted accumulation means in FIG. 2;
- 302 designates the integration means;
- 303 designates the output value limiting means.
- FIG. 28 diagrammatically shows the function of a node used to form a NN (neural network) using the prior art MLP method.
- 2803 designates the node generally
- 2801 designates an internal state value computing means for computing the internal state value
- 2802 designates output value generating means for converting the internal state value computed by 2801 into an external output.
- FIG. 29 similarly shows the functional arrangement of the prior art node shown in FIG. 28.
- reference numeral 2901 designates a data input means
- 2902 weighting accumulation means for weighting and accumulating data input values from 2901
- 2903 designates output value limiting means for converting the accumulated data values into a value within a preset range.
- FIG. 30 shows an example of an electronic circuit for the arrangement of FIG. 29.
- reference numeral 3001 designates both the data input means and weighted accumulation means of FIG. 29; and 3002 designates the output value limiting means.
- the node of the present invention includes the integration means which can not be found in the node of the prior art.
- the node of the prior art is static in that its output depends only on the input at that time.
- the node of the present invention can be said to be dynamic in that the past history of data inputted into that node is converted into and held as an integrated value on which the output of the node depends.
- the neural network using the static nodes of the prior art requires to take the temporal structure of data on the neural network structure if it is wanted to process time series data.
- the neural network using the dynamic nodes of the present invention can process time series data in the node itself without depending on the neural network structure.
- the processing of time series data by the neural network of the prior art requires any suitable manner of developing the temporal information into spatial information, such as a method of connecting data inputted at a plurality of timings into a single input data.
- it is required to provide a hardware and a process for storing and controlling the connected data.
- it may be required to provide a special context element for storing the aforementioned information depending on time. Any suitable hardware and a process for controlling this context is further required.
- the neural network of the present invention does not require any special structure because the context information and others are stored as integrated values in the interior of each of the elements. Therefore, the input of data can sufficiently be carried out according to the simplest data input manner in that the data of respective timing is inputted at the respective timing.
- the present invention does not require any specific hardware or a process for processing the temporal information.
- the internal state value of the node is X and the output value thereof is Y. It is also assumed that as the values X and Y are changed with time, the current internal state value is X curr ; the updated internal state value is X next ; and an input value to the node during the updating operation is Z i (i ranges from zero to n:n is the number of inputs to that node).
- the operation of the internal state value updating means is formally represented by a function G
- the updated internal state value X next can be represented by:
- the concrete form of this formula (1) can be considered to be of any one of various forms, although it may be the following formula (2) using the first order differential equation: ##EQU2## where ⁇ i is a time constant.
- the input value Z j will be defined in more detail.
- the input value are considered to include: 1 the output of the node itself multiplied by a connecting weight; 2 the output of the other node multiplied by a connecting weight; 3 a fixed output value multiplied by a connecting weight to provide a bias equivalently to the internal state value updating means; and 4 an external input provided to the node from the outside of the neural network.
- the updating of the internal state value in the i-th node relative to such an input value Z j is considered.
- the formula (2) can be rewritten into a more specific form as follows: ##EQU3## where the internal state value is X i ; the output of a node is Y j ; the connecting weight for connecting the output of the j-th node to the input of the i-th node is W ij ; the bias value is ⁇ i ; and the external input value to the i-th node is D i .
- the output Y of the node can be represented by:
- An specific form of the function F may be a sigmoid (logistic) function of a positive-negative symmetric output as shown by the following formula: ##EQU4##
- a function expression is not essential, and may be a simpler linear transform, a threshold function. etc.
- the time series of the output Y of the neural network constructed in accordance with the present invention can be calculated.
- FIG. 4 shows one embodiment of a speech recognition apparatus using a neural network which is constructed by such nodes according to the present invention.
- 401 designates speech feature extracting means
- 402 designates a neural network constructed by nodes according to the present invention
- 403 designates recognition result output means.
- Outputs extracted by the speech feature extracting means are inputted into two nodes.
- the neural network is of an entire connecting type in which any one node is connected with all the other nodes.
- the neural network provides two outputs to the recognition result output means.
- the neural network of the present invention may set any number of outputs. If a word is to be recognized, therefore, two outputs, positive and negative outputs can be provided. The recognition results of these outputs can collectively be judged to increase the accuracy in recognition.
- the number of inputs and outputs relative to the neural network needs not to be limited respectively to two as in FIG. 4, but may be set at any number.
- FIGS. 5-9 show a variety of other neural network forms constructed by the nodes of the present invention.
- FIG. 5 shows a form in which only the neural network 402 shown in FIG. 4 is modified.
- a neural network 402 includes an input layer 501, a hidden layer 502 and an output layer 503.
- Such an arrangement is apparently the same as in the MLP method of the prior art.
- the neural network constructed by the nodes is different from such a feed-forward type network as in the prior art in which the value of the input layer is first determined, the value of the hidden layer using the input layer value as an input is then determined, and the values of the respective layers until the output layer are successively determined.
- the neural network using the nodes of the present invention can recognize the time series data to provide the same result as in the prior art, without need of such a context layer as in the prior art.
- the neural network of the present invention can also perform the parallel processing more efficiently than the MLP method of the prior art because the outputs of all the layers are simultaneously determined.
- FIG. 10(a) shows the correspondence between the input and output in the node according to the simple MLP method of the prior art.
- the node of the present invention stores the temporal history as an internal state value.
- the next internal state value and output value are determined as a function of the current internal state value and input. Even if the input is overlaid with such a spiked noise as in FIG. 10(a), the spiked waveform is dulled with reduction of its effect, as shown in FIG. 10(b). As a result, the present invention can provide an improved noise resistance.
- the noise resistance can be somewhat accomplished even by the prior art having the context layer.
- the prior art must be provided with an external node having a special structure for holding the past history information as some of the nodes used to form the neural network. Therefore, the noise resistance of the prior art is inferior to that of the present invention in which each of all the nodes holds its own past history information as an internal state value.
- FIG. 6 shows a multi-layer neural network obtained by increasing the number of layers in such a neural network as in FIG. 5 to form the neural network into a sandglass configuration.
- 601 designates the neural network comprises a feature extracting (or information compressing) network
- 602 designates a transmission network
- 603 designates a recognizing (or information expanding) network.
- the neural network of FIG. 6 is also apparently similar to the MLP method of the prior art. However, its operation is entirely different from that of the prior art, as described.
- the functions of the feature extracting (or information compressing) NN (neural network) and recognizing (or information expanding) networks taking in the time series effect can be formed into modules to provide a speech recognition apparatus without losing advantages of the present invention.
- the transmission network 602 of FIG. 6 can be divided into an information transmitting function 702 and an information receiving function 703, as shown in FIG. 7.
- a wavy line between the functions 702 and 703 represents that these functions may be separated from each other through a space and/or time. If the wavy line represents spatial distance such as a transmission line, it will represent a speech compressing and transmitting device. If the wavy line represents the length of a time, it will represent a speech compressing and recording device. It is of course that an object to be compressed herein is not limited to the speech, but may be more general information. It is needless to say that the recognizing process is a process of information compression in a broader sense.
- FIG. 7 has the same advantages as described hereinbefore.
- the noise resistance described with respect to FIG. 10 can also protect the neural network from the mis-transmission and noise in the transmission line or the defect of or degradation of a recording medium.
- FIG. 8 shows a simplified modification of the neural network shown in FIG. 4.
- the neural network has an autoregressive loop which can handle events within a widened range of time. More particularly, the presence of the autoregressive loop approximately corresponds to replacing the time constant ⁇ in the system by the following formula:
- W is the connecting weight of the autoregressive loop portion in an input value Z.
- the connecting weight W can be modified by a learning process, which will be described later, to optimize the time scale in the response of the system for learning data.
- the method of the prior art using the context layer cannot self-organizingly optimize the time scale by learning. Thus, the network must manually be set for time scale.
- FIGS. 11a and 11b shows the concept of such an advantage in the present invention. It is now assumed that such square waveforms as shown in FIG. 11(a) are continuously inputted into the system. If the time constant in the response of the system is larger than the input cycle of square waveforms, the outputs in the system response will be sequentially overlaid one with another, as shown in FIG. 11(a). This does not provide any proper recognition result.
- the time constant in the system having such an autoregressive loop as shown in FIG. 8 can be optimized by learning. Therefore, the response of this system can be modified as shown in FIG. 11(b), which provides an improved recognition result.
- the random connecting type neural network 902 comprises two sub-networks: an input network 904, and an output network 905.
- the input network is an entire connecting type sub-network while the output network is a random connecting type sub-network. These sub-networks are connected with each other only in one direction.
- Such an arrangement provides the following advantages in addition to the aforementioned advantages.
- functions such as the supplementation of input defects or the improvement of the noise resistance can be achieved.
- the one-direction connection can heuristically treat the flow of information to optimize various functions such as information compression, and information expansion.
- FIG. 12 shows the same arrangement as that of FIG. 4 except that the speech recognition apparatus additionally comprises initial internal state value setting means 1204.
- the operation of the neural network according to the present invention can be described by the first order differential equation.
- an initial value is required.
- the initial internal state value setting means provides present initial values to all the nodes prior to actuation of the neural network.
- the operational procedure of the speech recognition apparatus will be described with reference to FIG. 13.
- the initial internal state value setting means sets a suitably selected initial internal state value X at all the nodes and sets an output Y corresponding to it.
- the sum of input values Z is determined in each of all the nodes.
- the input values Z were described.
- Speech feature value extracted by the speech feature extracting means constitutes a part of input values Z as an external input value.
- the internal state value X is updated on the basis of the sum of input values Z, that have been determined in the step 3 and on the basis of the internal state value X itself.
- the output value Y is calculated from the updated value X.
- the recognition result is provided to the recognition result output means as an output from a node assigned for it.
- FIG. 14 is a block diagram illustrating a learning process for the speech recognition apparatus of the present invention.
- numeral 1410 designates a learning section for causing a neural network 1402 to learn
- 1411 designates input data storage means for storing given input learning data
- 1413 designates output data storage means for storing output data which are models corresponding to the each input learning data
- 1412 designates input data selection means for selecting input data to be learned from the input data storage means
- 1414 designates output data selection means for selecting output data in the same manner
- 1415 designates a learning control means for controlling the learning of the neural network.
- a preset initial state values X are set at all the nodes.
- input learning data to be learned is selected by the input data selection means.
- the selected input data is fed to the learning control means.
- output learning data corresponding to the selected input learning data is selected by the output data selection means.
- the selected output data is similarly fed to the learning control means.
- the selected input learning data is received by the speech feature extracting means 1401, in which feature vector is extracted to be inputted to the neural network as an external input.
- the sum of inputs Z is determined and the internal state value X is updated according to the formula (2).
- an output Y is determined from the updated internal state value X.
- the connecting weight of units with each other in the neural network is random.
- the output value Y from the neural network also is random.
- a learning evaluation value C is determined by the following formula: ##EQU5## where E is an error evaluation value.
- the time series of the learning evaluation value Care calculated along such a procedure as shown in FIG. 15 following the formula (7).
- the error evaluation value E can be written, using Kullback-Leibler distance as an error evaluation function, as follows:
- T is the output learning data corresponding to the selected input learning data
- Y is an output value corresponding to the input learning data.
- the formula (8) can be replaced by the following formula (9), which is substantially the same as the formula (8):
- Such a learning rule may apparently be applied not only to the entire connecting type neural network exemplified, but also to any random connecting type neural network which includes specific examples such as layered connection and the like and which can be used more generally in the art.
- Another method of causing the speech recognition apparatus to learn by continuously inputting two input data for learning will be described using the neural network with two outputs, positive and negative outputs, for example.
- the positive output cannot be lowered to low level once it has been shifted to high level.
- the negative output cannot be raised to high level once it has been shifted to low level.
- input data one by one performs a learning in which when input data to be recognized (hereinafter called "positive data") is provided, the positive output is raised to high level while the negative output remains low level, as shown in FIG. 16(a), or performs another learning in which when input data not to be recognized (hereinafter called “negative data”) is provided, the negative output is raised to high level while the positive output remains low level, as shown in FIG. 16(b).
- positive data input data to be recognized
- negative data performs another learning in which when input data not to be recognized
- the present embodiment uses a learning method for both raising and lowering the output by continuously providing two speech data, as shown in FIGS. 17(a)-(d).
- negative and positive data are continuously inputted in this order to cause the neural network to learn the raising of the positive output and the raising and lowering of the negative output.
- positive and negative data are continuously inputted in this order to cause the neural network to learn the raising and lowering of the positive output and the raising of the negative output.
- two negative data are continuously inputted such that the neural network will not have a wrong recognition, through the learning of FIG. 17(a), that a positive data always follows a negative data.
- FIG. 17(d) similarly, two positive data are continuously inputted such that the neural network will not have a wrong recognition, through the learning of FIG. 17(b), that a negative data always follows a positive data.
- the learning process using only a single input data is started only from a specific initial value.
- the learning process is effective to show an expected ability only for the initial value.
- it must be caused to learn to provide correct responses for a variety of initial values. All the possible events may not need to be considered as initial values.
- the number of possible initial value combinations for an object to be recognized is limited due to various restrictions.
- the use of a chain of two or more data in the learning process approximately provides such possible combinations of initial values. For such a purpose, only continuous data consisting of two single data can provide a satisfactory result. It is acceptable, of course, to use continuous data consisting of three or more single data.
- FIG. 18 shows a speech recognition apparatus which can cause the neural network to learn continuous input data consisting of two single data.
- the input data storage means described in connection with FIG. 14 comprises means for storing data of two categories: positive and negative data.
- 1801 designates positive data storage means for storing positive data which is a group of words to be recognized collected under various conditions
- 1802 designates negative data storage means for storing negative data which is a group of words other than the words to be recognized
- 1803 and 1804 designate output data storage means for storing output learning data belonging to the respective categories. It is assumed herein that each of the categories includes three data.
- Reference numeral 1805 designates input data selection means
- 1806 designates output data selection means
- 1807 designates input data connecting means
- 1808 designates output data connecting means
- 1809 designates learning control means
- 1810 designates a neural network, respectively.
- the input data selection means selects two input learning data from the positive data storage means 1801 and negative data storage means 1802. Combinations of these data are as shown in FIGS. 17a-17d.
- the two selected input data are combined into a single continuous data by the input data connecting means.
- the continuous data is feature-extracted by the speech feature extracting means and then inputted into the neural network.
- the neural network then calculates the output value in time series according to the procedure of FIG. 13.
- the output of the neural network is fed to the learning control means wherein it is compared with a preselected output learning data to calculate an error, by which the connecting weight at each node will be modified. In such a manner, the neural network will repeatedly be caused to learn.
- FIG. 17a-17d The input data selection means selects two input learning data from the positive data storage means 1801 and negative data storage means 1802. Combinations of these data are as shown in FIGS. 17a-17d.
- the two selected input data are combined into a single continuous data by the input data connecting means.
- the output of the neural network includes two nodes: positive and negative output nodes.
- Solid lines in the output data storage means 1803 and 1804 represent the learning output of the positive output node corresponding to the positive data, while broken lines represent the learning output of the negative output node corresponding to the negative data.
- the recognition results of the speech recognition apparatus which comprises the neural network made of nodes having such features and which has been caused to learn according to the learning method described with reference to FIG. 18 are shown below. Assuming the twentieth order of LPC cepstrum as the output of the speech feature extracting means, the neural network wherein as actually constructed to include 32 nodes in total: 20 input nodes, 2 output nodes and 10 other nodes.
- the learning will first be described.
- the learning was carried out under such a condition that a word to be recognized (positive data) was “TORIAEZU” (FIRST OF ALL) and the other eight reference words (negative data) were “SHUUTEN” (TERMINAL), “UDEMAE” (SKILL), “KYOZETSU” (REJECTION), “CHOUETSU” (TRANSCENDENCE), “BUNRUI” (CLASSIFICATION), “ROKKAA” (LOCKER), "SANMYAKU” (MOUNTAIN RANGE) and "KAKURE PYURITAN” (HIDDEN PURITAN).
- the neural network was assumed to have two outputs, that is, a positive output corresponding to the positive data and a negative output corresponding to the negative data.
- the correspondence between the input and output was set such that when input data for one frame (in this case, the twentieth order of LPC cepstrum) was inputted, a set of positive and negative outputs was obtained. It is therefore not required to input data for a plurality of frames as in the prior art.
- a "BP model with feedback connections" type neural network which is a modification of the prior art MLP method raised a problem in that it is difficult to converge the learning and also in that the learning outputs must be prepared in the trial-and-error manner.
- the neural network of the present invention can generate the desired outputs by causing it to learn several hundreds to several thousands times according to the speech learning method of the present invnention.
- the learning outputs can readily be determined as an only possible output without a trial-and-error aspect at all.
- FIG. 25 shows test results when data containing unknown words not used in the learning are given to the neural network after the above learning has been carried out.
- Words of 216 kinds were available, in which 9 kinds were used for learning.
- Tests were carried out using two-word chain data which were prepared by combining the 216 kinds of words into a variety of combinations. In the tests, the total number of appearing words was equal to 1290 for one speaker.
- the recognition result judgements were based on the combinations of positive output and negative output. If the positive output is equal to or more than 0.75 and the negative output is equal to or less than 0.25, it is judged that the detection is made. If the positive output is equal to or less than 0.25 and the negative output is equal to or more than 0.75, it is judged that the detection is not made.
- FIG. 26 shows results in the same tests as in FIG. 25 that were carried out for nine unknown speakers other than the speakers used for the learning.
- the speech recognizing method of the present invention can provide a very improved rate of recognition even if small number of data are learned by the speech recognition apparatus.
- FIG. 19 shows the detection of words to be recognized from three or more successive words.
- a solid line shows positive outputs, while a broken line shows negative outputs.
- the speech recognition apparatus recognizes the word "TORIAEZU" (FIRST OF ALL) without being supplied with start and end edges as in the prior art.
- FIG. 20 shows the recognition of the word to be recognized. "TORIAEZU”, among the unknown words. As in FIG. 19, a solid line shows positive outputs while a broken line shows negative outputs. It is thus found that the recognition method of the present invention has a sufficient generalizing ability.
- the prior art which should perform the recognition with the start and end edges of the data, is required to check combinations in the order of square of 1049.
- the present invention requires to input each of 1049 data once.
- the process can be carried out within one several hundredth of a time required by conventional process.
- the present invention does not require the storage of data within the ranges of possible start and end edges as in the prior art. As a result, both the amount of data memory and the amount of calculation can be reduced.
- the output value is not required to be normalized for the length of the input data. More particularly, the outputs are always within a range (in this case, between -1 and 1) and also the weight of an output is invariable within a recognition section. This means that dynamic range of a value to be processed is narrower, and that the speech recognition apparatus can achieve sufficient performance using integer data rather than using floating-point data or logarithmic data on processing.
- the recognition does not fail since the negative output is not lowered even if the positive output begins to raise at a word "KOUNYU" (PURCHASE) in FIG. 20.
- the speech recognition can be improved in accuracy.
- the number of outputs is not limited to 2, but can be increased if necessary. For example, if an output is added which represents degree of resemblance between the presently inputted data and the data used in the learning, the result of recognition can be improved in accuracy. If a plurality of such outputs are used, the neural network which provides optimum results can be chosen.
- the present invention can recognize syllables or phonemes, rather than words as exemplified.
- a relatively small number of neural networks need to be used to recognize the entire language speech.
- the unit of recognition can be abstract ones which are not related to languages. This is particularly effective when the speech recognition apparatus is used to compress information.
- FIG. 21 shows another embodiment of the present invention which is different from the speech recognition apparatus of FIG. 12 in that background noise input means 2105 and stable state detection means 2106 are added to it.
- the other parts are similar to those of FIG. 12.
- FIG. 22 shows a flowchart of the process through which the initial internal state value is determined in the arrangement of FIG. 21.
- a step of preparing background noise data may comprise suitable initial value setting means, and suitable constant input preparing means. Or the step can be omitted to correspond to no input.
- FIG. 27 shows results of recognition obtained by causing the speech recognition apparatus to learn according to the learning method of FIG. 18, which corresponds to tables 1 and 2 of the first embodiment combined. The results are obtained by saving, as initial values, the internal state values of the neural network which became stable when background noise are inputted for about 3 seconds. On recognition, these values are used as initial values in the differential equation (2).
- the present embodiment reduces omission errors in comparison with the results of the first embodiment.
- the actual speech recognition systems of higher performance often use a language processing function in addition to a simple speech recognizing function.
- the insertion error can relatively easily be corrected or canceled considering language restrictions, but the omission error is difficult to be inferred and added considering the same language restrictions. Therefore, the improvement in the rate of omission error by the present embodiment is important in realizing a speech recognition apparatus of higher performance.
- FIG. 23 shows still another embodiment that the learning section of FIG. 14 further comprises noise data storage means and noise data overlaying means.
- the basic learning method is as described in connection with FIG. 14.
- This embodiment is characterized by that the learning data is a data overlaid with noise components beforehand.
- the connection weightings of units in the neural network is adjusted by the learning control means. In other words, the neural network is caused to learn so that the noise components contained in the learning data can definitely be differentiated.
- FIGS. 24a-24c The overlaying of the learning data with the noise components is carried out at a plurality of locations, as shown in FIGS. 24a-24c.
- reference numeral 2401 designates the learning data
- reference numerals 2402 and 2403 designate the noise components.
- FIG. 24(b) shows an example of the learning data of FIG. 24(a) overlaid with the noise component 2402 at its forward portion
- FIG. 24(c) shows an example of the learning data overlaid with the noise component 2403 at its rearward portion.
- the neural network can definitely differentiate only the noise components by causing the neural network to learn the learning data overlaid with noise components removing the noise components.
- the neural network can properly recognize nonconstant noises with which the speech data is overlaid.
- the present invention provides the speech recognition apparatus and its learning method which are very effective not only in the continuous speech recognition but also in the discrete speech recognition.
- the present invention is effective not only in the speech recognition but also in any processing of time series information if the correspondence between input data and output data can be taken.
- the present invention is considered to be applicable to compression of information, expansion of information. waveform equivalence and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
A speech recognition apparatus using a neural network. A neuron-like element according to the present invention has a means for storing a value of the inner condition thereof, a means for updating a value of internal status on the basis of an output from the neuron-like element itself, outputs from other neuron-like elements and an external input, and an output value generating means for converting a value of internal status into an external output. Accordingly, the neuron-like element itself can retain the history of input data. This enables the time series data, such as speech to be processed without providing any special means in the neural network.
Description
This is a continuation of application Ser. No. 08/150,170 filed Nov. 29, 1993, abandoned.
1. Field of the Invention
The present invention relates to a speech recognition apparatus using a neural network, and its learning method therefor. Unlike the prior art, the present invention will neither require the start and end edges of input data when time series data such as speech data sequence is to be processed, nor process all the possible combinations of the start and end edges. However, the present invention makes it possible to process the time series data such as speech data or the like precisely, using a simplified hardware which comprises neuron elements capable of holding past history of input data in themselves.
The present invention also relates to a learning method for a neural network to do such a process.
2. Description of the Related Art
Several data recognition methods have practically been used particularly to learn and recognize the category of time series data. Such methods include Dynamic Programming (DP) Method, Hidden Markov Model (HMM) Method, and Back Propagation Learning Rule and Multi-Layered Perceptron (MLP) Neural Network Method. These methods are described, for example, in NAKAGAWA Seiichi, "Speech Recognition by Stochastic Model" published by the Institute of Electronics, Information and Communication Engineers, and in NAKAGAWA, SHIKANO and TOHKURA, "Speech, Auditory Perception and Neural Network Model" published by Ohm Co., Ltd.
The common problem to the DP and HMM methods is to require the start and end edges in both the teacher data and input data to be recognized. One technique of processing data apparently not depending on the start and end edges thereof is to find the start and end edges providing the best result in a trial-and-error manner. Where it is considered to detect data parts belonging to a category from an input data having a length N, there are N number of possible start edges, and there also are N number of possible end edges. That is, combinations of start and end edge patterns to the order of N2 can be considered to be possible. Therefore, such a technique must recognize and process all the great number of combinations. This consumes huge processing time.
The aforementioned technique has a more essential problem due to the fact that the start and end edges of the input data are assumed than the quantitative problem of the huge number of combinations. More particularly, the start and end edges of the input data are self-evident if the input data only contains a single data belonging to a category. However, the start and end edges of the input data cannot easily and clearly be bounded if the input data includes successive data parts belonging to more than one category. Particularly, time series data such as speech data or the like does not have definite boundaries at the start and end edges, with data parts belonging to two adjacent categories being connected to each other through an overlapping transition region. Accordingly, the assumption of the start and end data edges raises very large problem in accuracy.
On the other hand, the MLP method does not require such an assumption. Instead, the MLP method raises another problem with respect to the start and end edges of the input data in that the range of the input data must be specified. In other words, the MLP method is basically for recognizing static data. Thus, the MLP method can recognize time series data only when an input data within a length of time is used while time information is equivalently processed. The length of time must be fixed due to the composition of the MLP method.
However, the length of the time series data greatly varies from one category to another, and also within the same category. For example, the average length of vowels, which are long phonemes, is ten or more times longer than that of plosives, which are short phonemes. Even in the same phonemes, the length can fluctuate over two times in an actual speech. Even if the input range of data is set to be the average length, the input data of a short phoneme to be recognized will include a number of data other than the data to be recognized, and the input data of a long phoneme will include only a part of the data to be recognized. Such things cause the recognition ability to be reduced. Even though input length is appropriately set for each phoneme, it will not solve the problem since the length of each phoneme itself varies. Such problems are generally found in time series information.
As described, DP and HMM methods require the start and end edges of data to be handled, and MLP method requires the start and end edges of the input range on learning. However, the start and end edges of the time series information cannot definitely be bounded due to the nature of the information. Even if the start and end edges are forcibly assumed, the speech recognition ability will be reduced. In order to apparently relieve it, all the combinations of start and end edges must be processed, resulting in a huge amount of processing.
On the contrary, the present invention provides a speech recognition apparatus using a neural network comprising:
1) Neuron elements, each of the neuron elements comprising internal state value storage means, internal state value updating means for updating the internal state value in said internal state value storage means on the bases of an internal state value stored in said internal state value storage means and an input value to said neuron element, and output value generating means for converting the output of said internal state value storage means into an external output value.
2) The internal state value updating means may be formed as weighted accumulation means for performing the weighted accumulation of said input values and said internal state values. The internal state value storage means may be formed as integration means for integrating the values accumulated by said weighted accumulation means. The output value generating means may be formed as output value limiting means for converting a value obtained by said integrating means into a value between upper and lower preset limits.
3) The internal state value updating means may be formed to update the internal state value into a value which satisfies the following formula: ##EQU1## where Xi is the internal state value of the i-th neuron elements in said neural network, τi is a time constant, and Zj (ranges between O and n:n is O or natural number) is said weighted input value to said neuron element.
4) In the speech recognition apparatus defined in any one of items 1) to 3), the weighted input value Zj to the i-th neuron element may include the weighted output of the i-th neuron element itself.
5) In the speech recognition apparatus defined in any one of items 1) to 4), the weighted input value Zj to the i-th neuron element may also include the weighted output of any other neuron element in said neural network.
6) In the speech recognition apparatus defined in any one of items 1) to 5), the weighted input value Zj to the i-th neuron element may also include any data provided from the outside of said neural network.
7) In the speech recognition apparatus defined in any one of items 1) to 6), the weighted input value Zj to the i-th neuron element may also include a weighted and fixed value.
8) In the speech recognition apparatus defined in any one of items 1) to 7), the output value generating means may be formed to have an output of symmetrical range in positive and negative directions.
9) In the speech recognition apparatus defined in any one of items 1) to 8), the neural network may be formed to have at least two outputs: positive and negative outputs.
10) In the speech recognition apparatus defined in any one of items 1) to 9), the speech recognition apparatus may comprise speech feature extracting means for extracting the feature of an input to be recognized and for providing the extracted value to said neural network, recognition result output means for converting the output value of said neural network into a recognition result, and internal state value initializing means for providing a preset initial value to the internal state value storage means of each neuron element comprised by said neural network.
11) The speech recognition apparatus as defined in the item 10) characterized in comprising: background noise input means for inputting the background noise to said neural network, and stable state detecting means for detecting the stable state from the output of said neural network and for outputting a signal to change the preset initial internal state value to an initial internal state value setting means on the basis of said stable state detection.
In the learning method for the speech recognition apparatus using neural network according to the present invention characterized in that:
12) The speech recognition apparatus as defined in the item 10) or 11) including a learning section for causing said neural network to learn. The learning section comprises input data storage means for storing input learning data, input data selection means for selecting an input learning data from said input data storage means, output data storage means for storing output learning data, output data selection means for selecting an output learning data depending on the selected input data and chains of data including the selected data, and learning control means for inputting the selected input learning data to said feature extracting means and for controlling the learning in said neural network. The learning control means is formed to respond to the outputs of said neural network and output data selection means to change the weightings at the connections of said neuron elements.
13) In the item 12), said input data storage means has a plurality of categories, said output data storage means has a category corresponding to each of the categories in said input data storage means, said input data selection means selects a plurality of data to be learned from the categories of said input data storage means, and said output data selection means selects an output learning data corresponding to the input learning data selected by said input data selection means. Said learning control means has input data connecting means for connecting the plurality of data selected by said input data selection means into a single input data, and output data connecting means for connecting the output learning data selected by said output data selection means into a single output data. The learning control means may be formed to input said connected input learning data to the speech feature extracting means and to change the weightings at the connections of said neuron elements on the basis of the outputs of said neural network and output data connecting means.
14) In the item 13), the number of said categories can be two.
15) In the items 12) to 14), the learning section comprises noise data storage means for storing noise data, and noise overlaying means for overlaying said selected learning data with the noise selected from said noise data storage means, the input data overlaid with the noise by said noise overlaying means being used to cause said neural network to learn.
16) In the item 15), the learning may be repeated while shifting said background noise to different overlaying positions.
17) In the item 15), the learning may be performed by first causing the neural network to learn an input data not overlaid with the background noise, and thereafter to learn the same data overlaid with the background noise.
According to the present invention, the speech recognition apparatus using neural network, and learning method therefor have the following advantages:
1) The processing speed may greatly be increased because data input is required only once, although the prior art required a processing time proportional to the square of length N of the speech input.
2) A memory used to store input data may greatly be reduced in capacity.
3) No normalization of results is required.
4) The continuous processing can easily be carried out.
5) The accuracy can sufficiently be obtained even with integer type data representation.
6) The recognition results can be obtained with very high accuracy by combining the positive and negative outputs with each other.
7) Any information of multiple number of outputs can be outputted.
8) Various characteristics such as noise-resistant property can easily be improved.
9) The speech recognition apparatus can self-organizingly treat phenomena with various time scales by causing the apparatus to learn.
10) Organizations optimally placing the associative ability and data compression/expansion ability of NN (neural network) can easily be formed for intended purposes.
11) The learning can very easily be made with very reduced number of trial-and-error parts.
FIG. 1 is a view of a neuron element used to form a neural network according to the present invention.
FIG. 2 is a view showing the neuron element of FIG. 1 in actual functional blocks.
FIG. 3 is a view obtained by replacing blocks in the arrangement of FIG. 2 by electric circuits.
FIG. 4 is a view of a speech recognition apparatus which uses a neural network constructed by neuron elements according to the present invention.
FIG. 5 is a view showing the neural network of FIG. 4 formed into a three-layered schematic structure.
FIG. 6 is a view showing the neural network of FIG. 5 formed into a increased number of layered structure.
FIG. 7 is a view showing the division of the transmission network shown in FIG. 6.
FIG. 8 is a view of a neural network having an autoregressive loop.
FIG. 9 is a view of a random connecting type neural network.
FIGS. 10a and 10b is a view illustrating the noise-resistant property of the speech recognition apparatus of the present invention.
FIGS. 11a and 11b is a view illustrating the learning effect of the speech recognition apparatus of the present invention with respect to time scale.
FIG. 12 is a view of another speech recognition apparatus using the neuron elements of the present invention.
FIG. 13 is a view illustrating the operational procedure of the speech recognition apparatus shown in FIG. 12.
FIG. 14 is a view illustrating a method of the present invention for causing the speech recognition apparatus using the neural network of the present invention to learn.
FIG. 15 is a view illustrating the operational procedure in the learning method of the present invention.
FIGS. 16a and 16b is a view showing the connecting of learning data in the present invention.
FIGS. 17a-17d is a view showing a form of learning data in the present invention.
FIG. 18 is a view showing another learning method of the present invention for the speech recognition apparatus using the neural network of the present invention.
FIG. 19 is a view showing the speech word detection output of the speech recognition apparatus of the present invention.
FIG. 20 is a view showing another speech word detection output of the speech recognition apparatus of the present invention.
FIG. 21 is a view showing an other arrangement of a speech recognition apparatus constructed in accordance with the present invention.
FIG. 22 is a view illustrating the operational procedure of the speech recognition apparatus shown in FIG. 21.
FIG. 23 is a view illustrating a method of causing a speech recognition apparatus having background noise overlaying means to learn.
FIG. 24a-24c is a view illustrating a manner that overlays a learning data with noise components.
FIG. 25 is a table showing results obtained when unknown words are provided to the neural network that has learned according to the learning method of the present invention.
FIG. 26 is a table showing results obtained when the same processing as in FIG. 25 is carried out to an unknown speaker.
FIG. 27 is a table showing results obtained when the same processing as in FIG. 26 is carried out with background noise.
FIG. 28 is a view showing a neuron element of the prior art.
FIG. 29 is a view showing the neuron element of FIG. 28 in actual functional blocks.
FIG. 30 is a view obtained by replacing blocks in the arrangement of FIG. 29 by electric circuit.
Referring to FIG. 1, there is diagrammatically shown the function of a neuron element (hereinafter referred to "node") which is used to form a NN (neural network) according to the present invention. In the figure, a numeral 104 designates the node generally, 101 designates an internal state value storage means, 102 designates an internal state value updating means to update the internal state responsive to the internal state value stored in 101 and an input value to the node, and 103 designates an output value generating means for converting the internal state value into an external output.
FIG. 2 shows the detailed function of the node shown in FIG. 1. In FIG. 2, reference numeral 201 designates data input means; 202 designates weighted accumulation means for weighting and accumulating data input values from the data input means 201; 203 designates integration means for integrating the accumulated data values; and 204 designates output value limiting means for converting a value obtained by the integration into a value within a preset range, respectively.
FIG. 3 shows an example of an electronic circuit in the arrangement of FIG. 2. In FIG. 3, reference numeral 301 designates both the input means and weighted accumulation means in FIG. 2; 302 designates the integration means; and 303 designates the output value limiting means.
On the other hand, FIG. 28 diagrammatically shows the function of a node used to form a NN (neural network) using the prior art MLP method. In the figure, 2803 designates the node generally, 2801 designates an internal state value computing means for computing the internal state value, and 2802 designates output value generating means for converting the internal state value computed by 2801 into an external output.
FIG. 29 similarly shows the functional arrangement of the prior art node shown in FIG. 28. In FIG. 29, reference numeral 2901 designates a data input means; 2902 weighting accumulation means for weighting and accumulating data input values from 2901; and 2903 designates output value limiting means for converting the accumulated data values into a value within a preset range.
FIG. 30 shows an example of an electronic circuit for the arrangement of FIG. 29. In FIG. 30, reference numeral 3001 designates both the data input means and weighted accumulation means of FIG. 29; and 3002 designates the output value limiting means.
As will be apparent from FIGS. 1-3 and 28-30, the node of the present invention includes the integration means which can not be found in the node of the prior art. The node of the prior art is static in that its output depends only on the input at that time. On the contrary, the node of the present invention can be said to be dynamic in that the past history of data inputted into that node is converted into and held as an integrated value on which the output of the node depends.
In other words, the neural network using the static nodes of the prior art requires to take the temporal structure of data on the neural network structure if it is wanted to process time series data. On the contrary, the neural network using the dynamic nodes of the present invention can process time series data in the node itself without depending on the neural network structure.
More concretely, the processing of time series data by the neural network of the prior art requires any suitable manner of developing the temporal information into spatial information, such as a method of connecting data inputted at a plurality of timings into a single input data. To this end, it is required to provide a hardware and a process for storing and controlling the connected data. Alternatively, it may be required to provide a special context element for storing the aforementioned information depending on time. Any suitable hardware and a process for controlling this context is further required.
On the contrary, the neural network of the present invention does not require any special structure because the context information and others are stored as integrated values in the interior of each of the elements. Therefore, the input of data can sufficiently be carried out according to the simplest data input manner in that the data of respective timing is inputted at the respective timing. The present invention does not require any specific hardware or a process for processing the temporal information.
The actual operations of the node and the neural network defined by a plurality of such nodes according to the present invention will be described below. It is now assumed that the internal state value of the node is X and the output value thereof is Y. It is also assumed that as the values X and Y are changed with time, the current internal state value is Xcurr ; the updated internal state value is Xnext ; and an input value to the node during the updating operation is Zi (i ranges from zero to n:n is the number of inputs to that node). When the operation of the internal state value updating means is formally represented by a function G, the updated internal state value Xnext can be represented by:
X.sub.next =G(X.sub.curr, Z.sub.0, . . . Z.sub.i, . . . Z.sub.n)(1).
The concrete form of this formula (1) can be considered to be of any one of various forms, although it may be the following formula (2) using the first order differential equation: ##EQU2## where τi is a time constant.
The input value Zj will be defined in more detail. The input value are considered to include: 1 the output of the node itself multiplied by a connecting weight; 2 the output of the other node multiplied by a connecting weight; 3 a fixed output value multiplied by a connecting weight to provide a bias equivalently to the internal state value updating means; and 4 an external input provided to the node from the outside of the neural network. Thus, the updating of the internal state value in the i-th node relative to such an input value Zj is considered. The formula (2) can be rewritten into a more specific form as follows: ##EQU3## where the internal state value is Xi ; the output of a node is Yj ; the connecting weight for connecting the output of the j-th node to the input of the i-th node is Wij ; the bias value is θi ; and the external input value to the i-th node is Di.
When the internal state value in the node determined in such a manner at a moment is X and if the operation of the output value generating means is formally represented by a function F, the output Y of the node can be represented by:
Y=F(X) (4).
An specific form of the function F may be a sigmoid (logistic) function of a positive-negative symmetric output as shown by the following formula: ##EQU4## However, such a function expression is not essential, and may be a simpler linear transform, a threshold function. etc.
According to such formulae, the time series of the output Y of the neural network constructed in accordance with the present invention can be calculated.
FIG. 4 shows one embodiment of a speech recognition apparatus using a neural network which is constructed by such nodes according to the present invention. In the figure, 401 designates speech feature extracting means, 402 designates a neural network constructed by nodes according to the present invention, and 403 designates recognition result output means. Outputs extracted by the speech feature extracting means are inputted into two nodes. Thus, the neural network is of an entire connecting type in which any one node is connected with all the other nodes. The neural network provides two outputs to the recognition result output means. The neural network of the present invention may set any number of outputs. If a word is to be recognized, therefore, two outputs, positive and negative outputs can be provided. The recognition results of these outputs can collectively be judged to increase the accuracy in recognition. The number of inputs and outputs relative to the neural network needs not to be limited respectively to two as in FIG. 4, but may be set at any number.
FIGS. 5-9 show a variety of other neural network forms constructed by the nodes of the present invention.
FIG. 5 shows a form in which only the neural network 402 shown in FIG. 4 is modified. Such a neural network 402 includes an input layer 501, a hidden layer 502 and an output layer 503. Such an arrangement is apparently the same as in the MLP method of the prior art. However, in the present invention, the neural network constructed by the nodes is different from such a feed-forward type network as in the prior art in which the value of the input layer is first determined, the value of the hidden layer using the input layer value as an input is then determined, and the values of the respective layers until the output layer are successively determined.
Because the node can hold its own internal state value, the neural network using the nodes of the present invention can recognize the time series data to provide the same result as in the prior art, without need of such a context layer as in the prior art. The neural network of the present invention can also perform the parallel processing more efficiently than the MLP method of the prior art because the outputs of all the layers are simultaneously determined.
Further, the neural network using the nodes of the present invention has an improved noise resistance. FIG. 10(a) shows the correspondence between the input and output in the node according to the simple MLP method of the prior art. When a signal comprising an input of square waveform overlaid with a spiked noise is inputted, the waveform substantially equivalent to that of the input signal appears at the output, as will be apparent from FIG. 10(a). Thus, the node of the MLP method will be affected directly by the noise since the input is simply reflected to the output.
However, the node of the present invention stores the temporal history as an internal state value. The next internal state value and output value are determined as a function of the current internal state value and input. Even if the input is overlaid with such a spiked noise as in FIG. 10(a), the spiked waveform is dulled with reduction of its effect, as shown in FIG. 10(b). As a result, the present invention can provide an improved noise resistance.
The noise resistance can be somewhat accomplished even by the prior art having the context layer. However, the prior art must be provided with an external node having a special structure for holding the past history information as some of the nodes used to form the neural network. Therefore, the noise resistance of the prior art is inferior to that of the present invention in which each of all the nodes holds its own past history information as an internal state value.
As a next example, FIG. 6 shows a multi-layer neural network obtained by increasing the number of layers in such a neural network as in FIG. 5 to form the neural network into a sandglass configuration. In FIG. 6, 601 designates the neural network comprises a feature extracting (or information compressing) network, 602 designates a transmission network and 603 designates a recognizing (or information expanding) network. The neural network of FIG. 6 is also apparently similar to the MLP method of the prior art. However, its operation is entirely different from that of the prior art, as described. In such an arrangement, the functions of the feature extracting (or information compressing) NN (neural network) and recognizing (or information expanding) networks taking in the time series effect can be formed into modules to provide a speech recognition apparatus without losing advantages of the present invention.
The transmission network 602 of FIG. 6 can be divided into an information transmitting function 702 and an information receiving function 703, as shown in FIG. 7. A wavy line between the functions 702 and 703 represents that these functions may be separated from each other through a space and/or time. If the wavy line represents spatial distance such as a transmission line, it will represent a speech compressing and transmitting device. If the wavy line represents the length of a time, it will represent a speech compressing and recording device. It is of course that an object to be compressed herein is not limited to the speech, but may be more general information. It is needless to say that the recognizing process is a process of information compression in a broader sense.
The arrangement of FIG. 7 has the same advantages as described hereinbefore. For example, the noise resistance described with respect to FIG. 10 can also protect the neural network from the mis-transmission and noise in the transmission line or the defect of or degradation of a recording medium.
FIG. 8 shows a simplified modification of the neural network shown in FIG. 4. The neural network has an autoregressive loop which can handle events within a widened range of time. More particularly, the presence of the autoregressive loop approximately corresponds to replacing the time constant τ in the system by the following formula:
τ/(1-W) (6)
where W is the connecting weight of the autoregressive loop portion in an input value Z.
The connecting weight W can be modified by a learning process, which will be described later, to optimize the time scale in the response of the system for learning data. The method of the prior art using the context layer cannot self-organizingly optimize the time scale by learning. Thus, the network must manually be set for time scale.
FIGS. 11a and 11b shows the concept of such an advantage in the present invention. It is now assumed that such square waveforms as shown in FIG. 11(a) are continuously inputted into the system. If the time constant in the response of the system is larger than the input cycle of square waveforms, the outputs in the system response will be sequentially overlaid one with another, as shown in FIG. 11(a). This does not provide any proper recognition result.
On the other hand, the time constant in the system having such an autoregressive loop as shown in FIG. 8 can be optimized by learning. Therefore, the response of this system can be modified as shown in FIG. 11(b), which provides an improved recognition result.
By combining the learning function of such a system for time constant with an appropriate learning method, the noise resistance and the like in the systems of FIGS. 6 and 7 can be further improved.
The last arrangement of a neural network obtained by modifying the neural network of FIG. 8 into a random connecting type neural network is shown in FIG. 9 the random connecting type neural network 902 comprises two sub-networks: an input network 904, and an output network 905. In this embodiment, the input network is an entire connecting type sub-network while the output network is a random connecting type sub-network. These sub-networks are connected with each other only in one direction.
Such an arrangement provides the following advantages in addition to the aforementioned advantages. By using the association ability of the entire connecting type neural network, functions such as the supplementation of input defects or the improvement of the noise resistance can be achieved. Further, the one-direction connection can heuristically treat the flow of information to optimize various functions such as information compression, and information expansion.
Although various modifications of the neural network shown in FIG. 4 have been described, another arrangement of the speech recognition apparatus itself will be described now.
FIG. 12 shows the same arrangement as that of FIG. 4 except that the speech recognition apparatus additionally comprises initial internal state value setting means 1204. As shown by the formula (2), the operation of the neural network according to the present invention can be described by the first order differential equation. Thus, in order to determine the operation, an initial value is required. The initial internal state value setting means provides present initial values to all the nodes prior to actuation of the neural network. The operational procedure of the speech recognition apparatus will be described with reference to FIG. 13.
1. The initial internal state value setting means sets a suitably selected initial internal state value X at all the nodes and sets an output Y corresponding to it.
2. The procedure finishes if the process goes to the end step.
3. The sum of input values Z is determined in each of all the nodes. The input values Z were described. Speech feature value extracted by the speech feature extracting means constitutes a part of input values Z as an external input value.
4. For each of all the nodes, the internal state value X is updated on the basis of the sum of input values Z, that have been determined in the step 3 and on the basis of the internal state value X itself.
5. The output value Y is calculated from the updated value X.
6. The procedure returns to the step 2.
The recognition result is provided to the recognition result output means as an output from a node assigned for it.
The basic operational and structural concepts of the speech recognition apparatus having the neural network which uses the nodes constructed according to the present invention have been described above. In order to cause such a neural network to perform the desired processing, the neural network should be caused to learn. A method of causing the neural network to learn will be described below.
FIG. 14 is a block diagram illustrating a learning process for the speech recognition apparatus of the present invention. In FIG. 14, numeral 1410 designates a learning section for causing a neural network 1402 to learn, 1411 designates input data storage means for storing given input learning data, 1413 designates output data storage means for storing output data which are models corresponding to the each input learning data, 1412 designates input data selection means for selecting input data to be learned from the input data storage means, 1414 designates output data selection means for selecting output data in the same manner, and 1415 designates a learning control means for controlling the learning of the neural network.
The manner in which the speech recognition apparatus is caused to learn by the learning section wilt be described with reference to FIGS. 13 and 14. First of all, a preset initial state values X are set at all the nodes. Secondly, input learning data to be learned is selected by the input data selection means. The selected input data is fed to the learning control means. At this time, output learning data corresponding to the selected input learning data is selected by the output data selection means. The selected output data is similarly fed to the learning control means. The selected input learning data is received by the speech feature extracting means 1401, in which feature vector is extracted to be inputted to the neural network as an external input. For each of all the nodes, the sum of inputs Z is determined and the internal state value X is updated according to the formula (2). Thus, an output Y is determined from the updated internal state value X.
In the initial step, the connecting weight of units with each other in the neural network is random. Thus, the output value Y from the neural network also is random.
The above procedure will be repeated to the end of the input data time series. For the resulting time series of output Y, a learning evaluation value C is determined by the following formula: ##EQU5## where E is an error evaluation value. The time series of the learning evaluation value Care calculated along such a procedure as shown in FIG. 15 following the formula (7).
As an practical example of this procedure, the error evaluation value E can be written, using Kullback-Leibler distance as an error evaluation function, as follows:
E(Y.sub.i,T.sub.i)=T.sub.i log(T.sub.i /Y.sub.i)÷(1-T.sub.i)log (1-T.sub.i)/(1-Y.sub.i)! (8)
where T is the output learning data corresponding to the selected input learning data; and Y is an output value corresponding to the input learning data. By using Kullback-Leibler distance, the learning speed can be increased due to various factors.
Where the output value generating means has symmetrical outputs, the formula (8) can be replaced by the following formula (9), which is substantially the same as the formula (8):
E(Y.sub.i,T.sub.i)= (1+T.sub.i)/2! log (1+T.sub.i)/(1+Y.sub.i)!÷ (1-T.sub.i)/2)! log (1-T.sub.i)/(1-Y.sub.i)! (9)
By using these formulae, the formula (7) can more concretely be rewritten as formula (10): ##EQU6##
Thus, the modification rule of the connecting weight W is provided by:
ΔW.sub.ij =∫C.sub.i Y.sub.j dt (11)
where α is a small and positive constant. The connecting weight of units with each other can thus be changed to provide the desired output. By repeatedly inputting speech data to be recognized and by changing the connecting weight of units with each other little by little, a correct value will come to be outputted from the network. The number of repetitions necessary for the output to converge is in the order of several thousands.
Such a learning rule may apparently be applied not only to the entire connecting type neural network exemplified, but also to any random connecting type neural network which includes specific examples such as layered connection and the like and which can be used more generally in the art.
Another method of causing the speech recognition apparatus to learn by continuously inputting two input data for learning will be described using the neural network with two outputs, positive and negative outputs, for example.
In the learning method using the input data one by one, the positive output cannot be lowered to low level once it has been shifted to high level. Conversely, the negative output cannot be raised to high level once it has been shifted to low level. More particularly, such a learning method input data one by one performs a learning in which when input data to be recognized (hereinafter called "positive data") is provided, the positive output is raised to high level while the negative output remains low level, as shown in FIG. 16(a), or performs another learning in which when input data not to be recognized (hereinafter called "negative data") is provided, the negative output is raised to high level while the positive output remains low level, as shown in FIG. 16(b). However, through these learnings, the positive or negative output once raised to high level will not be lowered.
If a plurality of speech data containing both positive and negative data are continuously inputted to the system and when the positive output has been raised to high level by the positive data, the positive output will not be lowered to low level even if the negative data is inputted to the system. This applies to negative output as well.
Therefore, the present embodiment uses a learning method for both raising and lowering the output by continuously providing two speech data, as shown in FIGS. 17(a)-(d). In FIG. 17(a), negative and positive data are continuously inputted in this order to cause the neural network to learn the raising of the positive output and the raising and lowering of the negative output. In FIG. 17(b), positive and negative data are continuously inputted in this order to cause the neural network to learn the raising and lowering of the positive output and the raising of the negative output. In FIG. 17(c), two negative data are continuously inputted such that the neural network will not have a wrong recognition, through the learning of FIG. 17(a), that a positive data always follows a negative data. In FIG. 17(d), similarly, two positive data are continuously inputted such that the neural network will not have a wrong recognition, through the learning of FIG. 17(b), that a negative data always follows a positive data.
In other words, this is a problem of the initial value dependency in the operation of the neural network. The learning process using only a single input data is started only from a specific initial value. Thus, the learning process is effective to show an expected ability only for the initial value. For general use of the neural network, it must be caused to learn to provide correct responses for a variety of initial values. All the possible events may not need to be considered as initial values. In actual recognitions, the number of possible initial value combinations for an object to be recognized is limited due to various restrictions. The use of a chain of two or more data in the learning process approximately provides such possible combinations of initial values. For such a purpose, only continuous data consisting of two single data can provide a satisfactory result. It is acceptable, of course, to use continuous data consisting of three or more single data.
FIG. 18 shows a speech recognition apparatus which can cause the neural network to learn continuous input data consisting of two single data. The input data storage means described in connection with FIG. 14 comprises means for storing data of two categories: positive and negative data. In FIG. 18, 1801 designates positive data storage means for storing positive data which is a group of words to be recognized collected under various conditions, 1802 designates negative data storage means for storing negative data which is a group of words other than the words to be recognized, and 1803 and 1804 designate output data storage means for storing output learning data belonging to the respective categories. It is assumed herein that each of the categories includes three data. Reference numeral 1805 designates input data selection means, 1806 designates output data selection means, 1807 designates input data connecting means, 1808 designates output data connecting means, 1809 designates learning control means, and 1810 designates a neural network, respectively.
The input data selection means selects two input learning data from the positive data storage means 1801 and negative data storage means 1802. Combinations of these data are as shown in FIGS. 17a-17d. The two selected input data are combined into a single continuous data by the input data connecting means. Then, the continuous data is feature-extracted by the speech feature extracting means and then inputted into the neural network. The neural network then calculates the output value in time series according to the procedure of FIG. 13. The output of the neural network is fed to the learning control means wherein it is compared with a preselected output learning data to calculate an error, by which the connecting weight at each node will be modified. In such a manner, the neural network will repeatedly be caused to learn. In FIG. 18, the output of the neural network includes two nodes: positive and negative output nodes. Solid lines in the output data storage means 1803 and 1804 represent the learning output of the positive output node corresponding to the positive data, while broken lines represent the learning output of the negative output node corresponding to the negative data.
The recognition results of the speech recognition apparatus which comprises the neural network made of nodes having such features and which has been caused to learn according to the learning method described with reference to FIG. 18 are shown below. Assuming the twentieth order of LPC cepstrum as the output of the speech feature extracting means, the neural network wherein as actually constructed to include 32 nodes in total: 20 input nodes, 2 output nodes and 10 other nodes.
The learning will first be described. The learning was carried out under such a condition that a word to be recognized (positive data) was "TORIAEZU" (FIRST OF ALL) and the other eight reference words (negative data) were "SHUUTEN" (TERMINAL), "UDEMAE" (SKILL), "KYOZETSU" (REJECTION), "CHOUETSU" (TRANSCENDENCE), "BUNRUI" (CLASSIFICATION), "ROKKAA" (LOCKER), "SANMYAKU" (MOUNTAIN RANGE) and "KAKURE PYURITAN" (HIDDEN PURITAN). The neural network was assumed to have two outputs, that is, a positive output corresponding to the positive data and a negative output corresponding to the negative data. Four different categories of learning outputs were supposed as described in connection with FIG. 17. The sigmoid function of the formula (5), having an origin at the temporal middle point of the curved part of each of these learning output data of which start edge corresponds to -10 and of which end edge corresponds to 10, and was modified to be within a range of 0 and 0.9, or the reversed was used for the curved part of each of the learning output data. Speakers to be learned were MAU and FSU in Japanese speech data base prepared by ATR Interpreting Telephony Research Laboratories, Inc.
The correspondence between the input and output was set such that when input data for one frame (in this case, the twentieth order of LPC cepstrum) was inputted, a set of positive and negative outputs was obtained. It is therefore not required to input data for a plurality of frames as in the prior art.
A "BP model with feedback connections" type neural network which is a modification of the prior art MLP method raised a problem in that it is difficult to converge the learning and also in that the learning outputs must be prepared in the trial-and-error manner. Whereas, the neural network of the present invention can generate the desired outputs by causing it to learn several hundreds to several thousands times according to the speech learning method of the present invnention. The learning outputs can readily be determined as an only possible output without a trial-and-error aspect at all.
FIG. 25 shows test results when data containing unknown words not used in the learning are given to the neural network after the above learning has been carried out. Words of 216 kinds were available, in which 9 kinds were used for learning. Tests were carried out using two-word chain data which were prepared by combining the 216 kinds of words into a variety of combinations. In the tests, the total number of appearing words was equal to 1290 for one speaker. The recognition result judgements were based on the combinations of positive output and negative output. If the positive output is equal to or more than 0.75 and the negative output is equal to or less than 0.25, it is judged that the detection is made. If the positive output is equal to or less than 0.25 and the negative output is equal to or more than 0.75, it is judged that the detection is not made. In the other case, it is judged that the system is in confused state. Under such conditions of judgment, it is considered that there is an insertion error if any output is detected in a position having no word to be detected, and that there is an omission error if any output is not detected in a position having a word to be detected.
FIG. 26 shows results in the same tests as in FIG. 25 that were carried out for nine unknown speakers other than the speakers used for the learning.
As is apparent from FIGS. 25 and 26, the speech recognizing method of the present invention can provide a very improved rate of recognition even if small number of data are learned by the speech recognition apparatus.
FIG. 19 shows the detection of words to be recognized from three or more successive words. In FIG. 19, a solid line shows positive outputs, while a broken line shows negative outputs. As is apparent from FIG. 19, the speech recognition apparatus recognizes the word "TORIAEZU" (FIRST OF ALL) without being supplied with start and end edges as in the prior art.
FIG. 20 shows the recognition of the word to be recognized. "TORIAEZU", among the unknown words. As in FIG. 19, a solid line shows positive outputs while a broken line shows negative outputs. It is thus found that the recognition method of the present invention has a sufficient generalizing ability.
Since the length of data given in FIG. 19 is equal to 1049 in total, the prior art, which should perform the recognition with the start and end edges of the data, is required to check combinations in the order of square of 1049. However, the present invention requires to input each of 1049 data once. Thus, the process can be carried out within one several hundredth of a time required by conventional process. Furthermore, since each data needs to be inputted only once, the present invention does not require the storage of data within the ranges of possible start and end edges as in the prior art. As a result, both the amount of data memory and the amount of calculation can be reduced.
Since the output has a peak at a necessary place, rather than monotonous increase or decrease as in the DP and HMM methods of the prior art, the output value is not required to be normalized for the length of the input data. More particularly, the outputs are always within a range (in this case, between -1 and 1) and also the weight of an output is invariable within a recognition section. This means that dynamic range of a value to be processed is narrower, and that the speech recognition apparatus can achieve sufficient performance using integer data rather than using floating-point data or logarithmic data on processing.
Because two outputs, positive and negative, are used collectively to make the recognition, the recognition does not fail since the negative output is not lowered even if the positive output begins to raise at a word "KOUNYU" (PURCHASE) in FIG. 20. Thus, the speech recognition can be improved in accuracy. It is of course that the number of outputs is not limited to 2, but can be increased if necessary. For example, if an output is added which represents degree of resemblance between the presently inputted data and the data used in the learning, the result of recognition can be improved in accuracy. If a plurality of such outputs are used, the neural network which provides optimum results can be chosen.
In addition, the present invention can recognize syllables or phonemes, rather than words as exemplified. In such a case, a relatively small number of neural networks need to be used to recognize the entire language speech. This enables a dictation system to be constructed, for example. The unit of recognition can be abstract ones which are not related to languages. This is particularly effective when the speech recognition apparatus is used to compress information.
FIG. 21 shows another embodiment of the present invention which is different from the speech recognition apparatus of FIG. 12 in that background noise input means 2105 and stable state detection means 2106 are added to it. The other parts are similar to those of FIG. 12.
FIG. 22 shows a flowchart of the process through which the initial internal state value is determined in the arrangement of FIG. 21. In this flowchart, a step of preparing background noise data may comprise suitable initial value setting means, and suitable constant input preparing means. Or the step can be omitted to correspond to no input. FIG. 27 shows results of recognition obtained by causing the speech recognition apparatus to learn according to the learning method of FIG. 18, which corresponds to tables 1 and 2 of the first embodiment combined. The results are obtained by saving, as initial values, the internal state values of the neural network which became stable when background noise are inputted for about 3 seconds. On recognition, these values are used as initial values in the differential equation (2).
As is apparent from FIG. 27, the present embodiment reduces omission errors in comparison with the results of the first embodiment.
The actual speech recognition systems of higher performance often use a language processing function in addition to a simple speech recognizing function. In such a case, the insertion error can relatively easily be corrected or canceled considering language restrictions, but the omission error is difficult to be inferred and added considering the same language restrictions. Therefore, the improvement in the rate of omission error by the present embodiment is important in realizing a speech recognition apparatus of higher performance.
FIG. 23 shows still another embodiment that the learning section of FIG. 14 further comprises noise data storage means and noise data overlaying means. The basic learning method is as described in connection with FIG. 14. This embodiment is characterized by that the learning data is a data overlaid with noise components beforehand. To recognize the learning data after the noise components have been removed, the connection weightings of units in the neural network is adjusted by the learning control means. In other words, the neural network is caused to learn so that the noise components contained in the learning data can definitely be differentiated.
The overlaying of the learning data with the noise components is carried out at a plurality of locations, as shown in FIGS. 24a-24c. In this figure, reference numeral 2401 designates the learning data, and reference numerals 2402 and 2403 designate the noise components. FIG. 24(b) shows an example of the learning data of FIG. 24(a) overlaid with the noise component 2402 at its forward portion, while FIG. 24(c) shows an example of the learning data overlaid with the noise component 2403 at its rearward portion. When such overlaid data obtained by overlaying the learning data with the noise components are used, the neural network can definitely differentiate only the noise components by causing the neural network to learn the learning data overlaid with noise components removing the noise components.
Consequently, the neural network can properly recognize nonconstant noises with which the speech data is overlaid.
The present invention provides the speech recognition apparatus and its learning method which are very effective not only in the continuous speech recognition but also in the discrete speech recognition.
Further, the present invention is effective not only in the speech recognition but also in any processing of time series information if the correspondence between input data and output data can be taken. The present invention is considered to be applicable to compression of information, expansion of information. waveform equivalence and the like.
Claims (5)
1. A method for recognizing speech, comprising:
extracting values of an input to be recognized;
inputting the extracted values into a recurrent neural network;
storing input learning data of a plurality of continuous data streams within a plurality of categories;
selecting input learning data of a plurality of continuous data streams to be learned within a plurality of categories;
storing positive output learning data of a plurality of continuous data streams within a plurality of categories corresponding to an input learning data category;
storing negative output learning data of a plurality of continuous data streams within a plurality of categories corresponding to an input learning data category;
selecting output learning data of a plurality of continuous data streams to be learned, each of which corresponds to an input learning data category;
connecting the selected input learning data into a single continuous data stream;
connecting the selected output learning data into a single continuous data stream in correlation with the connection of said input learning data;
inputting said connected input learning data stream to the extraction step; and
changing weightings at connections of neuron elements on the basis of outputs of said recurrent neural network and said connected output learning data streams.
2. The method for recognizing speech as in claim 1, wherein the number of said input learning data categories is equal to two.
3. The method for recognizing speech as in claim 1, further comprising:
storing noise data;
overlaying said selected input learning data with noise data selected from the stored noise data, the selected input learning data overlaid with the noise data causing said recurrent neural network to learn so that said noise data contained in said input learning data can be differentiated.
4. The method for recognizing speech as in claim 3, further comprising shifting said noise data to different overlaying positions on said selected input learning data to repeat learning.
5. The method for recognizing speech as in claim 3, wherein selected input learning data not overlaid with the noise data and selected portions of the input learning data overlaid with the noise data are used for learning.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US08/485,134 US5787393A (en) | 1992-03-30 | 1995-06-07 | Speech recognition apparatus using neural network, and learning method therefor |
Applications Claiming Priority (14)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP4-73818 | 1992-03-30 | ||
JP7381892 | 1992-03-30 | ||
JP4-87146 | 1992-04-08 | ||
JP8714692 | 1992-04-08 | ||
JP4-88786 | 1992-04-09 | ||
JP8878692 | 1992-04-09 | ||
JP4-159441 | 1992-06-18 | ||
JP15944192 | 1992-06-18 | ||
JP15942292 | 1992-06-18 | ||
JP4-159422 | 1992-06-18 | ||
JP4-161075 | 1992-06-19 | ||
JP16107592 | 1992-06-19 | ||
US15017093A | 1993-11-29 | 1993-11-29 | |
US08/485,134 US5787393A (en) | 1992-03-30 | 1995-06-07 | Speech recognition apparatus using neural network, and learning method therefor |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15017093A Continuation | 1992-03-30 | 1993-11-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
US5787393A true US5787393A (en) | 1998-07-28 |
Family
ID=27565199
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US08/485,134 Expired - Lifetime US5787393A (en) | 1992-03-30 | 1995-06-07 | Speech recognition apparatus using neural network, and learning method therefor |
US08/486,617 Expired - Lifetime US5809461A (en) | 1992-03-30 | 1995-06-07 | Speech recognition apparatus using neural network and learning method therefor |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US08/486,617 Expired - Lifetime US5809461A (en) | 1992-03-30 | 1995-06-07 | Speech recognition apparatus using neural network and learning method therefor |
Country Status (1)
Country | Link |
---|---|
US (2) | US5787393A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6151592A (en) * | 1995-06-07 | 2000-11-21 | Seiko Epson Corporation | Recognition apparatus using neural network, and learning method therefor |
US6175818B1 (en) * | 1996-05-29 | 2001-01-16 | Domain Dynamics Limited | Signal verification using signal processing arrangement for time varying band limited input signal |
US6304865B1 (en) | 1998-10-27 | 2001-10-16 | Dell U.S.A., L.P. | Audio diagnostic system and method using frequency spectrum and neural network |
US20030088412A1 (en) * | 2001-07-24 | 2003-05-08 | Honeywell International Inc. | Pattern recognition using an observable operator model |
US8170873B1 (en) * | 2003-07-23 | 2012-05-01 | Nexidia Inc. | Comparing events in word spotting |
US20170301347A1 (en) * | 2016-04-13 | 2017-10-19 | Malaspina Labs (Barbados), Inc. | Phonotactic-Based Speech Recognition & Re-synthesis |
US9875440B1 (en) | 2010-10-26 | 2018-01-23 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
US10510000B1 (en) | 2010-10-26 | 2019-12-17 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
US20200035222A1 (en) * | 2018-07-27 | 2020-01-30 | Deepgram, Inc. | End-to-end neural networks for speech recognition and classification |
US10671908B2 (en) | 2016-11-23 | 2020-06-02 | Microsoft Technology Licensing, Llc | Differential recurrent neural network |
US11449737B2 (en) * | 2016-09-07 | 2022-09-20 | Robert Bosch Gmbh | Model calculation unit and control unit for calculating a multilayer perceptron model with feedforward and feedback |
US20240095447A1 (en) * | 2022-06-22 | 2024-03-21 | Nvidia Corporation | Neural network-based language restriction |
US12124954B1 (en) | 2022-11-28 | 2024-10-22 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6947890B1 (en) * | 1999-05-28 | 2005-09-20 | Tetsuro Kitazoe | Acoustic speech recognition method and system using stereo vision neural networks with competition and cooperation |
KR20000058531A (en) * | 2000-06-10 | 2000-10-05 | 김성석 | Toy with a capability of language learning and training using speech synthesis and speech recognition technologies |
EP1217610A1 (en) * | 2000-11-28 | 2002-06-26 | Siemens Aktiengesellschaft | Method and system for multilingual speech recognition |
KR102565274B1 (en) | 2016-07-07 | 2023-08-09 | 삼성전자주식회사 | Automatic interpretation method and apparatus, and machine translation method and apparatus |
KR102424514B1 (en) | 2017-12-04 | 2022-07-25 | 삼성전자주식회사 | Method and apparatus for processing language input |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0281160A1 (en) * | 1987-03-05 | 1988-09-07 | Canon Kabushiki Kaisha | Liquid crystal apparatus |
EP0318858A2 (en) * | 1987-11-25 | 1989-06-07 | Nec Corporation | Connected word recognition system including neural networks arranged along a signal time axis |
JPH0272398A (en) * | 1988-09-07 | 1990-03-12 | Hitachi Ltd | Preprocessor for speech signal |
JPH0281160A (en) * | 1988-09-17 | 1990-03-22 | Sony Corp | Signal processor |
US5040215A (en) * | 1988-09-07 | 1991-08-13 | Hitachi, Ltd. | Speech recognition apparatus using neural network and fuzzy logic |
US5093899A (en) * | 1988-09-17 | 1992-03-03 | Sony Corporation | Neural network with normalized learning constant for high-speed stable learning |
US5119469A (en) * | 1989-05-17 | 1992-06-02 | United States Of America | Neural network with weight adjustment based on prior history of input signals |
US5150323A (en) * | 1989-08-11 | 1992-09-22 | Hughes Aircraft Company | Adaptive network for in-band signal separation |
JPH04295897A (en) * | 1991-03-26 | 1992-10-20 | Sanyo Electric Co Ltd | Voice recognizing method by neural network model |
JPH04295894A (en) * | 1991-03-26 | 1992-10-20 | Sanyo Electric Co Ltd | Voice recognition method by neural network model |
EP0510632A2 (en) * | 1991-04-24 | 1992-10-28 | Nec Corporation | Speech recognition by neural network adapted to reference pattern learning |
US5175794A (en) * | 1987-08-28 | 1992-12-29 | British Telecommunications Public Limited Company | Pattern recognition of temporally sequenced signal vectors |
US5185848A (en) * | 1988-12-14 | 1993-02-09 | Hitachi, Ltd. | Noise reduction system using neural network |
US5297237A (en) * | 1989-02-20 | 1994-03-22 | Fujitsu Limited | Learning system for a data processing apparatus |
-
1995
- 1995-06-07 US US08/485,134 patent/US5787393A/en not_active Expired - Lifetime
- 1995-06-07 US US08/486,617 patent/US5809461A/en not_active Expired - Lifetime
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0281160A1 (en) * | 1987-03-05 | 1988-09-07 | Canon Kabushiki Kaisha | Liquid crystal apparatus |
US5175794A (en) * | 1987-08-28 | 1992-12-29 | British Telecommunications Public Limited Company | Pattern recognition of temporally sequenced signal vectors |
EP0318858A2 (en) * | 1987-11-25 | 1989-06-07 | Nec Corporation | Connected word recognition system including neural networks arranged along a signal time axis |
JPH0272398A (en) * | 1988-09-07 | 1990-03-12 | Hitachi Ltd | Preprocessor for speech signal |
US5040215A (en) * | 1988-09-07 | 1991-08-13 | Hitachi, Ltd. | Speech recognition apparatus using neural network and fuzzy logic |
JPH0281160A (en) * | 1988-09-17 | 1990-03-22 | Sony Corp | Signal processor |
US5093899A (en) * | 1988-09-17 | 1992-03-03 | Sony Corporation | Neural network with normalized learning constant for high-speed stable learning |
US5185848A (en) * | 1988-12-14 | 1993-02-09 | Hitachi, Ltd. | Noise reduction system using neural network |
US5297237A (en) * | 1989-02-20 | 1994-03-22 | Fujitsu Limited | Learning system for a data processing apparatus |
US5119469A (en) * | 1989-05-17 | 1992-06-02 | United States Of America | Neural network with weight adjustment based on prior history of input signals |
US5150323A (en) * | 1989-08-11 | 1992-09-22 | Hughes Aircraft Company | Adaptive network for in-band signal separation |
JPH04295894A (en) * | 1991-03-26 | 1992-10-20 | Sanyo Electric Co Ltd | Voice recognition method by neural network model |
JPH04295897A (en) * | 1991-03-26 | 1992-10-20 | Sanyo Electric Co Ltd | Voice recognizing method by neural network model |
EP0510632A2 (en) * | 1991-04-24 | 1992-10-28 | Nec Corporation | Speech recognition by neural network adapted to reference pattern learning |
Non-Patent Citations (30)
Title |
---|
Fabio Greco et al., "A Recurrent Time-Delay Neural Network for Improved Phoneme Recognition", ICASSP, 1991, vol. 1, 14-17 May 1991 Toronto, Canada, pp. 81-84. |
Fabio Greco et al., A Recurrent Time Delay Neural Network for Improved Phoneme Recognition , ICASSP , 1991, vol. 1, 14 17 May 1991 Toronto, Canada, pp. 81 84. * |
H. U. Bauer et al., Nonlinear dynamics of feedback multilayer perceptrons , Physical Review A , vol. 42, No. 4, 15 Aug. 1990, US, pp. 2401 2408. * |
H.-U. Bauer et al., "Nonlinear dynamics of feedback multilayer perceptrons", Physical Review A, vol. 42, No. 4, 15 Aug. 1990, US, pp. 2401-2408. |
IOOSS, "From Lattices of Phoneme to Sentences: A Recurrent Neural Network Approach", International Joint Conference on Neural Nets, Jul. 8-14, 1991, pp. 883-838, vol. 2. |
IOOSS, From Lattices of Phoneme to Sentences: A Recurrent Neural Network Approach , International Joint Conference on Neural Nets, Jul. 8 14, 1991, pp. 883 838, vol. 2. * |
John Hertz, "Introduction to the Theory of Neural Computation," Santa Fe Institute Studies in the Sciences of Complexity, Lecture Notes vol. I, Addison-Wesley, 1991. |
John Hertz, Introduction to the Theory of Neural Computation, Santa Fe Institute Studies in the Sciences of Complexity, Lecture Notes vol. I, Addison Wesley, 1991. * |
K.P. Li, et al., "A Whole Word Recurrent Neural Network of Keyword Spotting", ICASSP, Mar. 1992, pp. II-81-II-84. |
K.P. Li, et al., A Whole Word Recurrent Neural Network of Keyword Spotting , ICASSP , Mar. 1992, pp. II 81 II 84. * |
Ken ichi Funahashi, On the Recurrent Neural Networks , Technical Research Report by IEICE , SP92 80, Oct. 21, 1992, pp. 51 58. * |
Ken-ichi Funahashi, "On the Recurrent Neural Networks", Technical Research Report by IEICE, SP92-80, Oct. 21, 1992, pp. 51-58. |
Mitsuhiro Inazumi et al., "Connected Work Recognition by Recurrent Neural Networks", Technical Report of IEICE, SP92-25, Jun. 30, 1992, pp. 9-16. |
Mitsuhiro Inazumi et al., "Continuous Spoken Digit Recognition by Recurrent Neural Networks", Technical Report of IEICE, SP92-125, Jan. 19, 1993, pp. 17-24. |
Mitsuhiro Inazumi et al., Connected Work Recognition by Recurrent Neural Networks , Technical Report of IEICE , SP92 25, Jun. 30, 1992, pp. 9 16. * |
Mitsuhiro Inazumi et al., Continuous Spoken Digit Recognition by Recurrent Neural Networks , Technical Report of IEICE , SP92 125, Jan. 19, 1993, pp. 17 24. * |
N. Z. Hakim et al., "Cursive Script Online Character Recognition with a Recurrent Neural Network Model", IJCNN, Jun. 1992, pp. III-711--III-716. |
N. Z. Hakim et al., Cursive Script Online Character Recognition with a Recurrent Neural Network Model , IJCNN , Jun. 1992, pp. III 711 III 716. * |
Patent Abstracts of Japan, vol. 16 No. 75 (p. 1316), 24 Feb. 1992 and JP A 03 265077 (Shigeki) 26 Nov. 1991 (Abstract). * |
Patent Abstracts of Japan, vol. 16 No. 75 (p. 1316), 24 Feb. 1992 and JP-A-03 265077 (Shigeki) 26 Nov. 1991 (Abstract). |
Robinson ( a real time recurrent error prpagation network word recognition system , ICASSP 92, vol. 1, Mar. 23 26, 1992, pp. 617 620). * |
Robinson ("a real-time recurrent error prpagation network word recognition system", ICASSP-92, vol. 1, Mar. 23-26, 1992, pp. 617-620). |
Tatsumi Watanabe et al., "Study of Learning methods and Shape of the Learning Surface for Recurrent Neural Networks", Theses by IEICE, vol. J74D-II, No. 12, Dec. 25, 1991, pp. 1776-1787. |
Tatsumi Watanabe et al., Study of Learning methods and Shape of the Learning Surface for Recurrent Neural Networks , Theses by IEICE , vol. J74D II, No. 12, Dec. 25, 1991, pp. 1776 1787. * |
Thomas M. English et al., "Back-Propagation Training of a Neural Network for Word Spotting", ICASSP, vol. 2, 23-26 Mar. 1992 San Francisco, CA, US, pp. 357-360. |
Thomas M. English et al., Back Propagation Training of a Neural Network for Word Spotting , ICASSP, vol. 2, 23 26 Mar. 1992 San Francisco, CA, US, pp. 357 360. * |
Yohji Fukuda, et al., "Phoneme Recognition using Recurrent Neural Networks", Technical Research Report by IEICE, NC91-10, May 8, 1991, pp. 71-78. |
Yohji Fukuda, et al., Phoneme Recognition using Recurrent Neural Networks , Technical Research Report by IEICE , NC91 10, May 8, 1991, pp. 71 78. * |
Zuqiang Zhao, "Connectionist Training of Non-Linear Hidden Markov Models for Speech Recognition", IJCNN, 1991, vol. 2, 18-21 Nov. 1991 Singapore, SG, pp. 1647-1652. |
Zuqiang Zhao, Connectionist Training of Non Linear Hidden Markov Models for Speech Recognition , IJCNN , 1991, vol. 2, 18 21 Nov. 1991 Singapore, SG, pp. 1647 1652. * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6151592A (en) * | 1995-06-07 | 2000-11-21 | Seiko Epson Corporation | Recognition apparatus using neural network, and learning method therefor |
US6175818B1 (en) * | 1996-05-29 | 2001-01-16 | Domain Dynamics Limited | Signal verification using signal processing arrangement for time varying band limited input signal |
US6304865B1 (en) | 1998-10-27 | 2001-10-16 | Dell U.S.A., L.P. | Audio diagnostic system and method using frequency spectrum and neural network |
US20030088412A1 (en) * | 2001-07-24 | 2003-05-08 | Honeywell International Inc. | Pattern recognition using an observable operator model |
US6845357B2 (en) | 2001-07-24 | 2005-01-18 | Honeywell International Inc. | Pattern recognition using an observable operator model |
US8170873B1 (en) * | 2003-07-23 | 2012-05-01 | Nexidia Inc. | Comparing events in word spotting |
US9875440B1 (en) | 2010-10-26 | 2018-01-23 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
US10510000B1 (en) | 2010-10-26 | 2019-12-17 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
US11514305B1 (en) | 2010-10-26 | 2022-11-29 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
US20170301347A1 (en) * | 2016-04-13 | 2017-10-19 | Malaspina Labs (Barbados), Inc. | Phonotactic-Based Speech Recognition & Re-synthesis |
US10297247B2 (en) * | 2016-04-13 | 2019-05-21 | Malaspina Labs (Barbados), Inc. | Phonotactic-based speech recognition and re-synthesis |
US11449737B2 (en) * | 2016-09-07 | 2022-09-20 | Robert Bosch Gmbh | Model calculation unit and control unit for calculating a multilayer perceptron model with feedforward and feedback |
US10671908B2 (en) | 2016-11-23 | 2020-06-02 | Microsoft Technology Licensing, Llc | Differential recurrent neural network |
US10720151B2 (en) * | 2018-07-27 | 2020-07-21 | Deepgram, Inc. | End-to-end neural networks for speech recognition and classification |
US10847138B2 (en) | 2018-07-27 | 2020-11-24 | Deepgram, Inc. | Deep learning internal state index-based search and classification |
US11367433B2 (en) | 2018-07-27 | 2022-06-21 | Deepgram, Inc. | End-to-end neural networks for speech recognition and classification |
US20200035222A1 (en) * | 2018-07-27 | 2020-01-30 | Deepgram, Inc. | End-to-end neural networks for speech recognition and classification |
US11676579B2 (en) | 2018-07-27 | 2023-06-13 | Deepgram, Inc. | Deep learning internal state index-based search and classification |
US20240095447A1 (en) * | 2022-06-22 | 2024-03-21 | Nvidia Corporation | Neural network-based language restriction |
US12124954B1 (en) | 2022-11-28 | 2024-10-22 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
Also Published As
Publication number | Publication date |
---|---|
US5809461A (en) | 1998-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5787393A (en) | Speech recognition apparatus using neural network, and learning method therefor | |
US5404422A (en) | Speech recognition system with neural network | |
US6151592A (en) | Recognition apparatus using neural network, and learning method therefor | |
US5526466A (en) | Speech recognition apparatus | |
US5150449A (en) | Speech recognition apparatus of speaker adaptation type | |
EP0586714B1 (en) | Speech recognition apparatus using neural network, and learning method therefor | |
US6041299A (en) | Apparatus for calculating a posterior probability of phoneme symbol, and speech recognition apparatus | |
US8838446B2 (en) | Method and apparatus of transforming speech feature vectors using an auto-associative neural network | |
US5185848A (en) | Noise reduction system using neural network | |
US4811399A (en) | Apparatus and method for automatic speech recognition | |
JP3037864B2 (en) | Audio coding apparatus and method | |
CA2066952C (en) | Speech recognition by neural network adapted to reference pattern learning | |
JP3168779B2 (en) | Speech recognition device and method | |
EP0705473A1 (en) | Speech recognition method using a two-pass search | |
JP2000099080A (en) | Voice recognizing method using evaluation of reliability scale | |
EP0453649A2 (en) | Method and apparatus for modeling words with composite Markov models | |
US5860062A (en) | Speech recognition apparatus and speech recognition method | |
Sunny et al. | Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam | |
JPH01204099A (en) | Speech recognition device | |
JPH06266386A (en) | Word spotting method | |
Wang et al. | Speaker verification and identification using gamma neural networks | |
Selouani et al. | A hybrid learning vector quantization/Time-delay neural networks system for the recognition of Arabic speech. | |
KR100211113B1 (en) | Learning method and speech recognition using chaotic recurrent neural networks | |
JP2000352994A (en) | Nerve cell element, recognition using neural network, and its learning method | |
Yip et al. | Optimal root cepstral analysis for speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FPAY | Fee payment |
Year of fee payment: 8 |
|
FPAY | Fee payment |
Year of fee payment: 12 |