WO2020017226A1

WO2020017226A1 - Noise-tolerant voice recognition device and method, and computer program

Info

Publication number: WO2020017226A1
Application number: PCT/JP2019/024279
Authority: WO
Inventors: 雅清藤本; 恒河井
Original assignee: 国立研究開発法人情報通信研究機構
Priority date: 2018-07-17
Filing date: 2019-06-19
Publication date: 2020-01-23
Also published as: JP2020012928A; JP7231181B2

Abstract

This noise-tolerant voice recognition device 180, which has a high voice recognition accuracy even when only a single channel voice signal is available, includes: a voice emphasis unit 202 which receives, as an input, a voice signal 112 that is a signal in which a noise signal overlaps a voice signal that is a target signal, and outputs an emphasis voice signal 112 obtained by emphasizing the voice signal with a prescribed voice emphasis method; an extended feature extraction unit 200 which receives, as inputs, the emphasis voice signal 203 and the voice signal 112, and extracts respective feature amounts; and a voice recognition unit 204 which converts utterance content of the voice signal into text by using the feature amounts and an acoustic model 206 for the same.

Description

Noise-tolerant speech recognition apparatus and method, and computer program

The present invention relates to speech recognition, and more particularly, to a noise-tolerant speech recognition apparatus and method, and a computer program that enable highly accurate speech recognition even for speech collected by a single microphone.

In recent years, with the advancement of computer computing power and the development of computer science, the range of use of speech recognition applications has expanded significantly. Apart from the fields in which speech recognition has been used in the past, speech recognition has also been introduced into so-called home appliances, and users, such as smart speakers, have rapidly increased the use of speech recognition in products that provide functions that were not previously available. Has increased. Along with this, scenes in which voice recognition is used have also become diverse.

On the other hand, what is essential for speech recognition is its accuracy. When the scenes in which the speech recognition is used are diverse, there are many noises and various types, and it is difficult to always keep the accuracy of the speech recognition high. Therefore, speech recognition (noise-tolerant speech recognition) that maintains high accuracy even for noise has become increasingly important.

雑音 Conventionally, two types of techniques have been used for noise-tolerant speech recognition. That is, there are the following two.

・ Speech enhancement (noise removal)
-Noise-added learning Speech enhancement is a technology for improving the accuracy of speech recognition by removing noise from a speech signal to be subjected to speech recognition. Typically, voice recognition processing is performed after voice enhancement is performed on a voice signal from a microphone.

Conventional speech enhancement techniques include a spectral subtraction method described in Non-Patent Document 1 and an MMSE-STSA estimation method (minimum mean square error) described in Non-Patent Document 2.
short-time spectral amplitude estimator), a method using Vector Taylor series described in Non-Patent Document 3, and a denoising autoencoder described in Non-Patent Document 4. .

All of these methods are methods for emphasizing speech as preprocessing of speech recognition for acoustic signals obtained from a single microphone.

FIG. 1 shows a schematic configuration of a conventional speech recognition apparatus 100. Referring to FIG. 1, a speech recognition apparatus 100 receives a speech signal 112, which is a noise-superimposed speech represented by a waveform 110 and output from a microphone (not shown), and performs speech enhancement by any of the above-described methods. A voice emphasis unit 114 for outputting a emphasized voice signal 116, a feature extraction unit 118 for extracting a predetermined feature amount from the emphasized voice signal 116, and performing speech recognition on the feature amount to display a waveform 110. And a voice recognition unit 120 for outputting a text 122 corresponding to the voice to be reproduced. As the voice recognition unit 120, for example, the one disclosed in Patent Document 1 can be used.

The speech recognition apparatus 100 further includes an acoustic model 124, a pronunciation dictionary 126, and a language model 128 used when the speech recognition unit 120 performs speech recognition. The acoustic model 124 is for estimating a corresponding phoneme based on the feature amount input from the feature extraction unit 118. The pronunciation dictionary 126 is used to obtain a word corresponding to the phoneme sequence estimated by the acoustic model 124. The language model 128 is used when calculating the probability of each of the utterance sentence candidates of the recognition result formed by the word string estimated using the pronunciation dictionary 126.

FIG. 2 shows a schematic configuration of the acoustic model 124. As can be seen from FIG. 2, the acoustic model 124 is formed of a so-called deep neural network, and has an input layer 150 for receiving a feature, an output layer 162 for outputting information specifying a phoneme estimated from the feature, and an input layer 162. It includes a plurality of hidden layers 152, hidden layers 154, hidden layers 156, hidden layers 158, and hidden layers 160 provided in order between the hidden layer 152 and the output layer 162. Since the configuration and learning method of acoustic model 124 are well known, details thereof will not be repeated here. For learning of the acoustic model 124, clean speech containing no noise is used. Note that the information for specifying the estimated phoneme may be, for example, a form of a probability vector for each element of a set of phonemes. Hereinafter, in this specification, to simplify the description, outputting information for specifying a phoneme is simply referred to as “outputting a phoneme”.

On the other hand, the noise-added learning is a method of improving the speech recognition accuracy for speech containing noise by learning an acoustic model using a deep neural network using speech signals containing noise as learning data. In this case, no preprocessing is performed on the voice signal, but the target of voice recognition is still a single voice signal.

In recent years, multi-channel voice enhancement obtained from microphones of multiple channels (microphone array) has been widely used as preprocessing for voice recognition, instead of voice enhancement for audio signals obtained from microphones of a single channel. A good example is a smart speaker. Smart speakers have been developed and sold by various companies, and are rapidly spreading especially in the United States and the like.

雑音 By using a microphone array, noise can be removed using the spatial information of the sound source, so that speech enhancement can be performed with high accuracy and low distortion.

JP 2017-219769

However, when using multi-channel audio signals, special devices such as a microphone array and a multi-channel microphone amplifier are required. Further, the processing amount and the transfer amount for the audio signal increase. Due to such a problem, for example, there is a problem that voice recognition using a voice signal of another channel cannot be applied to a device having only one microphone such as a so-called smartphone and having a limited processing amount.

Therefore, any one of the above-described voice emphasizing processes is applied to a smartphone, but in this case, a large increase in voice distortion is observed, and there is a problem that voice recognition accuracy is significantly deteriorated.

Therefore, an object of the present invention is to provide an acoustic model and a speech recognition device capable of improving the accuracy of speech recognition even when only a single-channel speech signal is available, and a computer program therefor.

A noise-tolerant speech recognition device according to a first aspect of the present invention includes a speech enhancement circuit that receives an audio signal in which a noise signal is superimposed on a speech signal as a target signal, and outputs an enhanced speech signal in which the speech signal is enhanced. And a voice recognition unit that receives the emphasized voice signal and the acoustic signal and converts the uttered content of the voice signal into text.

Preferably, the audio enhancement circuit performs a first type of audio enhancement processing on the audio signal and outputs a first enhanced audio signal, and a first audio enhancement unit for the audio signal. A second voice enhancement unit that performs a second type of voice enhancement process different from the above and outputs a second enhanced voice signal. The speech recognition unit includes a first and a second enhanced voice signal, and a sound signal. Then, the speech content of the voice signal is converted to text.

More preferably, the speech recognition unit includes a first feature extraction unit that extracts a first feature amount from the audio signal, a second feature extraction unit that extracts a second feature amount from the enhanced speech signal, Feature selecting means for selecting each of the feature quantities according to the first feature quantity and the second feature quantity, and utterance of an audio signal using the second feature quantity selected by the feature selecting means. Voice recognition means for converting the contents into text.

More preferably, the noise-tolerant speech recognition apparatus further includes acoustic model storage means for storing an acoustic model used for speech recognition by the speech recognition means, wherein the acoustic model is a deep neural network having a plurality of hidden layers, The model includes a first sub-network receiving a first feature as an input, a second sub-network receiving a second feature as an input, an output of the first sub-network, and an output of the second sub-network. And a third sub-network for outputting phonemes estimated from the first feature amount and the second feature amount.

A noise-tolerant speech recognition method according to a second aspect of the present invention is characterized in that a computer receives, as an input, a single-channel acoustic signal in which a noise signal is superimposed on a speech signal as a target signal, and an enhanced speech signal in which the speech signal is emphasized. And a voice recognition step in which the computer receives the emphasized voice signal and the acoustic signal and converts the uttered content of the voice signal into text.

コンピュータ A computer program according to a third aspect of the present invention causes a computer to function as any of the noise-tolerant speech recognition devices described above.

The problems solved by the present invention, the configuration of the present invention, and advantageous effects thereof will become more apparent by reading the detailed description of the embodiments with reference to the accompanying drawings.

FIG. 1 is a block diagram showing a schematic configuration of a speech recognition device that performs preprocessing on a single-channel speech signal by a conventional speech enhancement method to perform speech recognition. FIG. 2 is a block diagram showing a configuration of an acoustic model based on a deep neural network used in the speech recognition device shown in FIG. FIG. 3 is a block diagram showing a schematic configuration of the speech recognition device according to the first embodiment of the present invention. FIG. 4 is a schematic block diagram showing a configuration of an acoustic model used in the speech recognition device shown in FIG. FIG. 5 is a block diagram illustrating a configuration of an acoustic model used in the speech recognition device according to the second embodiment of the present invention. FIG. 6 is a block diagram showing a schematic configuration of the speech recognition device according to the third embodiment of the present invention. FIG. 7 is a block diagram showing a configuration of an acoustic model used in the speech recognition device shown in FIG. FIG. 8 is a block diagram illustrating a schematic configuration of an acoustic model used in the speech recognition device according to the fourth embodiment of the present invention. FIG. 9 is a block diagram illustrating a schematic configuration of an acoustic model used in the speech recognition device according to the fifth embodiment of the present invention. FIG. 10 is a block diagram illustrating a schematic configuration of an acoustic model used in the speech recognition device according to the sixth embodiment of the present invention. FIG. 11 is a block diagram illustrating a schematic configuration of an acoustic model used in the speech recognition device according to the seventh embodiment of the present invention. FIG. 12 is a block diagram showing a schematic configuration of an acoustic model used in the speech recognition device according to the eighth embodiment of the present invention. FIG. 13 is a diagram for explaining the function of the gate layer included in the acoustic model according to the fifth to eighth embodiments of the present invention. FIG. 14 is a diagram showing, in a table form, a comparison between the word error rates of the conventional art and the speech recognition apparatuses according to the first to eighth embodiments of the present invention. FIG. 15 is a hardware block diagram of a typical computer for realizing the speech recognition device according to the present invention.

では In the following description and drawings, the same components are denoted by the same reference numerals. Therefore, detailed description thereof will not be repeated.

[First Embodiment]
FIG. 3 is a block diagram illustrating a schematic configuration of the speech recognition device 180 according to the first embodiment of the present invention. Referring to FIG. 3, speech recognition apparatus 180 performs speech enhancement processing on existing speech signal 112 that is noise-superimposed speech, and outputs enhanced speech signal 203. An extended feature extraction unit 200 that receives both audio signals 203 as input and extracts feature amounts 210 and 212 of the expanded speech, and receives the feature amounts 210 and 212 output by the enlarged feature extraction unit 200 as input and performs speech recognition. And a speech recognition unit 204 that outputs the recognized text 208. The audio signal 112 is an acoustic signal output from a microphone in which a noise signal is superimposed on a signal representing the audio represented by the waveform 110. As the voice recognition unit 204, the same one as the voice recognition unit 120 shown in FIG. 1 can be used. However, the feature amounts used are different from those of the related art, as described later.

The speech recognition device 180 further includes an acoustic model 206 having a configuration different from the conventional one shown in FIG. 2 and a pronunciation dictionary 126 and a language model respectively identical to those shown in FIG. 128. These acoustic model 206, pronunciation dictionary 126, and language model 128 are all stored in a storage device such as a hard disk described later.

The enlarged feature extraction unit 200 receives the input of the audio signal 112 that is a noise-superimposed audio, and outputs a feature amount 210. The feature extraction unit 118 similar to that illustrated in FIG. 1 and the enhancement output from the audio enhancement unit 202. A feature extraction unit 220 that extracts the feature amount 212 from the audio signal 203 and has the same function as the feature extraction unit 118 is included. In the present embodiment, the feature extractor 118 and the feature extractor 220 have the same configuration, and the feature 210 and the feature 212 have the same meaning. However, generally, the values of the

feature quantities

210 and 212 are different from each other because the two inputs are different.

Referring to FIG. 4, an acoustic model 206 shown in FIG. 3 has an input in which both a feature amount 210 obtained from a voice on which noise is superimposed and a feature amount 212 obtained from an enhanced voice signal 203 are input. It includes a layer 240, an output layer 256 for outputting estimated phonemes, and a plurality of hidden layers 242 to 254 provided in order between the input layer 240 and the output layer 256. In the present embodiment, the number of hidden layers is seven.

4 The input layer 240 shown in FIG. 4 receives as many inputs as the sum of the number of elements of the

feature quantities

210 and 212, both of which are vectors. The

feature extraction units

118 and 220 that output these feature amounts 210 and 212 have the same configuration in the present embodiment as the conventional feature extraction unit 118 shown in FIG. Therefore, the number of features received by the acoustic model 206 is twice as large as that of the conventional model shown in FIG. Of these, half are features obtained from the noise-superimposed speech, and the other half are features obtained from the emphasized speech.

The operation of the speech recognition unit 204 uses the acoustic model 206 instead of the acoustic model 124 shown in FIG. 1, and the acoustic features to be processed include those of the noise-superimposed speech in addition to those from the emphasized speech. Except for this point, it is the same as the speech recognition device 100 shown in FIG. Therefore, the detailed description will not be repeated here.

By employing the acoustic model 206 having such a configuration, as described later with reference to FIG. 14, the voice recognition device 180 according to the present embodiment is higher than the conventional one shown in FIG. 1. Accurate speech recognition could be performed.

Note that the learning of the acoustic model 206 can be performed by an error back propagation method similar to that of a normal deep neural network by preparing training data including a noise-superimposed voice and a text represented by the voice in advance. The same applies to learning in each of the embodiments described below.

[Second embodiment]
FIG. 5 shows a configuration of an acoustic model 280 according to the second embodiment of the present invention. The speech recognition apparatus according to the second embodiment is the same as the speech recognition apparatus 180 according to the first embodiment except that an acoustic model 280 shown in FIG. 5 is used instead of the acoustic model 206 shown in FIG. is there.

The acoustic model 280 includes a sub-network 300 for noise-superimposed speech that receives the feature amount 210 of the noise-superimposed speech, a sub-network 302 for the emphasized voice that receives the feature amount 212 of the emphasized voice, and a sub-network 302 for the noise-superimposed speech. An output sub-network 304 that receives the output of the network 300 and the output of the sub-network 302 for emphasized voice, and an output layer 306 that receives the output of the output sub-network 304 and outputs phonemes.

The sub-network 300 for the noise-superimposed speech is connected in order between the input layer 320 connected to receive the feature amount 210 of the noise-superimposed speech, and the input between the input layer 320 and the input of the output-side sub-network 304. It includes a plurality (three in this embodiment) of hidden

layers

322, 324 and 326.

The sub-network 302 for the emphasized voice includes an input layer 330 connected to receive the feature amount 212 of the emphasized voice, and a plurality of sub-networks sequentially connected between the input layer 330 and the input of the output side sub-network 304. (Three in this embodiment)

hidden layers

332, 334 and 336.

The output side sub-network 304 includes a hidden layer 350 connected to receive the outputs of the sub-network 300 for the noise-superimposed speech and the sub-network 302 for the emphasized speech, and a signal between the hidden layer 350 and the output layer 306. And

hidden layers

352, 354 and 356 connected in sequence.

The acoustic model 280 shown in FIG. 5 differs from the acoustic model 206 of the first embodiment in the following points. That is, in the acoustic model 206, the input layer 240 receives both the feature amount 210 of the noise-superimposed speech and the feature amount 212 of the emphasized speech, and information from both is propagated to all of the hidden layers 242 to 254 thereafter. On the other hand, in the acoustic model 280, only the information from the feature 210 of the noise-superimposed speech propagates to the input layer 320 and the hidden layers 322 to 326 configuring the sub-network 300 for the noise-superimposed speech. Only the information from the feature 212 of the emphasized speech propagates to the input layer 330 and the hidden layers 332 to 336 of the subnetwork 302 for the emphasized speech. Both pieces of information are integrated for the first time in the hidden layer 350, and thereafter propagate to the hidden layers 352 to 356 and the output layer 306.

構成 The configuration of the speech recognition device employing the acoustic model 280 is the same as that of the speech recognition device 180 of the first embodiment.

音声 The speech recognition device using the acoustic model 280 according to the second embodiment also achieved higher accuracy than the prior art, as shown in FIG.

[Third Embodiment]
FIG. 6 shows a block diagram of a voice recognition device 380 according to the third embodiment of the present invention. The voice recognition device 380 performs existing first to fourth voice enhancement processing on the voice signal 112 output from the microphone with respect to the voice represented by the waveform 110, and performs voice enhancement signals 203, 393, and 395, respectively.

Speech enhancement units

202, 392, 394, and 396 that output 397, and the audio signal 112 and the emphasized audio signals 203, 393, 395, and 397 as inputs, and feature amounts 210, 212, 430, 432, and 434 of the expanded audio. And a speech recognition unit that receives as input the feature amounts 210, 212, 430, 432, and 434 output by the expanded feature extraction unit 390, performs speech recognition, and outputs a recognized text 400. 402.

The speech recognition device 380 further includes an acoustic model 398 used by the speech recognition unit 402 for speech recognition, and the same pronunciation dictionary 126 and language model 128 as those shown in FIG.

The enlarged feature extraction unit 390 receives the voice signal 112 on which noise is superimposed, extracts a feature amount 210, and receives the enhanced voice signal 203 from the voice enhancement unit 202, and receives the first emphasized voice. A feature extraction unit 220 for extracting the feature amount 212, a feature extraction unit 410 that receives the enhanced voice signal 393 from the voice enhancement unit 392, and outputs a feature amount 430 of the second enhanced voice, and a voice enhancement unit 394. A feature extraction unit 412 that receives the audio signal 395 and outputs a feature amount 432 of the third emphasized voice, and a feature extraction that receives the enhanced voice signal 397 from the speech enhancement unit 396 and outputs a fourth feature amount 434 of the enhanced voice Unit 414.

(4) The voice enhancement unit 202 performs voice enhancement by the method disclosed in Non-Patent Document 1. The voice enhancement unit 392 performs voice enhancement by the method disclosed in Non-Patent Document 2. The voice enhancement unit 394 performs voice enhancement by the method disclosed in Non-Patent Document 3. The voice enhancement unit 396 performs voice enhancement by the method disclosed in Non-Patent Document 4.

FIG. 7 is a block diagram showing the configuration of the deep neural network forming the acoustic model 398. Referring to FIG. 7, the acoustic model 398 is obtained by expanding the acoustic model 206 according to the first embodiment shown in FIG. 4 so as to use feature amounts extracted from four emphasized voices.

The acoustic model 398 includes the feature 210 of the noise-superimposed speech, the feature 212 of the first emphasized speech, the feature 430 of the second emphasized speech, the feature 432 of the third emphasized speech, and the feature of the fourth emphasized speech. An input layer 450 receiving the quantity 434, an output layer 454 for outputting phonemes estimated by the acoustic model 398, and an intermediate layer 452 composed of a plurality of hidden layers connected between the input layer 450 and the output layer 454. .

The middle layer 452 includes a hidden layer 470 having inputs connected to the outputs of the input layer 450, and

hidden layers

472, 474, 476, 478, 480 and 482, each having an input connected to the output of the previous layer. Including. The output of the hidden layer 482 is connected to the input of the output layer 454.

The speech recognition device 380 according to the third embodiment is obtained by expanding the speech recognition device 180 according to the second embodiment to use four speech enhancements. The operation is basically the same as that of the first embodiment.

でも Also in the third embodiment, the accuracy of voice recognition can be increased as compared with the related art.

[Fourth Embodiment]
In the third embodiment, the feature amount 210 of the noise-superimposed voice and the feature amounts 212, 430, 432, and 434 of the first to fourth emphasized voices are all input to the input layer 450. This information is propagated to all the constituent hidden layers. However, the present invention is not limited to such an embodiment.

The speech recognition apparatus according to the fourth embodiment has basically the same configuration as the speech recognition apparatus 380 shown in FIG. The difference is that an acoustic model 500 having the configuration shown in FIG. 8 is used instead of the acoustic model 398 used by the speech recognition device 380.

Referring to FIG. 8, this acoustic model 500 includes a first sub-network 540 that receives a feature 210 of a voice signal 112 that is a noise-superimposed voice, and a second sub-network that receives a feature 212 of a first emphasized voice. A network 542, a third sub-network 544 receiving the second emphasized voice feature 430, a fourth sub-network 546 receiving the third emphasized voice feature 432, and a fourth emphasized voice feature A fifth sub-network 548 that receives 434 and a first sub-network 540, a second sub-network 542, a third sub-network 544, a fourth sub-network 546, and a fifth sub-network 548. And an input connected to the output of the intermediate sub-network 550, And an output layer 552 outputs the estimation result of the phoneme which is the output of the sound model 500.

The first sub-network 540 includes an input layer 570 having an input for receiving the feature amount 210 of the noise-superimposed speech, a hidden layer 572, and a hidden layer 572 connected in sequence between the input layer 570 and the input of the intermediate sub-network 550. 574 and a hidden layer 576.

The second sub-network 542 includes an input layer 580 having an input for receiving the feature amount 212 of the first emphasized voice, and a hidden layer 582 connected in order between the input layer 580 and the input of the intermediate sub-network 550. And a hidden layer 584 and a hidden layer 586.

The third sub-network 544 includes an input layer 590 having an input for receiving the feature amount 430 of the second emphasized voice, a hidden layer 592 sequentially connected between the input layer 590 and an input of the intermediate sub-network 550, and a hidden layer 592. A layer 594 and a hidden layer 596.

The fourth sub-network 546 includes an input layer 600 having an input for receiving the feature amount 432 of the third emphasized voice, a hidden layer 602 sequentially connected between the input layer 600 and an input of the intermediate sub-network 550, and a hidden layer 602. A layer 604 and a hidden layer 606.

The fifth sub-network 548 includes an input layer 610 having an input for receiving the feature amount 434 of the fourth emphasized voice, a hidden layer 612 sequentially connected between the input layer 610 and the input of the intermediate sub-network 550, and a hidden layer 612. Including a layer 614 and a hidden layer 616.

The intermediate sub-network 550 is connected to a hidden layer 620 connected to receive the outputs of the first to

fifth sub-networks

540, 542, 544, 546 and 548, and is connected in order from the hidden layer 620 to the output layer 552. Hidden layer 622, hidden layer 624, and hidden layer 626.

The configuration of the speech recognition apparatus according to this embodiment is also the same as that shown in FIG. 6, except that an acoustic model 500 shown in FIG. 8 is used instead of acoustic model 398 in FIG.

In the third embodiment, all the hidden layers propagate the feature amount 210 of the noise-superimposed voice and the feature amounts 212, 430, 432, and 434 of the first to fourth emphasized voices. However, in the present embodiment, the feature 210 of the noise-superimposed speech is input to the hidden layer 620 after propagating inside the first sub-network 540. Similarly, the feature amounts 212, 430, 432, and 434 of the first to fourth emphasized voices propagate through only the second to

fifth sub-networks

542, 544, 546, and 548, respectively. Is entered. Inside the intermediate sub-network 550 starting from the hidden layer 620, all the feature amounts are integrated, propagated through the hidden layer in order, and finally the phoneme estimation result is output from the output layer 552.

音声 The speech recognition device using the acoustic model 500 according to the fourth embodiment was also able to perform speech recognition with higher accuracy than the conventional speech recognition device.

[Fifth Embodiment]
FIG. 9 shows a schematic configuration of an acoustic model 650 used in the speech recognition device according to the fifth embodiment. As can be seen from FIG. 9, this acoustic model 650 also consists of a deep neural network.

The acoustic model 650 shown in FIG. 9 is different from the acoustic model 206 shown in FIG. 4 in that the first emphasis speech is obtained before the input layer 240 receiving both the feature 210 of the noise-superimposed speech and the feature 212 of the first emphasized speech. A gate layer 682 is provided, which receives the voice feature 212 and multiplies it by the weight of the section [0, 1] and inputs the result to the input layer 240. Thereafter, as in the case shown in FIG. 4, all of the information from these feature amounts is propagated in common from the hidden layer 242 to the output layer 256.

The gate layer 682 can also be referred to as a kind of hidden layer, but its function is different from that of a normal hidden layer. That is, referring to FIG. 13, if gate layer 682 is generally expressed as gate layer 1100, gate layer 1100 has a gate weight g _t = σ (Wx _t + b) for each element of input vector x _t. It has an output to gate function output vector y _t by multiplying each. Here, when the vector _xt is M-dimensional, W is an M × M-dimensional weight matrix, b is an M-dimensional bias vector, σ (·) is an arbitrary activation function in the range of the interval [0, 1], Respectively. Each element of the gate weight is a value in the section [0, 1] as described above. Each of the elements of the weight matrix W and the bias vector b is a learning target. At the time of learning, learning of each element of the weight matrix W and the bias vector b can be performed using the same method as that of a normal deep neural network, except that the above-described restrictions on the section are obeyed. Also in the following description, each layer called a gate layer has the same function as the gate layer 1100 in FIG. 13, and all parameters can be learned in the same manner as other parameters under the constraint of the section [0, 1] described above. .

It should be noted that this gate layer separately gates each element of the input vector. Therefore, it is possible to perform a gating process for each feature amount of the emphasized speech as to whether or not to use the feature during speech recognition.

結果 As a result, for each element of the input vector composed of each feature, selection is made according to the weight for that element. This selection is performed based on the weight matrix W, the bias vector b, and the value of each element included in each input vector. That is, each element is selected according to the value of the input feature amount and used for speech recognition.

音声 The speech recognition device using the acoustic model 650 according to the fifth embodiment also achieved higher accuracy than the conventional technology.

[Sixth Embodiment]
FIG. 10 shows a schematic configuration of an acoustic model 750 used in the speech recognition device according to the sixth embodiment of the present invention. The configuration of the voice recognition device itself according to this embodiment is the same as that shown in FIG. However, the difference is that an acoustic model 750 is used instead of the acoustic model 206 in FIG.

The acoustic model 750 constitutes one deep neural network as a whole. The acoustic model 750 includes a first sub-network 770 that receives an input of the feature 210 of the noise-superimposed speech, a second sub-network 772 that receives the input of the feature 212 of the first emphasized speech, and a first sub-network. A third sub-network 774, which is part of the deep neural network, connected to receive the output of the second sub-network 770 and the output of the second sub-network 772; And an output layer 776 for specifying phonemes estimated by the model 750.

The first sub-network 770 includes an input layer 800 that receives the feature amount 210 of the noise-superimposed speech, and a hidden layer 802, a hidden layer 804, and a hidden layer 802 that are sequentially connected from the input layer 800 to the input of the third sub-network 774. And a hidden layer 806.

The second sub-network 772 includes an input layer 810 that receives the feature amount 212 of the first emphasized voice, a hidden layer 812, a hidden layer 814, a hidden layer 816 connected in order after the input layer 810, and a hidden layer 816. A gate layer 818 connected to receive an output and having the same function as the gate layer 682 of the fifth embodiment is included.

The third sub-network 774 includes a hidden layer 830 receiving the output of the first sub-network 770 and the output of the second sub-network 772, and a hidden layer connected in order between the hidden layer 830 and the output layer 776. 832, a hidden layer 834 and a hidden layer 836.

The acoustic model 750 is different from the acoustic model shown in FIG. 9 in that the feature 210 of the noise-superimposed speech and the feature 212 of the first emphasized speech include the first subnetwork 770 and the second subnetwork 770 in the first half of the acoustic model 750. It is separated into a network 772 and propagated inside each. The output of the first sub-network 770 is directly input to the third sub-network 774. However, in the second sub-network 772, after the gate processing in the gate layer 818 is performed on the output of the last hidden layer 816, the result is input to the hidden layer 830.

With this configuration, when it is more advantageous to use the feature amount 212 of the first emphasized voice, the feature amount 212 of the first emphasized voice is effectively used. When the use of the feature amount 212 of the first emphasized speech is disadvantageous, the output of the second sub-network 772 has a small value, and as a result, is not used for speech recognition.

(4) Even with the use of the acoustic model 750 according to the sixth embodiment, speech recognition could be performed with higher accuracy than in the related art.

[Seventh Embodiment]
FIG. 11 shows a schematic configuration of an acoustic model 850 used in the speech recognition device according to the seventh embodiment. As can be seen from FIG. 11, this acoustic model 850 is also composed of a deep neural network. The speech recognition device according to the seventh embodiment has the same configuration as the speech recognition device 380 shown in FIG. The difference is that an acoustic model 850 is used instead of the acoustic model 398 in FIG.

Referring to FIG. 11, this acoustic model 850 receives the feature amount 212 of the first emphasized speech before the input layer 450 in addition to the components of the acoustic model 398 shown in FIG. 1], and a gate layer 902 that receives the feature amount 430 of the second emphasized voice and multiplies it by the weight of the section [0, 1] and inputs the same to the input layer 450. And a gate layer 912 that receives the feature amount 432 of the third emphasized voice and multiplies it by the weight of the section [0, 1] and inputs the same to the input layer 450, and a section [434] that receives the feature amount 434 of the fourth emphasized voice. 0, 1] and input to the input layer 450. In other respects, the acoustic model 850 is the same as the acoustic model 398 shown in FIG.

In the acoustic model 850, the functions of the gate layers 892, 902, 912, and 922 are advantageous for any of the feature amounts 212, 430, 432, and 434 of the first to fourth emphasized voices, so that they are advantageous during voice recognition. Effective feature amounts can be used effectively, and other feature amounts can be prevented from being used. As a result, the accuracy can be improved even in speech recognition using the acoustic model 850.

In fact, as described later, the speech recognition apparatus using the acoustic model 850 according to the present embodiment was able to perform speech recognition with higher accuracy than the conventional technique.

[Eighth Embodiment]
FIG. 12 shows a schematic configuration of an acoustic model 950 used in the speech recognition device according to the eighth embodiment of the present invention. The acoustic model 950 also includes a deep neural network, like the acoustic model according to the other embodiments.

The acoustic model 950 includes a first input sub-network 960 that receives the feature 210 of the noise-superimposed speech, a second input sub-network 962 that receives the feature 212 of the first emphasized speech, and a feature of the second emphasized speech. A third input sub-network 964 receiving the quantity 430, a fourth input sub-network 966 receiving the third emphasized speech feature 432, and a fifth input sub-network receiving the fourth emphasized speech feature 434. 968, an intermediate sub-network 970 that receives the outputs of the first to

fifth input sub-networks

960, 962, 964, 966, and 968, and receives the output of the intermediate sub-network 970 and outputs the phonemes estimated by the acoustic model 950. And an output layer 972.

The first input subnetwork 960 includes an input layer 980 that receives the feature amount 210 of the noise-superimposed speech, and a hidden layer 982, a hidden layer 984, and a hidden layer 986 that are sequentially connected from the input layer 980 to the intermediate subnetwork 970. And

The second input sub-network 962 includes an input layer 990 that receives the feature value 212 of the first emphasized voice, a hidden layer 992, a hidden layer 994, a hidden layer 996, and a hidden layer 996 that are sequentially connected after the input layer 990. And the gate layer 998 inserted between the input of the intermediate sub-network 970.

The third input sub-network 964 includes an input layer 1000 receiving the feature amount 430 of the second emphasized voice, a hidden layer 1002, a hidden layer 1004, a hidden layer 1006, and a hidden layer 1006 connected in order after the input layer 1000. And a gate layer 1008 inserted between the input of the intermediate sub-network 970.

The fourth input sub-network 966 includes an input layer 1010 that receives the feature amount 432 of the third emphasized voice, a hidden layer 1012, a hidden layer 1014, a hidden layer 1016, and a hidden layer 1016 that are sequentially connected after the input layer 1010. And a gate layer 1018 inserted between the output of the intermediate sub-network 970.

The fifth input sub-network 968 includes an input layer 1020 that receives the feature amount 434 of the fourth emphasized voice, a hidden layer 1022, a hidden layer 1024, a hidden layer 1026, and a hidden layer 1026 that are sequentially connected after the input layer 1020. , And a gate layer 1028 inserted between the inputs of the intermediate sub-network 970.

The intermediate sub-network 970 includes a hidden layer 1030 that receives the outputs of the first input sub-network 960 and the second to

fifth input sub-networks

962, 964, 966, and 968, and a layer between the hidden layer 1030 and the output layer 972. It includes a hidden layer 1032, a hidden layer 1034, and a hidden layer 1036 connected in order.

The operation of the speech recognition device using this acoustic model 950 is the same as that of the speech recognition device 380 shown in FIG. 6 except that the acoustic model 950 is used as the acoustic model.

In this embodiment, the phoneme can be estimated by weighting each element of the feature amount obtained by the first to fourth speech enhancements with a coefficient having a value in the section [0, 1]. For each voice emphasis and for each feature amount, it is possible to effectively use features that are advantageous for speech recognition and not to use disadvantageous features. As a result, the accuracy of voice recognition can be increased.

As described later, in this embodiment, it is possible to realize not only the accuracy of the conventional technique but also higher accuracy than any of the above-described first to seventh embodiments.

[Experimental result]
FIG. 14 shows, in the form of a table, the results of experiments (word error rates) performed on the above embodiments. In this experiment, CHiME3 (voice recorded outdoors using a tablet) described in Non-Patent Document 5 was used as a recognition target. The speech emphasis processing used in this experiment is as follows.

・ Speech enhancement 1: Technology disclosed in Non-Patent Document 1 ・ Speech enhancement 2: Technology disclosed in Non-Patent Document 2 ・ Speech enhancement 3: Technology disclosed in Non-Patent Document 3 ・ Speech enhancement 4: Non-Patent Document In experiments relating to the first, second, fifth, and sixth embodiments, for example, the above-described voice emphasis 1 to 4 are respectively adopted as the voice emphasis unit 202 shown in FIG. The speech recognition accuracy was measured by using the acoustic model of the third embodiment, and in the experiments regarding the third, fourth, seventh and eighth embodiments, the

speech emphasizing units

202, 392, 394 and 396 shown in FIG. Emphasis 1 to 4 were respectively adopted, and the speech recognition accuracy was measured using the acoustic model of each embodiment.

Although not shown in FIG. 14, the word error rate when speech recognition was performed on the same data without speech enhancement using a conventional speech recognition device was 22.64%.

As is clear from FIG. 14, according to the first to eighth embodiments of the present invention, the word error rate was lower than in the case of using the conventional speech enhancement. That is, the accuracy of voice recognition was high. In most cases, the accuracy was higher than in conventional speech recognition without speech enhancement. In particular, in the second embodiment, high accuracy was realized by using any of the voice enhancements. In the fourth embodiment and the eighth embodiment, the accuracy is very high. In the eighth embodiment, in particular, higher accuracy can be realized as compared with the other embodiments.

[Realization by computer]
Each functional unit of the speech recognition device according to each of the embodiments described above is realized by computer hardware and a program executed by a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit) on the hardware. it can. FIG. 15 shows computer hardware for realizing each of the above speech recognition devices. The GPU is usually used for performing image processing, and such a technique of using the GPU for normal arithmetic processing instead of image processing is called GPGPU (General-purpose computing on graphics processing units). The GPU can execute a plurality of operations of the same type simultaneously and in parallel. On the other hand, in the case of a neural network, a large amount of computation is required, especially at the time of learning. Therefore, a computer equipped with a GPU is suitable for training and inference of a speech recognizer and a neural network constituting an acoustic model used in the speech recognizer. When speech recognition is performed using an acoustic model for which learning has been completed, a computer having a sufficiently high-speed CPU need not necessarily be equipped with a GPU.

Referring to FIG. 15, the computer system 1130 includes a computer 1140 having a memory port 1152 and a DVD (Digital Versatile Disk) drive 1150, a keyboard 1146, a mouse 1148, and a monitor 1142.

The computer 1140 is further connected to a CPU 1156 and a GPU 1158, a bus 1166 connected to these, a memory port 1152 and a DVD drive 1150, a ROM 1160 which is a read-only memory for storing a boot program and the like, and connected to a bus 1166, It includes a random access memory (RAM) 1162 which is a computer readable storage medium for storing a system program, work data, and the like, and a hard disk 1154 which is a computer readable nonvolatile storage medium. The computer 1140 further includes a network interface (I / F) 1144 which is connected to the bus 1166 and provides a connection to the network 1168, and an audio I / F 1170 for inputting and outputting audio signals to and from the outside. .

A program for causing the computer system 1130 to function as a storage unit of each of the functional units and acoustic models of each of the speech recognition apparatuses according to the above-described embodiments is mounted on the DVD drive 1150 or the memory port 1152, and both are computer-readable. It is stored in the DVD 1172 or the removable memory 1164, which is a suitable storage medium, and further transferred to the hard disk 1154. Alternatively, the program may be transmitted to computer 1140 via network 1168 and stored on hard disk 1154. The program is loaded into the RAM 1162 at the time of execution. The program may be loaded from the DVD 1172, from the removable memory 1164, or directly to the RAM 1162 via the network 1168. Data required for the above processing is stored at a predetermined address such as a register in the hard disk 1154, the RAM 1162, the CPU 1156, or the GPU 1158, processed by the CPU 1156 or the GPU 1158, and stored at an address designated by the program. The parameters of the acoustic model finally trained are stored, for example, on the hard disk 1154 together with a program for realizing the acoustic model training and inference algorithm, or stored on the DVD 1172 or the removable memory 1164 via the DVD drive 1150 and the memory port 1152, respectively. Or stored. Alternatively, it is transmitted to another computer or storage device connected via the network I / F 1144.

This program includes an instruction sequence including a plurality of instructions for causing the computer 1140 to function as each device and system according to the above embodiment. Numerical processing in each of the above devices and systems is performed using the CPU 1156 and the GPU 1158. Although only the CPU 1156 may be used, using the GPU 1158 is faster. Some of the basic functions required to cause the computer 1140 to perform this operation include an operating system or third party program running on the computer 1140 or various dynamically linkable programming toolkits or programs installed on the computer 1140. Provided by the library. Therefore, the program itself does not necessarily need to include all the functions necessary to realize the speech recognition device of this embodiment. The program described above may be implemented by dynamically calling, at run time, the appropriate functions or appropriate programs in a programming toolkit or program library in a controlled manner to obtain the desired result of the instructions, It is only necessary to include only instructions for realizing the functions of the device or the method. Of course, the above-described speech recognition apparatus may be realized by loading a program incorporating all necessary functions into a computer by a static link.

[Modification]
In the third, fourth, seventh, and eighth embodiments, four types of voice enhancement processing are used. However, the present invention is not limited to such an embodiment. Two, three, or five or more types of voice enhancement processing may be used.

Further, in the above embodiment, the hidden layers of the deep neural network constituting the acoustic model are seven layers in total, and in the third, fourth, seventh and eighth embodiments, the former half of the deep neural network is used. Three hidden layers and four hidden layers in the latter half. However, the present invention is not limited to such an embodiment. The number of hidden layers may be six or less, or eight or more. When constructing an acoustic model according to the third, fourth, seventh and eighth embodiments, the number of hidden layers in the first half and the second half does not need to be three and four, respectively. However, in the above experiment, it is true that the best results were obtained when three layers were used in the first half and four layers were used in the second half.

In the above embodiment, the present invention is applied to a single-channel audio signal. However, the present invention is not limited to such an embodiment, and is applicable to audio signals of a plurality of channels.

The embodiments disclosed this time are merely examples, and the present invention is not limited to the above embodiments. The scope of the present invention is shown by each claim in the claims, taking into account the description of the detailed description of the invention, and includes meanings equivalent to the words described therein and all changes within the scope. .

The present invention can be used for improving an interface with a user in a computer, a home appliance, an industrial product, and the like, and can easily operate these devices with a natural language voice issued by the user.

100, 180, 380 Speech recognition device 110 Waveform 112

Speech signal

114, 202, 392, 394, 396

Speech enhancement unit

116, 203, 393, 395, 397

Speech enhancement signal

118, 220, 410, 412, 414

Feature extraction unit

120 , 204, 402

Speech recognition units

122, 208, 400

Text

124, 206, 280, 398, 500, 650, 750, 850, 950 Acoustic model 126 Pronunciation dictionary 128

Language model

200, 390 Enlarged feature extraction unit 210 Features 212 Features of first emphasized speech 300 Subnetwork 302 for noise superimposed speech Subnetwork 304 for emphasized speech Output subnetwork 430 Features of second emphasized speech 432 Features of third emphasized speech Quantity 434 Feature quantity 4 of fourth emphasized voice 2 Intermediate layers 540, 770

First sub-networks

542, 772

Second sub-networks

544, 774 Third sub-network 546 Fourth sub-network 548

Fifth sub-network

550, 970

Intermediate sub-networks

682, 818, 892 , 902, 912, 922, 998, 1008, 1018, 1028, 1100 Gate layer 960 First input sub-network 962 Second input sub-network 964 Third input sub-network 966 Fourth input sub-network 968 Fifth Input subnetwork 1130 Computer system

Claims

An audio enhancement circuit that receives an audio signal in which a noise signal is superimposed on an audio signal that is a target signal and outputs an enhanced audio signal in which the audio signal is enhanced,
A noise-tolerant speech recognition device, comprising: a speech recognition unit that receives the emphasized speech signal and the acoustic signal and converts the uttered content of the speech signal into text.
The voice emphasis circuit,
A first voice enhancement unit that performs a first type of voice enhancement process on the audio signal and outputs a first enhanced voice signal;
A second voice enhancement unit that performs a second type of voice enhancement process different from the first type on the audio signal and outputs a second enhanced audio signal,
The noise-tolerant speech recognition device according to claim 1, wherein the speech recognition unit receives the first and second emphasized speech signals and the acoustic signal, and converts the utterance content of the speech signal into text.
The voice recognition unit,
First feature extraction means for extracting a first feature amount from the audio signal;
Second feature extraction means for extracting a second feature amount from the emphasized audio signal;
Feature selecting means for selecting each of the second feature values according to the first feature value and the second feature value;
2. The noise-tolerant speech recognition apparatus according to claim 1, further comprising: speech recognition means for converting the utterance content of the speech signal into text using the second feature amount selected by the feature selection means.
The speech recognition unit further includes an acoustic model storage unit that stores an acoustic model used for speech recognition,
The acoustic model is a deep neural network with multiple hidden layers,
The acoustic model is
A first sub-network receiving as input the first feature value;
A second sub-network receiving as input the second feature value;
A third sub-network receiving the output of the first sub-network and the output of the second sub-network, and outputting a phoneme estimated from the first feature and the second feature; The noise-tolerant speech recognition device according to claim 3.
Computer, as an input a single-channel sound signal in which a noise signal is superimposed on a sound signal as a target signal, and outputting an emphasized sound signal in which the sound signal is emphasized,
A noise-tolerant speech recognition method, comprising: a computer receiving the emphasized speech signal and the acoustic signal, and a speech recognition step of converting a speech content of the speech signal into text.
A computer program that causes a computer to function as the noise-resistant device according to any one of claims 1 to 4.