WO2013069187A1

WO2013069187A1 - Speech recognition system and speech recognition method

Info

Publication number: WO2013069187A1
Application number: PCT/JP2012/005874
Authority: WO
Inventors: 聡塚田; 英司高田; 剛範辻川
Original assignee: 日本電気株式会社
Priority date: 2011-11-09
Filing date: 2012-09-14
Publication date: 2013-05-16

Abstract

Provided is a speech recognition system that can reduce the size of devices used by users inputting speech and increase the accuracy of speech recognition. At least two or more input means (181) input speech by a user and noise at the time that user vocalizes the speech. A wireless transmission means (182) wirelessly transmits speech data that includes the speech and noise input by the various input means (181) to a speech recognition device (190). A speech extraction means (191) extracts the speech data with the noise eliminated from the speech data that is received. A speech recognition means (192) recognizes the speech of the speech data extracted by the speech extraction means (191).

Description

Speech recognition system and speech recognition method

The present invention relates to a voice recognition system and a voice recognition method for performing voice recognition of voice transmitted using wireless communication.

When performing speech recognition, it is known that the performance of speech recognition deteriorates due to ambient noise input together with the speech. For this reason, various methods are known for removing noise when it is mixed with speech.

Patent Document 1 describes a noise removing device that removes noise mixed in speech. The noise removal device described in Patent Document 1 includes a first microphone that collects sound and a second microphone that collects ambient noise. The noise removal device described in Patent Document 1 converts sound input to each microphone into a time-series feature vector, and removes stationary noise and non-stationary noise based on each converted time-series vector. .

Patent Document 2 describes a voice processing system using a headset with a wireless communication function. In the voice processing system described in Patent Document 2, the microphone provided in the headset detects voice, and the result of voice recognition performed by the voice recognition unit of the headset is transmitted to an external device by wireless communication.

Note that Patent Document 3 describes a speech recognition system that performs speech recognition by compressing and expanding a speech signal.

Japanese Patent No. 2836271 JP 2003-202888 A JP 2005-321748 A

In a situation where the work status is notified by voice while working, noise is often mixed in the voice input to the microphone depending on the work and the surrounding environment. In order to improve the accuracy of voice recognition in such a scene, it is necessary to appropriately remove noise mixed in the voice.

In such a scene where voice is input while working, it is desirable to downsize the device for inputting voice. For example, when a voice input / output device such as a headset microphone and a voice recognition processing device that performs voice recognition are connected by wire, there is a risk that using these devices may hinder work. is there.

For example, it is possible to remove noise by using the noise removing device described in Patent Document 1. However, since the noise removal apparatus described in Patent Document 1 is connected to two microphones and a means for removing noise, and these are integrated to remove noise, the noise removal apparatus tends to be larger. . Therefore, it is desirable that the device for inputting the voice by the operator can be reduced in size while improving the accuracy of voice recognition by removing noise as described in Patent Document 1.

Further, like the voice processing system described in Patent Document 2, it is possible to prevent the operator's work from being hindered by using a headset with a wireless communication function. However, in the voice processing system described in Patent Document 2, a voice recognition device is provided on the headset side used by an operator for voice input.

* Many resources are required to improve the accuracy of speech recognition. In addition, models and processing methods used for speech recognition are often changed. Therefore, in the voice processing system described in Patent Document 2, it is necessary to update the voice recognition device provided in each headset whenever the model or the processing method is changed. In addition, in order to reduce the size of the headset, it is difficult to say that voice recognition can be performed with sufficient accuracy because of resource limitations.

Therefore, an object of the present invention is to provide a voice recognition system and a voice recognition method capable of improving the accuracy of voice recognition while downsizing an apparatus used by a user who inputs voice.

A voice recognition system according to the present invention includes a voice input device that inputs a user's voice, and a voice recognition device that performs voice recognition of the voice input to the voice input device. And at least two or more input means for inputting noise when the user utters a voice, and wireless transmission for wirelessly transmitting the voice input to each input means and voice data including noise to the voice recognition device And the speech recognition apparatus includes speech extraction means for extracting speech data from which noise has been removed from received speech data, and speech recognition means for performing speech recognition of the speech data extracted by the speech extraction means. It is characterized by.

In the voice recognition method according to the present invention, a voice input device that inputs a user's voice inputs the user's voice and noise when the user utters the voice using two or more input means. Then, the voice input device wirelessly transmits the voice data including the voice and noise input to each input means to the voice recognition device, and the voice recognition device extracts the voice data from which the noise is removed from the received voice data. The voice recognition device performs voice recognition of the extracted voice data.

According to the present invention, it is possible to improve the accuracy of voice recognition while miniaturizing a device used by a user who inputs voice.

It is a block diagram which shows the structural example of 1st Embodiment of the speech recognition system by this invention. It is a flowchart which shows the operation example of the speech recognition system of 1st Embodiment. It is a block diagram which shows the structural example of 2nd Embodiment of the speech recognition system by this invention. It is explanatory drawing which shows the example of the method of integrating the audio | voice of 2 channels into the audio | voice of 1 channel. It is a flowchart which shows the operation example of the speech recognition system of 2nd Embodiment. It is a block diagram which shows the structural example of 3rd Embodiment of the speech recognition system by this invention. It is a flowchart which shows the operation example of the speech recognition system of 3rd Embodiment. It is a block diagram which shows the structural example of 4th Embodiment of the speech recognition system by this invention. It is a flowchart which shows the operation example of the speech recognition system of 4th Embodiment. It is a block diagram which shows the structural example of 5th Embodiment of the speech recognition system by this invention. It is a flowchart which shows the operation example of the speech recognition system of 5th Embodiment. It is a block diagram which shows the modification of the speech recognition system by this invention. It is explanatory drawing which shows the Example of the speech recognition system of this invention. It is a block diagram which shows the example of the minimum structure of the speech recognition system by this invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

Embodiment 1. FIG.
FIG. 1 is a block diagram showing a configuration example of a first embodiment of a speech recognition system according to the present invention. The voice recognition system according to the present embodiment includes a voice input / output unit 10 and a voice recognition response unit 20.

The voice input / output unit 10 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, and a control unit 15. The audio input / output unit 10 may include an output unit (not shown) that outputs audio input from the first microphone 11 and the second microphone 13. In the present embodiment, a case where the voice input / output unit 10 does not include an output unit will be described as an example.

The first microphone 11 and the second microphone 13 collect the user's voice and ambient noise when the user is uttering the voice. In the following description, a voice including noise may be simply referred to as a voice.

The first microphone 11 and the second microphone 13 are provided at physically separated positions. For example, when the voice input / output unit 10 is realized as a headset, the first microphone 11 may be disposed at the user's mouth, and the second microphone 13 may be disposed at the user's ear. In this way, different sound is input to each microphone by physically arranging the microphones. In this example, in particular, the first microphone 11 is used to collect user's voice, and the second microphone 13 is used to collect ambient noise.

However, the type of sound collected by the first microphone 11 and the second microphone 13 is not particularly limited. For example, both the first microphone 11 and the second microphone 13 may collect voice in which the user's voice and ambient noise are mixed. The first microphone 11 may be used particularly for collecting ambient noise, and the second microphone 13 may be used particularly for collecting user's voice.

In this embodiment, a case where the voice input / output unit 10 includes two microphones will be described. However, the number of microphones included in the voice input / output unit 10 is not limited to two. The voice input / output unit 10 may include three or more microphones. Even when three or more microphones are included in the voice input / output unit 10, the type of voice collected by each microphone is not particularly limited.

The first input voice transmission unit 12 transmits the voice input to the first microphone 11 to the voice recognition response unit 20 wirelessly. The second input voice transmission unit 14 transmits the voice input to the second microphone 13 to the voice recognition response unit 20 wirelessly.

In the present embodiment, a case where the voice collected by each microphone is transmitted to the voice recognition response unit 20 by the input voice transmission unit corresponding to each microphone will be described as an example. However, the voice input / output unit 10 may include one input voice transmission unit in which the functions of the first input voice transmission unit 12 and the second input voice transmission unit 14 are combined. Then, the input voice transmission unit may determine which microphone the collected voice is input to, and may transmit the voice to the voice recognition response unit 20.

In addition, the 1st input audio | voice transmission part 12 and the 2nd input audio | voice transmission part 14 digitize the audio | voice, when the audio | voice which the microphone collected is received as analog data. That is, the first input voice transmission unit 12 and the second input voice transmission unit 14 transmit the digitized voice data to the voice recognition response unit 20 wirelessly.

In addition, the first input voice transmission unit 12 and the second input voice transmission unit 14 receive status information indicating the state of the voice input / output unit 10 in response to an instruction from the control unit 15 to be described later. May be sent to.

The control unit 15 controls the state of the voice input / output unit 10 based on a control command received from another device (for example, the voice recognition response unit 20). As a control command, the control unit 15 sets, for example, the number of microphone channels, the compression method of the audio data to be transmitted, the sampling frequency setting of the audio data, the microphone switch setting, and the operation mode setting (for example, setting of the protocol used) Receive microphone volume settings, speaker volume settings, etc.

Further, the control unit 15 may instruct the first input voice transmission unit 12 and the second input voice transmission unit 14 to transmit status information indicating the state of the voice input / output unit 10. Status information includes, for example, the number of microphone channels, audio data sampling frequency, microphone switch status, operation mode information, transmission data block size, microphone volume information, speaker volume information, radio wave status, battery level, Examples include battery charge status and time information.

As described above, the voice input / output unit 10 may transmit status information to another device and operate based on a control command from the other device. By doing so, it is not necessary to incorporate a determination process for performing an operation in the voice input / output unit 10, so that the voice input / output unit 10 can be further downsized.

The voice recognition response unit 20 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, and a voice recognition unit 24. Note that the speech recognition response unit 20 may include a control unit (not shown) that synthesizes speech and reproduces the synthesized speech based on the result of speech recognition by the speech recognition unit 23 described later.

The first input voice reception unit 21 receives the voice data wirelessly transmitted by the first input voice transmission unit 12. The second input voice reception unit 22 receives the voice data wirelessly transmitted by the second input voice transmission unit 14.

The voice extraction unit 23 receives the voice data received by the first input voice receiver 21 (hereinafter referred to as first voice data) and the voice data received by the second input voice receiver 22 (hereinafter referred to as the first voice data). Audio data from which ambient noise has been removed is extracted based on the above.

That is, at least ambient noise is included in the first audio data and the second audio data. Therefore, the voice extraction unit 23 uses the received first voice data and second voice data to remove noise mixed in the voice uttered by the user.

For example, it is assumed that the first microphone 11 mainly collects user's voice and the second microphone 13 mainly collects ambient noise. In this case, the voice extraction unit 23 removes the voice data of the noise collected by the second microphone 13 from the voice data of the voice collected by the first microphone 11. At this time, the voice extraction unit 23 may use, for example, a method in which the noise removal device described in Patent Document 1 described above removes noise. In this way, if the type of sound collected by each microphone can be specified, the processing for extracting the sound can be speeded up.

Also, for example, it is assumed that both the first microphone 11 and the second microphone 13 collect sound in which the user's voice and ambient noise are mixed. In this case, since the first microphone 11 and the second microphone 13 are provided at physically separated positions, a phase difference occurs when a signal from the sound source reaches each microphone. Therefore, the voice extraction unit 23 may remove noise mixed in the voice using, for example, a microphone array technique.

The speech extraction unit 23 may use, for example, a beam forming method or a blind sound source separation method using ICA (Independent Component Analysis) as a microphone array method. In addition, since these methods are already known methods, detailed description is abbreviate | omitted.

In this way, the voice extraction unit 23 uses a technique that does not specify the type of voice collected by each microphone, so that the user can extract voice regardless of the manner in which the voice input / output unit 10 is used.

Here, the voice extraction process performed by the voice extraction unit 23 will be described more specifically.
In general, ambient sound (noise) continues to be input to the microphone even while the user's voice is not input. For this reason, it is desirable that the speech recognition is performed after the sound section of the portion where speech recognition is performed is cut out. In other words, it is desirable that the speech recognition process is not performed in a section other than the input speech. As described above, it is possible to suppress erroneous recognition of noise as speech by performing speech recognition after cutting out the sound of the portion where speech recognition is performed.

As a method for detecting a speech section, for example, there are a method of simply determining a speech section using the loudness of a sound, and a method of distinguishing speech and noise using characteristics such as frequency components of sound. In other words, detecting a speech segment can be said to remove a noise segment.

On the other hand, since the surrounding noise is also superimposed on the section cut out as speech (speech section), the process of removing the noise component superimposed on the speech in the speech section is performed. As described above, this noise component removal processing uses a technique for removing noise using voice data collected from a microphone that collects voice and a microphone that collects noise, or a technique of a microphone array. It is done.

When extracting the voice, the voice extraction unit 23 may detect the voice section and then perform noise removal processing on the detection section, or may detect voice for the voice after the noise removal processing is performed. You may go. The voice extraction unit 23 may extract voice by combining these processes. As described above, the process of extracting the voice by the voice extraction unit 23 includes a process of detecting a voice section and a process of removing noise.

The voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23. That is, the voice recognition unit 24 performs voice recognition on the voice from which noise has been removed from the voice collected by the microphone. The speech recognition unit 24 may perform speech recognition using a generally known method.

Next, the operation of the voice recognition system of this embodiment will be described. FIG. 2 is a flowchart showing an operation example of the speech recognition system of the present embodiment.

The first input voice transmission unit 12 transmits voice data indicating the voice input to the first microphone 11 to the voice recognition response unit 20 wirelessly. Similarly, the second input voice transmission unit 14 wirelessly transmits voice data indicating the voice input to the second microphone 13 to the voice recognition response unit 20 (step S1).

The voice extraction unit 23 of the voice recognition response unit 20 includes a first voice data received by the first input voice receiver 21 from the first input voice transmitter 12, and a second input voice receiver 22 provided by the second voice receiver 22. Based on the second voice data received from the input voice transmitter 14, voice data from which ambient noise is removed is extracted (step S 2). Then, the voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23 (step S3).

As described above, according to the present embodiment, the first microphone 11 and the second microphone 13 of the voice input / output unit 10 are the voice of the user and the noise when the user is uttering the voice. And the first input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit voice data including noise and noise input to each microphone to the voice recognition response unit 20. The voice extraction unit 23 of the voice recognition response unit 20 extracts voice data from which noise has been removed from the received voice data, and the voice recognition unit 24 performs voice recognition of the voice data extracted by the voice extraction unit 23.

With such a configuration, it is possible to improve the accuracy of voice recognition while miniaturizing a device used by a user who inputs voice. That is, in this embodiment, since the voice input / output unit 10 is configured to have two microphones and a function of transmitting voice data collected from these microphones, the voice input / output unit 10 is realized. Can be downsized. Therefore, the work efficiency of the user who uses the voice input / output unit 10 can be improved.

Furthermore, in this embodiment, the voice recognition response unit 20 removes noise from the received voice data and performs voice recognition based on the voice data from which the noise has been removed. The voice recognition response unit 20 may be provided in a place where wireless communication can be performed from the voice input / output unit 10, and does not always have to be integrated with the user (that is, the voice input / output unit 10). Accordingly, since it is less necessary to reduce the size of the speech recognition response unit 20 than the speech input / output unit 10, the speech recognition response unit 20 can be provided with many functions for improving the accuracy of speech recognition. Therefore, the accuracy of voice recognition can be increased.

Embodiment 2. FIG.
FIG. 3 is a block diagram showing a configuration example of the second embodiment of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted. The voice recognition system according to the present embodiment includes a voice input / output unit 30 and a voice recognition response unit 40.

The voice input / output unit 30 includes a first microphone 11, a second microphone 13, an input data integration unit 31, an input data transmission unit 32, and a control unit 15. The audio input / output unit 30 may include an output unit (not shown) that outputs audio input from the first microphone 11 and the second microphone 13. In the present embodiment, a case where the voice input / output unit 30 does not include an output unit will be described as an example. Moreover, the 1st microphone 11, the 2nd microphone 13, and the control part 15 are the same as that of 1st Embodiment.

The input data integration unit 31 integrates audio data indicating sound input to the first microphone 11 and audio data indicating sound input to the second microphone 13. When the input data is received as analog data by each microphone, the input data integration unit 31 converts the analog data into digital data and integrates the converted digital data.

FIG. 4 is an explanatory diagram showing an example of a method for integrating 2-channel audio into 1-channel audio. In the example shown in FIG. 4, the channel 1 audio data input to the first microphone 11 and the channel 2 audio data input to the second microphone 13 are alternately divided at regular intervals. Is shown. The integrated data is transmitted by the input data transmission unit 32 described later.

Note that the voice data integration method is not limited to the method illustrated in FIG. Other integration methods may be used as long as a plurality of audio data can be transmitted by one channel.

Further, as in the first embodiment, the voice input / output unit 30 may include three or more microphones. In this case, the method of integrating audio data indicating the audio input to each microphone is the same as the method described above.

The input data transmission unit 32 transmits the voice data integrated by the input data integration unit 31 to the voice recognition response unit 40 wirelessly. The input data transmission unit 32 may also transmit the status information indicating the method by which the input integration unit 31 has integrated the voice data to the voice recognition response unit 40.

The voice recognition response unit 40 includes an input data reception unit 41, an input data division unit 42, a voice extraction unit 23, and a voice recognition unit 24. Note that the voice recognition response unit 40 also includes a control unit (not shown) that synthesizes voice and reproduces the synthesized voice from the result of voice recognition by the voice recognition unit 23, as in the first embodiment. May be included. Note that the voice extraction unit 23 and the voice recognition unit 24 are the same as those in the first embodiment.

The input data receiving unit 41 receives the voice data wirelessly transmitted by the input data transmitting unit 32.

The input data dividing unit 42 divides the audio data obtained by integrating the two or more audio data into one by the input data integration unit 31 into the original audio data. Specifically, the input data dividing unit 42 divides the received audio data according to the method by which the input data integration unit 31 integrates the audio data. For example, as illustrated in FIG. 4, when the input data integration unit 31 integrates two or more pieces of audio data divided at regular intervals, the input data division unit 42 divides the received audio data at regular intervals, The divided audio data may be integrated to generate two or more original audio data.

The division method may be determined in advance between the voice input / output unit 30 and the voice recognition response unit 40. Further, the input data dividing unit 42 may specify the dividing method based on the status information indicating the integration method transmitted from the input data transmitting unit 32.

Next, the operation of the voice recognition system of this embodiment will be described. FIG. 5 is a flowchart showing an operation example of the speech recognition system of the present embodiment.

The input data integration unit 31 integrates audio data indicating audio input to the first microphone 11 and audio data indicating audio input to the second microphone 13 (step S11). The input data transmission unit 32 wirelessly transmits the voice data integrated by the input data integration unit 31 to the voice recognition response unit 40 (step S12).

The input data dividing unit 42 divides the audio data received by the input data receiving unit 41 (step S13). Specifically, the input data dividing unit 42 divides the audio data obtained by integrating the two or more audio data into one by the input data integration unit 31 into the original audio data.

The voice extraction unit 23 extracts voice data from which ambient noise has been removed based on the two or more voice data divided by the input data division unit 42 (step S14). Then, the voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23 (step S15).

As described above, according to the present embodiment, the input data integration unit 31 integrates voice data including noise and noise input to each microphone, and the input data transmission unit 32 integrates the input data integration unit 31. Voice data is wirelessly transmitted to the voice recognition device. Then, the input data dividing unit 42 divides the received audio data into original audio data, and the audio extracting unit 23 extracts audio data from which noise has been removed from each audio data divided by the input data dividing unit 42. .

Such a configuration makes it possible to simultaneously transmit audio data input simultaneously to each microphone. Therefore, it is not necessary for the reception side (voice recognition response unit 40) to perform processing in consideration of reception timing, so that the processing on the reception side can be simplified.

Embodiment 3. FIG.
FIG. 6 is a block diagram showing a configuration example of the third embodiment of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted. The voice recognition system according to the present embodiment includes a voice input / output unit 50 and a voice recognition response unit 60.

The voice input / output unit 50 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a first input. An audio compression unit 51 and a second input audio compression unit 52 are included. That is, the voice input / output unit 50 of the present embodiment is different from the voice input / output unit 10 of the first embodiment in that it further includes a first input voice compression unit 51 and a second input voice compression unit 52. Different.

The first input sound compression unit 51 generates sound data obtained by compressing the sound input to the first microphone 11. In this case, the first input voice transmission unit 12 transmits the compressed voice data to the voice recognition response unit 60 wirelessly.

Similarly, the second input sound compression unit 52 generates sound data obtained by compressing the sound input to the second microphone 13. In this case, the second input voice transmission unit 14 transmits the compressed voice data to the voice recognition response unit 60 wirelessly.

In the present embodiment, a case will be described as an example in which the sound collected by each microphone is generated by the input sound compression unit corresponding to each microphone, and the sound is compressed. However, the voice input / output unit 50 may include one input voice compression unit that combines the functions of the first input voice compression unit 51 and the second input voice compression unit 52. The input voice compression unit may determine which microphone the collected voice is input to, and compress the voice.

The first input audio compression unit 51 and the second input audio compression unit 52 generate audio data obtained by compressing audio using a generally known method. The first input audio compression unit 51 and the second input audio compression unit 52 are, for example, G. Μ Law standardized as G.711 or G. ITU-T recommendation. ADPCM (Adaptive Differential Pulse Code Modulation) standardized as 726 may be used. However, the method used by the first input audio compression unit 51 and the second input audio compression unit 52 for audio compression is not limited to μ Law and ADPCM.

The voice recognition response unit 60 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a first input voice expansion unit 61, A second input voice decompression unit 62. That is, the speech recognition response unit 60 of the present embodiment is different from the speech recognition response unit 20 of the first embodiment in that it further includes a first input speech expansion unit 61 and a second input speech expansion unit 62. Different.

Note that the voice recognition response unit 60, as in the first embodiment, includes a control unit (not shown) that synthesizes voice and reproduces the synthesized voice from the result of voice recognition by the voice recognition unit 23. May be included.

The first input voice decompression unit 61 decompresses the compressed voice data received by the first input voice reception unit 21 to the original voice data. The second input voice decompression unit 62 decompresses the compressed voice data received by the second input voice reception unit 22 to the original voice data. Specifically, the first input voice decompression unit 61 and the second input voice decompression unit 62 are the methods used by the first input voice compression unit 51 and the second input voice compression unit 52 to compress the voice. In response, the received audio data is decompressed.

Next, the operation of the voice recognition system of this embodiment will be described. FIG. 7 is a flowchart showing an operation example of the speech recognition system of this embodiment.

The first input sound compression unit 51 generates sound data obtained by compressing the sound input to the first microphone 11. Similarly, the second input audio compression unit 52 generates audio data obtained by compressing the audio input to the second microphone 13. (Step S21). Then, the first input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit the compressed voice data to the voice recognition response unit 60 (step S22).

The first input voice decompression unit 61 decompresses the compressed voice data received by the first input voice reception unit 21. Similarly, the second input voice decompression unit 62 decompresses the compressed voice data received by the second input voice reception unit 22 (step S23). Thereafter, the process of extracting the voice data from which the voice extraction unit 23 has removed noise and the process of performing the voice recognition by the voice recognition unit 24 are the same as steps S2 to S3 in FIG.

As described above, according to the present embodiment, the first input audio compression unit 51 and the second input audio compression unit 52 generate audio data in which the audio and noise input to each microphone are compressed, and the first The one input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit each compressed voice data to the voice recognition response unit 60. Further, the first input voice decompression unit 61 and the second input voice decompression unit 62 decompress the compressed voice data to the original voice data, and the voice extraction unit 23 decompresses each voice that has been decompressed to the original data. Extract voice data from which noise has been removed.

With such a configuration, the amount of data transmitted from the voice input / output unit 50 to the voice recognition response unit 60 can be reduced.

Further, as in the second embodiment, the voice input / output unit 50 according to the present embodiment may include an input data integration unit 31 that integrates voice data generated by each input voice compression unit. Further, as in the second embodiment, the voice recognition response unit 60 of this embodiment may include an input data dividing unit 42 that divides the voice data integrated by the input data integration unit 31.

With such a configuration, it is possible to reduce the amount of data transmitted from the voice input / output unit 50 to the voice recognition response unit 60 and simplify the processing on the reception side.

Embodiment 4 FIG.
FIG. 8 is a block diagram showing a configuration example of the fourth embodiment of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted. The voice recognition system according to the present embodiment includes a voice input / output unit 70 and a voice recognition response unit 80.

The voice recognition response unit 80 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a response generation unit 81, and a response transmission unit 82. including. That is, the speech recognition response unit 80 of this embodiment is different from the speech recognition response unit 20 of the first embodiment in that it further includes a response generation unit 81 and a response transmission unit 82.

The response generation unit 81 synthesizes speech from the result of speech recognition by the speech recognition unit 24 and generates speech data indicating the synthesized speech. Note that the sound data generated in this way is generated as sound data in response to the sound data received from the sound input / output unit 70, and therefore this sound data may be referred to as response sound data. As a method for the response generation unit 81 to synthesize speech from the speech recognition result, a method generally known as speech synthesis, a method using previously recorded speech, or a method of combining them is used.

The response transmission unit 82 wirelessly transmits the generated response voice data to the voice input / output unit 70.

The voice input / output unit 70 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a response reception unit 71. And a speaker 72. That is, the voice input / output unit 70 of the present embodiment is different from the voice input / output unit 10 of the first embodiment in that it further includes a response receiving unit 71 and a speaker 72.

The response receiving unit 71 receives the response voice data wirelessly transmitted by the response transmission unit 82 of the voice recognition response unit 80. The speaker 72 outputs sound indicated by the response sound data received by the response receiving unit 71.

Next, the operation of the voice recognition system of this embodiment will be described. FIG. 9 is a flowchart showing an operation example of the speech recognition system of the present embodiment. The process from when the voice input to the voice input / output unit 70 is transmitted to the voice recognition response unit 80 until voice recognition is performed is the same as steps S1 to S3 in FIG.

The response generation unit 81 generates response voice data from the result of the voice recognition unit 24 performing voice recognition (step S31). The response transmitter 82 transmits the response voice data to the voice input / output unit 70 (step S32). The response receiver 71 of the voice input / output unit 70 causes the speaker 72 to output the voice indicated by the received response voice data (step S33).

As described above, in this embodiment, in addition to the first embodiment, the response generation unit 81 generates response voice data from the result of voice recognition voice recognition, and the response transmission unit 82 inputs and outputs the response voice data. To the unit 70. Then, the response receiving unit 71 causes the speaker 72 to output the sound indicated by the received response sound data. Therefore, the voice recognition result by the voice recognition response unit 80 can be confirmed on the voice input / output unit 70 side.

Embodiment 5. FIG.
FIG. 10 is a block diagram showing a configuration example of the fifth embodiment of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment-4th Embodiment, the code | symbol same as FIG.1, FIG.3, FIG.6 or FIG. 8 is attached | subjected and description is abbreviate | omitted. The voice recognition system according to the present embodiment includes a voice input / output unit 90 and a voice recognition response unit 100.

The voice recognition response unit 100 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a first input voice expansion unit 61, A second input voice decompression unit 62, a response generation unit 81, a response transmission unit 82, and a response compression unit 101 are included. That is, the speech recognition response unit 100 of the present embodiment is different from the speech recognition response unit described in the first to fourth embodiments in that a response compression unit 101 is newly included.

The response compression unit 101 compresses response audio data. The method of compressing response audio data is the same as the method of compressing audio data by the input audio compression unit (first input audio compression unit 51, second input audio compression unit 52) of the third embodiment. Note that the method in which the response compression unit 101 compresses the response audio data and the method in which the input audio compression unit compresses the audio data may be the same or different. In this case, the response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90.

The voice input / output unit 90 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a first input. An audio compression unit 51, a second input audio compression unit 52, a response reception unit 71, a speaker 72, and a response expansion unit 91 are included. That is, the voice input / output unit 90 of this embodiment is different from the voice input / output units described in the first to fourth embodiments in that a response expansion unit 91 is newly included.

The response decompression unit 91 decompresses the compressed response voice data received by the response reception unit 71 to the original response voice data. The method for expanding the response audio data is the same as the method in which the input audio expansion unit (the first input audio expansion unit 61 and the second input audio expansion unit 62) of the third embodiment expands the audio data. Note that the method in which the response decompression unit 91 decompresses the response voice data and the method in which the input voice decompression unit decompresses the voice data may be the same or different.

Next, the operation of the voice recognition system of this embodiment will be described. FIG. 11 is a flowchart showing an operation example of the speech recognition system of the present embodiment. Note that the processing until the voice input to the voice input / output unit 90 is compressed and transmitted to the voice recognition response unit 100, and the transmitted voice data is expanded and recognized is shown in the flowchart of FIG. It is the same as the process illustrated. The process in which the response generation unit 81 generates response voice data from the voice recognition result is the same as the process in step S31 illustrated in FIG.

The response compression unit 101 compresses the response voice data (step S41). The response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90 (step S42). The response decompression unit 91 of the speech input / output unit 90 decompresses the compressed response speech data received by the response reception unit 71 (step S43). Then, the speaker 72 outputs the voice indicated by the expanded response voice data (step S44).

As described above, in the present embodiment, the response compression unit 101 of the voice recognition response unit 100 compresses the response voice data, and the response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90. Then, the response decompression unit 91 of the speech input / output unit 90 decompresses the compressed response speech data, and the speaker 72 outputs the sound indicated by the decompressed response speech data.

With such a configuration, the amount of data transmitted from the voice recognition response unit 100 to the voice input / output unit 90 can be reduced.

Next, modified examples of the first to fifth embodiments will be described. FIG. 12 is a block diagram showing a modification of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment-5th Embodiment, the code | symbol same as FIG.1, FIG.3, FIG.6, FIG.8 or FIG. The speech recognition system according to the present modification includes each component included in the first to fifth embodiments. Specifically, the voice recognition system according to this modification includes a voice input / output unit 110 and a voice recognition response unit 120.

The voice input / output unit 110 includes a first microphone 11 to an Nth microphone 15, a first input voice compression unit 51 to an Nth input voice compression unit 53, an input data integration unit 31, and an input data transmission unit. 32, the control unit 15, the first speaker 72 to the Nth speaker 75, the response data receiving unit 77, the response data dividing unit 76, and the first response speech decompression unit 73 to the Nth response speech decompression. Part 74.

Note that the response data receiving unit 77 corresponds to the response receiving unit 71 of the fourth embodiment. The first response speech decompression unit 73 to the Nth response speech decompression unit 74 correspond to the response decompression unit 91 of the fifth embodiment.

The speech recognition response unit 120 includes an input data reception unit 41, an input data division unit 42, a speech extraction unit 23, a speech recognition unit 24, and a first input speech decompression unit 61 to an Nth input speech decompression unit 63. A response generation unit 81, a first response audio compression unit 121 to an Nth response audio compression unit 1222, a response data integration unit 123, and a response data transmission unit 124.

Note that the response data transmission unit 124 corresponds to the response transmission unit 82 of the fifth embodiment. The first response audio compression unit 121 to the Nth response audio compression unit 1222 correspond to the response compression unit 101 of the fifth embodiment.

In the embodiment described above, an example in which the voice input / output unit includes two microphones has been described. In the fifth embodiment, an example in which the voice input / output unit 90 includes one speaker has been described. As illustrated in FIG. 12, the speech recognition system according to the present invention can include a plurality of microphones and a plurality of speakers.

Hereinafter, the present invention will be described with reference to specific examples, but the scope of the present invention is not limited to the contents described below. FIG. 13 is an explanatory diagram showing an embodiment of the speech recognition system of the present invention. The voice recognition system according to this embodiment includes a headset 130 and a voice recognition device 140. The headset 130 of this example corresponds to the voice input / output device of the above embodiment. The speech recognition apparatus 140 of this example corresponds to the speech recognition response unit of the above embodiment.

The headset 130 includes a voice input microphone 131, a noise input microphone 132, and a speaker 133. As illustrated in FIG. 13, the voice input microphone 131 is disposed at the user's mouth, and the noise input microphone 132 is disposed at the user's ear. The speaker 133 is disposed in the vicinity of the noise input microphone 132.

The headset 130 generates voice data in which the voice input to the voice input microphone 131 and the noise input to the noise input microphone 132 are respectively compressed. As a compression method, μ Law or ADPCM is used. The headset 130 may generate audio data without compressing audio and noise (uncompressed).

The headset 130 integrates the generated 2-channel audio data into the 1-channel audio data. At this time, the headset 130 generates data in which status information indicating the data format and the like is integrated with the audio data. Then, the headset 130 wirelessly transmits data integrated into one channel to the voice recognition device 140.

Bluetooth (registered trademark) is used for wireless communication between the headset 130 and the voice recognition device 140, and a serial port profile is used for the communication protocol. Thus, in the speech recognition system of the present invention, a general method that has already been standardized can be utilized.

The voice recognition device 140 divides the received data into two-channel voice data and status information. The voice recognition device 140 expands the two-channel voice data by a method corresponding to the compression method.

The voice recognition device 140 performs noise removal processing on two-channel voice data using the method described in Patent Document 1 described above. At that time, the voice recognition device 140 also detects a voice section from the voice data. Then, the voice recognition device 140 performs voice recognition using the voice data from which noise has been removed.

The voice recognition result 140 generates response voice data according to the result of voice recognition, compresses the generated response voice data, and transmits the compressed response voice data to the headset 130. Note that the voice recognition result 140 may make some response to the headset 130 regardless of the recognition result. For example, the voice recognition result 140 may transmit control information notifying that the voice data has been received to the headset 130.

The headset 130 expands the received response sound data and outputs the sound indicated by the response sound data from the speaker 133.

Next, a minimum configuration example of the present invention will be described. FIG. 14 is a block diagram showing an example of the minimum configuration of the speech recognition system according to the present invention. The voice recognition system according to the present invention includes a voice input device 180 (for example, the voice input / output unit 10) for inputting a user's voice and a voice recognition device 190 (for example, for voice recognition of the voice input to the voice input device 180). Voice recognition response unit 20).

The voice input device 180 has at least two or more input means 181 (for example, the first microphone 11 and the second microphone 13) for inputting the user's voice and noise when the user is uttering the voice. ) And wireless transmission means 182 (for example, the first input voice transmission section 12 and the second input voice transmission section) that wirelessly transmit the voice data including the voice and noise input to each input means 181 to the voice recognition device 190. 13).

The voice recognition device 190 includes voice extraction means 191 (for example, a voice extraction unit 23) that extracts voice data from which noise has been removed from received voice data, and voice that performs voice recognition of the voice data extracted by the voice extraction means 191. Recognition means 192 (for example, voice recognition unit 24).

With such a configuration, it is possible to improve the accuracy of voice recognition while miniaturizing a device used by a user who inputs voice.

The voice input device 180 (for example, the voice input / output unit 30) includes input data integration means (for example, the input data integration unit 31) that integrates voice data including noise and noise input to each input unit 181. You may go out. The voice recognition device 190 (for example, the voice recognition response unit 40) may include input data dividing means (for example, the input data dividing unit 42) that divides the received voice data into original voice data.

At this time, the wireless transmission unit 182 of the voice input device 180 wirelessly transmits the voice data integrated by the input data integration unit to the voice recognition device 190, and the voice extraction unit 191 of the voice recognition device 190 is divided by the input data division unit. The audio data from which noise has been removed may be extracted from each audio data.

Such a configuration makes it possible to simultaneously transmit audio data input simultaneously to each microphone. Therefore, it is not necessary for the reception side (voice recognition device 190) to perform processing in consideration of the reception timing, so that the processing on the reception side can be simplified.

Further, the voice input device 180 (for example, the voice input / output unit 50) includes input data compression means (for example, a first input voice compression unit) that generates voice data input to each input means 181 and compressed voice data. 51, and a second input voice compression unit 52). The voice recognition device 190 (for example, the voice recognition response unit 60) also includes input data expansion means (for example, the first input voice expansion unit 61, the second input) that expands the compressed voice data to the original voice data. An audio expansion unit 62) may be included.

At this time, the wireless transmission unit 182 of the voice input device 180 wirelessly transmits the voice data compressed by the input data compression unit to the voice recognition device 190, and the voice extraction unit 191 of the voice recognition device 190 is based on the input data expansion unit. The audio data from which noise has been removed may be extracted from each audio data expanded to the above data.

With such a configuration, the amount of data transmitted from the voice input device 180 to the voice recognition device 190 can be reduced.

The voice recognition device 190 (for example, the voice recognition response unit 80) includes a synthesized voice data generation unit (for example, the response generation unit 81) that generates synthesized voice data from the result of voice recognition performed by the voice recognition unit 192, and a synthesized voice. Synthetic voice data transmission means (for example, response transmission unit 82) that transmits the synthetic voice data generated by the data generation means to the voice input device 180 may be included. The voice input device 180 (for example, the voice input / output unit 70) is received by the synthesized voice data receiving unit (for example, the response receiving unit 71) that receives the synthesized voice data from the voice recognition device 190, and received by the synthesized voice data receiving unit. Output means (for example, a speaker 72) for outputting the voice indicated by the synthesized voice data.

With such a configuration, the result of voice recognition by the voice recognition device 190 can be confirmed on the voice input device 180 side.

Further, the voice recognition device 190 (for example, the voice recognition response unit 100) may include a synthesized voice data compression unit (for example, the response compression unit 101) that compresses the synthesized voice data. The voice input device 180 (for example, the voice input / output unit 90) may include a synthesized voice data expansion unit (for example, a response expansion unit 91) that expands the compressed synthesized voice data.

At this time, the synthesized voice data transmitting unit of the voice recognition device 190 transmits the synthesized voice data compressed by the synthesized voice data compressing unit to the voice input device, and the synthesized voice data decompressing unit decompresses the output unit of the voice input device 180. The voice indicated by the synthesized voice data may be output.

With such a configuration, the amount of data transmitted from the speech recognition device 190 to the speech input device 180 can be reduced.

In addition, the voice input device 180 controls the state of the own voice input device 180 based on a control command received from another device (for example, a voice recognition response unit), and displays status information indicating the state of the own voice input device 180. A control unit (for example, the control unit 15) that transmits to other devices may be included. With such a configuration, the voice input device 180 can be further downsized.

As mentioned above, although this invention was demonstrated with reference to embodiment and an Example, this invention is not limited to the said embodiment and Example. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2011-245616 filed on November 9, 2011, the entire disclosure of which is incorporated herein.

The present invention is preferably applied to a speech recognition system that performs speech recognition of speech transmitted using wireless communication.

10, 30, 50, 70, 90 Audio input / output unit 11 First microphone 12 First input audio transmission unit 13 Second microphone 14 Second input audio transmission unit 15

Control unit

20, 40, 60, 80, DESCRIPTION OF SYMBOLS 100 Voice recognition response part 21 1st input voice reception part 22 2nd input voice reception part 23 Voice extraction part 24 Voice recognition part 31 Input data integration part 32 Input data transmission part 41 Input data reception part 42 Input data division part 51 First input speech compression unit 52 Second input speech compression unit 61 First input speech decompression unit 62 Second input speech decompression unit 71 Response reception unit 72 Speaker 81 Response generation unit 82 Response transmission unit 91 Response decompression unit 101 Response compression unit

Claims

A voice input device for inputting a user's voice;
A voice recognition device that performs voice recognition of the voice input to the voice input device;
The voice input device includes:
At least two or more input means for inputting user's voice and noise when the user is uttering voice;
Wireless transmission means for wirelessly transmitting voice data including noise and noise input to each input means to the voice recognition device;
The voice recognition device
Voice extraction means for extracting the voice data from which the noise has been removed from the received voice data;
And a voice recognition means for performing voice recognition of the voice data extracted by the voice extraction means.
The voice input device includes input data integration means for integrating voice data including voice and noise input to each input means,
The voice recognition device includes input data dividing means for dividing the received voice data into the original voice data,
The wireless transmission means of the voice input device wirelessly transmits the voice data integrated by the input data integration means to the voice recognition device,
The speech recognition system according to claim 1, wherein the speech extraction unit of the speech recognition apparatus extracts speech data from which noise has been removed from each speech data divided by the input data dividing unit.
The voice input device includes input data compression means for generating voice data in which voice and noise input to each input means are compressed,
The speech recognition device includes input data decompression means for decompressing the compressed speech data into the original speech data,
The wireless transmission means of the voice input device wirelessly transmits the voice data compressed by the input data compression means to the voice recognition device,
The speech recognition system according to claim 1, wherein the speech extraction unit of the speech recognition apparatus extracts speech data from which noise has been removed from each speech data expanded to the original data by the input data decompression unit.
The voice recognition device
Synthesized voice data generating means for generating synthesized voice data from the result of voice recognition by the voice recognition means;
Including synthesized voice data transmitting means for transmitting the synthesized voice data generated by the synthesized voice data generating means to the voice input device,
The voice input device
Synthesized voice data receiving means for receiving synthesized voice data from the voice recognition device;
The speech recognition system according to any one of claims 1 to 3, further comprising: an output unit that outputs a voice indicated by the synthesized voice data received by the synthesized voice data receiving unit.
The speech recognition apparatus includes synthesized speech data compression means for compressing synthesized speech data,
The voice input device includes synthesized voice data decompression means for decompressing the compressed synthesized voice data,
The synthesized voice data transmitting means of the voice recognition device transmits the synthesized voice data compressed by the synthesized voice data compressing means to the voice input device,
The speech recognition system according to claim 4, wherein the output unit of the voice input device outputs the voice indicated by the synthesized voice data expanded by the synthesized voice data expansion unit.
The voice input device includes a control unit that controls the state of the own voice input device based on a control command received from another device, and transmits status information indicating the state of the own voice input device to the other device. The voice recognition system according to any one of claims 1 to 5.
A voice input device that inputs a user's voice uses two or more input means to input the user's voice and noise when the user utters the voice,
The voice input device wirelessly transmits voice data including voice and noise input to each input means to a voice recognition device,
The voice recognition device extracts the voice data from which the noise is removed from the received voice data,
The voice recognition method, wherein the voice recognition device performs voice recognition of the extracted voice data.