WO2013069187A1 - Speech recognition system and speech recognition method - Google Patents

Speech recognition system and speech recognition method Download PDF

Info

Publication number
WO2013069187A1
WO2013069187A1 PCT/JP2012/005874 JP2012005874W WO2013069187A1 WO 2013069187 A1 WO2013069187 A1 WO 2013069187A1 JP 2012005874 W JP2012005874 W JP 2012005874W WO 2013069187 A1 WO2013069187 A1 WO 2013069187A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
input
data
unit
speech
Prior art date
Application number
PCT/JP2012/005874
Other languages
French (fr)
Japanese (ja)
Inventor
聡 塚田
英司 高田
剛範 辻川
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Publication of WO2013069187A1 publication Critical patent/WO2013069187A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the present invention relates to a voice recognition system and a voice recognition method for performing voice recognition of voice transmitted using wireless communication.
  • Patent Document 1 describes a noise removing device that removes noise mixed in speech.
  • the noise removal device described in Patent Document 1 includes a first microphone that collects sound and a second microphone that collects ambient noise.
  • the noise removal device described in Patent Document 1 converts sound input to each microphone into a time-series feature vector, and removes stationary noise and non-stationary noise based on each converted time-series vector. .
  • Patent Document 2 describes a voice processing system using a headset with a wireless communication function.
  • the microphone provided in the headset detects voice, and the result of voice recognition performed by the voice recognition unit of the headset is transmitted to an external device by wireless communication.
  • Patent Document 3 describes a speech recognition system that performs speech recognition by compressing and expanding a speech signal.
  • a voice recognition device is provided on the headset side used by an operator for voice input.
  • an object of the present invention is to provide a voice recognition system and a voice recognition method capable of improving the accuracy of voice recognition while downsizing an apparatus used by a user who inputs voice.
  • a voice recognition system includes a voice input device that inputs a user's voice, and a voice recognition device that performs voice recognition of the voice input to the voice input device. And at least two or more input means for inputting noise when the user utters a voice, and wireless transmission for wirelessly transmitting the voice input to each input means and voice data including noise to the voice recognition device
  • the speech recognition apparatus includes speech extraction means for extracting speech data from which noise has been removed from received speech data, and speech recognition means for performing speech recognition of the speech data extracted by the speech extraction means. It is characterized by.
  • a voice input device that inputs a user's voice inputs the user's voice and noise when the user utters the voice using two or more input means. Then, the voice input device wirelessly transmits the voice data including the voice and noise input to each input means to the voice recognition device, and the voice recognition device extracts the voice data from which the noise is removed from the received voice data. The voice recognition device performs voice recognition of the extracted voice data.
  • the present invention it is possible to improve the accuracy of voice recognition while miniaturizing a device used by a user who inputs voice.
  • FIG. FIG. 1 is a block diagram showing a configuration example of a first embodiment of a speech recognition system according to the present invention.
  • the voice recognition system according to the present embodiment includes a voice input / output unit 10 and a voice recognition response unit 20.
  • the voice input / output unit 10 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, and a control unit 15.
  • the audio input / output unit 10 may include an output unit (not shown) that outputs audio input from the first microphone 11 and the second microphone 13.
  • an output unit not shown
  • a case where the voice input / output unit 10 does not include an output unit will be described as an example.
  • the first microphone 11 and the second microphone 13 collect the user's voice and ambient noise when the user is uttering the voice.
  • a voice including noise may be simply referred to as a voice.
  • the first microphone 11 and the second microphone 13 are provided at physically separated positions.
  • the voice input / output unit 10 when the voice input / output unit 10 is realized as a headset, the first microphone 11 may be disposed at the user's mouth, and the second microphone 13 may be disposed at the user's ear. In this way, different sound is input to each microphone by physically arranging the microphones.
  • the first microphone 11 is used to collect user's voice
  • the second microphone 13 is used to collect ambient noise.
  • the type of sound collected by the first microphone 11 and the second microphone 13 is not particularly limited.
  • both the first microphone 11 and the second microphone 13 may collect voice in which the user's voice and ambient noise are mixed.
  • the first microphone 11 may be used particularly for collecting ambient noise
  • the second microphone 13 may be used particularly for collecting user's voice.
  • the voice input / output unit 10 includes two microphones.
  • the number of microphones included in the voice input / output unit 10 is not limited to two.
  • the voice input / output unit 10 may include three or more microphones. Even when three or more microphones are included in the voice input / output unit 10, the type of voice collected by each microphone is not particularly limited.
  • the first input voice transmission unit 12 transmits the voice input to the first microphone 11 to the voice recognition response unit 20 wirelessly.
  • the second input voice transmission unit 14 transmits the voice input to the second microphone 13 to the voice recognition response unit 20 wirelessly.
  • the voice input / output unit 10 may include one input voice transmission unit in which the functions of the first input voice transmission unit 12 and the second input voice transmission unit 14 are combined. Then, the input voice transmission unit may determine which microphone the collected voice is input to, and may transmit the voice to the voice recognition response unit 20.
  • voice transmission part 14 digitize the audio
  • the first input voice transmission unit 12 and the second input voice transmission unit 14 receive status information indicating the state of the voice input / output unit 10 in response to an instruction from the control unit 15 to be described later. May be sent to.
  • the control unit 15 controls the state of the voice input / output unit 10 based on a control command received from another device (for example, the voice recognition response unit 20). As a control command, the control unit 15 sets, for example, the number of microphone channels, the compression method of the audio data to be transmitted, the sampling frequency setting of the audio data, the microphone switch setting, and the operation mode setting (for example, setting of the protocol used) Receive microphone volume settings, speaker volume settings, etc.
  • control unit 15 may instruct the first input voice transmission unit 12 and the second input voice transmission unit 14 to transmit status information indicating the state of the voice input / output unit 10.
  • Status information includes, for example, the number of microphone channels, audio data sampling frequency, microphone switch status, operation mode information, transmission data block size, microphone volume information, speaker volume information, radio wave status, battery level, Examples include battery charge status and time information.
  • the voice input / output unit 10 may transmit status information to another device and operate based on a control command from the other device. By doing so, it is not necessary to incorporate a determination process for performing an operation in the voice input / output unit 10, so that the voice input / output unit 10 can be further downsized.
  • the voice recognition response unit 20 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, and a voice recognition unit 24.
  • the speech recognition response unit 20 may include a control unit (not shown) that synthesizes speech and reproduces the synthesized speech based on the result of speech recognition by the speech recognition unit 23 described later.
  • the first input voice reception unit 21 receives the voice data wirelessly transmitted by the first input voice transmission unit 12.
  • the second input voice reception unit 22 receives the voice data wirelessly transmitted by the second input voice transmission unit 14.
  • the voice extraction unit 23 receives the voice data received by the first input voice receiver 21 (hereinafter referred to as first voice data) and the voice data received by the second input voice receiver 22 (hereinafter referred to as the first voice data). Audio data from which ambient noise has been removed is extracted based on the above.
  • the voice extraction unit 23 uses the received first voice data and second voice data to remove noise mixed in the voice uttered by the user.
  • the voice extraction unit 23 removes the voice data of the noise collected by the second microphone 13 from the voice data of the voice collected by the first microphone 11.
  • the voice extraction unit 23 may use, for example, a method in which the noise removal device described in Patent Document 1 described above removes noise. In this way, if the type of sound collected by each microphone can be specified, the processing for extracting the sound can be speeded up.
  • both the first microphone 11 and the second microphone 13 collect sound in which the user's voice and ambient noise are mixed.
  • the voice extraction unit 23 may remove noise mixed in the voice using, for example, a microphone array technique.
  • the speech extraction unit 23 may use, for example, a beam forming method or a blind sound source separation method using ICA (Independent Component Analysis) as a microphone array method.
  • ICA Independent Component Analysis
  • the voice extraction unit 23 uses a technique that does not specify the type of voice collected by each microphone, so that the user can extract voice regardless of the manner in which the voice input / output unit 10 is used.
  • the voice extraction process performed by the voice extraction unit 23 will be described more specifically.
  • ambient sound noise
  • a method for detecting a speech section for example, there are a method of simply determining a speech section using the loudness of a sound, and a method of distinguishing speech and noise using characteristics such as frequency components of sound. In other words, detecting a speech segment can be said to remove a noise segment.
  • this noise component removal processing uses a technique for removing noise using voice data collected from a microphone that collects voice and a microphone that collects noise, or a technique of a microphone array. It is done.
  • the voice extraction unit 23 may detect the voice section and then perform noise removal processing on the detection section, or may detect voice for the voice after the noise removal processing is performed. You may go.
  • the voice extraction unit 23 may extract voice by combining these processes.
  • the process of extracting the voice by the voice extraction unit 23 includes a process of detecting a voice section and a process of removing noise.
  • the voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23. That is, the voice recognition unit 24 performs voice recognition on the voice from which noise has been removed from the voice collected by the microphone.
  • the speech recognition unit 24 may perform speech recognition using a generally known method.
  • FIG. 2 is a flowchart showing an operation example of the speech recognition system of the present embodiment.
  • the first input voice transmission unit 12 transmits voice data indicating the voice input to the first microphone 11 to the voice recognition response unit 20 wirelessly.
  • the second input voice transmission unit 14 wirelessly transmits voice data indicating the voice input to the second microphone 13 to the voice recognition response unit 20 (step S1).
  • the voice extraction unit 23 of the voice recognition response unit 20 includes a first voice data received by the first input voice receiver 21 from the first input voice transmitter 12, and a second input voice receiver 22 provided by the second voice receiver 22. Based on the second voice data received from the input voice transmitter 14, voice data from which ambient noise is removed is extracted (step S 2). Then, the voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23 (step S3).
  • the first microphone 11 and the second microphone 13 of the voice input / output unit 10 are the voice of the user and the noise when the user is uttering the voice.
  • the first input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit voice data including noise and noise input to each microphone to the voice recognition response unit 20.
  • the voice extraction unit 23 of the voice recognition response unit 20 extracts voice data from which noise has been removed from the received voice data, and the voice recognition unit 24 performs voice recognition of the voice data extracted by the voice extraction unit 23.
  • the voice input / output unit 10 is configured to have two microphones and a function of transmitting voice data collected from these microphones, the voice input / output unit 10 is realized. Can be downsized. Therefore, the work efficiency of the user who uses the voice input / output unit 10 can be improved.
  • the voice recognition response unit 20 removes noise from the received voice data and performs voice recognition based on the voice data from which the noise has been removed.
  • the voice recognition response unit 20 may be provided in a place where wireless communication can be performed from the voice input / output unit 10, and does not always have to be integrated with the user (that is, the voice input / output unit 10). Accordingly, since it is less necessary to reduce the size of the speech recognition response unit 20 than the speech input / output unit 10, the speech recognition response unit 20 can be provided with many functions for improving the accuracy of speech recognition. Therefore, the accuracy of voice recognition can be increased.
  • FIG. FIG. 3 is a block diagram showing a configuration example of the second embodiment of the speech recognition system according to the present invention.
  • symbol same as FIG. 1 is attached
  • subjected and description is abbreviate
  • the voice recognition system according to the present embodiment includes a voice input / output unit 30 and a voice recognition response unit 40.
  • the voice input / output unit 30 includes a first microphone 11, a second microphone 13, an input data integration unit 31, an input data transmission unit 32, and a control unit 15.
  • the audio input / output unit 30 may include an output unit (not shown) that outputs audio input from the first microphone 11 and the second microphone 13.
  • an output unit not shown
  • the voice input / output unit 30 does not include an output unit will be described as an example.
  • the 1st microphone 11, the 2nd microphone 13, and the control part 15 are the same as that of 1st Embodiment.
  • the input data integration unit 31 integrates audio data indicating sound input to the first microphone 11 and audio data indicating sound input to the second microphone 13. When the input data is received as analog data by each microphone, the input data integration unit 31 converts the analog data into digital data and integrates the converted digital data.
  • FIG. 4 is an explanatory diagram showing an example of a method for integrating 2-channel audio into 1-channel audio.
  • the channel 1 audio data input to the first microphone 11 and the channel 2 audio data input to the second microphone 13 are alternately divided at regular intervals. Is shown.
  • the integrated data is transmitted by the input data transmission unit 32 described later.
  • voice data integration method is not limited to the method illustrated in FIG. Other integration methods may be used as long as a plurality of audio data can be transmitted by one channel.
  • the voice input / output unit 30 may include three or more microphones.
  • the method of integrating audio data indicating the audio input to each microphone is the same as the method described above.
  • the input data transmission unit 32 transmits the voice data integrated by the input data integration unit 31 to the voice recognition response unit 40 wirelessly.
  • the input data transmission unit 32 may also transmit the status information indicating the method by which the input integration unit 31 has integrated the voice data to the voice recognition response unit 40.
  • the voice recognition response unit 40 includes an input data reception unit 41, an input data division unit 42, a voice extraction unit 23, and a voice recognition unit 24. Note that the voice recognition response unit 40 also includes a control unit (not shown) that synthesizes voice and reproduces the synthesized voice from the result of voice recognition by the voice recognition unit 23, as in the first embodiment. May be included. Note that the voice extraction unit 23 and the voice recognition unit 24 are the same as those in the first embodiment.
  • the input data receiving unit 41 receives the voice data wirelessly transmitted by the input data transmitting unit 32.
  • the input data dividing unit 42 divides the audio data obtained by integrating the two or more audio data into one by the input data integration unit 31 into the original audio data. Specifically, the input data dividing unit 42 divides the received audio data according to the method by which the input data integration unit 31 integrates the audio data. For example, as illustrated in FIG. 4, when the input data integration unit 31 integrates two or more pieces of audio data divided at regular intervals, the input data division unit 42 divides the received audio data at regular intervals, The divided audio data may be integrated to generate two or more original audio data.
  • the division method may be determined in advance between the voice input / output unit 30 and the voice recognition response unit 40. Further, the input data dividing unit 42 may specify the dividing method based on the status information indicating the integration method transmitted from the input data transmitting unit 32.
  • FIG. 5 is a flowchart showing an operation example of the speech recognition system of the present embodiment.
  • the input data integration unit 31 integrates audio data indicating audio input to the first microphone 11 and audio data indicating audio input to the second microphone 13 (step S11).
  • the input data transmission unit 32 wirelessly transmits the voice data integrated by the input data integration unit 31 to the voice recognition response unit 40 (step S12).
  • the input data dividing unit 42 divides the audio data received by the input data receiving unit 41 (step S13). Specifically, the input data dividing unit 42 divides the audio data obtained by integrating the two or more audio data into one by the input data integration unit 31 into the original audio data.
  • the voice extraction unit 23 extracts voice data from which ambient noise has been removed based on the two or more voice data divided by the input data division unit 42 (step S14). Then, the voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23 (step S15).
  • the input data integration unit 31 integrates voice data including noise and noise input to each microphone
  • the input data transmission unit 32 integrates the input data integration unit 31.
  • Voice data is wirelessly transmitted to the voice recognition device.
  • the input data dividing unit 42 divides the received audio data into original audio data
  • the audio extracting unit 23 extracts audio data from which noise has been removed from each audio data divided by the input data dividing unit 42. .
  • Such a configuration makes it possible to simultaneously transmit audio data input simultaneously to each microphone. Therefore, it is not necessary for the reception side (voice recognition response unit 40) to perform processing in consideration of reception timing, so that the processing on the reception side can be simplified.
  • FIG. FIG. 6 is a block diagram showing a configuration example of the third embodiment of the speech recognition system according to the present invention.
  • symbol same as FIG. 1 is attached
  • subjected and description is abbreviate
  • the voice recognition system according to the present embodiment includes a voice input / output unit 50 and a voice recognition response unit 60.
  • the voice input / output unit 50 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a first input.
  • An audio compression unit 51 and a second input audio compression unit 52 are included. That is, the voice input / output unit 50 of the present embodiment is different from the voice input / output unit 10 of the first embodiment in that it further includes a first input voice compression unit 51 and a second input voice compression unit 52. Different.
  • the first input sound compression unit 51 generates sound data obtained by compressing the sound input to the first microphone 11.
  • the first input voice transmission unit 12 transmits the compressed voice data to the voice recognition response unit 60 wirelessly.
  • the second input sound compression unit 52 generates sound data obtained by compressing the sound input to the second microphone 13.
  • the second input voice transmission unit 14 transmits the compressed voice data to the voice recognition response unit 60 wirelessly.
  • the voice input / output unit 50 may include one input voice compression unit that combines the functions of the first input voice compression unit 51 and the second input voice compression unit 52.
  • the input voice compression unit may determine which microphone the collected voice is input to, and compress the voice.
  • the first input audio compression unit 51 and the second input audio compression unit 52 generate audio data obtained by compressing audio using a generally known method.
  • the first input audio compression unit 51 and the second input audio compression unit 52 are, for example, G. ⁇ Law standardized as G.711 or G. ITU-T recommendation.
  • ADPCM Adaptive Differential Pulse Code Modulation
  • the method used by the first input audio compression unit 51 and the second input audio compression unit 52 for audio compression is not limited to ⁇ Law and ADPCM.
  • the voice recognition response unit 60 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a first input voice expansion unit 61, A second input voice decompression unit 62. That is, the speech recognition response unit 60 of the present embodiment is different from the speech recognition response unit 20 of the first embodiment in that it further includes a first input speech expansion unit 61 and a second input speech expansion unit 62. Different.
  • the voice recognition response unit 60 includes a control unit (not shown) that synthesizes voice and reproduces the synthesized voice from the result of voice recognition by the voice recognition unit 23. May be included.
  • the first input voice decompression unit 61 decompresses the compressed voice data received by the first input voice reception unit 21 to the original voice data.
  • the second input voice decompression unit 62 decompresses the compressed voice data received by the second input voice reception unit 22 to the original voice data.
  • the first input voice decompression unit 61 and the second input voice decompression unit 62 are the methods used by the first input voice compression unit 51 and the second input voice compression unit 52 to compress the voice. In response, the received audio data is decompressed.
  • FIG. 7 is a flowchart showing an operation example of the speech recognition system of this embodiment.
  • the first input sound compression unit 51 generates sound data obtained by compressing the sound input to the first microphone 11.
  • the second input audio compression unit 52 generates audio data obtained by compressing the audio input to the second microphone 13. (Step S21). Then, the first input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit the compressed voice data to the voice recognition response unit 60 (step S22).
  • the first input voice decompression unit 61 decompresses the compressed voice data received by the first input voice reception unit 21.
  • the second input voice decompression unit 62 decompresses the compressed voice data received by the second input voice reception unit 22 (step S23). Thereafter, the process of extracting the voice data from which the voice extraction unit 23 has removed noise and the process of performing the voice recognition by the voice recognition unit 24 are the same as steps S2 to S3 in FIG.
  • the first input audio compression unit 51 and the second input audio compression unit 52 generate audio data in which the audio and noise input to each microphone are compressed, and the first The one input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit each compressed voice data to the voice recognition response unit 60. Further, the first input voice decompression unit 61 and the second input voice decompression unit 62 decompress the compressed voice data to the original voice data, and the voice extraction unit 23 decompresses each voice that has been decompressed to the original data. Extract voice data from which noise has been removed.
  • the amount of data transmitted from the voice input / output unit 50 to the voice recognition response unit 60 can be reduced.
  • the voice input / output unit 50 may include an input data integration unit 31 that integrates voice data generated by each input voice compression unit.
  • the voice recognition response unit 60 of this embodiment may include an input data dividing unit 42 that divides the voice data integrated by the input data integration unit 31.
  • FIG. FIG. 8 is a block diagram showing a configuration example of the fourth embodiment of the speech recognition system according to the present invention.
  • symbol same as FIG. 1 is attached
  • subjected and description is abbreviate
  • the voice recognition system according to the present embodiment includes a voice input / output unit 70 and a voice recognition response unit 80.
  • the voice recognition response unit 80 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a response generation unit 81, and a response transmission unit 82. including. That is, the speech recognition response unit 80 of this embodiment is different from the speech recognition response unit 20 of the first embodiment in that it further includes a response generation unit 81 and a response transmission unit 82.
  • the response generation unit 81 synthesizes speech from the result of speech recognition by the speech recognition unit 24 and generates speech data indicating the synthesized speech. Note that the sound data generated in this way is generated as sound data in response to the sound data received from the sound input / output unit 70, and therefore this sound data may be referred to as response sound data.
  • a method for the response generation unit 81 to synthesize speech from the speech recognition result a method generally known as speech synthesis, a method using previously recorded speech, or a method of combining them is used.
  • the response transmission unit 82 wirelessly transmits the generated response voice data to the voice input / output unit 70.
  • the voice input / output unit 70 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a response reception unit 71. And a speaker 72. That is, the voice input / output unit 70 of the present embodiment is different from the voice input / output unit 10 of the first embodiment in that it further includes a response receiving unit 71 and a speaker 72.
  • the response receiving unit 71 receives the response voice data wirelessly transmitted by the response transmission unit 82 of the voice recognition response unit 80.
  • the speaker 72 outputs sound indicated by the response sound data received by the response receiving unit 71.
  • FIG. 9 is a flowchart showing an operation example of the speech recognition system of the present embodiment.
  • the process from when the voice input to the voice input / output unit 70 is transmitted to the voice recognition response unit 80 until voice recognition is performed is the same as steps S1 to S3 in FIG.
  • the response generation unit 81 generates response voice data from the result of the voice recognition unit 24 performing voice recognition (step S31).
  • the response transmitter 82 transmits the response voice data to the voice input / output unit 70 (step S32).
  • the response receiver 71 of the voice input / output unit 70 causes the speaker 72 to output the voice indicated by the received response voice data (step S33).
  • the response generation unit 81 generates response voice data from the result of voice recognition voice recognition, and the response transmission unit 82 inputs and outputs the response voice data. To the unit 70. Then, the response receiving unit 71 causes the speaker 72 to output the sound indicated by the received response sound data. Therefore, the voice recognition result by the voice recognition response unit 80 can be confirmed on the voice input / output unit 70 side.
  • FIG. FIG. 10 is a block diagram showing a configuration example of the fifth embodiment of the speech recognition system according to the present invention.
  • symbol same as FIG.1, FIG.3, FIG.6 or FIG. 8 is attached
  • the voice recognition system according to the present embodiment includes a voice input / output unit 90 and a voice recognition response unit 100.
  • the voice recognition response unit 100 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a first input voice expansion unit 61, A second input voice decompression unit 62, a response generation unit 81, a response transmission unit 82, and a response compression unit 101 are included. That is, the speech recognition response unit 100 of the present embodiment is different from the speech recognition response unit described in the first to fourth embodiments in that a response compression unit 101 is newly included.
  • the response compression unit 101 compresses response audio data.
  • the method of compressing response audio data is the same as the method of compressing audio data by the input audio compression unit (first input audio compression unit 51, second input audio compression unit 52) of the third embodiment. Note that the method in which the response compression unit 101 compresses the response audio data and the method in which the input audio compression unit compresses the audio data may be the same or different.
  • the response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90.
  • the voice input / output unit 90 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a first input.
  • An audio compression unit 51, a second input audio compression unit 52, a response reception unit 71, a speaker 72, and a response expansion unit 91 are included. That is, the voice input / output unit 90 of this embodiment is different from the voice input / output units described in the first to fourth embodiments in that a response expansion unit 91 is newly included.
  • the response decompression unit 91 decompresses the compressed response voice data received by the response reception unit 71 to the original response voice data.
  • the method for expanding the response audio data is the same as the method in which the input audio expansion unit (the first input audio expansion unit 61 and the second input audio expansion unit 62) of the third embodiment expands the audio data. Note that the method in which the response decompression unit 91 decompresses the response voice data and the method in which the input voice decompression unit decompresses the voice data may be the same or different.
  • FIG. 11 is a flowchart showing an operation example of the speech recognition system of the present embodiment. Note that the processing until the voice input to the voice input / output unit 90 is compressed and transmitted to the voice recognition response unit 100, and the transmitted voice data is expanded and recognized is shown in the flowchart of FIG. It is the same as the process illustrated.
  • the process in which the response generation unit 81 generates response voice data from the voice recognition result is the same as the process in step S31 illustrated in FIG.
  • the response compression unit 101 compresses the response voice data (step S41).
  • the response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90 (step S42).
  • the response decompression unit 91 of the speech input / output unit 90 decompresses the compressed response speech data received by the response reception unit 71 (step S43). Then, the speaker 72 outputs the voice indicated by the expanded response voice data (step S44).
  • the response compression unit 101 of the voice recognition response unit 100 compresses the response voice data, and the response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90. Then, the response decompression unit 91 of the speech input / output unit 90 decompresses the compressed response speech data, and the speaker 72 outputs the sound indicated by the decompressed response speech data.
  • the amount of data transmitted from the voice recognition response unit 100 to the voice input / output unit 90 can be reduced.
  • FIG. 12 is a block diagram showing a modification of the speech recognition system according to the present invention.
  • symbol same as FIG.1, FIG.3, FIG.6, FIG.8 or FIG.
  • the speech recognition system according to the present modification includes each component included in the first to fifth embodiments.
  • the voice recognition system according to this modification includes a voice input / output unit 110 and a voice recognition response unit 120.
  • the voice input / output unit 110 includes a first microphone 11 to an Nth microphone 15, a first input voice compression unit 51 to an Nth input voice compression unit 53, an input data integration unit 31, and an input data transmission unit. 32, the control unit 15, the first speaker 72 to the Nth speaker 75, the response data receiving unit 77, the response data dividing unit 76, and the first response speech decompression unit 73 to the Nth response speech decompression. Part 74.
  • response data receiving unit 77 corresponds to the response receiving unit 71 of the fourth embodiment.
  • the first response speech decompression unit 73 to the Nth response speech decompression unit 74 correspond to the response decompression unit 91 of the fifth embodiment.
  • the speech recognition response unit 120 includes an input data reception unit 41, an input data division unit 42, a speech extraction unit 23, a speech recognition unit 24, and a first input speech decompression unit 61 to an Nth input speech decompression unit 63.
  • response data transmission unit 124 corresponds to the response transmission unit 82 of the fifth embodiment.
  • the first response audio compression unit 121 to the Nth response audio compression unit 1222 correspond to the response compression unit 101 of the fifth embodiment.
  • the speech recognition system can include a plurality of microphones and a plurality of speakers.
  • FIG. 13 is an explanatory diagram showing an embodiment of the speech recognition system of the present invention.
  • the voice recognition system according to this embodiment includes a headset 130 and a voice recognition device 140.
  • the headset 130 of this example corresponds to the voice input / output device of the above embodiment.
  • the speech recognition apparatus 140 of this example corresponds to the speech recognition response unit of the above embodiment.
  • the headset 130 includes a voice input microphone 131, a noise input microphone 132, and a speaker 133. As illustrated in FIG. 13, the voice input microphone 131 is disposed at the user's mouth, and the noise input microphone 132 is disposed at the user's ear. The speaker 133 is disposed in the vicinity of the noise input microphone 132.
  • the headset 130 generates voice data in which the voice input to the voice input microphone 131 and the noise input to the noise input microphone 132 are respectively compressed.
  • a compression method ⁇ Law or ADPCM is used.
  • the headset 130 may generate audio data without compressing audio and noise (uncompressed).
  • the headset 130 integrates the generated 2-channel audio data into the 1-channel audio data. At this time, the headset 130 generates data in which status information indicating the data format and the like is integrated with the audio data. Then, the headset 130 wirelessly transmits data integrated into one channel to the voice recognition device 140.
  • Bluetooth registered trademark
  • a serial port profile is used for the communication protocol.
  • the voice recognition device 140 divides the received data into two-channel voice data and status information.
  • the voice recognition device 140 expands the two-channel voice data by a method corresponding to the compression method.
  • the voice recognition device 140 performs noise removal processing on two-channel voice data using the method described in Patent Document 1 described above. At that time, the voice recognition device 140 also detects a voice section from the voice data. Then, the voice recognition device 140 performs voice recognition using the voice data from which noise has been removed.
  • the voice recognition result 140 generates response voice data according to the result of voice recognition, compresses the generated response voice data, and transmits the compressed response voice data to the headset 130.
  • the voice recognition result 140 may make some response to the headset 130 regardless of the recognition result.
  • the voice recognition result 140 may transmit control information notifying that the voice data has been received to the headset 130.
  • the headset 130 expands the received response sound data and outputs the sound indicated by the response sound data from the speaker 133.
  • FIG. 14 is a block diagram showing an example of the minimum configuration of the speech recognition system according to the present invention.
  • the voice recognition system according to the present invention includes a voice input device 180 (for example, the voice input / output unit 10) for inputting a user's voice and a voice recognition device 190 (for example, for voice recognition of the voice input to the voice input device 180).
  • Voice recognition response unit 20 for example, the voice input device 180 (for example, the voice input / output unit 10) for inputting a user's voice and a voice recognition device 190 (for example, for voice recognition of the voice input to the voice input device 180).
  • the voice input device 180 has at least two or more input means 181 (for example, the first microphone 11 and the second microphone 13) for inputting the user's voice and noise when the user is uttering the voice. ) And wireless transmission means 182 (for example, the first input voice transmission section 12 and the second input voice transmission section) that wirelessly transmit the voice data including the voice and noise input to each input means 181 to the voice recognition device 190. 13).
  • input means 181 for example, the first microphone 11 and the second microphone 13
  • wireless transmission means 182 for example, the first input voice transmission section 12 and the second input voice transmission section
  • the voice recognition device 190 includes voice extraction means 191 (for example, a voice extraction unit 23) that extracts voice data from which noise has been removed from received voice data, and voice that performs voice recognition of the voice data extracted by the voice extraction means 191.
  • voice extraction means 191 for example, a voice extraction unit 23
  • Recognition means 192 for example, voice recognition unit 24.
  • the voice input device 180 (for example, the voice input / output unit 30) includes input data integration means (for example, the input data integration unit 31) that integrates voice data including noise and noise input to each input unit 181. You may go out.
  • the voice recognition device 190 (for example, the voice recognition response unit 40) may include input data dividing means (for example, the input data dividing unit 42) that divides the received voice data into original voice data.
  • the wireless transmission unit 182 of the voice input device 180 wirelessly transmits the voice data integrated by the input data integration unit to the voice recognition device 190, and the voice extraction unit 191 of the voice recognition device 190 is divided by the input data division unit.
  • the audio data from which noise has been removed may be extracted from each audio data.
  • Such a configuration makes it possible to simultaneously transmit audio data input simultaneously to each microphone. Therefore, it is not necessary for the reception side (voice recognition device 190) to perform processing in consideration of the reception timing, so that the processing on the reception side can be simplified.
  • the voice input device 180 (for example, the voice input / output unit 50) includes input data compression means (for example, a first input voice compression unit) that generates voice data input to each input means 181 and compressed voice data. 51, and a second input voice compression unit 52).
  • the voice recognition device 190 (for example, the voice recognition response unit 60) also includes input data expansion means (for example, the first input voice expansion unit 61, the second input) that expands the compressed voice data to the original voice data.
  • An audio expansion unit 62 may be included.
  • the wireless transmission unit 182 of the voice input device 180 wirelessly transmits the voice data compressed by the input data compression unit to the voice recognition device 190, and the voice extraction unit 191 of the voice recognition device 190 is based on the input data expansion unit.
  • the audio data from which noise has been removed may be extracted from each audio data expanded to the above data.
  • the amount of data transmitted from the voice input device 180 to the voice recognition device 190 can be reduced.
  • the voice recognition device 190 (for example, the voice recognition response unit 80) includes a synthesized voice data generation unit (for example, the response generation unit 81) that generates synthesized voice data from the result of voice recognition performed by the voice recognition unit 192, and a synthesized voice.
  • Synthetic voice data transmission means (for example, response transmission unit 82) that transmits the synthetic voice data generated by the data generation means to the voice input device 180 may be included.
  • the voice input device 180 (for example, the voice input / output unit 70) is received by the synthesized voice data receiving unit (for example, the response receiving unit 71) that receives the synthesized voice data from the voice recognition device 190, and received by the synthesized voice data receiving unit.
  • Output means (for example, a speaker 72) for outputting the voice indicated by the synthesized voice data.
  • the result of voice recognition by the voice recognition device 190 can be confirmed on the voice input device 180 side.
  • the voice recognition device 190 may include a synthesized voice data compression unit (for example, the response compression unit 101) that compresses the synthesized voice data.
  • the voice input device 180 (for example, the voice input / output unit 90) may include a synthesized voice data expansion unit (for example, a response expansion unit 91) that expands the compressed synthesized voice data.
  • the synthesized voice data transmitting unit of the voice recognition device 190 transmits the synthesized voice data compressed by the synthesized voice data compressing unit to the voice input device, and the synthesized voice data decompressing unit decompresses the output unit of the voice input device 180.
  • the voice indicated by the synthesized voice data may be output.
  • the amount of data transmitted from the speech recognition device 190 to the speech input device 180 can be reduced.
  • the voice input device 180 controls the state of the own voice input device 180 based on a control command received from another device (for example, a voice recognition response unit), and displays status information indicating the state of the own voice input device 180.
  • a control unit for example, the control unit 15
  • the voice input device 180 can be further downsized.
  • the present invention is preferably applied to a speech recognition system that performs speech recognition of speech transmitted using wireless communication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)

Abstract

Provided is a speech recognition system that can reduce the size of devices used by users inputting speech and increase the accuracy of speech recognition. At least two or more input means (181) input speech by a user and noise at the time that user vocalizes the speech. A wireless transmission means (182) wirelessly transmits speech data that includes the speech and noise input by the various input means (181) to a speech recognition device (190). A speech extraction means (191) extracts the speech data with the noise eliminated from the speech data that is received. A speech recognition means (192) recognizes the speech of the speech data extracted by the speech extraction means (191).

Description

音声認識システム、音声認識方法Speech recognition system and speech recognition method
 本発明は、無線通信を用いて送信される音声の音声認識を行う音声認識システムおよび音声認識方法に関する。 The present invention relates to a voice recognition system and a voice recognition method for performing voice recognition of voice transmitted using wireless communication.
 音声認識を行う際、音声とともに入力される周囲の雑音によって音声認識の性能が悪化することが知られている。そのため、音声に雑音が混入する場合、その雑音を除去する方法も各種知られている。 When performing speech recognition, it is known that the performance of speech recognition deteriorates due to ambient noise input together with the speech. For this reason, various methods are known for removing noise when it is mixed with speech.
 特許文献1には、音声に混入した雑音を除去する雑音除去装置が記載されている。特許文献1に記載された雑音除去装置は、音声を集音する第1のマイクロフォンと、周囲雑音を集音する第2のマイクロフォンとを備えている。特許文献1に記載された雑音除去装置は、各マイクロフォンに入力された音をそれぞれ時系列特徴ベクトルに変換し、変換した各時系列ベクトルをもとに、定常雑音および非定常雑音をそれぞれ除去する。 Patent Document 1 describes a noise removing device that removes noise mixed in speech. The noise removal device described in Patent Document 1 includes a first microphone that collects sound and a second microphone that collects ambient noise. The noise removal device described in Patent Document 1 converts sound input to each microphone into a time-series feature vector, and removes stationary noise and non-stationary noise based on each converted time-series vector. .
 また、特許文献2には、無線通信機能付きヘッドセットを用いた音声処理システムが記載されている。特許文献2に記載された音声処理システムでは、ヘッドセットに備えられたマイクロフォンが音声を検出し、ヘッドセットの音声認識手段が音声認識した結果が無線通信にて外部の機器へ送信される。 Patent Document 2 describes a voice processing system using a headset with a wireless communication function. In the voice processing system described in Patent Document 2, the microphone provided in the headset detects voice, and the result of voice recognition performed by the voice recognition unit of the headset is transmitted to an external device by wireless communication.
 なお、特許文献3には、音声信号を圧縮及び伸長して音声認識を行う音声認識システムが記載されている。 Note that Patent Document 3 describes a speech recognition system that performs speech recognition by compressing and expanding a speech signal.
特許2836271号公報Japanese Patent No. 2836271 特開2003-202888号公報JP 2003-202888 A 特開2005-321748号公報JP 2005-321748 A
 作業をしながら、その作業状況を音声にて通知するような場面では、その作業や周囲の環境によって、マイクロフォンに入力する音声に雑音が混入することも多い。このような場面で音声認識の精度を上げるためには、音声に混入した雑音を適切に除去することが必要である。 In a situation where the work status is notified by voice while working, noise is often mixed in the voice input to the microphone depending on the work and the surrounding environment. In order to improve the accuracy of voice recognition in such a scene, it is necessary to appropriately remove noise mixed in the voice.
 このように作業をしながら音声を入力するような場面では、音声を入力する装置を小型化することが望ましい。例えば、ヘッドセットマイクなどの音声入出力装置と、音声認識を行う音声認識処理装置とが有線で接続されているような場合には、これらの装置を用いると作業の妨げになってしまう恐れがある。 In such a scene where voice is input while working, it is desirable to downsize the device for inputting voice. For example, when a voice input / output device such as a headset microphone and a voice recognition processing device that performs voice recognition are connected by wire, there is a risk that using these devices may hinder work. is there.
 例えば、特許文献1に記載された雑音除去装置を用いることで、雑音を除去することは可能である。しかし、特許文献1に記載された雑音除去装置は、2つのマイクロフォンと雑音の除去を行う手段とが接続され、これらが一体となって雑音を除去するため、雑音除去装置は大きくなる傾向にある。そのため、特許文献1に記載されたような雑音の除去により音声認識の精度を高めつつ、作業者が音声を入力する装置を小型化できることが望ましい。 For example, it is possible to remove noise by using the noise removing device described in Patent Document 1. However, since the noise removal apparatus described in Patent Document 1 is connected to two microphones and a means for removing noise, and these are integrated to remove noise, the noise removal apparatus tends to be larger. . Therefore, it is desirable that the device for inputting the voice by the operator can be reduced in size while improving the accuracy of voice recognition by removing noise as described in Patent Document 1.
 また、特許文献2に記載された音声処理システムのように、無線通信機能付きヘッドセットを用いることで、作業者の作業を妨げることを抑制することは可能である。しかし、特許文献2に記載された音声処理システムでは、作業者が音声の入力に利用するヘッドセット側に、音声認識装置が設けられている。 Further, like the voice processing system described in Patent Document 2, it is possible to prevent the operator's work from being hindered by using a headset with a wireless communication function. However, in the voice processing system described in Patent Document 2, a voice recognition device is provided on the headset side used by an operator for voice input.
 音声認識の精度を上げるためには、多くのリソースが必要になる。また、音声認識に用いられるモデルや処理方法は、変更されることが多い。そのため、特許文献2に記載された音声処理システムでは、モデルや処理方法が変更されるたびに、各ヘッドセットに備えられた音声認識装置を更新する必要がある。また、ヘッドセットを小型化するためには、リソースの制限も加わるため、必ずしも十分な精度の音声認識ができるとは言い難い。 * Many resources are required to improve the accuracy of speech recognition. In addition, models and processing methods used for speech recognition are often changed. Therefore, in the voice processing system described in Patent Document 2, it is necessary to update the voice recognition device provided in each headset whenever the model or the processing method is changed. In addition, in order to reduce the size of the headset, it is difficult to say that voice recognition can be performed with sufficient accuracy because of resource limitations.
 そこで、本発明は、音声を入力する利用者が利用する装置を小型化しつつ、音声認識の精度を高めることができる音声認識システムおよび音声認識方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a voice recognition system and a voice recognition method capable of improving the accuracy of voice recognition while downsizing an apparatus used by a user who inputs voice.
 本発明による音声認識システムは、利用者の音声を入力する音声入力装置と、音声入力装置に入力された音声の音声認識を行う音声認識装置とを備え、音声入力装置が、利用者の音声と、その利用者が音声を発声しているときの雑音とを入力する少なくとも2以上の入力手段と、各入力手段に入力された音声および雑音を含む音声データを音声認識装置に無線送信する無線送信手段とを含み、音声認識装置が、受信した音声データから、雑音を除去した音声データを抽出する音声抽出手段と、音声抽出手段が抽出した音声データの音声認識を行う音声認識手段とを含むことを特徴とする。 A voice recognition system according to the present invention includes a voice input device that inputs a user's voice, and a voice recognition device that performs voice recognition of the voice input to the voice input device. And at least two or more input means for inputting noise when the user utters a voice, and wireless transmission for wirelessly transmitting the voice input to each input means and voice data including noise to the voice recognition device And the speech recognition apparatus includes speech extraction means for extracting speech data from which noise has been removed from received speech data, and speech recognition means for performing speech recognition of the speech data extracted by the speech extraction means. It is characterized by.
 本発明による音声認識方法は、利用者の音声を入力する音声入力装置が、2以上の入力手段を用いて、利用者の音声とその利用者が音声を発声しているときの雑音とを入力し、音声入力装置が、各入力手段に入力された音声および雑音を含む音声データを音声認識装置に無線送信し、音声認識装置が、受信した音声データから、雑音を除去した音声データを抽出し、音声認識装置が、抽出された音声データの音声認識を行うことを特徴とする。 In the voice recognition method according to the present invention, a voice input device that inputs a user's voice inputs the user's voice and noise when the user utters the voice using two or more input means. Then, the voice input device wirelessly transmits the voice data including the voice and noise input to each input means to the voice recognition device, and the voice recognition device extracts the voice data from which the noise is removed from the received voice data. The voice recognition device performs voice recognition of the extracted voice data.
 本発明によれば、音声を入力する利用者が利用する装置を小型化しつつ、音声認識の精度を高めることができる。 According to the present invention, it is possible to improve the accuracy of voice recognition while miniaturizing a device used by a user who inputs voice.
本発明による音声認識システムの第1の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 1st Embodiment of the speech recognition system by this invention. 第1の実施形態の音声認識システムの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition system of 1st Embodiment. 本発明による音声認識システムの第2の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 2nd Embodiment of the speech recognition system by this invention. 2チャンネルの音声を1チャンネルの音声に統合する方法の例を示す説明図である。It is explanatory drawing which shows the example of the method of integrating the audio | voice of 2 channels into the audio | voice of 1 channel. 第2の実施形態の音声認識システムの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition system of 2nd Embodiment. 本発明による音声認識システムの第3の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 3rd Embodiment of the speech recognition system by this invention. 第3の実施形態の音声認識システムの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition system of 3rd Embodiment. 本発明による音声認識システムの第4の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 4th Embodiment of the speech recognition system by this invention. 第4の実施形態の音声認識システムの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition system of 4th Embodiment. 本発明による音声認識システムの第5の実施形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of 5th Embodiment of the speech recognition system by this invention. 第5の実施形態の音声認識システムの動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition system of 5th Embodiment. 本発明による音声認識システムの変形例を示すブロック図である。It is a block diagram which shows the modification of the speech recognition system by this invention. 本発明の音声認識システムの実施例を示す説明図である。It is explanatory drawing which shows the Example of the speech recognition system of this invention. 本発明による音声認識システムの最小構成の例を示すブロック図である。It is a block diagram which shows the example of the minimum structure of the speech recognition system by this invention.
 以下、本発明の実施形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
実施形態1.
 図1は、本発明による音声認識システムの第1の実施形態の構成例を示すブロック図である。本実施形態による音声認識システムは、音声入出力部10と、音声認識応答部20とを備えている。
Embodiment 1. FIG.
FIG. 1 is a block diagram showing a configuration example of a first embodiment of a speech recognition system according to the present invention. The voice recognition system according to the present embodiment includes a voice input / output unit 10 and a voice recognition response unit 20.
 音声入出力部10は、第一のマイクロフォン11と、第一の入力音声送信部12と、第二のマイクロフォン13と、第二の入力音声送信部14と、制御部15とを含む。音声入出力部10は、第一のマイクロフォン11や、第二のマイクロフォン13から入力される音声を出力する出力部(図示せず)を備えていてもよい。なお、本実施形態では、音声入出力部10が出力部を備えていない場合を例に説明する。 The voice input / output unit 10 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, and a control unit 15. The audio input / output unit 10 may include an output unit (not shown) that outputs audio input from the first microphone 11 and the second microphone 13. In the present embodiment, a case where the voice input / output unit 10 does not include an output unit will be described as an example.
 第一のマイクロフォン11および第二のマイクロフォン13は、利用者の音声、および利用者が音声を発声しているときの周囲の雑音を集音する。以下の説明では、雑音を含む音声を、単に音声と記すこともある。 The first microphone 11 and the second microphone 13 collect the user's voice and ambient noise when the user is uttering the voice. In the following description, a voice including noise may be simply referred to as a voice.
 第一のマイクロフォン11と、第二のマイクロフォン13とは、物理的に離れた位置に設けられる。例えば、音声入出力部10がヘッドセットとして実現される場合、第一のマイクロフォン11を利用者の口元に配置し、第二のマイクロフォン13を利用者の耳元に配置するようにしてもよい。このように、物理的に配置することにより、各マイクロフォンに異なる音声が入力されることになる。なお、この例では、特に第一のマイクロフォン11が、利用者の音声を集音するために用いられ、第二のマイクロフォン13が、周囲の雑音を集音するために用いられることになる。 The first microphone 11 and the second microphone 13 are provided at physically separated positions. For example, when the voice input / output unit 10 is realized as a headset, the first microphone 11 may be disposed at the user's mouth, and the second microphone 13 may be disposed at the user's ear. In this way, different sound is input to each microphone by physically arranging the microphones. In this example, in particular, the first microphone 11 is used to collect user's voice, and the second microphone 13 is used to collect ambient noise.
 ただし、第一のマイクロフォン11と、第二のマイクロフォン13とが集音する音声の種類は特に限定されない。例えば、第一のマイクロフォン11と、第二のマイクロフォン13のいずれも、利用者の音声と周囲の雑音とが混在した音声を集音してもよい。また、第一のマイクロフォン11が、特に周囲の雑音を集音するために用いられ、第二のマイクロフォン13が、特に利用者の音声を集音するために用いられてもよい。 However, the type of sound collected by the first microphone 11 and the second microphone 13 is not particularly limited. For example, both the first microphone 11 and the second microphone 13 may collect voice in which the user's voice and ambient noise are mixed. The first microphone 11 may be used particularly for collecting ambient noise, and the second microphone 13 may be used particularly for collecting user's voice.
 また、本実施形態では、音声入出力部10が2つのマイクロフォンを含む場合について説明する。ただし、音声入出力部10に含まれるマイクロフォンの個数は2つに限定されない。音声入出力部10は、3つ以上のマイクロフォンを含んでいてもよい。3つ以上のマイクロフォンが音声入出力部10に含まれる場合であっても、各マイクロフォンが集音する音声の種類は特に限定されない。 In this embodiment, a case where the voice input / output unit 10 includes two microphones will be described. However, the number of microphones included in the voice input / output unit 10 is not limited to two. The voice input / output unit 10 may include three or more microphones. Even when three or more microphones are included in the voice input / output unit 10, the type of voice collected by each microphone is not particularly limited.
 第一の入力音声送信部12は、第一のマイクロフォン11に入力された音声を無線にて音声認識応答部20に送信する。また、第二の入力音声送信部14は、第二のマイクロフォン13に入力された音声を無線にて音声認識応答部20に送信する。 The first input voice transmission unit 12 transmits the voice input to the first microphone 11 to the voice recognition response unit 20 wirelessly. The second input voice transmission unit 14 transmits the voice input to the second microphone 13 to the voice recognition response unit 20 wirelessly.
 なお、本実施形態では、各マイクロフォンが集音した音声を、各マイクロフォンに対応する入力音声送信部がそれぞれ音声認識応答部20に送信する場合を例に説明する。ただし、音声入出力部10は、第一の入力音声送信部12と第二の入力音声送信部14の機能を一つにまとめた入力音声送信部を1つ備えるようにしてもよい。そして、この入力音声送信部が、集音された音声がどのマイクロフォンに入力された音声かを判断して、音声認識応答部20にその音声を送信してもよい。 In the present embodiment, a case where the voice collected by each microphone is transmitted to the voice recognition response unit 20 by the input voice transmission unit corresponding to each microphone will be described as an example. However, the voice input / output unit 10 may include one input voice transmission unit in which the functions of the first input voice transmission unit 12 and the second input voice transmission unit 14 are combined. Then, the input voice transmission unit may determine which microphone the collected voice is input to, and may transmit the voice to the voice recognition response unit 20.
 なお、第一の入力音声送信部12および第二の入力音声送信部14は、マイクロンフォンが集音した音声をアナログデータとして受信した場合、その音声をデジタル化する。すなわち、第一の入力音声送信部12および第二の入力音声送信部14は、デジタル化された音声データを無線にて音声認識応答部20に送信する。 In addition, the 1st input audio | voice transmission part 12 and the 2nd input audio | voice transmission part 14 digitize the audio | voice, when the audio | voice which the microphone collected is received as analog data. That is, the first input voice transmission unit 12 and the second input voice transmission unit 14 transmit the digitized voice data to the voice recognition response unit 20 wirelessly.
 また、第一の入力音声送信部12および第二の入力音声送信部14は、後述する制御部15の指示に応じて、音声入出力部10の状態を示すステータス情報を、音声認識応答部20に送信してもよい。 In addition, the first input voice transmission unit 12 and the second input voice transmission unit 14 receive status information indicating the state of the voice input / output unit 10 in response to an instruction from the control unit 15 to be described later. May be sent to.
 制御部15は、他の装置(例えば、音声認識応答部20)から受信した制御コマンドに基づいて、音声入出力部10の状態を制御する。制御部15は、制御コマンドとして、例えば、マイクロフォンのチャンネル数設定、送信する音声データの圧縮方式設定、音声データのサンプリング周波数設定、マイクロフォンのスイッチ設定、動作モード設定(例えば、使用プロトコルの設定など)、マイクロフォンのボリューム設定、スピーカのボリューム設定などを受信する。 The control unit 15 controls the state of the voice input / output unit 10 based on a control command received from another device (for example, the voice recognition response unit 20). As a control command, the control unit 15 sets, for example, the number of microphone channels, the compression method of the audio data to be transmitted, the sampling frequency setting of the audio data, the microphone switch setting, and the operation mode setting (for example, setting of the protocol used) Receive microphone volume settings, speaker volume settings, etc.
 また、制御部15は、音声入出力部10の状態を示すステータス情報の送信指示を、第一の入力音声送信部12および第二の入力音声送信部14に対して行ってもよい。ステータス情報として、例えば、マイクロフォンのチャンネル数、音声データのサンプリング周波数、マイクロフォンのスイッチの状態、動作モード情報、送信データのブロックサイズ、マイクロフォンのボリューム情報、スピーカのボリューム情報、電波状態、バッテリー残量、バッテリー充電状態、時刻情報などが挙げられる。 Further, the control unit 15 may instruct the first input voice transmission unit 12 and the second input voice transmission unit 14 to transmit status information indicating the state of the voice input / output unit 10. Status information includes, for example, the number of microphone channels, audio data sampling frequency, microphone switch status, operation mode information, transmission data block size, microphone volume information, speaker volume information, radio wave status, battery level, Examples include battery charge status and time information.
 以上のように、音声入出力部10がステータス情報を他の装置に送信し、他の装置からの制御コマンドに基づいて動作するようにしてもよい。このようにすることで、音声入出力部10に、動作を行うための判断処理を組み込む必要がなくなるため、音声入出力部10をより小型化できるようになる。 As described above, the voice input / output unit 10 may transmit status information to another device and operate based on a control command from the other device. By doing so, it is not necessary to incorporate a determination process for performing an operation in the voice input / output unit 10, so that the voice input / output unit 10 can be further downsized.
 音声認識応答部20は、第一の入力音声受信部21と、第二の入力音声受信部22と、音声抽出部23と、音声認識部24とを含む。なお、音声認識応答部20は、後述する音声認識部23による音声認識の結果から、音声を合成したり、合成した音声を再生したりする制御部(図示せず)を含んでいてもよい。 The voice recognition response unit 20 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, and a voice recognition unit 24. Note that the speech recognition response unit 20 may include a control unit (not shown) that synthesizes speech and reproduces the synthesized speech based on the result of speech recognition by the speech recognition unit 23 described later.
 第一の入力音声受信部21は、第一の入力音声送信部12が無線送信した音声データを受信する。また、第二の入力音声受信部22は、第二の入力音声送信部14が無線送信した音声データを受信する。 The first input voice reception unit 21 receives the voice data wirelessly transmitted by the first input voice transmission unit 12. The second input voice reception unit 22 receives the voice data wirelessly transmitted by the second input voice transmission unit 14.
 音声抽出部23は、第一の入力音声受信部21が受信した音声データ(以下、第一の音声データと記す。)と、第二の入力音声受信部22が受信した音声データ(以下、第二の音声データと記す。)とに基づいて、周囲の雑音を除去した音声データを抽出する。 The voice extraction unit 23 receives the voice data received by the first input voice receiver 21 (hereinafter referred to as first voice data) and the voice data received by the second input voice receiver 22 (hereinafter referred to as the first voice data). Audio data from which ambient noise has been removed is extracted based on the above.
 すなわち、第一の音声データおよび第二の音声データには、少なくとも周囲の雑音が含まれている。そこで、音声抽出部23は、受信した第一の音声データおよび第二の音声データを用いて、利用者の発声した音声に混在している雑音を除去する。 That is, at least ambient noise is included in the first audio data and the second audio data. Therefore, the voice extraction unit 23 uses the received first voice data and second voice data to remove noise mixed in the voice uttered by the user.
 例えば、第一のマイクロフォン11が主に利用者の音声を集音し、第二のマイクロフォン13が主に周囲の雑音を集音するとする。この場合、音声抽出部23は、第一のマイクロフォン11が集音した音声の音声データから、第二のマイクロフォン13が集音した雑音の音声データを除去する。このとき、音声抽出部23は、例えば、上述する特許文献1に記載された雑音除去装置が雑音を除去する方法を用いてもよい。このように、各マイクロフォンが集音する音声の種類が特定できれば、音声を抽出する際の処理を高速化できる。 For example, it is assumed that the first microphone 11 mainly collects user's voice and the second microphone 13 mainly collects ambient noise. In this case, the voice extraction unit 23 removes the voice data of the noise collected by the second microphone 13 from the voice data of the voice collected by the first microphone 11. At this time, the voice extraction unit 23 may use, for example, a method in which the noise removal device described in Patent Document 1 described above removes noise. In this way, if the type of sound collected by each microphone can be specified, the processing for extracting the sound can be speeded up.
 また、例えば、第一のマイクロフォン11と第二のマイクロフォン13のいずれも、利用者の音声と周囲の雑音とが混在した音声を集音するとする。この場合、第一のマイクロフォン11と第二のマイクロフォン13とは物理的に離れた位置に設けられるため、音源からの信号が各マイクロフォンに到達するときに位相差が生じることになる。そこで、音声抽出部23は、例えば、マイクロフォンアレイの手法を用いて、音声に混在している雑音を除去してもよい。 Also, for example, it is assumed that both the first microphone 11 and the second microphone 13 collect sound in which the user's voice and ambient noise are mixed. In this case, since the first microphone 11 and the second microphone 13 are provided at physically separated positions, a phase difference occurs when a signal from the sound source reaches each microphone. Therefore, the voice extraction unit 23 may remove noise mixed in the voice using, for example, a microphone array technique.
 音声抽出部23は、マイクロフォンアレイの手法として、例えば、ビームフォーミング法やICA(Independent Component Analysis)を用いたブラインド音源分離法などを用いてもよい。なお、これらの手法は既に知られている手法であるため、詳細な説明を省略する。 The speech extraction unit 23 may use, for example, a beam forming method or a blind sound source separation method using ICA (Independent Component Analysis) as a microphone array method. In addition, since these methods are already known methods, detailed description is abbreviate | omitted.
 このように、各マイクロフォンが集音する音声の種類を特定しない手法を音声抽出部23が用いることで、利用者が音声入出力部10を使用する態様に依らずに音声を抽出できる。 In this way, the voice extraction unit 23 uses a technique that does not specify the type of voice collected by each microphone, so that the user can extract voice regardless of the manner in which the voice input / output unit 10 is used.
 ここで、音声抽出部23が行う音声抽出処理について、より具体的に説明する。
 一般に、マイクロフォンには、利用者の音声が入力されていない間も周囲の音(雑音)が入力され続ける。そのため、音声認識を行う部分の音の区間を切り出したあとで音声認識が実施されることが望ましい。すなわち、入力音声以外の区間では音声認識処理が実施されないことが望ましい。このように、音声認識を行う部分の音を切り出したあとで音声認識を実施することで、雑音を音声と誤認識することを抑制できる。
Here, the voice extraction process performed by the voice extraction unit 23 will be described more specifically.
In general, ambient sound (noise) continues to be input to the microphone even while the user's voice is not input. For this reason, it is desirable that the speech recognition is performed after the sound section of the portion where speech recognition is performed is cut out. In other words, it is desirable that the speech recognition process is not performed in a section other than the input speech. As described above, it is possible to suppress erroneous recognition of noise as speech by performing speech recognition after cutting out the sound of the portion where speech recognition is performed.
 音声区間を検出する方法として、例えば、単純に音の大きさを利用して音声区間を決定する方法や、音の周波数成分などの特徴を利用して音声と雑音とを区別する方法がある。すなわち、音声区間を検出することとは、雑音区間を除去することと言うことができる。 As a method for detecting a speech section, for example, there are a method of simply determining a speech section using the loudness of a sound, and a method of distinguishing speech and noise using characteristics such as frequency components of sound. In other words, detecting a speech segment can be said to remove a noise segment.
 一方、音声として切り出した区間(音声区間)にも、周囲の雑音が重畳されているため、音声区間の音声に重畳されている雑音成分を除去する処理が行われる。この雑音成分除去処理には、上述するように、音声を集音するマイクロフォンと雑音を集音するマイクロフォンとから集音された音声データを用いて雑音を除去する手法や、マイクロフォンアレイの手法が用いられる。 On the other hand, since the surrounding noise is also superimposed on the section cut out as speech (speech section), the process of removing the noise component superimposed on the speech in the speech section is performed. As described above, this noise component removal processing uses a technique for removing noise using voice data collected from a microphone that collects voice and a microphone that collects noise, or a technique of a microphone array. It is done.
 音声抽出部23は、音声を抽出する際、音声区間を検出してから、検出区間に対して雑音除去処理を行ってもよく、雑音除去処理を行った後の音声に対して音声の検出を行ってもよい。また、音声抽出部23は、これらの処理を組み合わせて音声を抽出してもよい。このように、音声抽出部23が音声を抽出する処理には、音声区間を検出する処理と雑音を除去する処理とが含まれる。 When extracting the voice, the voice extraction unit 23 may detect the voice section and then perform noise removal processing on the detection section, or may detect voice for the voice after the noise removal processing is performed. You may go. The voice extraction unit 23 may extract voice by combining these processes. As described above, the process of extracting the voice by the voice extraction unit 23 includes a process of detecting a voice section and a process of removing noise.
 音声認識部24は、音声抽出部23が抽出した音声データに基づいて、音声認識を行う。すなわち、音声認識部24は、マイクロフォンが集音した音声から雑音が除去された音声について音声認識を行う。音声認識部24は、一般的に知られた方法を用いて音声認識すればよい。 The voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23. That is, the voice recognition unit 24 performs voice recognition on the voice from which noise has been removed from the voice collected by the microphone. The speech recognition unit 24 may perform speech recognition using a generally known method.
 次に、本実施形態の音声認識システムの動作を説明する。図2は、本実施形態の音声認識システムの動作例を示すフローチャートである。 Next, the operation of the voice recognition system of this embodiment will be described. FIG. 2 is a flowchart showing an operation example of the speech recognition system of the present embodiment.
 第一の入力音声送信部12は、第一のマイクロフォン11に入力された音声を示す音声データを無線にて音声認識応答部20に送信する。同様に、第二の入力音声送信部14は、第二のマイクロフォン13に入力された音声を示す音声データを無線にて音声認識応答部20に送信する(ステップS1)。 The first input voice transmission unit 12 transmits voice data indicating the voice input to the first microphone 11 to the voice recognition response unit 20 wirelessly. Similarly, the second input voice transmission unit 14 wirelessly transmits voice data indicating the voice input to the second microphone 13 to the voice recognition response unit 20 (step S1).
 音声認識応答部20の音声抽出部23は、第一の入力音声受信部21が第一の入力音声送信部12から受信した第一の音声データと、第二の入力音声受信部22が第二の入力音声送信部14から受信した第二の音声データとに基づいて、周囲の雑音を除去した音声データを抽出する(ステップS2)。そして、音声認識部24は、音声抽出部23が抽出した音声データに基づいて、音声認識を行う(ステップS3)。 The voice extraction unit 23 of the voice recognition response unit 20 includes a first voice data received by the first input voice receiver 21 from the first input voice transmitter 12, and a second input voice receiver 22 provided by the second voice receiver 22. Based on the second voice data received from the input voice transmitter 14, voice data from which ambient noise is removed is extracted (step S 2). Then, the voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23 (step S3).
 以上のように、本実施形態によれば、音声入出力部10の第一のマイクロフォン11および第二のマイクロフォン13が、利用者の音声と、その利用者が音声を発声しているときの雑音とを集音し、第一の入力音声送信部12および第二の入力音声送信部14が、各マイクロフォンに入力された音声および雑音を含む音声データを音声認識応答部20に無線送信する。音声認識応答部20の音声抽出部23は、受信した音声データから、雑音を除去した音声データを抽出し、音声認識部24は、音声抽出部23が抽出した音声データの音声認識を行う。 As described above, according to the present embodiment, the first microphone 11 and the second microphone 13 of the voice input / output unit 10 are the voice of the user and the noise when the user is uttering the voice. And the first input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit voice data including noise and noise input to each microphone to the voice recognition response unit 20. The voice extraction unit 23 of the voice recognition response unit 20 extracts voice data from which noise has been removed from the received voice data, and the voice recognition unit 24 performs voice recognition of the voice data extracted by the voice extraction unit 23.
 そのような構成により、音声を入力する利用者が利用する装置を小型化しつつ、音声認識の精度を高めることができる。すなわち、本実施形態では、音声入出力部10が、2つのマイクロフォンと、それらのマイクロフォンから集音された音声データを送信する機能を備えるように構成されているため、音声入出力部10を実現する装置を小型化できる。したがって、音声入出力部10を利用する利用者の作業効率を向上できる。 With such a configuration, it is possible to improve the accuracy of voice recognition while miniaturizing a device used by a user who inputs voice. That is, in this embodiment, since the voice input / output unit 10 is configured to have two microphones and a function of transmitting voice data collected from these microphones, the voice input / output unit 10 is realized. Can be downsized. Therefore, the work efficiency of the user who uses the voice input / output unit 10 can be improved.
 さらに、本実施形態では、音声認識応答部20が、受信した音声データから雑音を除去し、雑音が除去された音声データに基づいて音声認識を行う。音声認識応答部20は、音声入出力部10から無線通信できる場所に設けられれば良く、利用者(すなわち、音声入出力部10)と常に一体となって存在する必要はない。したがって、音声入出力部10に比べて、音声認識応答部20を小型化する必要性は少ないため、音声認識応答部20には、音声認識の精度を高めるための機能を多く実装可能である。そのため、音声認識の精度を高めることができる。 Furthermore, in this embodiment, the voice recognition response unit 20 removes noise from the received voice data and performs voice recognition based on the voice data from which the noise has been removed. The voice recognition response unit 20 may be provided in a place where wireless communication can be performed from the voice input / output unit 10, and does not always have to be integrated with the user (that is, the voice input / output unit 10). Accordingly, since it is less necessary to reduce the size of the speech recognition response unit 20 than the speech input / output unit 10, the speech recognition response unit 20 can be provided with many functions for improving the accuracy of speech recognition. Therefore, the accuracy of voice recognition can be increased.
実施形態2.
 図3は、本発明による音声認識システムの第2の実施形態の構成例を示すブロック図である。なお、第1の実施形態と同様の構成については、図1と同一の符号を付し、説明を省略する。本実施形態による音声認識システムは、音声入出力部30と、音声認識応答部40とを備えている。
Embodiment 2. FIG.
FIG. 3 is a block diagram showing a configuration example of the second embodiment of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted. The voice recognition system according to the present embodiment includes a voice input / output unit 30 and a voice recognition response unit 40.
 音声入出力部30は、第一のマイクロフォン11と、第二のマイクロフォン13と、入力データ統合部31と、入力データ送信部32と、制御部15とを含む。音声入出力部30は、第一のマイクロフォン11や、第二のマイクロフォン13から入力される音声を出力する出力部(図示せず)を備えていてもよい。なお、本実施形態では、音声入出力部30が出力部を備えていない場合を例に説明する。また、第一のマイクロフォン11、第二のマイクロフォン13および制御部15は、第1の実施形態と同様である。 The voice input / output unit 30 includes a first microphone 11, a second microphone 13, an input data integration unit 31, an input data transmission unit 32, and a control unit 15. The audio input / output unit 30 may include an output unit (not shown) that outputs audio input from the first microphone 11 and the second microphone 13. In the present embodiment, a case where the voice input / output unit 30 does not include an output unit will be described as an example. Moreover, the 1st microphone 11, the 2nd microphone 13, and the control part 15 are the same as that of 1st Embodiment.
 入力データ統合部31は、第一のマイクロフォン11に入力された音声を示す音声データと、第二のマイクロフォン13に入力された音声を示す音声データとを統合する。なお、入力データ統合部31は、各マイクロフォンに入力データをアナログデータとして受信した場合、そのアナログデータをデジタルデータに変換し、変換した各デジタルデータを統合する。 The input data integration unit 31 integrates audio data indicating sound input to the first microphone 11 and audio data indicating sound input to the second microphone 13. When the input data is received as analog data by each microphone, the input data integration unit 31 converts the analog data into digital data and integrates the converted digital data.
 図4は、2チャンネルの音声を1チャンネルの音声に統合する方法の例を示す説明図である。図4に示す例では、第一のマイクロフォン11に入力されるチャンネル1の音声データと、第二のマイクロフォン13に入力されるチャンネル2の音声データとを一定時間ごとに区切って交互に統合する方法を示している。なお、統合されたデータは、後述する入力データ送信部32により送信される。 FIG. 4 is an explanatory diagram showing an example of a method for integrating 2-channel audio into 1-channel audio. In the example shown in FIG. 4, the channel 1 audio data input to the first microphone 11 and the channel 2 audio data input to the second microphone 13 are alternately divided at regular intervals. Is shown. The integrated data is transmitted by the input data transmission unit 32 described later.
 なお、音声データの統合方法は、図4に例示する方法に限定されない。複数の音声データを1チャンネルで送信できる方法であれば、他の統合方法であってもよい。 Note that the voice data integration method is not limited to the method illustrated in FIG. Other integration methods may be used as long as a plurality of audio data can be transmitted by one channel.
 また、第1の実施形態と同様、音声入出力部30は、マイクロフォンを3つ以上含んでいてもよい。この場合、各マイクロフォンに入力される音声を示す音声データの統合方法も、上述する方法と同様である。 Further, as in the first embodiment, the voice input / output unit 30 may include three or more microphones. In this case, the method of integrating audio data indicating the audio input to each microphone is the same as the method described above.
 入力データ送信部32は、入力データ統合部31が統合した音声データを無線にて音声認識応答部40に送信する。入力データ送信部32は、入力統合部31が音声データを統合した方法を示すステータス情報を併せて音声認識応答部40に送信してもよい。 The input data transmission unit 32 transmits the voice data integrated by the input data integration unit 31 to the voice recognition response unit 40 wirelessly. The input data transmission unit 32 may also transmit the status information indicating the method by which the input integration unit 31 has integrated the voice data to the voice recognition response unit 40.
 音声認識応答部40は、入力データ受信部41と、入力データ分割部42と、音声抽出部23と、音声認識部24とを含む。なお、音声認識応答部40も、第1の実施形態と同様、音声認識部23による音声認識の結果から、音声を合成したり、合成した音声を再生したりする制御部(図示せず)を含んでいてもよい。なお、音声抽出部23および音声認識部24は、第1の実施形態と同様である。 The voice recognition response unit 40 includes an input data reception unit 41, an input data division unit 42, a voice extraction unit 23, and a voice recognition unit 24. Note that the voice recognition response unit 40 also includes a control unit (not shown) that synthesizes voice and reproduces the synthesized voice from the result of voice recognition by the voice recognition unit 23, as in the first embodiment. May be included. Note that the voice extraction unit 23 and the voice recognition unit 24 are the same as those in the first embodiment.
 入力データ受信部41は、入力データ送信部32が無線送信した音声データを受信する。 The input data receiving unit 41 receives the voice data wirelessly transmitted by the input data transmitting unit 32.
 入力データ分割部42は、入力データ統合部31が2以上の音声データを1つに統合した音声データを元の各音声データに分割する。具体的には、入力データ分割部42は、入力データ統合部31が音声データを統合した方法に応じて、受信した音声データを分割する。例えば、図4に例示するように、入力データ統合部31が2以上の音声データを一定時間ごとに区切って統合した場合、入力データ分割部42は、受信した音声データを一定時間ごとに区切り、区切った各音声データをそれぞれ統合して元の2以上の音声データを生成すればよい。 The input data dividing unit 42 divides the audio data obtained by integrating the two or more audio data into one by the input data integration unit 31 into the original audio data. Specifically, the input data dividing unit 42 divides the received audio data according to the method by which the input data integration unit 31 integrates the audio data. For example, as illustrated in FIG. 4, when the input data integration unit 31 integrates two or more pieces of audio data divided at regular intervals, the input data division unit 42 divides the received audio data at regular intervals, The divided audio data may be integrated to generate two or more original audio data.
 分割の方法は、音声入出力部30と音声認識応答部40との間で予め定めておけばよい。また、入力データ分割部42は、入力データ送信部32から送信された統合方法を示すステータス情報に基づいて分割方法を特定してもよい。 The division method may be determined in advance between the voice input / output unit 30 and the voice recognition response unit 40. Further, the input data dividing unit 42 may specify the dividing method based on the status information indicating the integration method transmitted from the input data transmitting unit 32.
 次に、本実施形態の音声認識システムの動作を説明する。図5は、本実施形態の音声認識システムの動作例を示すフローチャートである。 Next, the operation of the voice recognition system of this embodiment will be described. FIG. 5 is a flowchart showing an operation example of the speech recognition system of the present embodiment.
 入力データ統合部31は、第一のマイクロフォン11に入力された音声を示す音声データと、第二のマイクロフォン13に入力された音声を示す音声データとを統合する(ステップS11)。入力データ送信部32は、入力データ統合部31が統合した音声データを無線にて音声認識応答部40に送信する(ステップS12)。 The input data integration unit 31 integrates audio data indicating audio input to the first microphone 11 and audio data indicating audio input to the second microphone 13 (step S11). The input data transmission unit 32 wirelessly transmits the voice data integrated by the input data integration unit 31 to the voice recognition response unit 40 (step S12).
 入力データ分割部42は、入力データ受信部41が受信した音声データを分割する(ステップS13)。具体的には、入力データ分割部42は、入力データ統合部31が2以上の音声データを1つに統合した音声データを元の各音声データに分割する。 The input data dividing unit 42 divides the audio data received by the input data receiving unit 41 (step S13). Specifically, the input data dividing unit 42 divides the audio data obtained by integrating the two or more audio data into one by the input data integration unit 31 into the original audio data.
 音声抽出部23は、入力データ分割部42が分割した2以上の音声データに基づいて、周囲の雑音を除去した音声データを抽出する(ステップS14)。そして、音声認識部24は、音声抽出部23が抽出した音声データに基づいて、音声認識を行う(ステップS15)。 The voice extraction unit 23 extracts voice data from which ambient noise has been removed based on the two or more voice data divided by the input data division unit 42 (step S14). Then, the voice recognition unit 24 performs voice recognition based on the voice data extracted by the voice extraction unit 23 (step S15).
 以上のように、本実施形態よれば、入力データ統合部31が、各マイクロフォンに入力された音声および雑音を含む音声データを統合し、入力データ送信部32が、入力データ統合部31が統合した音声データを音声認識装置に無線送信する。そして、入力データ分割部42が、受信した音声データを元の各音声データに分割し、音声抽出部23が、入力データ分割部42が分割した各音声データから雑音を除去した音声データを抽出する。 As described above, according to the present embodiment, the input data integration unit 31 integrates voice data including noise and noise input to each microphone, and the input data transmission unit 32 integrates the input data integration unit 31. Voice data is wirelessly transmitted to the voice recognition device. Then, the input data dividing unit 42 divides the received audio data into original audio data, and the audio extracting unit 23 extracts audio data from which noise has been removed from each audio data divided by the input data dividing unit 42. .
 そのような構成により、各マイクロフォンに同時に入力された音声データを同時に送信することが可能になる。よって、受信側(音声認識応答部40)が、受信タイミングを考慮した処理を行う必要が無くなるため、受信側の処理を簡易化できる。 Such a configuration makes it possible to simultaneously transmit audio data input simultaneously to each microphone. Therefore, it is not necessary for the reception side (voice recognition response unit 40) to perform processing in consideration of reception timing, so that the processing on the reception side can be simplified.
実施形態3.
 図6は、本発明による音声認識システムの第3の実施形態の構成例を示すブロック図である。なお、第1の実施形態と同様の構成については、図1と同一の符号を付し、説明を省略する。本実施形態による音声認識システムは、音声入出力部50と、音声認識応答部60とを備えている。
Embodiment 3. FIG.
FIG. 6 is a block diagram showing a configuration example of the third embodiment of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted. The voice recognition system according to the present embodiment includes a voice input / output unit 50 and a voice recognition response unit 60.
 音声入出力部50は、第一のマイクロフォン11と、第一の入力音声送信部12と、第二のマイクロフォン13と、第二の入力音声送信部14と、制御部15と、第一の入力音声圧縮部51と、第二の入力音声圧縮部52とを含む。すなわち、本実施形態の音声入出力部50は、第一の入力音声圧縮部51と、第二の入力音声圧縮部52とをさらに含む点において、第1の実施形態の音声入出力部10と異なる。 The voice input / output unit 50 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a first input. An audio compression unit 51 and a second input audio compression unit 52 are included. That is, the voice input / output unit 50 of the present embodiment is different from the voice input / output unit 10 of the first embodiment in that it further includes a first input voice compression unit 51 and a second input voice compression unit 52. Different.
 第一の入力音声圧縮部51は、第一のマイクロフォン11に入力された音声を圧縮した音声データを生成する。この場合、第一の入力音声送信部12は、圧縮された音声データを無線にて音声認識応答部60に送信する。 The first input sound compression unit 51 generates sound data obtained by compressing the sound input to the first microphone 11. In this case, the first input voice transmission unit 12 transmits the compressed voice data to the voice recognition response unit 60 wirelessly.
 同様に、第二の入力音声圧縮部52は、第二のマイクロフォン13に入力された音声を圧縮した音声データを生成する。この場合、第二の入力音声送信部14は、圧縮された音声データを無線にて音声認識応答部60に送信する。 Similarly, the second input sound compression unit 52 generates sound data obtained by compressing the sound input to the second microphone 13. In this case, the second input voice transmission unit 14 transmits the compressed voice data to the voice recognition response unit 60 wirelessly.
 なお、本実施形態では、各マイクロフォンが集音した音声を、各マイクロフォンに対応する入力音声圧縮部がそれぞれ音声を圧縮した音声データを生成する場合を例に説明する。ただし、音声入出力部50は、第一の入力音声圧縮部51と第二の入力音声圧縮部52の機能を一つにまとめた入力音声圧縮部を1つ備えるようにしてもよい。そして、この入力音声圧縮部が、集音された音声がどのマイクロフォンに入力された音声かを判断して、音声を圧縮してもよい。 In the present embodiment, a case will be described as an example in which the sound collected by each microphone is generated by the input sound compression unit corresponding to each microphone, and the sound is compressed. However, the voice input / output unit 50 may include one input voice compression unit that combines the functions of the first input voice compression unit 51 and the second input voice compression unit 52. The input voice compression unit may determine which microphone the collected voice is input to, and compress the voice.
 第一の入力音声圧縮部51および第二の入力音声圧縮部52は、一般的に知られた方法を用いて音声を圧縮した音声データを生成する。第一の入力音声圧縮部51および第二の入力音声圧縮部52は、例えば、ITU-T勧告でG.711として標準化されているμLawや、ITU-T勧告でG.726として標準化されているADPCM(Adaptive Differential Pulse Code Modulation )を用いてもよい。ただし、第一の入力音声圧縮部51および第二の入力音声圧縮部52が音声の圧縮に用いる方法は、μLawおよびADPCMに限定されない。 The first input audio compression unit 51 and the second input audio compression unit 52 generate audio data obtained by compressing audio using a generally known method. The first input audio compression unit 51 and the second input audio compression unit 52 are, for example, G. Μ Law standardized as G.711 or G. ITU-T recommendation. ADPCM (Adaptive Differential Pulse Code Modulation) standardized as 726 may be used. However, the method used by the first input audio compression unit 51 and the second input audio compression unit 52 for audio compression is not limited to μ Law and ADPCM.
 音声認識応答部60は、第一の入力音声受信部21と、第二の入力音声受信部22と、音声抽出部23と、音声認識部24と、第一の入力音声伸長部61と、第二の入力音声伸長部62とを含む。すなわち、本実施形態の音声認識応答部60は、第一の入力音声伸長部61と、第二の入力音声伸長部62とをさらに含む点において、第1の実施形態の音声認識応答部20と異なる。 The voice recognition response unit 60 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a first input voice expansion unit 61, A second input voice decompression unit 62. That is, the speech recognition response unit 60 of the present embodiment is different from the speech recognition response unit 20 of the first embodiment in that it further includes a first input speech expansion unit 61 and a second input speech expansion unit 62. Different.
 なお、音声認識応答部60は、第1の実施形態と同様、音声認識部23による音声認識の結果から、音声を合成したり、合成した音声を再生したりする制御部(図示せず)を含んでいてもよい。 Note that the voice recognition response unit 60, as in the first embodiment, includes a control unit (not shown) that synthesizes voice and reproduces the synthesized voice from the result of voice recognition by the voice recognition unit 23. May be included.
 第一の入力音声伸長部61は、第一の入力音声受信部21が受信した圧縮された音声データを元の音声データへ伸長する。また、第二の入力音声伸長部62は、第二の入力音声受信部22が受信した圧縮された音声データを元の音声データへ伸長する。具体的には、第一の入力音声伸長部61および第二の入力音声伸長部62は、第一の入力音声圧縮部51および第二の入力音声圧縮部52が音声の圧縮に用いた方法に応じて、受信した音声データを伸長する。 The first input voice decompression unit 61 decompresses the compressed voice data received by the first input voice reception unit 21 to the original voice data. The second input voice decompression unit 62 decompresses the compressed voice data received by the second input voice reception unit 22 to the original voice data. Specifically, the first input voice decompression unit 61 and the second input voice decompression unit 62 are the methods used by the first input voice compression unit 51 and the second input voice compression unit 52 to compress the voice. In response, the received audio data is decompressed.
 次に、本実施形態の音声認識システムの動作を説明する。図7は、本実施形態の音声認識システムの動作例を示すフローチャートである。 Next, the operation of the voice recognition system of this embodiment will be described. FIG. 7 is a flowchart showing an operation example of the speech recognition system of this embodiment.
 第一の入力音声圧縮部51は、第一のマイクロフォン11に入力された音声を圧縮した音声データを生成する。同様に、第二の入力音声圧縮部52は、第二のマイクロフォン13に入力された音声を圧縮した音声データを生成する。(ステップS21)。そして、第一の入力音声送信部12および第二の入力音声送信部14は、圧縮された音声データを無線にて音声認識応答部60に送信する(ステップS22)。 The first input sound compression unit 51 generates sound data obtained by compressing the sound input to the first microphone 11. Similarly, the second input audio compression unit 52 generates audio data obtained by compressing the audio input to the second microphone 13. (Step S21). Then, the first input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit the compressed voice data to the voice recognition response unit 60 (step S22).
 第一の入力音声伸長部61は、第一の入力音声受信部21が受信した圧縮された音声データを伸長する。同様に、第二の入力音声伸長部62は、第二の入力音声受信部22が受信した圧縮された音声データを伸長する(ステップS23)。以降、音声抽出部23が雑音を除去した音声データを抽出する処理、および音声認識部24が音声認識を行う処理は、図2のステップS2~ステップS3と同様である。 The first input voice decompression unit 61 decompresses the compressed voice data received by the first input voice reception unit 21. Similarly, the second input voice decompression unit 62 decompresses the compressed voice data received by the second input voice reception unit 22 (step S23). Thereafter, the process of extracting the voice data from which the voice extraction unit 23 has removed noise and the process of performing the voice recognition by the voice recognition unit 24 are the same as steps S2 to S3 in FIG.
 以上のように、本実施形態によれば、第一の入力音声圧縮部51および第二の入力音声圧縮部52が、各マイクロフォンに入力された音声および雑音を圧縮した音声データを生成し、第一の入力音声送信部12および第二の入力音声送信部14が、圧縮された各音声データを音声認識応答部60に無線送信する。また、第一の入力音声伸長部61および第二の入力音声伸長部62が、圧縮された音声データを元の音声データに伸長し、音声抽出部23が、元のデータに伸長された各音声データから、雑音を除去した音声データを抽出する。 As described above, according to the present embodiment, the first input audio compression unit 51 and the second input audio compression unit 52 generate audio data in which the audio and noise input to each microphone are compressed, and the first The one input voice transmission unit 12 and the second input voice transmission unit 14 wirelessly transmit each compressed voice data to the voice recognition response unit 60. Further, the first input voice decompression unit 61 and the second input voice decompression unit 62 decompress the compressed voice data to the original voice data, and the voice extraction unit 23 decompresses each voice that has been decompressed to the original data. Extract voice data from which noise has been removed.
 そのような構成により、音声入出力部50から音声認識応答部60に送信するデータ量を削減できる。 With such a configuration, the amount of data transmitted from the voice input / output unit 50 to the voice recognition response unit 60 can be reduced.
 また、第2の実施形態のように、本実施形態の音声入出力部50が、各入力音声圧縮部が生成した音声データを統合する入力データ統合部31を備えていてもよい。また、第2の実施形態のように、本実施形態の音声認識応答部60が、入力データ統合部31が統合した音声データを分割する入力データ分割部42を備えていてもよい。 Further, as in the second embodiment, the voice input / output unit 50 according to the present embodiment may include an input data integration unit 31 that integrates voice data generated by each input voice compression unit. Further, as in the second embodiment, the voice recognition response unit 60 of this embodiment may include an input data dividing unit 42 that divides the voice data integrated by the input data integration unit 31.
 そのような構成により、音声入出力部50から音声認識応答部60に送信するデータ量を削減できるとともに、受信側の処理を簡易化できる。 With such a configuration, it is possible to reduce the amount of data transmitted from the voice input / output unit 50 to the voice recognition response unit 60 and simplify the processing on the reception side.
実施形態4.
 図8は、本発明による音声認識システムの第4の実施形態の構成例を示すブロック図である。なお、第1の実施形態と同様の構成については、図1と同一の符号を付し、説明を省略する。本実施形態による音声認識システムは、音声入出力部70と、音声認識応答部80とを備えている。
Embodiment 4 FIG.
FIG. 8 is a block diagram showing a configuration example of the fourth embodiment of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted. The voice recognition system according to the present embodiment includes a voice input / output unit 70 and a voice recognition response unit 80.
 音声認識応答部80は、第一の入力音声受信部21と、第二の入力音声受信部22と、音声抽出部23と、音声認識部24と、応答生成部81と、応答送信部82とを含む。すなわち、本実施形態の音声認識応答部80は、応答生成部81と、応答送信部82とをさらに含む点において、第1の実施形態の音声認識応答部20と異なる。 The voice recognition response unit 80 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a response generation unit 81, and a response transmission unit 82. including. That is, the speech recognition response unit 80 of this embodiment is different from the speech recognition response unit 20 of the first embodiment in that it further includes a response generation unit 81 and a response transmission unit 82.
 応答生成部81は、音声認識部24が音声認識した結果から音声を合成し、合成した音声を示す音声データを生成する。なお、このようにして生成された音声データは、音声入出力部70から受信した音声データに応答する音声データとして生成されることから、この音声データを応答音声データと記すこともある。応答生成部81が音声認識結果から音声を合成する方法として、一般的に音声合成として知られた方法や、あらかじめ録音された音声を用いる方法、あるいは、それらを組み合わせる方法が用いられる。 The response generation unit 81 synthesizes speech from the result of speech recognition by the speech recognition unit 24 and generates speech data indicating the synthesized speech. Note that the sound data generated in this way is generated as sound data in response to the sound data received from the sound input / output unit 70, and therefore this sound data may be referred to as response sound data. As a method for the response generation unit 81 to synthesize speech from the speech recognition result, a method generally known as speech synthesis, a method using previously recorded speech, or a method of combining them is used.
 応答送信部82は、生成した応答音声データを音声入出力部70に無線送信する。 The response transmission unit 82 wirelessly transmits the generated response voice data to the voice input / output unit 70.
 音声入出力部70は、第一のマイクロフォン11と、第一の入力音声送信部12と、第二のマイクロフォン13と、第二の入力音声送信部14と、制御部15と、応答受信部71と、スピーカ72とを含む。すなわち、本実施形態の音声入出力部70は、応答受信部71と、スピーカ72とをさらに含む点において、第1の実施形態の音声入出力部10と異なる。 The voice input / output unit 70 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a response reception unit 71. And a speaker 72. That is, the voice input / output unit 70 of the present embodiment is different from the voice input / output unit 10 of the first embodiment in that it further includes a response receiving unit 71 and a speaker 72.
 応答受信部71は、音声認識応答部80の応答送信部82が無線送信した応答音声データを受信する。また、スピーカ72は、応答受信部71が受信した応答音声データが示す音声を出力する。 The response receiving unit 71 receives the response voice data wirelessly transmitted by the response transmission unit 82 of the voice recognition response unit 80. The speaker 72 outputs sound indicated by the response sound data received by the response receiving unit 71.
 次に、本実施形態の音声認識システムの動作を説明する。図9は、本実施形態の音声認識システムの動作例を示すフローチャートである。なお、音声入出力部70に入力された音声が音声認識応答部80に送信されて音声認識されるまでの処理は、図2のステップS1~ステップS3と同様である。 Next, the operation of the voice recognition system of this embodiment will be described. FIG. 9 is a flowchart showing an operation example of the speech recognition system of the present embodiment. The process from when the voice input to the voice input / output unit 70 is transmitted to the voice recognition response unit 80 until voice recognition is performed is the same as steps S1 to S3 in FIG.
 応答生成部81は、音声認識部24が音声認識した結果から応答音声データを生成する(ステップS31)。応答送信部82は、応答音声データを音声入出力部70に送信する(ステップS32)。音声入出力部70の応答受信部71は、受信した応答音声データが示す音声をスピーカ72に出力させる(ステップS33)。 The response generation unit 81 generates response voice data from the result of the voice recognition unit 24 performing voice recognition (step S31). The response transmitter 82 transmits the response voice data to the voice input / output unit 70 (step S32). The response receiver 71 of the voice input / output unit 70 causes the speaker 72 to output the voice indicated by the received response voice data (step S33).
 以上のように、本実施形態では、第1の実施形態に加え、応答生成部81が、音声認識音声認識の結果から応答音声データを生成し、応答送信部82が応答音声データを音声入出力部70に送信する。そして、応答受信部71が、受信した応答音声データが示す音声をスピーカ72に出力させる。そのため、音声入出力部70側で、音声認識応答部80による音声認識の結果を確認できる。 As described above, in this embodiment, in addition to the first embodiment, the response generation unit 81 generates response voice data from the result of voice recognition voice recognition, and the response transmission unit 82 inputs and outputs the response voice data. To the unit 70. Then, the response receiving unit 71 causes the speaker 72 to output the sound indicated by the received response sound data. Therefore, the voice recognition result by the voice recognition response unit 80 can be confirmed on the voice input / output unit 70 side.
実施形態5.
 図10は、本発明による音声認識システムの第5の実施形態の構成例を示すブロック図である。なお、第1の実施形態から第4の実施形態と同様の構成については、図1、図3、図6または図8と同一の符号を付し、説明を省略する。本実施形態による音声認識システムは、音声入出力部90と、音声認識応答部100とを備えている。
Embodiment 5. FIG.
FIG. 10 is a block diagram showing a configuration example of the fifth embodiment of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment-4th Embodiment, the code | symbol same as FIG.1, FIG.3, FIG.6 or FIG. 8 is attached | subjected and description is abbreviate | omitted. The voice recognition system according to the present embodiment includes a voice input / output unit 90 and a voice recognition response unit 100.
 音声認識応答部100は、第一の入力音声受信部21と、第二の入力音声受信部22と、音声抽出部23と、音声認識部24と、第一の入力音声伸長部61と、第二の入力音声伸長部62と、応答生成部81と、応答送信部82と、応答圧縮部101とを含む。すなわち、本実施形態の音声認識応答部100は、応答圧縮部101を新たに含む点において、第1の実施形態から第4の実施形態で説明した音声認識応答部と異なる。 The voice recognition response unit 100 includes a first input voice reception unit 21, a second input voice reception unit 22, a voice extraction unit 23, a voice recognition unit 24, a first input voice expansion unit 61, A second input voice decompression unit 62, a response generation unit 81, a response transmission unit 82, and a response compression unit 101 are included. That is, the speech recognition response unit 100 of the present embodiment is different from the speech recognition response unit described in the first to fourth embodiments in that a response compression unit 101 is newly included.
 応答圧縮部101は、応答音声データを圧縮する。応答音声データの圧縮方法は、第3の実施形態の入力音声圧縮部(第一の入力音声圧縮部51、第二の入力音声圧縮部52)が音声データを圧縮する方法と同様である。なお、応答圧縮部101が応答音声データを圧縮する方法と、上記入力音声圧縮部が音声データを圧縮する方法とは、同一であってもよく、異なっていてもよい。この場合、応答送信部82は、圧縮された応答音声データを音声入出力部90に送信する。 The response compression unit 101 compresses response audio data. The method of compressing response audio data is the same as the method of compressing audio data by the input audio compression unit (first input audio compression unit 51, second input audio compression unit 52) of the third embodiment. Note that the method in which the response compression unit 101 compresses the response audio data and the method in which the input audio compression unit compresses the audio data may be the same or different. In this case, the response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90.
 音声入出力部90は、第一のマイクロフォン11と、第一の入力音声送信部12と、第二のマイクロフォン13と、第二の入力音声送信部14と、制御部15と、第一の入力音声圧縮部51と、第二の入力音声圧縮部52と、応答受信部71と、スピーカ72と、応答伸長部91とを含む。すなわち、本実施形態の音声入出力部90は、応答伸長部91を新たに含む点において、第1の実施形態から第4の実施形態で説明した音声入出力部と異なる。 The voice input / output unit 90 includes a first microphone 11, a first input voice transmission unit 12, a second microphone 13, a second input voice transmission unit 14, a control unit 15, and a first input. An audio compression unit 51, a second input audio compression unit 52, a response reception unit 71, a speaker 72, and a response expansion unit 91 are included. That is, the voice input / output unit 90 of this embodiment is different from the voice input / output units described in the first to fourth embodiments in that a response expansion unit 91 is newly included.
 応答伸長部91は、応答受信部71が受信した圧縮された応答音声データを元の応答音声データへ伸長する。応答音声データの伸長方法は、第3の実施形態の入力音声伸長部(第一の入力音声伸長部61、第二の入力音声伸長部62)が音声データを伸長する方法と同様である。なお、応答伸長部91が応答音声データを伸長する方法と、上記入力音声伸長部が音声データを伸長する方法とは、同一であってもよく、異なっていてもよい。 The response decompression unit 91 decompresses the compressed response voice data received by the response reception unit 71 to the original response voice data. The method for expanding the response audio data is the same as the method in which the input audio expansion unit (the first input audio expansion unit 61 and the second input audio expansion unit 62) of the third embodiment expands the audio data. Note that the method in which the response decompression unit 91 decompresses the response voice data and the method in which the input voice decompression unit decompresses the voice data may be the same or different.
 次に、本実施形態の音声認識システムの動作を説明する。図11は、本実施形態の音声認識システムの動作例を示すフローチャートである。なお、音声入出力部90に入力された音声が圧縮された状態で音声認識応答部100に送信され、送信された音声データが伸長されて音声認識されるまでの処理は、図7のフローチャートに例示する処理と同様である。また、応答生成部81が、音声認識結果から応答音声データを生成する処理は、図9に例示するステップS31の処理と同様である。 Next, the operation of the voice recognition system of this embodiment will be described. FIG. 11 is a flowchart showing an operation example of the speech recognition system of the present embodiment. Note that the processing until the voice input to the voice input / output unit 90 is compressed and transmitted to the voice recognition response unit 100, and the transmitted voice data is expanded and recognized is shown in the flowchart of FIG. It is the same as the process illustrated. The process in which the response generation unit 81 generates response voice data from the voice recognition result is the same as the process in step S31 illustrated in FIG.
 応答圧縮部101は、応答音声データを圧縮する(ステップS41)。応答送信部82は、圧縮された応答音声データを音声入出力部90に送信する(ステップS42)。音声入出力部90の応答伸長部91は、応答受信部71が受信した圧縮された応答音声データを伸長する(ステップS43)。そして、スピーカ72は、伸長された応答音声データが示す音声を出力する(ステップS44)。 The response compression unit 101 compresses the response voice data (step S41). The response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90 (step S42). The response decompression unit 91 of the speech input / output unit 90 decompresses the compressed response speech data received by the response reception unit 71 (step S43). Then, the speaker 72 outputs the voice indicated by the expanded response voice data (step S44).
 以上のように、本実施形態では、音声認識応答部100の応答圧縮部101が応答音声データを圧縮し、応答送信部82が圧縮された応答音声データを音声入出力部90に送信する。そして、音声入出力部90の応答伸長部91が、圧縮された応答音声データを伸長し、スピーカ72が伸長された応答音声データが示す音声を出力する。 As described above, in the present embodiment, the response compression unit 101 of the voice recognition response unit 100 compresses the response voice data, and the response transmission unit 82 transmits the compressed response voice data to the voice input / output unit 90. Then, the response decompression unit 91 of the speech input / output unit 90 decompresses the compressed response speech data, and the speaker 72 outputs the sound indicated by the decompressed response speech data.
 そのような構成により、音声認識応答部100から音声入出力部90に送信するデータ量を削減できる。 With such a configuration, the amount of data transmitted from the voice recognition response unit 100 to the voice input / output unit 90 can be reduced.
 次に、第1の実施形態から第5の実施形態の変形例を説明する。図12は、本発明による音声認識システムの変形例を示すブロック図である。なお、第1の実施形態から第5の実施形態と同様の構成については、図1、図3、図6、図8または図10と同一の符号を付し、説明を省略する。本変形例の音声認識システムは、第1の実施形態から第5の実施形態に含まれる各構成を含む。具体的には、本変形例による音声認識システムは、音声入出力部110と、音声認識応答部120とを備えている。 Next, modified examples of the first to fifth embodiments will be described. FIG. 12 is a block diagram showing a modification of the speech recognition system according to the present invention. In addition, about the structure similar to 1st Embodiment-5th Embodiment, the code | symbol same as FIG.1, FIG.3, FIG.6, FIG.8 or FIG. The speech recognition system according to the present modification includes each component included in the first to fifth embodiments. Specifically, the voice recognition system according to this modification includes a voice input / output unit 110 and a voice recognition response unit 120.
 音声入出力部110は、第一のマイクロフォン11から第Nのマイクロフォン15と、第一の入力音声圧縮部51から第Nの入力音声圧縮部53と、入力データ統合部31と、入力データ送信部32と、制御部15と、第一のスピーカ72~第Nのスピーカ75と、応答データ受信部77と、応答データ分割部76と、第一の応答音声伸長部73から第Nの応答音声伸長部74とを含む。 The voice input / output unit 110 includes a first microphone 11 to an Nth microphone 15, a first input voice compression unit 51 to an Nth input voice compression unit 53, an input data integration unit 31, and an input data transmission unit. 32, the control unit 15, the first speaker 72 to the Nth speaker 75, the response data receiving unit 77, the response data dividing unit 76, and the first response speech decompression unit 73 to the Nth response speech decompression. Part 74.
 なお、応答データ受信部77は、第4の実施形態の応答受信部71に相当する。第一の応答音声伸長部73から第Nの応答音声伸長部74は、第5の実施形態の応答伸長部91に相当する。 Note that the response data receiving unit 77 corresponds to the response receiving unit 71 of the fourth embodiment. The first response speech decompression unit 73 to the Nth response speech decompression unit 74 correspond to the response decompression unit 91 of the fifth embodiment.
 音声認識応答部120は、入力データ受信部41と、入力データ分割部42と、音声抽出部23と、音声認識部24と、第一の入力音声伸長部61から第Nの入力音声伸長部63と、応答生成部81と、第一の応答音声圧縮部121から第Nの応答音声圧縮部1222と、応答データ統合部123と、応答データ送信部124とを含む。 The speech recognition response unit 120 includes an input data reception unit 41, an input data division unit 42, a speech extraction unit 23, a speech recognition unit 24, and a first input speech decompression unit 61 to an Nth input speech decompression unit 63. A response generation unit 81, a first response audio compression unit 121 to an Nth response audio compression unit 1222, a response data integration unit 123, and a response data transmission unit 124.
 なお、応答データ送信部124は、第5の実施形態の応答送信部82に相当する。第一の応答音声圧縮部121から第Nの応答音声圧縮部1222は、第5の実施形態の応答圧縮部101に相当する。 Note that the response data transmission unit 124 corresponds to the response transmission unit 82 of the fifth embodiment. The first response audio compression unit 121 to the Nth response audio compression unit 1222 correspond to the response compression unit 101 of the fifth embodiment.
 上述する実施形態では、音声入出力部が2つのマイクロフォンを備えている場合の例を説明した。また、第5の実施形態では、音声入出力部90が、スピーカを1つ備えている場合の例を説明した。図12に例示するように、本発明による音声認識システムは、複数のマイクロフォンおよび複数のスピーカを含むことが可能である。 In the embodiment described above, an example in which the voice input / output unit includes two microphones has been described. In the fifth embodiment, an example in which the voice input / output unit 90 includes one speaker has been described. As illustrated in FIG. 12, the speech recognition system according to the present invention can include a plurality of microphones and a plurality of speakers.
 以下、具体的な実施例により本発明を説明するが、本発明の範囲は以下に説明する内容に限定されない。図13は、本発明の音声認識システムの実施例を示す説明図である。本実施例による音声認識システムは、ヘッドセット130と、音声認識装置140とを備えている。本実施例のヘッドセット130は、上記実施形態の音声入出力装置に対応する。また、本実施例の音声認識装置140は、上記実施形態の音声認識応答部に対応する。 Hereinafter, the present invention will be described with reference to specific examples, but the scope of the present invention is not limited to the contents described below. FIG. 13 is an explanatory diagram showing an embodiment of the speech recognition system of the present invention. The voice recognition system according to this embodiment includes a headset 130 and a voice recognition device 140. The headset 130 of this example corresponds to the voice input / output device of the above embodiment. The speech recognition apparatus 140 of this example corresponds to the speech recognition response unit of the above embodiment.
 ヘッドセット130は、音声入力用マイクロフォン131と、雑音入力用マイクロフォン132と、スピーカ133とを含む。図13に例示するように、音声入力用マイクロフォン131は、利用者の口元に配置され、雑音入力用マイクロフォン132は、利用者の耳元に配置される。また、スピーカ133は、雑音入力用マイクロフォン132の近傍に配置される。 The headset 130 includes a voice input microphone 131, a noise input microphone 132, and a speaker 133. As illustrated in FIG. 13, the voice input microphone 131 is disposed at the user's mouth, and the noise input microphone 132 is disposed at the user's ear. The speaker 133 is disposed in the vicinity of the noise input microphone 132.
 ヘッドセット130は、音声入力用マイクロフォン131に入力される音声および雑音入力用マイクロフォン132に入力される雑音を、それぞれ圧縮した音声データを生成する。圧縮の方法は、μLawや、ADPCMが用いられる。なお、ヘッドセット130は、音声および雑音を圧縮せず(非圧縮)に音声データを生成してもよい。 The headset 130 generates voice data in which the voice input to the voice input microphone 131 and the noise input to the noise input microphone 132 are respectively compressed. As a compression method, μ Law or ADPCM is used. The headset 130 may generate audio data without compressing audio and noise (uncompressed).
 ヘッドセット130は、生成した2チャンネルの音声データを1チャンネルの音声データに統合する。このとき、ヘッドセット130は、データ形式などを示すステータス情報を音声データに統合したデータを生成する。そして、ヘッドセット130は、1チャンネルに統合したデータを無線で音声認識装置140に送信する。 The headset 130 integrates the generated 2-channel audio data into the 1-channel audio data. At this time, the headset 130 generates data in which status information indicating the data format and the like is integrated with the audio data. Then, the headset 130 wirelessly transmits data integrated into one channel to the voice recognition device 140.
 ヘッドセット130と音声認識装置140との間の無線通信には、Bluetooth(登録商標)が用いられ、通信プロトコルには、シリアルポートプロファイルが使用される。このように、本発明の音声認識システムでは、既に規格化されている一般的な方式を活用できる。 Bluetooth (registered trademark) is used for wireless communication between the headset 130 and the voice recognition device 140, and a serial port profile is used for the communication protocol. Thus, in the speech recognition system of the present invention, a general method that has already been standardized can be utilized.
 音声認識装置140は、受信したデータを2チャンネルの音声データおよびステータス情報に分割する。音声認識装置140は、2チャンネルの音声データを圧縮方法に対応した方法で伸長する。 The voice recognition device 140 divides the received data into two-channel voice data and status information. The voice recognition device 140 expands the two-channel voice data by a method corresponding to the compression method.
 音声認識装置140は、上述する特許文献1に記載された方法を用いて、2チャンネルの音声データの雑音除去処理を行う。その際、音声認識装置140は、音声データの中から音声区間の検出も行う。そして、音声認識装置140は、雑音を除去した音声データを用いて音声認識を行う。 The voice recognition device 140 performs noise removal processing on two-channel voice data using the method described in Patent Document 1 described above. At that time, the voice recognition device 140 also detects a voice section from the voice data. Then, the voice recognition device 140 performs voice recognition using the voice data from which noise has been removed.
 音声認識結果140は、音声認識の結果に応じて、応答音声データを生成し、生成した応答音声データを圧縮してヘッドセット130に送信する。なお、音声認識結果140は、認識結果に依らず、ヘッドセット130に何らかの応答を行う場合もある。例えば、音声認識結果140は、音声データを受信したことを通知する制御情報をヘッドセット130に送信してもよい。 The voice recognition result 140 generates response voice data according to the result of voice recognition, compresses the generated response voice data, and transmits the compressed response voice data to the headset 130. Note that the voice recognition result 140 may make some response to the headset 130 regardless of the recognition result. For example, the voice recognition result 140 may transmit control information notifying that the voice data has been received to the headset 130.
 ヘッドセット130は、受信した応答音声データを伸長し、その応答音声データが示す音声をスピーカ133から出力する。 The headset 130 expands the received response sound data and outputs the sound indicated by the response sound data from the speaker 133.
 次に、本発明の最小構成例を説明する。図14は、本発明による音声認識システムの最小構成の例を示すブロック図である。本発明による音声認識システムは、利用者の音声を入力する音声入力装置180(例えば、音声入出力部10)と、音声入力装置180に入力された音声の音声認識を行う音声認識装置190(例えば、音声認識応答部20)とを備えている。 Next, a minimum configuration example of the present invention will be described. FIG. 14 is a block diagram showing an example of the minimum configuration of the speech recognition system according to the present invention. The voice recognition system according to the present invention includes a voice input device 180 (for example, the voice input / output unit 10) for inputting a user's voice and a voice recognition device 190 (for example, for voice recognition of the voice input to the voice input device 180). Voice recognition response unit 20).
 音声入力装置180は、利用者の音声と、その利用者が音声を発声しているときの雑音とを入力する少なくとも2以上の入力手段181(例えば、第一のマイクロフォン11、第二のマイクロフォン13)と、各入力手段181に入力された音声および雑音を含む音声データを音声認識装置190に無線送信する無線送信手段182(例えば、第一の入力音声送信部12、第二の入力音声送信部13)とを含む。 The voice input device 180 has at least two or more input means 181 (for example, the first microphone 11 and the second microphone 13) for inputting the user's voice and noise when the user is uttering the voice. ) And wireless transmission means 182 (for example, the first input voice transmission section 12 and the second input voice transmission section) that wirelessly transmit the voice data including the voice and noise input to each input means 181 to the voice recognition device 190. 13).
 音声認識装置190は、受信した音声データから、雑音を除去した音声データを抽出する音声抽出手段191(例えば、音声抽出部23)と、音声抽出手段191が抽出した音声データの音声認識を行う音声認識手段192(例えば、音声認識部24)とを含む。 The voice recognition device 190 includes voice extraction means 191 (for example, a voice extraction unit 23) that extracts voice data from which noise has been removed from received voice data, and voice that performs voice recognition of the voice data extracted by the voice extraction means 191. Recognition means 192 (for example, voice recognition unit 24).
 そのような構成により、音声を入力する利用者が利用する装置を小型化しつつ、音声認識の精度を高めることができる。 With such a configuration, it is possible to improve the accuracy of voice recognition while miniaturizing a device used by a user who inputs voice.
 また、音声入力装置180(例えば、音声入出力部30)は、各入力手段181に入力された音声および雑音を含む音声データを統合する入力データ統合手段(例えば、入力データ統合部31)を含んでいてもよい。また、音声認識装置190(例えば、音声認識応答部40)は、受信した音声データを元の各音声データに分割する入力データ分割手段(例えば、入力データ分割部42)を含んでいてもよい。 The voice input device 180 (for example, the voice input / output unit 30) includes input data integration means (for example, the input data integration unit 31) that integrates voice data including noise and noise input to each input unit 181. You may go out. The voice recognition device 190 (for example, the voice recognition response unit 40) may include input data dividing means (for example, the input data dividing unit 42) that divides the received voice data into original voice data.
 このとき、音声入力装置180の無線送信手段182は、入力データ統合手段が統合した音声データを音声認識装置190に無線送信し、音声認識装置190の音声抽出手段191は、入力データ分割手段が分割した各音声データから、雑音を除去した音声データを抽出してもよい。 At this time, the wireless transmission unit 182 of the voice input device 180 wirelessly transmits the voice data integrated by the input data integration unit to the voice recognition device 190, and the voice extraction unit 191 of the voice recognition device 190 is divided by the input data division unit. The audio data from which noise has been removed may be extracted from each audio data.
 そのような構成により、各マイクロフォンに同時に入力された音声データを同時に送信することが可能になる。よって、受信側(音声認識装置190)が、受信タイミングを考慮した処理を行う必要が無くなるため、受信側の処理を簡易化できる。 Such a configuration makes it possible to simultaneously transmit audio data input simultaneously to each microphone. Therefore, it is not necessary for the reception side (voice recognition device 190) to perform processing in consideration of the reception timing, so that the processing on the reception side can be simplified.
 また、音声入力装置180(例えば、音声入出力部50)は、各入力手段181に入力された音声および雑音を圧縮した音声データを生成する入力データ圧縮手段(例えば、第一の入力音声圧縮部51、第二の入力音声圧縮部52)を含んでいてもよい。また、音声認識装置190(例えば、音声認識応答部60)は、圧縮された音声データを元の音声データに伸長する入力データ伸長手段(例えば、第一の入力音声伸長部61、第二の入力音声伸長部62)を含んでいてもよい。 Further, the voice input device 180 (for example, the voice input / output unit 50) includes input data compression means (for example, a first input voice compression unit) that generates voice data input to each input means 181 and compressed voice data. 51, and a second input voice compression unit 52). The voice recognition device 190 (for example, the voice recognition response unit 60) also includes input data expansion means (for example, the first input voice expansion unit 61, the second input) that expands the compressed voice data to the original voice data. An audio expansion unit 62) may be included.
 このとき、音声入力装置180の無線送信手段182は、入力データ圧縮手段が圧縮した音声データを音声認識装置190に無線送信し、音声認識装置190の音声抽出手段191は、入力データ伸長手段が元のデータに伸長した各音声データから、雑音を除去した音声データを抽出してもよい。 At this time, the wireless transmission unit 182 of the voice input device 180 wirelessly transmits the voice data compressed by the input data compression unit to the voice recognition device 190, and the voice extraction unit 191 of the voice recognition device 190 is based on the input data expansion unit. The audio data from which noise has been removed may be extracted from each audio data expanded to the above data.
 そのような構成により、音声入力装置180から音声認識装置190に送信するデータ量を削減できる。 With such a configuration, the amount of data transmitted from the voice input device 180 to the voice recognition device 190 can be reduced.
 また、音声認識装置190(例えば、音声認識応答部80)は、音声認識手段192が音声認識した結果から合成音声データを生成する合成音声データ生成手段(例えば、応答生成部81)と、合成音声データ生成手段が生成した合成音声データを音声入力装置180に送信する合成音声データ送信手段(例えば、応答送信部82)とを含んでいてもよい。また、音声入力装置180(例えば、音声入出力部70)は、音声認識装置190から合成音声データを受信する合成音声データ受信手段(例えば、応答受信部71)と、合成音声データ受信手段が受信した合成音声データが示す音声を出力する出力手段(例えば、スピーカ72)とを含んでいてもよい。 The voice recognition device 190 (for example, the voice recognition response unit 80) includes a synthesized voice data generation unit (for example, the response generation unit 81) that generates synthesized voice data from the result of voice recognition performed by the voice recognition unit 192, and a synthesized voice. Synthetic voice data transmission means (for example, response transmission unit 82) that transmits the synthetic voice data generated by the data generation means to the voice input device 180 may be included. The voice input device 180 (for example, the voice input / output unit 70) is received by the synthesized voice data receiving unit (for example, the response receiving unit 71) that receives the synthesized voice data from the voice recognition device 190, and received by the synthesized voice data receiving unit. Output means (for example, a speaker 72) for outputting the voice indicated by the synthesized voice data.
 そのような構成により、音声入力装置180側で、音声認識装置190による音声認識の結果を確認できる。 With such a configuration, the result of voice recognition by the voice recognition device 190 can be confirmed on the voice input device 180 side.
 また、音声認識装置190(例えば、音声認識応答部100)は、合成音声データを圧縮する合成音声データ圧縮手段(例えば、応答圧縮部101)を含んでいてもよい。また、音声入力装置180(例えば、音声入出力部90)は、圧縮された合成音声データを伸長する合成音声データ伸長手段(例えば、応答伸長部91)を含んでいてもよい。 Further, the voice recognition device 190 (for example, the voice recognition response unit 100) may include a synthesized voice data compression unit (for example, the response compression unit 101) that compresses the synthesized voice data. The voice input device 180 (for example, the voice input / output unit 90) may include a synthesized voice data expansion unit (for example, a response expansion unit 91) that expands the compressed synthesized voice data.
 このとき、音声認識装置190の合成音声データ送信手段は、合成音声データ圧縮手段が圧縮した合成音声データを音声入力装置に送信し、音声入力装置180の出力手段は、合成音声データ伸長手段が伸長した合成音声データが示す音声を出力してもよい。 At this time, the synthesized voice data transmitting unit of the voice recognition device 190 transmits the synthesized voice data compressed by the synthesized voice data compressing unit to the voice input device, and the synthesized voice data decompressing unit decompresses the output unit of the voice input device 180. The voice indicated by the synthesized voice data may be output.
 そのような構成により、音声認識装置190から音声入力装置180に送信するデータ量を削減できる。 With such a configuration, the amount of data transmitted from the speech recognition device 190 to the speech input device 180 can be reduced.
 また、音声入力装置180は、他の装置(例えば、音声認識応答部)から受信する制御コマンドに基づいて自音声入力装置180の状態を制御し、自音声入力装置180の状態を示すステータス情報を他の装置に送信する制御部(例えば、制御部15)を含んでいてもよい。そのような構成により、音声入力装置180をより小型化できる。 In addition, the voice input device 180 controls the state of the own voice input device 180 based on a control command received from another device (for example, a voice recognition response unit), and displays status information indicating the state of the own voice input device 180. A control unit (for example, the control unit 15) that transmits to other devices may be included. With such a configuration, the voice input device 180 can be further downsized.
 以上、実施形態及び実施例を参照して本願発明を説明したが、本願発明は上記実施形態および実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 As mentioned above, although this invention was demonstrated with reference to embodiment and an Example, this invention is not limited to the said embodiment and Example. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
 この出願は、2011年11月9日に出願された日本特許出願2011-245616を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2011-245616 filed on November 9, 2011, the entire disclosure of which is incorporated herein.
 本発明は、無線通信を用いて送信される音声の音声認識を行う音声認識システムに好適に適用される。 The present invention is preferably applied to a speech recognition system that performs speech recognition of speech transmitted using wireless communication.
 10,30,50,70,90 音声入出力部
 11 第一のマイクロフォン
 12 第一の入力音声送信部
 13 第二のマイクロフォン
 14 第二の入力音声送信部
 15 制御部
 20,40,60,80,100 音声認識応答部
 21 第一の入力音声受信部
 22 第二の入力音声受信部
 23 音声抽出部
 24 音声認識部
 31 入力データ統合部
 32 入力データ送信部
 41 入力データ受信部
 42 入力データ分割部
 51 第一の入力音声圧縮部
 52 第二の入力音声圧縮部
 61 第一の入力音声伸長部
 62 第二の入力音声伸長部
 71 応答受信部
 72 スピーカ
 81 応答生成部
 82 応答送信部
 91 応答伸長部
 101 応答圧縮部
10, 30, 50, 70, 90 Audio input / output unit 11 First microphone 12 First input audio transmission unit 13 Second microphone 14 Second input audio transmission unit 15 Control unit 20, 40, 60, 80, DESCRIPTION OF SYMBOLS 100 Voice recognition response part 21 1st input voice reception part 22 2nd input voice reception part 23 Voice extraction part 24 Voice recognition part 31 Input data integration part 32 Input data transmission part 41 Input data reception part 42 Input data division part 51 First input speech compression unit 52 Second input speech compression unit 61 First input speech decompression unit 62 Second input speech decompression unit 71 Response reception unit 72 Speaker 81 Response generation unit 82 Response transmission unit 91 Response decompression unit 101 Response compression unit

Claims (7)

  1.  利用者の音声を入力する音声入力装置と、
     前記音声入力装置に入力された音声の音声認識を行う音声認識装置とを備え、
     前記音声入力装置は、
     利用者の音声と、当該利用者が音声を発声しているときの雑音とを入力する少なくとも2以上の入力手段と、
     各入力手段に入力された音声および雑音を含む音声データを前記音声認識装置に無線送信する無線送信手段とを含み、
     前記音声認識装置は、
     受信した音声データから、前記雑音を除去した音声データを抽出する音声抽出手段と、
     前記音声抽出手段が抽出した音声データの音声認識を行う音声認識手段とを含む
     ことを特徴とする音声認識システム。
    A voice input device for inputting a user's voice;
    A voice recognition device that performs voice recognition of the voice input to the voice input device;
    The voice input device includes:
    At least two or more input means for inputting user's voice and noise when the user is uttering voice;
    Wireless transmission means for wirelessly transmitting voice data including noise and noise input to each input means to the voice recognition device;
    The voice recognition device
    Voice extraction means for extracting the voice data from which the noise has been removed from the received voice data;
    And a voice recognition means for performing voice recognition of the voice data extracted by the voice extraction means.
  2.  音声入力装置は、各入力手段に入力された音声および雑音を含む音声データを統合する入力データ統合手段を含み、
     音声認識装置は、受信した音声データを元の各音声データに分割する入力データ分割手段を含み、
     音声入力装置の無線送信手段は、前記入力データ統合手段が統合した音声データを音声認識装置に無線送信し、
     音声認識装置の音声抽出手段は、前記入力データ分割手段が分割した各音声データから、雑音を除去した音声データを抽出する
     請求項1記載の音声認識システム。
    The voice input device includes input data integration means for integrating voice data including voice and noise input to each input means,
    The voice recognition device includes input data dividing means for dividing the received voice data into the original voice data,
    The wireless transmission means of the voice input device wirelessly transmits the voice data integrated by the input data integration means to the voice recognition device,
    The speech recognition system according to claim 1, wherein the speech extraction unit of the speech recognition apparatus extracts speech data from which noise has been removed from each speech data divided by the input data dividing unit.
  3.  音声入力装置は、各入力手段に入力された音声および雑音を圧縮した音声データを生成する入力データ圧縮手段を含み、
     音声認識装置は、圧縮された音声データを元の音声データに伸長する入力データ伸長手段を含み、
     音声入力装置の無線送信手段は、前記入力データ圧縮手段が圧縮した音声データを音声認識装置に無線送信し、
     音声認識装置の音声抽出手段は、前記入力データ伸長手段が元のデータに伸長した各音声データから、雑音を除去した音声データを抽出する
     請求項1または請求項2記載の音声認識システム。
    The voice input device includes input data compression means for generating voice data in which voice and noise input to each input means are compressed,
    The speech recognition device includes input data decompression means for decompressing the compressed speech data into the original speech data,
    The wireless transmission means of the voice input device wirelessly transmits the voice data compressed by the input data compression means to the voice recognition device,
    The speech recognition system according to claim 1, wherein the speech extraction unit of the speech recognition apparatus extracts speech data from which noise has been removed from each speech data expanded to the original data by the input data decompression unit.
  4.  音声認識装置は、
     音声認識手段が音声認識した結果から合成音声データを生成する合成音声データ生成手段と、
     合成音声データ生成手段が生成した合成音声データを音声入力装置に送信する合成音声データ送信手段とを含み、
     音声入力装置は、
     音声認識装置から合成音声データを受信する合成音声データ受信手段と、
     前記合成音声データ受信手段が受信した合成音声データが示す音声を出力する出力手段とを含む
     請求項1から請求項3のうちのいずれか1項に記載の音声認識システム。
    The voice recognition device
    Synthesized voice data generating means for generating synthesized voice data from the result of voice recognition by the voice recognition means;
    Including synthesized voice data transmitting means for transmitting the synthesized voice data generated by the synthesized voice data generating means to the voice input device,
    The voice input device
    Synthesized voice data receiving means for receiving synthesized voice data from the voice recognition device;
    The speech recognition system according to any one of claims 1 to 3, further comprising: an output unit that outputs a voice indicated by the synthesized voice data received by the synthesized voice data receiving unit.
  5.  音声認識装置は、合成音声データを圧縮する合成音声データ圧縮手段を含み、
     音声入力装置は、圧縮された合成音声データを伸長する合成音声データ伸長手段を含み、
     音声認識装置の合成音声データ送信手段は、前記合成音声データ圧縮手段が圧縮した合成音声データを音声入力装置に送信し、
     音声入力装置の出力手段は、前記合成音声データ伸長手段が伸長した合成音声データが示す音声を出力する
     請求項4記載の音声認識システム。
    The speech recognition apparatus includes synthesized speech data compression means for compressing synthesized speech data,
    The voice input device includes synthesized voice data decompression means for decompressing the compressed synthesized voice data,
    The synthesized voice data transmitting means of the voice recognition device transmits the synthesized voice data compressed by the synthesized voice data compressing means to the voice input device,
    The speech recognition system according to claim 4, wherein the output unit of the voice input device outputs the voice indicated by the synthesized voice data expanded by the synthesized voice data expansion unit.
  6.  音声入力装置は、他の装置から受信する制御コマンドに基づいて自音声入力装置の状態を制御し、自音声入力装置の状態を示すステータス情報を他の装置に送信する制御部を含む
     請求項1から請求項5のうちのいずれか1項に記載の音声認識システム。
    The voice input device includes a control unit that controls the state of the own voice input device based on a control command received from another device, and transmits status information indicating the state of the own voice input device to the other device. The voice recognition system according to any one of claims 1 to 5.
  7.  利用者の音声を入力する音声入力装置が、2以上の入力手段を用いて、利用者の音声と当該利用者が音声を発声しているときの雑音とを入力し、
     前記音声入力装置が、前記各入力手段に入力された音声および雑音を含む音声データを音声認識装置に無線送信し、
     前記音声認識装置が、受信した音声データから、前記雑音を除去した音声データを抽出し、
     前記音声認識装置が、抽出された音声データの音声認識を行う
     ことを特徴とする音声認識方法。
    A voice input device that inputs a user's voice uses two or more input means to input the user's voice and noise when the user utters the voice,
    The voice input device wirelessly transmits voice data including voice and noise input to each input means to a voice recognition device,
    The voice recognition device extracts the voice data from which the noise is removed from the received voice data,
    The voice recognition method, wherein the voice recognition device performs voice recognition of the extracted voice data.
PCT/JP2012/005874 2011-11-09 2012-09-14 Speech recognition system and speech recognition method WO2013069187A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2011-245616 2011-11-09
JP2011245616 2011-11-09

Publications (1)

Publication Number Publication Date
WO2013069187A1 true WO2013069187A1 (en) 2013-05-16

Family

ID=48288978

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2012/005874 WO2013069187A1 (en) 2011-11-09 2012-09-14 Speech recognition system and speech recognition method

Country Status (1)

Country Link
WO (1) WO2013069187A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01115798U (en) * 1988-01-30 1989-08-03
JP2007140419A (en) * 2005-11-18 2007-06-07 Humanoid:Kk Interactive information transmission device with situation-adaptive intelligence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01115798U (en) * 1988-01-30 1989-08-03
JP2007140419A (en) * 2005-11-18 2007-06-07 Humanoid:Kk Interactive information transmission device with situation-adaptive intelligence

Similar Documents

Publication Publication Date Title
EP2211339A1 (en) Audio processing in a portable listening device
EP2312578A1 (en) Signal analyzing device, signal control device, and method and program therefor
JP2007318528A (en) Directional sound collector, directional sound collecting method, and computer program
JP2010530154A5 (en)
CN105976829B (en) Audio processing device and audio processing method
WO2017061023A1 (en) Audio signal processing method and device
JPH09116998A (en) Hearing aid
EP3737115A1 (en) A hearing apparatus with bone conduction sensor
EP2482566A1 (en) Method for generating an audio signal
JP2007034238A (en) On-site operation support system
WO2020017518A1 (en) Audio signal processing device
KR20170098761A (en) Apparatus and method for extending bandwidth of earset with in-ear microphone
US20200228849A1 (en) Information processing apparatus, information processing system, and program
WO2013069187A1 (en) Speech recognition system and speech recognition method
JP7284570B2 (en) Sound reproduction system and program
WO2014138758A3 (en) Method for increasing the comprehensibility of speech
JP2019516304A (en) Earset timbre compensator and method
KR101386883B1 (en) Mobile terminal and method for executing communication mode thereof
US20100255878A1 (en) Audio filter
JP6267860B2 (en) Audio signal transmitting apparatus, audio signal receiving apparatus and method thereof
JP4973376B2 (en) Apparatus for detecting basic period of speech and apparatus for converting speech speed using the basic period
WO2022137806A1 (en) Ear-mounted type device and reproduction method
WO2017116022A1 (en) Apparatus and method for extending bandwidth of earset having in-ear microphone
EP2434483A1 (en) Encoding device, decoding device, and methods therefor
KR101970589B1 (en) Speech signal transmitting apparatus, speech signal receiving apparatus and method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12848213

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12848213

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP