WO2022059245A1 - 音声処理システム、音声処理装置、及び音声処理方法 - Google Patents

音声処理システム、音声処理装置、及び音声処理方法 Download PDF

Info

Publication number
WO2022059245A1
WO2022059245A1 PCT/JP2021/016088 JP2021016088W WO2022059245A1 WO 2022059245 A1 WO2022059245 A1 WO 2022059245A1 JP 2021016088 W JP2021016088 W JP 2021016088W WO 2022059245 A1 WO2022059245 A1 WO 2022059245A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
microphone
speaker
specified
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2021/016088
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
智史 山梨
南生也 持木
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panasonic Intellectual Property Management Co Ltd
Original Assignee
Panasonic Intellectual Property Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Panasonic Intellectual Property Management Co Ltd filed Critical Panasonic Intellectual Property Management Co Ltd
Priority to CN202180052503.1A priority Critical patent/CN115917642B/xx
Priority to DE112021004878.3T priority patent/DE112021004878T5/de
Priority to JP2022550338A priority patent/JP7702212B2/ja
Publication of WO2022059245A1 publication Critical patent/WO2022059245A1/ja
Priority to US18/160,857 priority patent/US12412594B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • This disclosure relates to a voice processing system, a voice processing device, and a voice processing method.
  • a voice processing system that processes voice recognition commands based on the voice spoken by the speaker is known.
  • Patent Document 1 discloses a voice processing system that processes a voice recognition command based on the position spoken by the speaker.
  • Patent Document 1 does not disclose control when the position of the speaker cannot be specified. If the speaker's position cannot be identified, the speech processing system may perform unintended processing.
  • the purpose of this disclosure is to execute appropriate processing even when the position of the speaker cannot be specified in the voice processing system.
  • the voice processing system includes an input unit, a determination unit, and a voice recognition unit.
  • the input unit receives the first voice, which is the voice spoken by the first speaker.
  • the determination unit determines whether or not the position of the first speaker can be specified.
  • the voice recognition unit is a voice recognition unit that outputs a voice command that is specified by voice and is a signal for controlling the target device to the target device, and the determination unit determines the position of the first speaker. When it is determined that the voice command cannot be specified, the output of the utterance position command related to the position of the speaker is restricted.
  • FIG. 1 is a diagram showing an example of a schematic configuration of an in-vehicle voice processing system according to the first embodiment.
  • FIG. 2 is a diagram showing an example of the hardware configuration of the voice processing system according to the first embodiment.
  • FIG. 3 is a block diagram showing an example of the configuration of the voice processing system according to the first embodiment.
  • FIG. 4 is a flowchart showing an example of the operation of the voice processing system according to the first embodiment.
  • FIG. 5 is a block diagram showing an example of the configuration of the voice processing system according to the second embodiment.
  • FIG. 6 is a flowchart showing an example of the operation of the voice processing system according to the second embodiment.
  • FIG. 1 is a diagram showing an example of a schematic configuration of the voice system 5 according to the first embodiment.
  • the voice system 5 is mounted on the vehicle 10, for example.
  • the voice system 5 is mounted on the vehicle 10
  • an example in which the voice system 5 is mounted on the vehicle 10 will be described.
  • a plurality of seats are provided in the passenger compartment of the vehicle 10.
  • the plurality of seats are, for example, four seats, a driver's seat, a passenger seat, and left and right rear seats.
  • the number of seats is not limited to this.
  • the person seated in the driver's seat will be referred to as occupant hm1
  • the person seated in the passenger seat will be referred to as occupant hm2
  • the person seated on the right side of the rear seat will be referred to as occupant hm3
  • the person seated on the left side of the rear seat will be referred to as occupant hm4.
  • the voice system 5 includes a microphone MC1, a microphone MC2, a microphone MC3, a microphone MC4, a voice processing system 20, and an electronic device 30.
  • the voice system 5 shown in FIG. 1 has a number equal to the number of seats, that is, four microphones, but the number of microphones does not have to be equal to the number of seats.
  • the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 output an audio signal to the audio processing system 20. Then, the voice processing system 20 outputs the voice recognition result to the electronic device 30. The electronic device 30 executes the process specified by the voice recognition result based on the input voice recognition result.
  • the microphone MC1 is a microphone that collects the voice spoken by the occupant hm1. In other words, the microphone MC1 acquires a voice signal including a voice component spoken by the occupant hm1.
  • the microphone MC1 is arranged, for example, on the right side of the overhead console.
  • the microphone MC2 picks up the voice spoken by the occupant hm2.
  • the microphone MC2 is a microphone that acquires a voice signal including a voice component spoken by the occupant hm2.
  • the microphone MC2 is arranged, for example, on the left side of the overhead console. That is, the microphone MC1 and the microphone MC2 are arranged at close positions.
  • the microphone MC3 is a microphone that collects the voice spoken by the occupant hm3. In other words, the microphone MC3 acquires a voice signal including a voice component spoken by the occupant hm3.
  • the microphone MC3 is arranged, for example, on the right side of the center of the ceiling near the rear seats.
  • the microphone MC4 is a microphone that picks up the voice spoken by the occupant hm4. In other words, the microphone MC4 acquires a voice signal including a voice component spoken by the occupant hm4.
  • the microphone MC4 is arranged, for example, on the left side of the center of the ceiling near the rear seats of the ceiling. That is, the microphone MC3 and the microphone MC4 are arranged at close positions.
  • the arrangement positions of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 shown in FIG. 1 are examples, and may be arranged at other positions.
  • Each microphone may be a directional microphone or an omnidirectional microphone.
  • Each microphone may be a small MEMS (Micro Electro Mechanical Systems) microphone or an ECM (Electret Condenser Microphone).
  • Each microphone may be a beamforming microphone.
  • each microphone may be a microphone array that has directivity in the direction of each seat and can pick up the sound of the direction method.
  • the voice system 5 shown in FIG. 1 includes a plurality of voice processing systems 20 corresponding to each of the microphones.
  • the voice system 5 includes a voice processing system 21, a voice processing system 22, a voice processing system 23, and a voice processing system 24.
  • the voice processing system 21 corresponds to the microphone MC1.
  • the voice processing system 22 corresponds to the microphone MC2.
  • the voice processing system 23 corresponds to the microphone MC3.
  • the voice processing system 24 corresponds to the microphone MC4.
  • the voice processing system 21, the voice processing system 22, the voice processing system 23, and the voice processing system 24 may be collectively referred to as the voice processing system 20.
  • a signal output from the voice processing system 20 is input to the electronic device 30.
  • the electronic device 30 executes a process corresponding to the signal output from the voice processing system 20.
  • the signal output from the voice processing system 20 is, for example, a voice command which is a command input by voice.
  • a voice command is a signal that is specified by voice and controls a target device. That is, the electronic device 30 executes the process corresponding to the voice command output from the voice processing system 20.
  • the electronic device 30 executes a process of opening and closing a window, a process of driving a vehicle 10, a process of changing the temperature of an air conditioner, and a process of changing the volume of an audio device based on a voice command.
  • the electronic device 30 is an example of the target device.
  • FIG. 1 shows a case where four people are on the vehicle 10, but the number of people on the vehicle is not limited to this.
  • the number of passengers may be less than or equal to the maximum passenger capacity of the vehicle 10. For example, when the maximum passenger capacity of the vehicle 10 is 6, the number of passengers may be 6 or less.
  • FIG. 2 is a diagram showing an example of the hardware configuration of the voice processing system 20 in the first embodiment.
  • the speech processing system 20 includes a DSP (Digital Signal Processor) 2001, a RAM (Random Access Memory) 2002, a ROM (Read Only Memory) 2003, and an I / O (Input / Output) interface 2004. ..
  • DSP Digital Signal Processor
  • RAM Random Access Memory
  • ROM Read Only Memory
  • I / O Input / Output
  • DSP2001 is a processor capable of executing a computer program.
  • the type of processor included in the voice processing system 20 is not limited to DSP 2001.
  • the voice processing system 20 may be a CPU (Central Processing Unit) or other hardware.
  • the voice processing system 20 may include a plurality of processors.
  • RAM 2002 is a volatile memory used as a cache or buffer.
  • the type of volatile memory included in the voice processing system 20 is not limited to RAM 2002.
  • the voice processing system 20 may include a register instead of the RAM 2002. Further, the voice processing system 20 may include a plurality of volatile memories.
  • ROM 2003 is a non-volatile memory that stores various information including computer programs.
  • the DSP 2001 realizes the function of the voice processing system 20 by reading a specific computer program from the ROM 2003 and executing the program. The function of the voice processing system 20 will be described later.
  • the type of non-volatile memory included in the voice processing system 20 is not limited to ROM 2003.
  • the voice processing system 20 may include a flash memory instead of the ROM 2003.
  • the voice processing system 20 may include a plurality of non-volatile memories.
  • the I / O interface 2004 is an interface device to which an external device is connected.
  • the external device is, for example, a device such as a microphone MC1, a microphone MC2, a microphone MC3, a microphone MC4, and an electronic device 30.
  • the voice processing system 20 may include a plurality of I / O interfaces 2004.
  • the voice processing system 20 includes a memory in which the computer program is stored and a processor capable of executing the computer program. That is, the voice processing system 20 can be regarded as a computer.
  • the number of computers required to realize the function of the voice processing system 20 is not limited to one.
  • the function as the voice processing system 20 may be realized by the cooperation of two or more computers.
  • FIG. 3 is a block diagram showing an example of the configuration of the voice processing system 20 in the first embodiment. Audio signals are input to the audio processing system 20 from the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4. Then, the voice processing system 20 outputs the voice recognition result to the electronic device 30.
  • the voice processing system 20 includes a voice input unit 210, a failure detection unit 220, and a voice processing device 230.
  • the microphone MC1 generates a voice signal by converting the picked up voice into an electric signal. Then, the microphone MC1 outputs a voice signal to the voice input unit 210.
  • the voice signal is a signal including the voice of the occupant hm1 and the voice of a person other than the occupant hm1 and noise such as music and running noise emitted from an audio device.
  • the microphone MC2 generates a voice signal by converting the picked up voice into an electric signal. Then, the microphone MC2 outputs a voice signal to the voice input unit 210.
  • the voice signal is a signal including the voice of the occupant hm2, the voice of a person other than the occupant hm2, and noise such as music and running noise emitted from an audio device.
  • the microphone MC3 generates a voice signal by converting the picked up voice into an electric signal. Then, the microphone MC3 outputs a voice signal to the voice input unit 210.
  • the voice signal is a signal including the voice of the occupant hm3, the voice of a person other than the occupant hm3, and noise such as music and running noise emitted from an audio device.
  • the microphone MC4 generates a voice signal by converting the picked up voice into an electric signal. Then, the microphone MC4 outputs a voice signal to the voice input unit 210.
  • the voice signal is a signal including the voice of the occupant hm4, the voice of a person other than the occupant hm4, and noise such as music and running noise emitted from an audio device.
  • the audio input unit 210 inputs audio signals from each of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4. That is, the voice input unit 210 receives the first voice, which is the voice spoken by the first speaker. In other words, the voice input unit 210 receives the voice spoken by the first speaker of any one of the plurality of speakers.
  • the voice input unit 210 is an example of an input unit. Then, the voice input unit 210 outputs the voice signal to the failure detection unit 220.
  • the failure detection unit 220 detects the failure of each of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4. Further, the failure detection unit 220 determines whether or not the position of the first speaker can be specified.
  • the failure detection unit 220 is an example of a determination unit.
  • the voice processing system 20 compares the voice signals output from each of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 to determine the position of the speaker who utters the voice included in each voice signal. Identify.
  • the voice processing system 20 may not be able to identify the position of the speaker when any one of the microphone MC1, the microphone MC2, the microphone MC3, or the microphone MC4 fails. Therefore, the failure detection unit 220 detects the presence or absence of a failure of the plurality of microphones, and determines whether or not the position of the first speaker can be specified based on the detection result.
  • the microphone MC1 and the microphone MC2 are arranged at close positions. Therefore, the sound pressure received by the microphone MC1 and the sound pressure received by the microphone MC2 are substantially the same. Therefore, the levels of the audio signals output from the microphone MC1 and the microphone MC2 are substantially the same. However, if one of the microphone MC1 and the microphone MC2 fails, one of the microphone MC1 and the microphone MC2 cannot normally collect sound. Therefore, there is a difference in the level of the audio signal output from the microphone MC1 and the microphone MC2.
  • the failure detection unit 220 When the difference between the level of the audio signal output from the microphone MC1 and the level of the audio signal output from the microphone MC2 is equal to or greater than the threshold value, the failure detection unit 220 has a failure in one of the microphone MC1 and the microphone MC2. Judge that it has occurred. For example, the failure detection unit 220 determines that, of the two audio signals, the microphone that outputs the low-level audio signal is defective.
  • the failure detection unit 220 sets the microphone MC3 and the microphone MC4 when the difference between the level of the audio signal output from the microphone MC3 and the level of the audio signal output from the microphone MC4 is equal to or more than the threshold value. It is determined that a failure has occurred in one of them.
  • the failure detection unit 220 outputs a failure detection signal indicating that a failure has been detected when a failure is detected in at least one of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4. That is, the failure detection unit 220 outputs a failure detection signal indicating whether or not the position of the speaker who has spoken the voice received by the voice input unit 210 can be specified.
  • the failure detection signal is an example of the first signal.
  • the failure detection unit 220 outputs the audio signals output from the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 to the audio processing device 230.
  • the voice processing device 230 includes a signal receiving unit 231, a BF (Beam Forming) processing unit 232, an EC (Echo Canceller) processing unit 233, a CTC (CrossTalk Canceller) processing unit 234, and a voice recognition unit 235.
  • a BF Beam Forming
  • EC Echo Canceller
  • CTC Cross-Talk Canceller
  • the signal receiving unit 231 receives a failure detection signal indicating whether or not it is possible to specify the position of the speaker who uttered the voice received by the voice input unit 210.
  • the signal receiving unit 231 is an example of a receiving unit.
  • the signal receiving unit 231 receives a failure detection signal from the failure detection unit 220.
  • the signal receiving unit 231 transmits a failure detection signal to the BF processing unit 232, the EC processing unit 233, the CTC processing unit 234, and the voice recognition unit 235.
  • the BF processing unit 232 emphasizes the sound in the direction of the target seat by directivity control.
  • the operation of the BF processing unit 232 will be described by taking as an example a case where the voice in the direction of the driver's seat is emphasized among the voice signals output from the microphone MC1.
  • the microphone MC1 and the microphone MC2 are arranged in close proximity to each other. Therefore, it is assumed that the voice signal output from the microphone MC1 includes the voices of the occupant hm1 in the driver's seat and the occupant hm2 in the passenger seat. Similarly, it is assumed that the voice signal output from the microphone MC2 includes the voices of the occupant hm1 in the driver's seat and the occupant hm2 in the passenger seat.
  • the distance to the passenger seat of Mike MC1 is farther than that of Mike MC2. Therefore, when the passenger seat occupant hm2 speaks, the voice of the passenger seat occupant hm2 included in the voice signal output from the microphone MC1 is the voice of the passenger seat occupant hm2 included in the voice signal output from the microphone MC2. Is delayed than. Therefore, the BF processing unit 232 emphasizes the voice in the direction of the target seat, for example, by applying the time delay processing to the voice signal. Then, the BF processing unit 232 outputs a voice signal emphasizing the voice in the direction of the target seat to the EC processing unit 233. However, the method in which the BF processing unit 232 emphasizes the sound in the direction of the target seat is not limited to the above.
  • the EC processing unit 233 cancels the voice components other than the voice spoken by the speaker among the voice signals output from the BF processing unit 232.
  • the voice component other than the voice spoken by the speaker is, for example, music emitted by an audio device of the vehicle 10, running noise, or the like.
  • the EC processing unit 233 executes the echo canceling process.
  • the EC processing unit 233 cancels the audio component specified by the reference signal from the audio signal output from the BF processing unit 232. As a result, the EC processing unit 233 cancels the voice component other than the voice spoken by the speaker.
  • the reference signal is a signal indicating a voice component other than the voice spoken by the speaker.
  • the reference signal is a signal indicating an audio component of music emitted by an audio device.
  • the EC processing unit 233 can cancel the voice component other than the voice spoken by the speaker by canceling the voice component specified by the reference signal.
  • the CTC processing unit 234 cancels the sound emitted from a direction other than the target seat. In other words, the CTC processing unit 234 executes the crosstalk canceling process. Audio signals from all microphones are input to the CTC processing unit 234 after undergoing echo cancellation processing by the EC processing unit 233.
  • the CTC processing unit 234 cancels the audio component picked up from the direction other than the target seat by using the audio signal output from the microphone other than the microphone in the target seat among the input audio signals as a reference signal. That is, the CTC processing unit 234 cancels the audio component specified by the reference signal from the audio signal related to the microphone in the target seat. Then, the CTC processing unit 234 outputs the voice signal after the crosstalk cancellation processing to the voice recognition unit 235.
  • the voice recognition unit 235 outputs a voice command to the electronic device 30 based on the voice signal and the failure detection signal. More specifically, the voice recognition unit 235 identifies the voice command included in the voice signal by executing the voice recognition process on the voice signal output from the CTC processing unit 234. Further, the voice command includes an utterance position command which is a command relating to the position of the speaker. The electronic device 30 executes a process corresponding to the utterance position command. Based on the utterance position command, the electronic device 30 executes, for example, a process of changing the temperature of the air conditioner, a process of changing the volume of the speaker, and a process of opening and closing the window.
  • the utterance position command is a command that determines the process to be executed according to the position of the speaker. For example, when the passenger seat occupant hm2 utters "open the window", the voice recognition unit 235 is an utterance position command indicating a process of opening the window on the left side of the passenger seat. judge. Further, when the occupant hm3 in the right seat of the rear seat utters "Open the window", the voice recognition unit 235 indicates the utterance position command indicating the process of opening the window on the right side of the rear seat. Is determined to be.
  • the utterance position command includes a driving command related to driving.
  • the driving command is a command related to driving the vehicle 10.
  • the voice recognition unit 235 is configured to be able to distinguish between the driving command and other utterance position commands.
  • the driving command is a command for controlling the car navigation system, a command for controlling the vehicle speed by accelerator control, and a command for controlling the vehicle speed by brake control.
  • the voice recognition unit 235 determines from which position the voice signal is uttered based on the microphone position in which the voice signal including the utterance position command is input.
  • the voice recognition unit 235 determines that the voice signal based on the microphone MC1 is the voice uttered from the direction of the driver's seat.
  • the voice recognition unit 235 determines that the voice signal based on the microphone MC2 is the voice uttered from the direction of the assistant.
  • the voice recognition unit 235 determines that the voice signal based on the microphone MC3 is the voice uttered from the direction on the right side of the rear seat.
  • the voice recognition unit 235 determines that the voice signal based on the microphone MC4 is the voice uttered from the direction on the left side of the rear seat.
  • the voice recognition unit 235 determines whether or not the position of the speaker can be specified based on the failure detection signal output from the failure detection unit 220.
  • the BF processing unit 232 and the CTC processing unit 234 may not be able to normally execute the processing.
  • the microphone MC1 picks up the voice spoken by the occupant hm1 in the driver's seat and the voice uttered by the occupant hm2 in the passenger seat. In this case, if the microphone MC2 is out of order, the BF processing unit 232 and the CTC processing unit 234 cannot normally execute the processing.
  • the CTC processing unit 234 cannot cancel the audio component of the audio that should have been picked up by the microphone MC2 from the audio signal output from the microphone MC1. Therefore, the voice signal output from the microphone MC1 is input to the voice recognition unit 35 while including both the voice spoken by the occupant hm1 in the driver's seat and the voice uttered by the occupant hm2 in the passenger seat. In that case, the voice recognition unit 235 handles the voice uttered by the passenger seat occupant hm2 included in the voice signal output from the microphone MC1 as the voice uttered by the driver seat occupant hm1. Therefore, the voice recognition unit 235 determines whether or not the position of the speaker can be specified based on the failure detection signal.
  • the voice recognition unit 235 determines that the voice command included in the voice signal is the utterance position command and cannot specify the position of the speaker who uttered the utterance position command, the voice recognition unit 235 should output the utterance position. Unable to determine command. For example, when the voice recognition unit 235 determines that the voice signal includes the utterance position command "Open the window” and cannot specify the position of the speaker, any window is opened. It is not possible to specify whether to output the utterance position command to be made.
  • the voice recognition unit 235 is a voice recognition unit 235 that outputs a voice command that is specified by voice and is a signal for controlling the electronic device 30 to the electronic device 30, and the failure detection signal is the first utterance.
  • the voice recognition unit 235 is an example of the voice recognition unit.
  • the voice recognition unit 235 is a voice recognition unit 235 that outputs a voice command, which is a signal identified by voice and controls the electronic device 30, to the electronic device 30, and the failure detection unit 220 is the first.
  • the output of the speech position command related to the position of the speaker among the voice commands is restricted.
  • the voice recognition unit 235 does not output the utterance position command when the failure detection unit 220 determines that the position of the speaker cannot be specified. As a result, the electronic device 30 does not execute the process corresponding to the utterance position command. Therefore, the voice recognition unit 235 can prevent the electronic device 30 from executing an unintended process.
  • the voice recognition unit 235 determines that the position of the speaker cannot be specified due to the failure detection unit 220 detecting the failure of the microphone, the voice recognition unit 235 identifies the voice input from the microphone associated with the failed microphone. Limits the output of the spoken position command. In other words, the voice recognition unit 235 issues a speech position command specified by the voice input from the other microphones belonging to the group when any microphone belonging to the group composed of a plurality of adjacent microphones fails. Do not output to the electronic device 30. On the other hand, the voice recognition unit 235 does not limit the output of the utterance position command specified by the voice input from the microphone belonging to another group. That is, the voice recognition unit 235 outputs the utterance position command specified by the voice input from the microphones belonging to other groups.
  • microphone MC1 and microphone MC2 form a group.
  • the voice input unit 210 receives the voice including the first voice output from the first microphone and the plurality of microphones including the second microphone associated with the first microphone.
  • the first microphone is, for example, the microphone MC2.
  • the second microphone is, for example, the microphone MC1.
  • the first voice is, for example, a voice spoken by the passenger seat occupant hm2.
  • the microphone MC1 picks up the voice spoken by the occupant hm1 in the driver's seat and the voice uttered by the occupant hm2 in the passenger seat. In this case, if the microphone MC2 is out of order, the BF processing unit 232 and the CTC processing unit 234 cannot normally execute the processing.
  • the voice signal based on the microphone MC1 is input to the voice recognition unit 235 while including the voice spoken by the occupant hm1 in the driver's seat and the voice uttered by the occupant hm2 in the passenger seat. Therefore, the voice recognition unit 235 may erroneously determine the voice spoken by the passenger seat occupant hm2 as the voice spoken by the driver seat occupant hm1. On the other hand, since the microphone MC3 and the microphone MC4 are separated from the occupant hm1 in the driver's seat and the occupant hm2 in the passenger seat, the voice spoken by the occupant hm1 in the driver's seat and the voice uttered by the occupant hm2 in the passenger seat are heard.
  • the voice recognition unit 235 detects the failure of the first microphone and determines that the position of the first speaker cannot be specified, the voice recognition unit 235 is input from the second microphone among the utterance position commands. Do not output the utterance position command specified by voice.
  • the first speaker is, for example, the passenger seat occupant hm2.
  • the voice recognition unit 235 determines that the position of the first speaker cannot be specified by the failure detection unit 220, the voice recognition unit 235 changes the output priority of the operation command related to the operation among the utterance position commands. For example, when the voice recognition unit 235 receives a plurality of utterance position commands, the voice recognition unit 235 assigns the utterance position command to any stage of the priority divided into a plurality of stages. Then, the voice recognition unit 235 outputs an utterance position command having a priority higher than the threshold value to the electronic device 30. That is, the voice recognition unit 235 preferentially causes the electronic device 30 to execute the utterance position command.
  • the voice recognition unit 235 does not output the utterance position command having a priority lower than the threshold value to the electronic device 30. In this way, the voice recognition unit 235 changes the priority of the output of the operation command when the failure detection unit 220 determines that the position of the speaker cannot be specified.
  • the voice recognition unit 235 raises the priority of the output of the operation command when the failure detection unit 220 determines that the position of the first speaker cannot be specified. As a result, the voice recognition unit 235 prevents the operation related to the operation by voice from becoming impossible when any of the microphones fails.
  • the voice recognition unit 235 lowers the priority of the output of the operation command when the failure detection unit 220 determines that the position of the speaker cannot be specified. As a result, the voice recognition unit 235 prevents the operation related to driving from being performed by the voice of a person who is not originally related to driving, such as the occupant hm4 in the rear seat, when any of the microphones fails. ..
  • FIG. 4 is a flowchart showing an example of the operation of the voice processing system 20 in the first embodiment.
  • the voice input unit 210 receives input of a voice signal from the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 (step S11).
  • the failure detection unit 220 determines whether or not any of the microphone MC1, the microphone MC2, the microphone MC3, or the microphone MC4 is defective based on the audio signal output from the audio input unit 210 (step S12).
  • the failure detection unit 220 outputs a failure detection signal indicating whether or not any one of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 has failed to the signal receiving unit 231 of the voice processing device 230 (step S13). ..
  • the signal receiving unit 231 outputs a failure detection signal indicating whether or not any one of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 is failed, to the BF processing unit 232, the EC processing unit 233, the CTC processing unit 234, and It is transmitted to the voice recognition unit 235 (step S14).
  • the voice recognition unit 235 is based on the failure detection signal output from the signal reception unit 231 and contains the voice included in the voice signal input via the BF processing unit 232, the EC processing unit 233, and the CTC processing unit 234. It is determined whether or not the position of the speaker can be specified (step S15).
  • the voice recognition unit 235 When the position of the speaker can be specified (step S15; Yes), the voice recognition unit 235 outputs the voice command included in the voice signal to the electronic device 30 (step S16). As a result, the voice recognition unit 235 causes the electronic device 30 to execute the process specified by the voice command.
  • step S15 When the position of the speaker cannot be specified (step S15; No), the voice recognition unit 235 determines whether or not the voice command included in the voice signal is a command other than the voice command (step). S17). When the command is other than the utterance position command (step S17; Yes), the voice recognition unit 235 shifts to step S16.
  • the voice recognition unit 235 limits the output of the utterance position command (step S18). That is, as shown in step S16, the voice recognition unit 235 outputs a voice command, which is a signal specified by voice and controls the target device, to the electronic device 30. However, when it is determined in step S15 that the position of the first speaker cannot be specified, the voice recognition unit 235 limits the output of the utterance position command related to the position of the speaker among the voice commands. As a result, the voice recognition unit 235a limits the execution of the process specified by the voice command.
  • the voice processing system 20 ends the processing.
  • the voice input unit 210 receives the first voice uttered by the first speaker who is one of the plurality of speakers. Is it possible for the failure detection unit 220 to identify the position of the first speaker who has uttered the first voice received by the voice input unit 210 by detecting the failure of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4? Judge whether or not. Then, the voice recognition unit 235, which is specified by voice and outputs a voice command which is a signal for controlling the target device to the electronic device 30, is specified by voice when it is determined that the position of the first speaker cannot be specified. Among the voice commands, the output of the utterance position command related to the position of the speaker is restricted. Therefore, since the execution of unintended processing is restricted, the voice processing system 20 can execute appropriate processing even when the position of the speaker cannot be specified.
  • FIG. 5 is a block diagram showing an example of the configuration of the voice processing system 20a according to the second embodiment.
  • the voice processing device 230a of the voice processing system 20a in the second embodiment is different from the voice processing system 20 in the first embodiment in that it includes a speaker recognition unit 236.
  • the speaker recognition unit 236 determines whether or not the first voice, which is the voice spoken by the first speaker who is the speaker of any one of the plurality of speakers, is the voice by the pre-registered registrant. To judge.
  • the speaker recognition unit 236 is an example of the speaker determination unit. More specifically, the speaker recognition unit 236 compares the voice signal of the registrant registered in advance with the voice signal output from the CTC processing unit 234 to obtain the voice signal output from the CTC processing unit 234. It is determined whether the included voice is a voice uttered by a pre-registered registrant. For example, the speaker recognition unit 236 determines whether the voice included in the voice signal is the voice of the owner of the vehicle 10. Then, the speaker recognition unit 236 outputs a recognition result signal indicating whether or not the speaker who uttered the voice included in the voice signal can be determined to be a registrant to the voice recognition unit 235a.
  • the voice recognition unit 235a outputs an utterance position command on condition that the speaker recognition unit 236 determines that the first voice is uttered by the registrant. More specifically, the voice recognition unit 235a outputs an utterance position command regardless of whether or not the utterance is made by a pre-registered registrant when the failure detection unit 220 determines that the position of the speaker can be specified. do. Further, when the failure detection unit 220 determines that the position of the speaker cannot be specified, the voice recognition unit 235a issues an utterance position command on condition that the speaker recognition unit 236 recognizes that the utterance is made by the registrant. Output.
  • the voice recognition unit 235a executes the processing of the utterance position command on condition that the voice is the voice of the owner of the vehicle 10 registered in advance.
  • the voice recognition unit 235a determines that the position of the speaker cannot be specified by the failure detection unit 220, and the speaker recognition unit 236 recognizes that the utterance is made by the registrant. Limit output.
  • FIG. 6 is a flowchart showing an example of the operation of the voice processing system 20a in the second embodiment.
  • the voice input unit 210 receives input of a voice signal from the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 (step S21).
  • the failure detection unit 220 determines whether or not any of the microphone MC1, the microphone MC2, the microphone MC3, or the microphone MC4 is defective based on the audio signal output from the audio input unit 210 (step S22).
  • the failure detection unit 220 outputs a failure detection signal indicating whether or not any one of the microphone MC1, the microphone MC2, the microphone MC3, and the microphone MC4 has failed to the signal receiving unit 231 of the voice processing device 230a (step S23). ..
  • the signal receiving unit 231 outputs a failure detection signal indicating whether or not the microphone MC1, the microphone MC2, the microphone MC3, or the microphone MC4 is failed to the BF processing unit 232, the EC processing unit 233, the CTC processing unit 234, and the voice recognition unit. It is transmitted to 235a (step S24).
  • the voice recognition unit 235a is based on the failure detection signal output from the signal reception unit 231 and contains the voice included in the voice signal input via the BF processing unit 232, the EC processing unit 233, and the CTC processing unit 234. It is determined whether or not the position of the speaker can be specified (step S25).
  • the voice recognition unit 235a When the position of the speaker can be specified (step S25; Yes), the voice recognition unit 235a outputs the voice command included in the voice signal to the electronic device 30 (step S26). As a result, the voice recognition unit 235a causes the electronic device 30 to execute the process specified by the voice command.
  • the voice recognition unit 235a determines whether or not the voice included in the voice signal is due to the utterance of the registrant based on the recognition result signal. Is determined (step S27).
  • step S27 When the voice included in the voice signal is uttered by the registrant (step S27; Yes), the voice recognition unit 235a shifts to step S26.
  • the voice recognition unit 235a determines whether the voice command included in the voice signal is a command other than the utterance position command. Is determined (step S28). When the voice command included in the voice signal is a command other than the utterance position command (step S28; Yes), the voice recognition unit 235a shifts to step S26.
  • the voice recognition unit 235a limits the output of the utterance position command (step S29). As a result, the voice recognition unit 235a limits the execution of the process specified by the voice command.
  • the voice processing system 20a ends the processing.
  • the speaker recognition unit 236 the first voice spoken by the first speaker, which is one of the plurality of speakers, is registered in advance. It is determined whether or not the voice is from the registrant. Then, the voice recognition unit 235a outputs the utterance position command to the electronic device 30 on condition that the speaker recognition unit 236 determines that the first voice is the voice by the registrant. As a result, the electronic device 30 executes the processing of the utterance position command on condition that the voice is uttered by a specific registrant such as the owner of the vehicle 10. On the other hand, the voice recognition unit 235a limits the output of the utterance position command in the case of voice uttered by a person other than the registrant. Therefore, since the execution of unintended processing is restricted, the voice processing system 20a can execute appropriate processing even when the position of the speaker cannot be specified.
  • the voice processing device 230 in the first embodiment and the voice processing device 230a in the second embodiment have a CTC processing unit 234. However, the voice processing device 230 and the voice processing device 230a do not have to have the CTC processing unit 234. Further, the voice processing device 230 shown in FIG. 3 and the voice processing device 230a shown in FIG. 5 include an EC processing unit 233 after the BF processing unit 232. However, the voice processing device 230 and the voice processing device 230a may include a BF processing unit 232 after the EC processing unit 233.
  • the voice processing device 230 in the first embodiment and the voice processing device 230a in the second embodiment do not fail. Partial multi-zone sound collection may be performed by the microphone. Specifically, when the microphone MC3 fails, the voice processing device 230 and the voice processing device 230a execute the sound collection of the voice in the rear seat by the microphone MC4. Alternatively, when the microphone MC4 fails, the voice processing device 230 and the voice processing device 230a execute the sound collection of the voice in the rear seat by the microphone MC3.
  • a computer program for realizing the functions of the voice processing system 20 and the voice processing system 20a in the computer may be stored in ROM 2003 in advance and provided.
  • the computer program for realizing the functions of the voice processing system 20 and the voice processing system 20a on the computer is a file in an installable format or an executable format, which is a CD (Compact Disc) -ROM (Read Only Memory), a flexible disk (a flexible disk).
  • FD Flexible Disc
  • CD-R Recodeable
  • DVD Digital Versaille Disk
  • USB Universal Serial Bus
  • SD Secure Digital
  • a computer program for realizing the functions of the voice processing system 20 and the voice processing system 20a on the computer is stored on a computer connected to a network such as the Internet and is configured to be provided by downloading via the network. You may. Further, a computer program for realizing the functions of the voice processing system 20 and the voice processing system 20a on the computer may be provided or distributed via a network such as the Internet.
  • a part or all of the functions of the voice processing system 20 and the voice processing system 20a may be realized by a logic circuit.
  • a part or all of the functions of the voice processing system 20 and the voice processing system 20a may be realized by an analog circuit.
  • a part or all of the functions of the voice processing system 20 and the voice processing system 20a may be realized by FPGA (Field-Programmable Gate Array), ASIC (Application Specific Integrated Circuit), or the like.
  • Voice system 10 Vehicles 20, 20a, 21, 22, 23, 24 Voice processing system 30 Electronic equipment 210 Voice input unit 220 Failure detection unit 230, 230a Voice processing device 231 Signal receiving unit 232 BF (Beam Forming) processing unit 233 EC (EchoCanceller) Processing unit 234 CTC (Cross Talk Paneler) Processing unit 235, 235a Speech recognition unit 236 Speaker recognition unit hm1, hm2, hm3, hm4 Crew MC1, MC2, MC3, MC4 Mike 2001 DSP (DigitalSign) 2002 RAM (Random Access Memory) 2003 ROM (Read Only Memory) 2004 I / O (Input / Output) interface

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)
PCT/JP2021/016088 2020-09-18 2021-04-20 音声処理システム、音声処理装置、及び音声処理方法 Ceased WO2022059245A1 (ja)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN202180052503.1A CN115917642B (en) 2020-09-18 2021-04-20 Sound processing system, sound processing device, and sound processing method
DE112021004878.3T DE112021004878T5 (de) 2020-09-18 2021-04-20 Audioverarbeitungssystem, Audioverarbeitungsvorrichtung und Audioverarbeitungsverfahren
JP2022550338A JP7702212B2 (ja) 2020-09-18 2021-04-20 音声処理システム、音声処理装置、及び音声処理方法
US18/160,857 US12412594B2 (en) 2020-09-18 2023-01-27 Audio processing system, audio processing device, and audio processing method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020157743 2020-09-18
JP2020-157743 2020-09-18

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/160,857 Continuation US12412594B2 (en) 2020-09-18 2023-01-27 Audio processing system, audio processing device, and audio processing method

Publications (1)

Publication Number Publication Date
WO2022059245A1 true WO2022059245A1 (ja) 2022-03-24

Family

ID=80777411

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/016088 Ceased WO2022059245A1 (ja) 2020-09-18 2021-04-20 音声処理システム、音声処理装置、及び音声処理方法

Country Status (4)

Country Link
US (1) US12412594B2 (https=)
JP (1) JP7702212B2 (https=)
DE (1) DE112021004878T5 (https=)
WO (1) WO2022059245A1 (https=)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12238486B2 (en) * 2022-08-10 2025-02-25 GM Global Technology Operations LLC Management of audio performance in a vehicle

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001034454A (ja) * 1999-07-26 2001-02-09 Sharp Corp 音声コマンド入力装置及び該装置を動作させるプログラムを記録した記録媒体
US20170161016A1 (en) * 2015-12-07 2017-06-08 Motorola Mobility Llc Methods and Systems for Controlling an Electronic Device in Response to Detected Social Cues
US20180047394A1 (en) * 2016-08-12 2018-02-15 Paypal, Inc. Location based voice association system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007090611A (ja) 2005-09-28 2007-04-12 Sekisui Plastics Co Ltd 展示パネル用台板及び展示パネル
US7747446B2 (en) * 2006-12-12 2010-06-29 Nuance Communications, Inc. Voice recognition interactive system with a confirmation capability
JP2017090611A (ja) 2015-11-09 2017-05-25 三菱自動車工業株式会社 音声認識制御システム

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001034454A (ja) * 1999-07-26 2001-02-09 Sharp Corp 音声コマンド入力装置及び該装置を動作させるプログラムを記録した記録媒体
US20170161016A1 (en) * 2015-12-07 2017-06-08 Motorola Mobility Llc Methods and Systems for Controlling an Electronic Device in Response to Detected Social Cues
US20180047394A1 (en) * 2016-08-12 2018-02-15 Paypal, Inc. Location based voice association system

Also Published As

Publication number Publication date
US12412594B2 (en) 2025-09-09
DE112021004878T5 (de) 2023-08-10
US20230178093A1 (en) 2023-06-08
JP7702212B2 (ja) 2025-07-03
JPWO2022059245A1 (https=) 2022-03-24
CN115917642A (zh) 2023-04-04

Similar Documents

Publication Publication Date Title
CN109754803B (zh) 车载多音区语音交互系统及方法
JP2001075594A (ja) 音声認識システム
JP6847324B2 (ja) 音声認識装置、音声認識システム、及び音声認識方法
JP2001056693A (ja) 騒音低減装置
JP2008299221A (ja) 発話検知装置
US11089404B2 (en) Sound processing apparatus and sound processing method
JP7186375B2 (ja) 音声処理装置、音声処理方法および音声処理システム
WO2016143340A1 (ja) 音声処理装置及び制御装置
CN108538307B (zh) 用于为音频信号去除干扰的方法和设备以及语音控制设备
CN111199735A (zh) 车载装置以及语音识别方法
US12412594B2 (en) Audio processing system, audio processing device, and audio processing method
JP7692069B2 (ja) 信号処理装置及び信号処理方法
JP6459330B2 (ja) 音声認識装置、音声認識方法、及び音声認識プログラム
CN114495888A (zh) 车辆及其控制方法
US20220189450A1 (en) Audio processing system and audio processing device
CN115917642B (en) Sound processing system, sound processing device, and sound processing method
JP6775897B2 (ja) 車内会話支援装置
CN110211579A (zh) 一种语音指令识别方法、装置及系统
JP6995254B2 (ja) 音場制御装置及び音場制御方法
JP2000066698A (ja) 音認識装置
JPWO2022059245A5 (https=)
US20240205625A1 (en) Audio processing system, audio processing device, and audio processing method
JP7685813B2 (ja) 音声処理装置、音声処理方法、音声処理プログラム、および音声処理システム
CN118900380B (zh) 车载音频的调节方法、车载信息娱乐系统和可读存储介质
JPH10247096A (ja) 車載音声認識装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868932

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022550338

Country of ref document: JP

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 21868932

Country of ref document: EP

Kind code of ref document: A1