WO2016103710A1 - Voice processing device - Google Patents

Voice processing device Download PDF

Info

Publication number
WO2016103710A1
WO2016103710A1 PCT/JP2015/006448 JP2015006448W WO2016103710A1 WO 2016103710 A1 WO2016103710 A1 WO 2016103710A1 JP 2015006448 W JP2015006448 W JP 2015006448W WO 2016103710 A1 WO2016103710 A1 WO 2016103710A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
source
voice
processing unit
audio
Prior art date
Application number
PCT/JP2015/006448
Other languages
French (fr)
Japanese (ja)
Inventor
サシャ ヴラジック
岡田 広毅
Original Assignee
アイシン精機株式会社
トヨタ自動車株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by アイシン精機株式会社, トヨタ自動車株式会社 filed Critical アイシン精機株式会社
Publication of WO2016103710A1 publication Critical patent/WO2016103710A1/en

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60RVEHICLES, VEHICLE FITTINGS, OR VEHICLE PARTS, NOT OTHERWISE PROVIDED FOR
    • B60R16/00Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for
    • B60R16/02Electric or fluid circuits specially adapted for vehicles and not otherwise provided for; Arrangement of elements of electric or fluid circuits specially adapted for vehicles and not otherwise provided for electric constitutive elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present invention relates to a voice processing device.
  • Various devices are provided in vehicles such as automobiles. Operations on these various devices are performed, for example, by operating operation buttons, operation panels, and the like.
  • Patent Documents 1 to 3 Recently, voice recognition technology has also been proposed (Patent Documents 1 to 3).
  • An object of the present invention is to provide a good speech processing apparatus capable of improving the certainty of speech recognition.
  • a plurality of microphones arranged in a vehicle, and a sound source that determines an orientation of a sound source that is a sound source included in a sound reception signal acquired by each of the plurality of microphones
  • An audio processing apparatus is provided that performs the beamforming in the direction of the specified audio source.
  • the present invention by performing a predetermined action, it is possible to reliably specify a voice source to be a target of voice recognition. For this reason, according to this invention, the favorable audio processing apparatus which can improve the reliability of audio
  • FIG. 1 is a schematic diagram showing a configuration of a vehicle.
  • a driver's seat 40 that is a driver's seat and a passenger's seat 44 that is a passenger's seat are arranged at the front of a vehicle body (cabinet) 46 of a vehicle (automobile). ing.
  • the driver's seat 40 is located on the right side of the passenger compartment 46, for example.
  • a steering wheel (handle) 78 is disposed in front of the driver seat 40.
  • the passenger seat 44 is located on the left side of the passenger compartment 46, for example.
  • the driver seat 40 and the passenger seat 44 constitute a front seat.
  • an audio source 72a when the driver emits audio is located.
  • an audio source 72b when the passenger seat makes a sound is located.
  • a rear seat 70 is disposed at the rear of the vehicle body 46.
  • reference numeral 72 is used when the description is made without distinguishing between the individual sound sources, and reference numerals 72a and 72b are used when the description is made with the individual sound sources distinguished.
  • a plurality of microphones 22 (22a to 22c), that is, microphone arrays are arranged in front of the front seats 40 and 44.
  • reference numeral 22 is used when the description is made without distinguishing the individual microphones, and reference numerals 22a to 22c are used when the description is made with the individual microphones distinguished.
  • the microphone 22 may be disposed on the dashboard 42 or may be disposed on a portion close to the roof.
  • the distance between the sound source 72 of the front seats 40 and 44 and the microphone 22 is often about several tens of centimeters. However, the distance between the microphone 22 and the audio source 72 can be less than a few tens of centimeters. Also, the distance between the microphone 22 and the audio source 72 can exceed 1 m.
  • a speaker (loud speaker) 76 constituting a speaker system of an on-vehicle acoustic device (car audio device) 84 (see FIG. 2) is arranged.
  • Music (music) emitted from the speaker 76 can be noise when performing speech recognition.
  • the vehicle body 46 is provided with an engine 80 for driving the vehicle.
  • the sound emitted from the engine 80 can be noise when performing speech recognition.
  • the noise generated in the passenger compartment 46 by the road surface stimulus during the traveling of the vehicle can also be a noise when performing voice recognition.
  • wind noise generated when the vehicle travels can also be a noise source in performing speech recognition.
  • the noise source 82 may exist outside the vehicle body 46. The sound emitted from the external noise source 82 can also be noise in performing speech recognition.
  • the user's voice instruction is recognized using, for example, an automatic voice recognition device 68 (see FIG. 2).
  • the speech processing apparatus contributes to improvement of speech recognition accuracy in the automatic speech recognition apparatus 68.
  • FIG. 2 is a block diagram showing a system configuration of the speech processing apparatus according to the present embodiment.
  • the speech processing apparatus includes a pre-processing unit 10, a processing unit 12, a post-processing unit 14, a speech source direction determination unit 16, an adaptive algorithm determination unit 18, and a noise model.
  • a determination unit 20 and a designated input processing unit 86 are included.
  • the voice processing device may further include an automatic voice recognition device 68, and the voice processing device according to the present embodiment and the automatic voice recognition device 68 may be separate devices.
  • a device including these components and the automatic speech recognition device 68 can be referred to as a speech processing device or an automatic speech recognition device.
  • a signal acquired by each of the plurality of microphones 22a to 22c, that is, a sound reception signal is input to the preprocessing unit 10.
  • the microphone 22 for example, an omnidirectional microphone is used.
  • FIG. 3A and 3B are schematic diagrams showing examples of microphone arrangement.
  • FIG. 3A shows a case where the number of microphones 22 is three.
  • FIG. 3B shows a case where the number of microphones 22 is two.
  • the plurality of microphones 22 are arranged so as to be positioned on a straight line.
  • the sound reaching the microphone 22 is handled as a plane wave, and the direction (direction) of the sound source 72, that is, the sound source direction (DOA: DirectionDirectOf Arrival) is determined. it can.
  • DOA DirectionDirectOf Arrival
  • the sound source 72 When the sound source 72 is located in the near field, it is preferable to determine the direction of the sound source 72 by treating the sound reaching the microphone 22 as a spherical wave.
  • the distance L1 between the microphone 22a and the microphone 22b is set to be relatively long so as to be suitable for a relatively low frequency sound.
  • the distance L2 between the microphone 22b and the microphone 22c is set to be relatively short so as to be suitable for a relatively high frequency sound.
  • sound reception signals acquired by the plurality of microphones 22 are input to the preprocessing unit 10.
  • sound field correction is performed.
  • tuning is performed in consideration of the acoustic characteristics of the vehicle compartment 46 that is an acoustic space.
  • the preprocessing unit 10 When the sound reception signal acquired by the microphone 22 includes music, the preprocessing unit 10 removes the music from the sound reception signal acquired by the microphone 22.
  • a reference music signal (reference signal) is input to the preprocessing unit 10.
  • the preprocessing unit 10 removes music included in the sound reception signal acquired by the microphone 22 using the reference music signal.
  • the sound source direction determination unit 16 determines the direction of the sound source.
  • the speed of sound is c [m / s]
  • the distance between microphones is d [m]
  • the arrival time difference is ⁇ [seconds]
  • the direction ⁇ [degree] of the sound source 72 is expressed by the following equation (1). Represented by The sound speed c is about 340 [m / s].
  • the output signal of the voice source direction determination unit 16, that is, the signal indicating the direction of the voice source 72 is input to the adaptive algorithm determination unit 18.
  • the adaptive algorithm determination unit 18 determines an adaptive algorithm based on the orientation of the audio source 72.
  • a signal indicating the adaptation algorithm determined by the adaptation algorithm determination unit 18 is input from the adaptation algorithm determination unit 18 to the processing unit 12.
  • the processing unit 12 performs adaptive beamforming, which is signal processing that adaptively forms directivity (adaptive beamformer).
  • the processing unit 12 not only functions as an adaptive beamformer that adaptively performs beamforming, but also controls the entire speech processing apparatus according to the present embodiment.
  • the beam former for example, a Frost beam former or the like can be used.
  • the beam forming is not limited to the Frost beamformer, and various beamformers can be applied as appropriate.
  • the processing unit 12 performs beam forming based on the adaptive algorithm determined by the adaptive algorithm determination unit 18. In this embodiment, the beam forming is performed in order to reduce the sensitivity other than the arrival direction of the target sound while securing the sensitivity to the arrival direction of the target sound.
  • the target sound is, for example, a sound emitted from the driver.
  • the position of the sound source 72a can change.
  • the arrival direction of the target sound changes according to the change in the position of the sound source 72a.
  • the beam former is sequentially updated so as to suppress sound from an azimuth range other than the azimuth range including the azimuth.
  • the voice source 72b to be subjected to voice recognition is located in the passenger seat 44, sound coming from an azimuth range other than the azimuth range including the azimuth of the passenger seat 44 is suppressed. Good.
  • FIG. 4 is a diagram showing a beamformer algorithm.
  • the received sound signals acquired by the microphones 22a to 22c are input to the window function / fast Fourier transform processing units 48a to 48c provided in the processing unit 12 via the preprocessing unit 10 (see FIG. 2). It is like that.
  • the window function / fast Fourier transform processing units 48a to 48c perform window function processing and fast Fourier transform processing. In this embodiment, the window function process and the fast Fourier transform process are performed because the calculation in the frequency domain is faster than the calculation in the time domain.
  • the output signal X1 , k of the window function / fast Fourier transform processing unit 48a and the beamformer weight tensor W1 , k * are multiplied at the multiplication point 50a.
  • the output signal X2 , k of the window function / fast Fourier transform processor 48b and the beamformer weight tensor W2 , k * are multiplied at the multiplication point 50b.
  • the output signal X 3, k of the window function / fast Fourier transform processing unit 48c and the beamformer weight tensor W 3, k * are multiplied at the multiplication point 50c.
  • the signals multiplied at the multiplication points 50 a to 50 c are added at the addition point 52.
  • the signal Y k added at the addition point 52 is input to an inverse fast Fourier transform / superimposition addition processing unit 54 provided in the processing unit 12.
  • the inverse fast Fourier transform / superimposition addition processing unit 54 performs an inverse fast Fourier transform process and a process based on an overlay addition (OLA: OverLap-Add) method. By performing processing by the superposition addition method, the frequency domain signal is returned to the time domain signal. A signal subjected to the inverse fast Fourier transform process and the superposition addition method is input from the inverse fast Fourier transform / superimposition addition processing unit 54 to the post-processing unit 14.
  • OVA OverLap-Add
  • FIG. 5 is a diagram showing the directivity of the beamformer and the angle characteristics of the audio source direction determination cancellation process.
  • the solid line indicates the directivity of the beamformer.
  • the alternate long and short dash line indicates the angle characteristic of the audio source direction determination cancellation process.
  • the output signal power becomes minimum at the azimuth angle ⁇ 1 degree and the azimuth angle ⁇ 2. It is sufficiently suppressed between the azimuth angle ⁇ 1 and the azimuth angle ⁇ 2. If a directional beamformer as shown in FIG. 5 is used, the sound arriving from the passenger seat can be sufficiently suppressed. On the other hand, the voice coming from the driver's seat reaches the microphone 22 with almost no suppression.
  • the direction of the audio source 72 is determined. Suspend (voice source direction determination cancellation process). For example, when the beamformer is set to acquire the voice from the driver, if the voice from the passenger seat is larger than the voice from the driver, the direction of the voice source is estimated. Interrupt. In this case, the sound reception signal acquired by the microphone 22 is sufficiently suppressed. For example, when a voice arriving from a direction smaller than ⁇ 1 or a voice arriving from a direction larger than ⁇ 2, for example, is larger than the voice from the driver, a voice source direction determination canceling process is performed.
  • the beamformer is set so as to acquire the voice from the driver has been described as an example, but the beamformer may be set so as to acquire the voice from the passenger. .
  • the voice from the driver is louder than the voice from the passenger, the estimation of the direction of the voice source is interrupted.
  • a signal in which sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is suppressed is output from the processing unit 12.
  • An output signal from the processing unit 12 is input to the post-processing unit 14.
  • noise is removed.
  • noise includes engine noise, road noise, wind noise, and the like.
  • the engine noise model determination unit 20 generates a reference noise signal by performing noise modeling processing.
  • the reference noise signal output from the noise model determination unit 20 is a reference signal for removing noise from a signal including noise.
  • the reference engine noise signal is input to the post-processing unit 14.
  • the post-processing unit 14 uses the reference engine noise signal to remove noise from the signal including noise.
  • the post-processing unit 14 outputs a signal from which noise has been removed.
  • the post-processing unit 14 also performs distortion reduction processing. Note that noise removal is not performed only in the post-processing unit 14. Noise is removed from a sound acquired via the microphone 22 by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14.
  • a signal that has been post-processed by the post-processing unit 14 is output to the automatic speech recognition device 68. Since a good target sound in which sounds other than the target sound are suppressed is input to the automatic speech recognition device 68, the automatic speech recognition device 68 can improve the accuracy of speech recognition. Based on the voice recognition result by the automatic voice recognition device 68, the operation on the device mounted on the vehicle is automatically performed.
  • the voice recognition result by the automatic voice recognition device 68 is also input to the designated input processing unit 86.
  • the designation input processing unit 86 is for the user to designate a voice source 72 that is a target of voice recognition when a user (occupant) performs a predetermined action. Examples of the predetermined action include utterance of a predetermined word. A user who has issued a predetermined word is designated as the voice source 72 to be subjected to voice recognition.
  • the sound source 72 designated by performing a predetermined action is referred to as a designated sound source.
  • the designated input processing unit 68 determines whether or not a predetermined word has been issued based on the voice recognition result by the automatic voice recognition device 68.
  • a signal indicating whether or not a predetermined word has been issued is input from the designated input processing unit 86 to the processing unit 12.
  • the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72 that issued the predetermined word. Note that the direction of the sound source 72 that has issued the predetermined word is determined by the sound source direction determination unit 16.
  • FIG. 6 is a flowchart showing the operation of the speech processing apparatus according to the present embodiment.
  • step S1 the sound processor is turned on (step S1).
  • step S2 when the user has issued a predetermined word (YES in step S2), the audio source 72 that has issued the predetermined word is designated as the designated audio source (step S3). If the predetermined word is not issued (NO in step S2), step S2 is repeated.
  • the designated voice source is a voice source 72 that is a target of voice recognition. Since the direction of the sound source 72 that has issued the predetermined word is determined by the sound source direction determination unit 16, it is possible to determine which seat the user has issued the predetermined word from. In this way, the sound source 72 that has issued the predetermined word is determined, and the designated sound source 72 to be subjected to speech recognition is designated.
  • step S4 the orientation of the designated audio source 72 is determined (step S4).
  • the direction of the designated audio source 72 is determined by the audio source direction determining unit 16.
  • the directivity of the beamformer is set according to the direction of the designated audio source 72 (step S5).
  • the setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.
  • step S5 When the volume of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of designated voice source 72 is equal to or greater than the magnitude of voice coming from designated voice source 72 (YES in step S5), the voice The determination of the source 72 is interrupted (step S7).
  • step S4 when the magnitude of the sound coming from the azimuth range other than the predetermined azimuth range including the azimuth of the voice source 72 is not greater than the magnitude of the voice coming from the voice source 72 (NO in step S6), step S4 , S5 is repeated.
  • the beamformer is adaptively set according to the change in the position of the designated sound source 72, and the sound other than the sound from the designated sound source 72, that is, the sound other than the target sound is surely suppressed.
  • the present embodiment it is possible to reliably specify the voice source 72 to be subjected to voice recognition by issuing a predetermined word. For this reason, according to the present embodiment, it is possible to provide a good speech processing apparatus that can improve the certainty of speech recognition.
  • FIG. 7 is a block diagram showing the system configuration of the speech processing apparatus according to the present embodiment.
  • the same components as those of the speech processing apparatus according to the first embodiment shown in FIGS. 1 to 6 are denoted by the same reference numerals, and description thereof is omitted or simplified.
  • the predetermined action for the user to specify the voice source 72 that is the target of voice recognition is an operation or gesture of the switches 90 and 92.
  • the speech processing apparatus includes a pre-processing unit 10, a processing unit 12, a post-processing unit 14, a speech source direction determination unit 16, an adaptive algorithm determination unit 18, engine noise, and the like.
  • a model determining unit 20 The speech processing apparatus according to the present embodiment also includes a learning processing unit 88, a driver seat side switch 90, a passenger seat side switch 92, a camera 94, a switch designation input processing unit 96, and an image designation input processing unit. 98.
  • a driver's seat side switch 90 is arranged in the vicinity of the driver's seat 40.
  • a passenger seat side switch 92 is disposed in the vicinity of the passenger seat 44.
  • the driver seat side switch 90 and the passenger seat side switch 92 are connected to the switch designation input processing unit 96.
  • the switch designation input processing unit 96 is for the user to designate the voice source 72 that is the target of voice recognition by the user operating the switches 90 and 92.
  • the voice source 72a located in the driver's seat is designated as the designated voice source that is the target of voice recognition.
  • the voice source 72b located in the passenger seat is designated as the designated voice source to be recognized.
  • a signal indicating that the driver's seat side switch 90 has been operated is input from the switch designation input processing unit 96 to the processing unit 12.
  • the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72a located at the driver's seat 40. .
  • a signal indicating that the passenger seat side switch 92 has been operated is input from the switch designation input processing unit 96 to the processing unit 12.
  • the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72b located at the passenger seat 44. .
  • a camera 94 is disposed on the vehicle 46.
  • An image acquired by the camera 94 is input to the image designation input processing unit 98.
  • the image designation input processing unit 98 is for the user to designate a voice source 72 that is a target of voice recognition when a user (occupant) performs a predetermined action. Examples of the predetermined action include a predetermined gesture (gesture, pose).
  • a user who has performed a predetermined gesture is designated as a voice source (designated voice source) 72 to be a target of voice recognition.
  • the image designation input processing unit 98 determines whether a predetermined gesture has been performed based on the image acquired by the camera 94. A signal indicating whether or not a predetermined gesture has been performed is input from the image designation input processing unit 98 to the processing unit 12.
  • the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72a located at the driver's seat 40.
  • the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72b located at the passenger seat 44 when a predetermined gesture is performed by the passenger. I do.
  • a learning processing unit 88 is connected to the processing unit 12.
  • the learning processing unit 88 learns beam forming suitable for each of the sound sources 72a and 72b for each of the sound sources 72a and 72b.
  • the learning processing unit 88 is provided for the following reason. That is, in the present embodiment, the predetermined action for the user to specify the voice source 72 that is the target of voice recognition is an operation or gesture of the switches 90 and 92. That is, in the present embodiment, the voice source 72 that is the target of voice recognition is designated by means other than voice. For this reason, when the sound source 72 to be subjected to speech recognition is designated, the sound from the designated sound source 72 is not necessarily obtained via the microphone 22.
  • beam forming suitable for the designated sound source 72 is learned in advance, and the designated sound source 72 Preferably, beam forming suitable for 72 is applied.
  • a learning processing unit 88 is provided.
  • the learning processing unit 88 learns beam forming suitable for acquiring the sound from the sound source 72a when the sound is emitted from the sound source 72a.
  • the learning processing unit 88 learns beamforming suitable for acquiring the sound from the sound source 72b when the sound is emitted from the sound source 72b.
  • the beam forming learned as the beam forming suitable for the sound source 72a located in the driver's seat 40 is applied.
  • the beam forming learned as the beam forming suitable for the sound source 72b located in the passenger seat 44 is applied.
  • the signal that has been post-processed by the post-processing unit 14 is output as an audio output.
  • FIG. 8 is a flowchart showing the operation of the speech processing apparatus according to the present embodiment.
  • the sound processor is turned on (step S10).
  • step S11 beam forming learning is performed (step S11).
  • the learning processing unit 88 learns beamforming suitable for the sound source 72a located at the driver's seat 40.
  • the learning processing unit 88 learns beamforming suitable for the sound source 72b located in the passenger seat 44.
  • driver's seat side switch 90 When driver's seat side switch 90 is operated, specifically, when driver's seat side switch 90 is turned on (YES in step S12), beam forming suitable for audio source 72a located in driver's seat 40 is performed. The beam forming learned by the learning processing unit 88 is applied (step S13).
  • step S14 If the driver's seat side switch 90 has not been operated (NO in step S12), it is confirmed whether or not the passenger's seat side switch 92 has been operated (step S14).
  • the passenger seat side switch is operated, specifically, when the passenger seat side switch 92 is turned on (YES in step S14), beam forming suitable for the sound source 72b located in the passenger seat 44 is performed. Beam forming learned by the learning processing unit 88 is applied (step S15).
  • step S16 If the passenger seat side switch 92 is not operated (NO in step S14), it is confirmed whether or not a predetermined gesture is performed by the driver (step S16).
  • a predetermined gesture is performed by the driver (YES in step S16)
  • the beam forming learned by the learning processing unit 88 is applied as the beam forming suitable for the sound source 72a located in the driver seat 40 ( Step S17).
  • step S18 If the predetermined gesture is not performed by the driver (NO in step S16), it is confirmed whether or not the predetermined gesture is performed by the passenger seat (step S18).
  • the beamforming learned by the learning processing unit 88 is applied as the beamforming suitable for the sound source 72b located in the passenger seat 44. (Step S19).
  • step S21 when sound is emitted from the designated sound source 72, the direction of the designated sound source 72 is determined (step S21).
  • the orientation of the designated audio source 72 is performed by the audio source orientation determining unit 16 as described above.
  • the directivity of the beamformer is set according to the direction of the designated audio source 72 (step S22).
  • the setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.
  • step S21 when the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is not greater than the magnitude of voice coming from voice source 72 (NO in step S23), step S21 , S22 is repeated.
  • the beamformer is adaptively set according to the change in the position of the designated sound source 72, and the sound other than the sound from the designated sound source 72, that is, the sound other than the target sound is surely suppressed.
  • the predetermined action for the user to specify the voice source 72 to be subjected to voice recognition may be an operation of the switches 90 and 92, a gesture, or the like.
  • the case where the number of the microphones 22 is three has been described as an example, but the number of the microphones 22 is not limited to three, and may be four or more. If many microphones 22 are used, the direction of the sound source 72 can be determined with higher accuracy.
  • the sound source 72 is located in the driver seat 40 or the passenger seat 44 .
  • the position of the sound source 72 is not limited to the driver seat 40 or the passenger seat 44.
  • the present invention is also applicable when the audio source 72 is located in the rear seat 70.
  • a learning processing unit 88 may be further provided.
  • the case where the output of the speech processing apparatus according to the present embodiment is input to the automatic speech recognition apparatus 68 that is, the case where the output of the speech processing apparatus according to the present embodiment is used for speech recognition will be described as an example.
  • the present invention is not limited to this.
  • the output of the speech processing apparatus according to the present embodiment may not be used for automatic speech recognition.
  • the voice processing device according to the present embodiment may be applied to voice processing in a telephone conversation.
  • the sound processing apparatus according to the present embodiment may be used to suppress sounds other than the target sound and transmit good sound. If the voice processing device according to the present embodiment is applied to telephone conversation, it is possible to realize a voice conversation.
  • whether or not a predetermined gesture has been performed is determined based on an image acquired by the camera 94, but the present invention is not limited to this.
  • a motion sensor or the like may be used to determine whether a predetermined gesture has been performed.
  • the case where a plurality of microphones 22 are arranged linearly has been described as an example.
  • the arrangement of three or more microphones 22 is not limited to this.
  • the plurality of microphones 22 may be arranged on the same plane, or the plurality of microphones 22 may be arranged three-dimensionally.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mechanical Engineering (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)

Abstract

This voice processing device comprises: a plurality of microphones 22 arranged in a vehicle; a voice source direction determination unit that determines the direction of a voice source which is the source of a voice included in a sound reception signal acquired by each of the microphones; and a beamforming processing unit that performs beamforming to suppress sounds arriving from direction ranges outside the direction range including the direction of the voice source. The beamforming processing unit performs beamforming in the direction of the voice source designated by a predetermined action.

Description

音声処理装置Audio processing device
 本発明は、音声処理装置に関する。 The present invention relates to a voice processing device.
 自動車等の車両には、様々な機器が設けられている。これらの様々な機器に対する操作は、例えば、操作ボタンや操作パネル等を操作することにより行われている。 Various devices are provided in vehicles such as automobiles. Operations on these various devices are performed, for example, by operating operation buttons, operation panels, and the like.
 一方、近時では、音声認識の技術も提案されている(特許文献1~3)。 On the other hand, recently, voice recognition technology has also been proposed (Patent Documents 1 to 3).
特開2012-215606号公報JP 2012-215606 A 特開2012-189906号公報JP 2012-189906 A 特開2012-42465号公報JP 2012-42465 A
 しかしながら、車両においては、様々なノイズが存在する。このため、車両内で発せられる音声に対しての音声認識は容易ではなかった。 However, various noises exist in the vehicle. For this reason, it is not easy to recognize the voice generated in the vehicle.
 本発明の目的は、音声認識の確実性を向上し得る良好な音声処理装置を提供することにある。 An object of the present invention is to provide a good speech processing apparatus capable of improving the certainty of speech recognition.
 本発明の一観点によれば、車両に配された複数のマイクロフォンと、前記複数のマイクロフォンの各々によって取得される受音信号に含まれる音声の発生源である音声源の方位を判定する音声源方位判定部と、前記音声源の前記方位を含む方位範囲以外の方位範囲から到来する音を抑圧するビームフォーミングを行うビームフォーミング処理部とを有し、前記ビームフォーミング処理部は、所定の行為によって指定された前記音声源の前記方位に前記ビームフォーミングを行うことを特徴とする音声処理装置が提供される。 According to one aspect of the present invention, a plurality of microphones arranged in a vehicle, and a sound source that determines an orientation of a sound source that is a sound source included in a sound reception signal acquired by each of the plurality of microphones An azimuth determining unit, and a beam forming processing unit that performs beam forming to suppress sound coming from an azimuth range other than the azimuth range including the azimuth range of the sound source, and the beam forming processing unit An audio processing apparatus is provided that performs the beamforming in the direction of the specified audio source.
 本発明によれば、所定の行為を行うことによって、音声認識の対象とすべき音声源を確実に指定することができる。このため、本発明によれば、音声認識の確実性を向上し得る良好な音声処理装置を提供することができる。 According to the present invention, by performing a predetermined action, it is possible to reliably specify a voice source to be a target of voice recognition. For this reason, according to this invention, the favorable audio processing apparatus which can improve the reliability of audio | voice recognition can be provided.
車両の構成を示す概略図である。It is the schematic which shows the structure of a vehicle. 本発明の第1実施形態による音声処理装置のシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the audio processing apparatus by 1st Embodiment of this invention. マイクロフォンの数が3個の場合におけるマイクロフォンの配置の例を示す概略図である。It is the schematic which shows the example of arrangement | positioning of a microphone in case the number of microphones is three. マイクロフォンの数が2個の場合におけるマイクロフォンの配置の例を示す概略図である。It is the schematic which shows the example of arrangement | positioning of a microphone in case the number of microphones is two. ビームフォーマのアルゴリズムを示す図である。It is a figure which shows the algorithm of a beam former. ビームフォーマの指向性及び音声源方位判定キャンセル処理の角度特性を示す図である。It is a figure which shows the angle characteristic of the directivity of a beam former, and an audio source direction determination cancellation process. 本発明の第1実施形態による音声処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio processing apparatus by 1st Embodiment of this invention. 本発明の第2実施形態による音声処理装置のシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the audio processing apparatus by 2nd Embodiment of this invention. 本発明の第2実施形態による音声処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio processing apparatus by 2nd Embodiment of this invention.
 以下、本発明の実施の形態について図面を用いて説明する。なお、本発明は以下の実施形態に限定されるものではなく、その要旨を逸脱しない範囲において適宜変更可能である。また、以下で説明する図面において、同じ機能を有するものは同一の符号を付し、その説明を省略又は簡潔にすることもある。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, this invention is not limited to the following embodiment, In the range which does not deviate from the summary, it can change suitably. In the drawings described below, components having the same function are denoted by the same reference numerals, and the description thereof may be omitted or simplified.
 [第1実施形態]
 本発明の第1実施形態による音声処理装置を図1乃至図6を用いて説明する。
[First Embodiment]
A speech processing apparatus according to a first embodiment of the present invention will be described with reference to FIGS.
 本実施形態による音声処理装置について説明するに先立って、車両の構成について図1を用いて説明する。図1は、車両の構成を示す概略図である。 Prior to describing the speech processing apparatus according to the present embodiment, the configuration of the vehicle will be described with reference to FIG. FIG. 1 is a schematic diagram showing a configuration of a vehicle.
 図1に示すように、車両(自動車)の車体(車室)46の前部には、運転者用の座席である運転席40と助手席者用の座席である助手席44とが配されている。運転席40は、例えば車室46の右側に位置している。運転席40の前方には、ステアリングホイール(ハンドル)78が配されている。助手席44は、例えば車室46の左側に位置している。運転席40と助手席44とにより、前部座席が構成されている。運転席40の近傍には、運転者が音声を発する場合における音声源72aが位置する。助手席44の近傍には、助手席者が音声を発する場合における音声源72bが位置する。運転者も助手席者も座席40,44に着座した状態で上半身を動かし得るため、音声源72の位置は変化し得る。車体46の後部には、後部座席70が配されている。なお、ここでは、個々の音声源を区別しないで説明する場合には、符号72を用い、個々の音声源を区別して説明する場合には、符号72a、72bを用いることとする。 As shown in FIG. 1, a driver's seat 40 that is a driver's seat and a passenger's seat 44 that is a passenger's seat are arranged at the front of a vehicle body (cabinet) 46 of a vehicle (automobile). ing. The driver's seat 40 is located on the right side of the passenger compartment 46, for example. A steering wheel (handle) 78 is disposed in front of the driver seat 40. The passenger seat 44 is located on the left side of the passenger compartment 46, for example. The driver seat 40 and the passenger seat 44 constitute a front seat. In the vicinity of the driver's seat 40, an audio source 72a when the driver emits audio is located. In the vicinity of the passenger seat 44, an audio source 72b when the passenger seat makes a sound is located. Since both the driver and the front passenger can move the upper body while seated in the seats 40 and 44, the position of the sound source 72 can change. A rear seat 70 is disposed at the rear of the vehicle body 46. Here, reference numeral 72 is used when the description is made without distinguishing between the individual sound sources, and reference numerals 72a and 72b are used when the description is made with the individual sound sources distinguished.
 前部座席40,44の前方には、複数のマイクロフォン22(22a~22c)、即ち、マイクロフォンアレイが配されている。なお、ここでは、個々のマイクロフォンを区別しないで説明する場合には、符号22を用い、個々のマイクロフォンを区別して説明する場合には、符号22a~22cを用いることとする。マイクロフォン22は、ダッシュボード42に配されていてもよいし、ルーフに近い部位に配されていてもよい。 A plurality of microphones 22 (22a to 22c), that is, microphone arrays are arranged in front of the front seats 40 and 44. Here, reference numeral 22 is used when the description is made without distinguishing the individual microphones, and reference numerals 22a to 22c are used when the description is made with the individual microphones distinguished. The microphone 22 may be disposed on the dashboard 42 or may be disposed on a portion close to the roof.
 前部座席40,44の音声源72とマイクロフォン22との間の距離は、数十cm程度である場合が多い。しかし、マイクロフォン22と音声源72との間の距離は、数十cmより小さくなることもあり得る。また、マイクロフォン22と音声源72との間の距離は、1mを超えることもあり得る。 The distance between the sound source 72 of the front seats 40 and 44 and the microphone 22 is often about several tens of centimeters. However, the distance between the microphone 22 and the audio source 72 can be less than a few tens of centimeters. Also, the distance between the microphone 22 and the audio source 72 can exceed 1 m.
 車体46の内部には、車載音響機器(カーオーディオ機器)84(図2参照)のスピーカシステムを構成するスピーカ(ラウドスピーカ)76が配されている。スピーカ76から発せられる音楽(ミュージック)は、音声認識を行う上でのノイズとなり得る。 Inside the vehicle body 46, a speaker (loud speaker) 76 constituting a speaker system of an on-vehicle acoustic device (car audio device) 84 (see FIG. 2) is arranged. Music (music) emitted from the speaker 76 can be noise when performing speech recognition.
 車体46には、車両を駆動するためのエンジン80が配されている。エンジン80から発せられる音は、音声認識を行う上でのノイズとなり得る。 The vehicle body 46 is provided with an engine 80 for driving the vehicle. The sound emitted from the engine 80 can be noise when performing speech recognition.
 車両の走行中に路面の刺激によって車室46内に発生する騒音、即ち、ロードノイズも、音声認識を行う上でのノイズとなり得る。また、車両が走行する際に生ずる風切り音も、音声認識を行う上でのノイズ源となり得る。また、車体46の外部にも、ノイズ源82は存在し得る。外部ノイズ源82から発せられる音も、音声認識を行う上でのノイズとなり得る。 The noise generated in the passenger compartment 46 by the road surface stimulus during the traveling of the vehicle, that is, the road noise can also be a noise when performing voice recognition. In addition, wind noise generated when the vehicle travels can also be a noise source in performing speech recognition. Further, the noise source 82 may exist outside the vehicle body 46. The sound emitted from the external noise source 82 can also be noise in performing speech recognition.
 車体46に配された様々な機器に対する操作を、ユーザの音声による指示によって行い得ると便利である。ユーザの音声による指示は、例えば、自動音声認識装置68(図2参照)を用いて認識される。本実施形態による音声処理装置は、自動音声認識装置68における音声認識の精度の向上に資するものである。 It would be convenient if operations on various devices arranged on the vehicle body 46 could be performed by user voice instructions. The user's voice instruction is recognized using, for example, an automatic voice recognition device 68 (see FIG. 2). The speech processing apparatus according to the present embodiment contributes to improvement of speech recognition accuracy in the automatic speech recognition apparatus 68.
 図2は、本実施形態による音声処理装置のシステム構成を示すブロック図である。 FIG. 2 is a block diagram showing a system configuration of the speech processing apparatus according to the present embodiment.
 図2に示すように、本実施形態による音声処理装置は、前処理部10と、処理部12と、後処理部14と、音声源方位判定部16と、適応アルゴリズム決定部18と、ノイズモデル決定部20と、指定入力処理部86とを含む。 As shown in FIG. 2, the speech processing apparatus according to the present embodiment includes a pre-processing unit 10, a processing unit 12, a post-processing unit 14, a speech source direction determination unit 16, an adaptive algorithm determination unit 18, and a noise model. A determination unit 20 and a designated input processing unit 86 are included.
 本実施形態による音声処理装置が更に自動音声認識装置68を含んでいてもよいし、本実施形態による音声処理装置と自動音声認識装置68とが別個の装置であってもよい。これらの構成要素と自動音声認識装置68とを含む装置は、音声処理装置と称することもできるし、自動音声認識装置と称することもできる。 The voice processing device according to the present embodiment may further include an automatic voice recognition device 68, and the voice processing device according to the present embodiment and the automatic voice recognition device 68 may be separate devices. A device including these components and the automatic speech recognition device 68 can be referred to as a speech processing device or an automatic speech recognition device.
 前処理部10には、複数のマイクロフォン22a~22cの各々によって取得される信号、即ち、受音信号が入力されるようになっている。マイクロフォン22としては、例えば、無指向性のマイクロフォンが用いられる。 A signal acquired by each of the plurality of microphones 22a to 22c, that is, a sound reception signal is input to the preprocessing unit 10. As the microphone 22, for example, an omnidirectional microphone is used.
 図3A及び図3Bは、マイクロフォンの配置の例を示す概略図である。図3Aは、マイクロフォン22の数が3個の場合を示している。図3Bは、マイクロフォン22の数が2個の場合を示している。複数のマイクロフォン22は、直線上に位置するように配されている。 3A and 3B are schematic diagrams showing examples of microphone arrangement. FIG. 3A shows a case where the number of microphones 22 is three. FIG. 3B shows a case where the number of microphones 22 is two. The plurality of microphones 22 are arranged so as to be positioned on a straight line.
 音声源72が遠方界に位置する場合には、マイクロフォン22に到達する音声を平面波として取り扱って、音声源72の方位(方向)、即ち、音源方位(DOA:Direction Of Arrival)を判定することができる。 When the sound source 72 is located in the far field, the sound reaching the microphone 22 is handled as a plane wave, and the direction (direction) of the sound source 72, that is, the sound source direction (DOA: DirectionDirectOf Arrival) is determined. it can.
 音声源72が近傍界に位置する場合には、マイクロフォン22に到達する音声を球面波として扱って、音声源72の方位を判定することが好ましい。 When the sound source 72 is located in the near field, it is preferable to determine the direction of the sound source 72 by treating the sound reaching the microphone 22 as a spherical wave.
 マイクロフォン22aとマイクロフォン22bとの距離L1は、比較的低い周波数の音声に対して好適とすべく、比較的長く設定されている。マイクロフォン22bとマイクロフォン22cとの距離L2は、比較的高い周波数の音声に対して好適とすべく、比較的短く設定されている。 The distance L1 between the microphone 22a and the microphone 22b is set to be relatively long so as to be suitable for a relatively low frequency sound. The distance L2 between the microphone 22b and the microphone 22c is set to be relatively short so as to be suitable for a relatively high frequency sound.
 図2に示すように、複数のマイクロフォン22によって取得される受音信号が、前処理部10に入力されるようになっている。前処理部10では、音場補正が行われる。音場補正においては、音響空間である車室46の音響特性を考慮したチューニングが行われる。 As shown in FIG. 2, sound reception signals acquired by the plurality of microphones 22 are input to the preprocessing unit 10. In the preprocessing unit 10, sound field correction is performed. In the sound field correction, tuning is performed in consideration of the acoustic characteristics of the vehicle compartment 46 that is an acoustic space.
 マイクロフォン22によって取得される受音信号に音楽が含まれている場合には、前処理部10は、マイクロフォン22によって取得される受音信号から音楽を除去する。前処理部10には、参照用音楽信号(参照信号)が入力されるようになっている。前処理部10は、マイクロフォン22によって取得される受音信号に含まれている音楽を、参照用音楽信号を用いて除去する。 When the sound reception signal acquired by the microphone 22 includes music, the preprocessing unit 10 removes the music from the sound reception signal acquired by the microphone 22. A reference music signal (reference signal) is input to the preprocessing unit 10. The preprocessing unit 10 removes music included in the sound reception signal acquired by the microphone 22 using the reference music signal.
 音声源方位判定部16では、音声源の方位の判定が行われる。 The sound source direction determination unit 16 determines the direction of the sound source.
 音の速度をc[m/s]、マイクロフォン間の距離をd[m]、到来時間差をτ[秒]とすると、音声源72の方向θ[度]は、以下のような式(1)によって表される。なお、音速cは、340[m/s]程度である。 Assuming that the speed of sound is c [m / s], the distance between microphones is d [m], and the arrival time difference is τ [seconds], the direction θ [degree] of the sound source 72 is expressed by the following equation (1). Represented by The sound speed c is about 340 [m / s].
Figure JPOXMLDOC01-appb-I000001
Figure JPOXMLDOC01-appb-I000001
 到来時間差τに基づいて、音声源72の位置を特定することが可能である。 It is possible to specify the position of the sound source 72 based on the arrival time difference τ.
 音声源方位判定部16の出力信号、即ち、音声源72の方位を示す信号が、適応アルゴリズム決定部18に入力されるようになっている。適応アルゴリズム決定部18は、音声源72の方位に基づいて適応アルゴリズムを決定するものである。適応アルゴリズム決定部18によって決定された適応アルゴリズムを示す信号が、適応アルゴリズム決定部18から処理部12に入力されるようになっている。 The output signal of the voice source direction determination unit 16, that is, the signal indicating the direction of the voice source 72 is input to the adaptive algorithm determination unit 18. The adaptive algorithm determination unit 18 determines an adaptive algorithm based on the orientation of the audio source 72. A signal indicating the adaptation algorithm determined by the adaptation algorithm determination unit 18 is input from the adaptation algorithm determination unit 18 to the processing unit 12.
 処理部12は、適応的に指向性を形成する信号処理である適応ビームフォーミングを行うものである(適応ビームフォーマ)。処理部12は、適応的にビームフォーミングを行う適応ビームフォーマとして機能するのみならず、本実施形態による音声処理装置全体の制御をも司る。ビームフォーマとしては、例えばFrostビームフォーマ等を用いることができる。なお、ビームフォーミングは、Frostビームフォーマに限定されるものではなく、様々なビームフォーマを適宜適用することができる。処理部12は、適応アルゴリズム決定部18によって決定された適応アルゴリズムに基づいて、ビームフォーミングを行う。本実施形態において、ビームフォーミングを行うのは、目的音の到来方位に対しての感度を確保しつつ、目的音の到来方向以外の感度を低下させるためである。目的音は、例えば運転者から発せられる音声である。運転者は運転席40に着座した状態で上半身を動かし得るため、音声源72aの位置は変化し得る。音声源72aの位置の変化に応じて、目的音の到来方位は変化する。良好な音声認識を行うためには、目的音の到来方向以外の感度を確実に低下させることが好ましい。そこで、本実施形態では、上記のようにして判定される音声源72の方位に基づいて、当該方位を含む方位範囲以外の方位範囲からの音声を抑圧すべく、ビームフォーマを順次更新する。 The processing unit 12 performs adaptive beamforming, which is signal processing that adaptively forms directivity (adaptive beamformer). The processing unit 12 not only functions as an adaptive beamformer that adaptively performs beamforming, but also controls the entire speech processing apparatus according to the present embodiment. As the beam former, for example, a Frost beam former or the like can be used. The beam forming is not limited to the Frost beamformer, and various beamformers can be applied as appropriate. The processing unit 12 performs beam forming based on the adaptive algorithm determined by the adaptive algorithm determination unit 18. In this embodiment, the beam forming is performed in order to reduce the sensitivity other than the arrival direction of the target sound while securing the sensitivity to the arrival direction of the target sound. The target sound is, for example, a sound emitted from the driver. Since the driver can move the upper body while sitting in the driver's seat 40, the position of the sound source 72a can change. The arrival direction of the target sound changes according to the change in the position of the sound source 72a. In order to perform good speech recognition, it is preferable to reliably reduce the sensitivity other than the arrival direction of the target sound. Therefore, in the present embodiment, based on the direction of the sound source 72 determined as described above, the beam former is sequentially updated so as to suppress sound from an azimuth range other than the azimuth range including the azimuth.
 音声認識の対象とすべき音声源72aが運転席40に位置している場合には、運転席40の方位を含む方位範囲以外の方位範囲から到来する音が抑圧される。 When the voice source 72a to be subjected to voice recognition is located in the driver's seat 40, sound coming from an azimuth range other than the azimuth range including the azimuth of the driver's seat 40 is suppressed.
 なお、音声認識の対象とすべき音声源72bが助手席44に位置している場合には、助手席44の方位を含む方位範囲以外の方位範囲から到来する音が抑圧されるようにすればよい。 If the voice source 72b to be subjected to voice recognition is located in the passenger seat 44, sound coming from an azimuth range other than the azimuth range including the azimuth of the passenger seat 44 is suppressed. Good.
 図4は、ビームフォーマのアルゴリズムを示す図である。マイクロフォン22a~22cによって取得される受音信号が、前処理部10(図2参照)を介して、処理部12内に設けられた窓関数/高速フーリエ変換処理部48a~48cにそれぞれ入力されるようになっている。窓関数/高速フーリエ変換処理部48a~48cは、窓関数処理及び高速フーリエ変換処理を行うものである。本実施形態において、窓関数処理及び高速フーリエ変換処理を行うのは、周波数領域での計算は時間領域での計算より速いためである。窓関数/高速フーリエ変換処理部48aの出力信号X1,kとビームフォーマの重みテンソルW1,k とが、乗算点50aにおいて乗算されるようになっている。窓関数/高速フーリエ変換処理部48bの出力信号X2,kとビームフォーマの重みテンソルW2,k とが、乗算点50bにおいて乗算されるようになっている。窓関数/高速フーリエ変換処理部48cの出力信号X3,kとビームフォーマの重みテンソルW3,k とが、乗算点50cにおいて乗算されるようになっている。乗算点50a~50cにおいてそれぞれ乗算処理された信号が、加算点52において加算されるようになっている。加算点52において加算処理された信号Yは、処理部12内に設けられた逆高速フーリエ変換/重畳加算処理部54に入力されるようになっている。逆高速フーリエ変換/重畳加算処理部54は、逆高速フーリエ変換処理及び重畳加算(OLA:OverLap-Add)法による処理を行うものである。重畳加算法による処理を行うことにより、周波数領域の信号が時間領域の信号に戻される。逆高速フーリエ変換処理及び重畳加算法による処理が行われた信号が、逆高速フーリエ変換/重畳加算処理部54から後処理部14に入力されるようになっている。 FIG. 4 is a diagram showing a beamformer algorithm. The received sound signals acquired by the microphones 22a to 22c are input to the window function / fast Fourier transform processing units 48a to 48c provided in the processing unit 12 via the preprocessing unit 10 (see FIG. 2). It is like that. The window function / fast Fourier transform processing units 48a to 48c perform window function processing and fast Fourier transform processing. In this embodiment, the window function process and the fast Fourier transform process are performed because the calculation in the frequency domain is faster than the calculation in the time domain. The output signal X1 , k of the window function / fast Fourier transform processing unit 48a and the beamformer weight tensor W1 , k * are multiplied at the multiplication point 50a. The output signal X2 , k of the window function / fast Fourier transform processor 48b and the beamformer weight tensor W2 , k * are multiplied at the multiplication point 50b. The output signal X 3, k of the window function / fast Fourier transform processing unit 48c and the beamformer weight tensor W 3, k * are multiplied at the multiplication point 50c. The signals multiplied at the multiplication points 50 a to 50 c are added at the addition point 52. The signal Y k added at the addition point 52 is input to an inverse fast Fourier transform / superimposition addition processing unit 54 provided in the processing unit 12. The inverse fast Fourier transform / superimposition addition processing unit 54 performs an inverse fast Fourier transform process and a process based on an overlay addition (OLA: OverLap-Add) method. By performing processing by the superposition addition method, the frequency domain signal is returned to the time domain signal. A signal subjected to the inverse fast Fourier transform process and the superposition addition method is input from the inverse fast Fourier transform / superimposition addition processing unit 54 to the post-processing unit 14.
 図5は、ビームフォーマの指向性及び音声源方位判定キャンセル処理の角度特性を示す図である。実線は、ビームフォーマの指向性を示している。一点鎖線は、音声源方位判定キャンセル処理の角度特性を示している。図5から分かるように、例えば方位角β1度と方位角β2とにおいて出力信号パワーが極小となる。方位角β1と方位角β2との間においても、十分に抑圧されている。図5に示すような指向性のビームフォーマを用いれば、助手席から到来する音を十分に抑圧することができる。一方、運転席から到来する音声は、殆ど抑圧されることなくマイクロフォン22に到達する。本実施形態では、音声源72から到来する音声の大きさよりも、音声源72の方位を含む方位範囲以外の方位範囲から到来する音の方が大きい場合には、音声源72の方位の判定を中断する(音声源方位判定キャンセル処理)。例えば、運転者からの音声を取得するようにビームフォーマが設定されている場合において、運転者からの音声よりも助手席者からの音声の方が大きい場合には、音声源の方位の推定を中断する。この場合、マイクロフォン22によって取得される受音信号を、十分に抑圧する。例えばγ1より小さい方位から到来する音声、又は、例えばγ2より大きい方位から到来する音声が、運転者からの音声よりも大きい場合には、音声源方位判定キャンセル処理が行われる。なお、ここでは、運転者からの音声を取得するようにビームフォーマが設定されている場合を例に説明したが、助手席者からの音声を取得するようにビームフォーマが設定されていてもよい。この場合には、助手席者からの音声よりも運転者からの音声の方が大きい場合には、音声源の方位の推定を中断する。 FIG. 5 is a diagram showing the directivity of the beamformer and the angle characteristics of the audio source direction determination cancellation process. The solid line indicates the directivity of the beamformer. The alternate long and short dash line indicates the angle characteristic of the audio source direction determination cancellation process. As can be seen from FIG. 5, for example, the output signal power becomes minimum at the azimuth angle β1 degree and the azimuth angle β2. It is sufficiently suppressed between the azimuth angle β1 and the azimuth angle β2. If a directional beamformer as shown in FIG. 5 is used, the sound arriving from the passenger seat can be sufficiently suppressed. On the other hand, the voice coming from the driver's seat reaches the microphone 22 with almost no suppression. In the present embodiment, when the sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is larger than the magnitude of the audio coming from the audio source 72, the direction of the audio source 72 is determined. Suspend (voice source direction determination cancellation process). For example, when the beamformer is set to acquire the voice from the driver, if the voice from the passenger seat is larger than the voice from the driver, the direction of the voice source is estimated. Interrupt. In this case, the sound reception signal acquired by the microphone 22 is sufficiently suppressed. For example, when a voice arriving from a direction smaller than γ1 or a voice arriving from a direction larger than γ2, for example, is larger than the voice from the driver, a voice source direction determination canceling process is performed. Here, the case where the beamformer is set so as to acquire the voice from the driver has been described as an example, but the beamformer may be set so as to acquire the voice from the passenger. . In this case, when the voice from the driver is louder than the voice from the passenger, the estimation of the direction of the voice source is interrupted.
 こうして、音声源72の方位を含む方位範囲以外の方位範囲から到来する音が抑圧された信号が、処理部12から出力される。処理部12からの出力信号は、後処理部14に入力されるようになっている。 Thus, a signal in which sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is suppressed is output from the processing unit 12. An output signal from the processing unit 12 is input to the post-processing unit 14.
 後処理部(後処理適応フィルタ)14においては、ノイズの除去が行われる。かかるノイズとしては、エンジンノイズ、ロードノイズ、風切り音等が挙げられる。エンジンノイズモデル決定部20は、ノイズのモデリング処理を行うことにより、参照用ノイズ信号を生成する。ノイズモデル決定部20から出力される参照用ノイズ信号は、ノイズが含まれた信号からノイズを除去するための参照信号となる。参照用エンジンノイズ信号は、後処理部14に入力されるようになっている。後処理部14は、参照用エンジンノイズ信号を用い、ノイズを含む信号からノイズを除去する。後処理部14からは、ノイズが除去された信号が出力される。後処理部14においては、歪低減処理も行われる。なお、ノイズの除去は、後処理部14においてのみ行われるわけではない。マイクロフォン22を介して取得された音に対して、前処理部10、処理部12及び後処理部14において行われる一連の処理によって、ノイズの除去が行われる。 In the post-processing unit (post-processing adaptive filter) 14, noise is removed. Such noise includes engine noise, road noise, wind noise, and the like. The engine noise model determination unit 20 generates a reference noise signal by performing noise modeling processing. The reference noise signal output from the noise model determination unit 20 is a reference signal for removing noise from a signal including noise. The reference engine noise signal is input to the post-processing unit 14. The post-processing unit 14 uses the reference engine noise signal to remove noise from the signal including noise. The post-processing unit 14 outputs a signal from which noise has been removed. The post-processing unit 14 also performs distortion reduction processing. Note that noise removal is not performed only in the post-processing unit 14. Noise is removed from a sound acquired via the microphone 22 by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14.
 こうして、後処理部14によって後処理が行われた信号が、自動音声認識装置68に出力される。目的音以外の音が抑圧された良好な目的音が自動音声認識装置68に入力されるため、自動音声認識装置68は、音声認識の精度を向上することができる。自動音声認識装置68による音声認識結果に基づいて、車両に搭載されている機器等に対しての操作が自動で行われる。 In this way, a signal that has been post-processed by the post-processing unit 14 is output to the automatic speech recognition device 68. Since a good target sound in which sounds other than the target sound are suppressed is input to the automatic speech recognition device 68, the automatic speech recognition device 68 can improve the accuracy of speech recognition. Based on the voice recognition result by the automatic voice recognition device 68, the operation on the device mounted on the vehicle is automatically performed.
 自動音声認識装置68による音声認識結果は、指定入力処理部86にも入力されるようになっている。指定入力処理部86は、所定の行為をユーザ(乗員)が行うことよって、音声認識の対象となる音声源72をユーザが指定するためのものである。所定の行為としては、例えば所定のワードの発声が挙げられる。所定のワードを発したユーザが、音声認識の対象となる音声源72として指定される。所定の行為を行うことによって指定された音声源72は、指定音声源と称される。 The voice recognition result by the automatic voice recognition device 68 is also input to the designated input processing unit 86. The designation input processing unit 86 is for the user to designate a voice source 72 that is a target of voice recognition when a user (occupant) performs a predetermined action. Examples of the predetermined action include utterance of a predetermined word. A user who has issued a predetermined word is designated as the voice source 72 to be subjected to voice recognition. The sound source 72 designated by performing a predetermined action is referred to as a designated sound source.
 指定入力処理部68は、自動音声認識装置68による音声認識結果に基づいて、所定のワードが発せられたか否かを判定する。所定のワードが発せられたか否かを示す信号が、指定入力処理部86から処理部12に入力されるようになっている。処理部12は、所定のワードが発せられた際には、所定のワードを発した音声源72の方位を含む方位範囲以外の方位範囲から到来する音を抑圧するようにビームフォーミングを行う。なお、所定のワードを発した音声源72の方位は、音声源方位判定部16により判定される。 The designated input processing unit 68 determines whether or not a predetermined word has been issued based on the voice recognition result by the automatic voice recognition device 68. A signal indicating whether or not a predetermined word has been issued is input from the designated input processing unit 86 to the processing unit 12. When a predetermined word is emitted, the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72 that issued the predetermined word. Note that the direction of the sound source 72 that has issued the predetermined word is determined by the sound source direction determination unit 16.
 次に、本実施形態による音声処理装置の動作について図6を用いて説明する。図6は、本実施形態による音声処理装置の動作を示すフローチャートである。 Next, the operation of the speech processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 6 is a flowchart showing the operation of the speech processing apparatus according to the present embodiment.
 まず、音処理装置の電源がONにされる(ステップS1)。 First, the sound processor is turned on (step S1).
 次に、ユーザが所定のワードを発した場合には(ステップS2においてYES)、所定のワードを発した音声源72を指定音声源として指定する(ステップS3)。所定のワードが発せられない場合には(ステップS2においてNO)、ステップS2が繰り返し行われる。指定音声源は、音声認識の対象となる音声源72である。所定のワードを発した音声源72の方位が音声源方位判定部16によって判定されるため、どの座席に着座しているユーザから所定のワードが発せられたかを判定することが可能である。こうして、所定のワードを発した音声源72が判定され、音声認識の対象となる指定音声源72の指定が行われる。 Next, when the user has issued a predetermined word (YES in step S2), the audio source 72 that has issued the predetermined word is designated as the designated audio source (step S3). If the predetermined word is not issued (NO in step S2), step S2 is repeated. The designated voice source is a voice source 72 that is a target of voice recognition. Since the direction of the sound source 72 that has issued the predetermined word is determined by the sound source direction determination unit 16, it is possible to determine which seat the user has issued the predetermined word from. In this way, the sound source 72 that has issued the predetermined word is determined, and the designated sound source 72 to be subjected to speech recognition is designated.
 次に、指定音声源72の方位が判定される(ステップS4)。指定音声源72の方位の判定は、音声源方位判定部16によって行われる。 Next, the orientation of the designated audio source 72 is determined (step S4). The direction of the designated audio source 72 is determined by the audio source direction determining unit 16.
 次に、指定音声源72の方位に応じて、ビームフォーマの指向性を設定する(ステップS5)。ビームフォーマの指向性の設定は、上述したように、適応アルゴリズム決定部18、処理部12等によって行われる。 Next, the directivity of the beamformer is set according to the direction of the designated audio source 72 (step S5). The setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.
 指定音声源72の方位を含む所定の方位範囲以外の方位範囲から到来する音の大きさが、指定音声源72から到来する音声の大きさ以上である場合には(ステップS5においてYES)、音声源72の判定を中断する(ステップS7)。 When the volume of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of designated voice source 72 is equal to or greater than the magnitude of voice coming from designated voice source 72 (YES in step S5), the voice The determination of the source 72 is interrupted (step S7).
 一方、音声源72の方位を含む所定の方位範囲以外の方位範囲から到来する音の大きさが、音声源72から到来する音声の大きさ以上でない場合には(ステップS6においてNO)、ステップS4、S5が繰り返し行われる。 On the other hand, when the magnitude of the sound coming from the azimuth range other than the predetermined azimuth range including the azimuth of the voice source 72 is not greater than the magnitude of the voice coming from the voice source 72 (NO in step S6), step S4 , S5 is repeated.
 こうして、指定音声源72の位置の変化に応じて、ビームフォーマが適応的に設定され、指定音声源72からの音声以外の音、即ち、目的音以外の音が確実に抑制される。 Thus, the beamformer is adaptively set according to the change in the position of the designated sound source 72, and the sound other than the sound from the designated sound source 72, that is, the sound other than the target sound is surely suppressed.
 このように、本実施形態によれば、所定のワードを発することにより、音声認識の対象とすべき音声源72を確実に指定することができる。このため、本実施形態によれば、音声認識の確実性を向上し得る良好な音声処理装置を提供することができる。 As described above, according to the present embodiment, it is possible to reliably specify the voice source 72 to be subjected to voice recognition by issuing a predetermined word. For this reason, according to the present embodiment, it is possible to provide a good speech processing apparatus that can improve the certainty of speech recognition.
 [第2実施形態]
 本発明の第2実施形態による音声処理装置を図7及び図8を用いて説明する。図7は、本実施形態による音声処理装置のシステム構成を示すブロック図である。図1乃至図6に示す第1実施形態による音声処理装置と同一の構成要素には、同一の符号を付して説明を省略または簡潔にする。
[Second Embodiment]
A speech processing apparatus according to a second embodiment of the present invention will be described with reference to FIGS. FIG. 7 is a block diagram showing the system configuration of the speech processing apparatus according to the present embodiment. The same components as those of the speech processing apparatus according to the first embodiment shown in FIGS. 1 to 6 are denoted by the same reference numerals, and description thereof is omitted or simplified.
 本実施形態による音声認識装置は、音声認識の対象となる音声源72をユーザが指定するための所定の行為が、スイッチ90,92の操作又はジェスチャーであるものである。 In the voice recognition apparatus according to the present embodiment, the predetermined action for the user to specify the voice source 72 that is the target of voice recognition is an operation or gesture of the switches 90 and 92.
 図7に示すように、本実施形態による音声処理装置は、前処理部10と、処理部12と、後処理部14と、音声源方位判定部16と、適応アルゴリズム決定部18と、エンジンノイズモデル決定部20とを含む。また、本実施形態による音声処理装置は、学習処理部88と、運転席側スイッチ90と、助手席側スイッチ92と、カメラ94と、スイッチ用指定入力処理部96と、画像用指定入力処理部98とを更に含む。 As shown in FIG. 7, the speech processing apparatus according to the present embodiment includes a pre-processing unit 10, a processing unit 12, a post-processing unit 14, a speech source direction determination unit 16, an adaptive algorithm determination unit 18, engine noise, and the like. A model determining unit 20. The speech processing apparatus according to the present embodiment also includes a learning processing unit 88, a driver seat side switch 90, a passenger seat side switch 92, a camera 94, a switch designation input processing unit 96, and an image designation input processing unit. 98.
 運転席40の近傍には、運転席側スイッチ90が配されている。また、助手席44の近傍には、助手席側スイッチ92が配されている。運転席側スイッチ90及び助手席側スイッチ92は、スイッチ用指定入力処理部96に接続されている。 In the vicinity of the driver's seat 40, a driver's seat side switch 90 is arranged. A passenger seat side switch 92 is disposed in the vicinity of the passenger seat 44. The driver seat side switch 90 and the passenger seat side switch 92 are connected to the switch designation input processing unit 96.
 スイッチ用指定入力処理部96は、スイッチ90,92の操作をユーザが行うことよって、音声認識の対象となる音声源72をユーザが指定するためのものである。運転席側に配された運転席側スイッチ90が操作された場合には、運転席に位置する音声源72aが音声認識の対象である指定音声源として指定される。一方、助手席側に配された助手席側スイッチ92が操作された場合には、助手席に位置する音声源72bが音声認識の対象である指定音声源として指定される。 The switch designation input processing unit 96 is for the user to designate the voice source 72 that is the target of voice recognition by the user operating the switches 90 and 92. When the driver's seat side switch 90 arranged on the driver's seat side is operated, the voice source 72a located in the driver's seat is designated as the designated voice source that is the target of voice recognition. On the other hand, when the passenger seat side switch 92 arranged on the passenger seat side is operated, the voice source 72b located in the passenger seat is designated as the designated voice source to be recognized.
 運転席側スイッチ90が操作された場合には、運転席側スイッチ90が操作されたことを示す信号が、スイッチ用指定入力処理部96から処理部12に入力されるようになっている。処理部12は、運転席側スイッチ90が操作された際には、運転席40に位置する音声源72aの方位を含む方位範囲以外の方位範囲から到来する音を抑圧するようにビームフォーミングを行う。 When the driver's seat side switch 90 is operated, a signal indicating that the driver's seat side switch 90 has been operated is input from the switch designation input processing unit 96 to the processing unit 12. When the driver's seat side switch 90 is operated, the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72a located at the driver's seat 40. .
 助手席側スイッチ92が操作された場合には、助手席側スイッチ92が操作されたことを示す信号が、スイッチ用指定入力処理部96から処理部12に入力されるようになっている。処理部12は、助手席側スイッチ92が操作された際には、助手席44に位置する音声源72bの方位を含む方位範囲以外の方位範囲から到来する音を抑圧するようにビームフォーミングを行う。 When the passenger seat side switch 92 is operated, a signal indicating that the passenger seat side switch 92 has been operated is input from the switch designation input processing unit 96 to the processing unit 12. When the passenger seat side switch 92 is operated, the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72b located at the passenger seat 44. .
 また、車両46には、カメラ94が配されている。カメラ94により取得される画像が、画像用指定入力処理部98に入力されるようになっている。画像用指定入力処理部98は、所定の行為をユーザ(乗員)が行うことよって、音声認識の対象となる音声源72をユーザが指定するためのものである。所定の行為としては、例えば所定のジェスチャー(身振り、ポーズ)が挙げられる。所定のジェスチャーを行ったユーザが、音声認識の対象となる音声源(指定音声源)72として指定される。 In addition, a camera 94 is disposed on the vehicle 46. An image acquired by the camera 94 is input to the image designation input processing unit 98. The image designation input processing unit 98 is for the user to designate a voice source 72 that is a target of voice recognition when a user (occupant) performs a predetermined action. Examples of the predetermined action include a predetermined gesture (gesture, pose). A user who has performed a predetermined gesture is designated as a voice source (designated voice source) 72 to be a target of voice recognition.
 画像用指定入力処理部98は、カメラ94により取得される画像に基づいて、所定のジェスチャーが行われたか否かを判定する。所定のジェスチャーが行われたか否かを示す信号が、画像用指定入力処理部98から処理部12に入力されるようになっている。処理部12は、運転者により所定のジェスチャーが行われた際には、運転席40に位置する音声源72aの方位を含む方位範囲以外の方位範囲から到来する音を抑圧するようにビームフォーミングを行う。処理部12は、助手席者により所定のジェスチャーが行われた際には、助手席44に位置する音声源72bの方位を含む方位範囲以外の方位範囲から到来する音を抑圧するようにビームフォーミングを行う。 The image designation input processing unit 98 determines whether a predetermined gesture has been performed based on the image acquired by the camera 94. A signal indicating whether or not a predetermined gesture has been performed is input from the image designation input processing unit 98 to the processing unit 12. When a predetermined gesture is performed by the driver, the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72a located at the driver's seat 40. Do. The processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72b located at the passenger seat 44 when a predetermined gesture is performed by the passenger. I do.
 処理部12には、学習処理部88が接続されている。学習処理部88は、各々の音声源72a、72bに適したビームフォーミングを、音声源72a、72b毎に学習するものである。本実施形態において、学習処理部88を設けているのは以下のような理由によるものである。即ち、本実施形態では、音声認識の対象となる音声源72をユーザが指定するための所定の行為は、スイッチ90,92の操作又はジェスチャーである。即ち、本実施形態では、音声認識の対象となる音声源72が音声以外の手段によって指定される。このため、音声認識の対象となる音声源72が指定される際には、指定音声源72からの音声がマイクロフォン22を介して得られるとは限らない。音声認識の対象となる音声源72が指定された後に、指定音声源72からの音声を確実に処理するためには、指定音声源72に適したビームフォーミングを予め学習しておき、指定音声源72に適したビームフォーミングを適用することが好ましい。このため、本実施形態では、学習処理部88が設けられている。学習処理部88は、音声源72aから音声が発せられた際に、音声源72aからの音声を取得するのに適したビームフォーミングを学習しておく。また、学習処理部88は、音声源72bから音声が発せられた際に、音声源72bからの音声を取得するのに適したビームフォーミングを学習しておく。 A learning processing unit 88 is connected to the processing unit 12. The learning processing unit 88 learns beam forming suitable for each of the sound sources 72a and 72b for each of the sound sources 72a and 72b. In the present embodiment, the learning processing unit 88 is provided for the following reason. That is, in the present embodiment, the predetermined action for the user to specify the voice source 72 that is the target of voice recognition is an operation or gesture of the switches 90 and 92. That is, in the present embodiment, the voice source 72 that is the target of voice recognition is designated by means other than voice. For this reason, when the sound source 72 to be subjected to speech recognition is designated, the sound from the designated sound source 72 is not necessarily obtained via the microphone 22. In order to reliably process the sound from the designated sound source 72 after the sound source 72 to be subjected to speech recognition is designated, beam forming suitable for the designated sound source 72 is learned in advance, and the designated sound source 72 Preferably, beam forming suitable for 72 is applied. For this reason, in this embodiment, a learning processing unit 88 is provided. The learning processing unit 88 learns beam forming suitable for acquiring the sound from the sound source 72a when the sound is emitted from the sound source 72a. The learning processing unit 88 learns beamforming suitable for acquiring the sound from the sound source 72b when the sound is emitted from the sound source 72b.
 運転席40に位置する音声源72aが指定音声源として指定された場合には、運転席40に位置する音声源72aに適したビームフォーミングとして学習されたビームフォーミングが適用される。一方、助手席40に位置する音声源72bが指定音声源として指定された場合には、助手席44に位置する音声源72bに適したビームフォーミングとして学習されたビームフォーミングが適用される。 When the sound source 72a located in the driver's seat 40 is designated as the designated sound source, the beam forming learned as the beam forming suitable for the sound source 72a located in the driver's seat 40 is applied. On the other hand, when the sound source 72b located in the passenger seat 40 is designated as the designated sound source, the beam forming learned as the beam forming suitable for the sound source 72b located in the passenger seat 44 is applied.
 後処理部14によって後処理が行われた信号が、音声出力として出力される。 The signal that has been post-processed by the post-processing unit 14 is output as an audio output.
 次に、本実施形態による音声処理装置の動作について図8を用いて説明する。図8は、本実施形態による音声処理装置の動作を示すフローチャートである。 Next, the operation of the speech processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 8 is a flowchart showing the operation of the speech processing apparatus according to the present embodiment.
 まず、音処理装置の電源がONにされる(ステップS10)。 First, the sound processor is turned on (step S10).
 次に、ビームフォーミングの学習が行われる(ステップS11)。運転席40に位置する音声源72aから音声が発された際には、運転席40に位置する音声源72aに適したビームフォーミングが学習処理部88によって学習される。助手席44に位置する音声源72bから音声が発された際には、助手席44に位置する音声源72bに適したビームフォーミングが学習処理部88によって学習される。 Next, beam forming learning is performed (step S11). When a sound is emitted from the sound source 72a located at the driver's seat 40, the learning processing unit 88 learns beamforming suitable for the sound source 72a located at the driver's seat 40. When a sound is emitted from the sound source 72b located in the passenger seat 44, the learning processing unit 88 learns beamforming suitable for the sound source 72b located in the passenger seat 44.
 運転席側スイッチ90が操作された場合、具体的には、運転席側スイッチ90がONになった場合には(ステップS12においてYES)、運転席40に位置する音声源72aに適したビームフォーミングとして学習処理部88によって学習されたビームフォーミングが適用される(ステップS13)。 When driver's seat side switch 90 is operated, specifically, when driver's seat side switch 90 is turned on (YES in step S12), beam forming suitable for audio source 72a located in driver's seat 40 is performed. The beam forming learned by the learning processing unit 88 is applied (step S13).
 運転席側スイッチ90が操作されていない場合には(ステップS12においてNO)、助手席側スイッチ92が操作されたか否かが確認される(ステップS14)。助手席側スイッチが操作された場合、具体的には、助手席側スイッチ92がONになった場合には(ステップS14においてYES)、助手席44に位置する音声源72bに適したビームフォーミングとして学習処理部88によって学習されたビームフォーミングが適用される(ステップS15)。 If the driver's seat side switch 90 has not been operated (NO in step S12), it is confirmed whether or not the passenger's seat side switch 92 has been operated (step S14). When the passenger seat side switch is operated, specifically, when the passenger seat side switch 92 is turned on (YES in step S14), beam forming suitable for the sound source 72b located in the passenger seat 44 is performed. Beam forming learned by the learning processing unit 88 is applied (step S15).
 助手席側スイッチ92が操作されていない場合には(ステップS14においてNO)、運転者によって所定のジェスチャーが行われたか否かが確認される(ステップS16)。運転者によって所定のジェスチャーが行われた場合には(ステップS16においてYES)、運転席40に位置する音声源72aに適したビームフォーミングとして学習処理部88によって学習されたビームフォーミングが適用される(ステップS17)。 If the passenger seat side switch 92 is not operated (NO in step S14), it is confirmed whether or not a predetermined gesture is performed by the driver (step S16). When a predetermined gesture is performed by the driver (YES in step S16), the beam forming learned by the learning processing unit 88 is applied as the beam forming suitable for the sound source 72a located in the driver seat 40 ( Step S17).
 運転者によって所定のジェスチャーが行われていない場合には(ステップS16においてNO)、助手席者によって所定のジェスチャーが行われたか否かが確認される(ステップS18)。助手席者によって所定のジェスチャーが行われた場合には(ステップS18においてYES)、助手席44に位置する音声源72bに適したビームフォーミングとして学習処理部88によって学習されたビームフォーミングが適用される(ステップS19)。 If the predetermined gesture is not performed by the driver (NO in step S16), it is confirmed whether or not the predetermined gesture is performed by the passenger seat (step S18). When a predetermined gesture is performed by the passenger (YES in step S18), the beamforming learned by the learning processing unit 88 is applied as the beamforming suitable for the sound source 72b located in the passenger seat 44. (Step S19).
 次に、指定音声源72から音声が発せられた際には、指定音声源72の方位が判定される(ステップS21)。指定音声源72の方位は、上述したように、音声源方位判定部16によって行われる。 Next, when sound is emitted from the designated sound source 72, the direction of the designated sound source 72 is determined (step S21). The orientation of the designated audio source 72 is performed by the audio source orientation determining unit 16 as described above.
 次に、指定音声源72の方位に応じて、ビームフォーマの指向性が設定される(ステップS22)。ビームフォーマの指向性の設定は、上述したように、適応アルゴリズム決定部18、処理部12等によって行われる。 Next, the directivity of the beamformer is set according to the direction of the designated audio source 72 (step S22). The setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.
 指定音声源72の方位を含む所定の方位範囲以外の方位範囲から到来する音の大きさが、指定音声源72から到来する音声の大きさ以上である場合には(ステップS23においてYES)、音声源72の判定を中断する(ステップ24)。 When the volume of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of designated audio source 72 is greater than or equal to the audio coming from designated audio source 72 (YES in step S23), the audio The determination of the source 72 is interrupted (step 24).
 一方、音声源72の方位を含む所定の方位範囲以外の方位範囲から到来する音の大きさが、音声源72から到来する音声の大きさ以上でない場合には(ステップS23においてNO)、ステップS21、S22が繰り返し行われる。 On the other hand, when the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is not greater than the magnitude of voice coming from voice source 72 (NO in step S23), step S21 , S22 is repeated.
 こうして、指定音声源72の位置の変化に応じて、ビームフォーマが適応的に設定され、指定音声源72からの音声以外の音、即ち、目的音以外の音が確実に抑制される。 Thus, the beamformer is adaptively set according to the change in the position of the designated sound source 72, and the sound other than the sound from the designated sound source 72, that is, the sound other than the target sound is surely suppressed.
 このように、音声認識の対象となる音声源72をユーザが指定するための所定の行為は、スイッチ90,92の操作又はジェスチャー等であってもよい。 As described above, the predetermined action for the user to specify the voice source 72 to be subjected to voice recognition may be an operation of the switches 90 and 92, a gesture, or the like.
 [変形実施形態]
 上記実施形態に限らず種々の変形が可能である。
[Modified Embodiment]
The present invention is not limited to the above embodiment, and various modifications are possible.
 例えば、上記実施形態では、マイクロフォン22の数が3個である場合を例に説明したが、マイクロフォン22の数は3個に限定されるものではなく、4個以上であってもよい。多くのマイクロフォン22を用いれば、音声源72の方位をより高精度に判定し得る。 For example, in the above embodiment, the case where the number of the microphones 22 is three has been described as an example, but the number of the microphones 22 is not limited to three, and may be four or more. If many microphones 22 are used, the direction of the sound source 72 can be determined with higher accuracy.
 また、上記実施形態では、音声源72が運転席40又は助手席44に位置する場合を例に説明したが、音声源72の位置は、運転席40又は助手席44に限定されるものではない。例えば、後部座席70に音声源72が位置する場合にも、本発明は適用可能である。 In the above embodiment, the case where the sound source 72 is located in the driver seat 40 or the passenger seat 44 has been described as an example. However, the position of the sound source 72 is not limited to the driver seat 40 or the passenger seat 44. . For example, the present invention is also applicable when the audio source 72 is located in the rear seat 70.
 また、第1実施形態において、学習処理部88を更に設けるようにしてもよい。 In the first embodiment, a learning processing unit 88 may be further provided.
 また、上記実施形態では、本実施形態による音声処理装置の出力が自動音声認識装置68に入力される場合、即ち、本実施形態による音声処理装置の出力が音声認識に用いられる場合を例に説明したが、これに限定されるものではない。本実施形態による音声処理装置の出力が、自動音声認識に用いられなくてもよい。例えば、本実施形態による音声処理装置を、電話での会話における音声処理に適用してもよい。具体的には、本実施形態による音声処理装置を用いて目的音以外の音を抑圧し、良好な音声を送信するようにしてもよい。本実施形態による音声処理装置を電話での会話に適用すれば、良好な音声での通話を実現することができる。 In the above embodiment, the case where the output of the speech processing apparatus according to the present embodiment is input to the automatic speech recognition apparatus 68, that is, the case where the output of the speech processing apparatus according to the present embodiment is used for speech recognition will be described as an example. However, the present invention is not limited to this. The output of the speech processing apparatus according to the present embodiment may not be used for automatic speech recognition. For example, the voice processing device according to the present embodiment may be applied to voice processing in a telephone conversation. Specifically, the sound processing apparatus according to the present embodiment may be used to suppress sounds other than the target sound and transmit good sound. If the voice processing device according to the present embodiment is applied to telephone conversation, it is possible to realize a voice conversation.
 また、第2実施形態では、カメラ94により取得される画像に基づいて、所定のジェスチャーが行われたか否かを判定したが、これに限定されるものではない。例えば、モーションセンサ等を用いて、所定のジェスチャーが行われたか否かを判定するようにしてもよい。 In the second embodiment, whether or not a predetermined gesture has been performed is determined based on an image acquired by the camera 94, but the present invention is not limited to this. For example, a motion sensor or the like may be used to determine whether a predetermined gesture has been performed.
 また、上記実施形態では、複数のマイクロフォン22を直線状に配置する場合を例に説明したが、3個以上のマイクロフォン22の配置はこれに限定されるものではない。例えば、複数のマイクロフォン22が同一平面上に位置するように配置してもよいし、複数のマイクロフォン22を3次元配置してもよい。 In the above-described embodiment, the case where a plurality of microphones 22 are arranged linearly has been described as an example. However, the arrangement of three or more microphones 22 is not limited to this. For example, the plurality of microphones 22 may be arranged on the same plane, or the plurality of microphones 22 may be arranged three-dimensionally.
 この出願は2014年12月26日に出願された日本国特許出願第2014-263921号からの優先権を主張するものであり、その内容を引用してこの出願の一部とするものである。 This application claims the priority from Japanese Patent Application No. 2014-263921 filed on Dec. 26, 2014, the contents of which are incorporated herein by reference.
22,22a~22c…マイクロフォン
40…運転席
42…ダッシュボード
44…助手席
46…車体
72、72a、72b…音声源
76…スピーカ
78…ステアリングホイール
80…エンジン
82…外部ノイズ源
84…車載音響機器
22, 22a to 22c ... Microphone 40 ... Driver's seat 42 ... Dashboard 44 ... Passenger seat 46 ... Car body 72, 72a, 72b ... Audio source 76 ... Speaker 78 ... Steering wheel 80 ... Engine 82 ... External noise source 84 ... In-vehicle acoustic equipment

Claims (5)

  1.  車両に配された複数のマイクロフォンと、
     前記複数のマイクロフォンの各々によって取得される受音信号に含まれる音の発生源である音声源の方位を判定する音声源方位判定部と、
     前記音声源の前記方位を含む方位範囲以外の方位範囲から到来する音を抑圧するビームフォーミングを行うビームフォーミング処理部とを有し、
     前記ビームフォーミング処理部は、所定の行為によって指定された前記音声源の前記方位に前記ビームフォーミングを行う
     ことを特徴とする音声処理装置。
    A plurality of microphones arranged in the vehicle;
    A sound source direction determination unit that determines a direction of a sound source that is a sound source included in a sound reception signal acquired by each of the plurality of microphones;
    A beam forming processing unit that performs beam forming to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source;
    The said beam forming process part performs the said beam forming in the said direction of the said audio | voice source designated by the predetermined | prescribed action. The audio processing apparatus characterized by the above-mentioned.
  2.  前記所定の行為は、所定のワードの発声である
     ことを特徴とする請求項1記載の音声処理装置。
    The speech processing apparatus according to claim 1, wherein the predetermined action is utterance of a predetermined word.
  3.  前記所定の行為は、所定のスイッチの操作である
     ことを特徴とする請求項1記載の音声処理装置。
    The audio processing apparatus according to claim 1, wherein the predetermined action is an operation of a predetermined switch.
  4.  前記所定の行為は、所定のジェスチャーである
     ことを特徴とする請求項1記載の音声処理装置。
    The speech processing apparatus according to claim 1, wherein the predetermined action is a predetermined gesture.
  5.  各々の前記音声源に適した前記ビームフォーミングを前記音声源毎に学習する学習処理部を更に有し、
     前記所定の行為によって前記音声源が指定された際に、前記学習処理部によって学習された前記ビームフォーミングが適用される
     ことを特徴とする請求項1乃至4のいずれか1項に記載の音声処理装置。
    A learning processing unit for learning the beam forming suitable for each of the sound sources for each of the sound sources;
    5. The audio processing according to claim 1, wherein the beamforming learned by the learning processing unit is applied when the audio source is designated by the predetermined action. 6. apparatus.
PCT/JP2015/006448 2014-12-26 2015-12-24 Voice processing device WO2016103710A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014263921A JP2016126022A (en) 2014-12-26 2014-12-26 Speech processing unit
JP2014-263921 2014-12-26

Publications (1)

Publication Number Publication Date
WO2016103710A1 true WO2016103710A1 (en) 2016-06-30

Family

ID=56149768

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/006448 WO2016103710A1 (en) 2014-12-26 2015-12-24 Voice processing device

Country Status (2)

Country Link
JP (1) JP2016126022A (en)
WO (1) WO2016103710A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108674344A (en) * 2018-03-30 2018-10-19 斑马网络技术有限公司 Speech processing system based on steering wheel and its application
CN112911465A (en) * 2021-02-01 2021-06-04 杭州海康威视数字技术股份有限公司 Signal sending method and device and electronic equipment

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6643720B2 (en) 2016-06-24 2020-02-12 ミツミ電機株式会社 Lens driving device, camera module and camera mounting device
JP6755843B2 (en) 2017-09-14 2020-09-16 株式会社東芝 Sound processing device, voice recognition device, sound processing method, voice recognition method, sound processing program and voice recognition program
JP6872710B2 (en) * 2017-10-26 2021-05-19 パナソニックIpマネジメント株式会社 Directivity control device and directivity control method
CN108597507A (en) * 2018-03-14 2018-09-28 百度在线网络技术(北京)有限公司 Far field phonetic function implementation method, equipment, system and storage medium
JP7223561B2 (en) * 2018-03-29 2023-02-16 パナソニックホールディングス株式会社 Speech translation device, speech translation method and its program
KR102208536B1 (en) * 2019-05-07 2021-01-27 서강대학교산학협력단 Speech recognition device and operating method thereof
JP6888851B1 (en) * 2020-04-06 2021-06-16 山内 和博 Self-driving car

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001296891A (en) * 2000-04-14 2001-10-26 Mitsubishi Electric Corp Method and device for voice recognition
JP2004109361A (en) * 2002-09-17 2004-04-08 Toshiba Corp Device, method, and program for setting directivity
JP2014153663A (en) * 2013-02-13 2014-08-25 Sony Corp Voice recognition device, voice recognition method and program
JP2014203031A (en) * 2013-04-09 2014-10-27 小島プレス工業株式会社 Speech recognition control device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3484112B2 (en) * 1999-09-27 2004-01-06 株式会社東芝 Noise component suppression processing apparatus and noise component suppression processing method
JP4097219B2 (en) * 2004-10-25 2008-06-11 本田技研工業株式会社 Voice recognition device and vehicle equipped with the same
CN101238511B (en) * 2005-08-11 2011-09-07 旭化成株式会社 Sound source separating device, speech recognizing device, portable telephone, and sound source separating method, and program
GB0906269D0 (en) * 2009-04-09 2009-05-20 Ntnu Technology Transfer As Optimal modal beamformer for sensor arrays
JP5962038B2 (en) * 2012-02-03 2016-08-03 ソニー株式会社 Signal processing apparatus, signal processing method, program, signal processing system, and communication terminal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001296891A (en) * 2000-04-14 2001-10-26 Mitsubishi Electric Corp Method and device for voice recognition
JP2004109361A (en) * 2002-09-17 2004-04-08 Toshiba Corp Device, method, and program for setting directivity
JP2014153663A (en) * 2013-02-13 2014-08-25 Sony Corp Voice recognition device, voice recognition method and program
JP2014203031A (en) * 2013-04-09 2014-10-27 小島プレス工業株式会社 Speech recognition control device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108674344A (en) * 2018-03-30 2018-10-19 斑马网络技术有限公司 Speech processing system based on steering wheel and its application
CN108674344B (en) * 2018-03-30 2024-04-02 斑马网络技术有限公司 Voice processing system based on steering wheel and application thereof
CN112911465A (en) * 2021-02-01 2021-06-04 杭州海康威视数字技术股份有限公司 Signal sending method and device and electronic equipment

Also Published As

Publication number Publication date
JP2016126022A (en) 2016-07-11

Similar Documents

Publication Publication Date Title
WO2016103710A1 (en) Voice processing device
WO2016103709A1 (en) Voice processing device
WO2016143340A1 (en) Speech processing device and control device
CN110691299B (en) Audio processing system, method, apparatus, device and storage medium
JP5913340B2 (en) Multi-beam acoustic system
JP4779748B2 (en) Voice input / output device for vehicle and program for voice input / output device
CN105592384B (en) System and method for controlling internal car noise
US9953641B2 (en) Speech collector in car cabin
WO2017081960A1 (en) Voice recognition control system
CN105635501A (en) System and method for echo cancellation
JP6635394B1 (en) Audio processing device and audio processing method
JP2007180896A (en) Voice signal processor and voice signal processing method
JP2024026716A (en) Signal processor and signal processing method
JP2002351488A (en) Noise canceller and on-vehicle system
GB2560498A (en) System and method for noise cancellation
US20220189450A1 (en) Audio processing system and audio processing device
JP2009073417A (en) Apparatus and method for controlling noise
JP6332072B2 (en) Dialogue device
JP6606921B2 (en) Voice direction identification device
JP4660740B2 (en) Voice input device for electric wheelchair
JP4508147B2 (en) In-vehicle hands-free device
JP6573657B2 (en) Volume control device, volume control method, and volume control program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15872281

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15872281

Country of ref document: EP

Kind code of ref document: A1