WO2016103710A1

WO2016103710A1 - Voice processing device

Info

Publication number: WO2016103710A1
Application number: PCT/JP2015/006448
Authority: WO
Inventors: サシャヴラジック; 岡田　広毅
Original assignee: アイシン精機株式会社; トヨタ自動車株式会社
Priority date: 2014-12-26
Filing date: 2015-12-24
Publication date: 2016-06-30
Also published as: JP2016126022A

Abstract

This voice processing device comprises: a plurality of microphones 22 arranged in a vehicle; a voice source direction determination unit that determines the direction of a voice source which is the source of a voice included in a sound reception signal acquired by each of the microphones; and a beamforming processing unit that performs beamforming to suppress sounds arriving from direction ranges outside the direction range including the direction of the voice source. The beamforming processing unit performs beamforming in the direction of the voice source designated by a predetermined action.

Description

Audio processing device

The present invention relates to a voice processing device.

Various devices are provided in vehicles such as automobiles. Operations on these various devices are performed, for example, by operating operation buttons, operation panels, and the like.

On the other hand, recently, voice recognition technology has also been proposed (Patent Documents 1 to 3).

JP 2012-215606 A JP 2012-189906 A JP 2012-42465 A

However, various noises exist in the vehicle. For this reason, it is not easy to recognize the voice generated in the vehicle.

An object of the present invention is to provide a good speech processing apparatus capable of improving the certainty of speech recognition.

According to one aspect of the present invention, a plurality of microphones arranged in a vehicle, and a sound source that determines an orientation of a sound source that is a sound source included in a sound reception signal acquired by each of the plurality of microphones An azimuth determining unit, and a beam forming processing unit that performs beam forming to suppress sound coming from an azimuth range other than the azimuth range including the azimuth range of the sound source, and the beam forming processing unit An audio processing apparatus is provided that performs the beamforming in the direction of the specified audio source.

According to the present invention, by performing a predetermined action, it is possible to reliably specify a voice source to be a target of voice recognition. For this reason, according to this invention, the favorable audio processing apparatus which can improve the reliability of audio | voice recognition can be provided.

It is the schematic which shows the structure of a vehicle. It is a block diagram which shows the system configuration | structure of the audio processing apparatus by 1st Embodiment of this invention. It is the schematic which shows the example of arrangement | positioning of a microphone in case the number of microphones is three. It is the schematic which shows the example of arrangement | positioning of a microphone in case the number of microphones is two. It is a figure which shows the algorithm of a beam former. It is a figure which shows the angle characteristic of the directivity of a beam former, and an audio source direction determination cancellation process. It is a flowchart which shows operation | movement of the audio processing apparatus by 1st Embodiment of this invention. It is a block diagram which shows the system configuration | structure of the audio processing apparatus by 2nd Embodiment of this invention. It is a flowchart which shows operation | movement of the audio processing apparatus by 2nd Embodiment of this invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, this invention is not limited to the following embodiment, In the range which does not deviate from the summary, it can change suitably. In the drawings described below, components having the same function are denoted by the same reference numerals, and the description thereof may be omitted or simplified.

[First Embodiment]
A speech processing apparatus according to a first embodiment of the present invention will be described with reference to FIGS.

Prior to describing the speech processing apparatus according to the present embodiment, the configuration of the vehicle will be described with reference to FIG. FIG. 1 is a schematic diagram showing a configuration of a vehicle.

As shown in FIG. 1, a driver's seat 40 that is a driver's seat and a passenger's seat 44 that is a passenger's seat are arranged at the front of a vehicle body (cabinet) 46 of a vehicle (automobile). ing. The driver's seat 40 is located on the right side of the passenger compartment 46, for example. A steering wheel (handle) 78 is disposed in front of the driver seat 40. The passenger seat 44 is located on the left side of the passenger compartment 46, for example. The driver seat 40 and the passenger seat 44 constitute a front seat. In the vicinity of the driver's seat 40, an audio source 72a when the driver emits audio is located. In the vicinity of the passenger seat 44, an audio source 72b when the passenger seat makes a sound is located. Since both the driver and the front passenger can move the upper body while seated in the

seats

40 and 44, the position of the sound source 72 can change. A rear seat 70 is disposed at the rear of the vehicle body 46. Here, reference numeral 72 is used when the description is made without distinguishing between the individual sound sources, and

reference numerals

72a and 72b are used when the description is made with the individual sound sources distinguished.

A plurality of microphones 22 (22a to 22c), that is, microphone arrays are arranged in front of the

front seats

40 and 44. Here, reference numeral 22 is used when the description is made without distinguishing the individual microphones, and reference numerals 22a to 22c are used when the description is made with the individual microphones distinguished. The microphone 22 may be disposed on the dashboard 42 or may be disposed on a portion close to the roof.

The distance between the sound source 72 of the

front seats

40 and 44 and the microphone 22 is often about several tens of centimeters. However, the distance between the microphone 22 and the audio source 72 can be less than a few tens of centimeters. Also, the distance between the microphone 22 and the audio source 72 can exceed 1 m.

Inside the vehicle body 46, a speaker (loud speaker) 76 constituting a speaker system of an on-vehicle acoustic device (car audio device) 84 (see FIG. 2) is arranged. Music (music) emitted from the speaker 76 can be noise when performing speech recognition.

The vehicle body 46 is provided with an engine 80 for driving the vehicle. The sound emitted from the engine 80 can be noise when performing speech recognition.

The noise generated in the passenger compartment 46 by the road surface stimulus during the traveling of the vehicle, that is, the road noise can also be a noise when performing voice recognition. In addition, wind noise generated when the vehicle travels can also be a noise source in performing speech recognition. Further, the noise source 82 may exist outside the vehicle body 46. The sound emitted from the external noise source 82 can also be noise in performing speech recognition.

It would be convenient if operations on various devices arranged on the vehicle body 46 could be performed by user voice instructions. The user's voice instruction is recognized using, for example, an automatic voice recognition device 68 (see FIG. 2). The speech processing apparatus according to the present embodiment contributes to improvement of speech recognition accuracy in the automatic speech recognition apparatus 68.

FIG. 2 is a block diagram showing a system configuration of the speech processing apparatus according to the present embodiment.

As shown in FIG. 2, the speech processing apparatus according to the present embodiment includes a pre-processing unit 10, a processing unit 12, a post-processing unit 14, a speech source direction determination unit 16, an adaptive algorithm determination unit 18, and a noise model. A determination unit 20 and a designated input processing unit 86 are included.

The voice processing device according to the present embodiment may further include an automatic voice recognition device 68, and the voice processing device according to the present embodiment and the automatic voice recognition device 68 may be separate devices. A device including these components and the automatic speech recognition device 68 can be referred to as a speech processing device or an automatic speech recognition device.

A signal acquired by each of the plurality of microphones 22a to 22c, that is, a sound reception signal is input to the preprocessing unit 10. As the microphone 22, for example, an omnidirectional microphone is used.

3A and 3B are schematic diagrams showing examples of microphone arrangement. FIG. 3A shows a case where the number of microphones 22 is three. FIG. 3B shows a case where the number of microphones 22 is two. The plurality of microphones 22 are arranged so as to be positioned on a straight line.

When the sound source 72 is located in the far field, the sound reaching the microphone 22 is handled as a plane wave, and the direction (direction) of the sound source 72, that is, the sound source direction (DOA: DirectionDirectOf Arrival) is determined. it can.

When the sound source 72 is located in the near field, it is preferable to determine the direction of the sound source 72 by treating the sound reaching the microphone 22 as a spherical wave.

The distance L1 between the microphone 22a and the microphone 22b is set to be relatively long so as to be suitable for a relatively low frequency sound. The distance L2 between the microphone 22b and the microphone 22c is set to be relatively short so as to be suitable for a relatively high frequency sound.

As shown in FIG. 2, sound reception signals acquired by the plurality of microphones 22 are input to the preprocessing unit 10. In the preprocessing unit 10, sound field correction is performed. In the sound field correction, tuning is performed in consideration of the acoustic characteristics of the vehicle compartment 46 that is an acoustic space.

When the sound reception signal acquired by the microphone 22 includes music, the preprocessing unit 10 removes the music from the sound reception signal acquired by the microphone 22. A reference music signal (reference signal) is input to the preprocessing unit 10. The preprocessing unit 10 removes music included in the sound reception signal acquired by the microphone 22 using the reference music signal.

The sound source direction determination unit 16 determines the direction of the sound source.

Assuming that the speed of sound is c [m / s], the distance between microphones is d [m], and the arrival time difference is τ [seconds], the direction θ [degree] of the sound source 72 is expressed by the following equation (1). Represented by The sound speed c is about 340 [m / s].

It is possible to specify the position of the sound source 72 based on the arrival time difference τ.

The output signal of the voice source direction determination unit 16, that is, the signal indicating the direction of the voice source 72 is input to the adaptive algorithm determination unit 18. The adaptive algorithm determination unit 18 determines an adaptive algorithm based on the orientation of the audio source 72. A signal indicating the adaptation algorithm determined by the adaptation algorithm determination unit 18 is input from the adaptation algorithm determination unit 18 to the processing unit 12.

The processing unit 12 performs adaptive beamforming, which is signal processing that adaptively forms directivity (adaptive beamformer). The processing unit 12 not only functions as an adaptive beamformer that adaptively performs beamforming, but also controls the entire speech processing apparatus according to the present embodiment. As the beam former, for example, a Frost beam former or the like can be used. The beam forming is not limited to the Frost beamformer, and various beamformers can be applied as appropriate. The processing unit 12 performs beam forming based on the adaptive algorithm determined by the adaptive algorithm determination unit 18. In this embodiment, the beam forming is performed in order to reduce the sensitivity other than the arrival direction of the target sound while securing the sensitivity to the arrival direction of the target sound. The target sound is, for example, a sound emitted from the driver. Since the driver can move the upper body while sitting in the driver's seat 40, the position of the sound source 72a can change. The arrival direction of the target sound changes according to the change in the position of the sound source 72a. In order to perform good speech recognition, it is preferable to reliably reduce the sensitivity other than the arrival direction of the target sound. Therefore, in the present embodiment, based on the direction of the sound source 72 determined as described above, the beam former is sequentially updated so as to suppress sound from an azimuth range other than the azimuth range including the azimuth.

When the voice source 72a to be subjected to voice recognition is located in the driver's seat 40, sound coming from an azimuth range other than the azimuth range including the azimuth of the driver's seat 40 is suppressed.

If the voice source 72b to be subjected to voice recognition is located in the passenger seat 44, sound coming from an azimuth range other than the azimuth range including the azimuth of the passenger seat 44 is suppressed. Good.

FIG. 4 is a diagram showing a beamformer algorithm. The received sound signals acquired by the microphones 22a to 22c are input to the window function / fast Fourier transform processing units 48a to 48c provided in the processing unit 12 via the preprocessing unit 10 (see FIG. 2). It is like that. The window function / fast Fourier transform processing units 48a to 48c perform window function processing and fast Fourier transform processing. In this embodiment, the window function process and the fast Fourier transform process are performed because the calculation in the frequency domain is faster than the calculation in the time domain. The output signal X1 _{, k of the} window function / fast Fourier transform processing unit 48a and the beamformer weight tensor W1 _{, k} ^* are multiplied at the multiplication point 50a. The output signal X2 _{, k of the} window function / fast Fourier transform processor 48b and the beamformer weight tensor W2 _{, k} ^* are multiplied at the multiplication point 50b. The output signal X _{3, k of the} window function / fast Fourier transform processing unit 48c and the beamformer weight tensor W _{3, k} ^* are multiplied at the multiplication point 50c. The signals multiplied at the multiplication points 50 a to 50 c are added at the addition point 52. The signal Y _k added at the addition point 52 is input to an inverse fast Fourier transform / superimposition addition processing unit 54 provided in the processing unit 12. The inverse fast Fourier transform / superimposition addition processing unit 54 performs an inverse fast Fourier transform process and a process based on an overlay addition (OLA: OverLap-Add) method. By performing processing by the superposition addition method, the frequency domain signal is returned to the time domain signal. A signal subjected to the inverse fast Fourier transform process and the superposition addition method is input from the inverse fast Fourier transform / superimposition addition processing unit 54 to the post-processing unit 14.

FIG. 5 is a diagram showing the directivity of the beamformer and the angle characteristics of the audio source direction determination cancellation process. The solid line indicates the directivity of the beamformer. The alternate long and short dash line indicates the angle characteristic of the audio source direction determination cancellation process. As can be seen from FIG. 5, for example, the output signal power becomes minimum at the azimuth angle β1 degree and the azimuth angle β2. It is sufficiently suppressed between the azimuth angle β1 and the azimuth angle β2. If a directional beamformer as shown in FIG. 5 is used, the sound arriving from the passenger seat can be sufficiently suppressed. On the other hand, the voice coming from the driver's seat reaches the microphone 22 with almost no suppression. In the present embodiment, when the sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is larger than the magnitude of the audio coming from the audio source 72, the direction of the audio source 72 is determined. Suspend (voice source direction determination cancellation process). For example, when the beamformer is set to acquire the voice from the driver, if the voice from the passenger seat is larger than the voice from the driver, the direction of the voice source is estimated. Interrupt. In this case, the sound reception signal acquired by the microphone 22 is sufficiently suppressed. For example, when a voice arriving from a direction smaller than γ1 or a voice arriving from a direction larger than γ2, for example, is larger than the voice from the driver, a voice source direction determination canceling process is performed. Here, the case where the beamformer is set so as to acquire the voice from the driver has been described as an example, but the beamformer may be set so as to acquire the voice from the passenger. . In this case, when the voice from the driver is louder than the voice from the passenger, the estimation of the direction of the voice source is interrupted.

Thus, a signal in which sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is suppressed is output from the processing unit 12. An output signal from the processing unit 12 is input to the post-processing unit 14.

In the post-processing unit (post-processing adaptive filter) 14, noise is removed. Such noise includes engine noise, road noise, wind noise, and the like. The engine noise model determination unit 20 generates a reference noise signal by performing noise modeling processing. The reference noise signal output from the noise model determination unit 20 is a reference signal for removing noise from a signal including noise. The reference engine noise signal is input to the post-processing unit 14. The post-processing unit 14 uses the reference engine noise signal to remove noise from the signal including noise. The post-processing unit 14 outputs a signal from which noise has been removed. The post-processing unit 14 also performs distortion reduction processing. Note that noise removal is not performed only in the post-processing unit 14. Noise is removed from a sound acquired via the microphone 22 by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14.

In this way, a signal that has been post-processed by the post-processing unit 14 is output to the automatic speech recognition device 68. Since a good target sound in which sounds other than the target sound are suppressed is input to the automatic speech recognition device 68, the automatic speech recognition device 68 can improve the accuracy of speech recognition. Based on the voice recognition result by the automatic voice recognition device 68, the operation on the device mounted on the vehicle is automatically performed.

The voice recognition result by the automatic voice recognition device 68 is also input to the designated input processing unit 86. The designation input processing unit 86 is for the user to designate a voice source 72 that is a target of voice recognition when a user (occupant) performs a predetermined action. Examples of the predetermined action include utterance of a predetermined word. A user who has issued a predetermined word is designated as the voice source 72 to be subjected to voice recognition. The sound source 72 designated by performing a predetermined action is referred to as a designated sound source.

The designated input processing unit 68 determines whether or not a predetermined word has been issued based on the voice recognition result by the automatic voice recognition device 68. A signal indicating whether or not a predetermined word has been issued is input from the designated input processing unit 86 to the processing unit 12. When a predetermined word is emitted, the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72 that issued the predetermined word. Note that the direction of the sound source 72 that has issued the predetermined word is determined by the sound source direction determination unit 16.

Next, the operation of the speech processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 6 is a flowchart showing the operation of the speech processing apparatus according to the present embodiment.

First, the sound processor is turned on (step S1).

Next, when the user has issued a predetermined word (YES in step S2), the audio source 72 that has issued the predetermined word is designated as the designated audio source (step S3). If the predetermined word is not issued (NO in step S2), step S2 is repeated. The designated voice source is a voice source 72 that is a target of voice recognition. Since the direction of the sound source 72 that has issued the predetermined word is determined by the sound source direction determination unit 16, it is possible to determine which seat the user has issued the predetermined word from. In this way, the sound source 72 that has issued the predetermined word is determined, and the designated sound source 72 to be subjected to speech recognition is designated.

Next, the orientation of the designated audio source 72 is determined (step S4). The direction of the designated audio source 72 is determined by the audio source direction determining unit 16.

Next, the directivity of the beamformer is set according to the direction of the designated audio source 72 (step S5). The setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.

When the volume of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of designated voice source 72 is equal to or greater than the magnitude of voice coming from designated voice source 72 (YES in step S5), the voice The determination of the source 72 is interrupted (step S7).

On the other hand, when the magnitude of the sound coming from the azimuth range other than the predetermined azimuth range including the azimuth of the voice source 72 is not greater than the magnitude of the voice coming from the voice source 72 (NO in step S6), step S4 , S5 is repeated.

Thus, the beamformer is adaptively set according to the change in the position of the designated sound source 72, and the sound other than the sound from the designated sound source 72, that is, the sound other than the target sound is surely suppressed.

As described above, according to the present embodiment, it is possible to reliably specify the voice source 72 to be subjected to voice recognition by issuing a predetermined word. For this reason, according to the present embodiment, it is possible to provide a good speech processing apparatus that can improve the certainty of speech recognition.

[Second Embodiment]
A speech processing apparatus according to a second embodiment of the present invention will be described with reference to FIGS. FIG. 7 is a block diagram showing the system configuration of the speech processing apparatus according to the present embodiment. The same components as those of the speech processing apparatus according to the first embodiment shown in FIGS. 1 to 6 are denoted by the same reference numerals, and description thereof is omitted or simplified.

In the voice recognition apparatus according to the present embodiment, the predetermined action for the user to specify the voice source 72 that is the target of voice recognition is an operation or gesture of the

switches

90 and 92.

As shown in FIG. 7, the speech processing apparatus according to the present embodiment includes a pre-processing unit 10, a processing unit 12, a post-processing unit 14, a speech source direction determination unit 16, an adaptive algorithm determination unit 18, engine noise, and the like. A model determining unit 20. The speech processing apparatus according to the present embodiment also includes a learning processing unit 88, a driver seat side switch 90, a passenger seat side switch 92, a camera 94, a switch designation input processing unit 96, and an image designation input processing unit. 98.

In the vicinity of the driver's seat 40, a driver's seat side switch 90 is arranged. A passenger seat side switch 92 is disposed in the vicinity of the passenger seat 44. The driver seat side switch 90 and the passenger seat side switch 92 are connected to the switch designation input processing unit 96.

The switch designation input processing unit 96 is for the user to designate the voice source 72 that is the target of voice recognition by the user operating the

switches

90 and 92. When the driver's seat side switch 90 arranged on the driver's seat side is operated, the voice source 72a located in the driver's seat is designated as the designated voice source that is the target of voice recognition. On the other hand, when the passenger seat side switch 92 arranged on the passenger seat side is operated, the voice source 72b located in the passenger seat is designated as the designated voice source to be recognized.

When the driver's seat side switch 90 is operated, a signal indicating that the driver's seat side switch 90 has been operated is input from the switch designation input processing unit 96 to the processing unit 12. When the driver's seat side switch 90 is operated, the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72a located at the driver's seat 40. .

When the passenger seat side switch 92 is operated, a signal indicating that the passenger seat side switch 92 has been operated is input from the switch designation input processing unit 96 to the processing unit 12. When the passenger seat side switch 92 is operated, the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the sound source 72b located at the passenger seat 44. .

In addition, a camera 94 is disposed on the vehicle 46. An image acquired by the camera 94 is input to the image designation input processing unit 98. The image designation input processing unit 98 is for the user to designate a voice source 72 that is a target of voice recognition when a user (occupant) performs a predetermined action. Examples of the predetermined action include a predetermined gesture (gesture, pose). A user who has performed a predetermined gesture is designated as a voice source (designated voice source) 72 to be a target of voice recognition.

The image designation input processing unit 98 determines whether a predetermined gesture has been performed based on the image acquired by the camera 94. A signal indicating whether or not a predetermined gesture has been performed is input from the image designation input processing unit 98 to the processing unit 12. When a predetermined gesture is performed by the driver, the processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72a located at the driver's seat 40. Do. The processing unit 12 performs beam forming so as to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72b located at the passenger seat 44 when a predetermined gesture is performed by the passenger. I do.

A learning processing unit 88 is connected to the processing unit 12. The learning processing unit 88 learns beam forming suitable for each of the

sound sources

72a and 72b for each of the

sound sources

72a and 72b. In the present embodiment, the learning processing unit 88 is provided for the following reason. That is, in the present embodiment, the predetermined action for the user to specify the voice source 72 that is the target of voice recognition is an operation or gesture of the

switches

90 and 92. That is, in the present embodiment, the voice source 72 that is the target of voice recognition is designated by means other than voice. For this reason, when the sound source 72 to be subjected to speech recognition is designated, the sound from the designated sound source 72 is not necessarily obtained via the microphone 22. In order to reliably process the sound from the designated sound source 72 after the sound source 72 to be subjected to speech recognition is designated, beam forming suitable for the designated sound source 72 is learned in advance, and the designated sound source 72 Preferably, beam forming suitable for 72 is applied. For this reason, in this embodiment, a learning processing unit 88 is provided. The learning processing unit 88 learns beam forming suitable for acquiring the sound from the sound source 72a when the sound is emitted from the sound source 72a. The learning processing unit 88 learns beamforming suitable for acquiring the sound from the sound source 72b when the sound is emitted from the sound source 72b.

When the sound source 72a located in the driver's seat 40 is designated as the designated sound source, the beam forming learned as the beam forming suitable for the sound source 72a located in the driver's seat 40 is applied. On the other hand, when the sound source 72b located in the passenger seat 40 is designated as the designated sound source, the beam forming learned as the beam forming suitable for the sound source 72b located in the passenger seat 44 is applied.

The signal that has been post-processed by the post-processing unit 14 is output as an audio output.

Next, the operation of the speech processing apparatus according to the present embodiment will be described with reference to FIG. FIG. 8 is a flowchart showing the operation of the speech processing apparatus according to the present embodiment.

First, the sound processor is turned on (step S10).

Next, beam forming learning is performed (step S11). When a sound is emitted from the sound source 72a located at the driver's seat 40, the learning processing unit 88 learns beamforming suitable for the sound source 72a located at the driver's seat 40. When a sound is emitted from the sound source 72b located in the passenger seat 44, the learning processing unit 88 learns beamforming suitable for the sound source 72b located in the passenger seat 44.

When driver's seat side switch 90 is operated, specifically, when driver's seat side switch 90 is turned on (YES in step S12), beam forming suitable for audio source 72a located in driver's seat 40 is performed. The beam forming learned by the learning processing unit 88 is applied (step S13).

If the driver's seat side switch 90 has not been operated (NO in step S12), it is confirmed whether or not the passenger's seat side switch 92 has been operated (step S14). When the passenger seat side switch is operated, specifically, when the passenger seat side switch 92 is turned on (YES in step S14), beam forming suitable for the sound source 72b located in the passenger seat 44 is performed. Beam forming learned by the learning processing unit 88 is applied (step S15).

If the passenger seat side switch 92 is not operated (NO in step S14), it is confirmed whether or not a predetermined gesture is performed by the driver (step S16). When a predetermined gesture is performed by the driver (YES in step S16), the beam forming learned by the learning processing unit 88 is applied as the beam forming suitable for the sound source 72a located in the driver seat 40 ( Step S17).

If the predetermined gesture is not performed by the driver (NO in step S16), it is confirmed whether or not the predetermined gesture is performed by the passenger seat (step S18). When a predetermined gesture is performed by the passenger (YES in step S18), the beamforming learned by the learning processing unit 88 is applied as the beamforming suitable for the sound source 72b located in the passenger seat 44. (Step S19).

Next, when sound is emitted from the designated sound source 72, the direction of the designated sound source 72 is determined (step S21). The orientation of the designated audio source 72 is performed by the audio source orientation determining unit 16 as described above.

Next, the directivity of the beamformer is set according to the direction of the designated audio source 72 (step S22). The setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.

When the volume of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of designated audio source 72 is greater than or equal to the audio coming from designated audio source 72 (YES in step S23), the audio The determination of the source 72 is interrupted (step 24).

On the other hand, when the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is not greater than the magnitude of voice coming from voice source 72 (NO in step S23), step S21 , S22 is repeated.

As described above, the predetermined action for the user to specify the voice source 72 to be subjected to voice recognition may be an operation of the

switches

90 and 92, a gesture, or the like.

[Modified Embodiment]
The present invention is not limited to the above embodiment, and various modifications are possible.

For example, in the above embodiment, the case where the number of the microphones 22 is three has been described as an example, but the number of the microphones 22 is not limited to three, and may be four or more. If many microphones 22 are used, the direction of the sound source 72 can be determined with higher accuracy.

In the above embodiment, the case where the sound source 72 is located in the driver seat 40 or the passenger seat 44 has been described as an example. However, the position of the sound source 72 is not limited to the driver seat 40 or the passenger seat 44. . For example, the present invention is also applicable when the audio source 72 is located in the rear seat 70.

In the first embodiment, a learning processing unit 88 may be further provided.

In the above embodiment, the case where the output of the speech processing apparatus according to the present embodiment is input to the automatic speech recognition apparatus 68, that is, the case where the output of the speech processing apparatus according to the present embodiment is used for speech recognition will be described as an example. However, the present invention is not limited to this. The output of the speech processing apparatus according to the present embodiment may not be used for automatic speech recognition. For example, the voice processing device according to the present embodiment may be applied to voice processing in a telephone conversation. Specifically, the sound processing apparatus according to the present embodiment may be used to suppress sounds other than the target sound and transmit good sound. If the voice processing device according to the present embodiment is applied to telephone conversation, it is possible to realize a voice conversation.

In the second embodiment, whether or not a predetermined gesture has been performed is determined based on an image acquired by the camera 94, but the present invention is not limited to this. For example, a motion sensor or the like may be used to determine whether a predetermined gesture has been performed.

In the above-described embodiment, the case where a plurality of microphones 22 are arranged linearly has been described as an example. However, the arrangement of three or more microphones 22 is not limited to this. For example, the plurality of microphones 22 may be arranged on the same plane, or the plurality of microphones 22 may be arranged three-dimensionally.

This application claims the priority from Japanese Patent Application No. 2014-263921 filed on Dec. 26, 2014, the contents of which are incorporated herein by reference.

22, 22a to 22c ... Microphone 40 ... Driver's seat 42 ... Dashboard 44 ... Passenger seat 46 ...

Car body

72, 72a, 72b ... Audio source 76 ... Speaker 78 ... Steering wheel 80 ... Engine 82 ... External noise source 84 ... In-vehicle acoustic equipment

Claims

A plurality of microphones arranged in the vehicle;
A sound source direction determination unit that determines a direction of a sound source that is a sound source included in a sound reception signal acquired by each of the plurality of microphones;
A beam forming processing unit that performs beam forming to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source;
The said beam forming process part performs the said beam forming in the said direction of the said audio | voice source designated by the predetermined | prescribed action. The audio processing apparatus characterized by the above-mentioned.
The speech processing apparatus according to claim 1, wherein the predetermined action is utterance of a predetermined word.
The audio processing apparatus according to claim 1, wherein the predetermined action is an operation of a predetermined switch.
The speech processing apparatus according to claim 1, wherein the predetermined action is a predetermined gesture.
A learning processing unit for learning the beam forming suitable for each of the sound sources for each of the sound sources;
5. The audio processing according to claim 1, wherein the beamforming learned by the learning processing unit is applied when the audio source is designated by the predetermined action. 6. apparatus.