WO2016143340A1

WO2016143340A1 - Speech processing device and control device

Info

Publication number: WO2016143340A1
Application number: PCT/JP2016/001290
Authority: WO
Inventors: 一平菅江; 丹羽　栄二; 孝之中所
Original assignee: アイシン精機株式会社
Priority date: 2015-03-09
Filing date: 2016-03-09
Publication date: 2016-09-15
Also published as: JP2016167645A

Abstract

Provided is a speech processing device and control device that can accurately perform speech processing of speech generated inside and outside a vehicle while meeting demand for lower costs. The present invention comprises a speech source direction determination unit 16 that determines the direction of speech sources that are the sources of speech included in received sound signals obtained via each of multiple microphones 22 provided within a vehicle, a beam forming processing unit 12 that forms beams which suppress sound incoming from direction ranges other than direction ranges that include the directions of the speech sources and a noise removal processing unit 14 that performs removal processing on noise mixed into the received sound signals, wherein on the basis of a first signal that indicates whether or not passengers are present in the vehicle, beam forming by the beam forming processing unit is set on or off.

Description

Voice processing apparatus and control apparatus

The present invention relates to a voice processing device and a control device.

Various devices are provided in vehicles such as automobiles. Operations on these various devices are performed, for example, by operating operation buttons, operation panels, and the like.

On the other hand, recently, it has also been proposed to control a vehicle using a voice recognition technology (Patent Documents 1 and 2).

JP 2006-189394 A JP 2007-308887 A

However, sound can be emitted not only inside the vehicle but also outside the vehicle. If microphones are arranged at various locations in order to reliably detect sound that can be uttered at various locations, this is contrary to the demand for cost reduction.

An object of the present invention is to provide a voice processing device that can accurately perform voice processing on voice that can be emitted inside and outside a vehicle while satisfying a demand for cost reduction, and a control device using the voice processing device. It is in.

According to one aspect of the present invention, a sound source direction determination unit that determines a direction of a sound source that is a sound source included in a sound reception signal acquired by each of a plurality of microphones arranged in a vehicle; A beam forming processing unit that performs beam forming to suppress sound coming from an azimuth range other than the azimuth range including the azimuth range of the sound source; and a noise removal processing unit that performs processing to remove noise mixed in the received sound signal. There is provided a speech processing apparatus in which on / off of the beamforming by the beamforming processing unit is set based on a first signal indicating whether or not an occupant is present in the vehicle. The

According to the present invention, on / off of the beam forming is set based on the first signal indicating whether or not an occupant is present in the vehicle. For this reason, even when the occupant is located outside the vehicle, it is possible to reliably detect the sound emitted by the occupant using the microphone disposed in the vehicle. Since it is not necessary to provide a microphone for acquiring sound emitted outside the vehicle separately from the microphone arranged in the vehicle, it is possible to contribute to cost reduction.

It is the schematic which shows the structure of a vehicle. It is a block diagram which shows the system configuration | structure of the control apparatus by one Embodiment of this invention. It is a top view which shows the vehicle by one Embodiment of this invention. It is a block diagram which shows the system configuration | structure of the audio processing apparatus by one Embodiment of this invention. It is the schematic which shows the example of arrangement | positioning of a microphone in case the number of microphones is three. It is the schematic which shows the example of arrangement | positioning of a microphone in case the number of microphones is two. It is a figure which shows the case where an audio source is located in a far field. It is a figure which shows the case where a sound source is located in a near field. It is the schematic which shows the algorithm of a music removal. It is a figure which shows the signal waveform before the removal of music. It is a figure which shows the signal waveform after the removal of music. It is a figure which shows the algorithm of determination of the azimuth | direction of an audio source. It is a figure which shows an adaptive filter coefficient. It is a figure which shows the azimuth | direction angle of an audio source. It is a figure which shows the amplitude of an audio | voice signal. It is a figure which shows the directivity of a beam former notionally. It is a figure which shows the algorithm of a beam former. It is a graph which shows the example of the directivity obtained by a beam former. It is a figure which shows the angle characteristic at the time of combining a beam former and an audio source direction determination cancellation process. It is a graph which shows the example of the directivity obtained by a beam former. It is a figure which shows the algorithm of noise removal. It is a figure which shows the signal waveform before noise removal. It is a figure which shows the signal waveform after noise removal. It is a flowchart which shows operation | movement of the audio processing apparatus by one Embodiment of this invention. It is a flowchart which shows the operation | movement in the 1st operation mode in the speech processing unit by one Embodiment of this invention. It is a flowchart which shows the operation | movement in the 2nd operation mode in the speech processing unit by one Embodiment of this invention. It is a flowchart which shows the operation | movement in the 2nd operation mode in the audio processing apparatus by the modification of one Embodiment of this invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, this invention is not limited to the following embodiment, In the range which does not deviate from the summary, it can change suitably. In the drawings described below, components having the same function are denoted by the same reference numerals, and the description thereof may be omitted or simplified.

[One Embodiment]
A voice processing device and a control device using the voice processing device according to an embodiment of the present invention will be described with reference to FIGS.

Prior to describing the voice processing device and the control device according to the present embodiment, the configuration of the vehicle will be described with reference to FIG. FIG. 1 is a schematic diagram showing a configuration of a vehicle.

As shown in FIG. 1, a driver's seat 40 that is a driver's seat and a passenger's seat 44 that is a passenger's seat are arranged at the front of a vehicle body (cabinet) 46 of a vehicle (automobile) 136. Has been. The driver's seat 40 is located on the right side of the passenger compartment 46, for example. A steering wheel (handle) 78 is disposed in front of the driver seat 40. The passenger seat 44 is located on the left side of the passenger compartment 46, for example. The driver seat 40 and the passenger seat 44 constitute a front seat. In the vicinity of the driver's seat 40, an audio source 72a when the driver emits audio is located. In the vicinity of the passenger seat 44, an audio source 72b when the passenger seat makes a sound is located. Since both the driver and the front passenger can move the upper body while seated in the

seats

40 and 44, the position of the sound source 72 can change. A rear seat 70 is disposed at the rear of the vehicle body 46. Here, reference numeral 72 is used when the description is made without distinguishing between the individual sound sources, and

reference numerals

72a and 72b are used when the description is made with the individual sound sources distinguished.

A plurality of microphones 22 (22a to 22c), that is, microphone arrays are arranged in front of the

front seats

40 and 44. Here, reference numeral 22 is used when the description is made without distinguishing the individual microphones, and reference numerals 22a to 22c are used when the description is made with the individual microphones distinguished. The microphone 22 may be disposed on the dashboard 42 or may be disposed on a portion close to the roof.

The distance between the sound source 72 of the

front seats

40 and 44 and the microphone 22 is often about several tens of centimeters. However, the distance between the microphone 22 and the audio source 72 can be less than a few tens of centimeters. Also, the distance between the microphone 22 and the audio source 72 can exceed 1 m.

Inside the vehicle body 46, a speaker (loud speaker) 76 constituting a speaker system of an on-vehicle acoustic device (car audio device) 84 (see FIG. 2) is arranged. Music (music) emitted from the speaker 76 can be noise when performing speech recognition.

The vehicle body 46 is provided with an engine 80 for driving the vehicle 136. The sound emitted from the engine 80 can be noise when performing speech recognition.

The noise generated in the passenger compartment 46 due to road surface stimulation while the vehicle 136 is traveling, that is, road noise, can also be noise when performing voice recognition. In addition, wind noise generated when the vehicle 136 travels can also be a noise source in performing voice recognition. Further, the noise source 82 may exist outside the vehicle body 46. The sound emitted from the external noise source 82 can also be noise in performing speech recognition.

It is convenient if operations on various devices arranged on the vehicle body 46 can be performed by voice instructions. The voice instruction is recognized using, for example, an automatic voice recognition device 168 (see FIG. 2). The speech processing apparatus 102 according to the present embodiment contributes to improvement of speech recognition accuracy.

FIG. 2 is a block diagram showing the control device according to the present embodiment.

As shown in FIG. 2, the control device 100 according to the present embodiment includes a speech processing device 102, an automatic speech recognition device 168, an input unit 114, a control unit (CPU: Central Processing Unit) 116, a memory 118, and an output unit 120. have. The voice processing device 102, the automatic voice recognition device 168, the input unit 114, the control unit 116, the memory 118, and the output unit 120 can input / output signals to / from each other via the bus line 122.

The voice processing device 102 and the automatic voice recognition device 168 may be separate devices, or the voice processing device (voice processing unit) 102 and the automatic voice recognition device (voice recognition unit) 168 are integrated. May be. A device in which the speech processing device 102 and the automatic speech recognition device 168 are integrated can be called a speech processing device or an automatic speech recognition device.

A signal acquired by each of the plurality of microphones 22a to 22c is input to the voice processing device 102. In addition, a signal from the in-vehicle acoustic device 84 is input to the audio processing device 102.

The voice signal processed by the voice processing apparatus 102 is output to the automatic voice recognition apparatus (voice recognition apparatus) 168 as a voice output.

A signal from the proximity detection unit (proximity detection means) 126 is input to the input unit 114. A signal indicating whether or not a passenger is approaching the vehicle 136, that is, a proximity detection signal is input from the proximity detection unit 126 to the input unit 114. As the proximity detection unit 126, for example, a reception unit (reception unit) that can receive a wireless signal emitted from the smart key (authentication key) 146 can be used. For example, the proximity detection unit 126 may also serve as a reception unit for the smart key system, or may be provided separately from the reception unit for the smart key system. FIG. 3 is a plan view showing the vehicle according to the present embodiment. As shown in FIG. 3, a communication area 148 of the smart key 146 is formed in the vicinity of the opening / closing bodies 134a to 134c. When the smart key 146 is located in the communication area 148 of the smart key system, a signal indicating that the smart key 146 is located in the communication area 148 is input from the proximity detection unit 126 to the input unit 114. The

Note that, here, the presence / absence of the occupant's proximity to the vehicle 136 is determined based on the reception of the wireless signal emitted from the smart key 146 by the proximity detection unit 126, but the present invention is not limited to this. That is, the occupant-side device used to determine whether or not the occupant is approaching the vehicle 136 is not limited to the smart key 146, and may be any portable device that can perform ID authentication. It is possible to appropriately determine the proximity of an occupant to the vehicle 136 based on whether or not communication between various portable devices capable of ID authentication and in-vehicle devices is established.

A signal indicating whether or not there is an occupant in the vehicle 136, that is, an occupant presence / absence detection signal is input to the input unit 114 from the occupant detection unit 142 to the input unit 114. As the occupant detection unit 142, for example, a driver monitor, a weight detection sensor, or the like can be used. The driver monitor can detect the presence or absence of an occupant based on an image taken by a camera (not shown). For example, the weight detection sensor is arranged in the driver's seat 40 and can detect the presence or absence of an occupant based on the weight detected by the weight detection sensor.

The control unit 116 controls the entire control device 100. The control unit 116 reads a proximity detection signal input from the proximity detection unit 126 via the input unit 114. Based on the proximity detection signal, the control unit 116 can determine whether or not the occupant carrying the smart key 146 is in a state of being close to the vehicle 136. Further, the control unit 116 reads an occupant presence / absence detection signal input from the occupant detection unit 142 via the input unit 114. The control unit 116 can determine whether there is an occupant in the vehicle 136 based on the occupant presence / absence detection signal. Further, the control unit 116 reads output information from the automatic speech recognition device 168, that is, a speech recognition result. The control unit 116 can recognize an occupant instruction by voice based on the voice recognition result by the automatic voice recognition device 168.

The control unit 116 controls various devices mounted on the vehicle 136 based on the voice recognition result by the automatic voice recognition device 168.

For example, the control unit 116 controls the opening / closing body 134. Specifically, the control unit 116 outputs a control signal for controlling the opening / closing body driving device 132 to the opening / closing body driving device 132 via the output unit 120. The opening / closing body driving device 132 is for driving an opening / closing body 134 which is a structure having an opening / closing mechanism. The control unit 116 automatically opens and closes the opening / closing body 134 via the opening / closing body driving device 132. The vehicle 136 is provided with various opening / closing bodies such as the

side doors

134a and 134b and the back door 134c. In FIG. 2, the individual opening / closing bodies are not distinguished, and one of the plurality of opening / closing bodies is illustrated by reference numeral 134.

In addition, the control unit 116 controls the brake 140. Specifically, the control unit 116 outputs a control signal for controlling the brake control device 138 to the brake control device 138 via the output unit 120. The brake control device 138 is for controlling the brake 140. The control unit 116 controls the brake 140 via the brake control device 138.

FIG. 4 is a block diagram showing a system configuration of the speech processing apparatus according to the present embodiment. As shown in FIG. 4, the speech processing apparatus 102 according to the present embodiment includes a preprocessing unit 10, a processing unit 12, a post-processing unit 14, a speech source direction determination unit 16, an adaptive algorithm determination unit 18, noise A model determining unit 20.

A signal acquired by each of the plurality of microphones 22a to 22c, that is, a sound reception signal is input to the preprocessing unit 10. As the microphone 22, for example, an omnidirectional microphone is used.

FIG. 5A and FIG. 5B are schematic diagrams showing examples of microphone arrangement. FIG. 5A shows a case where the number of microphones 22 is three. FIG. 5B shows a case where the number of microphones 22 is two. The plurality of microphones 22 are arranged so as to be positioned on a straight line.

6A and 6B are diagrams showing a case where the sound source is located in the far field and a case where the sound source is located in the near field. FIG. 6A shows a case where the audio source 72 is located in the far field, and FIG. 6B shows a case where the audio source 72 is located in the near field. d indicates a difference in distance from the sound source 72 to the microphone 22. θ represents the direction of the audio source 72.

As shown in FIG. 6A, when the sound source 72 is located in the far field, the sound reaching the microphone 22 can be regarded as a plane wave. For this reason, in this embodiment, when the sound source 72 is located in the far field, the sound reaching the microphone 22 is handled as a plane wave, and the direction (direction) of the sound source 72, that is, the sound source direction (DOA: Direction) Of Arrival). Since the sound reaching the microphone 22 can be handled as a plane wave, when the sound source 72 is located in the far field, the direction of the sound source 72 can be determined using the two microphones 22. Depending on the position of the sound source 72 and the arrangement of the microphones 22, the orientation of the sound source 72 located in the near field can be determined even when the number of the microphones 22 is two.

As shown in FIG. 6B, when the sound source 72 is located in the near field, the sound reaching the microphone 22 can be regarded as a spherical wave. For this reason, in this embodiment, when the sound source 72 is located in the near field, the sound reaching the microphone 22 is treated as a spherical wave, and the direction of the sound source 72 is determined. Since the sound reaching the microphone 22 needs to be handled as a spherical wave, when the sound source 72 is located in the near field, the orientation of the sound source 72 is determined using at least three microphones 22. Here, for simplification of description, a case where the number of microphones 22 is three will be described as an example.

The distance L1 between the microphone 22a and the microphone 22b is set to be relatively long. The distance L2 between the microphone 22b and the microphone 22c is set to be relatively short.

The reason why the distance L1 and the distance L2 are different in the present embodiment is as follows. That is, in this embodiment, the direction of the sound source 72 is specified based on the sound (arrival time difference (TDOA: Time Delay Of Arrival) of the received sound signal) reaching each microphone 22. Since the wavelength is relatively long, it is preferable to set the distance between the microphones 22 to be relatively large in order to cope with the sound having a relatively low frequency.For this reason, in this embodiment, the microphone 22a and the microphone 22b The distance L1 between the microphones 22 is set to be relatively long, while the sound having a relatively high frequency has a relatively short wavelength, so that the distance between the microphones 22 is relatively small in order to correspond to the sound having a relatively high frequency. Therefore, in this embodiment, the distance L2 between the microphone 22b and the microphone 22c is relatively short. It is constant.

The distance L1 between the microphone 22a and the microphone 22b is, for example, about 5 cm so as to be suitable for sound having a frequency of 3400 Hz or less. The distance L2 between the microphone 22b and the microphone 22c is, for example, about 2.5 cm so as to be suitable for sound having a frequency exceeding 3400 Hz. The distances L1 and L2 are not limited to these, and can be set as appropriate.

In the present embodiment, when the sound source 72 is located in the far field, the sound reaching the microphone 22 is handled as a plane wave when the sound is handled as a plane wave than when the sound is handled as a spherical wave. This is because the process for determining the direction of the audio source 72 is simple. For this reason, in this embodiment, when the sound source 72 is located in the far field, the sound reaching the microphone 22 is treated as a plane wave. Since the sound reaching the microphone 22 is handled as a plane wave, when determining the direction of the sound source 72 located in the far field, the processing load for determining the direction of the sound source 72 can be reduced.

In addition, although the addition of the process for determining the azimuth | direction of the audio | voice source 72 becomes heavy, when the audio | voice source 72 is located in a near field, the audio | voice which reaches | attains the microphone 22 is handled as a spherical wave. This is because when the sound source 72 is located in the near field, the direction of the sound source 72 cannot be accurately determined unless the sound reaching the microphone 22 is treated as a spherical wave.

As described above, in the present embodiment, when the sound source 72 is located in the far field, the direction of the sound source 72 is determined by treating the sound as a plane wave, and when the sound source 72 is located in the near field, The direction of the sound source 72 is determined by treating the sound as a spherical wave.

As shown in FIG. 4, sound reception signals acquired by the plurality of microphones 22 are input to the preprocessing unit 10. In the preprocessing unit 10, sound field correction is performed. In the sound field correction, tuning is performed in consideration of the acoustic characteristics of the vehicle compartment 46 that is an acoustic space.

When the sound reception signal acquired by the microphone 22 includes music, the preprocessing unit 10 removes the music from the sound reception signal acquired by the microphone 22. A reference music signal (reference signal) is input to the preprocessing unit 10. The preprocessing unit 10 removes music included in the sound reception signal acquired by the microphone 22 using the reference music signal.

FIG. 7 is a schematic diagram showing a music removal algorithm. When music is played back by the in-vehicle acoustic device 84, the sound reception signal acquired by the microphone 22 includes music. A received sound signal including music acquired by the microphone 22 is input to a music removal processing unit 24 provided in the preprocessing unit 10. A reference music signal is input to the music removal processing unit 24. The reference music signal can be obtained, for example, by acquiring music output from the speaker 76 of the in-vehicle acoustic device 84 using the

microphones

26a and 26b. Alternatively, the music source signal before being converted into sound by the speaker 76 may be input to the music removal processing unit 24 as a reference music signal.

The output signal from the music removal processing unit 24 is input to a step size determination unit 28 provided in the preprocessing unit 10. The step size determination unit 28 determines the step size of the output signal of the music removal processing unit 24. The step size determined by the step size determination unit 28 is fed back to the music removal processing unit 24. The music removal processing unit 24 uses the reference music signal and, based on the step size determined by the step size determination unit 28, the frequency domain normalized least square method (NLMS: Normalized Least-Mean Square) algorithm Remove music from signals that contain. The music removal process is performed with a sufficient number of processing steps to sufficiently remove the reverberant component of the music in the passenger compartment 46.

8A and 8B are diagrams showing signal waveforms before and after music removal. The horizontal axis indicates time, and the vertical axis indicates amplitude. FIG. 8A shows before music removal, and FIG. 8B shows after music removal. As can be seen from FIGS. 8A and 8B, the music is reliably removed.

The signal from which music has been removed in this way is output from the music removal processing unit 24 of the preprocessing unit 10 and input to the processing unit 12. If the pre-processing unit 10 cannot sufficiently remove music, the post-processing unit 14 may also perform music removal processing.

The sound source direction determination unit 16 determines the direction of the sound source. FIG. 9 is a diagram showing an algorithm for determining the direction of the sound source. A signal from a certain microphone 22 among the plurality of microphones 22 is input to a delay unit 30 provided in the sound source direction determination unit 16. A signal from another microphone 22 among the plurality of microphones 22 is input to an adaptive filter 32 provided in the sound source direction determination unit 16. The output signal of the delay unit 30 and the output signal of the adaptive filter 32 are input to the subtraction point 34. At the subtraction point 34, the output signal of the adaptive filter 34 is subtracted from the output signal of the delay unit 30. The adaptive filter 32 is adjusted based on the signal subjected to the subtraction process at the subtraction point 34. The output from the adaptive filter 32 is input to the peak detector 36. The peak detector 36 detects a peak (maximum value) of the adaptive filter coefficient. The arrival time difference τ corresponding to the peak of the adaptive filter coefficient is the arrival time difference τ corresponding to the arrival direction of the target sound. Therefore, it is possible to determine the direction of the voice source 72, that is, the direction of arrival of the target sound, based on the arrival time difference τ thus obtained.

Assuming that the speed of sound is c [m / s], the distance between microphones is d [m], and the arrival time difference is τ [seconds], the direction θ [degree] of the sound source 72 is expressed by the following equation (1). Represented by The sound speed c is about 340 [m / s].

Θ = (180 / π) × arccos (τ · c / d) (1)

FIG. 10A is a diagram showing adaptive filter coefficients. FIG. 10B is a diagram showing the azimuth angle of the audio source. FIG. 10C is a diagram illustrating the amplitude of the audio signal. In FIG. 10A, the portion where the adaptive filter coefficient has a peak is hatched. FIG. 10B shows the orientation of the audio source 72 determined based on the arrival time difference τ. FIG. 10C shows the amplitude of the audio signal. FIGS. 10A to 10C show a case where the driver and the passenger are uttered alternately. Here, the direction of the sound source 72a when the driver emits sound is α1. The direction of the sound source 72b when the passenger seat utters sound is α2.

As shown in FIG. 10A, the arrival time difference τ can be detected based on the peak of the adaptive filter coefficient w (t, τ). When the driver utters voice, the arrival time difference τ corresponding to the peak of the adaptive filter coefficient is, for example, about −t1. When the azimuth angle of the audio source 72a is determined based on the arrival time difference τ, the azimuth angle of the audio source 72a is determined to be about α1, for example. On the other hand, when the passenger seat utters a voice, the arrival time difference τ corresponding to the peak of the adaptive filter coefficient is, for example, about t2. When the azimuth angle of the audio source 72b is determined based on the arrival time difference τ, the azimuth angle of the audio source 72b is determined to be about α2 degrees, for example. Here, the case where the driver is located in the direction of α1 and the passenger seat is located in the direction of α2 has been described as an example, but the present invention is not limited to this. Whether the audio source 72 is located in the near field or the audio source 72 is located in the far field, the position of the audio source 72 can be specified based on the arrival time difference τ. . However, when the sound source 72 is located in the near field, as described above, three or more microphones 22 are required, so that the processing load for obtaining the direction of the sound source 72 becomes heavy.

The output signal of the voice source direction determination unit 16, that is, the signal indicating the direction of the voice source 72 is input to the adaptive algorithm determination unit 18. The adaptive algorithm determination unit 18 determines an adaptive algorithm based on the orientation of the audio source 72. A signal indicating the adaptation algorithm determined by the adaptation algorithm determination unit 18 is input from the adaptation algorithm determination unit 18 to the processing unit 12.

The processing unit 12 performs adaptive beamforming, which is signal processing that adaptively forms directivity (adaptive beamformer, beamforming processing unit). For example, a Frost beamformer can be used as the beamformer. The beam forming is not limited to the Frost beamformer, and various beamformers can be applied as appropriate. The processing unit 12 performs beam forming based on the adaptive algorithm determined by the adaptive algorithm determination unit 18. In this embodiment, the beam forming is performed in order to reduce the sensitivity other than the arrival direction of the target sound while securing the sensitivity to the arrival direction of the target sound. The target sound is, for example, a sound emitted from the driver. Since the driver can move the upper body while sitting in the driver's seat 40, the position of the sound source 72a can change. The arrival direction of the target sound changes according to the change in the position of the sound source 72a. In order to perform good speech recognition, it is preferable to reliably reduce the sensitivity other than the arrival direction of the target sound. Therefore, in the present embodiment, based on the direction of the sound source 72 determined as described above, the beam former is sequentially updated so as to suppress sound from an azimuth range other than the azimuth range including the azimuth.

FIG. 11 is a diagram conceptually showing the directivity of the beamformer. FIG. 11 conceptually shows the directivity of the beamformer when the voice source 72a to be subjected to voice recognition is located in the driver's seat 40. The hatching in FIG. 11 indicates the azimuth range in which the incoming sound is suppressed (suppressed or reduced). As shown in FIG. 11, sound coming from an azimuth range other than the azimuth range including the azimuth of the driver's seat 40 is suppressed.

If the voice source 72b to be subjected to voice recognition is located in the passenger seat 44, sound coming from an azimuth range other than the azimuth range including the azimuth of the passenger seat 44 is suppressed. Good.

FIG. 12 is a diagram showing a beamformer algorithm. The received sound signals acquired by the microphones 22a to 22c are input to the window function / fast Fourier transform processing units 48a to 48c provided in the processing unit 12 via the preprocessing unit 10 (see FIG. 4). It is like that. The window function / fast Fourier transform processing units 48a to 48c perform window function processing and fast Fourier transform processing. In this embodiment, the window function process and the fast Fourier transform process are performed because the calculation in the frequency domain is faster than the calculation in the time domain. The output signal X1 _{, k of the} window function / fast Fourier transform processing unit 48a and the beamformer weight tensor W1 _{, k} ^* are multiplied at the multiplication point 50a. The output signal X2 _{, k of the} window function / fast Fourier transform processor 48b and the beamformer weight tensor W2 _{, k} ^* are multiplied at the multiplication point 50b. The output signal X3 _{, k of the} window function / fast Fourier transform processing unit 48b and the beamformer weight tensor W3 _{, k} ^* are multiplied at the multiplication point 50c. The signals multiplied at the multiplication points 50 a to 50 c are added at the addition point 52. The signal Y _k added at the addition point 52 is input to an inverse fast Fourier transform / superimposition addition processing unit 54 provided in the processing unit 12. The inverse fast Fourier transform / superimposition addition processing unit 54 performs an inverse fast Fourier transform process and a process based on an overlay addition (OLA: OverLap-Add) method. By performing processing by the superposition addition method, the frequency domain signal is returned to the time domain signal. A signal subjected to the inverse fast Fourier transform process and the superposition addition method is input from the inverse fast Fourier transform / superimposition addition processing unit 54 to the post-processing unit 14.

FIG. 13 is a diagram showing the directivity (angle characteristic) obtained by the beamformer. The horizontal axis indicates the azimuth angle, and the vertical axis indicates the output signal power. As can be seen from FIG. 13, for example, the output signal power is minimized at the azimuth angle β1 and the azimuth angle β2. Sufficient suppression is also performed between the azimuth angle β1 and the azimuth angle β2. If a directional beamformer as shown in FIG. 13 is used, the sound coming from the passenger seat can be sufficiently suppressed. On the other hand, the voice coming from the driver's seat reaches the microphone 22 with almost no suppression.

In the present embodiment, when the sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is larger than the magnitude of the audio coming from the audio source 72, the direction of the audio source 72 is determined. Suspend (voice source direction determination cancellation process). For example, when the beamformer is set to acquire the voice from the driver, if the voice from the passenger seat is larger than the voice from the driver, the direction of the voice source is estimated. Interrupt. In this case, the sound reception signal acquired by the microphone 22 is sufficiently suppressed. FIG. 14 is a diagram showing the directivity (angle characteristic) when the beamformer and the sound source direction determination cancellation processing are combined. The solid line indicates the directivity of the beamformer. The alternate long and short dash line indicates the angle characteristic of the audio source direction determination cancellation process. For example, when a voice arriving from a direction smaller than γ1 or a voice arriving from a direction larger than γ2, for example, is larger than the voice from the driver, a voice source direction determination canceling process is performed. Here, the case where the beamformer is set so as to acquire the voice from the driver has been described as an example, but the beamformer may be set so as to acquire the voice from the passenger. . In this case, when the voice from the driver is louder than the voice from the passenger, the estimation of the direction of the voice source is interrupted.

FIG. 15 is a graph showing the directivity obtained by the beamformer when there are two microphones. The horizontal axis is the azimuth angle, and the vertical axis is the output signal power. Since there are two microphones 22, there is only one angle at which the minimum value is obtained. As can be seen from FIG. 15, for example, significant suppression is possible at the azimuth angle β1, but the robustness to the change in the azimuth of the audio source 72 is not so high.

Thus, a signal in which sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is suppressed is output from the processing unit 12. An output signal from the processing unit 12 is input to the post-processing unit 14.

In the post-processing unit (post-processing adaptive filter) 14, noise is removed. Examples of such noise include engine noise, road noise, and wind noise. FIG. 16 is a diagram illustrating an algorithm for noise removal. A fundamental wave of noise is determined by a fundamental wave determination unit 56 provided in the noise model determination unit 20. The fundamental wave determination unit 56 outputs a sine wave based on the fundamental wave of noise. The sine wave output from the fundamental wave determination unit 56 is input to a modeling processing unit 58 provided in the noise model determination unit 20. The modeling processing unit 58 includes a non-linear mapping processing unit 60, a linear filter 62, and a non-linear mapping processing unit 64. The modeling processing unit 58 performs modeling processing using a Hammerstein-Wiener nonlinear model. The modeling processing unit 58 includes a non-linear mapping processing unit 60, a linear filter 62, and a non-linear mapping processing unit 64. The modeling processing unit 58 generates a reference noise signal by performing a modeling process on the sine wave output from the fundamental wave determination unit 56. The reference noise signal output from the modeling processing unit 58 is a reference signal for removing noise from a signal including noise. The reference noise signal is input to a noise removal processing unit 66 provided in the post-processing unit 14. A signal including noise from the processing unit 12 is also input to the noise removal processing unit 66. The noise removal processing unit 66 removes noise from a signal including noise by using a reference noise signal and using a normalization least squares algorithm. The noise removal processing unit 66 outputs a signal from which noise has been removed.

FIG. 17A and FIG. 17B are diagrams showing signal waveforms before and after noise removal. The horizontal axis indicates time, and the vertical axis indicates amplitude. FIG. 17A shows before noise removal, and FIG. 17B shows after noise removal. As can be seen from FIGS. 17A and 17B, noise is reliably removed.

In the post-processing unit 14, distortion reduction processing is also performed. Note that noise removal is not performed only in the post-processing unit 14. Noise is removed from a sound acquired via the microphone 22 by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14.

Thus, the signal that has been post-processed by the post-processing unit 14 is output to the automatic speech recognition apparatus 168 as a voice output. Since a good target sound in which sounds other than the target sound are suppressed is input to the automatic speech recognition device 168, the automatic speech recognition device 168 can improve the accuracy of speech recognition. Based on the voice recognition result by the automatic voice recognition device 168, the operation on the device mounted on the vehicle 136 is automatically performed.

Next, operations of the voice processing device according to the present embodiment and the control device using the voice processing device will be described with reference to FIGS. FIG. 18 is a flowchart showing the operation of the speech processing apparatus according to this embodiment.

First, as shown in FIG. 18, it is determined whether or not an occupant is present in the vehicle 136 (step S1). Whether an occupant is present in the vehicle 136 can be determined based on, for example, an occupant presence / absence detection signal from the occupant detection unit 142.

If a passenger is present in the vehicle 136 (YES in step S1), the voice processing device 102 is operated in the first operation mode. The first operation mode is an operation mode on the assumption that an occupant is present in the vehicle 136. In the first operation mode, sound source direction determination, beam forming processing, noise removal processing, music removal processing, and the like are performed.

The operation of the speech processing apparatus in the first operation mode will be described with reference to FIG. FIG. 19 is a flowchart showing the operation in the first operation mode in the speech processing apparatus according to the present embodiment.

First, noise removal processing and music removal processing are started (step S10). That is, the noise removal process and the music removal process are set on. Thereafter, the noise removal process and the music removal process are continuously performed. Note that the music removal process may not be performed when the in-vehicle acoustic device 84 is not outputting music or when the volume of the music is extremely low. As described above, noise is removed by a series of processes performed in the pre-processing unit 10, the processing unit 12, and the post-processing unit 14. Further, as described above, the music removal processing is performed by the music removal processing unit 24 provided in the preprocessing unit 10 or the like.

Before the call by the occupant is made to the sound processing apparatus 102 (NO in step S11), noise removal processing, music removal processing, and the like are performed, but sound source direction determination, beam forming, and the like are not performed.

When the call by the occupant is made to the voice processing device 102 (YES in step S11), the voice source direction determination process and the beam forming process are set to ON, and the direction of the voice source 72 that issued the call is determined ( Step S12). The determination of the direction of the sound source 72 is performed by the sound source direction determination unit 16 and the like as described above. The call is made by the driver, for example. The call may not be performed by the driver. For example, a passenger seat may make a call. The call may be a specific word or a simple utterance.

Next, the directivity of the beamformer is set according to the direction of the sound source 72 (step S13). The setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.

When the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is greater than or equal to the magnitude of voice coming from voice source 72 (YES in step S14), voice source 72 Is interrupted (step S15).

On the other hand, when the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is not greater than the magnitude of voice coming from voice source 72 (NO in step S14), step S12 , S13 is repeated.

Thus, the beamformer is adaptively set according to the change of the position of the sound source 72, and the sound other than the target sound is surely suppressed. Since a good target sound, in which noise removal processing, music removal processing, and the like are performed and a sound other than the target sound is suppressed, is input to the automatic speech recognition device 168, the automatic speech recognition device 168 has a voice recognition accuracy. Can be improved. Based on the voice recognition result by the automatic voice recognition device 168, an operation on a device mounted on the vehicle 136, for example, an operation on a door, a window, a wiper, a winker, or the like is automatically performed.

On the other hand, when no occupant is present in the vehicle 136 (NO in step S1), the voice processing device 102 is operated in the second operation mode. The second operation mode is an operation mode on the assumption that no occupant is present in the vehicle 136. In the second operation mode, noise removal processing, music removal processing, and the like are performed, but audio source direction determination, beam forming processing, and the like are not performed.

The operation of the speech processing apparatus in the second operation mode will be described with reference to FIG. FIG. 20 is a flowchart showing the operation in the second operation mode in the speech processing apparatus according to the present embodiment.

First, noise removal processing and music removal processing are started (step S20). Thereafter, the noise removal process and the music removal process are continuously performed. In the second operation mode, sound source direction determination, beam forming, and the like are not performed. That is, in the second operation mode, the sound source direction determination process and the beam forming process are set to off. As described above, the music removal process may not be performed when the in-vehicle audio device 84 does not output music or when the volume of music is extremely low. In addition, as described above, noise is removed by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14. Further, as described above, the music removal processing is performed by the music removal processing unit 24 provided in the preprocessing unit 10 or the like.

In the second operation mode, a sound signal that has been subjected to noise removal processing, music removal processing, and the like is output from the sound processing apparatus 102. In the second operation mode, the sound source direction determination process and the beam forming process are set to OFF for the following reason. That is, when an occupant is present outside the vehicle 136, it is not always easy to accurately and reliably specify the position of the occupant outside the vehicle 136. For this reason, it is also conceivable that beam forming is performed in the wrong direction. When sound is emitted from the occupant while beamforming is performed in the wrong direction, the sound emitted from the occupant is suppressed, and the sound emitted from the occupant cannot be acquired. There is a fear. Therefore, in this embodiment, beam forming is not performed in the second operation mode. Since beam forming is not performed, in this embodiment, sound source direction determination necessary for performing beam forming is not performed. Since a good audio signal that has been subjected to noise removal processing, music removal processing, and the like is input to the automatic speech recognition device 168, the automatic speech recognition device 168 can perform speech recognition with high accuracy. Based on the voice recognition result by the automatic voice recognition device 168, the operation on the device mounted on the vehicle 136 is automatically performed.

In the second operation mode, when a predetermined word such as “open” or “closed” is detected by the automatic sound processing device 168, an occupant located outside the vehicle 136 opens the opening / closing body 134. It seems that he wants to operate and close. When the proximity detection signal from the proximity detection unit 126 is input to the input unit 114, it is considered that the occupant has issued the predetermined word. Therefore, in the state where the occupant presence / absence detection signal indicates that no occupant is present in the vehicle 136 and the proximity detection signal from the proximity detection unit 126 is input to the input unit 114, “open” When the automatic speech processing device 168 detects a predetermined word such as “closed” or “closed”, the control unit 116 performs control for opening or closing the opening / closing body 134. Specifically, the control unit 116 opens or closes the opening / closing body 134 by controlling the opening / closing body driving device 132 via the output unit 120.

In the second operation mode, for example, when a predetermined word “stop” is detected by the automatic sound processing device 168, it is considered that an occupant located outside the vehicle 136 wants the vehicle 136 to stop. . For example, when the vehicle 136 stopped on a slope starts to move, an occupant located outside the vehicle 136 wants the vehicle 136 to stop. For this reason, when an occupant presence / absence detection signal indicates that no occupant is present in the vehicle 136 and the vehicle 136 is moving, a predetermined word such as “stop” is detected. The control unit 116 performs control for stopping the vehicle 136. Specifically, the control unit 116 controls the brake control device 138 via the output unit 120 to operate the brake 140, thereby stopping the vehicle 136.

Thus, according to the present embodiment, on / off of beamforming is set based on the occupant presence / absence detection signal indicating whether or not an occupant is present in the vehicle 136. For this reason, even when the occupant is located outside the vehicle 136, the sound emitted by the occupant can be reliably detected using the microphone 22 disposed in the vehicle 136. Since it is not necessary to provide a microphone for acquiring sound emitted outside the vehicle 136 separately from the microphone 22 disposed in the vehicle 136, it is possible to contribute to cost reduction. Therefore, according to the present embodiment, a voice processing device that can accurately perform voice processing on voice that can be emitted inside and outside the vehicle while satisfying a demand for cost reduction, and a control device using the voice processing device are provided. Can be provided.

(Modification)
Next, a voice processing device according to a modification of the present embodiment and a control device using the voice processing device will be described with reference to FIG. 18, FIG. 19, and FIG. FIG. 21 is a flowchart showing the operation in the second operation mode in the speech processing apparatus according to this modification.

The sound processing apparatus according to this modification is configured to perform beam forming on an occupant after the occupant located outside the vehicle 136 has issued a predetermined word.

First, it is determined whether or not an occupant is present in the vehicle 136 in the same manner as in the sound processing apparatus according to the embodiment described above with reference to FIG. 18 (step S1).

If a passenger is present in the vehicle 136 (YES in step S1), the voice processing device 102 is operated in the first operation mode. The operation of the speech processing apparatus in the first operation mode is the same as the operation in the first operation mode of the speech processing apparatus according to the embodiment described above with reference to FIG.

On the other hand, when no occupant is present in the vehicle 136 (NO in step S1), the voice processing device 102 is operated in the second operation mode. As described above, the second operation mode is an operation mode based on the assumption that no occupant is present in the vehicle 136. In the second operation mode, noise removal processing, music removal processing, and the like are performed before a predetermined word is detected by the automatic speech recognition device 168, but sound source direction determination, beam forming, and the like are not performed. . That is, before the predetermined word is detected by the automatic speech recognition device 168, the voice source direction determination process and the beam forming process are set to off.

The operation of the speech processing apparatus in the second operation mode will be described with reference to FIG. FIG. 21 is a flowchart showing the operation in the second operation mode in the speech processing apparatus according to this modification.

First, noise removal processing and music removal processing are started (step S30). Thereafter, the noise removal process and the music removal process are continuously performed. In the second operation mode, sound source direction determination, beam forming, and the like are not performed. As described above, the music removal process may not be performed when the in-vehicle acoustic device 84 does not output music, or when the volume of music is extremely low. In addition, as described above, noise is removed by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14. Further, as described above, the music removal processing is performed by the music removal processing unit 24 provided in the preprocessing unit 10 or the like.

When the predetermined word is detected by the automatic speech recognition device 168 (YES in step S31), the voice source direction determination process and the beam forming process are set to ON, and the direction of the voice source 72 that issued the predetermined word is determined. (Step S32). The determination of the direction of the sound source 72 is performed by the sound source direction determination unit 16 and the like as described above. As the predetermined word, for example, “A” which is a voice uttered when surprised can be cited. When such a predetermined word is issued, it is considered that an occupant located outside the vehicle 136 is surprised. For this reason, when the predetermined word is detected by the automatic speech recognition device 168 (YES in step S31), the operation after step S32 is performed in order to acquire more reliably the sound emitted from the occupant who has issued the predetermined word. Done.

Next, the directivity of the beamformer is set according to the direction of the sound source 72 (step S33). The setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.

When the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is greater than or equal to the magnitude of voice coming from voice source 72 (YES in step S34), voice source 72 Is interrupted (step S35).

On the other hand, when the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is not greater than the magnitude of voice coming from voice source 72 (NO in step S34), step S32 , S33 is repeated.

As described above, according to the present modification, after the predetermined word is detected, the sound source direction determination process, the beam forming process, and the like are set on, so that the sound other than the target sound is suppressed. Voice signal is input to the automatic voice recognition device 168. For this reason, according to the present modification, the accuracy of voice recognition can be further improved, and operations on devices and the like mounted on the vehicle 136 can be performed more accurately and reliably.

[Modified Embodiment]
The present invention is not limited to the above embodiment, and various modifications are possible.

For example, in the above embodiment, the case where the number of the microphones 22 is three has been described as an example, but the number of the microphones 22 is not limited to three, and may be four or more. If many microphones 22 are used, the direction of the sound source 72 can be determined with higher accuracy.

This application claims priority from Japanese Patent Application No. 2015-045408 filed on March 9, 2015, the contents of which are incorporated herein by reference.

22, 22a to 22c, 26a, 26b ... microphone 40 ... driver's seat 42 ... dashboard 44 ... passenger seat 46 ... car body,

passenger compartment

72, 72a, 72b ... sound source 76 ... speaker 78 ... steering wheel 80 ... engine 82 ... external Noise source 84 ... in-vehicle acoustic device 100 ... control device 102 ...

voice processing devices

134, 134a to 134c ... opening / closing body 136 ... vehicle 148 ... communication area

Claims

A sound source direction determination unit that determines a direction of a sound source that is a sound source included in a sound reception signal acquired by each of a plurality of microphones arranged in the vehicle;
A beam forming processing unit that performs beam forming to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source;
A noise removal processing unit for removing noise mixed in the received sound signal;
An audio processing apparatus in which on / off of the beamforming by the beamforming processing unit is set based on a first signal indicating whether or not an occupant is present in the vehicle.
The speech processing apparatus according to claim 1, wherein the noise removal processing unit performs the noise removal processing regardless of the first signal.
The sound according to claim 1 or 2, wherein when the first signal indicates that the occupant is not present in the vehicle, the beamforming by the beamforming processing unit is set to off. Processing equipment.
If the first signal indicates that the occupant is not present in the vehicle, the beamforming is set off before the predetermined word is detected, and the predetermined word The sound processing apparatus according to claim 1, wherein the beamforming is set to be on after the detection of.
A music removal processing unit that removes the music signal mixed in the received sound signal using a reference music signal obtained from an audio device;
5. The audio processing device according to claim 1, wherein the music signal is removed by the music removal processing unit regardless of the first signal. 6.
An audio source direction determination unit that determines the direction of a sound source that is a sound source included in a sound reception signal acquired by each of a plurality of microphones arranged in the vehicle, and a direction that includes the direction of the sound source A voice processing unit including a beam forming processing unit that performs beam forming to suppress sound coming from a azimuth range other than the range; and a noise removal processing unit that performs processing to remove noise mixed in the received sound signal;
A control unit that performs control based on a voice recognition result acquired using the voice processing unit,
The said control part sets the on / off of the said beam forming by the said beam forming process part based on the 1st signal which shows whether the passenger | crew exists in the said vehicle.
In a state where the first signal indicates that the occupant is not present in the vehicle, if a predetermined word is detected, the control unit performs control based on the predetermined word. The control device according to claim 6, which is performed.
When the first signal indicates that the occupant is not present in the vehicle, and the predetermined word is detected in a state where the vehicle is moving, the control unit The control device according to claim 7, wherein the vehicle is stopped.
In the state where the first signal indicates that the occupant is not present in the vehicle and the second signal indicates that the occupant is approaching the vehicle, the predetermined signal The control device according to claim 7, wherein when the word is detected, the control unit opens or closes an opening / closing body provided in the vehicle.