JP2016167645A

JP2016167645A - Voice processing device and control device

Info

Publication number: JP2016167645A
Application number: JP2015045408A
Authority: JP
Inventors: 一平菅江; Ippei Sugae; 丹羽　栄二; Eiji Niwa; 栄二丹羽; 孝之中所; Takayuki Nakadokoro
Original assignee: Aisin Seiki Co Ltd
Current assignee: Aisin Corp
Priority date: 2015-03-09
Filing date: 2015-03-09
Publication date: 2016-09-15
Also published as: WO2016143340A1

Abstract

PROBLEM TO BE SOLVED: To provide a voice processing device and a control device capable of accurately performing voice processing on voices which may be uttered inside and outside a vehicle while satisfying a requirement for reducing the cost.SOLUTION: A voice processing device includes a voice source direction determination unit 16 for determining the azimuth of a voice source as an utterance source of voices contained in a reception sound signal acquired by each of a plurality of microphones 22 arranged in a vehicle, a beam forming processor 12 for performing beam forming to suppress sounds coming from an azimuth range other than an azimuth range including the azimuth of the voice source, and a noise removing processor 14 for performing removal processing of noise mixed in the reception sound signal. ON/OFF of the beam forming by the beam forming processor is set on the basis of a first signal indicating whether an occupant is present in the vehicle.SELECTED DRAWING: Figure 4

Description

本発明は、音声処理装置及び制御装置に関する。 The present invention relates to a voice processing device and a control device.

自動車等の車両には、様々な機器が設けられている。これらの様々な機器に対する操作は、例えば、操作ボタンや操作パネル等を操作することにより行われている。 Various devices are provided in a vehicle such as an automobile. Operations on these various devices are performed, for example, by operating operation buttons, operation panels, and the like.

一方、近時では、音声認識技術を用いて車両の制御を行うことも提案されている（特許文献１、２）。 On the other hand, recently, it has also been proposed to control a vehicle using voice recognition technology (Patent Documents 1 and 2).

特開２００６−１８９３９４号公報JP 2006-189394 A 特開２００７−３０８８８７号公報JP 2007-308887 A

しかしながら、車両内のみならず、車両外においても、音声が発せられ得る。様々な箇所において発せられ得る音声を確実に検出すべく、様々な箇所にマイクロフォンを配した場合には、低コスト化の要請に反することとなる。 However, sound can be emitted not only inside the vehicle but also outside the vehicle. If microphones are arranged at various locations in order to reliably detect sound that can be uttered at various locations, this is contrary to the demand for cost reduction.

本発明の目的は、低コスト化の要請を満たしつつ、車両の内外において発せられ得る音声に対して音声処理を的確に行い得る音声処理装置及びその音声処理装置を用いた制御装置を提供することにある。 An object of the present invention is to provide a voice processing device that can accurately perform voice processing on voice that can be emitted inside and outside a vehicle while satisfying a demand for cost reduction, and a control device using the voice processing device. It is in.

本発明の一観点によれば、車両内に配された複数のマイクロフォンの各々によって取得される受音信号に含まれる音声の発生源である音声源の方位を判定する音声源方位判定部と、前記音声源の前記方位を含む方位範囲以外の方位範囲から到来する音を抑圧するビームフォーミングを行うビームフォーミング処理部と、前記受音信号に混入されたノイズの除去処理を行うノイズ除去処理部とを有し、前記車両内に乗員が存在しているか否かを示す第１の信号に基づいて、前記ビームフォーミング処理部による前記ビームフォーミングのオン／オフが設定される、音声処理装置が提供される。 According to one aspect of the present invention, a sound source direction determination unit that determines a direction of a sound source that is a sound source included in a sound reception signal acquired by each of a plurality of microphones arranged in a vehicle; A beam forming processing unit that performs beam forming to suppress sound coming from an azimuth range other than the azimuth range including the azimuth range of the sound source; and a noise removal processing unit that performs processing to remove noise mixed in the received sound signal. There is provided a speech processing apparatus in which on / off of the beamforming by the beamforming processing unit is set based on a first signal indicating whether or not an occupant is present in the vehicle. The

本発明によれば、車両内に乗員が存在しているか否かを示す第１の信号に基づいて、ビームフォーミングのオン／オフが設定される。このため、車両の外部に乗員が位置している場合であっても、かかる乗員が発する音声を、車両内に配されたマイクロフォンを用いて確実に検出することができる。車両の外部において発せられる音声を取得するためのマイクロフォンを、車両内に配されたマイクロフォンと別個に設けることを要しないため、低コスト化に寄与することができる。 According to the present invention, on / off of beamforming is set based on the first signal indicating whether or not an occupant is present in the vehicle. For this reason, even when the occupant is located outside the vehicle, it is possible to reliably detect the sound emitted by the occupant using the microphone disposed in the vehicle. Since it is not necessary to provide a microphone for acquiring sound emitted outside the vehicle separately from the microphone arranged in the vehicle, it is possible to contribute to cost reduction.

車両の構成を示す概略図である。It is the schematic which shows the structure of a vehicle. 本発明の一実施形態による制御装置のシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the control apparatus by one Embodiment of this invention. 本発明の一実施形態による車両を示す平面図である。It is a top view which shows the vehicle by one Embodiment of this invention. 本発明の一実施形態による音声処理装置のシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the audio processing apparatus by one Embodiment of this invention. マイクロフォンの配置の例を示す概略図である。It is the schematic which shows the example of arrangement | positioning of a microphone. 音声源が遠方界に位置する場合と近傍界に位置する場合とを示す図である。It is a figure which shows the case where a sound source is located in a far field, and the case where it is located in a near field. 音楽の除去のアルゴリズムを示す概略図である。It is the schematic which shows the algorithm of a music removal. 音楽の除去前と除去後の信号波形を示す図である。It is a figure which shows the signal waveform before removal after music removal. 音声源の方位の判定のアルゴリズムを示す図である。It is a figure which shows the algorithm of determination of the azimuth | direction of an audio source. 適応フィルタ係数、音声源の方位角、及び、音声信号の振幅を示す図である。It is a figure which shows an adaptive filter coefficient, the azimuth | direction angle of an audio source, and the amplitude of an audio | voice signal. ビームフォーマの指向性を概念的に示す図である。It is a figure which shows the directivity of a beam former notionally. ビームフォーマのアルゴリズムを示す図である。It is a figure which shows the algorithm of a beam former. ビームフォーマにより得られる指向性の例を示すグラフである。It is a graph which shows the example of the directivity obtained by a beam former. ビームフォーマと音声源方位判定キャンセル処理とを組み合わせた場合の角度特性を示す図である。It is a figure which shows the angle characteristic at the time of combining a beam former and an audio source direction determination cancellation process. ビームフォーマにより得られる指向性の例を示すグラフである。It is a graph which shows the example of the directivity obtained by a beam former. ノイズの除去のアルゴリズムを示す図である。It is a figure which shows the algorithm of noise removal. ノイズの除去前と除去後の信号波形を示す図である。It is a figure which shows the signal waveform before noise removal and after removal. 本発明の一実施形態による音声処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio processing apparatus by one Embodiment of this invention. 本発明の一実施形態による音声処理装置における第１の動作モードでの動作を示すフローチャートである。It is a flowchart which shows the operation | movement in the 1st operation mode in the speech processing unit by one Embodiment of this invention. 本発明の一実施形態による音声処理装置における第２の動作モードでの動作を示すフローチャートである。It is a flowchart which shows the operation | movement in the 2nd operation mode in the speech processing unit by one Embodiment of this invention. 本発明の一実施形態の変形例による音声処理装置における第２の動作モードでの動作を示すフローチャートである。It is a flowchart which shows the operation | movement in the 2nd operation mode in the audio processing apparatus by the modification of one Embodiment of this invention.

以下、本発明の実施の形態について図面を用いて説明する。なお、本発明は以下の実施形態に限定されるものではなく、その要旨を逸脱しない範囲において適宜変更可能である。また、以下で説明する図面において、同じ機能を有するものは同一の符号を付し、その説明を省略又は簡潔にすることもある。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, this invention is not limited to the following embodiment, In the range which does not deviate from the summary, it can change suitably. In the drawings described below, components having the same function are denoted by the same reference numerals, and the description thereof may be omitted or simplified.

［一実施形態］
本発明の一実施形態による音声処理装置及びその音声処理装置を用いた制御装置について図１乃至図１９を用いて説明する。 [One Embodiment]
A voice processing device and a control device using the voice processing device according to an embodiment of the present invention will be described with reference to FIGS.

本実施形態による音声処理装置及び制御装置について説明するに先立って、車両の構成について図１を用いて説明する。図１は、車両の構成を示す概略図である。 Prior to describing the voice processing device and the control device according to the present embodiment, the configuration of the vehicle will be described with reference to FIG. FIG. 1 is a schematic diagram showing a configuration of a vehicle.

図１に示すように、車両（自動車）１３６の車体（車室）４６の前部には、運転者用の座席である運転席４０と助手席者用の座席である助手席４４とが配されている。運転席４０は、例えば車室４６の右側に位置している。運転席４０の前方には、ステアリングホイール（ハンドル）７８が配されている。助手席４４は、例えば車室４６の左側に位置している。運転席４０と助手席４４とにより、前部座席が構成されている。運転席４０の近傍には、運転者が音声を発する場合における音声源７２ａが位置する。助手席４４の近傍には、助手席者が音声を発する場合における音声源７２ｂが位置する。運転者も助手席者も座席４０，４４に着座した状態で上半身を動かし得るため、音声源７２の位置は変化し得る。車体４６の後部には、後部座席７０が配されている。なお、ここでは、個々の音声源を区別しないで説明する場合には、符号７２を用い、個々の音声源を区別して説明する場合には、符号７２ａ、７２ｂを用いることとする。 As shown in FIG. 1, a driver's seat 40 that is a driver's seat and a passenger's seat 44 that is a passenger's seat are arranged at the front of a vehicle body (cabinet) 46 of a vehicle (automobile) 136. Has been. The driver's seat 40 is located on the right side of the passenger compartment 46, for example. A steering wheel (handle) 78 is disposed in front of the driver seat 40. The passenger seat 44 is located on the left side of the passenger compartment 46, for example. The driver seat 40 and the passenger seat 44 constitute a front seat. In the vicinity of the driver's seat 40, an audio source 72a when the driver emits audio is located. In the vicinity of the passenger seat 44, an audio source 72b when the passenger seat makes a sound is located. Since both the driver and the front passenger can move the upper body while seated in the seats 40 and 44, the position of the sound source 72 can change. A rear seat 70 is disposed at the rear of the vehicle body 46. Here, reference numeral 72 is used when the description is made without distinguishing between the individual sound sources, and reference numerals 72a and 72b are used when the description is made with the individual sound sources distinguished.

前部座席４０，４４の前方には、複数のマイクロフォン２２（２２ａ〜２２ｃ）、即ち、マイクロフォンアレイが配されている。なお、ここでは、個々のマイクロフォンを区別しないで説明する場合には、符号２２を用い、個々のマイクロフォンを区別して説明する場合には、符号２２ａ〜２２ｃを用いることとする。マイクロフォン２２は、ダッシュボード４２に配されていてもよいし、ルーフに近い部位に配されていてもよい。 A plurality of microphones 22 (22a to 22c), that is, microphone arrays are arranged in front of the front seats 40 and 44. Here, reference numeral 22 is used when the description is made without distinguishing the individual microphones, and reference numerals 22a to 22c are used when the description is made with the individual microphones distinguished. The microphone 22 may be disposed on the dashboard 42 or may be disposed on a portion close to the roof.

前部座席４０，４４の音声源７２とマイクロフォン２２との間の距離は、数十ｃｍ程度である場合が多い。しかし、マイクロフォン２２と音声源７２との間の距離は、数十ｃｍより小さくなることもあり得る。また、マイクロフォン２２と音声源７２との間の距離は、１ｍを超えることもあり得る。 The distance between the sound source 72 of the front seats 40 and 44 and the microphone 22 is often about several tens of centimeters. However, the distance between the microphone 22 and the audio source 72 can be less than a few tens of centimeters. Also, the distance between the microphone 22 and the audio source 72 can exceed 1 m.

車体４６の内部には、車載音響機器（カーオーディオ機器）８４（図２参照）のスピーカシステムを構成するスピーカ（ラウドスピーカ）７６が配されている。スピーカ７６から発せられる音楽（ミュージック）は、音声認識を行う上でのノイズとなり得る。 Inside the vehicle body 46, a speaker (loud speaker) 76 constituting a speaker system of an on-vehicle acoustic device (car audio device) 84 (see FIG. 2) is arranged. Music (music) emitted from the speaker 76 can be noise when performing speech recognition.

車体４６には、車両１３６を駆動するためのエンジン８０が配されている。エンジン８０から発せられる音は、音声認識を行う上でのノイズとなり得る。 The vehicle body 46 is provided with an engine 80 for driving the vehicle 136. The sound emitted from the engine 80 can be noise when performing speech recognition.

車両１３６の走行中に路面の刺激によって車室４６内に発生する騒音、即ち、ロードノイズも、音声認識を行う上でのノイズとなり得る。また、車両１３６が走行する際に生ずる風切り音も、音声認識を行う上でのノイズ源となり得る。また、車体４６の外部にも、ノイズ源８２は存在し得る。外部ノイズ源８２から発せられる音も、音声認識を行う上でのノイズとなり得る。 Noise generated in the passenger compartment 46 due to road surface stimulation while the vehicle 136 is traveling, that is, road noise, can also be noise in speech recognition. In addition, wind noise generated when the vehicle 136 travels can also be a noise source in performing voice recognition. Further, the noise source 82 may exist outside the vehicle body 46. The sound emitted from the external noise source 82 can also be noise in performing speech recognition.

車体４６に配された様々な機器に対する操作を、音声による指示によって行い得ると便利である。音声による指示は、例えば自動音声認識装置１６８（図２参照）を用いて認識される。本実施形態による音声処理装置１０２は、音声認識の精度の向上に資するものである。 It is convenient if operations on various devices arranged on the vehicle body 46 can be performed by voice instructions. The voice instruction is recognized using, for example, an automatic voice recognition device 168 (see FIG. 2). The speech processing apparatus 102 according to the present embodiment contributes to improvement of speech recognition accuracy.

図２は、本実施形態による制御装置を示すブロック図である。 FIG. 2 is a block diagram illustrating the control device according to the present embodiment.

図２に示すように、本実施形態による制御装置１００は、音声処理装置１０２、自動音声認識装置１６８、入力部１１４、制御部（ＣＰＵ：Central Processing Unit）１１６、メモリ１１８、及び、出力部１２０を有している。音声処理装置１０２、自動音声認識装置１６８、入力部１１４、制御部１１６、メモリ１１８、及び、出力部１２０は、バスライン１２２を介して相互に信号を入出力し得る。 As shown in FIG. 2, the control device 100 according to the present embodiment includes a speech processing device 102, an automatic speech recognition device 168, an input unit 114, a control unit (CPU: Central Processing Unit) 116, a memory 118, and an output unit 120. have. The voice processing device 102, the automatic voice recognition device 168, the input unit 114, the control unit 116, the memory 118, and the output unit 120 can input / output signals to / from each other via the bus line 122.

なお、音声処理装置１０２と自動音声認識装置１６８とが別個の装置であってもよいし、音声処理装置（音声処理部）１０２と自動音声認識装置（音声認識部）１６８とが一体になっていてもよい。音声処理装置１０２と自動音声認識装置１６８とが一体になった装置は、音声処理装置と称することもできるし、自動音声認識装置と称することもできる。 The voice processing device 102 and the automatic voice recognition device 168 may be separate devices, or the voice processing device (voice processing unit) 102 and the automatic voice recognition device (voice recognition unit) 168 are integrated. May be. A device in which the speech processing device 102 and the automatic speech recognition device 168 are integrated can be called a speech processing device or an automatic speech recognition device.

音声処理装置１０２には、複数のマイクロフォン２２ａ〜２２ｃの各々によって取得される信号が入力されるようになっている。また、音声処理装置１０２には、車載音響機器８４からの信号が入力されるようになっている。 A signal acquired by each of the plurality of microphones 22a to 22c is input to the sound processing apparatus 102. In addition, a signal from the in-vehicle acoustic device 84 is input to the audio processing device 102.

音声処理装置１０２によって処理が行われた音声信号が、音声出力として自動音声認識装置（音声認識装置）１６８に出力されるようになっている。 An audio signal processed by the audio processing device 102 is output to an automatic speech recognition device (speech recognition device) 168 as an audio output.

入力部１１４には、近接検知部（近接検知手段）１２６からの信号が入力されるようになっている。車両１３６への乗員の近接の有無を示す信号、即ち、近接検知信号が、近接検知部１２６から入力部１１４に入力されるようになっている。近接検知部１２６としては、例えば、スマートキー（認証キー）１４６から発せられる無線信号を受信し得る受信部（受信手段）等を用いることができる。近接検知部１２６は、例えば、スマートキーシステム用の受信部を兼ねていてもよいし、スマートキーシステム用の受信部と別個に設けられているものであってもよい。図３は、本実施形態による車両を示す平面図である。図３に示すように、開閉体１３４ａ〜１３４ｃの近傍にスマートキー１４６の通信エリア１４８が形成される。スマートキー１４６がスマートキーシステムの通信エリア１４８内に位置している際に、スマートキー１４６が通信エリア１４８内に位置していることを示す信号が、近接検知部１２６から入力部１１４に入力される。 A signal from the proximity detection unit (proximity detection means) 126 is input to the input unit 114. A signal indicating whether or not a passenger is approaching the vehicle 136, that is, a proximity detection signal is input from the proximity detection unit 126 to the input unit 114. As the proximity detection unit 126, for example, a reception unit (reception unit) that can receive a wireless signal emitted from the smart key (authentication key) 146 can be used. For example, the proximity detection unit 126 may also serve as a reception unit for the smart key system, or may be provided separately from the reception unit for the smart key system. FIG. 3 is a plan view showing the vehicle according to the present embodiment. As shown in FIG. 3, a communication area 148 of the smart key 146 is formed in the vicinity of the opening / closing bodies 134a to 134c. When the smart key 146 is located in the communication area 148 of the smart key system, a signal indicating that the smart key 146 is located in the communication area 148 is input from the proximity detection unit 126 to the input unit 114. The

なお、ここでは、スマートキー１４６から発せられる無線信号が近接検知部１２６により受信されたことに基づいて、車両１３６への乗員の近接の有無を判定したが、これに限定されるものではない。即ち、車両１３６への乗員の近接の有無を判定ために用いられる乗員側の機器は、スマートキー１４６に限定されるものではなく、ＩＤ認証が可能な携帯機器であればよい。ＩＤ認証が可能な様々な携帯機器と車載機器との間の通信の成立の有無に基づいて、車両１３６への乗員の近接の適宜判定することが可能である。 Here, the presence / absence of the occupant's proximity to the vehicle 136 is determined based on the reception of the wireless signal generated from the smart key 146 by the proximity detection unit 126, but the present invention is not limited to this. That is, the occupant-side device used to determine whether or not the occupant is approaching the vehicle 136 is not limited to the smart key 146, and may be any portable device that can perform ID authentication. It is possible to appropriately determine the proximity of an occupant to the vehicle 136 based on whether or not communication between various portable devices capable of ID authentication and in-vehicle devices is established.

入力部１１４には、車両１３６内に乗員が存在するか否かを示す信号、即ち、乗員有無検知信号が、乗員検出部１４２から入力部１１４に入力されるようになっている。乗員検出部１４２としては、例えば、ドライバモニタや体重検知センサ等を用いることができる。ドライバモニタは、カメラ（図示せず）で撮影した画像に基づいて乗員の有無を検出し得る。体重検知センサは、例えば、運転席４０に配され、体重検知センサによって検知された体重に基づいて乗員の有無を検出し得る。 A signal indicating whether or not an occupant is present in the vehicle 136, that is, an occupant presence / absence detection signal is input to the input unit 114 from the occupant detection unit 142 to the input unit 114. As the occupant detection unit 142, for example, a driver monitor, a weight detection sensor, or the like can be used. The driver monitor can detect the presence or absence of an occupant based on an image taken by a camera (not shown). For example, the weight detection sensor is arranged in the driver's seat 40 and can detect the presence or absence of an occupant based on the weight detected by the weight detection sensor.

制御部１１６は、制御装置１００の全体の制御を司るものである。制御部１１６は、近接検知部１２６から入力部１１４を介して入力される近接検知信号を読み取る。制御部１１６は、スマートキー１４６を所持した乗員が車両１３６に近接した状態であるか否かを、近接検知信号に基づいて判断し得る。また、制御部１１６は、乗員検出部１４２から入力部１１４を介して入力される乗員有無検知信号を読み取る。制御部１１６は、車両１３６内に乗員が存在するか否かを、乗員有無検知信号に基づいて判断し得る。また、制御部１１６は、自動音声認識装置１６８からの出力情報、即ち、音声認識結果を読み取る。制御部１１６は、自動音声認識装置１６８による音声認識結果に基づいて、音声による乗員の指示を認識し得る。 The control unit 116 controls the entire control device 100. The control unit 116 reads a proximity detection signal input from the proximity detection unit 126 via the input unit 114. Based on the proximity detection signal, the control unit 116 can determine whether or not the occupant carrying the smart key 146 is in a state of being close to the vehicle 136. Further, the control unit 116 reads an occupant presence / absence detection signal input from the occupant detection unit 142 via the input unit 114. The control unit 116 can determine whether there is an occupant in the vehicle 136 based on the occupant presence / absence detection signal. Further, the control unit 116 reads output information from the automatic speech recognition device 168, that is, a speech recognition result. The control unit 116 can recognize an occupant instruction by voice based on the voice recognition result by the automatic voice recognition device 168.

制御部１１６は、自動音声認識装置１６８による音声認識結果に基づいて、車両１３６に搭載されている様々な機器等に対しての制御を行う。 The control unit 116 controls various devices and the like mounted on the vehicle 136 based on the voice recognition result by the automatic voice recognition device 168.

例えば、制御部１１６は、開閉体１３４に対しての制御を行う。具体的には、制御部１１６は、開閉体駆動装置１３２を制御するための制御信号を、出力部１２０を介して開閉体駆動装置１３２に出力する。開閉体駆動装置１３２は、開閉機構を有する構造体である開閉体１３４を駆動するためのものである。制御部１１６は、開閉体駆動装置１３２を介して開閉体１３４を自動で開作動等させる。車両１３６には、サイドドア１３４ａ、１３４ｂやバックドア１３４ｃ等の様々な開閉体が配されているが。図２においては、個々の開閉体を区別せず、複数の開閉体のうちの１つを符号１３４を用いて図示している。 For example, the control unit 116 controls the opening / closing body 134. Specifically, the control unit 116 outputs a control signal for controlling the opening / closing body driving device 132 to the opening / closing body driving device 132 via the output unit 120. The opening / closing body driving device 132 is for driving an opening / closing body 134 which is a structure having an opening / closing mechanism. The control unit 116 automatically opens and closes the opening / closing body 134 via the opening / closing body driving device 132. The vehicle 136 is provided with various opening / closing bodies such as the side doors 134a and 134b and the back door 134c. In FIG. 2, the individual opening / closing bodies are not distinguished, and one of the plurality of opening / closing bodies is illustrated by reference numeral 134.

また、制御部１１６は、ブレーキ１４０に対しての制御を行う。具体的には、制御部１１６は、ブレーキ制御装置１３８を制御するための制御信号を、出力部１２０を介してブレーキ制御装置１３８に出力する。ブレーキ制御装置１３８は、ブレーキ１４０を制御するためのものである。制御部１１６は、ブレーキ制御装置１３８を介してブレーキ１４０を制御する。 The control unit 116 also controls the brake 140. Specifically, the control unit 116 outputs a control signal for controlling the brake control device 138 to the brake control device 138 via the output unit 120. The brake control device 138 is for controlling the brake 140. The control unit 116 controls the brake 140 via the brake control device 138.

図４は、本実施形態による音声処理装置のシステム構成を示すブロック図である。図４に示すように、本実施形態による音声処理装置１０２は、前処理部１０と、処理部１２と、後処理部１４と、音声源方位判定部１６と、適応アルゴリズム決定部１８と、ノイズモデル決定部２０とを含む。 FIG. 4 is a block diagram showing the system configuration of the speech processing apparatus according to the present embodiment. As shown in FIG. 4, the speech processing apparatus 102 according to the present embodiment includes a preprocessing unit 10, a processing unit 12, a post-processing unit 14, a speech source direction determination unit 16, an adaptive algorithm determination unit 18, noise A model determining unit 20.

前処理部１０には、複数のマイクロフォン２２ａ〜２２ｃの各々によって取得される信号、即ち、受音信号が入力されるようになっている。マイクロフォン２２としては、例えば、無指向性のマイクロフォンが用いられる。 A signal acquired by each of the plurality of microphones 22 a to 22 c, that is, a sound reception signal is input to the preprocessing unit 10. As the microphone 22, for example, an omnidirectional microphone is used.

図５は、マイクロフォンの配置の例を示す概略図である。図５（ａ）は、マイクロフォン２２の数が３個の場合を示している。図５（ｂ）は、マイクロフォン２２の数が２個の場合を示している。複数のマイクロフォン２２は、直線上に位置するように配されている。 FIG. 5 is a schematic diagram showing an example of microphone arrangement. FIG. 5A shows a case where the number of microphones 22 is three. FIG. 5B shows a case where the number of microphones 22 is two. The plurality of microphones 22 are arranged so as to be positioned on a straight line.

図６は、音声源が遠方界に位置する場合と近傍界に位置する場合とを示す図である。図６（ａ）は、音声源７２が遠方界に位置する場合を示しており、図６（ｂ）は、音声源７２が近傍界に位置する場合を示している。ｄは、音声源７２からマイクロフォン２２までの距離の差を示している。θは、音声源７２の方位を示している。 FIG. 6 is a diagram illustrating a case where the sound source is located in the far field and a case where the sound source is located in the near field. FIG. 6A shows a case where the audio source 72 is located in the far field, and FIG. 6B shows a case where the audio source 72 is located in the near field. d indicates a difference in distance from the sound source 72 to the microphone 22. θ represents the direction of the audio source 72.

図６（ａ）に示すように、音声源７２が遠方界に位置する場合には、マイクロフォン２２に到達する音声は、平面波とみなすことができる。このため、本実施形態では、音声源７２が遠方界に位置する場合には、マイクロフォン２２に到達する音声を平面波として取り扱って、音声源７２の方位（方向）、即ち、音源方位（ＤＯＡ：Direction Of Arrival）を判定する。マイクロフォン２２に到達する音声を平面波として扱うことが可能なため、音声源７２が遠方界に位置する場合には、２個のマイクロフォン２２を用いて音声源７２の方位を判定し得る。なお、音声源７２の位置やマイクロフォン２２の配置によっては、マイクロフォン２２の数が２個の場合であっても、近傍界に位置する音声源７２の方位を判定し得る。 As shown in FIG. 6A, when the sound source 72 is located in the far field, the sound reaching the microphone 22 can be regarded as a plane wave. For this reason, in this embodiment, when the sound source 72 is located in the far field, the sound reaching the microphone 22 is handled as a plane wave, and the direction (direction) of the sound source 72, that is, the sound source direction (DOA: Direction) Of Arrival). Since the sound reaching the microphone 22 can be handled as a plane wave, when the sound source 72 is located in the far field, the direction of the sound source 72 can be determined using the two microphones 22. Depending on the position of the sound source 72 and the arrangement of the microphones 22, the orientation of the sound source 72 located in the near field can be determined even when the number of the microphones 22 is two.

図６（ｂ）に示すように、音声源７２が近傍界に位置する場合には、マイクロフォン２２に到達する音声は、球面波とみなすことができる。このため、本実施形態では、音声源７２が近傍界に位置する場合には、マイクロフォン２２に到達する音声を球面波として扱って、音声源７２の方位を判定する。マイクロフォン２２に到達する音声を球面波として扱うことを要するため、音声源７２が近傍界に位置する場合には、少なくとも３個のマイクロフォン２２を用いて音声源７２の方位を判定する。ここでは、説明の簡略化のため、マイクロフォン２２の数を３個とする場合を例に説明する。 As shown in FIG. 6B, when the sound source 72 is located in the near field, the sound reaching the microphone 22 can be regarded as a spherical wave. For this reason, in this embodiment, when the sound source 72 is located in the near field, the sound reaching the microphone 22 is treated as a spherical wave, and the direction of the sound source 72 is determined. Since the sound reaching the microphone 22 needs to be handled as a spherical wave, when the sound source 72 is located in the near field, the orientation of the sound source 72 is determined using at least three microphones 22. Here, for simplification of description, a case where the number of microphones 22 is three will be described as an example.

マイクロフォン２２ａとマイクロフォン２２ｂとの距離Ｌ１は、比較的長く設定されている。マイクロフォン２２ｂとマイクロフォン２２ｃとの距離Ｌ２は、比較的短く設定されている。 The distance L1 between the microphone 22a and the microphone 22b is set to be relatively long. The distance L2 between the microphone 22b and the microphone 22c is set to be relatively short.

本実施形態において距離Ｌ１と距離Ｌ２とを異ならせているのは、以下のような理由によるものである。即ち、本実施形態では、各々のマイクロフォン２２に到達する音声（受音信号の到来時間差（ＴＤＯＡ：Time Delay Of Arrival）に基づいて、音声源７２の方位を特定する。周波数が比較的低い音声は波長が比較的長いため、周波数が比較的低い音声に対応するためには、マイクロフォン２２間の距離を比較的大きく設定することが好ましい。このため、本実施形態では、マイクロフォン２２ａとマイクロフォン２２ｂとの間の距離Ｌ１を比較的長く設定している。一方、周波数が比較的高い音声は波長が比較的短いため、周波数が比較的高い音声に対応するためには、マイクロフォン２２間の距離を比較的小さく設定することが好ましい。そこで、本実施形態では、マイクロフォン２２ｂとマイクロフォン２２ｃとの間の距離Ｌ２を比較的短く設定している。 The reason why the distance L1 and the distance L2 are different in the present embodiment is as follows. That is, in the present embodiment, the direction of the sound source 72 is specified based on the sound reaching each microphone 22 (the time delay of arrival (TDOA)) of the sound reception signal. Since the wavelength is relatively long, it is preferable to set the distance between the microphones 22 to be relatively large in order to cope with the sound having a relatively low frequency.For this reason, in this embodiment, the microphone 22a and the microphone 22b The distance L1 between the microphones 22 is set to be relatively long, while the sound having a relatively high frequency has a relatively short wavelength, and therefore the distance between the microphones 22 is relatively Therefore, in this embodiment, the distance L2 between the microphones 22b and 22c is set to be relatively short. It is.

マイクロフォン２２ａとマイクロフォン２２ｂとの間の距離Ｌ１は、例えば３４００Ｈｚ以下の周波数の音声に対して好適とすべく、例えば５ｃｍ程度とする。マイクロフォン２２ｂとマイクロフォン２２ｃとの間の距離Ｌ２は、例えば３４００Ｈｚを超える周波数の音声に対して好適とすべく、例えば２．５ｃｍ程度とする。なお、距離Ｌ１、Ｌ２は、これらに限定されるものではなく、適宜設定し得る。 The distance L1 between the microphone 22a and the microphone 22b is, for example, about 5 cm so as to be suitable for sound having a frequency of 3400 Hz or less. The distance L2 between the microphone 22b and the microphone 22c is, for example, about 2.5 cm so as to be suitable for sound having a frequency exceeding 3400 Hz. The distances L1 and L2 are not limited to these, and can be set as appropriate.

本実施形態において、音声源７２が遠方界に位置する場合に、マイクロフォン２２に到達する音声を平面波として扱うのは、音声を平面波として扱う場合の方が、音声を球面波として扱う場合よりも、音声源７２の方位を判定するための処理が簡略なためである。このため、本実施形態では、音声源７２が遠方界に位置する場合には、マイクロフォン２２に到達する音声を平面波として扱う。マイクロフォン２２に到達する音声を平面波として扱うため、遠方界に位置する音声源７２の方位を判定する際には、音声源７２の方位を判定するための処理の負荷を軽くすることができる。 In the present embodiment, when the sound source 72 is located in the far field, the sound reaching the microphone 22 is handled as a plane wave when the sound is handled as a plane wave than when the sound is handled as a spherical wave. This is because the process for determining the direction of the audio source 72 is simple. For this reason, in this embodiment, when the sound source 72 is located in the far field, the sound reaching the microphone 22 is treated as a plane wave. Since the sound reaching the microphone 22 is handled as a plane wave, when determining the direction of the sound source 72 located in the far field, the processing load for determining the direction of the sound source 72 can be reduced.

なお、音声源７２の方位を判定するための処理の付加は重くなるが、音声源７２が近傍界に位置する場合には、マイクロフォン２２に到達する音声を球面波として扱う。音声源７２が近傍界に位置する場合には、マイクロフォン２２に到達する音声を球面波として扱わないと、音声源７２の方位を正確に判定し得ないためである。 In addition, although the addition of the process for determining the azimuth | direction of the audio | voice source 72 becomes heavy, when the audio | voice source 72 is located in a near field, the audio | voice which reaches | attains the microphone 22 is handled as a spherical wave. This is because when the sound source 72 is located in the near field, the direction of the sound source 72 cannot be accurately determined unless the sound reaching the microphone 22 is treated as a spherical wave.

このように、本実施形態では、音声源７２が遠方界に位置する場合には、音声を平面波として扱って音声源７２の方位を判定し、音声源７２が近傍界に位置する場合には、音声を球面波として扱って音声源７２の方位を判定する。 As described above, in the present embodiment, when the sound source 72 is located in the far field, the direction of the sound source 72 is determined by treating the sound as a plane wave, and when the sound source 72 is located in the near field, The direction of the sound source 72 is determined by treating the sound as a spherical wave.

図４に示すように、複数のマイクロフォン２２によって取得される受音信号が、前処理部１０に入力されるようになっている。前処理部１０では、音場補正が行われる。音場補正においては、音響空間である車室４６の音響特性を考慮したチューニングが行われる。 As shown in FIG. 4, sound reception signals acquired by the plurality of microphones 22 are input to the preprocessing unit 10. In the preprocessing unit 10, sound field correction is performed. In the sound field correction, tuning is performed in consideration of the acoustic characteristics of the vehicle compartment 46 that is an acoustic space.

マイクロフォン２２によって取得される受音信号に音楽が含まれている場合には、前処理部１０は、マイクロフォン２２によって取得される受音信号から音楽を除去する。前処理部１０には、参照用音楽信号（参照信号）が入力されるようになっている。前処理部１０は、マイクロフォン２２によって取得される受音信号に含まれている音楽を、参照用音楽信号を用いて除去する。 When the sound reception signal acquired by the microphone 22 includes music, the preprocessing unit 10 removes the music from the sound reception signal acquired by the microphone 22. A reference music signal (reference signal) is input to the preprocessing unit 10. The preprocessing unit 10 removes music included in the sound reception signal acquired by the microphone 22 using the reference music signal.

図７は、音楽の除去のアルゴリズムを示す概略図である。車載音響機器８４によって音楽が再生されている際には、マイクロフォン２２によって取得される受音信号には音楽が含まれる。マイクロフォン２２によって取得される音楽を含む受音信号は、前処理部１０内に設けられた音楽除去処理部２４に入力されるようになっている。また、参照用音楽信号が、音楽除去処理部２４に入力されるようになっている。参照用音楽信号は、例えば、車載音響機器８４のスピーカ７６から出力された音楽を、マイクロフォン２６ａ、２６ｂによって取得することにより得ることが可能である。また、スピーカ７６によって音に変換される前の音楽ソース信号を、参照用音楽信号として、音楽除去処理部２４に入力するようにしてもよい。 FIG. 7 is a schematic diagram showing an algorithm for music removal. When music is played back by the in-vehicle acoustic device 84, the sound reception signal acquired by the microphone 22 includes music. A received sound signal including music acquired by the microphone 22 is input to a music removal processing unit 24 provided in the preprocessing unit 10. A reference music signal is input to the music removal processing unit 24. The reference music signal can be obtained, for example, by acquiring music output from the speaker 76 of the in-vehicle acoustic device 84 using the microphones 26a and 26b. Alternatively, the music source signal before being converted into sound by the speaker 76 may be input to the music removal processing unit 24 as a reference music signal.

音楽除去処理部２４からの出力信号は、前処理部１０内に設けられたステップサイズ判定部２８に入力されるようになっている。ステップサイズ判定部２８は、音楽除去処理部２４の出力信号のステップサイズの判定を行うものである。ステップサイズ判定部２８によって判定されたステップサイズは、音楽除去処理部２４にフィードバックされるようになっている。音楽除去処理部２４は、参照用音楽信号を用い、ステップサイズ判定部２８により判定されたステップサイズに基づき、周波数領域の正規化最小二乗法（ＮＬＭＳ：Normalized Least-Mean Square）のアルゴリズムによって、音楽を含む信号から音楽を除去する。車室４６内における音楽の反響成分をも十分に除去すべく、十分な処理段数で音楽の除去の処理が行われる。 An output signal from the music removal processing unit 24 is input to a step size determination unit 28 provided in the preprocessing unit 10. The step size determination unit 28 determines the step size of the output signal of the music removal processing unit 24. The step size determined by the step size determination unit 28 is fed back to the music removal processing unit 24. The music removal processing unit 24 uses the reference music signal, and based on the step size determined by the step size determination unit 28, the music removal processing unit 24 performs music using a frequency domain normalized least square (NLMS) algorithm. Remove music from signals that contain. The music removal process is performed with a sufficient number of processing steps to sufficiently remove the reverberant component of the music in the passenger compartment 46.

図８は、音楽の除去前と除去後の信号波形を示す図である。横軸は時間を示しており、縦軸は振幅を示している。図８（ａ）は音楽の除去前を示しており、図８（ｂ）は音楽の除去後を示している。図８から分かるように、音楽が確実に除去されている。 FIG. 8 is a diagram illustrating signal waveforms before and after music removal. The horizontal axis indicates time, and the vertical axis indicates amplitude. FIG. 8A shows before music removal, and FIG. 8B shows after music removal. As can be seen from FIG. 8, the music has been reliably removed.

このようにして音楽が除去された信号が、前処理部１０の音楽除去処理部２４から出力され、処理部１２に入力される。なお、前処理部１０において音楽を十分に除去し得ない場合には、後処理部１４においても、音楽の除去の処理を行うようにしてもよい。 The signal from which music has been removed in this manner is output from the music removal processing unit 24 of the preprocessing unit 10 and input to the processing unit 12. If the pre-processing unit 10 cannot sufficiently remove music, the post-processing unit 14 may also perform music removal processing.

音声源方位判定部１６では、音声源の方位の判定が行われる。図９は、音声源の方位の判定のアルゴリズムを示す図である。複数のマイクロフォン２２のうちのあるマイクロフォン２２からの信号が、音声源方位判定部１６内に設けられた遅延部３０に入力されるようになっている。複数のマイクロフォン２２のうちの他のマイクロフォン２２からの信号が、音声源方位判定部１６内に設けられた適応フィルタ３２に入力されるようになっている。遅延部３０の出力信号と適応フィルタ３２の出力信号とが、減算点３４に入力されるようになっている。減算点３４においては、遅延部３０の出力信号から適応フィルタ３４の出力信号が減算される。減算点３４において減算処理が行われた信号に基づいて、適応フィルタ３２が調整される。適応フィルタ３２からの出力は、ピーク検出部３６に入力されるようになっている。ピーク検出部３６は、適応フィルタ係数のピーク（最大値）を検出するものである。適応フィルタ係数のピークに対応する到来時間差τが、目的音の到来方位に対応する到来時間差τである。従って、こうして求められた到来時間差τに基づいて、音声源７２の方位、即ち、目的音の到来方位を判定することが可能となる。 The sound source direction determination unit 16 determines the direction of the sound source. FIG. 9 is a diagram showing an algorithm for determining the direction of the sound source. A signal from a certain microphone 22 among the plurality of microphones 22 is input to a delay unit 30 provided in the sound source direction determination unit 16. A signal from another microphone 22 among the plurality of microphones 22 is input to an adaptive filter 32 provided in the sound source direction determination unit 16. The output signal of the delay unit 30 and the output signal of the adaptive filter 32 are input to the subtraction point 34. At the subtraction point 34, the output signal of the adaptive filter 34 is subtracted from the output signal of the delay unit 30. The adaptive filter 32 is adjusted based on the signal subjected to the subtraction process at the subtraction point 34. The output from the adaptive filter 32 is input to the peak detector 36. The peak detector 36 detects a peak (maximum value) of the adaptive filter coefficient. The arrival time difference τ corresponding to the peak of the adaptive filter coefficient is the arrival time difference τ corresponding to the arrival direction of the target sound. Therefore, it is possible to determine the direction of the voice source 72, that is, the direction of arrival of the target sound, based on the arrival time difference τ thus obtained.

音の速度をｃ［ｍ／ｓ］、マイクロフォン間の距離をｄ［ｍ］、到来時間差をτ［秒］とすると、音声源７２の方向θ［度］は、以下のような式（１）によって表される。なお、音速ｃは、３４０［ｍ／ｓ］程度である。 Assuming that the speed of sound is c [m / s], the distance between microphones is d [m], and the arrival time difference is τ [seconds], the direction θ [degree] of the sound source 72 is expressed by the following equation (1). Represented by The sound speed c is about 340 [m / s].

θ ＝（１８０／π）×ａｒｃｃｏｓ（τ・ｃ／ｄ）・・・（１） θ = (180 / π) × arccos (τ · c / d) (1)

図１０は、適応フィルタ係数、音声源の方位角、及び、音声信号の振幅を示す図である。図１０（ａ）では、適応フィルタ係数がピークとなる部分にハッチングを付している。図１０（ｂ）は、到来時間差τに基づいて判定された音声源７２の方位を示している。図１０（ｃ）は、音声信号の振幅を示している。なお、図１０は、運転者と助手席者とで交互に音声を発した場合を示している。ここでは、運転者が音声を発する場合の音声源７２ａの方位は、α１とした。助手席者が音声を発する場合の音声源７２ｂの方位は、α２とした。 FIG. 10 is a diagram illustrating the adaptive filter coefficient, the azimuth angle of the sound source, and the amplitude of the sound signal. In FIG. 10A, the portion where the adaptive filter coefficient has a peak is hatched. FIG. 10B shows the direction of the audio source 72 determined based on the arrival time difference τ. FIG. 10C shows the amplitude of the audio signal. In addition, FIG. 10 has shown the case where a driver | operator and a passenger's seat sounded alternately. Here, the direction of the sound source 72a when the driver emits sound is α1. The direction of the sound source 72b when the passenger seat utters sound is α2.

図１０（ａ）に示すように、適応フィルタ係数ｗ（ｔ，τ）のピークに基づいて、到来時間差τを検出することが可能である。運転者が音声を発した場合には、適応フィルタ係数のピークに対応する到来時間差τは、例えば−ｔ１程度となる。そして、到来時間差τに基づいて音声源７２ａの方位角を判定すると、音声源７２ａの方位角は例えばα１程度と判定される。一方、助手席者が音声を発した場合には、適応フィルタ係数のピークに対応する到来時間差τは、例えばｔ２程度となる。そして、到来時間差τに基づいて音声源７２ｂの方位角を判定すると、音声源７２ｂの方位角は例えばα２度程度と判定される。なお、ここでは、α１の方位に運転者が位置しており、α２の方位に助手席者が位置している場合を例に説明したが、これに限定されるものではない。音声源７２が近傍界に位置する場合であっても、音声源７２が遠方界に位置する場合であっても、到来時間差τに基づいて、音声源７２の位置を特定することが可能である。但し、音声源７２が近傍界に位置する場合には、上述したように、マイクロフォン２２が３個以上必要であるため、音声源７２の方位を求めるための処理の負荷は重くなる。 As shown in FIG. 10A, the arrival time difference τ can be detected based on the peak of the adaptive filter coefficient w (t, τ). When the driver utters voice, the arrival time difference τ corresponding to the peak of the adaptive filter coefficient is, for example, about −t1. When the azimuth angle of the audio source 72a is determined based on the arrival time difference τ, the azimuth angle of the audio source 72a is determined to be about α1, for example. On the other hand, when the passenger seat utters a voice, the arrival time difference τ corresponding to the peak of the adaptive filter coefficient is, for example, about t2. When the azimuth angle of the audio source 72b is determined based on the arrival time difference τ, the azimuth angle of the audio source 72b is determined to be about α2 degrees, for example. Here, the case where the driver is located in the direction of α1 and the passenger seat is located in the direction of α2 has been described as an example, but the present invention is not limited to this. Whether the audio source 72 is located in the near field or the audio source 72 is located in the far field, the position of the audio source 72 can be specified based on the arrival time difference τ. . However, when the sound source 72 is located in the near field, as described above, three or more microphones 22 are required, so that the processing load for obtaining the direction of the sound source 72 becomes heavy.

音声源方位判定部１６の出力信号、即ち、音声源７２の方位を示す信号が、適応アルゴリズム決定部１８に入力されるようになっている。適応アルゴリズム決定部１８は、音声源７２の方位に基づいて適応アルゴリズムを決定するものである。適応アルゴリズム決定部１８によって決定された適応アルゴリズムを示す信号が、適応アルゴリズム決定部１８から処理部１２に入力されるようになっている。 An output signal from the sound source direction determination unit 16, that is, a signal indicating the direction of the sound source 72 is input to the adaptive algorithm determination unit 18. The adaptive algorithm determination unit 18 determines an adaptive algorithm based on the orientation of the audio source 72. A signal indicating the adaptation algorithm determined by the adaptation algorithm determination unit 18 is input from the adaptation algorithm determination unit 18 to the processing unit 12.

処理部１２は、適応的に指向性を形成する信号処理である適応ビームフォーミングを行うものである（適応ビームフォーマ、ビームフォーミング処理部）。ビームフォーマとしては、例えばＦｒｏｓｔビームフォーマを用いることができる。なお、ビームフォーミングは、Ｆｒｏｓｔビームフォーマに限定されるものではなく、様々なビームフォーマを適宜適用することができる。処理部１２は、適応アルゴリズム決定部１８によって決定された適応アルゴリズムに基づいて、ビームフォーミングを行う。本実施形態において、ビームフォーミングを行うのは、目的音の到来方位に対しての感度を確保しつつ、目的音の到来方向以外の感度を低下させるためである。目的音は、例えば運転者から発せられる音声である。運転者は運転席４０に着座した状態で上半身を動かし得るため、音声源７２ａの位置は変化し得る。音声源７２ａの位置の変化に応じて、目的音の到来方位は変化する。良好な音声認識を行うためには、目的音の到来方向以外の感度を確実に低下させることが好ましい。そこで、本実施形態では、上記のようにして判定される音声源７２の方位に基づいて、当該方位を含む方位範囲以外の方位範囲からの音声を抑圧すべく、ビームフォーマを順次更新する。 The processing unit 12 performs adaptive beamforming, which is signal processing that adaptively forms directivity (adaptive beamformer, beamforming processing unit). For example, a Frost beamformer can be used as the beamformer. The beam forming is not limited to the Frost beamformer, and various beamformers can be applied as appropriate. The processing unit 12 performs beam forming based on the adaptive algorithm determined by the adaptive algorithm determination unit 18. In this embodiment, the beam forming is performed in order to reduce the sensitivity other than the arrival direction of the target sound while securing the sensitivity to the arrival direction of the target sound. The target sound is, for example, a sound emitted from the driver. Since the driver can move the upper body while sitting in the driver's seat 40, the position of the sound source 72a can change. The arrival direction of the target sound changes according to the change in the position of the sound source 72a. In order to perform good speech recognition, it is preferable to reliably reduce the sensitivity other than the arrival direction of the target sound. Therefore, in the present embodiment, based on the direction of the sound source 72 determined as described above, the beam former is sequentially updated so as to suppress sound from an azimuth range other than the azimuth range including the azimuth.

図１１は、ビームフォーマの指向性を概念的に示す図である。図１１は、音声認識の対象とすべき音声源７２ａが運転席４０に位置している場合のビームフォーマの指向性を概念的に示している。図１１におけるハッチングは、到来音が抑圧（抑制、低減）される方位範囲を示している。図１１に示すように、運転席４０の方位を含む方位範囲以外の方位範囲から到来する音が抑圧される。 FIG. 11 is a diagram conceptually showing the directivity of the beamformer. FIG. 11 conceptually shows the directivity of the beamformer when the voice source 72a to be subjected to voice recognition is located in the driver's seat 40. The hatching in FIG. 11 indicates the azimuth range in which the incoming sound is suppressed (suppressed or reduced). As shown in FIG. 11, sound coming from an azimuth range other than the azimuth range including the azimuth of the driver's seat 40 is suppressed.

なお、音声認識の対象とすべき音声源７２ｂが助手席４４に位置している場合には、助手席４４の方位を含む方位範囲以外の方位範囲から到来する音が抑圧されるようにすればよい。 If the voice source 72b to be subjected to voice recognition is located in the passenger seat 44, sound coming from an azimuth range other than the azimuth range including the azimuth of the passenger seat 44 is suppressed. Good.

図１２は、ビームフォーマのアルゴリズムを示す図である。マイクロフォン２２ａ〜２２ｃによって取得される受音信号が、前処理部１０（図４参照）を介して、処理部１２内に設けられた窓関数／高速フーリエ変換処理部４８ａ〜４８ｃにそれぞれ入力されるようになっている。窓関数／高速フーリエ変換処理部４８ａ〜４８ｃは、窓関数処理及び高速フーリエ変換処理を行うものである。本実施形態において、窓関数処理及び高速フーリエ変換処理を行うのは、周波数領域での計算は時間領域での計算より速いためである。窓関数／高速フーリエ変換処理部４８ａの出力信号Ｘ_１，ｋとビームフォーマの重みテンソルＷ_１，ｋ ^＊とが、乗算点５０ａにおいて乗算されるようになっている。窓関数／高速フーリエ変換処理部４８ｂの出力信号Ｘ_２，ｋとビームフォーマの重みテンソルＷ_２，ｋ ^＊とが、乗算点５０ｂにおいて乗算されるようになっている。窓関数／高速フーリエ変換処理部４８ｂの出力信号Ｘ_３，ｋとビームフォーマの重みテンソルＷ_３，ｋ ^＊とが、乗算点５０ｃにおいて乗算されるようになっている。乗算点５０ａ〜５０ｃにおいてそれぞれ乗算処理された信号が、加算点５２において加算されるようになっている。加算点５２において加算処理された信号Ｙ_ｋは、処理部１２内に設けられた逆高速フーリエ変換／重畳加算処理部５４に入力されるようになっている。逆高速フーリエ変換／重畳加算処理部５４は、逆高速フーリエ変換処理及び重畳加算（ＯＬＡ：OverLap-Add）法による処理を行うものである。重畳加算法による処理を行うことにより、周波数領域の信号が時間領域の信号に戻される。逆高速フーリエ変換処理及び重畳加算法による処理が行われた信号が、逆高速フーリエ変換／重畳加算処理部５４から後処理部１４に入力されるようになっている。 FIG. 12 is a diagram showing a beamformer algorithm. The received sound signals acquired by the microphones 22a to 22c are input to the window function / fast Fourier transform processing units 48a to 48c provided in the processing unit 12 via the preprocessing unit 10 (see FIG. 4). It is like that. The window function / fast Fourier transform processing units 48a to 48c perform window function processing and fast Fourier transform processing. In this embodiment, the window function process and the fast Fourier transform process are performed because the calculation in the frequency domain is faster than the calculation in the time domain. The output signal X1 _{, k of the} window function / fast Fourier transform processing unit 48a and the beamformer weight tensor W1 _{, k} ^* are multiplied at the multiplication point 50a. The output signal X2 _{, k of the} window function / fast Fourier transform processor 48b and the beamformer weight tensor W2 _{, k} ^* are multiplied at the multiplication point 50b. The output signal X3 _{, k of the} window function / fast Fourier transform processing unit 48b and the beamformer weight tensor W3 _{, k} ^* are multiplied at the multiplication point 50c. The signals multiplied at the multiplication points 50 a to 50 c are added at the addition point 52. The signal Y _k added at the addition point 52 is input to an inverse fast Fourier transform / superimposition addition processing unit 54 provided in the processing unit 12. The inverse fast Fourier transform / superimposition addition processing unit 54 performs an inverse fast Fourier transform process and a process based on an overlay addition (OLA: OverLap-Add) method. By performing processing by the superposition addition method, the frequency domain signal is returned to the time domain signal. A signal subjected to the inverse fast Fourier transform process and the superposition addition method is input from the inverse fast Fourier transform / superimposition addition processing unit 54 to the post-processing unit 14.

図１３は、ビームフォーマにより得られた指向性（角度特性）を示す図である。横軸は方位角を示しており、縦軸は出力信号パワーを示している。図１３から分かるように、例えば方位角β１と方位角β２とにおいて出力信号パワーが極小となる。方位角β１と方位角β２との間においても、十分な抑圧が行われている。図１３に示すような指向性のビームフォーマを用いれば、助手席から到来する音を十分に抑圧することができる。一方、運転席から到来する音声は、殆ど抑圧されることなくマイクロフォン２２に到達する。 FIG. 13 is a diagram showing the directivity (angle characteristic) obtained by the beamformer. The horizontal axis indicates the azimuth angle, and the vertical axis indicates the output signal power. As can be seen from FIG. 13, for example, the output signal power is minimized at the azimuth angle β1 and the azimuth angle β2. Sufficient suppression is also performed between the azimuth angle β1 and the azimuth angle β2. If a directional beamformer as shown in FIG. 13 is used, the sound coming from the passenger seat can be sufficiently suppressed. On the other hand, the voice coming from the driver's seat reaches the microphone 22 with almost no suppression.

本実施形態では、音声源７２から到来する音声の大きさよりも、音声源７２の方位を含む方位範囲以外の方位範囲から到来する音の方が大きい場合には、音声源７２の方位の判定を中断する（音声源方位判定キャンセル処理）。例えば、運転者からの音声を取得するようにビームフォーマが設定されている場合において、運転者からの音声よりも助手席者からの音声の方が大きい場合には、音声源の方位の推定を中断する。この場合、マイクロフォン２２によって取得される受音信号を十分に抑圧する。図１４は、ビームフォーマと音声源方位判定キャンセル処理とを組み合わせた場合の指向性（角度特性）を示す図である。実線は、ビームフォーマの指向性を示している。一点鎖線は、音声源方位判定キャンセル処理の角度特性を示している。例えばγ１より小さい方位から到来する音声、又は、例えばγ２より大きい方位から到来する音声が、運転者からの音声よりも大きい場合には、音声源方位判定キャンセル処理が行われる。なお、ここでは、運転者からの音声を取得するようにビームフォーマが設定されている場合を例に説明したが、助手席者からの音声を取得するようにビームフォーマが設定されていてもよい。この場合には、助手席者からの音声よりも運転者からの音声の方が大きい場合には、音声源の方位の推定を中断する。 In the present embodiment, when the sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is larger than the magnitude of the audio coming from the audio source 72, the direction of the audio source 72 is determined. Suspend (voice source direction determination cancellation process). For example, when the beamformer is set to acquire the voice from the driver, if the voice from the passenger seat is larger than the voice from the driver, the direction of the voice source is estimated. Interrupt. In this case, the sound reception signal acquired by the microphone 22 is sufficiently suppressed. FIG. 14 is a diagram showing the directivity (angle characteristic) when the beamformer and the sound source direction determination cancellation processing are combined. The solid line indicates the directivity of the beamformer. The alternate long and short dash line indicates the angle characteristic of the audio source direction determination cancellation process. For example, when a voice arriving from a direction smaller than γ1 or a voice arriving from a direction larger than γ2, for example, is larger than the voice from the driver, a voice source direction determination canceling process is performed. Here, the case where the beamformer is set so as to acquire the voice from the driver has been described as an example, but the beamformer may be set so as to acquire the voice from the passenger. . In this case, when the voice from the driver is louder than the voice from the passenger, the estimation of the direction of the voice source is interrupted.

図１５は、マイクロフォンが２個の場合におけるビームフォーマにより得られる指向性を示すグラフである。横軸は方位角であり、縦軸は出力信号パワーである。マイクロフォン２２が２個であるため、極小値となる角度が１箇所のみである。図１５から分かるように、例えば方位角β１においては著しい抑圧が可能であるが、音声源７２の方位の変化に対するロバスト性はあまり高くない。 FIG. 15 is a graph showing the directivity obtained by the beamformer when two microphones are used. The horizontal axis is the azimuth angle, and the vertical axis is the output signal power. Since there are two microphones 22, there is only one angle at which the minimum value is obtained. As can be seen from FIG. 15, for example, significant suppression is possible at the azimuth angle β1, but the robustness to the change in the azimuth of the audio source 72 is not so high.

こうして、音声源７２の方位を含む方位範囲以外の方位範囲から到来する音が抑圧された信号が、処理部１２から出力される。処理部１２からの出力信号は、後処理部１４に入力されるようになっている。 In this way, the processing unit 12 outputs a signal in which sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source 72 is suppressed. An output signal from the processing unit 12 is input to the post-processing unit 14.

後処理部（後処理適応フィルタ）１４においては、ノイズの除去が行われる。かかるノイズとしては、例えばエンジンノイズ、ロードノイズ、風切り音等が挙げられる。図１６は、ノイズの除去のアルゴリズムを示す図である。ノイズモデル決定部２０内に設けられた基本波判定部５６によって、ノイズの基本波が判定される。基本波判定部５６は、ノイズの基本波に基づいた正弦波を出力する。基本波判定部５６から出力される正弦波は、ノイズモデル決定部２０内に設けられたモデリング処理部５８に入力されるようになっている。モデリング処理部５８は、非線形マッピング処理部６０と、線形フィルタ６２と、非線形マッピング処理部６４とを有している。モデリング処理部５８は、Hammerstein-Wiener非線形モデルによるモデリング処理を行うものである。モデリング処理部５８には、非線形マッピング処理部６０、線形フィルタ６２及び非線形マッピング処理部６４が設けられている。モデリング処理部５８は、基本波判定部５６から出力される正弦波に対してモデリング処理を行うことにより、参照用ノイズ信号を生成する。モデリング処理部５８から出力される参照用ノイズ信号は、ノイズが含まれた信号からノイズを除去するための参照信号となる。参照用ノイズ信号は、後処理部１４内に設けられたノイズ除去処理部６６に入力されるようになっている。ノイズ除去処理部６６には、処理部１２からのノイズを含む信号も入力されるようになっている。ノイズ除去処理部６６は、参照用ノイズ信号を用い、正規化最小二乗法のアルゴリズムによって、ノイズを含む信号からノイズを除去する。ノイズ除去処理部６６からは、ノイズが除去された信号が出力される。 The post-processing unit (post-processing adaptive filter) 14 removes noise. Examples of such noise include engine noise, road noise, and wind noise. FIG. 16 is a diagram illustrating an algorithm for noise removal. A fundamental wave of noise is determined by a fundamental wave determination unit 56 provided in the noise model determination unit 20. The fundamental wave determination unit 56 outputs a sine wave based on the fundamental wave of noise. The sine wave output from the fundamental wave determination unit 56 is input to a modeling processing unit 58 provided in the noise model determination unit 20. The modeling processing unit 58 includes a non-linear mapping processing unit 60, a linear filter 62, and a non-linear mapping processing unit 64. The modeling processing unit 58 performs modeling processing using a Hammerstein-Wiener nonlinear model. The modeling processing unit 58 includes a non-linear mapping processing unit 60, a linear filter 62, and a non-linear mapping processing unit 64. The modeling processing unit 58 generates a reference noise signal by performing a modeling process on the sine wave output from the fundamental wave determination unit 56. The reference noise signal output from the modeling processing unit 58 is a reference signal for removing noise from a signal including noise. The reference noise signal is input to a noise removal processing unit 66 provided in the post-processing unit 14. A signal including noise from the processing unit 12 is also input to the noise removal processing unit 66. The noise removal processing unit 66 removes noise from a signal including noise by using a reference noise signal and using a normalization least squares algorithm. The noise removal processing unit 66 outputs a signal from which noise has been removed.

図１７は、ノイズの除去前と除去後の信号波形を示す図である。横軸は時間を示しており、縦軸は振幅を示している。図１７（ａ）はノイズ除去前を示しており、図１７（ｂ）はノイズ除去後を示している。図１７から分かるように、ノイズが確実に除去されている。 FIG. 17 is a diagram illustrating signal waveforms before and after noise removal. The horizontal axis indicates time, and the vertical axis indicates amplitude. FIG. 17A shows before noise removal, and FIG. 17B shows after noise removal. As can be seen from FIG. 17, noise is reliably removed.

後処理部１４においては、歪低減処理も行われる。なお、ノイズの除去は、後処理部１４においてのみ行われるわけではない。マイクロフォン２２を介して取得された音に対して、前処理部１０、処理部１２及び後処理部１４において行われる一連の処理によって、ノイズの除去が行われる。 The post-processing unit 14 also performs distortion reduction processing. Note that noise removal is not performed only in the post-processing unit 14. Noise is removed from a sound acquired via the microphone 22 by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14.

こうして、後処理部１４によって後処理が行われた信号が、自動音声認識装置１６８に音声出力として出力される。目的音以外の音が抑圧された良好な目的音が自動音声認識装置１６８に入力されるため、自動音声認識装置１６８は、音声認識の精度を向上することができる。自動音声認識装置１６８による音声認識結果に基づいて、車両１３６に搭載されている機器等に対しての操作が自動で行われる。 In this way, the signal post-processed by the post-processing unit 14 is output to the automatic speech recognition apparatus 168 as a voice output. Since a good target sound in which sounds other than the target sound are suppressed is input to the automatic speech recognition device 168, the automatic speech recognition device 168 can improve the accuracy of speech recognition. Based on the voice recognition result by the automatic voice recognition device 168, the operation on the device mounted on the vehicle 136 is automatically performed.

次に、本実施形態による音声処理装置及びその音声処理装置を用いた制御装置の動作について図１８乃至図２０を用いて説明する。図１８は、本実施形態による音声処理装置の動作を示すフローチャートである。 Next, operations of the voice processing apparatus according to the present embodiment and the control apparatus using the voice processing apparatus will be described with reference to FIGS. FIG. 18 is a flowchart showing the operation of the speech processing apparatus according to this embodiment.

まず、図１８に示すように、車両１３６内に乗員が存在するか否かを判定する（ステップＳ１）。車両１３６内に乗員が存在するか否かは、例えば、乗員検出部１４２からの乗員有無検知信号に基づいて判断し得る。 First, as shown in FIG. 18, it is determined whether or not an occupant is present in the vehicle 136 (step S1). Whether an occupant is present in the vehicle 136 can be determined based on, for example, an occupant presence / absence detection signal from the occupant detection unit 142.

車両１３６内に乗員が存在する場合には（ステップＳ１においてＹＥＳ）、音声処理装置１０２を第１の動作モードで動作させる。第１の動作モードは、車両１３６内に乗員が存在していることを前提とした動作モードである。第１の動作モードにおいては、音声源方位判定、ビームフォーミング処理、ノイズ除去処理、音楽除去処理等が行われる。 If an occupant is present in vehicle 136 (YES in step S1), speech processing apparatus 102 is operated in the first operation mode. The first operation mode is an operation mode on the assumption that an occupant is present in the vehicle 136. In the first operation mode, sound source direction determination, beam forming processing, noise removal processing, music removal processing, and the like are performed.

第１の動作モードにおける音声処理装置の動作を、図１９を用いて説明する。図１９は、本実施形態による音声処理装置における第１の動作モードでの動作を示すフローチャートである。 The operation of the speech processing apparatus in the first operation mode will be described with reference to FIG. FIG. 19 is a flowchart showing the operation in the first operation mode in the speech processing apparatus according to the present embodiment.

まず、ノイズ除去処理及び音楽除去処理が開始される（ステップＳ１０）。即ち、ノイズ除去処理及び音楽除去処理がオンに設定される。ノイズ除去処理及び音楽除去処理は、この後、継続して行われる。なお、車載音響機器８４が音楽を出力していない場合や、音楽の音量が極めて小さい場合等には、音楽除去処理を行わなくてもよい。上述したように、前処理部１０、処理部１２及び後処理部１４において行われる一連の処理によって、ノイズの除去が行われる。また、上述したように、音楽除去処理は、前処理部１０に設けられた音楽除去処理部２４等によって行われる。 First, noise removal processing and music removal processing are started (step S10). That is, the noise removal process and the music removal process are set on. Thereafter, the noise removal process and the music removal process are continuously performed. Note that the music removal process may not be performed when the in-vehicle acoustic device 84 is not outputting music or when the volume of the music is extremely low. As described above, noise is removed by a series of processes performed in the pre-processing unit 10, the processing unit 12, and the post-processing unit 14. Further, as described above, the music removal processing is performed by the music removal processing unit 24 provided in the preprocessing unit 10 or the like.

乗員による呼びかけが音声処理装置１０２に対して行われる前においては（ステップＳ１１においてＮＯ）、ノイズ除去処理、音楽除去処理等は行われるが、音声源方位判定、ビームフォーミング等は行われない。 Before the call by the occupant is made to the voice processing apparatus 102 (NO in step S11), noise removal processing, music removal processing, and the like are performed, but voice source direction determination, beam forming, and the like are not performed.

乗員による呼びかけが音声処理装置１０２に対して行われると（ステップＳ１１においてＹＥＳ）、音声源方位判定処理及びビームフォーミング処理がオンに設定され、呼びかけを発した音声源７２の方位が判定される（ステップＳ１２）。音声源７２の方位の判定は、上述したように、音声源方位判定部１６等によって行われる。呼びかけは、例えば、運転者によって行われる。なお、呼びかけは、運転者が行わなくてもよい。例えば、助手席者が呼びかけを行ってもよい。また、呼びかけは、特定の言葉であってもよいし、単なる発声であってもよい。 When the call by the occupant is made to the voice processing device 102 (YES in step S11), the voice source direction determination process and the beam forming process are set to ON, and the direction of the voice source 72 that issued the call is determined ( Step S12). The determination of the direction of the sound source 72 is performed by the sound source direction determination unit 16 and the like as described above. The call is made by the driver, for example. The call may not be performed by the driver. For example, a passenger seat may make a call. The call may be a specific word or a simple utterance.

次に、音声源７２の方位に応じて、ビームフォーマの指向性が設定される（ステップＳ１３）。ビームフォーマの指向性の設定は、上述したように、適応アルゴリズム決定部１８、処理部１２等によって行われる。 Next, the directivity of the beamformer is set according to the direction of the sound source 72 (step S13). The setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.

音声源７２の方位を含む所定の方位範囲以外の方位範囲から到来する音の大きさが、音声源７２から到来する音声の大きさ以上である場合には（ステップＳ１４においてＹＥＳ）、音声源７２の方位の判定を中断する（ステップＳ１５）。 When the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is greater than or equal to the magnitude of voice coming from voice source 72 (YES in step S14), voice source 72 Is interrupted (step S15).

一方、音声源７２の方位を含む所定の方位範囲以外の方位範囲から到来する音の大きさが、音声源７２から到来する音声の大きさ以上でない場合には（ステップＳ１４においてＮＯ）、ステップＳ１２、Ｓ１３を繰り返し行う。 On the other hand, when the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is not greater than the magnitude of voice coming from voice source 72 (NO in step S14), step S12 , S13 is repeated.

こうして、音声源７２の位置の変化に応じて、ビームフォーマが適応的に設定され、目的音以外の音が確実に抑制される。ノイズ除去処理や音楽除去処理等が行われ、且つ、目的音以外の音が抑圧された、良好な目的音が自動音声認識装置１６８に入力されるため、自動音声認識装置１６８は音声認識の精度を向上することができる。自動音声認識装置１６８による音声認識結果に基づいて、車両１３６に搭載されている機器等に対しての操作、例えば、ドア、ウィンドウ、ワイパー、ウインカー等に対しての操作が自動で行われる。 In this way, the beamformer is adaptively set according to the change in the position of the sound source 72, and sounds other than the target sound are reliably suppressed. Since a good target sound, in which noise removal processing, music removal processing, and the like are performed and a sound other than the target sound is suppressed, is input to the automatic speech recognition device 168, the automatic speech recognition device 168 has the accuracy of speech recognition. Can be improved. Based on the voice recognition result by the automatic voice recognition device 168, an operation on a device mounted on the vehicle 136, for example, an operation on a door, a window, a wiper, a winker, or the like is automatically performed.

一方、車両１３６内に乗員が存在しない場合には（ステップＳ１においてＮＯ）、音声処理装置１０２を第２の動作モードで動作させる。第２の動作モードは、車両１３６内に乗員が存在しないことを前提とした動作モードである。第２の動作モードにおいては、ノイズ除去処理、音楽除去処理等は行われるが、音声源方位判定やビームフォーミング処理等は行われない。 On the other hand, if no occupant is present in vehicle 136 (NO in step S1), speech processing apparatus 102 is operated in the second operation mode. The second operation mode is an operation mode on the assumption that no occupant is present in the vehicle 136. In the second operation mode, noise removal processing, music removal processing, and the like are performed, but audio source direction determination, beam forming processing, and the like are not performed.

第２の動作モードにおける音声処理装置の動作を、図２０を用いて説明する。図２０は、本実施形態による音声処理装置における第２の動作モードでの動作を示すフローチャートである。 The operation of the speech processing apparatus in the second operation mode will be described with reference to FIG. FIG. 20 is a flowchart showing the operation in the second operation mode in the speech processing apparatus according to the present embodiment.

まず、ノイズ除去処理及び音楽除去処理が開始される（ステップＳ２０）。ノイズ除去処理及び音楽除去処理は、この後、継続して行われる。第２の動作モードにおいては、音声源方位判定、ビームフォーミング等は行われない。即ち、第２の動作モードにおいては、音声源方位判定処理やビームフォーミング処理が、オフに設定される。なお、上述したように、車載音響機器８４が音楽を出力していない場合や、音楽の音量が極めて小さい場合等には、音楽除去処理を行わなくてもよい。また、上述したように、前処理部１０、処理部１２及び後処理部１４において行われる一連の処理によって、ノイズの除去が行われる。また、上述したように、音楽除去処理は、前処理部１０に設けられた音楽除去処理部２４等によって行われる。 First, noise removal processing and music removal processing are started (step S20). Thereafter, the noise removal process and the music removal process are continuously performed. In the second operation mode, sound source direction determination, beam forming, and the like are not performed. That is, in the second operation mode, the sound source direction determination process and the beam forming process are set to off. As described above, the music removal process may not be performed when the in-vehicle audio device 84 does not output music or when the volume of music is extremely low. In addition, as described above, noise is removed by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14. Further, as described above, the music removal processing is performed by the music removal processing unit 24 provided in the preprocessing unit 10 or the like.

第２の動作モードにおいては、ノイズ除去処理、音楽除去処理等が行われた良好な音声信号が、音声処理装置１０２から出力される。第２の動作モードにおいて、音声源方位判定処理やビームフォーミング処理がオフに設定されるのは、以下のような理由によるものである。即ち、車両１３６外に乗員が存在する場合には、車両１３６外における乗員の位置を正確且つ確実に特定するのは必ずしも容易ではない。このため、誤った方向にビームフォーミングが行われることも考えられる。誤った方向にビームフォーミングが行われている状態で、乗員から音声が発せられた場合には、当該乗員から発せられた音声が抑圧されてしまい、当該乗員から発せられた音声を取得し得ない虞がある。そこで、本実施形態では、第２の動作モードにおいては、ビームフォーミングを行わないようにしている。ビームフォーミングを行わないため、本実施形態では、ビームフォーミングを行うために必要となる音声源方位判定も行われない。ノイズ除去処理や音楽除去処理等が行われた良好な音声信号が自動音声認識装置１６８に入力されるため、自動音声認識装置１６８は高い精度で音声認識を行うことができる。自動音声認識装置１６８による音声認識結果に基づいて、車両１３６に搭載されている機器等に対しての操作が自動で行われる。 In the second operation mode, a sound signal that has been subjected to noise removal processing, music removal processing, and the like is output from the sound processing apparatus 102. In the second operation mode, the sound source direction determination process and the beam forming process are set to OFF for the following reason. That is, when an occupant is present outside the vehicle 136, it is not always easy to accurately and reliably specify the position of the occupant outside the vehicle 136. For this reason, it is also conceivable that beam forming is performed in the wrong direction. When sound is emitted from the occupant while beamforming is performed in the wrong direction, the sound emitted from the occupant is suppressed, and the sound emitted from the occupant cannot be acquired. There is a fear. Therefore, in this embodiment, beam forming is not performed in the second operation mode. Since beam forming is not performed, in this embodiment, sound source direction determination necessary for performing beam forming is not performed. Since a good audio signal that has been subjected to noise removal processing, music removal processing, and the like is input to the automatic speech recognition device 168, the automatic speech recognition device 168 can perform speech recognition with high accuracy. Based on the voice recognition result by the automatic voice recognition device 168, the operation on the device mounted on the vehicle 136 is automatically performed.

第２の動作モードにおいて、例えば「開け」や「閉まれ」等の所定のワードが自動音声処理装置１６８によって検出された場合には、車両１３６の外部に位置する乗員が、開閉体１３４の開作動や閉作動を欲していると考えられる。また、近接検知部１２６からの近接検知信号が入力部１１４に入力されている場合には、所定のワードを発したのは乗員であると考えられる。このため、車両１３６内に乗員が存在していないことを乗員有無検知信号が示しており、且つ、近接検知部１２６からの近接検知信号が入力部１１４に入力されている状態において、「開け」や「閉まれ」等の所定のワードが自動音声処理装置１６８によって検出された場合には、制御部１１６は、開閉体１３４の開作動又は閉作動のための制御を行う。具体的には、制御部１１６は、出力部１２０を介して開閉体駆動装置１３２を制御することにより、開閉体１３４の開作動又は閉作動を行う。 In the second operation mode, when a predetermined word such as “open” or “closed” is detected by the automatic sound processing device 168, an occupant located outside the vehicle 136 opens the opening / closing body 134. It seems that he wants to operate and close. When the proximity detection signal from the proximity detection unit 126 is input to the input unit 114, it is considered that the occupant has issued the predetermined word. Therefore, in the state where the occupant presence / absence detection signal indicates that no occupant is present in the vehicle 136 and the proximity detection signal from the proximity detection unit 126 is input to the input unit 114, “open” When the automatic speech processing device 168 detects a predetermined word such as “closed” or “closed”, the control unit 116 performs control for opening or closing the opening / closing body 134. Specifically, the control unit 116 opens or closes the opening / closing body 134 by controlling the opening / closing body driving device 132 via the output unit 120.

第２の動作モードにおいて、例えば「止まれ」という所定のワードが自動音声処理装置１６８によって検出された場合には、車両１３６の外部に位置する乗員が、車両１３６の停止を欲していると考えられる。例えば、坂道に停車させた車両１３６が動き始めてしまった場合には、車両１３６の外部に位置している乗員が、車両１３６の停止を欲する。このため、車両１３６内に乗員が存在していないことを乗員有無検知信号が示しており、且つ、車両１３６が移動している状態において、「止まれ」等の所定のワードが検出された場合には、制御部１１６は、車両１３６を停止させるための制御を行う。具体的には、制御部１１６は、出力部１２０を介してブレーキ制御装置１３８を制御することにより、ブレーキ１４０を動作させ、これにより、車両１３６を停止させる。 In the second operation mode, for example, when a predetermined word “stop” is detected by the automatic sound processing device 168, it is considered that an occupant located outside the vehicle 136 wants the vehicle 136 to stop. . For example, when the vehicle 136 stopped on a slope starts to move, an occupant located outside the vehicle 136 wants the vehicle 136 to stop. For this reason, when an occupant presence / absence detection signal indicates that no occupant is present in the vehicle 136 and the vehicle 136 is moving, a predetermined word such as “stop” is detected. The control unit 116 performs control for stopping the vehicle 136. Specifically, the control unit 116 controls the brake control device 138 via the output unit 120 to operate the brake 140, thereby stopping the vehicle 136.

このように、本実施形態によれば、車両１３６内に乗員が存在しているか否かを示す乗員有無検知信号に基づいて、ビームフォーミングのオン／オフが設定される。このため、車両１３６の外部に乗員が位置している場合であっても、かかる乗員が発する音声を、車両１３６内に配されたマイクロフォン２２を用いて確実に検出することができる。車両１３６の外部において発せられる音声を取得するためのマイクロフォンを、車両１３６内に配されたマイクロフォン２２と別個に設けることを要しないため、低コスト化に寄与することができる。従って、本実施形態によれば、低コスト化の要請を満たしつつ、車両の内外において発せられ得る音声に対して音声処理を的確に行い得る音声処理装置及びその音声処理装置を用いた制御装置を提供することができる。 Thus, according to the present embodiment, on / off of beamforming is set based on the passenger presence / absence detection signal indicating whether or not there is an passenger in the vehicle 136. For this reason, even when the occupant is located outside the vehicle 136, the sound emitted by the occupant can be reliably detected using the microphone 22 disposed in the vehicle 136. Since it is not necessary to provide a microphone for acquiring sound emitted outside the vehicle 136 separately from the microphone 22 disposed in the vehicle 136, it is possible to contribute to cost reduction. Therefore, according to the present embodiment, a voice processing device that can accurately perform voice processing on voice that can be emitted inside and outside the vehicle while satisfying a demand for cost reduction, and a control device using the voice processing device are provided. Can be provided.

（変形例）
次に、本実施形態の変形例による音声処理装置及びその音声処理装置を用いた制御装置について図１８、図１９及び図２１を用いて説明する。図２１は、本変形例による音声処理装置における第２の動作モードでの動作を示すフローチャートである。 (Modification)
Next, a voice processing device according to a modification of the present embodiment and a control device using the voice processing device will be described with reference to FIG. 18, FIG. 19, and FIG. FIG. 21 is a flowchart showing the operation in the second operation mode in the speech processing apparatus according to this modification.

本変形例による音声処理装置は、車両１３６の外部に位置している乗員が所定のワードを発した後においては、当該乗員に対してビームフォーミングを行うようにするものである。 The voice processing apparatus according to this modification is configured to perform beam forming on an occupant after an occupant located outside the vehicle 136 has issued a predetermined word.

まず、図１８を用いて上述した一実施形態による音声処理装置と同様にして、車両１３６内に乗員が存在するか否かの判定が行われる（ステップＳ１）。 First, in the same manner as the sound processing apparatus according to the embodiment described above with reference to FIG. 18, it is determined whether or not there is an occupant in the vehicle 136 (step S1).

車両１３６内に乗員が存在する場合には（ステップＳ１においてＹＥＳ）、音声処理装置１０２を第１の動作モードで動作させる。第１の動作モードにおける音声処理装置の動作は、図１９を用いて上述した一実施形態による音声処理装置の第１の動作モードにおける動作と同様であるため、説明を省略する。 If an occupant is present in vehicle 136 (YES in step S1), speech processing apparatus 102 is operated in the first operation mode. The operation of the speech processing apparatus in the first operation mode is the same as the operation in the first operation mode of the speech processing apparatus according to the embodiment described above with reference to FIG.

一方、車両１３６内に乗員が存在しない場合には（ステップＳ１においてＮＯ）、音声処理装置１０２を第２の動作モードで動作させる。第２の動作モードは、上述したように、車両１３６内に乗員が存在しないことを前提とした動作モードである。第２の動作モードにおいては、所定のワードが自動音声認識装置１６８によって検出される前においては、ノイズ除去処理、音楽除去処理等は行われるが、音声源方位判定、ビームフォーミング等は行われない。即ち、所定のワードが自動音声認識装置１６８によって検出される前においては、音声源方位判定処理やビームフォーミング処理が、オフに設定される。 On the other hand, if no occupant is present in vehicle 136 (NO in step S1), speech processing apparatus 102 is operated in the second operation mode. As described above, the second operation mode is an operation mode based on the assumption that no occupant is present in the vehicle 136. In the second operation mode, noise removal processing, music removal processing, and the like are performed before a predetermined word is detected by the automatic speech recognition device 168, but sound source direction determination, beam forming, and the like are not performed. . That is, before the predetermined word is detected by the automatic speech recognition device 168, the voice source direction determination process and the beam forming process are set to off.

第２の動作モードにおける音声処理装置の動作を、図２１を用いて説明する。図２１は、本変形例による音声処理装置における第２の動作モードでの動作を示すフローチャートである。 The operation of the speech processing apparatus in the second operation mode will be described with reference to FIG. FIG. 21 is a flowchart showing the operation in the second operation mode in the speech processing apparatus according to this modification.

まず、ノイズ除去処理及び音楽除去処理が開始される（ステップＳ３０）。ノイズ除去処理及び音楽除去処理は、この後、継続して行われる。第２の動作モードにおいては、音声源方位判定、ビームフォーミング等は行われない。なお、上述したように、車載音響機器８４が音楽を出力していない場合や、音楽の音量が極めて小さい場合等には、音楽除去処理を行わなくてもよい。また、上述したように、前処理部１０、処理部１２及び後処理部１４において行われる一連の処理によって、ノイズの除去が行われる。また、上述したように、音楽除去処理は、前処理部１０に設けられた音楽除去処理部２４等によって行われる。 First, noise removal processing and music removal processing are started (step S30). Thereafter, the noise removal process and the music removal process are continuously performed. In the second operation mode, sound source direction determination, beam forming, and the like are not performed. As described above, the music removal process may not be performed when the in-vehicle audio device 84 does not output music or when the volume of music is extremely low. In addition, as described above, noise is removed by a series of processes performed in the preprocessing unit 10, the processing unit 12, and the postprocessing unit 14. Further, as described above, the music removal processing is performed by the music removal processing unit 24 provided in the preprocessing unit 10 or the like.

所定のワードが自動音声認識装置１６８によって検出されると（ステップＳ３１においてＹＥＳ）、音声源方位判定処理及びビームフォーミング処理がオンに設定され、所定のワードを発した音声源７２の方位が判定される（ステップＳ３２）。音声源７２の方位の判定は、上述したように、音声源方位判定部１６等によって行われる。所定のワードとしては、例えば、驚嘆したときに発せられる音声である「あ」を挙げることができる。かかる所定ワードが発せられた場合には、車両１３６の外部に位置している乗員が驚いていると考えられる。このため、所定ワードが自動音声認識装置１６８によって検出された場合には（ステップＳ３１においてＹＥＳ）、所定ワードを発した当該乗員から発せられる音声をより確実に取得すべく、ステップＳ３２以降の動作が行われる。 When the predetermined word is detected by the automatic speech recognition device 168 (YES in step S31), the voice source direction determination process and the beam forming process are set to ON, and the direction of the voice source 72 that issued the predetermined word is determined. (Step S32). The determination of the direction of the sound source 72 is performed by the sound source direction determination unit 16 and the like as described above. As the predetermined word, for example, “A” which is a voice uttered when surprised can be cited. When such a predetermined word is issued, it is considered that an occupant located outside the vehicle 136 is surprised. For this reason, when the predetermined word is detected by the automatic speech recognition device 168 (YES in step S31), the operation after step S32 is performed in order to acquire more reliably the sound emitted from the occupant who has issued the predetermined word. Done.

次に、音声源７２の方位に応じて、ビームフォーマの指向性を設定する（ステップＳ３３）。ビームフォーマの指向性の設定は、上述したように、適応アルゴリズム決定部１８、処理部１２等によって行われる。 Next, the directivity of the beamformer is set according to the direction of the sound source 72 (step S33). The setting of the beamformer directivity is performed by the adaptive algorithm determination unit 18, the processing unit 12, and the like as described above.

音声源７２の方位を含む所定の方位範囲以外の方位範囲から到来する音の大きさが、音声源７２から到来する音声の大きさ以上である場合には（ステップＳ３４においてＹＥＳ）、音声源７２の方位の判定を中断する（ステップＳ３５）。 When the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is greater than or equal to the magnitude of voice coming from voice source 72 (YES in step S34), voice source 72 Is interrupted (step S35).

一方、音声源７２の方位を含む所定の方位範囲以外の方位範囲から到来する音の大きさが、音声源７２から到来する音声の大きさ以上でない場合には（ステップＳ３４においてＮＯ）、ステップＳ３２、Ｓ３３を繰り返し行う。 On the other hand, when the magnitude of sound coming from an azimuth range other than the predetermined azimuth range including the azimuth of voice source 72 is not greater than the magnitude of voice coming from voice source 72 (NO in step S34), step S32 , S33 is repeated.

このように、本変形例によれば、所定のワードが検出された後においては、音声源方位判定処理やビームフォーミング処理等がオンに設定されるため、目的音以外の音が抑圧されたより良好な音声信号が自動音声認識装置１６８に入力される。このため、本変形例によれば、音声認識の精度をより向上することができ、車両１３６に搭載されている機器等に対しての操作をより正確且つ確実に行うことが可能となる。 As described above, according to the present modification, after the predetermined word is detected, the sound source direction determination process, the beam forming process, and the like are set on, so that the sound other than the target sound is suppressed. Voice signal is input to the automatic voice recognition device 168. For this reason, according to the present modification, the accuracy of voice recognition can be further improved, and operations on devices and the like mounted on the vehicle 136 can be performed more accurately and reliably.

［変形実施形態］
上記実施形態に限らず種々の変形が可能である。 [Modified Embodiment]
The present invention is not limited to the above embodiment, and various modifications are possible.

例えば、上記実施形態では、マイクロフォン２２の数が３個である場合を例に説明したが、マイクロフォン２２の数は３個に限定されるものではなく、４個以上であってもよい。多くのマイクロフォン２２を用いれば、音声源７２の方位をより高精度に判定し得る。 For example, in the above-described embodiment, the case where the number of the microphones 22 is three has been described as an example. However, the number of the microphones 22 is not limited to three, and may be four or more. If many microphones 22 are used, the direction of the sound source 72 can be determined with higher accuracy.

２２，２２ａ〜２２ｃ、２６ａ、２６ｂ…マイクロフォン
４０…運転席
４２…ダッシュボード
４４…助手席
４６…車体、車室
７２、７２ａ、７２ｂ…音声源
７６…スピーカ
７８…ステアリングホイール
８０…エンジン
８２…外部ノイズ源
８４…車載音響機器
１００…制御装置
１０２…音声処理装置
１３４、１３４ａ〜１３４ｃ…開閉体
１３６…車両
１４８…通信エリア 22, 22a-22c, 26a, 26b ... microphone 40 ... driver's seat 42 ... dashboard 44 ... passenger seat 46 ... car body, passenger compartment 72, 72a, 72b ... sound source 76 ... speaker 78 ... steering wheel 80 ... engine 82 ... external Noise source 84 ... in-vehicle acoustic device 100 ... control device 102 ... voice processing devices 134, 134a to 134c ... opening / closing body 136 ... vehicle 148 ... communication area

Claims

A sound source direction determination unit that determines a direction of a sound source that is a sound source included in a sound reception signal acquired by each of a plurality of microphones arranged in the vehicle;
A beam forming processing unit that performs beam forming to suppress sound coming from an azimuth range other than the azimuth range including the azimuth of the audio source;
A noise removal processing unit for removing noise mixed in the received sound signal;
An audio processing apparatus in which on / off of the beamforming by the beamforming processing unit is set based on a first signal indicating whether or not an occupant is present in the vehicle.

The speech processing apparatus according to claim 1, wherein the noise removal processing is performed by the noise removal processing unit regardless of the first signal.

The sound according to claim 1 or 2, wherein when the first signal indicates that the occupant is not present in the vehicle, the beamforming by the beamforming processing unit is set to off. Processing equipment.

If the first signal indicates that the occupant is not present in the vehicle, the beamforming is set off before the predetermined word is detected, and the predetermined word The sound processing apparatus according to claim 1, wherein the beamforming is set to be on after the detection of.

A music removal processing unit that removes the music signal mixed in the received sound signal using a reference music signal obtained from an audio device;
5. The audio processing device according to claim 1, wherein the music signal is removed by the music removal processing unit regardless of the first signal. 6.

An audio source direction determination unit that determines the direction of a sound source that is a sound source included in a sound reception signal acquired by each of a plurality of microphones arranged in the vehicle, and a direction that includes the direction of the sound source A voice processing unit including a beam forming processing unit that performs beam forming to suppress sound coming from a azimuth range other than the range; and a noise removal processing unit that performs processing to remove noise mixed in the received sound signal;
A control unit that performs control based on a voice recognition result acquired using the voice processing unit,
The said control part sets the on / off of the said beam forming by the said beam forming process part based on the 1st signal which shows whether the passenger | crew exists in the said vehicle.

In a state where the first signal indicates that the occupant is not present in the vehicle, if a predetermined word is detected, the control unit performs control based on the predetermined word. The control device according to claim 6, which is performed.

When the first signal indicates that the occupant is not present in the vehicle, and the predetermined word is detected in a state where the vehicle is moving, the control unit The control device according to claim 7, wherein the vehicle is stopped.

In the state where the first signal indicates that the occupant is not present in the vehicle and the second signal indicates that the occupant is approaching the vehicle, the predetermined signal The control device according to claim 7, wherein when the word is detected, the control unit opens or closes an opening / closing body provided in the vehicle.