JP2020134566A

JP2020134566A - Voice processing system, voice processing device and voice processing method

Info

Publication number: JP2020134566A
Application number: JP2019023942A
Authority: JP
Inventors: 智史山梨; Tomohito Yamanashi; 南生也持木; Naoya Mochiki; 番場　裕; Yutaka Banba; 裕番場
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2019-02-13
Filing date: 2019-02-13
Publication date: 2020-08-31

Abstract

To provide a voice processing system capable of improving accuracy of suppressing a noise component of a voice signal even when a number or positions of noise sources change.SOLUTION: The voice processing system includes: a first voice acquisition unit which acquires a voice signal including a first voice component and a voice component other than the first voice component; a plurality of second voice acquisition units which acquire a plurality of reference signals including a second voice component and a voice component other than the second voice component; a first adaptive filter which passes two or more reference signals among the plurality of reference signals and generates a first pass signal; a plurality of second adaptive filters which pass a different single reference signal among the plurality of reference signals and generate a plurality of second pass signals; and a control unit which determines an adaptive filter of a controlled object among the first adaptive filter and the plurality of second adaptive filters based on the voice signal, the first pass signal, and the plurality of second pass signals, and controls the filter coefficient of the adaptive filter of the controlled object.SELECTED DRAWING: Figure 1

Description

本開示は、音声処理システム、音声処理装置及び音声処理方法に関する。 The present disclosure relates to a voice processing system, a voice processing device, and a voice processing method.

車載用の音声認識装置やハンズフリー電話に用いて好適な、エコーキャンセラが知られている（特許文献１参照）。このエコーキャンセラは、音源数に応じて、エコーキャンセル処理において動作する適応フィルタの数やタップ数を切り替える。 An echo canceller suitable for use in an in-vehicle voice recognition device or a hands-free telephone is known (see Patent Document 1). This echo canceller switches the number of adaptive filters and the number of taps that operate in the echo canceling process according to the number of sound sources.

また、信号に混在する雑音、妨害信号、エコーなどを消去する信号処理装置が知られている（特許文献２参照）。この信号処理装置は、複数の音源が存在する環境において、各音源の音声信号の大きさに応じて、対応する適応フィルタの更新量を調整する。 Further, a signal processing device that eliminates noise, interfering signals, echoes, etc. mixed in a signal is known (see Patent Document 2). This signal processing device adjusts the update amount of the corresponding adaptive filter according to the magnitude of the audio signal of each sound source in an environment where a plurality of sound sources exist.

特許第４８８９８１０号公報Japanese Patent No. 4889810 特許第６３６３３２４号公報Japanese Patent No. 6363324

しかし、特許文献１のエコーキャンセラは、複数の音源の位置が既知であるので、適応フィルタの数やタップ数を調整可能であるが、車室内では発生するノイズ源である話者の位置が変化するので、対応することが困難である。また、特許文献２の信号処理装置は、ノイズ源である音源数が変化した場合、適応フィルタが収束するまである程度時間が必要であり、その間の音質が劣化し得る。 However, in the echo canceller of Patent Document 1, since the positions of a plurality of sound sources are known, the number of adaptive filters and the number of taps can be adjusted, but the position of the speaker, which is a noise source generated in the vehicle interior, changes. Therefore, it is difficult to deal with it. Further, in the signal processing apparatus of Patent Document 2, when the number of sound sources which are noise sources changes, it takes some time for the adaptive filter to converge, and the sound quality during that time may deteriorate.

本開示は、上記事情に鑑みてなされたものであり、ノイズ源の数や位置が変化した場合でも、取得対象の音声信号の雑音成分の抑圧精度を向上できる音声処理システム、音声処理装置及び音声処理方法を提供する。 The present disclosure has been made in view of the above circumstances, and is a voice processing system, a voice processing device, and a voice capable of improving the suppression accuracy of the noise component of the voice signal to be acquired even when the number or position of the noise source changes. Provides a processing method.

本開示の一態様は、第１の音声成分及び第１の音声成分以外の音声成分を含む音声信号を取得する第１の音声取得部と、第２の音声成分及び第２の音声成分以外の音声成分を含む複数の参照信号を取得する複数の第２の音声取得部と、前記複数の参照信号のうち２つ以上の参照信号を通過させ、第１の通過信号を生成する第１の適応フィルタと、前記複数の参照信号のうち異なる単一の参照信号を通過させ、複数の第２の通過信号を生成する複数の第２の適応フィルタと、前記音声信号と前記第１の通過信号と前記複数の第２の通過信号とに基づいて、前記第１の適応フィルタ及び前記複数の第２の適応フィルタのうち制御対象の適応フィルタを決定し、前記制御対象の適応フィルタのフィルタ係数を制御する制御部と、を備える音声処理システムである。 One aspect of the present disclosure is a first audio acquisition unit that acquires an audio signal including an audio component other than the first audio component and the first audio component, and a second audio component other than the second audio component and the second audio component. A first adaptation in which a plurality of second audio acquisition units that acquire a plurality of reference signals including audio components and two or more reference signals among the plurality of reference signals are passed to generate a first pass signal. A filter, a plurality of second adaptive filters that pass a different single reference signal among the plurality of reference signals to generate a plurality of second pass signals, and the audio signal and the first pass signal. The adaptive filter to be controlled is determined from the first adaptive filter and the plurality of second adaptive filters based on the plurality of second passing signals, and the filter coefficient of the adaptive filter to be controlled is controlled. It is a voice processing system including a control unit for the operation.

本開示の一態様は、第１の音声成分及び第１の音声成分以外の音声成分を含む音声信号を取得する制御部と、第２の音声成分及び第２の音声成分以外の音声成分を含む複数の参照信号を取得する複数の適応フィルタと、を備え、前記複数の適応フィルタは、前記複数の参照信号のうち２つ以上の参照信号を通過させ、第１の通過信号を生成する第１の適応フィルタと、前記複数の参照信号のうち異なる単一の参照信号を通過させ、複数の第２の通過信号を生成する複数の第２の適応フィルタと、を含み、前記制御部は、前記音声信号と前記第１の通過信号と前記複数の第２の通過信号とに基づいて、前記第１の適応フィルタ及び前記複数の第２の適応フィルタのうち制御対象の適応フィルタを決定し、前記制御対象の適応フィルタのフィルタ係数を制御する、音声処理装置である。 One aspect of the present disclosure includes a control unit that acquires a voice signal including a first voice component and a voice component other than the first voice component, and a second voice component and a voice component other than the second voice component. A first that includes a plurality of adaptive filters that acquire a plurality of reference signals, and the plurality of adaptive filters pass two or more reference signals among the plurality of reference signals to generate a first pass signal. The control unit includes the adaptive filter of the above and a plurality of second adaptive filters that pass a different single reference signal among the plurality of reference signals to generate a plurality of second pass signals. Based on the voice signal, the first passing signal, and the plurality of second passing signals, the adaptive filter to be controlled is determined from the first adaptive filter and the plurality of second adaptive filters, and the adaptive filter to be controlled is determined. It is a voice processing device that controls the filter coefficient of the adaptive filter to be controlled.

本開示の一態様は、第１の音声成分及び第１の音声成分以外の音声成分を含む音声信号を取得し、第２の音声成分及び第２の音声成分以外の音声成分を含む複数の参照信号を取得し、前記複数の参照信号のうち２つ以上の参照信号が第１の適応フィルタを通過した第１の通過信号を生成し、前記複数の参照信号のうち異なる単一の参照信号が通過する複数の第２の適応フィルタを通過した複数の第２の通過信号を生成し、前記音声信号と前記第１の通過信号と前記複数の第２の通過信号とに基づいて、前記第１の適応フィルタ及び前記複数の第２の適応フィルタのうち制御対象の適応フィルタを決定し、前記制御対象の適応フィルタのフィルタ係数を制御する、音声処理方法である。 One aspect of the present disclosure is a plurality of references that acquire an audio signal including a first audio component and an audio component other than the first audio component, and include a second audio component and an audio component other than the second audio component. A signal is acquired, two or more of the plurality of reference signals generate a first pass signal that has passed through the first adaptive filter, and a different single reference signal among the plurality of reference signals A plurality of second pass signals that have passed through a plurality of second adaptive filters that pass are generated, and the first pass signal is based on the audio signal, the first pass signal, and the plurality of second pass signals. This is an audio processing method for determining an adaptive filter to be controlled from among the adaptive filter of the above and the plurality of second adaptive filters, and controlling the filter coefficient of the adaptive filter to be controlled.

本開示によれば、ノイズ源の数や位置が変化した場合でも、取得対象の音声信号の雑音成分の抑圧精度を向上できる。 According to the present disclosure, it is possible to improve the suppression accuracy of the noise component of the audio signal to be acquired even when the number or position of the noise source changes.

第１の実施形態における音声処理システムの概略構成の一例を示す図The figure which shows an example of the schematic structure of the voice processing system in 1st Embodiment 音声処理装置のハードウェア構成を示すブロック図Block diagram showing the hardware configuration of the audio processing device 音声処理装置の動作手順を示すフローチャートFlowchart showing the operation procedure of the voice processing device 第２の実施形態における音声処理装置のハードウェア構成を示す図The figure which shows the hardware configuration of the voice processing apparatus in 2nd Embodiment

以下、適宜図面を参照しながら、本開示に係る音声処理システム、音声処理装置及び音声処理方法を具体的に開示した実施形態である音声処理システムを詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明や実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、添付図面及び以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, a voice processing system according to an embodiment in which the voice processing system, the voice processing device, and the voice processing method according to the present disclosure are specifically disclosed will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of already well-known matters and duplicate explanations for substantially the same configuration may be omitted. This is to avoid unnecessary redundancy of the following description and to facilitate the understanding of those skilled in the art. It should be noted that the accompanying drawings and the following description are provided for those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter described in the claims.

（第１の実施形態）
図１は、第１の実施形態における音声処理システム５の概略構成の一例を示す図である。音声処理システム５は、車両１０に搭載される。車両１０の車室内には、例えば、運転席、助手席、および左右の後部座席が設けられる。音声処理システム５は、複数のマイクＭＣ１〜ＭＣ４、及び音声処理装置２０を含む構成である。音声処理装置２０の出力は、音声認識エンジン４０に入力される。音声認識エンジン４０による音声認識結果は、例えば、カーナビゲーション装置５０に入力され、カーナビゲーション装置５０の操作信号に利用され得る。なお、座席数、マイク数は、これに限られない。 (First Embodiment)
FIG. 1 is a diagram showing an example of a schematic configuration of the voice processing system 5 according to the first embodiment. The voice processing system 5 is mounted on the vehicle 10. In the vehicle interior of the vehicle 10, for example, a driver's seat, a passenger's seat, and left and right rear seats are provided. The voice processing system 5 includes a plurality of microphones MC1 to MC4 and a voice processing device 20. The output of the voice processing device 20 is input to the voice recognition engine 40. The voice recognition result by the voice recognition engine 40 can be input to, for example, the car navigation device 50 and used as an operation signal of the car navigation device 50. The number of seats and the number of microphones are not limited to this.

運転席の前（例えばダッシュボードの右側前面）には、運転者ｈｍ１が発話する音声を収音するマイクＭＣ１が配置されてよい。助手席の前（例えばダッシュボードの左側前面）には、乗員ｈｍ２が発話する音声を収音するマイクＭＣ２が配置されてよい。助手席の背もたれ部には、左側の後部座席に座る乗員ｈｍ３が発話する音声を収音するマイクＭＣ３が配置されてよい。運転席の背もたれ部には、右側の後部座席に座る乗員ｈｍ４が発話する音声を収音するマイクＭＣ４が配置されてよい。なお、各マイクＭＣ１〜ＭＣ４の配置位置は、これに限られない。 In front of the driver's seat (for example, the front right side of the dashboard), a microphone MC1 that picks up the voice spoken by the driver hm1 may be arranged. In front of the passenger seat (for example, the front left side of the dashboard), a microphone MC2 that collects the voice spoken by the occupant hm2 may be arranged. A microphone MC3 that picks up the sound spoken by the occupant hm3 sitting in the left rear seat may be arranged in the backrest portion of the passenger seat. A microphone MC4 that picks up the sound spoken by the occupant hm4 sitting in the right rear seat may be arranged on the backrest of the driver's seat. The arrangement positions of the microphones MC1 to MC4 are not limited to this.

マイクＭＣ１〜ＭＣ４は、指向性マイク、無指向性マイクのいずれでもよい。マイクＭＣ１〜ＭＣ４は、小型のＭＥＭＳ（micro electro mechanical systems）マイクが用いられてもよいし、エレクトレットコンデンサーマイクロホン（ＥＣＭ：Electret Condenser Microphone）が用いられてもよい。マイクＭＣ１〜ＭＣ４は、ビームフォーミング可能なマイク、例えば各座席の方向に指向性を形成して指向方向の音声を収音可能なマイクアレイでもよい。 The microphones MC1 to MC4 may be either a directional microphone or an omnidirectional microphone. As the microphones MC1 to MC4, a small MEMS (micro electro mechanical systems) microphone may be used, or an electret condenser microphone (ECM: Electret Condenser Microphone) may be used. The microphones MC1 to MC4 may be microphones capable of beamforming, for example, a microphone array capable of forming directivity in the direction of each seat and collecting sound in the directivity direction.

車両１０のダッシュボードには、カーナビゲーション装置５０が配置されてよい。音声処理装置２０及び音声認識エンジン４０は、ダッシュボードの内部、座席の内部に収容されて配置されてよい。音声処理装置２０は、マイク毎に対応して、座席毎に対応して設けられてよい。例えば、音声処理装置２０は、マイクＭＣ１，ＭＣ２，ＭＣ３，ＭＣ４にそれぞれ対応する音声処理装置２１，２２，２３，２４でよい。 A car navigation device 50 may be arranged on the dashboard of the vehicle 10. The voice processing device 20 and the voice recognition engine 40 may be housed and arranged inside the dashboard and inside the seat. The voice processing device 20 may be provided for each microphone and for each seat. For example, the voice processing device 20 may be the voice processing devices 21, 22, 23, 24 corresponding to the microphones MC1, MC2, MC3, and MC4, respectively.

なお、図１では、音声処理装置２１，２２，２３，２４がそれぞれ別体で構成されることを例示しているが、１つの音声処理装置２０で構成されてもよい。つまり、音声処理装置２０は、１つの音声処理部で構成されて複数設けられてもよいし、複数の音声処理部で構成されて１つ設けられてもよいし、複数の音声処理部で構成されて複数設けられてもよい。よって、音声処理システム５は、１つの音声処理装置２０を備えてもよいし、複数の音声処理装置２０を備えてもよい。各音声処理装置２０（２１，２２，２３，２４）は、異なるハードウェアで構成されてもよいし、１つの共通のハードウェアで構成されてもよい。 Although it is illustrated in FIG. 1 that the voice processing devices 21, 22, 23, and 24 are separately configured, they may be configured by one voice processing device 20. That is, the voice processing device 20 may be composed of one voice processing unit and provided in plurality, may be composed of a plurality of voice processing units and provided in one, or may be composed of a plurality of voice processing units. It may be provided in plurality. Therefore, the voice processing system 5 may be provided with one voice processing device 20 or may be provided with a plurality of voice processing devices 20. Each voice processing device 20 (21, 22, 23, 24) may be configured by different hardware or may be configured by one common hardware.

各音声処理装置２０は、例えば車室内のいずれかの座席内に配置されてよい。各音声処理装置２０は、各マイクに対応する各座席内に配置されてよい。各音声処理装置２０は、ダッシュボード内等に配置されてもよい。 Each voice processing device 20 may be arranged in, for example, any seat in the vehicle interior. Each voice processing device 20 may be arranged in each seat corresponding to each microphone. Each voice processing device 20 may be arranged in a dashboard or the like.

音声認識エンジン４０は、少なくとも１つの音声処理装置２０からの出力信号に含まれる音声を認識し、音声認識結果を出力する。音声認識エンジン４０は、音声認識結果や音声認識結果に基づく信号（例えばカーナビゲーション装置５０の操作信号）を生成する。音声認識エンジン４０は、音声処理装置２０と別体の装置であってもよいし、音声処理装置２０に組み込まれた一体型の装置であってもよい。 The voice recognition engine 40 recognizes the voice included in the output signal from at least one voice processing device 20, and outputs the voice recognition result. The voice recognition engine 40 generates a voice recognition result and a signal based on the voice recognition result (for example, an operation signal of the car navigation device 50). The voice recognition engine 40 may be a device separate from the voice processing device 20, or may be an integrated device incorporated in the voice processing device 20.

カーナビゲーション装置５０は、音声処理システム５の出力先の一例である。カーナビゲーション装置５０は、音声認識エンジン４０から出力される操作信号を入力し、操作信号に対応する動作を行う。例えば、カーナビゲーション装置５０は、ディスプレイに地図データを表示し、車両の進路を誘導するナビゲーションを行う。 The car navigation device 50 is an example of an output destination of the voice processing system 5. The car navigation device 50 inputs an operation signal output from the voice recognition engine 40 and performs an operation corresponding to the operation signal. For example, the car navigation device 50 displays map data on a display and performs navigation for guiding the course of the vehicle.

なお、音声処理システム５の出力先としては、カーナビゲーション装置５０に限らず、パネルメータ、テレビ、携帯電話等の電子機器であってもよい。 The output destination of the voice processing system 5 is not limited to the car navigation device 50, but may be an electronic device such as a panel meter, a television, or a mobile phone.

また、図１では、車両に４人が乗車している場合を示したが、乗車する人数は、この人数に限られない。乗車人数は、車両の最大乗車定員を上限とし、この範囲内で４人、６人、９人等の人数であってもよい。 Further, although FIG. 1 shows a case where four people are in the vehicle, the number of people in the vehicle is not limited to this number. The maximum number of passengers is limited to the maximum number of passengers in the vehicle, and the number of passengers may be 4, 6, 9, or the like within this range.

図２は、音声処理装置２０としての音声処理装置２１のハードウェア構成を示すブロック図である。音声処理装置２１，２２，２３，２４は、いずれも同一の構成および機能を有する。ここでは、音声処理装置２１を主に用いて説明する。音声処理装置２１は、運転席に座る運転者ｈｍ１が発話する音声をターゲット（取得目的の音声信号）とし、マイクＭＣ１で収音される音声の音声信号からクロストーク成分を抑圧した音声信号を出力信号として出力する。 FIG. 2 is a block diagram showing a hardware configuration of the voice processing device 21 as the voice processing device 20. The voice processing devices 21, 22, 23, and 24 all have the same configuration and function. Here, the voice processing device 21 will be mainly described. The voice processing device 21 targets the voice spoken by the driver hm1 sitting in the driver's seat (voice signal for acquisition), and outputs a voice signal in which the crosstalk component is suppressed from the voice signal of the voice picked up by the microphone MC1. Output as a signal.

音声処理装置２１は、マイクＭＣ１で収音された音声の音声信号を入力する音声入力部２９と、複数（例えば４つ）の適応フィルタＦ２，Ｆ３，Ｆ４，Ｆ５と、加算器２７と、適応フィルタ制御部２８と、を含む構成を有する。 The voice processing device 21 is adapted to a voice input unit 29 for inputting a voice signal of the voice picked up by the microphone MC1, a plurality of (for example, four) adaptive filters F2, F3, F4, F5, and an adder 27. It has a configuration including a filter control unit 28.

音声入力部２９は、運転者ｈｍ１の前に配置されたマイクＭＣ１で収音される音声の音声信号を入力する。この音声信号は、運転者ｈｍ１の音声（ターゲット成分の音声）と運転者ｈｍ１以外の乗員の音声を含むノイズ（クロストーク成分の音）とを含む信号である。 The voice input unit 29 inputs the voice signal of the voice picked up by the microphone MC1 arranged in front of the driver hm1. This voice signal is a signal including the voice of the driver hm1 (sound of the target component) and noise including the voice of an occupant other than the driver hm1 (sound of the crosstalk component).

適応フィルタＦ２は、複数（例えば３つ）の適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃを含む。適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃは、マイクＭＣ１で収音される音声に含まれる、運転者ｈｍ１の音声以外のクロストーク成分を抑圧するために、マイクＭＣ２，マイクＭＣ３，マイクＭＣ４で収音される音声の音声信号を、参照信号としてそれぞれ入力し、適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃを通過した通過信号を抽出する。適応フィルタＦ２は、適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃで抽出された通過信号を足し合わせて出力する。適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃは、物理的に分離されていてよい。 The adaptive filter F2 includes a plurality of (for example, three) adaptive filters F2A, F2B, and F2C. The adaptive filters F2A, F2B, and F2C are picked up by the microphone MC2, the microphone MC3, and the microphone MC4 in order to suppress the crosstalk component other than the sound of the driver hm1 contained in the sound picked up by the microphone MC1. The audio signal of the audio is input as a reference signal, and the passing signal that has passed through the adaptive filters F2A, F2B, and F2C is extracted. The adaptive filter F2 adds and outputs the passing signals extracted by the adaptive filters F2A, F2B, and F2C. The adaptive filters F2A, F2B, F2C may be physically separated.

適応フィルタＦ３は、マイクＭＣ１で収音される音声の音声信号に含まれる、運転者ｈｍ１の音声成分以外のクロストーク成分を抑圧するために、マイクＭＣ２で収音される音声の音声信号を参照信号として入力し、適応フィルタＦ３を通過した通過信号を出力する。 The adaptive filter F3 refers to the voice signal of the voice picked up by the microphone MC2 in order to suppress the crosstalk component other than the voice component of the driver hm1 included in the voice signal of the voice picked up by the microphone MC1. It is input as a signal, and a passing signal that has passed through the adaptive filter F3 is output.

適応フィルタＦ４は、マイクＭＣ１で収音される音声の音声信号に含まれる、運転者ｈｍ１の音声成分以外のクロストーク成分を抑圧するために、マイクＭＣ３で収音される音声の音声信号を参照信号として入力し、適応フィルタＦ４を通過した通過信号を出力する。 The adaptive filter F4 refers to the voice signal of the voice picked up by the microphone MC3 in order to suppress the crosstalk component other than the voice component of the driver hm1 included in the voice signal of the voice picked up by the microphone MC1. It is input as a signal, and a passing signal that has passed through the adaptive filter F4 is output.

適応フィルタＦ５は、マイクＭＣ１で収音される音声の音声信号に含まれる、運転者ｈｍ１の音声成分以外のクロストーク成分を抑圧するために、マイクＭＣ４で収音される音声の音声信号を参照信号として入力し、適応フィルタＦ５を通過した通過信号を出力する。 The adaptive filter F5 refers to the voice signal of the voice picked up by the microphone MC4 in order to suppress the crosstalk component other than the voice component of the driver hm1 included in the voice signal of the voice picked up by the microphone MC1. It is input as a signal, and a passing signal that has passed through the adaptive filter F5 is output.

ここで、適応フィルタの動作の概略を説明する。適応フィルタは、誤差信号の自乗平均で定義されるコスト関数を最小にするフィルタである。ここでは、適応フィルタとして、ＦＩＲ（ Finite impulse response）フィルタが用いることを例示するが、他の適応フィルタでもよい。 Here, the outline of the operation of the adaptive filter will be described. The adaptive filter is a filter that minimizes the cost function defined by the root mean square of the error signal. Here, it is illustrated that an FIR (Finite impulse response) filter is used as the adaptive filter, but other adaptive filters may be used.

適応フィルタを用いると、音声処理装置２１の出力信号、つまり減算信号ｅ（ｎ）は、例えば式（１）で表される。式（１）で表現される各遅延ブロックをタップと称する。ＦＩＲフィルタは、タップの重み及びタップの段数（タップ長）を変えることで、様々なフィルタ特性に適応する。タップの重み及びタップの段数（タップ長）は、フィルタ係数の一例である。 When the adaptive filter is used, the output signal of the voice processing device 21, that is, the subtraction signal e (n) is represented by, for example, the equation (1). Each delay block represented by the equation (1) is called a tap. The FIR filter adapts to various filter characteristics by changing the tap weight and the number of tap stages (tap length). The tap weight and the number of tap steps (tap length) are examples of filter coefficients.

ここで、ｎは時刻を表す。ｄ（ｎ）は取得したい（ターゲットの）音声信号である。ｘ（ｎ）は参照信号である。参照信号とは、ターゲットの音声信号以外の音声信号の１つである。ｗｉはフィルタ係数（タップの重み）である。Ｉはタップ長である。

Here, n represents a time. d (n) is the (target) audio signal to be acquired. x (n) is a reference signal. The reference signal is one of the audio signals other than the target audio signal. wi is a filter coefficient (tap weight). I is the tap length.

また、ＬＭＳ（Least Mean Square）のアルゴリズムにおけるフィルタ係数の更新は、式（２）で表される。

ここで、αは、フィルタ係数の補正係数であり、更新幅（更新量）に相当する。 Further, the update of the filter coefficient in the LMS (Least Mean Square) algorithm is represented by the equation (2).

Here, α is a correction coefficient of the filter coefficient and corresponds to an update width (update amount).

なお、フィルタ係数の更新時のアルゴリズムとして、ＬＭＳを用いることを例示したが、これに限らず、他のアルゴリズム（例えばＩＣＡ（Independent Component Analysis）、ＮＬＭＳ（Normalized Least Mean Square））を用いてもよい。 Although it has been illustrated that LMS is used as an algorithm for updating the filter coefficient, the present invention is not limited to this, and other algorithms (for example, ICA (Independent Component Analysis) and NLMS (Normalized Least Mean Square)) may be used. ..

加算器２７は、音声入力部２９から出力されるターゲットの音声信号から、適応フィルタＦ２から出力される通過信号を減算し（減算的に加算し）、この減算信号を誤差信号として出力する。加算器２７は、音声入力部２９から出力されるターゲットの音声信号から、適応フィルタＦ３から出力される通過信号を減算し、この減算信号を誤差信号として出力する。加算器２７は、音声入力部２９から出力されるターゲットの音声信号から、適応フィルタＦ４から出力される通過信号を減算し、この減算信号を誤差信号として出力する。加算器２７は、音声入力部２９から出力されるターゲットの音声信号から、適応フィルタＦ５から出力される通過信号を減算し、この減算信号を誤差信号として出力する。 The adder 27 subtracts (subtractively adds) the passing signal output from the adaptive filter F2 from the target voice signal output from the voice input unit 29, and outputs this subtracted signal as an error signal. The adder 27 subtracts the passing signal output from the adaptive filter F3 from the target audio signal output from the audio input unit 29, and outputs this subtracted signal as an error signal. The adder 27 subtracts the passing signal output from the adaptive filter F4 from the target audio signal output from the audio input unit 29, and outputs this subtracted signal as an error signal. The adder 27 subtracts the passing signal output from the adaptive filter F5 from the target voice signal output from the voice input unit 29, and outputs this subtracted signal as an error signal.

適応フィルタ制御部２８は、加算器２７から出力される複数（例えば４つ）の減算信号（誤差信号）のうち、信号レベルが最小の誤差信号を選択し、その誤差信号を出力信号として出力する。適応フィルタ制御部２８の出力信号は、音声認識エンジン４０に入力される。なお、音声認識エンジン４０は、適応フィルタ制御部２８の出力先の一例である。適応フィルタ制御部２８の出力先は、音声を発するスピーカ等であってもよい。このとき、適応フィルタ制御部２８は、無線通信網などを介して、携帯端末へ出力信号を出力するとしてもよい。携帯端末へ出力された出力信号は、携帯端末の有するスピーカ等から音声として出力されてもよい。 The adaptive filter control unit 28 selects an error signal having the lowest signal level from a plurality of (for example, four) subtraction signals (error signals) output from the adder 27, and outputs the error signal as an output signal. .. The output signal of the adaptive filter control unit 28 is input to the voice recognition engine 40. The voice recognition engine 40 is an example of an output destination of the adaptive filter control unit 28. The output destination of the adaptive filter control unit 28 may be a speaker or the like that emits sound. At this time, the adaptive filter control unit 28 may output an output signal to the mobile terminal via a wireless communication network or the like. The output signal output to the mobile terminal may be output as voice from a speaker or the like of the mobile terminal.

適応フィルタ制御部２８は、複数（例えば４つ）の適応フィルタＦ２〜Ｆ５の中から、誤差信号の信号レベルが最小である誤差信号に対応する通過信号に対応する適応フィルタを選択し、誤差信号が値０に近づくように、選択された適応フィルタのフィルタ係数を更新する。なお、適応フィルタ制御部２８が適応フィルタＦ２〜Ｆ５のいずれかのフィルタ係数を更新する場合、適応フィルタＦ２に含まれる３つの適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃのうち該当する適応フィルタのフィルタ係数を併せて更新してもよい。例えば、適応フィルタ制御部２８は、適応フィルタＦ４を更新する場合、同じ参照信号Ｃが入力される適応フィルタＦ２Ｂを更新してもよい。 The adaptive filter control unit 28 selects an adaptive filter corresponding to the passing signal corresponding to the error signal having the minimum signal level of the error signal from a plurality of (for example, four) adaptive filters F2 to F5, and selects the adaptive filter corresponding to the pass signal. Updates the filter coefficient of the selected adaptive filter so that is close to the value 0. When the adaptive filter control unit 28 updates the filter coefficient of any of the adaptive filters F2 to F5, the filter coefficient of the corresponding adaptive filter among the three adaptive filters F2A, F2B, and F2C included in the adaptive filter F2 is combined. May be updated. For example, when updating the adaptive filter F4, the adaptive filter control unit 28 may update the adaptive filter F2B to which the same reference signal C is input.

なお、適応フィルタ制御部２８が、誤差信号の信号レベルが最小である誤差信号に対応する参照信号に対応する適応フィルタを選択することを例示したが、信号レベルが最小であること以外の基準を基に、適応フィルタを選択してもよい。例えば、適応フィルタ制御部２８が、誤差信号の信号レベルが閾値ｔｈ１以下であるいずれかの適応フィルタを選択してもよい。 Although the adaptive filter control unit 28 exemplifies the selection of the adaptive filter corresponding to the reference signal corresponding to the error signal having the minimum signal level of the error signal, the criteria other than the minimum signal level are used. Based on this, an adaptive filter may be selected. For example, the adaptive filter control unit 28 may select any adaptive filter in which the signal level of the error signal is equal to or less than the threshold value th1.

適応フィルタ制御部２８は、例えば、プロセッサ（不図示）がメモリ（不図示）に保持されたプログラムを実行することで、適応フィルタ制御部２８の各種機能を実現する。 The adaptive filter control unit 28 realizes various functions of the adaptive filter control unit 28, for example, by executing a program in which a processor (not shown) is held in a memory (not shown).

図２では、音声信号Ａは、運転席のマイクＭＣ１で収音された音声（ターゲット成分を主に含む）の信号である。参照信号Ｂは、助手席のマイクＭＣ２で収音され、適応フィルタＦ２，Ｆ３に入力される音声（非ターゲット成分及びノイズを含む音）の信号である。参照信号Ｃは、左後部座席のマイクＭＣ３で収音され、適応フィルタＦ２，Ｆ４に入力される音声の信号である。参照信号Ｄは、助手席のマイクＭＣ４で収音され、適応フィルタＦ２，Ｆ５に入力される音声の信号である。 In FIG. 2, the voice signal A is a signal of voice (mainly including a target component) picked up by the microphone MC1 in the driver's seat. The reference signal B is a signal of voice (sound including non-target components and noise) collected by the microphone MC2 in the passenger seat and input to the adaptive filters F2 and F3. The reference signal C is an audio signal that is picked up by the microphone MC3 in the left rear seat and input to the adaptive filters F2 and F4. The reference signal D is an audio signal that is picked up by the microphone MC4 in the passenger seat and input to the adaptive filters F2 and F5.

通過信号Ｂ’，Ｃ’，Ｄ’は、各マイクＭＣ２，ＭＣ３，ＭＣ４で収音された音声の参照信号Ｂ，Ｃ，Ｄをそれぞれ適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃを通過させた信号である。マイクＭＣ１で収音される音声のうち、ターゲット成分以外のクロストーク成分は、ノイズに相当する。 The passing signals B', C', and D'are signals obtained by passing the reference signals B, C, and D of the voice picked up by the microphones MC2, MC3, and MC4 through the adaptive filters F2A, F2B, and F2C, respectively. Of the sound picked up by the microphone MC1, the crosstalk components other than the target component correspond to noise.

例えば、音声処理装置２１では、ターゲット席（ここでは、運転席）からの音声信号Ａが無く、他の助手席、後部座席から発話による参照信号Ｂ〜Ｄがある場合、マイクＭＣ１で収音される音声の音声信号には、クロストーク成分（漏れ込み成分）が含まれる。音声処理装置２１は、誤差信号を最小化するように適応フィルタを更新してよい。この場合、運転席で発話がないので、理想的な誤差信号は、無音信号となる。また、運転席で発話による音声信号があった場合、基本的に、音声信号Ａに含まれる発話は、参照信号Ｂ〜Ｄに含まれる漏れ込み音よりも時間的に早いため、音声処理装置２１は、適応フィルタにより音声信号Ａに含まれる発話をキャンセルすることができない（因果律）。したがって、音声処理装置２１は、ターゲットの音声信号が含まれても含まれなくても、誤差信号を最小化するように適応フィルタを更新することで、音声信号Ａにおけるクロストーク成分を最大限に低減できる。 For example, in the voice processing device 21, when there is no voice signal A from the target seat (here, the driver's seat) and there are reference signals B to D due to utterance from the other passenger seat and the rear seat, the sound is picked up by the microphone MC1. A cross-talk component (leakage component) is included in the sound signal of the sound. The voice processing device 21 may update the adaptive filter to minimize the error signal. In this case, since there is no utterance in the driver's seat, the ideal error signal is a silent signal. Further, when there is an utterance voice signal in the driver's seat, the utterance included in the voice signal A is basically faster than the leak sound included in the reference signals B to D, so that the voice processing device 21 Cannot cancel the utterance contained in the voice signal A by the adaptive filter (causal law). Therefore, the audio processing device 21 maximizes the crosstalk component in the audio signal A by updating the adaptive filter so as to minimize the error signal regardless of whether the target audio signal is included or not. Can be reduced.

加算器２７は、音声信号Ａから適応フィルタＦ２の通過信号Ｅ’を差し引いた減算信号を出力する。通過信号Ｅ’は、各適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃの通過信号を足し合わせた信号Ｂ’＋Ｃ’＋Ｄ’である。また、加算器２７は、音声信号Ａから適応フィルタＦ３の通過信号Ｂ’を差し引いた減算信号を出力する。また、加算器２７は、音声信号Ａから適応フィルタＦ４の通過信号Ｃ’を差し引いた減算信号を出力する。また、加算器２７は、音声信号Ａから適応フィルタＦ５の通過信号Ｄ’を差し引いた減算信号を出力する。 The adder 27 outputs a subtraction signal obtained by subtracting the passing signal E'of the adaptive filter F2 from the audio signal A. The passing signal E'is a signal B'+ C'+ D'that is the sum of the passing signals of the adaptive filters F2A, F2B, and F2C. Further, the adder 27 outputs a subtraction signal obtained by subtracting the passing signal B'of the adaptive filter F3 from the audio signal A. Further, the adder 27 outputs a subtraction signal obtained by subtracting the passing signal C'of the adaptive filter F4 from the audio signal A. Further, the adder 27 outputs a subtraction signal obtained by subtracting the passing signal D'of the adaptive filter F5 from the audio signal A.

図３は、音声処理装置２１の動作手順を示すフローチャートである。音声処理装置２１の音声入力部２９は、運転席に配置されたマイクＭＣ１で収音された音声信号Ａ（ターゲット成分としての運転者ｈｍ１の音声と、クロストーク成分を含む信号）を入力する（Ｓ１）。音声信号Ａは、ターゲット成分としての運転者ｈｍ１の音声成分、クロストーク成分としての乗員ｈｍ２〜ｈｍ４の音声成分を含み得る。音声処理装置２１は、マイクＭＣ２，ＭＣ３，ＭＣ４でそれぞれ収音された参照信号Ｂ，Ｃ，Ｄを取得する（Ｓ２）。例えば、参照信号Ｂ，Ｃ，Ｄは、ターゲット成分以外の主成分としての乗員ｈｍ２の音声成分、主成分以外の運転者ｈｍ１、乗員ｈｍ３，ｈｍ４の音声成分、を含み得る。 FIG. 3 is a flowchart showing the operation procedure of the voice processing device 21. The voice input unit 29 of the voice processing device 21 inputs a voice signal A (a voice of the driver hm1 as a target component and a signal including a cross talk component) picked up by the microphone MC1 arranged in the driver's seat ( S1). The voice signal A may include a voice component of the driver hm1 as a target component and a voice component of the occupants hm2 to hm4 as a crosstalk component. The voice processing device 21 acquires the reference signals B, C, and D picked up by the microphones MC2, MC3, and MC4, respectively (S2). For example, the reference signals B, C, and D may include a voice component of the occupant hm2 as a main component other than the target component, and a voice component of the driver hm1 and the occupant hm3 and hm4 other than the main component.

音声処理装置２１は、参照信号Ｂ，Ｃ，Ｄを用いて、適応フィルタを通過させた通過信号を生成する（Ｓ３）。適応フィルタＦ２は、参照信号Ｂを適応フィルタＦ２Ａに通過させ、参照信号Ｃを適応フィルタＦ２Ｂに通過させ、参照信号Ｄを適応フィルタＦ２Ｃに通過させ、各通過後の信号を足し合わせて通過信号Ｅ’を生成する。適応フィルタＦ３は、参照信号Ｂを通過させて通過信号Ｂ’を生成する。適応フィルタＦ４は、参照信号Ｃを通過させて通過信号Ｃ’を生成する。適応フィルタＦ５は、参照信号Ｄを通過させて通過信号Ｄ’を生成する。 The voice processing device 21 uses the reference signals B, C, and D to generate a passing signal that has passed the adaptive filter (S3). The adaptive filter F2 passes the reference signal B through the adaptive filter F2A, passes the reference signal C through the adaptive filter F2B, passes the reference signal D through the adaptive filter F2C, and adds the signals after each passage to the passing signal E. 'Generate. The adaptive filter F3 passes the reference signal B to generate a passing signal B'. The adaptive filter F4 passes the reference signal C to generate a passing signal C'. The adaptive filter F5 passes the reference signal D to generate a passing signal D'.

加算器２７は、音声信号Ａから各通過信号Ｅ’，Ｂ’，Ｃ’，Ｄ’を減算し、各減算信号Ａ−Ｅ’，Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’を生成する（Ｓ４）。適応フィルタ制御部２８は、減算信号Ａ−Ｅ’，Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’に基づいて、出力信号を選択する。例えば、適応フィルタ制御部２８は、減算信号Ａ−Ｅ’，Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’のうち、ターゲット成分の割合が最大となる、つまり信号レベルが最小となる減算信号（誤差信号）を出力信号として選択する（Ｓ５）。 The adder 27 subtracts each passing signal E', B', C', D'from the audio signal A, and subtracts each subtracted signal A-E', AB', AC', AD'. Generate (S4). The adaptive filter control unit 28 selects an output signal based on the subtraction signals A-E', AB', AC', and AD'. For example, the adaptive filter control unit 28 performs subtraction in which the ratio of the target component among the subtraction signals A-E', AB', AC', and AD'is the maximum, that is, the signal level is the minimum. A signal (error signal) is selected as an output signal (S5).

適応フィルタ制御部２８は、この出力信号に対応する適応フィルタのフィルタ係数を更新する（Ｓ６）。その後、適応フィルタ制御部２８から出力される出力信号は、更新されたフィルタ係数が反映されたものとなる。 The adaptive filter control unit 28 updates the filter coefficient of the adaptive filter corresponding to this output signal (S6). After that, the output signal output from the adaptive filter control unit 28 reflects the updated filter coefficient.

適応フィルタ制御部２８は、出力信号を音声認識エンジン４０に出力する（Ｓ７）。 The adaptive filter control unit 28 outputs an output signal to the voice recognition engine 40 (S7).

音声認識エンジン４０は、出力信号に含まれる音声を認識し、その認識結果に基づく音声指示を出力先の一例であるカーナビゲーション装置５０に送信する。カーナビゲーション装置５０は、受信した音声指示に従い、例えば行き先、地図、ナビルート等の操作を実行する。なお、ここでは、出力信号は、音声認識エンジン４０に出力されたが、その他の装置（例えばスピーカ）に出力されてもよい。スピーカは、出力信号に対応する音声を発音する。この後、音声処理装置２１は、Ｓ１の処理に戻る。なお、適応フィルタ制御部２８からの出力信号は、有線接続するその他の装置へ出力されてもよいし、また無線接続するその他の装置へ出力されるとしてもよい。 The voice recognition engine 40 recognizes the voice included in the output signal, and transmits a voice instruction based on the recognition result to the car navigation device 50 which is an example of the output destination. The car navigation device 50 executes operations such as a destination, a map, and a navigation route according to the received voice instruction. Although the output signal is output to the voice recognition engine 40 here, it may be output to another device (for example, a speaker). The speaker produces a voice corresponding to the output signal. After this, the voice processing device 21 returns to the processing of S1. The output signal from the adaptive filter control unit 28 may be output to another device connected by wire, or may be output to another device connected wirelessly.

このように、第１の実施形態における音声処理システム５では、車室内に複数のマイクや複数のノイズ源である、周囲の雑音、動作音、スピーカ、話者（発話する乗員）等が存在する場合、ノイズ源の数に対応する数の異なる適応フィルタが用いられる。また、音声処理システム５では、適応フィルタ制御部２８は、ノイズ源が１つか複数かに応じて、更新する適応フィルタを切り替える。また、車室内で発生するノイズ源の数や位置が変化する場合、適応フィルタ制御部２８は、ノイズ源の位置に対応した適応フィルタを特定し、特定された適応フィルタのフィルタ係数を更新する。適応フィルタ制御部２８は、誤差信号の信号レベルに応じて、適応フィルタの更新量及びタップ長を更新する際、更新する適応フィルタ自体を切り替える。これにより、ノイズ源の数や位置が変化した場合でも、各ノイズ源に対応するマイク信号（例えば音声信号Ａ）のＳ／Ｎ比が改善する。 As described above, in the voice processing system 5 of the first embodiment, there are a plurality of microphones, a plurality of noise sources, ambient noise, operating sounds, speakers, speakers (speaking occupants), and the like in the vehicle interior. In this case, different adaptive filters are used, corresponding to the number of noise sources. Further, in the voice processing system 5, the adaptive filter control unit 28 switches the adaptive filter to be updated according to one or more noise sources. Further, when the number or position of the noise source generated in the vehicle interior changes, the adaptive filter control unit 28 identifies the adaptive filter corresponding to the position of the noise source and updates the filter coefficient of the specified adaptive filter. The adaptive filter control unit 28 switches the adaptive filter itself to be updated when updating the update amount and tap length of the adaptive filter according to the signal level of the error signal. As a result, even if the number or position of the noise sources changes, the S / N ratio of the microphone signal (for example, the audio signal A) corresponding to each noise source is improved.

なお、車室内に設置されるマイクがマイクアレイである場合、ノイズ抑圧処理の前段階で、マイクアレイがノイズ源に向けて指向性を形成してその音声を収音する（ビームフォーミングを行う）ことで、各マイクに入力される音声信号のＳ／Ｎ比を改善してもよい。これにより、音声処理システム５は、後段のノイズ抑圧処理を高めることができる。つまり、マイクは、ノイズ源で発生するノイズ音を参照信号として効率良く収音でき、適応フィルタ制御部は、誤差信号が最小となるように適応フィルタを更新できる。適応フィルタは、対象の音声信号からノイズ音を良く打ち消す、抑圧処理を効果的に行うことができる。 If the microphone installed in the vehicle interior is a microphone array, the microphone array forms directivity toward the noise source and collects the sound (beamforming) before the noise suppression process. Therefore, the S / N ratio of the audio signal input to each microphone may be improved. As a result, the voice processing system 5 can enhance the noise suppression processing in the subsequent stage. That is, the microphone can efficiently pick up the noise sound generated by the noise source as a reference signal, and the adaptive filter control unit can update the adaptive filter so that the error signal is minimized. The adaptive filter can effectively perform the suppression process that cancels the noise sound well from the target audio signal.

また、本実施形態では、音声処理装置２０が、話者がＮ人（例えば４人）存在することが想定される状況において、Ｎ人が発する音声の参照信号を話者毎に別々に入力する適応フィルタと、Ｎ人が発する音声の参照信号をＮ人分まとめて入力する適応フィルタと、を設けることを例示した。この場合、音声処理装置２０は、なるべく少ない数の適応フィルタを用いて、ターゲットとなる音声信号から、ターゲット成分以外のクロストーク成分等のノイズ成分を効率良く低減できる。また、音声処理装置２０は、Ｎ人以下の人数毎に、適応フィルタを設けてもよい。例えば、４人のうち、２人の音声の参照信号を入力する２名用の適応フィルタと、３人の音声の参照信号を入力する３名用の適応フィルタと、４人の音声の参照信号を入力する４名用の適応フィルタと、を設けてよい。この場合、音声処理装置は、実際の話者の位置や人数に応じて最適な適応フィルタを選択し、フィルタ係数を更新でき、クロストーク成分の抑圧性能を更に向上できる。 Further, in the present embodiment, the voice processing device 20 separately inputs the reference signal of the voice emitted by the N people for each speaker in the situation where it is assumed that there are N speakers (for example, 4 people). It is illustrated that an adaptive filter and an adaptive filter for inputting a reference signal of a voice emitted by N people collectively for N people are provided. In this case, the audio processing device 20 can efficiently reduce noise components such as crosstalk components other than the target component from the target voice signal by using as few adaptive filters as possible. Further, the voice processing device 20 may be provided with an adaptive filter for each number of N or less people. For example, an adaptive filter for 2 people who inputs a reference signal of the voice of 2 people out of 4 people, an adaptive filter for 3 people who inputs a reference signal of the voice of 3 people, and a reference signal of the voice of 4 people. An adaptive filter for 4 people may be provided to input. In this case, the voice processing device can select the optimum adaptive filter according to the actual position and number of speakers, update the filter coefficient, and further improve the suppression performance of the crosstalk component.

以上のように、音声処理システム５は、ターゲット成分（第１の音声成分の一例）及びクロストーク成分（第１の音声成分以外の音声成分の一例）を含む音声信号Ａを取得するマイクＭＣ１（第１の音声取得部の一例）を備えてよい。音声処理システム５は、ターゲット成分以外の主成分（第２の音声成分の一例）及びこの主成分以外の成分（第２の音声成分以外の音声成分の一例）を含む複数の参照信号Ｂ，Ｃ，Ｄを取得する複数のマイクＭＣ１，ＭＣ３，ＭＣ４（第２の音声取得部の一例）を備えてよい。音声処理システム５の音声処理装置２１は、複数の参照信号Ｂ，Ｃ，Ｄのうち２つ以上の参照信号を通過させ、通過信号Ｅ’（第１の通過信号の一例）を生成する適応フィルタＦ２（第１の適応フィルタの一例）を備えてよい。音声処理装置２１は、複数の参照信号Ｂ，Ｃ，Ｄのうち異なる単一の参照信号を通過させ、複数の通過信号Ｂ’，Ｃ’，Ｄ’（第２の通過信号の一例）を生成する複数の適応フィルタＦ３，Ｆ４，Ｆ５（第２の適応フィルタの一例）を備えてよい。音声処理装置２１は、音声信号Ａと通過信号Ｅ’と複数の通過信号Ｂ’，Ｃ’，Ｄ’とに基づいて、適応フィルタＦ２及び複数の適応フィルタＦ３，Ｆ４，Ｆ５のうち制御対象の適応フィルタを決定し、制御対象の適応フィルタのフィルタ係数を制御する適応フィルタ制御部２８（制御部の一例）を備えてよい。 As described above, the voice processing system 5 obtains the voice signal A including the target component (an example of the first voice component) and the crosstalk component (an example of the voice component other than the first voice component). An example of the first voice acquisition unit) may be provided. The voice processing system 5 includes a plurality of reference signals B and C including a principal component other than the target component (an example of the second voice component) and a component other than the main component (an example of the voice component other than the second voice component). A plurality of microphones MC1, MC3, MC4 (an example of a second audio acquisition unit) for acquiring, D may be provided. The voice processing device 21 of the voice processing system 5 passes two or more reference signals among the plurality of reference signals B, C, and D, and generates a passing signal E'(an example of the first passing signal). F2 (an example of a first adaptive filter) may be provided. The voice processing device 21 passes a different single reference signal among the plurality of reference signals B, C, and D, and generates a plurality of pass signals B', C', D'(an example of the second pass signal). A plurality of adaptive filters F3, F4, F5 (an example of a second adaptive filter) may be provided. The voice processing device 21 is a control target of the adaptive filter F2 and the plurality of adaptive filters F3, F4, F5 based on the voice signal A, the passing signal E', and the plurality of passing signals B', C', and D'. An adaptive filter control unit 28 (an example of a control unit) that determines an adaptive filter and controls the filter coefficient of the adaptive filter to be controlled may be provided.

これにより、音声処理システム５の音声処理装置２１は、取得目的の音声信号、この音声信号以外の参照信号が複数信号用の適応フィルタを通過した通過信号、この音声信号以外の参照信号が単一信号用の適応フィルタを通過した通過信号を基に、各信号の状態を考慮して適応フィルタを決定し、この適応フィルタのフィルタ係数を制御できる。よって、音声処理装置２１は、例えば、話者（ノイズ源の一例）の位置の変化や話者の人数の変化によって、関連性の高い適用フィルタについては、更新により好適なフィルタ係数を維持でき、関連性の低い適用フィルタについては、過去の学習結果が不要に更新され、適用フィルタのフィルタ効率が低下することを抑制できる。また、複数信号用の適応フィルタや単一信号用の適応フィルタが制御対象の適応フィルタに決定されることで、音声処理装置２１は、例えば、話者が１人の場合には、各第２の適用フィルタを用いて、第１の音声成分以外の音声成分を効率良く除去できる。また、音声処理装置２１は、話者が複数人の場合には、第１の適用フィルタを用いて、第１の音声成分以外の音声成分を効率良く除去できる。 As a result, the voice processing device 21 of the voice processing system 5 has a single voice signal to be acquired, a passing signal in which a reference signal other than this voice signal has passed through an adaptive filter for a plurality of signals, and a single reference signal other than this voice signal. Based on the passing signal that has passed through the adaptive filter for the signal, the adaptive filter can be determined in consideration of the state of each signal, and the filter coefficient of this adaptive filter can be controlled. Therefore, the voice processing device 21 can maintain a suitable filter coefficient by updating the applied filter that is highly relevant due to a change in the position of the speaker (an example of a noise source) or a change in the number of speakers, for example. For an applied filter that is less relevant, past learning results are unnecessarily updated, and it is possible to suppress a decrease in the filter efficiency of the applied filter. Further, by determining the adaptive filter for a plurality of signals and the adaptive filter for a single signal as the adaptive filter to be controlled, the voice processing device 21 can be used as a second speaker, for example, when there is one speaker. The application filter of can be used to efficiently remove audio components other than the first audio component. Further, when there are a plurality of speakers, the voice processing device 21 can efficiently remove voice components other than the first voice component by using the first applied filter.

また、加算器２７（制御部の一例）は、音声信号Ａから通過信号Ｂ’を減算して、減算信号Ａ−Ｅ’（第１の減算信号の一例）を生成してよい。加算器２７は、音声信号Ａから異なる通過信号Ｂ’，Ｃ’，Ｄ’を減算して、複数の減算信号Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’（第２の減算信号の一例）を生成してよい。適応フィルタ制御部２８は、減算信号Ａ−Ｅ’及び各減算信号Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’の信号レベルに基づいて、制御対象の適応フィルタを決定してよい。なお、制御部は、加算器２７及び適応フィルタ制御部２８が別体として構成されてもよいし、適応フィルタ制御部２８が加算器２７の機能を含んで構成されてもよい。 Further, the adder 27 (an example of the control unit) may subtract the passing signal B'from the audio signal A to generate the subtraction signal AE'(an example of the first subtraction signal). The adder 27 subtracts different passing signals B', C', and D'from the audio signal A, and a plurality of subtracted signals AB', AC', AD'(of the second subtraction signal). An example) may be generated. The adaptive filter control unit 28 may determine the adaptive filter to be controlled based on the signal levels of the subtraction signals AE'and the subtraction signals AB', AC', and AD'. In the control unit, the adder 27 and the adaptive filter control unit 28 may be configured as separate bodies, or the adaptive filter control unit 28 may be configured to include the function of the adder 27.

これにより、音声処理装置２１は、減算信号の信号レベルに応じて、ターゲット成分以外の除去効率を加味して、除去効率の高い適応フィルタを決定し、この適応フィルタのフィルタ係数を制御し、これ以外の適応フィルタのフィルタ係数を制御しないことができる。よって、音声処理装置２１は、ノイズ源の数や位置が変化した場合でも、取得対象の音声信号の雑音成分の抑圧精度を向上できる。 As a result, the audio processing device 21 determines an adaptive filter having high removal efficiency in consideration of the removal efficiency other than the target component according to the signal level of the subtraction signal, controls the filter coefficient of the adaptive filter, and controls this. It is possible not to control the filter coefficient of the adaptive filter other than. Therefore, the voice processing device 21 can improve the suppression accuracy of the noise component of the voice signal to be acquired even when the number and position of the noise sources change.

また、適応フィルタ制御部２８は、減算信号Ａ−Ｅ’及び各減算信号Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’の信号レベルに基づいて、減算信号Ａ−Ｅ’及び各減算信号Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’のいずれかを出力信号として決定してよい。適応フィルタ制御部２８は、出力信号に対応する適応フィルタＦ２及び複数の適応フィルタＦ３，Ｆ４，Ｆ５のうちのいずれかの適応フィルタを、制御対象の適応フィルタを決定してよい。 Further, the adaptive filter control unit 28 sets the subtraction signals A-E'and each subtraction signal based on the signal levels of the subtraction signals A-E'and the subtraction signals AB', AC', and AD'. Any one of AB', AC', and AD' may be determined as an output signal. The adaptive filter control unit 28 may determine the adaptive filter to be controlled by any one of the adaptive filter F2 and the plurality of adaptive filters F3, F4, and F5 corresponding to the output signal.

これにより、音声処理装置２１は、音声処理装置２１の後段の処理に用いる、クロストーク成分が小さい出力信号に対応する適応フィルタのフィルタ係数を制御することで、全ての適応フィルタのフィルタ係数の制御を行う必要なく、効率良くフィルタ係数を制御できる。また、制御されたフィルタ係数により、後においても出力信号として選択される可能性が高い減算信号のクロストーク成分が一層小さくなることが期待できる。 As a result, the voice processing device 21 controls the filter coefficients of all the adaptive filters by controlling the filter coefficients of the adaptive filters corresponding to the output signals having a small crosstalk component used in the subsequent processing of the voice processing device 21. The filter coefficient can be controlled efficiently without the need to perform. In addition, the controlled filter coefficient can be expected to further reduce the crosstalk component of the subtraction signal, which is likely to be selected as the output signal later.

また、適応フィルタ制御部２８は、減算信号Ａ−Ｅ’及び各減算信号Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’のうち、信号レベルが最小である信号でよい。 Further, the adaptive filter control unit 28 may be a signal having the minimum signal level among the subtraction signals A-E'and the subtraction signals AB', AC', and AD'.

これにより、音声処理装置２１は、フィルタ係数の制御対象となる適応フィルタとして、クロストーク成分を最も抑圧可能な適応フィルタを選択できる。 As a result, the voice processing device 21 can select an adaptive filter that can suppress the crosstalk component most as an adaptive filter for which the filter coefficient is controlled.

また、適応フィルタ制御部２８は、減算信号Ａ−Ｅ’及び複数の減算信号Ａ−Ｂ’，Ａ−Ｃ’，Ａ−Ｄ’のうち、信号レベルが閾値ｔｈ１（第１の閾値の一例）以下である信号でよい。 Further, the adaptive filter control unit 28 has a signal level of the threshold value th1 (an example of the first threshold value) among the subtraction signals A-E'and the plurality of subtraction signals AB', AC', and AD'. The signal may be as follows.

これにより、音声処理装置２１は、フィルタ係数の制御対象となる適応フィルタとして、クロストーク成分を所望の基準以上に抑圧可能な適応フィルタを選択できる。 As a result, the voice processing device 21 can select an adaptive filter capable of suppressing the crosstalk component more than a desired reference as the adaptive filter to be controlled by the filter coefficient.

また、マイクＭＣ２，ＭＣ３，ＭＣ４は、それぞれ参照信号Ｂ，Ｃ，Ｄを取得するために、音声信号Ａのクロストーク成分を発する話者としての乗員ｈｍ２〜ｈｍ４（音源の一例）の方向に指向性を有してよい。 Further, the microphones MC2, MC3, and MC4 are directed in the direction of the occupant hm2 to hm4 (an example of a sound source) as a speaker who emits the crosstalk component of the audio signal A in order to acquire the reference signals B, C, and D, respectively. May have sex.

これにより、音声処理装置２１は、特定の方向に指向性を有することにより、マイクＭＣ２，ＭＣ３，ＭＣ４は、例えば、特定の話者が発する音声成分を多くし、特定の話者以外が発する音声成分を少なくして取得できる。よって、音声処理装置２１は、特定の話者以外の音声が漏れ入ることが少なくなることで、適応フィルタＦ３，Ｆ４，Ｆ５を特定の方向に位置する特定の話者専用として使用できるようになる。よって、音声処理装置２１は、適応フィルタの学習時の揺らぎを小さくでき、音声信号から特定の話者の音声成分を効率良く抑制できる。 As a result, the voice processing device 21 has directivity in a specific direction, so that the microphones MC2, MC3, and MC4 increase, for example, the voice components emitted by a specific speaker, and the voice emitted by a speaker other than the specific speaker. It can be obtained with less components. Therefore, the voice processing device 21 can use the adaptive filters F3, F4, and F5 exclusively for a specific speaker located in a specific direction by reducing the leakage of voices other than the specific speaker. .. Therefore, the voice processing device 21 can reduce the fluctuation during learning of the adaptive filter, and can efficiently suppress the voice component of a specific speaker from the voice signal.

また、適応フィルタＦ２、複数の適応フィルタＦ３〜Ｆ５、及び適応フィルタ制御部２８を備える音声処理装置２０（２１〜２５）、を複数備えてよい。複数の音声処理装置２０における各適応フィルタ制御部２８が取得する各音声信号（音声信号に含まれる各ターゲット成分）は、それぞれ異なってよい。また、複数の音声処理装置２０における各適応フィルタＦ２及び各適応フィルタＦ３〜Ｆ５が取得する各参照信号（参照信号に含まれる各ターゲット成分以外の主成分）の組み合わせは、それぞれ異なってよい。 Further, a plurality of voice processing devices 20 (21 to 25) including an adaptive filter F2, a plurality of adaptive filters F3 to F5, and an adaptive filter control unit 28 may be provided. Each voice signal (each target component included in the voice signal) acquired by each adaptive filter control unit 28 in the plurality of voice processing devices 20 may be different. Further, the combination of each adaptive filter F2 and each reference signal (main component other than each target component included in the reference signal) acquired by each adaptive filter F3 to F5 in the plurality of audio processing devices 20 may be different.

例えば、音声処理装置２１の適応フィルタ制御部２８が取得する音声信号は、マイクＭＣ１で収音された音声信号でよく、そのターゲット成分は、乗員ｈｍ１の音声でよい。音声処理装置２２の適応フィルタ制御部２８が取得する音声信号は、マイクＭＣ２で収音された音声信号でよく、そのターゲット成分は、乗員ｈｍ２の音声でよい。音声処理装置２３の適応フィルタ制御部２８が取得する音声信号は、マイクＭＣ３で収音された音声信号でよく、そのターゲット成分は、乗員ｈｍ３の音声でよい。音声処理装置２４の適応フィルタ制御部２８が取得する音声信号は、マイクＭＣ４で収音された音声信号でよく、そのターゲット成分は、乗員ｈｍ４の音声でよい。 For example, the voice signal acquired by the adaptive filter control unit 28 of the voice processing device 21 may be a voice signal picked up by the microphone MC1, and the target component thereof may be the voice of the occupant hm1. The voice signal acquired by the adaptive filter control unit 28 of the voice processing device 22 may be a voice signal picked up by the microphone MC2, and the target component thereof may be the voice of the occupant hm2. The voice signal acquired by the adaptive filter control unit 28 of the voice processing device 23 may be a voice signal picked up by the microphone MC3, and the target component thereof may be the voice of the occupant hm3. The voice signal acquired by the adaptive filter control unit 28 of the voice processing device 24 may be a voice signal picked up by the microphone MC4, and the target component thereof may be the voice of the occupant hm4.

例えば、音声処理装置２１の各適応フィルタＦ２〜Ｆ５が取得する参照信号は、マイクＭＣ２，ＭＣ３，ＭＣ４で収音された信号でよく、ターゲット成分以外の主成分の組み合わせは、乗員ｈｍ２，ｈｍ３，ｈｍ４の音声でよい。音声処理装置２２の各適応フィルタＦ２〜Ｆ５が取得する参照信号は、マイクＭＣ３，ＭＣ４，ＭＣ１で収音された信号でよく、ターゲット成分以外の主成分の組み合わせは、乗員ｈｍ３，ｈｍ４及び運転者ｈｍ１の音声でよい。音声処理装置２３の各適応フィルタＦ２〜Ｆ５が取得する参照信号は、マイクＭＣ４，ＭＣ１，ＭＣ２で収音された信号でよく、ターゲット成分以外の主成分は、乗員ｈｍ４、運転者ｈｍ１、及び乗員ｈｍ２の音声でよい。音声処理装置２４の各適応フィルタＦ２〜Ｆ５が取得する参照信号は、マイクＭＣ１，ＭＣ２，ＭＣ３で収音された信号でよく、ターゲット成分以外の主成分は、運転者ｈｍ１及び乗員ｈｍ２，ｈｍ３の音声でよい。 For example, the reference signal acquired by each of the adaptive filters F2 to F5 of the voice processing device 21 may be a signal picked up by the microphones MC2, MC3, and MC4, and the combination of the main components other than the target component is the occupant hm2, hm3. The voice of hm4 may be used. The reference signal acquired by each of the adaptive filters F2 to F5 of the voice processing device 22 may be a signal picked up by the microphones MC3, MC4, and MC1, and the combination of the main components other than the target component is the occupant hm3, hm4 and the driver. The voice of hm1 may be used. The reference signal acquired by each of the adaptive filters F2 to F5 of the voice processing device 23 may be a signal picked up by the microphones MC4, MC1 and MC2, and the main components other than the target component are the occupant hm4, the driver hm1 and the occupant. The voice of hm2 may be used. The reference signal acquired by each of the adaptive filters F2 to F5 of the voice processing device 24 may be a signal picked up by the microphones MC1, MC2, and MC3, and the main components other than the target component are the driver hm1 and the occupant hm2 and hm3. Voice is fine.

つまり、音声処理システム５は、Ｎ個（Ｎは自然数）のマイクと、Ｎ個の適応フィルタと、適応フィルタ制御部２８と、をそれぞれ含むＮ個の音声処理装置２０と、を備えてよい。ｍ（ｍ：１〜Ｎの任意の整数）番目の音声処理装置２０は、Ｎ個のマイクのうち、ｍ番目のマイクで入力された信号をターゲットとなる音声信号とし、ｍを除く１〜Ｎ番目のマイクで入力された信号を参照信号としてよい。 That is, the voice processing system 5 may include N voice processing devices 20 including N microphones (N is a natural number), N adaptive filters, and an adaptive filter control unit 28, respectively. The m (m: arbitrary integer of 1 to N) th voice processing device 20 uses the signal input by the mth microphone among the N microphones as the target voice signal, and excludes m from 1 to N. The signal input by the second microphone may be used as a reference signal.

これにより、音声処理システム５は、複数のマイクで収音される音声それぞれに対し、クロストーク成分を抑圧でき、各マイクで収音された音声信号のクロストーク成分等のノイズ抑圧精度を向上できる。 As a result, the voice processing system 5 can suppress the crosstalk component for each of the sounds picked up by the plurality of microphones, and can improve the noise suppression accuracy of the crosstalk component and the like of the voice signal picked up by each microphone. ..

また、マイクＭＣ１〜ＭＣ４は、車室内に配置されてよい。 Further, the microphones MC1 to MC4 may be arranged in the vehicle interior.

これにより、音声処理装置２１は、車室内の例えば狭い空間に複数の乗員がいる状況でも、話者が発話する音声に含まれるクロストーク成分を抑圧できる。 As a result, the voice processing device 21 can suppress the cross talk component included in the voice spoken by the speaker even in a situation where a plurality of occupants are present in a narrow space, for example, in the vehicle interior.

また、音声処理システム５は、出力信号に対して音声認識処理を行う音声認識エンジン４０（音声認識処理部の一例）を備えてよい。 Further, the voice processing system 5 may include a voice recognition engine 40 (an example of a voice recognition processing unit) that performs voice recognition processing on an output signal.

これにより、音声処理システム５は、適応フィルタのフィルタ係数の更新について追従性が向上するので、フィルタ係数の変更後（例えば車室内での話者の変化後）の初期段階における出力信号に基づく音声の音声認識精度が向上する。よって、話者が音声を発した直後から音声認識精度が向上し、音声認識を用いたアプリケーションにおける操作情報の認識精度が向上する。したがって、音声処理システム５は、例えば音声認識による操作可能なアプリケーションに対する指示をスムーズに行うことができる。 As a result, the voice processing system 5 improves the followability with respect to the update of the filter coefficient of the adaptive filter, so that the voice based on the output signal in the initial stage after the filter coefficient is changed (for example, after the speaker is changed in the vehicle interior). Voice recognition accuracy is improved. Therefore, the voice recognition accuracy is improved immediately after the speaker emits the voice, and the recognition accuracy of the operation information in the application using the voice recognition is improved. Therefore, the voice processing system 5 can smoothly give an instruction to an application that can be operated by voice recognition, for example.

（第２の実施形態）
第２の実施形態では、車室内の乗員のうち発話している話者を検知する話者検知を行い、話者検知結果を、クロストーク成分の抑圧処理に補助的に利用する場合を示す。 (Second Embodiment)
The second embodiment shows a case where speaker detection is performed to detect a speaker who is speaking among the occupants in the vehicle interior, and the speaker detection result is used as an auxiliary for suppression processing of the crosstalk component.

第２の実施形態では、第１の実施形態で説明した構成や動作と同一の構成や動作については、同一の符号を用いることで、その説明を省略又は簡略化する。 In the second embodiment, the same components and operations as those described in the first embodiment will be omitted or simplified by using the same reference numerals.

図４は、第２の実施形態における音声処理装置２１Ａのハードウェア構成を示す図である。第２の実施形態の音声処理システム５Ａは、第１の実施形態と同様、車両１０の車室内に配置された、複数（例えば４つ）のマイクＭＣ１〜ＭＣ４と、音声処理装置２０Ａと、音声認識エンジン４０と、カーナビゲーション装置５０と、を含む構成を有する。また、音声処理装置２０Ａは、例えば音声処理装置２１Ａ，２２Ａ，２３Ａ，２４Ａでよい。音声処理装置２１Ａ，２２Ａ，２３Ａ，２４Ａは、いずれも同一の構成および機能を有する。ここでは、音声処理装置２１Ａを主に用いて説明する。 FIG. 4 is a diagram showing a hardware configuration of the voice processing device 21A according to the second embodiment. Similar to the first embodiment, the voice processing system 5A of the second embodiment includes a plurality of (for example, four) microphones MC1 to MC4, a voice processing device 20A, and voice arranged in the vehicle interior of the vehicle 10. It has a configuration including a recognition engine 40 and a car navigation device 50. Further, the voice processing device 20A may be, for example, voice processing devices 21A, 22A, 23A, 24A. The voice processing devices 21A, 22A, 23A, and 24A all have the same configuration and function. Here, the voice processing device 21A will be mainly described.

音声処理装置２１Ａは、音声入力部２９と、複数（例えば４つ）の適応フィルタＦ２，Ｆ３，Ｆ４，Ｆ５と、加算器２７と、適応フィルタ制御部２８Ａと、記憶部２８Ｂと、信号検知部３０と、を含む。 The voice processing device 21A includes a voice input unit 29, a plurality of (for example, four) adaptive filters F2, F3, F4, F5, an adder 27, an adaptive filter control unit 28A, a storage unit 28B, and a signal detection unit. 30 and.

信号検知部３０は、マイクＭＣ１からの音声信号Ａ、及びマイクＭＣ２，ＭＣ３，ＭＣ４からの各参照信号Ｂ，Ｃ，Ｄを入力し、これらの信号の音圧レベル（信号レベル）を基に、話者の位置を検知する。例えば、マイクＭＣ１〜ＭＣ４で収音される音声の音声信号の音圧レベルが閾値ｔｈ２より高い場合、そのマイクに向かって発話している乗員がいると判断し、話者の位置を特定してよい。信号検知部３０は、話者位置の検知結果を適応フィルタ制御部２８Ａに通知してよい。なお、信号検知部３０は、音声処理装置２１Ａ〜２４Ａ毎に設けられてもよいし、音声処理システム５Ａ全体で１つ設けられてもよい。 The signal detection unit 30 inputs the audio signal A from the microphone MC1 and the reference signals B, C, and D from the microphones MC2, MC3, and MC4, and based on the sound pressure level (signal level) of these signals, the signal detection unit 30 inputs. Detects the position of the speaker. For example, when the sound pressure level of the voice signal of the voice picked up by the microphones MC1 to MC4 is higher than the threshold value th2, it is determined that there is an occupant speaking into the microphone, and the position of the speaker is specified. Good. The signal detection unit 30 may notify the adaptive filter control unit 28A of the detection result of the speaker position. The signal detection unit 30 may be provided for each of the voice processing devices 21A to 24A, or may be provided for the entire voice processing system 5A.

適応フィルタ制御部２８Ａは、信号検知部３０で検知された話者位置に対応する適応フィルタを更新の対象（制御対象）とし、その適応フィルタのフィルタ係数を更新してよい。一方、適応フィルタ制御部２８Ａは、検知された話者位置以外に対応する適応フィルタを更新の対象とせず、その適応フィルタのフィルタ係数を更新しなくよい。 The adaptive filter control unit 28A may update the adaptive filter corresponding to the speaker position detected by the signal detection unit 30 as an update target (control target), and update the filter coefficient of the adaptive filter. On the other hand, the adaptive filter control unit 28A does not have to update the adaptive filter corresponding to other than the detected speaker position, and does not have to update the filter coefficient of the adaptive filter.

例えば、助手席で発話が検知された場合、適応フィルタ制御部２８Ａは、適応フィルタＦ３のフィルタ係数を更新してよい。また、左側の後部座席と右側の後部座席の両方で発話が検知された場合、適応フィルタ制御部２８Ａは、話者が複数であるとして、適応フィルタＦ２のフィルタ係数を更新してよい。 For example, when an utterance is detected in the passenger seat, the adaptive filter control unit 28A may update the filter coefficient of the adaptive filter F3. Further, when the utterance is detected in both the left rear seat and the right rear seat, the adaptive filter control unit 28A may update the filter coefficient of the adaptive filter F2, assuming that there are a plurality of speakers.

信号検知部３０によって検知される話者位置の確度が低い場合、適応フィルタ制御部２８Ａは、適応フィルタＦ２（３つの適応フィルタＦ２Ａ，Ｆ２Ｂ，Ｆ２Ｃを含む）を更新し、適応フィルタＦ２の通過信号Ｅ’を基に、出力信号を得てよい。これにより、複数人の話者位置を正確に検知することは困難であるが、話者位置の確度が低く、話者位置が推定困難な場合でも、音声処理装置２１Ａは、大きな音質劣化を抑制できる。 When the accuracy of the speaker position detected by the signal detection unit 30 is low, the adaptive filter control unit 28A updates the adaptive filter F2 (including the three adaptive filters F2A, F2B, and F2C), and the pass signal of the adaptive filter F2. An output signal may be obtained based on E'. As a result, it is difficult to accurately detect the positions of a plurality of speakers, but even if the accuracy of the speaker positions is low and it is difficult to estimate the speaker positions, the voice processing device 21A suppresses a large deterioration in sound quality. it can.

一方、信号検知部３０によって検知される話者位置の確度が高い場合、適応フィルタ制御部２８Ａは、この話者位置に対応するいずれかの適応フィルタを更新し、更新される適応フィルタを通過する通過信号を基に、出力信号を得てよい。これにより、話者位置の確度が高く、話者位置が高精度に特定可能である場合、音声処理装置２１Ａは、音声信号Ａに含まれるクロストーク成分（例えば話者）を十分に抑圧できる。 On the other hand, when the accuracy of the speaker position detected by the signal detection unit 30 is high, the adaptive filter control unit 28A updates any of the adaptive filters corresponding to the speaker position and passes through the updated adaptive filter. An output signal may be obtained based on the passing signal. As a result, when the accuracy of the speaker position is high and the speaker position can be specified with high accuracy, the voice processing device 21A can sufficiently suppress the crosstalk component (for example, the speaker) contained in the voice signal A.

このように、音声処理装置２１Ａは、話者位置検知を行って話者位置を推定することで、例えば、全ての適応フィルタＦ２〜Ｆ５に関する演算の少なくとも一部を省略できる。適応フィルタＦ２〜Ｆ５に関する演算は、各通過信号の生成に係る演算、各減算信号の生成に係る演算、等を含んでよい。このように、信号検知部３０の機能を補助的に用いることで、音声処理装置２１Ａの処理負荷を低減できる。 In this way, the voice processing device 21A can omit at least a part of the operations related to all the adaptive filters F2 to F5 by estimating the speaker position by detecting the speaker position. The operations related to the adaptive filters F2 to F5 may include operations related to the generation of each passing signal, operations related to the generation of each subtraction signal, and the like. In this way, by using the function of the signal detection unit 30 as an auxiliary, the processing load of the voice processing device 21A can be reduced.

ここで、話者位置及び話者位置の確度の導出例について説明する。 Here, an example of deriving the speaker position and the accuracy of the speaker position will be described.

信号検知部３０は、話者位置の確度を様々な方法で導出してよい。例えば、信号検知部３０は、マイクＭＣ１〜ＭＣ４でそれぞれ収音される音声の音圧レベル（信号レベル）が閾値ｔｈ３（＞閾値ｔｈ２）を超える否かに応じて、話者位置の確度を決定してよい。また、信号検知部３０は、カメラを含んでもよい。この場合、信号検知部３０は、例えば乗員の口元付近を撮像し、この撮像画像を解析して、乗員が発話しているか否かを判断してもよい。信号検知部３０は、音圧レベルにより検知された発話者と、カメラによる撮像画像を基に解析された発話者とが一致した場合、話者位置検知の確度が高いと判断してよい。 The signal detection unit 30 may derive the accuracy of the speaker position by various methods. For example, the signal detection unit 30 determines the accuracy of the speaker position according to whether or not the sound pressure level (signal level) of the voice picked up by the microphones MC1 to MC4 exceeds the threshold value th3 (> threshold value th2). You can do it. Further, the signal detection unit 30 may include a camera. In this case, the signal detection unit 30 may, for example, take an image of the vicinity of the mouth of the occupant and analyze the captured image to determine whether or not the occupant is speaking. When the speaker detected by the sound pressure level and the speaker analyzed based on the image captured by the camera match, the signal detection unit 30 may determine that the accuracy of speaker position detection is high.

また、車両１０に乗車する人物が、車両１０における同じ座席に座ることが多い場合、信号検知部３０は、声紋を用いて話者位置を検知してもよい。この場合、記憶部２８Ｂに各乗員の声紋を予め登録しておき、発話があった場合、信号検知部３０が、記憶部２８Ｂに登録された声紋を参照し、座席に対応するマイクで収音される音声の声紋と一致するか否かを判別してよい。一致した場合、信号検知部３０は、その声紋に対応する乗員の着座位置が話者位置である確度が高いと判断してよい。 Further, when a person riding in the vehicle 10 often sits in the same seat in the vehicle 10, the signal detection unit 30 may detect the speaker position using a voiceprint. In this case, the voiceprints of each occupant are registered in advance in the storage unit 28B, and when there is an utterance, the signal detection unit 30 refers to the voiceprints registered in the storage unit 28B and picks up the sound with the microphone corresponding to the seat. It may be determined whether or not it matches the voiceprint of the voice to be played. If they match, the signal detection unit 30 may determine that the seating position of the occupant corresponding to the voiceprint is highly likely to be the speaker position.

また、信号検知部３０は、各マイクＭＣ１〜ＭＣ４の設置位置と各マイクＭＣ１〜ＭＣ４が主に収音する運転者ｈｍ１又は乗員ｈｍ２〜ｈｍ４との距離に応じて、話者位置検知の確度が高いと判断してよい。信号検知部３０は、各マイクＭＣ１〜ＭＣ４で収音された音声信号の遅延成分に基づいて、上記の距離を推定してよい。例えば、音声信号の遅延成分が多い場合、マイクと話者の距離が長く、話者位置検知の確度が低いと判断可能である。音声信号の遅延成分が少ない場合、マイクと話者の距離が短く、話者位置検知の確度が高いと判断可能である。 Further, the signal detection unit 30 determines the accuracy of speaker position detection according to the distance between the installation position of the microphones MC1 to MC4 and the driver hm1 or the occupant hm2 to hm4 in which the microphones MC1 to MC4 mainly collect sound. You may judge that it is expensive. The signal detection unit 30 may estimate the above distance based on the delay component of the audio signal picked up by the microphones MC1 to MC4. For example, when the delay component of the audio signal is large, it can be determined that the distance between the microphone and the speaker is long and the accuracy of speaker position detection is low. When the delay component of the audio signal is small, it can be determined that the distance between the microphone and the speaker is short and the accuracy of speaker position detection is high.

適応フィルタ制御部２８Ａは、減算信号（誤差信号）の信号レベルをスコアとして計算し、話者位置の検知結果をスコアとして計算し、これらのスコアに基づいて、出力信号を決定してもよい。この場合、適応フィルタ制御部２８Ａは、話者位置と、話者位置の検知結果として検知された話者位置に対応する適応フィルタの通過信号に対応する減算信号と、を対応付けてよい。例えば、適応フィルタ制御部２８Ａは、両者のスコアの合計値が最も高い（又は最も低い）減算信号を、出力信号として決定してよい。 The adaptive filter control unit 28A may calculate the signal level of the subtraction signal (error signal) as a score, calculate the detection result of the speaker position as a score, and determine the output signal based on these scores. In this case, the adaptive filter control unit 28A may associate the speaker position with the subtraction signal corresponding to the passing signal of the adaptive filter corresponding to the speaker position detected as the detection result of the speaker position. For example, the adaptive filter control unit 28A may determine the subtraction signal having the highest (or lowest) sum of the scores of both as the output signal.

よって、音声処理システム５Ａでは、話者位置を検知することで、適応フィルタの更新をより正確に行うことができ、クロストーク成分を一層抑圧した品質の高い音声信号を得ることができる。 Therefore, in the voice processing system 5A, the adaptive filter can be updated more accurately by detecting the speaker position, and a high-quality voice signal in which the crosstalk component is further suppressed can be obtained.

また、適応フィルタ制御部２８Ａは、話者位置の検知結果の信頼度（確度）が高い状態である期間が閾値ｔｈ４以上継続した場合、つまり、ある座席で発話が長くあった場合、信頼度が高い状態で更新し続けたフィルタ係数を、その席に対応する適応フィルタのフィルタ係数として記憶部２８Ｂに退避してもよい。また、適応フィルタ制御部２８Ａは、一定期間毎に、退避しておいたフィルタ係数を更新してもよい。信号検知部３０や誤差信号の信号レベルに基づいて推定された話者位置が、記憶部２８Ｂにフィルタ係数が退避された適応フィルタに対応する席の位置であると判定された場合、適応フィルタ制御部２８Ａは、退避しておいたフィルタ係数を読み出し、適応フィルタに設定してもよい。これにより、音声処理装置２１Ａは、例えばフィルタ係数が退避された適応フィルタに対応する位置が話者位置となった直後から、この席の話者の音声成分（クロストーク成分）を効果的に抑圧できる。 In addition, the adaptive filter control unit 28A has a high reliability (accuracy) of the speaker position detection result when the period in which the reliability (accuracy) is high continues for the threshold value th4 or more, that is, when the utterance is long in a certain seat. The filter coefficient that has been continuously updated in a high state may be saved in the storage unit 28B as the filter coefficient of the adaptive filter corresponding to the seat. Further, the adaptive filter control unit 28A may update the saved filter coefficient at regular intervals. When it is determined that the speaker position estimated based on the signal detection unit 30 or the signal level of the error signal is the position of the seat corresponding to the adaptive filter in which the filter coefficient is saved in the storage unit 28B, the adaptive filter control is performed. The unit 28A may read the saved filter coefficient and set it as an adaptive filter. As a result, the voice processing device 21A effectively suppresses the voice component (crosstalk component) of the speaker in this seat immediately after the position corresponding to the adaptive filter in which the filter coefficient is saved becomes the speaker position. it can.

以上のように、音声処理装置２１Ａは、ターゲット成分（音声信号Ａの主成分）又はターゲット成分以外の主成分（参照信号Ｂ，Ｃ，Ｄの主成分）の音声を発する話者（運転者ｈｍ１、乗員ｈｍ２〜ｈｍ４）の位置を検知する信号検知部３０を備えてよい。適応フィルタ制御部２８Ａは、検知された話者の位置に基づいて、制御対象の適応フィルタを決定してよい。 As described above, the voice processing device 21A is a speaker (driver hm1) that emits the voice of the target component (main component of the voice signal A) or the main component other than the target component (main component of the reference signals B, C, D). , The signal detection unit 30 for detecting the position of the occupant hm2 to hm4) may be provided. The adaptive filter control unit 28A may determine the adaptive filter to be controlled based on the detected position of the speaker.

これにより、音声処理装置２１Ａは、話者位置検知の結果を補助的に利用することで、更新すべき適用フィルタの選択精度を向上できる。 As a result, the voice processing device 21A can improve the selection accuracy of the applied filter to be updated by using the result of the speaker position detection as an auxiliary.

また、音声処理装置２１Ａは、記憶部２８Ｂを更に備えてよい。適応フィルタ制御部２８Ａは、信号検知部３０により検知された話者の位置の確度を導出（例えば算出）してよい。記憶部２８Ｂは、この確度が閾値ｔｈ４（第２の閾値の一例）以上である場合、時刻ｔ１（第１の時刻の一例）に信号検知部３０により検知された話者の位置と、この話者の位置に対応する適用フィルタのフィルタ係数と、を関連付けて記憶してよい。適応フィルタ制御部２８Ａは、時刻ｔ１よりも後の時刻ｔ２（第２の時刻の一例）において、信号検知部３０により時刻ｔ１と同じ話者の位置が検知された場合、この話者に対応する適用フィルタのフィルタ係数を、記憶部２８Ｂに記憶された話者の位置に関連付けられたフィルタ係数で更新してよい。 Further, the voice processing device 21A may further include a storage unit 28B. The adaptive filter control unit 28A may derive (for example, calculate) the accuracy of the position of the speaker detected by the signal detection unit 30. When the accuracy is equal to or higher than the threshold value th4 (an example of the second threshold value), the storage unit 28B describes the position of the speaker detected by the signal detection unit 30 at the time t1 (an example of the first time) and this story. The filter coefficient of the applied filter corresponding to the position of the person may be associated and stored. When the signal detection unit 30 detects the same speaker position as the time t1 at the time t2 (an example of the second time) after the time t1, the adaptive filter control unit 28A corresponds to this speaker. The filter coefficient of the applied filter may be updated with the filter coefficient associated with the speaker's position stored in the storage unit 28B.

これにより、音声処理装置２１Ａは、複数の時刻ｔ１，ｔ２で同じ話者（又は同じ話者の組み合わせ）が音声を発している場合、過去に更新された実績のあるフィルタ係数の値を利用することで、過去と同様に、音声信号の雑音成分の抑圧精度を向上できると期待できる。また、音声処理装置２１Ａは、話者判定は困難であるが、話者位置検知の確度が閾値ｔｈ４以上の場合に限定することで、実績のあるフィルタ係数に安定して更新できる。 As a result, when the same speaker (or a combination of the same speakers) is emitting voice at a plurality of times t1 and t2, the voice processing device 21A uses the value of the filter coefficient that has been updated in the past and has a proven track record. Therefore, it can be expected that the suppression accuracy of the noise component of the voice signal can be improved as in the past. Further, although it is difficult for the voice processing device 21A to determine the speaker, the speaker position detection accuracy can be stably updated to a proven filter coefficient by limiting the accuracy to the threshold value th4 or more.

以上、図面を参照しながら各種の実施形態について説明したが、本開示はかかる例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、それらについても当然に本開示の技術的範囲に属するものと了解される。 Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that a person skilled in the art can come up with various modifications or modifications within the scope of the claims, which naturally belong to the technical scope of the present disclosure. Understood.

上記実施形態では、プロセッサは、物理的にどのように構成してもよい。また、プログラム可能なプロセッサを用いれば、プログラムの変更により処理内容を変更できるので、プロセッサの設計の自由度を高めることができる。プロセッサは、１つの半導体チップで構成してもよいし、物理的に複数の半導体チップで構成してもよい。複数の半導体チップで構成する場合、上記実施形態の各制御をそれぞれ別の半導体チップで実現してもよい。この場合、それらの複数の半導体チップで１つのプロセッサを構成すると考えることができる。また、プロセッサは、半導体チップと別の機能を有する部材（コンデンサ等）で構成してもよい。また、プロセッサが有する機能とそれ以外の機能とを実現するように、１つの半導体チップを構成してもよい。また、複数のプロセッサが１つのプロセッサで構成されてもよい。 In the above embodiment, the processor may be physically configured in any way. Further, if a programmable processor is used, the processing content can be changed by changing the program, so that the degree of freedom in processor design can be increased. The processor may be composed of one semiconductor chip, or may be physically composed of a plurality of semiconductor chips. When composed of a plurality of semiconductor chips, each control of the above embodiment may be realized by a separate semiconductor chip. In this case, it can be considered that one processor is composed of those plurality of semiconductor chips. Further, the processor may be composed of a member (capacitor or the like) having a function different from that of the semiconductor chip. Further, one semiconductor chip may be configured so as to realize the functions of the processor and other functions. Further, a plurality of processors may be configured by one processor.

上記実施形態では、音声信号、参照信号として話者の音声が含まれることを主に例示したが、音声信号、参照信号には、各種の音が広く含まれてよい。例えば、音声信号、参照信号には、音楽、環境音、機械音、その他の音が広く含まれてよい。 In the above embodiment, it is mainly illustrated that the voice of the speaker is included as the voice signal and the reference signal, but the voice signal and the reference signal may broadly include various sounds. For example, the audio signal and the reference signal may broadly include music, environmental sounds, mechanical sounds, and other sounds.

上記実施形態では、各閾値は、固定値でも可変値でもよい。各閾値は、予め定められた値でも、音声処理システムが備える操作部を介して入力された値でもよい。 In the above embodiment, each threshold value may be a fixed value or a variable value. Each threshold value may be a predetermined value or a value input via an operation unit included in the voice processing system.

本開示は、ノイズ源の数や位置が変化した場合でも、取得対象の音声信号の雑音成分の抑圧精度を向上できる音声処理システム、音声処理装置及び音声処理方法等に有用である。 The present disclosure is useful for a voice processing system, a voice processing device, a voice processing method, and the like that can improve the suppression accuracy of a noise component of a voice signal to be acquired even when the number and position of noise sources change.

５音声処理システム
１０車両
２０，２１，２１Ａ，２２，２３，２４音声処理装置
２７加算器
２８，２８Ａ適応フィルタ制御部
２８Ｂ記憶部
２９音声入力部
３０信号検知部
４０音声認識エンジン
５０カーナビゲーション装置
Ｆ２，Ｆ２Ａ，Ｆ２Ｂ，Ｆ２Ｃ，Ｆ３，Ｆ４，Ｆ５適応フィルタ
ｈｍ１運転者
ｈｍ２，ｈｍ３，ｈｍ４乗員
ＭＣ１，ＭＣ２，ＭＣ３，ＭＣ４マイク
5 Voice processing system 10 Vehicles 20, 21, 21A, 22, 23, 24 Voice processing device 27 Adder 28, 28A Adaptive filter control unit 28B Storage unit 29 Voice input unit 30 Signal detection unit 40 Voice recognition engine 50 Car navigation device F2 , F2A, F2B, F2C, F3, F4, F5 Adaptive filter hm1 Driver hm2, hm3, hm4 Crew MC1, MC2, MC3, MC4 Microphone

Claims

A first voice acquisition unit that acquires a voice signal including a first voice component and a voice component other than the first voice component, and
A plurality of second audio acquisition units that acquire a plurality of reference signals including a second audio component and an audio component other than the second audio component, and
A first adaptive filter that passes two or more reference signals out of the plurality of reference signals to generate a first pass signal, and
A plurality of second adaptive filters that pass a single different reference signal among the plurality of reference signals to generate a plurality of second pass signals, and a plurality of second adaptive filters.
Based on the voice signal, the first passing signal, and the plurality of second passing signals, the adaptive filter to be controlled among the first adaptive filter and the plurality of second adaptive filters is determined. A control unit that controls the filter coefficient of the adaptive filter to be controlled,
A voice processing system equipped with.

The control unit
The first passing signal is subtracted from the audio signal to generate the first subtracting signal.
A plurality of second subtraction signals are generated by subtracting different second passing signals from the audio signal.
The adaptive filter to be controlled is determined based on the signal levels of the first subtraction signal and the plurality of second subtraction signals.
The voice processing system according to claim 1.

The control unit
Based on the signal levels of the first subtraction signal and the plurality of second subtraction signals, any one of the first subtraction signal and the plurality of second subtraction signals is determined as an output signal.
The adaptive filter of any one of the first adaptive filter and the plurality of second adaptive filters corresponding to the output signal is determined as the adaptive filter to be controlled.
The voice processing system according to claim 2.

The control unit determines the adaptive filter corresponding to the subtraction signal having the minimum signal level among the first subtraction signal and the plurality of second subtraction signals as the adaptive filter to be controlled.
The voice processing system according to claim 2 or 3.

The control unit sets an adaptive filter corresponding to the subtraction signal whose signal level is equal to or lower than the first threshold value among the first subtraction signal and the plurality of second subtraction signals as the adaptive filter to be controlled. decide,
The voice processing system according to claim 2 or 3.

Further, a detection unit for detecting the position of the speaker emitting the first voice component or the second voice component is provided.
The control unit determines an adaptive filter to be controlled based on the position of the speaker.
The voice processing system according to any one of claims 1 to 5.

With a storage unit,
The control unit derives the accuracy of the position of the speaker detected by the detection unit.
When the accuracy is equal to or higher than the second threshold value, the storage unit includes the position of the speaker detected by the detection unit at the first time and the filter coefficient of the applied filter corresponding to the position of the speaker. , Associate and remember,
When the detection unit detects the same speaker position as the first time at a second time after the first time, the control unit determines the applicable filter corresponding to the speaker. The filter coefficient is updated with the filter coefficient stored in the storage unit and associated with the speaker position detected by the detection unit.
The voice processing system according to claim 6.

The second audio acquisition unit has directivity in the direction of the sound source that emits the second audio component in order to acquire the reference signal.
The voice processing system according to any one of claims 1 to 7.

A plurality of the first adaptive filter, the plurality of second adaptive filters, and a voice processing device including the control unit are provided.
Each voice signal acquired by each control unit in a plurality of voice processing devices is different.
The combination of each reference signal acquired by each of the first adaptive filters and the second adaptive filters in the plurality of voice processing devices is different.
The voice processing system according to any one of claims 1 to 8.

The first voice acquisition unit and the plurality of second voice acquisition units are arranged in the vehicle interior.
The voice processing system according to any one of claims 1 to 9.

A voice recognition processing unit that performs voice recognition processing on the output signal is further provided.
The voice processing system according to claim 3.

A control unit that acquires an audio signal including a first audio component and an audio component other than the first audio component, and
A plurality of adaptive filters that acquire a plurality of reference signals including a second audio component and an audio component other than the second audio component, and
With
The plurality of adaptive filters
A first adaptive filter that passes two or more reference signals out of the plurality of reference signals to generate a first pass signal, and
A plurality of second adaptive filters, which pass a different single reference signal among the plurality of reference signals and generate a plurality of second pass signals, are included.
The control unit adapts the controlled object among the first adaptive filter and the plurality of second adaptive filters based on the audio signal, the first pass signal, and the plurality of second pass signals. The filter is determined, and the filter coefficient of the adaptive filter to be controlled is controlled.
Voice processing device.

Acquires an audio signal containing an audio component other than the first audio component and the first audio component,
Acquire a plurality of reference signals including a second audio component and an audio component other than the second audio component,
Two or more of the plurality of reference signals generate a first pass signal that has passed through the first adaptive filter.
A plurality of second pass signals that have passed through a plurality of second adaptive filters through which a different single reference signal among the plurality of reference signals passes are generated.
Based on the voice signal, the first passing signal, and the plurality of second passing signals, the adaptive filter to be controlled among the first adaptive filter and the plurality of second adaptive filters is determined. Controlling the filter coefficient of the adaptive filter to be controlled,
Voice processing method.