JP2011002535A

JP2011002535A - Voice interaction system, voice interaction method, and program

Info

Publication number: JP2011002535A
Application number: JP2009143979A
Authority: JP
Inventors: Tomoya Takatani; 智哉高谷
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2009-06-17
Filing date: 2009-06-17
Publication date: 2011-01-06

Abstract

PROBLEM TO BE SOLVED: To improve adaptive speed of a digital filter for suppressing a system voice component included in observation signals by a microphone array in a barge-in free voice interaction system.SOLUTION: A system voice suppressing filter section 51 and ICA (Independent Component Analysis) section 52 performs filter processing for suppressing system voice included in observation signals X(f, t) to X(f, t) by the microphone array 1 on the basis of ICA algorithm, and creates first signal groups X(f, t) to X(f, t). A user voice emphasis filter section 61 and an ICA section 62 input the first signal groups X(f, t) to X(f, t), and separates user voice Z1(f, t) on the basis of the ICA algorithm. An interaction management section 71 performs recognition processing of user voice on the basis of the separated signal Z1(f, t).

Description

本発明は、システム音声（システムの自己発声音）の出力タイミングに関わらず対話者音声の認識が可能なバージインフリー音声対話システムに関する。 The present invention relates to a barge-in-free speech dialogue system capable of recognizing a talker voice regardless of the output timing of system voice (system self-voiced sound).

ユーザとの間で音声によるインタラクションを行う音声対話システムが知られている（特許文献１および非特許文献１を参照）。これらの音声対話システムは、自身がユーザに対して音声（以下システム音声と呼ぶ）を発するともに、マイクロホンによる観測信号からユーザの音声（以下、ユーザ音声と呼ぶ）を認識する。なお、システム音声の出力とユーザの発話が同時に行われた場合、システムはこれらの音声の混合音をマイクロホンによって集音してしまう。
非特許文献１は、ユーザ音声及びシステム音声の混合音からシステム音声を除去することでユーザ音声の認識率の低下を抑制するシステムを開示している。非特許文献１に開示されたシステムのように、システム音声出力中におけるユーザの発話を許容するシステムは、一般的に、「バージインフリー音声対話システム」と呼ばれている。 2. Description of the Related Art A voice dialogue system that performs voice interaction with a user is known (see Patent Literature 1 and Non-Patent Literature 1). These voice interactive systems themselves emit voices (hereinafter referred to as system voices) to the user, and recognize user voices (hereinafter referred to as user voices) from observation signals from microphones. In addition, when the output of the system sound and the user's utterance are simultaneously performed, the system collects the mixed sound of these sounds by the microphone.
Non-Patent Document 1 discloses a system that suppresses a reduction in the recognition rate of user speech by removing system speech from a mixed sound of user speech and system speech. A system that allows a user's utterance during system voice output, such as the system disclosed in Non-Patent Document 1, is generally called a “barge-in free voice dialogue system”.

より具体的に述べると、非特許文献１に開示された音声対話システムは、独立成分分析（ＩＣＡ：Independent Component Analysis）に基づくブラインド音源分離（ＢＳＳ：Blind Source Separation）技術をシステム音声の分離・抑圧に利用する。なお、ＩＣＡに基づくＢＳＳは、マイクロホンアレイで集音された観測信号の中から未知音源（ブラインド音源）より到達した所望の信号（ユーザ音声など）を分離するための信号処理技術である。 More specifically, the speech dialogue system disclosed in Non-Patent Document 1 uses a blind source separation (BSS) technology based on independent component analysis (ICA) to separate and suppress system speech. To use. Note that ICA-based BSS is a signal processing technique for separating a desired signal (such as a user voice) that has arrived from an unknown sound source (blind sound source) from observation signals collected by a microphone array.

また、非特許文献２および特許文献２は、非特許文献１と同様のＩＣＡに基づくＢＳＳを音響エコーキャンセラに適用する例を開示している。非特許文献２および特許文献２は、スピーカからの出力信号がマイクロホンに回り込むことで発生する音響エコーをマイクロホンによる観測信号から除去するために、ＩＣＡに基づくＢＳＳを利用する。マイクロホンによる観測信号は、近端話者（ユーザ）の発話音声とスピーカからの出力音声（遠端話者からの受話信号）を含む混合音である。非特許文献２および特許文献２では、マイクロホンによる観測信号および遠端話者からの受話信号をデジタルフィルタに入力し、混合音に含まれる近端話者（ユーザ）の発話音声を抑圧する。 Non-Patent Document 2 and Patent Document 2 disclose an example in which a BSS based on ICA similar to Non-Patent Document 1 is applied to an acoustic echo canceller. Non-Patent Document 2 and Patent Document 2 use BCA based on ICA in order to remove the acoustic echo generated when the output signal from the speaker wraps around the microphone from the observation signal from the microphone. The observation signal by the microphone is a mixed sound including the speech sound of the near-end speaker (user) and the output sound from the speaker (received signal from the far-end speaker). In Non-Patent Document 2 and Patent Document 2, the observation signal from the microphone and the received signal from the far-end speaker are input to the digital filter, and the speech of the near-end speaker (user) included in the mixed sound is suppressed.

特開２００７−２４８５３４号公報JP 2007-248534 A 特開２００４−４８２５３号公報JP 2004-48253 A

Sigeyuki Miyabe et al., "Barge-in- and noise-free spoken dialogue interface based on sound field control and semi-blind source separation", Proc. 15th European Signal Processing Conference (EUSIPCO 2007), pp.232-236, Sep. 2007Sigeyuki Miyabe et al., "Barge-in- and noise-free spoken dialogue interface based on sound field control and semi-blind source separation", Proc. 15th European Signal Processing Conference (EUSIPCO 2007), pp.232-236, Sep . 2007 Ted Wada et al., "Use of decorrelation procedure for source and echo suppression", Proc. International Workshop for Acoustic Echo and Noise Control (IWAENC'08), #9086 in Online Proceedings or CD-ROM, Sep. 2008Ted Wada et al., "Use of decorrelation procedure for source and echo suppression", Proc. International Workshop for Acoustic Echo and Noise Control (IWAENC'08), # 9086 in Online Proceedings or CD-ROM, Sep. 2008

図４は、非特許文献１に開示された技術を適用したバージインフリー音声対話システム８００の構成例を示す図である。図４において、マイクロホンアレイ１は、Ｋ個のマクロホン素子からなり、ユーザ音声、背景雑音、及びシステム音声が混合された混合音を観測する。 FIG. 4 is a diagram illustrating a configuration example of a barge-in free speech dialogue system 800 to which the technology disclosed in Non-Patent Document 1 is applied. In FIG. 4, the microphone array 1 is composed of K macrophone elements, and observes a mixed sound in which user voice, background noise, and system voice are mixed.

スピーカ２は、システム音声を出力する。システム音声出力のための音源信号は音声合成部７２で生成され、ＤＡコンバータ（ＤＡＣ）４でアナログ信号に変換された後にスピーカ２に供給される。 The speaker 2 outputs system sound. A sound source signal for system sound output is generated by the sound synthesizer 72, converted to an analog signal by the DA converter (DAC) 4, and then supplied to the speaker 2.

Ｋ個のＡＤコンバータ（ＡＤＣ）３１〜３Ｋは、マイクロホンアレイ１によるＫ本の観測信号群Ｘ_ｊ（ｔ）（ｊ＝１、２、・・・Ｋ）のサンプリング行う。一方、ＡＤＣ３０は、スピーカ２に供給される音源信号Ｘ_０（ｔ）のサンプリングを行い、システム音声のサンプル列を生成する。 The K AD converters (ADCs) 31 to 3K perform sampling of the K observation signal groups X _j (t) (j = 1, 2,... K) by the microphone array 1. On the other hand, the ADC 30 samples the sound source signal X ₀ (t) supplied to the speaker 2 to generate a system audio sample string.

なお、システム８００は周波数領域でのＩＣＡを行う。このため、観測信号群Ｘ_ｊ（ｔ）（ｊ＝１、２、・・・Ｋ）のサンプル列は、短時間ＤＦＴ（Discrete Fourier Transform）によって、時間・周波数領域の信号群Ｘ_ｊ（ｆ、ｔ）（ｊ＝１、２、・・・Ｋ）に変換される。同様に、システム音声Ｘ_０（ｔ）のサンプル列も短時間ＤＦＴによって、時間・周波数領域の信号Ｘ_０（ｆ、ｔ）に変換される。 Note that the system 800 performs ICA in the frequency domain. For this reason, the sample sequence of the observation signal group X _j (t) (j = 1, 2,... K) is converted into a time / frequency domain signal group X _j (f, t) (j = 1, 2,... K). Similarly, the sample sequence of the system sound X ₀ (t) is also converted into a time / frequency domain signal X ₀ (f, t) by the short-time DFT.

適応フィルタ部８１は、観測信号群Ｘ_ｊ（ｆ、ｔ）とシステム音声Ｘ_０（ｆ、ｔ）とを受信し、これらの信号から３つの分離信号Ｚ_１（ｆ，ｔ）、Ｚ_２（ｆ，ｔ）、及びＺ_３（ｆ，ｔ）を生成する。ここで、Ｚ_１（ｆ，ｔ）はユーザ音声と推定される分離信号であり、Ｚ₂（ｆ，ｔ）は背景雑音と推定される分離信号であり、Ｚ_３（ｆ，ｔ）はシステム音声と推定される分離信号である。 The adaptive filter unit 81 receives the observation signal group X _j (f, t) and the system voice X ₀ (f, t), and three separated signals Z ₁ (f, t), Z ₂ ( f, t) and Z ₃ (f, t). Here, Z ₁ (f, t) is a separated signal estimated as user speech, Z ₂ (f, t) is a separated signal estimated as background noise, and Z ₃ (f, t) is a system signal. It is a separated signal estimated as speech.

適応フィルタ部８１にて行われる、ＩＣＡに基づく信号分離過程は、以下の式（１）により表わすことができる。ここで、Ｗ_１１（ｆ）、Ｗ_２１（ｆ），Ｗ_１Ｋ（ｆ）、Ｗ_２Ｋ（ｆ）、Ｗ_１０（ｆ）、Ｗ_２０（ｆ）、Ｗ_３０（ｆ）は、適応フィルタ部８１内の各分離フィルタである。

The signal separation process based on ICA performed in the adaptive filter unit 81 can be expressed by the following equation (1). Here, W ₁₁ (f), W ₂₁ (f), W _1K (f), W _2K (f), W ₁₀ (f), W ₂₀ (f), and W ₃₀ (f) are the adaptive filter unit 81. Each of the separation filters.

ＩＣＡ部８２は、ＩＣＡアルゴリズムに基づいて、分離信号群（分離信号ベクトル）Ｚ_１（ｆ、ｔ）、Ｚ_２（ｆ、ｔ）及びＺ_３（ｆ、ｔ）が互いに独立となるように、各分離フィルタＷ_１１（ｆ）、Ｗ_２１（ｆ），Ｗ_１Ｋ（ｆ）、Ｗ_２Ｋ（ｆ）、Ｗ_１０（ｆ）、Ｗ_２０（ｆ）、Ｗ_３０（ｆ）のフィルタ係数を更新する。 Based on the ICA algorithm, the ICA unit 82 is configured so that the separated signal groups (separated signal vectors) Z ₁ (f, t), Z ₂ (f, t), and Z ₃ (f, t) are independent from each other. The filter coefficients of the respective separation filters W ₁₁ (f), W ₂₁ (f), W _1K (f), W _2K (f), W ₁₀ (f), W ₂₀ (f), W ₃₀ (f) are updated. .

対話管理部７１は、ユーザ音声に相当する分離信号Ｚ_１（ｆ，ｔ）またはこれを時間領域に戻した信号Ｚ_１（ｔ）に基づいて、音声認識を行う。さらに、対話管理部７１は、認識されたユーザの発話内容に対応した情報処理を実行し、ユーザに対する応答メッセージを生成するよう音声合成部７２を制御する。 The dialogue management unit 71 performs voice recognition based on the separated signal Z ₁ (f, t) corresponding to the user voice or the signal Z ₁ (t) that is returned to the time domain. Furthermore, the dialogue management unit 71 executes information processing corresponding to the recognized content of the user's utterance, and controls the speech synthesis unit 72 to generate a response message to the user.

上述したように、非特許文献１に開示されたバージインフリー音声対話システムは、ユーザ音声、背景雑音、およびシステム音声の分離を１つの適応アルゴリズム、つまり１つのＩＣＡアルゴリズムで行う。しかしこの場合、システム音声を抑圧するための分離フィルタＷ_１０（ｆ）、Ｗ_２０（ｆ）及びＷ_３０（ｆ）のフィルタ長（フィルタ係数の数、タップ数）を、ユーザ音声強調用のフィルタＷ_１１（ｆ）およびＷ_１Ｋ（ｆ）等のフィルタ長と同じにする必要がある。したがって、最適化パラメータであるフィルタ係数の数が大きくなり、（ａ）フィルタの最適化に要する時間が大きくなる、および（ｂ）最適化過程においてローカルミニマムを回避することが難しい、という問題がある。 As described above, the barge-in free speech dialogue system disclosed in Non-Patent Document 1 separates user speech, background noise, and system speech by one adaptive algorithm, that is, one ICA algorithm. However, in this case, the filter lengths (the number of filter coefficients and the number of taps) of the separation filters W ₁₀ (f), W ₂₀ (f), and W ₃₀ (f) for suppressing the system voice are set as the user voice enhancement filter. The filter length must be the same as W ₁₁ (f) and W _1K (f). Therefore, there are problems that the number of filter coefficients as optimization parameters increases, (a) the time required for filter optimization increases, and (b) it is difficult to avoid local minimum in the optimization process. .

本発明は、本願の発明者による上述の知見に基づいてなされたものであって、システム音声の出力タイミングに関わらず対話者音声の認識が可能なバージインフリー音声対話システムにおいて、マイクロホンアレイによる観測信号に含まれるシステム音声成分を抑圧するためのデジタルフィルタの適応速度を向上させることを目的とする。 The present invention has been made on the basis of the above-mentioned knowledge by the inventor of the present application, and is an observation signal by a microphone array in a barge-in-free speech dialogue system capable of recognizing a dialogue voice regardless of the output timing of the system voice. It is an object of the present invention to improve the adaptive speed of a digital filter for suppressing the system sound component contained in.

本発明の第１の態様は、システム音声がスピーカ出力されている状況下でユーザ音声の認識を行うことが可能なバージインフリー音声対話システムである。当該システムは、第１のフィルタ部、第１のＩＣＡ部、第２のフィルタ部、第２のＩＣＡ部、および対話管理部を有する。
前記第１のフィルタ部は、マイクロホンアレイによる観測信号群に対するフィルタ処理を行って、フィルタ処理後の第１の信号群を生成する。
前記第１のＩＣＡ部は、前記システム音声の出力のために前記スピーカに供給される音源信号と前記第１の信号群の各々との間が互いに独立となるように、前記第１のフィルタ部に含まれる適応フィルタ群を独立成分分析アルゴリズムに基づいて最適化する。
前記第２のフィルタ部は、前記第１の信号群に対するフィルタ処理を行って、フィルタ処理後の第２の信号群を生成する。
前記第２のＩＣＡ部は、前記第２の信号群が互いに独立となるように、前記第２のフィルタ部に含まれる適応フィルタ群を独立成分分析アルゴリズムに基づいて最適化する。
前記対話管理部は、前記第２の信号群に含まれる信号に基づく前記ユーザ音声の認識処理、および認識処理結果に基づく新たなシステム音声の生成処理を含む対話管理を行う。 A first aspect of the present invention is a barge-in-free voice interactive system capable of recognizing user voice under a situation where system voice is output from a speaker. The system includes a first filter unit, a first ICA unit, a second filter unit, a second ICA unit, and a dialogue management unit.
The first filter unit performs a filtering process on the observation signal group by the microphone array, and generates a first signal group after the filtering process.
The first ICA unit includes the first filter unit so that a sound source signal supplied to the speaker for outputting the system sound and each of the first signal groups are independent from each other. The adaptive filter group included in is optimized based on the independent component analysis algorithm.
The second filter unit performs a filtering process on the first signal group to generate a second signal group after the filtering process.
The second ICA unit optimizes the adaptive filter group included in the second filter unit based on an independent component analysis algorithm so that the second signal group is independent from each other.
The dialogue management unit performs dialogue management including recognition processing of the user voice based on a signal included in the second signal group and generation processing of a new system voice based on a recognition processing result.

上述した本発明の第１の態様では、システム音声抑圧のための第１のフィルタ部に含まれるフィルタ係数群の最適化を、ユーザ音声強調のための第２のフィルタ部とは異なる独立したＩＣＡアルゴリズムに基づいて行う。これにより、システム音声の抑圧を行う第１のフィルタ部に含まれる適応フィルタ群のフィルタ長を第２のフィルタ部に含まれる適応フィルタ群に比べて短くすることができる。よって、本発明の第１の態様によれば、第１のフィルタ部のフィルタ係数の最適化を高速に行うことが可能となる。また、最適化過程においてローカルミニマムに捕まる確率を低下させることができる。 In the first aspect of the present invention described above, the optimization of the filter coefficient group included in the first filter unit for system speech suppression is performed independently from the second filter unit for user speech enhancement. Based on the algorithm. As a result, the filter length of the adaptive filter group included in the first filter unit that suppresses system speech can be made shorter than that of the adaptive filter group included in the second filter unit. Therefore, according to the first aspect of the present invention, it is possible to optimize the filter coefficient of the first filter unit at high speed. In addition, the probability of being caught by the local minimum in the optimization process can be reduced.

上述した本発明の第１の態様によれば、バージインフリー音声対話システムにおいて、マイクロホンアレイによる観測信号に含まれるシステム音声成分を抑制するためのデジタルフィルタの適応速度を向上させることができる。 According to the first aspect of the present invention described above, in the barge-in free speech dialogue system, it is possible to improve the adaptive speed of the digital filter for suppressing the system speech component included in the observation signal by the microphone array.

本発明の実施の形態にかかる音声対話システムの全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the voice dialogue system concerning embodiment of this invention. 図１に示した音声対話システムが有するシステム音声抑圧フィルタ部の構成例を示すブロック図である。It is a block diagram which shows the structural example of the system audio | voice suppression filter part which the audio | voice dialog system shown in FIG. 1 has. 図１に示した音声対話システムが有するユーザ音声強調フィルタ部の構成例を示すブロック図である。It is a block diagram which shows the structural example of the user audio | voice emphasis filter part which the speech dialogue system shown in FIG. 1 has. 背景技術にかかるバージインフリー音声対話システムの構成を示すブロック図である。It is a block diagram which shows the structure of the barge-in free voice interactive system concerning background art.

以下では、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。各図面において、同一要素には同一の符号が付されており、説明の明確化のため、必要に応じて重複説明は省略される。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings. In the drawings, the same elements are denoted by the same reference numerals, and redundant description is omitted as necessary for the sake of clarity.

<発明の実施の形態１>
図１は、本実施の形態にかかるバージインフリー音声対話システム１００の全体構成を示すブロック図である。なお、音声対話システム１００が有する構成要素のうち、図４に示した音声対話システム８００と共通の構成要素については、図４と同一の符号が付されている。ここでは、これら共通の構成要素に関する重複説明を省略する。 <Embodiment 1 of the Invention>
FIG. 1 is a block diagram showing an overall configuration of a barge-in free speech dialogue system 100 according to the present embodiment. Note that, among the components included in the voice interaction system 100, the same components as those in the voice interaction system 800 illustrated in FIG. 4 are denoted by the same reference numerals as those in FIG. Here, a duplicate description of these common components is omitted.

図１において、システム音声抑圧フィルタ部５１及びＩＣＡ部５２は、マイクロホンアレイ１によって得られた観測信号群Ｘ_１〜Ｘ_Ｋに対して、ＩＣＡに基づくシステム音声の抑圧処理を行う。そして、ユーザ音声強調フィルタ部６１及びＩＣＡ部６２は、システム音声の抑圧後の信号群Ｙ_１〜Ｙ_Ｋに対して、ＩＣＡに基づくユーザ音声の強調処理を行う。つまり、音声対話システム１００は、「システム音声の抑圧」および「ユーザ音声の強調」を２つの独立したＩＣＡアルゴリズムによって行う。 In FIG. 1, a system sound suppression filter unit 51 and an ICA unit 52 perform system sound suppression processing based on ICA on the observation signal groups X _{1 to} X _K obtained by the microphone array 1. Then, the user speech enhancement filter unit 61 and the ICA unit 62 perform user speech enhancement processing based on ICA on the signal groups Y _{1 to} Y _K after the suppression of the system speech. That is, the voice interaction system 100 performs “system voice suppression” and “user voice enhancement” by two independent ICA algorithms.

ところで、デジタルフィルタの適応学習において多くの計算時間を必要とするフィルタ係数の多くは、推定すべき対象である空間伝達関数の長さ（つまり時間長）に依存している。マイクロホンアレイ１と音源との間の空間伝達関数の長さに関しては、経験的に以下の２つの性質がある。すなわち、
（１）最もマイクロホンアレイ１に近い音源は、システム１００が有するスピーカ２である。
（２）マイクロホンアレイ１からユーザおよび背景雑音の音源までの距離は比較的近いものの、マイクロホンアレイ１からスピーカ２までの距離に比べると遠い。
よって、本来的には、システム音声を推定するための適応フィルタ長は、ユーザ音声および背景雑音を推定するための適応フィルタ長より相対的に短くて済む。 By the way, many of the filter coefficients that require a lot of calculation time in adaptive learning of the digital filter depend on the length (that is, time length) of the spatial transfer function to be estimated. Empirically, the length of the spatial transfer function between the microphone array 1 and the sound source has the following two properties. That is,
(1) The sound source closest to the microphone array 1 is the speaker 2 included in the system 100.
(2) Although the distance from the microphone array 1 to the sound source of the user and background noise is relatively short, it is far compared to the distance from the microphone array 1 to the speaker 2.
Therefore, the adaptive filter length for estimating the system voice is essentially shorter than the adaptive filter length for estimating the user voice and the background noise.

したがって、「システム音声の抑圧」および「ユーザ音声の強調」を２つの独立したＩＣＡアルゴリズムによって行う音声対話システム１００では、システム音声抑圧のためのフィルタ部５１に含まれる適応フィルタ群のフィルタ長をユーザ音声強調のためのフィルタ部６１に含まれる適応フィルタ群に比べて短くすることができる。よって、音声対話システム１００によれば、システム音声抑圧フィルタ部５１のフィルタ係数群の最適化を高速に行うことが可能となる。また、最適化過程においてローカルミニマムに捕まる確率を低下させることができる。 Therefore, in the spoken dialogue system 100 that performs “system voice suppression” and “user voice enhancement” by two independent ICA algorithms, the filter length of the adaptive filter group included in the filter unit 51 for system voice suppression is set to the user's filter length. It can be made shorter than the adaptive filter group included in the filter unit 61 for speech enhancement. Therefore, according to the spoken dialogue system 100, it is possible to optimize the filter coefficient group of the system voice suppression filter unit 51 at high speed. In addition, the probability of being caught by the local minimum in the optimization process can be reduced.

以下では、フィルタ部５１、ＩＣＡ部５２、フィルタ部６１、ＩＣＡ部６２の具体的な構成例について説明する。なお、ここでは、周波数領域ＩＣＡを行う場合について説明する。 Below, the specific structural example of the filter part 51, the ICA part 52, the filter part 61, and the ICA part 62 is demonstrated. Here, a case where frequency domain ICA is performed will be described.

図２は、フィルタ部５１及びＩＣＡ部５２の構成例を示すブロック図である。図２に示すフィルタ部５１は、Ｋ個の適応フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）を含む。フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）及び後述するＩＣＡ部５２１〜５２Ｋは、スピーカ２から各マイクロホン素子１１〜１Ｋに到達するシステム音声の伝搬路の伝達関数を推定する。 FIG. 2 is a block diagram illustrating a configuration example of the filter unit 51 and the ICA unit 52. The filter unit 51 illustrated in FIG. 2 includes K adaptive filters B ₁₀ (f) to B _K0 (f). Filters B ₁₀ (f) to B _K0 (f) and ICA units 521 to 52K, which will be described later, estimate the transfer function of the propagation path of the system sound reaching the microphone elements 11 to 1K from the speaker 2.

Ｋ個の適応フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）は、Ｋ本の観測信号Ｘ_１（ｆ，ｔ）〜Ｘ_Ｋ（ｆ，ｔ）と対応づけて配置されている。フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）の各々は、システム音声に相当する音源信号Ｘ_０（ｆ，ｔ）を入力し、Ｘ_０（ｆ，ｔ）の周波数特性を変化させる。フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）の出力は、それぞれ加算器（減算器）５１１〜５１Ｋに供給される。 The K adaptive filters B ₁₀ (f) to B _K0 (f) are arranged in association with the _K observation signals X ₁ (f, t) to X _K (f, t). Each of the filters B ₁₀ (f) to B _K0 (f) receives the sound source signal X ₀ (f, t) corresponding to the system sound, and changes the frequency characteristics of X ₀ (f, t). Outputs of the filters B ₁₀ (f) to B _K0 (f) are supplied to adders (subtracters) 511 to 51K, respectively.

加算器５１１〜５１Ｋは、フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）の出力信号を観測信号Ｘ_１（ｆ，ｔ）〜Ｘ_Ｋ（ｆ，ｔ）から減算することで、信号Ｙ_１（ｆ，ｔ）〜Ｙ_Ｋ（ｆ，ｔ）を生成する。信号Ｙ_１（ｆ，ｔ）〜Ｙ_Ｋ（ｆ，ｔ）は、システム音声抑圧後の観測信号群として、後段のユーザ音声強調フィルタ部６１に供給される。 The adders 511 to 51K subtract the output signals of the filters B ₁₀ (f) to B _K0 (f) from the observation signals X ₁ (f, t) to X _K (f, t), so that the signal Y ₁ ( f, t) to Y _K (f, t) are generated. The signals Y ₁ (f, t) to Y _K (f, t) are supplied to the subsequent user speech enhancement filter unit 61 as an observation signal group after system speech suppression.

図２に示すフィルタ部５１によるフィルタ処理過程は、以下の式（２）により表わすことができる。

The filtering process by the filter unit 51 shown in FIG. 2 can be expressed by the following equation (2).

ＩＣＡ部５２は、ＩＣＡアルゴリズムに基づいて、適応フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）のフィルタ係数の更新を行う。図２の例では、ＩＣＡ部５２は、Ｋ個の適応フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）に対応するＫ個のＩＣＡ部５２１〜５２Ｋを含む。 The ICA unit 52 updates the filter coefficients of the adaptive filters B ₁₀ (f) to B _K0 (f) based on the ICA algorithm. In the example of FIG. 2, the ICA unit 52 includes K ICA units 521 to 52K corresponding to K adaptive filters B ₁₀ (f) to B _K0 (f).

例えば、ＩＣＡ部５２１は、ＩＣＡアルゴリズムに従って、システム音声に対応する音源信号Ｘ_０（ｆ，ｔ）とフィルタ後の出力信号Ｙ_１（ｆ，ｔ）が互いに独立となるように、適応フィルタＢ_１０（ｆ）のフィルタ係数群を更新する。同様に、ＩＣＡ部５２Ｋは、ＩＣＡアルゴリズムに従って、システム音声に対応する音源信号Ｘ_０（ｆ，ｔ）とフィルタ後の出力信号Ｙ_Ｋ（ｆ，ｔ）が互いに独立となるように、適応フィルタＢ_Ｋ０（ｆ）のフィルタ係数群を更新する。 For example, the ICA unit 521 uses the adaptive filter B ₁₀ so that the sound source signal X ₀ (f, t) corresponding to the system sound and the filtered output signal Y ₁ (f, t) are independent from each other according to the ICA algorithm. The filter coefficient group in (f) is updated. Similarly, the ICA unit 52K follows the ICA algorithm so that the sound source signal X ₀ (f, t) corresponding to the system sound and the filtered output signal Y _K (f, t) are independent from each other. The filter coefficient group of _K0 (f) is updated.

ＩＣＡにおける独立性の評価基準としては、相互情報量（Kullback-leibler divergence）や高次統計量（尖度：Kurtosis）等が用いられている。ＩＣＡ部５２１〜５２Ｋにおけるフィルタ係数の更新は、相互情報量や高次統計量を用いたＩＣＡによって行えばよい。一例として、Infomax法として知られている相互情報量の最大化による手法を応用したフィルタ係数の更新式を式（３）に示す。式（３）において、関数φ（Ｙ）は、音源信号の確率密度関数である。αは更新係数（学習率）である。また、式（３）中の<Ａ>_ｔは、Ａの時間平均を表している。［Ｉ］は、フィルタ係数の更新回数を表している。式（３）中のｊは、１〜Ｋの整数である。本実施の形態のように音声信号を扱う場合、φ（Ｙ）は、シグモイド関数によって近似すればよい。

Mutual information (Kullback-leibler divergence), higher-order statistics (kurtosis: Kurtosis), and the like are used as evaluation criteria for independence in ICA. The update of the filter coefficients in the ICA units 521 to 52K may be performed by ICA using mutual information or higher-order statistics. As an example, Equation (3) shows a filter coefficient update formula that applies a method of maximizing mutual information known as the Infomax method. In equation (3), the function φ (Y) is a probability density function of the sound source signal. α is an update coefficient (learning rate). In the formula (3), <A> _t represents the time average of A. [I] represents the number of updates of the filter coefficient. J in Formula (3) is an integer of 1 to K. When an audio signal is handled as in this embodiment, φ (Y) may be approximated by a sigmoid function.

なお、周波数領域ＩＣＡの場合、適応フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）の適応学習（つまりフィルタ係数の更新）には、メモリに格納された観測信号Ｘ_１（ｔ）〜Ｘ_Ｋ（ｔ）並びにＸ_０（ｔ）のサンプルデータ列を短時間フレーム単位で分割した後に、分割されたサンプル列にＤＦＴを行った後の周波数領域のデータ列が用いられる。つまり、少なくとも１フレーム分のサンプルデータ列を予め蓄積する必要がある。したがって、リアルタイムでの音声認識を行う場合には、１又は数フレーム前のサンプルデータ列に基づいてフィルタ係数が更新された適応フィルタＢ_１０（ｆ）〜Ｂ_Ｋ０（ｆ）を用いて、新たなサンプルデータ列に対するシステム音声の抑圧処理を行えばよい。 In the case of the frequency domain ICA, the observation signals X ₁ (t) to X _K (stored in the memory) are used for adaptive learning (that is, update of the filter coefficients) of the adaptive filters B ₁₀ (f) to B _K0 (f). After the sample data sequence of t) and X ₀ (t) is divided in a short time frame unit, the data sequence in the frequency domain after DFT is performed on the divided sample sequence is used. That is, it is necessary to store a sample data string for at least one frame in advance. Therefore, when performing speech recognition in real time, the adaptive filters B ₁₀ (f) to B _K0 (f) whose filter coefficients are updated based on the sample data sequence one or several frames before are used to A system voice suppression process may be performed on the sample data string.

続いて、図３を参照して、ユーザ音声強調フィルタ部６１及びＩＣＡ部６２の構成例を説明する。図３に示すフィルタ部６１は、２Ｋ個の適応フィルタＶ_１１（ｆ）、Ｖ_２１（ｆ）〜Ｖ_１Ｋ（ｆ）、Ｖ_２Ｋ（ｆ）を含む。これらのフィルタは、フィルタ部５１からのＫ本の出力信号Ｙ_１（ｆ，ｔ）〜Ｙ_Ｋ（ｆ，ｔ）に対応づけて配置されている。 Next, a configuration example of the user voice enhancement filter unit 61 and the ICA unit 62 will be described with reference to FIG. The filter unit 61 illustrated in FIG. 3 includes 2K adaptive filters V ₁₁ (f), V ₂₁ (f) to V _1K (f), and V _2K (f). These filters are arranged in association with the K output signals Y ₁ (f, t) to Y _K (f, t) from the filter unit 51.

例えば、適応フィルタＶ_１１（ｆ）、Ｖ_２１（ｆ）は、信号Ｙ_１（ｆ，ｔ）を入力し、Ｙ_１（ｆ，ｔ）の周波数特性を変化させる。適応フィルタＶ_１Ｋ（ｆ）、Ｖ_２Ｋ（ｆ）は、信号Ｙ_Ｋ（ｆ，ｔ）の周波数特性を変化させる。 For example, the adaptive filters V ₁₁ (f) and V ₂₁ (f) receive the signal Y ₁ (f, t) and change the frequency characteristics of Y ₁ (f, t). The adaptive filters V _1K (f) and V _2K (f) change the frequency characteristics of the signal Y _K (f, t).

加算器６１１は、フィルタＶ_１１（ｆ）〜Ｖ_１Ｋ（ｆ）から出力されるＫ本の信号を加算し、分離信号Ｚ_１（ｆ，ｔ）を生成する。また、加算器６１２は、フィルタＶ_２１（ｆ）〜Ｖ_２Ｋ（ｆ）から出力されるＫ本の信号を加算し、分離信号Ｚ_２（ｆ，ｔ）を生成する。分離信号Ｚ_１（ｆ，ｔ）はユーザ音声の推定信号であり、分離信号Ｚ_２（ｆ，ｔ）は背景雑音の推定信号である。 The adder 611 adds the K signals output from the filters V ₁₁ (f) to V _1K (f) to generate a separated signal Z ₁ (f, t). The adder 612 adds K signals output from the filters V ₂₁ (f) to V _2K (f) to generate a separated signal Z ₂ (f, t). The separated signal Z ₁ (f, t) is an estimated signal of user speech, and the separated signal Z ₂ (f, t) is an estimated signal of background noise.

図３に示すフィルタ部６１によるフィルタ処理過程は、以下の式（４）により表わすことができる。

The filtering process by the filter unit 61 shown in FIG. 3 can be expressed by the following equation (4).

ＩＣＡ部６２は、ＩＣＡアルゴリズムに従って、分離信号Ｚ_１（ｆ，ｔ）及びＺ_２（ｆ，ｔ）が互いに独立となるように、２Ｋ個の適応フィルタＶ_１１（ｆ）、Ｖ_２１（ｆ）〜Ｖ_１Ｋ（ｆ）、Ｖ_２Ｋ（ｆ）のフィルタ係数の更新処理を行う。一例として、Infomax法を応用したフィルタ係数の更新式を式（５）に示す。式（５）中の関数φ（Ｙ）は、シグモイド関数によって近似すればよい。

The ICA unit 62 performs 2K adaptive filters V ₁₁ (f) and V ₂₁ (f) so that the separated signals Z ₁ (f, t) and Z ₂ (f, t) are independent of each other according to the ICA algorithm. Update processing of filter coefficients of .about.V _1K (f) and V _2K (f) is performed. As an example, an equation for updating the filter coefficient using the Infomax method is shown in Equation (5). The function φ (Y) in equation (5) may be approximated by a sigmoid function.

図２及び３に示したように、周波数領域ＩＣＡを利用して「システム音声の抑圧」および「ユーザ音声の強調」を行うことで、計算量の削減効果が得られる。特許文献２に開示されたＩＣＡに基づく音響エコーキャンセラは、適応フィルタの学習を時間領域のＩＣＡで行っている。このため、特許文献２の手法では、時間遅れを含む混合問題、つまり畳み込み混合を取り扱う必要がある。これに対して、図２及び図３を用いて説明した具体例は、周波数領域ＩＣＡを利用するため、周波数ビン毎にＩＣＡを適用して瞬時混合問題を解けばよい。このため、時間領域ＩＣＡを利用する特許文献２の手法に比べて、計算量を大幅に削減でき、適応フィルタの収束速度を改善できる。 As shown in FIGS. 2 and 3, by performing “suppression of system speech” and “emphasis of user speech” using the frequency domain ICA, an effect of reducing the amount of calculation can be obtained. The acoustic echo canceller based on ICA disclosed in Patent Literature 2 performs adaptive filter learning using ICA in the time domain. For this reason, in the method of Patent Document 2, it is necessary to handle a mixing problem including time delay, that is, convolutional mixing. On the other hand, since the specific example described with reference to FIGS. 2 and 3 uses the frequency domain ICA, the instantaneous mixing problem may be solved by applying ICA for each frequency bin. For this reason, compared with the method of patent document 2 using time domain ICA, the amount of calculations can be reduced significantly and the convergence speed of an adaptive filter can be improved.

ところで、フィルタ部５１及び６１によるフィルタ処理、ＩＣＡ部５１及び６２によるフィルタ係数更新処理は、ＡＳＩＣ（Application Specific Integrated Circuit）又はＦＰＧＡ（Field Programmable Gate Array）等の半導体処理装置を用いて実現してもよい。また、これらフィルタ処理およびフィルタ係数更新処理は、ＤＳＰ（Digital Signal Processor）、マイクロプロセッサ等を含むコンピュータシステムにプログラムを実行させることによって実現してもよい。 By the way, the filter processing by the filter units 51 and 61 and the filter coefficient update processing by the ICA units 51 and 62 can be realized by using a semiconductor processing device such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA). Good. These filter processing and filter coefficient update processing may be realized by causing a computer system including a DSP (Digital Signal Processor), a microprocessor, and the like to execute a program.

フィルタ処理およびフィルタ係数更新処理をコンピュータシステムに行わせるための命令群を含むプログラムは、様々な種類の記憶媒体に格納することが可能であり、また、通信媒体を介して伝達されることが可能である。ここで、記憶媒体には、例えば、フレキシブルディスク、ハードディスク、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ、ＲＯＭカートリッジ、バッテリバックアップ付きＲＡＭメモリカートリッジ、フラッシュメモリカートリッジ、不揮発性ＲＡＭカートリッジ等が含まれる。また、通信媒体には、電話回線等の有線通信媒体、マイクロ波回線等の無線通信媒体等が含まれ、インターネットも含まれる。 A program including an instruction group for causing a computer system to perform filter processing and filter coefficient update processing can be stored in various types of storage media, and can be transmitted via a communication medium. It is. Here, the storage medium includes, for example, a flexible disk, a hard disk, a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD, a ROM cartridge, a battery-backed RAM memory cartridge, a flash memory cartridge, a nonvolatile RAM cartridge, and the like. . In addition, the communication medium includes a wired communication medium such as a telephone line, a wireless communication medium such as a microwave line, and the Internet.

さらに、本発明は上述した実施の形態のみに限定されるものではなく、既に述べた本発明の要旨を逸脱しない範囲において種々の変更が可能であることは勿論である。 Furthermore, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the gist of the present invention described above.

１００音声対話システム
１マイクロホンアレイ
１１〜１Ｋマイクロホン
３０、３１〜３ＫＡＤコンバータ（ＡＤＣ）
４ＤＡコンバータ（ＤＡＣ）
５１システム音声抑圧フィルタ部
５２ＩＣＡ部
６１ユーザ音声強調フィルタ部
６２ＩＣＡ部
７１対話管理部
７２音声合成部
５１１〜５１Ｋ加算器
５２１〜５２ＫＩＣＡ部
６１１、６１２加算器
Ｂ_１０〜Ｂ_Ｋ０適応フィルタ
Ｖ_１１（ｆ）、Ｖ_２１（ｆ）〜Ｖ_１Ｋ（ｆ）、Ｖ_２Ｋ（ｆ）適応フィルタ 100 Spoken Dialogue System 1 Microphone Array 11-1K Microphone 30, 31-3K AD Converter (ADC)
4 DA converter (DAC)
51 System Voice Suppression Filter Unit 52 ICA Unit 61 User Voice Enhancement Filter Unit 62 ICA Unit 71 Dialogue Management Unit 72 Speech Synthesis Units 511 to 51K Adders 521 to 52K ICA Units 611 and 612 Adders B _{10 to} B _K0 Adaptive Filter V ₁₁ (F), V ₂₁ (f) to V _1K (f), V _2K (f) Adaptive filter

Claims

A barge-in free voice interactive system capable of recognizing user voice in a situation where system voice is output from a speaker,
A first filter unit that performs a filtering process on the observation signal group by the microphone array and generates a first signal group after the filtering process;
The adaptive filter group included in the first filter unit is independent so that the sound source signal supplied to the speaker for outputting the system sound and each of the first signal group are independent from each other. A first ICA unit that is optimized based on a component analysis algorithm;
A second filter unit that performs a filtering process on the first signal group and generates a second signal group after the filtering process;
A second ICA unit that optimizes an adaptive filter group included in the second filter unit based on an independent component analysis algorithm so that the second signal group is independent of each other;
A dialogue management unit for performing dialogue management including recognition processing of the user voice based on a signal included in the second signal group and generation processing of a new system voice based on a recognition processing result;
Barge-in free spoken dialogue system.

The barge-in free spoken dialogue system according to claim 1, wherein the first and second ICA units optimize an adaptive filter group by independent component analysis in a frequency domain.

A first filter step of performing a filtering process on the observation signal group by the microphone array to generate a first signal group after the filtering process;
The adaptive filter group used in the first filter step is independent so that the sound source signal supplied to the speaker for outputting system sound and the first signal group are independent from each other. A first ICA step to optimize based on a component analysis algorithm;
Performing a filtering process on the first signal group to generate a second signal group after the filtering process; and
The second filter so that the second signal groups are independent of each other. A second ICA step of optimizing the adaptive filter group used in the step based on an independent component analysis algorithm;
A dialogue management step for performing dialogue management including recognition processing of the user voice based on a signal included in the second signal group and generation processing of a new system voice based on a recognition processing result;
A voice interaction method comprising:

The method according to claim 3, wherein the optimization of the adaptive filter group in the first and second filter steps is performed by independent component analysis in the frequency domain.

A program for causing a computer to execute information processing related to a barge-in-free voice interactive system capable of recognizing user voice under a situation where system voice is output from a speaker,
The information processing
A first filter step of performing a filtering process on the observation signal group by the microphone array to generate a first signal group after the filtering process;
Adaptive filter group used in the first filter step so that a sound source signal supplied to the speaker for outputting the system sound and each of the first signal group are independent from each other. A first ICA step that optimizes based on an independent component analysis algorithm;
Performing a filtering process on the first signal group to generate a second signal group after the filtering process; and
The second filter so that the second signal groups are independent of each other. A second ICA step of optimizing the adaptive filter group used in the step based on an independent component analysis algorithm;
A dialogue management step for performing dialogue management including recognition processing of the user voice based on a signal included in the second signal group and generation processing of a new system voice based on a recognition processing result;
Including the program.

The program according to claim 5, wherein the optimization of the adaptive filter group in the first and second filter steps is performed by independent component analysis in a frequency domain.