JP2005031524A

JP2005031524A - Speech signal extracting method and speech recognition device

Info

Publication number: JP2005031524A
Application number: JP2003272569A
Authority: JP
Inventors: Shinichi Tamura; 震一田村
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2003-07-09
Filing date: 2003-07-09
Publication date: 2005-02-03
Anticipated expiration: 2023-07-09
Also published as: JP4107192B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method of extracting a speech signal from one input speech signal including a mixed signal such as noise. <P>SOLUTION: A signal storage part 11 stores finite past values of one input signal. A plural-filter determination part 12 determines a plurality of filters through independent component analysis based upon the finite past values and uses the finite past values as inputs to obtain a plurality of filter outputs. A filter output selection part 13 selects a filter output corresponding to a speech signal component from a plurality of filter outputs. A speech signal composition part 14 composes a speech signal based upon the selected filter output. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は、雑音（ノイズ）が存在する環境下において、音声信号成分とノイズ成分とを含む１の信号から音声信号のみを抽出する音声信号抽出方法およびその音声信号抽出方法を利用した音声認識装置に関する。 The present invention relates to a speech signal extraction method for extracting only a speech signal from one signal including a speech signal component and a noise component in an environment where noise exists, and a speech recognition apparatus using the speech signal extraction method. About.

複数のマイクロフォンからの複数入力信号を用いた信号抽出方式として、例えば非特許文献１に記載されるように、独立成分解析を用いた信号抽出方式がある。この信号抽出方式は音声とノイズとが統計的に独立であることに着目し、独立成分解析を用いて音声信号を抽出する。
T-W. Lee, A.J. Bell and R. Orglmeister. “Blind Source Separation of Real World Signals”, Proceedings of IEEE International Conference Neural Networks , June 97, Houston, pp 2129-2135. As a signal extraction method using a plurality of input signals from a plurality of microphones, for example, as described in Non-Patent Document 1, there is a signal extraction method using independent component analysis. In this signal extraction method, focusing on the fact that speech and noise are statistically independent, speech signals are extracted using independent component analysis.
TW. Lee, AJ Bell and R. Orglmeister. “Blind Source Separation of Real World Signals”, Proceedings of IEEE International Conference Neural Networks, June 97, Houston, pp 2129-2135.

独立成分解析を用いた音声信号抽出は高精度な抽出性能が期待される手法である。しかしながら従来技術には以下の問題点がある。まず、複数の入力信号を必要とする。すなわち「ノイズ成分の数＋１（抽出すべき音声信号）」の入力信号が必要となる。ノイズ成分の数は時々刻々と変化するものであり、これは非現実的である。さらに、複数の入力信号を処理することに伴いハードウエアが複雑になる。 Speech signal extraction using independent component analysis is a technique that is expected to have high-precision extraction performance. However, the prior art has the following problems. First, a plurality of input signals are required. That is, an input signal “number of noise components + 1 (audio signal to be extracted)” is required. The number of noise components changes from moment to moment, which is unrealistic. Furthermore, the hardware becomes complicated as a plurality of input signals are processed.

本発明は、上記の点に鑑みてなされたもので、音声信号成分とノイズ成分とを含む１つの入力信号から音声信号を抽出することが可能な音声信号抽出方法およびその音声信号抽出方法を利用した音声認識装置を提供することを目的とする。 The present invention has been made in view of the above points, and uses an audio signal extraction method and an audio signal extraction method capable of extracting an audio signal from one input signal including an audio signal component and a noise component. An object of the present invention is to provide a voice recognition apparatus.

上記目的を達成するために、請求項１に記載の音声信号抽出方法においては、音声信号とそれ以外のノイズ信号とが統計的に独立であることに着目し、１つの入力信号を複数のフィルタによりお互いに統計的に独立な信号成分（フィルタ出力）に分解する。音声とノイズとは統計的に独立であると見なせるため、分解された信号成分の中に音声とノイズが混ざった信号成分は存在しない。そこで音声の信号成分を選び出し、選び出した信号成分から音声信号を取得することで音声信号の抽出を実現する。 In order to achieve the above object, in the audio signal extraction method according to claim 1, focusing on the fact that the audio signal and other noise signals are statistically independent, one input signal is converted into a plurality of filters. To decompose signal components (filter outputs) that are statistically independent of each other. Since speech and noise can be considered to be statistically independent, there is no signal component in which speech and noise are mixed among the decomposed signal components. Therefore, the extraction of the audio signal is realized by selecting the audio signal component and acquiring the audio signal from the selected signal component.

複数のフィルタを決定する場合、請求項２に記載したように、独立成分解析を用いて複数のフィルタを決定することができる。これにより、決定されたフィルタの出力は、統計的に独立したものとすることができる。なお、このフィルタの具体例として、請求項３に記載したように、デジタルＦＩＲ(Finite Impulse Filter)フィルタを用いたり、請求項４に記載したように、デジタルＩＩＲ(Infinite Impulse Filter)フィルタを用いたりすることができる。 When determining a plurality of filters, as described in claim 2, the plurality of filters can be determined using independent component analysis. Thereby, the output of the determined filter can be made statistically independent. As a specific example of this filter, a digital FIR (Finite Impulse Filter) filter is used as described in claim 3, or a digital IIR (Infinite Impulse Filter) filter is used as described in claim 4. can do.

請求項５に記載の音声信号抽出方法においては、音声信号を取得するためのフィルタ出力として、フィルタ出力がガウス分布から離れている順にＮ（Ｎ≧１）個のフィルタ出力を選択することを特徴とする。世の中に存在する雑音は一般にガウス分布に近い振幅分布特性を持つ。一方、音声信号はガウス分布から離れた振幅分布を有する。そこで、ガウス分布から最も離れた分布を持つ信号成分から順にある特定の個数の信号成分（フィルタ出力）を選び出すことで音声信号に対応するフィルタ出力を選択することができる。また、そのガウス分布によらず、請求項６に記載したように、各フィルタ出力の音声特徴量に基づいても、音声信号に対応するフィルタ出力を選択することができる。そして、このように選択したフィルタ出力が複数である場合には、請求項７に記載したように、その複数のフィルタ出力の和を取ることで音声信号を合成することができる。 6. The audio signal extraction method according to claim 5, wherein N (N ≧ 1) filter outputs are selected as filter outputs for acquiring the audio signal in the order in which the filter outputs are separated from the Gaussian distribution. And Noise existing in the world generally has an amplitude distribution characteristic close to a Gaussian distribution. On the other hand, the audio signal has an amplitude distribution separated from the Gaussian distribution. Therefore, a filter output corresponding to the audio signal can be selected by selecting a specific number of signal components (filter outputs) in order from the signal component having the distribution farthest from the Gaussian distribution. Moreover, regardless of the Gaussian distribution, as described in claim 6, it is possible to select a filter output corresponding to an audio signal based on the audio feature amount of each filter output. And when there are a plurality of filter outputs selected in this way, as described in claim 7, the audio signal can be synthesized by taking the sum of the plurality of filter outputs.

請求項８〜請求項１４には、上述した音声信号抽出方法を利用した音声認識装置が記載される。すなわち、上述した音声信号抽出方法を利用して、音声信号を取得した後、その取得音声信号を認識する音声認識部を備える。このように、上述した音声信号抽出方法を利用して取得した音声信号を音声認識に用いることにより、音声認識の精度の向上を図ることができる。 Claims 8 to 14 describe a speech recognition apparatus using the above-described speech signal extraction method. That is, a voice recognition unit that recognizes the acquired voice signal after the voice signal is acquired using the voice signal extraction method described above is provided. As described above, the accuracy of speech recognition can be improved by using the speech signal acquired using the speech signal extraction method described above for speech recognition.

以下、本発明の実施形態について、図面を用いて説明する。図１は、本実施形態による音声認識装置２０の構成を示すブロック図である。この音声認識装置２０は、以下に説明する音声信号抽出方法を利用して音声信号を抽出し、その抽出した音声信号を認識するものである。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the speech recognition apparatus 20 according to the present embodiment. The voice recognition device 20 extracts a voice signal using a voice signal extraction method described below, and recognizes the extracted voice signal.

図１において、１０はマイクであり、音声とともに、周囲の雑音（ノイズ）を含む信号を生成する。１１は、マイク１０によって生成された信号を入力信号とし、現在時点から1秒遡った時点までの、入力信号を記憶する信号記憶部である。なお、入力信号は、サンプリング周波数１００００Hzでサンプリングされ、デジタル信号に変換して信号記憶部１１に記憶される。従って、信号記憶部１１には、現在時点から１秒遡った時点までの入力信号として、１００００個のデジタル信号が記憶されることになる。 In FIG. 1, reference numeral 10 denotes a microphone, which generates a signal including ambient noise (noise) together with voice. Reference numeral 11 denotes a signal storage unit that stores an input signal from the current time point to a time point that is one second back from the signal generated by the microphone 10. The input signal is sampled at a sampling frequency of 10000 Hz, converted into a digital signal, and stored in the signal storage unit 11. Therefore, 10,000 digital signals are stored in the signal storage unit 11 as input signals from the current time point to a time point that is one second backward.

この１００００個のデジタル信号は、1秒ごとに複数フィルタ決定部１２に送られる。複数フィルタ決定部１２では、３つの長さ３のＦＩＲ(Finite Impulse Response)フィルタの係数が決定され、その係数を使って信号記憶部１１から送られた１００００個のデジタル信号を入力として、３つのフィルタ出力が計算される。複数フィルタ決定部１２は、計算した３つのＦＩＲフィルタ出力とＦＩＲフィルタ係数とをフィルタ出力選択部１３に送る。フィルタ出力選択部１３は、ノイズ成分であることを示すガウス分布との非近似性に基づいて、そのガウス分布と離れた分布を持つ２つのフィルタ出力を選択する。 The 10,000 digital signals are sent to the multiple filter determination unit 12 every second. The multiple filter determination unit 12 determines the coefficients of three FIR (Finite Impulse Response) filters having a length of 3, and inputs the 10,000 digital signals sent from the signal storage unit 11 using the coefficients. The filter output is calculated. The multiple filter determination unit 12 sends the calculated three FIR filter outputs and FIR filter coefficients to the filter output selection unit 13. The filter output selection unit 13 selects two filter outputs having a distribution away from the Gaussian distribution based on the non-approximation with the Gaussian distribution indicating a noise component.

音声信号合成部１４は、フィルタ出力選択部１３によって選択された２つのフィルタ出力と２つの振幅を指定する係数を受け取って音声信号を合成する。このようにして1秒の長さを持つ抽出音声信号が次々に合成される。 The audio signal synthesis unit 14 receives the two filter outputs selected by the filter output selection unit 13 and the coefficients specifying the two amplitudes, and synthesizes the audio signal. In this way, extracted speech signals having a length of 1 second are synthesized one after another.

すなわち、音声信号成分とノイズ成分とを含む１つの入力信号から音声信号を抽出する音声信号抽出方法は、信号記憶部１１から音声信号合成部１４において、入力信号に対して施される処理全体によって実現されるものである。 That is, an audio signal extraction method for extracting an audio signal from one input signal including an audio signal component and a noise component is performed by the entire processing performed on the input signal in the audio signal synthesis unit 14 from the signal storage unit 11. It is realized.

音声認識部１５は、音声信号合成部１４によって合成された音声信号を入力し、その入力音声信号の音声認識を行なう。音声認識部１５における音声認識結果は、認識音声を用いる処理部へ出力される。 The voice recognition unit 15 receives the voice signal synthesized by the voice signal synthesis unit 14 and performs voice recognition of the input voice signal. The speech recognition result in the speech recognition unit 15 is output to a processing unit that uses recognized speech.

以下、それぞれのブロックの処理について詳細に説明する。なお、以下の数式および図２において記号、Ｎは１００００を表す。 Hereinafter, the processing of each block will be described in detail. In the following formula and FIG. 2, the symbol N represents 10,000.

信号記憶部１１は、入力信号から配列mm(u)(u=0,1,...,10000-1)を作成する。複数フィルタ決定部１２は、信号記憶部１１の配列mm(u)から以下の数式１によって示される信号ベクトルx(u)を作成する。 The signal storage unit 11 creates an array mm (u) (u = 0, 1,..., 10000-1) from the input signal. The multiple filter determination unit 12 creates a signal vector x (u) represented by the following Equation 1 from the array mm (u) of the signal storage unit 11.

次に、３つのＦＩＲフィルタの係数をW_ij(i:フィルタ番号,0,1,2,j=0,1,2)として、以下の数式２に示すように、マトリクスWを作成する。 Next, a matrix W is created as shown in Equation 2 below, where the coefficients of the three FIR filters are W _ij (i: filter number, 0, 1, 2, j = 0, 1, 2).

ここで、mm(u)を入力した時の３つのフィルタ出力y₀(u), y₁(u), y₂(u)を要素とするベクトルをy(u)とすると、y(u)は以下の数式３によって表すことができる。 Here, if y (u) is a vector whose elements are the three filter outputs y ₀ (u), y ₁ (u), and y ₂ (u) when mm (u) is input, y (u) Can be expressed by Equation 3 below.

出力同士がお互いに統計的に独立になるようにフィルタ係数を決定することは、「数式３のベクトルy(u)(u=2,3,...,10000-1)の要素同士が統計的に独立になるようにマトリクスWを決定すること」と言い換えられる。WはInfomaxアルゴリズム（Bell A.J. and Sejnowski T.J. 1995. “An information maximisation approach to blind separation and blind deconvolution”, Neural Computation, 7, 6, pp.1129-1159を参照）などの標準的な独立成分解析を使って決定する。図２に、Infomaxアルゴリズムを使用して、マトリクスWを決定する手法の一例を示す。このようにして決定したマトリクスWからフィルタ出力y₀(u), y₁(u), y₂(u)を計算する。 Determining the filter coefficients so that the outputs are statistically independent of each other means that the elements of the vector y (u) (u = 2,3, ..., 10000-1) in Equation 3 are statistical In other words, “determining the matrix W so as to be independent”. W uses standard independent component analysis such as the Infomax algorithm (see Bell AJ and Sejnowski TJ 1995. “An information maximisation approach to blind separation and blind deconvolution”, Neural Computation, 7, 6, pp.1129-1159) To decide. FIG. 2 shows an example of a technique for determining the matrix W using the Infomax algorithm. Filter outputs y ₀ (u), y ₁ (u), y ₂ (u) are calculated from the matrix W thus determined.

フィルタ出力選択部１３は、複数フィルタ決定部１２で得られたフィルタ出力y₀(u), y₁(u), y₂(u)から音声の合成に使うフィルタ出力を選択する。まず、フィルタ出力y₀(u), y₁(u), y₂(u)の平均を０、分散を１に正規化した後、フィルタ出力のガウス分布からの隔たりを表す指標g_i(i=0,1,2)を以下の数式４によって計算する。（A. Hyvarinen. “New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit”, In Advances in Neural Information Processing Systems 10 (NIPS*97), pp. 273-279, MIT Press, 1998.を参照） The filter output selection unit 13 selects a filter output used for speech synthesis from the filter outputs y ₀ (u), y ₁ (u), and y ₂ (u) obtained by the multiple filter determination unit 12. First, after normalizing the average of the filter outputs y ₀ (u), y ₁ (u), y ₂ (u) to 0 and the variance to 1, the index g _i (i representing the distance from the Gaussian distribution of the filter output = 0,1,2) is calculated by the following Equation 4. (See A. Hyvarinen. “New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit”, In Advances in Neural Information Processing Systems 10 (NIPS * 97), pp. 273-279, MIT Press, 1998.)

指標g_iは正の値を取り、値が大きい程ガウス分布から離れていることを示す。フィルタ出力選択部１３は、３つのフィルタ出力の内指標g_iの値が最大と２番目に大きい２つのフィルタ出力を音声信号合成部１４に送る。 The index g _i takes a positive value, and the larger the value, the farther from the Gaussian distribution. The filter output selection unit 13 sends the two filter outputs having the largest and second largest values of the index g _i of the three filter outputs to the audio signal synthesis unit 14.

音声信号合成部１４は、フィルタ出力選択部１３で選択されたフィルタ出力と複数フィルタ決定部１２で得られたマトリクスWを使って音声を合成する。選択されたフィルタ出力をy₀(u), y₁(u)とする。さらに、マトリクスWの逆マトリクスをAとすると、信号ベクトルx(u)は以下の数式５によって示される。 The audio signal synthesizer 14 synthesizes audio using the filter output selected by the filter output selector 13 and the matrix W obtained by the multiple filter determiner 12. Let the selected filter output be y ₀ (u), y ₁ (u). Further, assuming that the inverse matrix of the matrix W is A, the signal vector x (u) is expressed by the following Equation 5.

数式５における信号ベクトルx(u)の第一要素に着目すると、以下の数式６が成立する。 When attention is paid to the first element of the signal vector x (u) in Expression 5, the following Expression 6 is established.

数式６におけるx₀(u)はもとの入力信号そのものであるから、入力信号はA₀₀y₀(u)、A₀₁y₁(u)、A₀₂y₂(u)の３つの和に分解されていることになる。そこで、数式７に示すように、A₀₀y₀(u)、A₀₁y₁(u)の和をとることで音声信号を合成する。 Since x ₀ (u) in Equation 6 is the original input signal itself, the input signal is the sum of three of A ₀₀ y ₀ (u), A ₀₁ y ₁ (u), and A ₀₂ y ₂ (u). It will be disassembled. Therefore, as shown in Formula 7, the audio signal is synthesized by taking the sum of A ₀₀ y ₀ (u) and A ₀₁ y ₁ (u).

上述した各ブロックの処理により、1秒の長さを持つ抽出音声信号が次々に合成され、合成された抽出音声信号が音声認識部１５へ送られる。すなわち、本実施形態による音声信号抽出方法によれば、複数のフィルタを用いた独立成分解析により、１つの入力信号から音声信号を抽出することが可能になるのである。 By the processing of each block described above, extracted speech signals having a length of 1 second are synthesized one after another, and the synthesized extracted speech signals are sent to the speech recognition unit 15. That is, according to the audio signal extraction method according to the present embodiment, an audio signal can be extracted from one input signal by independent component analysis using a plurality of filters.

なお、本発明は上述した実施形態に制限されることなく、本発明の主旨を逸脱しない範囲において、種々変形して実施することが可能である。 The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the present invention.

例えば、上記実施形態においては、フィルタ出力選択部１３において各フィルタ出力のガウス分布からの隔たりを求めて、その隔たりの大きいフィルタ出力を選択した。しかしながら、各フィルタ出力y0(u), y1(u), y2(u)の音声らしさを表す特徴量を計算し、音声らしいフィルタ出力を選択しても良い。 For example, in the above embodiment, the filter output selection unit 13 obtains a distance from each Gaussian distribution of each filter output, and selects a filter output having a large distance. However, it is also possible to calculate a feature value representing the likelihood of speech of each filter output y0 (u), y1 (u), y2 (u) and select a filter output that seems to be speech.

また、上記実施形態では、フィルタ出力選択部１３において、２つのフィルタ出力を選択したが、その選択すべきフィルタの数は１つでも良いし、フィルタ出力の数が４以上である場合には、３以上のフィルタ出力を選択しても良い。 In the above embodiment, the filter output selection unit 13 selects two filter outputs. However, the number of filters to be selected may be one, or when the number of filter outputs is four or more, Three or more filter outputs may be selected.

さらに、上述した実施形態においては、複数フィルタ決定部１２は、フィルタとしてデジタルＦＩＲフィルタを用いたが、それ以外にも、例えばデジタルＩＩＲ(Infinite Impulse Filter)フィルタを用いても良い。 Furthermore, in the above-described embodiment, the multiple filter determination unit 12 uses a digital FIR filter as a filter. However, for example, a digital IIR (Infinite Impulse Filter) filter may be used instead.

本発明の実施形態による音声認識装置の構成を表すブロック図である。It is a block diagram showing the structure of the speech recognition apparatus by embodiment of this invention. Infomaxアルゴリズムを使用して、マトリクスWを決定する手法の一例を示す説明図である。It is explanatory drawing which shows an example of the method of determining the matrix W using an Infomax algorithm.

Explanation of symbols

１０マイク
１１信号記憶部
１２複数フィルタ決定部
１３フィルタ出力選択部
１４音声信号合成部
１５音声認識部
２０音声認識装置 DESCRIPTION OF SYMBOLS 10 Microphone 11 Signal memory | storage part 12 Multiple filter determination part 13 Filter output selection part 14 Voice signal synthesis | combination part 15 Voice recognition part 20 Voice recognition apparatus

Claims

Always storing one signal including an audio signal component and a noise component from a current time point to a finite past time point T;
Determining a plurality of filters so that the outputs are statistically independent of each other when the stored signal is input;
Selecting a filter output corresponding to the audio signal component from the outputs of the plurality of filters;
And obtaining an audio signal from the selected filter output.

2. The audio signal extraction method according to claim 1, wherein the step of determining the plurality of filters determines the plurality of filters using independent component analysis.

2. The audio signal extraction method according to claim 1, wherein the step of determining the plurality of filters determines a digital FIR (Finite Impulse Filter) filter as the plurality of filters.

2. The audio signal extraction method according to claim 1, wherein the step of determining the plurality of filters determines a digital IIR (Infinite Impulse Filter) filter as the plurality of filters.

2. The audio signal extraction method according to claim 1, wherein the filter output selecting step selects N (N ≧ 1) filter outputs in the order in which the filter outputs are separated from the Gaussian distribution. Method.

2. The audio signal extraction method according to claim 1, wherein the filter output selecting step selects N (N ≧ 1) filter outputs in accordance with the audio feature amount of each filter output. Method.

7. The audio signal extraction method according to claim 5 or 6, wherein the step of selecting the filter output selects a plurality of filter outputs, and the step of acquiring the audio signal takes a sum of the plurality of filter outputs. A voice signal extraction method characterized by synthesizing a voice signal.

A signal input unit that receives one signal including an audio signal component and a noise component;
A signal storage unit that always stores one input signal from the signal input unit from a current time point to a finite past time point T;
When a signal stored in the signal storage unit is input, a plurality of filter determination units that determine a plurality of filters so that their outputs are statistically independent from each other;
A filter output selection unit that selects a filter output corresponding to an audio signal component from a plurality of filter outputs determined by the plurality of filter determination units;
An audio signal acquisition unit for acquiring an audio signal from the filter output obtained by the filter output selection unit;
A voice recognition apparatus comprising: a voice recognition unit that receives the voice signal acquired by the voice signal acquisition unit.

The speech recognition apparatus according to claim 8, wherein the plurality of filter determination units determine a plurality of filters using independent component analysis.

9. The speech recognition apparatus according to claim 8, wherein the filter determined by the plurality of filter determination units is a digital FIR (Finite Impulse Filter) filter.

9. The speech recognition apparatus according to claim 8, wherein the filter determined by the plurality of filter determination units is a digital IIR (Infinite Impulse Filter) filter.

9. The speech recognition apparatus according to claim 8, wherein the filter output selection unit selects N (N ≧ 1) filter outputs in order of separation of the filter output from the Gaussian distribution.

9. The speech recognition apparatus according to claim 8, wherein the filter output selection unit selects N (N ≧ 1) filter outputs according to a speech feature amount of the filter output.

14. The speech recognition apparatus according to claim 12, wherein the filter output selection unit selects a plurality of filter outputs, and the speech signal acquisition unit synthesizes speech signals from the plurality of filter outputs. Voice recognition device.