JP2011107603A

JP2011107603A - Speech recognition device, speech recognition method and program

Info

Publication number: JP2011107603A
Application number: JP2009265076A
Authority: JP
Inventors: Satoshi Asakawa; 智朝川; Atsuo Hiroe; 厚夫廣江; Hiroaki Ogawa; 浩明小川; Hitoshi Honda; 等本田; Tsutomu Sawada; 務澤田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2009-11-20
Filing date: 2009-11-20
Publication date: 2011-06-02
Also published as: CN102074230A; CN102074230B; US20110125496A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and a method for performing sound source separation and speech recognition to a mixed signal from a plurality of sound sources, and efficiently obtaining a necessary recognition result. <P>SOLUTION: A separation signal is created by processing applying independent component analysis (ICA), on an observation signal composed of the mixed signal in which outputs from a plurality of sound sources are mixed, and speech recognition processing is performed on each separation signal. Then, additional information is created as evaluation information of a speech recognition result. As the additional information, a recognition reliability degree of the speech recognition result, and an utterance degree in task, which indicates whether the speech recognition result is a recognition result relating to the task which is assumed in the speech recognition device. A score of the speech recognition result corresponding to each channel is calculated by applying the additional information, and the recognition result of a high score is selected and output. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声認識装置、および音声認識方法、並びにプログラムに関する。さらに、詳細には複数の音声信号の混合信号を独立成分分析（ＩＣＡ：ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）を用いて分離し、音声認識を行う音声認識装置、および音声認識方法、並びにプログラムに関する。 The present invention relates to a voice recognition device, a voice recognition method, and a program. More specifically, the present invention relates to a speech recognition apparatus, a speech recognition method, and a program for performing speech recognition by separating a mixed signal of a plurality of speech signals using independent component analysis (ICA).

複数の音声信号の混合信号を分離する処理として独立成分分析（ＩＣＡ：ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）が知られている。独立成分分析（ＩＣＡ）により得られた分離結果に対して音声認識を適用することにより、目的とする音とそれ以外の音とを分離した上で音声認識処理を行うことで、目的とする音源の音声認識を高精度に行うことが可能となる。 Independent component analysis (ICA: Independent Component Analysis) is known as a process for separating a mixed signal of a plurality of audio signals. By applying voice recognition to the separation result obtained by independent component analysis (ICA), the target sound source is separated from the target sound and the other sound, and the target sound source is obtained. Can be recognized with high accuracy.

このような独立成分分析（ＩＣＡ）に基づく音源分離処理と音声認識処理とを組み合わせたシステムは既にいくつか存在しているが、従来型のシステムはＩＣＡの結果として得られた複数の音源の各々に対応する複数の出力チャンネルから、目的とするチャンネル（音源）を選択して音声認識の入力とする構成となっている。 There are already several systems that combine sound source separation processing based on independent component analysis (ICA) and speech recognition processing, but conventional systems use a plurality of sound sources obtained as a result of ICA. The target channel (sound source) is selected from a plurality of output channels corresponding to, and is used as an input for speech recognition.

まず、本発明の背景技術として、独立成分分析（ＩＣＡ：ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）の概要について説明する。
ＩＣＡとは、多変量分析の一種であり、信号の統計的な性質を利用して多次元信号を分離する手法のことである。ＩＣＡ自体の詳細については、例えば非特許文献１［『入門・独立成分分析』（村田昇著、東京電機大学出版局）］などを参照されたい。 First, an outline of independent component analysis (ICA) will be described as background art of the present invention.
ICA is a type of multivariate analysis, and is a technique for separating multidimensional signals using the statistical properties of signals. For details of the ICA itself, see, for example, Non-Patent Document 1 ["Introduction / Independent Component Analysis" (Noboru Murata, Tokyo Denki University Press)].

以下では、音信号のＩＣＡ、特に時間周波数領域のＩＣＡについて説明する。
図１に示すように、Ｎ個の音源から異なる音が鳴っていて、それらをｎ個のマイクで観測するという状況を考える。音源が発した音（原信号）がマイクに届くまでには、時間遅れや反射などがある。従って、マイクロホンｋで観測される信号（観測信号）は式［１．１］のように、原信号と伝達関数（transfer function）との畳み込み演算（convolution）を全音源について総和した式して表わすことができる。この混合を以下では「畳み込み混合」（convolutive mixtures）と呼ぶ。
なお、マイクｎの観測信号をｘ_ｎ（ｔ）とする。マイク１、マイク２の観測信号はそれぞれｘ_１（ｔ），ｘ_２（ｔ）となる。
全てのマイクについての観測信号を一つの式で表わすと、以下に示す式［１．２］のように表わせる。 Hereinafter, the ICA of the sound signal, particularly the ICA in the time frequency domain will be described.
As shown in FIG. 1, a situation is considered in which different sounds are produced from N sound sources and these are observed by n microphones. There is a time delay or reflection until the sound (original signal) emitted by the sound source reaches the microphone. Therefore, the signal (observation signal) observed by the microphone k is expressed by a summation of the convolution of the original signal and the transfer function for all sound sources as shown in Equation [1.1]. be able to. This mixture is referred to below as “convolutive mixture”.
Note that an observation signal of the microphone n is x _n (t). The observation signals of the microphone 1 and the microphone 2 are x ₁ (t) and x ₂ (t), respectively.
When the observation signals for all the microphones are represented by one equation, it can be represented by the following equation [1.2].

ただし、ｘ（ｔ），ｓ（ｔ）はそれぞれｘ_ｋ（ｔ），ｓ_ｋ（ｔ）を要素とする列ベクトルであり、Ａ^［ｌ］はａ^［ｌ］ _ｋｊを要素とするｎ×Ｎの行列である。以降では、ｎ＝Ｎとする。 However, x (t), s (t) respectively _x k _(t), a column vector with _s k (t) of the ^{elements, A [l]} is n × N whose elements ^{a _[l]} _kj It is a matrix. Hereinafter, it is assumed that n = N.

時間領域の畳み込み混合は、時間周波数領域では瞬時混合で表わされることが知られており、その特徴を利用したのが時間周波数領域のＩＣＡである。 It is known that the convolutional mixing in the time domain is represented by instantaneous mixing in the time-frequency domain, and the ICA in the time-frequency domain uses this feature.

時間周波数領域ＩＣＡ自体については、非特許文献２［『詳解独立成分分析』の「１９．２．４．フーリエ変換法」］や、特許文献１（特開２００６−２３８４０９号公報『音声信号分離装置・雑音除去装置および方法』）などを参照されたい。 Regarding the time-frequency domain ICA itself, Non-Patent Document 2 [“19.2.4. Fourier transform method” in “Detailed Independent Component Analysis”] and Patent Document 1 (Japanese Patent Laid-Open No. 2006-238409 “Audio Signal Separation Device”). Refer to “Noise reduction device and method”).

上記の式［１．２］の両辺を短時間フーリエ変換すると、以下に示す式［２．１］が得られる。 When both sides of the above equation [1.2] are subjected to a short-time Fourier transform, the following equation [2.1] is obtained.

上記式［２．１］において、
ωは周波数ビンの番号、
ｔはフレームの番号、
である。 In the above equation [2.1],
ω is the frequency bin number,
t is the frame number,
It is.

ωを固定すると、この式は瞬時混合（時間遅れのない混合）と見なせる。そこで、観測信号を分離するには、分離結果［Ｙ］の算出式［２．５］を用意した上で、分離結果：Ｙ（ω，ｔ）の各成分が最も独立になるように分離行列Ｗ（ω）を決める。このような処理によって、混合された音声信号から分離信号を得る。 If ω is fixed, this equation can be regarded as instantaneous mixing (mixing without time delay). Therefore, in order to separate the observation signals, the calculation formula [2.5] of the separation result [Y] is prepared, and the separation matrix so that each component of the separation result: Y (ω, t) becomes the most independent. Determine W (ω). By such processing, a separated signal is obtained from the mixed audio signal.

この独立成分分析（ＩＣＡ）によって得られた分離信号を音声認識システムに入力することで、各音源に対応する認識結果を高精度に得ることができる。一般的な、独立成分分析（ＩＣＡ）による音源分離処理と音声認識部を組み合わせたシステム例を図２に示す。 By inputting the separated signal obtained by the independent component analysis (ICA) to the speech recognition system, the recognition result corresponding to each sound source can be obtained with high accuracy. FIG. 2 shows a system example in which a general sound source separation process based on independent component analysis (ICA) and a speech recognition unit are combined.

複数のマイクロホン１０１−１〜Ｎによって音声が集音され、各マイクロホン１０１−１〜Ｎの取得した音声信号に対応する入力波形が音源分離部１０２に送られる。音源分離部１０２は、上述した独立成分分析（ＩＣＡ）により、複数音源の混合音を各音源に由来する個々の音源に分離する処理を行う。なお、チャンネル選択部１０３において、音源方向に基づいてチャンネル選択を行う場合には、音源分離部において音源方向推定が同時に行われる。 Sound is collected by the plurality of microphones 101-1 to 101 -N, and an input waveform corresponding to the sound signal acquired by each of the microphones 101-1 to 101 -N is sent to the sound source separation unit 102. The sound source separation unit 102 performs a process of separating the mixed sound of a plurality of sound sources into individual sound sources derived from each sound source by the independent component analysis (ICA) described above. When the channel selection unit 103 performs channel selection based on the sound source direction, the sound source separation unit simultaneously performs sound source direction estimation.

音源分離部１０２からは、音源対応の個別の音信号を示す分離波形と、音源方向情報が出力され、チャンネル選択部１０３に入力される。
チャンネル選択部１０３では、音源分離部１０２から入力された各音源に対応する分離波形の中から、目的とする音が含まれるチャンネルを選択する。例えばユーザの指定等によって選択される。選択された１つの分離波形が、音声認識部１０４に出力される。 From the sound source separation unit 102, a separated waveform indicating individual sound signals corresponding to the sound source and sound source direction information are output and input to the channel selection unit 103.
The channel selection unit 103 selects a channel including the target sound from the separated waveforms corresponding to each sound source input from the sound source separation unit 102. For example, it is selected by user designation. One selected separated waveform is output to the voice recognition unit 104.

音声認識部１０４は、チャンネル選択部１０３から入力したある１つの音源に対応する音信号を示す分離波形を入力として音声認識を実行し、特定の音源（目的音）の音声認識結果を出力する。 The speech recognition unit 104 performs speech recognition using the separated waveform indicating the sound signal corresponding to a certain sound source input from the channel selection unit 103 as input, and outputs a speech recognition result of a specific sound source (target sound).

ＩＣＡに基づく音源分離処理と音声認識処理とを組み合わせたシステムはこのような処理により、目的とする音源の認識結果を得る構成を持つ。しかし、このようなシステムはＩＣＡ出力の不確定性と、目的音声を選択するチャンネル選択に関する問題点がある。以下、これらの問題点について説明する。 A system combining sound source separation processing based on ICA and speech recognition processing has a configuration for obtaining a recognition result of a target sound source through such processing. However, such a system has problems related to uncertainty of ICA output and channel selection for selecting a target voice. Hereinafter, these problems will be described.

まず、ＩＣＡ出力の不確定性と、目的音声を選択するチャンネル選択手法について説明する。 First, the ICA output uncertainty and the channel selection method for selecting the target voice will be described.

（ＩＣＡ出力の不確定性について）
ＩＣＡでは、元の音源に対応する分離結果の音がそれぞれどのチャンネルに出力されているかは不確定であるため、目的とする音が含まれているチャンネルを何らかの形で選択する必要がある。なお、例えば特許文献２（特開２００９−５３０８８号公報）にＩＣＡ出力の不確定性に関する記述がある。 (About ICA output uncertainty)
In ICA, since it is uncertain which channel the sound of the separation result corresponding to the original sound source is output to, the channel containing the target sound needs to be selected in some form. For example, Patent Document 2 (Japanese Patent Laid-Open No. 2009-53088) describes the ICA output uncertainty.

（目的音声を選択するチャンネル選択手法について）
ＩＣＡの出力を後段処理手段へ出力して、何らかの処理を行う場合には元音源に対応する分離結果の音がそれぞれどのチャンネルに出力されているかを判別する必要がある。例えば後段処理手段として音声認識処理を行う場合、認識対象の音声がどのチャンネルに出力されているかを判別する必要がある。
ＩＣＡにおいては例えばマイクがＮ個ある場合Ｎチャンネル入力となり、Ｎチャンネルの分離結果を出力する。しかし、音源数は様々な設定となる。音源数が入力チャンネル数より少ない場合、音源に対応した出力チャンネル（音源チャンネル）と、どの音源にも対応しない残響のような音が観測される出力チャンネル（残響チャンネル）とが観測情報として得られることになる。 (About channel selection method for selecting target audio)
In the case where the output of ICA is output to the post-processing unit and some processing is performed, it is necessary to determine to which channel the sound of the separation result corresponding to the original sound source is output. For example, when speech recognition processing is performed as a post-processing unit, it is necessary to determine which channel the recognition target speech is being output to.
In ICA, for example, when there are N microphones, N channel input is performed, and N channel separation results are output. However, the number of sound sources varies. When the number of sound sources is smaller than the number of input channels, the output channels (sound source channels) corresponding to the sound sources and the output channels (reverberation channels) where reverberation-like sounds that do not correspond to any sound source are observed are obtained as observation information. It will be.

なお、ＩＣＡと音声認識を組み合わせた処理を考えた場合、ＩＣＡの出力チャンネルは以下のように分類することができる。
（１）実際の音源に対応している音源チャンネル
（２）音源に対応していない残響チャンネル When processing combining ICA and speech recognition is considered, ICA output channels can be classified as follows.
(1) Sound source channels that support actual sound sources (2) Reverberation channels that do not support sound sources

さらに、（１）音源チャンネルは以下のように細分類することができる。
（１−１）音声であるチャンネル
（１−１−１）音声認識システムが入力として想定している内容を含む発話チャンネル（タスク内発話）
（１−１−２）音声認識システムが入力として想定していない内容を含む発話チャンネル（タスク外発話）
（１−２）音声以外のチャンネル（例えばシステムへの入力ではない人同士の雑談なども含む） Further, (1) sound source channels can be subdivided as follows.
(1-1) A channel that is speech (1-1-1) An utterance channel that includes content assumed by the speech recognition system as an input (utterance within a task)
(1-1-2) An utterance channel that includes content that the speech recognition system does not assume as input (utterance outside task)
(1-2) Channels other than voice (including chats between people who are not input to the system, for example)

ＩＣＡによる音源分離結果に基づいて音声認識を行うシステムとしては、上記の分類中、
（１−１−１）音声認識システムが入力として想定している内容を含む発話チャンネル（タスク内発話）
上記のチャンネル（タスク内発話）の音声を認識することが重要となる。 As a system for performing speech recognition based on the sound source separation result by ICA,
(1-1-1) An utterance channel including utterances assumed by the voice recognition system as input (utterance in task)
It is important to recognize the voice of the above channel (utterance in task).

このような目的音源に対応するチャンネル選択の手法としては、例えば、以下の方法がある。
（ａ）パワー（音量）の大きさに基づいて選択する。
これは、各チャンネル出力のパワーの値に基づいて目的の音源チャンネルであるか残響チャンネルであるかを判別し、最大パワーのチャンネルを選択する方法である。
（ｂ）音源方向を推定し最も正面に近いものを選択する。
これは、ＩＣＡを行うとともに音源到来方向も同時に推定し、最も正面方向に近い音源が出力されているチャンネルを目的音として選択する方法である。
（ｃ）音声／非音声判別と、過去のデータとの比較により選択する。
これは、例えば各チャンネルの音が、人の音声信号であるか否かを判別し、人の音声信号のチャンネルと判断したチャンネルについて、保持している過去の周波数特徴量との比較を行うことで、特定の人の音声を判別する手法である。なお、この手法については、例えば特許文献３（特開２００７−２７９５１７号公報）に記載されている。 As a method of selecting a channel corresponding to such a target sound source, for example, there are the following methods.
(A) Selection is made based on the magnitude of power (volume).
This is a method of discriminating whether a channel is a target sound source channel or a reverberation channel based on the power value of each channel output, and selecting a channel with the maximum power.
(B) Estimate the sound source direction and select the one closest to the front.
This is a method of performing ICA and simultaneously estimating a sound source arrival direction and selecting a channel from which a sound source closest to the front direction is output as a target sound.
(C) Selection is made by comparing voice / non-voice discrimination with past data.
For example, it is determined whether or not the sound of each channel is a human voice signal, and the channel determined to be the channel of the human voice signal is compared with the past frequency feature value held. In this method, the voice of a specific person is discriminated. This technique is described in, for example, Japanese Patent Application Laid-Open No. 2007-279517.

従来システムにおける問題点のまとめ
例えば、図１に示すＩＣＡに基づく音源分離処理と音声認識処理とを組み合わせたシステムにおいては、上述したＩＣＡ出力の不確定性が存在し、また、いかにしてＩＣＡの生成した複数のチャンネルから目的音声を選択するかを決定するかが問題となる。 Summary of problems in the conventional system For example, in the system combining the sound source separation processing based on ICA and the speech recognition processing shown in FIG. The problem is whether to decide whether to select the target voice from the plurality of generated channels.

従来システムにおける問題点を整理して列挙すると以下のようになる。
（Ａ）チャンネル選択後に音声認識を適用することの問題点
（Ａ１）１つのチャンネルのみ選択する場合、
複数の音が鳴っている場合に、目的音以外が選択されてしまう可能性がある。
（Ａ２）複数のチャンネルを選択する場合、
音声認識結果が複数得られるが、その中で再度取捨選択する必要がある。 The problems in the conventional system are summarized and listed as follows.
(A) Problems of applying speech recognition after channel selection (A1) When only one channel is selected,
When a plurality of sounds are sounding, a sound other than the target sound may be selected.
(A2) When selecting multiple channels:
A plurality of speech recognition results are obtained, but it is necessary to select again among them.

（Ｂ）チャンネル選択の従来手法の問題点
上述の３つの従来手法の問題点を挙げる。
（ａ）パワーの大きさに基づくチャンネル選択手法の問題点
パワーの大きさのみでは、音声以外の音源を誤って選択する可能性がある。例えば、音源チャンネルと残響チャンネルは区別できるが、音声と非音声とは区別できない。
（ｂ）音源方向を推定し最も正面に近いものを選択する手法の問題点
目的とする音声が必ずしも正面から到来するとは限らない。
（ｃ）音声／非音声判別と、過去のデータとの比較の組み合わせにより選択する手法の問題点
音声／非音声判別では、音声認識システムが想定するタスクの発話内容かどうかまでは判別できない。
音声信号とそれ以外は区別できるが、タスク内発話とタスク外発話とは区別できない。
このように、従来のチャンネル選択手法には様々な問題点がある。 (B) Problems of the conventional method of channel selection The problems of the three conventional methods described above are listed.
(A) Problems with the channel selection method based on the magnitude of power There is a possibility that a sound source other than sound may be selected by mistake only with the magnitude of power. For example, a sound source channel and a reverberation channel can be distinguished, but voice and non-voice cannot be distinguished.
(B) Problems of the method of estimating the sound source direction and selecting the one closest to the front The target speech does not necessarily come from the front.
(C) Problems of the method of selecting by a combination of voice / non-voice discrimination and comparison with past data In voice / non-voice discrimination, it is not possible to discriminate whether or not the utterance content of a task assumed by the voice recognition system.
While speech signals and other utterances can be distinguished, intra-task utterances and non-task utterances cannot be distinguished.
As described above, the conventional channel selection method has various problems.

特開２００６−２３８４０９号公報JP 2006-238409 A 特開２００９−５３０８８号公報JP 2009-53088 A 特開２００７−２７９５１７号公報JP 2007-279517 A

『入門・独立成分分析』（村田昇著、東京電機大学出版局）“Introduction and Independent Component Analysis” (Noboru Murata, Tokyo Denki University Press) 『詳解独立成分分析』の「１９．２．４．フーリエ変換法」“19.2.4. Fourier transform method” in “Detailed analysis of independent components”

本発明は、このような状況に鑑みてなされたものであり、独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ；ＩＣＡ）を用いて各音源信号単位の分離処理を行い、さらに目的音の音声認識処理を実行する音声認識装置、および音声認識方法、並びにプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and performs separation processing for each sound source signal unit using independent component analysis (ICA), and further performs speech recognition processing of the target sound. An object of the present invention is to provide a voice recognition device, a voice recognition method, and a program.

本発明の第１の側面は、
複数音源の出力の混合信号を各音源に対応する信号に分離して複数チャンネルの分離信号を生成する音源分離部と、
前記音源分離部の生成した複数チャンネルの分離信号を入力して音声認識処理を実行し、各チャンネル対応の音声認識結果を生成するとともに、各チャンネル対応の音声認識結果の評価情報となる付加情報を生成する音声認識部と、
前記音声認識結果と前記付加情報を入力し、前記付加情報を適用して各チャンネル対応の音声認識結果のスコアを算出し、スコアの高い音声認識結果を選択出力するチャンネル選択部を有する音声認識装置にある。 The first aspect of the present invention is:
A sound source separation unit that separates a mixed signal output from a plurality of sound sources into a signal corresponding to each sound source to generate a separated signal of a plurality of channels;
A plurality of channels separated signals generated by the sound source separation unit are input to perform speech recognition processing, generate speech recognition results corresponding to each channel, and additional information serving as evaluation information for the speech recognition results corresponding to each channel. A voice recognition unit to be generated;
A speech recognition apparatus having a channel selection unit that inputs the speech recognition result and the additional information, calculates a score of the speech recognition result corresponding to each channel by applying the additional information, and selectively outputs a speech recognition result having a high score It is in.

さらに、本発明の音声認識装置の一実施態様において、前記音声認識部は、前記付加情報として音声認識結果の認識信頼度を算出し、前記チャンネル選択部は、前記認識信頼度を適用して各チャンネル対応の音声認識結果のスコアを算出する。 Furthermore, in an embodiment of the speech recognition apparatus of the present invention, the speech recognition unit calculates a recognition reliability of a speech recognition result as the additional information, and the channel selection unit applies the recognition reliability to each The score of the speech recognition result corresponding to the channel is calculated.

さらに、本発明の音声認識装置の一実施態様において、前記音声認識部は、前記付加情報として、音声認識結果が音声認識装置において想定したタスクに関連する認識結果であるか否かを示すタスク内発話度を算出し、前記チャンネル選択部は、前記タスク内発話度を適用して各チャンネル対応の音声認識結果のスコアを算出する。 Furthermore, in an embodiment of the speech recognition apparatus of the present invention, the speech recognition unit includes, as the additional information, in a task indicating whether the speech recognition result is a recognition result related to a task assumed in the speech recognition apparatus. The utterance level is calculated, and the channel selection unit calculates the score of the speech recognition result corresponding to each channel by applying the intra-task utterance level.

さらに、本発明の音声認識装置の一実施態様において、前記チャンネル選択部は、音声認識結果の認識信頼度、または音声認識結果が音声認識装置において想定したタスクに関連する認識結果であるか否かを示すタスク内発話度の少なくともいずれかをスコア算出データとして適用するとともに、音声パワー、または音源方向情報の少なくともいずれかを組み合わせてスコアを算出する。 Furthermore, in an embodiment of the speech recognition apparatus of the present invention, the channel selection unit determines whether or not the speech recognition result is a recognition reliability, or whether the speech recognition result is a recognition result related to a task assumed in the speech recognition apparatus. Is applied as score calculation data, and a score is calculated by combining at least one of voice power and sound source direction information.

さらに、本発明の音声認識装置の一実施態様において、前記音声認識部は、前記音源分離部の生成した複数チャンネルの分離信号のチャンネル数に等しい複数の音声認識部によって構成され。前記複数の音声認識部は、前記音源分離部の生成した複数チャンネルの分離信号の各チャンネル対応の分離信号をそれぞれ入力して、音声認識処理を並列に実行する構成である。 Furthermore, in one embodiment of the speech recognition apparatus of the present invention, the speech recognition unit is constituted by a plurality of speech recognition units equal to the number of channels of the separated signals of the plurality of channels generated by the sound source separation unit. The plurality of speech recognition units are configured to input a separation signal corresponding to each channel of the separation signals of the plurality of channels generated by the sound source separation unit, and execute speech recognition processing in parallel.

さらに、本発明の第２の側面は、
音声認識装置において実行する音声認識方法であり、
音源分離部が、複数音源の出力の混合信号を各音源に対応する信号に分離して複数チャンネルの分離信号を生成する音源分離ステップと、
音声認識部が、前記音源分離部の生成した複数チャンネルの分離信号を入力して音声認識処理を実行し、各チャンネル対応の音声認識結果を生成するとともに、各チャンネル対応の音声認識結果の評価情報となる付加情報を生成する音声認識ステップと、
チャンネル選択部が、前記音声認識結果と前記付加情報を入力し、前記付加情報を適用して各チャンネル対応の音声認識結果のスコアを算出し、スコアの高い音声認識結果を選択出力するチャンネル選択ステップと、
を有する音声認識方法にある。 Furthermore, the second aspect of the present invention provides
A speech recognition method executed in a speech recognition device,
A sound source separation unit that separates a mixed signal output from a plurality of sound sources into a signal corresponding to each sound source to generate a separated signal of a plurality of channels;
The speech recognition unit inputs the separation signals of the plurality of channels generated by the sound source separation unit, executes speech recognition processing, generates a speech recognition result corresponding to each channel, and evaluation information of the speech recognition result corresponding to each channel A speech recognition step for generating additional information to be
A channel selection step in which a channel selection unit inputs the speech recognition result and the additional information, calculates a score of the speech recognition result corresponding to each channel by applying the additional information, and selectively outputs a speech recognition result having a high score When,
There is a speech recognition method.

さらに、本発明の第３の側面は、
音声認識装置において音声認識処理を実行させるプログラムであり、
音源分離部に、複数音源の出力の混合信号を各音源に対応する信号に分離して複数チャンネルの分離信号を生成させる音源分離ステップと、
音声認識部に、前記音源分離部の生成した複数チャンネルの分離信号を入力して音声認識処理を実行させ、各チャンネル対応の音声認識結果を生成するとともに、各チャンネル対応の音声認識結果の評価情報となる付加情報を生成させる音声認識ステップと、
チャンネル選択部に、前記音声認識結果と前記付加情報を入力し、前記付加情報を適用して各チャンネル対応の音声認識結果のスコアを算出させ、スコアの高い音声認識結果を選択出力させるチャンネル選択ステップと、
を実行させるプログラムにある。 Furthermore, the third aspect of the present invention provides
A program for executing voice recognition processing in a voice recognition device,
A sound source separation step for causing the sound source separation unit to generate a multi-channel separation signal by separating the mixed signal of the output of the plurality of sound sources into a signal corresponding to each sound source;
The speech recognition unit inputs the separation signals of the plurality of channels generated by the sound source separation unit and executes speech recognition processing to generate a speech recognition result corresponding to each channel, and evaluation information of the speech recognition result corresponding to each channel A speech recognition step for generating additional information to be
A channel selection step of inputting the voice recognition result and the additional information to a channel selection unit, calculating a score of the voice recognition result corresponding to each channel by applying the additional information, and selectively outputting a voice recognition result having a high score When,
It is in the program that executes

なお、本発明のプログラムは、例えば、様々なプログラム・コードを実行可能な画像処理装置やコンピュータ・システムに対して、コンピュータ可読な形式で提供する記憶媒体、通信媒体によって提供可能なプログラムである。このようなプログラムをコンピュータ可読な形式で提供することにより、情報処理装置やコンピュータ・システム上でプログラムに応じた処理が実現される。 The program of the present invention is a program that can be provided by, for example, a storage medium or a communication medium provided in a computer-readable format to an image processing apparatus or a computer system that can execute various program codes. By providing such a program in a computer-readable format, processing corresponding to the program is realized on the information processing apparatus or the computer system.

本発明のさらに他の目的、特徴や利点は、後述する本発明の実施例や添付する図面に基づくより詳細な説明によって明らかになるであろう。なお、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Other objects, features, and advantages of the present invention will become apparent from a more detailed description based on embodiments of the present invention described later and the accompanying drawings. In this specification, the system is a logical set configuration of a plurality of devices, and is not limited to one in which the devices of each configuration are in the same casing.

本発明の一実施例の構成によれば、複数音源からの出力を混合した混合信号からなる観測信号に対して、独立成分分析（ＩＣＡ）を適用した処理により分離信号を生成するとともに、各分離信号に対する音声認識処理を実行する。さらに音声認識結果の評価情報としての付加情報を生成する。付加情報として音声認識結果の認識信頼度や、音声認識結果が音声認識装置において想定したタスクに関連する認識結果であるか否かを示すタスク内発話度を算出する。これらの付加情報を適用して各チャンネル対応の音声認識結果のスコアを算出してスコアの高い認識結果を選択出力する。これらの処理により、複数音源からの混合信号に対する音源分離と音声認識が実現され、必要とする認識結果をより確実に取得することができる。 According to the configuration of an embodiment of the present invention, a separation signal is generated by processing that applies independent component analysis (ICA) to an observation signal that is a mixed signal obtained by mixing outputs from a plurality of sound sources, and each separation is performed. Perform voice recognition processing on the signal. Further, additional information is generated as evaluation information of the speech recognition result. As additional information, the recognition reliability of the speech recognition result and the intra-task utterance level indicating whether or not the speech recognition result is a recognition result related to a task assumed in the speech recognition apparatus are calculated. By applying these additional information, the score of the speech recognition result corresponding to each channel is calculated, and the recognition result having a high score is selected and output. By these processes, sound source separation and speech recognition for mixed signals from a plurality of sound sources are realized, and a necessary recognition result can be obtained more reliably.

Ｎ個の音源から異なる音が鳴っていて、それらをｎ個のマイクで観測するという状況について説明する図である。It is a figure explaining the situation where different sounds are sounding from N sound sources and observing them with n microphones. 一般的な独立成分分析（ＩＣＡ）による音源分離処理と音声認識部を組み合わせたシステム例について説明する図である。It is a figure explaining the example of a system which combined the sound source separation process by general independent component analysis (ICA), and the speech recognition part. 本発明の音声認識装置の全体構成と処理の概要について説明する図である。It is a figure explaining the whole structure and the outline | summary of a process of the speech recognition apparatus of this invention. 音源分離部２０２の詳細構成と処理の具体例について説明する図である。It is a figure explaining the detailed structure of the sound source separation part 202, and the specific example of a process. 各チャンネルに対応して設けられた音声認識部２０３−１〜Ｎの１つの音声認識部の構成を示す図である。It is a figure which shows the structure of one speech recognition part of the speech recognition parts 203-1 to N provided corresponding to each channel. チャンネル選択部２０４の詳細構成と処理の具体例について説明する図である。It is a figure explaining the detailed structure of the channel selection part 204, and the specific example of a process. 本発明の音声認識装置の実行する処理の全体の流れを示すフローチャートを示す図である。It is a figure which shows the flowchart which shows the flow of the whole process which the speech recognition apparatus of this invention performs. 図７に示すフローにおけるステップＳ１０３の音声認識処理の詳細を示すフローチャートを示す図である。It is a figure which shows the flowchart which shows the detail of the speech recognition process of step S103 in the flow shown in FIG. 図７に示すフローにおけるステップＳ１０４のチャンネル選択処理の詳細を示すフローチャートを示す図である。It is a figure which shows the flowchart which shows the detail of the channel selection process of step S104 in the flow shown in FIG.

以下、図面を参照しながら本発明の音声認識装置、および音声認識方法、並びにプログラムの詳細について説明する。説明は、以下の項目に従って行う。
１．本発明の音声認識装置の全体構成例と処理の概要について
２．音源分離部の詳細構成と処理の具体例について
３．音声認識部の詳細構成と処理の具体例について
４．チャンネル選択部の詳細構成と処理の具体例について
５．音声認識装置の実行する処理のシーケンスについて The details of the speech recognition apparatus, speech recognition method, and program of the present invention will be described below with reference to the drawings. The description will be made according to the following items.
1. 1. Outline of overall configuration example and processing of speech recognition apparatus of present invention 2. Detailed configuration of sound source separation unit and specific example of processing 3. Detailed configuration of voice recognition unit and specific example of processing 4. Detailed configuration of channel selection unit and specific example of processing About the sequence of processing executed by the speech recognition device

［１．本発明の音声認識装置の全体構成例と処理の概要について］
まず、図３を参照して本発明の音声認識装置の全体構成と処理の概要について説明する。本発明の音声認識装置は、複数音源の出力する音声の混合信号を入力して音源分離を行い、さらに音源分離結果を利用した音声認識処理を行う装置である。
図３に、本発明の一実施例に係る音声認識装置２００の構成例を示す。 [1. Example of overall configuration and processing outline of speech recognition apparatus of present invention]
First, with reference to FIG. 3, the overall configuration and processing outline of the speech recognition apparatus of the present invention will be described. The speech recognition device of the present invention is a device that performs sound source separation by inputting a mixed signal of sounds output from a plurality of sound sources, and further performs speech recognition processing using a sound source separation result.
FIG. 3 shows a configuration example of a speech recognition apparatus 200 according to an embodiment of the present invention.

複数のマイクロホン２０１−１〜Ｎによって音声が集音され、各マイクロホン２０１−１〜Ｎの取得した音声信号に対応する入力波形が音源分離部２０２に送られる。音源分離部２０２は、例えば独立成分分析（ＩＣＡ：ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ）により、複数音源の混合音を各音源に由来する個々の音源に分離する処理を行う。この分離処理により、例えば各音源に由来する音声の分離波形が生成されて出力される。
音源分離部２０２は、この音源分離処理に併せて各分離波形に対応する音声の到来する音源方向の推定処理も実行する。 Sound is collected by the plurality of microphones 201-1 to 201-N, and an input waveform corresponding to the sound signal acquired by each of the microphones 201-1 to 20-N is sent to the sound source separation unit 202. The sound source separation unit 202 performs a process of separating a mixed sound of a plurality of sound sources into individual sound sources derived from each sound source by, for example, independent component analysis (ICA) (Independent Component Analysis). By this separation processing, for example, a separation waveform of sound originating from each sound source is generated and output.
In addition to the sound source separation process, the sound source separation unit 202 also executes a process for estimating the sound source direction from which the sound corresponding to each separated waveform arrives.

なお、音源分離部２０２の実行する独立成分分析（ＩＣＡ）による分離処理によって、入力数（Ｎ）に応じたＮ個の分離波形が生成される。なお、ここでは分離波形の個数（Ｎ）をチャンネル数とする。音源分離部２０２はチャンネル１〜チャンネルＮのＮチャンネルの分離波形を生成する。ただし、音源数はＮ個に等しいとは限らない。Ｎチャンネルの一部のみが特定の音源に対応する音声分離波形を出力し、その他はノイズのみからなるチャンネルとなる場合もある。 Note that N separated waveforms corresponding to the number of inputs (N) are generated by the separation processing by independent component analysis (ICA) executed by the sound source separation unit 202. Here, the number (N) of separated waveforms is the number of channels. The sound source separation unit 202 generates an N-channel separation waveform of channel 1 to channel N. However, the number of sound sources is not necessarily equal to N. In some cases, only a part of the N channel outputs a sound separation waveform corresponding to a specific sound source, and the other is a channel composed of only noise.

音源分離部２０２の生成した各音源に由来する複数の分離波形の各々は、個別にチャンネル選択部２０４に出力され、さらに、各分離波形個別に設定された音声認識部２０３−１〜Ｎに入力される。
また、音源分離部２０２の生成した各音源に由来する複数の音源方向情報は、個別にチャンネル選択部２０４に出力される。 Each of the plurality of separated waveforms derived from each sound source generated by the sound source separation unit 202 is individually output to the channel selection unit 204, and further input to the speech recognition units 203-1 to 203-N set individually for each separated waveform. Is done.
A plurality of sound source direction information derived from each sound source generated by the sound source separation unit 202 is individually output to the channel selection unit 204.

音声認識部２０３−１〜Ｎの各々は、音源分離部２０２から出力される分離波形各々に対して音声認識処理を実行する。音声認識部２０３−１〜Ｎの各々は、音声認識結果とともに、付加情報として認識結果の信頼度やタスク内発話か否かの度合い（タスク内発話度）を付加して、チャンネル選択部２０４に出力する。 Each of the speech recognition units 203-1 to 203-N performs a speech recognition process on each of the separated waveforms output from the sound source separation unit 202. Each of the speech recognition units 203-1 to 203-N adds the reliability of the recognition result and the degree of utterance within a task (intra-task utterance level) as additional information to the channel selection unit 204 along with the speech recognition result. Output.

なお、タスク内発話度とは、音声認識装置２００の想定するタスクの発話であるかどうかの度合いである。具体的には、例えば、音声認識装置２００を備えた装置がテレビである場合、テレビに対する操作要求、例えばボリューム（音量）の変更要求や、チャンネルの変更要求などが音声認識結果に含まれる場合は、タスク内発話である可能性が高く、タスク内発話度を高設定とした情報を出力する。なお、この判定処理に際しては、音声認識装置２００内のメモリに保持された統計言語モデルが利用される。統計言語モデルは、予め様々な単語について、タスクに関連する単語であるか否かの指標値を設定したデータである。 The intra-task utterance degree is a degree of whether or not the utterance of the task assumed by the speech recognition apparatus 200 is assumed. Specifically, for example, when the device including the speech recognition device 200 is a television, when an operation request for the television, for example, a volume (volume) change request or a channel change request is included in the speech recognition result. It is highly possible that the utterance is in the task, and the information with the utterance level in the task set to high is output. In this determination process, a statistical language model held in a memory in the speech recognition apparatus 200 is used. The statistical language model is data in which an index value indicating whether or not the word is related to a task is set in advance for various words.

チャンネル選択部２０４は、音源分離部２０２から各音源に対応する分離波形を入力し、さらに、音声認識部２０３−１〜Ｎの各々から、
分離波形各々に対応する音声認識結果、
付加情報（認識結果の信頼度やタスク内発話度）、
これらの情報を入力する。 The channel selection unit 204 inputs a separation waveform corresponding to each sound source from the sound source separation unit 202, and from each of the speech recognition units 203-1 to 203-N,
Speech recognition results corresponding to each separated waveform,
Additional information (reliability of recognition result and utterance level in task),
Enter this information.

チャンネル選択部２０４は、これらの入力情報を適用して、目的とする音が含まれるチャンネルの音声認識結果を選択して出力する。 The channel selection unit 204 applies these input information to select and output a speech recognition result of a channel including the target sound.

なお、図３に示す各構成部の処理は、図３には示されない制御部の制御の下に実行される。制御部はＣＰＵ等によって構成され、図示しない記憶部に格納されたプログラムを実行して、図３に示す各構成部の処理を制御する。
図３に示す各構成部の詳細構成と実行する処理の具体例について、図４以下を参照して説明する。 3 is executed under the control of a control unit not shown in FIG. The control unit is constituted by a CPU or the like, and executes a program stored in a storage unit (not shown) to control processing of each component unit shown in FIG.
A detailed configuration of each component shown in FIG. 3 and a specific example of processing to be executed will be described with reference to FIG.

［２．音源分離部の詳細構成と処理の具体例について］
まず、図４を参照して、音源分離部２０２の詳細構成と処理の具体例について説明する。
図４に示すように、音源分離部２０２は、Ａ／Ｄ変換部３０１、短時間フーリエ変換（ＦＴ）部３０２、信号分離部３０３、逆フーリエ変換（ＦＴ）部３０４、Ｄ／Ａ変換部３０５、および音源方向推定部３０６を有する。 [2. Detailed configuration of sound source separation unit and specific processing example]
First, a detailed configuration of the sound source separation unit 202 and a specific example of processing will be described with reference to FIG.
As shown in FIG. 4, the sound source separation unit 202 includes an A / D conversion unit 301, a short-time Fourier transform (FT) unit 302, a signal separation unit 303, an inverse Fourier transform (FT) unit 304, and a D / A conversion unit 305. And a sound source direction estimating unit 306.

マイクロホン２０１−１〜Ｎからの個々の入力波形はＡ／Ｄ変換部３０１においてデジタル観測信号に変換され、短時間フーリエ変換（ＦＴ）部３０２に入力される。 Individual input waveforms from the microphones 201-1 to 201-N are converted into digital observation signals by the A / D conversion unit 301 and input to the short-time Fourier transform (FT) unit 302.

短時間フーリエ変換（ＦＴ）部３０２は、デジタル信号に変換された入力信号に対する短時間フーリエ変換（ＦＴ）処理を実行し、スペクトログラムに変換して信号分離部３０３に入力する。なお、この短時間フーリエ変換（ＦＴ）処理により得られる各観測信号のスペクトログラムは、先に説明した式［２．１］の信号、すなわち、Ｘ（ω，ｔ）である。 A short-time Fourier transform (FT) unit 302 performs short-time Fourier transform (FT) processing on the input signal converted into a digital signal, converts the input signal into a spectrogram, and inputs the spectrogram to the signal separation unit 303. Note that the spectrogram of each observation signal obtained by this short-time Fourier transform (FT) process is the signal of equation [2.1] described above, that is, X (ω, t).

信号分離部３０３は、短時間フーリエ変換（ＦＴ）部３０２の生成した各観測信号のスペクトログラムを入力して、前述した独立成分分析（ＩＣＡ）を実行して分離結果Ｙを生成する。この分離結果は、Ｎ個のチャンネル数に対応したＮ個の分離結果となる。この分離結果Ｙは、逆フーリエ変換（ＦＴ）部３０４に入力される。 The signal separation unit 303 receives the spectrogram of each observation signal generated by the short-time Fourier transform (FT) unit 302 and executes the above-described independent component analysis (ICA) to generate the separation result Y. This separation result becomes N separation results corresponding to the number of N channels. This separation result Y is input to an inverse Fourier transform (FT) unit 304.

逆フーリエ変換（ＦＴ）部３０４は、個々の音源信号に対応するスペクトログラムに対して逆フーリエ変換処理を施して、スペクトログラムを時間領域の信号に変換して、各音源に対応すると推定される音源分離信号を生成する。この分離信号は、チャンネル数分、すなわちＮ個の信号として生成される。 The inverse Fourier transform (FT) unit 304 performs an inverse Fourier transform process on the spectrogram corresponding to each sound source signal, converts the spectrogram to a signal in the time domain, and the sound source separation estimated to correspond to each sound source. Generate a signal. This separated signal is generated as the number of channels, that is, N signals.

これらのＮ個の分離信号は、Ｄ／Ａ変換部３０５に入力され、Ｄ／Ａ変換によりアナログ信号としてのＮ個の分離波形に変換される。このＮ個の分離波形の各々は、各チャンネル１〜Ｎ対応の音声認識部２０３〜１〜Ｎ、およびチャンネル選択部２０４に出力される。 These N separated signals are input to the D / A converter 305 and converted into N separated waveforms as analog signals by D / A conversion. Each of the N separated waveforms is output to the voice recognition units 203-1 to 203 -N corresponding to the channels 1 to N and the channel selection unit 204.

音源方向推定部３０６では信号分離部３０３での推定結果の一部を用いて、各独立な信号の到来方向を推定する。この推定情報もＮ個のチャンネル数に対応したチャンネル対応のＮ個の音源方向情報である。音源方向推定部３０６の生成したこのＮ個の音源方向情報は、チャンネル選択部２０４に出力される。 The sound source direction estimation unit 306 estimates the arrival direction of each independent signal using a part of the estimation result from the signal separation unit 303. This estimation information is also N sound source direction information corresponding to channels corresponding to the number of N channels. The N pieces of sound source direction information generated by the sound source direction estimation unit 306 are output to the channel selection unit 204.

［３．音声認識部の詳細構成と処理の具体例について］
次に、図５を参照して音声認識部２０３−１〜Ｎの詳細構成と処理の具体例について説明する。図５は、各チャンネルに対応して設けられた音声認識部２０３−１〜Ｎの１つの音声認識部の構成を示す図である。Ｎ個の音声認識部２０３−１〜Ｎの各々が図５に示す構成を有している。 [3. Detailed configuration of speech recognition unit and specific processing example]
Next, with reference to FIG. 5, a detailed configuration and specific example of processing of the speech recognition units 203-1 to 203 -N will be described. FIG. 5 is a diagram showing a configuration of one voice recognition unit of the voice recognition units 203-1 to 203-N provided corresponding to each channel. Each of the N speech recognition units 203-1 to 203-N has the configuration shown in FIG.

図５に示すように、音声認識部２０３は、Ａ／Ｄ変換部４０１、特徴量抽出部４０２、音声認識処理部４０３、付加情報算出部４０７を有している。付加情報算出部は、認識信頼度算出部４０８、タスク内発話度算出部４０９を有している。
また、音声認識部２０３内には、音響モデル４０４、タスク内統計言語モデル４０５、タスク外統計言語モデル４０６が格納され、これらのデータを利用した処理が実行される。 As illustrated in FIG. 5, the speech recognition unit 203 includes an A / D conversion unit 401, a feature amount extraction unit 402, a speech recognition processing unit 403, and an additional information calculation unit 407. The additional information calculation unit includes a recognition reliability calculation unit 408 and an intra-task utterance level calculation unit 409.
In the speech recognition unit 203, an acoustic model 404, an in-task statistical language model 405, and an out-of-task statistical language model 406 are stored, and processing using these data is executed.

図５に示す音声認識部２０３の入力は、音源分離部２０２によって分離されたＮ個のチャンネル中、１つのチャンネルｋ（ｋ＝１〜Ｎ）に対応する１つの分離波形である。音声認識部２０３−１〜Ｎの各々が、チャンネルｋ（ｋ＝１〜Ｎ）の分離波形を入力して、それぞれが並列に各チャンネルの分離波形に基づく音声認識処理を実行する。 The input of the speech recognition unit 203 shown in FIG. 5 is one separated waveform corresponding to one channel k (k = 1 to N) among the N channels separated by the sound source separation unit 202. Each of the speech recognition units 203-1 to 203-1 to N inputs the separated waveform of the channel k (k = 1 to N), and executes speech recognition processing based on the separated waveform of each channel in parallel.

このように、音声認識部２０３−１〜ＮではＮチャンネルのＮ個の分離波形に対する処理が並列に行われる。図５を参照して、１つのチャンネル対応の分離波形に対する処理について説明する。 As described above, the speech recognition units 203-1 to 203-N perform processing on N separated waveforms of N channels in parallel. With reference to FIG. 5, the process for the separated waveform corresponding to one channel will be described.

１つのチャンネル対応の分離波形は、まず、Ａ／Ｄ変換部４０１に入力される。
Ａ／Ｄ変換部４０１は、アナログ信号である分離波形をデジタル観測信号に変換する。デジタル観測信号は特徴量抽出部４０２に入力される。 A separated waveform corresponding to one channel is first input to the A / D converter 401.
The A / D conversion unit 401 converts the separated waveform, which is an analog signal, into a digital observation signal. The digital observation signal is input to the feature amount extraction unit 402.

特徴量抽出部４０２は、Ａ／Ｄ変換部４０１からデジタル観測信号を入力し、デジタル観測信号から音声認識に用いる特徴量を抽出する。なお、特徴量抽出処理は、既存の音声認識アルゴリズムに従って実行可能である。抽出された特徴量は音声認識処理部４０３に入力される。 The feature amount extraction unit 402 receives the digital observation signal from the A / D conversion unit 401 and extracts the feature amount used for speech recognition from the digital observation signal. The feature amount extraction process can be executed according to an existing speech recognition algorithm. The extracted feature amount is input to the speech recognition processing unit 403.

音声認識処理部４０３は、特徴量抽出部４０２から入力した特徴量を用いた音声認識処理を実行する。
音声認識処理部４０３は、音響モデル４０４の他、タスク内統計言語モデル４０５を用いた音声認識処理と、タスク外統計言語モデル４０６を用いた音声認識処理の、異なる言語モデルを適用した複数の認識処理を実行する。 The speech recognition processing unit 403 executes speech recognition processing using the feature amount input from the feature amount extraction unit 402.
In addition to the acoustic model 404, the speech recognition processing unit 403 performs a plurality of recognitions using different language models of speech recognition processing using the in-task statistical language model 405 and speech recognition processing using the non-task statistical language model 406. Execute the process.

例えば、タスク内統計言語モデル４０５に登録された単語と、音声認識処理結果として得られる単語とを比較して、一致する単語を選択して認識結果を取得して、さらに一致度に応じたスコアを算出する。
さらに、タスク外統計言語モデル４０６に登録された単語と、音声認識処理結果として得られる単語とを比較して、一致する単語を選択して認識結果を取得して、さらに一致度に応じたスコアを算出する。
この異なるモデルを用いた複数の認識結果の中から最も認識スコアの高い結果を選択して音声認識結果として出力する。
なお、タスク内統計言語モデル４０５、タスク外統計言語モデル４０６としては、異なるモデルを複数用いることが可能である。 For example, a word registered in the in-task statistical language model 405 is compared with a word obtained as a speech recognition processing result, a matching word is selected, a recognition result is acquired, and a score corresponding to the degree of matching is further obtained. Is calculated.
Further, the word registered in the non-task statistical language model 406 is compared with the word obtained as a result of the speech recognition processing, the matching word is selected, the recognition result is obtained, and the score corresponding to the degree of matching is further obtained. Is calculated.
A result with the highest recognition score is selected from a plurality of recognition results using the different models and output as a speech recognition result.
Note that a plurality of different models can be used as the in-task statistical language model 405 and the out-task statistical language model 406.

音声認識処理部４０３で生成した音声認識結果は、チャンネル選択部２０４に出力されるとともに、音声認識部２０３内の付加情報算出部４０７に出力される。付加情報算出部４０７に出力される情報には、上記のスコア情報も含まれる。 The voice recognition result generated by the voice recognition processing unit 403 is output to the channel selection unit 204 and also output to the additional information calculation unit 407 in the voice recognition unit 203. The information output to the additional information calculation unit 407 includes the above score information.

付加情報算出部４０７は、認識信頼度算出部４０８、およびタスク内発話度算出部４０９を有する。
認識信頼度算出部４０８は、音声認識処理部４０３で生成した音声認識結果の認識信頼度を算出する。音声認識結果の認識信頼度は、例えば認識した単語の系列の妥当性を予めメモリに格納した評価基準データを利用して評価することで行う。具体的には、例えば特開２００５−２７５３４８に記載の構成を適用して認識信頼度を算出することができる。 The additional information calculation unit 407 includes a recognition reliability calculation unit 408 and an intra-task utterance level calculation unit 409.
The recognition reliability calculation unit 408 calculates the recognition reliability of the speech recognition result generated by the speech recognition processing unit 403. The recognition reliability of the speech recognition result is determined, for example, by evaluating the validity of the recognized word sequence using evaluation reference data stored in advance in a memory. Specifically, for example, the recognition reliability can be calculated by applying the configuration described in JP-A-2005-275348.

タスク内発話度算出部４０９は、音声認識処理部４０３で生成した音声認識結果のタスク内発話度を算出する。タスク内発話度とは、前述したように、音声認識装置２００の想定するタスクの発話であるかどうかの度合いである。具体的には、例えば、音声認識装置２００を備えた装置がテレビである場合、音声認識処理部４０３で生成した音声認識結果に含まれる単語が、テレビに対する操作要求、例えばボリューム（音量）の変更要求や、チャンネルの変更要求などの単語である場合は、タスク内発話である可能性が高く、タスク内発話度が高くなる。音声認識結果にこのようなタスクと無関係の単語が多く含まれる場合は、タスク内発話度は低く設定される。 The in-task utterance level calculation unit 409 calculates the in-task utterance level of the speech recognition result generated by the speech recognition processing unit 403. The intra-task utterance degree is a degree of whether or not the utterance of the task assumed by the speech recognition apparatus 200 is, as described above. Specifically, for example, when the device including the speech recognition device 200 is a television, a word included in the speech recognition result generated by the speech recognition processing unit 403 is an operation request to the television, for example, a change in volume (volume). In the case of a word such as a request or a channel change request, there is a high possibility that it is an utterance within a task, and the degree of utterance within a task increases. When the speech recognition result includes many words unrelated to such a task, the intra-task utterance level is set low.

具体的な処理としては、上述した音声認識処理部４０３の処理によって得られるスコアを利用した処理によって、タスク内発話度を算出することができる。
すなわち、音声認識処理結果として得られる単語とタスク内統計言語モデル４０５の登録単語との一致度に応じた第１のスコアと、
音声認識処理結果として得られる単語とタスク外統計言語モデル４０６の登録単語との一致度に応じた第２のスコアと、
の比較を実行する。
第１のスコアが第２のスコアより高い場合は、タスク内発話度は高く設定され、第２のスコアが第１のスコアより高い場合は、タスク内発話度は低く設定される。 As a specific process, the intra-task utterance level can be calculated by a process using the score obtained by the process of the voice recognition processing unit 403 described above.
That is, the first score corresponding to the degree of coincidence between the word obtained as a result of the speech recognition processing and the registered word of the in-task statistical language model 405,
A second score corresponding to the degree of coincidence between the word obtained as a result of the speech recognition processing and the registered word of the statistical language outside task 406;
Perform a comparison of
When the first score is higher than the second score, the intra-task utterance level is set high, and when the second score is higher than the first score, the intra-task utterance level is set low.

付加情報算出部４０７は、認識信頼度算出部４０８の算出した認識信頼度と、タスク内発話度算出部４０９の算出したタスク内発話度を、音声認識結果に対応する付加情報としてチャンネル選択部２０４に出力する。 The additional information calculation unit 407 uses the recognition reliability calculated by the recognition reliability calculation unit 408 and the intra-task utterance degree calculated by the intra-task utterance degree calculation unit 409 as additional information corresponding to the speech recognition result. Output to.

［４．チャンネル選択部の詳細構成と処理の具体例について］
次に、図６を参照してチャンネル選択部２０４の詳細構成と処理の具体例について説明する。
図６に示すように、チャンネル選択部２０４は、チャンネルスコア算出部５０１〜１〜Ｎと、選択チャンネル決定部５０２を有している。 [4. Detailed configuration of channel selector and specific example of processing]
Next, a detailed configuration of the channel selection unit 204 and a specific example of processing will be described with reference to FIG.
As illustrated in FIG. 6, the channel selection unit 204 includes channel score calculation units 501 to 1 to N and a selection channel determination unit 502.

チャンネルスコア算出部５０１−１〜Ｎは、各チャンネル１〜Ｎに対応して設けられている。各チャンネルスコア算出部５０１−１〜Ｎの各々は、チャンネル対応情報として、
音声認識部２０３から音声認識結果と、付加情報（認識信頼度と、タスク内発話度）
音源分離部２０２から、分離波形と、音源方向情報、
これらの情報を入力する。 Channel score calculation units 501-1 to 501 -N are provided corresponding to the respective channels 1 to N. Each of the channel score calculation units 501-1 to 501 -N has as channel correspondence information,
Speech recognition result from voice recognition unit 203 and additional information (recognition reliability and utterance level in task)
From the sound source separation unit 202, a separated waveform, sound source direction information,
Enter this information.

チャンネルスコア算出部５０１−１〜Ｎは、これらのチャンネル対応情報を利用して各チャンネルの音声認識結果のスコアを算出する。
例えば、
認識信頼度＝ｐ
タスク内発話度＝ｑ
分離波形のパワー＝ｒ
とする。
認識信頼度＝ｐは、信頼度が高いほど、ｐの値が大きく、
タスク内発話度＝ｑは、タスク内発話の可能性が高いほど、ｑの値が大きく、
分離波形のパワー＝ｒは、パワー（音量）が大きいほど、ｒの値を大きく設定する。 The channel score calculation units 501-1 to 501-N calculate the score of the speech recognition result of each channel using the channel correspondence information.
For example,
Recognition reliability = p
In-task utterance level = q
Power of separated waveform = r
And
The recognition reliability = p is such that the higher the reliability, the larger the value of p,
In-task utterance level = q, the higher the possibility of in-task utterance, the larger the value of q,
The separation waveform power = r is set to a larger value r as the power (volume) is larger.

この場合、そのチャンネルｋのスコアＳｋを、
Ｓｋ＝ａｐ＋ｂｑ＋ｃｒ
として算出する。
ただしａ，ｂ，ｃ，は予め設定した係数（重み係数）である。 In this case, the score Sk of the channel k is
Sk = ap + bq + cr
Calculate as
However, a, b, and c are preset coefficients (weighting coefficients).

さらに、音源方向についても考慮して、音源方向が、装置の正面ほど高くなる評価値として、
音源方向評価値＝ｈ
を利用し、
Ｓｋ＝ａｐ＋ｂｑ＋ｃｒ＋ｄｈ
として算出する構成としてもよい。
ただしａ，ｂ，ｃ，ｄは予め設定した係数（重み係数）である。 Furthermore, considering the sound source direction, as an evaluation value that the sound source direction becomes higher as the front of the device,
Sound source direction evaluation value = h
Use
Sk = ap + bq + cr + dh
It is good also as a structure calculated as.
However, a, b, c, and d are preset coefficients (weight coefficients).

これらのチャンネル対応のスコアＳｋ（ｋ＝１〜Ｎ）が、各チャンネルスコア算出部５０１−１〜Ｎにおいて算出され、選択チャンネル決定部５０２に入力される。 These channel-corresponding scores Sk (k = 1 to N) are calculated by the respective channel score calculation units 501-1 to 501-N and input to the selected channel determination unit 502.

選択チャンネル決定部５０２は、チャンネルスコア算出部５０１−１〜Ｎから入力するＮチャンネル各々に対応するスコアＳ１〜ＳＮを入力して、これらのスコアの比較処理を実行して、高スコアのチャンネルの音声認識結果を選択して認識結果として出力する。 The selected channel determination unit 502 inputs scores S1 to SN corresponding to each of the N channels input from the channel score calculation units 501-1 to 501-N, executes a comparison process of these scores, and selects a high score channel. A speech recognition result is selected and output as a recognition result.

なお、選択チャンネル決定部５０２は、スコアの高いチャンネルの認識結果から、予め設定したＭ個の認識結果を出力する。出力個数Ｍは、例えば、外部からユーザが設定可能な構成を持つ。 Note that the selected channel determination unit 502 outputs M recognition results set in advance from the recognition results of the channels with high scores. The output number M has a configuration that can be set by the user from the outside, for example.

選択チャンネル決定部５０２は、スコアの上位Ｍチャンネル分の認識結果を選択された認識結果として出力する。選択チャンネル数Ｍの値は、利用形態にあわせて設定する。例えばユーザが一人の場合、同時に１発話のみの入力が想定されるので、
Ｍ＝１
とする。同時に複数人が発話を入力する可能性がある場合は１より大きい値を設定する。 The selected channel determination unit 502 outputs the recognition results for the top M channels of the score as the selected recognition results. The value of the selected channel number M is set according to the usage mode. For example, if there is only one user, only one utterance can be input at the same time.
M = 1
And A value larger than 1 is set when there is a possibility that a plurality of people input utterances at the same time.

［５．音声認識装置の実行する処理のシーケンスについて］
次に、図７以下のフローチャートを参照して、本発明の音声認識装置の実行する処理の処理シーケンスについて説明する。 [5. About the sequence of processing executed by the speech recognition device]
Next, a processing sequence of processing executed by the speech recognition apparatus of the present invention will be described with reference to the flowchart in FIG.

図７に示すフローチャートは、本発明の音声認識装置の実行する処理の全体の流れを示すフローチャートである
図８は、図７に示すフローにおけるステップＳ１０３の音声認識処理の詳細を示すフローチャートである。
図９は、図７に示すフローにおけるステップＳ１０４のチャンネル選択処理の詳細を示すフローチャートである。 The flowchart shown in FIG. 7 is a flowchart showing the overall flow of processing executed by the voice recognition apparatus of the present invention. FIG. 8 is a flowchart showing details of the voice recognition processing in step S103 in the flow shown in FIG.
FIG. 9 is a flowchart showing details of the channel selection processing in step S104 in the flow shown in FIG.

なお、図７〜図９に示すフローチャートに従った処理は、前述したようにＣＰＵ等によって構成される制御部の制御の下に実行される。制御部は記憶部に格納されたプログラムを実行することで、図３〜図５を参照して説明した各構成部に適宜コマンド等を出力して処理制御を行い、図７〜図９に示すフローチャートに従った処理を実行させる。 Note that the processing according to the flowcharts shown in FIGS. 7 to 9 is executed under the control of a control unit configured by a CPU or the like as described above. The control unit executes a program stored in the storage unit, thereby appropriately outputting a command or the like to each component unit described with reference to FIGS. The process according to the flowchart is executed.

まず、図７に示すフローチャートを参照して、発明の音声認識装置の実行する処理の全体の流れについて説明する。なお、各処理ステップの処理について図３の構成図と対応させながら説明を行う。 First, the overall flow of processing executed by the speech recognition apparatus of the invention will be described with reference to the flowchart shown in FIG. The processing of each processing step will be described in correspondence with the configuration diagram of FIG.

ステップＳ１０１において、マイクロホン２０１−１〜Ｎからの音声入力処理を行う。様々な位置に配置されたＮ本のマイクロホンを利用して音声を集音して入力する。Ｎ本のマイクロホンがあったとすると、Ｎチャンネルの入力波形が得られる。 In step S101, voice input processing from the microphones 201-1 to 201-N is performed. Sound is collected and input using N microphones arranged at various positions. If there are N microphones, an N-channel input waveform is obtained.

ステップＳ１０２において、音源分離処理を実行する。これは、図３に示す音源分離部２０２の処理であり、図３を参照して説明した処理に相当する。先に図３を参照して説明したように、音源分離部２０２は、Ｎチャンネル分の入力波形に対してＩＣＡによる音源分離処理を実行して、Ｎチャンネル分の分離波形を生成する。なお、この処理に際して、各チャンネルの分離波形に対応する音源方向情報についても取得する構成としてもよい。 In step S102, a sound source separation process is executed. This is the process of the sound source separation unit 202 shown in FIG. 3, and corresponds to the process described with reference to FIG. As described above with reference to FIG. 3, the sound source separation unit 202 performs sound source separation processing by ICA on the input waveforms for N channels to generate separated waveforms for N channels. In this process, sound source direction information corresponding to the separation waveform of each channel may be acquired.

次のステップＳ１０３の処理は、音声認識処理である。この音声認識処理は、図３に示す音声認識部２０３−１〜Ｎにおいて実行される処理であり、図４を参照して説明した処理に対応する。ステップＳ１０３の音声認識処理では、各チャンネルに対応する音声認識結果と、付加情報としての認識信頼度と、タスク内発話度が生成される。
なお、このステップＳ１０３の音声認識処理の詳細については、後段で、図８のフローチャートを参照して説明する。 The process of the next step S103 is a voice recognition process. This voice recognition process is a process executed in the voice recognition units 203-1 to 203-N shown in FIG. 3, and corresponds to the process described with reference to FIG. In the speech recognition process in step S103, a speech recognition result corresponding to each channel, a recognition reliability as additional information, and an intra-task utterance level are generated.
Details of the voice recognition processing in step S103 will be described later with reference to the flowchart of FIG.

次のステップＳ１０４の処理は、チャンネル選択処理である。このチャンネル選択処理は、図３に示すチャンネル選択部２０４において行われる処理であり、図６を参照して説明した処理に対応する。ステップＳ１０４のチャンネル選択処理では、音声認識処理結果と付加情報等から、チャンネル対応のスコアを算出し、スコアの高いものを優先して選択する。
なお、このステップＳ１０４のチャンネル選択処理の詳細については、後段で、図９のフローチャートを参照して説明する。 The next step S104 is a channel selection process. This channel selection processing is processing performed in the channel selection unit 204 shown in FIG. 3, and corresponds to the processing described with reference to FIG. In the channel selection process in step S104, a score corresponding to the channel is calculated from the result of the speech recognition process and the additional information, and the highest score is selected.
The details of the channel selection process in step S104 will be described later with reference to the flowchart of FIG.

次のステップＳ１０５は、認識結果出力処理である。この認識結果出力処理も、図３に示すチャンネル選択部２０４において行われる処理であり、図６を参照して説明した処理に対応する。ステップＳ１０５の認識結果出力処理では、予め設定した出力数（Ｍ）に応じて、ステップＳ１０４において算出したチャンネル対応のスコアの高いものから順にＭ個の音声認識結果の出力を実行する。 The next step S105 is recognition result output processing. This recognition result output processing is also processing performed in the channel selection unit 204 shown in FIG. 3, and corresponds to the processing described with reference to FIG. In the recognition result output process in step S105, M speech recognition results are output in descending order of the channel-corresponding scores calculated in step S104 according to the preset output number (M).

次に、図８に示すフローチャートを参照して、図７のフローチャートにおけるステップＳ１０３の音声認識処理の詳細シーケンスについて説明する。この音声認識処理は、図３に示す音声認識部２０３−１〜Ｎにおいて実行される処理であり、図５を参照して説明した処理に対応する。 Next, a detailed sequence of the speech recognition process in step S103 in the flowchart of FIG. 7 will be described with reference to the flowchart shown in FIG. This voice recognition process is a process executed in the voice recognition units 203-1 to 203-N shown in FIG. 3, and corresponds to the process described with reference to FIG.

ここでは、チャンネル１〜Ｎのうちでｋチャンネルにおける処理（音声認識部２０３−ｋの処理）について説明する。音声認識処理においてチャンネルの間で依存関係は無いため、それぞれの音声認識を順次処理することも並列処理することも可能である。 Here, the process in the k channel among the channels 1 to N (the process of the speech recognition unit 203-k) will be described. Since there is no dependency between channels in the speech recognition process, each speech recognition can be processed sequentially or in parallel.

ステップＳ２０１において、音源分離部２０２の分離処理結果である出力チャンネルｋのデータを受信する。
ステップＳ２０２において、特徴量抽出処理を実行する。この特徴量抽出処理は、図５に示す特徴量抽出部４０２の処理である。特徴量抽出部４０２は、観測信号から音声認識に用いる特徴量を抽出する。 In step S201, the data of the output channel k, which is the separation processing result of the sound source separation unit 202, is received.
In step S202, feature amount extraction processing is executed. This feature quantity extraction process is a process of the feature quantity extraction unit 402 shown in FIG. The feature amount extraction unit 402 extracts a feature amount used for speech recognition from the observation signal.

次にステップＳ２０３において、音声認識処理を実行する。この音声認識処理は、図５に示す音声認識処理部４０３の処理である。音声認識処理部４０３は、前述したように音響モデル４０４の他、タスク内統計言語モデル４０５を用いた音声認識処理と、タスク外統計言語モデル４０６を用いた音声認識処理の、異なる言語モデルを適用した複数の認識処理を実行する。 Next, in step S203, voice recognition processing is executed. This voice recognition process is a process of the voice recognition processing unit 403 shown in FIG. As described above, the speech recognition processing unit 403 applies different language models of the speech recognition process using the in-task statistical language model 405 and the speech recognition process using the non-task statistical language model 406 in addition to the acoustic model 404. A plurality of recognition processes are executed.

次に、ステップＳ２０４において、信頼度算出処理を実行する。この信頼度算出処理は、図５に示す付加情報算出部４０７の認識信頼度算出部４０８の実行する処理である。 Next, in step S204, a reliability calculation process is executed. This reliability calculation process is a process executed by the recognition reliability calculation unit 408 of the additional information calculation unit 407 shown in FIG.

認識信頼度算出部４０８は、音声認識処理部４０３で生成した音声認識結果の認識信頼度を算出する。例えば認識した単語の系列の妥当性を予めメモリに格納した評価基準データを利用して認識信頼度を算出する。 The recognition reliability calculation unit 408 calculates the recognition reliability of the speech recognition result generated by the speech recognition processing unit 403. For example, the recognition reliability is calculated using evaluation reference data in which the validity of the recognized word sequence is stored in advance in the memory.

次に、ステップＳ２０５において、タスク内発話度算出処理を実行する。このタスク内発話度算出処理は、図５に示す付加情報算出部４０７のタスク内発話度算出部４０９の実行する処理である。 Next, in step S205, an intra-task utterance degree calculation process is executed. This intra-task utterance degree calculation process is a process executed by the intra-task utterance degree calculation unit 409 of the additional information calculation unit 407 shown in FIG.

タスク内発話度算出部４０９は、音声認識処理部４０３で生成した音声認識結果のタスク内発話度を算出する。音声認識処理部４０３で生成した音声認識結果に含まれる単語に、タスクに関連する単語が多く含まれる場合は、タスク内発話である可能性が高く、タスク内発話度が高くなる。音声認識結果にこのようなタスクと無関係の単語が多く含まれる場合は、タスク内発話度は低く設定される。 The in-task utterance level calculation unit 409 calculates the in-task utterance level of the speech recognition result generated by the speech recognition processing unit 403. If the word included in the speech recognition result generated by the speech recognition processing unit 403 includes many words related to the task, there is a high possibility that it is an intra-task utterance and the intra-task utterance level is high. When the speech recognition result includes many words unrelated to such a task, the intra-task utterance level is set low.

音声認識部２０３は、図８に示すフローチャートに従って、各チャンネル対応のデータとして、
音声認識結果、
付加情報（認識信頼度、タスク内発話度）
これらのデータを生成してチャンネル選択部２０４に供給する。 The voice recognition unit 203 follows the flowchart shown in FIG. 8 as data corresponding to each channel.
Speech recognition results,
Additional information (recognition reliability, utterance level in task)
These data are generated and supplied to the channel selection unit 204.

次に、図９に示すフローチャートを参照して、図７のフローチャートにおけるステップＳ１０４のチャンネル選択処理の詳細シーケンスについて説明する。このチャンネル選択処理は、図３に示すチャンネル選択部２０４において実行される処理であり、図６を参照して説明した処理に対応する。 Next, a detailed sequence of the channel selection process in step S104 in the flowchart of FIG. 7 will be described with reference to the flowchart shown in FIG. This channel selection process is a process executed by the channel selection unit 204 shown in FIG. 3, and corresponds to the process described with reference to FIG.

ステップＳ３０１は、出力リストの初期化処理を行う。出力リストは、チャンネル１〜Ｎの認識結果をスコアの高い順に並べたリストである。図６に示す選択チャンネル決定部５０２はこの出力リストに従って高スコアの認識結果から、予め決定した出力数（Ｍ個）の認識結果を選択して出力することになる。
ステップＳ３０１では、出力リストの初期化処理、すなわちリストのリセットを行う。 In step S301, output list initialization processing is performed. The output list is a list in which recognition results of channels 1 to N are arranged in descending order of score. The selected channel determination unit 502 shown in FIG. 6 selects and outputs a predetermined number (M) of recognition results from the high score recognition results according to the output list.
In step S301, the output list is initialized, that is, the list is reset.

次のステップＳ３０２〜Ｓ３０４の処理は、チャンネルｋ＝１〜Ｎのデータに対応して繰り返し実行するループ処理である。
ステップＳ３０３において、チャンネルｋに対応するスコアの計算を行う。
スコアの計算は、例えば先に説明したように、
認識信頼度＝ｐ
タスク内発話度＝ｑ
分離波形のパワー＝ｒ
として、チャンネルｋのスコアＳｋを、
Ｓｋ＝ａｐ＋ｂｑ＋ｃｒ
として算出する。
ただしａ，ｂ，ｃ，は予め設定した係数（重み係数）である。
あるいは、音源方向についても考慮し、
音源方向評価値＝ｈ
を利用し、
Ｓｋ＝ａｐ＋ｂｑ＋ｃｒ＋ｄｈ
として算出する。
このような処理よって、チャンネルｋのスコアを算出する。 The processing of the next steps S302 to S304 is a loop processing that is repeatedly executed corresponding to the data of channels k = 1 to N.
In step S303, the score corresponding to channel k is calculated.
The calculation of the score is, for example, as described above,
Recognition reliability = p
In-task utterance level = q
Power of separated waveform = r
The score Sk of channel k is
Sk = ap + bq + cr
Calculate as
However, a, b, and c are preset coefficients (weighting coefficients).
Or consider the direction of the sound source,
Sound source direction evaluation value = h
Use
Sk = ap + bq + cr + dh
Calculate as
With such processing, the score of channel k is calculated.

ステップＳ３０２〜Ｓ３０４において、チャンネル１〜ＮのＮ個のチャンネルに対応する音声認識結果に対応するＮ個のスコアＳ１〜ＳＮを算出する。 In steps S302 to S304, N scores S1 to SN corresponding to speech recognition results corresponding to N channels 1 to N are calculated.

最後にステップＳ３０５において、チャンネルのスコアの上位から予め規定した出力数（Ｍ）に対応する数の認識結果を選択して出力する。この処理は、図６に示す選択チャンネル決定部５０２の処理である。 Finally, in step S305, a number of recognition results corresponding to a predetermined number of outputs (M) are selected and output from the top of the channel score. This process is the process of the selected channel determination unit 502 shown in FIG.

このように、本発明の音声認識装置では、ＩＣＡによる音源分離の各出力チャンネルに音声認識をそれぞれ適用し、その結果に基づいて目的とする音に対応するチャンネルを選択する。音声認識結果の信頼度と音声認識装置が想定するタスク内の発話であるか否かの情報を付与し、それらの付加情報に基づいてチャンネル選択を行うことで、ＩＣＡ出力チャンネル選択の誤りの問題を解消することができる。 Thus, in the speech recognition apparatus of the present invention, speech recognition is applied to each output channel of sound source separation by ICA, and a channel corresponding to the target sound is selected based on the result. ICA output channel selection error problem by assigning information on reliability of speech recognition result and whether or not the speech is within the task assumed by the speech recognition apparatus and performing channel selection based on the additional information Can be eliminated.

本発明の音声認識装置の実行する処理による効果としては、例えば以下の効果がある。
（ａ）音声認識の信頼度を利用することにより、目的とする音声以外のチャンネルを誤って選択されてしまうという問題が解消される。
（ｂ）音源方向の情報は用いない設定では、目的音声の到来方向に依存しないチャンネル選択が可能となる。
（ｃ）タスク内発話内容かどうかの情報を利用することにより、音声認識システムが入力として想定していない妨害音声を棄却することができる。
このような効果奏することができる。 As the effects of the processing executed by the speech recognition apparatus of the present invention, for example, there are the following effects.
(A) By utilizing the reliability of voice recognition, the problem of erroneously selecting a channel other than the target voice is solved.
(B) With settings that do not use information on the sound source direction, channel selection independent of the direction of arrival of the target speech is possible.
(C) By using the information on whether the utterance content is in-task, it is possible to reject disturbing speech that the speech recognition system does not assume as input.
Such an effect can be produced.

以上、特定の実施例を参照しながら、本発明について詳解してきた。しかしながら、本発明の要旨を逸脱しない範囲で当業者が実施例の修正や代用を成し得ることは自明である。すなわち、例示という形態で本発明を開示してきたのであり、限定的に解釈されるべきではない。本発明の要旨を判断するためには、特許請求の範囲の欄を参酌すべきである。 The present invention has been described in detail above with reference to specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiments without departing from the gist of the present invention. In other words, the present invention has been disclosed in the form of exemplification, and should not be interpreted in a limited manner. In order to determine the gist of the present invention, the claims should be taken into consideration.

また、明細書中において説明した一連の処理はハードウェア、またはソフトウェア、あるいは両者の複合構成によって実行することが可能である。ソフトウェアによる処理を実行する場合は、処理シーケンスを記録したプログラムを、専用のハードウェアに組み込まれたコンピュータ内のメモリにインストールして実行させるか、あるいは、各種処理が実行可能な汎用コンピュータにプログラムをインストールして実行させることが可能である。例えば、プログラムは記録媒体に予め記録しておくことができる。記録媒体からコンピュータにインストールする他、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、インターネットといったネットワークを介してプログラムを受信し、内蔵するハードディスク等の記録媒体にインストールすることができる。 The series of processing described in the specification can be executed by hardware, software, or a combined configuration of both. When executing processing by software, the program recording the processing sequence is installed in a memory in a computer incorporated in dedicated hardware and executed, or the program is executed on a general-purpose computer capable of executing various processing. It can be installed and run. For example, the program can be recorded in advance on a recording medium. In addition to being installed on a computer from a recording medium, the program can be received via a network such as a LAN (Local Area Network) or the Internet and can be installed on a recording medium such as a built-in hard disk.

なお、明細書に記載された各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。また、本明細書においてシステムとは、複数の装置の論理的集合構成であり、各構成の装置が同一筐体内にあるものには限らない。 Note that the various processes described in the specification are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Further, in this specification, the system is a logical set configuration of a plurality of devices, and the devices of each configuration are not limited to being in the same casing.

以上、説明したように、本発明の一実施例の構成によれば、複数音源からの出力を混合した混合信号からなる観測信号に対して、独立成分分析（ＩＣＡ）を適用した処理により分離信号を生成するとともに、各分離信号に対する音声認識処理を実行する。さらに音声認識結果の評価情報としての付加情報を生成する。付加情報として音声認識結果の認識信頼度や、音声認識結果が音声認識装置において想定したタスクに関連する認識結果であるか否かを示すタスク内発話度を算出する。これらの付加情報を適用して各チャンネル対応の音声認識結果のスコアを算出してスコアの高い認識結果を選択出力する。これらの処理により、複数音源からの混合信号に対する音源分離と音声認識が実現され、必要とする認識結果をより確実に取得することができる。 As described above, according to the configuration of one embodiment of the present invention, the separated signal is obtained by processing that applies independent component analysis (ICA) to an observation signal that is a mixed signal obtained by mixing outputs from a plurality of sound sources. And a speech recognition process for each separated signal. Further, additional information is generated as evaluation information of the speech recognition result. As additional information, the recognition reliability of the speech recognition result and the intra-task utterance level indicating whether or not the speech recognition result is a recognition result related to a task assumed in the speech recognition apparatus are calculated. By applying these additional information, the score of the speech recognition result corresponding to each channel is calculated, and the recognition result having a high score is selected and output. By these processes, sound source separation and speech recognition for mixed signals from a plurality of sound sources are realized, and a necessary recognition result can be obtained more reliably.

１０１マイクロホン
１０２音源分離部
１０３チャンネル選択部
１０４音声認識部
２００音声認識装置
２０１マイクロホン
２０２音源分離部
２０３音声認識部
２０４チャンネル選択部
３０１Ａ／Ｄ変換部
３０２短時間フーリエ変換（ＦＴ）部
３０３信号分離部
３０４逆フーリエ変換（ＦＴ）部
３０５Ｄ／Ａ変換部
３０６音源方向推定部
４０１Ａ／Ｄ変換部
４０２特徴量抽出部
４０３音声認識処理部
４０４音響モデル
４０５タスク内統計言語モデル
４０６タスク外統計言語モデル
４０７付加情報算出部
４０８認識信頼度算出部
４０９タスク内発話度算出部
５０１チャンネルスコア算出部
５０２選択チャンネル決定部 DESCRIPTION OF SYMBOLS 101 Microphone 102 Sound source separation part 103 Channel selection part 104 Speech recognition part 200 Speech recognition apparatus 201 Microphone 202 Sound source separation part 203 Speech recognition part 204 Channel selection part 301 A / D conversion part 302 Short-time Fourier transform (FT) part 303 Signal separation Unit 304 inverse Fourier transform (FT) unit 305 D / A conversion unit 306 sound source direction estimation unit 401 A / D conversion unit 402 feature quantity extraction unit 403 speech recognition processing unit 404 acoustic model 405 statistical language model in task 406 statistical language outside task Model 407 Additional information calculation unit 408 Recognition reliability calculation unit 409 In-task utterance level calculation unit 501 Channel score calculation unit 502 Selected channel determination unit

Claims

A sound source separation unit that separates a mixed signal output from a plurality of sound sources into a signal corresponding to each sound source to generate a separated signal of a plurality of channels;
A plurality of channels separated signals generated by the sound source separation unit are input to perform speech recognition processing, generate speech recognition results corresponding to each channel, and additional information serving as evaluation information for the speech recognition results corresponding to each channel. A voice recognition unit to be generated;
A speech recognition apparatus having a channel selection unit that inputs the speech recognition result and the additional information, calculates a score of the speech recognition result corresponding to each channel by applying the additional information, and selectively outputs a speech recognition result having a high score .

The voice recognition unit
Calculate the recognition reliability of the speech recognition result as the additional information,
The channel selector
The speech recognition apparatus according to claim 1, wherein a score of a speech recognition result corresponding to each channel is calculated by applying the recognition reliability.

The voice recognition unit
As the additional information, an in-task utterance degree indicating whether or not the speech recognition result is a recognition result related to a task assumed in the speech recognition apparatus is calculated,
The channel selector
The speech recognition apparatus according to claim 1, wherein a score of a speech recognition result corresponding to each channel is calculated by applying the intra-task utterance degree.

The channel selector
At least one of the recognition reliability of the speech recognition result or the speech utterance degree indicating whether or not the speech recognition result is a recognition result related to the task assumed in the speech recognition apparatus is applied as score calculation data, and the speech The speech recognition apparatus according to claim 1, wherein the score is calculated by combining at least one of power and sound source direction information.

The voice recognition unit
A plurality of speech recognition units equal to the number of channels of the plurality of separated signals generated by the sound source separation unit.
The plurality of voice recognition units are
The speech recognition apparatus according to any one of claims 1 to 4, wherein a speech recognition process is performed in parallel by inputting a separation signal corresponding to each channel of the separation signals of a plurality of channels generated by the sound source separation unit.

A speech recognition method executed in a speech recognition device,
A sound source separation unit that separates a mixed signal output from a plurality of sound sources into a signal corresponding to each sound source to generate a separated signal of a plurality of channels;
The speech recognition unit inputs the separation signals of the plurality of channels generated by the sound source separation unit, executes speech recognition processing, generates a speech recognition result corresponding to each channel, and evaluation information of the speech recognition result corresponding to each channel A speech recognition step for generating additional information to be
A channel selection step in which a channel selection unit inputs the speech recognition result and the additional information, calculates a score of the speech recognition result corresponding to each channel by applying the additional information, and selectively outputs a speech recognition result having a high score When,
A speech recognition method comprising:

A program for executing voice recognition processing in a voice recognition device,
A sound source separation step for causing the sound source separation unit to generate a multi-channel separation signal by separating the mixed signal of the output of the plurality of sound sources into a signal corresponding to each sound source;
The speech recognition unit inputs the separation signals of the plurality of channels generated by the sound source separation unit and executes speech recognition processing to generate a speech recognition result corresponding to each channel, and evaluation information of the speech recognition result corresponding to each channel A speech recognition step for generating additional information to be
A channel selection step of inputting the voice recognition result and the additional information to a channel selection unit, calculating a score of the voice recognition result corresponding to each channel by applying the additional information, and selectively outputting a voice recognition result having a high score When,
A program that executes