JP2011059186A

JP2011059186A - Speech section detecting device and speech recognition device, program and recording medium

Info

Publication number: JP2011059186A
Application number: JP2009205990A
Authority: JP
Inventors: Tetsutsugu Tamura; 哲嗣田村; Shinichi Takeuchi; 伸一竹内; Satoru Hayamizu; 悟速水
Original assignee: Gifu University NUC
Current assignee: Gifu University NUC
Priority date: 2009-09-07
Filing date: 2009-09-07
Publication date: 2011-03-24

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech section detecting device capable of suppressing the influence of acoustic noise in detecting speech section by a multi-modal speech section detection which comprehensively uses voice information and image information. <P>SOLUTION: The speech section detecting device 100 includes a first multi-modal VAD section 131, which creates a sound and image feature amount combining a sound feature amount and an image feature amount, and which determines a speech section based on the sound and image feature amount; a speech uni-modal VAD section 132 for determining the speech section by using only the sound feature amount; an image uni-modal VAD section 133 for determining the speech section by using only the image feature amount; a second multi-modal VAD section 134 for determining the speech section, by combining the determination of the speech uni-modal VAD section 132 and the image uni-modal section 133; and a third multi-modal section 135 for determining the speech section, by combining the first multi-modal VAD section 131 and the second multi-modal VAD section 134 by a majority decision rule. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声発話区間を検出する音声区間検出装置及び音声をテキストに変換する音声認識装置、プログラム並びに記録媒体に関する。 The present invention relates to a speech section detection device that detects a speech utterance section, a speech recognition device that converts speech into text, a program, and a recording medium.

音声認識は、入力された音声信号を、音響処理・音響分析により時系列の音響特性に変換し、この音響特性、すなわち、特徴量を用いてパターンマッチングなどにより、テキストに変換する技術である。音声認識では、音響処理・音響分析を行う前に、音声区間検出により、入力音声を適切な区間で分割しつつ、分割したそれぞれの区間を音声区間又は非音声区間とラベル付けする処理を加えることが多い。この場合、音声区間検出により音声区間としてラベル付けされた音声信号のみ後段の音声認識処理を行う。 Speech recognition is a technique in which an input speech signal is converted into time-series acoustic characteristics by acoustic processing / analysis and converted into text by pattern matching using the acoustic characteristics, that is, feature quantities. In speech recognition, before performing acoustic processing / analysis, a process of labeling each divided segment as a speech segment or a non-speech segment while dividing the input speech into appropriate segments by speech segment detection. There are many. In this case, the subsequent speech recognition process is performed only for the voice signal labeled as the voice section by the voice section detection.

音声区間検出は、モデルベースの手法と非モデルベースの手法の２種類に大別される。モデルベースの手法では、事前に音声と非音声のモデルを構築しておく。そして、入力に対して、音声のモデルと非音声のモデルの両モデルを用いて音声と非音声のどちらに近いかを計算し、その結果により、ラベル付けを行う。 Speech segment detection is roughly divided into two types: model-based methods and non-model-based methods. In the model-based method, speech and non-speech models are built in advance. Then, for the input, the voice model or the non-speech model is used to calculate whether it is close to the voice or non-speech model, and labeling is performed based on the result.

非モデルベースの手法では、まず、入力信号からパワーなどの特徴を基にスコアを計算する。このスコアが一定の閾値を越えている場合は音声区間、そうでない場合には、非音声区間とする。例えば、非特許文献１では、入力信号を周期性・非周期性成分に分解し、両者のパワー比をスコアとして音声区間か否かを同定している。 In the non-model based method, first, a score is calculated based on characteristics such as power from an input signal. When this score exceeds a certain threshold, it is set as a voice section, and when not so, it is set as a non-voice section. For example, in Non-Patent Document 1, an input signal is decomposed into periodic and non-periodic components, and the power ratio between the two is used as a score to identify whether or not it is a speech section.

一方、音声認識の一手法として音声信号だけでなく、発声時の口唇動画像を用いる、マルチモーダル音声認識がある。マルチモーダル音声認識では、入力動画像を時系列の画像特徴量に変換し、この画像特徴量と音響特徴量を連結して音響画像特徴量を生成する。そして、この音響画像特徴量を用いることにより、音声認識を行う。 On the other hand, there is multimodal speech recognition that uses not only a speech signal but also a lip moving image at the time of speech as a speech recognition method. In multimodal speech recognition, an input moving image is converted into a time-series image feature amount, and the image feature amount and the acoustic feature amount are connected to generate an acoustic image feature amount. Then, voice recognition is performed by using the acoustic image feature amount.

マルチモーダル音声認識の例として、非特許文献２では、入力画像を予め用意しておいた主成分ベクトルにより主成分分析し、得られた主成分係数を画像特徴量として用いる。又、認識においては、マルチストリームＨＭＭ（Hidden Markov Model，ＨＭＭ）を利用し、音声と画像の重み付けを適切に行うことで、音声認識の性能を向上させている。 As an example of multimodal speech recognition, in Non-patent Document 2, an input image is subjected to principal component analysis using a principal component vector prepared in advance, and the obtained principal component coefficient is used as an image feature amount. In recognition, a multi-stream HMM (Hidden Markov Model, HMM) is used and weighting of speech and images is appropriately performed to improve speech recognition performance.

音声区間検出においても、同様に画像情報を用いる手法が提案されている。例えば、特許文献１では、入力画像から口唇形状を求め、以前に抽出した口唇形状と比較することにより、動き形状を計算する。これをウェーブレット変換し、その高周波領域の値を閾値処理することにより、音声区間を検出している。 Similarly, a method using image information has been proposed for voice segment detection. For example, in Patent Document 1, a lip shape is obtained from an input image, and the motion shape is calculated by comparing with a previously extracted lip shape. This is wavelet transformed, and the voice section is detected by thresholding the value in the high frequency region.

又、特許文献２では、音声信号を一定時間毎にフレーム単位に分割し、各フレームでパワーとゼロ交差率を計算し、条件を満たしたものを音声区間候補とする。ついで入力画像から動き領域を検出し、動き領域の特徴と予め用意した特徴との類似度を求め、閾値により唇動き信号を生成する。その上で、音声区間候補において唇動き信号が検出された場合に、音声区間と判定している。 Also, in Patent Document 2, a speech signal is divided into frames at regular intervals, the power and zero crossing rate are calculated for each frame, and those satisfying the conditions are regarded as speech segment candidates. Next, a motion region is detected from the input image, a similarity between the feature of the motion region and a feature prepared in advance is obtained, and a lip motion signal is generated based on a threshold value. In addition, when a lip motion signal is detected in a speech segment candidate, it is determined as a speech segment.

なお、先行技術を調査した結果、音声区間検出装置として特許文献３の発明が提案されている。特許文献３の音声区間検出装置は、話者の音声波と口唇画像情報を音声認識のための情報源にするものである。 As a result of investigating the prior art, the invention of Patent Document 3 has been proposed as a speech segment detection device. The speech section detection apparatus of Patent Document 3 uses a speaker's speech wave and lip image information as an information source for speech recognition.

特開平６−３０１３９３号公報JP-A-6-301393 特開２００７−１５６４９３号公報JP 2007-156493 A 特開昭５９−１４７３９８号公報JP 59-147398 A

石塚・中谷、「信号の周期性・非周期性成分の比を用いた耐雑音音声区間検出の評価」、日本音響学会２００６年春季講演論文集、３−９−１１。Ishizuka and Nakatani, “Evaluation of noise-resistant speech segment detection using the ratio of periodic and aperiodic components of signal”, Acoustical Society of Japan Spring 2006 Proceedings, 3-9-11. 宮島・徳田・北村、「最小誤り学習に基づくバイモーダル音声認識」、日本音響学会２０００年春季講演論文集、１−Ｑ−１４。Miyajima, Tokuda and Kitamura, “Bimodal Speech Recognition Based on Minimum Error Learning”, Acoustical Society of Japan 2000 Spring Proceedings, 1-Q-14.

従来の音声認識技術は、背景雑音の存在する環境において、認識性能が著しく低下するという問題を抱えていた。
この問題の解決手法の一つとして、前処理として音声区間検出をもつ音声認識手法が提案されている。音声区間検出は、非音声区間での誤認識の抑制に有効であるという利点があり、広く用いられている。ところが、音声区間検出それ自体も、雑音による検出性能の低下は避けられないという課題を抱えている。音声信号に依存する限り、この問題を解決することは困難である。 The conventional speech recognition technology has a problem that the recognition performance is remarkably deteriorated in an environment where background noise exists.
As one of solutions to this problem, a speech recognition method having speech section detection as preprocessing has been proposed. Voice segment detection has the advantage of being effective in suppressing misrecognition in non-speech segments and is widely used. However, the speech section detection itself has a problem that the detection performance is inevitably lowered by noise. So long as it depends on the audio signal, this problem is difficult to solve.

音声認識の性能低下を抑制する手法として、マルチモーダル音声認識がある。マルチモーダル音声認識では、音声信号に加え、音響雑音の影響を受けない画像情報をあわせて用いるため、認識性能の低下を抑制することが可能である。 There is multimodal speech recognition as a technique for suppressing the performance degradation of speech recognition. In multimodal speech recognition, in addition to speech signals, image information that is not affected by acoustic noise is used together, so that degradation in recognition performance can be suppressed.

その一方、マルチモーダル音声認識においても、雑音が重畳した音声信号の影響により、非音声区間における誤認識の問題は依然として残り、この対処が課題となっていた。加えて、認識性能の改善には音声信号から得られる情報と画像情報から得られる情報を効果的に利用することが肝要であるが、従来のマルチモーダル音声認識の枠組みでは十分でないことも問題であった。 On the other hand, in multimodal speech recognition, the problem of misrecognition in the non-speech section still remains due to the influence of the speech signal on which noise is superimposed, and this countermeasure has been a problem. In addition, it is important to effectively use information obtained from speech signals and information obtained from image information in order to improve recognition performance. However, the conventional multimodal speech recognition framework is not sufficient. there were.

なお、特許文献３の音声区間検出装置は、単に音声波と口唇画像情報を音声認識のための情報源として組み合わせたことのみしか提案されておらず、この構成のみでは、音響雑音の影響を抑制して、音声区間検出の精度の向上を望むことは期待できない。 Note that the speech section detection device of Patent Document 3 has only been proposed only by combining speech waves and lip image information as information sources for speech recognition, and this configuration alone suppresses the influence of acoustic noise. Thus, it cannot be expected to improve the accuracy of voice segment detection.

本発明の目的は、音声情報と画像情報を総合的に用いるマルチモーダル音声区間検出により、音声区間検出における音響雑音の影響を抑制することができる音声区間検出装置を提供することにある。 An object of the present invention is to provide a speech segment detection device capable of suppressing the influence of acoustic noise in speech segment detection by multimodal speech segment detection that uses speech information and image information comprehensively.

本発明の他の目的は、音声信号と口唇動画像信号を用いる従来のマルチモーダル音声認識が有する、雑音下でも頑健な音声認識が可能な利点を備えつつ、前処理として音声区間検出装置を備えることで、非音声区間での誤認識を抑制できる音声認識装置を提供することにある。 Another object of the present invention is to provide a voice section detection device as a pre-process while having the advantages of conventional multi-modal voice recognition using voice signals and lip moving image signals that enables robust voice recognition even under noise. Thus, an object of the present invention is to provide a speech recognition apparatus that can suppress erroneous recognition in a non-speech section.

又、本発明の他の目的は、コンピュータを、音声情報と画像情報を総合的に用いるマルチモーダル音声区間検出により、音声区間検出における音響雑音の影響を抑制することができる音声区間検出装置とすることができるプログラムを提供することにある。 Another object of the present invention is to provide a speech section detection apparatus that can suppress the influence of acoustic noise in speech section detection by multimodal speech section detection that uses speech information and image information in a comprehensive manner. It is to provide a program that can.

本発明の他の目的は、コンピュータを、音声情報と画像情報を総合的に用いるマルチモーダル音声区間検出により、音声区間検出における音響雑音の影響を抑制することができる音声区間検出装置とすることができるプログラムを記憶した記録媒体を提供することにある。 Another object of the present invention is to provide a computer as a speech segment detection device that can suppress the influence of acoustic noise in speech segment detection by multimodal speech segment detection that uses speech information and image information comprehensively. An object of the present invention is to provide a recording medium storing a program that can be recorded.

上記目的を達成するために、請求項１に記載の発明は、発話者の音声信号を入力して、ディジタル信号に変換する音声入力手段と、前記発話者の口唇動画像を入力し、静止画像時系列（以下、画像フレームという）に変換する画像入力手段と、前記音声入力手段が出力するディジタル化された音声信号から音声区間検出用の音響特徴量を抽出する音響特徴量抽出手段と、前記画像フレームから音声区間検出用の画像特徴量を抽出する画像特徴量抽出手段と、前記音声区間検出用の音響特徴量及び音声区間検出用の画像特徴量に基づいて音声区間判定を行う音声区間判定手段を備えた音声区間検出装置において、前記音声区間判定手段は、前記音響特徴量と画像特徴量を合わせた音響画像特徴量を生成して、該音響画像特徴量に基づいて音声区間を判定する第１判定手段と、前記音響特徴量のみを用いて音声区間の判定を行う第２判定手段と、前記画像特徴量のみを用いて音声区間の判定を行う第３判定手段と、第２判定手段及び第３判定手段の判定を統合して、音声区間の判定を行う第４判定手段と、前記第１乃至第４判定手段のうち、少なくとも第１、第４判定手段の判定結果を多数決原理で統合して音声区間の判定を行う第５判定手段を含むことを特徴とする音声区間検出装置を要旨とするものである。 In order to achieve the above object, according to the first aspect of the present invention, there is provided a voice input means for inputting a voice signal of a speaker and converting it into a digital signal, a lip moving image of the speaker, and a still image. Image input means for converting into time series (hereinafter referred to as image frames), acoustic feature quantity extracting means for extracting acoustic feature quantities for voice segment detection from a digitized voice signal output by the voice input means, and Image feature amount extraction means for extracting an image feature amount for speech section detection from an image frame, and speech section determination for performing speech section determination based on the acoustic feature amount for speech section detection and the image feature amount for speech section detection In the speech section detection device including the means, the speech section determination means generates an acoustic image feature amount that is a combination of the acoustic feature amount and the image feature amount, and based on the acoustic image feature amount, First determination means for determining voice section, second determination means for determining speech section using only the acoustic feature amount, third determination means for determining speech section using only the image feature amount, The determination result of at least the first and fourth determination means among the fourth determination means for determining the voice section by integrating the determinations of the second determination means and the third determination means, and the first to fourth determination means. The gist of the present invention is a speech segment detection device including a fifth determination means for performing speech segment determination in an integrated manner based on the majority rule.

請求項２の発明は、請求項１において、前記音響特徴量抽出手段、及び画像特徴量抽出手段は、モデルベース及び非モデルベースの手法により、音響特徴量及び画像特徴量をそれぞれ抽出し、前記第１乃至第４判定手段は、前記モデルベース及び非モデルベースの手法で抽出した特徴量に基づいて音声区間の判定を行うことを特徴とする。 According to a second aspect of the present invention, in the first aspect, the acoustic feature amount extraction unit and the image feature amount extraction unit respectively extract the acoustic feature amount and the image feature amount by a model-based and non-model-based method, and The first to fourth determination means determine a speech section based on feature amounts extracted by the model-based and non-model-based methods.

請求項３の発明は、請求項１又は請求項２に記載の音声区間検出装置が判定した音声区間の判定に基づいて前記音声入力手段が出力した音声信号の音声区間を切り出し、切り出した音声区間内の音声信号から音声認識用の音響特徴量を算出する音響特徴量算出手段と、前記音声区間検出装置が判定した音声区間内の画像フレームから音声認識用の画像特徴量を算出する画像特徴量算出手段と、前記音声認識用の音響特徴量及び前記音声認識用の画像特徴量を用いて、音声認識用の音響画像特徴量を生成する特徴量生成手段と、生成された音声認識用の音響画像特徴量に基づいて音声認識を行うマルチモーダル音声認識手段を備えたことを特徴とする音声認識装置を要旨とするものである。 According to a third aspect of the present invention, the voice section of the voice signal output by the voice input means is cut out based on the judgment of the voice section determined by the voice section detection device according to the first or second aspect, and the voice section is cut out. An acoustic feature amount calculating means for calculating an acoustic feature amount for speech recognition from a speech signal in the image, and an image feature amount for calculating an image feature amount for speech recognition from an image frame in the speech section determined by the speech section detecting device. Calculation means; feature quantity generating means for generating an acoustic image feature quantity for speech recognition using the acoustic feature quantity for speech recognition and the image feature quantity for speech recognition; and the generated acoustic sound for speech recognition The gist of the present invention is a speech recognition apparatus including multimodal speech recognition means for performing speech recognition based on image feature amounts.

請求項４の発明は、コンピュータに、発話者の音声信号を入力して、ディジタル信号に変換する音声入力手段と、前記発話者の口唇動画像を入力し、静止画像時系列（以下、画像フレームという）に変換する画像入力手段と、前記音声入力手段が出力するディジタル化された音声信号から音声区間検出用の音響特徴量を抽出する音響特徴量抽出手段と、前記画像フレームから音声区間検出用の画像特徴量を抽出する画像特徴量抽出手段と、前記音声区間検出用の音響特徴量及び音声区間検出用の画像特徴量に基づいて音声区間判定を行う音声区間判定手段として、機能させるためのプログラムであって、前記音声区間判定手段は、前記音響特徴量と画像特徴量を合わせた音響画像特徴量を生成して、該音響画像特徴量に基づいて音声区間を判定する第１判定手段と、前記音響特徴量のみを用いて音声区間の判定を行う第２判定手段と、前記画像特徴量のみを用いて音声区間の判定を行う第３判定手段と、第２判定手段及び第３判定手段の判定を統合して、音声区間の判定を行う第４判定手段と、前記第１乃至第４判定手段のうち、少なくとも第１、第４判定手段の判定結果を多数決原理で統合して音声区間の判定を行う第５判定手段を含むことを特徴とするプログラムを要旨とするものである。 According to a fourth aspect of the present invention, a voice input means for inputting a speech signal of a speaker and converting it into a digital signal and a lip moving image of the speaker are input to a computer, and a still image time series (hereinafter referred to as an image frame) is input. An image input means for converting to the above, an acoustic feature quantity extracting means for extracting an acoustic feature quantity for voice section detection from a digitized voice signal output from the voice input means, and a voice section detection means from the image frame. Image feature amount extraction means for extracting the image feature amount, and voice segment determination means for performing voice segment determination based on the acoustic feature amount for voice segment detection and the image feature amount for voice segment detection. In the program, the speech section determination unit generates an acoustic image feature amount that is a combination of the acoustic feature amount and the image feature amount, and determines the speech section based on the acoustic image feature amount. First determination means that performs determination of a speech section using only the acoustic feature amount, third determination means that performs speech section determination using only the image feature amount, and second determination Among the fourth determination means for determining the speech section by integrating the determination of the means and the third determination means, and the decision result of at least the first and fourth determination means among the first to fourth determination means. The gist of the program is characterized in that it includes fifth determination means for determining a voice section by integrating them.

請求項５は、コンピュータに、発話者の音声信号を入力して、ディジタル信号に変換する音声入力手段と、前記発話者の口唇動画像を入力し、静止画像時系列（以下、画像フレームという）に変換する画像入力手段と、前記音声入力手段が出力するディジタル化された音声信号から音声区間検出用の音響特徴量を抽出する音響特徴量抽出手段と、前記画像フレームから音声区間検出用の画像特徴量を抽出する画像特徴量抽出手段と、前記音声区間検出用の音響特徴量及び音声区間検出用の画像特徴量に基づいて音声区間判定を行う音声区間判定手段として、機能させるためのプログラムを記憶したコンピュータ読取り可能な記録媒体であって、前記音声区間判定手段は、前記音響特徴量と画像特徴量を合わせた音響画像特徴量を生成して、該音響画像特徴量に基づいて音声区間を判定する第１判定手段と、前記音響特徴量のみを用いて音声区間の判定を行う第２判定手段と、前記画像特徴量のみを用いて音声区間の判定を行う第３判定手段と、第２判定手段及び第３判定手段の判定を統合して、音声区間の判定を行う第４判定手段と、前記第１乃至第４判定手段のうち、少なくとも第１、第４判定手段の判定結果を多数決原理で統合して音声区間の判定を行う第５判定手段を含むことを特徴とするコンピュータ読取り可能な記録媒体を要旨とするものである。 According to a fifth aspect of the present invention, a voice input means for inputting a voice signal of a speaker into a computer and converting it into a digital signal, and a lip moving image of the speaker are input, and a still image time series (hereinafter referred to as an image frame). An image input means for converting to sound, an acoustic feature quantity extracting means for extracting an acoustic feature quantity for voice section detection from a digitized voice signal output from the voice input means, and an image for voice section detection from the image frame. A program for functioning as an image feature amount extracting means for extracting a feature amount, and an audio section determining means for performing speech section determination based on the acoustic feature amount for detecting the speech section and the image feature amount for detecting the speech section. A computer-readable recording medium stored, wherein the speech section determination unit generates an acoustic image feature amount that is a combination of the acoustic feature amount and the image feature amount, and First determination means for determining a speech section based on a reverberation image feature amount, second determination means for determining a speech section using only the acoustic feature amount, and determination of a speech section using only the image feature amount At least a first determination unit, a fourth determination unit that determines a speech section by integrating the determinations of the third determination unit, the second determination unit, and the third determination unit. The gist of the present invention is a computer-readable recording medium characterized by including fifth determination means for determining the speech section by integrating the determination results of the fourth determination means by the majority rule.

請求項１の発明によれば、音声情報と画像情報を総合的に用いるマルチモーダル音声区間検出により、音声区間検出における音響雑音の影響を抑制することができる音声区間検出装置を提供できる。すなわち、請求項１の発明によれば、音声信号のみならず、口唇動画像を用いることにより、音声区間検出における音響雑音の影響を抑制することができ、雑音環境下でも高い精度で音声区間を検出することができる。 According to the first aspect of the present invention, it is possible to provide a speech segment detection device capable of suppressing the influence of acoustic noise in speech segment detection by multimodal speech segment detection that uses speech information and image information comprehensively. That is, according to the first aspect of the present invention, by using not only the audio signal but also the lip moving image, it is possible to suppress the influence of the acoustic noise in the audio segment detection, and the audio segment can be accurately detected even in a noise environment. Can be detected.

請求項２の発明によれば、音響特徴量抽出手段、及び画像特徴量抽出手段は、モデルベース及び非モデルベースの手法により、抽出した音響特徴量及び画像特徴量を用いていることから、モデルベース及び非モデルベースの音響特徴量及び画像特徴量に基づいて、多様な情報に基づいて音声区間を検出でき、雑音環境下でも高い精度で音声区間を検出することができる。 According to the invention of claim 2, since the acoustic feature quantity extraction unit and the image feature quantity extraction unit use the acoustic feature quantity and the image feature quantity extracted by the model-based and non-model-based methods, Based on the base and non-model-based acoustic feature amounts and image feature amounts, speech segments can be detected based on various information, and speech segments can be detected with high accuracy even in a noisy environment.

請求項３の発明によれば、音声認識装置は、音声信号と口唇動画像を用いる従来のマルチモーダル音声認識が有する、雑音下でも頑健な音声認識が可能という利点を備えつつ、前処理を行う音声区間検出装置を備えることにより、非音声区間での誤認識を抑制することができる。この結果、雑音環境下でも高い音声認識性能を発揮できる。 According to the invention of claim 3, the speech recognition apparatus performs preprocessing while having the advantage that the conventional multimodal speech recognition using the speech signal and the lip moving image has the advantage that robust speech recognition is possible even under noise. By including the speech segment detection device, erroneous recognition in a non-speech segment can be suppressed. As a result, high speech recognition performance can be exhibited even in a noisy environment.

請求項４の発明によれば、プログラムを実行することによりコンピュータを請求項１に記載の音声区間検出装置として容易に実現することができる。
請求項５の発明によれば、コンピュータにこの記録媒体を読取りさせることにより、コンピュータを請求項１に記載の音声区間検出装置として容易に実現することができる。 According to invention of Claim 4, a computer can be easily implement | achieved as a speech area detection apparatus of Claim 1 by running a program.
According to the fifth aspect of the present invention, the computer can be easily realized as the voice section detecting device according to the first aspect by causing the computer to read the recording medium.

一実施形態の音声区間検出装置、及び音声認識装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus of one Embodiment, and a speech recognition apparatus. コンピュータの概略図。Schematic diagram of a computer. オプティカルフローの説明図。Explanatory drawing of an optical flow. 音響画像特徴量の生成例の説明図。Explanatory drawing of the example of a production | generation of an acoustic image feature-value. 音声区間検出の出力例の説明図。Explanatory drawing of the example of an output of audio | voice area detection. 最終統合型音声区間検出の例の説明図。Explanatory drawing of the example of the last integrated type | mold audio | voice area detection. 音声区間検出の結果の補償例の説明図。Explanatory drawing of the example of compensation of the result of audio | voice area detection. 音声認識用の画像特徴量の算出に使用される窓の説明図。Explanatory drawing of the window used for calculation of the image feature-value for speech recognition.

以下、本発明を具体化した音声区間検出装置、及び音声認識装置の一実施形態を図１〜図８を参照して説明する。
図１に示すように、音声区間検出装置１００及び音声認識装置２００は、共通のコンピュータ１０からなる。該コンピュータ１０は、図２に示すように、ＣＰＵ２０、ＲＯＭ３０、ＲＡＭ４０、及びハードディスク等の記憶装置５０を備えている。ＲＯＭ３０には、音声区間検出プログラム及び音声認識プログラムが格納されている。コンピュータ１０には、マイクロフォン６０及び撮像手段７０が接続され、発話者の音声及び口唇動画像が入力可能になっている。ＲＯＭ３０は、記録媒体に相当する。なお、音声区間検出プログラムをＲＡＭ４０に格納している場合は、ＲＡＭ４０が記録媒体に相当する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, an embodiment of a speech section detection device and a speech recognition device embodying the present invention will be described with reference to FIGS.
As shown in FIG. 1, the speech segment detection device 100 and the speech recognition device 200 are composed of a common computer 10. As shown in FIG. 2, the computer 10 includes a CPU 20, a ROM 30, a RAM 40, and a storage device 50 such as a hard disk. The ROM 30 stores a voice segment detection program and a voice recognition program. A microphone 60 and an image pickup means 70 are connected to the computer 10 so that a voice of a speaker and a lip moving image can be input. The ROM 30 corresponds to a recording medium. In addition, when the audio | voice area detection program is stored in RAM40, RAM40 corresponds to a recording medium.

音声区間検出装置１００は、前記コンピュータ１０により、前記音声区間検出プログラムが実行されると、下記の各部の機能を実現する。すなわち、音声区間検出装置１００は、図１に示すように、音声入力部１０１、音響特徴量抽出部１０２、画像入力部１１１、画像特徴量抽出部１１２、音響画像特徴量生成部１２１、初期統合型音声区間検出部（以下、第１マルチモーダルＶＡＤ部という）１３１、音声ユニモーダル音声区間検出部（以下、音声ユニモーダルＶＡＤ部という）１３２、画像ユニモーダル音声区間検出部（以下、画像ユニモーダルＶＡＤ部という）１３３、結果統合型音声区間検出部（以下、第２マルチモーダルＶＡＤ部という）１３４及び最終統合型音声区間検出部（以下、第３マルチモーダルＶＡＤ部という）１３５を備えている。なお、ＶＡＤは、Voice Activity Detection （音声区間検出）の意味である。 When the voice section detection program is executed by the computer 10, the voice section detection device 100 realizes the functions of the following units. That is, as shown in FIG. 1, the speech section detection apparatus 100 includes a speech input unit 101, an acoustic feature amount extraction unit 102, an image input unit 111, an image feature amount extraction unit 112, an acoustic image feature amount generation unit 121, and an initial integration. Type voice section detector (hereinafter referred to as first multimodal VAD section) 131, voice unimodal voice section detector (hereinafter referred to as voice unimodal VAD section) 132, image unimodal voice section detector (hereinafter referred to as image unimodal). 133, a result integrated speech section detector (hereinafter referred to as a second multimodal VAD section) 134, and a final integrated speech section detector (hereinafter referred to as a third multimodal VAD section) 135. Note that VAD means Voice Activity Detection.

又、音声認識装置２００は、前記コンピュータ１０により、前記音声認識プログラムが実行されると、下記の各部の機能を実現する。
すなわち、音声認識装置２００は、図１に示すように、音声区間検出補償部２０１、音声切り出し部３０１、音声認識用の音響特徴量抽出部３０２、画像切り出し部３１１、音声認識用の画像特徴量抽出部３１２、音声認識用の音響画像特徴量生成部３２１、及びマルチモーダル音声認識部３３１を備える。 Further, when the voice recognition program is executed by the computer 10, the voice recognition device 200 realizes the functions of the following units.
That is, as shown in FIG. 1, the speech recognition apparatus 200 includes a speech section detection / compensation unit 201, a speech segmentation unit 301, a speech recognition acoustic feature amount extraction unit 302, an image segmentation unit 311, and a speech recognition image feature amount. An extraction unit 312, an acoustic image feature value generation unit 321 for speech recognition, and a multimodal speech recognition unit 331 are provided.

以下、音声区間検出装置１００及び音声認識装置２００の作用を説明する。
音声区間検出装置１００の音声入力部１０１は、発話者の音声がマイクロフォン６０により電気信号に変換された音声信号（すなわち、アナログ信号）を入力し、該音声信号を標本化定理により原信号が復元できるように標本化を行うとともに、適当な量子化ステップで量子化を行い、ディジタル信号に変換する。音声入力部１０１は音声入力手段に相当する。 Hereinafter, the operation of the speech section detection device 100 and the speech recognition device 200 will be described.
The voice input unit 101 of the voice section detection apparatus 100 inputs a voice signal (that is, an analog signal) obtained by converting the voice of the speaker into an electric signal by the microphone 60, and the original signal is restored by the sampling theorem. Sampling is performed as much as possible, and quantization is performed at an appropriate quantization step to convert it into a digital signal. The voice input unit 101 corresponds to a voice input unit.

音響特徴量抽出部１０２は、前記ディジタル信号から、音響特徴量を計算（すなわち、抽出）する。例えば、音響特徴量抽出部１０２は、一定時間長を持つ音声フレームを一定時間毎に抽出し、抽出したフレーム毎に、音声信号の対数パワー及びメル尺度ケプストラム係数（Mel-Frequency Cepstrum Coefficient、ＭＦＣＣ）を求め、対数パワー及びメル尺度ケプストラム係数の、それぞれについて一次微分係数、二次微分係数を算出する。なお、音声フレームには、フレーム番号（ＩＤ）が付与される。 The acoustic feature quantity extraction unit 102 calculates (that is, extracts) an acoustic feature quantity from the digital signal. For example, the acoustic feature quantity extraction unit 102 extracts a voice frame having a fixed time length every fixed time, and for each extracted frame, the logarithmic power of the voice signal and the Mel scale cepstrum coefficient (MFCC) For each of the logarithmic power and the mel scale cepstrum coefficient, the first derivative coefficient and the second derivative coefficient are calculated. Note that a frame number (ID) is assigned to the audio frame.

ここで、本実施形態では、音響特徴量抽出部１０２が算出した音響特徴量のうちいずれか、又は複数を音声区間検出用の音響特徴量として使用する。
すなわち、音響特徴量抽出部１０２は、後述するモデルベースの手法及び非モデルベースの手法に使用される音響特徴量を算出する。 Here, in the present embodiment, one or more of the acoustic feature amounts calculated by the acoustic feature amount extraction unit 102 are used as the acoustic feature amount for detecting the speech section.
That is, the acoustic feature quantity extraction unit 102 calculates acoustic feature quantities used for a model-based technique and a non-model-based technique described later.

なお、非モデルベースの手法では、対数パワーのみが使用される。モデルベースの手法では、上記した全ての音響特徴量が使用される。すなわち、本実施形態のモデルベースの手法では、音響特徴量は、ＭＦＣＣ１２次元及び対数パワー、並びに、ＭＦＣＣ１２次元と対数パワーの動的特徴を示す一次微分係数、二次微分係数の計３９次元が用いられる。音響特徴量抽出部１０２は、音声区間検出用の音響特徴量抽出手段に相当する。 Note that in the non-model based approach, only logarithmic power is used. In the model-based method, all the acoustic feature values described above are used. That is, in the model-based method of the present embodiment, the acoustic feature amount is 39 dimensions of MFCC 12 dimensions and logarithmic power, and primary and secondary differential coefficients indicating dynamic characteristics of MFCC 12 dimensions and logarithmic power. It is done. The acoustic feature quantity extraction unit 102 corresponds to acoustic feature quantity extraction means for detecting a voice section.

画像入力部１１１は、ビデオカメラ、或いはＷＥＢカメラ等の動画像を撮像する撮像手段７０を使用して発話者の口唇動画像を入力し、該口唇動画像を適切なフレームレート、及び適切な幅、高さを有した静止画像時系列に変換する。以下、この静止画像を画像フレームという。画像フレームは、Ｗ（横画素数）×Ｈ（縦画素数）からなる。画像入力部１１１は、画像入力手段に相当する。 The image input unit 111 inputs a lip moving image of a speaker using an imaging unit 70 that captures a moving image such as a video camera or a WEB camera, and the lip moving image has an appropriate frame rate and an appropriate width. , And convert to a still image time series having a height. Hereinafter, this still image is referred to as an image frame. An image frame consists of W (number of horizontal pixels) × H (number of vertical pixels). The image input unit 111 corresponds to an image input unit.

画像特徴量抽出部１１２は、ある時点での画像フレームと、それよりも一つ前の画像フレームを用いて、図３に示すように、オプティカルフロー（Optical Flow）を計算する。オプティカルフローは、画像フレーム上の各画素の動きベクトルのことである。しかる後に、画像特徴量抽出部１１２は、画像フレーム全体におけるオプティカルフローの縦方向成分及び横方向成分の平均及び分散を計算する。 The image feature amount extraction unit 112 calculates an optical flow using an image frame at a certain point in time and an image frame immediately before the image frame, as shown in FIG. The optical flow is a motion vector of each pixel on the image frame. After that, the image feature amount extraction unit 112 calculates the average and variance of the vertical component and the horizontal component of the optical flow in the entire image frame.

ここで、下記は、縦方向成分及び横方向成分の平均及び分散の算出例である。 Here, the following is an example of calculating the average and variance of the vertical component and the horizontal component.

ただし、得られた点（ｘ，ｙ）におけるオプティカルフローのベクトル（ｕ（ｘ，ｙ），ｖ（ｘ，ｙ））、画像フレームの幅をＷ、高さをＨとする。

However, the optical flow vector (u (x, y), v (x, y)) at the obtained point (x, y), W is the width of the image frame, and H is the height.

すなわち、画像特徴量抽出部１１２は、画像フレーム全体から、オプティカルフローの平均、及び分散を縦横それぞれ２次元ずつ合わせて４次元の画像特徴量を求める。
オプティカルフローでは、発話者が発話するときは、口が動くことで、フローベクトルが発生し、画像領域内の平均値が大きくなる。又、口が動くことでフローベクトルの発生の有無が生じ、フローベクトルの分散値が大きくなるため、それらを画像特徴量として求めるのである。 That is, the image feature quantity extraction unit 112 obtains a four-dimensional image feature quantity from the entire image frame by combining the average and variance of the optical flow in two dimensions, both vertically and horizontally.
In the optical flow, when the speaker speaks, the mouth moves to generate a flow vector, and the average value in the image area increases. In addition, the presence or absence of a flow vector occurs due to the movement of the mouth, and the variance value of the flow vector becomes large. Therefore, they are obtained as image feature amounts.

後述するモデルベースの手法、及び非モデルベースの手法では、それぞれ、上記で得られた画像特徴量のうち、いずれか１つ、又は複数を音声区間検出用の画像特徴量として選択して採用される。 In the model-based method and the non-model-based method, which will be described later, one or more of the image feature values obtained above are selected and used as the image feature values for speech section detection. The

例えば、モデルベースの手法では、上記の全ての画像特徴量が使用される。又、非モデルベースの手法では、縦方向の分散が使用される。これは、発話者の口が動いていない場合には、絶対値の小さいオプティカルフローのみが観測されるため、分散値は小さくなり、口が動いている場合は、頬などの動きが小さい箇所と口唇など動きの大きい箇所が混在するため分散値が大きくなることを利用している。画像特徴量抽出部１１２は、音声区間検出用の画像特徴量抽出手段に相当する。 For example, in the model-based method, all the image feature amounts described above are used. Also, in the non-model based approach, longitudinal dispersion is used. This is because, when the speaker's mouth is not moving, only the optical flow with a small absolute value is observed, so the variance value is small, and when the mouth is moving, the movement of the cheek etc. is small. It uses the fact that the variance value is large because of the presence of large movements such as the lips. The image feature amount extraction unit 112 corresponds to image feature amount extraction means for detecting a voice section.

音響画像特徴量生成部１２１は、音響特徴量抽出部１０２で得られた音響特徴量と、該音響特徴量のフレーム番号に対応して画像特徴量抽出部１１２で得られた画像特徴量を単純に連結して、音声区間検出用の音響画像特徴量を生成（すなわち、統合）する。音響特徴量と画像特徴量は、図４に示すようにフレームレートが異なることがある。この場合、音響画像特徴量生成部１２１は、フレームレートの調整（すなわち、フレームレート調整処理）を行う。例えば、音響画像特徴量生成部１２１は、より低いフレームレートをもつ特徴量に対しては、時間方向に３次元スプライン関数を用いて補間を行うことにより、低いフレームレートをもつ特徴量のフレームレートを上げ、他方の特徴量の高いフレームレートと合わせるフレームレートの調整を行う。調整されたフレームには、音響特徴量抽出部１０２で付与されたフレーム番号（ＩＤ）と同期するように、すなわち、一致するように付与される。 The acoustic image feature quantity generation unit 121 simply calculates the acoustic feature quantity obtained by the acoustic feature quantity extraction unit 102 and the image feature quantity obtained by the image feature quantity extraction unit 112 corresponding to the frame number of the acoustic feature quantity. To generate (that is, integrate) the acoustic image feature quantity for detecting the voice section. The acoustic feature quantity and the image feature quantity may have different frame rates as shown in FIG. In this case, the acoustic image feature value generation unit 121 performs frame rate adjustment (that is, frame rate adjustment processing). For example, the acoustic image feature value generation unit 121 interpolates a feature value having a lower frame rate by using a three-dimensional spline function in the time direction to obtain a frame rate of the feature value having a low frame rate. The frame rate is adjusted to match the frame rate with the other feature amount. The adjusted frame is given so as to be synchronized with the frame number (ID) given by the acoustic feature quantity extraction unit 102, that is, to match.

図４の例では、音響画像特徴量生成部１２１は、フレームレートが３０Ｈｚの画像特徴量を、３次元スプライン関数で補間することにより、フレームレートが１００Ｈｚの画像特徴量にし、その後、フレームレートが１００Ｈｚの音響特徴量と連結することにより、フレームレートが１００Ｈｚの音響画像特徴量を生成している。 In the example of FIG. 4, the acoustic image feature quantity generation unit 121 interpolates an image feature quantity with a frame rate of 30 Hz with a three-dimensional spline function to obtain an image feature quantity with a frame rate of 100 Hz, and then the frame rate is By connecting to the acoustic feature quantity of 100 Hz, an acoustic image feature quantity having a frame rate of 100 Hz is generated.

第１マルチモーダルＶＡＤ部１３１は、音響画像特徴量生成部１２１で得られた音響画像特徴量を用いて、モデルベースの手法及び非モデルベースの手法をそれぞれ実行し、初期統合による音声区間検出を行う。 The first multimodal VAD unit 131 executes a model-based method and a non-model-based method using the acoustic image feature values obtained by the acoustic image feature value generation unit 121, and performs speech segment detection by initial integration. Do.

具体的には、第１マルチモーダルＶＡＤ部１３１は、モデルベースの手法の場合、隠れマルコフモデルの一種であるマルチストリームＨＭＭを予め作成しておき、ビタビアルゴリズムによる前記音響画像特徴量と前記隠れマルコフモデル（マルチストリームＨＭＭ）とのマッチングを行い、最も類似度の高いと判定された音声区間・非音声区間の時系列を結果として出力する。なお、前記マルチストリームＨＭＭは、記憶装置５０に予め記憶されている。 Specifically, in the case of the model-based method, the first multimodal VAD unit 131 creates a multi-stream HMM that is a kind of hidden Markov model in advance, and the acoustic image feature amount and the hidden Markov by the Viterbi algorithm. Matching with a model (multi-stream HMM) is performed, and a time series of speech / non-speech segments determined to have the highest similarity is output as a result. The multi-stream HMM is stored in advance in the storage device 50.

ここで、音声区間・非音声区間の時系列、すなわち、順番に並んだフレームのうち、前記音声区間と判定された各フレームが、音声区間候補となる。
出力例を、図５に示す。 Here, among the time series of the voice segment and the non-voice segment, that is, among the frames arranged in order, each frame determined to be the voice segment is a voice segment candidate.
An output example is shown in FIG.

図５において、α，βは、音響画像特徴量のフレーム番号（ＩＤ）を示している。例えば、「０」は非音声区間（ｎｏｎ−ｓｐｅｅｃｈ）の開始フレーム番号を示し、「４４」は、当該非音声区間（ｎｏｎ−ｓｐｅｅｃｈ）の終了フレーム番号である。又、図５において、「４５」は、音声区間（ｓｐｅｅｃｈ）の開始フレーム番号を示し、「６０」は、当該音声区間（ｓｐｅｅｃｈ）の終了フレーム番号である。ここで、「４５」〜「６０」が音声区間候補である。以下、同様である。 In FIG. 5, α and β indicate frame numbers (IDs) of acoustic image feature values. For example, “0” indicates the start frame number of the non-speech section (non-speech), and “44” is the end frame number of the non-speech section (non-speech). In FIG. 5, “45” indicates the start frame number of the speech section (speech), and “60” is the end frame number of the speech section (speech). Here, “45” to “60” are speech segment candidates. The same applies hereinafter.

なお、前記マルチストリームＨＭＭは、画像と音響からそれぞれ抽出した前述の各種の特徴量を用いて、音声と非音声のそれぞれのＨＭＭを教師有り学習をさせたものである。本実施形態では、マルチストリームＨＭＭは、音声状態のＨＭＭ（音声ＨＭＭ）、非音声状態のＨＭＭ（非音声ＨＭＭ）間を交互に遷移する状態遷移モデルを構成する。そして、第１マルチモーダルＶＡＤ部１３１は、前記音響画像特徴量と、上記音声ＨＭＭと非音声ＨＭＭのマッチングを行い、上記音声ＨＭＭと非音声ＨＭＭのそれぞれの対数尤度によって音声／非音声状態の識別を行う。 The multi-stream HMM is obtained by supervised learning of each of the speech and non-speech HMMs using the above-described various feature amounts extracted from the image and the sound. In the present embodiment, the multi-stream HMM constitutes a state transition model in which transition is alternately performed between a voice state HMM (voice HMM) and a non-voice state HMM (non-voice HMM). Then, the first multimodal VAD unit 131 performs matching between the acoustic image feature quantity and the speech HMM and the non-speech HMM, and in the speech / non-speech state according to the log likelihood of each of the speech HMM and the non-speech HMM. Identify.

本実施形態では、初期統合において、マルチストリームＨＭＭを用いた場合、下記のようにストリーム重みを調整できる。このため、いずれか一方の特徴量の性能が悪くても、ストリーム重みを調整することにより、もう一方の特徴量でカバーして補うことができる。 In the present embodiment, when a multi-stream HMM is used in the initial integration, the stream weight can be adjusted as follows. For this reason, even if the performance of one of the feature quantities is bad, it can be covered and compensated for by the other feature quantity by adjusting the stream weight.

すなわち、マルチストリームＨＭＭの出力対数尤度は式（１）でｂ_AVと表わすことができる。式（１）において、Ｏ_A ，Ｏ_V は、それぞれ音響特徴量、画像特徴量を表わし、ｂ_A（Ｏ_A ），ｂ_V（Ｏ_V ）はそれぞれに対応した対数尤度を表わしている。 That is, the output log likelihood of the multi-stream HMM can be expressed as b _{AV in} equation (1). In Equation (1), O _A and O _V represent acoustic feature amounts and image feature amounts, respectively, and b _A (O _A ) and b _V (O _V ) represent log likelihoods corresponding to the respective features.

ｂ_AV＝λ_A ｂ_A （Ｏ_A ）＋λ_V ｂ_V （Ｏ_V ）………（１）
ここで、λ_A ，λ_V はそれぞれ音響特徴量、画像特徴量のストリーム重みを表わし、式（２）の関係を持つ。 b _AV = λ _A b _A (O _A ) + λ _V b _V (O _V ) (1)
Here, λ _A and λ _V represent stream weights of the acoustic feature amount and the image feature amount, respectively, and have the relationship of Expression (2).

λ_A ＋λ_V ＝１（０≦λ_A、λ_V ≦１） ………（２）
一方、非モデルベースの手法では、第１マルチモーダルＶＡＤ部１３１は、音響特徴量と画像特徴量を線形結合によりスコアに変換し、閾値処理（すなわち、閾値以上の値をもつものを選択（以下、同じ。））することにより、音声区間・非音声区間の時系列結果を出力する。前記線形結合の処理は、音声と画像の重み付けを行うパラメータを乗算して線形結合する。 λ _A + λ _V = 1 (0 ≦ λ _A , λ _V ≦ 1) (2)
On the other hand, in the non-model-based method, the first multimodal VAD unit 131 converts the acoustic feature quantity and the image feature quantity into a score by linear combination, and selects a threshold value process (that is, a value having a value equal to or greater than the threshold value (hereinafter referred to as “the threshold value process”)). , The same))) to output the time-series results of the speech and non-speech segments. In the linear combination process, the voice and image weighting parameters are multiplied and linearly combined.

モデルベースの手法、非モデルベースの手法のいずれにおいても、音声と画像の重み付けを行うパラメータ（すなわち、前記λ_A ，λ_V、及び前記線形結合に使用するパラメータ）があり、これらは、予め試験により、最も識別結果が良好となるように設定するものとする、又は、各モダリティの雑音状況などに応じて前記パラメータを設定するものとする。 In both the model-based method and the non-model-based method, there are parameters for weighting speech and images (that is, the parameters used for the λ _A , λ _V , and the linear combination), which are tested in advance. Thus, the parameter is set so that the identification result is the best, or the parameter is set according to the noise status of each modality.

音響画像特徴量生成部１２１、第１マルチモーダルＶＡＤ部１３１は、第１判定手段に相当する。
音声ユニモーダルＶＡＤ部１３２は、音響特徴量抽出部１０２で抽出した音響特徴量のみの情報に基づき、モデルベースの手法、及び非モデルベースの手法でそれぞれ音声区間検出を行う。音声ユニモーダルＶＡＤ部１３２は、第２判定手段に相当する。 The acoustic image feature value generation unit 121 and the first multimodal VAD unit 131 correspond to a first determination unit.
The voice unimodal VAD unit 132 performs voice section detection using a model-based method and a non-model-based method based on only the information about the acoustic feature amount extracted by the acoustic feature amount extraction unit 102. The voice unimodal VAD unit 132 corresponds to a second determination unit.

すなわち、音声ユニモーダルＶＡＤ部１３２は、モデルベースの手法では、予め作成されて、記憶装置５０に記憶したＨＭＭを用いたり、或いは混合正規分布（Gaussian Mixture Model 、ＧＭＭ）を用いて、ＨＭＭと音響特徴量とのマッチングを行い、或いは、ＧＭＭと音響特徴量とのマッチングを行うことにより、音響特徴量のみの情報に基づいて、音声区間候補を出力する。 That is, the voice unimodal VAD unit 132 uses an HMM that is created in advance and stored in the storage device 50 in a model-based method, or a mixed normal distribution (Gaussian Mixture Model, GMM). A voice section candidate is output based on information on only the acoustic feature quantity by matching with the feature quantity or matching the GMM and the acoustic feature quantity.

音声ユニモーダルＶＡＤ部１３２は非モデルベースの手法では、対数パワー（音響特徴量）から、公知の方法で音響スコアを計算して、閾値処理することにより、音声区間候補を出力する。 In the non-model-based method, the speech unimodal VAD unit 132 calculates an acoustic score from a logarithmic power (acoustic feature amount) by a known method, and outputs a speech segment candidate by performing threshold processing.

音声ユニモーダルＶＡＤ部１３２は、前記音声区間候補を出力する際、該音声区間候補の開始フレーム番号及び終了フレーム番号、並びに、その音声区間候補の確からしさとして信頼度スコアを合わせて出力する。モデルベースの手法における信頼度スコアの算出例については後述する。 When the speech unimodal VAD unit 132 outputs the speech segment candidate, the speech unimodal VAD unit 132 outputs a reliability score as the start frame number and end frame number of the speech segment candidate and the probability of the speech segment candidate. A calculation example of the reliability score in the model-based method will be described later.

非モデルベースの手法では、前記音響スコアを挙げることができる。音響スコアが高いほど、音声区間としての信頼度が高いことを意味する。すなわち、非モデルベースの手法では、各フレーム毎に、対数パワーの値を、音響スコアとし、得られた音響スコア（信頼度スコア）をモデルベースのときと同様に利用する。 Non-model-based techniques can include the acoustic score. It means that the higher the acoustic score, the higher the reliability as a speech section. That is, in the non-model-based method, the logarithmic power value is used as the acoustic score for each frame, and the obtained acoustic score (reliability score) is used in the same manner as in the model-based method.

画像ユニモーダルＶＡＤ部１３３は、画像特徴量抽出部１１２で抽出した画像特徴量のみの情報に基づき、モデルベースの手法、及び非モデルベースの手法でそれぞれ音声区間検出を行う。画像ユニモーダルＶＡＤ部１３３は、第３判定手段に相当する。 The image unimodal VAD unit 133 performs speech section detection using a model-based method and a non-model-based method based on only the image feature amount information extracted by the image feature amount extraction unit 112. The image unimodal VAD unit 133 corresponds to a third determination unit.

すなわち、画像ユニモーダルＶＡＤ部１３３は、モデルベースの手法では、予め作成されて、記憶装置５０に記憶したＨＭＭを用いたり、或いは混合正規分布（Gaussian Mixture Model 、ＧＭＭ）を用いて、ＨＭＭと画像特徴量とのマッチングを行い、或いは、ＧＭＭと画像特徴量とのマッチングを行うことにより、画像特徴量のみの情報に基づいて、音声区間候補（音声区間候補の開始フレーム番号及び終了フレーム番号、以下、同じ。）を出力し、信頼度スコアを付与する。 That is, in the model-based method, the image unimodal VAD unit 133 uses an HMM that is created in advance and stored in the storage device 50, or a mixed normal distribution (Gaussian Mixture Model, GMM). Based on the information of only the image feature amount by performing matching with the feature amount or matching between the GMM and the image feature amount, the speech segment candidate (start frame number and end frame number of the speech segment candidate, hereinafter , The same) is output and a confidence score is assigned.

又、画像ユニモーダルＶＡＤ部１３３は、非モデルベースの手法では、画像特徴量（縦方向の分散）を閾値処理することにより、画像情報における音声区間候補を判定し、該音声区間候補を出力し、信頼度スコアを付与する。 Further, in the non-model based method, the image unimodal VAD unit 133 determines a speech section candidate in the image information by performing threshold processing on the image feature amount (vertical dispersion), and outputs the speech section candidate. Give a confidence score.

前記信頼度スコアは、音声区間候補の確からしさを表わす。モデルベースの手法における信頼度スコアの算出例については後述する。
なお、前述したように、音響特徴量と画像特徴量は、フレームレートが異なることがある。この場合、画像ユニモーダルＶＡＤ部１３３は、音響画像特徴量生成部１２１と同様に画像のフレームレートの調整（すなわち、フレームレート調整処理）を行う。例えば、画像ユニモーダルＶＡＤ部１３３は、より低いフレームレートをもつ画像特徴量に対しては、時間方向に３次元スプライン関数を用いて補間を行うことにより、低いフレームレートをもつ特徴量のフレームレートを上げ、他方の音響特徴量の高いフレームレートと合わせることにより、フレームレートの調整を行った後、前述のモデルベースの手法、及び非モデルベースの手法でそれぞれ音声区間検出を行う。 The reliability score represents the likelihood of a speech segment candidate. A calculation example of the reliability score in the model-based method will be described later.
As described above, the frame rate may be different between the acoustic feature quantity and the image feature quantity. In this case, the image unimodal VAD unit 133 adjusts the frame rate of the image (that is, the frame rate adjustment process) in the same manner as the acoustic image feature value generation unit 121. For example, the image unimodal VAD unit 133 performs the interpolation using a three-dimensional spline function in the time direction for an image feature amount having a lower frame rate, so that the frame rate of the feature amount having a low frame rate is obtained. After adjusting the frame rate by matching with the other frame rate having the higher acoustic feature amount, the speech section detection is performed by the model-based method and the non-model-based method, respectively.

次に、第２マルチモーダルＶＡＤ部１３４の統合処理について説明する。
第２マルチモーダルＶＡＤ部１３４における音声区間検出の処理は、信頼度スコアを使用する場合、信頼度スコアを使用しないで、論理演算を使用する場合、或いは、両方をともに行う場合がある。 Next, the integration process of the second multimodal VAD unit 134 will be described.
The speech section detection processing in the second multimodal VAD unit 134 may use a logical operation without using the reliability score when using the reliability score, or may perform both.

本実施形態の第２マルチモーダルＶＡＤ部１３４では、両方を行って、それぞれの場合における音声区間候補を出力する。第２マルチモーダルＶＡＤ部１３４は、第４判定手段に相当する。 In the second multimodal VAD unit 134 of the present embodiment, both are performed, and speech section candidates in each case are output. The second multimodal VAD unit 134 corresponds to a fourth determination unit.

（信頼度スコアの算出例）
ここで、モデルベースの手法における信頼度スコアの算出例について説明する。
前記音声ユニモーダルＶＡＤ部１３２では、非音声モデルが出力するフレームｔにおける対数尤度Ｌａ（ｔ）の値又はその傾きに定数を乗じた値を、音声信頼度スコアＣａ（ｔ）として出力する。 (Reliability score calculation example)
Here, a calculation example of the reliability score in the model-based method will be described.
The speech unimodal VAD unit 132 outputs the value of the log likelihood La (t) in the frame t output by the non-speech model or a value obtained by multiplying the slope by a constant as the speech reliability score Ca (t).

又、同様に、画像ユニモーダルＶＡＤ部１３３では、非音声モデルが出力するフレーム毎の対数尤度Ｌｖ（ｔ）の値又はその傾きに定数を乗じた値を、画像信頼度スコアＣｖ（ｔ）として出力する。 Similarly, in the image unimodal VAD unit 133, the image reliability score Cv (t) is obtained by multiplying the value of the log likelihood Lv (t) for each frame output by the non-speech model or a value obtained by multiplying the slope by a constant. Output as.

これらの信頼度スコアは、正の値を持つ場合は、非音声区間としての信頼性が高く、負の値をもつ場合は、非音声区間としての信頼性が低いことを意味する。
これらの信頼度スコアは、正の値を持つ場合は、音声区間としての信頼性が高く、負の値をもつ場合は、音声区間としての信頼性が低いことを意味する。 When the reliability score has a positive value, it means that the reliability as a non-speech interval is high, and when the reliability score has a negative value, it means that the reliability as a non-speech interval is low.
When the reliability score has a positive value, it means that the reliability as a speech section is high, and when it has a negative value, it means that the reliability as a speech section is low.

次に、第２マルチモーダルＶＡＤ部１３４の統合処理について説明する。
（信頼度スコアを使用する場合）
第２マルチモーダルＶＡＤ部１３４は、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３で得られた音声区間候補に対して、前記信頼度スコアに基づいて、これらを統合して、音声区間結果を出力する。 Next, the integration process of the second multimodal VAD unit 134 will be described.
(When using confidence score)
The second multimodal VAD unit 134 integrates the speech segment candidates obtained by the speech unimodal VAD unit 132 and the image unimodal VAD unit 133 based on the reliability score, and produces a speech segment result. Is output.

第２マルチモーダルＶＡＤ部１３４は、例えば、音声、画像の信頼度スコアをそれぞれ正規化した後に、正規化した各信頼度スコアに重みパラメータλを乗算した上で線形結合し、線形結合した結果が予め設定された閾値を越えた音声区間のみを出力する。なお、重みパラメータは、各モダリティの雑音状況などに応じて予め設定されている。 The second multimodal VAD unit 134, for example, normalizes the reliability scores of the voice and the image, and then linearly combines the normalized reliability scores with the weight parameter λ, and the linear combination result is obtained. Only voice segments that exceed a preset threshold are output. The weight parameter is set in advance according to the noise situation of each modality.

下記は信頼度スコアＣ（ｔ）の算出例である。
Ｃ（ｔ）＝Ｃ_ａ（ｔ）＋λＣ_ｖ（ｔ）
λは、スケーリング係数（重みパラメータ）である。Ｃａ（ｔ）は正規化した音声信頼度スコア、Ｃ_ｖ（ｔ）は、正規化した画像信頼度スコアである。 The following is a calculation example of the reliability score C (t).
C (t) = C _a (t) + λC _v (t)
λ is a scaling coefficient (weight parameter). Ca (t) is a normalized voice reliability score, and C _v (t) is a normalized image reliability score.

ここで、第２マルチモーダルＶＡＤ部１３４は、音声ユニモーダルＶＡＤ部１３２が出力する音声区間候補と、画像ユニモーダルＶＡＤ部１３３が出力する音声区間候補の、少なくとも、一方を音声区間と判定したとき、Ｃ（ｔ）が正の値をもつ場合は、そのまま音声区間候補として出力し、Ｃ（ｔ）が負の値をもつ場合は、非音声区間候補として出力する。 Here, when the second multimodal VAD unit 134 determines that at least one of the speech segment candidate output by the speech unimodal VAD unit 132 and the speech segment candidate output by the image unimodal VAD unit 133 is a speech segment. When C (t) has a positive value, it is output as a speech segment candidate as it is, and when C (t) has a negative value, it is output as a non-speech segment candidate.

（信頼度スコアを使用しない場合）
第２マルチモーダルＶＡＤ部１３４は、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３において、それぞれモデルベースの手法で得られた音声区間候補に対して、フレーム毎に論理演算を用いたＡＮＤ統合と、ＯＲ統合を行う。 (When not using confidence score)
The second multimodal VAD unit 134 performs AND integration using a logical operation for each frame with respect to speech segment candidates obtained by the model-based method in the speech unimodal VAD unit 132 and the image unimodal VAD unit 133. And OR integration.

モデルベースの手法で得られた音声区間候補に対するＡＮＤ統合は、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３において、それぞれモデルベースの手法で得られた音声区間検出の結果がともに音声区間であるフレームのみ、音声区間とする統合である。 In the AND integration for the speech segment candidates obtained by the model-based method, the speech unimodal VAD unit 132 and the image unimodal VAD unit 133 are both speech segment detection results obtained by the model-based method. Only a certain frame is integrated as a speech section.

モデルベースの手法で得られた音声区間候補に対するＯＲ統合は、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３において、それぞれモデルベースの手法で得られた音声区間検出の結果のいずれか一方が音声区間であるフレームを、音声区間とする統合である。 In the OR integration for the speech segment candidates obtained by the model-based method, either of the speech segment detection results obtained by the model-based method in the speech unimodal VAD unit 132 and the image unimodal VAD unit 133 is obtained. This is integration in which a frame that is a speech section is a speech section.

さらに、第２マルチモーダルＶＡＤ部１３４は、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３において、それぞれ非モデルベースの手法で得られた音声区間候補に対して、論理演算に従ってＡＮＤ統合と、ＯＲ統合を行う。すなわち、非モデルベースの手法で得られた音声区間候補に対するＡＮＤ統合は、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３において、それぞれ非モデルベースの手法で得られた音声区間検出の結果がともに音声区間であるフレームのみ、音声区間とする統合である。又、非モデルベースの手法で得られた音声区間候補に対するＯＲ統合は、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３において、それぞれ非モデルベースの手法で得られた音声区間検出の結果のいずれか一方が音声区間であるフレームを、音声区間とする統合である。 Further, the second multimodal VAD unit 134 performs AND integration according to a logical operation on the speech section candidates obtained by the non-model based method in the speech unimodal VAD unit 132 and the image unimodal VAD unit 133, respectively. Perform OR integration. In other words, the AND integration for the speech segment candidates obtained by the non-model based method is performed by the speech unimodal VAD unit 132 and the image unimodal VAD unit 133, respectively, by the speech segment detection results obtained by the non-model based method. Only the frames that are both speech segments are integrated as speech segments. In addition, the OR integration for the speech segment candidates obtained by the non-model-based method is performed by the speech unimodal VAD unit 132 and the image unimodal VAD unit 133, respectively, as a result of the speech segment detection obtained by the non-model-based method. This is integration in which any one of the frames is a speech segment.

第３マルチモーダルＶＡＤ部１３５は、第１マルチモーダルＶＡＤ部１３１で出力されたモデルベースの音声区間候補及び非モデルベースの音声区間候補、並びに第２マルチモーダルＶＡＤ部１３４で出力されたモデルベースの音声区間候補及び非モデルベースの音声区間候補を使用して、音声区間結果を最終的に統合処理する。 The third multimodal VAD unit 135 includes model-based speech segment candidates and non-model-based speech segment candidates output from the first multimodal VAD unit 131, and model-based speech segment candidates output from the second multimodal VAD unit 134. The speech segment result is finally integrated using the speech segment candidate and the non-model based speech segment candidate.

この統合処理は、図６に示すように、音声区間候補のある時刻フレーム（すなわち、フレーム番号）が音声区間か否かを、それぞれの音声区間検出結果、すなわち、入力された第３マルチモーダルＶＡＤ部１３５に入力された全ての音声区間候補の多寡（多数決）により決定する処理（すなわち、多数決原理）である。 As shown in FIG. 6, this integration process determines whether or not a time frame (that is, a frame number) having a speech segment candidate is a speech segment, and determines each speech segment detection result, that is, the input third multimodal VAD. This is a process (namely, the principle of majority voting) determined by the number of all voice segment candidates input to the unit 135 (majority voting).

このようにして、第３マルチモーダルＶＡＤ部１３５では、多数決により決定された音声区間を音声認識装置２００に出力する。
このように、初期統合型マルチモーダル音声区間検出と、結果統合型音声区間マルチモーダル音声区間検出のそれぞれが検出した音声区間候補を多数決原理で最終的に、第３マルチモーダルＶＡＤ部１３５により音声区間候補と決定することにより、音声区間検出における音響雑音の影響を抑制することができる。 In this way, the third multimodal VAD unit 135 outputs the speech section determined by the majority decision to the speech recognition apparatus 200.
As described above, the third multimodal VAD unit 135 finally uses the third multimodal VAD unit 135 to determine speech segment candidates detected by the initial integrated multimodal speech segment detection and the result integrated speech segment multimodal speech segment detection. By determining as a candidate, it is possible to suppress the influence of acoustic noise in voice section detection.

第３マルチモーダルＶＡＤ部１３５は、第５判定手段に相当する。又、第１マルチモーダルＶＡＤ部１３１、音声ユニモーダルＶＡＤ部１３２、画像ユニモーダルＶＡＤ部１３３、第２マルチモーダルＶＡＤ部１３４及び第３マルチモーダルＶＡＤ部１３５は、音声区間判定手段に相当する。 The third multimodal VAD unit 135 corresponds to a fifth determination unit. In addition, the first multimodal VAD unit 131, the voice unimodal VAD unit 132, the image unimodal VAD unit 133, the second multimodal VAD unit 134, and the third multimodal VAD unit 135 correspond to a voice section determination unit.

音声区間検出補償部２０１は、第３マルチモーダルＶＡＤ部１３５により決定された音声区間に対して、音声認識の向上に特化した音声区間検出の識別誤りを補償する処理を行う。具体的には、図７に示すように、音声区間に挟まれた一定時間（閾値）に満たない非音声区間ａがある場合、音声区間検出補償部２０１は、この非音声区間ａを識別誤りであると判定して、この非音声区間を音声区間に組み入れる。 The speech segment detection / compensation unit 201 performs processing for compensating for the speech segment detection identification error specialized in speech recognition improvement for the speech segment determined by the third multimodal VAD unit 135. Specifically, as shown in FIG. 7, when there is a non-speech segment a that is less than a predetermined time (threshold) sandwiched between speech segments, the speech segment detection / compensation unit 201 identifies the non-speech segment a as an identification error. This non-speech segment is incorporated into the speech segment.

音声切り出し部３０１は、音声区間検出補償部２０１で修正された音声区間検出の結果に基づいて、音声区間とラベル付けされた時間区間に対応する音声信号のみを切り出し、切り出した音声信号を音響特徴量抽出部３０２に出力する。 The voice cutout unit 301 cuts out only the voice signal corresponding to the time section labeled as the voice section based on the result of the voice section detection corrected by the voice section detection / compensation unit 201, and the cutout voice signal is acoustically characterized. The data is output to the quantity extraction unit 302.

音響特徴量抽出部３０２は、音声切り出し部３０１で切り出された区間に対し、音声認識に供する音響特徴量を計算する。すなわち、音響画像特徴量のフレーム毎に対数パワーとＭＦＣＣ、それらの一次微分係数、二次微分係数を計算する。音響特徴量抽出部３０２は、音響特徴量算出手段に相当する。 The acoustic feature quantity extraction unit 302 calculates an acoustic feature quantity to be used for voice recognition for the section cut out by the voice cutout unit 301. That is, the logarithmic power and MFCC, their primary differential coefficient, and secondary differential coefficient are calculated for each frame of the acoustic image feature value. The acoustic feature quantity extraction unit 302 corresponds to acoustic feature quantity calculation means.

画像切り出し部３１１は、音声切り出し部３０１と同様に、音声区間検出補償部２０１から得られる音声区間に対応する画像フレームを画像特徴量抽出部３１２に出力する。
画像特徴量抽出部３１２は、画像切り出し部３１１から得られる画像フレームを用いて音声認識に供する画像特徴量を抽出する。画像特徴量抽出部３１２は、画像特徴量算出手段に相当する。 Similar to the audio segmentation unit 301, the image segmentation unit 311 outputs an image frame corresponding to the audio segment obtained from the audio segment detection / compensation unit 201 to the image feature amount extraction unit 312.
The image feature amount extraction unit 312 extracts an image feature amount used for speech recognition using the image frame obtained from the image cutout unit 311. The image feature amount extraction unit 312 corresponds to an image feature amount calculation unit.

具体的には、画像特徴量抽出部３１２は、まず、画像フレーム内の口唇の同定を行い、口唇の形状情報として、口唇の幅と高さ、及び検出された歯の画素数による情報を公知の技術により算出する。 Specifically, the image feature amount extraction unit 312 first identifies the lips in the image frame, and publicly knows information about the lip width and height and the number of detected tooth pixels as lip shape information. It is calculated by the technique of

次に、画像特徴量抽出部３１２は、動き情報として、オプティカルフローを計算し、口唇の周辺に設定した複数の窓（例えば、図８に示す領域Ａ，Ｂ，Ｃ）におけるオプティカルフローベクトルの水平・垂直成分の平均値を求め、これらの平均値に基づいて式（３）、式（４）に示すように２種類のパラメータｍ_１，ｍ_２を計算する。 Next, the image feature quantity extraction unit 312 calculates the optical flow as the motion information, and the horizontal of the optical flow vector in a plurality of windows (for example, areas A, B, and C shown in FIG. 8) set around the lips. The average value of the vertical components is obtained, and two types of parameters m ₁ and m ₂ are calculated based on these average values as shown in Expression (3) and Expression (4).

パラメータｍ_１は、フローベクトルＸ成分に関する動き情報（パラメータ）であり、パラメータｍ_２は、フローベクトルＹ成分に関する動き情報（パラメータ）である。この後、画像特徴量抽出部３１２は、前記形状情報（３次元）、動き情報（前記パラメータｍ_１，ｍ_２の２次元）を連結統合（すなわち、線形結合）して、５次元の画像基礎特徴量を求める。この後、画像特徴量抽出部３１２は、前記画像基礎特徴量に対して主成分分析を利用して、直交化を施し、主成分得点を得る。画像特徴量抽出部３１２は、前記直交化して得られた主成分得点を画像特徴量として抽出する。

The parameter m ₁ is motion information (parameter) regarding the flow vector X component, and the parameter m ₂ is motion information (parameter) regarding the flow vector Y component. Thereafter, the image feature quantity extraction unit 312 concatenates and integrates (that is, linearly combines) the shape information (three-dimensional) and motion information (two-dimensional of the parameters m ₁ and m ₂ ) to obtain a five-dimensional image basis. Find the features. Thereafter, the image feature quantity extraction unit 312 performs principalization on the basic image feature quantity using principal component analysis to obtain a principal component score. The image feature quantity extraction unit 312 extracts the principal component score obtained by the orthogonalization as an image feature quantity.

なお、ここで説明した画像特徴量抽出の方法は例示であり、他の公知の方法で行ってもよい。
音響画像特徴量生成部３２１は、音響特徴量抽出部３０２で得られた音響特徴量と画像特徴量抽出部３１２で得られた画像特徴量を単純に連結（線形結合）して、音声認識用の音響画像特徴量を生成する。音響画像特徴量生成部３２１は、特徴量生成手段に相当する。 Note that the image feature extraction method described here is an example, and other known methods may be used.
The acoustic image feature quantity generation unit 321 simply connects (linearly combines) the acoustic feature quantity obtained by the acoustic feature quantity extraction unit 302 and the image feature quantity obtained by the image feature quantity extraction unit 312 for speech recognition. Is generated. The acoustic image feature value generation unit 321 corresponds to feature value generation means.

なお、音響画像特徴量と、画像特徴量のフレームレートが異なる場合には、連結前に、音響画像特徴量生成部１２１と同様に、フレームレート調整処理を、音響画像特徴量生成部３２１は行う。 If the frame rate of the acoustic image feature amount and the image feature amount are different, the acoustic image feature amount generation unit 321 performs the frame rate adjustment process, similar to the acoustic image feature amount generation unit 121, before connection. .

マルチモーダル音声認識部３３１は、音響画像特徴量生成部３２１で生成された音響画像特徴量を用いて音声認識を行う。モデルにマルチストリームＨＭＭを使用し、ビタビアルゴリズムで特徴量とモデルとのマッチングを行い、最も類似度の高い単語仮説候補を認識結果として出力する。このとき、マルチストリームＨＭＭ内のパラメータであるストリーム重み係数は予め適切に設定しておくものとする。又、前記モデルであるマルチストリームＨＭＭは、記憶装置５０に予め記憶されている。 The multimodal speech recognition unit 331 performs speech recognition using the acoustic image feature amount generated by the acoustic image feature amount generation unit 321. A multi-stream HMM is used as a model, a feature amount is matched with the model by a Viterbi algorithm, and a word hypothesis candidate having the highest similarity is output as a recognition result. At this time, the stream weighting coefficient, which is a parameter in the multi-stream HMM, is set appropriately in advance. The multi-stream HMM as the model is stored in the storage device 50 in advance.

さて、上記の音声区間検出装置１００、音声認識装置２００、音声区間検出プログラム、及びＲＯＭ３０は、下記の特徴がある。
（１）本実施形態の音声区間検出装置１００は、音響特徴量と画像特徴量を合わせた音響画像特徴量を生成して、該音響画像特徴量に基づいて音声区間を判定する第１マルチモーダルＶＡＤ部１３１（第１判定手段）と、音響特徴量のみを用いて音声区間の判定を行う音声ユニモーダルＶＡＤ部１３２（第２判定手段）と、画像特徴量のみを用いて音声区間の判定を行う画像ユニモーダルＶＡＤ部１３３（第３判定手段）と、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３の判定を統合して、音声区間の判定を行う第２マルチモーダルＶＡＤ部１３４（第４判定手段）と、第１マルチモーダルＶＡＤ部１３１、第２マルチモーダルＶＡＤ部１３４の判定結果を多数決原理で統合して音声区間の判定を行う第３マルチモーダルＶＡＤ部１３５（第５判定手段）を備えている。この結果、音声区間検出装置１００は、第３マルチモーダルＶＡＤ部１３５において、音声情報と画像情報を総合的に用いて、多数決原理によるマルチモーダル音声区間検出により、音声区間検出における音響雑音の影響を抑制することができる。すなわち、本実施形態の音声区間検出装置１００は、音声信号のみならず、口唇動画像を用いることにより、音声区間検出における音響雑音の影響を抑制することができ、雑音環境下でも高い精度で音声区間を検出することができる。 Now, the above-described speech segment detection device 100, speech recognition device 200, speech segment detection program, and ROM 30 have the following features.
(1) The speech section detection apparatus 100 according to the present embodiment generates a first acoustic image feature amount that combines an acoustic feature amount and an image feature amount, and determines a speech section based on the acoustic image feature amount. A VAD unit 131 (first determination unit), a voice unimodal VAD unit 132 (second determination unit) that performs voice segment determination using only acoustic feature amounts, and a voice segment determination using only image feature amounts. The image unimodal VAD unit 133 (third determination means) to be performed, the determination of the voice unimodal VAD unit 132 and the image unimodal VAD unit 133 are integrated, and the second multimodal VAD unit 134 (the first determination unit) that determines the voice section. 4 determination means), the third multimodal VAD unit 131, and the second multimodal VAD unit 134 are integrated by the majority decision principle to determine the speech section. A dual VAD unit 135 (fifth determining means) is provided. As a result, the speech section detection apparatus 100 uses the third multimodal VAD unit 135 to comprehensively use speech information and image information, and detects the influence of acoustic noise in speech section detection by multimodal speech section detection based on the majority rule. Can be suppressed. That is, the speech segment detection apparatus 100 of this embodiment can suppress the influence of acoustic noise in speech segment detection by using not only a speech signal but also a lip moving image, and can perform speech with high accuracy even in a noise environment. A section can be detected.

（２）本実施形態の音声区間検出装置１００では、音響特徴量抽出部１０２（音響特徴量抽出手段）、及び画像特徴量抽出部１１２（画像特徴量抽出手段）は、モデルベース及び非モデルベースの手法により、音響特徴量及び画像特徴量をそれぞれ抽出する。又、第１マルチモーダルＶＡＤ部１３１、音声ユニモーダルＶＡＤ部１３２、画像ユニモーダルＶＡＤ部１３３、及び第２マルチモーダルＶＡＤ部１３４は、モデルベース及び非モデルベースの手法で抽出した特徴量に基づいて音声区間の判定を行う。 (2) In the speech section detection device 100 of the present embodiment, the acoustic feature quantity extraction unit 102 (acoustic feature quantity extraction unit) and the image feature quantity extraction unit 112 (image feature quantity extraction unit) are model-based and non-model-based. The acoustic feature amount and the image feature amount are extracted by the above method. In addition, the first multimodal VAD unit 131, the audio unimodal VAD unit 132, the image unimodal VAD unit 133, and the second multimodal VAD unit 134 are based on the feature values extracted by the model-based and non-model-based methods. Perform voice segment determination.

この結果、音響特徴量抽出部１０２（音響特徴量抽出手段）、及び画像特徴量抽出部１１２（画像特徴量抽出手段）は、モデルベース及び非モデルベースの手法により、抽出した音響特徴量及び画像特徴量を用いていることから、モデルベース及び非モデルベースの音響特徴量及び画像特徴量に基づいて、多様な情報に基づいて音声区間を検出でき、雑音環境下でも高い精度で音声区間を検出することができる。 As a result, the acoustic feature quantity extraction unit 102 (acoustic feature quantity extraction unit) and the image feature quantity extraction unit 112 (image feature quantity extraction unit) extract the acoustic feature quantity and image extracted by the model-based and non-model-based methods. Since feature values are used, voice segments can be detected based on various information based on model-based and non-model-based acoustic features and image features, and voice segments can be detected with high accuracy even in noisy environments. can do.

（３）本実施形態の音声認識装置２００は、音声区間検出装置１００が判定した音声区間の判定に基づいて音声入力部１０１（音声入力手段）が出力した音声信号の音声区間を切り出し、切り出した音声区間内の音声信号から音声認識用の音響特徴量を算出する音響特徴量抽出部３０２（音響特徴量算出手段）と、音声区間検出装置１００が判定した音声区間内の画像フレームから音声認識用の画像特徴量を算出する画像特徴量抽出部３１２（画像特徴量算出手段）と、音声認識用の音響特徴量及び音声認識用の画像特徴量を用いて、音声認識用の音響画像特徴量を生成する音響画像特徴量生成部３２１（特徴量生成手段）と、生成された音声認識用の音響画像特徴量に基づいて音声認識を行うマルチモーダル音声認識部３３１（マルチモーダル音声認識手段）を備える。この結果、本実施形態の音声認識装置２００は、音声信号と口唇動画像を用いる従来のマルチモーダル音声認識が有する、雑音下でも頑健な音声認識が可能という利点を備えつつ、前処理を行う音声区間検出装置を備えることにより、非音声区間での誤認識を抑制することができる。この結果、雑音環境下でも高い音声認識性能を発揮できる。 (3) The speech recognition apparatus 200 according to the present embodiment cuts out and cuts out the speech section of the speech signal output by the speech input unit 101 (speech input unit) based on the speech section determination determined by the speech section detection device 100. An acoustic feature amount extraction unit 302 (acoustic feature amount calculation unit) that calculates an acoustic feature amount for speech recognition from an audio signal in the speech interval, and an image frame in the speech interval determined by the speech interval detection device 100 for speech recognition. Using the image feature amount extraction unit 312 (image feature amount calculation means) for calculating the image feature amount of the image, the acoustic feature amount for speech recognition, and the image feature amount for speech recognition. A generated acoustic image feature quantity generation unit 321 (feature quantity generation means) and a multimodal voice recognition unit 331 (multimode voice recognition unit) that performs voice recognition based on the generated acoustic image feature quantity for voice recognition. Provided with a dull voice recognition means). As a result, the speech recognition apparatus 200 according to the present embodiment has the advantage that the conventional multimodal speech recognition using the speech signal and the lip moving image has the advantage that robust speech recognition is possible even under noise, and the speech that performs preprocessing. By including the section detection device, erroneous recognition in a non-speech section can be suppressed. As a result, high speech recognition performance can be exhibited even in a noisy environment.

（４）本実施形態の音声区間検出プログラムは、コンピュータ１０に、発話者の音声信号を入力して、ディジタル信号に変換する音声入力部１０１（音声入力手段）と、発話者の口唇動画像を入力し、画像フレームに変換する画像入力部１１１（画像入力手段）として機能させる。又、前記プログラムは、コンピュータ１０に、音声入力部１０１が出力するディジタル化された音声信号から音声区間検出用の音響特徴量を抽出する音響特徴量抽出部１０２（音響特徴量抽出手段）と、画像フレームから音声区間検出用の画像特徴量を抽出する画像特徴量抽出部１１２（画像特徴量抽出手段）と、音声区間検出用の音響特徴量及び音声区間検出用の画像特徴量に基づいて音声区間判定を行う音声区間判定手段として、機能させる。 (4) The voice segment detection program of the present embodiment inputs a voice signal of a speaker into the computer 10 and converts the voice input unit 101 (voice input means) into a digital signal, and a lip moving image of the speaker. It is made to function as an image input unit 111 (image input means) for inputting and converting the image frame. The program also includes an acoustic feature quantity extraction unit 102 (acoustic feature quantity extraction means) that extracts an acoustic feature quantity for voice segment detection from a digitized voice signal output from the voice input unit 101 to the computer 10. An image feature quantity extraction unit 112 (image feature quantity extraction means) that extracts an image feature quantity for detecting a voice section from an image frame, and a voice based on an acoustic feature quantity for voice section detection and an image feature quantity for voice section detection. It is made to function as a voice section determination means for performing section determination.

さらに、前記プログラムは、コンピュータ１０に、音声区間判定手段として機能する際に、前記音響特徴量と画像特徴量を合わせた音響画像特徴量を生成して、該音響画像特徴量に基づいて音声区間を判定する第１マルチモーダルＶＡＤ部１３１（第１判定手段）と、前記音響特徴量のみを用いて音声区間の判定を行う音声ユニモーダルＶＡＤ部１３２（第２判定手段）と、前記画像特徴量のみを用いて音声区間の判定を行う画像ユニモーダルＶＡＤ部１３３（第３判定手段）と、音声ユニモーダルＶＡＤ部１３２及び画像ユニモーダルＶＡＤ部１３３の判定を統合して、音声区間の判定を行う第２マルチモーダルＶＡＤ部１３４（第４判定手段）と、第１マルチモーダルＶＡＤ部１３１、及び第２マルチモーダルＶＡＤ部１３４の判定結果を多数決原理で統合して音声区間の判定を行う第３マルチモーダルＶＡＤ部１３５（第５判定手段）として機能させる。 Furthermore, the program generates an acoustic image feature amount that is a combination of the acoustic feature amount and the image feature amount when functioning as a speech interval determination unit in the computer 10, and based on the acoustic image feature amount, A first multi-modal VAD unit 131 (first determination unit) for determining a voice, a voice unimodal VAD unit 132 (second determination unit) for determining a voice section using only the acoustic feature amount, and the image feature amount The image unimodal VAD unit 133 (third determination means) that determines the speech section using only the voice unimodal VAD unit 132 and the image unimodal VAD unit 133 is integrated to perform the speech section determination. Determination results of second multimodal VAD unit 134 (fourth determination means), first multimodal VAD unit 131, and second multimodal VAD unit 134 Integrated with majority rule to function as a third multimodal VAD unit 135 for judging voice section (fifth determination means).

この結果、本実施形態の音声区間検出プログラムによれば、本プログラムを実行することによりコンピュータを上記（１）に記載の音声区間検出装置として容易に実現することができる。 As a result, according to the speech segment detection program of the present embodiment, the computer can be easily realized as the speech segment detection apparatus described in (1) above by executing this program.

（５）本実施形態の記録媒体としてのＲＯＭ３０は、上記（４）に記載の音声区間検出プログラムを記録し、コンピュータ１０により読取り可能となっている。この結果、コンピュータ１０にこのＲＯＭ３０の記録した音声区間検出プログラムを読取りさせることにより、コンピュータを上記（１）に記載の音声区間検出装置として容易に実現することができる。 (5) The ROM 30 as a recording medium of the present embodiment records the voice segment detection program described in (4) above and can be read by the computer 10. As a result, by causing the computer 10 to read the voice segment detection program recorded in the ROM 30, the computer can be easily realized as the voice segment detection device described in (1) above.

なお、本発明の実施形態は前記実施形態に限定されるものではなく、前記実施形態を、この発明の趣旨から逸脱しない範囲で変更してもよい。
・前記実施形態では、音声区間検出装置１００、及び音声認識装置２００を単一のコンピュータで構成したが、音声区間検出装置１００、及び音声認識装置２００をそれぞれ独立したコンピュータで構成してもよい。 In addition, embodiment of this invention is not limited to the said embodiment, You may change the said embodiment in the range which does not deviate from the meaning of this invention.
In the embodiment, the speech segment detection device 100 and the speech recognition device 200 are configured by a single computer, but the speech segment detection device 100 and the speech recognition device 200 may be configured by independent computers.

・前記実施形態の音声区間検出装置１００の音響特徴量抽出部１０２では、音響特徴量は、ＭＦＣＣ１２次元と、対数パワー、及び一次微分係数、二次微分係数の計３９次元を使用したが、さらに、ＢＣＦ（Block Cepstrum Flux）も音響特徴量に加えてもよい。ＢＣＦは、一定フレーム毎のケプストラムベクトル間の距離を平均化したものである。音声区間では、スペクトル変動が大きくなり、ＢＣＦの値も大きくなるため、区間検出のための音響特徴量として採用できる。 In the acoustic feature quantity extraction unit 102 of the speech section detection device 100 of the above embodiment, the acoustic feature quantity uses MFCC 12 dimensions, logarithmic power, and primary and secondary differential coefficients, which are 39 dimensions in total. BCF (Block Cepstrum Flux) may also be added to the acoustic feature quantity. The BCF is an average of distances between cepstrum vectors for each fixed frame. In the voice section, since the spectrum fluctuation increases and the BCF value also increases, it can be adopted as an acoustic feature quantity for section detection.

・なお、前述したように音響特徴量抽出部１０２で抽出する音響特徴量としてはＭＦＣＣ）、ΔＭＦＣＣ、ΔΔＭＦＣＣ、対数パワー、Δ対数パワーなどが用いられるが、これらの組み合わせで、１０〜１００次元程度の音響特徴量ベクトルが構成される。代表例としては、前記実施形態で説明した３９次元の他に、ＭＦＣＣの１２次元、ΔＭＦＣＣの１２次元、対数パワーの一次微分係数の１次元を含む２５次元のものであってもよい。このように、音響特徴量抽出部１０２では、種々の音響特徴量を抽出してもよく、前記実施形態の各種音響特徴量に限定されるものではない。 As described above, MFCC), ΔMFCC, ΔΔMFCC, logarithmic power, Δlogarithmic power, and the like are used as the acoustic feature amount extracted by the acoustic feature amount extraction unit 102, and these combinations are about 10 to 100 dimensions. Of acoustic feature vectors. As a representative example, in addition to the 39 dimensions described in the above-described embodiment, 12 dimensions of MFCC, 12 dimensions of ΔMFCC, and 25 dimensions including one dimension of the first derivative of logarithmic power may be used. As described above, the acoustic feature quantity extraction unit 102 may extract various acoustic feature quantities, and is not limited to the various acoustic feature quantities of the embodiment.

・前記実施形態の音声区間検出装置１００の画像特徴量抽出部１１２では、非モデルベース手法における画像特徴量は、オプティカルフローの縦方向成分及び横方向成分の平均及び分散のうち、縦方向の分散のみを用いているが、これ以外の上記の他の値のいずれかを用いたり、又は、複数用いたりしてもよい。 In the image feature amount extraction unit 112 of the speech section detection device 100 of the above embodiment, the image feature amount in the non-model-based method is the vertical variance among the average and variance of the vertical and horizontal components of the optical flow. However, any one of the other values described above or a plurality of other values may be used.

・前記実施形態では、第２マルチモーダルＶＡＤ部１３４は、音声、画像の信頼度スコアをそれぞれ正規化した後に、正規化した各信頼度スコアに重みパラメータλを乗算した上で線形結合し、線形結合した結果が予め設定された閾値を越えた音声区間のみを出力するようにした。これに代えて、第２マルチモーダルＶＡＤ部１３４は、音声、画像の信頼度スコアをそれぞれ正規化することなく各信頼度スコアに重みパラメータλを乗算した上で線形結合し、線形結合した結果が予め設定された閾値を越えた音声区間のみを出力するようにしてもよい。この場合、重みパラメータλの値を適正に設定することにより、前記実施形態と同様の結果が得られる。 In the embodiment, the second multimodal VAD unit 134 normalizes the reliability scores of the voice and the image, and then linearly combines the normalized reliability scores after multiplying each of the normalized reliability scores by the weight parameter λ. Only speech segments whose combined result exceeds a preset threshold are output. Instead, the second multimodal VAD unit 134 linearly combines the reliability scores with the weight parameter λ without normalizing the reliability scores of the voice and the image, and the linear combination results. Only a speech section that exceeds a preset threshold may be output. In this case, the same result as in the above embodiment can be obtained by appropriately setting the value of the weight parameter λ.

・前記実施形態では、第３マルチモーダルＶＡＤ部１３５は、第１マルチモーダルＶＡＤ部１３１及び第２マルチモーダルＶＡＤ部１３４で出力された音声区間候補で、最終統合した。これに替えて、第３マルチモーダルＶＡＤ部１３５は、第１マルチモーダルＶＡＤ部１３１、第２マルチモーダルＶＡＤ部１３４、音声ユニモーダルＶＡＤ部１３２、画像ユニモーダルＶＡＤ部１３３で出力された音声区間候補を、多数決原理で決定するようにしてもよい。 In the embodiment, the third multimodal VAD unit 135 is the speech segment candidate output from the first multimodal VAD unit 131 and the second multimodal VAD unit 134 and is finally integrated. Instead, the third multimodal VAD unit 135 is a voice section candidate output from the first multimodal VAD unit 131, the second multimodal VAD unit 134, the voice unimodal VAD unit 132, and the image unimodal VAD unit 133. May be determined by the majority rule.

・前記実施形態の第２マルチモーダルＶＡＤ部１３４における音声区間検出の処理は、信頼度スコアを使用する方法と、信頼度スコアを使用しないで、論理演算を使用する方法をともに行い、音声区間候補をそれぞれの場合において出力するようにした。この方法に代えて、第２マルチモーダルＶＡＤ部１３４における音声区間検出の処理を、信頼度スコアのみを使用したり、或いは、信頼度スコアを使用しないで論理演算のみを使用して、音声区間候補を第３マルチモーダルＶＡＤ部１３５に出力するようにしてもよい。 The speech segment detection processing in the second multimodal VAD unit 134 of the above embodiment performs both a method using a confidence score and a method using a logical operation without using a confidence score, and speech segment candidates. Was output in each case. Instead of this method, the speech segment detection process in the second multimodal VAD unit 134 uses only the confidence score, or uses only the logical operation without using the confidence score, and the speech segment candidate. May be output to the third multimodal VAD unit 135.

この場合、第３マルチモーダルＶＡＤ部１３５では、第２マルチモーダルＶＡＤ部１３４が出力した音声区間候補、第１マルチモーダルＶＡＤ部１３１が出力した音声区間候補を使用して、最終的に多数決原理で音声区間候補を決定する。このようにしても、音声情報と画像情報を総合的に用いるマルチモーダル音声区間検出により、音声区間検出における音響雑音の影響を抑制することができる。 In this case, the third multimodal VAD unit 135 uses the speech segment candidate output from the second multimodal VAD unit 134 and the speech segment candidate output from the first multimodal VAD unit 131 to finally use the majority rule. A speech segment candidate is determined. Even in this case, the influence of acoustic noise in voice section detection can be suppressed by multimodal voice section detection using voice information and image information comprehensively.

・前記実施形態において、音声ユニモーダルＶＡＤ部１３２が検出した音声区間候補、及び画像ユニモーダルＶＡＤ部１３３が検出した音声区間候補を、第３マルチモーダルＶＡＤ部１３５に入力するようにしてもよい。この場合、第３マルチモーダルＶＡＤ部１３５は、第１マルチモーダルＶＡＤ部１３１、音声ユニモーダルＶＡＤ部１３２、画像ユニモーダルＶＡＤ部１３３、及び第２マルチモーダルＶＡＤ部１３４が検出した音声区間候補を含む音声区間候補の中から第３マルチモーダルＶＡＤ部１３５は、多数決原理で最終的に音声区間候補を出力する。 In the embodiment, the speech segment candidate detected by the speech unimodal VAD unit 132 and the speech segment candidate detected by the image unimodal VAD unit 133 may be input to the third multimodal VAD unit 135. In this case, the third multimodal VAD unit 135 includes the speech segment candidates detected by the first multimodal VAD unit 131, the voice unimodal VAD unit 132, the image unimodal VAD unit 133, and the second multimodal VAD unit 134. Among the speech segment candidates, the third multimodal VAD unit 135 finally outputs the speech segment candidates based on the majority rule.

・前記実施形態の第３マルチモーダルＶＡＤ部１３５では、第１マルチモーダルＶＡＤ部１３１で出力されたモデルベースの音声区間候補及び非モデルベースの音声区間候補、並びに第２マルチモーダルＶＡＤ部１３４で出力されたモデルベースの音声区間候補及び非モデルベースの音声区間候補を使用している。このとき、第３マルチモーダルＶＡＤ部１３５に入力される、それぞれの音声区間候補は１つでもよいし、複数でもよい。複数の音声区間候補を生成するには、モデルベースでは例えばモデルパラメータを設定したり、非モデルベースでは閾値を変えたりすればよい。 In the third multimodal VAD unit 135 of the embodiment, the model-based speech segment candidate and the non-model-based speech segment candidate output from the first multimodal VAD unit 131 and the second multimodal VAD unit 134 output Model-based speech segment candidates and non-model-based speech segment candidates are used. At this time, each speech section candidate input to the third multimodal VAD unit 135 may be one or more. In order to generate a plurality of speech segment candidates, for example, model parameters may be set on the model base, or threshold values may be changed on the non-model base.

・同様に、前記実施形態の第３マルチモーダルＶＡＤ部１３５において、音声ユニモーダルＶＡＤ部１３２が検出した音声区間候補、及び画像ユニモーダルＶＡＤ部１３３が検出した音声区間候補を入力する場合も、それぞれの音声区間候補は１つでもよいし、複数でもよい。複数の音声区間候補を生成するには、モデルベースでは識別で利用するパラメータを変更したり、非モデルベースでは閾値を変えたりすればよい。 Similarly, in the third multimodal VAD unit 135 of the embodiment, when the speech segment candidate detected by the speech unimodal VAD unit 132 and the speech segment candidate detected by the image unimodal VAD unit 133 are input, There may be one speech segment candidate or a plurality of speech segment candidates. In order to generate a plurality of speech segment candidates, a parameter used for identification may be changed on a model base, or a threshold value may be changed on a non-model base.

・前記音声認識装置２００では、音声区間検出補償部２０１を設けたが、音声区間検出補償部２０１を省略した音声認識装置としてもよい。
・前記実施形態では、前記音声区間検出プログラムを記録媒体としてのＲＯＭ３０に記憶させたが、コンピュータが読取り可能な他の記録媒体であってもよい。このように記録媒体としては、ハードディスク、フレキシブルディスク（登録商標）、ＭＯ、ＣＤ、ＤＶＤ、ブルーレイディスク（登録商標）、フラッシュメモリ（登録商標）、ＵＳＢメモリ等を挙げることができる。 In the speech recognition apparatus 200, the speech section detection / compensation unit 201 is provided.
In the embodiment, the voice segment detection program is stored in the ROM 30 as a recording medium. However, another recording medium readable by a computer may be used. As described above, examples of the recording medium include a hard disk, a flexible disk (registered trademark), an MO, a CD, a DVD, a Blu-ray disc (registered trademark), a flash memory (registered trademark), and a USB memory.

１００…音声区間検出装置、
１０１…音声入力部（音声入力手段）、
１０２…音響特徴量抽出部（音声区間検出用の音響特徴量抽出手段）、
１１１…画像入力部（画像入力手段）、
１１２…画像特徴量生成部（音声区間検出用の画像特徴量抽出手段）、
１２１…音響画像特徴量生成部、
１３１…第１マルチモーダルＶＡＤ部（音響画像特徴量生成部とともに第１判定手段を構成する）、
１３２…音声ユニモーダルＶＡＤ部（第２判定手段）、
１３３…画像ユニモーダルＶＡＤ部（第３判定手段）、
１３４…第２マルチモーダルＶＡＤ部（第４判定手段）、
１３５…第３マルチモーダルＶＡＤ部（第５判定手段、第１〜第４判定手段とともに音声区間判定手段）、
２００…音声認識装置、
２０１…音声区間検出補償部、
３０１…音声切り出し部、
３０２…音響特徴量抽出部（音響特徴量算出手段）、
３１１…画像切り出し部、
３１２…画像特徴量抽出部（画像特徴量算出手段）、
３２１…音響画像特徴量生成部（特徴量生成手段）、
３３１…マルチモーダル音声認識部（マルチモーダル音声認識手段）。 100 ... voice segment detection device,
101 ... voice input unit (voice input means),
102... Acoustic feature amount extraction unit (acoustic feature amount extraction means for voice section detection),
111... Image input unit (image input means)
112... Image feature value generation unit (image feature value extraction means for detecting a voice section),
121... Acoustic image feature value generation unit,
131 ... 1st multimodal VAD part (a sound image feature-value production | generation part is comprised with a 1st determination means),
132 ... voice unimodal VAD section (second determination means),
133 Image unimodal VAD part (third determination means),
134 ... 2nd multimodal VAD part (4th determination means),
135 ... third multimodal VAD section (fifth judging means, first to fourth judging means together with voice section judging means),
200 ... voice recognition device,
201 ... voice section detection compensation unit,
301 ... voice cutout unit,
302 ... acoustic feature amount extraction unit (acoustic feature amount calculation means),
311 ... Image cutout unit,
312 ... Image feature amount extraction unit (image feature amount calculation means),
321... Acoustic image feature value generation unit (feature value generation means),
331... Multimodal speech recognition unit (multimodal speech recognition means).

Claims

Voice input means for inputting a voice signal of a speaker and converting it into a digital signal;
An image input means for inputting the lip moving image of the speaker and converting it into a still image time series (hereinafter referred to as an image frame);
An acoustic feature quantity extraction means for extracting an acoustic feature quantity for voice section detection from a digitized voice signal output by the voice input means;
Image feature amount extraction means for extracting an image feature amount for voice segment detection from the image frame;
In a speech section detection device comprising speech section determination means for performing speech section determination based on the acoustic feature amount for speech section detection and the image feature amount for speech section detection,
The voice segment determination means includes
A first determination unit configured to generate an acoustic image feature amount obtained by combining the acoustic feature amount and the image feature amount, and to determine a voice section based on the acoustic image feature amount;
Second determination means for determining a speech section using only the acoustic feature amount;
Third determination means for determining a speech section using only the image feature amount;
A fourth determination unit that integrates the determinations of the second determination unit and the third determination unit to determine a speech section;
Among the first to fourth determination means, a speech section detection apparatus comprising fifth determination means for determining a speech section by integrating at least the determination results of the first and fourth determination means based on a majority rule. .

The acoustic feature amount extraction unit and the image feature amount extraction unit extract the acoustic feature amount and the image feature amount by a model-based and non-model-based method,
The speech section detection device according to claim 1, wherein the first to fourth determination units perform speech section determination based on feature amounts extracted by the model-based and non-model-based techniques.

The voice section of the voice signal output by the voice input means is cut out based on the judgment of the voice section determined by the voice section detection device according to claim 1 or 2, and voice recognition is performed from the voice signal in the cut out voice section. An acoustic feature amount calculating means for calculating an acoustic feature amount for use;
Image feature amount calculating means for calculating an image feature amount for speech recognition from an image frame in the speech section determined by the speech section detecting device;
Feature quantity generating means for generating an acoustic image feature quantity for voice recognition using the acoustic feature quantity for voice recognition and the image feature quantity for voice recognition;
A speech recognition apparatus comprising multimodal speech recognition means for performing speech recognition based on a generated acoustic image feature quantity for speech recognition.

On the computer,
Voice input means for inputting a voice signal of a speaker and converting it into a digital signal;
An image input means for inputting the lip moving image of the speaker and converting it into a still image time series (hereinafter referred to as an image frame);
An acoustic feature quantity extraction means for extracting an acoustic feature quantity for voice section detection from a digitized voice signal output by the voice input means;
Image feature amount extraction means for extracting an image feature amount for voice segment detection from the image frame;
A program for functioning as a voice section determination unit that performs voice section determination based on the acoustic feature quantity for voice section detection and the image feature quantity for voice section detection,
The voice segment determination means includes
A first determination unit configured to generate an acoustic image feature amount obtained by combining the acoustic feature amount and the image feature amount, and to determine a voice section based on the acoustic image feature amount;
Second determination means for determining a speech section using only the acoustic feature amount;
Third determination means for determining a speech section using only the image feature amount;
A fourth determination unit that integrates the determinations of the second determination unit and the third determination unit to determine a speech section;
A program comprising: fifth determination means for determining a speech section by integrating at least the determination results of the first and fourth determination means based on the majority rule among the first to fourth determination means.

On the computer,
Voice input means for inputting a voice signal of a speaker and converting it into a digital signal;
An image input means for inputting the lip moving image of the speaker and converting it into a still image time series (hereinafter referred to as an image frame);
An acoustic feature quantity extraction means for extracting an acoustic feature quantity for voice section detection from a digitized voice signal output by the voice input means;
Image feature amount extraction means for extracting an image feature amount for voice segment detection from the image frame;
A computer-readable recording medium storing a program for functioning as voice section determination means for performing voice section determination based on the acoustic feature quantity for voice section detection and the image feature quantity for voice section detection,
The voice segment determination means includes
A first determination unit configured to generate an acoustic image feature amount obtained by combining the acoustic feature amount and the image feature amount, and to determine a voice section based on the acoustic image feature amount;
Second determination means for determining a speech section using only the acoustic feature amount;
Third determination means for determining a speech section using only the image feature amount;
A fourth determination unit that integrates the determinations of the second determination unit and the third determination unit to determine a speech section;
Of the first to fourth determining means, the computer-readable means includes fifth determining means for determining a speech section by integrating at least the determination results of the first and fourth determining means based on the majority rule. recoding media.