JP2015155982A

JP2015155982A - Voice section detection device, speech recognition device, method thereof, and program

Info

Publication number: JP2015155982A
Application number: JP2014031276A
Authority: JP
Inventors: 記良鎌土; Noriyoshi Kamado; 雅清藤本; Masakiyo Fujimoto; 慶介木下; Keisuke Kinoshita; 裕司青野; Yuji Aono
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-02-21
Filing date: 2014-02-21
Publication date: 2015-08-27
Anticipated expiration: 2034-02-21
Also published as: JP6106618B2

Abstract

PROBLEM TO BE SOLVED: To provide voice section detection technology capable of recognizing the voice of a main speaker with high accuracy, not just in a quiet environment but even in a noisy environment, against the voices of a plurality of speakers mixed in a single microphone in an actual environment.

SOLUTION: A voice section detection device 10 includes: a noise suppression unit 110 for suppressing the noise included in a voice digital signal including a voice, noise, and reverberation by using a voice model and obtaining a noise-suppressed voice digital signal; a reverberation estimation unit 120 for estimating the reverberation component included in the noise-suppressed voice digital signal and obtaining a reverberation signal; a main speaker voice feature emphasis unit 140 for obtaining a noise/reverberation-suppressed voice digital signal that is a difference between the noise-suppressed voice digital signal and the reverberation signal; and a main speaker identification unit 150 for identifying a main speaker voice section that is a section in which the main speaker is talking from the noise/reverberation-suppressed voice digital signal by using the voice model.

Description

本発明は、音声ディジタル信号から音声区間を検出する技術、及び検出した音声区間に対して行う音声認識技術に関する。 The present invention relates to a technique for detecting a voice section from a voice digital signal, and a voice recognition technique performed for the detected voice section.

特許文献１が、音声信号区間推定技術と雑音除去技術との間でパラメータ等の情報を密に共有し、音声信号区間推定技術と雑音除去技術とを統合的に扱うことにより、高精度な音声信号区間推定及び雑音除去を行うことを可能にする雑音除去技術として知られている。しかし、特許文献１では、主話者音声、他話者音声が考慮されていなかった。なお、主話者とは、対象とする人を意味し、主話者音声とはその音声である。また、他話者とは、主話者以外の人を意味し、他話者音声とはその音声である。 Patent Document 1 shares information such as parameters densely between the speech signal section estimation technique and the noise removal technique, and handles the speech signal section estimation technique and the noise removal technique in an integrated manner, thereby achieving high-accuracy speech. It is known as a denoising technique that makes it possible to perform signal interval estimation and denoising. However, in Patent Document 1, the main speaker voice and the other speaker voice are not considered. The main speaker means a target person, and the main speaker voice is the voice. The other speaker means a person other than the main speaker, and the other speaker voice is the voice.

また、非特許文献１が、モバイル環境において他話者音声の直接音と反射音のエネルギー比に着目し、マルチステップ線形予測に基づき推定された残響成分から主話者音声区間の検出を行う技術として知られている。 Further, Non-Patent Document 1 focuses on the energy ratio of direct sound and reflected sound of other speaker's voice in a mobile environment, and detects the main speaker voice section from reverberation components estimated based on multistep linear prediction. Known as.

特開２００９−２１０６４７号公報JP 2009-210647 A

鎌土記良、小橋川哲、木下慶介、政瀧浩和、高橋敏、「モバイル音声認識における主話者音声区間検出への残響除去法の応用」、日本音響学会研究発表会講演論文集、２０１３年、pp.145-146Kiyoshi Kamado, Satoshi Kohashikawa, Keisuke Kinoshita, Hirokazu Masatoshi, Satoshi Takahashi, "Application of dereverberation method to main speaker speech detection in mobile speech recognition", Proceedings of the Acoustical Society of Japan, 2013 , Pp.145-146

しかしながら、特許文献１では、音声とそうでない音との区別はつくが、他話者の音声と主話者の音声とを分離することができず、全ての音声を「音声区間」として判定してしまうため、主話者音声区間を推定することはできない。 However, in Patent Document 1, although it is possible to distinguish between a voice and a non-sound, it is impossible to separate the voices of other speakers and the main speakers, and all voices are determined as “speech intervals”. Therefore, the main speaker speech section cannot be estimated.

また、非特許文献１では、入力信号に雑音が含まれる場合、雑音の影響により残響推定の精度が下がってしまい、結果として主話者音声の抽出精度が下がることがある。また、残響推定の前段において、雑音抑圧を行う構成も考えられるが、その場合、雑音抑圧と残響抑圧と音声区間検出とを独立に行うため雑音を抑え過ぎて残響推定の精度が下がってしまい、結果として主話者音声の検出精度が下がることがある。 Further, in Non-Patent Document 1, when noise is included in an input signal, the accuracy of reverberation estimation decreases due to the influence of noise, and as a result, the extraction accuracy of the main speaker's speech may decrease. In addition, a configuration that performs noise suppression in the previous stage of reverberation estimation is also conceivable, but in that case, noise suppression, reverberation suppression, and speech section detection are performed independently, so noise is excessively suppressed and the accuracy of reverberation estimation is reduced. As a result, the detection accuracy of the main speaker voice may be lowered.

本発明は、実環境下における単一マイクへの複数話者混入音声（主話者音声に加え他話者音声も含む）に対し、静音環境下のみならず、高雑音環境下でも高い精度で主話者音声の認識を行うことができる音声区間検出技術を提供することを目的とする。 The present invention is highly accurate not only in a quiet environment but also in a high noise environment with respect to mixed speech (including main speaker speech and other speaker speech) to a single microphone in a real environment. It is an object of the present invention to provide a speech segment detection technique capable of recognizing a main speaker speech.

上記の課題を解決するために、本発明の一態様によれば、音声区間検出装置は、音声モデルを用いて、音声と雑音と残響とを含む音声ディジタル信号に含まれる雑音を抑圧して雑音抑圧音声ディジタル信号を求める雑音抑圧部と、雑音抑圧音声ディジタル信号に含まれる残響成分を推定して残響信号を求める残響推定部と、雑音抑圧音声ディジタル信号と残響信号との差である雑音残響抑圧音声ディジタル信号を求める主話者音声特徴強調部と、音声モデルを用いて、雑音残響抑圧音声ディジタル信号から主話者が話している区間である主話者音声区間を識別する主話者識別部を含む。 In order to solve the above-described problem, according to one aspect of the present invention, a speech segment detection apparatus suppresses noise included in a speech digital signal including speech, noise, and reverberation using a speech model. A noise suppression unit that obtains a suppressed speech digital signal, a reverberation estimation unit that estimates a reverberation component included in the noise suppressed speech digital signal, and a noise reverberation suppression that is a difference between the noise suppressed speech digital signal and the reverberation signal A main speaker voice feature emphasizing unit for obtaining a voice digital signal, and a main speaker identification unit for identifying a main speaker voice section, which is a section in which the main speaker is speaking, from a noise dereverberation voice digital signal using a voice model including.

上記の課題を解決するために、本発明の他の態様によれば、音声区間検出方法は、雑音抑圧部が、音声モデルを用いて、音声と雑音と残響とを含む音声ディジタル信号に含まれる雑音を抑圧して雑音抑圧音声ディジタル信号を求める雑音抑圧ステップと、残響推定部が、雑音抑圧音声ディジタル信号に含まれる残響成分を推定して残響信号を求める残響推定ステップと、
主話者音声特徴強調部が、雑音抑圧音声ディジタル信号と残響信号との差である雑音残響抑圧音声ディジタル信号を求める主話者音声特徴強調ステップと、主話者識別部が、音声モデルを用いて、雑音残響抑圧音声ディジタル信号から主話者が話している区間である主話者音声区間を識別する主話者識別ステップを含む。 In order to solve the above problem, according to another aspect of the present invention, in the speech interval detection method, the noise suppression unit is included in a speech digital signal including speech, noise, and reverberation using a speech model. A noise suppression step for obtaining a noise-suppressed speech digital signal by suppressing noise, and a reverberation estimating step for obtaining a reverberation signal by estimating a reverberation component included in the noise-suppressed speech digital signal;
The main speaker voice feature enhancement unit obtains a noise reverberation suppression voice digital signal, which is the difference between the noise suppressed voice digital signal and the reverberation signal, and the main speaker identification unit uses a voice model. And a main speaker identifying step of identifying a main speaker voice section, which is a section in which the main speaker is speaking, from the noise dereverberation suppressed voice digital signal.

実環境下における単一マイクへの複数話者混入音声に対し、静音環境下のみならず、高雑音環境下でも高い精度で主話者音声の認識を行うことができるという効果を奏する。 With respect to voices mixed with a plurality of speakers in a single microphone in an actual environment, the main speaker voice can be recognized with high accuracy not only in a quiet environment but also in a high noise environment.

第一実施形態に係る音声区間検出装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus which concerns on 1st embodiment. 第一実施形態に係る音声区間検出装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the audio | voice area detection apparatus which concerns on 1st embodiment. 第一実施形態に係る音声区間検出雑音抑圧部の機能ブロック図。The functional block diagram of the audio | voice area detection noise suppression part which concerns on 1st embodiment. 第一実施形態に係る音声区間検出雑音抑圧部の処理フローの例を示す図。The figure which shows the example of the processing flow of the audio | voice area detection noise suppression part which concerns on 1st embodiment. 図５Ａは主話者の音声に対応する音声アナログ信号のイメージを表す図、図５Ｂは主話者外音に対応する音声アナログ信号のイメージを表す図。FIG. 5A is a diagram showing an image of an audio analog signal corresponding to the voice of the main speaker, and FIG. 5B is a diagram showing an image of an audio analog signal corresponding to the external sound of the main speaker. 図６Ａは主話者の音声に対応する残響信号をベクトルで表わしたイメージを表す図、図６Ｂは主話者外音に対応する残響信号をベクトルで表わしたイメージを表す図。6A is a diagram illustrating an image in which a reverberation signal corresponding to the voice of the main speaker is represented by a vector, and FIG. 6B is a diagram illustrating an image in which a reverberation signal corresponding to the external sound of the main speaker is represented by a vector. 図７Ａは主話者の音声に対応する差分をベクトルで表わしたイメージを表す図、図７Ｂは主話者外音に対応する差分をベクトルで表わしたイメージを表す図。FIG. 7A is a diagram illustrating an image in which a difference corresponding to the voice of the main speaker is represented by a vector, and FIG. 7B is a diagram illustrating an image in which a difference corresponding to the external sound from the main speaker is represented by a vector. 第一実施形態に係る音声区間検出装置と、第二実施形態に係る音声認識装置との配置を説明するための図。The figure for demonstrating arrangement | positioning with the audio | voice area detection apparatus which concerns on 1st embodiment, and the speech recognition apparatus which concerns on 2nd embodiment.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態のポイント＞
特許文献１にて処理された雑音成分を抑圧した信号に含まれる主話者音声を、残響推定法を用いて強調する。例えば、参考文献１の残響推定法を用いることができる。
［参考文献１］国際公開第ＷＯ２００７／１００１３７号パンフレット
参考文献１では、モバイル環境において、単一マイクで収録される主話者音声と他話者音声とにそれぞれ含まれる残響成分の差分が大きいことを利用して、主話者の音声を高精度に強調できる。また、単一マイクで収録される信号をそのまま利用するのではなく、そこから雑音成分を抑圧した信号を使用するため、残響推定の精度が元の信号をそのまま処理した場合より向上する。 <Points of first embodiment>
The main speaker speech included in the signal in which the noise component processed in Patent Document 1 is suppressed is emphasized using a reverberation estimation method. For example, the reverberation estimation method of Reference 1 can be used.
[Reference Document 1] International Publication No. WO 2007/100137 Pamphlet In Reference Document 1, in the mobile environment, the difference between the reverberation components included in the main speaker voice and the other speaker voice recorded with a single microphone is large. Can be used to enhance the voice of the main speaker with high accuracy. Further, since the signal recorded by the single microphone is not used as it is, but the signal from which the noise component is suppressed is used, the accuracy of reverberation estimation is improved as compared with the case where the original signal is processed as it is.

さらに、本実施形態では元の信号から雑音と残響とを抑圧した信号を特許文献１の雑音除去装置に入力することで、再度、音声区間の判定をを行う。 Furthermore, in this embodiment, the speech section is determined again by inputting a signal in which noise and reverberation are suppressed from the original signal to the noise removal apparatus of Patent Document 1.

このような構成により、主話者の音声が精度良く強調された信号の音声特徴量を基に、音声モデルを用いて、音声区間を統計的に計算することが可能となり、非特許文献１のように残響信号（のパワーをスムージングした対数信号）を閾値判定する場合と比較して精度の高い主話者音声区間検出が可能となる。 With such a configuration, it is possible to statistically calculate a speech section using a speech model based on a speech feature amount of a signal in which the speech of the main speaker is accurately emphasized. As described above, it is possible to detect the main speaker voice section with higher accuracy compared to the case where the reverberation signal (logarithm signal obtained by smoothing the power thereof) is determined as a threshold value.

＜第一実施形態に係る音声区間検出装置＞
図１は音声区間検出装置１０の機能ブロック図、図２はその処理フローの例を示す図である。音声区間検出装置１０は、例えば、音声認識処理に用いる入力音声（以下「音声アナログ信号」ともいう）から他話者音声・無音・雑音区間を除去することで主話者の音声を高精度に認識することができる。 <Audio section detection device according to the first embodiment>
FIG. 1 is a functional block diagram of the speech section detection device 10, and FIG. 2 is a diagram showing an example of the processing flow thereof. For example, the speech section detection device 10 removes the other-speaker speech / silence / noise sections from the input speech (hereinafter also referred to as “speech analog signal”) used for speech recognition processing, thereby accurately processing the speech of the main speaker. Can be recognized.

まず、単一マイクへの入力信号に含まれる主話者の音声区間を、入力信号に含まれる残響成分から抽出する。さらに主話者音声を音声特徴量に変換し、これを入力として音声区間検出のための音声モデルと共に音声尤度計算を行うことにより統計的な枠組みで高精度な主話者の音声区間を抽出する。さらに、この主話者の音声区間を用いることで、高精度に主話者音声の音声認識が可能となる。通常、主話者以外の音声や雑音は、非音声とは判定されずに音声認識されてしまい音声認識結果が誤認識として湧き出す事になる。そのため、主話者の音声区間のみを高精度に抽出することで、認識対象外の音声や雑音による音声認識システムへの悪影響を低減する事ができる。 First, the voice section of the main speaker included in the input signal to the single microphone is extracted from the reverberation component included in the input signal. Furthermore, by converting the main speaker's speech into speech features, and using this as input, the speech likelihood is calculated together with the speech model for speech segment detection. To do. Further, by using the voice section of the main speaker, the voice of the main speaker can be recognized with high accuracy. Usually, voices and noises other than the main speaker are recognized as voices without being determined as non-speech, and the voice recognition result is generated as misrecognition. Therefore, by extracting only the voice section of the main speaker with high accuracy, it is possible to reduce adverse effects on the voice recognition system due to voice or noise that is not the recognition target.

音声区間検出装置１０は、音声信号取得部１００、音声区間検出雑音抑圧部１１０、残響推定部１２０、ゲイン調整部１３０、主話者音声特徴強調部１４０及び主話者音声区間抽出部１６０を含む。また、音声区間検出雑音抑圧部１１０は、主話者識別部１５０を含む。音声区間検出装置１０は、マイクロホン９０で収音した音声アナログ信号を受け取り、主話者の発話区間に対応し、雑音成分の抑圧された音声ディジタル信号（以下「雑音抑圧音声ディジタル信号」ともいう）を出力する。以下、各部の詳細を説明する。 The speech segment detection device 10 includes a speech signal acquisition unit 100, a speech segment detection noise suppression unit 110, a reverberation estimation unit 120, a gain adjustment unit 130, a main speaker speech feature enhancement unit 140, and a main speaker speech segment extraction unit 160. . Further, the voice section detection noise suppression unit 110 includes a main speaker identification unit 150. The speech section detection device 10 receives a speech analog signal picked up by the microphone 90, corresponds to the speech section of the main speaker, and a speech digital signal in which noise components are suppressed (hereinafter also referred to as “noise-suppressed speech digital signal”). Is output. Details of each part will be described below.

＜音声信号取得部１００＞
入力：音声アナログ信号
出力：音声ディジタル信号
音声信号取得部１００は、アナログの音声信号（音声アナログ信号）を受け取り、ディジタルの音声信号（音声ディジタル信号）に変換し（ｓ１００）、出力する。 <Audio signal acquisition unit 100>
Input: Audio analog signal output: Audio digital signal The audio signal acquisition unit 100 receives an analog audio signal (audio analog signal), converts it into a digital audio signal (audio digital signal) (s100), and outputs it.

＜音声区間検出雑音抑圧部１１０＞
入力：音声ディジタル信号
出力：雑音抑圧音声ディジタル信号
音声区間検出雑音抑圧部１１０は、音声ディジタル信号を受け取り、音声モデルを用いて、音声ディジタル信号に含まれる雑音を抑圧して雑音抑圧音声ディジタル信号を求め（ｓ１１０）、主話者音声特徴強調部１４０、残響推定部１２０、主話者音声区間抽出部１６０に出力する。 <Speech Interval Detection Noise Suppression Unit 110>
Input: Voice digital signal Output: Noise suppression voice digital signal Voice section detection noise suppression unit 110 receives a voice digital signal, and uses a voice model to suppress noise contained in the voice digital signal to generate a noise-suppressed voice digital signal. Obtained (s110), and output to the main speaker voice feature enhancement unit 140, the reverberation estimation unit 120, and the main speaker voice section extraction unit 160.

例えば、音声区間検出雑音抑圧部１１０は、雑音抑圧と音声区間検出を同時に行い、特許文献１の雑音除去装置により実現される。音声区間検出雑音抑圧部１１０の処理の概要を説明する。 For example, the speech section detection noise suppression unit 110 performs noise suppression and speech section detection at the same time, and is realized by the noise removal device disclosed in Patent Document 1. An overview of the processing of the speech section detection noise suppression unit 110 will be described.

図３は音声区間検出雑音抑圧部１１０の機能ブロック図、図４はその処理フローの例を示す図である。 FIG. 3 is a functional block diagram of the speech section detection noise suppressing unit 110, and FIG. 4 is a diagram illustrating an example of a processing flow thereof.

音声区間検出雑音抑圧部１１０は、音響信号分析部１１１と、モデルパラメータ記憶部１１２と、前向き推定部１１３と、後向き推定部１１４と、パラメータ記憶部１１５と、状態確率比算出部１１６と、音声信号区間推定部１１７と、雑音除去部１１８とを含む。 The speech section detection noise suppression unit 110 includes an acoustic signal analysis unit 111, a model parameter storage unit 112, a forward estimation unit 113, a backward estimation unit 114, a parameter storage unit 115, a state probability ratio calculation unit 116, a speech A signal section estimation unit 117 and a noise removal unit 118 are included.

音響信号分析部１１１は、音声ディジタル信号を受け取り、音声ディジタル信号の音声特徴量を一定時間区間であるフレームごとに抽出して出力する（ｓ１１１）。 The acoustic signal analysis unit 111 receives the audio digital signal, extracts the audio feature amount of the audio digital signal for each frame that is a predetermined time interval, and outputs it (s111).

モデルパラメータ記憶部１１２は、クリーン音声信号と無音信号の各出力確率を、それぞれ、複数の正規分布を含有する混合正規分布で表現した確率モデルの確率モデルパラメータを前述の音声モデルとして利用に先立ち記憶しておく。 The model parameter storage unit 112 stores the probability model parameters of the probability model in which the output probabilities of the clean speech signal and the silence signal are expressed by a mixed normal distribution containing a plurality of normal distributions as the above-described speech model before use. Keep it.

前向き推定部１１３は、音声特徴量と、モデルパラメータ記憶部１１２に記憶されたクリーン音声信号と無音信号の各確率モデルパラメータとが入力され、過去のフレームから現在のフレームに向かって並列非線形カルマンフィルタにより現在のフレームの雑音モデルパラメータを逐次推定して出力する（ｓ１１３）。 The forward estimation unit 113 receives the speech feature amount and the probability model parameters of the clean speech signal and the silence signal stored in the model parameter storage unit 112, and performs parallel nonlinear Kalman filtering from the past frame to the current frame. The noise model parameters of the current frame are sequentially estimated and output (s113).

後向き推定部１１４は、前向き推定部１１３から出力された雑音モデルパラメータと、モデルパラメータ記憶部１１２に記憶されたクリーン音声信号と無音信号の各確率モデルパラメータとが入力され、未来のフレームから現在のフレームに向かって並列カルマンスムーザにより現在フレームの雑音モデルパラメータを逐次後向き推定し、この後向き推定した雑音モデルパラメータに基づき、音声（雑音＋クリーン音声）信号と非音声（雑音＋無音）信号の各出力確率をそれぞれ混合正規分布で表現した確率モデルの確率モデルパラメータを逐次推定し、音声信号と非音声信号それぞれの出力確率を算出して出力する（ｓ１１４）。 The backward estimation unit 114 receives the noise model parameters output from the forward estimation unit 113, and the probability model parameters of the clean speech signal and the silence signal stored in the model parameter storage unit 112, and inputs the current model from the future frame. The noise model parameters of the current frame are sequentially and backward estimated by the parallel Kalman smoother toward the frame, and each of speech (noise + clean speech) signal and non-speech (noise + silence) signal is determined based on the backward estimated noise model parameter. The probability model parameters of the probability model each representing the output probability with a mixed normal distribution are sequentially estimated, and the output probabilities of the speech signal and the non-speech signal are calculated and output (s114).

パラメータ記憶部１１５は、前向き推定部１１３及び後向き推定部１１４における処理の過程で得られた計算結果を記憶する（ｓ１１５）。 The parameter storage unit 115 stores the calculation results obtained in the process of the forward estimation unit 113 and the backward estimation unit 114 (s115).

状態確率比算出部１１６は、音声信号及び非音声信号それぞれの出力確率が入力され、音声状態確率と、非音声状態確率と、非音声状態確率に対する音声状態確率の比とを算出し、これらを出力する（ｓ１１６）。 The state probability ratio calculation unit 116 receives the output probabilities of each of the speech signal and the non-speech signal, calculates the speech state probability, the non-speech state probability, and the ratio of the speech state probability to the non-speech state probability. It outputs (s116).

音声信号区間推定部１１７は、状態確率の比が入力され、フレームごとに状態確率の比としきい値とを比較して、各フレームが音声状態に属するか非音声状態に属するかを示す判定結果を出力する（ｓ１１７）。 The speech signal section estimation unit 117 receives the state probability ratio, compares the state probability ratio with the threshold value for each frame, and determines whether each frame belongs to the speech state or the non-speech state. Is output (s117).

雑音除去部１１８は、音声信号及び非音声信号の各確率モデルパラメータである正規分布ごとの平均と、クリーン音声信号及び無音信号の各確率モデルパラメータである正規分布ごとの平均と、音声状態確率及び非音声状態確率とが入力される。雑音除去部１１８は、音声信号と非音声信号の各確率モデルパラメータである正規分布ごとの平均に対する、クリーン音声信号と無音信号の各確率モデルパラメータである正規分布ごとの平均の各相対値を、音声状態確率及び非音声状態確率を用いて加重平均し、雑音信号を除去する周波数応答フィルタを生成し、周波数応答フィルタをインパルス応答フィルタに変換し、音声ディジタル信号に対してインパルス応答フィルタを畳み込んで雑音抑圧音声ディジタル信号を生成して出力する（ｓ１１８）。 The noise removal unit 118 includes an average for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal, an average for each normal distribution that is each probability model parameter of the clean speech signal and the silence signal, and the speech state probability and The non-voice state probability is input. The noise removal unit 118 compares each average value of each normal distribution, which is each probability model parameter of the clean speech signal and the silence signal, with respect to the average of each normal model that is each probability model parameter of the speech signal and the non-speech signal. Generates a frequency response filter that removes the noise signal by weighted averaging using the speech state probability and the non-speech state probability, converts the frequency response filter to an impulse response filter, and convolves the impulse response filter with the speech digital signal Generates and outputs a noise-suppressed voice digital signal (s118).

なお、特許文献１では、音声信号区間推定部１１７の判定結果に基づき、音声状態に属するフレームのみ雑音除去を行うため、音声信号区間推定部１１７における処理が必要であるが、全てのフレームに対して雑音除去を行う場合には、音声信号区間推定部１１７における処理を省略してもよい。 In Patent Document 1, since noise removal is performed only on frames belonging to the speech state based on the determination result of the speech signal section estimation unit 117, processing in the speech signal section estimation unit 117 is necessary. When noise removal is performed, the processing in the speech signal section estimation unit 117 may be omitted.

音声区間検出雑音抑圧部１１０は、主話者識別部１５０を含む。主話者識別部１５０の処理内容は後述する。 The voice section detection noise suppression unit 110 includes a main speaker identification unit 150. The processing content of the main speaker identification unit 150 will be described later.

＜残響推定部１２０＞
入力:雑音抑圧音声ディジタル信号
出力：残響信号
残響推定部１２０は、雑音抑圧音声ディジタル信号に含まれる残響成分を推定し（ｓ１２０）、残響信号を取得する。以下、残響成分を推定する方法の概要を説明する。 <Reverberation estimation unit 120>
Input: Noise suppression voice digital signal output: Reverberation signal The reverberation estimation unit 120 estimates a reverberation component included in the noise suppression voice digital signal (s120), and acquires a reverberation signal. Hereinafter, an outline of a method for estimating reverberation components will be described.

原音声信号s(z)は、式(1)のように、白色信号u(z)に短い自己回帰（Auto-Regressive:以下「AR」ともいう)過程がかかったものである。AR過程のZ変換をv(z)=1/(1-b(z))とし、1-b(z)を多項式とする。 The original audio signal s (z) is obtained by applying a short auto-regressive (hereinafter also referred to as “AR”) process to the white signal u (z) as shown in Expression (1). The Z transformation of the AR process is v (z) = 1 / (1-b (z)), and 1-b (z) is a polynomial.

この原音声信号s(z)が空間を伝達し、マイクで観測される信号x(z)は、式(1)より、以下のように表される。 The original audio signal s (z) is transmitted through the space, and the signal x (z) observed by the microphone is expressed as follows from the equation (1).

ここで、h(z)は、音源からマイクまでの室内伝達関数を表す。音声信号は、v(z)に従って強い短期的な相関を有する。そこで、式(3)による短期的な相関を取り除く線形予測によるPre-whitening処理を施すことにより、v(z)は、ほぼ白色信号とみなせ、v(z)≒1が成り立つ。 Here, h (z) represents an indoor transfer function from the sound source to the microphone. The audio signal has a strong short-term correlation according to v (z). Therefore, by performing pre-whitening processing by linear prediction that removes short-term correlation according to Equation (3), v (z) can be regarded as a substantially white signal, and v (z) ≈1 holds.

ここで、b(p)は、v(z)を効果的に抑圧するための線形予測係数であり、式(4)により求められる。 Here, b (p) is a linear prediction coefficient for effectively suppressing v (z), and is obtained by Expression (4).

ここで、r(i)は、マイクで観測された信号x(z)がiサンプルずれた場合の自己相関係数）を示す。この線形予測は、30msのフィルタ長で実施し、30ms以内に含まれる初期反射音成分及び音声の短期的な相関が取り除かれることが期待される。 Here, r (i) represents an autocorrelation coefficient when the signal x (z) observed by the microphone is shifted by i samples. This linear prediction is performed with a filter length of 30 ms, and it is expected that the short-term correlation between the early reflection component and the speech included within 30 ms is removed.

Dをステップサイズ(遅延)、Lをフィルタ長とすると、残響信号d(n)は以下のように定式化することができる。 When D is a step size (delay) and L is a filter length, the reverberation signal d (n) can be formulated as follows.

ここで、a(l)(ローマ字のエル)は線形予測係数、x~(n)は式(3)により求められたPre-whitening処理された観測音を表す。a(l)をｚ変換したa(z)は、式(6)で求められる。 Here, a (l) (Roman letter L) represents a linear prediction coefficient, and x to (n) represent pre-whitening-processed observation sounds obtained by Expression (3). a (z) obtained by z-converting a (l) is obtained by Expression (6).

ここで、h_min(z)とh_max(z)は、それぞれh(z)の最小位相成分（Z平面上の単位円内の零点に対応する成分）と最大位相成分（Z平面上の単位円外の零点に対応する成分）を表す。また、min[h_max(z)]は、h_max(z)を最小位相化する関数を表す。 Where h _min (z) and h _max (z) are the minimum phase component of h (z) (the component corresponding to the zero in the unit circle on the Z plane) and the maximum phase component (unit on the Z plane). Represents the component corresponding to the zero point outside the circle). Also, min [h _max (z)] represents a function that minimizes h _max (z).

一般に、Dは10〜200msに相当する値を、Lは100ms〜500msに相当する値を設定する。
本手法は、例えば参考文献１に詳しい。 In general, D is set to a value corresponding to 10 to 200 ms, and L is set to a value corresponding to 100 ms to 500 ms.
This technique is detailed in Reference Document 1, for example.

上述の方法や、他の既存の残響推定技術を用いて、残響推定部１２０は、雑音抑圧音声ディジタル信号ｘ（ｎ）に含まれる残響成分を推定し、残響信号ｄ（ｎ）を取得する。 Using the above-described method and other existing reverberation estimation techniques, the reverberation estimation unit 120 estimates the reverberation component included in the noise-suppressed speech digital signal x (n) and acquires the reverberation signal d (n).

＜ゲイン調整部１３０＞
入力：残響信号
出力：ゲイン調整された残響信号
ゲイン調整部１３０は、残響信号を受け取り、残響信号にゲインGを乗算し（ｓ１３０）、ゲイン調整された残響信号を得、出力する。ゲインGは、1よりも小さく０より大きな値を用いる。例えば、0.8〜1.0の値を用いる。これにより、後述する主話者音声特徴強調部１４０において、雑音抑圧音声ディジタル信号と残響信号との差分を求める際に生じる歪を低減させることができる。 <Gain adjustment unit 130>
Input: Reverberation signal output: Gain-adjusted reverberation signal The gain adjustment unit 130 receives the reverberation signal, multiplies the reverberation signal by the gain G (s130), and obtains and outputs the gain-adjusted reverberation signal. As the gain G, a value smaller than 1 and larger than 0 is used. For example, a value of 0.8 to 1.0 is used. Thereby, in the main speaker voice feature emphasizing unit 140 described later, it is possible to reduce distortion that occurs when obtaining the difference between the noise-suppressed voice digital signal and the reverberation signal.

＜主話者音声特徴強調部１４０＞
入力：雑音抑圧音声ディジタル信号、ゲイン調整された残響信号
出力：雑音残響抑圧音声ディジタル信号
主話者音声特徴強調部１４０は、雑音抑圧音声ディジタル信号とゲイン調整された残響信号とを受け取り、これらの信号の差を算出し（ｓ１４０）、雑音残響抑圧音声ディジタル信号として出力する。なお、雑音残響抑圧音声ディジタル信号は、主話者音声が強調された音声ディジタル信号といってもよい。 <Main speaker voice feature enhancement unit 140>
Input: noise-suppressed voice digital signal, gain-adjusted reverberation signal output: noise-reverberation-suppressed voice digital signal The main speaker voice feature enhancement unit 140 receives the noise-suppressed voice digital signal and the gain-adjusted reverberation signal, and The signal difference is calculated (s140) and output as a noise dereverberation speech digital signal. Note that the noise dereverberation speech digital signal may be referred to as a speech digital signal in which the main speaker speech is emphasized.

図５Ａは主話者の音声に対応する音声アナログ信号のイメージを表す図、図５Ｂは他話者の音声に対応する音声アナログ信号のイメージを表す図である。図５Ａに示すように、主話者の音声に対応する音声アナログ信号は、直接音Ｄが大きく、反射音Ｒ（残響成分）が小さい。一方、図５Ｂに示すように、他話者の音声に対応する音声アナログ信号は、直接音Ｄが小さく、反射音Ｒ（残響成分）が大きい。図６Ａは主話者の音声に対応する残響信号をベクトルで表わしたイメージを表す図、図６Ｂは他話者の音声に対応する残響信号をベクトルで表わしたイメージを表す図である。図７Ａは主話者の音声に対応する差分をベクトルで表わしたイメージを表す図、図７Ｂは他話者の音声に対応する差分をベクトルで表わしたイメージを表す図である。なお、図７中の小さい矢印は除去しきれなかった残響成分Ｒ’を表す。 FIG. 5A is a diagram illustrating an image of a speech analog signal corresponding to the voice of the main speaker, and FIG. 5B is a diagram illustrating an image of a speech analog signal corresponding to the speech of another speaker. As shown in FIG. 5A, the voice analog signal corresponding to the voice of the main speaker has a large direct sound D and a small reflected sound R (reverberation component). On the other hand, as shown in FIG. 5B, the voice analog signal corresponding to the voice of another speaker has a small direct sound D and a large reflected sound R (reverberation component). 6A is a diagram illustrating an image in which a reverberation signal corresponding to the voice of the main speaker is represented by a vector, and FIG. 6B is a diagram illustrating an image in which a reverberation signal corresponding to the speech of another speaker is represented by a vector. FIG. 7A is a diagram showing an image in which a difference corresponding to the voice of the main speaker is represented by a vector, and FIG. 7B is a diagram showing an image in which a difference corresponding to the voice of another speaker is represented by a vector. A small arrow in FIG. 7 represents a reverberation component R ′ that could not be removed.

主話者音声特徴強調部１４０では、ゲイン調整された残響信号と雑音抑圧音声ディジタル信号の差分を計算することにより、残響抑圧を行う。この減算処理の結果、主話者音声の特徴が強調され（図７参照）、音声ディジタル信号中の主話者音声の特徴量を高精度に抽出することができるようになる。 The main speaker speech feature emphasizing unit 140 performs reverberation suppression by calculating a difference between the gain-adjusted reverberation signal and the noise-suppressed speech digital signal. As a result of the subtraction process, the feature of the main speaker voice is emphasized (see FIG. 7), and the feature amount of the main speaker voice in the voice digital signal can be extracted with high accuracy.

ここで言う、主話者音声の特徴とは、残響抑圧後の信号と残響抑圧前の信号とのメルスペクトル上での特徴量の差分が小さいことを指している。一方、他話者音声の特徴量は、これに比べて残響抑圧により大きく変化するか、もしくは残響抑圧法の特性上、直接波が不明瞭な場合は残響推定ができず無音化される。そのため、残響抑圧後の信号の特徴量、メルスペクトル表現においては、主話者と他話者との差が強調されることとなる。なお、メルスペクトルは公知の技術であるため、ここでの説明は省略する。別の言い方をすると、他話者音声の場合、残響抑圧後の信号と残響抑圧前の信号とのメルスペクトル上での特徴量の差分が大きくなるか（残響抑圧により大きく変化するため）、または、差分が０（残響推定ができず残響が無音化されるため、０となり、残響抑圧前後の信号が全く同じ信号になるため）になる。差分が小さい場合（主話者）と、差分が大きい場合や０の場合（他話者）とは明らかに異なるため、主話者と他話者との差が強調されることとなる。 The feature of the main speaker speech here means that the difference in the feature amount on the mel spectrum between the signal after dereverberation and the signal before dereverberation is small. On the other hand, the feature amount of the other-speaker voice changes greatly due to dereverberation, or if the direct wave is unclear due to the characteristics of the dereverberation method, the reverberation cannot be estimated and the sound is silenced. Therefore, the difference between the main speaker and the other speaker is emphasized in the feature amount of the signal after dereverberation and the mel spectrum expression. Since the mel spectrum is a known technique, the description thereof is omitted here. In other words, in the case of other speaker's speech, the difference in the feature amount on the mel spectrum between the signal after dereverberation and the signal before dereverberation increases (because it changes greatly due to dereverberation), or , The difference becomes 0 (because the reverberation cannot be estimated and the reverberation is silenced, it becomes 0, and the signals before and after the reverberation suppression become exactly the same signal). Since the case where the difference is small (main speaker) is clearly different from the case where the difference is large or 0 (other speaker), the difference between the main speaker and the other speaker is emphasized.

ゲイン調整部１３０及び主話者音声特徴強調部１４０の処理を合わせて、スペクトルサブトラクション法という既知の手法で実現することができる（参考文献２参照）。
[参考文献２] BOLL, S. F., "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. Acoust., Speech, Signal Processing, 1979, vol. ASSP-27, pp. 113-120 The processing of the gain adjustment unit 130 and the main speaker voice feature enhancement unit 140 can be combined and realized by a known technique called a spectral subtraction method (see Reference 2).
[Reference 2] BOLL, SF, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. Acoust., Speech, Signal Processing, 1979, vol. ASSP-27, pp. 113-120

＜主話者識別部１５０＞
入力：雑音残響抑圧音声ディジタル信号
出力：区間情報
主話者識別部１５０は、雑音残響抑圧音声ディジタル信号を受け取り、音声モデルを用いて、雑音残響抑圧音声ディジタル信号から主話者音声区間を識別し（ｓ１５０）、識別結果を区間情報として出力する。例えば、主話者識別部１５０は、音響信号分析部１１１と、モデルパラメータ記憶部１１２と、前向き推定部１１３と、後向き推定部１１４と、パラメータ記憶部１１５と、状態確率比算出部１１６と、音声信号区間推定部１１７とを含む。音声ディジタル信号に代えて雑音残響抑圧音声ディジタル信号を用いて、ｓ１１１〜ｓ１１７を行い（図４参照）、主話者識別部１５０内の音声信号区間推定部１１７は、状態確率の比が入力され、フレームごとに状態確率の比としきい値とを比較して、各フレームが音声状態に属するか非音声状態に属するかを示す判定結果を区間情報として出力する（ｓ１１７）。 <Main speaker identification unit 150>
Input: noise dereverberation suppressed digital signal output: section information The main speaker identification unit 150 receives the noise dereverberation suppressed digital signal, and identifies the main speaker voice segment from the noise dereverberation suppressed digital signal using a speech model. (S150), the identification result is output as section information. For example, the main speaker identification unit 150 includes an acoustic signal analysis unit 111, a model parameter storage unit 112, a forward estimation unit 113, a backward estimation unit 114, a parameter storage unit 115, a state probability ratio calculation unit 116, A speech signal section estimation unit 117. S111 to s117 are performed using a noise dereverberation suppressed voice digital signal instead of the voice digital signal (see FIG. 4), and the state probability ratio is input to the voice signal interval estimation unit 117 in the main speaker identification unit 150. Then, the ratio of the state probability and the threshold value are compared for each frame, and a determination result indicating whether each frame belongs to the voice state or the non-voice state is output as section information (s117).

なお、雑音残響抑圧音声ディジタル信号は、主話者音声が強調されているため、主話者音声に対応するフレームだけが音声状態に属すると判断され、他話者音声を含む主話者外音に対応するフレームは非音声状態に属すると判断されやすくなる。なお、主話者外音とは、主話者音声以外の音を意味し、他話者音声や雑音、無音等を含む。 In addition, since the main speaker voice is emphasized in the noise reverberation suppression voice digital signal, it is determined that only the frame corresponding to the main speaker voice belongs to the voice state, and the main speaker outside sound including the other speaker voice is included. It is easy to determine that the frame corresponding to is in the non-voice state. The main speaker outside sound means a sound other than the main speaker voice, and includes other speaker voice, noise, silence and the like.

これにより、主話者音声の特徴が強調された信号である雑音残響抑圧音声ディジタル信号が特許文献１の雑音除去装置（より詳しく言えば、音響信号分析部１１１）に入力されることとなり、主話者音声が精度良く強調された音声特徴量を基に、音声モデルを用いて、音声区間を統計的に再度計算することが可能となり、単なる閾値計算と比較して精度の高い主話者音声区間検出が可能となる。 As a result, a noise dereverberation-suppressed voice digital signal, which is a signal in which the features of the main speaker voice are emphasized, is input to the noise removal device of Patent Document 1 (more specifically, the acoustic signal analysis unit 111). It is possible to re-calculate the speech interval statistically using the speech model based on the speech feature that the speaker speech is emphasized with high accuracy, and the main speaker speech is more accurate than simple threshold calculation. Section detection is possible.

なお、残響推定部１２０における残響計算の際に遅延が生じているため、この遅延分を考慮し、出力される音声区間の時間を遅延分巻き戻す処理を行う。 Since a delay occurs during the reverberation calculation in the reverberation estimation unit 120, a process for rewinding the time of the output speech section is performed in consideration of the delay.

このように、雑音抑圧処理時の音声モデルと同様のモデルを利用することで、音声信号区間推定技術と雑音抑圧技術とを統合的に扱い、高精度な音声区間推定及び雑音抑圧を行う。 In this way, by using a model similar to the speech model at the time of noise suppression processing, the speech signal section estimation technique and the noise suppression technique are handled in an integrated manner, and highly accurate speech section estimation and noise suppression are performed.

＜主話者音声区間抽出部１６０＞
入力：雑音抑圧音声ディジタル信号、区間情報
出力：主話者の音声に対応する雑音抑圧音声ディジタル信号
主話者音声区間抽出部１６０は、雑音抑圧音声ディジタル信号と区間情報とを受け取り、区間情報を用いて、雑音抑圧音声ディジタル信号から主話者の音声に対応する部分を抽出し（ｓ１６０）、音声区間検出装置１０の出力値として出力する。 <Main speaker voice section extraction unit 160>
Input: Noise-suppressed voice digital signal, section information output: Noise-suppressed voice digital signal corresponding to the voice of the main speaker The main-speaker voice section extraction unit 160 receives the noise-suppressed voice digital signal and the section information, and receives the section information. Then, a portion corresponding to the voice of the main speaker is extracted from the noise-suppressed voice digital signal (s160) and output as an output value of the voice section detection device 10.

例えば、区間情報として、開始時間と終了時間を用いる場合、開始時間と終了時間との間のサンプルに１を、さらに、開始時間と終了時間のマージンを確保するため、主話者音声区間から主話者外音区間へと切り替わる開始時間の前にNサンプル(0.1〜0.4msに対応するサンプル長)の1を、1から0へと切り替わる終了時間の後にMサンプル(0.1〜0.4msに対応するサンプル長)の1を付加するマージン処理を行う。このマージン処理をした主話者音声区間（つまり、開始時間前Nサンプルから終了時間後Mサンプルに対応する部分までが１であり、他の部分が0である時間サンプル列）を雑音抑圧音声ディジタル信号に時間サンプル毎に乗算することで主話者音声を抽出することができる。 For example, when the start time and the end time are used as the section information, 1 is added to the sample between the start time and the end time, and in order to secure a margin between the start time and the end time, the main speaker voice section 1 of N samples (sample length corresponding to 0.1 to 0.4ms) before start time to switch to speaker outside sound section, M samples (corresponding to 0.1 to 0.4ms) after end time to switch from 1 to 0 Perform margin processing to add 1 of (sample length). The noise-suppressed speech digital data of the main speaker speech section (that is, a time sample sequence in which the portion corresponding to the N samples before the start time to the M sample after the end time is 1 and the other portions are 0) subjected to this margin processing The main speaker's voice can be extracted by multiplying the signal every time sample.

また、雑音抑圧音声ディジタル信号に主話者音声区間のフラグを付与した信号を区間情報として用いた場合、その信号にマージン処理を行い（つまり、始端と終端のそれぞれNサンプルとMサンプルの雑音抑圧音声ディジタル信号に主話者音声区間のフラグを付与する）、主話者音声区間のフラグを付与した部分に対応する雑音抑圧音声ディジタル信号を抽出する。また、雑音抑圧音声ディジタル信号に主話者外音区間のフラグを付与した信号を区間情報として用いた場合、主話者外音区間のフラグを付与していない雑音抑圧音声ディジタル信号にマージン処理を行い、主話者外音区間のフラグを付与していない部分に対応する雑音抑圧音声ディジタル信号を抽出する。
＜効果＞
実環境下における単一マイクへの複数話者混入音声に対し、静音のみならず、高雑音環境下でも高い精度で主話者音声の認識を行うことができる。また、その結果、マイクロホンの数を少なくすることができ、ハードウェアの構成も軽量化できる。
＜変形例＞
主話者音声区間抽出部１６０では、雑音抑圧音声ディジタル信号に代えて、元の音声ディジタル信号や、雑音残響抑圧音声ディジタル信号を用いてもよい。その場合であっても、主話者音声を抽出することができる。ただし、後段において、音声認識処理を行う場合には、雑音抑圧音声ディジタル信号を用いたときに最も認識精度が高まると考えられる。 In addition, when a signal with the flag of the main speaker voice section added to the noise-suppressed voice digital signal is used as section information, margin processing is performed on the signal (that is, noise suppression of N samples and M samples at the start and end, respectively) The main speaker voice section flag is added to the voice digital signal), and the noise-suppressed voice digital signal corresponding to the portion to which the main speaker voice section flag is added is extracted. In addition, when a signal with the flag of the main speaker outside sound section added to the noise-suppressed voice digital signal is used as section information, margin processing is applied to the noise-suppressed voice digital signal without the flag of the main speaker outer sound section. Then, a noise-suppressed speech digital signal corresponding to a portion to which the flag of the main speaker outside sound section is not given is extracted.
<Effect>
With respect to mixed speech from a plurality of speakers to a single microphone in an actual environment, it is possible to recognize the main speaker speech with high accuracy not only in a quiet environment but also in a high noise environment. As a result, the number of microphones can be reduced, and the hardware configuration can be reduced.
<Modification>
The main speaker voice segment extraction unit 160 may use the original voice digital signal or the noise dereverberation voice digital signal in place of the noise suppressed voice digital signal. Even in that case, the main speaker voice can be extracted. However, when speech recognition processing is performed in the subsequent stage, it is considered that the recognition accuracy is most enhanced when a noise-suppressed speech digital signal is used.

音声区間検出装置１０は、入力信号として音声ディジタル信号を受け取る場合には、必ずしも音声信号取得部１００を備えなくともよい。 The speech section detection device 10 does not necessarily need to include the speech signal acquisition unit 100 when receiving a speech digital signal as an input signal.

音声区間検出装置１０は、必ずしもゲイン調整部１３０を備えなくともよい。この場合、主話者音声特徴強調部１４０では、ゲイン調整されていない残響信号をそのまま用いる。 The speech section detection device 10 does not necessarily include the gain adjustment unit 130. In this case, the main speaker voice feature emphasizing unit 140 uses a reverberation signal that has not been gain adjusted.

音声区間検出装置１０は、必ずしも主話者音声抽出部１７０を備えなくともよい。主話者識別部１５０の出力値（区間情報）を、音声区間検出装置１０の出力値として出力する。 The speech segment detection device 10 does not necessarily include the main speaker speech extraction unit 170. The output value (section information) of the main speaker identification unit 150 is output as the output value of the speech section detection device 10.

主話者識別部１５０は、必ずしも音声区間検出雑音抑圧部１１０の一部である必要はない。要は、雑音抑圧処理時の音声モデルと同様のモデルを利用することで、音声信号区間推定技術と雑音抑圧技術とを統合的に扱うことができればよい。 The main speaker identification unit 150 is not necessarily a part of the voice section detection noise suppression unit 110. In short, it is only necessary that the speech signal section estimation technology and the noise suppression technology can be handled in an integrated manner by using a model similar to the speech model at the time of noise suppression processing.

＜第二実施形態＞
図８は、音声区間検出装置１０と、音声認識装置８００との配置を説明するための図である。音声認識装置８００の前段に音声区間検出装置１０を配置する。 <Second embodiment>
FIG. 8 is a diagram for explaining the arrangement of the speech segment detection device 10 and the speech recognition device 800. The speech segment detection device 10 is arranged in the front stage of the speech recognition device 800.

音声認識装置８００は、音声信号を入力として前述の音声区間検出装置１０によって得られた信号を用いて、音声認識を行う。なお、音声信号とは、音声アナログ信号、音声ディジタル信号、雑音抑圧音声ディジタル信号及び雑音残響抑圧音声ディジタル信号を含む概念である。 The voice recognition apparatus 800 performs voice recognition using a signal obtained by the voice section detection apparatus 10 described above using a voice signal as an input. The speech signal is a concept including a speech analog signal, a speech digital signal, a noise-suppressed speech digital signal, and a noise reverberation-suppressed speech digital signal.

例えば、前述の音声区間検出装置１０によって得られた主話者の音声に対応する雑音抑圧音声ディジタル信号を受け取り、その音声認識結果を出力する。 For example, it receives a noise-suppressed speech digital signal corresponding to the speech of the main speaker obtained by the speech section detection device 10 described above, and outputs the speech recognition result.

また、例えば区間情報を音声区間検出装置１０の出力値として出力し（第一実施形態の変形例参照）、区間情報として主話者音声区間または主話者外音区間の少なくとも一方の開始時間と終了時間等を用いる場合、区間情報に対応する雑音抑圧音声ディジタル信号に対して音声認識を行い、音声認識結果を出力する。 Further, for example, section information is output as an output value of the voice section detection device 10 (see the modification of the first embodiment), and as the section information, at least one start time of the main speaker voice section or the main speaker outside sound section and When the end time or the like is used, speech recognition is performed on the noise-suppressed speech digital signal corresponding to the section information, and the speech recognition result is output.

また、例えば、区間情報を音声区間検出装置１０の出力値として出力し（第一実施形態の変形例参照）、区間情報として音声ディジタル信号に主話者音声区間または主話者外音区間の少なくとも一方のフラグを付与した信号等を区間情報として用いる場合、主話者音声区間のフラグを付与された雑音抑圧音声ディジタル信号に対して音声認識を行い、その音声認識結果を出力する。 Further, for example, section information is output as an output value of the voice section detection device 10 (see the modification of the first embodiment), and at least a main speaker voice section or a main speaker outside sound section is added to the voice digital signal as section information. When a signal or the like with one flag is used as section information, speech recognition is performed on the noise-suppressed speech digital signal to which the flag of the main speaker speech section is added, and the speech recognition result is output.

このように、音声区間検出装置１０によって得られた主話者の音声に対応する音声ディジタル信号（雑音抑圧音声ディジタル信号、雑音残響抑圧音声ディジタル信号）や区間情報を用いることで、音声認識処理に用いる入力音声（音声信号）から主話者外音・無音・雑音等を除去し、主話者の音声に対してのみ音声認識処理を行うことができ、その精度を向上させることができる。通常、主話者外音や雑音は、非音声とは判定されずに音声認識されてしまい音声認識結果が誤認識として湧き出すことになるが、音声区間検出装置１０により、主話者音声区間のみを高精度に検出することで、認識対象外の音声や雑音による音声認識システムへの悪影響を低減する事が出来る。
＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。
＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 As described above, the speech recognition processing is performed by using the speech digital signal (noise-suppressed speech digital signal, noise-reverberation-suppressed speech digital signal) corresponding to the speech of the main speaker obtained by the speech segment detecting device 10 and the segment information. It is possible to remove the main speaker external sound / silence / noise from the input voice (speech signal) to be used, and perform voice recognition processing only on the main speaker's voice, thereby improving the accuracy. Normally, the main speaker's external sound and noise are recognized as voice without being determined as non-speech, and the voice recognition result is generated as misrecognition. Only with high accuracy, it is possible to reduce adverse effects on speech recognition systems caused by speech and noise that are not recognized.
<Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.
<Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

A noise suppression unit that obtains a noise-suppressed voice digital signal by suppressing noise included in the voice digital signal including voice, noise, and reverberation using a voice model;
A reverberation estimation unit that estimates a reverberation component included in the noise-suppressed speech digital signal and obtains a reverberation signal;
A main speaker speech feature enhancement unit for obtaining a noise reverberation suppressed speech digital signal that is a difference between the noise suppressed speech digital signal and the reverberation signal;
Using the voice model, including a main speaker identification unit for identifying a main speaker voice section that is a section in which the main speaker is speaking from the noise reverberation suppression voice digital signal,
Voice segment detection device.

A speech recognition device that performs speech recognition on the speech signal using a signal output from the speech section detection device according to claim 1 with the speech signal as an input.

A noise suppression step in which a noise suppression unit obtains a noise-suppressed speech digital signal by suppressing noise included in the speech digital signal including speech, noise, and reverberation using a speech model;
A reverberation estimation unit that estimates a reverberation component included in the noise-suppressed speech digital signal to obtain a reverberation signal; and
A main speaker voice feature enhancement step for obtaining a noise dereverberation suppressed voice digital signal that is a difference between the noise-suppressed voice digital signal and the reverberation signal;
The main speaker identifying unit includes a main speaker identifying step of identifying a main speaker voice section, which is a section in which the main speaker is speaking, from the noise reverberation-suppressed voice digital signal using the voice model.
Voice segment detection method.

A program for causing a computer to function as the speech section detection device according to claim 1 or the speech recognition device according to claim 2.