JP2007017620A

JP2007017620A - Utterance section detecting device, and computer program and recording medium therefor

Info

Publication number: JP2007017620A
Application number: JP2005197804A
Authority: JP
Inventors: Tatsuya Kawahara; 達也河原; Yusuke Kida; 祐介木田
Original assignee: Kyoto University NUC
Current assignee: Kyoto University NUC
Priority date: 2005-07-06
Filing date: 2005-07-06
Publication date: 2007-01-25

Abstract

<P>PROBLEM TO BE SOLVED: To provide an utterance section detecting device capable of showing a constant performance level, regardless of the noise conditions. <P>SOLUTION: The utterance section detecting device 80 includes a feature value calculation part 102 for calculating two or more kinds of feature values to each frame of speech data; a feature value integrating part 106 which weights the two or more kinds of calculated feature values with respective predetermined weights to calculate an integrated score; an utterance section discriminating part 108, which performs discrimination between an utterance section and a non-utterance section for each frame of the speech data; a reference data storage part 126 and a label file creating part 122, which prepare data 124 with a label, indicating the utterance section and the non-utterance section to each frame; and an initialization control part 130 and a weighting-updating part 128, which use the labelled data 124 as learning data and learn weighting to the two or more kinds of feature values in the feature value integrating part 106 so as to optimize discrimination errors in the utterance section discriminating part 108. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は音声認識技術に関し、特に、音声認識に先立って音声信号中から発話区間を精度よく検出するための発話区間検出装置に関する。 The present invention relates to a speech recognition technique, and more particularly to an utterance interval detection device for accurately detecting an utterance interval from a speech signal prior to speech recognition.

現在の音声認識技術における最も重要な課題の一つに、雑音環境下での頑健な認識の実現が挙げられる。この問題を解決するための手法として、スペクトルサブトラクションやＷｉｅｎｅｒフィルターなどの雑音抑圧手法、ＭＬＬＲ（ｍａｘｉｍｕｍｌｉｋｅｌｉｈｏｏｄｌｉｎｅａｒｒｅｇｒｅｓｓｉｏｎ）やＰＭＣ（ｐａｒａｌｌｅｌｍｏｄｅｌｃｏｍｂｉｎａｔｉｏｎ）による雑音へのモデル適応などのアプローチが知られている。 One of the most important issues in current speech recognition technology is the realization of robust recognition in noisy environments. As methods for solving this problem, noise suppression methods such as spectrum subtraction and Wiener filter, approaches such as model adaptation to noise by MLLR (maximum likelihood linear regression) and PMC (parallel model combination) are known.

これらに加えて、発話区間検出は雑音環境下での音声認識において非常に重要な要素技術である。 In addition to these, speech interval detection is a very important elemental technology in speech recognition under noisy environments.

図１に、発話区間検出の概念及び位置付けについて示す。図１の左側を参照して、音声認識の処理では、音声入力６０に対して発話区間検出処理６２を行なう。そして、発話区間と判定された区間に含まれる音声データに対して音声認識処理６４を行なう。 FIG. 1 shows the concept and positioning of speech segment detection. Referring to the left side of FIG. 1, in speech recognition processing, speech segment detection processing 62 is performed on speech input 60. Then, the speech recognition process 64 is performed on the speech data included in the section determined as the speech section.

図１の右側を参照して、例えば、音声データ３０が、発話区間４２及び４６と、それ以外の区間４０、４４及び４８とを含むものとする。発話区間４２及び４６には大きな波形の動きが見られる。これらは話者の発話により生じたものでである。一方、区間４０、４４及び４８にも若干の波形が見られるが、これらは主として雑音データである。音声データ３０から発話区間４２及び４６のみに含まれる音声波形５０及び５２を切り出すのが発話区間検出処理６２である。より具体的には、発話区間検出処理６２では、発話区間の開始時と、発話区間の終了時とを決定する。そして発話区間の開始時と終了時との間に存在する音声データに対して音声認識をすることにより、音声認識結果５４及び５６が得られる。 With reference to the right side of FIG. 1, for example, it is assumed that the audio data 30 includes speech sections 42 and 46 and other sections 40, 44 and 48. In the utterance sections 42 and 46, a large waveform movement is observed. These are caused by the speaker's utterance. On the other hand, some waveforms are also seen in the sections 40, 44 and 48, but these are mainly noise data. The speech segment detection processing 62 cuts out the speech waveforms 50 and 52 included only in the speech segments 42 and 46 from the speech data 30. More specifically, in the utterance section detection process 62, the start time of the utterance section and the end time of the utterance section are determined. Then, voice recognition results 54 and 56 are obtained by performing voice recognition on the voice data existing between the start time and the end time of the speech section.

図１から明らかなように、発話区間が正しく検出されなければ、それに続く認識処理が成功する可能性はきわめて低くなる。雑音部分まで含めて音声認識が行われるためである。したがって、発話区間をできるだけ正しく決定する技術が必要である。近年、発話区間検出に関する研究も盛んに行われ、これまでに様々な処理方法が提案されている。 As is clear from FIG. 1, if the utterance period is not detected correctly, the possibility that the subsequent recognition process will succeed is extremely low. This is because voice recognition is performed including the noise part. Therefore, a technique for determining the utterance interval as correctly as possible is required. In recent years, active research has been conducted on utterance section detection, and various processing methods have been proposed so far.

特許文献１には、音声データのエネルギを測定することにより発話区間検出を行なう発話区間検出装置が開示されている。特許文献１に開示された技術では、音声データに含まれる環境雑音データの変化に追随して、発話区間検出のための音声エネルギのしきい値を変化させる。
特開２００５−３１６３２号公報 Patent Document 1 discloses an utterance section detection device that detects an utterance section by measuring the energy of voice data. In the technique disclosed in Patent Document 1, the threshold value of the voice energy for detecting the utterance section is changed following the change of the environmental noise data included in the voice data.
JP 2005-31632 A

上記したように、発話区間検出のための方法は種々提案されている。特に、特許文献１に記載の技術は、雑音環境の変化にも頑健な発話区間検出が可能になると期待される。しかし、特許文献１に記載の技術にせよ、その他の技術にせよ、発話区間検出の精度にはさらに改良の余地があるというのが現状である。特に、それらの処理方法の多くは、性能が雑音条件（例えば雑音の種類）に大きく依存してしまうという問題がある。将来の音声認識技術は、種々の環境で使用されることが予想される。したがって、どのような雑音条件の下でも一定の性能を示す発話区間検出のための技術が求められている。 As described above, various methods for detecting an utterance section have been proposed. In particular, the technique described in Patent Document 1 is expected to enable detection of an utterance section that is robust against changes in the noise environment. However, the current situation is that there is room for further improvement in the accuracy of speech section detection, regardless of the technique described in Patent Document 1 or other techniques. In particular, many of these processing methods have a problem that the performance greatly depends on the noise condition (for example, the type of noise). Future speech recognition technology is expected to be used in various environments. Therefore, there is a need for a technique for detecting an utterance section that exhibits a certain performance under any noise condition.

したがって本発明の一つの目的は、雑音条件にかかわらず、一定の性能を示すことが可能な発話区間検出装置を提供することである。 Accordingly, an object of the present invention is to provide an utterance section detecting device capable of exhibiting constant performance regardless of noise conditions.

本発明の他の目的は、雑音条件に関わらず、従来の技術よりも高い性能を示すことが可能な発話区間検出装置を提供することである。 Another object of the present invention is to provide an utterance section detecting apparatus capable of exhibiting higher performance than the conventional technique regardless of noise conditions.

本発明のさらに他の目的は、雑音条件に関わらず、従来の技術よりも高精度で発話区間を検出することが可能な発話区間検出装置を提供することである。 Still another object of the present invention is to provide an utterance section detection device capable of detecting an utterance section with higher accuracy than the conventional technique regardless of noise conditions.

本発明の第１の局面に係る発話区間検出装置は、音声データ内の発話区間を検出するための発話区間検出装置であって、音声データの各フレームに対し、予め定める複数種類の特徴量を算出するための特徴量算出手段と、音声データの各フレームに対し、特徴量算出手段により算出された複数種類の特徴量にそれぞれ所定の重み付けをしてこれら複数種類の特徴量を統合し、統合スコアを算出するための特徴量統合手段と、特徴量統合手段により算出される統合スコアに基づいて、音声データのフレームごとに発話区間と非発話区間との識別を行なうための発話区間識別手段とを含み、さらに、各フレームに対し、発話区間と非発話区間とを示すラベルが付されたラベル付データを準備するためのラベル付データ準備手段と、ラベル付データ準備手段により準備されたラベル付データを学習データとし、発話区間識別手段における識別誤りが所定の基準を満たすように、特徴量統合手段における複数種類の特徴量に対する重み付けを学習するための重み学習手段とを含む。 An utterance interval detection device according to a first aspect of the present invention is an utterance interval detection device for detecting an utterance interval in audio data, and a plurality of types of feature amounts determined in advance for each frame of the audio data. A feature amount calculation means for calculating, and for each frame of audio data, a plurality of types of feature amounts calculated by the feature amount calculation means are respectively weighted to integrate the plurality of types of feature amounts. A feature amount integration unit for calculating a score, and an utterance interval identification unit for identifying an utterance interval and a non-utterance interval for each frame of speech data based on the integrated score calculated by the feature amount integration unit; In addition, for each frame, labeled data preparation means for preparing labeled data with a label indicating an utterance interval and a non-utterance interval, and labeled data Weight learning means for learning weights for a plurality of types of feature amounts in the feature amount integration means so that the labeled data prepared by the preparation means is used as learning data, and the identification error in the utterance section identification means satisfies a predetermined criterion Including.

複数種類の特徴量に対し、学習データに基づいて識別誤りが所定の基準を満たすように重み付けをし、それらを統合して統合スコアを得る。この統合スコアを用いて音声データの発話区間・非発話区間を識別する。複数種類の特徴量に、学習による重み付けを行なうため、雑音環境に応じて各特徴量に対する重みが適切に算出され、雑音環境にかかわりなく一定の精度で発話区間・非発話区間の識別を行なうことができる。 A plurality of types of feature quantities are weighted based on learning data so that identification errors satisfy a predetermined criterion, and are integrated to obtain an integrated score. The integrated score is used to identify speech / non-speech intervals in the speech data. Since multiple types of feature quantities are weighted by learning, the weights for each feature quantity are calculated appropriately according to the noise environment, and the speech and non-speech sections are identified with a constant accuracy regardless of the noise environment. Can do.

好ましくは、複数種類の特徴量は、各フレームにおける音声信号の振幅レベルと、各フレームにおける音声信号のゼロ交差数と、各フレームにおける音声信号のスペクトル情報と、各フレームにおけるＧＭＭ対数尤度とからなる群から選ばれる。 Preferably, the plurality of types of feature amounts are based on the amplitude level of the audio signal in each frame, the number of zero crossings of the audio signal in each frame, the spectrum information of the audio signal in each frame, and the GMM log likelihood in each frame. Chosen from the group of

これら既存の特徴量のうちから選ばれる複数種類の特徴量に対し、学習による適切な重み付けを行なう。その結果、本発明に係る発話区間検出装置によれば、これら既存の特徴量を単独で用いた場合と比較して、大部分の場合により高い精度で発話区間の検出を行なうことができる。この結果は、実験によっても裏付けられた。 Appropriate weighting by learning is performed on a plurality of types of feature values selected from these existing feature values. As a result, according to the utterance section detecting apparatus according to the present invention, it is possible to detect the utterance section with higher accuracy in most cases compared to the case where these existing feature quantities are used alone. This result was confirmed by experiments.

より好ましくは、ラベル付データ準備手段は、発話区間検出装置の動作時に、所与の基準発話に対応する音声データを取得するための音声データ取得手段と、所与の基準発話に対する音響モデルを予め準備するための手段と、音声データ取得手段により取得された音声データに対し、所与の基準発話に対する音響モデルとの強制アライメントを行なうことにより、音声データ取得手段の取得した音声データの各フレームに対し、発話区間と非発話区間とのラベル付を行なうための手段とを含む。 More preferably, the labeled data preparation means includes an audio data acquisition means for acquiring audio data corresponding to a given reference utterance and an acoustic model for the given reference utterance in advance during operation of the utterance section detecting device. For each frame of the voice data acquired by the voice data acquisition means, the voice data acquired by the voice data acquisition means is subjected to forced alignment with the acoustic model for a given reference utterance with respect to the voice data acquired by the voice data acquisition means. On the other hand, it includes means for labeling the speech segment and the non-speech segment.

発話区間検出装置の動作時に、基準発話に対応する音声データが取得される。また、基準発話に対する音響モデルを用いた強制アライメントによってラベル付データが準備される。予め内容がわかっている基準発話に対する強制アライメントは比較的正確に行なえる。その結果、実際の発話区間検出装置の動作時に、正確な学習用データのラベル付を行なうことができるので、実際の雑音環境に応じた、正確な重み付けの算出が可能になる。 During operation of the utterance section detection device, voice data corresponding to the reference utterance is acquired. Also, labeled data is prepared by forced alignment using an acoustic model for the reference utterance. Forced alignment with reference utterances whose contents are known in advance can be performed relatively accurately. As a result, since the learning data can be labeled accurately during the operation of the actual utterance section detecting device, accurate weighting can be calculated according to the actual noise environment.

さらに好ましくは、特徴量統合手段は、音声データの各フレームに対し、特徴量算出手段により算出された複数種類の特徴量にそれぞれ所定の重み付けをして加算することにより、これら複数種類の特徴量を統合し、統合スコアを算出するための手段と、所定の重み付けのための重みを記憶するための重み記憶手段とを含み、重み学習手段は、ラベル付データ準備手段により準備されたラベル付データを学習データとし、特徴量統合手段における識別誤りが小さくなるように所定の修正基準にしたがって重み記憶手段に記憶された重みを更新するための重み更新手段を含む。 More preferably, the feature amount integration unit adds each of the plurality of types of feature amounts by adding a predetermined weight to each of the plurality of types of feature amounts calculated by the feature amount calculation unit for each frame of the audio data. And a weight storage means for storing a weight for a predetermined weight, and the weight learning means is labeled data prepared by the labeled data preparation means. And weight update means for updating the weight stored in the weight storage means in accordance with a predetermined correction criterion so that the identification error in the feature quantity integration means is reduced.

重み更新手段は、ラベル付データ準備手段により準備されたラベル付データを学習データとし、発話区間識別手段における識別誤りに関する最小分類誤り学習により重み記憶手段に記憶された重みを更新するための手段を含んでもよい。 The weight update means uses the labeled data prepared by the labeled data preparation means as learning data, and means for updating the weight stored in the weight storage means by the minimum classification error learning related to the identification error in the utterance section identification means. May be included.

最小分類誤り学習により重みを学習することにより、基準発話の数が少なくてもよい精度が得られることが実験から判明した。 Experiments have shown that learning weights with minimum classification error learning can provide accuracy with a small number of reference utterances.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを上記したいずれかの発話区間検出装置として動作させる。 When the computer program according to the second aspect of the present invention is executed by a computer, it causes the computer to operate as any of the above-described utterance section detection devices.

本発明の第３の局面に係る記録媒体は、このコンピュータプログラムを記録した、コンピュータ読取可能な記録媒体である。 A recording medium according to the third aspect of the present invention is a computer-readable recording medium on which the computer program is recorded.

以下、本発明の一実施の形態に係る発話区間検出装置について、その構成と動作、並びにコンピュータによる実現について説明する。本実施の形態に係る発話区間検出装置は、４種類の特徴量に対してそれぞれ重みを割当て、それら重み付けられた特徴量を統合した値によって発話区間の検出を行なう点、及び特徴量に割当てる重みを、動作開始直後に最適化する点に特徴がある。 Hereinafter, the configuration and operation of an utterance section detecting apparatus according to an embodiment of the present invention and implementation by a computer will be described. The utterance section detection device according to the present embodiment assigns weights to the four types of feature amounts, detects the utterance section based on a value obtained by integrating the weighted feature amounts, and assigns weights to the feature amounts. Is characterized in that it is optimized immediately after the start of operation.

以下の説明及び図面においては、同じ部品には同じ参照番号を付してある。それらの機能及び名称も同じである。したがってそれらについての詳細な説明は繰返さない。また以下の説明では、発話区間検出装置がデスクトップの音声応答システムの一部をなすものとして説明するが、本発明に係る発話区間検出装置がそのような用途に限定されず、一般の音声認識処理など、発話区間の検出を行なうことが必要な全てのシステムに適用可能なことはいうまでもない。なお、本明細書において最適化とは、必ずしも最も好ましい条件に装置を設定することに限らず、初期状態から多少とも好ましい状態に装置を設定する場合も含むものとする。 In the following description and drawings, the same parts are denoted by the same reference numerals. Their functions and names are also the same. Therefore, detailed description thereof will not be repeated. Further, in the following description, it is assumed that the utterance section detection device forms part of the desktop voice response system. However, the utterance section detection device according to the present invention is not limited to such applications, and general speech recognition processing is performed. Needless to say, the present invention is applicable to all systems that need to detect a speech section. In this specification, the term “optimization” does not necessarily mean that the apparatus is set to the most preferable condition, but includes the case where the apparatus is set to a somewhat preferable state from the initial state.

さらに、以下の実施の形態では、特徴量として４種類を用いる。しかし本発明はそのような実施の形態には限定されず、複数種類の特徴量を用いるものであればどのようなものでもよい。また、以下の実施の形態では、これら特徴量に割当てられる重みの初期値を全て同じ値としている。しかし本発明はそのような実施の形態には限定されず、予め実験により定めた特定の値を初期値としてもよいし、ランダムに初期値を定めるようにしてもよい。 Furthermore, in the following embodiments, four types are used as feature amounts. However, the present invention is not limited to such an embodiment, and may be anything as long as it uses a plurality of types of feature amounts. In the following embodiments, the initial values of the weights assigned to these feature quantities are all the same value. However, the present invention is not limited to such an embodiment, and a specific value determined in advance by experiment may be set as the initial value, or the initial value may be set at random.

また、以下の説明では、音声を処理する際の単位となるフレームの長さは、１００ｍｓｅｃと２５ｍｓｅｃとの２種類を使用し、特徴量の種類により使い分ける。これは、この程度の長さであれば、各特徴量算出の際に音声データに変化がなく一定であるとみなすことができるためである。したがってフレームの長さはこれらに限定されず、処理に支障が生じない範囲で適宜選択できる。フレームのシフト時間は１０ｍｓｅｃとするが、このシフト時間についても処理に支障が生じない範囲で適宜選択できる。 Also, in the following description, two types of frame lengths, 100 msec and 25 msec, are used as units for processing audio, and are used depending on the type of feature amount. This is because if the length is such a level, it can be considered that the audio data is constant without any change in the calculation of each feature amount. Therefore, the length of the frame is not limited to these, and can be appropriately selected within a range where the processing is not hindered. Although the frame shift time is 10 msec, this shift time can also be selected as appropriate within a range that does not hinder the processing.

＜構成＞
図２に、本発明の一実施の形態に係る発話区間検出装置８０の構成を、ブロック図形式で示す。本実施の形態に係る発話区間検出装置８０は、例えば音声認識装置及び音声合成機能を持つ音声応答システムの一部をなすものとし、音声応答システムの他の機能部分と共通の部品を持つ。 <Configuration>
FIG. 2 shows, in a block diagram form, the configuration of an utterance section detection device 80 according to an embodiment of the present invention. The utterance section detecting device 80 according to the present embodiment is part of a voice response system having a voice recognition device and a voice synthesis function, for example, and has parts common to other functional parts of the voice response system.

図２を参照して、この発話区間検出装置８０は、マイクロホン８２の出力するアナログ信号をサンプリングし、量子化し、デジタル化してデジタルの音声信号として出力するためのＡ／Ｄ変換処理部８６の出力を受けて発話区間検出を行なうための発話区間検出処理部９０と、発話区間検出装置８０の電源投入時に発話区間検出処理部９０で使用する重みを最適化する処理を行なうための重み最適化部９２とを含む。重み最適化部９２は、重み最適化処理のために、所定の基準発話を発話するようにユーザに促すメッセージを発生させるための音声データを出力し、音声合成装置９４及びスピーカ８４を介してメッセージを出力する。 Referring to FIG. 2, this utterance period detection device 80 samples, quantizes, digitizes, and digitizes an analog signal output from the microphone 82, and outputs from the A / D conversion processing unit 86. Utterance section detection processing section 90 for detecting the utterance section in response to the above, and a weight optimization section for performing processing for optimizing the weight used in the utterance section detection processing section 90 when the utterance section detection device 80 is turned on 92. The weight optimization unit 92 outputs voice data for generating a message that prompts the user to speak a predetermined reference utterance for the weight optimization process, and the message is sent via the voice synthesizer 94 and the speaker 84. Is output.

発話区間検出処理部９０は、Ａ／Ｄ変換処理部８６の出力するデジタルの音声信号を、１００ｍｓｅｃの長さで１０ｍｓｅｃずつのシフト量でフレーム化するための第１のフレーム化処理部８７と、Ａ／Ｄ変換処理部８６の出力する音声信号を、２５ｍｓｅｃの長さで１０ｍｓｅｃずつのシフト量でフレーム化するための第２のフレーム化処理部８８とを含む。 The utterance section detection processing unit 90 includes a first framing processing unit 87 for framing the digital audio signal output from the A / D conversion processing unit 86 with a shift amount of 10 msec in length of 100 msec; And a second framing processor 88 for framing the audio signal output from the A / D conversion processor 86 with a shift amount of 10 msec in length of 25 msec.

発話区間検出処理部９０は、第１のフレーム化処理部８７の出力するフレームデータから２種類の特徴量を、第２のフレーム化処理部８８の出力するフレームデータからさらに２種類の特徴量を、それぞれ算出し出力するための特徴量算出部１０２と、重み最適化部９２から与えられるサンプルデータの特徴量を受ける第１の入力１３２及び特徴量算出部１０２の出力する特徴量を受ける第２の入力１３４を有し、重み最適化部９２から与えられる制御信号によって第１の入力１３２及び第２の入力１３４のいずれかを選択して出力するための選択部１００と、上記した４種類の特徴量に対してそれぞれ割当てられる重みを記憶するための重み記憶部１０４と、重み記憶部１０４に記憶された重みを用いて、選択部１００から与えられる４種類の特徴量を統合して統合スコアを算出するための特徴量統合部１０６と、特徴量統合部１０６から得られた統合スコアを予め学習済のしきい値と比較することにより各フレームについて発話区間か非発話区間かを識別し、各フレームにラベルを付して出力するための発話区間識別部１０８とを含む。 The utterance section detection processing unit 90 obtains two types of feature amounts from the frame data output from the first framing processing unit 87 and further two types of feature amounts from the frame data output from the second framing processing unit 88. A feature amount calculation unit 102 for calculating and outputting each, a first input 132 for receiving the feature amount of the sample data given from the weight optimization unit 92, and a second for receiving the feature amount output by the feature amount calculation unit 102 The selection unit 100 for selecting and outputting one of the first input 132 and the second input 134 according to the control signal given from the weight optimization unit 92, and the above four types The weight storage unit 104 for storing the weight assigned to each feature amount and the four types given from the selection unit 100 using the weights stored in the weight storage unit 104 Utterance interval for each frame by comparing the integrated score obtained from the feature amount integrating unit 106 for calculating the integrated score by integrating the feature amounts and the integrated score obtained from the feature amount integrating unit 106 Or a non-speech segment, and a speech segment identification unit 108 for labeling and outputting each frame.

特徴量算出部１０２は、第１のフレーム化処理部８７の出力するフレームデータの振幅レベルに基づいて音声波形の振幅レベルの特徴量ｆ⁽¹⁾を算出するための振幅レベル特徴量算出部１４０と、第１のフレーム化処理部８７の出力するフレームデータのゼロ交差数に基づいて、ゼロ交差数の特徴量ｆ⁽²⁾を算出するためのゼロ交差数特徴量算出部１４２と、第２のフレーム化処理部８８の出力するフレームデータのスペクトル情報に基づき、スペクトル情報特徴量ｆ⁽³⁾を算出するためのスペクトル情報特徴量算出部１４４と、第２のフレーム化処理部８８の出力するフレームデータのＧＭＭ対数尤度に基づいて、ＧＭＭ対数尤度特徴量ｆ⁽⁴⁾を算出するためのＧＭＭ対数尤度特徴量算出部１４６とを含む。以下、特徴量算出部１０２の各部で算出される特徴量について説明する。 The feature quantity calculation unit 102 calculates an amplitude level feature quantity calculation unit 140 for calculating the feature quantity f ⁽¹⁾ of the amplitude level of the speech waveform based on the amplitude level of the frame data output from the first framing processing unit 87. A zero-crossing number feature quantity calculating unit 142 for calculating a zero-crossing number feature quantity f ⁽²⁾ based on the zero-crossing number of the frame data output from the first framing processing unit 87, and a second Based on the spectrum information of the frame data output by the framing processor 88, the spectrum information feature quantity calculator 144 for calculating the spectral information feature quantity f ⁽³⁾ and the second framing processor 88 output A GMM log likelihood feature quantity calculation unit 146 for calculating a GMM log likelihood feature quantity f ⁽⁴⁾ based on the GMM log likelihood of the frame data. Hereinafter, the feature amount calculated by each unit of the feature amount calculation unit 102 will be described.

（１）音声波形の振幅レベル
音声波形の振幅レベルは、発話区間検出に用いられる最も基本的な特徴であり、様々な音声認識システムに実装されている。ｔ番目のフレームに対する振幅レベルＥ_tは、以下の式で求められる。 (1) Amplitude level of speech waveform The amplitude level of a speech waveform is the most basic feature used for speech section detection, and is implemented in various speech recognition systems. amplitude level E _t for t th frame is calculated by the following formula.

ここで、ｓ_nはフレーム内のサンプル信号に長さＮのハミング窓をかけた値である。

Here, s _n is a value obtained by multiplying the sample signal in the frame by a Hamming window of length N.

本実施の形態では、雑音区間における特徴量が既知であるものとし、振幅レベルについては雑音レベルとの比を用いる。すなわち、振幅レベル特徴量算出部１４０が算出する特徴量ｆ_t ⁽¹⁾は、は以下のようになる。 In the present embodiment, it is assumed that the feature amount in the noise section is known, and the ratio with the noise level is used for the amplitude level. That is, the feature quantity f _t ⁽¹⁾ calculated by the amplitude level feature quantity calculation unit 140 is as follows.

ただし、Ｅnは雑音区間での振幅レベル値である。なお、この特徴量の算出にあたっては、フレーム長１００ｍｓｅｃのデータを用いる。

However, En is an amplitude level value in a noise section. In calculating the feature amount, data having a frame length of 100 msec is used.

（２）ゼロ交差数（ＺＣＲ）
ゼロ交差数は、一定時間内に信号レベルがゼロと交わる回数である。音声区間ではこの値が大きくなる。したがってこの現象を利用して発話区間検出を行なうことができる。ただし実際にはゼロの代わりに一定のバイアス値を設定し、バイアスの範囲内での交差はカウントしないのが一般的である。この特徴量をｆ_t ⁽²⁾とすると、ゼロ交差数特徴量算出部１４２が算出する特徴量ｆ_t ⁽²⁾も振幅レベルと同様に雑音区間との比を用い、以下のように表される。 (2) Number of zero crossings (ZCR)
The number of zero crossings is the number of times that the signal level crosses zero within a certain time. This value becomes large in the voice section. Therefore, it is possible to detect the utterance section using this phenomenon. However, in practice, a constant bias value is set instead of zero, and the intersection within the bias range is generally not counted. Assuming that this feature quantity is _ft ⁽²⁾ , the feature quantity _ft ⁽²⁾ calculated by the zero-crossing number feature quantity calculation unit 142 is expressed as follows using the ratio to the noise interval as well as the amplitude level. The

ここでＺ_tは入力フレームのゼロ交差数、Ｚnは雑音区間でのゼロ交差数である。この特徴量の算出にあたっても、フレーム長１００ｍｓｅｃのものを用いる。

Here, Z _t is the number of zero crossings in the input frame, and Zn is the number of zero crossings in the noise interval. In calculating the feature amount, a frame length of 100 msec is used.

（３）スペクトル情報
スペクトルから特徴を抽出して発話区間検出に利用する技術は近年盛んに行われている。音声と雑音とのスペクトル例を図３に示す。図３に示す例では、音声１５０は雑音１５２よりも低周波数領域に分布している成分が多い。高周波数領域では両者はほぼ同じである。もちろん、スペクトルは音声の場合も雑音の場合も種類により季節により異なってくる。本実施の形態では、周波数領域をいくつかのチャネルに分割し、各チャネルごとにＳ／Ｎ比を計算する。こうして算出したＳ／Ｎ比の平均値をスペクトル情報による特徴量とする。 (3) Spectrum information In recent years, techniques for extracting features from a spectrum and using it for detecting an utterance section have been actively performed. An example spectrum of speech and noise is shown in FIG. In the example shown in FIG. 3, the voice 150 has more components distributed in the low frequency region than the noise 152. Both are almost the same in the high frequency range. Of course, the spectrum varies depending on the season depending on the type, whether it is speech or noise. In this embodiment, the frequency domain is divided into several channels, and the S / N ratio is calculated for each channel. The average value of the S / N ratio calculated in this way is used as the feature amount based on the spectrum information.

スペクトル情報特徴量算出部１４４が算出する特徴量ｆ_t ⁽³⁾は以下の式で表される。 The feature value f _t ⁽³⁾ calculated by the spectrum information feature value calculation unit 144 is expressed by the following equation.

ここで、Ｂはチャネルの数を、Ｓ_btは入力フレームにおけるチャネルｂの平均パワーを、Ｎ_bは雑音区間におけるチャネルｂの平均パワーを、それぞれ表す。この特徴量の算出にはフレーム長２５ｍｓｅｃのデータを用いる。

Here, B represents the number of channels, S _bt represents the average power of channel b in the input frame, and N _b represents the average power of channel b in the noise interval. The feature amount is calculated using data having a frame length of 25 msec.

（４）ＧＭＭ対数尤度
ガウス混合分布（ＧＭＭ）は、統計的学習が容易なことから近年発話区間検出によく用いられている。ここでは音声のＧＭＭと雑音のＧＭＭとの対数尤度比を特徴量として用いる。ＧＭＭ対数尤度特徴量算出部１４６が算出する特徴量ｆ_t ⁽⁴⁾は以下の式で示される。 (4) GMM logarithmic likelihood Gaussian mixture distribution (GMM) is often used in recent years for utterance detection because it is easy to learn statistically. Here, the log likelihood ratio between the speech GMM and the noise GMM is used as the feature amount. The feature value f _t ⁽⁴⁾ calculated by the GMM log-likelihood feature value calculation unit 146 is expressed by the following equation.

ここでΘ_s及びΘ_nはそれぞれ音声ＧＭＭ及び雑音ＧＭＭのモデルパラメータセットである。このデータの算出にもフレーム長２５ｍｓｅｃのデータを用いる。

Here, Θ _s and Θ _n are model parameters sets of the speech GMM and noise GMM, respectively. Data with a frame length of 25 msec is also used for calculating this data.

再び図２を参照して、重み最適化部９２は、重み記憶部１０４に記憶される重みを最適化する際に、発話者に対して所定の基準発話を行なうように促すメッセージと、当該基準発話に対する音声モデルとを予め記憶し、当該メッセージを音声合成装置９４に与えるための基準データ記憶部１２６と、前述したメッセージにしたがってユーザが発話したことにより得られるサンプルの音声データから得られた特徴量を、特徴量算出部１０２から受け取って記憶するためのサンプルデータ記憶部１２０と、サンプルデータ記憶部１２０に記憶されたサンプルの音声データから得た特徴量に対し、基準データ記憶部１２６に記憶された基準発話の音声モデルとの間の強制アライメント処理を行なうことにより、サンプルデータの各フレームに対して発話区間／非発話区間の識別を行ない、発話区間／非発話区間のラベルを記憶したラベルファイルを作成するためのラベルファイル作成部１２２と、ラベルファイル作成部１２２により作成されたラベルファイルを記憶するためのラベルファイル記憶部１２４とを含む。 Referring to FIG. 2 again, the weight optimization unit 92, when optimizing the weight stored in the weight storage unit 104, prompts the speaker to perform a predetermined reference utterance, and the reference A feature obtained from a reference data storage unit 126 for storing a speech model for speech in advance and giving the message to the speech synthesizer 94, and sample speech data obtained by a user speaking according to the message described above. The sample data storage unit 120 for receiving and storing the amount from the feature amount calculation unit 102 and the feature amount obtained from the audio data of the sample stored in the sample data storage unit 120 in the reference data storage unit 126 For each frame of sample data by performing a forced alignment process with the speech model of the specified reference utterance A label file creation unit 122 for identifying a speech segment / non-speech segment and creating a label file storing a label of the speech segment / non-speech segment, and a label file created by the label file creation unit 122 are stored. A label file storage unit 124.

重み最適化部９２はさらに、ラベルファイル記憶部１２４に記憶されたラベルファイルと、サンプルデータの特徴量を特徴量統合部１０６で統合して得られたスコアに基づき発話区間識別部１０８が識別して得られた各フレームに対するラベルとに基づき、両者の間の相違（発話区間識別部１０８の識別誤り）を最小化するように統合重みを再計算するための処理を行ない、重み記憶部１０４に記憶された重みを更新するための重み更新部１２８と、発話区間検出装置８０に電源が投入されリセット信号が与えられたことに応答して、重み最適化部９２内の各部、及び発話区間検出処理部９０内の選択部１００と重み記憶部１０４とを制御して、最小誤り分類（ＭＣＥ）学習により重み記憶部１０４に記憶された重みを最適化するための処理を行なうための初期化制御部１３０とを含む。 The weight optimization unit 92 is further identified by the speech segment identification unit 108 based on the label file stored in the label file storage unit 124 and the score obtained by integrating the feature quantities of the sample data by the feature quantity integration unit 106. Based on the label for each frame obtained in this way, processing for recalculating the integrated weight is performed so as to minimize the difference between them (identification error of the utterance section identification unit 108), and the weight storage unit 104 A weight updating unit 128 for updating the stored weights, and each unit in the weight optimization unit 92 and the utterance interval detection in response to the power being supplied to the utterance interval detection device 80 and a reset signal being given. Processing for controlling the selection unit 100 and the weight storage unit 104 in the processing unit 90 to optimize the weights stored in the weight storage unit 104 by minimum error classification (MCE) learning And a initialization control unit 130 for performing.

−ＭＣＥ学習を用いた重み最適化−
上記した４つの特徴量ｆ_t ⁽¹⁾，ｆ_t ⁽²⁾，ｆ_t ⁽³⁾，ｆ_t ⁽⁴⁾に対し、それぞれ重みｗ₁，ｗ₂，ｗ₃，ｗ₄を付けて統合する。ある時刻ｔにおける入力フレームｘ_tに対する統合スコアＦ（ｘ_t）は以下の式で表される。 -Weight optimization using MCE learning-
The above four feature quantities _ft ⁽¹⁾ , _ft ⁽²⁾ , _ft ⁽³⁾ , and _ft ⁽⁴⁾ are integrated with weights w ₁ , w ₂ , w ₃ , and w ₄ , respectively. . The integrated score F (x _t ) for the input frame x _{t at} a certain time t is expressed by the following equation.

また、重みｗ_kは以下の制約条件を満たすものとする。

The weight w _k satisfies the following constraint conditions.

なお、本実施の形態では重みの初期値は全て同じ値で０．２５とする。

In this embodiment, the initial values of the weights are all the same value and are 0.25.

この発話区間検出装置８０を雑音環境に適応させるため、重み最適化部９２はＭＣＥ学習を用いて統合のための重みｗ_kを最適化する。識別学習には、一般化確率的降下法（ＧＰＤ）を用いる。 In order to adapt the utterance section detection device 80 to the noise environment, the weight optimization unit 92 optimizes the weight w _k for integration using MCE learning. A generalized probabilistic descent method (GPD) is used for discriminative learning.

・損失関数の定義
学習データｘ_tに対する誤分類測度は以下のように表される。 And classification measure false for the definition learning data x _t of the loss function is expressed as follows.

ここでｋは正解クラスであり、音声（ｓ）又は非音声（ｎ）に相当する。ｍは非正解クラスである。ｄ_k（ｘ_k）の値が負のときにはｘ_tが正しく分類されたことを示す。

Here, k is a correct class and corresponds to voice (s) or non-voice (n). m is the non-correct answer class. When the value of d _k (x _k ) is negative, x _t is correctly classified.

次に、誤分類測度に０、１のステップ関数を近似するシグモイド関数を適用して、次の式により損失を定義する。 Next, a sigmoid function approximating a step function of 0 or 1 is applied to the misclassification measure, and the loss is defined by the following equation.

ここで、γはシグモイド関数の傾きを表す。確率的降下法に基づいて損失関数を最小にすることが識別学習の目標となる。

Here, γ represents the slope of the sigmoid function. The goal of discriminative learning is to minimize the loss function based on the stochastic descent method.

・重みの最適化
重みの更新は以下の通りに行なう。振幅レベル、ゼロ交差数、スペクトル情報、及びＧＭＭ対数尤度から得た特徴量に対する重みを前述のとおりそれぞれｗ₁、ｗ₂、ｗ₃及びｗ₄とする。本実施の形態では、これら重みが常に０より大きくなければならないという制約条件を設けている。ＭＣＥ学習による更新の過程において常にこの制約条件が満たされることを保証するために、重みｗ＝｛ｗ₁，ｗ₂，ｗ₃，ｗ₄｝を以下の新しいセット〜ｗに変換する。なお、本明細書のテキスト中において使用される「〜」は、式中では直後の文字の直上に記載してあるものである。・ Optimization of weights Weights are updated as follows. As described above, w ₁ , w ₂ , w _3, and w ₄ are weights for the feature amounts obtained from the amplitude level, the number of zero crossings, the spectrum information, and the GMM log likelihood. In this embodiment, there is a constraint that these weights must always be greater than zero. In order to ensure that this constraint is always satisfied in the process of updating by MCE learning, the weights w = {w ₁ , w ₂ , w ₃ , w ₄ } are converted into the following new sets ~ w. In addition, "-" used in the text of this specification is described immediately above the character immediately after in a formula.

〜ｗは、学習データが入力されるごとに以下のように更新される。ここで、ε_ｔは学習のステップを表し、データが入力されるたびに単調に減少していくものとする。

˜w is updated as follows each time learning data is input. Here, ε _t represents a learning step and decreases monotonously every time data is input.

式（１１）の右辺最終項は、以下のように展開される。

The last term on the right side of Equation (11) is expanded as follows.

ここで、

here,

であり、さらに式（１２）の最終因数の要素は次のように表される。

Further, the final factor element of the equation (12) is expressed as follows.

このようにして、〜ｗの学習が行なわれる。

In this way, learning of ~ w is performed.

〜ｗの学習が終了したら、〜ｗをｗに以下の式で逆変換する。 When learning of ~ w is completed, ~ w is inversely converted to w by the following expression.

ここで、式（１６）は制約条件（２）を満たすための正規化処理も含んでいる。

Here, the expression (16) includes a normalization process for satisfying the constraint condition (2).

・発話区間の識別
こうして、重みｗ₁，ｗ₂，ｗ₃，ｗ₄を最適化した後、統合スコアＦ（ｘ_t）を求め、以下の二つの識別関数を利用して音声（発話区間）か非音声（非発話区間）かの識別を行なう。ＭＣＥ学習では、識別する各クラスごとに識別関数を用意する必要があるため、このように二つの識別関数を用いている。・ Identification of utterance interval After optimizing the weights w ₁ , w ₂ , w ₃ , and w ₄ , the integrated score F (x _t ) is obtained, and the speech (utterance interval) is obtained using the following two identification functions Or non-speech (non-speech interval). In MCE learning, since it is necessary to prepare an identification function for each class to be identified, two identification functions are used in this way.

ここで、θは統合スコアのしきい値であり、予め定められている。ｇ_s（ｘ_t）がｇ_n（ｘ_t）より大きい場合にはｘ_tは音声フレームと判定され、そうでない場合は非音声フレームと判定される。この判定はフレームごとに行なう。

Here, θ is a threshold value of the integrated score and is determined in advance. When g _s (x _t ) is larger than g _n (x _t ), x _t is determined to be a voice frame, and otherwise, it is determined to be a non-voice frame. This determination is performed for each frame.

図４に、本実施の形態に係る重み最適化部９２をコンピュータプログラムで実現する際の、プログラムの制御構造をフローチャート形式で示す。図４を参照して、電源が投入されると、ステップ１６０で発話区間検出装置８０の各部を初期化（クリア）する処理を行なう。続いてステップ１６２において、発話区間検出装置８０の状態を初期状態に設定する。すなわち、初期化制御部１３０は選択部１００に指示してサンプルデータ記憶部１２０からの出力を受けるように接続を設定する。 FIG. 4 is a flowchart showing a program control structure when the weight optimizing unit 92 according to the present embodiment is realized by a computer program. Referring to FIG. 4, when the power is turned on, a process for initializing (clearing) each part of speech section detecting device 80 is performed in step 160. Subsequently, in step 162, the state of the utterance section detection device 80 is set to the initial state. That is, the initialization control unit 130 instructs the selection unit 100 to set a connection so as to receive the output from the sample data storage unit 120.

ステップ１６４において、参照データの音声合成を行なう。すなわち、初期化制御部１３０は音声合成装置９４に指令を出し、基準データ記憶部１２６から基準発話の発生を促すメッセージのテキストを読出させる。音声合成装置９４は、このメッセージのテキストに対する音声合成を行ない、音声信号をスピーカ８４に与える。スピーカ８４はこの音声信号を音声に変換する。このメッセージは、例えば「『こんにちは』」を３回繰返してください。」というようなメッセージである。 In step 164, speech synthesis of reference data is performed. That is, the initialization control unit 130 instructs the speech synthesizer 94 to read out the text of a message prompting the generation of the reference utterance from the reference data storage unit 126. The voice synthesizer 94 performs voice synthesis on the text of this message and gives a voice signal to the speaker 84. The speaker 84 converts this sound signal into sound. This message is, for example, please repeat three times, "" Hello "". Is a message like this.

ステップ１６６において、メッセージに応答してユーザが発話する基準発話の入力をマイクロホン８２及びＡ／Ｄ変換処理部８６から受け、ステップ１６８において４種類の特徴量を算出する。この特徴量はサンプルデータ記憶部１２０に記憶される。 In step 166, the input of the standard utterance that the user utters in response to the message is received from the microphone 82 and the A / D conversion processing unit 86, and in step 168, four types of feature quantities are calculated. This feature amount is stored in the sample data storage unit 120.

ステップ１７０において、サンプルデータ記憶部１２０に記憶されたサンプルデータの特徴量に対し、基準データ記憶部１２６に記憶された音響モデルとの間の強制アライメントを行ない、サンプルデータのうちの発話区間と非発話区間とをフレームごとに識別する。ただしこのとき、発話区間と非発話区間とをそれぞれ大きくまとめるためにスムージング処理を行なう。この識別結果に応じ、フレームごとに発話区間／非発話区間を特定するラベルからなるラベルファイルを作成し、ラベルファイル記憶部１２４に記憶する。 In step 170, the feature amount of the sample data stored in the sample data storage unit 120 is forcibly aligned with the acoustic model stored in the reference data storage unit 126, and the utterance interval and non-existence in the sample data are determined. A speech segment is identified for each frame. However, at this time, a smoothing process is performed in order to make the speech segment and the non-speech segment large. In accordance with this identification result, a label file composed of labels specifying the utterance interval / non-utterance interval for each frame is created and stored in the label file storage unit 124.

ステップ１７２では、強制アライメントの結果が、妥当なものか否かをアライメントの尤度により判定する。強制アライメントの結果が妥当でないときには再度ステップ１６０に戻り、以上の処理を繰返す。強制アライメントの結果が妥当なときにはステップ１７４に進む。 In step 172, it is determined by the likelihood of alignment whether the result of forced alignment is appropriate. When the result of forced alignment is not valid, the process returns to step 160 again, and the above processing is repeated. When the result of the forced alignment is valid, the process proceeds to step 174.

ステップ１７４では、ＭＣＥ学習による重みの算出を行なう。具体的には、初期化制御部１３０は、第１の入力１３２に与えられる特徴量を算出するように選択部１００を制御し、かつ前述したＭＣＥ学習にしたがってラベルファイル記憶部１２４に記憶されたラベルファイルのラベルと、選択部１００から与えられる特徴量から求められた統合スコアによってフレームごとに識別された発話区間／非発話区間の識別結果（ラベル）とを比較し、相互の誤りが最小になるように重み記憶部１０４、特徴量統合部１０６、発話区間識別部１０８、及び重み更新部１２８を制御する。 In step 174, the weight is calculated by MCE learning. Specifically, the initialization control unit 130 controls the selection unit 100 so as to calculate the feature amount given to the first input 132, and is stored in the label file storage unit 124 according to the MCE learning described above. The label of the label file is compared with the identification result (label) of the utterance period / non-utterance period identified for each frame by the integrated score obtained from the feature amount given from the selection unit 100, and the mutual error is minimized. Thus, the weight storage unit 104, the feature amount integration unit 106, the utterance section identification unit 108, and the weight update unit 128 are controlled.

重みの算出が終わったら、最終的に得られた重みをステップ１７６において重み記憶部１０４に再度記憶する。ステップ１７８で、発話区間検出装置８０の状態を通常状態に設定する。すなわち、第２の入力１３４からの特徴量を選択するように、選択部１００を設定する。したがって以後、発話区間検出処理部９０は、マイクロホン８２、Ａ／Ｄ変換処理部８６、第１のフレーム化処理部８７、第２のフレーム化処理部８８、及び特徴量算出部１０２によって算出された特徴量に基づき、リアルタイムで発話区間の検出を行なうようになる。また同時に初期化制御部１３０は、サンプルデータ記憶部１２０、ラベルファイル作成部１２２、及び重み更新部１２８の動作を停止させる。すなわち、以後の処理では重み記憶部１０４に記憶された重みは更新されなくなる。 When the weight calculation is completed, the finally obtained weight is stored again in the weight storage unit 104 in step 176. In step 178, the state of the utterance section detection device 80 is set to the normal state. That is, the selection unit 100 is set so as to select a feature amount from the second input 134. Therefore, thereafter, the utterance section detection processing unit 90 is calculated by the microphone 82, the A / D conversion processing unit 86, the first framing processing unit 87, the second framing processing unit 88, and the feature amount calculation unit 102. Based on the feature amount, the utterance section is detected in real time. At the same time, the initialization control unit 130 stops the operations of the sample data storage unit 120, the label file creation unit 122, and the weight update unit 128. That is, in the subsequent processing, the weight stored in the weight storage unit 104 is not updated.

＜動作＞
以上に構成を説明した発話区間検出装置８０の動作について以下に説明する。予め、基準データ記憶部１２６には基準発話の発話を促すためのメッセージのテキストデータと、基準発話に対する強制アライメントを行なうための音声モデルとが記憶されているものとする。実際の動作時、発話区間検出装置８０の電源が投入されると、図２に示す初期化制御部１３０は、発話区間検出装置８０の各部を初期化する。さらに初期化制御部１３０は、発話区間検出処理部９０の選択部１００を初期状態に設定する。すなわち、第１の入力１３２への入力を選択するように設定する。 <Operation>
The operation of the utterance section detecting device 80 whose configuration has been described above will be described below. Assume that the reference data storage unit 126 stores in advance text data of a message for prompting the utterance of the reference utterance and a voice model for performing the forced alignment with respect to the reference utterance. In actual operation, when the utterance section detection device 80 is turned on, the initialization control unit 130 shown in FIG. 2 initializes each section of the utterance section detection device 80. Furthermore, the initialization control unit 130 sets the selection unit 100 of the utterance section detection processing unit 90 to an initial state. That is, the input to the first input 132 is set to be selected.

初期化制御部１３０は続いて、音声合成装置９４に対し指示を与え、基準データ記憶部１２６に記憶されている基準発話を発話することをユーザに促すためのメッセージの音声合成を行わせるようにする。これに応答して、音声合成装置９４は基準データ記憶部１２６からメッセージのテキストを読出し、音声合成を行なって音声信号を生成しスピーカ８４に与える。スピーカ８４はこの音声信号を音声に変換する。ユーザは、この音声に促され、所定の発話を行なう。 Subsequently, the initialization control unit 130 instructs the speech synthesizer 94 to perform speech synthesis of a message for prompting the user to speak the reference utterance stored in the reference data storage unit 126. To do. In response to this, the speech synthesizer 94 reads out the text of the message from the reference data storage unit 126, performs speech synthesis, generates a speech signal, and provides it to the speaker 84. The speaker 84 converts this sound signal into sound. The user is prompted by this voice and makes a predetermined utterance.

この音声は、マイクロホン８２によりアナログ音声信号に変換され、Ａ／Ｄ変換処理部８６に与えられる。Ａ／Ｄ変換処理部８６は、音声信号をサンプリングし、量子化し、さらにデジタル化して第１のフレーム化処理部８７及び第２のフレーム化処理部８８に与える。 This sound is converted into an analog sound signal by the microphone 82 and given to the A / D conversion processing unit 86. The A / D conversion processing unit 86 samples the audio signal, quantizes it, digitizes it, and provides it to the first framing processing unit 87 and the second framing processing unit 88.

第１のフレーム化処理部８７は、入力される音声データを１００ｍｓｅｃ単位でフレーム化し、振幅レベル特徴量算出部１４０及びゼロ交差数特徴量算出部１４２に与える。第２のフレーム化処理部８８は入力される音声を２５ｍｓｅｃ単位でフレーム化し、スペクトル情報特徴量算出部１４４及びＧＭＭ対数尤度特徴量算出部１４６に与える。フレームのシフト時間は１０ｍｓｅｃである。 The first framing processing unit 87 frames the input audio data in units of 100 msec, and provides the frame to the amplitude level feature amount calculation unit 140 and the zero crossing number feature amount calculation unit 142. The second framing processing unit 88 frames the input speech in units of 25 msec, and provides it to the spectrum information feature amount calculation unit 144 and the GMM log likelihood feature amount calculation unit 146. The frame shift time is 10 msec.

振幅レベル特徴量算出部１４０、ゼロ交差数特徴量算出部１４２、スペクトル情報特徴量算出部１４４、及びＧＭＭ対数尤度特徴量算出部１４６はそれぞれ、与えられるフレームデータに対して振幅レベル特徴量（パワー）、ゼロ交差数、スペクトル情報、及びＧＭＭ対数尤度比を算出し、サンプルデータ記憶部１２０に与える。サンプルデータ記憶部１２０は、これら特徴量をフレームごとに記憶する。 The amplitude level feature quantity calculation unit 140, the zero-crossing number feature quantity calculation unit 142, the spectrum information feature quantity calculation unit 144, and the GMM log-likelihood feature quantity calculation unit 146 each have an amplitude level feature quantity ( Power), the number of zero crossings, the spectrum information, and the GMM log-likelihood ratio are calculated and supplied to the sample data storage unit 120. The sample data storage unit 120 stores these feature amounts for each frame.

続いて初期化制御部１３０は、ラベルファイル作成部１２２を制御してラベルファイルの作成を行なう。すなわち、ラベルファイル作成部１２２は、基準データ記憶部１２６に記憶されている、初期化のための音響モデルに対し、サンプルデータ記憶部１２０に記憶されている特徴量を用いて強制的なアライメントを行ない、サンプルデータ中の発話開始点と発話終了点とを特定する。このとき、発話区間と非発話区間との境界で両者ができるだけ混在しないよう、スムージング処理を行なう。このような強制アライメント処理は、音声認識の一形態ということができるが、この実施の形態での初期化時のように発話内容が予め判っている場合、この処理は容易に実現できる。 Subsequently, the initialization control unit 130 controls the label file creation unit 122 to create a label file. That is, the label file creation unit 122 performs forced alignment on the acoustic model for initialization stored in the reference data storage unit 126 using the feature amount stored in the sample data storage unit 120. To specify an utterance start point and an utterance end point in the sample data. At this time, the smoothing process is performed so that both are not mixed as much as possible at the boundary between the utterance interval and the non-utterance interval. Such a forced alignment process can be said to be a form of speech recognition, but this process can be easily realized when the utterance content is known in advance as in the initialization in this embodiment.

ラベルファイル作成部１２２は、このようにしてサンプルデータの各フレームごとに発話区間／非発話区間のラベルを付ける。このラベルをフレームの順番に並べてラベルファイルが作成される。このラベルファイルは、図２に示すラベルファイル記憶部１２４に記憶される。 In this way, the label file creation unit 122 labels the utterance interval / non-utterance interval for each frame of the sample data. A label file is created by arranging the labels in the order of the frames. This label file is stored in the label file storage unit 124 shown in FIG.

初期化制御部１３０は、ラベルファイルの作成が完了すると、重み記憶部１０４、特徴量統合部１０６、発話区間識別部１０８、及び重み更新部１２８を制御して、重みを最適化するためのＭＣＥ学習処理を実行させる。ＭＣＥ学習処理が終了すると、得られた重みを改めて重み記憶部１０４に記憶させる。さらに初期化制御部１３０は、第２の入力１３４の入力を選択するように選択部１００を設定し、サンプルデータ記憶部１２０、ラベルファイル作成部１２２、及び重み更新部１２８の動作を停止させる。 When the creation of the label file is completed, the initialization control unit 130 controls the weight storage unit 104, the feature amount integration unit 106, the utterance section identification unit 108, and the weight update unit 128 to optimize the weight. The learning process is executed. When the MCE learning process is completed, the obtained weight is newly stored in the weight storage unit 104. Further, the initialization control unit 130 sets the selection unit 100 to select the input of the second input 134 and stops the operations of the sample data storage unit 120, the label file creation unit 122, and the weight update unit 128.

この後は、マイクロホン８２及びＡ／Ｄ変換処理部８６によりデジタル化された音声信号は、第１のフレーム化処理部８７及び第２のフレーム化処理部８８でフレーム化され、特徴量算出部１０２に与えられる。特徴量算出部１０２は前述した４種類の特徴量を算出する。この特徴量は、今度はサンプルデータ記憶部１２０ではなく選択部１００を介して特徴量統合部１０６に与えられる。特徴量統合部１０６は、与えられた４種類の特徴量を重み記憶部１０４に記憶された重みを用いて統合し、得られた統合スコアを発話区間識別部１０８に与える。発話区間識別部１０８は、この統合スコアをしきい値と比較し、しきい値以上の統合スコアを示すフレームを発話区間、それ以外を非発話区間と判定し、フレームに発話区間／非発話区間を示すラベルを付して図示しない音声認識装置に与える。この際、発話区間の判定結果に対してスムージングを行なう。 Thereafter, the audio signal digitized by the microphone 82 and the A / D conversion processor 86 is framed by the first framing processor 87 and the second framing processor 88, and the feature amount calculator 102. Given to. The feature amount calculation unit 102 calculates the above-described four types of feature amounts. This feature amount is then given to the feature amount integration unit 106 via the selection unit 100 instead of the sample data storage unit 120. The feature amount integration unit 106 integrates the given four types of feature amounts using the weights stored in the weight storage unit 104, and provides the obtained integrated score to the utterance section identification unit 108. The utterance section identification unit 108 compares the integrated score with a threshold value, determines a frame indicating an integrated score equal to or higher than the threshold value as an utterance section, and determines the other as a non-utterance section. Is given to a voice recognition device (not shown). At this time, smoothing is performed on the determination result of the utterance section.

以上のように本実施の形態に係る発話区間検出装置８０では、４種類の特徴量を用い、それらを統合して得られる統合スコアに基づいて発話区間／非発話区間の識別を行なう。統合の際の重みは、発話区間検出装置８０の電源投入時に、基準データを用いて最適化される。この最適化により、雑音環境に応じた重みの値が決定される。したがって、雑音環境の種類に応じて、発話区間／非発話区間の識別が正確に行なえるよう、４種類の特徴量に対する重みが適宜調整される。その結果、雑音の種類にかかわらず、常に一定の効果を得ることができるようになる。しかも、後に示す実験結果から分かるように、４種類の特徴量を単独で用いる装置と比較すると、ほとんど全ての条件で最も優れた精度を得ることができる。したがって、雑音の種類にかかわらず、従来の技術よりも高い性能を示すことが可能になる。 As described above, the utterance interval detection device 80 according to the present embodiment uses four types of feature amounts and identifies the utterance interval / non-utterance interval based on the integrated score obtained by integrating them. The weight at the time of integration is optimized using the reference data when the utterance section detecting device 80 is turned on. By this optimization, the weight value corresponding to the noise environment is determined. Therefore, according to the type of the noise environment, the weights for the four types of feature amounts are appropriately adjusted so that the speech segment / non-speech segment can be accurately identified. As a result, a constant effect can always be obtained regardless of the type of noise. Moreover, as can be seen from the experimental results shown later, the best accuracy can be obtained under almost all conditions when compared with an apparatus that uses four types of feature quantities independently. Therefore, regardless of the type of noise, it is possible to show higher performance than the conventional technology.

＜実験と評価＞
−タスク及び実験条件−
本実施の形態に係る発話区間検出装置８０の有効性を評価するために、雑音環境下における発話検出実験を行なった。音声データは、１０人の話者の発話を防音室で収録したもの（１６ｋＨｚ，１６ビット）を用いた。１人あたりの発話は１０回で、各発話は１秒から３秒程度である。各発話の間には３秒程度のポーズが挿入されている。雑音として、センサールーム、工作機械、話し声の３種類を用意し、これを音声データに重畳することでテストデータを作成する。センサールームの雑音は比較的静かで、エアコンの音が聞こえる程度のものである。工作機械の雑音は、ものを切断するような比較的高周波の成分が多いものである。話し声の雑音は、背景で話す人の声を含み、発話区間検出の対象となる発話と重なる周波数帯域の成分が多いものである。 <Experiment and evaluation>
-Tasks and experimental conditions-
In order to evaluate the effectiveness of the utterance section detection device 80 according to the present embodiment, an utterance detection experiment in a noisy environment was performed. The voice data used was recorded from the speech of 10 speakers in a soundproof room (16 kHz, 16 bits). The number of utterances per person is 10 times, and each utterance is about 1 to 3 seconds. A pause of about 3 seconds is inserted between each utterance. As noise, three types of sensor room, machine tool, and spoken voice are prepared, and test data is created by superimposing these on voice data. The noise in the sensor room is relatively quiet and the sound of the air conditioner can be heard. Machine tool noise has a relatively high frequency component that cuts objects. Speaking noise includes the voice of a person speaking in the background, and has many components in the frequency band that overlap with the utterance that is the target of utterance detection.

各雑音に対して重畳時のＳ／Ｎ比を１０ｄｂ及び１５ｄｂとした３種類のデータを作成した。したがって、テストデータのサンプルは計６００（３雑音×２Ｓ／Ｎ比×１０人×１０発話）発話となる。重みの学習に用いるデータは、テストデータと同じ話者による別の１０発話とする。 Three types of data were created with S / N ratios of 10 db and 15 db at the time of superposition for each noise. Therefore, a total of 600 test data samples (3 noise × 2 S / N ratio × 10 persons × 10 utterances) are generated. The data used for weight learning is another 10 utterances by the same speaker as the test data.

本実施の形態では、式（２）、（３）及び（４）において、雑音の特徴量を計算する必要がある。今回の実験では、テストデータのうち、音声が含まれていない最初の１秒間を用いてそれらを計算した。また、学習のためのラベルファイルは、強制アライメントでなく人手で作成した。 In the present embodiment, it is necessary to calculate noise feature amounts in the equations (2), (3), and (4). In this experiment, the test data were calculated using the first 1 second of the test data that did not contain speech. The label file for learning was created manually instead of forced alignment.

次に、発話区間検出に用いた特徴について述べる。フレーム長は振幅レベル及びゼロ交差数においては１００ｍｓｅｃ、ＧＭＭ対数尤度及びスペクトル情報については２５ｍｓｅｃとした。フレーム周期は各特徴とも１０ｍｓｅｃである。スペクトル情報の分割チャネル数は２０とする。ＧＭＭには３２混合で対角共分散行列のガウス分布を用い、その入力は１２次元のメルケプストラム及びその一次差分（Δ）と、Δ-パワーとからなる２５次元とする。音声ＧＭＭの学習には既存の新聞記事読上げコーパスの３０４人による約３２０００発話、雑音ＧＭＭの学習にはセンサールーム、オフィス、廊下の３種類の雑音(各２０分程度)をそれぞれ用いた。ここで、センサールームのみが発話区間検出の評価用データにも用いられている雑音である。ゼロ交差数のバイアス値は３００である。 Next, features used for detecting the utterance section will be described. The frame length was 100 msec for the amplitude level and the number of zero crossings, and 25 msec for the GMM log likelihood and spectrum information. The frame period is 10 msec for each feature. The number of division channels of spectrum information is 20. The GMM uses a Gaussian distribution of a diagonal covariance matrix with 32 mixtures, and its input is 25 dimensions consisting of a 12-dimensional mel cepstrum, its primary difference (Δ), and Δ-power. For speech GMM learning, about 32,000 utterances by 304 people in an existing newspaper article reading corpus were used, and for noise GMM learning, three types of noise (each about 20 minutes) in the sensor room, office, and corridor were used. Here, only the sensor room is the noise used for the evaluation data for detecting the utterance section. The bias value for the number of zero crossings is 300.

発話区間検出の評価尺度には、フレームベースでのｆａｌｓｅａｌａｒｍｒａｔｅ（ＦＡＲ)及びｆａｌｓｅｒｅｊｅｃｔｉｏｎｒａｔｅ(ＦＲＲ)を用いる。ＦＡＲは全非音声フレームにおいて誤って音声と認識されたフレームの割合、ＦＲＲは全音声フレームにおいて誤って非音声と認識されたフレームの割合をそれぞれ示す。 As an evaluation measure for detecting an utterance interval, frame-based false alarm rate (FAR) and false rejection rate (FRR) are used. FAR indicates the proportion of frames erroneously recognized as speech in all non-speech frames, and FRR indicates the proportion of frames erroneously recognized as non-speech in all speech frames.

−実験結果−
６パターンの雑音条件に対する実験結果を図５から図１０に示す。図５はセンサールーム（１０ｄｂ）、図６はセンサールーム（１５ｄｂ）、図７は工作機械（１０ｄｂ）、図８は工作機械（１５ｄｂ）、図９は話し声（１０ｄｂ）、及び図１０は話し声（１５ｄｂ）の結果を示す。 -Experimental results-
The experimental results for six patterns of noise conditions are shown in FIGS. 5 is a sensor room (10 db), FIG. 6 is a sensor room (15 db), FIG. 7 is a machine tool (10 db), FIG. 8 is a machine tool (15 db), FIG. 9 is a speaking voice (10 db), and FIG. The result of 15 db) is shown.

それぞれの図は各特徴を単独で用いて発話区間検出を行なった結果と、本実施の形態での結果とを重ねて表している。図中の「■」をプロットした線２００、２１０、２２０、２３０、２４０及び２５０はいずれも、本実施の形態に係る発話区間検出の結果を示す。「振幅」、「スペクトル」、「ゼロ交差数」、及び「ＧＭＭ」はそれぞれ、振幅レベル、スペクトル情報、ゼロ交差数、及びＧＭＭ対数尤度を単独で用いたときの発話区間検出の結果を示す。いずれの図においても、横軸はＦＡＲ、縦軸はＦＲＲに対応する。なお図７および図８において、ゼロ交差数を用いた結果は図の範囲外にあり表れていない。図中のプロットは識別関数（発話区間識別部１０８における発話区間検出）のしきい値に対応しており、しきい値を変えながら実験を行なうことによって図のようなオペレーション曲線を得た。 Each figure shows the result of detecting an utterance section using each feature alone and the result in this embodiment. Lines 200, 210, 220, 230, 240, and 250 in which “■” is plotted in the figure indicate the results of the speech segment detection according to the present embodiment. “Amplitude”, “Spectrum”, “Number of zero crossings”, and “GMM” indicate the results of speech interval detection when amplitude level, spectrum information, number of zero crossings, and GMM log likelihood are used alone, respectively. . In any figure, the horizontal axis corresponds to FAR, and the vertical axis corresponds to FRR. In FIGS. 7 and 8, the result of using the number of zero crossings is outside the range of the figure and does not appear. The plot in the figure corresponds to the threshold value of the discriminant function (speech segment detection in the utterance segment identification unit 108), and an operation curve as shown in the diagram was obtained by performing an experiment while changing the threshold value.

まず、単独の特徴について考察する。センサールーム雑音は雑音ＧＭＭの作成に用いられたためＧＭＭ対数尤度の結果が最もよくなることが期待されたが、実際にはゼロ交差数が最も高い性能を示した。また、工作機械ではスペクトル情報、話し声ではＧＭＭ対数尤度が最も高い発話区間検出性能を示した。 First, consider a single feature. Since sensor room noise was used to create a noise GMM, the GMM log-likelihood result was expected to be the best, but in practice the zero crossing number showed the highest performance. In addition, the machine tool showed the spectrum information and the speech section detection performance with the highest GMM logarithmic likelihood in the spoken voice.

これらの結果から、雑音環境に応じて最適な特徴量が異なることが分かる。それに対して、本実施の形態に係る発話区間検出装置８０は、全ての雑音環境において単独特徴を上回る結果を示した。これより、提案手法の有効性が示された。 From these results, it can be seen that the optimum feature amount varies depending on the noise environment. On the other hand, the utterance section detection device 80 according to the present embodiment showed results exceeding the single feature in all noise environments. This shows the effectiveness of the proposed method.

次に、重み学習のために用いた音声データを評価する、いわゆるクローズド実験を行なった。センサールーム（Ｓ／Ｎ比：１０ｄｂ）での実験結果を図５の「クローズド」で示されたオペレーション線に示す。図より、「クローズド」、「本実施の形態」の結果がほとんど同じであることが分かる。他の雑音条件についても同様の結果が得られた。これは、本実施の形態による発話区間検出が発話の変動に対して頑健であることを表している。 Next, a so-called closed experiment was performed in which speech data used for weight learning was evaluated. The result of the experiment in the sensor room (S / N ratio: 10 db) is shown in the operation line indicated by “closed” in FIG. From the figure, it can be seen that the results of “closed” and “this embodiment” are almost the same. Similar results were obtained for other noise conditions. This indicates that the utterance section detection according to the present embodiment is robust against the fluctuation of the utterance.

また、重み適応の有効性を確かめるために、本実施の形態に係る発話区間検出装置８０において重みを最適化する前の状態(すなわち全重みが等しい場合)で実験を行なった結果と最適化後の結果とを比較した。同時に、適応に用いる発話数を１、５、及び１０と変化させて実験を行ない、それに伴う性能の変化を調べた。 In addition, in order to confirm the effectiveness of weight adaptation, in the utterance section detection device 80 according to the present embodiment, the result of an experiment performed before the weight is optimized (that is, when all weights are equal) and the result after optimization The results were compared. At the same time, the experiment was performed by changing the number of utterances used for adaptation to 1, 5, and 10, and the change in performance accompanying the experiment was examined.

各雑音を１０ｄｂで重畳したテストデータに対する実験結果をＥＥＲ（ＥｑｕａｌＥｒｒｏｒＲａｔｅ）でテーブル１に示す。ＥＥＲはＦＡＲとＦＲＲとが等しくなる点の値である。 Table 1 shows EER (Equal Error Rate) experimental results for test data in which each noise is superimposed at 10 db. EER is the value at the point where FAR and FRR are equal.

表１から分かるように、センサールームでは重み適応の前後で性能の変化がほとんど見られなかった。一方、工作機械と話し言葉とでは検出能力が改善された。全体としても雑音環境への適応の効果が見られる。適応に用いた発話数において比較すると、１発話より１０発話の方が若干性能が向上したが、大きな違いは見られなかった。これより、１回の発話でも重みの学習が十分に有効であることが判った。

As can be seen from Table 1, there was almost no change in performance before and after weight adaptation in the sensor room. On the other hand, the detection capability of machine tools and spoken language has been improved. As a whole, the effect of adaptation to the noise environment can be seen. Compared with the number of utterances used for adaptation, the performance of 10 utterances was slightly improved over that of 1 utterance, but no significant difference was observed. From this, it was found that learning of weights is sufficiently effective even for one utterance.

＜コンピュータによる実現＞
本実施の形態に係る発話区間検出装置８０のうち、特に重み最適化部９２及び発話区間検出処理部９０は、コンピュータハードウェア及び当該コンピュータハードウェア上で実行されるコンピュータプログラムにより実現可能である。図１１は重み最適化部９２及び発話区間検出処理部９０を実現するための一例として、コンピュータシステム３３０の外観を示し、図１２はコンピュータシステム３３０の内部構成を示す。 <Realization by computer>
Of the speech segment detection device 80 according to the present embodiment, the weight optimization unit 92 and the speech segment detection processing unit 90 can be realized by computer hardware and a computer program executed on the computer hardware. FIG. 11 shows an appearance of a computer system 330 as an example for realizing the weight optimization unit 92 and the utterance section detection processing unit 90, and FIG. 12 shows an internal configuration of the computer system 330.

図１１を参照して、このコンピュータシステム３３０は、ＦＤ（フレキシブルディスク）ドライブ３５２およびＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ３５０を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２と、マイクロホン３７０（図２に示すマイクロホン８２に相当）と、スピーカ３７２（図２に示すスピーカ８４に相当）とを含む。 Referring to FIG. 11, this computer system 330 includes a computer 340 having an FD (flexible disk) drive 352 and a CD-ROM (compact disk read only memory) drive 350, a keyboard 346, a mouse 348, and a monitor 342. And a microphone 370 (corresponding to the microphone 82 shown in FIG. 2) and a speaker 372 (corresponding to the speaker 84 shown in FIG. 2).

図１２を参照して、コンピュータ３４０は、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、ＦＤドライブ３５２およびＣＤ−ＲＯＭドライブ３５０に接続されたバス３６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム、および作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）３６０と、マイクロホン３７０及びスピーカ３７２とバス３６６とに接続されるサウンドボード３６８とを含む。コンピュータシステム３３０はさらに、図示しないプリンタを含んでもよい。 Referring to FIG. 12, in addition to FD drive 352 and CD-ROM drive 350, computer 340 includes CPU (Central Processing Unit) 356 and bus 366 connected to CPU 356, FD drive 352, and CD-ROM drive 350. A read only memory (ROM) 358 for storing a boot-up program and the like; a random access memory (RAM) 360 connected to the bus 366 for storing a program command, a system program, work data, and the like; a microphone 370 and a speaker 372 and a sound board 368 connected to the bus 366. The computer system 330 may further include a printer (not shown).

ここでは示さないが、コンピュータ３４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。 Although not shown here, the computer 340 may further include a network adapter board that provides a connection to a local area network (LAN).

コンピュータシステム３３０に発話区間検出装置８０としての動作を行なわせるためのコンピュータプログラムは、ＣＤ−ＲＯＭドライブ３５０またはＦＤドライブ３５２に挿入されるＣＤ−ＲＯＭ３６２またはＦＤ３６４に記憶され、さらにハードディスク３５４に転送される。または、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３５４に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＣＤ−ＲＯＭ３６２から、ＦＤ３６４から、またはネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to operate as the speech zone detection device 80 is stored in the CD-ROM 362 or FD 364 inserted in the CD-ROM drive 350 or FD drive 352 and further transferred to the hard disk 354. . Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 354. The program is loaded into the RAM 360 when executed. The program may be loaded directly into the RAM 360 from the CD-ROM 362, from the FD 364, or via a network.

このプログラムは、コンピュータ３４０にこの実施の形態に係る発話区間検出装置８０として動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上で動作するオペレーティングシステム（ＯＳ）もしくはサードパーティのプログラム、またはコンピュータ３４０にインストールされる各種ツールキットのモジュールにより提供される。従って、このプログラムはこの実施の形態のシステムおよび方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能または「ツール」を呼出すことにより、上記した発話区間検出装置８０としての動作を実行する命令のみを含んでいればよい。コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions for causing the computer 340 to operate as the utterance section detection device 80 according to this embodiment. Some of the basic functions required to perform this operation are provided by operating system (OS) or third party programs running on the computer 340 or various toolkit modules installed on the computer 340. Therefore, this program does not necessarily include all functions necessary for realizing the system and method of this embodiment. This program includes only an instruction for executing the operation as the above-described speech segment detection device 80 by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. Just go out. The operation of computer system 330 is well known and will not be repeated here.

なお、例えば図２に示すラベルファイル記憶部１２４及び基準データ記憶部１２６はハードディスク３５４を用いて実現され、サンプルデータ記憶部１２０及び重み記憶部１０４はＲＡＭ３６０により実現される。また、図２に示すＡ／Ｄ変換処理部８６の機能は、サウンドボード３６８により提供される。 For example, the label file storage unit 124 and the reference data storage unit 126 shown in FIG. 2 are realized using the hard disk 354, and the sample data storage unit 120 and the weight storage unit 104 are realized by the RAM 360. Further, the function of the A / D conversion processing unit 86 shown in FIG. 2 is provided by the sound board 368.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

発話区間検出の概念を示す図である。It is a figure which shows the concept of an utterance area detection. 本発明の一実施の形態に係る発話区間検出装置８０の構成を示すブロック図である。It is a block diagram which shows the structure of the utterance area detection apparatus 80 which concerns on one embodiment of this invention. 雑音と音声とのスペクトルの例を示すグラフである。It is a graph which shows the example of the spectrum of noise and a voice. 本発明の一実施の形態に係る発話区間検出装置８０における、重みの最適化処理を実現するためのコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program for implement | achieving the optimization process of weight in the utterance area detection apparatus 80 which concerns on one embodiment of this invention. センサールームの雑音をＳ／Ｎ比１０ｄｂで音声に重畳したときの実験結果を示す図である。It is a figure which shows the experimental result when the noise of a sensor room is superimposed on an audio | voice with S / N ratio of 10db. センサールームの雑音をＳ／Ｎ比１５ｄｂで音声に重畳したときの実験結果を示す図である。It is a figure which shows the experimental result when the noise of a sensor room is superimposed on an audio | voice with S / N ratio 15db. 工作機械の雑音をＳ／Ｎ比１０ｄｂで音声に重畳したときの実験結果を示す図である。It is a figure which shows the experimental result when the noise of a machine tool is superimposed on an audio | voice with S / N ratio 10db. 工作機械の雑音をＳ／Ｎ比１５ｄｂで音声に重畳したときの実験結果を示す図である。It is a figure which shows the experimental result when the noise of a machine tool is superimposed on an audio | voice with S / N ratio 15db. 話し声の雑音をＳ／Ｎ比１０ｄｂで音声に重畳したときの実験結果を示す図である。It is a figure which shows the experimental result when the noise of speech is superimposed on the voice with an S / N ratio of 10 db. 話し声の雑音をＳ／Ｎ比１５ｄｂで音声に重畳したときの実験結果を示す図である。It is a figure which shows the experimental result when the noise of speech is superimposed on the voice with an S / N ratio of 15 db. 本発明の一実施の形態に係る発話区間検出装置８０を実現するコンピュータシステムの外観図である。It is an external view of the computer system which implement | achieves the speech area detection apparatus 80 which concerns on one embodiment of this invention. 図１１に示すコンピュータのブロック図である。It is a block diagram of the computer shown in FIG.

Explanation of symbols

８０発話区間検出装置
８２マイクロホン
８４スピーカ
８６Ａ／Ｄ変換処理部
８７第１のフレーム化処理部
８８第２のフレーム化処理部
９０発話区間検出処理部
９２重み最適化部
９４音声合成装置
１００選択部
１０２特徴量算出部
１０４重み記憶部
１０６特徴量統合部
１０８発話区間識別部
１２０サンプルデータ記憶部
１２２ラベルファイル作成部
１２４ラベルファイル記憶部
１２６基準データ記憶部
１２８重み更新部
１３０初期化制御部
１４０振幅レベル特徴量算出部
１４２ゼロ交差数特徴量算出部
１４４スペクトル情報特徴量算出部
１４６ＧＭＭ対数尤度特徴量算出部 DESCRIPTION OF SYMBOLS 80 Speech area detection apparatus 82 Microphone 84 Speaker 86 A / D conversion process part 87 1st framing process part 88 2nd framing process part 90 Speaking area detection process part 92 Weight optimization part 94 Speech synthesizer 100 Selection part 102 feature amount calculation unit 104 weight storage unit 106 feature amount integration unit 108 utterance section identification unit 120 sample data storage unit 122 label file creation unit 124 label file storage unit 126 reference data storage unit 128 weight update unit 130 initialization control unit 140 amplitude Level feature amount calculation unit 142 Zero-crossing number feature amount calculation unit 144 Spectrum information feature amount calculation unit 146 GMM log likelihood feature amount calculation unit

Claims

An utterance interval detection device for detecting an utterance interval in audio data,
Feature amount calculating means for calculating a plurality of predetermined feature amounts for each frame of the audio data;
Features for calculating the integrated score by weighting each of the plurality of types of feature amounts calculated by the feature amount calculation means for each frame of the audio data, and integrating the plurality of types of feature amounts. Quantity integration means;
Based on the integrated score calculated by the feature amount integration means, including an utterance interval identification means for identifying an utterance interval and a non-utterance interval for each frame of the voice data,
Furthermore, for each frame, labeled data preparation means for preparing labeled data with a label indicating an utterance interval and a non-utterance interval;
The labeled data prepared by the labeled data preparation means is used as learning data, and the feature quantity integration means weights the plurality of types of feature quantities so that the identification error in the utterance section identification means satisfies a predetermined criterion. An utterance section detecting device including weight learning means for learning

The plurality of types of feature amounts are a group consisting of the amplitude level of the audio signal in each frame, the number of zero crossings of the audio signal in each frame, the spectrum information of the audio signal in each frame, and the GMM log likelihood in each frame. The utterance section detection device according to claim 1, which is selected from

The labeled data preparation means includes:
Voice data acquisition means for acquiring voice data corresponding to a given reference utterance during operation of the utterance section detection device;
Means for pre-preparing an acoustic model for the given reference utterance;
For each frame of the speech data acquired by the speech data acquisition means, by performing forced alignment of the speech data acquired by the speech data acquisition means with the acoustic model for the given reference utterance, The utterance section detection apparatus according to claim 1, further comprising: means for labeling the utterance and the non-speech section.

The feature amount integration unit adds each of the plurality of types of feature amounts by adding a predetermined weight to each of the plurality of types of feature amounts calculated by the feature amount calculation unit for each frame of the audio data. And means for calculating the integrated score;
Weight storage means for storing a weight for the predetermined weight,
The weight learning means uses the labeled data prepared by the labeled data preparation means as learning data, and stores it in the weight storage means in accordance with a predetermined correction criterion so that identification errors in the feature quantity integrating means are reduced. The utterance section detection device according to claim 1, further comprising weight update means for updating the weights that have been set.

The weight update means uses the labeled data prepared by the labeled data preparation means as learning data, and updates the weight stored in the weight storage means by minimum classification error learning regarding identification errors in the utterance section identification means. The utterance section detection apparatus according to claim 4, comprising means for

A computer program that, when executed by a computer, causes the computer to operate as the utterance section detection device according to any one of claims 1 to 5.

A computer-readable recording medium on which the computer program according to claim 6 is recorded.