JP2007233148A

JP2007233148A - Device and program for utterance section detection

Info

Publication number: JP2007233148A
Application number: JP2006056234A
Authority: JP
Inventors: Toru Imai; 亨今井; Shoe Sato; 庄衛佐藤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2006-03-02
Filing date: 2006-03-02
Publication date: 2007-09-13
Anticipated expiration: 2026-03-02
Also published as: JP4791857B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and a program for utterance section detection to detect an utterance section at high speed and with high accuracy. <P>SOLUTION: The device for utterance section detection to detect the utterance section from the input voice comprises: a sound analysis means for converting an input voice to a sound feature amount; a continuous voice recognition means for serially calculating cumulative likelihood in each sub-word while synchronizing with the input voice, by using the sound feature amount obtained by the sound analysis means, and a sub-word network composed of a sound model and a language model which are established beforehand; and an utterance section detecting means for serially detecting an utterance start point and an utterance end point in the input voice from the cumulative likelihood in each sub-word. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、発話区間検出装置及び発話区間検出プログラムに係り、特に迅速且つ効率的に音声に対する発話区間を検出するための発話区間検出装置及び発話区間検出プログラムに関する。 The present invention relates to an utterance interval detection device and an utterance interval detection program, and more particularly to an utterance interval detection device and an utterance interval detection program for quickly and efficiently detecting an utterance interval for speech.

放送番組の字幕制作やメタデータ制作等に用いられる音声認識では、雑音環境や対談における発話検出性能の向上、男女の話者が混在した音声の認識性能の向上が重要である。そこで、従来では、単語や音声等から発話区間を検出する様々な手法が提案されている。例えば、従来の発話区間検出方法には、短時間パワーを利用した手法（例えば、非特許文献１、特許文献１参照。）や、音素認識結果による手法（例えば、非特許文献２参照。）、認識時の尤度を用いた手法（例えば、特許文献２参照。）、局所的な音声／非音声の尤度比による手法（例えば、特許文献３参照。）が知られている。 In speech recognition used for caption production and metadata production of broadcast programs, it is important to improve speech detection performance in noisy environments and conversations, and speech recognition performance mixed with male and female speakers. Therefore, conventionally, various methods for detecting an utterance section from words, voices, and the like have been proposed. For example, conventional speech segment detection methods include a method using short-time power (for example, see Non-Patent Document 1 and Patent Document 1), a method based on a phoneme recognition result (for example, see Non-Patent Document 2), A method using the likelihood at the time of recognition (for example, see Patent Document 2) and a method based on a local speech / non-speech likelihood ratio (for example, see Patent Document 3) are known.

ここで、短時間パワーを利用した手法は、スピーチに対する短時間パワーの閾値と非スピーチに対する短時間パワーの閾値とを設け、入力音声の短時間パワーがスピーチの閾値を超えた時、そのしばらく前の時点を発話始端とし、入力音声の短時間パワーが非スピーチの閾値を下回った時を発話終端とするもので、２つの閾値を入力音声の短時間パワーの変動に合わせて動的に変化させて、雑音等の影響を軽減しようとするものである。 Here, the method using the short-time power provides a short-time power threshold for speech and a short-time power threshold for non-speech, and when the short-time power of the input speech exceeds the speech threshold, Is the beginning of speech, and the end of speech when the short-time power of the input speech falls below the non-speech threshold. The two threshold values are dynamically changed according to the fluctuation of the short-time power of the input speech. Therefore, it is intended to reduce the influence of noise and the like.

また、音素認識結果による手法は、音素単位の連続音声認識を実行し、非スピーチとして認識された部分を発話始終端として同定するものである。また、認識時の尤度を用いた手法は、発話中のポーズを検出することで発話区間を検出するものである。更に、局所的な音声／非音声の尤度比による手法は、短い音声区間で独立に音声／非音声を判定するものである。
Ｐ．Ｒｅｎｅｖｅｙ，ｅｔａｌ．，”ＥｎｔｒｏｐｙＢａｓｅｄＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎｉｎＶａｒｙＮｏｉｓｙＣｏｎｄｉｔｉｏｎｓ”，Ｅｕｒｏｓｐｅｅｃｈ−２００１，ｐｐ．１８８７−１８９０，２００１．特開２００５−３１６３２号公報Ｆ．Ｋｕｂａｌａ，ｅｔａｌ．，”Ｔｈｅ１９９６ＢＢＮＢｙｂｌｏｓＨＵＢ−４ＴｒａｎｓｃｒｉｐｔｉｏｎＳｙｓｔｅｍ”，ＤＡＲＰＡＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＷｏｒｋｓｈｏｐ，ｐｐ．９０−９３，１９９７．特開平９−２５８７６５号公報特許第３１０５４６５号公報 Also, the method based on the phoneme recognition result performs continuous speech recognition in units of phonemes and identifies a portion recognized as non-speech as an utterance start / end. The technique using the likelihood at the time of recognition detects an utterance section by detecting a pause during utterance. Furthermore, the method based on the likelihood ratio of local speech / non-speech is to determine speech / non-speech independently in a short speech interval.
P. Renevey, et al. , “Entropy Based Voice Activity Detection in Vary Noise Conditions”, Eurospeech-2001, pp. 199-001. 1887-1890, 2001. JP 2005-31632 A F. Kubala, et al. "The 1996 BBN Byblos HUB-4 Transcription System", DARPA Speech Recognition Works, pp. 90-93, 1997. Japanese Patent Laid-Open No. 9-258765 Japanese Patent No. 3105465

しかしながら、上述した発話検出手法において、まず短時間パワーを利用した手法の場合は、非常に簡便であり広く一般に利用されているが、音声に雑音がない場合であっても、発話の始端で十分にパワーが上がらない「日本」や「北海道」等の単語の始端を取りこぼす場合が多く、こうした低Ｓ／Ｎ比音声の発話検出性能は実用上十分ではない。 However, in the utterance detection method described above, the method using the power for a short time is very simple and widely used. However, even if there is no noise in the speech, the beginning of the utterance is sufficient. In many cases, the beginning of words such as “Japan” and “Hokkaido” that do not increase in power are missed, and the speech detection performance of such low S / N ratio speech is not practically sufficient.

また、音素認識結果による手法は、オフライン処理では問題ないものの、音素認識結果の取得に入力音声からの大きな時間遅れが生じるため、オンライン処理には向いていない。 Moreover, although the method based on the phoneme recognition result has no problem in the off-line processing, the acquisition of the phoneme recognition result has a large time delay from the input speech, and is not suitable for the on-line processing.

また、認識時の尤度を用いた手法は、発話終端はポーズそのものであるために問題はないものの、発話始端については発話中あるいは発話終端のポーズを検出するまで定まらないため、例えばポーズがなかなか出現しない原稿読み上げ等の発話においては、入力音声からの時間遅れが問題となる。 In addition, the method using the likelihood at the time of recognition has no problem because the utterance end is a pose itself, but the utterance start end is not determined until the utterance end point or the utterance end pose is detected. In an utterance such as reading a document that does not appear, a time delay from the input voice becomes a problem.

更に、局所的な音声／非音声の尤度比による手法は、短い音声区間で独立に音声／非音声を判定するものであるが、長い音声区間でみると判定結果にばらつきが生じるため、平均値処理等の経験的な平滑化処理が必要になり、様々な音響環境のもとでの発話区間検出の最適化が容易ではない。 Furthermore, the local speech / non-speech likelihood ratio method is to judge speech / non-speech independently in a short speech interval. Empirical smoothing processing such as value processing is required, and optimization of the speech section detection under various acoustic environments is not easy.

本発明は、上述した問題点に鑑みなされたものであり、迅速且つ高精度に発話区間を検出するための発話区間検出装置及び発話区間検出プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide an utterance section detection device and an utterance section detection program for detecting an utterance section quickly and with high accuracy.

上記課題を解決するために、本件発明は、以下の特徴を有する課題を解決するための手段を採用している。 In order to solve the above problems, the present invention employs means for solving the problems having the following characteristics.

請求項１に記載された発明は、入力音声から発話区間を検出する発話区間検出装置において、前記入力音声を音響特徴量に変換する音響分析手段と、前記音響分析手段により得られる音響特徴量と、予め設定された音響モデル及び言語モデルからなるサブワード・ネットワークとを用いて、前記入力音声に同期して各サブワードにおける累積尤度を逐次算出する連続音声認識手段と、各サブワードにおける累積尤度から前記入力音声における発話始端と発話終端とを逐次検出する発話区間検出手段とを有することを特徴とする。 The invention described in claim 1 is an utterance section detection device for detecting a utterance section from input speech, an acoustic analysis means for converting the input speech into an acoustic feature quantity, and an acoustic feature quantity obtained by the acoustic analysis means, A continuous speech recognition means for sequentially calculating a cumulative likelihood in each subword in synchronization with the input speech using a preset subword network consisting of an acoustic model and a language model, and a cumulative likelihood in each subword It has an utterance section detecting means for sequentially detecting an utterance start end and an utterance end in the input voice.

請求項１記載の発明によれば、迅速且つ高精度に発話区間を検出することができる。したがって、無音や雑音や音楽等、音声認識に不要な非スピーチ区間を入力音声から自動的に除去し、認識すべきスピーチ区間だけを取り出すことができる。これにより、音声認識の処理量の削減と認識性能の向上が図られる。 According to the first aspect of the present invention, it is possible to detect an utterance section quickly and with high accuracy. Therefore, it is possible to automatically remove non-speech sections unnecessary for speech recognition, such as silence, noise, music, etc., from the input speech and extract only the speech sections to be recognized. Thereby, the amount of speech recognition processing is reduced and the recognition performance is improved.

請求項２に記載された発明は、音声と音声以外の音との音響的な特徴を表現する１又は複数の話者クラスタを有するサブワード音響モデルと、サブワード音響モデル間の遷移を表現するサブワード言語モデルとを用いて、前記サブワード・ネットワークを統合化するサブワード・ネットワーク統合手段を有することを特徴とする。 The invention described in claim 2 includes a subword acoustic model having one or a plurality of speaker clusters that express acoustic features of speech and non-speech sounds, and a subword language that expresses transitions between subword acoustic models. And subword network integration means for integrating the subword network using a model.

請求項２記載の発明によれば、入力音声の内容に対応させて高精度なサブワード・ネットワークを生成することができる。 According to the second aspect of the present invention, it is possible to generate a highly accurate subword network corresponding to the content of the input voice.

請求項３に記載された発明は、前記サブワード・ネットワーク統合手段は、前記発話区間検出開始状態から全ての話者クラスタの非スピーチに対応する音響モデルへの遷移、前記非スピーチ音響モデルからそれぞれの話者クラスタのスピーチに対応する音響モデルへのサブワード言語モデルにしたがった遷移、一定の時間長にわたって非スピーチを吸収するために非スピーチ音響モデルから前記発話区間検出開始状態へ戻る遷移、各話者クラスタのスピーチに対応する音響モデル間でサブワード言語モデルにしたがった遷移、各話者クラスタのスピーチに対応する音響モデルから異なる話者クラスタのスピーチに対応する音響モデルへのペナルティ付き遷移、各話者クラスタのスピーチに対応する音響モデルからそれぞれの非スピーチに対応する音響モデルへのサブワード言語モデルにしたがった遷移、前記発話終端検出条件にしたがった発話区間検出終了状態への遷移、及び前記発話区間検出終了状態から前記発話区間検出開始状態への遷移のうち、少なくとも１つの遷移を可能とするサブワード・ネットワークを構成することを特徴とする。 According to a third aspect of the present invention, the subword / network integration means includes a transition from the utterance section detection start state to an acoustic model corresponding to non-speech of all speaker clusters, and the non-speech acoustic model to each Transition according to the subword language model to the acoustic model corresponding to the speech of the speaker cluster, transition from the non-speech acoustic model to the utterance interval detection start state to absorb the non-speech over a certain length of time, each speaker Transitions according to the subword language model between the acoustic models corresponding to the speech of the cluster, transitions with a penalty from the acoustic model corresponding to the speech of each speaker cluster to the acoustic model corresponding to the speech of a different speaker cluster, each speaker Corresponding to non-speech from acoustic model corresponding to cluster speech A transition according to a subword language model to an acoustic model, a transition to an utterance interval detection end state according to the utterance end detection condition, and a transition from the utterance interval detection end state to the utterance interval detection start state, A subword network that enables at least one transition is configured.

請求項３記載の発明によれば、それぞれの状態遷移を行うことにより、サブワードの高精度化を図ることができる。 According to the third aspect of the present invention, the accuracy of the subword can be improved by performing the respective state transitions.

請求項４に記載された発明は、前記連続音声認識手段は、前記サブワード・ネットワークにおける発話区間検出開始状態から非スピーチあるいはスピーチに対応する音響モデルに遷移した後、一定の時間長にわたって発話始端検出条件が満たされなかった場合に、非スピーチ音響モデルから発話区間検出開始状態に戻ると同時に、全ての音響モデルにおける累積尤度等の音声認識の途中結果をクリアし、発話区間検出開始時刻を更新して再度サブワード単位の連続音声認識を開始することを特徴とする。 In the invention described in claim 4, the continuous speech recognition means detects a speech start edge over a certain length of time after transitioning from a speech section detection start state in the subword network to an acoustic model corresponding to non-speech or speech. If the condition is not satisfied, the speech recognition unit returns to the speech segment detection start state from the non-speech acoustic model, and at the same time, the results of speech recognition such as cumulative likelihood in all acoustic models are cleared and the speech segment detection start time is updated. Then, continuous speech recognition in units of subwords is started again.

請求項４記載の発明によれば、発話始端を検出するまでの長い非音声を吸収することができる。したがって、高精度に発話始端を検出することができる。 According to the fourth aspect of the present invention, it is possible to absorb a long non-speech until the start of the utterance is detected. Therefore, it is possible to detect the utterance start end with high accuracy.

請求項５に記載された発明は、前記発話区間検出手段は、発話始端を検出する際、発話区間検出開始時刻から現時刻までの全入力音声に対して、全話者クラスタのスピーチに対応する音響モデルのうち、最大の累積尤度と、発話区間検出開始状態の後続の同じ話者クラスタの非スピーチに対応する音響モデルの累積尤度との比を入力音声に同期して逐次算出し、算出された比の値と予め設定された閾値とに基づいて、最大の累積尤度を示すサブワード列の始端の非スピーチ音響モデルの終端時刻から、一定の時間長遡った時刻を発話始端として検出することを特徴とする。 In the invention described in claim 5, when detecting the utterance start edge, the utterance interval detection means supports speech of all speaker clusters for all input speech from the utterance interval detection start time to the current time. Of the acoustic models, the ratio between the maximum cumulative likelihood and the cumulative likelihood of the acoustic model corresponding to the non-speech of the same speaker cluster following the speech section detection start state is sequentially calculated in synchronization with the input speech, Based on the calculated ratio value and a preset threshold, a time that is a certain length of time from the end time of the non-speech acoustic model at the start of the subword sequence indicating the maximum cumulative likelihood is detected as the start of speech It is characterized by doing.

請求項５記載の発明によれば、迅速且つ高精度に発話始端を検出することができる。 According to the fifth aspect of the present invention, it is possible to detect the utterance start end quickly and with high accuracy.

請求項６に記載された発明は、前記発話区間検出手段は、発話終端を検出する際、発話区間検出開始時刻から現時刻までの全入力音声に対して、全話者クラスタのスピーチに対応する音響モデルに後続する非スピーチに対応する音響モデルのうち最大の累積尤度と、同じ話者クラスタのスピーチに対応する音響モデルの最大の累積尤度との比を入力音声に同期して逐次算出し、算出された比の値が一定の時間長以上にわたって予め設定された閾値を超えていた場合、超え始めた時刻から一定の時間長遡った時刻を発話終端として検出することを特徴とする。 According to a sixth aspect of the present invention, when detecting the utterance end, the utterance section detecting means corresponds to speech of all speaker clusters for all input speech from the utterance section detection start time to the current time. Sequential calculation of the ratio of the maximum cumulative likelihood of the acoustic models corresponding to non-speech following the acoustic model and the maximum cumulative likelihood of the acoustic model corresponding to speech of the same speaker cluster in synchronization with the input speech When the calculated ratio value exceeds a preset threshold for a certain time length or more, a time that is a certain time length backward from the time when the ratio starts to be exceeded is detected as an utterance end point.

請求項６記載の発明によれば、迅速且つ高精度に発話終端を検出することができる。 According to the sixth aspect of the present invention, the utterance end can be detected quickly and with high accuracy.

請求項７に記載された発明は、前記発話区間検出手段は、前記発話始端及び前記発話終端の時刻情報に基づいて前記入力音声から発話区間の音声を出力することを特徴とする。 The invention described in claim 7 is characterized in that the utterance section detecting means outputs the voice of the utterance section from the input voice based on time information of the utterance start end and the utterance end.

請求項７記載の発明によれば、発話始端及び発話終端の時刻情報に基づいて迅速且つ高精度に発話区間の音声を出力することができる。 According to the seventh aspect of the present invention, it is possible to output the voice of the utterance section quickly and with high accuracy based on the time information of the utterance start end and the utterance end.

請求項８に記載された発明は、入力音声から発話区間を検出する発話区間検出処理をコンピュータに実行させるための発話区間検出プログラムにおいて、前記入力音声を音響特徴量に変換する音響分析処理と、前記音響分析処理により得られる音響特徴量と、予め設定された音響モデル及び言語モデルからなるサブワード・ネットワークとを用いて、前記入力音声に同期して各サブワードにおける累積尤度を逐次算出する連続音声認識処理と、各サブワードにおける累積尤度から発話始端と発話終端とを逐次検出する発話区間検出処理とをコンピュータに実行させる。 The invention described in claim 8 is an utterance period detection program for causing a computer to execute an utterance period detection process for detecting an utterance period from an input voice, and an acoustic analysis process for converting the input voice into an acoustic feature amount; Continuous speech that sequentially calculates the cumulative likelihood in each subword in synchronization with the input speech using the acoustic feature obtained by the acoustic analysis processing and a subword network composed of a preset acoustic model and language model The computer is caused to execute a recognition process and an utterance section detection process for sequentially detecting the utterance start end and the utterance end from the cumulative likelihood in each subword.

請求項８記載の発明によれば、迅速且つ高精度に発話区間を検出することができる。また、実行プログラムをコンピュータにインストールすることにより、容易に発話区間を検出することができる。 According to the invention described in claim 8, it is possible to detect the utterance section quickly and with high accuracy. Further, by installing the execution program in the computer, it is possible to easily detect the utterance section.

本発明によれば、迅速且つ高精度に発話区間を検出することができる。 According to the present invention, it is possible to detect an utterance section quickly and with high accuracy.

＜本発明の概要＞
本発明は、様々な音響環境のもとで話された人間の声の発話区間を、音声中からオンラインで迅速に自動検出する発話区間検出手法に関するものである。具体的には、複数の話者クラスタのサブワード音響モデルとサブワード言語モデルとを統合してサブワード・ネットワークを構成し、入力音声に対するサブワード（例えば、音素、音節、トライフォン等）単位の連続音声認識の実行中に、スピーチと非スピーチに対応する各サブワードにおける累積尤度を入力音声に同期して算出及び比較することにより、少ない遅れ時間で高精度に発話始端と発話終端を検出する。 <Outline of the present invention>
The present invention relates to an utterance interval detection technique for automatically and quickly detecting an utterance interval of a human voice spoken under various acoustic environments from speech. Specifically, subword acoustic models and subword language models of multiple speaker clusters are integrated to form a subword network, and continuous speech recognition in units of subwords (eg, phonemes, syllables, triphones, etc.) for input speech. During execution, the cumulative likelihood in each subword corresponding to speech and non-speech is calculated and compared in synchronization with the input speech, so that the speech start and speech end can be detected with high accuracy with a small delay time.

以下に、上記のような特徴を有する本発明における発話区間検出装置及び発話区間検出プログラムを好適に実施した形態について、図面を用いて詳細に説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments in which an utterance section detection apparatus and an utterance section detection program according to the present invention having the above-described features are preferably described in detail with reference to the drawings.

＜発話区間検出装置：装置構成＞
図１は、本発明における発話区間検出装置の一構成例を示す図である。図１に示す発話区間検出装置１０は、サブワード・ネットワーク統合手段１１と、音響分析手段１２と、連続音声認識手段１３と、発話区間検出装置１４とを有するよう構成されている。 <Speaking section detection device: device configuration>
FIG. 1 is a diagram illustrating a configuration example of an utterance section detection device according to the present invention. The utterance section detection apparatus 10 shown in FIG. 1 is configured to include a subword / network integration means 11, an acoustic analysis means 12, a continuous speech recognition means 13, and an utterance section detection apparatus 14.

サブワード・ネットワーク統合手段１１は、１又は複数の話者クラスタのサブワード音響モデル２１と、予め設定されたサブワード言語モデル２２とを利用して、サブワード・ネットワーク２３を生成し、連続音声認識手段１３に出力する。 The subword / network integration unit 11 generates a subword network 23 by using the subword acoustic model 21 of one or a plurality of speaker clusters and a preset subword language model 22, and transmits the subword network 23 to the continuous speech recognition unit 13. Output.

ここで、サブワード音響モデル２１は、例えば話者クラスタ数を２とした場合、話者クラスタＡを男性、話者クラスタＢを女性、あるいは話者クラスタＡを広帯域音声、話者クラスタＢを狭帯域音声等として、サブワードを音響環境依存あるいは音響環境非依存の音素や音節とする等、任意に設定することができる。なお、サブワード音響モデルの話者クラスタ数は、３以上でもよく、単数でもよい。 Here, for example, when the number of speaker clusters is 2, the subword acoustic model 21 is a speaker cluster A male, speaker cluster B female, or speaker cluster A wideband speech, speaker cluster B narrowband. As the speech or the like, the subword can be arbitrarily set such as a phoneme or syllable that is dependent on the acoustic environment or independent of the acoustic environment. The number of speaker clusters in the subword acoustic model may be three or more, or may be singular.

また、サブワード言語モデル２２は、例えば音素連鎖確率モデルあるいは音節連鎖確率モデル等、既存の連鎖確率モデルを任意に設定することができる。なお、サブワード・ネットワーク２３については、後述する。 The subword language model 22 can arbitrarily set an existing chain probability model such as a phoneme chain probability model or a syllable chain probability model. The subword network 23 will be described later.

また、音響分析手段１２は、発話検出対象となる入力音声２４を入力し、音響特徴量２５に変換して出力する。なお、音響特徴量２５は、サブワード音響モデル２１を学習するために使用した音響特徴量と同じ構成とし、例えば周波数特性を表すケプストラム、短時間パワー、それらの動的特徴量等とすることができる。ここで、以下の説明では、発話の始端検出開始時刻τから現時刻ｔまでの音響特徴量２５の列をｘ_τ ^tとする。 In addition, the acoustic analysis unit 12 inputs the input voice 24 to be utterance detection target, converts it into an acoustic feature value 25, and outputs it. The acoustic feature quantity 25 has the same configuration as the acoustic feature quantity used for learning the subword acoustic model 21, and can be, for example, a cepstrum representing frequency characteristics, short-time power, dynamic feature quantities thereof, or the like. . Here, in the following description, the column of the acoustic feature value 25 from the start point detection start time τ of the utterance to the current time ^{t is} assumed to be x _τ ^t .

連続音声認識手段１３は、音響特徴量２５の入力に同期してサブワード・ネットワーク２３にしたがって状態遷移を行いつつ、発話の始端検出開始時刻τから現時刻ｔまでの音響特徴量２５の列ｘ_τ ^tに対応する可能性のある複数のサブワードの列及びそれらの累積尤度２６を、例えば隠れマルコフモデルを利用した時間同期ビームサーチ音声認識手法（例えば、中川聖一、「確率モデルによる音声認識」、電子情報通信学会、ｐｐ．４４−４６，１９８８等）により逐次求めていく。なお、連続音声認識手段１３におけるサブワードの列及びそれらの累積尤度２６の認識手法については後述する。 The continuous speech recognition means 13 performs a state transition in accordance with the subword network 23 in synchronization with the input of the acoustic feature 25, and the sequence x _τ of the acoustic feature 25 from the utterance start detection start time τ to the current time t. A time-synchronized beam search speech recognition method using, for example, a hidden Markov model (for example, Seiichi Nakagawa, “Speech recognition by a probability model”) using a hidden Markov model, for example, a sequence of a plurality of subwords that may correspond to ^t. , In accordance with the Institute of Electronics, Information and Communication Engineers, pp. 44-46, 1988, etc.). In addition, the recognition method of the subword string and the cumulative likelihood 26 thereof in the continuous speech recognition means 13 will be described later.

発話区間検出手段１４は、連続音声認識手段１３により得られるサブワード累積尤度２６に基づいて、入力音声２４における１又は複数の発話始端と発話終端とを検出する。具体的には、発話区間検出手段１４は、入力音声２４に付与された時刻（タイムレコード）に対応した発話始端時刻２７、発話終端時刻２８を出力する。また、発話区間検出手段１４は、発話始端時刻２７及び発話終端時刻２８に対応させた発話区間音声２９を出力してもよい。上述した発話区間検出装置１０の構成により、発話区間を迅速且つ高精度に検出することができる。 The utterance section detection unit 14 detects one or a plurality of utterance start points and utterance end points in the input speech 24 based on the subword cumulative likelihood 26 obtained by the continuous speech recognition unit 13. Specifically, the utterance section detection unit 14 outputs the utterance start time 27 and the utterance end time 28 corresponding to the time (time record) given to the input voice 24. Further, the utterance section detecting means 14 may output the utterance section voice 29 corresponding to the utterance start time 27 and the utterance end time 28. With the configuration of the utterance section detection device 10 described above, the utterance section can be detected quickly and with high accuracy.

なお、上述した発話区間検出装置１０では、サブワード・ネットワーク統合手段１１により話者クラスタのサブワード音響モデル２１とサブワード言語モデル２２とからサブワード・ネットワーク２３を生成していたが、本発明においてはこの限りではなく、予めサブワード・ネットワーク２３を生成し連続音声認識手段１３や他の蓄積手段（図示せず）に蓄積しておいてもよい。 In the utterance section detection apparatus 10 described above, the subword network 23 is generated from the subword acoustic model 21 and the subword language model 22 of the speaker cluster by the subword / network integration unit 11. Instead, the subword network 23 may be generated in advance and stored in the continuous speech recognition means 13 or other storage means (not shown).

＜サブワード・ネットワーク２３＞
ここで、上述したサブワード・ネットワークについて、具体的に説明する。図２は、話者クラスタ数を２とした場合のサブワード・ネットワークの一例を示す図である。 <Subword network 23>
Here, the above-described subword network will be specifically described. FIG. 2 is a diagram illustrating an example of a subword network when the number of speaker clusters is two.

図２に示す話者クラスタ数を２としたサブワード・ネットワーク２３は、発話検出開始状態３１と、発話始端に相当する話者クラスタＡの非スピーチ音響モデル３２と、話者クラスタＡのスピーチ音響モデル３３と、発話終端に相当する話者クラスタＡの非スピーチ音響モデル３４と、発話始端に相当する話者クラスタＢの非スピーチ音響モデル３５と、話者クラスタＢのスピーチ音響モデル３６と、発話終端に相当する話者クラスタＢの非スピーチ音響モデル３７と、発話検出終了状態３８とを有するよう構成することができる。 The subword network 23 having two speaker clusters shown in FIG. 2 includes an utterance detection start state 31, a non-speech acoustic model 32 of the speaker cluster A corresponding to the utterance start point, and a speech acoustic model of the speaker cluster A. 33, non-speech acoustic model 34 of speaker cluster A corresponding to the end of speech, non-speech acoustic model 35 of speaker cluster B corresponding to the start of speech, speech acoustic model 36 of speaker cluster B, and end of speech Can be configured to have a non-speech acoustic model 37 of the speaker cluster B corresponding to

ここで、音響モデルには、例えば隠れマルコフモデルを利用することができ、非スピーチ音響モデルはスピーチ以外の無音、雑音、音楽等の音声から事前に学習しておくものとし、スピーチ音響モデルはスピーチの音声から母音や子音等の音素や音節等のサブワード単位で事前に学習しておくものとする。 Here, for example, a hidden Markov model can be used as the acoustic model. The non-speech acoustic model is learned in advance from speech such as silence, noise, music, etc. other than speech, and the speech acoustic model is speech. It is assumed that learning is performed in advance in units of subwords such as phonemes such as vowels and consonants and syllables.

図２において、発話検出開始状態３１から話者クラスタＡの非スピーチ音響モデル３２及び話者クラスタＢの非スピーチ音響モデル３５へは、発話区間検出開始直後に制約なしで遷移することができる（図２における矢印＊１）。 In FIG. 2, it is possible to transition from the speech detection start state 31 to the non-speech acoustic model 32 of the speaker cluster A and the non-speech acoustic model 35 of the speaker cluster B without restriction immediately after the start of the speech segment detection (FIG. 2). Arrow * 2 in 2).

また、話者クラスタＡの非スピーチ音響モデル３２及び３４と、話者クラスタＡのスピーチ音響モデル３３との間は、サブワード言語モデル２２にしたがって遷移することができる（図における矢印＊２）。 In addition, a transition can be made between the non-speech acoustic models 32 and 34 of the speaker cluster A and the speech acoustic model 33 of the speaker cluster A according to the subword language model 22 (arrow * 2 in the figure).

同様に、話者クラスタＢの非スピーチ音響モデル３５及び３７と、話者クラスタＢのスピーチ音響モデル３６との間は、サブワード言語モデル２２にしたがって遷移することができる（図２における矢印＊２）。 Similarly, a transition can be made between the non-speech acoustic models 35 and 37 of the speaker cluster B and the speech acoustic model 36 of the speaker cluster B according to the subword language model 22 (arrow * 2 in FIG. 2). .

また、話者クラスタＡの非スピーチ音響モデル３２及び話者クラスタＢの非スピーチ音響モデル３５から発話検出開始状態３１へは、予め設定される一定の時間長にわたって発話始端検出条件が満たされなかった場合に遷移することができる（図２における矢印＊３）。 Further, the utterance start detection condition is not satisfied from the non-speech acoustic model 32 of the speaker cluster A and the non-speech acoustic model 35 of the speaker cluster B to the utterance detection start state 31 for a predetermined time length. Transition to the case (arrow * 3 in FIG. 2).

また、話者クラスタＡのスピーチ音響モデル３３と話者クラスタＢのスピーチ音響モデル３６との間は、異なる話者クラスタへ所定のペナルティ付きで遷移することができる（図２における矢印＊４）。 Further, the speech acoustic model 33 of the speaker cluster A and the speech acoustic model 36 of the speaker cluster B can transition to different speaker clusters with a predetermined penalty (arrow * 4 in FIG. 2).

また、話者クラスタＡの非スピーチ音響モデル３４と話者クラスタＢの非スピーチ音響モデル３７とから発話検出終了状態３８へは、発話終端検出条件にしたがって遷移することができる（図２における矢印＊５）。更に、発話検出終了状態３８から発話検出開始状態３１へは、発話終端検出直後に次の発話のために制約なしで遷移することができる（図２における矢印＊６）。 Further, transition from the non-speech acoustic model 34 of the speaker cluster A and the non-speech acoustic model 37 of the speaker cluster B to the utterance detection end state 38 can be made according to the utterance end detection condition (arrow * in FIG. 2). 5). Furthermore, it is possible to transition from the utterance detection end state 38 to the utterance detection start state 31 immediately after the utterance end detection without restriction for the next utterance (arrow * 6 in FIG. 2).

なお、話者クラスタＡの非スピーチ音響モデル３２と話者クラスタＢの非スピーチ音響モデル３５とは、纏めて１つの非スピーチ音響モデルとして構成することも可能である。同様に、話者クラスタＡの非スピーチ音響モデル３４と話者クラスタＢの非スピーチ音響モデル３７とは、纏めて１つの非スピーチ音響モデルとして構成することも可能である。 Note that the non-speech acoustic model 32 of the speaker cluster A and the non-speech acoustic model 35 of the speaker cluster B can be collectively configured as one non-speech acoustic model. Similarly, the non-speech acoustic model 34 of the speaker cluster A and the non-speech acoustic model 37 of the speaker cluster B can be collectively configured as one non-speech acoustic model.

ここで、話者クラスタＡの非スピーチ音響モデル３２及び３４は、異なる状態として表現しているが、その統計的性質は全く同じものでもよい。同様に、話者クラスタＢの非スピーチ音響モデル３５及び３７は、異なる状態として表現してるが、その統計的性質は全く同じものでもよい。 Here, the non-speech acoustic models 32 and 34 of the speaker cluster A are expressed as different states, but their statistical properties may be exactly the same. Similarly, the non-speech acoustic models 35 and 37 of the speaker cluster B are expressed as different states, but their statistical properties may be exactly the same.

本発明におけるサブワード・ネットワーク統合手段１１は、１又は複数の話者クラスタ数において上述した遷移のうち少なくとも１つを用いてサブワード・ネットワーク２３を統合することができる。 The subword network integration means 11 in the present invention can integrate the subword network 23 using at least one of the transitions described above in one or a plurality of speaker clusters.

＜サブワードの列及びそれらの累積尤度２６＞
次に、連続音声認識手段１３におけるサブワードの列及びそれらの累積尤度２６の認識手法について具体的に説明する。図３は、発話始端における音声認識の一例を示す図である。また、図４は、発話終端における音声認識の一例を示す図である。 <Subword sequence and their cumulative likelihood 26>
Next, a method for recognizing sub-word strings and their cumulative likelihood 26 in the continuous speech recognition means 13 will be described in detail. FIG. 3 is a diagram illustrating an example of speech recognition at the beginning of utterance. FIG. 4 is a diagram showing an example of speech recognition at the utterance end.

例えば、サブワード音響モデル２１の話者クラスタ数が２であって、時間同期ビームサーチ音声認識処理を行う際に、話者クラスタＳ∈｛Ａ，Ｂ｝の非スピーチ音響モデルをｓｉｌ_Ｓとし、話者クラスタＳのスピーチ音響モデルをｐｈ_Ｓ，ｉとした場合（ここで、ｉは音素等のサブワード番号を示す）、発話始端では、図３に示すような音響特徴量２５に対応する可能性のある複数のサブワード列に対して、最尤サブワード列の累積尤度の対数値を以下に示す（１）式により逐次求める。 For example, when the number of speaker clusters in the subword acoustic model 21 is 2, and the time-synchronized beam search speech recognition process is performed, the non-speech acoustic model of the speaker cluster S∈ {A, B} is set to sil _S , If the speech acoustic model of the user cluster S is ph _{S, i} (where i indicates a subword number such as a phoneme), there is a possibility of corresponding to an acoustic feature 25 as shown in FIG. For a plurality of subword strings, the logarithmic value of the cumulative likelihood of the maximum likelihood subword string is sequentially obtained by the following equation (1).

更に、始端の非スピーチ音響モデルの累積尤度の対数値を以下に示す（２）式により逐次求める。

Further, the logarithmic value of the cumulative likelihood of the non-speech acoustic model at the beginning is sequentially obtained by the following equation (2).

また、発話終端では、図４に示すような発話の始端検出開始時刻τから現時刻ｔまでの音響特徴量２５の列ｘ_τ ^tに対応する可能性のある複数のサブワード列に対して、全話者クラスタのスピーチに対応する音響モデルに後続し、非スピーチに対応する音響モデルのうち、最大の累積尤度の対数値を以下に示す（３）式により逐次求める。

Further, at the end of the utterance, all the subword strings that may correspond to the string x _τ ^t of the acoustic feature value 25 from the utterance start detection start time τ to the current time t as shown in FIG. Subsequent to the acoustic model corresponding to the speech of the speaker cluster, the logarithmic value of the maximum cumulative likelihood among the acoustic models corresponding to the non-speech is sequentially obtained by the following equation (3).

更に、同じ話者クラスタのスピーチに対応する音響モデルの最大の累積尤度の対数値を以下に示す（４）式により逐次求める。

Further, the logarithmic value of the maximum cumulative likelihood of the acoustic model corresponding to the speech of the same speaker cluster is sequentially obtained by the following equation (4).

なお、連続音声認識中は、話者クラスタ間のサブワード音響モデルの遷移を許可するものとし、話者クラスタ間のサブワード音響モデルの遷移を許可する場合、一定のペナルティのスコアをサブワード累積尤度の対数値に付加する。上述した処理を行うことで、連続音声認識手段１３は高精度なサブワード累積尤度２６を出力することができる。

During continuous speech recognition, subword acoustic model transitions between speaker clusters are allowed, and when subword acoustic model transitions between speaker clusters are allowed, a score of a certain penalty is assigned to the subword cumulative likelihood. Append to logarithmic value. By performing the processing described above, the continuous speech recognition unit 13 can output the subword cumulative likelihood 26 with high accuracy.

なお、連続音声認識手段１３は、サブワード・ネットワーク２３における発話区間検出開始状態から非スピーチあるいはスピーチに対応する音響モデルに遷移した後、一定の時間長ｔ_ｉｄｌｅにわたって継続して予め設定された後述する発話始端検出条件が満たされなかった場合に、非スピーチ音響モデルから発話区間検出開始状態に戻ると同時に、全ての音響モデルにおける累積尤度等の音声認識の途中結果をクリア（リセット）し、発話区間検出開始時刻τを現時刻ｔに更新して再度サブワード単位の連続音声認識を開始する。これにより、発話始端を検出するまでの長い非音声を吸収することができる。したがって、高精度に発話始端を検出することができる。 Note that the continuous speech recognition means 13 will be described later, which is set in advance after a transition from an utterance section detection start state in the subword network 23 to an acoustic model corresponding to non-speech or speech over a certain time length t _idle. When the utterance start edge detection condition is not satisfied, the speech recognition detection results such as the cumulative likelihood in all acoustic models are cleared (reset) at the same time as returning to the utterance section detection start state from the non-speech acoustic model. The section detection start time τ is updated to the current time t, and continuous speech recognition in units of subwords is started again. Thereby, it is possible to absorb a long non-speech until the start of the utterance is detected. Therefore, it is possible to detect the utterance start end with high accuracy.

＜発話区間検出手段１４＞
次に、発話区間検出手段１４について具体的に説明する。発話区間検出手段１４は、発話始端では、最尤サブワード列の累積尤度の対数値Ｌ_１と、始端の非スピーチ音響モデルの累積尤度の対数値Ｌ_２の差が一定の閾値θ_{ｓｔａｒｔ}を超えた時、すなわち（Ｌ_１−Ｌ_２）＞θ_{ｓｔａｒｔ}となる時、これを発話始端検出条件として、図３に示すように最大の累積尤度を示すサブワード列の始端の非スピーチ音響モデルの終端時刻から、所定の時間長ｔ_{ｓｔａｒｔ}遡った時刻を発話始端時刻２７とする。 <Speech section detection means 14>
Next, the utterance section detection unit 14 will be specifically described. The utterance section detection means 14 sets a threshold value θ _{start at} which the difference between the logarithmic value L ₁ of the cumulative likelihood of the maximum likelihood subword sequence and the logarithmic value L ₂ of the cumulative likelihood of the non-speech acoustic model at the _start is constant at the utterance _start . When this is exceeded, that is, when (L ₁ −L ₂ )> θ _start , this is used as the utterance _start edge detection condition, and the non-speech acoustic model at the start of the subword string indicating the maximum cumulative likelihood as shown in FIG. An utterance _start time 27 is a time that is a predetermined time length t _start from the end time.

なお、時間長ｔ_{ｓｔａｒｔ}は、例えばニュース原稿を読み上げるような一般的な音声速度の場合、約２００ｍｓｅｃ程度が好ましいが、本発明においてはこれに限定されない。 Note that the time length t _start is preferably about 200 msec in the case of a general voice speed for reading a news manuscript, for example, but is not limited to this in the present invention.

一方、発話終端では、終端が非スピーチ音響モデルとなる最尤サブワード列のうち最大の累積尤度の対数値Ｌ_３と、同話者クラスタのスピーチ音響モデルを終端とする最尤サブワード列の累積尤度の対数値Ｌ_４との差が、一定の閾値θ_ｅｎｄを時間長ｔ_ｅｎｄ１継続して超えた場合、すなわちｔ_ｅｎｄ１継続して（Ｌ_３−Ｌ_４）＞θ_ｅｎｄとなる時、これを発話終端検出条件として、図４に示すように、現時刻ｔから時間長ｔ_ｅｎｄ１を基準とした所定の時間長ｔ_ｅｎｄ２（ｔ_ｅｎｄ２＜ｔ_ｅｎｄ１）分遡った時刻を発話終端時刻２８とする。 On the other hand, in the speech termination, the logarithmic value L ₃ of the maximum cumulative likelihood of the maximum likelihood word string termination is non-speech acoustic models, accumulation of maximum likelihood subword sequence to terminate the speech acoustic models of the speaker cluster when the difference between the logarithmic value _{L 4} of the likelihood is that a certain threshold theta case of a continuously exceeds the time length _{t end1} _{end the,} i.e. _{t end1} continued _{_{(L 3 -L 4)> θ}} end, which as the speech termination detection condition, as shown in FIG. 4, the predetermined length of time _{_{_{t end2 (t end2 <t end1}}} ) min time speech ending time 28 going back relative to the time length _{t end1} from the current time t .

なお、時間長ｔ_ｅｎｄ１は、発話終端検出条件の基準であるため、実際の発話終端時刻よりも長くなってしまう。そこで、よりもｔ_ｅｎｄ２＜ｔ_ｅｎｄ１の関係を満たす時間長ｔ_ｅｎｄ２を設定することで、より発話終端部に近い時刻を検出することができる。ここで、時間長ｔ_ｅｎｄ２は、例えばニュース原稿を読み上げるような一般的な音声速度の場合、約２００ｍｓｅｃ程度が好ましいが、本発明においてはこれに限定されない。 Note that the time length t _end1 is a reference for the utterance end detection condition, and thus becomes longer than the actual utterance end time. Therefore, by setting a time length t _end2 that satisfies the relationship of t _end2 <t _end1 , it is possible to detect a time closer to the utterance termination part. Here, the time length _tend2 is preferably about 200 msec in the case of a general voice speed for reading a news manuscript, for example, but is not limited to this in the present invention.

これにより、音声認識の処理量を削減することができる。また、認識性能の向上を図ることができる。したがって、入力された音声の中から発話区間を迅速且つ高精度に検出することができる。 Thereby, the processing amount of voice recognition can be reduced. Also, the recognition performance can be improved. Therefore, it is possible to quickly and accurately detect the utterance section from the input voice.

＜実行プログラム＞
ここで、上述した発話区間検出装置１０は、上述した専用の装置構成等を用いて本発明における発話区間検出処理を行うこともできるが、各構成における処理をコンピュータに実行させることができる実行プログラムを生成し、例えば、汎用のパーソナルコンピュータ、サーバ等にそのプログラムをインストールすることにより、本発明に係る発話区間検出処理を実現することができる。 <Execution program>
Here, the utterance section detection device 10 described above can perform the utterance section detection processing according to the present invention using the above-described dedicated device configuration or the like, but can execute a process in each configuration on a computer. , And the program is installed in a general-purpose personal computer, server, or the like, for example, so that the speech segment detection processing according to the present invention can be realized.

＜ハードウェア構成＞
ここで、本発明における発話区間検出処理が実行可能なコンピュータのハードウェア構成例について図を用いて説明する。図５は、本発明における発話区間検出処理が実現可能なハードウェア構成の一例を示す図である。 <Hardware configuration>
Here, a hardware configuration example of a computer capable of executing the speech section detection processing according to the present invention will be described with reference to the drawings. FIG. 5 is a diagram illustrating an example of a hardware configuration capable of realizing the speech segment detection processing according to the present invention.

図５におけるコンピュータ本体には、入力装置４１と、出力装置４２と、ドライブ装置４３と、補助記憶装置４４と、メモリ装置４５と、各種制御を行うＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）４６と、ネットワーク接続装置４７とを有するよう構成されており、これらはシステムバスＢで相互に接続されている。 5 includes an input device 41, an output device 42, a drive device 43, an auxiliary storage device 44, a memory device 45, a CPU (Central Processing Unit) 46 for performing various controls, and a network connection device. 47, which are connected to each other by a system bus B.

入力装置４１は、ユーザが操作するキーボード及びマウス等のポインティングデバイスや音声入力デバイス等を有しており、ユーザからのプログラムの実行指示等、各種操作信号、音声信号を入力する。出力装置４２は、本発明における処理を行うためのコンピュータ本体を操作するのに必要な各種ウィンドウやデータ等を表示するディスプレイやスピーカ等を有し、ＣＰＵ４６が有する制御プログラムにより実行経過や結果等を表示又は音声出力することができる。 The input device 41 includes a keyboard and a pointing device such as a mouse operated by a user, a voice input device, and the like, and inputs various operation signals and voice signals such as a program execution instruction from the user. The output device 42 has a display, a speaker, and the like that display various windows and data necessary for operating the computer main body for performing processing in the present invention. Display or audio output is possible.

ここで、本発明において、コンピュータ本体にインストールされる実行プログラムは、例えば、ＣＤ−ＲＯＭ等の記録媒体４８等により提供される。プログラムを記録した記録媒体４８は、ドライブ装置４３にセット可能であり、記録媒体４８に含まれる実行プログラムが、記録媒体４８からドライブ装置４３を介して補助記憶装置４４にインストールされる。 Here, in the present invention, the execution program installed in the computer main body is provided by, for example, the recording medium 48 such as a CD-ROM. The recording medium 48 on which the program is recorded can be set in the drive device 43, and the execution program included in the recording medium 48 is installed in the auxiliary storage device 44 from the recording medium 48 via the drive device 43.

また、ドライブ装置４３は、本発明に係る実行プログラムを記録媒体４８に記録することができる。これにより、その記録媒体４８を用いて、他の複数のコンピュータに容易にインストールすることができ、容易に発話区間検出処理を実現することができる。 Further, the drive device 43 can record the execution program according to the present invention on the recording medium 48. Thereby, using the recording medium 48, it can be easily installed in a plurality of other computers, and the speech segment detection processing can be easily realized.

補助記憶装置４４は、ハードディスク等のストレージ手段であり、本発明における実行プログラムや、コンピュータに設けられた制御プログラム等を蓄積し必要に応じて入出力を行うことができる。また、補助記憶装置４４は、上述したサブワード音響モデル２１やサブワード言語モデル２２、サブワード・ネットワーク２３、入力音声２４、音響特徴量２５、サブワード累積尤度２６、発話始端時刻２７、発話終端時刻２８、及び発話区間音声２９等を蓄積する蓄積手段として用いることもできる。 The auxiliary storage device 44 is a storage means such as a hard disk, and can store an execution program according to the present invention, a control program provided in a computer, and the like, and can perform input / output as necessary. The auxiliary storage device 44 also includes the subword acoustic model 21, the subword language model 22, the subword network 23, the input speech 24, the acoustic feature 25, the subword cumulative likelihood 26, the utterance start time 27, the utterance end time 28, Also, it can be used as a storage means for storing the speech section voice 29 and the like.

ＣＰＵ４６は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等の制御プログラム、及び補助記憶装置４４から読み出されメモリ装置４５に格納されている実行プログラムに基づいて、各種演算や各ハードウェア構成部とのデータの入出力等、コンピュータ全体の処理を制御して、発話区間検出処理における各処理を実現することができる。また、プログラムの実行中に必要な各種情報等は、補助記憶装置４４から取得することができ、また格納することもできる。 The CPU 46 performs various calculations and data input / output with each hardware component based on a control program such as an OS (Operating System) and an execution program read from the auxiliary storage device 44 and stored in the memory device 45. Each process in the utterance section detection process can be realized by controlling the process of the entire computer. Various information necessary during the execution of the program can be acquired from the auxiliary storage device 44 and can also be stored.

ネットワーク接続装置４７は、電話回線やＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）ケーブル等の通信ネットワーク等と接続することにより、実行プログラムを通信ネットワークに接続されている他の端末等から取得したり、プログラムを実行することで得られた実行結果又は本発明における実行プログラムを他の端末等に提供することができる。 The network connection device 47 obtains an execution program from another terminal connected to the communication network or executes the program by connecting to a communication network such as a telephone line or a LAN (Local Area Network) cable. The execution result obtained in this way or the execution program in the present invention can be provided to other terminals or the like.

上述したようなハードウェア構成により、特別な装置構成を必要とせず、低コストで上述した発話区間検出処理を実現することができる。また、プログラムをインストールすることにより、容易に発話区間検出処理を実現することができる。 With the hardware configuration as described above, the above-described speech segment detection processing can be realized at low cost without requiring a special device configuration. Further, by installing the program, it is possible to easily realize the speech segment detection process.

＜発話区間検出処理手順＞
次に、本発明における実行プログラム（発話区間検出プログラム）を用いた発話区間検出処理手順についてフローチャートを用いて説明する。図６は、発話区間検出処理手順の一例を示すフローチャートである。なお、図６に示す発話区間検出処理手順では、検出対象が発話始端であるか又は発話終端であるかを明確にするために検出対象パラメータを設けている。また、以下の説明では、検出対象のパラメータには、“始端”又は“終端”の何れかがセットされているものとして説明するが、本発明においてはこれに限定されるものではない。 <Speech section detection processing procedure>
Next, a speech segment detection processing procedure using the execution program (speech segment detection program) according to the present invention will be described with reference to a flowchart. FIG. 6 is a flowchart illustrating an example of an utterance section detection processing procedure. In the utterance section detection processing procedure shown in FIG. 6, a detection target parameter is provided to clarify whether the detection target is the utterance start end or the utterance end. Further, in the following description, it is assumed that either “starting end” or “end” is set as the detection target parameter, but the present invention is not limited to this.

図６において、まずプログラム開始直後、複数の話者クラスタのサブワード音響モデルとサブワード言語モデルとを利用して、サブワード・ネットワークを統合し（Ｓ０１）、検索対象のパラメータには初期状態として“始端”とセットする（Ｓ０２）。なお、ここまでの処理は、前処理として予め処理されていてもよい。 In FIG. 6, first, immediately after the start of the program, the subword network is integrated using the subword acoustic model and subword language model of a plurality of speaker clusters (S01). Is set (S02). In addition, the process so far may be processed previously as pre-processing.

次に、音声入力があるか否かを判断し（Ｓ０３）、音声が入力された場合（Ｓ０３において、ＹＥＳ）、１フレーム分の音響特徴量の算出に必要な、例えば２５ミリ秒程度の短い区間の音声をデジタル入力し（Ｓ０４）、入力した音声の音響分析を行う（Ｓ０５）。次に、Ｓ０４の処理にて得られた音響特徴量について、Ｓ０１の処理にて得られたサブワード・ネットワーク上で各累積尤度を算出する（Ｓ０６）。 Next, it is determined whether or not there is a voice input (S03), and when a voice is input (YES in S03), a short time of, for example, about 25 milliseconds necessary for calculating the acoustic feature amount for one frame is required. The voice of the section is digitally input (S04), and the input voice is analyzed (S05). Next, for each acoustic feature obtained in the process of S04, each cumulative likelihood is calculated on the subword network obtained in the process of S01 (S06).

ここで、検出対象として予め設定されたパラメータに“始端”とセットされているか否かを判断し（Ｓ０７）、“始端”がセットされている場合（Ｓ０７において、ＹＥＳ）、発話始端時刻を出力し（Ｓ０８）、また音声の出力を開始する（Ｓ０９）。また、検出対象のパラメータに“終端”をセットし（Ｓ１０）、Ｓ０３に戻り、以後同様の処理を継続する。 Here, it is determined whether or not “starting end” is set to a parameter set in advance as a detection target (S07), and when “starting end” is set (YES in S07), the utterance start end time is output. (S08), and voice output is started (S09). Further, “end” is set in the parameter to be detected (S10), the process returns to S03, and the same processing is continued thereafter.

また、Ｓ０７の処理において、検出対象パラメータに“始端”がセットされていない場合（Ｓ０７において、ＮＯ）、検出対象が“終端”であると判断し、発話終端の時刻を出力し（Ｓ１１）、また音声の出力を停止する（Ｓ１２）。 Further, in the process of S07, when “starting end” is not set in the detection target parameter (NO in S07), it is determined that the detection target is “end”, and the time of the utterance end is output (S11). Also, the output of the voice is stopped (S12).

次に、発話区間検出処理を継続するか否かを判断し（Ｓ１３）、継続する場合（Ｓ１３において、ＹＥＳ）、検出対象のパラメータに“始端”をセットし（Ｓ１４）、Ｓ０３に戻り、以後同様の処理を継続する。 Next, it is determined whether or not to continue the utterance section detection process (S13). If it is continued (YES in S13), “starting end” is set as the parameter to be detected (S14), and the process returns to S03. The same process is continued.

また、Ｓ０３の処理において、音声入力がない場合（Ｓ０３において、ＮＯ）、又はＳ１３の処理において、発話区間検出処理を継続しない場合（Ｓ１３において、ＮＯ）、処理を終了する。 If there is no voice input in the process of S03 (NO in S03), or if the speech section detection process is not continued in the process of S13 (NO in S13), the process ends.

上述したように、発話区間検出プログラムを用いた発話区間検出処理により、迅速且つ高精度に音声に対する発話区間を検出することができる。また、プログラムをインストールすることにより、容易に発話区間検出処理を実現することができる。 As described above, an utterance section for speech can be detected quickly and with high accuracy by the utterance section detection process using the utterance section detection program. Further, by installing the program, it is possible to easily realize the speech segment detection process.

なお、発話区間検出処理においては、発話始端時刻及び発話終端時刻を出力し（Ｓ０８、Ｓ１１）、更に発話区間の音声を出力したが（Ｓ０９、Ｓ１２）本発明においてはこの限りではなく、例えば、発話始端時刻、発話終端時刻、及び発話区間の音声のうち、少なくとも１つを出力させてもよい。 In the utterance section detection processing, the utterance start time and utterance end time are output (S08, S11), and the voice of the utterance section is further output (S09, S12). At least one of the speech start time, speech end time, and speech in the speech section may be output.

上述したように本発明によれば、迅速且つ高精度に音声に対する発話区間を検出することができる。具体的には、本発明は、短時間パワーと周波数特性及びそれらの動的特徴量で構成される音響特徴量に対して、複数の話者クラスタのサブワード音響モデルとサブワード言語モデルを統合して高精度且つ簡易なサブワード・ネットワークを構成し、入力音声に対するサブワード単位の連続音声認識の実行中に、スピーチと非スピーチに対応する各音響モデルにおける累積尤度を入力音声に同期して算出及び比較することで、背景雑音が存在する様々な音響環境のもとでも高精度に、オンライン且つ少ない遅れ時間で、入力音声中の人間の声の発話区間を自動検出することが可能になる。 As described above, according to the present invention, it is possible to detect an utterance section for speech quickly and with high accuracy. Specifically, the present invention integrates a subword acoustic model and a subword language model of a plurality of speaker clusters for an acoustic feature amount composed of short-time power and frequency characteristics and dynamic feature amounts thereof. A highly accurate and simple subword network is constructed, and the cumulative likelihood in each acoustic model corresponding to speech and non-speech is calculated and compared in synchronization with the input speech during execution of continuous speech recognition in units of subwords for the input speech. By doing so, it becomes possible to automatically detect an utterance section of a human voice in the input voice with high accuracy and with a small delay time even in various acoustic environments where background noise exists.

したがって、本発明を音声認識の前処理に利用することで、無音や雑音や音楽等、音声認識に不要な非スピーチ区間を入力音声から自動的に除去し、認識すべきスピーチ区間だけを取り出すことができる。これにより、音声認識の処理量の削減と認識性能の向上が図られる。 Therefore, by using the present invention for speech recognition preprocessing, non-speech sections unnecessary for speech recognition, such as silence, noise, and music, are automatically removed from the input speech, and only the speech sections to be recognized are extracted. Can do. Thereby, the amount of speech recognition processing is reduced and the recognition performance is improved.

また、本発明を音声圧縮の前処理に利用することで、スピーチ区間と非スピーチ区間それぞれに最適な圧縮方式を選択的に適用することが可能となり、圧縮効率を高めることができる。また、本発明を音声データベースの自動ラベリングに利用することで、スピーチ区間と非スピーチ区間のラベリング及びファイルへの分割を自動化でき、作業効率を高めることができる。また、本発明を音声の書き起こしテキスト作成支援に利用することで、スピーチ区間だけを音声から取り出すと共に、音声中の各発話の時刻情報を自動的に付与することができ、作業効率を高めることができる。 Further, by using the present invention for speech compression pre-processing, it is possible to selectively apply an optimum compression method to each of the speech period and the non-speech period, thereby improving the compression efficiency. In addition, by using the present invention for automatic labeling of a speech database, labeling of speech sections and non-speech sections and division into files can be automated, and work efficiency can be improved. Also, by using the present invention for voice transcription text creation support, it is possible to extract only the speech section from the voice and automatically add time information of each utterance in the voice, thereby improving work efficiency. Can do.

更に、本発明を録音装置に利用することで、スピーチ区間だけを録音することができ、テープやメモリ等の録音媒体の節約が可能となる。 Furthermore, by using the present invention for a recording apparatus, it is possible to record only a speech section, and it is possible to save a recording medium such as a tape or a memory.

つまり、本発明は、放送番組の字幕制作、音声対話システム、音声ワープロ、会議の議事録の自動作成、声による機器の制御等、音声認識や言語処理を利用した様々な分野の技術に適用することができる。 In other words, the present invention is applied to technologies in various fields using speech recognition and language processing, such as subtitle production of broadcast programs, voice dialogue systems, voice word processors, automatic creation of meeting minutes, and control of devices by voice. be able to.

以上本発明の好ましい実施例について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to such specific embodiments, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

本発明における発話区間検出装置の一構成例を示す図である。It is a figure which shows one structural example of the utterance area detection apparatus in this invention. 話者クラスタ数を２とした場合のサブワード・ネットワークの一例を示す図である。It is a figure which shows an example of a subword network when the number of speaker clusters is two. 発話始端における音声認識の一例を示す図である。It is a figure which shows an example of the speech recognition in the utterance start end. 発話終端における音声認識の一例を示す図である。It is a figure which shows an example of the speech recognition in the utterance termination | terminus. 本発明における発話区間検出処理が実現可能なハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions which can implement | achieve the speech area detection process in this invention. 発話区間検出処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of an utterance area detection process procedure.

Explanation of symbols

１０発話区間検出装置
１１サブワード・ネットワーク統合手段
１２音響分析手段
１３連続音声認識手段
１４発話区間検出装置
２１サブワード音響モデル
２２サブワード言語モデル
２３サブワード・ネットワーク
２４入力音声
２５音響特徴量
２６サブワードの列及びそれらの累積尤度
２７発話始端時刻
２８発話終端時刻
２９発話区間音声
３１発話検出開始状態
３２発話始端に相当する話者クラスタＡの非スピーチ音響モデル
３３話者クラスタＡのスピーチ音響モデル
３４発話終端に相当する話者クラスタＡの非スピーチ音響モデル
３５発話始端に相当する話者クラスタＢの非スピーチ音響モデル
３６話者クラスタＢのスピーチ音響モデル
３７発話終端に相当する話者クラスタＢの非スピーチ音響モデル
３８発話検出終了状態
４１入力装置
４２出力装置
４３ドライブ装置
４４補助記憶装置
４５メモリ装置
４６ＣＰＵ
４７ネットワーク接続装置
４８記録媒体 DESCRIPTION OF SYMBOLS 10 Speaking section detection apparatus 11 Subword network integration means 12 Acoustic analysis means 13 Continuous speech recognition means 14 Speaking section detection apparatus 21 Subword acoustic model 22 Subword language model 23 Subword network 24 Input speech 25 Acoustic feature amount 26 Subword sequence and them 27 utterance start time 28 utterance end time 29 utterance interval sound 31 utterance detection start state 32 non-speech acoustic model of speaker cluster A corresponding to utterance start end 33 speech acoustic model of speaker cluster A 34 equivalent to utterance end Non-speech acoustic model of speaker cluster A 35 Non-speech acoustic model of speaker cluster B corresponding to the beginning of speech 36 Speech speech model of speaker cluster B 37 Non-speech acoustic model of speaker cluster B corresponding to the end of speech 38 Speech test Output end state 41 Input device 42 Output device 43 Drive device 44 Auxiliary storage device 45 Memory device 46 CPU
47 Network connection device 48 Recording medium

Claims

In the utterance section detection device for detecting the utterance section from the input voice,
Acoustic analysis means for converting the input speech into acoustic features;
Continuous speech that sequentially calculates the cumulative likelihood in each subword in synchronization with the input speech using the acoustic feature obtained by the acoustic analysis means and a subword network consisting of a preset acoustic model and language model Recognition means;
An utterance section detecting device comprising: an utterance section detecting means for sequentially detecting an utterance start end and an utterance end in the input speech from an accumulated likelihood in each subword.

The subword network using a subword acoustic model having one or a plurality of speaker clusters expressing acoustic features of speech and non-speech sounds, and a subword language model representing a transition between the subword acoustic models. The utterance section detecting device according to claim 1, further comprising subword / network integration means for integrating the utterances.

The subword network integration means includes:
Transition from the speech section detection start state to an acoustic model corresponding to non-speech of all speaker clusters, according to a subword language model from the non-speech acoustic model to an acoustic model corresponding to speech of each speaker cluster Transitions, transitions from non-speech acoustic models back to the speech segment detection start state to absorb non-speech over a certain length of time, transitions according to subword language models between acoustic models corresponding to speech of each speaker cluster, Penalized transition from an acoustic model corresponding to the speech of each speaker cluster to an acoustic model corresponding to the speech of a different speaker cluster, an acoustic model corresponding to each non-speech from the acoustic model corresponding to the speech of each speaker cluster Transition according to the subword language model, utterance termination Forming a subword network that enables at least one of a transition from the utterance interval detection end state to the utterance interval detection start state according to the output condition, and a transition from the utterance interval detection end state to the utterance interval detection start state. The utterance section detection device according to claim 2, wherein

The continuous speech recognition means includes
After transition from the speech section detection start state in the subword network to the acoustic model corresponding to non-speech or speech, the speech section detection is performed from the non-speech acoustic model when the speech start detection condition is not satisfied for a certain length of time. At the same time as returning to the start state, the intermediate results of speech recognition such as cumulative likelihood in all acoustic models are cleared, the speech segment detection start time is updated, and continuous speech recognition in units of subwords is started again. The utterance section detection apparatus according to any one of claims 1 to 3.

The utterance section detecting means includes
When detecting the utterance start edge, the maximum cumulative likelihood and the utterance interval detection start state among the acoustic models corresponding to the speech of all speaker clusters for all input speech from the utterance interval detection start time to the current time Next, the ratio of the cumulative likelihood of the acoustic model corresponding to the non-speech of the same speaker cluster is sequentially calculated in synchronization with the input speech, and based on the calculated ratio value and a preset threshold value, 5. The time according to any one of claims 1 to 4, wherein a time that is a certain length of time from the end time of the non-speech acoustic model at the start of the subword string indicating the maximum cumulative likelihood is detected as the start of speech. The utterance section detection device described.

The utterance section detecting means includes
When detecting the end of the utterance, for all input speech from the utterance interval detection start time to the current time, the largest cumulative among the acoustic models corresponding to non-speech following the acoustic model corresponding to speech of all speaker clusters The ratio between the likelihood and the maximum cumulative likelihood of the acoustic model corresponding to the speech of the same speaker cluster is calculated sequentially in synchronization with the input speech, and the calculated ratio value is preset over a certain length of time. 6. The utterance section detection device according to claim 1, wherein when the threshold value is exceeded, a time that is a certain length of time after the time when the threshold starts to be exceeded is detected as an utterance end point. 6.

The utterance section detecting means includes
The utterance section detection device according to any one of claims 1 to 6, wherein a voice of an utterance section is output from the input voice based on time information of the utterance start end and the utterance end.

In an utterance interval detection program for causing a computer to execute an utterance interval detection process for detecting an utterance interval from input speech,
An acoustic analysis process for converting the input speech into acoustic features;
Continuous speech that sequentially calculates the cumulative likelihood in each subword in synchronization with the input speech using the acoustic feature obtained by the acoustic analysis processing and a subword network composed of a preset acoustic model and language model Recognition processing,
An utterance interval detection program for causing a computer to execute an utterance interval detection process for sequentially detecting an utterance start end and an utterance end from an accumulated likelihood in each subword.