JP2007233149A

JP2007233149A - Voice recognition device and voice recognition program

Info

Publication number: JP2007233149A
Application number: JP2006056235A
Authority: JP
Inventors: Toru Imai; 亨今井; Shoe Sato; 庄衛佐藤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2006-03-02
Filing date: 2006-03-02
Publication date: 2007-09-13
Anticipated expiration: 2026-03-02
Also published as: JP4700522B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device and a voice recognition program, for attaining voice recognition with high accuracy. <P>SOLUTION: The voice recognition device, for performing voice recognition from input voice by one or more speaker clusters in which a sound feature is different from each other, comprises: a sound analysis means for converting the input voice to a sound feature amount; an identification means for identifying attributes associated with a speaker cluster from the sound feature amount which is obtained by the sound analysis means; and a continuous voice recognition means for performing continuous voice recognition based on a search network for correct word searching, which is created from a sound model and a language model of the speaker cluster which is established beforehand, and restriction information of attributes associated with the speaker cluster to the input voice. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声認識装置及び音声認識プログラムに係り、特に高精度な音声認識を実現するための音声認識装置及び音声認識プログラムに関する。 The present invention relates to a voice recognition apparatus and a voice recognition program, and more particularly to a voice recognition apparatus and a voice recognition program for realizing highly accurate voice recognition.

放送番組の字幕制作やメタデータ制作等に用いられる音声認識では、男女の話者が混在した音声の認識性能の向上が重要である。 In speech recognition used for subtitle production and metadata production for broadcast programs, it is important to improve speech recognition performance in which male and female speakers are mixed.

そこで、従来より音響的特徴の異なる複数の話者クラスタの音声認識には、性別等に依存しない唯一の音響モデルを利用した手法や、連続音声認識の前処理で性別等の話者クラスタ属性を予め決定しておく手法が知られている（例えば、特許文献１、非特許文献１参照。）。また、別の手法としては、男女等の複数の話者クラスタの音響モデルを単語辞書上で並列に動作させるが１発話内での異なる話者クラスタの単語辞書間の遷移を許さない手法が知られている（例えば、特許文献２参照、非特許文献２参照。）。
特開２００３−９９０８３号公報Ｆ．Ｋｕｂａｌａ，ｅｔａｌ．，”Ｔｈｅ１９９６ＢＢＮＢｙｂｌｏｓＨＵＢ−４ＴｒａｎｓｃｒｉｐｔｉｏｎＳｙｓｔｅｍ”，ＤＡＲＰＡＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＷｏｒｋｓｈｏｐ，ｐｐ．９０−９３，１９９７．特開２００５−３４５７７２号公報山本博史他、「日英音声翻訳システム「ＡＴＲ−ＭＡＴＲＩＸ」における音声認識部分の構造と制御方法」、日本音響学会講演論文集、２−Ｑ−２１、ｐｐ．１６１−１６２、１９９８．３． Therefore, for the speech recognition of multiple speaker clusters with different acoustic features, the method using a unique acoustic model that does not depend on gender or the like, or speaker cluster attributes such as gender in the pre-processing of continuous speech recognition are used. Methods that are determined in advance are known (see, for example, Patent Document 1 and Non-Patent Document 1). As another method, a method is known in which an acoustic model of a plurality of speaker clusters such as men and women is operated in parallel on a word dictionary, but transition between word dictionaries of different speaker clusters within one utterance is not allowed. (For example, see Patent Document 2 and Non-Patent Document 2).
JP 2003-99083 A F. Kubala, et al. "The 1996 BBN Byblos HUB-4 Transcription System", DARPA Speech Recognition Works, pp. 90-93, 1997. JP 2005-345772 A Hiroshi Yamamoto et al., “Structure and Control Method of Speech Recognition Part in Japanese-English Speech Translation System“ ATR-MATRIX ””, Proceedings of the Acoustical Society of Japan, 2-Q-21, pp. 161-162, 1998.3.

上述した従来の音声認識手法のうち、性別等に依存しない唯一の音響モデルを利用した手法の場合、学習音声の全ての話者クラスタを無視してグローバルな唯一の音響モデルを予め学習しておき、この音響モデルのみを用いて音声認識を行うため、認識時には話者クラスタを考慮する必要がなく非常に簡易に実現可能である。しかしながら、性別依存音響モデル等の話者クラスタを考慮した場合に比べて、一般に認識率が低いため実用上問題がある。 Among the conventional speech recognition methods described above, in the case of a method using a single acoustic model that does not depend on gender, etc., a global unique acoustic model is learned in advance by ignoring all speaker clusters of the learning speech. Since speech recognition is performed using only this acoustic model, it is not necessary to consider speaker clusters at the time of recognition and can be realized very easily. However, since the recognition rate is generally low compared to a case where speaker clusters such as a gender-dependent acoustic model are considered, there is a practical problem.

また、上述した連続音声認識の前処理で性別等の話者クラスタ属性を予め決定しておく手法の場合、話者クラスタ全体の音響的特徴を表現する音響モデルの尤度比較による手法や、複数の話者クラスタの音響モデルを用いた音素認識による手法がある。 In addition, in the case of the method for predetermining the speaker cluster attributes such as gender in the above-described continuous speech recognition pre-processing, a method based on likelihood comparison of acoustic models expressing the acoustic characteristics of the entire speaker cluster, There is a method based on phoneme recognition using an acoustic model of a speaker cluster.

ここで、前者の場合には、各発話の冒頭のスピーチ区間で話者クラスタ属性を決定するが、１発話内にわたってその話者クラスタ属性を仮定するため、１発話中に男女の音声が混在し易い対談等の音声や、背景雑音等の影響によって誤った話者クラスタ属性を仮定してしまった場合には認識率が低下してしまう。 Here, in the former case, the speaker cluster attribute is determined in the speech section at the beginning of each utterance. Since the speaker cluster attribute is assumed throughout one utterance, both male and female voices are mixed in one utterance. If an incorrect speaker cluster attribute is assumed due to the influence of speech such as easy conversations or background noise, the recognition rate is lowered.

また、後者の場合、１発話の入力音声の全てが得られた後に最終的な音素認識結果及び話者クラスタ属性が決定されるため、話し始めてから即座に音声認識に取り掛かることができず、生放送番組の字幕制作等、オンライン処理且つ少ない時間遅れが要求されるアプリケーションでは実用上問題がある。 In the latter case, since the final phoneme recognition result and speaker cluster attribute are determined after all of the input speech of one utterance is obtained, speech recognition cannot be started immediately after the start of speaking, and live broadcasting is performed. There are practical problems in applications that require online processing and a small time delay, such as the production of captions for programs.

更に、上述した男女等の複数の話者クラスタの音響モデルを単語辞書上で並列に動作させるが１発話内での異なる話者クラスタの単語辞書間の遷移を許さない手法の場合、複数の話者クラスタの可能性が考慮されているのは各発話の冒頭だけであり、音声認識処理が進むにつれて入力音声にマッチしない話者クラスタは次第に探索対象から除外されていくため、１発話中に男女の音声が混合し易い対談等の音声では認識率が低下してしまう問題がある。 Furthermore, in the case of the technique in which the acoustic models of a plurality of speaker clusters such as men and women described above are operated in parallel on the word dictionary but the transition between the word dictionaries of different speaker clusters within one utterance is not allowed, The speaker cluster is considered only at the beginning of each utterance, and speaker clusters that do not match the input speech are gradually excluded from the search target as the speech recognition process proceeds. There is a problem that the recognition rate is lowered in the case of voices such as conversations that can easily be mixed.

本発明は、上述した問題点に鑑みなされたものであり、高精度に音声認識を実現するための音声認識装置及び音声認識プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide a speech recognition device and a speech recognition program for realizing speech recognition with high accuracy.

上記課題を解決するために、本件発明は、以下の特徴を有する課題を解決するための手段を採用している。 In order to solve the above problems, the present invention employs means for solving the problems having the following characteristics.

請求項１に記載された発明は、入力音声から音響的特徴の異なる１又は複数の話者クラスタによる音声認識を行う音声認識装置において、前記入力音声を音響特徴量に変換する音響分析手段と、前記音響分析手段により得られる音響特徴量から話者クラスタ属性を同定する話者クラスタ属性同定手段と、予め設定される前記話者クラスタの音響モデル及び言語モデルから生成された正解単語探索用の探索ネットワークと、前記入力音声に対する前記話者クラスタ属性の制約情報とに基づいて、連続音声認識を行う連続音声認識手段とを有することを特徴とする。 The invention described in claim 1 is a speech recognition apparatus that performs speech recognition by one or a plurality of speaker clusters having different acoustic characteristics from an input speech, and an acoustic analysis unit that converts the input speech into an acoustic feature amount; Speaker cluster attribute identifying means for identifying speaker cluster attributes from acoustic features obtained by the acoustic analysis means, and a search for correct word search generated from the acoustic model and language model of the speaker cluster set in advance It has a continuous speech recognition means for performing continuous speech recognition based on the network and constraint information of the speaker cluster attribute for the input speech.

請求項１記載の発明によれば、高精度に音声認識を実現することができる。 According to the first aspect of the present invention, voice recognition can be realized with high accuracy.

請求項２に記載された発明は、音声の音響的特徴を表現する１又は複数の話者クラスタからなる音響モデルと、予め設定された単語間の遷移を表現する言語モデルとを有し、前記話者クラスタの音響モデルを、前記言語モデル及び予め話者クラスタ毎に設定された単語辞書にしたがって探索ネットワークへ展開するネットワーク展開手段を有することを特徴とする。 The invention described in claim 2 has an acoustic model composed of one or a plurality of speaker clusters expressing acoustic features of speech, and a language model expressing transitions between preset words, The present invention is characterized by comprising network expansion means for expanding an acoustic model of a speaker cluster into a search network according to the language model and a word dictionary preset for each speaker cluster.

請求項２記載の発明によれば、入力音声の内容に対応させて高精度な探索ネットワークを生成することができる。また、この探索ネットワークを用いて入力音声に対して高精度に音声認識を実現することができる。 According to the second aspect of the present invention, a highly accurate search network can be generated according to the content of the input voice. Moreover, it is possible to realize speech recognition with high accuracy for the input speech using this search network.

請求項３に記載された発明は、前記ネットワーク展開手段は、発話始端の認識開始状態から全ての話者クラスタの単語辞書始端への遷移、同じ話者クラスタの単語辞書間で言語モデルにしたがった遷移、話者クラスタ属性の変化に応じて各話者クラスタの単語辞書終端から異なる話者クラスタの単語辞書始端への遷移、発話終端で各話者クラスタの単語辞書終端から認識終了状態への遷移、及び次の発話のために認識終了状態から認識開始状態への遷移のうち、少なくとも１つの遷移を可能とする探索ネットワークを構成することを特徴とする。 According to a third aspect of the present invention, the network expansion means follows a transition from the recognition start state of the utterance start edge to the word dictionary start edge of all speaker clusters, and a language model between word dictionaries of the same speaker cluster. Transition from the word dictionary end of each speaker cluster to the beginning of the word dictionary of a different speaker cluster according to changes in the speaker cluster attribute, transition from the word dictionary end of each speaker cluster to the recognition end state at the end of speech And a search network that enables at least one of the transitions from the recognition end state to the recognition start state for the next utterance.

請求項３記載の発明によれば、それぞれの状態遷移を行うことにより、探索ネットワークの高精度化を図ることができる。 According to the third aspect of the present invention, it is possible to improve the accuracy of the search network by performing each state transition.

請求項４に記載された発明は、前記話者クラスタ属性同定手段は、前記入力音声に対して話者クラスタが変化した時刻情報及び／又は変化後の話者クラスタの属性情報を出力することを特徴とする。 The invention described in claim 4 is characterized in that the speaker cluster attribute identification means outputs the time information when the speaker cluster changes and / or the attribute information of the speaker cluster after the change with respect to the input speech. Features.

請求項４記載の発明によれば、連続音声認識をする前段階で話者クラスタの変化部分を把握することができるため、より高精度に音声認識を実行することができる。 According to the fourth aspect of the present invention, since the change portion of the speaker cluster can be grasped before the continuous speech recognition, the speech recognition can be executed with higher accuracy.

請求項５に記載された発明は、前記連続音声認識手段は、前記話者クラスタ属性が変化した場合に、前記探索ネットワークにおける変化前の話者クラスタの単語辞書終端から変化後の話者クラスタの単語辞書始端への遷移を可能にすると共に、変化前の話者クラスタの単語辞書を継続して探索対象とすることを特徴とする。 In the invention described in claim 5, when the speaker cluster attribute changes, the continuous speech recognition means changes the speaker cluster after the change from the word dictionary end of the speaker cluster before the change in the search network. It is possible to make a transition to the beginning of the word dictionary, and to continuously search the word dictionary of the speaker cluster before the change.

請求項５記載の発明によれば、音響的特徴の異なる複数の話者クラスタの各音声に対して、高精度に音声認識を行うことができる。これにより、１発話中に複数の話者クラスタの音声が混在した場合でも、従来よりも少ない演算量且つ少ない遅れ時間で高精度な音声認識を実現することができる。 According to the fifth aspect of the present invention, voice recognition can be performed with high accuracy for each voice of a plurality of speaker clusters having different acoustic characteristics. Thereby, even when voices of a plurality of speaker clusters are mixed during one utterance, high-accuracy voice recognition can be realized with a smaller calculation amount and a smaller delay time than in the past.

請求項６に記載された発明は、入力音声から音響的特徴の異なる１又は複数の話者クラスタによる音声認識を行う音声認識処理をコンピュータに実行させるための音声認識プログラムにおいて、前記入力音声を音響特徴量に変換する音響分析処理と、前記音響分析処理により得られる音響特徴量から話者クラスタ属性を同定する話者クラスタ属性同定処理と、予め設定される前記話者クラスタの音響モデル及び言語モデルから生成された正解単語探索用の探索ネットワークと、前記入力音声に対する前記話者クラスタ属性の制約情報とに基づいて、連続音声認識を行う連続音声認識処理とをコンピュータに実行させる。 The invention described in claim 6 is a speech recognition program for causing a computer to execute speech recognition processing for performing speech recognition by one or a plurality of speaker clusters having different acoustic characteristics from input speech. Acoustic analysis processing for converting into feature amounts, speaker cluster attribute identification processing for identifying speaker cluster attributes from the acoustic feature amounts obtained by the acoustic analysis processing, and acoustic models and language models of the speaker clusters set in advance Based on the search network for the correct word search generated from the above and the restriction information of the speaker cluster attribute for the input speech, the computer is caused to execute continuous speech recognition processing for performing continuous speech recognition.

請求項６記載の発明によれば、高精度に音声認識を実現することができる。また、実行プログラムをコンピュータにインストールすることにより、容易に音声認識処理を実現することができる。 According to the sixth aspect of the present invention, voice recognition can be realized with high accuracy. In addition, voice recognition processing can be easily realized by installing an execution program in a computer.

本発明によれば、高精度に音声認識を実現することができる。 According to the present invention, speech recognition can be realized with high accuracy.

＜本発明の概要＞
本発明は、１又は複数の話者クラスタの音響モデルを利用しつつ、１発話中での複数の話者クラスタの単語間の遷移を可能とし、例えば１発話中に男女の音声が混在し易い対談等の音声においても、従来よりも少ない演算量且つ少ない遅れ時間で高精度な音声認識を実現するものである。 <Outline of the present invention>
The present invention enables transition between words of a plurality of speaker clusters in one utterance while using an acoustic model of one or a plurality of speaker clusters. For example, male and female voices are easily mixed in one utterance. For speech such as conversations, high-accuracy speech recognition is realized with a smaller amount of computation and a smaller delay time than in the past.

具体的には、男性と女性、高齢者と成人と子供等、複数の話者クラスタの音響モデルを言語モデル及び単語辞書にしたがって並列に探索ネットワークを展開する。また、例えば入力音声の話者クラスタ属性が甲から乙に変化した時に、探索ネットワークにおける話者クラスタ甲の単語辞書終端から話者クラスタ乙の単語辞書始端への遷移を可能にすると共に、話者クラスタ甲の単語辞書を継続して探索対象とすることにより、１発話中に混在する複数の話者クラスタの柔軟な並列音声認識を実現する。 Specifically, a search network is developed in parallel for acoustic models of a plurality of speaker clusters such as men and women, elderly people, adults and children according to a language model and a word dictionary. For example, when the speaker cluster attribute of the input speech changes from A to B, it is possible to make a transition from the end of the word cluster of the speaker cluster A to the start of the word dictionary of the speaker cluster B in the search network. By continuously searching the word dictionary of the cluster A, flexible parallel speech recognition of a plurality of speaker clusters mixed during one utterance is realized.

以下に、上記のような特徴を有する本発明における音声認識装置及び音声認識プログラムを好適に実施した形態について、図面を用いて詳細に説明する。 Hereinafter, preferred embodiments of a speech recognition apparatus and a speech recognition program according to the present invention having the above-described features will be described in detail with reference to the drawings.

＜音声認識装置：装置構成＞
図１は、本発明における音声認識装置の一構成例を示す図である。図１に示す音声認識装置１０は、ネットワーク展開手段１１と、音響分析手段１２と、話者クラスタ属性同定手段１３と、連続音声認識手段１４とを有するよう構成されている。 <Voice recognition device: device configuration>
FIG. 1 is a diagram showing a configuration example of a speech recognition apparatus according to the present invention. The speech recognition apparatus 10 shown in FIG. 1 is configured to include a network expansion unit 11, an acoustic analysis unit 12, a speaker cluster attribute identification unit 13, and a continuous speech recognition unit 14.

ネットワーク展開手段１１は、１又は複数の話者クラスタの音響モデル２１と、言語モデル２２とを利用して、各話者クラスタの単語辞書による探索ネットワーク２３を展開し、展開した探索ネットワーク２３を連続音声認識手段１４に出力する。 The network expansion means 11 expands the search network 23 based on the word dictionary of each speaker cluster using the acoustic model 21 and the language model 22 of one or a plurality of speaker clusters, and continues the expanded search network 23. Output to the voice recognition means 14.

ここで、話者クラスタの音響モデル２１は、例えば話者クラスタ数を２（Ａ，Ｂ）とした場合、話者クラスタＡを男性、話者クラスタＢを女性、あるいは話者クラスタＡを成人、話者クラスタＢを子供、あるいは話者クラスタＡを広帯域音声、話者クラスタＢを狭帯域音声とする等、任意に設定することができる。なお、本発明では、これに限定されず、例えば上述した話者クラスタを組み合わせて３以上の話者クラスタ数にしてもよく、また単数であってもよい。 Here, the speaker cluster acoustic model 21 is, for example, when the number of speaker clusters is 2 (A, B), the speaker cluster A is male, the speaker cluster B is female, or the speaker cluster A is adult, The speaker cluster B can be set arbitrarily, such as a child, the speaker cluster A can be a wideband voice, and the speaker cluster B can be a narrowband voice. In the present invention, the present invention is not limited to this. For example, the number of speaker clusters may be three or more by combining the above-described speaker clusters, or may be singular.

また、音響モデルの単位は、音素、音節、環境依存、環境非依存等、任意に設定することができ、例えば隠れマルコフモデルに代表される確率統計モデル（例えば、中川聖一、「確率モデルによる音声認識」、電子情報通信学会、１９８８）等を利用することができる。また、言語モデル２２は、例えば単語単位のＮ−ｇｒａｍモデル等を任意に設定することができる。なお、ネットワーク展開手段１１における探索ネットワーク２３についての詳細は後述する。 The unit of the acoustic model can be arbitrarily set such as phonemes, syllables, environment dependence, environment independence, and the like. For example, a stochastic statistical model represented by a hidden Markov model (for example, Seiichi Nakagawa, “according to probability model”). Speech recognition ", The Institute of Electronics, Information and Communication Engineers, 1988) can be used. Moreover, the language model 22 can arbitrarily set, for example, an N-gram model in units of words. Details of the search network 23 in the network expansion means 11 will be described later.

音響分析手段１２は、音声認識対象となる入力音声２４を音響特徴量２５に変換し、変換した音響特徴量２５を話者クラスタ属性同定手段１３及び連続音声認識手段１４に出力する。ここで、音響特徴量２５は、各話者クラスタの音響モデル２１を学習するために使用した音響特徴量と同じ構成とし、例えば周波数特性を表すケプストラム、短時間パワー、それらの動的特徴量等とすることができる。 The acoustic analysis unit 12 converts the input speech 24 to be a speech recognition target into an acoustic feature amount 25 and outputs the converted acoustic feature amount 25 to the speaker cluster attribute identification unit 13 and the continuous speech recognition unit 14. Here, the acoustic feature quantity 25 has the same configuration as the acoustic feature quantity used for learning the acoustic model 21 of each speaker cluster. For example, a cepstrum representing frequency characteristics, short-time power, dynamic feature quantities thereof, and the like. It can be.

また、話者クラスタ属性同定手段１３は、オンライン且つリアルタイムで入力音声における時刻情報等に基づく各区間に対する話者クラスタの属性を求める。また、話者クラスタ属性同定手段１３は、入力音声２４に対して話者クラスタが変化した時刻情報及び／又は変化後の話者クラスタの属性情報等の話者クラスタの変化情報を話者クラスタ属性２６として連続音声認識手段１４に出力する。これにより、連続音声認識をする前段階で話者クラスタの変化部分を把握することができるため、より高精度に音声認識を実行することができる。なお、話者クラスタ属性同定手段１３における話者クラスタの属性同定手法についての詳細は後述する。 Further, the speaker cluster attribute identification unit 13 obtains the speaker cluster attribute for each section based on time information and the like in the input voice online and in real time. Further, the speaker cluster attribute identification unit 13 displays speaker cluster change information such as time information when the speaker cluster changes with respect to the input speech 24 and / or attribute information of the speaker cluster after the change. 26 is output to the continuous speech recognition means 14. Thereby, since the change part of a speaker cluster can be grasped | ascertained before the continuous speech recognition, speech recognition can be performed with higher accuracy. Details of the speaker cluster attribute identification method in the speaker cluster attribute identification unit 13 will be described later.

連続音声認識手段１４は、探索ネットワーク２３、音響特徴量２５、及び話者クラスタ属性２６を入力し、探索ネットワーク２３上の各単語の音素等にリンクした音響モデルにより、音響特徴量２５に対する尤度を逐次算出する。 The continuous speech recognition means 14 receives the search network 23, the acoustic feature value 25, and the speaker cluster attribute 26, and the likelihood for the acoustic feature value 25 by an acoustic model linked to the phoneme of each word on the search network 23. Are calculated sequentially.

なお、連続音声認識手段１４は、探索ネットワーク２３において言語モデル２２にしたがった遷移である場合は、言語的なスコア付けも行って入力音声２４の発話内容に最も適した単語の文字列を音声認識結果２７として出力する。 The continuous speech recognition means 14 recognizes a character string of a word most suitable for the utterance content of the input speech 24 by performing linguistic scoring when the search network 23 transitions according to the language model 22. The result 27 is output.

ここで、連続音声認識手段１４は、発話始端において、全ての話者クラスタの単語辞書の全ての単語始端で尤度の算出を開始する。このとき、連続音声認識手段１４は、より多くの入力音声に対して尤度計算が進むと、入力音声にマッチしない話者クラスタの単語の音響モデルの尤度が低くなるため、次第に探索対象から除外していく。 Here, the continuous speech recognition means 14 starts calculating the likelihood at every word start of the word dictionary of every speaker cluster at the start of the utterance. At this time, if the likelihood calculation proceeds with respect to more input speech, the continuous speech recognition means 14 gradually decreases the likelihood of the acoustic model of the words of the speaker cluster that does not match the input speech. Exclude.

ただし、連続音声認識手段１４は、話者クラスタ属性２６により１発話内で話者クラスタに変化が生じたことを認識すると、その変化した異なる話者クラスタの単語辞書の始端への遷移を許すと共に、探索対象に残っている話者クラスタの探索経路も全て保持する。したがって、１発話内の複数の話者クラスタの音声認識を高精度に実現することができる。 However, if the continuous speech recognition means 14 recognizes that a change has occurred in the speaker cluster within one utterance by the speaker cluster attribute 26, the continuous speech recognition means 14 allows the transition to the beginning of the word dictionary of the changed speaker cluster. All the search paths for speaker clusters remaining in the search target are also retained. Therefore, voice recognition of a plurality of speaker clusters in one utterance can be realized with high accuracy.

つまり、本発明では、話者クラスタ属性２６で話者クラスタの属性が指定されている場合、その指定された単語辞書だけを残して他の話者クラスタの探索を中止するのではなく、可能性のある新たな話者クラスタと、すでに探索中の話者クラスタ両方との単語辞書を探索対象とする。そのため、計算量の増加を抑えつつ、１発話中での複数の話者クラスタの混在の可能性を話者クラスタ同定よりも詳細な連続音声認識で考慮することができる。 In other words, in the present invention, when the speaker cluster attribute 26 is designated by the speaker cluster attribute 26, the search for other speaker clusters is not stopped, leaving only the designated word dictionary. A search target is a word dictionary of both a new speaker cluster having a certain number and a speaker cluster that is already being searched. For this reason, it is possible to consider the possibility of mixing a plurality of speaker clusters in one utterance with more detailed continuous speech recognition than speaker cluster identification while suppressing an increase in the amount of calculation.

上述した音声認識装置１０の構成により、複数の話者クラスタの高精度な音響モデルを利用しつつ、１発話中での複数の話者クラスタの単語間の遷移を可能とし、１発話中に男女の音声が混在しやすい対談等の音声においても、従来よりも少ない演算量且つ少ない遅れ時間で高精度な音声認識を実現することができる。 The above-described configuration of the speech recognition apparatus 10 enables transition between words of a plurality of speaker clusters during one utterance while using a high-accuracy acoustic model of a plurality of speaker clusters, and allows men and women during one utterance. Even in speech such as conversations that are likely to be mixed, it is possible to realize highly accurate speech recognition with a smaller amount of computation and less delay time than in the past.

なお、上述した音声認識装置１０では、ネットワーク展開手段１１により話者クラスタの音響モデル２１と言語モデル２２とから探索ネットワーク２３を展開していたが、本発明においてはこの限りではなく、例えば予め展開された探索ネットワーク２３を連続音声認識手段１４や他の蓄積手段（図示せず）に蓄積しておいてもよい。 In the speech recognition apparatus 10 described above, the search network 23 is developed from the acoustic model 21 and the language model 22 of the speaker cluster by the network expansion unit 11. However, the present invention is not limited to this. The search network 23 may be stored in the continuous speech recognition means 14 or other storage means (not shown).

＜探索ネットワーク２３＞
ここで、ネットワーク展開手段１１における探索ネットワーク２３の展開内容について説明する。図２は、話者クラスタ数が２の場合の探索ネットワークの一例を示す図である。また、図３は、各話者クラスタの単語辞書の内部構造の一例を示す図である。 <Search network 23>
Here, the expansion content of the search network 23 in the network expansion means 11 is demonstrated. FIG. 2 is a diagram illustrating an example of a search network when the number of speaker clusters is two. FIG. 3 is a diagram showing an example of the internal structure of the word dictionary of each speaker cluster.

ネットワーク展開手段１１から出力される探索ネットワーク２３は、例えば図２に示すように認識開始状態３１と、話者クラスタＡの音響モデルにリンクした単語辞書３２と、話者クラスタＢの音響モデルにリンクした単語辞書３３と、認識終了状態３４とを有するよう構成することができる。 The search network 23 output from the network expansion means 11 is linked to the recognition start state 31, the word dictionary 32 linked to the speaker cluster A acoustic model, and the speaker cluster B acoustic model as shown in FIG. The word dictionary 33 and the recognition end state 34 can be configured.

ここで、使用する単語辞書は、例えば認識語彙の全ての単語について音素等の発音記号を展開したネットワークであり、連続音声認識時の正解単語探索には、各音素が対応する音響モデルによって、入力音声の音響特徴量に対する尤度計算が行われる。 Here, the word dictionary to be used is, for example, a network in which phonetic symbols and the like are expanded for all words in the recognized vocabulary, and the correct word search at the time of continuous speech recognition is input by an acoustic model corresponding to each phoneme. Likelihood calculation is performed on the acoustic feature quantity of speech.

また、単語辞書は、各単語の先頭部分の音素を共有する木構造辞書や、各単語の音素が単語毎に独立した線形辞書で構成することができる。例えば、木構造辞書の場合の単語辞書は、図３に示すように、各単語の音素単位の発音記号を木構造でネットワーク上に展開（例えば、単語「赤（／ａ／ｋ／ａ／）」と単語「秋（／ａ／ｋ／ｉ／）」とは音素／ａ／と／ｋ／を共有）し、それぞれの音素は対応する音響モデルにリンクしている。 In addition, the word dictionary can be configured by a tree structure dictionary that shares the phonemes at the beginning of each word, or a linear dictionary in which the phonemes of each word are independent for each word. For example, as shown in FIG. 3, the word dictionary in the case of a tree structure dictionary develops phoneme unit phonetic symbols of each word on the network in a tree structure (for example, the word “red (/ a / k / a /)”). And the word “autumn (/ a / k / i /) share phonemes / a / and / k /), and each phoneme is linked to a corresponding acoustic model.

認識開始状態３１から全ての話者クラスタ３２、３３の全ての単語の先頭音素へは、発話始端の認識開始直後に制約なしで遷移することができる（図２に示す矢印＊１）。また、話者クラスタＡの単語辞書３２の終端からは、後続の単語を認識するために同じ話者クラスタＡの単語辞書３２の始端へ言語モデル２２にしたがって遷移することができる（図２に示す矢印＊２）。 Transition from the recognition start state 31 to the first phoneme of all words in all the speaker clusters 32 and 33 can be made without restriction immediately after the start of recognition of the utterance start edge (arrow * 1 shown in FIG. 2). Further, from the end of the word dictionary 32 of the speaker cluster A, a transition can be made in accordance with the language model 22 to the start of the word dictionary 32 of the same speaker cluster A in order to recognize subsequent words (shown in FIG. 2). Arrow * 2).

また、話者クラスタＡの単語辞書３２の終端からは、話者クラスタ属性の変化にしたがって異なる話者クラスタＢの単語辞書３３の始端へ遷移することができる（図２に示す矢印＊３）。 In addition, a transition can be made from the end of the word dictionary 32 of the speaker cluster A to the start of the word dictionary 33 of a different speaker cluster B according to the change of the speaker cluster attribute (arrow * 3 shown in FIG. 2).

また同様にして、話者クラスタＢの単語辞書３３の終端からは、後続の単語を認識するために同じ話者クラスタＢの単語辞書３３の始端へ言語モデル２２にしたがって遷移することができる（図２に示す矢印＊２）。また、話者クラスタＢの単語辞書３３の終端からは、話者クラスタ属性の変化にしたがって、異なる話者クラスタＡの単語辞書３２の始端へ遷移することができる（図２に示す矢印＊３）。 Similarly, from the end of the word dictionary 33 of the speaker cluster B, a transition can be made in accordance with the language model 22 to the start of the word dictionary 33 of the same speaker cluster B in order to recognize subsequent words (see FIG. Arrow 2 shown in FIG. Further, from the end of the word dictionary 33 of the speaker cluster B, a transition can be made to the start of the word dictionary 32 of a different speaker cluster A according to the change of the speaker cluster attribute (arrow * 3 shown in FIG. 2). .

また、発話終端では、各話者クラスタの単語辞書３２、３３の終端から認識終了状態３４へ遷移することができる（図２に示す矢印＊４）。更に、認識終了状態３４からは、次の発話を認識するために認識開始状態３１へ遷移することができる（図２に示す矢印＊５）。なお、上述した発話の始端と終端とは、累積音素尤度を利用して迅速且つ高精度に検出することができる。 At the end of the utterance, it is possible to transition from the end of the word dictionary 32, 33 of each speaker cluster to the recognition end state 34 (arrow * 4 shown in FIG. 2). Furthermore, from the recognition end state 34, a transition can be made to the recognition start state 31 in order to recognize the next utterance (arrow * 5 shown in FIG. 2). Note that the start and end of the utterance described above can be detected quickly and with high accuracy using the accumulated phoneme likelihood.

このように、ネットワーク展開手段１１は、上述したように発話始端の認識開始状態から全ての話者クラスタの単語辞書始端への遷移、同じ話者クラスタの単語辞書間で言語モデルにしたがった遷移、話者クラスタ属性の変化に応じて各話者クラスタの単語辞書終端から異なる話者クラスタの単語辞書始端への遷移、発話終端で各話者クラスタの単語辞書終端から認識終了状態への遷移、及び次の発話のために認識終了状態から認識開始状態への遷移のうち、少なくとも１つの遷移を可能とする探索ネットワークを構成することで、それぞれの状態遷移を行って探索ネットワークの高精度化を図り、この探索ネットワーク２３を用いて高精度な音声認識を実現することができる。 As described above, the network expansion means 11 makes a transition from the recognition start state of the utterance start edge to the word dictionary start edge of all speaker clusters as described above, a transition according to the language model between the word dictionaries of the same speaker cluster, Transition from the word dictionary end of each speaker cluster to the beginning of the word dictionary of a different speaker cluster in response to changes in speaker cluster attributes, transition from the word dictionary end of each speaker cluster to the recognition end state at the end of speech, and By configuring a search network that enables at least one of the transitions from the recognition end state to the recognition start state for the next utterance, each state transition is performed to improve the accuracy of the search network. The search network 23 can be used to realize highly accurate speech recognition.

＜話者クラスタの属性同定手法＞
次に、話者クラスタ属性同定手段１３における話者クラスタ属性同定手法について説明する。話者クラスタ属性同定手法としては、例えば話者クラスタ数が２（男性、女性）とした場合、逐次確定処理（例えば、特開２００１−９２４９６号公報）により、男性、女性の音素認識を並列して行い確定したお互いの結果（スコア）から良い方の音声認識結果を出力する男女並列音素認識等の手法により、少ない遅れ時間で複数話者クラスタのサブワードレベル（例えば、音素、音節、トライフォン等）の音声認識を行い、入力音声のどこからどこまでがどの話者クラスタに属するのかを同定することができる。 <Speaker cluster attribute identification method>
Next, a speaker cluster attribute identification method in the speaker cluster attribute identification unit 13 will be described. As a speaker cluster attribute identification method, for example, when the number of speaker clusters is 2 (male and female), phoneme recognition of males and females is performed in parallel by a sequential determination process (for example, JP-A-2001-92496). Subword levels of multiple speaker clusters (eg phonemes, syllables, triphones, etc.) with little delay time by using a method such as gender parallel phoneme recognition that outputs the better speech recognition result from the results (scores) determined and confirmed ) Speech recognition can be performed to identify which speaker cluster belongs to which speaker cluster.

また、男女間遷移が可能で枝刈り共通の男女並列音素認識を行い、累積音素尤度を利用して発話の始端と終端を迅速に検出し、その結果に基づいて話者クラスタ属性を同定する手法がある。ここで、上述の内容について図を用いて説明する。なお、以下の例では、クラスタ属性の一例として男女並列の性別依存音響モデルによる音素認識を利用し、入力音声からの時間遅れの少ない発話区間検出を実行する例を示す。 In addition, gender-parallel phoneme recognition is possible, which allows for transition between men and women, quickly detects the beginning and end of utterances using cumulative phoneme likelihood, and identifies speaker cluster attributes based on the results There is a technique. Here, the above-mentioned content is demonstrated using figures. In the following example, an example is shown in which phoneme recognition using a gender-dependent gender-dependent acoustic model is used as an example of a cluster attribute, and speech segment detection with a small time delay from input speech is executed.

＜発話区間検出＞
図４は、男女並列音素認識のネットワークの一例を示す図である。図４に示すように、男女間遷移が可能で枝刈り共通の男女並列音素認識を行い、累積音素尤度を利用して発話の始端と終端を迅速に検出する。 <Speaking section detection>
FIG. 4 is a diagram illustrating an example of a gender-parallel phoneme recognition network. As shown in FIG. 4, gender-parallel phoneme recognition is possible, which allows transition between men and women, and uses the cumulative phoneme likelihood to quickly detect the beginning and end of an utterance.

具体的には、入力音声の特徴ベクトルをケプストラムと短時間パワー及びそれらの動的特徴量として、様々な音響環境の男性話者音声から学習した音素環境非依存音響モデルと、同様に学習した女性の音響モデルから、音素バイグラムを利用して、図４に示すような音素ネットワークを構成する。ここで、性別ｇ∈｛０（男性），１（女性）｝毎の非音声モデルをｓｉｌ_ｇとし、それ以外の音素モデルをｐｈ_ｇ，ｉとする。 Specifically, a phoneme environment-independent acoustic model learned from male speaker speech in various acoustic environments, using a cepstrum, short-time power, and dynamic features of the feature vectors of the input speech as well as a woman who learned in the same way A phoneme network as shown in FIG. 4 is constructed from the acoustic model using phoneme bigrams. Here, the non-voice model for each _g ∈ {0 (male), 1 (female)} is sill, and the other phoneme models are ph _{g, i} .

ここで、発話の始端検出開始時刻τから現時刻ｔまでの入力音声の特徴ベクトル列をｘ_τ ^ｔとし、最尤音素列及び始端の非音声モデルの累積音素対数尤度をそれぞれ以下に示す（１）、（２）式で表す。 Here, the feature vector sequence of the input speech from the start detection start time τ of the utterance to the current time t is x _τ ^t , and the maximum likelihood phoneme sequence and the cumulative phoneme logarithmic likelihood of the non-speech model at the start end are respectively shown below ( 1) and (2).

ここで、発話始端では最尤音素列の累積尤度の対数値Ｌ_１と、始端の非スピーチ音響モデルの累積尤度の対数値Ｌ_２との差が一定の閾値θ_{ｓｔａｒｔ}を超えた時、すなわち（Ｌ_１−Ｌ_２）＞θ_{ｓｔａｒｔ}となる時、最尤音素列始端の非音声モデルの終端から一定の時間長ｔ_{ｓｔａｒｔ}遡った時刻を発話始端とする。

Here, the logarithmic value L ₁ of the cumulative likelihood of the maximum likelihood sequence of phonemes in speech start, when the difference between the logarithmic value L ₂ of the cumulative likelihood of the beginning of non-speech acoustic model exceeds a certain threshold theta _start, That is, when (L ₁ −L ₂ )> θ _start , a time that is a predetermined time length t _start from the end of the non-voice model at the _start of the maximum likelihood phoneme sequence is set as the _start of speech.

なお、発話始端を検出するまでの長い非音声を吸収するため、始端検出条件が時間長ｔ_ｉｄｌｅ継続して満たされなかった場合、音素認識をリセットし、始端検出開始時刻τを更新する。また、音素認識中は、男女の音素モデル間の遷移を許可し、そのときのペナルティ（重み）を累積音素対数尤度に加える。 Note that in order to absorb a long non-speech until the utterance start end is detected, phoneme recognition is reset and the start end detection start time τ is updated when the start end detection condition is not satisfied continuously for the time length t _idle . Further, during phoneme recognition, transition between male and female phoneme models is permitted, and the penalty (weight) at that time is added to the cumulative phoneme log likelihood.

次に、終端の非音声モデルと最尤音素列の累積音素対数尤度を、それぞれ以下に示す（３）、（４）式で表す。 Next, the cumulative phoneme logarithmic likelihood of the terminal non-speech model and the maximum likelihood phoneme sequence is represented by the following equations (3) and (4), respectively.

このとき、終端が非スピーチ音響モデルとなる最尤音素列のうち、最大の累積尤度の対数値Ｌ_３と、同話者クラスタのスピーチ音響モデルを終端とする最尤サブワード列の累積尤度の対数値Ｌ_４との差が一定の閾値θ_ｅｎｄを越えた状態で、更に時間長ｔ_ｅｎｄ１を継続して超えた場合、すなわちｔ_ｅｎｄ１継続して（Ｌ_３−Ｌ_４）＞θ_ｅｎｄが一定の時間長ｔ_ｅｎｄ１継続して満たされた場合、現時刻ｔから時間長ｔ_ｅｎｄ２（ｔ_ｅｎｄ２＜ｔ_ｅｎｄ１）遡った時刻を発話終端とする。

At this time, among the maximum likelihood phoneme sequences whose end is a non-speech acoustic model, the maximum likelihood likelihood logarithm L ₃ and the maximum likelihood subword sequence cumulative likelihood that ends with the speech acoustic model of the speaker cluster in a state in which the difference between the logarithmic value _{L 4} of exceeds a certain threshold theta _{end the,} if it exceeds by further continuing the time length _{t end1,} i.e. _{t end1} continued _{_{(L 3 -L 4)> θ}} end is When the predetermined time length t _end1 is satisfied continuously, the time that is time duration t _end2 (t _end2 <t _end1 ) from the current time t is set as the speech end point.

これにより、発話始端及び発話終端の時刻及び／又はその区間のクラスタ属性を検出することができる。なお、上述したｔ_{ｓｔａｒｔ}及びｔ_ｅｎｄ２は、０以上で任意に設定することができる。もちろん、この他にも音声パワーを用いた手法等の発話区間検出手法を用いることができる。 As a result, it is possible to detect the time of the utterance start and the end of utterance and / or the cluster attribute of the section. Note that t _start and t _end2 described above can be arbitrarily set to 0 or more. Of course, in addition to this, an utterance interval detection method such as a method using voice power can be used.

このようにして得られた発話始端及び発話終端の時刻情報や属性情報に基づいて、音響特徴量２５に対する上述した話者クラスタ属性２６を同定することができる。ここで、図５は、話者クラスタ属性の一例を示す図である。 Based on the time information and attribute information of the utterance start end and utterance end obtained in this way, the above-described speaker cluster attribute 26 for the acoustic feature amount 25 can be identified. Here, FIG. 5 is a diagram illustrating an example of speaker cluster attributes.

例えば、図５に示すように話者クラスタ属性は、入力音声の音響特徴量Ｘの列Ｘ_１，…，Ｘ_Ｔに対して、話者クラスタが男性と女性の場合、Ｘ_１からＸ_ｔまでは男性、Ｘ_ｔ＋１からＸ_ｋまでは女性、Ｘ_ｋ＋１からＸ_Ｔまでは男性といった話者クラスタの属性情報や各音響特徴量における時間情報により構成される。 For example, the speaker cluster attribute, as shown in FIG. 5, the column X ₁ of the acoustic feature amount X of the input _speech, ..., with respect to X _T, if the speaker cluster of men and women, from X ₁ to X _t Is made up of speaker cluster attribute information such as male, X _{t + 1} to X _k being female, and X _{k + 1} to X _T being male, and time information on each acoustic feature.

なお、上述した話者クラスタ属性は、本発明においては男性、女性に限定されず、例えば高齢者と成人と子供とを属性分けしてもよく、これらを男性と女性とで組み合わせて、詳細な属性分けを行ってもよい。 Note that the speaker cluster attribute described above is not limited to men and women in the present invention. For example, elderly people, adults, and children may be attributed. Attribute division may be performed.

これにより、複数の話者クラスタの高精度な音響モデルを利用しつつ、１発話中での複数の話者クラスタの単語間の遷移を可能とし、１発話中に男女の音声が混在しやすい対談等の音声においても、従来よりも少ない演算量且つ少ない遅れ時間で高精度な音声認識を実現することができる。 This makes it possible to transition between words in multiple speaker clusters during one utterance while using a high-accuracy acoustic model of multiple speaker clusters. For speech such as the above, it is possible to realize highly accurate speech recognition with a smaller amount of computation and a smaller delay time than in the past.

＜実行プログラム＞
ここで、上述した音声認識装置１０は、上述した専用の装置構成等を用いて本発明における音声認識処理を行うこともできるが、各構成における処理をコンピュータに実行させることができる実行プログラム（音声認識プログラム）を生成し、例えば、汎用のパーソナルコンピュータ、サーバ等にそのプログラムをインストールすることにより、本発明に係る音声認識処理を実現することができる。 <Execution program>
Here, the above-described speech recognition apparatus 10 can perform the speech recognition processing according to the present invention using the above-described dedicated device configuration or the like, but an execution program (speech) that can cause a computer to execute the processing in each configuration. For example, the speech recognition processing according to the present invention can be realized by installing the program into a general-purpose personal computer, server, or the like.

＜ハードウェア構成＞
ここで、本発明における音声認識処理が実行可能なコンピュータのハードウェア構成例について図を用いて説明する。図６は、本発明における音声認識処理が実現可能なハードウェア構成の一例を示す図である。 <Hardware configuration>
Here, an example of a hardware configuration of a computer capable of executing speech recognition processing according to the present invention will be described using the drawings. FIG. 6 is a diagram illustrating an example of a hardware configuration capable of realizing the speech recognition process according to the present invention.

図６におけるコンピュータ本体には、入力装置４１と、出力装置４２と、ドライブ装置４３と、補助記憶装置４４と、メモリ装置４５と、各種制御を行うＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）４６と、ネットワーク接続装置４７とを有するよう構成されており、これらはシステムバスＢで相互に接続されている。 6 includes an input device 41, an output device 42, a drive device 43, an auxiliary storage device 44, a memory device 45, a CPU (Central Processing Unit) 46 for performing various controls, and a network connection device. 47, which are connected to each other by a system bus B.

入力装置４１は、ユーザが操作するキーボード及びマウス等のポインティングデバイス及び音声入力デバイスを有しており、ユーザからのプログラムの実行指示等、各種操作信号、音声信号を入力する。出力装置４２は、本発明における処理を行うためのコンピュータ本体を操作するのに必要な各種ウィンドウやデータ等を表示するディスプレイや音声を出力するスピーカ等を有し、ＣＰＵ４６が有する制御プログラムにより実行経過や結果等を表示又は音声出力することができる。 The input device 41 has a pointing device such as a keyboard and mouse operated by the user and a voice input device, and inputs various operation signals and voice signals such as a program execution instruction from the user. The output device 42 has a display for displaying various windows and data necessary for operating the computer main body for performing the processing according to the present invention, a speaker for outputting sound, and the like. And results can be displayed or output as audio.

ここで、本発明において、コンピュータ本体にインストールされる実行プログラムは、例えば、ＣＤ−ＲＯＭ等の記録媒体４８等により提供される。プログラムを記録した記録媒体４８は、ドライブ装置４３にセット可能であり、記録媒体４８に含まれる実行プログラムが、記録媒体４８からドライブ装置４３を介して補助記憶装置４４にインストールされる。 Here, in the present invention, the execution program installed in the computer main body is provided by, for example, the recording medium 48 such as a CD-ROM. The recording medium 48 on which the program is recorded can be set in the drive device 43, and the execution program included in the recording medium 48 is installed in the auxiliary storage device 44 from the recording medium 48 via the drive device 43.

また、ドライブ装置４３は、本発明に係る実行プログラムを記録媒体４８に記録することができる。これにより、その記録媒体４８を用いて、他の複数のコンピュータに容易にインストールすることができ、容易に音声認識処理を実現することができる。 Further, the drive device 43 can record the execution program according to the present invention on the recording medium 48. Thereby, using the recording medium 48, it can be easily installed in a plurality of other computers, and voice recognition processing can be easily realized.

補助記憶装置４４は、ハードディスク等のストレージ手段であり、本発明における実行プログラムや、コンピュータに設けられた制御プログラム等を蓄積し必要に応じて入出力を行うことができる。また、補助記憶装置４４は、上述した話者クラスタの音響モデル２１や言語モデル２２、探索ネットワーク２３、入力音声２４、音響特徴量２５、話者クラスタ属性２６、及び音声認識結果２７等を蓄積する蓄積手段として用いることもできる。 The auxiliary storage device 44 is a storage means such as a hard disk, and can store an execution program according to the present invention, a control program provided in a computer, and the like, and can perform input / output as necessary. The auxiliary storage device 44 stores the speaker cluster acoustic model 21 and language model 22, search network 23, input speech 24, acoustic feature 25, speaker cluster attribute 26, speech recognition result 27, and the like. It can also be used as storage means.

ＣＰＵ４６は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等の制御プログラム、及び補助記憶装置４４から読み出されメモリ装置４５に格納されている実行プログラムに基づいて、各種演算や各ハードウェア構成部とのデータの入出力等、コンピュータ全体の処理を制御して、音声認識処理における各処理を実現することができる。また、プログラムの実行中に必要な各種情報等は、補助記憶装置４４から取得することができ、また格納することもできる。 The CPU 46 performs various calculations and data input / output with each hardware component based on a control program such as an OS (Operating System) and an execution program read from the auxiliary storage device 44 and stored in the memory device 45. Each processing in the voice recognition processing can be realized by controlling the processing of the entire computer. Various information necessary during the execution of the program can be acquired from the auxiliary storage device 44 and can also be stored.

ネットワーク接続装置４７は、電話回線やＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）ケーブル等の通信ネットワーク等と接続することにより、実行プログラムを通信ネットワークに接続されている他の端末等から取得したり、プログラムを実行することで得られた実行結果又は本発明における実行プログラムを他の端末等に提供することができる。 The network connection device 47 obtains an execution program from another terminal connected to the communication network or executes the program by connecting to a communication network such as a telephone line or a LAN (Local Area Network) cable. The execution result obtained in this way or the execution program in the present invention can be provided to other terminals or the like.

上述したようなハードウェア構成により、特別な装置構成を必要とせず、低コストで上述した音声認識処理を実現することができる。また、プログラムをインストールすることにより、容易に音声認識処理を実現することができる。 With the hardware configuration described above, the above-described voice recognition processing can be realized at a low cost without requiring a special device configuration. In addition, voice recognition processing can be easily realized by installing a program.

＜音声認識処理手順＞
次に、本発明における実行プログラム（音声認識プログラム）を用いた音声認識処理手順についてフローチャートを用いて説明する。図７は、音声認識処理手順の一例を示すフローチャートである。 <Voice recognition processing procedure>
Next, a voice recognition processing procedure using the execution program (voice recognition program) in the present invention will be described with reference to a flowchart. FIG. 7 is a flowchart illustrating an example of a voice recognition processing procedure.

図７において、まずプログラム開始直後に複数の話者クラスタの音響モデルと言語モデルを利用して、探索ネットワークを展開する（Ｓ０１）。なお、ここまでの処理は、前処理として予め処理されていてもよい。 In FIG. 7, first, immediately after the start of the program, a search network is developed using acoustic models and language models of a plurality of speaker clusters (S01). In addition, the process so far may be processed previously as pre-processing.

次に、音声入力があるか否かを判断し（Ｓ０２）、音声が入力された場合（Ｓ０２において、ＹＥＳ）、１フレーム分の音響特徴量の算出に必要な例えば２５ミリ秒程度の短い区間の音声をデジタル入力し（Ｓ０３）、音響分析を行う（Ｓ０４）。次に、入力音声の各区間の話者クラスタ属性を同定する（Ｓ０５）。 Next, it is determined whether or not there is a voice input (S02). When a voice is input (YES in S02), a short interval of, for example, about 25 milliseconds necessary for calculating the acoustic feature amount for one frame. Are digitally input (S03) and acoustic analysis is performed (S04). Next, the speaker cluster attribute of each section of the input speech is identified (S05).

ここで、Ｓ０５により同定した話者クラスタ属性において、属性に変化があったか否かを判断し（Ｓ０６）、属性変化があった場合（Ｓ０６において、ＹＥＳ）、変化した異なる他の話者クラスタの単語辞書への遷移を許可する（Ｓ０７）。また、属性変化がなかった場合（Ｓ０６において、ＮＯ）、又はＳ０７の処理が終了後、探索ネットワーク上の各単語の音響モデルで尤度を逐次算出する（Ｓ０８）。なお、Ｓ０８の処理においては、言語モデルによるスコアも加味して尤度を算出する。 Here, in the speaker cluster attribute identified in S05, it is determined whether or not the attribute has changed (S06). If the attribute has changed (YES in S06), the word of another speaker cluster that has changed is changed. Transition to the dictionary is permitted (S07). Further, when there is no attribute change (NO in S06), or after the processing of S07 is completed, the likelihood is sequentially calculated with the acoustic model of each word on the search network (S08). In the process of S08, the likelihood is calculated in consideration of the score by the language model.

ここで、音声認識処理を終了するか否かを判断し（Ｓ０９）、終了しない場合（Ｓ０９のおいて、ＮＯ）、Ｓ０２に戻り、以後同様の処理を継続し連続した音声認識処理を実行する。また、Ｓ０２の処理において、音声入力がない場合（Ｓ０２において、ＮＯ）、つまり発話終了である場合、又はＳ０９の処理において、音声認識処理を終了する場合（Ｓ０９においてＹＥＳ）、認識結果の単語列を出力し（Ｓ１０）、処理を終了する。もちろん、上述した逐次確定処理により、発話終了前に、認識の途中結果の単語列を逐次出力することも可能である。 Here, it is determined whether or not the voice recognition process is to be ended (S09). If the voice recognition process is not to be ended (NO in S09), the process returns to S02 and thereafter the same process is continued and the continuous voice recognition process is executed. . In addition, when there is no voice input in the process of S02 (NO in S02), that is, when the utterance is finished, or when the voice recognition process is finished in the process of S09 (YES in S09), the word string of the recognition result Is output (S10), and the process ends. Of course, it is also possible to sequentially output a word string as a result of recognition before the end of utterance by the above-described sequential determination process.

上述したように、音声認識プログラムを用いた音声認識処理により、迅速且つ高精度に音声認識を実現することができる。また、プログラムをインストールすることにより、容易に音声認識処理を実現することができる。 As described above, voice recognition can be performed quickly and with high accuracy by voice recognition processing using a voice recognition program. In addition, voice recognition processing can be easily realized by installing a program.

＜音声認識結果の比較例＞
ここで、従来手法と本発明手法とにおける音声認識結果の比較例について説明する。一例として男女の音声が混在するニュース番組の対談等の音声認識を行った結果、従来手法である性別に依存しない唯一の音響モデルを利用した場合の単語誤認識率は１２．２％（入力音声の時間長に対する認識処理時間の比＝認識処理実時間比０．８１倍）であった。また、男女別々の音響モデルを並列に動作させると、１発話中の男女の単語間の遷移を許さなかった場合の単語誤認識率は１１．９％（認識処理実時間比０．９３倍）、同様にして男女の単語間の遷移を制約なしで常に許す場合の単語誤認識率は１１．３％（認識処理実時間比１．２８倍）であった。 <Comparison example of speech recognition results>
Here, a comparative example of speech recognition results between the conventional method and the method of the present invention will be described. As an example, as a result of speech recognition such as interviews of news programs with mixed male and female voices, the word recognition rate when using the only acoustic model that does not depend on gender, which is the conventional method, is 12.2% (input speech The ratio of the recognition processing time to the time length of the above = the recognition processing real time ratio 0.81 times). In addition, when the male and female acoustic models are operated in parallel, the word error recognition rate when the transition between words of male and female during one utterance is not allowed is 11.9% (recognition processing real time ratio 0.93 times) Similarly, the word misrecognition rate when the transition between words of men and women is always allowed without restriction was 11.3% (recognition processing real time ratio 1.28 times).

これに対し、本発明である男女の単語間の遷移を性別属性の制約にしたがって許可／不許可の設定をした場合の単語誤認識率は１１．１％であり、認識処理実時間比は０．９３倍にまで改善し、特に認識率と処理時間において本発明の効果が示された。 On the other hand, the word misrecognition rate when the transition between the male and female words according to the present invention is permitted / not permitted according to the gender attribute constraint is 11.1%, and the recognition processing real time ratio is 0. The effect of the present invention was shown especially in recognition rate and processing time.

すなわち、男女音声の認識では、性別非依存の音響モデルよりも性別依存音響モデルを男女並列に動作させ、発話中の男女間遷移を許可する方が誤認識率は低く、更に音素認識による性別属性の制約により誤認識率と認識時間が改善した。 In other words, in recognition of gender speech, the gender-dependent acoustic model is operated in parallel with gender rather than gender-independent acoustic models, and the transition between men and women during utterance is allowed, and the misrecognition rate is lower. The error recognition rate and the recognition time were improved due to the restriction of.

上述したように本発明によれば、高精度に音声認識を実現することができる。具体的には、本発明は、複数の話者クラスタの音響モデルを言語モデル及び単語辞書にしたがって並列に探索ネットワークへ展開し、入力音声の話者クラスタ属性の制約を利用して、１発話中での異なる話者クラスタの単語間の遷移を可能とすることにより、１発話中に複数の話者クラスタの音声が混在した場合でも、従来よりも少ない演算量且つ少ない遅れ時間で高精度な音声認識を実現することができる。 As described above, according to the present invention, speech recognition can be realized with high accuracy. Specifically, the present invention expands an acoustic model of a plurality of speaker clusters into a search network in parallel according to a language model and a word dictionary, and uses the restriction of speaker cluster attributes of the input speech during one utterance. By enabling the transition between words of different speaker clusters at the same time, even when voices of multiple speaker clusters are mixed during one utterance, high-accuracy speech with less computation and less delay time than before Recognition can be realized.

これにより、例えば男性と女性、高齢者と成人と子供等、話者の声の音響的特徴が複数想定され、予め話者の声の音響的特徴を限定することのできない状況や、複数の音響的特徴の音声が１発話中に混在する状況、あるいは生放送番組の字幕制作等のオンライン処理且つ少ない時間遅れが要求される音声認識で、本発明を適用することができる。 As a result, for example, a plurality of acoustic features of a speaker's voice are assumed, such as men and women, elderly people, adults and children, etc. The present invention can be applied in a situation where voices with special features are mixed in one utterance, or in voice recognition that requires online processing such as caption production of a live broadcast program and a small time delay.

また、本発明は、放送番組の字幕制作、音声対話システム、音声ワープロ、会議の議事録の自動作成、声による機器の制御等、音声認識や言語処理を利用した様々な分野の技術に適用することができる。 In addition, the present invention is applied to technologies in various fields using speech recognition and language processing, such as subtitle production for broadcast programs, voice dialogue systems, voice word processors, automatic creation of meeting minutes, and control of devices by voice. be able to.

以上本発明の好ましい実施例について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to such specific embodiments, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

本発明における音声認識装置の一構成例を示す図である。It is a figure which shows the example of 1 structure of the speech recognition apparatus in this invention. 話者クラスタ数が２の場合の探索ネットワークの一例を示す図である。It is a figure which shows an example of the search network in case the number of speaker clusters is two. 各話者クラスタの単語辞書の内部構造の一例を示す図である。It is a figure which shows an example of the internal structure of the word dictionary of each speaker cluster. 男女並列音素認識のネットワークの一例を示す図である。It is a figure which shows an example of the network of gender parallel phoneme recognition. 話者クラスタ属性の一例を示す図である。It is a figure which shows an example of a speaker cluster attribute. 本発明における音声認識処理が実現可能なハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions which can implement | achieve the speech recognition process in this invention. 音声認識処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of a speech recognition process sequence.

Explanation of symbols

１０音声認識装置
１１ネットワーク展開手段
１２音響分析手段
１３話者クラスタ属性同定手段
１４連続音声認識手段
２１音響モデル
２２言語モデル
２３探索ネットワーク
２４入力音声
２５音響特徴量
２６話者クラスタ属性
２７音声認識結果
３１認識開始状態
３２、３３単語辞書
３４認識終了状態
４１入力装置
４２出力装置
４３ドライブ装置
４４補助記憶装置
４５メモリ装置
４６ＣＰＵ
４７ネットワーク接続装置
４８記録媒体 DESCRIPTION OF SYMBOLS 10 Speech recognition apparatus 11 Network expansion means 12 Acoustic analysis means 13 Speaker cluster attribute identification means 14 Continuous speech recognition means 21 Acoustic model 22 Language model 23 Search network 24 Input speech 25 Acoustic feature 26 Speaker cluster attribute 27 Speech recognition result 31 Recognition start state 32, 33 Word dictionary 34 Recognition end state 41 Input device 42 Output device 43 Drive device 44 Auxiliary storage device 45 Memory device 46 CPU
47 Network connection device 48 Recording medium

Claims

In a speech recognition apparatus that performs speech recognition by one or a plurality of speaker clusters having different acoustic characteristics from input speech,
Acoustic analysis means for converting the input speech into acoustic features;
Speaker cluster attribute identifying means for identifying a speaker cluster attribute from an acoustic feature obtained by the acoustic analysis means;
Continuous speech recognition based on a search network for searching for correct words generated from the acoustic model and language model of the speaker cluster set in advance, and constraint information of the speaker cluster attribute for the input speech A speech recognition apparatus comprising speech recognition means.

An acoustic model composed of one or a plurality of speaker clusters expressing the acoustic features of speech, and a language model expressing transitions between preset words,
The speech recognition apparatus according to claim 1, further comprising a network expansion unit that expands an acoustic model of the speaker cluster to a search network according to the language model and a word dictionary set in advance for each speaker cluster. .

The network deployment means includes
Transition from the recognition start state of the utterance start edge to the word dictionary start edge of all speaker clusters, transition according to the language model between word dictionaries of the same speaker cluster, change of each speaker cluster according to change of speaker cluster attribute Transition from the end of the word dictionary to the beginning of the word dictionary of a different speaker cluster, transition from the end of the word dictionary to the recognition end state of each speaker cluster at the end of speech, and from the recognition end state to the recognition start state for the next utterance The speech recognition apparatus according to claim 2, comprising a search network that enables at least one of the transitions.

The speaker cluster attribute identification means includes:
4. The speech recognition apparatus according to claim 1, wherein time information when a speaker cluster changes and / or attribute information of the speaker cluster after the change is output with respect to the input speech. 5. .

The continuous speech recognition means includes
When the speaker cluster attribute changes, it enables transition from the word dictionary end of the speaker cluster before the change in the search network to the word dictionary start of the speaker cluster after the change, and the speaker before the change 5. The speech recognition apparatus according to claim 2, wherein the cluster word dictionary is continuously searched.

In a speech recognition program for causing a computer to perform speech recognition processing for performing speech recognition by one or a plurality of speaker clusters having different acoustic characteristics from input speech,
An acoustic analysis process for converting the input speech into acoustic features;
Speaker cluster attribute identification processing for identifying speaker cluster attributes from acoustic features obtained by the acoustic analysis processing;
Continuous speech recognition based on a search network for searching for correct words generated from the acoustic model and language model of the speaker cluster set in advance, and constraint information of the speaker cluster attribute for the input speech A speech recognition program for causing a computer to execute speech recognition processing.