JP5235187B2

JP5235187B2 - Speech recognition apparatus, speech recognition method, and speech recognition program

Info

Publication number: JP5235187B2
Application number: JP2009260836A
Authority: JP
Inventors: 済央野本; 浩和政瀧; 敏高橋; 理吉岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-11-16
Filing date: 2009-11-16
Publication date: 2013-07-10
Anticipated expiration: 2029-11-16
Also published as: JP2011107314A

Description

本発明は人の話している内容をテキストデータとして取り出す音声認識装置、音声認識方法及び音声認識プログラムに関する。 The present invention relates to a speech recognition apparatus, a speech recognition method, and a speech recognition program that extract the content of a person talking as text data.

例えば音声認識や統計的機械翻訳などでは、認識性能を向上させるための言語的な制約として言語モデルが用いられる。そして、音声認識などの使用用途（タスク）が限定されている場合、一般に、その用途に特化して構築された言語モデルを用いることで認識精度を高めることができるとされている。 For example, in speech recognition and statistical machine translation, a language model is used as a linguistic restriction for improving recognition performance. When the usage (task) such as voice recognition is limited, it is generally said that the recognition accuracy can be improved by using a language model specially constructed for the usage.

近年盛んに使用されている言語モデルである統計的言語モデルＮ−ｇｒａｍは、性能の高いモデルを構築するために大量のデータを学習する必要がある。使用用途を限定した場合、その用途に関するテキストデータを大量に収集するのは一般に困難である。この問題を解決するべく、用途外のテキストも含めた大量のテキストデータで学習した言語モデルから、目的のテキストを用いてモデルを適応する言語モデルの適応方法が提案されている。目的の用途に即したテキストデータを要することなく、目的の用途に適した言語モデル（適応言語モデル）を作成する言語モデル生成装置として、特許文献１が知られている。 The statistical language model N-gram, which is a language model actively used in recent years, needs to learn a large amount of data in order to construct a high-performance model. When the usage is limited, it is generally difficult to collect a large amount of text data related to the usage. In order to solve this problem, an adaptation method of a language model has been proposed in which a model is adapted using a target text from a language model learned from a large amount of text data including non-use text. Patent Document 1 is known as a language model generation apparatus that creates a language model (adaptive language model) suitable for a target application without requiring text data suitable for the target application.

特開２００７−２４９０５０号公報JP 2007-249050 A

しかし、従来の言語モデル生成技術を用いて音声認識を行った場合、評価用データ（音声データ及び音声書き起こしテキスト）を作成する必要がある。また、音声認識に先立ち、クラスタ言語モデル、合成クラスタ言語モデル等を大量に作成し、各言語モデルについて評価用データ等を用いて評価した上で、言語モデルを作成する必要があり、事前の準備やその計算量が膨大になるという問題がある。 However, when speech recognition is performed using a conventional language model generation technique, it is necessary to create evaluation data (speech data and speech transcription text). Prior to speech recognition, it is necessary to create a large number of cluster language models, synthetic cluster language models, etc., and evaluate each language model using evaluation data, etc. There is a problem that the amount of calculation becomes enormous.

上記の課題を解決するために、本発明に係る音声認識技術は、会話音声を認識し、音声信号から特徴量を抽出し、所定の話者Ａの発話内容を含む音声信号から得られる特徴量と音響モデルと適応前の言語モデルを用いて音声認識を行い、認識結果Ａ’を求め、認識結果Ａ’のみと適応前の言語モデルを用いて、適応後の言語モデルを求め、所定の話者以外の話者Ｂの発話内容を含む音声信号から得られる特徴量と音響モデルと適応後の言語モデルを用いて音声認識を行い、認識結果Ｂ’を求める。 In order to solve the above-described problem, the speech recognition technology according to the present invention recognizes a conversational speech, extracts a feature amount from the speech signal, and obtains a feature amount from a speech signal including the utterance content of a predetermined speaker A. Speech recognition is performed using an acoustic model and a language model before adaptation, a recognition result A ′ is obtained, a language model after adaptation is obtained using only the recognition result A ′ and the language model before adaptation, and a predetermined story Speech recognition is performed using a feature amount obtained from a speech signal including speech content of a speaker B other than the speaker, an acoustic model, and a language model after adaptation, and a recognition result B ′ is obtained.

本発明は、会話の特性を活かした言語制約を与えることで、評価用データを作成せず、かつ、膨大な準備や計算を必要とせずに、言語モデルの性能を向上させ、認識率を向上させるという効果を奏する。 The present invention provides language constraints that take advantage of the characteristics of conversation, thereby improving the performance of the language model and improving the recognition rate without creating evaluation data and requiring enormous preparations and calculations. There is an effect of letting.

音声認識装置１００の構成例を示す図。The figure which shows the structural example of the speech recognition apparatus. 音声認識装置１００の処理フロー例を示す図。The figure which shows the example of a processing flow of the speech recognition apparatus 100. 言語モデル適応部１２１の構成例を示す図。The figure which shows the structural example of the language model adaptation part 121. FIG. 音声認識装置１００’の構成例を示す図。The figure which shows the structural example of the speech recognition apparatus 100 '. 音声認識装置１００のハードウェア構成を例示したブロック図。2 is a block diagram illustrating a hardware configuration of the speech recognition apparatus 100. FIG. 音声認識装置２００の構成例を示す図。The figure which shows the structural example of the speech recognition apparatus. 音声認識装置２００の処理フロー例を示す図。The figure which shows the example of a processing flow of the speech recognition apparatus 200. 適応発話選択部２２５の選択方法を説明するための図。The figure for demonstrating the selection method of the adaptive speech selection part 225. FIG. 音声認識装置２００’の構成例を示す図。The figure which shows the structural example of the speech recognition apparatus 200 '.

以下、本発明の実施の形態について、詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

＜音声認識装置１００＞
図１は音声認識装置１００の構成例を、図２は音声認識装置１００の処理フロー例を示す。図１及び２を用いて実施例１に係る音声認識装置１００を説明する。 <Voice recognition apparatus 100>
FIG. 1 shows a configuration example of the speech recognition apparatus 100, and FIG. A speech recognition apparatus 100 according to the first embodiment will be described with reference to FIGS.

音声認識装置１００は、記憶部１０３、制御部１０５、音声信号入力端子１０７Ａ及び１０７Ｂ、音声信号取得部１０９Ａ及び１０９Ｂ、特徴量分析部１１３Ａ及び１１３Ｂ、認識処理部１１５Ａ及び１１５Ｂ、言語モデル記憶部１１７、音響モデル記憶部１１９、言語モデル適応部１２１及び適応後言語モデル記憶部１２３を有する。 The speech recognition apparatus 100 includes a storage unit 103, a control unit 105, speech signal input terminals 107A and 107B, speech signal acquisition units 109A and 109B, feature amount analysis units 113A and 113B, recognition processing units 115A and 115B, and a language model storage unit 117. , An acoustic model storage unit 119, a language model adaptation unit 121, and a post-adaptation language model storage unit 123.

音声認識装置１００は、会話音声を認識する。会話とは、２人以上の話者が言語の発声によって共通の話題をやりとりするコミュニケーションを意味し、会話音声とは、その音声情報を意味する。なお、話者が２人の場合を対話といい、本実施例では説明を簡単にするため対話音声を認識する場合について説明する。 The speech recognition apparatus 100 recognizes conversation speech. Conversation means communication in which two or more speakers exchange a common topic by utterance of language, and conversational voice means the voice information. In addition, the case where there are two speakers is referred to as “dialogue”, and in this embodiment, a case where the dialogue voice is recognized will be described in order to simplify the explanation.

従来の音声認識技術では、２話者による対話音声を認識する際、それぞれ独立に音声認識をしていた。その際に言語制約を与えるものとしてＮ−ｇｒａｍによって表された言語モデルが一般的に用いられていたが、対話の特性を考慮した言語制約を与えるような枠組みはなかった。 In the conventional voice recognition technology, voice recognition is performed independently when recognizing dialogue voices by two speakers. At that time, a language model represented by N-gram was generally used as a language constraint, but there was no framework for giving a language constraint considering the characteristics of dialogue.

本発明は、対話の特性を考慮した言語制約を与えることで、言語モデルの性能を向上させ、認識率の向上を図る。 The present invention improves the performance of the language model and improves the recognition rate by providing language constraints that take into consideration the characteristics of dialogue.

ここで、対話の特性とは、２人の話者の発話内容は互いに強い関係性があり、一方の話者が発話したキーワード等は、他方の話者もそれを受けて発言する確率が高いという特性である。そこで、本発明では、対話を音声認識する際、一方の話者の発話内容を用いて、他方の話者の発話内容に対する言語制約を与える。言語制約を与えるとは、具体的には、所定の話者の発話内容を含む音声信号を用いて、対話内容に合うように言語モデルを適応させ、その適応後の言語モデルを用いて、所定の話者以外の話者の発話内容を含む音声信号の音声認識を行うことを意味する。 Here, the characteristics of the dialogue have a strong relationship between the utterances of the two speakers, and there is a high probability that the keywords spoken by one speaker will also speak by the other speaker. It is a characteristic. Therefore, in the present invention, when recognizing a dialogue, the speech restriction of one speaker is used to give a language restriction on the speech content of the other speaker. Specifically, the language restriction is applied by adapting a language model so as to match the conversation contents using a speech signal including the utterance contents of a predetermined speaker, and using the language model after the adaptation. This means that speech recognition of a speech signal including the utterance content of a speaker other than the above speaker is performed.

ここで、所定の話者とは話し方が丁寧や音声の収録状況も良い等の（適応前の言語モデルを用いても）認識率が高くなることが期待される話者を意味し、所定の話者以外の話者とは話し方がラフや音声の収録状況も悪い等の（適応前の言語モデルを用いると）認識率が低くなることが予想される話者を意味する。例えば、コールセンタなどで交わされるオペレータと顧客の対話を認識する場合、話し方が丁寧で音声の収録状況も良いことが期待されるオペレータを所定の話者とし、話し方がラフで音声の収録状況も悪いことが予想される顧客を所定の話者以外の話者とする。 Here, the predetermined speaker means a speaker who is expected to have a high recognition rate (even when using a language model before adaptation), such as politely speaking and good voice recording conditions. A speaker other than a speaker means a speaker whose recognition rate is expected to be low (using a pre-adaptation language model), such as rough speaking and poor voice recording conditions. For example, when recognizing a dialogue between an operator and a customer who is exchanged at a call center, etc., an operator who is expected to speak well and have good voice recording conditions is assumed to be a predetermined speaker, and his speech is rough and voice recording conditions are also bad. A customer who is expected to be a speaker other than a predetermined speaker is assumed to be a speaker.

以下、各部の処理内容について説明する。 Hereinafter, the processing content of each part is demonstrated.

＜記憶部１０３及び制御部１０５＞
記憶部１０３は、入出力される各データや演算過程の各データを、逐一、格納・読み出しする。それにより各演算処理が進められる。但し、必ずしも記憶部１０３に記憶しなければならないわけではなく、各部間で直接データを受け渡してもよい。なお、後述する言語モデル記憶部１１７、音響モデル記憶部１１９及び適応後言語モデル記憶部１２３は、この記憶部１０３の一部であってもよい。 <Storage unit 103 and control unit 105>
The storage unit 103 stores / reads each input / output data and each data of the calculation process one by one. Thereby, each calculation process is advanced. However, the data need not necessarily be stored in the storage unit 103, and data may be directly transferred between the units. Note that a language model storage unit 117, an acoustic model storage unit 119, and a post-adaptation language model storage unit 123, which will be described later, may be part of the storage unit 103.

制御部１０５は、各処理を制御する。 The control unit 105 controls each process.

＜音声信号入力端子１０７Ａ及び１０７Ｂ、音声信号取得部１０９Ａ及び１０９Ｂ＞
音声信号取得部１０９Ａ及び１０９Ｂは、それぞれ音声信号入力端子１０７Ａ及び１０７Ｂを介して所定の話者Ａ（例えばオペレータ）及び所定の話者以外の話者Ｂ（例えば顧客）のアナログ音声信号Ａ２、Ｂ２を取得し、ディジタル音声信号Ａ３、Ｂ３に変換し、出力する（ｓ１０９Ａ、ｓ１０９Ｂ）。 <Audio signal input terminals 107A and 107B, audio signal acquisition units 109A and 109B>
The audio signal acquisition units 109A and 109B respectively receive analog audio signals A2 and B2 of a predetermined speaker A (for example, an operator) and a speaker B (for example, a customer) other than the predetermined speaker via audio signal input terminals 107A and 107B, respectively. Is converted into digital audio signals A3 and B3 and output (s109A and s109B).

＜特徴量分析部１１３Ａ及び１１３Ｂ＞
特徴量分析部１１３Ａ及び１１３Ｂは、それぞれ所定の話者Ａ及び所定の話者以外の話者Ｂのディジタル音声信号Ａ３及びＢ３から（音響）特徴量Ａ４及びＢ４を抽出し、出力する（ｓ１１３Ａ、ｓ１１３Ｂ）。 <Feature amount analysis units 113A and 113B>
The feature amount analyzing units 113A and 113B extract (acoustic) feature amounts A4 and B4 from the digital speech signals A3 and B3 of the predetermined speaker A and a speaker B other than the predetermined speaker, respectively, and output them (s113A, s113B).

抽出する特徴量としては、例えば、MFCC（Mel-Frequenct Cepstrum Coefficient）の１〜１２次元と、その変化量であるΔMFCCなどの動的パラメータや、パワーやΔパワー等を用いる。また、CMN（ケプストラム平均正規化）処理を行っても良い。また、特徴量は、MFCCやパワーに限定したものではなく、音声認識に用いられるパラメータを用いても良い。具体的な特徴量抽出方法は、公知のものによるから説明を略する（例えば、参考文献１：古井貞煕著、「音響・音声工学」、近代科学社、1992年9月）。 As the feature quantity to be extracted, for example, 1 to 12 dimensions of MFCC (Mel-Frequenct Cepstrum Coefficient) and dynamic parameters such as ΔMFCC which is the change amount, power, Δ power, and the like are used. Also, CMN (cepstrum average normalization) processing may be performed. The feature amount is not limited to MFCC or power, but a parameter used for speech recognition may be used. The specific feature quantity extraction method is well known and will not be described here (for example, Reference 1: Sadahiro Furui, “Acoustic / Speech Engineering”, Modern Science Co., Ltd., September 1992).

＜言語モデル記憶部１１７及び音響モデル記憶部１１９＞
言語モデル記憶部１１７及び音響モデル記憶部１１９は、それぞれ予め言語モデルＬ及び音響モデルＫを記憶しておく。なお、この言語モデルＬは汎用言語モデルであってもよいし、コールセンタ用に特化して構築された言語モデルであってもよい。特化して構築された言語モデルを適応前の言語モデルとして用いれば、認識結果Ａ’の認識率はより高くなる。より正確な認識結果に基づき、言語モデルを適応させるため、適応後の言語モデルを用いて求める認識結果Ｂ’の認識率も高くなると考えられる。 <Language model storage unit 117 and acoustic model storage unit 119>
The language model storage unit 117 and the acoustic model storage unit 119 store a language model L and an acoustic model K in advance, respectively. The language model L may be a general-purpose language model or a language model specially constructed for a call center. If a specialized language model is used as the language model before adaptation, the recognition rate of the recognition result A ′ will be higher. Since the language model is adapted based on a more accurate recognition result, the recognition rate of the recognition result B ′ obtained using the language model after adaptation is considered to be high.

＜認識処理部１１５Ａ＞
認識処理部１１５Ａは、所定の話者Ａの発話内容を含むディジタル音声信号Ａ３から抽出した特徴量Ａ４と音響モデルＫと適応前の言語モデルＬを用いて音声認識を行う（ｓ１１５Ａ）。認識処理部１１５Ａは、所定の話者Ａの発話内容を含むディジタル音声信号Ａ３から抽出した特徴量Ａ４を受け取り、従来技術同様、音響モデルＫと言語モデルＬを用いて認識結果Ａ’を求め、出力する。また、音声認識を行う際に利用した言語モデルＬ’も出力する。具体的な認識処理方法は、公知のもの（例えば、参考文献１等）によるから説明を略する。 <Recognition processing unit 115A>
The recognition processing unit 115A performs speech recognition using the feature amount A4 extracted from the digital speech signal A3 including the utterance content of the predetermined speaker A, the acoustic model K, and the language model L before adaptation (s115A). The recognition processing unit 115A receives the feature amount A4 extracted from the digital speech signal A3 including the utterance content of the predetermined speaker A, and obtains the recognition result A ′ using the acoustic model K and the language model L, as in the prior art. Output. In addition, the language model L ′ used when performing speech recognition is also output. Since a specific recognition processing method is based on a known method (for example, Reference 1), description thereof is omitted.

対話（コールセンタであれば通話）が終了するまで上記の処理（Ｓ１０９〜ｓ１１５）を繰り返し、対話終了後、以下の処理を行う。 The above processing (S109 to s115) is repeated until the dialogue (call in the case of a call center) is completed. After the dialogue is completed, the following processing is performed.

なお、所定の話者以外の話者Ｂの発話内容を含む音声信号Ｂ２に対する音声信号取得処理（ｓ１０９Ｂ）、特徴量分析処理（ｓ１１３Ｂ）等は、音声信号Ｂ２を記憶部１０３等に記憶しておき、通話終了後に行ってもよい。また、音声認識処理（ｓ１１５Ｂ）は通話終了後に、以下説明する言語モデル適応後（ｓ１２１）に認識処理部１１５Ｂで行う。 The voice signal acquisition process (s109B), the feature amount analysis process (s113B), etc. for the voice signal B2 including the utterance contents of the speaker B other than the predetermined speaker are stored in the storage unit 103 or the like. Alternatively, it may be performed after the call ends. The voice recognition process (s115B) is performed by the recognition processing unit 115B after the end of the call and after adaptation of a language model (s121) described below.

＜言語モデル適応部１２１＞
言語モデル適応部１２１は、所定の話者Ａの発話内容を含む音声信号Ａ２の音声認識の結果Ａ’（以下「認識結果Ａ’」という）のみと言語モデルＬを用いて、適応後の言語モデルＬ”を求める（ｓ１２１）。ここで「認識結果Ａ’のみ」とは、話者Ｂの発話内容を含む音声信号Ｂ２の認識結果を含まないことを意味する。つまり、所定の話者Ａの認識結果Ａ’を用いて、対話内容に合うように言語モデルＬを適応させ、適応後の言語モデルＬ”を求める。 <Language model adaptation unit 121>
The language model adaptation unit 121 uses only the speech recognition result A ′ (hereinafter referred to as “recognition result A ′”) of the speech signal A2 including the utterance content of the predetermined speaker A and the language model L to use the language after adaptation. The model L ″ is obtained (s121). Here, “only recognition result A ′” means that the recognition result of the speech signal B2 including the utterance content of the speaker B is not included. That is, using the recognition result A ′ of the predetermined speaker A, the language model L is adapted so as to match the content of the dialogue, and the adapted language model L ″ is obtained.

例えば、適応方法の一つとして重み付け適応がある。重み付け適応とは、複数の規模の異なるテキストコーパスから学習したＮ-ｇｒａｍを混合する手法である。混合する際に、コーパス毎に規模や重要度を考慮して重み付けを行う。本実施例では、適応前の言語モデルＬと所定の話者Ａの認識結果Ａ’から学習したＮ−ｇｒａｍを、重みｗを考慮して混合し、混合されたＮ−ｇｒａｍを適応する。例えば単語総数ｍのテキストコーパスＡから学習した単語ｘの出現頻度をＰａ（ｘ）、単語総数ｎのテキストコーパスＢから学習した単語ｘの出現頻度をＰｂ（ｘ）とする。混合の際のコーパスＢの重みをｗとする。その場合、コーパスＡとＢを重み付け適応して学習した単語ｘの出現頻度Ｐ（ｘ）は以下の式で表される。 For example, weighting adaptation is one of adaptation methods. Weighting adaptation is a method of mixing N-grams learned from a plurality of text corpora of different scales. When mixing, weighting is performed in consideration of the scale and importance for each corpus. In this embodiment, the N-gram learned from the language model L before adaptation and the recognition result A ′ of the predetermined speaker A is mixed in consideration of the weight w, and the mixed N-gram is adapted. For example, let Pa (x) be the appearance frequency of the word x learned from the text corpus A with the total number of words m, and let Pb (x) be the appearance frequency of the word x learned from the text corpus B with the total number n of words. The weight of the corpus B at the time of mixing is set to w. In this case, the appearance frequency P (x) of the word x learned by weighting adaptation of the corpora A and B is expressed by the following equation.

適応を行う際、「はい」や「えー」などの話題と関係無く現れる発話は除いても良い。なお、重みｗは、予め実験等により適切な値を求めておく。 When adaptation is performed, utterances that appear regardless of topics such as “Yes” and “Eh” may be excluded. Note that an appropriate value for the weight w is obtained in advance through experiments or the like.

図３は、言語モデル適応部１２１の構成例を示す。言語モデル適応部１２１は、重み付け部１２１ａと適応部１２１ｂを備える。 FIG. 3 shows a configuration example of the language model adaptation unit 121. The language model adaptation unit 121 includes a weighting unit 121a and an adaptation unit 121b.

重み付け部１２１ａは、所定の話者Ａの音声認識結果Ａ’と音声認識に利用した言語モデルＬ’を用いてコーパスＢを作成し、その単語総数ｎを求める。さらに、ｎに予め求めておいた重みｗを乗じ、ｗｎを求める。 The weighting unit 121a creates a corpus B using the speech recognition result A 'of a predetermined speaker A and the language model L' used for speech recognition, and obtains the total number n of words. Furthermore, n is multiplied by a previously determined weight w to obtain wn.

適応部１２１ｂは、コーパスＢから単語ｘの出現頻度Ｐｂ（ｘ）を得て、適応前の言語モデルＬとそのコーパスＡから単語総数ｍと単語ｘの出現頻度Ｐａ（ｘ）を得て、式（１）により、学習した単語ｘの出現頻度Ｐ（ｘ）を算出する。出現頻度Ｐ（ｘ）を用いて、言語モデルＬを適応させ、適応後の言語モデルＬ”を求める。 The adaptation unit 121b obtains the appearance frequency Pb (x) of the word x from the corpus B, obtains the total number m of words and the appearance frequency Pa (x) of the word x from the language model L before adaptation and the corpus A, From (1), the appearance frequency P (x) of the learned word x is calculated. The language model L is adapted using the appearance frequency P (x), and the language model L ″ after adaptation is obtained.

＜適応後言語モデル記憶部１２３＞
適応後言語モデル記憶部１２３は、適応後の言語モデルＬ”を記憶する。なお、適応前の言語モデルＬとは別に記憶する。適応前の言語モデルＬは通話毎に変更等しないが、認識結果Ａ’は通話毎に異なるため、適応後の言語モデルＬ”も、通話毎に異なる。 <Adapted language model storage unit 123>
The language model storage unit 123 after adaptation stores the language model L ″ after adaptation. The language model L before adaptation is stored separately from the language model L before adaptation. Since the result A ′ is different for each call, the language model L ″ after adaptation is also different for each call.

＜認識処理部１１５Ｂ＞
認識処理部１１５Ｂは、所定の話者以外の話者Ｂの発話内容を含む音声信号Ｂ２から抽出した特徴量Ｂ４と音響モデルＫと適応後の言語モデルＬ”を用いて音声認識を行う（ｓ１１５Ｂ）。認識処理部１１５Ｂは、所定の話者Ｂの発話内容を含むディジタル音声信号Ｂ３から抽出した特徴量Ｂ４を受け取り、従来技術同様、音響モデルＫと言語モデルＬを用いて認識結果Ｂ’を求め、出力する。 <Recognition processing unit 115B>
The recognition processing unit 115B performs speech recognition using the feature amount B4 extracted from the speech signal B2 including the speech content of the speaker B other than the predetermined speaker, the acoustic model K, and the language model L ″ after adaptation (s115B). The recognition processing unit 115B receives the feature quantity B4 extracted from the digital speech signal B3 including the utterance content of the predetermined speaker B, and uses the acoustic model K and the language model L to obtain the recognition result B ′ as in the conventional technique. Find and output.

＜効果＞
本実施例では、対話全体を通して認識率(信頼度)の高い発話全体を用いて、言語モデルの適応を行い、認識率(信頼度)の低い発話の認識を行う。このような構成とすることによって、評価用データを作成せず、かつ、膨大な準備や計算を必要とせずに、言語モデルの性能を向上させ、音声認識における認識率を向上させることができる。 <Effect>
In the present embodiment, the utterance with a high recognition rate (reliability) is used throughout the dialogue to adapt the language model and recognize the utterance with a low recognition rate (reliability). With such a configuration, the performance of the language model can be improved and the recognition rate in speech recognition can be improved without creating evaluation data and without requiring a large amount of preparation and calculation.

特に、コールセンタにおける顧客音声を認識する場合には、顧客側の音声信号は収録環境が悪く音響モデルの効果に期待できないため、従来の音声認識技術では認識率が低いが、本発明を用いた場合には認識率の向上が期待できる。 In particular, when recognizing customer voice in a call center, the voice signal on the customer side is poor in recording environment and cannot be expected to be effective in the acoustic model, so the conventional voice recognition technology has a low recognition rate, but the present invention is used. Can be expected to improve the recognition rate.

＜その他＞
なお、音声認識装置１００は、アナログ音声信号Ａ２、Ｂ２ではなく、ディジタル音声信号Ａ３、Ｂ３を受け取る場合や、記憶部１０３や図示しない記憶媒体や通信装置からディジタル音声信号Ａ３、Ｂ３を受け取る場合は、音声認識装置１００に音声入力端子１０７Ａ、１０７Ｂや音声信号取得部１０９Ａ、１０９Ｂを設けなくともよい。 <Others>
Note that the speech recognition apparatus 100 receives digital audio signals A3 and B3 instead of the analog audio signals A2 and B2, or receives digital audio signals A3 and B3 from the storage unit 103, a storage medium (not shown), or a communication device. The voice recognition device 100 may not include the voice input terminals 107A and 107B and the voice signal acquisition units 109A and 109B.

本実施例では、コールセンタにおけるオペレータと顧客の通話（対話）音声を認識する場合について説明しているが、それ以外の対話音声であってもよいし、さらに会話音声であってもよい。話者が３人以上の場合には、適応前の言語モデルＬによる認識率が高くなることが期待される話者（例えば、収音環境等が整っていたり、話す速度や単語、文法等が適切である話者等）のグループをＡとし、適応前の言語モデルＬによる認識率が低くなると予想される話者（例えば、雑音等が多い収音環境にいたり、話す速度が速かったり、用いる単語や文法に誤りがある話者等）のグループをＢとして、会話の特性を考慮し、本実施例と同様に言語モデルを会話内容に合うように適応させることができる。 In this embodiment, a case has been described in which a call (conversation) voice between an operator and a customer in a call center is recognized, but other conversation voices may be used, and further conversation voices may be used. When there are three or more speakers, the speaker who is expected to have a high recognition rate by the language model L before adaptation (for example, a sound collection environment is prepared, speaking speed, words, grammar, etc. A group of appropriate speakers is assumed to be A, and a speaker (for example, in a sound collecting environment where there is a lot of noise or the like, speaking speed is fast, or used) A group of speakers having a word or grammar error) is B, and the language model can be adapted to suit the conversation content in the same manner as in the present embodiment in consideration of the conversation characteristics.

本実施例では、音声信号入力端子Ａから受け取る信号に適応前の言語モデルＬによる認識率が高くなることが期待される話者の発話内容が含まれ、音声信号入力端子Ｂから受け取る信号に適応前の言語モデルＬによる認識率が低くなることが予想される話者の発話内容が含まれることを前提としている。しかし、各音声信号の雑音量や話す速度によって、その音声信号が認識率の高い話者によるものか、低い話者によるものかを確定してもよい。 In the present embodiment, the signal received from the audio signal input terminal A includes the utterance content of the speaker expected to increase the recognition rate by the language model L before adaptation, and is adapted to the signal received from the audio signal input terminal B. It is assumed that the utterance content of a speaker whose recognition rate by the previous language model L is expected to be low is included. However, depending on the amount of noise of each voice signal and the speaking speed, it may be determined whether the voice signal is from a speaker with a high recognition rate or a speaker with a low recognition rate.

なお、音声認識装置１００は、必ずしも認識結果Ａ’を出力しなくともよい。例えば、コールセンタにおいて、顧客の発話内容のみをテキストデータとして記録したい場合等には、認識結果Ｂ’のみを出力、保存する構成としてもよい。 Note that the speech recognition apparatus 100 does not necessarily output the recognition result A ′. For example, in the call center, when it is desired to record only the utterance content of the customer as text data, only the recognition result B 'may be output and stored.

また、本実施例では、通話終了後に言語モデルを会話に適応させているが、必ずしも通話が終了している必要はなく、例えば、所定時間内の認識結果Ａ’から言語モデルを適応させ、その言語モデルを用いて、その所定時間内の話者Ｂの発話内容を含む音声信号の音声認識を行ってもよい。 In this embodiment, the language model is adapted to the conversation after the call ends. However, the call need not necessarily be terminated. For example, the language model is adapted from the recognition result A ′ within a predetermined time, Using a language model, speech recognition including speech content of the speaker B within the predetermined time may be performed.

［変形例１］
実施例１の音声認識装置１００では、所定の話者Ａと所定の話者以外の話者Ｂの発話内容を含む音声信号がそれぞれ別の音声信号入力端子から入力され、別々に処理される。本変形例の音声認識装置１００’では、所定の話者Ａと所定の話者以外の話者Ｂの発話内容を含む音声信号が同一の音声信号入力端子から入力される場合について説明する。 [Modification 1]
In the speech recognition apparatus 100 according to the first embodiment, speech signals including utterance contents of a predetermined speaker A and a speaker B other than the predetermined speaker are respectively input from separate audio signal input terminals and processed separately. In the voice recognition device 100 ′ of the present modification, a case will be described in which a voice signal including utterance contents of a predetermined speaker A and a speaker B other than the predetermined speaker is input from the same voice signal input terminal.

＜音声認識装置１００’＞
図４は音声認識装置１００’の構成例を示す。図４を用いて変形例１に係る音声認識装置１００’を説明する。 <Voice recognition apparatus 100 '>
FIG. 4 shows a configuration example of the speech recognition apparatus 100 ′. A speech recognition apparatus 100 ′ according to the first modification will be described with reference to FIG.

音声認識装置１００は、記憶部１０３、制御部１０５、音声信号入力端子１０７、音声信号取得部１０９、特徴量分析部１１３、認識処理部１１５、言語モデル記憶部１１７、音響モデル記憶部１１９、言語モデル適応部１２１及び適応後言語モデル記憶部１２３を有する。実施例１と異なる部分についてのみ説明する。 The speech recognition apparatus 100 includes a storage unit 103, a control unit 105, an audio signal input terminal 107, an audio signal acquisition unit 109, a feature amount analysis unit 113, a recognition processing unit 115, a language model storage unit 117, an acoustic model storage unit 119, a language A model adaptation unit 121 and a post-adaptation language model storage unit 123 are included. Only parts different from the first embodiment will be described.

＜音声信号入力端子１０７及び音声信号取得部１０９＞
音声信号取得部１０９は、音声信号入力端子１０７を介して話者Ａ及び話者Ｂの発話内容を含むアナログ音声信号を取得し、ディジタル音声信号に変換し、出力する。 <Audio signal input terminal 107 and audio signal acquisition unit 109>
The audio signal acquisition unit 109 acquires an analog audio signal including the utterance contents of the speaker A and the speaker B via the audio signal input terminal 107, converts it into a digital audio signal, and outputs it.

＜話者判定部１１１＞
話者判定部１１１は、ディジタル音声信号を用いて、ディジタル音声信号に含まれる発話内容を発している話者を判定し、話者情報として出力する。具体的な話者判定方法は、公知のもの（例えば、参考文献１）によるから説明を略する。 <Speaker determination unit 111>
The speaker determination unit 111 determines a speaker who is uttering the utterance content included in the digital audio signal using the digital audio signal, and outputs it as speaker information. Since a specific speaker determination method is based on a known method (for example, Reference 1), description thereof is omitted.

＜特徴量分析部１１３＞
特徴量分析部１１３は、話者Ａ及び話者Ｂの発話内容を含むディジタル音声信号から（音響）特徴量を抽出し、各特徴量に話者情報を付加して出力する。 <Feature amount analysis unit 113>
The feature amount analysis unit 113 extracts (acoustic) feature amounts from digital audio signals including the utterance contents of the speaker A and the speaker B, adds speaker information to each feature amount, and outputs the feature amount.

＜認識処理部１１５＞
認識処理部１１５は、話者情報に基づき何れの話者による特徴量かを判断し、所定の話者Ａの発話内容を含むディジタル音声信号から抽出した特徴量と音響モデルＫと適応前の言語モデルＬを用いて音声認識を行う。そして。認識結果Ａ’と利用した言語モデルＬ’を出力する。なお、話者Ｂの発話内容を含むディジタル音声信号から抽出した特徴量は記憶部１０３等に記憶しておく。 <Recognition processing unit 115>
The recognition processing unit 115 determines which speaker the feature amount is based on the speaker information, the feature amount extracted from the digital speech signal including the utterance content of the predetermined speaker A, the acoustic model K, and the language before adaptation. Speech recognition is performed using the model L. And then. The recognition result A ′ and the language model L ′ used are output. Note that the feature quantity extracted from the digital voice signal including the utterance content of the speaker B is stored in the storage unit 103 or the like.

対話（コールセンタであれば通話）が終了するまで上記の処理を繰り返し、対話終了後、音声認識装置１００’は、言語モデル適応部１２１において実施例１と同様に言語モデル適応処理（ｓ１２１）を行い、適応後の言語モデルＬ”を求める。 The above processing is repeated until the dialogue (call in the case of a call center) is completed. After the dialogue is finished, the speech recognition apparatus 100 ′ performs the language model adaptation processing (s121) in the language model adaptation unit 121 as in the first embodiment. The language model L ″ after adaptation is obtained.

そして、認識処理部１１５は、記憶部１０３等から話者Ｂの発話内容を含む音声信号から抽出した特徴量を受け取り、音響モデルＫと適応後の言語モデルＬ”を用いて音声認識を行い、認識結果Ｂ’を出力する。 And the recognition process part 115 receives the feature-value extracted from the audio | voice signal containing the utterance content of the speaker B from the memory | storage part 103 grade | etc., Performs speech recognition using the acoustic model K and the language model L "after adaptation, The recognition result B ′ is output.

なお、対話開始時には適応前の言語モデルＬを用い、対話終了時（言語モデル適応後）に適応後の言語モデルＬ”を用いるように切り替える構成としてもよい。 The language model L before adaptation may be used at the start of dialogue, and the language model L ″ after adaptation may be used at the end of dialogue (after language model adaptation).

このような構成とすることによって実施例１と同様の効果を得ることができる。よって、各部（音声信号取得部、特徴量分析部及び認識処理部等）は同一であっても、別々に設けてもよい。 By adopting such a configuration, the same effect as in the first embodiment can be obtained. Therefore, each unit (speech signal acquisition unit, feature amount analysis unit, recognition processing unit, etc.) may be the same or provided separately.

＜ハードウェア構成＞
図５は、本実施例における音声認識装置１００のハードウェア構成を例示したブロック図である。図５に例示するように、この例の音声認識装置１００は、それぞれＣＰＵ（Central Processing Unit）１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ（Read Only Memory）１５、ＲＡＭ（Random Access Memory）１６及びバス１７を有している。 <Hardware configuration>
FIG. 5 is a block diagram illustrating a hardware configuration of the speech recognition apparatus 100 according to the present embodiment. As illustrated in FIG. 5, the speech recognition apparatus 100 of this example includes a CPU (Central Processing Unit) 11, an input unit 12, an output unit 13, an auxiliary storage device 14, a ROM (Read Only Memory) 15, and a RAM (Random). Access Memory) 16 and a bus 17.

この例のＣＰＵ１１は、制御部１１ａ、演算部１１ｂ及びレジスタ１１ｃを有し、レジスタ１１ｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部１２は、データが入力される入力インターフェース、キーボード、マウス等であり、出力部１３は、データが出力される出力インターフェース等である。補助記憶装置１４は、例えば、ハードディスク、半導体メモリ等であり、音声認識装置１００としてコンピュータを機能させるためのプログラムや各種データが格納される。また、ＲＡＭ１６には、上記のプログラムや各種データが展開され、ＣＵＰ１１等から利用される。また、バス１７は、ＣＰＵ１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ１５及びＲＡＭ１６を通信可能に接続する。なお、このようなハードウェアの具体例としては、例えば、パーソナルコンピュータの他、サーバ装置やワークステーション等を例示できる。 The CPU 11 in this example includes a control unit 11a, a calculation unit 11b, and a register 11c, and executes various calculation processes according to various programs read into the register 11c. The input unit 12 is an input interface for inputting data, a keyboard, a mouse, and the like, and the output unit 13 is an output interface for outputting data. The auxiliary storage device 14 is, for example, a hard disk, a semiconductor memory, or the like, and stores programs and various data for causing the computer to function as the voice recognition device 100. Further, the above-mentioned program and various data are expanded in the RAM 16 and used from the CUP 11 or the like. The bus 17 connects the CPU 11, the input unit 12, the output unit 13, the auxiliary storage device 14, the ROM 15, and the RAM 16 in a communicable manner. In addition, as a specific example of such hardware, a server apparatus, a workstation, etc. other than a personal computer can be illustrated, for example.

＜プログラム構成＞
上述のように、補助記憶装置１４には、本実施例の音声認識装置１００の各処理を実行するための各プログラムが格納される。音声認識プログラムを構成する各プログラムは、単一のプログラム列として記載されていてもよく、また、少なくとも一部のプログラムが別個のモジュールとしてライブラリに格納されていてもよい。 <Program structure>
As described above, each program for executing each process of the speech recognition apparatus 100 according to the present embodiment is stored in the auxiliary storage device 14. Each program constituting the speech recognition program may be described as a single program sequence, or at least a part of the program may be stored in the library as a separate module.

＜ハードウェアとプログラムとの協働＞
ＣＰＵ１１は、読み込まれたＯＳプログラムに従い、補助記憶装置１４に格納されている上述のプログラムや各種データをＲＡＭ１６に展開する。そして、このプログラムやデータが書き込まれたＲＡＭ１６上のアドレスがＣＰＵ１１のレジスタ１１ｃに格納される。ＣＰＵ１１の制御部１１ａは、レジスタ１１ｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１６上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１１ｂに順次実行させ、その演算結果をレジスタ１１ｃに格納していく。 <Cooperation between hardware and program>
The CPU 11 expands the above-described program and various data stored in the auxiliary storage device 14 in the RAM 16 according to the read OS program. The address on the RAM 16 where the program and data are written is stored in the register 11c of the CPU 11. The control unit 11a of the CPU 11 sequentially reads these addresses stored in the register 11c, reads a program and data from the area on the RAM 16 indicated by the read address, causes the calculation unit 11b to sequentially execute the operation indicated by the program, The calculation result is stored in the register 11c.

図１は、このようにＣＰＵ１１に上述のプログラムが読み込まれて実行されることにより構成される音声認識装置１００の機能構成を例示したブロック図である。 FIG. 1 is a block diagram illustrating a functional configuration of the speech recognition apparatus 100 configured by reading and executing the above-described program in the CPU 11 as described above.

ここで、記憶部１０３、言語モデル記憶部１１７、音響モデル記憶部１１９及び適応後言語モデル１２３は、補助記憶装置１４、ＲＡＭ１６、レジスタ１１ｃ、その他のバッファメモリやキャッシュメモリ等の何れか、あるいはこれらを併用した記憶領域に相当する。また、音声信号取得部１０９Ａ及び１０９Ｂ、話者判定部１１１、特徴量分析部１１３Ａ及びＢ、認識処理部１１５Ａ及び１１５Ｂ、言語モデル適応部１２１は、ＣＰＵ１１に音声認識プログラムを実行させることにより構成されるものである。 Here, the storage unit 103, the language model storage unit 117, the acoustic model storage unit 119, and the post-adaptation language model 123 may be any one of the auxiliary storage device 14, the RAM 16, the register 11 c, other buffer memory, cache memory, etc. Corresponds to a storage area. The voice signal acquisition units 109A and 109B, the speaker determination unit 111, the feature amount analysis units 113A and B, the recognition processing units 115A and 115B, and the language model adaptation unit 121 are configured by causing the CPU 11 to execute a voice recognition program. Is.

＜音声認識装置２００＞
図６は音声認識装置２００の構成例を、図７は音声認識装置２００の処理フロー例を示す。実施例１と異なる部分について、図６及び７を用いて実施例２に係る音声認識装置２００を説明する。 <Voice recognition apparatus 200>
FIG. 6 shows a configuration example of the speech recognition apparatus 200, and FIG. Regarding the parts different from the first embodiment, the speech recognition apparatus 200 according to the second embodiment will be described with reference to FIGS.

音声認識装置２００は、記憶部１０３、制御部１０５、音声信号入力端子１０７Ａ及び１０７Ｂ、音声信号取得部１０９Ａ及び１０９Ｂ、特徴量分析部１１３Ａ及び１１３Ｂ、認識処理部１１５Ａ及び１１５Ｂ、言語モデル記憶部１１７、音響モデル記憶部１１９、言語モデル適応部１２１、適応後言語モデル記憶部１２３に加え、適応発話選択部２２５及び発話区間判定部２２３を有する。 The speech recognition apparatus 200 includes a storage unit 103, a control unit 105, speech signal input terminals 107A and 107B, speech signal acquisition units 109A and 109B, feature amount analysis units 113A and 113B, recognition processing units 115A and 115B, and a language model storage unit 117. In addition to the acoustic model storage unit 119, the language model adaptation unit 121, and the post-adaptation language model storage unit 123, an adaptive utterance selection unit 225 and an utterance section determination unit 223 are included.

＜発話区間判定部２２３＞
発話区間判定部２２３は、音声信号取得部１０９Ｂから所定の話者以外の話者Ｂの発話内容を含むディジタル音声信号Ｂ３を受け取り、これを用いて、所定の話者以外の話者Ｂの発話区間を判定し、発話区間情報を求め、出力する（ｓ２２３）。発話区間情報とは、例えば、発話開始時間と終了時間の組み合わせである。具体的な発話区間判定方法は、公知のもの（例えば、参考文献１等）によるから説明を略する。 <Speech section determination unit 223>
The utterance section determination unit 223 receives the digital audio signal B3 including the utterance content of the speaker B other than the predetermined speaker from the audio signal acquisition unit 109B, and uses this to use the utterance of the speaker B other than the predetermined speaker. The section is determined, and the utterance section information is obtained and output (s223). The utterance section information is, for example, a combination of utterance start time and end time. Since a specific speech segment determination method is based on a known method (for example, Reference 1), description thereof is omitted.

＜適応発話選択部２２５＞
適応発話選択部２２５は、認識処理部１１５Ａから認識結果Ａ’と言語モデルＬ’を受け取り、発話区間判定部２２３から発話区間情報を受け取る。 <Adaptive utterance selection unit 225>
The adaptive utterance selection unit 225 receives the recognition result A ′ and the language model L ′ from the recognition processing unit 115A, and receives the utterance section information from the utterance section determination unit 223.

適応発話選択部２２５は、所定の話者以外の話者Ｂの発話区間情報を用いて、その発話区間の前後ｎ個の所定の話者Ａの発話内容を含む音声信号Ａ３の認識結果Ａ’を選択する（ｓ２２５）。なお、ｎは任意の自然数であり、例えば、１や２等である。 The adaptive utterance selection unit 225 uses the utterance section information of the speaker B other than the predetermined speaker, and recognizes the recognition result A ′ of the speech signal A3 including the utterance contents of n predetermined speakers A before and after the utterance section. Is selected (s225). Note that n is an arbitrary natural number, for example, 1 or 2.

図８は、適応発話選択部２２５の選択方法を説明するための図である。例えば、［ｔ］番目の顧客Ｂの発話区間情報から、ｎ＝１の場合には［ｔ−１］番目、［ｔ＋１］番目のオペレータＡの認識結果Ａ’を選択し、ｎ＝２の場合には［ｔ−１］番目、［ｔ＋１］番目に加え、［ｔ−３］番目、［ｔ＋３］番目のオペレータＡの認識結果Ａ’を選択する。但し、会話の開始時または終了時には、オペレータＡの認識結果が顧客Ｂの発話区間より前にｎ個のオペレータＡの認識結果Ａ’が存在しない場合、または、後にｎ個のオペレータＡの認識結果画Ａ’が存在しない場合があるが、その場合には、存在する認識結果Ａ’だけを選択してもよい。例えば、［ｔ−３］番目のオペレータＡの認識結果Ａ’から会話が開始し、かつ、ｎ＝２の場合に［ｔ−２］番目の顧客Ｂの発話区間の前には、［ｔ−３］番目のオペレータＡの認識結果Ａ’しかないが、存在する［ｔ−３］番目、［ｔ−１］番目、［ｔ＋１］番目の３個のオペレータＡの認識結果Ａ’を選択する。 FIG. 8 is a diagram for explaining a selection method of the adaptive utterance selection unit 225. For example, from the utterance section information of the [t] -th customer B, when n = 1, the recognition result A ′ of the [t−1] -th and [t + 1] -th operator A is selected, and when n = 2 In addition to the [t−1] th and [t + 1] th, the recognition result A ′ of the [t−3] th and [t + 3] th operator A is selected. However, at the start or end of the conversation, if the recognition result of the operator A is not the recognition result A ′ of the n operators A before the utterance section of the customer B, or the recognition result of the n operators A is after In some cases, the image A ′ does not exist. In this case, only the existing recognition result A ′ may be selected. For example, when the conversation starts from the recognition result A ′ of the [t−3] th operator A and n = 2, before the [t−2] th customer B ’s utterance section, [t− 3] Only the recognition result A ′ of the third operator A is selected, but the recognition results A ′ of the existing [t−3] th, [t−1] th and [t + 1] th operators A are selected.

＜言語モデル適応部１２１＞
言語モデル適応部１２１は、適応発話選択部２２５で選択された前後ｎ個の認識結果Ａ’のみと言語モデルＬを用いて、適応後の言語モデルＬ”を求める（ｓ１２１）。実施例１では、対話全体の所定の話者Ａの認識結果Ａ’を用いていたのに対し、本実施例では、認識をしようとする音声信号Ｂ２の前後ｎ個の認識結果Ａ’しか用いない点が異なる。なお、適応方法自体は実施例１と同様である。 <Language model adaptation unit 121>
The language model adaptation unit 121 obtains the language model L ″ after adaptation using only the n preceding and following recognition results A ′ selected by the adaptive utterance selection unit 225 and the language model L (s121). In contrast to using the recognition result A ′ of the predetermined speaker A for the entire dialogue, the present embodiment is different in that only n recognition results A ′ before and after the speech signal B2 to be recognized are used. The adaptation method itself is the same as that in the first embodiment.

＜認識処理部１１５Ｂ＞
認識処理部１１５Ｂは、所定の話者以外の話者Ｂの発話内容を含む（発話区間に対応する）音声信号Ｂ２から抽出した特徴量Ｂ４と音響モデルＫとその発話区間に対応する適応後の言語モデルＬ”を用いて音声認識を行う（ｓ１１５Ｂ）。なお、実施例１では、対話全体において同じ適応後の言語モデルを用いるが、本実施例では、所定の話者以外の話者Ｂの発話区間毎に適応後の言語モデルが更新されるため、発話区間毎に異なる適応後の言語モデルを用いて、音声認識処理が行われる。 <Recognition processing unit 115B>
The recognition processing unit 115B includes the feature amount B4 extracted from the speech signal B2 including the utterance contents of the speaker B other than the predetermined speaker (corresponding to the utterance section), the acoustic model K, and the adaptation corresponding to the utterance section. Speech recognition is performed using the language model L ″ (s115B). In the first embodiment, the same language model after adaptation is used in the entire dialogue. In this embodiment, the speaker B other than the predetermined speaker is used. Since the language model after adaptation is updated for each utterance section, speech recognition processing is performed using the language model after adaptation that is different for each utterance section.

音声認識装置２００は、対話が終了するまで上記処理を繰り返す（ｓ２２８）。 The speech recognition apparatus 200 repeats the above process until the dialogue is finished (s228).

＜効果＞
本実施例では、認識率の低い発話の前後の認識率の高い発話を用いて、言語モデルの適応を行い、認識率の低い発話の認識を行う。このような構成とすることによって、評価用データを作成せず、かつ、膨大な準備や計算を必要とせずに、言語モデルの性能を向上させ、認識率を向上させることができる。 <Effect>
In this embodiment, utterances with a high recognition rate before and after utterances with a low recognition rate are used to adapt the language model to recognize utterances with a low recognition rate. With such a configuration, it is possible to improve the performance of the language model and improve the recognition rate without creating evaluation data and without requiring enormous preparations and calculations.

また本実施例の構成の場合、片方の発話内容全てを用いて適応するのではなく、認識したい発話に対し時間的に隣接している発話のみを用いて適応することで、会話の中で局所的に現れる話題に即した適応ができる。よって、会話の話題が時々刻々と変わっていくような場合に有効である。また、オペレータが顧客の発話内容を繰り返すこと（オウム返し）が多いコールセンタの対話等においても有効である。オウム返しが顧客の発話後すぐに行われる場合には、ｎ＝１でも十分な効果が得られ、計算量を少なくすることができる。また、Ｂの発話後、Ａの１発話が完了した後に、そのＢの発話について音声認識処理を開始することができる。 In addition, in the case of the configuration of the present embodiment, the adaptation is not performed using the entire content of one utterance, but only by using the utterance that is temporally adjacent to the utterance to be recognized. Adaptable to the topic that appears regularly. Therefore, it is effective when the topic of conversation changes from moment to moment. It is also effective in a call center conversation or the like in which the operator repeats the contents of the customer's utterances (repeat return). When the parrot is returned immediately after the customer speaks, a sufficient effect can be obtained even when n = 1, and the amount of calculation can be reduced. In addition, after one utterance of A is completed after the utterance of B, the voice recognition process can be started for the utterance of B.

なお、ｎの値を２，３…と大きくすることで、適用範囲を広げることができるが、ｎが大きくなるほど、計算量が多くなり、音声認識処理の開始が遅くなるため、予め実験等により、言語モデルを会話に適応させるために適切なｎを求めておいてもよい。 Note that the range of application can be expanded by increasing the value of n to 2, 3,..., But as n increases, the amount of calculation increases and the start of speech recognition processing becomes slower. In order to adapt the language model to conversation, an appropriate n may be obtained.

＜その他＞
本実施例も、実施例１の変形例１と同様、所定の話者Ａと所定の話者以外の話者Ｂの発話内容を含む音声信号が同一の音声信号入力端子から入力される場合に変形できる。その場合の音声認識装置２００’の構成例を図９に示す。 <Others>
In this embodiment, as in the first modification of the first embodiment, when a voice signal including the utterance contents of a predetermined speaker A and a speaker B other than the predetermined speaker is input from the same audio signal input terminal. Can be transformed. A configuration example of the speech recognition apparatus 200 ′ in that case is shown in FIG.

この場合、音声認識装置２００’は、発話区間判定部２２３を必要とせず、適応発話選択部２２５は、発話区間情報に代えて、話者情報を受け取る。 In this case, the speech recognition apparatus 200 ′ does not require the utterance section determination unit 223, and the adaptive utterance selection unit 225 receives speaker information instead of the utterance section information.

つまり、適応発話選択部２２５は、話者判定部１１１から話者情報を受け取り、認識処理部から認識結果Ａ’及び言語モデルＬ’を受け取る。適応発話選択部２２５は、所定の話者以外の話者Ｂの話者情報を用いて、その前後ｎ個の所定の話者Ａの発話内容を含む音声信号Ａ３の認識結果Ａ’を選択する。 That is, the adaptive utterance selection unit 225 receives speaker information from the speaker determination unit 111 and receives a recognition result A ′ and a language model L ′ from the recognition processing unit. The adaptive utterance selection unit 225 selects the recognition result A ′ of the speech signal A3 including the utterance contents of n predetermined speakers A before and after the speaker information of the speaker B other than the predetermined speaker. .

１００、１００’、２００、２００’ 音声認識装置
１０３記憶部
１０５制御部
１０９Ａ，１０９Ｂ，１０９音声信号取得部
１１１話者判定部
１１３Ａ，１１３Ｂ，１１３特徴量分析部
１１５Ａ，１１５Ｂ，１１５認識処理部
１１７言語モデル記憶部
１１９音響モデル記憶部
１２３適応後言語モデル記憶部
２２５適応発話選択部
２２３発話区間判定部 100, 100 ′, 200, 200 ′ Speech recognition device 103 Storage unit 105 Control unit 109A, 109B, 109 Speech signal acquisition unit 111 Speaker determination unit 113A, 113B, 113 Feature quantity analysis unit 115A, 115B, 115 Recognition processing unit 117 Language model storage unit 119 Acoustic model storage unit 123 Post-adaptation language model storage unit 225 Adaptive utterance selection unit 223 Utterance section determination unit

Claims

A speech recognition device that recognizes conversational speech,
A storage unit for storing an acoustic model and a language model;
A feature quantity analysis unit that extracts a feature quantity from an audio signal;
Speech recognition is performed using a feature amount obtained from a speech signal including the utterance content of a predetermined speaker A, the acoustic model, and a language model before adaptation, a recognition result A ′ is obtained, and a speech other than the predetermined speaker A recognition processing unit that performs speech recognition using the feature amount obtained from the speech signal including the utterance content of the person B, the acoustic model, and the language model after adaptation, and obtains a recognition result B ′;
A language model adaptation unit for obtaining a language model after adaptation using only the recognition result A ′ and a language model before adaptation;
Using said speech period of the speaker B, a adaptive speech selection unit for selecting the front and rear of n recognition result A 'of the speech period,
The recognition result A ′ used in the language model adaptation unit is selected by the adaptive utterance selection unit,
The recognition processing unit performs speech recognition using a feature value obtained from an audio signal including the utterance content of the speaker B in the utterance section, the acoustic model, and an adapted language model corresponding to the utterance section, Obtain recognition result B ′.
A speech recognition apparatus characterized by that.

A speech recognition method for recognizing conversational speech,
A feature amount analyzing step for extracting a feature amount from the audio signal;
A recognition processing step A for performing speech recognition using a feature amount, an acoustic model, and a language model before adaptation obtained from a speech signal including the utterance content of a predetermined speaker A, and obtaining a recognition result A ′;
A language model adaptation step for obtaining a language model after adaptation using only the recognition result A ′ and the language model before adaptation;
A recognition processing step of performing speech recognition using a feature amount obtained from a speech signal including speech content of a speaker B other than the predetermined speaker, the acoustic model, and the language model after adaptation, and obtaining a recognition result B ′. B and
Using said speech period of the speaker B, a adaptive utterance selection step of selecting the front and rear of n recognition result A 'of the speech period,
The recognition result A ′ used in the language model adaptation step is selected in the adaptive utterance selection step,
In the recognition processing step, speech recognition is performed using a feature amount obtained from a speech signal including the speech content of the speaker B in the speech section, the acoustic model, and an adapted language model corresponding to the speech section, Obtain recognition result B ′.
A speech recognition method characterized by the above.

Program for causing a computer to function as a speech recognition apparatus according to claim 1 Symbol placement.