JP2008241970A

JP2008241970A - Speaker adaptation device, speaker adaptation method and speaker adaptation program

Info

Publication number: JP2008241970A
Application number: JP2007080806A
Authority: JP
Inventors: Kengo Fujita; 顕吾藤田; Tsuneo Kato; 恒夫加藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2007-03-27
Filing date: 2007-03-27
Publication date: 2008-10-09

Abstract

<P>PROBLEM TO BE SOLVED: To fully automatically perform adaptation to speaker's speech, without forcing a user selection. <P>SOLUTION: A sound analysis section 11 extracts a sound feature amount of input speech. In a sound model data base 13 and language model data base 14, a statistical sound model and a language model for performing speech recognition based on the extracted sound feature amount are stored. A search processing section 12 performs search processing by applying a statistical model on the sound feature amount, and outputs a speech recognition result. A recognition result text which is output by the search processing is corrected by the user in a recognition result correction section 15. An adaptation utilization determining section 17 calculates a reliability score for the correction text, and determines a threshold for whether or not the correction text is utilized for speaker adaptation. A speaker adaptation section 18 performs speaker adaptation, when it is determined that the correction text is utilized for speaker adaptation. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、コンピュータや携帯端末で音声入力を行うための音声認識の話者適応装置、話者適応方法及び話者適応プログラムに関する。 The present invention relates to a speech recognition speaker adaptation device, a speaker adaptation method, and a speaker adaptation program for performing speech input on a computer or a portable terminal.

一般に、音声認識装置は、図３に示すように、音響分析部１０１と、探索処理部１０２との２つの主要素から構成されている。 In general, as shown in FIG. 3, the speech recognition apparatus is composed of two main elements, an acoustic analysis unit 101 and a search processing unit 102.

図３において、入力音声信号は、音響分析部１０１に送られ、音響分析部１０１で入力音声から音響特徴量が抽出される。 In FIG. 3, an input voice signal is sent to the acoustic analysis unit 101, and an acoustic feature quantity is extracted from the input voice by the acoustic analysis unit 101.

音響分析部１０１は、図４に示すように、音声区間において、入力音声から長さＴのフレームを切り出し、このフレーム内の音声信号から、その特徴を表すｎ次元の音響特徴量を抽出する。このとき、図４に示すように、フレームの切り出し位置を△Ｔずつシフトしながら進め、音声区間の始端から終端まで、音響特徴量の抽出処理を実行する。 As shown in FIG. 4, the acoustic analysis unit 101 cuts out a frame having a length T from the input speech in the speech section, and extracts an n-dimensional acoustic feature amount representing the feature from the speech signal in this frame. At this time, as shown in FIG. 4, the frame cut-out position is advanced while being shifted by ΔT, and the acoustic feature amount extraction processing is executed from the beginning to the end of the speech section.

図３において、音響分析部１０１で抽出された音響特徴量は、探索処理部１０２に送られる。また、音響モデルデータベース１０３及び言語モデルデータベース１０４には、予め作成された統計的な音響モデル及び言語モデルが格納されている。 In FIG. 3, the acoustic feature amount extracted by the acoustic analysis unit 101 is sent to the search processing unit 102. The acoustic model database 103 and the language model database 104 store statistical acoustic models and language models created in advance.

探索処理部１０２は、その音響特徴量を基に、音響モデルデータベース１０３及び言語モデルデータベース１０４の統計的な音響モデル及び言語モデルに従って探索処理を行い、その探索結果から、音声認識結果を出力する。 The search processing unit 102 performs search processing according to statistical acoustic models and language models in the acoustic model database 103 and the language model database 104 based on the acoustic feature amount, and outputs a speech recognition result from the search result.

探索処理部１０２では、言語モデルで定義される遷移可能な単語列のうち、入力音声に対していずれが最も確からしいかを探索する。言語モデルとしては、単語の遷移パターンを予め定義しておく固定文法モデル、或いは、ある時刻までに確定した単語列に従い、次に遷移可能な単語が確率的に定まる確率文法モデルのいずれかが用いられる。 The search processing unit 102 searches for the most probable input speech among the translatable word strings defined by the language model. As the language model, either a fixed grammar model in which a transition pattern of a word is defined in advance or a probabilistic grammar model in which a next transitionable word is stochastically determined according to a word string determined by a certain time is used. It is done.

図５は、固定文法モデルの一例である。図５に示す固定文法モデル例では、初めの無音状態「Ｓｉｌ」から遷移可能な単語は、「伊藤」、「糸井」、「今井」、「土井」の４通りであり、その次に唯一遷移可能な単語「です」を経由して、最終的に再び無音状態「Ｓｉｌ」へ遷移するような単語列が定義されている。即ち、この場合、「［Ｓｉｌ］{伊藤｜糸井｜今井｜土井}です[Ｓｉｌ］」のうち、最尤単語列がいずれであるかを探索することになる。 FIG. 5 is an example of a fixed grammar model. In the example of the fixed grammar model shown in FIG. 5, there are four words “Ito”, “Itoi”, “Imai”, “Doi” that can be transitioned from the first silent state “Sil”, and the only transition after that. A word string is defined that finally transitions to the silent state “Sil” again via the possible word “I”. That is, in this case, it is searched for the most likely word string among “[Sil] {Ito | Itoi | Imai | Doi} is [Sil]”.

固定文法モデル、確率文法モデルのいずれを用いる場合でも、フレーム毎の音響特徴量を用いた探索処理は、単語を更に細分化した音節単位あるいは音素単位で進められる。各々の単語は、音素毎のＨＭＭ（ｈｉｄｄｅｎＭａｒｋｏｖｍｏｄｅｌ：隠れマルコフモデル)状態系列を連結した形で表される。 Regardless of whether the fixed grammar model or the probabilistic grammar model is used, the search processing using the acoustic feature amount for each frame is advanced in syllable units or phoneme units obtained by further subdividing words. Each word is represented by a concatenation of HMM (hidden Markov model) state series for each phoneme.

図６は、単語「今井」のＨＭＭ状態系列を示すものである。単語「今井」の音素表現は、「ｉ／ｍ／ａ／ｉ」であるが、一般に探索処理性能向上のため、図６に示すような前後の音素に依存したＨＭＭ状態系列が用いられる。ここで、「Ｓｉｌ−ｉ＋ｍ」は、音素「ｉ」の先行音素が「Ｓｉ１」、後続音素が「ｍ」である場合のＨＭＭ状態系列を表す。各々のＨＭＭ状態には自身への遷移(自己遷移)と右隣のＨＭＭ状態への遷移(ＬＲ遷移)が許されており、自己遷移確率、ＬＲ遷移確率が音響モデルに記述されている。また、音響モデルには、フレーム毎に得られる音響特徴量の各ＨＭＭ状態に対する尤もらしさ(音響尤度)を算出するための出力確率分布が記述されている。 FIG. 6 shows an HMM state sequence of the word “Imai”. The phoneme expression of the word “Imai” is “i / m / a / i”, but in general, an HMM state sequence depending on the preceding and following phonemes as shown in FIG. 6 is used to improve the search processing performance. Here, “Sil−i + m” represents an HMM state sequence when the preceding phoneme of the phoneme “i” is “Si1” and the subsequent phoneme is “m”. Each HMM state is allowed to transition to itself (self-transition) and to the right HMM state (LR transition), and the self-transition probability and LR transition probability are described in the acoustic model. The acoustic model describes an output probability distribution for calculating the likelihood (acoustic likelihood) of each acoustic feature quantity obtained for each frame with respect to each HMM state.

図３において、探索処理部１０２は、フレーム毎にそのフレームで考慮すべき全てのＨＭＭ状態について、自己遷移、ＬＲ遷移それぞれの場合の遷移確率と音響尤度の和(累積尤度)を算出し、ＨＭＭ状態遷移として尤もらしい(累積尤度の高い)遷移を選ぶことを繰り返し、最終的に最も累積尤度の高いＨＭＭ状態系列を決定する。このように最尤のパスを探索するアルゴリズムは、Ｖｉｔｅｒｂｉアルゴリズムと呼ばれている。 In FIG. 3, the search processing unit 102 calculates the sum (accumulated likelihood) of the transition probability and acoustic likelihood for each of the self transition and the LR transition for every HMM state to be considered in the frame for each frame. , Repeatedly selecting the most likely (highest cumulative likelihood) transition as the HMM state transition, and finally determining the HMM state sequence having the highest cumulative likelihood. Such an algorithm for searching for the maximum likelihood path is called a Viterbi algorithm.

係る音声認識装置において、利用者が固定されていない汎用的な音声認識アプリケーションでは、音響モデルデータベース１０３に格納された不特定話者モデルが音響モデルとして用いられる。不特定話者モデルに記述されている遷移確率及び出力確率分布といったパラメータは、予め収集された多数話者による発声データを基に学習されたものである。不特定話者モデルを用いることで、どのような利用者の発声に対してもある程度の認識性能を得ることができる。 In such a speech recognition apparatus, in a general-purpose speech recognition application in which a user is not fixed, an unspecified speaker model stored in the acoustic model database 103 is used as an acoustic model. Parameters such as transition probability and output probability distribution described in the unspecified speaker model are learned based on utterance data collected by a large number of speakers. By using an unspecified speaker model, a certain degree of recognition performance can be obtained for any user's utterance.

一方、ある特定話者から収集された発声データにより不特定話者モデルのパラメータを更新した話者適応モデルを用いることで、不特定話者モデルを用いる場合よりもその特定話者の発声に対する認識性能を向上させることができる。このようなパラメータ更新は話者適応と呼ばれ、一般にＭＡＰ（ＭａｘｉｍｕｍＡＰｏｓｔｅｒｉｏｒｉ）、ＭＬＬＲ（ＭａｘｉｍｕｍＬｉｋｅｌｉｈｏｏｄＬｉｎｅａｒＲｅｇｒｅｓｓｉｏｎ）といった手法が利用されている。例えば、利用者が固定されている可能性の高い携帯電話等での音声認識アプリケーションを考える場合、話者適応モデルの利用が有効である。 On the other hand, by using a speaker adaptation model in which the parameters of the unspecified speaker model are updated with the utterance data collected from a specific speaker, recognition of the utterance of the specific speaker is achieved rather than using the unspecified speaker model. Performance can be improved. Such parameter update is called speaker adaptation, and generally uses a technique such as MAP (Maximum A Postoriori) or MLLR (Maximum Likelihood Linear Regression). For example, when considering a speech recognition application in a mobile phone or the like where the user is likely to be fixed, the use of a speaker adaptation model is effective.

話者適応には、発声データとラベルと呼ばれる発声内容を表すテキストの組が必要である。これは、発声データを基にラベルに含まれるＨＭＭ状態のパラメータを更新するためである。 For speaker adaptation, a set of text representing speech content called speech data and a label is required. This is to update the parameters of the HMM state included in the label based on the utterance data.

例えば、音声認識アプリケーションが予め指定した単語や文章を利用者に読み上げ(エンロール)させることで、発声データ内容とラベルが完全に一致する組を得ることができる。このような組を用いて実行する話者適応は、教師あり適応と呼ばれている。 For example, by allowing the user to read (enroll) a word or sentence designated in advance by the voice recognition application, a set in which the utterance data content and the label completely match can be obtained. Speaker adaptation performed using such a set is called supervised adaptation.

しかしながら、複数の単語や文章を読み上げなければならないエンロールは利用者にとって煩わしいものであり、話者適応、あるいは音声認識アプリケーション自体の利用を阻む可能性がある。また、一般的な、初めて音声認識アプリケーションを利用する前に一度だけ行われるエンロールでは、利用者音声の経時変化に対応できない。即ち、利用者がエンロールを行い話者適応モデルが作成された場合でも、利用者の音声は時間の経過とともに変化するものであるため、次第にその音響モデルに記述されたパラメータとのずれが大きくなり、認識性能が低下する。 However, enrollment that requires reading out a plurality of words and sentences is troublesome for the user and may hinder speaker adaptation or use of the speech recognition application itself. In addition, general enrollment that is performed only once before using the voice recognition application for the first time cannot cope with changes in user voice over time. That is, even when the user enrolls and the speaker adaptation model is created, the user's voice changes with time, so the deviation from the parameters described in the acoustic model gradually increases. , Recognition performance decreases.

これらの問題に対し、音声認識装置による認識結果テキストをラベルとして用いることで、利用者にエンロールを強いること無く、話者適応に必要な発声データとラベルの組を得ることができるようにしたものが提案されている（例えば、特許文献１参照。）。このような話者適応は教師なし適応と呼ばれており、音声認識アプリケーションが利用される度に話者適応の実行が可能であるため、利用者音声の経時変化にも対応できる。 In response to these problems, the recognition result text by the speech recognition device is used as a label, so that the utterance data and label pair necessary for speaker adaptation can be obtained without forcing the user to enroll. Has been proposed (see, for example, Patent Document 1). Such speaker adaptation is called unsupervised adaptation, and it is possible to perform speaker adaptation each time a speech recognition application is used, so it is possible to cope with changes in user speech over time.

しかしながら、教師なし話者適応では、認識結果に誤りがあった場合に、かえって音響モデルの性能の低下を招いてしまうという問題がある。そこで、修正テキストをラベルとする方法が提案されている。図７は、ラベルとして修正テキストを利用した教師なし話者適応を行う従来の音声認識装置の一例である。 However, unsupervised speaker adaptation has a problem that the performance of the acoustic model is degraded when there is an error in the recognition result. Therefore, a method using the corrected text as a label has been proposed. FIG. 7 is an example of a conventional speech recognition apparatus that performs unsupervised speaker adaptation using corrected text as a label.

前述の一般的な音声認識装置と同様、音響分析部１０１で入力音声の音響特徴量が抽出され、探索処理部１０２で認識結果テキストが得られる。探索処理部１０２からの認識結果テキストは、認識結果修正部１０５に送られる。また、音響分析部１０１からの音響特徴量は発声データとして特徴量データベース１０６に蓄積される。 As in the above-described general speech recognition apparatus, the acoustic analysis unit 101 extracts the acoustic feature amount of the input speech, and the search processing unit 102 obtains the recognition result text. The recognition result text from the search processing unit 102 is sent to the recognition result correction unit 105. The acoustic feature amount from the acoustic analysis unit 101 is stored in the feature amount database 106 as utterance data.

利用者は、音声認識結果を確認し、認識結果が間違っていると、キーボード、テンキー等の入力装置１０９を介して、修正を加え、認識結果修正部１０５で認識結果テキストに対して修正が加えられる。そして、適応利用選択部１０７で、利用者から話者適応への利用が選択された修正テキストのみが、それに対応する特徴量データベース１０６に蓄積された音響特徴量と組み合わせて、話者適応部１０８での話者適応に利用される。
特開２００３−２４１７８７号公報 The user confirms the voice recognition result, and if the recognition result is wrong, the user makes corrections via the input device 109 such as a keyboard or a numeric keypad, and the recognition result correction unit 105 corrects the recognition result text. It is done. Then, only the corrected text selected by the user for use in speaker adaptation by the adaptive use selection unit 107 is combined with the acoustic feature amount stored in the feature amount database 106 corresponding to the corrected text, and the speaker adaptation unit 108. Used for speaker adaptation in
JP 2003-241787 A

しかしながら、上述のような教師なし適応では、キーボード、テンキー等の入力装置１０９を介して、認識結果テキストに利用者からの修正が加えられた修正テキストをラベルとして用いて、話者適応を行っているが、修正テキストが必ずしも発声データと一致しているとは限らない。 However, in the unsupervised adaptation as described above, speaker adaptation is performed by using, as a label, the corrected text obtained by correcting the recognition result text by the user via the input device 109 such as a keyboard or a numeric keypad. However, the corrected text does not necessarily match the utterance data.

例えば、携帯電話上で音声認識アプリケーションを用いてテキスト入力を行う場合を考える。利用者は入力すべきテキストを用いてある目的(検索、メール等)を達成しようとしているため、認識結果が誤りであったとすると、殆どの場合、テンキー入力等で認識結果テキストに修正が加えられると考えられる。利用者は入力したい内容を発声したはずであり、この修正テキストと発声データの内容は一致する可能性が高い。 For example, consider a case where text input is performed on a mobile phone using a voice recognition application. Since the user is trying to achieve a certain purpose (search, e-mail, etc.) using the text to be entered, if the recognition result is incorrect, in most cases, the recognition result text is corrected by entering the numeric keypad. it is conceivable that. The user should have uttered the content that he / she wants to input, and it is highly likely that the corrected text and the content of the utterance data match.

ただし、必ずしも両者が一致するとは限らない。例えば、利用者が「伊藤さん」と入力したにもかかわらず、音声認識結果が「糸井さん」となったような場合、利用者は、テンキー等を使って「伊藤さん」と修正する。このとき、さらに、「伊藤太郎くん」と修正する可能性もある。この場合、発声データは「伊藤さん」なのに対して修正テキストは「伊藤太郎くん」であり、修正テキストと発声データとは一致しなくなる。 However, the two do not necessarily match. For example, when the user inputs “Mr. Ito” but the voice recognition result is “Mr. Itoi”, the user corrects “Mr. Ito” using the numeric keypad. At this time, there is a possibility of further correction to “Taro Ito”. In this case, while the utterance data is “Mr. Ito”, the corrected text is “Taro Ito”, and the corrected text and the utterance data do not match.

このように、修正テキストと発声データとは必ずしも一致するとは限らないため、修正テキストを全てラベルとして話者適応に用いることは、適応話者モデルの性能低下を招く可能性がある。 As described above, since the corrected text and the utterance data do not always match, using the corrected text as a label for speaker adaptation may cause a decrease in performance of the adaptive speaker model.

これに対し、従来手法では、認識結果が修正される度に、修正テキストを話者適応に利用するか否かの選択を利用者に促し、利用すると選択された場合のみ修正テキストをラベルとした話者適応が実行される。 In contrast, in the conventional method, whenever the recognition result is corrected, the user is prompted to select whether to use the corrected text for speaker adaptation, and the corrected text is used as a label only when it is selected. Speaker adaptation is performed.

しかしながら、教師なし適応の利点は、音声認識アプリケーションの利用に伴う自動的な話者適応の実行によって、利用者に煩わしさを感じさせないことにあり、利用者に話者適応実行の選択作業を強いることはこの利点に反する。また、利用者の発声データ内容に関する記憶が曖昧な場合には、修正テキストと発声データ内容が一致していないにもかかわらず、話者適応に用いられ、適応話者モデルの性能低下を招く可能性がある。 However, the advantage of unsupervised adaptation is that it does not make the user feel annoyed by the automatic execution of speaker adaptation that accompanies the use of a speech recognition application, which forces the user to select the speaker adaptation execution. That goes against this advantage. In addition, if the memory about the user's utterance data content is ambiguous, it may be used for speaker adaptation, even though the modified text and utterance data content do not match, leading to performance degradation of the adaptive speaker model. There is sex.

そこで、本発明は、上述の課題を鑑みてなされたものであり、利用者に選択を強いること無く、話者適応に利用する修正テキストと発声データの組を正確に決定することで、完全に自動で話者適応を実行することができる話者適応装置、話者適応方法及び話者適応プログラムを提供することを目的とする。 Therefore, the present invention has been made in view of the above-mentioned problems, and by completely determining a set of corrected text and utterance data to be used for speaker adaptation without forcing the user to make a selection, the present invention is completely achieved. An object of the present invention is to provide a speaker adaptation device, a speaker adaptation method, and a speaker adaptation program that can automatically perform speaker adaptation.

本発明は、上記の課題を解決するために以下の事項を提案している。
（１）本発明は、入力音声の音響特徴量を抽出する音響分析部と、
該抽出された音響特徴量に基づいて音声認識を実行するための統計モデルが格納された統計モデルデータベースと、音響特徴量に該統計モデルを適用して探索処理を実行する探索処理部と、探索処理により出力される認識結果テキストに対して利用者からの修正を加える認識結果修正部と、修正テキストに対する信頼度スコアを算出し、該修正テキストを話者適応に利用するか否かを閾値判定する適応利用判定部と、修正テキストを話者適応に利用すると判定された場合に話者適応を実行する話者適応部と、を備えたことを特徴とする話者適応装置を提案している。 The present invention proposes the following items in order to solve the above problems.
(1) The present invention provides an acoustic analysis unit that extracts an acoustic feature of input speech;
A statistical model database in which a statistical model for executing speech recognition based on the extracted acoustic feature quantity is stored; a search processing unit that applies the statistical model to the acoustic feature quantity and executes a search process; and A recognition result correction unit that corrects the recognition result text output by the process from the user, calculates a reliability score for the correction text, and determines whether to use the corrected text for speaker adaptation And a speaker adaptation unit that performs speaker adaptation when it is determined that the corrected text is used for speaker adaptation. .

（２）本発明は、（１）の話者適応装置において、前記適応利用判定部において、前記修正テキストを認識対象とする判定用統計モデルを作成し、入力音声より得られた音響特徴量に判定用統計モデルを適用して探索処理を実行して得られる修正テキストに対する累積尤度を基に信頼度スコアを算出することを特徴とする話者適応装置を提案している。 (2) According to the present invention, in the speaker adaptation device of (1), the adaptive usage determination unit creates a statistical model for determination with the corrected text as a recognition target, and generates an acoustic feature obtained from the input speech. A speaker adaptation device is proposed that calculates a reliability score based on a cumulative likelihood for a corrected text obtained by executing a search process by applying a statistical model for determination.

（３）本発明は、（２）の話者適応装置において、前記適応利用判定部において、前記修正テキストに対する累積尤度を処理フレーム数で除することにより得られるフレーム正規化尤度を信頼度スコアとすることを特徴とする話者適応装置を提案している。 (3) In the speaker adaptation device according to (2), the present invention relates to the frame normalization likelihood obtained by dividing the cumulative likelihood for the modified text by the number of processing frames in the adaptive use determination unit. A speaker adaptation device characterized by a score is proposed.

（４）本発明は、（２）の話者適応装置において、前記適応利用判定部において、前記修正テキストに対する累積尤度と、探索処理部にて得られる認識結果テキストに対する累積尤度を基に信頼度スコアを算出することを特徴とする話者適応装置を提案している。 (4) In the speaker adaptation device of (2), the present invention is based on the cumulative likelihood for the modified text and the cumulative likelihood for the recognition result text obtained by the search processing unit in the adaptive usage determination unit. A speaker adaptation device that calculates a reliability score is proposed.

（５）本発明は、（４）の話者適応装置において、前記適応利用判定部において、前記修正テキストに対するフレーム正規化尤度と、前記探索処理部にて得られる認識結果テキストに対する累積尤度とをその処理フレーム数で除した認識結果テキストに対するフレーム正規化尤度を基に信頼度スコアを算出することを特徴とする話者適応装置を提案している。 (5) In the speaker adaptation device according to (4), the adaptive use determination unit may include a frame normalization likelihood for the modified text and a cumulative likelihood for the recognition result text obtained by the search processing unit. A speaker adaptation device is proposed that calculates a reliability score based on a frame normalization likelihood with respect to a recognition result text obtained by dividing and by the number of processing frames.

（６）本発明は、（５）の話者適応装置において、前記適応利用判定部において、前記修正テキストと認識結果テキストとのフレーム正規化尤度の差から信頼度スコアを算出することを特徴とする話者適応装置を提案している。 (6) In the speaker adaptation device according to (5), the adaptive use determination unit calculates a reliability score from a difference in frame normalization likelihood between the corrected text and the recognition result text. A speaker adaptation device is proposed.

（７）本発明は、入力音声の音響特徴量を抽出する第１のステップと、該音響特徴量に統計モデルを適用して探索処理を実行する第２のステップと、該探索処理により出力される認識結果テキストに対して利用者からの修正が加えられたかどうかを判断する第３のステップと、修正テキストに対する信頼度スコアを算出し、該修正テキストを話者適応に利用するか否かを閾値判定する第４のステップと、該修正テキストを話者適応に利用すると判定された場合に話者適応を実行する第５のステップと、を備えたことを特徴とする話者適応方法を提案している。 (7) The present invention outputs a first step of extracting an acoustic feature quantity of input speech, a second step of executing a search process by applying a statistical model to the acoustic feature quantity, and the search process. A third step of determining whether or not the user has made corrections to the recognition result text, and calculating a reliability score for the corrected text and whether or not to use the corrected text for speaker adaptation. Proposing a speaker adaptation method comprising: a fourth step for determining a threshold value; and a fifth step for executing speaker adaptation when it is determined that the modified text is used for speaker adaptation. is doing.

（８）本発明は、入力音声の音響特徴量を抽出する第１のステップと、該音響特徴量に統計モデルを適用して探索処理を実行する第２のステップと、該探索処理により出力される認識結果テキストに対して利用者からの修正が加えられたかどうかを判断する第３のステップと、修正テキストに対する信頼度スコアを算出し、該修正テキストを話者適応に利用するか否かを閾値判定する第４のステップと、該修正テキストを話者適応に利用すると判定された場合に話者適応を実行する第５のステップと、をコンピュータに実行させるための話者適応プログラムを提案している。 (8) The present invention outputs a first step of extracting an acoustic feature quantity of input speech, a second step of executing a search process by applying a statistical model to the acoustic feature quantity, and the search process. A third step of determining whether or not the user has made corrections to the recognition result text, and calculating a reliability score for the corrected text and whether or not to use the corrected text for speaker adaptation. A speaker adaptation program for causing a computer to execute a fourth step of determining a threshold and a fifth step of executing speaker adaptation when it is determined that the modified text is used for speaker adaptation is proposed. ing.

本発明によれば、修正テキストに対する信頼度スコアを算出し、修正テキストを話者適応に利用するか否かを閾値判定し、修正テキストを話者適応に利用すると判定された場合に、話者適応を実行する。これにより、利用者に選択を強いること無く、話者適応に利用する修正テキストと発声データの組を決定でき、利用者に煩わしさを感じさせること無く、完全に自動で話者適応を実行することができるという効果がある。また、発声内容に関する利用者の記億に依存せず話者適応への利用可否判定を行うため、修正テキストと発声データ内容が一致する組をより正確に選出し、話者適応に利用することで適応話者モデルの性能を向上させることができるという効果がある。 According to the present invention, the confidence score for the corrected text is calculated, whether or not the corrected text is used for speaker adaptation is determined as a threshold, and when it is determined that the corrected text is used for speaker adaptation, Perform adaptation. This makes it possible to determine the set of modified text and utterance data to be used for speaker adaptation without forcing the user to select the speaker, and to perform speaker adaptation completely automatically without making the user feel bothered There is an effect that can be. In addition, in order to determine whether or not to use speaker adaptation without relying on the user's account for the utterance content, it is necessary to more accurately select a combination of the corrected text and utterance data content and use it for speaker adaptation. With this, the performance of the adaptive speaker model can be improved.

以下、本発明の実施の形態について図面を参照しながら説明する。
なお、本実施形態における構成要素は適宜、既存の構成要素等との置き換えが可能であり、また、他の既存の構成要素との組み合わせを含む様々なバリエーションが可能である。したがって、本実施形態の記載をもって、特許請求の範囲に記載された発明の内容を限定するものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
Note that the constituent elements in the present embodiment can be appropriately replaced with existing constituent elements and the like, and various variations including combinations with other existing constituent elements are possible. Therefore, the description of the present embodiment does not limit the contents of the invention described in the claims.

図１は、本実施形態の機能ブロックを示すものである。
図１において、入力音声信号は、音響分析部１１に送られ、音響分析部１１で入力音声から音響特徴量が抽出される。 FIG. 1 shows functional blocks of the present embodiment.
In FIG. 1, an input voice signal is sent to an acoustic analysis unit 11, and an acoustic feature quantity is extracted from the input voice by the acoustic analysis unit 11.

音響分析部１１で抽出された音響特徴量は、探索処理部１２に送られる。また、この音響特徴量は、話者適応処理のために、発声データとして特徴量データベース１６に保存される。 The acoustic feature amount extracted by the acoustic analysis unit 11 is sent to the search processing unit 12. The acoustic feature amount is stored in the feature amount database 16 as utterance data for speaker adaptation processing.

音響モデルデータベース１３及び言語モデルデータベース１４には、予め作成された統計的な音響モデル及び言語モデルが格納されている。探索処理部１２では、その音響特徴量を基に、音響モデルデータベース１３及び言語モデルデータベース１４の統計的な音響モデル及び言語モデルに従って探索処理が行われ、その探索結果から、音声認識結果が出力される。なお、これら音響分析処理及び探索処理は、一般的な音声認識装置と同様である。 The acoustic model database 13 and the language model database 14 store statistical acoustic models and language models created in advance. The search processing unit 12 performs search processing according to statistical acoustic models and language models in the acoustic model database 13 and the language model database 14 based on the acoustic feature amount, and a speech recognition result is output from the search result. The Note that these acoustic analysis processing and search processing are the same as those of a general voice recognition device.

利用者は、音声認識結果を確認し、認識結果が間違っていると、キーボード、テンキー等の入力装置１９を介して修正を加える。認識結果修正部１５では、認識結果テキストに対して利用者からの修正が加えられる。そして、修正テキストが適応利用判定部１７に送られる。 The user confirms the voice recognition result, and if the recognition result is incorrect, the user makes correction via the input device 19 such as a keyboard or a numeric keypad. The recognition result correction unit 15 corrects the recognition result text from the user. Then, the corrected text is sent to the adaptive usage determination unit 17.

適応利用判定部１７は、認識結果修正部１５で得られる修正テキストから作成した判定用言語モデルにより、音響分析部１１で得られる入力音声の音響特徴量の探索処理を実行し、修正テキストの累積尤度を算出する。これを基に修正テキストの信頼度スコアを算出し、信頼度スコアが予め定められた閾値による条件を満たす場合、修正テキストと入力音声の内容が一致し、話者適応に利用可能であると判定する。 The adaptive usage determination unit 17 executes a search process for the acoustic feature quantity of the input speech obtained by the acoustic analysis unit 11 using the determination language model created from the modified text obtained by the recognition result modification unit 15 and accumulates the modified text. Calculate the likelihood. Based on this, the reliability score of the corrected text is calculated, and if the reliability score satisfies the condition based on a predetermined threshold, it is determined that the corrected text matches the content of the input speech and can be used for speaker adaptation. To do.

すなわち、適応利用判定部１７では、まず得られた修正テキストからその内容のみを認識対象とする言語モデル(判定用言語モデル)が作成される。これは修正テキストのみに依存して作成されるため、探索処理部１２にて用いる言語モデルで認識対象としていない単語または単語列となる場合もある。次に、音響モデルデータベース１３及び言語モデルデータベース１４の判定用の音響モデル及び言語モデルと、を用いて、対応する音響特徴量の探索処理が実行され、修正テキストに対する累積尤度が算出される。 In other words, the adaptive use determination unit 17 first creates a language model (determination language model) whose recognition target is only the content from the obtained corrected text. Since this is created depending only on the corrected text, it may be a word or a word string that is not a recognition target in the language model used in the search processing unit 12. Next, using the acoustic model and language model for determination in the acoustic model database 13 and the language model database 14, a corresponding acoustic feature amount search process is executed, and the cumulative likelihood for the corrected text is calculated.

ここで、判定用言語モデルの認識対象の単語または単語列は１通りであるため、必ず認識結果は修正テキストと一致することとなる。算出された累積尤度より信頼度スコアが算出され、信頼度スコアが予め定められた閾値による条件を満たす場合、修正テキストと発声データ内容が一致し、話者適応に利用可能であると判定される。 Here, since there are only one word or word string to be recognized in the determination language model, the recognition result always matches the corrected text. When the reliability score is calculated from the calculated cumulative likelihood and the reliability score satisfies a predetermined threshold value, it is determined that the corrected text matches the utterance data content and can be used for speaker adaptation. The

また、適応利用判定部１７での処理は、修正テキストの全体、あるいは一部のみの何れも処理対象とすることができる。 In addition, the processing in the adaptive usage determination unit 17 can set the entire corrected text or only a part thereof as a processing target.

なお、信頼度スコアとして、例えば、算出された累積尤度を探索処理で処理されたフレーム数で除したフレーム正規化尤度を用いることが考えられる。これは、累積尤度自体は処理フレーム数の増加に従い増加するため、閾値判定に適さないためである。 As the reliability score, for example, it is conceivable to use a frame normalization likelihood obtained by dividing the calculated cumulative likelihood by the number of frames processed in the search process. This is because the cumulative likelihood itself increases as the number of processing frames increases, and is not suitable for threshold determination.

また、修正テキストの累積尤度だけでなく、探索処理部１２にて認識結果を得る際に算出される累積尤度を併用して信頼度スコアを算出することも考えられる。たとえば、修正テキストに対するフレーム正規化尤度と、探索処理部１２にて得られる認識結果テキストに対する累積尤度をその処理フレーム数で除した認識結果テキストに対するフレーム正規化尤度を基に信頼度スコアを算出する。 It is also conceivable to calculate the reliability score by using not only the cumulative likelihood of the corrected text but also the cumulative likelihood calculated when the search processing unit 12 obtains the recognition result. For example, the reliability score based on the frame normalization likelihood for the corrected text and the frame normalization likelihood for the recognition result text obtained by dividing the cumulative likelihood for the recognition result text obtained by the search processing unit 12 by the number of processing frames. Is calculated.

次に、適応利用判定部１７において、信頼度スコアが予め定められた閾値による条件を満たすと判断された場合には、話者適応部１８において、話者適応処理が実行される。話者適応処理では、修正テキストをラベルとし、特徴量データベース１６に保存されているそれと対応する音響特徴量を発声データとして組み合わせ、音響モデルデータベース１３の音響モデルを話者に適応させるような処理が行われる。 Next, when the adaptive usage determination unit 17 determines that the reliability score satisfies the condition based on a predetermined threshold, the speaker adaptation unit 18 executes speaker adaptation processing. In the speaker adaptation process, a process for adapting the acoustic model in the acoustic model database 13 to the speaker by combining the corrected text as a label and the corresponding acoustic feature quantity stored in the feature quantity database 16 as utterance data. Done.

以上説明したように、本実施形態では、適応利用判定部１７により、認識結果修正部１５で得られる修正テキストから作成した判定用言語モデルにより、音響分析部１１で得られる入力音声の音響特徴量の探索処理を実行し、修正テキストの累積尤度を算出し、これを基に修正テキストの信頼度スコアを算出し、信頼度スコアが予め定められた閾値による条件を満たす場合に、話者適応部１８で話者適応処理を実行するようにしている。このため、話者適応に利用する修正テキストと発声データの組を正しく決定することができる。 As described above, in this embodiment, the acoustic feature amount of the input speech obtained by the acoustic analysis unit 11 by the adaptive use determination unit 17 by the determination language model created from the corrected text obtained by the recognition result correction unit 15. , The cumulative likelihood of the corrected text is calculated, and the confidence score of the corrected text is calculated based on this. When the reliability score satisfies the condition based on a predetermined threshold, speaker adaptation is performed. The unit 18 executes speaker adaptation processing. For this reason, it is possible to correctly determine a set of corrected text and utterance data used for speaker adaptation.

例えば、利用者が「伊藤さん」と音声入力したにもかかわらず、音声認識結果が「糸井さん」となったような場合、利用者は、入力装置１９により、認識結果修正部１５で、「伊藤さん」と修正する。また、このとき、さらに、「伊藤太郎くん」と修正する可能性もある。「伊藤さん」と修正した場合には、修正テキストの累積尤度は閾値より大きくなるため、話者適応処理が実行される。これに対して、「伊藤太郎くん」と修正した場合には、修正テキストの累積尤度は閾値より小さくなるため、話者適応処理が実行されない。 For example, in the case where the voice recognition result is “Mr. Itoi” even though the user has input the voice “Mr. Ito”, the user uses the input device 19 to change the recognition result correction unit 15 to “ Ito-san ". In addition, at this time, there is a possibility of further correcting as “Taro Ito”. When the correction is made with “Mr. Ito”, the cumulative likelihood of the corrected text becomes larger than the threshold value, so the speaker adaptation process is executed. On the other hand, when the correction is made as “Taro Ito”, the cumulative likelihood of the corrected text is smaller than the threshold value, so the speaker adaptation process is not executed.

このように、本実施形態では、利用者に選択を強いること無く、話者適応に利用する修正テキストと発声データの組を決定することで、利用者に煩わしさを感じさせること無く、完全に自動で話者適応を実行することができる。また、発声内容に関する利用者の記億に依存せず話者適応への利用可否判定を行うため、修正テキストと発声データ内容が一致する組をより正確に選出し、話者適応に利用することで、適応話者モデルの性能を向上させることができる。 Thus, in this embodiment, without forcing the user to select, by determining the combination of corrected text and utterance data to be used for speaker adaptation, the user is completely annoyed without feeling bothered. Speaker adaptation can be performed automatically. In addition, in order to determine whether or not to use speaker adaptation without relying on the user's account for the utterance content, it is necessary to more accurately select a combination of the corrected text and utterance data content and use it for speaker adaptation. Thus, the performance of the adaptive speaker model can be improved.

なお、本発明は、図１と同等な処理を行うハードウェア又はこれと同等のソフトウェアで実現することができる。図２は、本発明の実施形態のフローチャートを示すものである。 The present invention can be realized by hardware that performs processing equivalent to that shown in FIG. 1 or software equivalent thereto. FIG. 2 shows a flowchart of the embodiment of the present invention.

図２において、音声が入力されると（ステップＳ１）、この入力音声の音響特徴量が抽出される（ステップＳ２）。また、この抽出された音響特徴量が発声データとして保存される（ステップＳ３）。そして、その音響特徴量を基に、統計的な音響モデル及び言語モデルに従って探索処理が行われ（ステップＳ４）、その探索結果から、音声認識結果が出力される（ステップＳ５）。 In FIG. 2, when a voice is input (step S1), an acoustic feature amount of the input voice is extracted (step S2). Further, the extracted acoustic feature amount is stored as utterance data (step S3). Then, based on the acoustic feature quantity, search processing is performed according to a statistical acoustic model and a language model (step S4), and a speech recognition result is output from the search result (step S5).

認識結果テキストに対して、利用者からの修正が加えられたかどうかが判断される（ステップＳ６）。修正が加えられていない場合には、ステップＳ１にリターンする。修正が加えられている場合には、修正テキストから判定言語モデルが作成される（ステップＳ７）。そして、作成された判定用言語モデルにより、入力音声の音響特徴量の探索処理を実行し、修正テキストの累積尤度を基に、修正テキストの信頼度スコアが算出される（ステップＳ８）。この信頼度スコアが予め定められた閾値により大きいかどうかが判断される（ステップＳ９）。信頼度スコアが閾値より小さければ、ステップＳ１にリターンする。信頼度スコアが閾値より大きければ、話者適応処理を実行して（ステップＳ１０）、ステップＳ１にリターンする。 It is determined whether or not the correction result text has been added to the recognition result text (step S6). If no correction has been made, the process returns to step S1. If correction has been made, a determination language model is created from the corrected text (step S7). And the search process of the acoustic feature-value of input speech is performed with the created language model for determination, and the reliability score of the corrected text is calculated based on the cumulative likelihood of the corrected text (step S8). It is determined whether or not the reliability score is larger than a predetermined threshold value (step S9). If the reliability score is smaller than the threshold, the process returns to step S1. If the reliability score is larger than the threshold value, speaker adaptation processing is executed (step S10), and the process returns to step S1.

なお、ソースとなるプログラムは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体等のコンピュータ読み取り可能な記録媒体で提供される。また、ソースとなるプログラムは、コンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、ソースとなるプログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The source program is provided on a computer-readable recording medium such as a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM. The source program may be transmitted from a computer system to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The source program may be a program for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

本発明は、上述した実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications can be made without departing from the gist of the present invention.

本発明の一実施形態のブロック図である。It is a block diagram of one embodiment of the present invention. 本発明の一実施形態の説明に用いるフローチャートである。It is a flowchart used for description of one Embodiment of this invention. 音声認識装置の基本構成のブロック図である。It is a block diagram of the basic composition of a speech recognition device. 音響分析のフレーム処理の説明図である。It is explanatory drawing of the frame process of acoustic analysis. 固定文法モデルの説明図である。It is explanatory drawing of a fixed grammar model. 前後音素環境依存ＨＭＭ状態の説明図である。It is explanatory drawing of a front and back phoneme environment dependence HMM state. 従来の教師なし話者適応を行う音声認識装置の一例のブロック図である。It is a block diagram of an example of the speech recognition apparatus which performs the conventional unsupervised speaker adaptation.

Explanation of symbols

１１：音響分析部
１２：探索処理部
１３：音響モデルデータベース
１４：言語モデルデータベース
１５：認識結果修正部
１６：特徴量データベース
１７：適応利用判定部
１８：話者適応部
１９：入力装置 11: acoustic analysis unit 12: search processing unit 13: acoustic model database 14: language model database 15: recognition result correction unit 16: feature amount database 17: adaptive use determination unit 18: speaker adaptation unit 19: input device

Claims

An acoustic analysis unit for extracting acoustic features of the input speech;
A statistical model database in which a statistical model for executing speech recognition based on the extracted acoustic features is stored;
A search processing unit that executes the search process by applying the statistical model to the acoustic feature amount;
A recognition result correction unit for correcting the recognition result text output by the search process from the user;
An adaptive usage determining unit that calculates a reliability score for the corrected text and determines whether or not to use the corrected text for speaker adaptation;
A speaker adaptation unit that performs speaker adaptation when it is determined that the modified text is used for speaker adaptation;
A speaker adaptation device characterized by comprising:

A correction obtained by executing a search process by creating a statistical model for determination using the corrected text as a recognition target and applying the statistical model for determination to the acoustic feature obtained from the input speech in the adaptive usage determination unit The speaker adaptation device according to claim 1, wherein a confidence score is calculated based on a cumulative likelihood for the text.

3. The speaker adaptation according to claim 2, wherein the adaptive use determination unit uses a frame normalization likelihood obtained by dividing the cumulative likelihood for the modified text by the number of processing frames as a reliability score. apparatus.

The said adaptive use determination part calculates a reliability score based on the cumulative likelihood with respect to the said correction text, and the cumulative likelihood with respect to the recognition result text obtained in a search process part, The feature score of Claim 2 characterized by the above-mentioned. Speaker adaptation device.

In the adaptive use determination unit, the frame normalization likelihood for the recognition result text obtained by dividing the frame normalization likelihood for the modified text and the cumulative likelihood for the recognition result text obtained by the search processing unit by the number of processing frames. 5. The speaker adaptation apparatus according to claim 4, wherein a reliability score is calculated based on the degree.

6. The speaker adaptation apparatus according to claim 5, wherein the adaptive use determination unit calculates a reliability score from a difference in frame normalization likelihood between the modified text and the recognition result text.

A first step of extracting an acoustic feature of the input speech;
A second step of executing a search process by applying a statistical model to the acoustic feature amount;
A third step of determining whether or not correction from the user has been made to the recognition result text output by the search process;
A fourth step of calculating a reliability score for the corrected text and determining whether or not to use the corrected text for speaker adaptation;
A fifth step of performing speaker adaptation if it is determined to use the modified text for speaker adaptation;
A speaker adaptation method characterized by comprising:

A first step of extracting an acoustic feature of the input speech;
A second step of executing a search process by applying a statistical model to the acoustic feature amount;
A third step of determining whether or not correction from the user has been made to the recognition result text output by the search process;
A fourth step of calculating a reliability score for the corrected text and determining whether or not to use the corrected text for speaker adaptation;
A fifth step of performing speaker adaptation if it is determined to use the modified text for speaker adaptation;
A speaker adaptation program that causes a computer to execute.