JP2007248730A

JP2007248730A - Sound model adaptive apparatus, method, and program, and recording medium

Info

Publication number: JP2007248730A
Application number: JP2006070961A
Authority: JP
Inventors: Yuichi Nakazawa; 裕一中澤; Satoru Kobashigawa; 哲小橋川; Atsunori Ogawa; 厚徳小川; Hirokazu Masataki; 浩和政瀧
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-03-15
Filing date: 2006-03-15
Publication date: 2007-09-27
Anticipated expiration: 2026-03-15
Also published as: JP4594885B2

Abstract

<P>PROBLEM TO BE SOLVED: To construct a sound model with high accuracy, by easily selecting a voice recognition result with high accuracy, suitable for adaptation without a teacher of the sound model, and by using the selected voice recognition result. <P>SOLUTION: A reliability degree imparting section 150 calculates a reliability degree which is the estimation value of a recognition rate for each utterance sequence in which the word sequence of the voice recognition result is divided by using the voice recognition result. An utterance selecting section 160 selects an utterance sequence which is used for adaptation of the sound model by using the recognition rate of the sound model and the reliability degree for each utterance sequence. A sound model adaptive section 170 performs adaptation of the sound model by using the utterance sequence selected by the utterance selecting section 160 and a featured value corresponding to the utterance sequence. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音響モデルの適応を行う技術に関し、特に、音声認識結果の信頼度を利用して音響モデルの教師なし適応を行う技術に関する。 The present invention relates to a technique for adapting an acoustic model, and more particularly, to a technique for performing unsupervised adaptation of an acoustic model using reliability of a speech recognition result.

一般に、音声認識では、音声ファイルと音声ファイルの発話内容を表す正解テキストとを学習データとして音響モデルの適応を行う。なお「音響モデルの適応」とは、学習処理により、学習データ中の事例ができるだけ多く成り立つように、音響モデルのパラメータを最適化する処理を意味する。また、この音響モデルの適応は、音声ファイルに対応する読みを人間が書き起こすこと等によって作成される正解テキストを学習データとして用いる教師あり適応と、音声認識結果等を正解テキストとした学習データとして用いる教師なし適応とに大別される。 In general, in speech recognition, an acoustic model is adapted using a speech file and correct text representing the utterance content of the speech file as learning data. “Adaptation of the acoustic model” means a process of optimizing the parameters of the acoustic model so that as many cases as possible in the learning data are established by the learning process. In addition, the adaptation of the acoustic model is performed as supervised adaptation using correct text created as a learning data by human writing and the like corresponding to an audio file, and learning data using correct text as a speech recognition result. It is roughly divided into the unsupervised adaptation used.

ここで、教師なし適応によって音響モデルの適応を行う場合、認識精度の高い音声認識結果を正解テキストとして用いる必要がある。認識精度の低い音声認識結果を正解テキストとして用いた場合、音響モデルの誤った適応によって、音響モデルの精度を低下させてしまう可能性があるからである。
このような問題に対し、音声認識結果に信頼度を付与し、信頼度の高さに応じて音声認識結果を選択し、選択した音声認識結果を用いて音響モデルの適応を行う手法が考えられる。これにより、認識精度の低い音声認識結果が正解テキストとして用いられ、音響モデルの精度が低下してしまうことを回避できる。 Here, when the acoustic model is adapted by unsupervised adaptation, it is necessary to use a speech recognition result with high recognition accuracy as the correct text. This is because if the speech recognition result with low recognition accuracy is used as the correct text, the accuracy of the acoustic model may be reduced due to incorrect adaptation of the acoustic model.
For such a problem, a method may be considered in which reliability is given to the speech recognition result, the speech recognition result is selected according to the high reliability, and the acoustic model is adapted using the selected speech recognition result. . Thereby, it can avoid that the speech recognition result with low recognition accuracy is used as a correct text, and the accuracy of an acoustic model falls.

例えば、非特許文献１では、音声認識結果に対し、音素事後確率に基づいた信頼度を付与し、閾値以上の信頼度が付与された音声認識結果を用いて音響モデルの適応を行う手法が開示されている。この手法では、０から１の範囲で閾値を設定し、異なる値を闘値とした複数のデータ選択モデルを用意して、音響モデルの適応・評価を行っている。
緒方淳，有木康雄，「音素事後確率に基づく信頼度を用いた音響モデルの教師なし適応化」，信学技報NLC2001‐70，pp.19‐24 For example, Non-Patent Document 1 discloses a technique for assigning a reliability based on a phoneme posterior probability to a speech recognition result and adapting an acoustic model using the speech recognition result to which a reliability equal to or higher than a threshold is assigned. Has been. In this method, a threshold value is set in a range from 0 to 1, a plurality of data selection models having different values as threshold values are prepared, and an acoustic model is adapted and evaluated.
Satoshi Ogata, Yasuo Ariki, “Unsupervised adaptation of acoustic models using reliability based on phoneme posterior probabilities”, IEICE Technical Report NLC2001-70, pp.19-24

しかし、非特許文献１のような既存の技術では、どの値を闘値として設定し、音響モデルの適応に用いる音声認識結果を選択するのがよいかを判断することが非常に困難であった。
本発明はこのような点に鑑みてなされたものであり、容易に音響モデルの教師なし適応に適した精度の高い音声認識結果を選択し、選択した音声認識結果を用い、精度の高い音響モデルを構築できる技術を提供することを目的とする。 However, in the existing technology such as Non-Patent Document 1, it is very difficult to determine which value should be set as a threshold value and to select a speech recognition result to be used for acoustic model adaptation. .
The present invention has been made in view of such points, and easily selects a highly accurate speech recognition result suitable for unsupervised adaptation of an acoustic model, and uses the selected speech recognition result to provide a highly accurate acoustic model. The purpose is to provide technology that can build

本発明では上記課題を解決するために、信頼度付与部が、音声認識結果を用い、当該音声認識結果の単語系列を分割した発話系列毎に、認識率の推定値である信頼度を算出し、発話選択部が、音響モデルの認識率と発話系列毎の信頼度とを用い、音響モデルの適応に用いる発話系列を選択し、音響モデル適応部が、発話選択部が選択した発話系列及び当該発話系列に対応する特徴量を用い、音響モデルの適応を行う。なお、「発話系列」とは、音声認識結果の単語系列（読みの単語系列）を所定の基準に従って区分して得られる各系列を意味する。また、「発話系列」は１以上の単語からなる。また、「信頼度」は、認識率の推定値であるが、これは、認識率そのものを推定した値（例えば、認識率をα％と推定した場合のα）のみならず、認識率が属する範囲を推定した値（例えば、認識率がα％以上であると推定した場合のαや、認識率がα％以上β％未満であると推定した場合のα及びβ等）をも含む概念である。 In the present invention, in order to solve the above-described problem, the reliability providing unit calculates the reliability that is the estimated value of the recognition rate for each utterance sequence obtained by dividing the word sequence of the speech recognition result using the speech recognition result. The utterance selection unit uses the recognition rate of the acoustic model and the reliability of each utterance sequence to select an utterance sequence to be used for adaptation of the acoustic model, and the acoustic model adaptation unit selects the utterance sequence selected by the utterance selection unit and the utterance sequence The acoustic model is adapted using the feature quantity corresponding to the utterance series. The “utterance sequence” means each sequence obtained by dividing a word sequence (reading word sequence) of a speech recognition result according to a predetermined standard. Further, the “utterance series” is composed of one or more words. The “reliability” is an estimated value of the recognition rate, and this includes not only a value that estimates the recognition rate itself (for example, α when the recognition rate is estimated as α%) but also the recognition rate. A concept that includes a value that estimates a range (for example, α when the recognition rate is estimated to be α% or more, and α and β when the recognition rate is estimated to be α% or more and less than β%). is there.

ここで、本発明では、音響モデルの認識率を基準として発話系列の信頼度を評価し、音響モデルの適応に用いる発話系列を選択する。これにより、適応によって音響モデルの認識率を低下させてしまうような発話系列が選択されることを防止できる。また、音響モデルの適応に用いる発話系列の選択は、音響モデルの認識率を基準に行われるため、従来のように適切な閾値を設定するための試行錯誤も必要もない。
また、本発明において好ましくは、発話選択部は、音響モデルの認識率以上の値に設定される基準値と発話系列毎の信頼度とを比較し、信頼度が当該基準値以上である発話系列を選択するか、信頼度が当該基準値を超える発話系列を選択する。 Here, in the present invention, the reliability of the utterance sequence is evaluated with reference to the recognition rate of the acoustic model, and the utterance sequence used for adaptation of the acoustic model is selected. As a result, it is possible to prevent an utterance sequence that would reduce the recognition rate of the acoustic model from being selected. In addition, since the selection of the utterance sequence used for the adaptation of the acoustic model is performed based on the recognition rate of the acoustic model, there is no need for trial and error for setting an appropriate threshold as in the prior art.
Preferably, in the present invention, the utterance selection unit compares the reference value set to a value equal to or higher than the recognition rate of the acoustic model and the reliability for each utterance sequence, and the utterance sequence whose reliability is equal to or higher than the reference value. Or an utterance sequence whose reliability exceeds the reference value is selected.

このように発話系列を選択することにより、適応によって音響モデルの認識率を低下させてしまうような発話系列が選択されることを防止できる。
また、本発明において好ましくは、適応データ入力部に、教師あり正解テキストが入力され、音響モデル適応部は、発話選択部が選択した発話系列及び当該発話系列に対応する特徴量、並びに、適応データ入力部に入力された教師あり正解テキスト及び当該教師あり正解テキストに対応する特徴量を用い、音響モデルの適応を行う。なお、「教師あり正解テキスト」とは、音声ファイルに対応する読みを人間が書き起こすことによって作成又は訂正された正解テキストを意味する。ここで好ましくは、正解テキスト選択部が、発話選択部が選択しなかった発話系列の少なくとも一部に対応する教師あり正解テキストを選択し、正解テキスト出力部が、選択された教師あり正解テキストを出力する。また好ましくは、適応データ入力部に入力される教師あり正解テキストは、正解テキスト出力部から出力された教師あり正解テキストである。このように信頼度が低い発話系列を、教師あり正解テキストに置き換えてモデル適応を行うことにより、教師なし適応の長所を維持しつつ、音響モデルの精度をさらに向上させることができる。 By selecting the utterance sequence in this way, it is possible to prevent the selection of the utterance sequence that reduces the recognition rate of the acoustic model due to adaptation.
Preferably, in the present invention, the supervised correct text is input to the adaptive data input unit, and the acoustic model adaptation unit selects the utterance sequence selected by the utterance selection unit, the feature amount corresponding to the utterance sequence, and the adaptive data. The acoustic model is adapted using the supervised correct text input to the input unit and the feature amount corresponding to the supervised correct text. The “supervised correct answer text” means a correct answer text created or corrected by a person writing up a reading corresponding to an audio file. Preferably, the correct text selection unit selects a supervised correct text corresponding to at least a part of an utterance sequence that is not selected by the utterance selection unit, and the correct text output unit selects the selected supervised correct text. Output. Preferably, the supervised correct text input to the adaptive data input unit is a supervised correct text output from the correct text output unit. By replacing the utterance sequence with low reliability in this way with the supervised correct text and performing model adaptation, the accuracy of the acoustic model can be further improved while maintaining the advantages of unsupervised adaptation.

さらに好ましくは、正解テキスト選択部は、発話選択部が選択しなかった発話系列であって、なおかつ、信頼度が所定の基準を満たすだけ良好な発話系列に対応する教師あり正解テキストを選択する。これにより、信頼度が極端に低く、データ自体に問題がある可能性がある音声ファイルが音響モデルの適応に用いられ、音響モデルの精度に悪影響を与えてしまうことを防止できる。 More preferably, the correct text selection unit selects a supervised correct text that corresponds to an utterance sequence that has not been selected by the utterance selection unit and that has a reliability that satisfies a predetermined criterion. As a result, it is possible to prevent an audio file having extremely low reliability and possibly having a problem with the data itself from being used for adaptation of the acoustic model and adversely affecting the accuracy of the acoustic model.

以上説明した通り、本発明では、容易に音響モデルの教師なし適応に適した精度の高い音声認識結果を選択し、選択した音声認識結果を用い、精度の高い音響モデルを構築することが可能となる。 As described above, according to the present invention, it is possible to easily select a highly accurate speech recognition result suitable for unsupervised adaptation of an acoustic model, and to construct a highly accurate acoustic model using the selected speech recognition result. Become.

以下、本発明を実施するための最良の形態を図面を参照して説明する。
〔第１の実施の形態〕
＜ハードウェア構成＞
図１は、第１の実施の形態における音響モデル適応装置１のハードウェア構成を例示したブロック図である。
図１に例示するように、この例の音響モデル適応装置１は、ＣＰＵ（Central Processing Unit）１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ（Read Only Memory）１５、ＲＡＭ（Random Access Memory）１６及びバス１７を有している。 The best mode for carrying out the present invention will be described below with reference to the drawings.
[First Embodiment]
<Hardware configuration>
FIG. 1 is a block diagram illustrating a hardware configuration of an acoustic model adaptation device 1 according to the first embodiment.
As illustrated in FIG. 1, the acoustic model adaptation device 1 of this example includes a CPU (Central Processing Unit) 11, an input unit 12, an output unit 13, an auxiliary storage device 14, a ROM (Read Only Memory) 15, and a RAM (Random Access Memory) 16 and a bus 17.

この例のＣＰＵ１１は、制御部１１ａ、演算部１１ｂ及びレジスタ１１ｃを有し、レジスタ１１ｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、この例の入力部１２は、データが入力される入力ポート、キーボード、マウス等であり、出力部１３は、データを出力する出力ポート、ディスプレイ等である。補助記憶装置１４は、例えば、ハードディスク、ＭＯ（Magneto-Optical disc）、半導体メモリ等であり、本形態の処理を実行するためのプログラムを格納したプログラム領域１４ａ及びタグ出力情報等の各種データが格納されるデータ領域１４ｂを有している。また、ＲＡＭ１６は、例えば、ＳＲＡＭ (Static Random Access Memory)、ＤＲＡＭ (Dynamic Random Access Memory)等であり、上記のプログラムが書き込まれるプログラム領域１６ａ及び各種データが書き込まれるデータ領域１６ｂを有している。また、この例のバス１７は、ＣＰＵ１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ１５及びＲＡＭ１６を、データのやり取りが可能なように接続する。 The CPU 11 in this example includes a control unit 11a, a calculation unit 11b, and a register 11c, and executes various calculation processes according to various programs read into the register 11c. In this example, the input unit 12 is an input port for inputting data, a keyboard, a mouse, and the like, and the output unit 13 is an output port for outputting data, a display, and the like. The auxiliary storage device 14 is, for example, a hard disk, an MO (Magneto-Optical disc), a semiconductor memory, or the like, and stores various data such as a program area 14a storing a program for executing the processing of this embodiment and tag output information. The data area 14b is provided. The RAM 16 is, for example, an SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 16a in which the above program is written and a data area 16b in which various data are written. The bus 17 in this example connects the CPU 11, the input unit 12, the output unit 13, the auxiliary storage device 14, the ROM 15 and the RAM 16 so that data can be exchanged.

＜ハードウェアとソフトウェアとの協働＞
この例のＣＰＵ１１は、読み込まれたＯＳ（Operating System）プログラムに従い、補助記憶装置１４のプログラム領域１４ａに格納されているプログラムを、ＲＡＭ１６のプログラム領域１６ａに書き込む。同様にＣＰＵ１１は、補助記憶装置１４のデータ領域１４ｂに格納されている各種データをＲＡＭ１６のデータ領域１６ｂに書き込む。さらに、ＣＰＵ１１は、当該プログラムや各種データが書き込まれたＲＡＭ１６上のアドレスをレジスタ１１ｃに格納する。そして、ＣＰＵ１１の制御部１１ａは、レジスタ１１ｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１６上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１１ｂに順次実行させ、その演算結果をレジスタ１１ｃに格納していく。 <Cooperation between hardware and software>
The CPU 11 in this example writes a program stored in the program area 14 a of the auxiliary storage device 14 in the program area 16 a of the RAM 16 in accordance with the read OS (Operating System) program. Similarly, the CPU 11 writes various data stored in the data area 14 b of the auxiliary storage device 14 into the data area 16 b of the RAM 16. Further, the CPU 11 stores the address on the RAM 16 where the program and various data are written in the register 11c. Then, the control unit 11a of the CPU 11 sequentially reads these addresses stored in the register 11c, reads a program and data from the area on the RAM 16 indicated by the read address, and sequentially executes the calculation indicated by the program to the calculation unit 11b. The calculation result is stored in the register 11c.

図２は、このようにＣＰＵ１１にプログラムが読み込まれることにより構成される音響モデル適応装置１のブロック図の例示である。なお、図２における矢印はデータの流れを示すが、制御部１９０に入出力されるデータの流れの記載は省略してある。
図２に示すように、本形態の音響モデル適応装置１は、メモリ１１０、音声認識結果入力部１３０、情報変換部１４０、信頼度付与部１５０、発話選択部１６０、音響モデル適応部１７０、一時メモリ１８０及び制御部１９０を有している。ここで、メモリ１１０は、各種データを格納する格納部１１１〜１１９を有している。また、信頼度付与部１５０は、特徴量ベクトル生成部１５１及び特徴量ベクトル評価部１５２を有している。なお、メモリ１１０及び一時メモリ１８０は、例えば、図１に記載したレジスタ１１ｃ、補助記憶装置１４、ＲＡＭ１６、或いはこれらの少なくとも一部を結合した記憶領域に相当する。また、情報変換部１４０、信頼度付与部１５０、発話選択部１６０、音響モデル適応部１７０及び制御部１９０は、例えば、図１に記載したＣＰＵ１１にプログラムが読み込まれることにより構成されるものである。さらに、音声認識結果入力部１３０は、例えば、プログラムが読み込まれたＣＰＵ１１の制御のもと動作する入力部１２である。また、音響モデル適応装置１は、制御部１９０の制御のもと各処理を実行する。また、特に明記しない限り、各処理のデータは、逐一、一時メモリ１８０に読み書きされる。 FIG. 2 is an example of a block diagram of the acoustic model adaptation apparatus 1 configured by reading the program into the CPU 11 as described above. The arrows in FIG. 2 indicate the flow of data, but the description of the flow of data input to and output from the control unit 190 is omitted.
As shown in FIG. 2, the acoustic model adaptation apparatus 1 according to the present embodiment includes a memory 110, a speech recognition result input unit 130, an information conversion unit 140, a reliability assignment unit 150, an utterance selection unit 160, an acoustic model adaptation unit 170, a temporary A memory 180 and a control unit 190 are included. Here, the memory 110 has storage units 111 to 119 for storing various data. In addition, the reliability assigning unit 150 includes a feature vector generation unit 151 and a feature vector evaluation unit 152. Note that the memory 110 and the temporary memory 180 correspond to, for example, the register 11c, the auxiliary storage device 14, the RAM 16, or a storage area obtained by combining at least a part of them as illustrated in FIG. In addition, the information conversion unit 140, the reliability assignment unit 150, the utterance selection unit 160, the acoustic model adaptation unit 170, and the control unit 190 are configured, for example, by reading a program into the CPU 11 illustrated in FIG. . Furthermore, the speech recognition result input unit 130 is, for example, the input unit 12 that operates under the control of the CPU 11 into which a program has been read. In addition, the acoustic model adaptation device 1 executes each process under the control of the control unit 190. Unless otherwise specified, the data of each process is read from and written to the temporary memory 180 one by one.

＜処理＞
次に、本形態の音響モデル適応装置１の処理について説明する。
図３は、第１の実施の形態における音響モデル適応装置１の処理を説明するためのフローチャートである。また、図４は、図３におけるステップＳ３の処理の詳細を説明するためのフローチャートである。以下、これらの図を用いて本形態の処理を説明する。
［前処理］
前処理として、メモリ１１０の格納部１１４に識別モデルを、格納部１１８に音声ファイルを、格納部１１９に音響モデルを、格納部１１６（「認識率格納部」に相当）に当該音響モデルの認識率を、それぞれ格納しておく。なお、識別モデルとは、音声認識結果から得られた特徴量を用い、認識率の推定値（信頼度）を求めるためのモデルを意味する（詳細は後述）。また、音響モデルは、音声の統計的な性質を表現するモデルであり、例えば、隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）等を例示できる。また、音響モデルの認識率は、音響モデルを用いて実際の評価用データの音声認識を行い、その認識率を算出したものである。 <Processing>
Next, the process of the acoustic model adaptation apparatus 1 of this form is demonstrated.
FIG. 3 is a flowchart for explaining processing of the acoustic model adaptation apparatus 1 according to the first embodiment. FIG. 4 is a flowchart for explaining details of the process in step S3 in FIG. Hereinafter, the processing of this embodiment will be described with reference to these drawings.
[Preprocessing]
As preprocessing, the identification model is stored in the storage unit 114 of the memory 110, the audio file is stored in the storage unit 118, the acoustic model is stored in the storage unit 119, and the acoustic model is recognized in the storage unit 116 (corresponding to the “recognition rate storage unit”). Each rate is stored. The identification model means a model for obtaining an estimated value (reliability) of a recognition rate using a feature amount obtained from a speech recognition result (details will be described later). The acoustic model is a model that expresses the statistical properties of speech, and examples thereof include a hidden Markov model (HMM). The recognition rate of the acoustic model is obtained by performing speech recognition of actual evaluation data using the acoustic model and calculating the recognition rate.

［音響モデル適応処理］
以上のような前処理を前提に音響モデル適応処理が実行される。
まず、図示していない音声認識部が、メモリ１１０の格納部１１９に格納された音響モデルを用い、格納部１１８に格納された音声ファイルの音声認識を行う。この音声認識結果は、音声認識結果入力部１３０に入力され、対応する各音声ファイルに関連付けられて、メモリ１１０の格納部１１１に格納される（ステップＳ１）。なお、この音声認識結果は、音声認識により得られた読みの単語系列と、音声認識により各単語に付与される付加情報（例えば、各単語の品詞情報、音響尤度スコア、言語尤度スコア、単語尤度スコア、単語継続時間長、音素数、音素継続時間長等）とを含む。 [Acoustic model adaptation processing]
The acoustic model adaptation process is executed on the premise of the preprocessing as described above.
First, a voice recognition unit (not shown) performs voice recognition of a voice file stored in the storage unit 118 using an acoustic model stored in the storage unit 119 of the memory 110. The voice recognition result is input to the voice recognition result input unit 130, and is associated with each corresponding voice file and stored in the storage unit 111 of the memory 110 (step S1). The speech recognition result includes a reading word sequence obtained by speech recognition and additional information given to each word by speech recognition (for example, part-of-speech information, acoustic likelihood score, language likelihood score, Word likelihood score, word duration, phoneme number, phoneme duration, etc.).

次に、情報変換部１４０が、メモリ１１０の格納部１１１から音声認識結果を読み込み、ある一定の基準に基づいて当該音声認識結果の単語系列を発話系列ごとに区分し、得られた各単語系列を各音声ファイル及び音声認識結果の各付加情報に関連付けてメモリ１１０の格納部１１２に格納する（ステップＳ２）。なお、「発話系列」の定義は前述の通りである。単語系列を区切るための基準の例としては、単語間の無音区間の長さや単語の品詞情報などを例示できる。また、発話系列の具体例を示すと以下のようになる。
『その辺ではかなり収益も上がるんじゃないかなと思います。』
『なるほどね。』
『今、あの韓国に買い物行くツアーとか、そういうのが非常に流行ってるんですが、』
『んー』
次に、信頼度付与部１５０が、音声認識結果を用い、発話系列毎に、認識率の推定値である信頼度を算出する。算出された各信頼度は、対応する発話系列に関連付けられて、メモリ１１０の格納部１１５に格納される（ステップＳ３）。なお、「信頼度」は、認識率の推定値であるが、これは、認識率そのものを推定した値（例えば、認識率をα％と推定した場合のα）のみならず、認識率が属する範囲を推定した値（例えば、認識率がα％以上であると推定した場合のαや、認識率がα％以上β％未満であると推定した場合のα及びβ等）をも含む概念である。この処理の詳細については後述する。 Next, the information conversion unit 140 reads the speech recognition result from the storage unit 111 of the memory 110, classifies the word sequence of the speech recognition result for each utterance sequence based on a certain standard, and obtains each word sequence obtained Are stored in the storage unit 112 of the memory 110 in association with each additional information of each voice file and voice recognition result (step S2). The definition of “utterance sequence” is as described above. Examples of criteria for dividing a word sequence include the length of a silent interval between words and the part of speech information of a word. A specific example of the utterance sequence is as follows.
“I think there will be a lot of profits in that area. ]
"I see. ]
“Now that tour to go shopping in Korea is very popular,”
"Hmm"
Next, the reliability assigning unit 150 calculates the reliability that is the estimated value of the recognition rate for each utterance sequence using the speech recognition result. Each calculated reliability is associated with the corresponding utterance sequence and stored in the storage unit 115 of the memory 110 (step S3). Note that the “reliability” is an estimated value of the recognition rate, but this includes not only a value that estimates the recognition rate itself (for example, α when the recognition rate is estimated as α%) but also the recognition rate. A concept that includes a value that estimates a range (for example, α when the recognition rate is estimated to be α% or more, and α and β when the recognition rate is estimated to be α% or more and less than β%). is there. Details of this processing will be described later.

次に、発話選択部１６０が、メモリ１１０の格納部１１５から発話系列毎の信頼度を読み込み、格納部１１６から音響モデルの認識率を読み込む。そして、発話選択部１６０は、これらを用い、音響モデルの適応に用いる発話系列を選択し、その選択内容を示す選択情報を格納部１１７に格納する（ステップＳ４）。好ましくは、発話選択部１６０は、音響モデルの認識率に基づき設定される基準値と発話系列毎の信頼度とを比較し、信頼度が当該基準値以上である発話系列を選択するか、信頼度が当該基準値を超える発話系列を選択する。より好ましくは、この基準値は、音響モデルの認識率以上に設定される値である。具体的には、例えば、以下のように発話系列を選択する。 Next, the utterance selection unit 160 reads the reliability of each utterance sequence from the storage unit 115 of the memory 110 and reads the recognition rate of the acoustic model from the storage unit 116. And the utterance selection part 160 selects the utterance series used for adaptation of an acoustic model using these, and stores the selection information which shows the selection content in the storage part 117 (step S4). Preferably, the utterance selection unit 160 compares a reference value set based on the recognition rate of the acoustic model with the reliability of each utterance sequence, and selects an utterance sequence whose reliability is equal to or higher than the reference value, An utterance sequence whose degree exceeds the reference value is selected. More preferably, the reference value is a value set to be equal to or higher than the acoustic model recognition rate. Specifically, for example, an utterance sequence is selected as follows.

［例１］
基準値を音響モデルの認識率とし、信頼度が音響モデルの認識率以上である発話系列を選択するか、信頼度が音響モデルの認識率を超える発話系列を選択する。
［例２］
音響モデルの認識率に定数を加算又は乗算した値を基準値とし、信頼度が基準値以上である発話系列を選択するか、信頼度が基準値を超える発話系列を選択する。
［例３］
音響モデルの認識率から定数を減算した値を基準値とし、信頼度が基準値以上である発話系列を選択するか、信頼度が基準値を超える発話系列を選択する。
［例４］
音響モデルの認識率を所定の関数に代入した関数値を基準値とし、信頼度が基準値以上である発話系列を選択するか、信頼度が基準値を超える発話系列を選択する。 [Example 1]
The reference value is the recognition rate of the acoustic model, and an utterance sequence whose reliability is equal to or higher than the recognition rate of the acoustic model is selected, or an utterance sequence whose reliability exceeds the recognition rate of the acoustic model is selected.
[Example 2]
A value obtained by adding or multiplying the recognition rate of the acoustic model by a constant is used as a reference value, and an utterance sequence having a reliability equal to or higher than the reference value is selected, or an utterance sequence having a reliability exceeding the reference value is selected.
[Example 3]
A value obtained by subtracting a constant from the recognition rate of the acoustic model is used as a reference value, and an utterance sequence having a reliability higher than the reference value is selected, or an utterance sequence having a reliability higher than the reference value is selected.
[Example 4]
A function value obtained by substituting the recognition rate of the acoustic model into a predetermined function is used as a reference value, and an utterance sequence having a reliability greater than or equal to the reference value is selected, or an utterance sequence having a reliability higher than the reference value is selected.

次に、音響モデル適応部１７０が、メモリ１１０の格納部１１７から選択情報を読み込み、選択情報を用いて発話選択部１６０が選択した発話系列を特定する。その後、音響モデル適応部１７０は、特定した発話系列を格納部１１２から読み込み、読み込んだ発話系列に対応する音声ファイルを格納部１１８から読み込む。そして、音響モデル適応部１７０は、読み込んだ音声ファイルの特徴量と発話系列とを用い、既存の音響モデル適応手法を用い、音響モデルの適応を行う（ステップＳ５）。この際、発話系列は教師なし正解テキストとして機能する。また、音響モデル適応手法には限定はなく、例えば、バームウェルチ（Baum-Weltch）のアルゴリズム等を用いればよいが、データ量に応じ、最適な音響モデル適応手法を選択することにより、適応精度を向上させることができる。このように適応が行われた音響モデルは、メモリ１１０の格納部１１９に格納される。 Next, the acoustic model adaptation unit 170 reads the selection information from the storage unit 117 of the memory 110 and specifies the utterance sequence selected by the utterance selection unit 160 using the selection information. After that, the acoustic model adaptation unit 170 reads the identified utterance sequence from the storage unit 112 and reads an audio file corresponding to the read utterance sequence from the storage unit 118. Then, the acoustic model adaptation unit 170 adapts the acoustic model using the existing acoustic model adaptation method using the feature amount and the utterance sequence of the read voice file (step S5). At this time, the utterance sequence functions as an unsupervised correct text. The acoustic model adaptation method is not limited. For example, the Baum-Weltch algorithm may be used. However, the adaptive accuracy can be improved by selecting the optimal acoustic model adaptation method according to the amount of data. Can be improved. The acoustic model that has been adapted in this way is stored in the storage unit 119 of the memory 110.

［ステップＳ３の処理の詳細］
次に、前述したステップＳ３の処理の詳細について説明する。
まず、信頼度付与部１５０の特徴量ベクトル生成部１５１が、メモリ１１０の格納部１１２から１つの発話系列を読み込み、一時メモリ１８０に格納する（ステップＳ１１）。次に、特徴量ベクトル生成部１５１が、一時メモリ１８０から当該発話系列を読み込み、この発話系列に関連付けられている付加情報を格納部１１１から読み込む。そして、特徴量ベクトル生成部１５１は、読み込んだ付加情報を用いて発話系列毎の特徴量ベクトルを生成し、これを当該発話系列に関連付けて格納部１１３に格納する（ステップＳ１２）。なお、特徴量ベクトルの各要素としては、付加情報のうち、特徴量ベクトル評価部１５２で認識率を推定するために役立つ情報を用いる。例えば、発話系列が具備する各単語の品詞情報、音響尤度スコア、言語尤度スコア、単語尤度スコア、単語継続時間長、音素数、音素継続時間長の全部または一部を特徴量ベクトルの要素とする。 [Details of Step S3 Processing]
Next, details of the processing in step S3 described above will be described.
First, the feature vector generation unit 151 of the reliability assigning unit 150 reads one utterance series from the storage unit 112 of the memory 110 and stores it in the temporary memory 180 (step S11). Next, the feature vector generation unit 151 reads the utterance sequence from the temporary memory 180 and reads additional information associated with the utterance sequence from the storage unit 111. Then, the feature quantity vector generation unit 151 generates a feature quantity vector for each utterance sequence using the read additional information, and stores this in the storage unit 113 in association with the utterance sequence (step S12). For each element of the feature vector, information useful for estimating the recognition rate by the feature vector evaluation unit 152 is used among the additional information. For example, all or part of part-of-speech information, acoustic likelihood score, language likelihood score, word likelihood score, word duration length, phoneme number, phoneme duration length of each word included in the utterance series is a feature vector. Element.

図５は、このように生成される特徴量ベクトル２００の構成を例示した概念図である。
図５の例の特徴量ベクトル２００は、品詞情報２１０、音響尤度スコア２２０、…、音素継続時間長２３０から構成される。ここで、品詞情報２１０は、発話系列に含まれる複数の単語を1つのシンボルで表した特徴量である。図５の例の品詞情報２１０は、各品詞２１１−１〜ｍに対応するｍ個の要素（０又は１）から構成される。そして、発話系列が含む単語の品詞に対応する要素を１とし、それ以外の品詞に対応する要素を０とする。また、図５の例の音響尤度スコア２２０、…、音素継続時間長２３０は、発話系列に含まれる各単語に付与された音響尤度スコア、…、音素継続時間長毎の統計情報（この例では、平均値２２１，２３１、分散値２２２，２３２、最大値２２３，２３３、最小値２２４，２３４）を、それぞれ０〜１に正規化した値（Ｓ１〜Ｓ４，…，Ｓ５〜Ｓ８）からなる。例えば、３７種類（ｍ＝３７）の品詞を具備する品詞情報、並びに、音響尤度スコア、言語尤度スコア、単語尤度スコア、単語継続時間長、音素数、音素継続時間長それぞれについての発話系列毎の平均・分散・最大・最小要素によって特徴量ベクトルを構成した場合、その特徴量ベクトルは６１｛＝３７＋（６×４）｝次元となる。なお、特徴量ベクトルは、単語単位の情報を発話系列単位に変換した情報であればよく、図５に例示した構成に限定されるものではない。 FIG. 5 is a conceptual diagram illustrating the configuration of the feature quantity vector 200 generated in this way.
The feature quantity vector 200 in the example of FIG. 5 includes part-of-speech information 210, an acoustic likelihood score 220,..., And a phoneme duration length 230. Here, the part-of-speech information 210 is a feature amount representing a plurality of words included in the utterance series by one symbol. The part-of-speech information 210 in the example of FIG. 5 includes m elements (0 or 1) corresponding to each part-of-speech 211-1 to m. The element corresponding to the part of speech of the word included in the utterance series is set to 1, and the elements corresponding to other parts of speech are set to 0. In addition, the acoustic likelihood score 220,..., Phoneme duration length 230 in the example of FIG. 5 is the acoustic likelihood score assigned to each word included in the utterance series,..., Statistical information for each phoneme duration length (this In the example, average values 221 and 231, variance values 222 and 232, maximum values 223 and 233, minimum values 224 and 234) are normalized from 0 to 1 (S1 to S4,..., S5 to S8), respectively. Become. For example, part-of-speech information having 37 types (m = 37) of parts-of-speech, and utterances for each of acoustic likelihood score, language likelihood score, word likelihood score, word duration length, phoneme number, and phoneme duration length When a feature vector is constituted by the average, variance, maximum, and minimum elements for each series, the feature vector has 61 {= 37 + (6 × 4)} dimensions. Note that the feature amount vector is not limited to the configuration illustrated in FIG. 5 as long as it is information obtained by converting information in units of words into units of utterance sequences.

次に、特徴量ベクトル評価部１５２が、メモリ１１０の格納部１１３から特徴量ベクトルを読み込み、格納部１１４から識別モデルを読み込む。そして、特徴量ベクトル評価部１５２は、特徴量ベクトルと識別モデルとを用いた統計的な評価を行い、特徴量ベクトルに対応する発話系列の信頼度（認識率の推定値）を算出する。算出された信頼度は、対応する発話系列に関連付けられてメモリ１１０の格納部１１５に格納される（ステップＳ１３）。以下に、ステップＳ１３の処理の詳細を例示する。
［ステップＳ１３の処理の詳細］
まず、識別モデルについて説明する。本形態の識別モデルは、特徴量ベクトルを用い、対応する発話系列の信頼度を求めるためのモデルである。すなわち、特徴量ベクトルの各要素を識別モデルに代入することにより、対応する発話系列の信頼度を特定するための情報を算出することができる。このような識別モデルは、学習データ（特徴量ベクトルと発話系列の信頼度を特定するための情報とを具備）を用いて生成される。すなわち、学習によって、学習データ中のより多くの事例が成り立つようにモデルパラメータを設定し、識別モデルを構成する。このような識別モデルとしては、ＳＶＭ（サポートベクターマシーン）やブーステイングといった機械学習に基づくもの、最尤推定法や最大エントロピー法といった確率モデルに基づくもの、ニューラルネットワークに基づくもの等を例示できる。 Next, the feature vector evaluation unit 152 reads a feature vector from the storage unit 113 of the memory 110 and reads an identification model from the storage unit 114. Then, the feature vector evaluation unit 152 performs statistical evaluation using the feature vector and the identification model, and calculates the reliability (estimated value of the recognition rate) of the utterance sequence corresponding to the feature vector. The calculated reliability is stored in the storage unit 115 of the memory 110 in association with the corresponding utterance sequence (step S13). Hereinafter, details of the process of step S13 will be exemplified.
[Details of processing in step S13]
First, the identification model will be described. The identification model of this embodiment is a model for obtaining the reliability of a corresponding utterance sequence using a feature vector. That is, by substituting each element of the feature vector into the identification model, information for specifying the reliability of the corresponding utterance sequence can be calculated. Such an identification model is generated using learning data (comprising a feature vector and information for specifying the reliability of an utterance sequence). In other words, model parameters are set so that more cases can be established in the learning data by learning, and an identification model is configured. Examples of such an identification model include those based on machine learning such as SVM (support vector machine) and boosting, those based on a probability model such as a maximum likelihood estimation method and a maximum entropy method, and those based on a neural network.

通常、特徴量ベクトルの次元数が非常に大きい場合、統計的な識別モデルの学習には大量の学習データが必要となり、学習データが少ないと過学習の問題が発生することが多い。これに対し、ＳＶＭは「マージン最大化」という基準から自動的に識別平面付近の少数の学習サンプルのみを選択して識別面を構成するため、少数の学習データでも比較的良い識別性能が得られる。この理由から、ＳＶＭは、本発明に適している。
ＳＶＭに基づく識別モデルは、入力された特徴量ベクトルに対する認識率が閾値（ｎ％）以上であるか否か、の２クラスのパターン認識を行うモデルである。このような識別モデルは、予め、学習データ（クラスの帰属が既知の特徴量ベクトル）を用意し、これらから特徴ベクトルとクラスとの確率的な対応関係を学習することによって生成される。また、ＳＶＭに基づく識別モデルが推定できるのは、特徴量ベクトルに対する認識率が閾値（ｎ％）以上であるか否かのみである。そのため、このような識別モデルは、０≦ｎ≦１００の範囲で必要とされる密度で作成する。例えば、認識率の推定値がどの範囲に属するのかを１０％間隔の精度で必要とする場合（例えば、認識率の推定値が７０〜８０％である等）は、１１個の識別モデル（ｎ=0,10,…,100）を作成する必要がある。一方、認識率の推定値がｎ％以上であるか否かの情報のみでよい場合（例えば、認識率の推定値が７０％以上であるか否か等）は、１個の識別モデル（n=70）のみを作成すればよい（［ステップＳ１３の処理の詳細］の説明終わり）。 Normally, when the number of dimensions of the feature vector is very large, learning of a statistical identification model requires a large amount of learning data, and if there is little learning data, an overlearning problem often occurs. On the other hand, since SVM automatically selects only a small number of learning samples near the identification plane based on the criterion of “maximizing margin” to form the identification plane, relatively good identification performance can be obtained even with a small amount of learning data. . For this reason, SVM is suitable for the present invention.
The identification model based on SVM is a model that performs two-class pattern recognition of whether or not the recognition rate for an input feature vector is equal to or greater than a threshold value (n%). Such an identification model is generated by preparing learning data (a feature vector whose class membership is known) in advance and learning a probabilistic correspondence between the feature vector and the class. Further, the identification model based on the SVM can only be estimated whether or not the recognition rate for the feature vector is equal to or higher than a threshold value (n%). Therefore, such an identification model is created with a required density in the range of 0 ≦ n ≦ 100. For example, when it is necessary to determine to which range the estimated value of the recognition rate belongs with an accuracy of 10% intervals (for example, the estimated value of the recognition rate is 70 to 80%), the 11 identification models (n = 0,10, ..., 100) must be created. On the other hand, when only the information indicating whether or not the estimated value of the recognition rate is n% or more (for example, whether or not the estimated value of the recognition rate is 70% or more), one identification model (n = 70) need only be created (end of description of [Details of processing in step S13]).

次に、制御部１９０は、メモリ１１０の格納部１１２，１１５に格納された発話系列及び信頼度を参照し、全ての発話系列の信頼度が算出済みであるか否かを判断する（ステップＳ１４）。ここで、全ての発話系列の信頼度が算出済みでなかった場合、制御部１９０は、処理をステップＳ１１に戻す。一方、全ての発話系列の信頼度が算出済みであった場合、制御部１９０は、ステップＳ３の処理を終了する（［ステップＳ３の処理の詳細］の説明終わり）。
〔第２の実施の形態〕
次に、本発明における第２の実施の形態について説明する。 Next, the control unit 190 refers to the utterance sequences and the reliability stored in the storage units 112 and 115 of the memory 110, and determines whether or not the reliability of all the utterance sequences has been calculated (step S14). ). Here, when the reliability of all the utterance sequences has not been calculated, the control unit 190 returns the process to step S11. On the other hand, when the reliability of all the utterance sequences has been calculated, the control unit 190 ends the process of step S3 (end of description of [details of process of step S3]).
[Second Embodiment]
Next, a second embodiment of the present invention will be described.

第２の実施の形態は、第１の実施の形態の変形例であり、信頼度が低い発話系列に対しては、教師あり正解テキストを用いて音響モデル適応を行う形態である。以下では、第１の実施の形態との相違点を中心に説明し、第１の実施の形態と共通する事項については説明を省略する。
＜構成＞
図６は、第１の実施の形態と同様な公知のコンピュータに所定のプログラムが読み込まれることにより構成される音響モデル適応装置３０１のブロック図の例示である。なお、図６における矢印はデータの流れを示すが、制御部１９０に入出力されるデータの流れの記載は省略してある。また、図６において図２と共通する部分については、図２と同じ符号を付し、説明を簡略化する。 The second embodiment is a modification of the first embodiment, in which an acoustic model adaptation is performed using a supervised correct text for an utterance sequence with low reliability. Below, it demonstrates centering around difference with 1st Embodiment, and abbreviate | omits description about the matter which is common in 1st Embodiment.
<Configuration>
FIG. 6 is an example of a block diagram of an acoustic model adaptation apparatus 301 configured by reading a predetermined program into a known computer similar to the first embodiment. The arrows in FIG. 6 indicate the flow of data, but the description of the flow of data input to and output from the control unit 190 is omitted. Further, in FIG. 6, portions common to FIG. 2 are denoted by the same reference numerals as those in FIG. 2 to simplify the description.

図６に示すように、本形態の音響モデル適応装置３０１は、メモリ１１０、音声認識結果入力部１３０、情報変換部１４０、信頼度付与部１５０、発話選択部１６０、音響モデル適応部１７０、一時メモリ１８０、制御部１９０、正解テキスト選択部３３０、正解テキスト出力部３４０及び適応データ入力部３５０を有している。ここで、メモリ１１０は、各種データを格納する格納部１１１〜１１９の他、格納部３１１，３１２を有している。また、正解テキスト選択部３３０は、図１のＣＰＵ１１にプログラムが読み込まれることにより構成されるものである。また、正解テキスト出力部３４０及び適応データ入力部３５０は、例えば、図１のＣＰＵ１１にプログラムが読み込まれることにより構成されるもの、或いは、プログラムが読み込まれたＣＰＵ１１の制御のもと動作する出力部１３及び入力部１２である。また、音響モデル適応装置３０１は、制御部１９０の制御のもと各処理を実行する。また、特に明記しない限り、各処理のデータは、逐一、一時メモリ１８０に読み書きされる。 As shown in FIG. 6, the acoustic model adaptation apparatus 301 of this embodiment includes a memory 110, a speech recognition result input unit 130, an information conversion unit 140, a reliability assignment unit 150, an utterance selection unit 160, an acoustic model adaptation unit 170, a temporary A memory 180, a control unit 190, a correct text selection unit 330, a correct text output unit 340, and an adaptive data input unit 350 are provided. Here, the memory 110 includes storage units 311 and 312 in addition to the storage units 111 to 119 for storing various data. The correct text selection unit 330 is configured by reading a program into the CPU 11 of FIG. The correct text output unit 340 and the adaptive data input unit 350 are configured by, for example, a program being read into the CPU 11 in FIG. 1 or an output unit that operates under the control of the CPU 11 that has read the program. 13 and the input unit 12. The acoustic model adaptation apparatus 301 executes each process under the control of the control unit 190. Unless otherwise specified, the data of each process is read from and written to the temporary memory 180 one by one.

＜処理＞
次に、本形態の音響モデル適応装置３０１の処理について説明する。
図７は、第２の実施の形態における音響モデル適応装置３０１の処理を説明するためのフローチャートである。以下、この図を用いて本形態の処理を説明する。
［前処理］
前処理として、メモリ１１０の格納部１１４に識別モデルを、格納部１１８に音声ファイルを、格納部１１９に音響モデルを、格納部１１６に当該音響モデルの認識率を格納しておく。また、格納部３１１に、格納部１１８に格納された音声ファイルに対応する教師あり正解テキストの集合である教師あり正解テキストファイルを格納しておく。 <Processing>
Next, the process of the acoustic model adaptation apparatus 301 of this form is demonstrated.
FIG. 7 is a flowchart for explaining the processing of the acoustic model adaptation apparatus 301 in the second embodiment. Hereinafter, the processing of this embodiment will be described with reference to FIG.
[Preprocessing]
As preprocessing, an identification model is stored in the storage unit 114 of the memory 110, an audio file is stored in the storage unit 118, an acoustic model is stored in the storage unit 119, and a recognition rate of the acoustic model is stored in the storage unit 116. The storage unit 311 stores a supervised correct text file that is a set of supervised correct texts corresponding to the audio files stored in the storage unit 118.

［音響モデル適応処理］
以上のような前処理を前提に音響モデル適応処理が実行される。
ステップＳ２１〜Ｓ２４は、第１の実施の形態のステップ１〜Ｓ４と同じである。すなわち、まず、音声認識結果が、音声認識結果入力部１３０に入力され、対応する各音声ファイルに関連付けられ、メモリ１１０の格納部１１１に格納される（ステップＳ２１）。次に、情報変換部１４０が、ある一定の基準に基づいて音声認識結果の単語系列を発話系列ごとに区分し、得られた各単語系列を各音声ファイル及び音声認識結果の各付加情報に関連付けてメモリ１１０の格納部１１２に格納する（ステップＳ２２）。そして、信頼度付与部１５０が、音声認識結果を用い、発話系列毎に、認識率の推定値である信頼度を算出し、算出された各信頼度を、対応する発話系列に関連付けて、メモリ１１０の格納部１１５に格納する（ステップＳ２３）。次に、発話選択部１６０が、発話系列毎の信頼度と音響モデルの認識率とを用い、音響モデルの適応に用いる発話系列を選択し、その選択内容を示す選択情報を格納部１１７に格納する（ステップＳ２４）。 [Acoustic model adaptation processing]
The acoustic model adaptation process is executed on the premise of the preprocessing as described above.
Steps S21 to S24 are the same as steps 1 to S4 of the first embodiment. That is, first, a speech recognition result is input to the speech recognition result input unit 130, associated with each corresponding speech file, and stored in the storage unit 111 of the memory 110 (step S21). Next, the information conversion unit 140 classifies the word sequence of the speech recognition result for each utterance sequence based on a certain criterion, and associates each obtained word sequence with each additional information of the speech file and the speech recognition result. And stored in the storage unit 112 of the memory 110 (step S22). Then, the reliability level assigning unit 150 uses the speech recognition result to calculate the reliability level that is the estimated value of the recognition rate for each utterance sequence, and associates each calculated reliability level with the corresponding utterance sequence, 110 is stored in the storage unit 115 (step S23). Next, the utterance selection unit 160 selects the utterance sequence used for adaptation of the acoustic model using the reliability for each utterance sequence and the recognition rate of the acoustic model, and stores selection information indicating the selection contents in the storage unit 117. (Step S24).

次に、正解テキスト選択部３３０が、メモリ１１０の格納部１１７から各選択情報を読み込み、格納部１１５から各信頼度を読み込む。そして、正解テキスト選択部３３０は、発話選択部１６０が選択しなかった発話系列であって、なおかつ、信頼度が所定の基準を満たすだけ良好な発話系列に対応する教師あり正解テキストを、格納部３１１の教師あり正解テキストファイルから選択する（ステップＳ２５）。なお、「信頼度が所定の基準を満たすだけ良好な発話系列」は、例えば以下のように選択される。
［例１］
発話選択部１６０が選択しなかった発話系列を信頼度が高い順序で並び替え、信頼度が高い方から順番に所定個の発話系列を選択する。 Next, the correct text selection unit 330 reads each selection information from the storage unit 117 of the memory 110 and reads each reliability from the storage unit 115. Then, the correct text selection unit 330 stores a supervised correct text corresponding to an utterance sequence that has not been selected by the utterance selection unit 160 and has a reliability that satisfies a predetermined criterion. Selection is made from 311 supervised correct text files (step S25). Note that “an utterance sequence whose reliability is satisfactory enough to satisfy a predetermined criterion” is selected as follows, for example.
[Example 1]
The utterance sequences not selected by the utterance selection unit 160 are rearranged in the order of high reliability, and a predetermined number of utterance sequences are selected in descending order of reliability.

［例２］
発話選択部１６０が用いた基準値よりも値が小さな値を閾値とし、この閾値よりも大きな信頼度を有する発話系列を選択する。
選択された教師あり正解テキストは、正解テキスト出力部３４０から出力され、メモリ１１０の格納部３１２に格納される。次に、格納部３１２に格納された教師あり正解テキストが、適応データ入力部３５０に入力され、音響モデル適応部１７０に送られる。音響モデル適応部１７０は、送られた教師あり正解テキストに対応する音声ファイルを格納部１１８から読み込む。 [Example 2]
The threshold value is a value smaller than the reference value used by the utterance selection unit 160, and an utterance sequence having a reliability higher than the threshold value is selected.
The selected supervised correct text is output from the correct text output unit 340 and stored in the storage unit 312 of the memory 110. Next, the supervised correct text stored in the storage unit 312 is input to the adaptive data input unit 350 and sent to the acoustic model adaptation unit 170. The acoustic model adaptation unit 170 reads an audio file corresponding to the sent supervised correct text from the storage unit 118.

さらに、音響モデル適応部１７０は、メモリ１１０の格納部１１７から選択情報を読み込み、選択情報を用いて発話選択部１６０が選択した発話系列を特定し、特定した発話系列を格納部１１２から読み込み、読み込んだ発話系列に対応する音声ファイルを格納部１１８から読み込む。そして、音響モデル適応部１７０は、読み込んだ音声ファイルの特徴量と発話系列と教師あり正解テキストとを用い（すなわち、発話選択部１６０が選択した発話系列及び当該発話系列に対応する特徴量、並びに、適応データ入力部３５０に入力された教師あり正解テキスト及び当該教師あり正解テキストに対応する特徴量を用い）音響モデルの適応を行う（ステップＳ２６）。このように適応が行われた音響モデルは、メモリ１１０の格納部１１９に格納される。 Furthermore, the acoustic model adaptation unit 170 reads the selection information from the storage unit 117 of the memory 110, identifies the utterance sequence selected by the utterance selection unit 160 using the selection information, reads the identified utterance sequence from the storage unit 112, An audio file corresponding to the read utterance series is read from the storage unit 118. Then, the acoustic model adaptation unit 170 uses the feature amount, utterance sequence, and supervised correct text of the read audio file (that is, the utterance sequence selected by the utterance selection unit 160 and the feature amount corresponding to the utterance sequence, and Then, the acoustic model is adapted (using the supervised correct text input to the adaptive data input unit 350 and the feature amount corresponding to the supervised correct text) (step S26). The acoustic model that has been adapted in this way is stored in the storage unit 119 of the memory 110.

〔変形例等〕
なお、本発明は上述の実施の形態に限定されるものではない。例えば、上述の実施の形態では、１つのコンピュータにプログラムを読み込ませて音響モデル適応装置を構成することとしたが、音響モデル装置の各機能を複数のコンピュータやＣＰＵに分散させて構成してもよい。例えば、第２の実施の形態における正解テキスト選択部３３０を、別のコンピュータ（別装置）によって実現してもよいし、複数のコンピュータによってそれぞれ構成される複数の正解テキスト選択部３３０を用いてもよい。なお、別装置で選択された教師あり正解テキストは、適応データ入力部３５０（この場合、プログラムが読み込まれたＣＰＵ１１の制御のもと動作する入力部１２に相当）から入力される。 [Modifications, etc.]
The present invention is not limited to the embodiment described above. For example, in the above-described embodiment, the acoustic model adaptation apparatus is configured by reading a program into one computer. However, each function of the acoustic model apparatus may be distributed to a plurality of computers and CPUs. Good. For example, the correct text selection unit 330 in the second embodiment may be realized by another computer (separate apparatus), or a plurality of correct text selection units 330 each configured by a plurality of computers may be used. Good. The supervised correct text selected by another apparatus is input from an adaptive data input unit 350 (in this case, corresponding to the input unit 12 operating under the control of the CPU 11 into which the program has been read).

また、上述の各実施の形態では、格納部１１８に音声ファイルを格納しておき、音響モデル適応部１７０が、音響ファイルから特徴量と抽出し、音響モデルの適応を行うこととした。しかし、格納部１１８に特徴量自体を格納しておき、音響モデル適応部１７０が、格納部１１８から読み込んだ特徴量を直接利用する構成であってもよい。
また、上述の第２の実施の形態では、正解テキスト選択部３３０が、発話選択部１６０が選択しなかった発話系列であって、なおかつ、信頼度が所定の基準を満たすだけ良好な発話系列に対応する教師あり正解テキストを選択することとした。しかし、正解テキスト選択部３３０が、発話選択部１６０が選択しなかった発話系列の少なくとも一部に対応する教師あり正解テキストを任意に選択する構成であってもよい。さらには、発話選択部１６０の選択内容にかかわらず、正解テキスト選択部３３０が、任意に教師あり正解テキストを選択する構成であってもよい。 Further, in each of the above-described embodiments, the audio file is stored in the storage unit 118, and the acoustic model adaptation unit 170 extracts the feature amount from the acoustic file and adapts the acoustic model. However, the storage unit 118 may store the feature quantity itself, and the acoustic model adaptation unit 170 may directly use the feature quantity read from the storage unit 118.
In the second embodiment described above, the correct text selection unit 330 is an utterance sequence that is not selected by the utterance selection unit 160, and the utterance sequence is satisfactory only if the reliability satisfies a predetermined criterion. The corresponding supervised correct text was selected. However, the correct text selection unit 330 may arbitrarily select the supervised correct text corresponding to at least a part of the utterance series that the utterance selection unit 160 did not select. Furthermore, the correct text selection unit 330 may arbitrarily select the correct text with the teacher regardless of the selection contents of the utterance selection unit 160.

また、上述の各実施の形態のように適応が行われた音響モデルを用いて音声認識を行い、その音声認識結果を再び音声認識結果入力部１３０への入力とし、同様な処理を繰り返すこととしてもよい。これにより、高精度のモデル適応が可能となる。
また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。
また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 In addition, speech recognition is performed using an acoustic model that has been adapted as in each of the above-described embodiments, the speech recognition result is input to the speech recognition result input unit 130 again, and similar processing is repeated. Also good. Thereby, highly accurate model adaptation is attained.
In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.
Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical discs, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、各形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In each embodiment, the apparatus is configured by executing a predetermined program on a computer. However, at least a part of the processing contents may be realized by hardware.

本発明の産業上の利用分野としては、例えば、コンピュータと人間とが音声対話によってコミュニケーションを行う音声対話システムを例示できる。このような音声対話システムでは、コンピュータが人間と対話を行いながら音声を収集・選択・学習し、逐次自律適応を行う。本発明では、少量の適応データ即ち短い適応時間で、容易かつ効率的に高精度な音響モデルが構築できるため、容易に高精度な音声対話システムを構成できる。 As an industrial application field of the present invention, for example, a voice dialogue system in which a computer and a person communicate by voice dialogue can be exemplified. In such a spoken dialogue system, a computer collects, selects and learns speech while interacting with humans, and performs sequential autonomous adaptation. According to the present invention, since a highly accurate acoustic model can be constructed easily and efficiently with a small amount of adaptation data, that is, with a short adaptation time, a highly accurate spoken dialogue system can be configured easily.

図１は、第１の実施の形態における音響モデル適応装置のハードウェア構成を例示したブロック図である。FIG. 1 is a block diagram illustrating a hardware configuration of the acoustic model adaptation apparatus according to the first embodiment. 図２は、第１の実施の形態における音響モデル適応装置のブロック図の例示である。FIG. 2 is an example of a block diagram of the acoustic model adaptation apparatus according to the first embodiment. 図３は、第１の実施の形態における音響モデル適応装置の処理を説明するためのフローチャートである。FIG. 3 is a flowchart for explaining processing of the acoustic model adaptation apparatus according to the first embodiment. 図４は、図３におけるステップＳ３の処理の詳細を説明するためのフローチャートである。FIG. 4 is a flowchart for explaining details of the processing in step S3 in FIG. 図５は、特徴量ベクトルの構成を例示した概念図である。FIG. 5 is a conceptual diagram illustrating the configuration of the feature vector. 図６は、第２の実施の形態における音響モデル適応装置のブロック図の例示である。FIG. 6 is an illustration of a block diagram of the acoustic model adaptation apparatus in the second exemplary embodiment. 図７は、第２の実施の形態における音響モデル適応装置の処理を説明するためのフローチャートである。FIG. 7 is a flowchart for explaining the process of the acoustic model adaptation apparatus according to the second embodiment.

Explanation of symbols

１，３０１音響モデル適応装置 1,301 Acoustic model adaptation device

Claims

An acoustic model adaptation device for adapting an acoustic model,
A recognition rate storage for storing the recognition rate of the acoustic model;
A speech recognition result input unit for inputting a speech recognition result using the acoustic model;
Using the speech recognition result, for each utterance sequence obtained by dividing the word sequence of the speech recognition result, a reliability providing unit that calculates a reliability that is an estimated value of the recognition rate;
An utterance selection unit that selects an utterance sequence to be used for adaptation of the acoustic model, using the recognition rate of the acoustic model and the reliability of each utterance sequence;
An acoustic model adaptation unit that adapts the acoustic model using the utterance sequence selected by the utterance selection unit and a feature amount corresponding to the utterance sequence;
An acoustic model adaptation device characterized by comprising:

The acoustic model adaptation device according to claim 1,
The utterance selection unit
Compare the reference value set to a value equal to or higher than the recognition rate of the acoustic model and the reliability for each utterance sequence, and select an utterance sequence whose reliability is equal to or higher than the reference value, or the reliability is the reference value Select an utterance sequence that exceeds
An acoustic model adaptation device characterized by that.

The acoustic model adaptation device according to claim 1,
An adaptive data input unit for inputting a supervised correct text;
The acoustic model adaptation unit is
Using the utterance sequence selected by the utterance selection unit and the feature amount corresponding to the utterance sequence, and the supervised correct text input to the adaptive data input unit and the feature amount corresponding to the supervised correct text, the sound Adapt the model,
An acoustic model adaptation device characterized by that.

The acoustic model adaptation device according to claim 3,
A correct text selection unit that selects a supervised correct text corresponding to at least a part of an utterance sequence that is not selected by the utterance selection unit;
A correct text output unit for outputting the supervised correct text;
An acoustic model adaptation device characterized by comprising:

The acoustic model adaptation device according to claim 4,
The correct text selection part
Selecting a supervised correct text corresponding to an utterance sequence that is not selected by the utterance selection unit and having a reliability that satisfies a predetermined criterion;
An acoustic model adaptation device characterized by that.

The acoustic model adaptation device according to claim 4,
The supervised correct text input to the adaptive data input unit is
This is a supervised correct text output from the correct text output unit.
An acoustic model adaptation device characterized by that.

An acoustic model adaptation method for adapting an acoustic model,
A process in which a speech recognition result using the acoustic model is input to the speech recognition result input unit;
A process of calculating a reliability that is an estimated value of a recognition rate for each utterance sequence obtained by dividing the word sequence of the speech recognition result using the speech recognition result,
The utterance selection unit uses the recognition rate of the acoustic model and the reliability for each utterance sequence to select an utterance sequence to be used for adaptation of the acoustic model;
The acoustic model adaptation unit adapts the acoustic model using the utterance sequence selected by the utterance selection unit and the feature amount corresponding to the utterance sequence;
An acoustic model adaptation method characterized by comprising:

The acoustic model adaptation program for functioning a computer as an acoustic model adaptation apparatus in any one of Claim 1 to 6.

A computer-readable recording medium storing the acoustic model adaptation program according to claim 8.