JP4408665B2

JP4408665B2 - Speech recognition apparatus for speech recognition, speech data collection method for speech recognition, and computer program

Info

Publication number: JP4408665B2
Application number: JP2003291441A
Authority: JP
Inventors: 信之鷲尾; 拓郎池田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-08-11
Filing date: 2003-08-11
Publication date: 2010-02-03
Anticipated expiration: 2023-08-11
Also published as: JP2005062398A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and a method for collecting utterance data for speech recognition, and a computer program that can efficiently collect the utterance data while keeping the precision of speech recognition high. <P>SOLUTION: Provided are a voice interaction device including a means of storing interaction scenario information in which a progress procedure of interaction is described, a means of accepting inputted utterance, a means of recognizing the speech of the inputted utterance, a means of advancing the interaction according to the speech recognition result and interaction scenario information, and a means of outputting an answer to the utterance, a means of storing a state transition history of the interaction based upon the interaction scenario information, a means of judging whether or not the inputted utterance is correctly recognized according to the speech recognition result and state transition history, and a means of storing the speech recognition result and inputted utterance so that they correspond to each other when it is judged that the inputted utterance is correctly recognized. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識に用いる発話データを収集して記憶する音声認識用発話データ収集装置、音声認識用発話データ収集方法、及びコンピュータプログラムに関する。 The present invention relates to a speech recognition speech data collection apparatus, speech recognition speech data collection method, and computer program for collecting and storing speech data used for speech recognition.

近年、音声認識システム（ＡＳＲ：Auto Speech Recognition）を用いたボイスポータル等の音声対話システム（ＩＶＲ：Interactive Voice Response）が普及し始めている。該音声対話システムの使い易さを大きく左右するのは、発話を認識する音声認識システムの認識性能である。音声認識システムの性能には、ＨＭＭ（Hidden Markov model）等でモデル化される音響モデルの精度が大きく影響する。 In recent years, an interactive voice response (IVR) such as a voice portal using a speech recognition system (ASR: Auto Speech Recognition) has begun to spread. The recognition performance of the speech recognition system that recognizes utterances greatly affects the ease of use of the speech dialogue system. The accuracy of an acoustic model modeled by an HMM (Hidden Markov model) or the like greatly affects the performance of the speech recognition system.

一般に、精度のよい音響モデルを開発するためには、該音響モデルの学習時に、発話内容が明確である発話データが大量に必要となる。ＨＭＭ等の統計モデルを使用した場合、特に発話データの量が音響モデルの性能と直結する。そのため、発声内容を示すデータを付与した発話を大量かつ容易に収集することは、音声認識システムの開発において重要な課題の一つである。なお、「発話データ」とは、発話者による発話及び対応付けられた発話内容に関する情報を含む音声認識に用いられるデータ全体を意味している。 In general, in order to develop an accurate acoustic model, a large amount of utterance data whose utterance content is clear is required when learning the acoustic model. When a statistical model such as HMM is used, the amount of utterance data is directly related to the performance of the acoustic model. Therefore, it is one of the important issues in the development of a speech recognition system to easily collect a large amount of utterances to which data indicating the utterance content is added. Note that “utterance data” means the entire data used for speech recognition including information about the utterance by the speaker and the associated utterance content.

また、音響モデルの学習時に用いる発話を入力する環境は、発話者が音声認識システムを使用する環境と同一条件であることが望ましい。例えば、音響モデルの学習時に用いる発話データを、読み上げ用原稿を用意し、該原稿を発話者が読み上げた発話を録音することで取得する場合、該発話データは読み上げ口調となる。それに対して、実際に音声認識システムを使用する環境では自由発話に近い口語口調の発話が原稿読み上げ時より多くなる。したがって、音響モデルの学習時に用いる発話データと実際の発話との間で音声特徴量等の乖離が大きくなり、認識精度は悪化する。 In addition, it is desirable that the environment for inputting the utterance used for learning the acoustic model is the same as the environment in which the utterer uses the speech recognition system. For example, when the speech data used when learning the acoustic model is acquired by preparing a text to be read out and recording the speech read by the speaker, the speech data has a tone of reading. On the other hand, in an environment where the speech recognition system is actually used, utterances with spoken tones close to free utterances are greater than when reading a document. Therefore, the divergence of the speech feature amount or the like between the utterance data used when learning the acoustic model and the actual utterance increases, and the recognition accuracy deteriorates.

また、上記方法で発話データを収集する場合、原稿を読み上げる発話を録音する機器の操作等のための専従者が必要であり、発話データ収集コストまたは収集能率の観点から、大量に発話データを収集することは困難であった。さらに、原稿を発話者が読み上げた発話データには読み誤ったデータも含まれており、このような音響モデルの学習時に用いる発話データとして不適切なデータを除外する必要もあった。 In addition, when collecting utterance data by the above method, a full-time person is required for operation of the equipment that records the utterance that reads out the manuscript, and collects utterance data in large quantities from the viewpoint of utterance data collection cost or collection efficiency It was difficult to do. Furthermore, the utterance data read out by the utterer of the manuscript includes misread data, and it is necessary to exclude data inappropriate as utterance data used when learning such an acoustic model.

斯かる問題を解消するために、音声認識システムの認識結果を発話に対応付けて発話データを生成する方法が用いられている。例えば音声認識システムの認識結果を発話の識別情報として対応付けることで、発話の内容が明確になる。しかし、読み誤ったデータを除外する必要が生じるという問題は解消しない。 In order to solve such a problem, a method of generating speech data by associating the recognition result of the speech recognition system with the speech is used. For example, by associating the recognition result of the voice recognition system as the utterance identification information, the content of the utterance becomes clear. However, it does not solve the problem that it is necessary to exclude misread data.

そこで、特許文献１に開示されているように、音声認識システムの認識結果を発話に対応付けて発話データを生成して記憶し、特許文献２に開示されているように、発話者自身が気付いた読み誤ったデータは除外する手段を設ける。図１０に、音声認識装置の認識結果を発話に対応付けて発話データを生成して記憶する音声認識用発話データ収集装置の機能ブロック構成図を示す。 Therefore, as disclosed in Patent Document 1, the recognition result of the voice recognition system is associated with the utterance to generate and store the utterance data, and as disclosed in Patent Document 2, the utterer himself notices. Provide a means to exclude misread data. FIG. 10 is a functional block configuration diagram of a speech recognition utterance data collection device that generates and stores speech data by associating the recognition result of the speech recognition device with an utterance.

図１０に示すように、発話入力部１から発話者の音声が入力され、音声認識部２に送られる。音声認識部２は、発話者が発話する毎に、入力された音声に対応する音声認識結果を出力し、それぞれ認識結果記憶部３、発話記憶部４に記憶する。 As shown in FIG. 10, the speech of the speaker is input from the speech input unit 1 and sent to the speech recognition unit 2. Each time the speaker speaks, the speech recognition unit 2 outputs a speech recognition result corresponding to the input speech, and stores it in the recognition result storage unit 3 and the speech storage unit 4, respectively.

発話者が音声認識結果が正しくないと判断し、発話取消入力部７から認識結果取消要求信号が入力される場合、発話データ生成部５は、記憶されている対応する入力された発話及び音声認識結果を、それぞれ認識結果記憶部３及び発話記憶部４から削除する。発話取消入力部７から認識結果取消要求信号が入力されない場合、発話データ生成部５は、認識結果記憶部３及び発話記憶部４に記憶されている発話と音声認識結果を一対のデータとして対応付け、発話データ記憶部６に記憶する。このようにすることで、発話データ記憶部６に記憶されている発話データには、音声認識部２で誤認識されたものは含まれない。したがって、認識性能の高い音響モデルを構築することが可能となる。
特開２００３−１５０１８５号公報特開２００２−１８９４９６号公報 When the speaker determines that the speech recognition result is not correct and a recognition result cancellation request signal is input from the speech cancellation input unit 7, the utterance data generation unit 5 stores the corresponding input utterance and speech recognition stored therein. The results are deleted from the recognition result storage unit 3 and the utterance storage unit 4, respectively. When the recognition result cancellation request signal is not input from the utterance cancellation input unit 7, the utterance data generation unit 5 associates the utterance and the voice recognition result stored in the recognition result storage unit 3 and the utterance storage unit 4 as a pair of data. And stored in the utterance data storage unit 6. By doing in this way, the speech data stored in the speech data storage unit 6 does not include those erroneously recognized by the speech recognition unit 2. Therefore, it is possible to construct an acoustic model with high recognition performance.
JP 2003-150185 A JP 2002-189596 A

しかし、図１０に示す音声認識用発話データ収集装置では、入力される発話は、結局原稿を読み上げた発話であり、音響モデルの学習時に用いる発話を入力する環境と、該発話データを用いて学習した音響モデルを用いる音声認識装置をユーザが使用する環境とを一致させることは現実的に困難である。 However, in the speech recognition utterance data collection device shown in FIG. 10, the input utterance is an utterance that is read out from the manuscript, and learning is performed using the environment for inputting the utterance used when learning the acoustic model and the utterance data. It is practically difficult to match the environment in which the user uses the speech recognition apparatus using the acoustic model.

また、音響モデルの学習時に用いる発話を収集するために多くの発話者を集める必要がある。発話収集に協力してもらった発話者に対価を支払う等の方法を用いる場合、比較的容易に発話者を集めることは可能であるが、音響モデルの学習にかかる金銭的負担が重くなる。 In addition, it is necessary to collect many speakers in order to collect the utterances used when learning the acoustic model. When using a method such as paying a speaker who cooperated in utterance collection, it is possible to collect utterers relatively easily, but the financial burden of learning the acoustic model becomes heavy.

さらに、音響モデル構築のために集める発話者の男女比、年齢分布等の相違によっても音響モデルの認識性能が左右される。集める発話者の男女比、年齢分布等は、実際に音声認識システムを使用する発話者の男女比、年齢分布等に近いことが望ましいことは言うまでもない。 Furthermore, the recognition performance of the acoustic model depends on the difference in the sex ratio, age distribution, etc. of the speakers collected for the construction of the acoustic model. It goes without saying that the gender ratio, age distribution, etc. of the speakers to be collected are preferably close to the gender ratio, age distribution, etc. of the speakers who actually use the speech recognition system.

本発明は斯かる事情に鑑みてなされたものであり、音声認識の精度を高く維持しつつ、効率的に発話データを収集することができる音声認識用発話データ収集装置、音声認識用発話データ収集方法、及びコンピュータプログラムを提供することを目的とする。 SUMMARY OF THE INVENTION The present invention has been made in view of such circumstances, and the speech recognition speech data collection device and speech recognition speech data collection capable of efficiently collecting speech data while maintaining high speech recognition accuracy. It is an object to provide a method and a computer program.

上記目的を達成するために第１発明に係る音声認識用発話データ収集装置は、対話の進行手順に沿った発話が入力された場合に発話の内容に応じて次に行うべき処理を実行する記述を含む対話シナリオ情報を記憶する手段と、音声認識結果及び入力された発話を対応づけて、音声認識用の発話データとして記憶されている発話データ蓄積手段と、入力された発話を受け付ける手段と、前記入力された発話を、前記発話データ蓄積手段に蓄積された発話データを用いて、音声認識する手段と、音声認識結果及び前記対話シナリオ情報に基づいて対話を進行させる手段と、前記処理の実行を検知する検知手段と、前記入力された発話に対する応答を出力する手段を含む音声対話装置と、前記対話シナリオ情報に基づいた対話の状態遷移履歴を記憶する手段と、前記検知手段が前記処理の実行を検知した場合、前記入力された発話が前記対話シナリオ情報に記述された対話の進行手順に対応する発話であると判断し、前記検知手段が前記処理の実行を検知しなかった場合は、前記音声認識結果及び前記状態遷移履歴に基づいて、前記入力された発話が正しく認識されたか否かを判断する判断手段と、前記判断手段が正しく認識されたと判断した場合、前記音声認識結果及び前記入力された発話を対応付けして前記発話データ蓄積手段に蓄積させる手段とを備えることを特徴とする。また、本発明に係る音声認識用発話データ収集装置は、対話の進行手順を記述した対話シナリオ情報を記憶する手段、入力された発話を受け付ける手段、前記入力された発話を音声認識する手段、音声認識結果及び前記対話シナリオ情報に基づいて対話を進行させる手段、及び前記入力された発話に対する応答を出力する手段を含む音声対話装置と、前記対話シナリオ情報に基づいた対話の状態遷移履歴を記憶する手段と、前記音声認識結果及び前記状態遷移履歴に基づいて、前記入力された発話が正しく認識されたか否かを判断する手段と、該手段が、正しく認識されたと判断した場合、前記音声認識結果及び前記入力された発話を対応付けて記憶する手段と、電話回線での発信番号を受信して記憶する手段と、前記発信番号を、前記音声認識結果及び前記入力された発話と対応付けて記憶する手段とを備えることを特徴とする。 In order to achieve the above object, the speech data collection apparatus for speech recognition according to the first aspect of the present invention is a description for executing a process to be performed next according to the content of an utterance when an utterance is input in accordance with a progress procedure of a dialog. Means for storing dialogue scenario information including: speech data accumulating means stored as speech data for speech recognition in association with speech recognition results and input utterances; means for receiving input utterances; A voice recognition means for the input utterance using speech data stored in the utterance data storage means, a means for proceeding a dialogue based on a voice recognition result and the dialogue scenario information, and execution of the processing A voice dialogue apparatus including a detection means for detecting a voice, a means for outputting a response to the input utterance, and a dialogue state transition history based on the dialogue scenario information Means that, when the detection means detects execution of the processing, it is determined that the input speech is speech that corresponds to the progression steps of the dialogue described in the dialog scenario information, said detecting means the When the execution of the process is not detected, a determination unit that determines whether or not the input utterance is correctly recognized based on the voice recognition result and the state transition history, and the determination unit is correctly recognized. The speech recognition result and the input utterance are associated with each other and stored in the utterance data storage means. Further, the speech data collection apparatus for speech recognition according to the present invention includes means for storing dialog scenario information describing a progress procedure of a dialog, means for receiving an input speech, means for recognizing the input speech, speech A voice dialogue device including means for advancing a dialogue based on a recognition result and the dialogue scenario information; and a means for outputting a response to the inputted utterance; and storing a state transition history of the dialogue based on the dialogue scenario information Means, based on the voice recognition result and the state transition history, a means for judging whether or not the inputted utterance has been correctly recognized, and if the means has been recognized correctly, the voice recognition result And means for storing the inputted utterances in association with each other, means for receiving and storing a calling number on a telephone line, and Characterized in that it comprises a means for storing in association with identification result and the input speech.

第１発明に係る音声認識用発話データ収集装置では、音声対話装置で進行する対話で、発話が正しく認識されたか否かを評価し、発話が正しく認識されている場合には音声認識結果と発話を対応付けて、発話データとして記憶する。 In the speech recognition utterance data collection device according to the first aspect of the present invention, it is evaluated whether or not the utterance is correctly recognized in the dialogue progressed by the speech dialogue device, and when the utterance is correctly recognized, the speech recognition result and the utterance Are associated and stored as utterance data.

また、第２発明に係る音声認識用発話データ収集装置は、第１発明において、音声認識結果と対応付けて記憶される発話は、音声の波形データまたは該発話を音響分析した結果である発話特徴量であることを特徴とする。 In the speech recognition utterance data collection device according to the second aspect of the present invention, in the first aspect of the invention, the speech stored in association with the speech recognition result is speech waveform data or speech characteristics obtained by acoustic analysis of the speech. It is characterized by a quantity.

第２発明に係る音声認識用発話データ収集装置では、音声の波形データまたは該発話を音響分析した結果である発話特徴量を音声認識結果と対応付けて、発話データとして蓄積する。 In the speech recognition utterance data collecting apparatus according to the second aspect of the invention, speech waveform data or utterance feature values obtained as a result of acoustic analysis of the utterance are associated with the speech recognition result and stored as utterance data.

また、第３発明に係る音声認識用発話データ収集装置は、第１発明または第２発明において、電話回線での発信番号を受信して記憶する手段と、前記発信番号を、前記音声認識結果及び前記入力された発話と対応付けて記憶する手段とを備えることを特徴とする。 In addition, the speech recognition utterance data collection device according to the third invention is the first invention or the second invention, wherein means for receiving and storing a calling number on a telephone line, the calling number, the voice recognition result and And means for storing in association with the inputted utterance.

第３発明に係る音声認識用発話データ収集装置では、電話回線ごとに固有の発信番号と発話データとを対応付けて記憶することができ、電話回線ごとに固有のノイズ、フィルタリング、変調等に応じた音響モデルを生成する。 In the speech recognition utterance data collection device according to the third aspect of the invention, it is possible to store a unique call number and utterance data in association with each telephone line, depending on noise, filtering, modulation, etc. specific to each telephone line. Generate an acoustic model.

また、第４発明に係る音声認識用発話データ収集方法は、コンピュータを用いて、対話の進行手順に沿った発話が入力された場合に発話の内容に応じて次に行うべき処理を実行する記述を含む対話シナリオ情報を記憶するステップと、音声認識結果及び入力された発話を対応づけて、音声認識用の発話データとして記憶されている発話データ蓄積ステップと、入力された発話を受け付けるステップと、前記入力された発話を、前記発話データ蓄積ステップで蓄積された発話データを用いて、音声認識するステップと、前記音声認識結果及び前記対話シナリオ情報に基づいて対話を進行させるステップと、前記処理の実行を検知する検知ステップと、前記入力された発話に対する応答を出力するステップと音声対話方法を用い、前記対話シナリオ情報に基づいた対話の状態遷移履歴を記憶するステップと、前記検知ステップが前記処理の実行を検知した場合、入力された発話が前記対話シナリオ情報に記述された対話の進行手順に対応する発話であると判断し、前記検知ステップが前記処理の実行を検知しなかった場合、前記音声認識結果及び前記状態遷移履歴に基づいて、前記入力された発話が正しく認識されたか否かを判断する判断ステップと、前記判断ステップで判断した結果、正しく認識されたと判断した場合、前記音声認識結果及び前記入力された発話を対応付けて前記発話データ蓄積ステップに蓄積させるステップを実行することを特徴とする。 In addition, the speech recognition speech data collection method according to the fourth aspect of the invention is a description that uses a computer to execute a process to be performed next according to the content of an utterance when the utterance is input in accordance with the progress procedure of the dialog. Storing dialogue scenario information including: utterance data storage step stored as speech recognition speech data by associating the speech recognition result and the input utterance; and receiving the input utterance; said input utterance, by using the speech data accumulated in the speech data storage step, the steps of advancing a voice recognizing a conversation based on the speech recognition result and the dialog scenario information, the processing Using the detection step of detecting execution, the step of outputting a response to the input utterance, and the spoken dialogue method, Storing a state transition history of the conversation based on the information, when said detecting step detects the execution of the processing, in speech input utterance corresponding to the progression steps of the dialogue described in the dialog scenario information A determination step of determining whether or not the input utterance is correctly recognized based on the voice recognition result and the state transition history when the detection step does not detect the execution of the process. When it is determined that the speech is recognized correctly as a result of the determination in the determination step, a step is performed in which the speech recognition result and the input utterance are associated with each other and stored in the utterance data storage step.

第４発明に係る音声認識用発話データ収集方法では、音声対話装置で進行する対話で、発話が正しく認識されたか否かを評価し、発話が正しく認識されている場合には音声認識結果と発話を対応付けて、発話データとして記憶する。 In the speech recognition utterance data collection method according to the fourth aspect of the present invention, it is evaluated whether or not the utterance is correctly recognized in the dialogue progressed by the speech dialogue apparatus. If the utterance is correctly recognized, the speech recognition result and the utterance Are associated and stored as utterance data.

また、第５発明に係るコンピュータプログラムは、コンピュータを、対話の進行手順に沿った発話が入力された場合に発話の内容に応じて次に行うべき処理を実行する記述を含む記述した対話シナリオ情報を記憶する手段と、音声認識結果及び入力された発話を対応づけて、音声認識用の発話データとして記憶されている発話データ蓄積手段と、入力された発話を受け付ける手段と、前記入力された発話を、前記発話データ蓄積手段に蓄積された発話データを用いて、音声認識する手段と、前記音声認識結果及び前記対話シナリオ情報に基づいて対話を進行させる手段と、前記処理の実行を検知する検知手段と、前記入力された発話に対する応答を出力する手段を含む音声対話装置と、前記対話シナリオ情報に基づいた対話の状態遷移履歴を記憶する手段と、前記検知手段が前記処理の実行を検知した場合、入力された発話が前記対話シナリオ情報に記述された対話の進行手順に対応する発話であると判断し、前記検知手段が前記処理の実行を検知しなかった場合、前記音声認識結果及び前記状態遷移履歴に基づいて、前記入力された発話が正しく認識されたか否かを判断する判断手段と、該判断手段が判断した結果、正しく認識されたと判断した場合、前記音声認識結果及び前記入力された発話を対応付けて前記発話データ蓄積手段に蓄積させる手段として機能させることを特徴とする。 According to a fifth aspect of the present invention, there is provided a computer program according to the fifth aspect of the present invention, wherein the computer includes a description of dialog scenario information including a description for executing a process to be performed next in accordance with the content of the utterance when the utterance is input in accordance with the progress of the dialog The speech recognition result and the input utterance, the speech data storage means stored as the speech data for speech recognition, the means for accepting the input utterance, and the input utterance Using speech data stored in the speech data storage means, voice recognition means, means for advancing a dialogue based on the voice recognition result and the dialogue scenario information, and detection for detecting execution of the processing A voice dialog device including means, a means for outputting a response to the input utterance, and a dialog state transition history based on the dialog scenario information Means for憶, when the detection means detects execution of the processing, it is determined that the input utterance is the utterance corresponding to the progression steps of the dialogue described in the dialog scenario information, said detecting means the When the execution of the process is not detected, a determination unit that determines whether or not the input utterance is correctly recognized based on the voice recognition result and the state transition history, and a result of the determination by the determination unit, When it is determined that the speech is recognized correctly, the speech recognition result and the input utterance are associated with each other and functioned as a means for storing in the utterance data storage means.

第５発明に係るコンピュータプログラムをコンピュータに導入することで、音声対話装置で進行する対話で、発話が正しく認識されたか否かを評価し、発話が正しく認識されている場合には音声認識結果と発話を対応付けて、発話データとして記憶する。 By introducing the computer program according to the fifth aspect of the present invention into a computer, it is evaluated whether or not the utterance is correctly recognized in the dialogue progressed by the voice dialogue device, and if the utterance is correctly recognized, the voice recognition result and Utterances are associated and stored as utterance data.

第１発明に係る音声認識用発話データ収集装置によれば、音声対話の成立の可否に基づいて正しい音声認識結果と対応付けた発話を収集することができ、音声認識精度が高い音響モデルを生成するための発話データを効率よく収集することが可能となる。 According to the speech recognition utterance data collection device according to the first aspect of the present invention, it is possible to collect utterances associated with correct speech recognition results based on whether or not a voice dialogue is established, and generate an acoustic model with high speech recognition accuracy. It is possible to efficiently collect utterance data.

また、第２発明に係る音声認識用発話データ収集装置によれば、音声対話の成立の可否に基づいて正しい音声認識結果と対応付けた発話を収集することができ、音声認識精度が高い音響モデルを生成するための発話データを効率よく収集することが可能となる。 Moreover, according to the speech recognition utterance data collection device according to the second aspect of the present invention, it is possible to collect utterances associated with correct speech recognition results based on whether or not a voice dialogue is established, and an acoustic model with high speech recognition accuracy It is possible to efficiently collect the utterance data for generating.

また、第３発明に係る音声認識用発話データ収集装置によれば、電話回線を介したユーザ発話について、音響モデルの学習時に用いる発話を入力する環境と、該発話データを用いて学習した音響モデルを用いる音声認識装置を発話者が使用する環境とを一致させることが容易となり、音声認識精度が高い音響モデルを生成するための発話データを効率よく収集することが可能となる。 Further, according to the speech recognition utterance data collection device according to the third aspect of the present invention, for a user utterance via a telephone line, an environment for inputting an utterance used for learning an acoustic model, and an acoustic model learned using the utterance data This makes it easy to match the speech recognition device using the voice and the environment used by the speaker, and it is possible to efficiently collect speech data for generating an acoustic model with high speech recognition accuracy.

また、第４発明に係る音声認識用発話データ収集方法によれば、音声対話の成立の可否に基づいて正しい音声認識結果と対応付けた発話を収集することができ、音声認識精度が高い音響モデルを生成するための発話データを効率よく収集することが可能となる。 Further, according to the speech recognition utterance data collection method according to the fourth aspect of the present invention, it is possible to collect utterances associated with correct speech recognition results based on whether or not a voice dialogue is established, and an acoustic model with high speech recognition accuracy It is possible to efficiently collect the utterance data for generating.

また、第５発明に係るコンピュータプログラムによれば、音声対話の成立の可否に基づいて正しい音声認識結果と対応付けた発話を収集することができ、音声認識精度が高い音響モデルを生成するための発話データを効率よく収集することが可能となる。 Moreover, according to the computer program which concerns on 5th invention, the speech matched with the correct speech recognition result can be collected based on the possibility of establishment of a speech dialogue, and it is for generating an acoustic model with high speech recognition accuracy It is possible to collect utterance data efficiently.

以下、本発明をその実施の形態を示す図面に基づいて具体的に説明する。 Hereinafter, the present invention will be specifically described with reference to the drawings showing embodiments thereof.

（実施の形態１）
以下、本発明の実施の形態１に係る音声認識用発話データ収集装置について図面に基づいて具体的に説明する。本実施の形態１では、音声認識用発話データ収集装置を一つのコンピュータを用いて具現化する場合について説明する。もちろん、音声認識に用いる発話データ等は、通信手段を介して接続された他のコンピュータの記憶装置、ＤＶＤ等の可搬型記録媒体に記憶されていてもよく、通信手段についても特に限定されるものではない。 (Embodiment 1)
Hereinafter, the speech recognition utterance data collection apparatus according to Embodiment 1 of the present invention will be specifically described with reference to the drawings. In the first embodiment, a case will be described in which the speech recognition utterance data collection device is implemented using a single computer. Of course, the speech data used for voice recognition may be stored in a storage device of another computer connected via the communication means, a portable recording medium such as a DVD, and the communication means is also particularly limited. is not.

図１は、本発明の実施の形態１に係る音声認識用発話データ収集装置を具現化するコンピュータの概略構成図である。図１に示すように、音声認識用発話データ収集装置を具現化するコンピュータは、少なくとも、ＣＰＵ（中央演算装置）１１、記憶手段１２、ＲＡＭ（メモリ）１３、外部の通信手段と接続する通信手段１４、マウス及びキーボード等の入力手段１５、モニタ等の出力手段１６及び補助記憶手段１７で構成される。 FIG. 1 is a schematic configuration diagram of a computer that embodies the speech recognition speech data collection apparatus according to Embodiment 1 of the present invention. As shown in FIG. 1, a computer embodying a speech recognition utterance data collection device includes at least a CPU (central processing unit) 11, a storage unit 12, a RAM (memory) 13, and a communication unit connected to an external communication unit. 14, an input means 15 such as a mouse and a keyboard, an output means 16 such as a monitor, and an auxiliary storage means 17.

補助記憶手段１７は、音声認識用発話データ収集装置を具現化するコンピュータで使用するプログラムを記録した可搬型記録媒体１８であり、ＤＶＤ、ＣＤ−ＲＯＭ等が該当する。また、音声認識に用いる発話データ等の音声認識用発話データ収集装置で使用するデータを記録する可搬型記録媒体１８等も含む。 The auxiliary storage means 17 is a portable recording medium 18 in which a program used by a computer that embodies the speech data collection apparatus for speech recognition is recorded, and corresponds to a DVD, a CD-ROM, or the like. Moreover, the portable recording medium 18 etc. which record the data used with the speech data collection apparatus for speech recognition, such as speech data used for speech recognition, are included.

本発明の実施の形態１に係る音声認識用発話データ収集装置を具現化するコンピュータは、音声対話装置２０を内蔵する。音声対話装置２０も、該コンピュータが有するＣＰＵ（中央演算装置）１１、記憶手段１２、ＲＡＭ（メモリ）１３、外部の通信手段と接続する通信手段１４、マウス及びキーボード等の入力手段１５、モニタ等の出力手段１６及び補助記憶手段１７を用いて機能する。 A computer that embodies the speech recognition utterance data collection device according to Embodiment 1 of the present invention incorporates a voice interaction device 20. The voice interaction device 20 also includes a CPU (central processing unit) 11, a storage unit 12, a RAM (memory) 13, a communication unit 14 connected to an external communication unit, an input unit 15 such as a mouse and a keyboard, a monitor, etc. The output unit 16 and the auxiliary storage unit 17 function.

まずコンピュータは、発話者による発話を促すために、記憶手段１２に記憶されている対話シナリオ情報に沿って、ＣＰＵ１１の指令により出力手段１６から音声出力を行う。例えば、「ご用件は、○○、××、・・・のうちどれですか」等、次に発話者により入力される発話を限定することができる質問を音声出力する。出力手段１６からの出力は音声出力に限定されるものではなく、画面への表示出力であってもよい。 First, in order to urge the speaker to speak, the computer performs voice output from the output unit 16 according to a command from the CPU 11 in accordance with the dialogue scenario information stored in the storage unit 12. For example, a question that can limit the utterance input by the speaker next, such as “Which is XX, XX,... The output from the output means 16 is not limited to audio output, but may be display output on a screen.

なお、対話シナリオ情報は、例えばVoiceXMLのようなシナリオ記述言語により、対話における発話を受け付けることができるよう記述される。すなわち、対話シナリオ情報には、コンピュータ側からの出力の内容、発話に応じた対話の遷移、発話の内容に応じて次に行うべき処理等が記述される。 Note that the dialogue scenario information is described so that an utterance in the dialogue can be received by a scenario description language such as VoiceXML. That is, the dialogue scenario information describes the contents of the output from the computer, the transition of the dialogue according to the utterance, the processing to be performed next according to the utterance content, and the like.

出力された音声に対して、入力手段１５から発話が入力されると、入力された発話は音声の波形データ、または入力された発話を音響分析した結果である発話特徴量を示すデータとして記憶手段１２及びＲＡＭ１３に記憶され、ＣＰＵ１１の指令により、ＲＡＭ１３に記憶された発話について音声認識を行う。音声認識処理に用いる音声認識エンジンは特に限定されるものではなく、一般に用いられる音声認識エンジンであれば何でもよい。音声認識結果は、記憶手段１２及びＲＡＭ１３に記憶される。 When an utterance is input from the input unit 15 with respect to the output voice, the input utterance is stored as waveform data of the voice or data indicating an utterance feature value as a result of acoustic analysis of the input utterance. 12 and the RAM 13, and speech recognition is performed on the utterance stored in the RAM 13 according to a command from the CPU 11. The speech recognition engine used for speech recognition processing is not particularly limited, and any speech recognition engine that is generally used may be used. The voice recognition result is stored in the storage unit 12 and the RAM 13.

なお、記憶手段１２としては、内蔵されているハードディスクに限定されるものではなく、通信手段１４を介して接続されている他のコンピュータに内蔵されているハードディスク等、大容量のデータを記憶することができる記録媒体であれば何でもよい。 Note that the storage unit 12 is not limited to a built-in hard disk, and stores a large amount of data such as a hard disk built in another computer connected via the communication unit 14. Any recording medium can be used.

ＣＰＵ１１は、ＲＡＭ１３に記憶された音声認識結果に基づいて、発話が正しく認識されているか否かを判断する。発話が正しく認識されているか否かを判断する方法は、様々な方法を用いることができる。以下、具体例を挙げながら説明する。 The CPU 11 determines whether or not the utterance is correctly recognized based on the voice recognition result stored in the RAM 13. Various methods can be used for determining whether or not the utterance is correctly recognized. Hereinafter, a specific example will be described.

一つには、対話シナリオ情報に基づいた対話の状態遷移履歴を記憶手段１２またはＲＡＭ１３に記憶し、記憶されている音声認識結果及び状態遷移履歴に基づいて、入力された発話が正しく認識されたか否かを判断する方法が挙げられる。図２に、名前を確認する対話シナリオでの状態遷移図を示す。図２に示すように、状態１で該対話シナリオが開始し、「お名前をどうぞ」というシステム発話が出力され、状態２へ遷移する。 For example, the state transition history of the dialogue based on the dialogue scenario information is stored in the storage means 12 or the RAM 13, and the input utterance is correctly recognized based on the stored voice recognition result and the state transition history. There is a method for determining whether or not. FIG. 2 shows a state transition diagram in a dialogue scenario for confirming the name. As shown in FIG. 2, the conversation scenario starts in state 1, a system utterance “Please name” is output, and the state transitions to state 2.

状態２では、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶する。記憶された音声認識結果が「○○」である場合、該対話シナリオでは「○○さんですね」とのシステム発話が出力され、状態３へ遷移する。 In state 2, the input utterance is recognized as speech, and the speech recognition result is stored in RAM 13. When the stored speech recognition result is “XX”, a system utterance “You are Mr. XX” is output in the dialogue scenario, and the state transitions to state 3.

状態３では、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶する。状態３では音声認識結果が「はい」または「いいえ」の二者択一であると判断できることから、状態３での音声認識結果の信頼度は高い。記憶された音声認識結果が「はい」である場合、状態４へ遷移して対話シナリオを終了するとともに、状態２での音声認識結果が正しいと判断できる。 In state 3, the input utterance is recognized as speech, and the speech recognition result is stored in RAM 13. Since it can be determined that the voice recognition result is “Yes” or “No” in State 3, the reliability of the voice recognition result in State 3 is high. When the stored speech recognition result is “Yes”, it is possible to determine that the speech recognition result in State 2 is correct while transitioning to State 4 to end the dialogue scenario.

上述した判断方法として、状態遷移にフィードバックが有るか否かを判断する方法を用いることもできる。図３に、切符を購入する対話シナリオでの状態遷移図を示す。図３に示すように、状態１で該対話シナリオが開始し、「目的駅名をどうぞ」というシステム発話が出力され、状態２へ遷移する。 As the determination method described above, a method of determining whether or not there is feedback in the state transition can be used. FIG. 3 shows a state transition diagram in a dialogue scenario for purchasing a ticket. As shown in FIG. 3, the dialogue scenario starts in state 1, and a system utterance “Please name the destination station” is output, and the state transitions to state 2.

状態２では、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶するとともに状態１ａへ遷移する。記憶された音声認識結果が「ＸＸ駅」である場合、該対話シナリオでは「ＸＸ駅ですね」とのシステム発話、及び「大人ですか、子供ですか」とのシステム発話が出力され、状態２ａへ遷移する。 In state 2, the input utterance is recognized as speech, the speech recognition result is stored in RAM 13, and the state transitions to state 1a. When the stored speech recognition result is “XX station”, the system utterance “Is it an XX station” and the system utterance “Is it an adult or a child” are output in the dialogue scenario, and the state 2a Transition to.

状態２ａでは、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶する。音声認識結果が「大人」、「子供」のいずれでもない「△△」である場合、状態１へと遷移（フィードバック）する。このように状態遷移に、対話シナリオ情報に逆行する状態遷移が有る場合には、状態２または状態２ａでの音声認識結果が正しくないと判断できる。また、対話シナリオ情報に逆行する状態遷移が同一箇所で連続して存在する場合にのみ音声認識結果が正しくないと判断する等、判断基準を変更することも可能である。 In the state 2a, the input utterance is recognized by speech and the speech recognition result is stored in the RAM 13. When the speech recognition result is “ΔΔ” which is neither “adult” nor “child”, the state transitions (feeds back) to state 1. As described above, when the state transition includes a state transition that goes against the dialogue scenario information, it can be determined that the speech recognition result in the state 2 or the state 2a is not correct. In addition, it is possible to change the determination criteria such as determining that the speech recognition result is not correct only when there are continuous state transitions in the same location that reverse to the dialogue scenario information.

また、状態遷移履歴に基づいて、音声認識結果を修正した回数を累積し、累積数の大小に応じて音声認識結果が正しいか否か判断する方法を用いることもできる。図３で、状態２ａでの音声認識結果が「大人」または「子供」である場合、状態１ｂへ遷移し、「大人ですね」または「子供ですね」とのシステム発話が出力され、「切符枚数をどうぞ」とのシステム発話が出力された後、状態２ｂへ遷移する。 Further, it is possible to use a method of accumulating the number of times the speech recognition result is corrected based on the state transition history and determining whether the speech recognition result is correct according to the accumulated number. In FIG. 3, when the speech recognition result in the state 2a is “adult” or “child”, the state transitions to the state 1b, and a system utterance “I am an adult” or “I am a child” is output. After the system utterance “please number” is output, the state transitions to state 2b.

状態２ｂでは、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶する。音声認識結果が「◎枚」である場合、「◎枚ですね」とのシステム発話を出力して状態３へ遷移する。 In the state 2b, the input utterance is recognized by speech and the speech recognition result is stored in the RAM 13. When the speech recognition result is “◎ sheets”, a system utterance “◎ is this” is output and the state transitions to state 3.

状態３では、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶する。状態３では音声認識結果が「はい」または「いいえ」の二者択一であると判断できることから、状態３での音声認識結果の信頼度は高い。記憶された音声認識結果が「いいえ」である場合、状態１ｂへ遷移して、再度切符枚数を入力する発話を行うことで、音声認識結果を修正する。 In state 3, the input utterance is recognized as speech, and the speech recognition result is stored in RAM 13. Since it can be determined that the voice recognition result is “Yes” or “No” in State 3, the reliability of the voice recognition result in State 3 is high. When the stored speech recognition result is “No”, the state is changed to the state 1b, and the speech recognition result is corrected by performing the utterance to input the number of tickets again.

このように音声認識結果を修正した回数を累積し、累積数が所定の回数以下である場合に、音声認識結果が正しいものと判断する。つまり、発話者が音声認識結果の誤りを修正した回数が少なければ、該音声認識エンジンが正しい認識結果を出力していると判断できる。 In this way, the number of times the speech recognition result is corrected is accumulated, and when the accumulated number is equal to or less than a predetermined number, it is determined that the speech recognition result is correct. That is, if the number of times the speaker has corrected the error in the speech recognition result is small, it can be determined that the speech recognition engine is outputting the correct recognition result.

また、図４は、切符を購入する対話シナリオでの他の状態遷移図である。図４に示すように、最後に１回だけ音声認識結果が正しいか否かを判断し、音声認識結果が正しいと判断された場合、それまでに通過した各状態での音声認識結果をすべて正しいと判断することもできる。図４では、状態１で該対話シナリオが開始し、「目的駅名をどうぞ」というシステム発話が出力され、状態２へ遷移する。 FIG. 4 is another state transition diagram in a dialogue scenario for purchasing a ticket. As shown in FIG. 4, it is finally determined whether or not the speech recognition result is correct only once. When it is determined that the speech recognition result is correct, all the speech recognition results in the respective states passed so far are correct. It can also be judged. In FIG. 4, the dialogue scenario starts in state 1, a system utterance “please name the destination station” is output, and the state transitions to state 2.

状態２ａでは、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶する。音声認識結果が「大人」「子供」のいずれかであるか否かにかかわらず、任意の音声認識結果「△△」である場合、状態１ｂへと遷移する。状態１ｂでは、「切符枚数をどうぞ」とのシステム発話が出力された後、状態２ｂへ遷移する。 In the state 2a, the input utterance is recognized by speech and the speech recognition result is stored in the RAM 13. Regardless of whether the speech recognition result is “adult” or “children”, if the speech recognition result is “ΔΔ”, the state transitions to state 1b. In the state 1b, after the system utterance “please give the number of tickets” is output, the state transitions to the state 2b.

状態２ｂでは、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶する。音声認識結果が「◎枚」である場合、それまでの状態での音声認識結果をまとめたシステム発話が出力される。例えば、「ＸＸ駅まで、△△の切符◎枚ですね。誤りがある場合、駅名、種別、枚数と、何を修正するか指定してください。」とのシステム発話を出力して状態５へ遷移する。 In the state 2b, the input utterance is recognized by speech and the speech recognition result is stored in the RAM 13. When the speech recognition result is “◎”, a system utterance that summarizes the speech recognition results in the previous state is output. For example, the system utterance “To the station XX, △△ ticket ◎ sheets. If there is an error, please specify the station name, type, number of sheets, and what to correct.” Is output to state 5 Transition.

状態５では、入力された発話を音声認識し、音声認識結果をＲＡＭ１３に記憶する。状態５では音声認識結果が「駅名」である場合は状態１へ、「種別」である場合は状態１ａへ、「枚数」である場合は状態１ｂへ、それぞれ状態遷移する。音声認識結果が「はい」である場合、状態６へ遷移して処理を終了する。すなわち、状態５での音声認識結果の信頼度を高いと判断し、音声認識結果が「はい」である場合、それまでに遷移してきた状態１、状態１ａ、状態１ｂ、状態５のすべての音声認識結果が正しいものと判断する。 In state 5, the input utterance is recognized as speech, and the speech recognition result is stored in RAM 13. In state 5, when the speech recognition result is "station name", the state transitions to state 1, to "type", to state 1a, and to "number of sheets", to state 1b. If the voice recognition result is “Yes”, the process goes to the state 6 to end the process. That is, when the reliability of the speech recognition result in state 5 is determined to be high and the speech recognition result is “Yes”, all the speeches in state 1, state 1a, state 1b, and state 5 that have transitioned so far Judge that the recognition result is correct.

さらに、音声認識エンジンから出力される認識評価値を併用することもできる。この場合、文単位または単語単位での音声認識結果の評価値が所定のしきい値よりも高い場合に、音声認識結果として正しいと判断する。つまり、対話の内容ではなく、音声認識の評価値のみで判断する。したがって、上述した方法と併用することで、音声認識結果が正しいか否かを判断する精度がより向上することは言うまでもない。 Furthermore, the recognition evaluation value output from the speech recognition engine can be used in combination. In this case, when the evaluation value of the speech recognition result in units of sentences or words is higher than a predetermined threshold value, it is determined that the speech recognition result is correct. That is, it is determined only by the evaluation value of speech recognition, not the content of dialogue. Therefore, it goes without saying that the accuracy of determining whether or not the voice recognition result is correct is further improved by using it together with the method described above.

音声認識結果が正しいと判断された場合、該音声認識結果は、記憶されている発話と対応付けた発話データとして記憶手段１２の発話データ蓄積部１２１に記憶される。音声認識結果が正しくないと判断された場合、該音声認識結果と発話は記憶手段１２から削除される。なお、「発話データ」とは、発話及び対応付けられた発話内容に関する情報を含む音声認識に用いられるデータ全体を意味する。 When it is determined that the speech recognition result is correct, the speech recognition result is stored in the utterance data storage unit 121 of the storage unit 12 as utterance data associated with the stored utterance. When it is determined that the voice recognition result is not correct, the voice recognition result and the speech are deleted from the storage unit 12. Note that “utterance data” means the entire data used for speech recognition including information on utterances and associated utterance contents.

このように音声対話装置を用い、発話者とコンピュータとの間の対話に基づいて音声認識結果画正しいか否かを判断することで、発話データとして発話データ蓄積部１２１に音声認識結果が正しい発話のみを選択して発話データとして収集することができる。 As described above, the speech recognition result is correctly stored in the speech data storage unit 121 as speech data by determining whether or not the speech recognition result image is correct based on the dialogue between the speaker and the computer using the speech dialogue apparatus. You can select only to collect as utterance data.

また、音声対話装置では、発話者は自然な対話を行うので、原稿を読み上げる場合のような不自然さを排除することができ、無意識に自然会話に近い音声認識用の発話データを収集することができる。したがって、通常の対話環境に合致した音声認識率の高い音響モデルを容易に構築することが可能となる。また、発話形態、入力系の特性、利用する発話者の年齢層の分布等、いずれの観点においても収集された発話データと音声認識装置の使用環境との違いが生じるのを回避することが可能となる。 Also, in a voice interaction device, since the speaker performs a natural conversation, it is possible to eliminate unnaturalness like when reading a manuscript and unconsciously collect speech data for speech recognition that is close to natural conversation. Can do. Therefore, it is possible to easily construct an acoustic model having a high speech recognition rate that matches a normal dialogue environment. In addition, it is possible to avoid the difference between the collected speech data and the usage environment of the speech recognition device from any viewpoint, such as the speech form, the characteristics of the input system, the distribution of the speaker's age group, etc. It becomes.

なお、対話の状態遷移履歴は記憶手段１２またはＲＡＭ１３に記憶されていることから、上述した処理による発話データの収集は、必ずしもリアルタイムである必要はなく、対話シナリオに沿った音声対話の終了後に、発話データの収集を行うものであってもよい。 Note that since the state transition history of the dialogue is stored in the storage means 12 or the RAM 13, the collection of the utterance data by the above-described processing does not necessarily have to be in real time, and after completion of the voice dialogue according to the dialogue scenario, Utterance data may be collected.

図５は、発話データ蓄積部１２１に記憶されるデータ構成の例示図である。図５では、「コンピュータを」と発話された場合について説明する。以下、図７まで該発話を例に挙げて説明する。図５の例では、発話は時系列データとして記憶され、認識結果としての音素記号が開始ポイント、終了ポイントとともに付与されている。なお、開始ポイント及び終了ポイントは、サンプリング周波数に依存するサンプル数を累積した値で示す。 FIG. 5 is an exemplary diagram of a data configuration stored in the utterance data storage unit 121. In FIG. 5, a case where “computer” is spoken will be described. Hereinafter, the utterance will be described as an example up to FIG. In the example of FIG. 5, utterances are stored as time-series data, and phoneme symbols as recognition results are given together with start points and end points. Note that the start point and the end point are indicated by accumulated values of the number of samples depending on the sampling frequency.

図６は、発話データとして記憶されるデータ構成のうち、音素と時間ポイントとの関係を示す例示図である。図６に示すように、音素ごとに開始ポイントと終了ポイントが記憶されており、図５に示す発話のどの部分が各音素に対応しているかを示している。 FIG. 6 is an exemplary diagram showing the relationship between phonemes and time points in the data structure stored as utterance data. As shown in FIG. 6, a start point and an end point are stored for each phoneme, indicating which part of the utterance shown in FIG. 5 corresponds to each phoneme.

また、図７（ａ）に示すように、音素ではなく、音節ごとに開始ポイントと終了ポイントを記憶してもよいし、図７（ｂ）に示すように、文節ごとに開始ポイントと終了ポイントを記憶してもよい。また、ユーザ発話が１発声単位である場合には、音声認識装置が切り出した発声前後の無音区間を含むユーザ発話に対して、開始ポイント及び終了ポイントの指定もなく、認識結果としての音素、音節等のみを対応付けてもよい。 Further, as shown in FIG. 7A, the start point and the end point may be stored for each syllable instead of the phoneme, or as shown in FIG. 7B, the start point and the end point for each phrase. May be stored. When the user utterance is one utterance unit, the start point and the end point are not specified for the user utterance including the silent period before and after the utterance cut out by the speech recognition apparatus, and the phoneme and syllable as the recognition result are specified. Etc. may be associated with each other.

なお、本実施の形態１では、開始ポイント及び終了ポイントをサンプリング周波数に依存するサンプル数に基づいて示しているが、特にこれに限定されるものではなく、時間単位である秒、ミリ秒等を用いてもよい。 In the first embodiment, the start point and the end point are shown based on the number of samples depending on the sampling frequency. However, the present invention is not particularly limited to this. It may be used.

また、発話データを構成する発話として波形データを用いているが、特にこれに限定されるものではなく、例えば音声スペクトラム、ＭＦＣＣ（Mel-Frequency Cepstral Co-efficients）等の音声認識に用いる音声特徴量を用いることも可能である。 In addition, waveform data is used as an utterance constituting the utterance data. However, the present invention is not particularly limited to this, and for example, voice feature quantities used for voice recognition such as voice spectrum, MFCC (Mel-Frequency Cepstral Co-efficients), etc. It is also possible to use.

さらに、音声認識用として生成され記憶される発話データは、発話単位である必要はなく、ワードスポッティング処理のように、発話の一部分だけを音声認識した結果を、対応する区間の音声データと対応付けて記憶してもよい。 Furthermore, the utterance data generated and stored for speech recognition need not be in utterance units, and the result of speech recognition of only a part of the utterance is associated with the speech data of the corresponding section as in the word spotting process. May be stored.

次に、本発明の実施の形態１に係る音声認識用発話データ収集装置を具現化するコンピュータプログラムの処理について説明する。図８は、本発明の実施の形態１に係る音声認識用発話データ収集装置を具現化するコンピュータプログラムのフローチャートである。 Next, processing of a computer program that embodies the speech recognition speech data collection apparatus according to Embodiment 1 of the present invention will be described. FIG. 8 is a flowchart of a computer program that embodies the speech recognition speech data collection apparatus according to Embodiment 1 of the present invention.

図８で、まず対話シナリオ情報に沿って、発話者による発話を促すメッセージを出力する（ステップＳ８０１）。そして、該メッセージ対する発話入力を受け付ける（ステップＳ８０２）。 In FIG. 8, first, a message that prompts the speaker to speak is output along the dialogue scenario information (step S801). Then, an utterance input for the message is accepted (step S802).

次に、入力された発話について、音声認識処理を行い（ステップＳ８０３）、音声認識結果及び対話シナリオ情報に基づいた対話の状態遷移履歴に基づいて、入力された発話が正しく認識されたか否かを判断する（ステップＳ８０４）。 Next, speech recognition processing is performed on the input utterance (step S803), and whether or not the input utterance has been correctly recognized based on the speech state transition history based on the speech recognition result and the dialog scenario information. Judgment is made (step S804).

音声認識結果が正しく認識されたと判断した場合には（ステップＳ８０４：ＹＥＳ）、入力された発話と音声認識結果とを対応付けて、１つの発話データとして記憶する（ステップＳ８０５）。 If it is determined that the speech recognition result is correctly recognized (step S804: YES), the input utterance and the speech recognition result are associated with each other and stored as one utterance data (step S805).

上述した処理を、対話シナリオ情報が終了するまで続行し（ステップＳ８０６）、対話シナリオ情報が終了した時点で（ステップＳ８０６：ＹＥＳ）、発話データの収集を終了する。 The above-described processing is continued until the dialogue scenario information is finished (step S806), and when the dialogue scenario information is finished (step S806: YES), the collection of the utterance data is finished.

なお、本実施の形態１では、一つの音声対話装置から発話データを収集しているが、音声対話装置は一つに限定されるものではなく、複数の音声対話装置で蓄積した発話データを集約することも可能である。複数の音声対話装置で発話データを収集することで、より大量の発話データを収集することができ、音声認識精度の高い音響モデルを構築することが可能となる。 In the first embodiment, utterance data is collected from one voice interactive device. However, the number of voice interactive devices is not limited to one, and utterance data accumulated by a plurality of voice interactive devices are aggregated. It is also possible to do. By collecting utterance data with a plurality of voice interactive devices, a larger amount of utterance data can be collected, and an acoustic model with high voice recognition accuracy can be constructed.

また、本実施の形態１では、状態遷移に着目して音声認識結果が正しいか否かを判断しているが、状態遷移に着目する方法に限定されるものではなく、音声認識結果の正当性が担保される方法であれば何でもよい。例えば、対話シナリオで用いられる発話による入力項目ごとに記憶スロットを記憶手段１２に設けておき、スロットの値と該スロットの値と音声認識された発話を識別する情報と対応付けて記憶する方法も可能である。 In the first embodiment, whether or not the speech recognition result is correct is determined by focusing on the state transition. However, the present embodiment is not limited to the method focusing on the state transition. Any method can be used as long as it is secured. For example, there is a method in which a storage slot is provided in the storage unit 12 for each input item by an utterance used in a dialogue scenario, and the slot value, the slot value, and information for identifying the speech-recognized speech are stored in association with each other. Is possible.

図９に、記憶手段１２でのスロット管理の説明図を示す。図９では、切符を購入する対話シナリオでの「駅名」、「種別」、「枚数」等の各入力項目に対してスロットが割り当てられ、発話を音声認識した結果がスロット値として、音声認識された発話を識別する情報として発話ＩＤが記憶される。このようにすることで、例えば駅名の音声認識が正しくないと判断され、再度音声認識された場合、スロット値が正しい認識結果である「高知」へ修正され、対応する発話ＩＤも修正される。したがって、最終的に正しく認識された結果のみが記憶手段１２に記憶されることになる。 FIG. 9 is an explanatory diagram of slot management in the storage unit 12. In FIG. 9, a slot is assigned to each input item such as “station name”, “type”, “number of sheets”, etc. in the dialog scenario for purchasing a ticket, and the speech recognition result is voice-recognized as a slot value. The utterance ID is stored as information for identifying the utterance. By doing so, for example, when it is determined that the voice recognition of the station name is not correct and the voice is recognized again, the slot value is corrected to “Kochi” which is the correct recognition result, and the corresponding utterance ID is also corrected. Therefore, only the result of finally correctly recognized is stored in the storage unit 12.

（実施の形態２）
本発明の実施の形態２に係る音声認識用発話データ収集装置を具現化するコンピュータの概略構成図は実施の形態１と同様である。本実施の形態２では、音声を用いた話者認識手段を備え、発話に基づいて発話者の確認を行うことができる点が相違する。 (Embodiment 2)
A schematic configuration diagram of a computer embodying the speech recognition utterance data collecting apparatus according to the second embodiment of the present invention is the same as that of the first embodiment. The second embodiment is different in that a speaker recognition unit using voice is provided and the speaker can be confirmed based on the utterance.

発話者に関する情報は、記憶手段１２の発話者情報記憶部１２２に事前に記憶しておく。記憶される発話者に関する情報は、少なくとも発話者を識別する情報を含み、その他性別、年齢、居住地域、国籍等を含む個人情報である。発話者に関する情報は、発話者情報記憶部１２２に事前に記憶しておくことに限定されるものではなく、音声対話装置の対話シナリオ情報中で判明した情報を随時追加するものであってもよい。また、発話者情報記憶部１２２は記憶手段１２だけではなく、通信手段１４を介して接続されている他のコンピュータに内蔵されているハードディスク等、大容量のデータを記憶することができる記録媒体であれば、何に設けてもよい。 Information about the speaker is stored in advance in the speaker information storage unit 122 of the storage unit 12. The stored information regarding the speaker is personal information including at least information for identifying the speaker and other information such as gender, age, residential area, nationality, and the like. The information about the speaker is not limited to storing in advance in the speaker information storage unit 122, and information found in the conversation scenario information of the voice interaction device may be added as needed. . The speaker information storage unit 122 is a recording medium capable of storing a large amount of data such as a hard disk built in not only the storage unit 12 but also another computer connected via the communication unit 14. If it exists, it may be provided for anything.

発話者による発話が入力されると、発話を音声認識するとともに、発話者が登録されている話者のうちの誰であるかを確認するため、対応する発話者を識別する情報、例えばユーザＩＤを取得する。そして、記憶手段１２の発話者情報記憶部１２２から、ユーザＩＤに対応付けられた発話者に関する情報を抽出し、発話データを生成する際に、発話と音声認識結果、及び抽出された発話者に関する情報を一対の発話データとして発話データ蓄積部１２１に記憶する。 When an utterance by a speaker is input, the utterance is recognized by voice, and information for identifying the corresponding speaker, for example, a user ID, in order to confirm who the speaker is registered is, for example, a user ID To get. Then, when the information about the speaker associated with the user ID is extracted from the speaker information storage unit 122 of the storage unit 12 and the speech data is generated, the speech and voice recognition result and the extracted speaker are related. Information is stored in the speech data storage unit 121 as a pair of speech data.

このようにすることで、音声認識用に記憶される話者データを、性別、年齢別、居住地域別等の条件別に収集することができ、発話者の条件に対応した音響モデルを作成することができる。したがって、特定条件での音声認識精度の向上が期待できる発話データを効率的に収集することが可能となる。 By doing this, speaker data stored for speech recognition can be collected by conditions such as gender, age, and residential area, and an acoustic model corresponding to the conditions of the speaker can be created Can do. Therefore, it is possible to efficiently collect utterance data that can be expected to improve speech recognition accuracy under specific conditions.

（実施の形態３）
本発明の実施の形態３に係る音声認識用発話データ収集装置を具現化するコンピュータの概略構成図は実施の形態１と同様である。本実施の形態３では、各種の電話回線を用いてユーザ発話を入力する点が相違する。 (Embodiment 3)
A schematic configuration diagram of a computer embodying the speech recognition utterance data collecting apparatus according to the third embodiment of the present invention is the same as that of the first embodiment. The third embodiment is different in that user utterances are input using various telephone lines.

すなわち、入力手段１５は固定電話、携帯電話、ＰＨＳ、ＩＰ電話等の電話回線と接続されており、入力される発話は、各回線に固有の音声加工が施された状態で入力される。したがって、発話データの収集時に、どの電話回線を用いた発話データであるのか識別する情報を付加することで、音声認識精度の向上を図ることができる。 That is, the input means 15 is connected to a telephone line such as a fixed telephone, a cellular phone, a PHS, and an IP telephone, and the input utterance is input in a state in which each line is subjected to a specific voice processing. Therefore, when the speech data is collected, the information for identifying which speech line is used for the speech data is added, so that the speech recognition accuracy can be improved.

発話者は、音声対話装置に接続するべく、自宅の固定電話、所有する携帯電話等の電話機から、音声対話装置に繋がる既定の電話番号を発呼する。コンピュータは、発呼を受けると回線接続し、同時に該発呼の発信番号情報(発信番号を通知しているか否か、及び通知している場合には通知した発信番号)を記憶装置１２またはメモリ１３に記憶する。対話シナリオ情報に沿って、音声メッセージを出力し、以下実施の形態１と同様に発話者の発話データを蓄積する。 In order to connect to the voice interactive apparatus, the speaker calls a predetermined telephone number connected to the voice interactive apparatus from a telephone such as a fixed telephone at home or a mobile phone owned by the speaker. When the computer receives a call, it connects to the line, and at the same time, the caller's caller ID information (whether or not the caller ID is notified and, if it is notified, the notified caller ID) is stored in the storage device 12 or the memory 13 is stored. A voice message is output along with the dialogue scenario information, and the utterance data of the utterer is accumulated as in the first embodiment.

発話者による発呼時に発信番号が通知されている場合は、該発信番号に基づいて回線のＣＯＤＥＣ種別またはキャリア（電話事業者）の種類を推定することができる。例えば、発信番号の最初の３桁が「０９０」である場合には携帯電話であると推定できる。したがって、発話データは、発話及び音声認識結果に、発信番号情報を対応付けて生成する。 When the calling number is notified at the time of calling by the speaker, the line CODEC type or carrier (telephone carrier) type can be estimated based on the calling number. For example, when the first three digits of the transmission number are “090”, it can be estimated that the mobile phone is a mobile phone. Therefore, the utterance data is generated by associating the calling number information with the utterance and voice recognition result.

このようにすることで、発信番号に基づいて回線のＣＯＤＥＣ種別またはキャリアの種類を識別できることから、回線のＣＯＤＥＣ種別またはキャリアの種類ごとに発話データを分類することができる。したがって、各回線ごとに生じるＣＯＤＥＣの歪みを考慮した音響モデルを生成することができ、電話回線使用時の音声認識精度の向上が期待できる。 In this way, since the line CODEC type or carrier type can be identified based on the transmission number, speech data can be classified for each line CODEC type or carrier type. Therefore, it is possible to generate an acoustic model in consideration of CODEC distortion generated for each line, and it can be expected that voice recognition accuracy is improved when the telephone line is used.

一方、発話者による発呼時に発信番号が通知されていない場合は、音響モデルとして回線判定用の音響モデルを記憶手段１２に記憶しておき、入力された発話がどの音響モデルと合致しているかを判定することで、使用された回線が固定電話、携帯電話（ＰＤＣ、Ｗ−ＣＤＭＡ等）、ＰＨＳ、ＩＰ電話等のいずれかであるのかを判定することができる。 On the other hand, when the calling number is not notified at the time of the call by the speaker, an acoustic model for line determination is stored in the storage unit 12 as an acoustic model, and which acoustic model matches the input utterance It is possible to determine whether the used line is a fixed phone, a mobile phone (PDC, W-CDMA, etc.), a PHS, an IP phone, or the like.

このようにすることで、回線のＣＯＤＥＣ種別またはキャリアの種類ごとに対応する音響モデルを特定することができ、各回線ごとに生じるＣＯＤＥＣ歪みを考慮した音響モデルを用いることで、電話回線使用時の音声認識精度の向上が期待できる。 In this way, it is possible to specify an acoustic model corresponding to each CODEC type or carrier type of the line, and by using an acoustic model that takes into account the CODEC distortion that occurs for each line, The improvement of voice recognition accuracy can be expected.

以上のように本実施の形態３によれば、１対話ごとに取得することができる回線情報別に分類された発話データを収集することができ、使用する回線に応じた音響モデルを用いることで、より音声認識の精度を向上することが可能となる。 As described above, according to the third embodiment, it is possible to collect speech data classified by line information that can be acquired for each conversation, and by using an acoustic model corresponding to the line to be used, It is possible to improve the accuracy of voice recognition.

また、発話データは、発話及び音声認識結果に、発信番号情報を対応付けて生成するのに加えて、発話が収録された時刻も対応付けて生成することが望ましい。例えば携帯電話のように時代とともにＣＯＤＥＣ種別が変遷するものでは、古い発話を音響モデルの学習、改良等に用いることは避けるべきであり、音響モデルの学習用データから排除するべき発話を選別するための情報として、発話が収録された時刻に関するデータは有効である。 In addition to generating the utterance data in association with the utterance and voice recognition result in association with the transmission number information, it is desirable to generate the utterance data in association with the time when the utterance was recorded. For example, if the CODEC type changes with the times, such as mobile phones, it should be avoided to use old utterances for learning and improving acoustic models, and to select utterances that should be excluded from acoustic model learning data. As the information, the data regarding the time when the utterance was recorded is valid.

なお、音響モデルの学習を少ない発話データに基づいて行うと、不特定話者に対して有効な音響モデルを生成することが困難になる。そこで、発話データが発話データ蓄積部１２１に蓄積されるデータ量を検出する手段を設け、蓄積される発話データ量が所定のしきい値を超えた場合に音響モデルの学習を開始する。 Note that if learning of an acoustic model is performed based on a small amount of utterance data, it becomes difficult to generate an acoustic model effective for an unspecified speaker. Therefore, means for detecting the amount of data in which the speech data is stored in the speech data storage unit 121 is provided, and learning of the acoustic model is started when the amount of stored speech data exceeds a predetermined threshold.

蓄積される発話データ量を検出する手段としては、記憶されたデータの総量を検出するものに限定されるものではなく、例えば一定時間間隔で音響モデルを再生成するものであってもよい。 The means for detecting the accumulated amount of utterance data is not limited to a means for detecting the total amount of stored data, and for example, an acoustic model may be regenerated at regular time intervals.

このようにすることで、定量的または定期的に音響モデルを更新することができ、最新の発話データに基づいた音響モデルを生成することができることから、発話者の体調や経年変化に伴う音声の変動等を考慮した音響モデルを生成することができる。したがって、より音声認識の精度向上に貢献する発話データを収集することが可能となる。 In this way, the acoustic model can be updated quantitatively or periodically, and an acoustic model based on the latest utterance data can be generated. An acoustic model can be generated in consideration of fluctuations and the like. Therefore, it is possible to collect utterance data that further contributes to improving the accuracy of voice recognition.

（付記１）
対話の進行手順を記述した対話シナリオ情報を記憶する手段、
入力された発話を受け付ける手段、
前記入力された発話を音声認識する手段、
前記音声認識結果及び前記対話シナリオ情報に基づいて対話を進行させる手段、
及び前記入力された発話に対する応答を出力する手段を含む音声対話装置と、
前記対話シナリオ情報に基づいた対話の状態遷移履歴を記憶する手段と、
前記音声認識結果及び前記状態遷移履歴に基づいて、前記入力された発話が正しく認識されたか否かを判断する手段と、
該手段が、正しく認識されたと判断した場合、前記音声認識結果及び前記入力された発話を対応付けて記憶する手段と
を備えることを特徴とする音声認識用発話データ収集装置。 (Appendix 1)
Means for storing dialogue scenario information describing the progress of the dialogue;
Means for accepting the input utterance,
Means for voice recognition of the input utterance;
Means for proceeding a dialogue based on the speech recognition result and the dialogue scenario information;
And a voice interaction device including means for outputting a response to the input utterance,
Means for storing a dialog state transition history based on the dialog scenario information;
Means for determining whether or not the inputted utterance is correctly recognized based on the voice recognition result and the state transition history;
A speech recognition utterance data collection device comprising: means for storing the speech recognition result and the input utterance in association with each other when the means determines that the speech recognition has been correctly performed.

（付記２）
音声認識結果と対応付けて記憶される発話は、音声の波形データまたは該発話を音響分析した結果である発話特徴量であることを特徴とする付記１記載の音声認識用発話データ収集装置。 (Appendix 2)
The utterance data collection device for speech recognition according to appendix 1, wherein the utterance stored in association with the speech recognition result is speech waveform data or an utterance feature amount obtained by acoustic analysis of the utterance.

（付記３）
音声認識結果を修正する手段と、
前記音声認識結果を修正した回数を累積する手段と、
修正した回数の累積数が所定の回数以下である場合、入力された発話が前記対話シナリオ情報に記述された対話の進行手順に対応する発話であると判断する手段と
を備えることを特徴とする付記１または２記載の音声認識用発話データ収集装置。 (Appendix 3)
Means for correcting speech recognition results;
Means for accumulating the number of times the speech recognition result has been corrected;
Means for determining that the input utterance is an utterance corresponding to the progress procedure of the dialog described in the dialog scenario information when the cumulative number of corrected times is equal to or less than a predetermined number of times, The speech data collection device for speech recognition according to appendix 1 or 2.

（付記４）
前記対話シナリオ情報は、記述された対話の進行手順に沿った発話が入力された場合に所定のタスクを実行する記述を含み、
前記タスクの実行を検知する手段を備え、
前記入力された発話が正しく認識されたか否かを判断する手段は、前記タスクの実行を検知する手段が前記タスクの実行を検知した場合、入力された発話が前記対話シナリオ情報に記述された対話の進行手順に対応する発話であると判断することを特徴とする付記１または２記載の音声認識用発話データ収集装置。 (Appendix 4)
The dialog scenario information includes a description for executing a predetermined task when an utterance is input in accordance with the described procedure of the dialog,
Means for detecting execution of the task,
The means for determining whether or not the inputted utterance has been correctly recognized is the dialogue in which the inputted utterance is described in the dialogue scenario information when the means for detecting the execution of the task detects the execution of the task. The speech recognition utterance data collection device according to supplementary note 1 or 2, wherein the utterance data collection device is determined to be an utterance corresponding to the proceeding procedure.

（付記５）
発話者を特定するための情報を含む発話者に関する情報を記憶する手段と、
前記入力された発話と前記発話者に関する情報に基づいて発話者を特定する手段と、
特定された発話者に付随する情報を、前記音声認識結果及び前記入力された発話と対応付けて記憶する手段と
を備えることを特徴とする付記１から４のいずれか一項に記載の音声認識用発話データ収集装置。 (Appendix 5)
Means for storing information about the speaker including information for identifying the speaker;
Means for identifying a speaker based on the input utterance and information about the speaker;
The speech recognition according to any one of appendices 1 to 4, further comprising: means for storing information associated with the identified speaker in association with the speech recognition result and the input speech. Utterance data collection device.

（付記６）
電話回線での発信番号を受信する手段と、
前記発信番号を、前記音声認識結果及び前記入力された発話と対応付けて記憶する手段と
を備えることを特徴とする付記１から５のいずれか一項に記載の音声認識用発話データ収集装置。 (Appendix 6)
Means for receiving a calling number on a telephone line;
The speech recognition utterance data collection device according to any one of appendices 1 to 5, further comprising: a unit that stores the transmission number in association with the speech recognition result and the input utterance.

（付記７）
前記発信番号に基づいて、回線種別またはキャリアを判定する手段と、
回線種別またはキャリアの判定結果を、前記音声認識結果及び前記入力された発話と対応付けて記憶する手段と
を備えることを特徴とする付記６記載の音声認識用発話データ収集装置。 (Appendix 7)
Means for determining a line type or a carrier based on the calling number;
The speech recognition utterance data collection device according to appendix 6, further comprising means for storing a determination result of the line type or carrier in association with the speech recognition result and the input utterance.

（付記８）
発話が入力された時刻に関する情報を、前記音声認識結果及び前記入力された発話と対応付けて記憶する手段を備えることを特徴とする付記１から７のいずれか一項に記載の音声認識用発話データ収集装置。 (Appendix 8)
The speech for speech recognition according to any one of appendices 1 to 7, further comprising means for storing information related to a time when the speech is input in association with the speech recognition result and the input speech. Data collection device.

（付記９）
対話の進行手順を記述した対話シナリオ情報を記憶し、
入力された発話を受け付け、
前記入力された発話を音声認識し、
前記音声認識結果及び前記対話シナリオ情報に基づいて対話を進行させ、
前記入力された発話に対する応答を出力する音声対話方法を用い、
前記対話シナリオ情報に基づいた対話の状態遷移履歴を記憶し、
前記音声認識結果及び前記状態遷移履歴に基づいて、前記入力された発話が正しく認識されたか否かを判断し、
該手段が、正しく認識されたと判断した場合、前記音声認識結果及び前記入力された発話を対応付けて記憶することを特徴とする音声認識用発話データ収集方法。 (Appendix 9)
Memorize dialogue scenario information describing the progress of the dialogue,
Accepts the input utterance,
Speech recognition of the input utterance,
Based on the speech recognition result and the dialogue scenario information, the dialogue proceeds,
Using a voice interaction method for outputting a response to the input utterance,
Storing a dialogue state transition history based on the dialogue scenario information;
Based on the speech recognition result and the state transition history, it is determined whether or not the input utterance is correctly recognized,
A speech recognition utterance data collection method characterized by storing the speech recognition result and the input utterance in association with each other when the means determines that the speech recognition has been correctly performed.

（付記１０）
コンピュータを、
対話の進行手順を記述した対話シナリオ情報を記憶する手段、
入力された発話を受け付ける手段と、
前記入力された発話を音声認識する手段、
前記音声認識結果及び前記対話シナリオ情報に基づいて対話を進行させる手段、
及び前記発話に対する応答を出力する手段を含む音声対話装置と、
前記対話シナリオ情報に基づいた対話の状態遷移履歴を記憶する手段と、
前記音声認識結果及び前記状態遷移履歴に基づいて、前記入力された発話が正しく認識されたか否かを判断する手段と、
該手段が、正しく認識されたと判断した場合、前記音声認識結果及び前記入力された発話を対応付けて記憶する手段として機能させることを特徴とするコンピュータプログラム。 (Appendix 10)
Computer
Means for storing dialogue scenario information describing the progress of the dialogue;
Means for accepting the input utterance;
Means for voice recognition of the input utterance;
Means for proceeding a dialogue based on the speech recognition result and the dialogue scenario information;
And a voice interaction device including means for outputting a response to the utterance;
Means for storing a dialog state transition history based on the dialog scenario information;
Means for determining whether or not the inputted utterance is correctly recognized based on the voice recognition result and the state transition history;
A computer program that causes the voice recognition result and the inputted utterance to be stored in association with each other when it is determined that the means has been correctly recognized.

本発明の実施の形態１に係る音声認識用発話データ収集装置を具現化するコンピュータの概略構成図である。It is a schematic block diagram of the computer which embodies the speech data collection apparatus for speech recognition which concerns on Embodiment 1 of this invention. 名前を確認する対話シナリオでの状態遷移図である。It is a state transition diagram in the dialogue scenario for confirming the name. 切符を購入する対話シナリオでの状態遷移図である。It is a state transition diagram in the dialogue scenario which purchases a ticket. 切符を購入する対話シナリオでの他の状態遷移図であるIt is another state transition diagram in a dialogue scenario for purchasing a ticket. 発話データ蓄積部に記憶されるデータ構成の例示図である。It is an illustration figure of the data structure memorize | stored in an utterance data storage part. 音素と時間ポイントとの関係を示す例示図である。It is an illustration showing the relationship between phonemes and time points. 音節または単語と時間ポイントとの関係を示す例示図である。It is an illustration figure which shows the relationship between a syllable or a word, and a time point. 本発明の実施の形態１に係る音声認識用発話データ収集装置で用いるプログラムのフローチャートである。It is a flowchart of the program used with the speech data collection apparatus for speech recognition which concerns on Embodiment 1 of this invention. 記憶手段でのスロット管理の説明図を示すAn explanatory diagram of slot management in the storage means is shown. 従来の音声認識用発話データ収集装置の概略構成を示す機能ブロック図である。It is a functional block diagram which shows schematic structure of the conventional speech data collection apparatus for speech recognition.

Explanation of symbols

１１ＣＰＵ
１２記憶手段
１３ＲＡＭ
１４通信手段
１５入力手段
１６出力手段
１７補助記憶手段
１８可搬型記録媒体
１２１発話データ蓄積部
１２２発話者情報記憶部 11 CPU
12 storage means 13 RAM
DESCRIPTION OF SYMBOLS 14 Communication means 15 Input means 16 Output means 17 Auxiliary storage means 18 Portable recording medium 121 Utterance data storage part 122 Speaker information storage part

Claims

Means for storing dialogue scenario information including a description for executing a process to be performed next in accordance with the content of the utterance when an utterance in accordance with the dialogue progress procedure is input;
An utterance data storage unit that associates the speech recognition result and the input utterance and is stored as utterance data for speech recognition;
Means for accepting the input utterance;
Means for recognizing the input speech using speech data stored in the speech data storage means;
Means for proceeding a dialogue based on a speech recognition result and the dialogue scenario information;
Detecting means for detecting execution of the processing ;
A voice interactive apparatus including means for outputting a response to the input utterance;
Means for storing a dialog state transition history based on the dialog scenario information;
When the detection unit detects the execution of the process, the input unit determines that the input utterance is an utterance corresponding to the progress procedure of the dialog described in the dialog scenario information, and the detection unit executes the process . If not detected, based on the voice recognition result and the state transition history, a determination means for determining whether or not the input utterance has been correctly recognized;
Speech recognition data collection for speech recognition, comprising: means for storing the speech recognition result and the input utterance in association with each other when the judgment means determines that the speech recognition result is correctly recognized. apparatus.

2. The speech recognition utterance data collection device according to claim 1, wherein the utterance stored in association with the speech recognition result is speech waveform data or an utterance feature amount obtained by acoustic analysis of the utterance.

Means for receiving and storing a calling number on a telephone line;
The speech recognition utterance data collection device according to claim 1, further comprising: a unit that stores the transmission number in association with the speech recognition result and the input utterance.

Using a computer
Storing dialogue scenario information including a description for executing a process to be performed next in accordance with the content of the utterance when an utterance in accordance with the dialogue progress procedure is input;
Associating the speech recognition result and the input utterance, and storing the speech data storing step stored as speech data for speech recognition;
Accepting the input utterance,
Recognizing the input utterance using the utterance data stored in the utterance data storage step; and
Advancing a dialogue based on the speech recognition result and the dialogue scenario information;
A detection step of detecting execution of the process ;
Using a step of outputting a response to the input utterance and a voice interaction method;
Storing a dialogue state transition history based on the dialogue scenario information;
When the detection step detects the execution of the process , it is determined that the input utterance is an utterance corresponding to the progress procedure of the dialog described in the dialog scenario information, and the detection step detects the execution of the process. If not, a determination step of determining whether or not the input utterance is correctly recognized based on the voice recognition result and the state transition history;
A speech recognition utterance data collection method for executing a step of storing the speech recognition result and the inputted utterance in association with each other in the speech data storage step when it is determined that the speech is recognized correctly as a result of the determination in the determination step.

Computer
Means for storing the described dialogue scenario information including a description for executing a process to be performed next in accordance with the content of the utterance when an utterance in accordance with the progress procedure of the dialogue is input;
An utterance data storage unit that associates the speech recognition result and the input utterance and is stored as utterance data for speech recognition;
Means for accepting the input utterance;
Means for recognizing the input speech using speech data stored in the speech data storage means;
Means for advancing a dialogue based on the speech recognition result and the dialogue scenario information;
Detecting means for detecting execution of the processing ;
A voice interactive apparatus including means for outputting a response to the input utterance;
Means for storing a dialog state transition history based on the dialog scenario information;
When the detection unit detects the execution of the process , the input unit determines that the input utterance is an utterance corresponding to the progress procedure of the dialog described in the dialog scenario information, and the detection unit detects the execution of the process. If not, based on the speech recognition result and the state transition history,
Determining means for determining whether or not the input utterance is correctly recognized;
A computer program that functions as means for storing the speech recognition result and the inputted utterance in association with each other when the judgment means judges that the speech is recognized correctly.

Means for storing dialogue scenario information describing the progress of the dialogue;
Means for accepting the input utterance,
Means for voice recognition of the input utterance;
Means for proceeding a dialogue based on a speech recognition result and the dialogue scenario information;
And a voice interaction device including means for outputting a response to the input utterance,
Means for storing a dialog state transition history based on the dialog scenario information;
Means for determining whether or not the inputted utterance is correctly recognized based on the voice recognition result and the state transition history;
Means for storing the voice recognition result and the inputted utterance in association with each other when it is determined that the means is correctly recognized;
Means for receiving and storing a calling number on a telephone line;
A speech recognition utterance data collection device comprising: means for storing the transmission number in association with the speech recognition result and the input utterance.