JP2005257917A

JP2005257917A - Phonetic interpretion method, phonetic interpreting device, and phonetic interpretation program

Info

Publication number: JP2005257917A
Application number: JP2004067729A
Authority: JP
Inventors: Katsuto Sudo; 克仁須藤; Mikio Nakano; 幹生中野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-03-10
Filing date: 2004-03-10
Publication date: 2005-09-22

Abstract

<P>PROBLEM TO BE SOLVED: To improve the reliability and the flexibility of phonetic interpretation in a voice interaction system which uses computers, etc. <P>SOLUTION: In a phonetic interpretation device, reliability information is successively added to the recognized results of voice recognition by a reliability information adding means 14, moreover, finite-state transducers are generated, respectively by a rule type interpretation means 18a, a specific word extracting means 18b, an example type interpretation means 18c whose forms are different and a combination whose weight is the smallest; in short, whose reliability is the highest is selected from among the combined forms of these finite-state transducers to be outputted as a phonetic interpretation result in an interpretation result selecting means 15. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は音声を用いて人の要求を受け付ける音声解釈システムの音声解釈方法および音声解釈装置、音声解釈プログラムに関する。 The present invention relates to a speech interpretation method, speech interpretation apparatus, and speech interpretation program for a speech interpretation system that accepts a human request using speech.

システムに音声を入力する音声入力手段と、音声を単語の系列として認識する音声認識手段と、認識した各単語に対してその認識の信頼性を評価する音声認識信頼性評価手段と、認識した単語の系列と各単語の認識の信頼性をもとに入力された音声を解釈する音声解釈手段とで構成される音声解釈システムの音声解釈方法であって、音声入力手段から入力された音声を、音声認識手段によって単語の系列として認識し、単語の系列から入力された音声に対する解釈結果を特定することに関する従来技術に、単語の系列を中間表現に変換し、中間表現の共起確率を統計モデルとして表現することで音声認識の誤りの影響で解釈結果に誤りが生じることを避けるという技術がある（非特許文献１）。
「統計処理による入力文から中間表現への変換を用いた音声言語理解」政瀧浩和、谷垣宏一、匂坂芳典、電子情報通信学会論文誌Ｄ−II Vol.Ｊ８２−Ｄ−II No.２ pp.１６９−１７７，１９９９年２月 Speech input means for inputting speech to the system; speech recognition means for recognizing speech as a sequence of words; speech recognition reliability evaluation means for evaluating the recognition reliability of each recognized word; and recognized words And a speech interpretation method of a speech interpretation system comprising speech interpretation means for interpreting speech input based on the reliability of recognition of each word, and speech input from the speech input means, A conventional model that recognizes words as a sequence of words by speech recognition means and identifies interpretation results for speech input from a sequence of words. Converts a sequence of words into an intermediate representation, and calculates the co-occurrence probability of the intermediate representation as a statistical model. There is a technique for avoiding an error in the interpretation result due to the influence of a voice recognition error (Non-patent Document 1).
"Spoken language understanding using conversion from input sentence to intermediate representation by statistical processing" Masakazu Masami, Koichi Tanigaki, Yoshinori Sakasaka, IEICE Transactions D-II Vol.J82-D-II No.2 pp. 169-177, February 1999

従来の技術では、音声認識手段によって得られる単語の系列のみを利用して音声解釈を行っていたため、認識した単語の系列のどの単語が認識誤りであるかを考慮することができなかった。そのことにより、音声認識手段によって十分な確信を持って単語の系列が得られた場合でも、ほとんど確信が持てないような単語の系列が得られた場合でも、その単語の系列自体が一致していれば、同じ解釈の結果が導かれることになり、音声認識の誤りによる悪影響を軽減する効果は十分でない。また、従来の技術で利用されている統計的モデルを利用するためには十分な量の人間の発話データベースが学習のために必要であり、そのデータを集めるために別の音声解釈システムを利用したり、人間同士の対話によって人間とコンピュータシステムとの対話を模擬的に記録することが必要であった。 In the conventional technique, since speech interpretation is performed using only the word sequence obtained by the speech recognition means, it is not possible to consider which word in the recognized word sequence is a recognition error. As a result, even if a word sequence is obtained with sufficient confidence by the speech recognition means, or even if a word sequence with little confidence is obtained, the word sequence itself is consistent. Thus, the result of the same interpretation is derived, and the effect of reducing the adverse effects due to the error in speech recognition is not sufficient. In addition, a sufficient amount of human utterance database is necessary for learning to use the statistical model used in the conventional technology, and another speech interpretation system is used to collect the data. Or, it was necessary to record a dialogue between a human and a computer system in a simulated manner by a dialogue between humans.

この発明の請求項１では、音声を用いてシステムに入力を行う音声入力処理と、上記の音声を単語の系列として認識する音声認識処理と、認識した各単語に対してその認識の信頼性を評価する音声認識信頼性評価処理と、認識した単語の系列と各単語の認識の信頼性をもとに入力された音声を解釈する音声解釈処理とを含む音声解釈システムの音声解釈方法であって、上記音声入力処理から入力された音声（Ａ）を、音声認識処理によって単語の系列（Ｂ）として認識し、この単語の系列に含まれる単語（Ｂ０），（Ｂ１），…，（Ｂｎ）のそれぞれに対して音声認識信頼性評価処理によって音声認識の信頼性評価値（Ｃ０），（Ｃ１），…，（Ｃｎ）を計算し、単語（Ｂ０），…，（Ｂｎ）と認識の信頼性評価値（Ｃ０），（Ｃ１），…，（Ｃｎ）から、入力された音声（Ａ）に対する音声解釈結果（Ｄ）を特定する音声解釈方法を提案する。 According to the first aspect of the present invention, a voice input process for inputting voice into the system, a voice recognition process for recognizing the voice as a sequence of words, and a reliability of recognition for each recognized word. A speech interpretation method for a speech interpretation system, comprising speech recognition reliability evaluation processing to be evaluated, and speech interpretation processing for interpreting speech input based on a recognized word sequence and the reliability of recognition of each word. The speech (A) input from the speech input processing is recognized as a word sequence (B) by the speech recognition processing, and the words (B0), (B1),..., (Bn) included in this word sequence , (Cn) is calculated by the speech recognition reliability evaluation process for each of the above, and the word (B0),..., (Bn) and the recognition reliability are calculated. Sex evaluation values (C0), (C1), ..., From cn), we propose a sound interpretation to identify the voice interpretation result (D) for the audio (A) entered.

この発明の請求項２では、音声入力処理から入力された音声（Ａ）を、音声認識処理によって単語の系列（Ｂ）として認識し、この単語の系列に含まれる単語（Ｂ０），…，（Ｂｎ）と、音声認識信頼性評価処理によって計算された音声認識の信頼性評価値（Ｃ０），（Ｃ１），…，（Ｃｎ）から、入力された音声（Ａ）に対する音声解釈結果（Ｄ）を特定する音声解釈方法を実現するために、登録された特定の単語の系列（Ｅ０），（Ｅ１），…，（Ｅｉ）を含む単語の系列が入力されたとき、それと対応する解釈結果の系列（Ｆ０），（Ｆ１），…，（Ｆｊ）を出力する規則型解釈処理と、ある特定の種類の単語（Ｇ）を含む単語の系列が入力されたとき、（Ｇ）を出力する特定種単語抽出処理と、データベース内に事例として蓄積されている単語の系列（Ｈ０），（Ｈ１），…，（Ｈｋ）と対応する解釈結果の系列（Ｉ０），（Ｉ１），…，（Ｉｌ）の組があり、（Ｈ０），…，（Ｈｋ）を含む単語の系列が入力されたとき、（Ｉ０），…，（Ｉｌ）を出力する事例型解釈処理とに基づき、規則型解釈処理、特定種単語抽出処理、事例型解釈処理を組み合わせた有限状態トランスデューサを利用した解釈結果選別処理によって、最善の音声解釈結果（Ｄ）を特定する音声解釈方法を提案する。
この発明の請求項３では請求項２記載の音声解釈方法において、データベースに蓄積する事例として音声認識信頼性評価値（Ｍｉ）がある値よりも大きい単語系列（Ｊｉ），（Ｌｉ）の組を登録し、この登録された単語系列（Ｊｉ），（Ｌｉ）を事例型解釈処理で事例として利用する音声解釈方法を提案する。 According to claim 2 of the present invention, the voice (A) input from the voice input process is recognized as a word series (B) by the voice recognition process, and the words (B0),. Bn) and the speech interpretation reliability (D) for the input speech (A) from the speech recognition reliability evaluation values (C0), (C1),..., (Cn) calculated by the speech recognition reliability evaluation process. When a word sequence including a registered specific word sequence (E0), (E1),..., (Ei) is input, the interpretation result corresponding to it is input. A rule-type interpretation process that outputs a sequence (F0), (F1),..., (Fj), and a specification that outputs (G) when a sequence of words including a specific type of word (G) is input. Seed word extraction processing and accumulated as examples in the database , (Hk) and the corresponding interpretation result series (I0), (I1),..., (Il), and (H0),. ), A rule type interpretation process, a specific type word extraction process, and a case type interpretation process are combined based on the case type interpretation process that outputs (I0),..., (Il). A speech interpretation method for specifying the best speech interpretation result (D) by an interpretation result selection process using a finite state transducer is proposed.
According to a third aspect of the present invention, in the speech interpretation method according to the second aspect, as a case of accumulating in the database, a set of word sequences (Ji), (Li) larger than a certain value as a speech recognition reliability evaluation value (Mi) is stored. A speech interpretation method is proposed in which the registered word series (Ji) and (Li) are used as examples in the case type interpretation process.

この発明の請求項４では、請求項３に記載の音声解釈方法において、事例型解釈処理で用いる音声認識信頼性評価値（Ｍｉ）は対話情報記録手段に記録した対話情報から実際に発話したと推定される単語の系列（Ｐ）を抽出し、この抽出した単語の系列（Ｐ）から解釈される上記対話情報の内容を（Ｐ）′に特定し、単語の系列（Ｐ）に含まれる単語をそれぞれ（Ｑ０），（Ｑ１），（Ｑ２），…，（Ｑｎ）として抽出し、単語系列（Ｐ）に対応する実際に発話した単語の系列の正解を（Ｒ）と特定し、この単語の系列の正解（Ｒ）から解釈される上記対話情報の内容の正解を（Ｒ）′と特定し、単語列（Ｑｉ）に対応する対話情報の実際に発話した単語を（Ｓｉ）と特定し、抽出された単語の系列（Ｐ）をその結果の候補として推定するために計算した指標に加え、上記対話情報記録手段に記録された対話情報で得られる指標を、発話単位及び単語単位でそれぞれについて発話単位指標値及び単語単位指標値を計算し、計算された発話単位指標値及び単語単位指標値と、上記対話情報記録手段から読み出された対話情報をシステムが認識した単語の系列と、上記対話情報記録手段により記録されている実際に発話した単語の系列（Ｐ）の正解（Ｒ）とを比較して得られる発話単位及び単語単位での正しさを０と１の二値で示した指標との関連を求めることにより、発話単位及び単語単位での信頼性を評価するための信頼性尺度を作成し、作成された信頼性尺度によって、その信頼性を評価すべき音声認識結果の指標値を計算し、信頼性尺度の評価式に当てはめて求めた信頼性評価値とした音声解釈方法を提案する。 According to a fourth aspect of the present invention, in the voice interpretation method according to the third aspect, the speech recognition reliability evaluation value (Mi) used in the case type interpretation process is actually spoken from the dialogue information recorded in the dialogue information recording means. An estimated word sequence (P) is extracted, the content of the dialogue information interpreted from the extracted word sequence (P) is specified as (P) ′, and the words included in the word sequence (P) Are extracted as (Q0), (Q1), (Q2),..., (Qn), the correct answer of the actually spoken word sequence corresponding to the word sequence (P) is identified as (R), and this word The correct answer of the content of the dialog information interpreted from the correct answer (R) of the sequence is specified as (R) ′, and the actually spoken word of the dialog information corresponding to the word string (Qi) is specified as (Si). In order to estimate the extracted word sequence (P) as a result candidate In addition to the index calculated in the above, the index obtained by the dialog information recorded in the dialog information recording means is calculated as the utterance unit index value and the word unit index value for each utterance unit and word unit, and the calculated utterance unit An index value and a word unit index value, a series of words in which the system recognizes the dialogue information read from the dialogue information recording means, and a series of words actually spoken recorded by the dialogue information recording means (P ) In the utterance unit and the word unit by determining the relationship between the correctness in the utterance unit and the word unit obtained by comparing the correct answer (R) of) with an index indicating binary values of 0 and 1 The reliability measure for evaluating the reliability, the index value of the speech recognition result for which the reliability should be evaluated is calculated based on the created reliability measure, and the reliability obtained by applying it to the evaluation formula of the reliability measure Evaluation To propose a sound interpretation methods with.

この発明によれば、音声解釈のために定義した規則と、注目したい特定種の単語とを利用することで、簡易な定義によって音声解釈を可能にし、さらに事例を集めることでより多様な表現に対して柔軟に音声解釈をできるように変更可能で、かつ、音声認識の誤りに対して、信頼性の低い単語を棄却して音声解釈をする枠組みが組み込まれており、音声認識の誤りに対して頑健な音声解釈を可能にする。また、事例型解釈のための事例は、あらかじめデータを収集して、発話を書き起こし、対応する解釈結果のラベルを付けた形式で作成するだけでなく、システムが集めた対話記録から信頼性の高い事例を集めて、自動的に作成することも可能である。 According to the present invention, by using rules defined for speech interpretation and a specific type of word to be noticed, speech interpretation can be performed with a simple definition, and more cases can be collected by collecting examples. It is possible to change the voice interpretation so that it can be flexibly interpreted, and it incorporates a framework that rejects words with low reliability for speech recognition errors and incorporates speech interpretation errors. And robust speech interpretation. In addition, cases for case-type interpretation are not only created in a format in which data is collected in advance, the utterances are transcribed, and the corresponding interpretation results are labeled. It is also possible to collect high cases and create them automatically.

図１にこの発明を実施するための最良の形態を示す。図１に示す１０はこの発明による音声解釈装置、２０はこの音声解釈装置１０で用いるデータベースを作成するデータベース作成部を示す。
この発明による音声解釈装置１０は、入力音声１１を取り込むための音声入力手段１２と、入力された音声の内容を音声認識する音声認識手段１３と、音声認識手段１３で音声認識した認識結果に信頼性情報を付加する信頼性情報付加手段１４と、信頼性情報が付加された単語の系列から、データベース１７に登録されているデータに従って割当られたモードに従って解釈結果を出力し、有限状態トランスデューサを生成する有限状態トランスデューサ生成部１８と、有限状態トランスデューサ生成部１８が出力した有限状態トランスデューサを利用して最善のものを選別する解釈結果選別手段１５とによって構成され、この解釈結果選別手段１５から音声解釈結果１６が出力され、音声又は文字として表示される。 FIG. 1 shows the best mode for carrying out the present invention. 1 denotes a speech interpretation apparatus according to the present invention, and 20 denotes a database creation unit that creates a database used in the speech interpretation apparatus 10.
The speech interpretation apparatus 10 according to the present invention relies on the speech input means 12 for capturing the input speech 11, the speech recognition means 13 for speech recognition of the content of the input speech, and the recognition result recognized by the speech recognition means 13. A finite state transducer is generated by outputting the interpretation result according to the mode assigned in accordance with the data registered in the database 17 from the reliability information adding means 14 for adding sex information and the word sequence to which the reliability information is added. The finite state transducer generating unit 18 and the interpretation result selecting unit 15 for selecting the best one using the finite state transducer output from the finite state transducer generating unit 18. The result 16 is output and displayed as speech or text.

有限状態トランスデューサ生成部１８は信頼性情報付加手段１４で信頼性情報を付加された単語の系列から、データベース１７に登録された特定の単語の系列を抽出し、対応する解釈結果を出力する規則型解釈手段１８Ａと、同じく信頼性情報を付加された単語の系列から、データベース１７に登録された特定の種類の単語を抽出し、出力する特定種単語抽出手段１８Ｂと、同じく信頼性情報を付加された単語の系列から、データベース１７内に事例として蓄積されている単語の系列を抽出し、データベース１７内の対応する解釈結果の値を参照して出力する事例型解釈手段１８Ｃとによって構成される。
データベース作成部２０は事例ファイル２１から事例登録手段２２を通じてデータベース１７に事例を登録する構成と、対話情報記録手段２３から事例信頼性情報付加手段２４で対話情報に音声認識信頼性情報を付加し、この音声認識信頼性情報が付加された対話情報の中から信頼性の高い情報を事例選択手段２５で選択し、その選択した情報を事例登録手段２２を通じてデータベース１７に書き込む構成とが考えられる。 The finite state transducer generation unit 18 extracts a specific word sequence registered in the database 17 from the word sequence to which the reliability information is added by the reliability information adding means 14, and outputs a corresponding interpretation result. The specific type word extracting unit 18B that extracts and outputs a specific type of word registered in the database 17 from the interpretation unit 18A, and a series of words to which the same reliability information is added, and the reliability information is also added. The word type stored in the database 17 as a case is extracted from the word series, and the case type interpreting means 18C is configured to refer to and output the value of the corresponding interpretation result in the database 17.
The database creation unit 20 registers the case in the database 17 from the case file 21 through the case registration unit 22, and adds the speech recognition reliability information to the dialogue information from the dialogue information recording unit 23 through the case reliability information addition unit 24. A configuration may be considered in which highly reliable information is selected by the case selecting means 25 from the dialogue information to which the voice recognition reliability information is added, and the selected information is written in the database 17 through the case registering means 22.

以下に各部の動作を図１に示した各ステップ表示に従って説明する。
ステップ１データベース１７には規則型解釈手段１８Ａで利用される音声解釈規則、特定種単語抽出手段１８Ｂで利用される特定種単語、事例型解釈手段１８Ｃで利用されるデータベース１７中の事例を記述したファイル（システム記述ファイル）を準備する。
ステップ２システム記述ファイルで定義された音声解釈規則に基づいて、音声認識手段１３で音声認識した音声認識結果の単語の系列を解釈するための規則型解釈手段１８Ａを起動する。具体的には、規則で定義された単語の系列を入力とし、対応する解釈結果を出力とするような重み付き有限状態トランスデューサとして表現する。ここで、ひとつの規則の適用に対して、一定の重みを加える。
ステップ３音声認識手段１３で音声認識した単語の中でシステム記述ファイルで定義された特定の内容を表す種類の単語が存在した場合、その単語を出力する形で解釈とする機能を持つ特定種単語抽出手段１８Ｂを起動する。具体的には、特定種の単語が入力された場合には特定種類の単語であることを明示した上で、一定の重みで単語を出力し、それ以外の単語については、それより大きな重みでそのまま単語を出力するような重み付き有限状態トランスデューサとして表現する。
ステップ４システム記述ファイルに記載された、データベース１７に蓄積されている単語の系列と対応する解釈結果の事例に基づいて、過去の事例にあった解釈を利用して音声解釈を行う事例型解釈手段１８Ｃを起動する。動作は規則型解釈手段１８Ａの場合とほぼ同じで、重み付き有限状態トランスデューサの重みの与え方が異なる。事例による解釈を適用するときの重みは、データベース１７中での事例の生起確率の対数の符号を反転させた値を用いる。
ステップ５規則型解釈手段１８Ａ・特定種単語抽出手段１８Ｂ・事例型解釈手段１８Ｃのそれぞれで生成した３つの重み付き有限状態トランスデューサを結合し、解釈結果選別手段１５に入力する。
ステップ６音声入力手段１２により入力音声１１を受け取る。
ステップ７音声認識手段１３により音声認識処理を行う。
ステップ８信頼性情報付加手段１４により、音声認識の結果得られる単語の系列に、音声認識信頼性情報を付加した有限状態トランスデューサに変換する。信頼性の高い単語が優先されるように、信頼性評価値の高いものほど小さい重みを持つようにする。
ステップ９音声認識の信頼性情報を付加した有限状態トランスデューサと、規則型解釈手段・特定種単語抽出手段・事例型解釈手段を組み合わせた重み付き有限状態トランスデューサと合成し、合成された重み付き有限状態トランスデューサ（解釈結果選別手段）から、もっとも重みの小さい経路を選び、その出力を音声解釈結果１６として出力する。 The operation of each part will be described below according to each step display shown in FIG.
Step 1 The database 17 describes the speech interpretation rules used by the rule type interpretation means 18A, the specific type words used by the specific type word extraction means 18B, and the cases in the database 17 used by the case type interpretation means 18C. Prepare a file (system description file).
Step 2 Based on the speech interpretation rules defined in the system description file, the rule type interpretation means 18A for interpreting the word sequence of the speech recognition result speech-recognized by the speech recognition means 13 is activated. Specifically, it is expressed as a weighted finite state transducer that takes as input a series of words defined by rules and outputs the corresponding interpretation result. Here, a certain weight is added to the application of one rule.
Step 3 If the word recognized by the speech recognition means 13 includes a word of a type representing the specific contents defined in the system description file, the specific seed word having a function of interpreting it in the form of outputting the word The extraction means 18B is activated. Specifically, when a specific type of word is input, it is clearly indicated that it is a specific type of word, and the word is output with a constant weight, and other words are output with a higher weight. It is expressed as a weighted finite state transducer that outputs a word as it is.
Step 4 Case type interpretation means for performing speech interpretation using interpretations based on past cases, based on interpretation results corresponding to word sequences stored in the database 17 described in the system description file Start 18C. The operation is almost the same as in the case of the regular interpretation means 18A, and the weighting method of the weighted finite state transducer is different. As the weight when applying the interpretation by the case, a value obtained by inverting the logarithm sign of the occurrence probability of the case in the database 17 is used.
Step 5 The three weighted finite state transducers generated by the rule type interpretation unit 18A, the specific-type word extraction unit 18B, and the case type interpretation unit 18C are combined and input to the interpretation result selection unit 15.
Step 6 The input voice 11 is received by the voice input means 12.
Step 7 Voice recognition processing is performed by the voice recognition means 13.
Step 8 The reliability information adding means 14 converts the word sequence obtained as a result of the speech recognition into a finite state transducer in which the speech recognition reliability information is added. The higher the reliability evaluation value, the smaller the weight is given so that the word with high reliability is given priority.
Step 9 A finite state transducer to which reliability information of speech recognition is added is combined with a weighted finite state transducer that is a combination of a regular interpretation unit, a specific-type word extraction unit, and a case type interpretation unit. The path with the smallest weight is selected from the transducer (interpretation result selection means), and the output is output as the speech interpretation result 16.

データベース作成部２０では一つの方法として、
ステップＡ１人間が実際に発話した単語の系列と、対応する解釈結果について記述した情報を記録したファイル（事例ファイル２１）を準備する。
ステップＡ２事例登録手段２２により、事例ファイル２１の単語の系列と解釈結果の組をデータベース１７の事例に登録する。
他の方法として、
ステップＢ１事例信頼性情報付加手段２４により、対話記録の音声認識結果の単語の系列に対し、発話単位信頼性評価値を計算し、対話記録の音声認識結果の単語の系列に発話単位信頼性情報を付加する。この発話単位信頼性評価値の算出方法に関しては先願である「特願２００３−２７９２６号」で提案した「音声認識信頼性評価方法」を用いることができる。先願の発話単位信頼性評価値の計算方法に関しては後に詳細に説明する。
ステップＢ２事例選択手段２５により、信頼性評価値が別に指定した閾値より大きい音声認識結果を選別し、ファイル（選択事例ファイル）として保存する。
ステップＢ３事例登録手段２２により、選択事例ファイルに書き出された信頼性の高い音声認識結果の単語の系列と、対応する解釈結果をデータベース１７の事例に登録する。 In the database creation unit 20, as one method,
Step A1 A file (example file 21) is prepared that records a series of words actually spoken by a human and information describing the corresponding interpretation result.
Step A2 The case registration means 22 registers the word sequence and interpretation result set in the case file 21 in the case of the database 17.
As an alternative,
Step B1 The case reliability information adding means 24 calculates an utterance unit reliability evaluation value for the word sequence of the speech recognition result of the dialogue record, and the utterance unit reliability information for the word sequence of the speech recognition result of the dialogue record. Is added. As a method for calculating the utterance unit reliability evaluation value, the “voice recognition reliability evaluation method” proposed in the prior application “Japanese Patent Application No. 2003-27926” can be used. The calculation method of the utterance unit reliability evaluation value of the prior application will be described in detail later.
Step B2 The case selection means 25 selects a speech recognition result whose reliability evaluation value is larger than a separately designated threshold value, and saves it as a file (selected case file).
Step B3 The case registration means 22 registers the word series of the speech recognition result with high reliability written in the selected case file and the corresponding interpretation result in the case of the database 17.

以下に各部の実施例を説明する。例として、バスの時刻表をコンピュータシステムに質問するための音声解釈システムを用いる。同システムでは、人間はシステムに「乗車するバス停」「降車するバス停」「バスの経由地」「出発時間もしくは到着時間」「曜日」の５つの情報を伝えることができる。
信頼性情報付加手段１４
音声認識手段１３で音声認識した単語の系列を、各単語の音声認識の信頼性の情報を付加した重み付き有限状態トランスデューサに変換する。ここで用いる信頼性評価方法としては例えば参考文献１に記載の技術を用いることができる。
（参考文献１）
“Recognition confidence scoring and its use in speech understanding systems”，Timothy J.Hazen,Stephanie Seneff and Joseph Polifroni,Computer Speech and Language 2002 vol.16 pp.46-67
信頼性の高い単語が優先されるように、信頼性評価値が高いものほど小さい重みが与えられるようにする。本実施例では、信頼性評価値の符号を反転させた数を重みとして用いる。そして、音声認識の誤りがあった場合に、誤って認識された単語を棄却することができるように、有限状態トランスデューサの状態遷移の際に一定の重みで単語を棄却する枠組みを導入する。 Examples of each part will be described below. As an example, a speech interpretation system for querying a computer system for a bus timetable is used. In this system, a human can convey five types of information to the system: “bus stop to get on”, “bus stop to get off”, “bus stop”, “departure time or arrival time”, and “day of the week”.
Reliability information adding means 14
The sequence of words recognized by the speech recognition unit 13 is converted into a weighted finite state transducer to which information on the reliability of speech recognition of each word is added. As a reliability evaluation method used here, for example, the technique described in Reference 1 can be used.
(Reference 1)
“Recognition confidence scoring and its use in speech understanding systems”, Timothy J. Hazen, Stephanie Seneff and Joseph Polifroni, Computer Speech and Language 2002 vol.16 pp.46-67
The higher the reliability evaluation value, the smaller the weight is given so that the word with high reliability is given priority. In this embodiment, the number obtained by inverting the sign of the reliability evaluation value is used as the weight. Then, a framework for rejecting words with a constant weight at the time of state transition of the finite state transducer is introduced so that an erroneously recognized word can be rejected when there is a speech recognition error.

例として、単語を棄却する場合の状態遷移の重みを＋３．０とし、音声認識結果の第一候補が「バスセンター（信頼性評価値＋２．０）から（信頼性評価値＋１．２）」、第二候補が「バスセンター（信頼性評価値＋２．０）から（信頼性評価値＋１．０）は（信頼性評価値−０．８）」であった場合の重み付き有限状態トランスデューサの表現は、以下のようになる。以下の例では、各行が遷移規則を表し、第１列が現在の状態番号、第２列が遷移先の状態番号、第３列が状態遷移の際の入力シンボル、第４列が状態遷移の際の出力シンボル、第５列が状態遷移の重み（空欄の場合は０）である。状態番号０が開始状態で、第１列にのみ要素が入っている場合は、その状態番号が終了状態となる。epsilonは何も入力されない、もしくは何も出力されない状態遷移であることを示している。単語の棄却は、出力がepsilonとなるような状態遷移によって実現される。
========
第１列第２列第３列第４列第５列
０１ epsilon epsilon
１２バスセンターバスセンター -2.0
１２バスセンター epsilon 3.0
２３からバスセンター -1.2
２３から epsilon 3.0
３
０４バスセンターバスセンター -2.0
０４バスセンター epsilon 3.0
４５からバスセンター -1.0
４５から epsilon 3.0
５６はは 0.8
５６は epsilon 3.0
========
規則型解釈手段１８Ａ
あらかじめ定義された音声解釈規則に基づいて、音声認識結果の単語の系列を解釈する。具体的には、規則で定義された単語の系列を入力とし、対応する解釈結果を出力とするような重み付き有限状態トランスデューサとして表現する。ひとつの規則の適用に対して、一定の重みを加える。 As an example, the weight of state transition when rejecting a word is +3.0, and the first candidate of the speech recognition result is “from bus center (reliability evaluation value +2.0) (reliability evaluation value +1.2)”, The expression of the weighted finite state transducer when the second candidate is “bus center (reliability evaluation value +2.0) to (reliability evaluation value +1.0) is (reliability evaluation value−0.8)” is It becomes as follows. In the following example, each row represents a transition rule, the first column is the current state number, the second column is the destination state number, the third column is the input symbol for the state transition, and the fourth column is the state transition. The output symbol at the time, the fifth column is the weight of state transition (0 in case of blank). If state number 0 is the start state and an element is included only in the first column, the state number is the end state. epsilon indicates a state transition in which nothing is input or nothing is output. The rejection of the word is realized by a state transition in which the output is epsilon.
========
1st row 2nd row 3rd row 4th row 5th row 0 1 epsilon epsilon
1 2 Bus Center Bus Center -2.0
1 2 Bus Center epsilon 3.0
2 3 to Bus Center -1.2
2 3 to epsilon 3.0
3
0 4 Bus Center Bus Center -2.0
0 4 Bus Center epsilon 3.0
4 5 to Bus Center -1.0
4 to 5 epsilon 3.0
5 6 is 0.8
5 6 is epsilon 3.0
========
Regular interpretation means 18A
A sequence of words of the speech recognition result is interpreted based on a predefined speech interpretation rule. Specifically, it is expressed as a weighted finite state transducer that takes as input a series of words defined by rules and outputs the corresponding interpretation result. A certain weight is added to the application of one rule.

例として、規則を適用する際の重みを＋１．０とし、「研究所前からバスセンターまで」という単語の系列が入力されたとき、「stopfrom＝(busstop＝(研究所前))，stopto＝(busstop＝(バスセンター))」という音声解釈結果を出力するような音声解釈規則を適用する重み付き有限状態トランスデューサは、以下のようになる。
========
０１ epsilon stopfrom=(研究所前)),stopto=(busstop=(バスセンター)) 1.0
１２研究所前 epsilon
２３から epsilon
３４バスセンター epsilon
４５まで epsilon
５
========
特定種単語抽出手段１８Ｂ
あらかじめ定義された特定の内容を表す種類の単語について、その単語を出力する形で解釈とする機能を実現する。具体的には、特定種の単語が入力された場合には特定種類の単語であることを明示した上で、一定の重みで単語を出力し、それ以外の単語については、それより大きな重みでそのまま単語を出力するような重み付き有限状態トランスデューサとして表現する。 As an example, if the weight for applying the rule is +1.0 and the word sequence “From the laboratory to the bus center” is input, “stopfrom = (busstop = (before the laboratory)), stopto = ( A weighted finite state transducer applying a speech interpretation rule that outputs a speech interpretation result of “busstop = (bus center))” is as follows.
========
0 1 epsilon stopfrom = (in front of the laboratory)), stopto = (busstop = (bus center)) 1.0
1 2 In front of the laboratory epsilon
2 3 to epsilon
3 4 Bus Center epsilon
Up to 4 5 epsilon
5
========
Specific word extraction means 18B
A function of interpreting a word of a type representing a specific content defined in advance by outputting the word is realized. Specifically, when a specific type of word is input, it is clearly indicated that it is a specific type of word, and the word is output with a constant weight, and other words are output with a higher weight. It is expressed as a weighted finite state transducer that outputs a word as it is.

例として、特定種単語を出力するときの重みを＋２．０、特定種単語として明示的に定義されてはいないが、システムが扱うことのできる語彙に含まれる単語を出力するときの重みを＋３．０とし、「バス停名はバスセンター」という内容を表すbusstopという種類の単語「バスセンター」と、「時間は８時」という内容を表すhourという種類の単語「８時」と、種類が明示的に定義されていない単語「です」を扱う重み付き有限状態トランスデューサは以下のようになる。
========
０１バスセンター busstop＝(バスセンター) 2.0
０１８時 hour＝(8) 3.0
０１ですです 3.0
１
========
事例型解釈手段１８Ｃ
データベース１７に蓄積されている単語の系列と対応する解釈結果の事例に基づいて、過去の事例にあった解釈を利用して音声解釈を行う。構成法は規則型解釈手段１８Ａの場合とほぼ同じで、重み付き有限状態トランスデューサの重みの与え方が異なる。事例による解釈を適用するときの重みは、データベース中での事例の生起確率の対数の符号を反転させた値を用いる。 For example, the weight for outputting a specific seed word is +2.0, and the weight for outputting a word included in a vocabulary that is not explicitly defined as a specific seed word but can be handled by the system is +3. , The word “bus center” of the type “busstop” representing the content “bus stop name is bus center”, and the word “8 o'clock” of the type “hour” representing the content “hour is 8 o'clock”. A weighted finite state transducer that handles the undefined word "is" is
========
0 1 Bus Center busstop = (Bus Center) 2.0
0 1 8:00 hour = (8) 3.0
0 1 is 3.0
1
========
Case type interpretation means 18C
Based on the case of the interpretation result corresponding to the word sequence stored in the database 17, speech interpretation is performed using the interpretation in the past case. The construction method is almost the same as in the case of the regular interpretation means 18A, and the weighting method of the weighted finite state transducer is different. As the weight when the interpretation by the case is applied, a value obtained by inverting the sign of the logarithm of the occurrence probability of the case in the database is used.

例として、「研究所前１０時発」という単語の系列と、「stopfrom＝(busstop＝(研究所前）），departtime＝(hour＝(8))」という解釈結果の組がデータベースに蓄積されており、そのデータベース中での生起確率が０．００１であったとすると、この事例を適用する重み付き有限状態トランスデューサは以下のようになる。
========
０１ epsilon stopfrom＝(研究所前)),departtime＝(hour＝(8)) -log(0.001)
１２研究所前 epsilon
２３ 10時 epsilon
３４発 epsilon
４
========
解釈結果選別手段１５
規則型解釈手段１８Ａ、特定種単語抽出手段１８Ｂ、事例型解釈手段１８Ｃのそれぞれを利用して単語の系列に対する解釈結果を与えることができるが、それぞれが独立して動作すると上記３つのどの手段によって得られた解釈結果が妥当であるかを判別することができないため、それぞれで構成された重み付き有限状態トランスデューサを並列に組み合わせることで、組み合わされた重み付き有限状態トランスデューサで受理される単語の系列の中で最も重みの小さくなるものを選択し、解釈結果として採用する。その際、任意の長さの単語の系列と、音声理解規則にある単語の系列・特定種単語・データベース中の事例にある単語の系列の任意の回数の繰り返しに対応するために、規則型解釈手段１８Ａ・特定種単語抽出手段１８Ｂ・事例型解釈手段１８Ｃのそれぞれの重み付き有限状態トランスデューサを並列に組み合わせ、さらにその閉包（トランスデューサの各終了状態から開始状態への経路が存在し、任意の回数繰り返してトランスデューサのネットワークをたどれるようにしたもの）を構成し、解釈結果を選別する手段として用いる。
事例登録手段２２ As an example, a series of words “10 o'clock in front of the laboratory” and an interpretation result of “stopfrom = (busstop = (in front of the laboratory)), departtime = (hour = (8))” are stored in the database. If the occurrence probability in the database is 0.001, the weighted finite state transducer to which this case is applied is as follows.
========
0 1 epsilon stopfrom = (in front of laboratory)), departtime = (hour = (8)) -log (0.001)
1 2 In front of the laboratory epsilon
2 3 10 o'clock epsilon
3 4 epsilon
4
========
Interpretation result selection means 15
The rule type interpretation means 18A, specific-type word extraction means 18B, and case type interpretation means 18C can be used to give interpretation results for word sequences. Since it is impossible to determine whether the obtained interpretation result is valid, a series of words accepted by the combined weighted finite state transducer by combining the weighted finite state transducers configured in parallel with each other. The one with the smallest weight is selected and adopted as the interpretation result. In that case, to interpret the word sequence of arbitrary length and the sequence of words in the speech comprehension rule, the specific species word, or the sequence of words in the case of the example in the database, rule-type interpretation The weighted finite state transducers of the means 18A, specific-species word extraction means 18B, and case type interpretation means 18C are combined in parallel, and the closure (the path from each end state of the transducer to the start state exists, any number of times This is used as a means for selecting the interpretation result.
Case registration means 22

事例型解釈手段１８Ｃが用いるデータベースへ蓄積する事例として、単語の系列と対応する解釈結果の組を登録するための手段である。単語の系列と対応する解釈結果の組を書き出したファイルを入力として、その組を事例としてデータベースに登録する。入力とされるファイルには２種類のものがあり、１種類目は、人間とコンピュータシステムとの対話記録を基に、実際に人間が発話した単語の系列（書き起こし）と、その解釈結果をラベルづけしたものを書き出したファイル（事例ファイル）であり、２種類目は、事例選択手段２５によって作成されたファイル（選択事例ファイル）である。
事例選択手段２５
人間が書き起こしや意味内容のラベルづけなどをした、整備された対話記録を用いるのではなく、システムが記録しただけで整備されていない対話記録を用いて、事例型解釈手段１８Ｃが用いるデータベース１７への事例の登録を可能にするための方法である。事例信頼性情報付加手段２４によって信頼性情報が付加された音声認識結果の単語の系列と対応する解釈結果の組から、信頼性評価値が一定値よりも高いものを選択して、データベース１７に備えたファイル（選択事例ファイル）に書き出す。
事例信頼性情報付加手段２４
事例信頼性情報付加手段２４により、対話記録の音声認識結果の単語の系列に対し、発話単位信頼性評価値を計算し、対話記録の音声認識結果の単語の系列に発話単位信頼性情報を付加し、事例選択手段２５に渡す。その際利用する信頼性評価法として、「特願２００３−２７９２６号で提案した音声認識信頼性評価方法」を適用することができる。 This is a means for registering a series of words and a corresponding interpretation result as examples stored in the database used by the case type interpretation means 18C. A file in which a set of interpretation results corresponding to a word sequence is written is input, and the set is registered in the database as an example. There are two types of files that are input. The first type is a sequence of words (transcriptions) actually spoken by humans based on the conversation records between humans and computer systems, and the interpretation results. The labeled file is written (case file), and the second type is a file created by the case selecting means 25 (selected case file).
Case selection means 25
The database 17 used by the case type interpretation means 18C uses a dialogue record that is recorded only by the system but is not maintained, instead of using a maintained dialogue record that is transcribed or labeled with meaning content. This is a method for enabling the registration of cases. From the set of interpretation results corresponding to the word sequence of the speech recognition result to which the reliability information is added by the case reliability information adding means 24, the one having a reliability evaluation value higher than a certain value is selected and stored in the database 17 Write to a prepared file (selected case file).
Case reliability information adding means 24
The case reliability information adding means 24 calculates an utterance unit reliability evaluation value for the word sequence of the speech recognition result of the dialog record, and adds the utterance unit reliability information to the word sequence of the speech recognition result of the dialog record. To the case selection means 25. As the reliability evaluation method used at that time, the “speech recognition reliability evaluation method proposed in Japanese Patent Application No. 2003-27926” can be applied.

［実施の処理の流れ］
まず、音声解釈装置１０の動作を説明する。以下の例では説明の簡略化のため、バスの時刻表を扱う音声解釈システムの、「経由地」に関する情報を解釈する部分についてのみ触れる。他の内容について解釈する部分についても同様の処理が行われる。
音声解釈規則は以下のような書式でＸＭＬ形式のファイル（A.xml:システム記述ファイル）に記録しておく。
<class name=“specify_via”type=“Action”>
<entry>観音坂経由</entry>
<entry>経由地は船子</entry>
<class>
上記の例では、「経由地を指定する（specify_via）」というタイプの発話を定義している。 [Flow of implementation process]
First, the operation of the speech interpretation apparatus 10 will be described. In the following example, for simplification of description, only the part of the speech interpretation system that handles the bus timetable that interprets information related to “route points” will be described. The same processing is performed for the part that interprets other contents.
The speech interpretation rules are recorded in an XML format file (A.xml: system description file) in the following format.
<class name = “specify_via” type = “Action”>
<entry> via Kannonzaka </ entry>
<entry> The waypoint is a ship </ entry>
<class>
In the above example, the type of utterance “specify_via” is defined.

特定種単語の定義は、以下のような書式で同じＸＭＬファイル（A.xml）に記録しておく。
<class name=“via”type=“Key”>
<entry>観音坂</entry>
<entry>船子</entry>
<entry>広町橋</entry>
</class>
上記の例では、「経由地（via）」という種類の単語を定義している。 The definition of the specific type word is recorded in the same XML file (A.xml) in the following format.
<class name = “via” type = “Key”>
<entry> Kannonzaka </ entry>
<entry> Ferry </ entry>
<entry> Hiromachi Bridge </ entry>
</ class>
In the above example, a word of the type “via” is defined.

一方、事例型解釈手段１８Ｃが参照するデータベースは、以下のような書式で単語の系列と対応する解釈結果を事例としてデータベース１７に備えたシステム記述ファイルに記録されている。
観音坂を経由：specify_via via＝(観音坂)
船子を通る：specify_via via＝(広町橋）
上記の事例を含むファイルは、実際に人間が発話した単語の系列と対応する解釈結果の組をあらかじめデータを収集して、発話を書き起こし、対応する解釈結果のラベルを付けた形式の事例ファイルとして作成するか、人間とシステムとの対話記録を参照し、後に説明する特願２００３−０２７９２６の音声認識信頼性評価方法を用いて、発話単位信頼性評価値を計算し、音声認識結果の発話単位信頼性評価値が付加された下記の書式のファイルを作成し、この中から信頼性評価値が一定の閾値より大きい（例えば０より大）ものを選択した選択事例ファイルとして作成する。
20021114-02-02_08,0.861485:船子を通る［CONCEPT］specify_via via=(船子)
20021114-02-02_21,-1.65177:１時まで［CONCEPT］specify_time hour=(1)
上記のファイルの書式は、左から発話のＩＤ、発話単位信頼性評価値、音声認識結果の単語の系列、そして［CONCEPT］以下が音声認識結果の単語の系列に対応する音声解釈結果である。 On the other hand, the database referred to by the case type interpretation means 18C is recorded in the system description file provided in the database 17 as an example of the interpretation result corresponding to the word series in the following format.
Via Kannonzaka: specify_via via ＝ (Kannonzaka)
Go through Funako: specify_via via ＝ (Hiromachi Bridge)
A file containing the above examples is a case file in a format in which a set of interpretation results corresponding to a word sequence actually spoken by a human is collected in advance, the utterances are transcribed, and the corresponding interpretation results are labeled. Or by referring to the dialogue record between the human and the system, and using the speech recognition reliability evaluation method of Japanese Patent Application No. 2003-027926 described later, the speech unit reliability evaluation value is calculated, and the speech recognition result speech A file having the following format to which the unit reliability evaluation value is added is created, and a selection example file in which a reliability evaluation value larger than a certain threshold (for example, greater than 0) is selected is created.
20021114-02-02_08,0.861485: [CONCEPT] specify_via via = (Ship)
20021114-02-02_21, -1.65177: Until 1 o'clock [CONCEPT] specify_time hour = (1)
The format of the above file is an utterance ID, an utterance unit reliability evaluation value, a word sequence of a speech recognition result, and a speech interpretation result corresponding to the word sequence of the speech recognition result below [CONCEPT].

そして、データベースに蓄積された事例を取り出し、以下のような書式でシステム記述ファイル（A.xml）に記録しておく。
<example>
<entry acttype=“specify_via”concept=“via=(観音坂)”prob=“0.01”>観音坂を経由</entry>
<entry acttype=“specify_via”concept=“via=(広町橋)”prob=“0.001”>船子を通る</entry>
</example>
上記の例では、観音坂という経由地を指定する「観音坂を経由」という単語の系列が、確率０．０１でデータベース中の事例に含まれており、また、広町橋という経由地を指定する「広町橋を通る」という単語の系列が、確率０．００１でデータベース中の事例に含まれていることを示している。 The cases stored in the database are taken out and recorded in the system description file (A.xml) in the following format.
<example>
<entry acttype = “specify_via” concept = “via = (Kannonzaka)” prob = “0.01”> Via Kannonzaka </ entry>
<entry acttype = “specify_via” concept = “via = (Hiromachibashi)” prob = “0.001”> Passing the boat </ entry>
</ example>
In the above example, the word sequence “Via Kannonzaka” is specified in the database with a probability of 0.01, which specifies the transit location called Kannonzaka, and the transit location called Hiromachi Bridge is also specified. This means that the word sequence “passing through Hiromachi Bridge” is included in the cases in the database with a probability of 0.001.

上記のような内容が記録されたシステム記述ファイル（A.xml）を処理して、規則型解釈手段１８Ａ、特定種単語抽出手段１８Ｂ、事例型解釈手段１８Ｃでそれぞれの重み付き有限状態トランスデューサを作成する。なお、規則型解釈手段１８Ａで定義した単語の系列や事例型解釈手段１８Ｃで利用される事例の単語の系列に特定種単語が含まれる場合は、同じ種類の単語に置き換えても同じ規則が適用できるように変更を加える。なお、規則型解釈手段１８Ａで規則を適用する際の重みは＋１．０、特定種単語抽出手段１８Ｂで特定種単語（この例では「観音坂」「船子」「広町橋」）を抽出する際の重みは＋２．０、特定種として定義されていない単語（特定種単語として定義されていないものすべて：この例では「えーと」「バス」「です」など）を受け付ける際の重みは＋３．０、とした。そして、それらを並列に組み合わせた上、閉包とした重み付き有限状態トランスデューサの例を図２に示す。 The system description file (A.xml) in which the contents as described above are recorded is processed, and each weighted finite state transducer is created by the rule type interpretation means 18A, the specific seed word extraction means 18B, and the case type interpretation means 18C. To do. In addition, when a specific type word is included in the word series defined by the rule type interpretation unit 18A or the case word series used by the case type interpretation unit 18C, the same rule is applied even if the word type is replaced. Make changes as you can. It should be noted that the weight when applying the rule by the rule type interpretation means 18A is +1.0, and the specific kind word extraction means 18B extracts the specific kind words (in this example, “Kannonzaka” “Funako” “Hiromachi Bridge”). The weight when accepting +2.0, the weight when accepting words that are not defined as a specific species (all words that are not defined as specific species: in this example, “Ut”, “Bus”, “Is”, etc.) 0.0. FIG. 2 shows an example of a weighted finite state transducer in which they are combined in parallel and closed.

図中で、初期状態は０、各状態を丸印で示してあり、丸印の中の数字が状態番号である。二重丸で示されている状態は終了状態を示す。また、状態遷移の矢印は、「入力シンボル、出力シンボル／重み」の書式で状態遷移時の入力シンボル、出力シンボル、重みの情報を表示している。入力シンボルと出力シンボルのepsilonは、入力シンボルもしくは出力シンボルがない状態遷移であることを示す。
このとき、この重み付き有限状態トランスデューサを用い、重みが最小となるような状態遷移系列を求めることで、音声解釈結果を得ることが可能となる。 In the figure, the initial state is 0, each state is indicated by a circle, and the number in the circle is the state number. A state indicated by a double circle indicates an end state. In addition, the state transition arrows display information on input symbols, output symbols, and weights at the time of state transition in the format of “input symbol, output symbol / weight”. The epsilon of the input symbol and the output symbol indicates a state transition in which there is no input symbol or output symbol.
At this time, by using this weighted finite state transducer and obtaining a state transition sequence that minimizes the weight, a speech interpretation result can be obtained.

上記の音声解釈規則および特定種単語の定義に基づいて、「（経由地）種に含まれる単語」経由」および「経由地は（「経由地」種に含まれる単語）」という単語の系列に対して解釈が可能となる。例として、「経由地は広町橋」という単語の系列に対する解釈結果はspecify_via via=(広町橋)（広町橋という経由地を指定している）のようになる。その際の状態遷移系列は、0->1->6-（「経由地」を入力）->7-（「は」を入力）->8-（「広町橋」を入力）->9->10（終了状態）で、重みは１となる。このとき、音声解釈装置１０からの出力は、
1.0 specify_via via=(広町橋)
のようになり、この解釈の重みと、解釈結果の組として出力される。 Based on the above speech interpretation rules and the definition of a specific species word, the word sequence “via (word) included in (route) species” and “passage (word included in“ route ”species)” Interpretation is possible. As an example, the interpretation result for the word sequence “passage is Hiromachi Bridge” is as follows: specify_via via = (Hiromachi Bridge). The state transition sequence at that time is 0->1-> 6- (Enter "route")-> 7- (Enter "ha")-> 8- (Enter "Hiromachibashi")->9-> 10 (end state), weight is 1. At this time, the output from the speech interpretation device 10 is
1.0 specify_via via = (Hiromachibashi)
And output as a set of interpretation weights and interpretation results.

また、「えーと観音坂」のように、音声解釈規則には含まれていないが、特定種（経由地）の単語を含む単語の系列に対しては、via=(観音坂)（観音坂という経由地を発話をした）という解釈結果が得られる。その際の状態遷移系列は、0->1->11-（「えーと」を入力）->12->13->24->0->1->11-（「観音坂」を入力）->12->13（終了状態）で、重みは５となる。このとき、音声解釈装置１０からの出力は、
5.0 via=(観音坂)
のようになる。 Also, like “Et Kannonzaka”, it is not included in the speech interpretation rules, but via = (Kannonzaka) (Kannonzaka) Interpretation results are obtained. The state transition sequence at that time is 0->1-> 11- (input "Eto")->12->13->24->0->1-> 11- (input "Kannonzaka") ->12-> 13 (end state), the weight is 5. At this time, the output from the speech interpretation device 10 is
5.0 via = (Kannonzaka)
become that way.

さらに、「船子を通るバスです」という単語の系列が入力された場合は、「船子を通る」という単語の系列はデータベースの事例に含まれているため、事例型解釈手段１８Ｃによって解釈され、「バスです」という特定種単語として定義されていない単語に対しては、特定種単語抽出手段１８Ｂで特定種でない単語として扱われるため、specify_via via=(船子)という解釈結果が得られる。この際の状態遷移系列は、0->1->19-（「船子」を入力）->20-（「を」を入力）->21-（「通る」を入力）->22->23->24->0->1->11-（「バス」を入力）->12->13->24->0->1->11-（「です」を入力）->12->13（終了状態）で、重みは８となる（-log(0.001)=3）。このとき、音声解釈装置１０からの出力は、
8.0 specify_via via=(船子)
のようになり、この解釈の重みと、解釈結果の組として出力される。 Furthermore, when the word sequence “passing through the boat” is input, the word sequence “passing through the boat” is included in the database examples, and is interpreted by the case type interpretation means 18C. Since the word that is not defined as the specific type word “is a bus” is treated as a non-specific type word by the specific type word extracting means 18B, an interpretation result of “specify_via via = (ship)” is obtained. In this case, the state transition sequence is 0->1-> 19- (enter "Ship")-> 20- (enter "O")-> 21- (enter "Pass")-> 22- >23->24->0->1-> 11- (enter "bus")->12->13->24->0->1-> 11- (enter "is")-> At 12-> 13 (end state), the weight is 8 (-log (0.001) = 3). At this time, the output from the speech interpretation device 10 is
8.0 specify_via via = (Funako)
And output as a set of interpretation weights and interpretation results.

そして、以下のような音声認識結果が得られたとき、
観音坂（信頼性評価値+1.5）は（信頼性評価値-0.5）経由（信頼性評価値+1.0）
観音坂（信頼性評価値+1.5）が（信頼性評価値-1.0）経由（信頼性評価値+1.0）
認識結果の単語を棄却するときの重みを＋３．０としたとき、信頼性情報付加手段１４によって信頼性情報を付加（信頼性評価値の符号を反転させた値を重みとする）した重み付き有限状態トランデューサは図３のようになる。図３の音声認識結果の単語系列を表す重み付き有限状態トランスデューサの出力シンボルを、図２の音声解釈手段を表す重み付き有限状態トランスデューサの入力シンボルとして扱う形で、２つの重み付き有限状態トランスデューサを合成し、合成された重み付き有限状態トランスデューサ上で重みが最小となるような状態遷移系列を求めれば、最適な音声解釈結果を得ることができる。図２と図３の例では、図３側の状態遷移系列を0->1->2-（「観音坂」を入力、「観音坂」を出力）->3-(「は」を入力、「は」は棄却)->4-（「経由」を入力、「経由」を出力）->5（終了状態）とし、「観音坂経由」という出力をして、図２側の状態遷移系列は0->1->2-（「観音坂」を入力）->3-（「経由」を入力）->4->5（終了状態）としたときが最も重みが小さくなり、その際の図３側での重みは０．５、図２側での重みは１となる。最終的に、
1.5 specify_via via=(観音坂)
という解釈結果が得られる。仮に、図３の単語の棄却をする枠組みがなかったとし、「観音坂は経由」という単語の系列を図３の重み付き有限状態トランスデューサの状態遷移に当てはめると、規則型解釈、事例型解釈によって解釈できないことから、
8.0 via=(観音坂)
となり、「経由地を指定する」という発話タイプであることを解釈できない。 And when the following speech recognition results are obtained,
Kannonzaka (reliability evaluation value +1.5) via (reliability evaluation value -0.5) (reliability evaluation value +1.0)
Kannonzaka (reliability evaluation value +1.5) via (reliability evaluation value -1.0) (reliability evaluation value +1.0)
When the weight when rejecting the word of the recognition result is +3.0, the reliability information is added by the reliability information adding means 14 (the value obtained by inverting the sign of the reliability evaluation value is used as the weight). The finite state transducer is as shown in FIG. Two weighted finite state transducers are treated in such a manner that the output symbol of the weighted finite state transducer representing the word sequence of the speech recognition result of FIG. 3 is treated as the input symbol of the weighted finite state transducer representing the speech interpreting means of FIG. By synthesizing and obtaining a state transition sequence that minimizes the weight on the synthesized weighted finite state transducer, an optimal speech interpretation result can be obtained. In the example of Fig. 2 and Fig. 3, the state transition sequence of Fig. 3 is 0->1-> 2- ("Kannonzaka" is input, "Kannonzaka" is output)->3-("Ha" is input , "Ha" is rejected)-> 4- (input "via", output "via")-> 5 (end state), output "via Kannonzaka", state transition on the side of Figure 2 The series has the smallest weight when 0->1-> 2- (input "Kannonzaka")-> 3- (input "via")->4-> 5 (end state) The weight on the side of FIG. 3 is 0.5, and the weight on the side of FIG. Finally,
1.5 specify_via via = (Kannonzaka)
The interpretation result is obtained. Suppose that there was no framework for rejecting the words in Fig. 3, and applying the sequence of words "Kannonzaka via" to the state transition of the weighted finite state transducer in Fig. 3, Because it cannot be interpreted,
8.0 via = (Kannonzaka)
Therefore, it cannot be interpreted that the utterance type is “designating a waypoint”.

以下に先願である特願２００３−２７９２６号で提案した音声認識信頼性評価方法を説明する。先の出願では音声対話システムに併設する音声認識信頼性評価方法及び装置を提案している。以下では可能な限り音声認識信頼性評価方法及び装置に絞って説明することにする。
図４は先に出願した音声対話システムに用いる音声認識信頼性評価方法を実行する場合の手順の一例を示す。尚、ここでは対話終了後に音声認識信頼性評価の処理を開始するものとして説明する。 The speech recognition reliability evaluation method proposed in Japanese Patent Application No. 2003-27926, which is a prior application, will be described below. In the previous application, a speech recognition reliability evaluation method and apparatus provided in the speech dialogue system has been proposed. The following description will focus on the speech recognition reliability evaluation method and apparatus as much as possible.
FIG. 4 shows an example of a procedure for executing the speech recognition reliability evaluation method used in the previously filed spoken dialogue system. Here, the description will be made assuming that the speech recognition reliability evaluation process is started after the end of the dialogue.

図４に示すステップＳＰ１で対話情報記憶手段から対話情報を読み込む。
ステップＳＰ２ではユーザが情報要求のために発話したと推定される単語の系列（Ｐ）を抽出する。
ステップＳＰ３では推定した単語の系列（Ｐ）から解釈されるユーザの情報要求内容を（Ｐ）′と特定する。
ステップＳＰ４ではステップＳＰ２で抽出した単語の系列（Ｐ）に含まれる単語をそれぞれ（Ｑ０）（Ｑ１）（Ｑ２）…（Ｑｎ）として特定する。
ステップＳＰ５では単語の系列（Ｐ）に対応するユーザが情報要求のために実際に発話した単語の系列の正解を（Ｒ）と特定する。
ステップＳＰ６では単語の系列の正解（Ｒ）から解釈されるユーザの情報要求内容の正解を（Ｒ）′と特定する。
ステップＳＰ７では単語列（Ｑｉ）に対応するユーザが情報要求のために実際に発話した単語を（Ｓｉ）と特定する。 In step SP1 shown in FIG. 4, dialogue information is read from the dialogue information storage means.
In step SP2, a sequence (P) of words estimated to be spoken by the user for requesting information is extracted.
In step SP3, the user's information request content interpreted from the estimated word sequence (P) is specified as (P) ′.
In step SP4, the words included in the word sequence (P) extracted in step SP2 are identified as (Q0) (Q1) (Q2).
In step SP5, the correct answer of the word sequence actually spoken by the user corresponding to the word sequence (P) for information request is specified as (R).
In step SP6, the correct answer of the information request content of the user interpreted from the correct answer (R) of the word series is specified as (R) ′.
In step SP7, the word actually spoken by the user corresponding to the word string (Qi) for the information request is specified as (Si).

図５に示すステップＳＰ８ではユーザとシステムのやり取りの各時点において音声認識手段がユーザの音声を認識し、ユーザが発話したと推定される単語の系列（Ｐ）をその結果の候補として推定するために計算した指標に加え、ユーザとシステムとのやり取りが終了した時点で得られる指標を発話単位及び単語単位でそれぞれについて会話単位指標値及び単語単位指標値を計算する。
ステップＳＰ９では計算された発話単位指標値及び単語単位指標値と、ユーザの発話をシステムが認識した単語の系列と、上記対話情報記録手段により記録されている実際に発話した単語の系列の正解とを比較して得られる発話単位及び単語単位での正しさを０と１の二値で示した指標との関連を求めることにより、発話単位及び単語単位での信頼性を評価するための尺度を作成する。
ステップＳＰ１０では、ステップＳＰ２と同様に単語系列（Ｐ）を抽出する。ステップＳＰ１１では、ステップＳＰ３と同様に単語系列（Ｐ）から解釈されるユーザの情報要求内容を（Ｐ）′と特定する。 In step SP8 shown in FIG. 5, the voice recognition means recognizes the user's voice at each point of time between the user and the system, and estimates a word sequence (P) that is estimated to be spoken by the user as a result candidate. In addition to the calculated index, the conversation unit index value and the word unit index value are calculated for the utterance unit and the word unit for the index obtained when the exchange between the user and the system ends.
In step SP9, the calculated utterance unit index value and the word unit index value, the word sequence in which the system recognizes the user's utterance, the correct answer of the actually uttered word sequence recorded by the dialog information recording means, A measure for evaluating the reliability of the utterance unit and the word unit is obtained by obtaining the relationship between the correctness in the utterance unit and the word unit obtained by comparing the two values and the index indicating the binary value of 0 and 1. create.
In step SP10, the word series (P) is extracted as in step SP2. In step SP11, as in step SP3, the user's information request content interpreted from the word sequence (P) is specified as (P) ′.

ステップＳＰ１２では、ステップＳＰ４と同様に単語列（Ｑ０）（Ｑ１）（Ｑ２）…（Ｑｎ）を特定する。
ステップＳＰ１３では、ステップＳＰ８と同様に発話単位及び単語単位でそれぞれ指標値を計算する。
ステップＳＰ１４では、計算された発話単位指標値および単語単位指標値を、ステップＳＰ９で作成した発話単位信頼性尺度および単語単位信頼性尺度に当てはめ、発話単位信頼性評価値および単語単位信頼性評価値を計算する。 In step SP12, the word string (Q0) (Q1) (Q2)... (Qn) is specified as in step SP4.
In step SP13, the index value is calculated for each utterance and each word as in step SP8.
In step SP14, the calculated utterance unit index value and word unit index value are applied to the utterance unit reliability scale and word unit reliability scale created in step SP9, and the utterance unit reliability evaluation value and word unit reliability evaluation value are calculated. Calculate

図６にコンピュータによって実現した音声対話システムと、この音声対話システムの音声認識信頼性の評価を行なう音声認識信頼性評価装置の実施例を示す。
コンピュータはよく知られているように、プログラムを解読し、実行するＣＰＵ３１と、読出専用メモリＲＯＭ３２と、プログラム等を格納し、実行するためのＲＡＭ３３と、入力ポート３４、出力ポート３５等によって構成される。尚、出力ポート３５には拡声装置４１が接続され、この拡声装置４１でスピーカを駆動し、システムからの応答が音声で出力される場合を示す。 FIG. 6 shows an embodiment of a voice dialogue system realized by a computer and a voice recognition reliability evaluation apparatus for evaluating the voice recognition reliability of the voice dialogue system.
As is well known, the computer includes a CPU 31 for decoding and executing a program, a read-only memory ROM 32, a RAM 33 for storing and executing the program, an input port 34, an output port 35, and the like. The Note that a loudspeaker 41 is connected to the output port 35, a speaker is driven by the loudspeaker 41, and a response from the system is output by voice.

ＲＡＭ３３には情報要求入力手段３３Ａを構成するためのプログラムと、対話情報記録手段３３Ｂを構成するプログラム、形式変換手段３３Ｃを構成するプログラム、音声認識手段３３Ｄを構成するプログラム、対話終了判定手段３３Ｅを構成するプログラム、情報提供手段３３Ｆを構成するプログラムが格納され、これらのプログラムとＣＰＵ３１とによって音声対話システム１００が構成される。
音声認識信頼性評価装置２００は音声対話システム１００の構成に加えて、ＲＡＭ３３に単語系列（Ｐ）の抽出手段３３Ｇを構成するプログラムと、情報要求内容（Ｐ）′特定手段３３Ｈを構成するためのプログラム、単語列Ｑ０、Ｑ１、Ｑ２…抽出手段３３Ｉを構成するプログラム、正解単語特定手段３３Ｊを構成するプログラム、正解情報内容特定手段３３Ｋを構成するプログラム、発話単語特定手段３３Ｌを構成するプログラム、発話単位指標値計算手段及び単語単位指標値計算手段３３Ｍを構成するプログラム、発話単位信頼性尺度作成手段及び単語単位信頼性尺度作成手段３３Ｎを構成するプログラム、発話単位及び単語単位信頼性評価値計算手段３３Ｐを構成するプログラムが格納され、これらのプログラムが音声対話システム１００を構成するプログラムと共にＣＰＵ３１により実行されて音声認識信頼性評価装置２００が構成され音声認識信頼性評価方法が実行される。 The RAM 33 includes a program for configuring the information request input unit 33A, a program for configuring the dialog information recording unit 33B, a program for configuring the format conversion unit 33C, a program for configuring the speech recognition unit 33D, and a dialog end determination unit 33E. The programs constituting the information providing means 33F and the programs constituting the information providing means 33F are stored, and the voice dialogue system 100 is constituted by these programs and the CPU 31.
In addition to the configuration of the speech dialogue system 100, the speech recognition reliability evaluation apparatus 200 is configured to configure a program that constitutes the word sequence (P) extraction means 33G in the RAM 33 and an information request content (P) 'identification means 33H. Program, word string Q0, Q1, Q2... Program constituting extraction means 33I, program constituting correct word specifying means 33J, program constituting correct information content specifying means 33K, program constituting speech word specifying means 33L, speech Program that constitutes unit index value calculation means and word unit index value calculation means 33M, program that constitutes utterance unit reliability scale creation means and word unit reliability scale creation means 33N, utterance unit and word unit reliability evaluation value calculation means 33P is stored, and these programs are spoken dialogue systems. Has been the voice recognition reliability evaluation method consists voice recognition reliability evaluation device 200 executed by the CPU31 with programs constituting 00 is executed.

以下に各部の実施例を説明する。例として、音声対話によるバス時刻表案内システムを用いる。同システムでは、ユーザはシステムに「乗車するバス停」「降りるバス停」「バスの経由地」「時間」「指定するのは出発する時間／到着する時間のどちらか」「曜日」の６つの情報を伝え、システムは、該当するバスの発車時刻をユーザに伝える。
同システムにおける対話情報記録手段、指標値計算手段、信頼性尺度作成手段の詳細は以下のようになる。 Examples of each part will be described below. As an example, a bus timetable guidance system using voice dialogue is used. In this system, the user has six types of information: “bus stop to get on”, “bus stop to get off”, “bus stop”, “time”, “departure time / arrival time”, and “day of the week”. The system informs the user of the departure time of the corresponding bus.
Details of the dialogue information recording means, index value calculating means, and reliability measure creating means in the system are as follows.

対話情報記録手段
音声対話システムはユーザの情報要求を、属性と値の対として理解する。これをスロットと呼ぶ。システムは複数のスロットから成るデータ構造を用いて、ユーザの情報要求を保持し、対話を進める。このデータ構造を対話状態と呼ぶ。バス時刻表案内システムにおける対話状態は、以下の６つのスロットで構成される。
（STOP_FROM.value）:「乗車するバス停」
（STOP_TO.value）:「降りるバス停」
（VIA.value）:「バスの経由地」
（TIME.value）:「時間」
（TIME_TYPE.value）:「指定するのは出発する時間/到着する時間のどちらか」（DAY.value）:「曜日」
また、ユーザの発話をシステムが解釈した結果得られる、「スロット（Ｓ）に値（ｓ）を埋める」などの、ユーザの発話に含まれる意味表現を、対話行為と呼ぶ。例えば、「○○バスセンターから※※学院大学まで」というユーザの発話に対する対話行為は、以下のように表現される。
（SET-STOP_FROM（○○バスセンター））（SET-STOP_TO（＊＊学院大学））
システムはユーザとの対話の各時点における、ユーザの発話音声と、対話が終了した時点で、確定しているユーザの情報要求内容を、対話状態の形で対話記録として記録する。 Dialog information recording means The voice dialog system understands user information requests as attribute-value pairs. This is called a slot. The system uses a data structure consisting of a plurality of slots to hold user information requests and to proceed with the dialogue. This data structure is called a dialog state. The dialogue state in the bus timetable guidance system is composed of the following six slots.
(STOP_FROM.value): "Bus stop to get on"
(STOP_TO.value): "Get off the bus stop"
(VIA.value): "Bus stop"
(TIME.value): "Time"
(TIME_TYPE.value): “Specify either departure / arrival time” (DAY.value): “Day of the week”
A semantic expression included in the user's utterance, such as “fill value (s) in slot (S)”, obtained as a result of the system interpreting the user's utterance, is called a dialogue action. For example, the dialogue action for the user's utterance “From the bus center to ** Gakuin University” is expressed as follows.
(SET-STOP_FROM (XX Bus Center)) (SET-STOP_TO (** Gakuin University))
The system records the user's utterance voice at each point of dialogue with the user and the information request contents of the user who have been confirmed when the dialogue is completed as a dialogue record in the form of a dialogue state.

発話単位指標値計算手段
記録されたユーザの各発話音声に対し、対話記録時と同じ音声認識手段を用い、ユーザの発話音声を認識し、ユーザが発話した単語の系列を、最大５個まで推定し、各認識結果候補に対し、各認識結果候補の単語の系列と認識の過程で用いたスコアに基づいて、図７に示す発話単位での指標値Ｕ１〜Ｕ１９を計算する。
発話単位信頼性尺度作成手段
発話単位指標値計算手段で計算された指標値に基づいて、ユーザの発話を認識した結果得られる単語の系列を（Ｐ）、対応する発話を人間が書き起こして得られる単語の系列の正解を（Ｑ）として、（Ｐ）を解釈したときの対話行為が、（Ｑ）を解釈したときの対話行為と一致する信頼性を、計算する尺度を作成する。文献[“Recognition confidence scoring and its speech understanding systems”,Timothy J.Hazen,Stephanie Seneff and Joseph Polefroni,Computer Speech and Language 2002 vol.16 pp.46-67] Utterance unit index value calculation means For each user's utterance voice recorded, the same voice recognition means as that used during conversation recording is used to recognize the user's utterance voice and estimate a maximum of five sequences of words spoken by the user. Then, for each recognition result candidate, index values U1 to U19 in units of utterances shown in FIG. 7 are calculated based on the word sequence of each recognition result candidate and the score used in the recognition process.
Utterance unit reliability scale creation means Based on the index value calculated by the utterance unit index value calculation means, a sequence of words obtained as a result of recognizing the user's utterance (P), and a corresponding utterance is obtained by human transcription. A scale for calculating the reliability with which the dialogue action when (P) is interpreted coincides with the dialogue action when (Q) is interpreted, with the correct answer of the sequence of words as (Q). Literature [“Recognition confidence scoring and its speech understanding systems”, Timothy J. Hazen, Stephanie Seneff and Joseph Polefroni, Computer Speech and Language 2002 vol.16 pp.46-67]

発話単位指標値計算手段で算出された指標の列ベクトルf＾、同次元の列ベクトルp＾を用いて、（Ｐ）を解釈したときの対話行為が、（Ｑ）を解釈したときの対話行為と一致する信頼性を、式１及び式２により、１次元の値Ｒを用いて評価する。

式２の、ｔはしきい値、p（r|correct）、p（r|incorrect）は、それぞれrが正しい発話音声認識結果に基づく指標値、誤った発話認識結果に基づく指標値であったときの、ガウス密度関数（式３、式４）であり、P（correct）、P（incorrect）は、それぞれ正しい発話音声認識結果、誤った発話音声認識結果を観測する事後確率である。発話音声認識結果が正しいかったときのrの平均をμ（correct）、分散をσ²（correct）、発話音声認識結果が誤っていたときのrの平均をμ（incorrect）、分散をσ²（incorrect）とすると、p（r｜correct）、p（r｜incorrect）は、それぞれ式３、式４で計算できる。

Dialogue action when (P) is interpreted as dialogue action when (P) is interpreted using the column vector f ^ and the column vector p ^ of the same dimension calculated by the utterance unit index value calculation means Is evaluated using a one-dimensional value R according to

Equations

1 and 2.

In Equation 2, t is a threshold value, and p (r | correct) and p (r | incorrect) are index values based on the correct speech recognition result for r and index values based on the incorrect speech recognition result, respectively. Are Gaussian density functions (Equation 3 and Equation 4), where P (correct) and P (incorrect) are posterior probabilities of observing correct utterance speech recognition results and incorrect utterance speech recognition results, respectively. When the speech recognition result is correct, the average of r is μ (correct), the variance is σ ² (correct), when the speech recognition result is incorrect, the average of r is μ (incorrect), and the variance is σ ² If (incorrect), p (r | correct) and p (r | incorrect) can be calculated by Equation 3 and Equation 4, respectively.

発話単位信頼性尺度作成手段では、発話音声認識結果に対する正解となる、対応する発話音声の書き起こしの存在する少量の記録を用いて、発話音声認識結果と、その発話認識結果を解釈して得られる対話行為が正しいかどうかの二値表現との関係から、式（１）―（４）を適用するためにに必要な、ベクトルｐ＾の各要素の最適な値を求め、対応するt、μ（correct）σ²（correct）、μ（incorrect）、σ²（incorrect）を求める。Ｐ＾は、Fisherの線形判別分析法に基づいて初期値を設定し、正解／不正解の分類誤りが最小になるように、各要素の値を山登り法に基づいて繰り返し更新して求める。 The utterance unit reliability scale creation means interprets the utterance speech recognition result and the utterance recognition result by using a small amount of records in which the transcript of the corresponding utterance speech exists as a correct answer to the utterance speech recognition result. The optimal value of each element of the vector p ^ necessary for applying the equations (1)-(4) is obtained from the relationship with the binary expression as to whether the dialogue action to be performed is correct, and the corresponding t, Find μ (correct) σ ² (correct), μ (incorrect), and σ ² (incorrect). P ^ is determined by setting an initial value based on Fisher's linear discriminant analysis method and repeatedly updating the value of each element based on the hill-climbing method so that the correct / incorrect classification error is minimized.

発話単位信頼性評価値計算手段
信頼性尺度を作成したときとは別の対話記録中の音声認識結果に対して計算された発話単位指標値を、発話単位信頼性尺度（式２）に当てはめ、発話単位信頼性評価値を計算する。
単語単位指標値計算手段
発話単位指標値計算手段で用いたものと同じユーザの発話音声認識結果を用い、各認識結果候補の単語の系列と認識の過程で用いたスコアに基づいて、各認識結果に含まれる単語に対して、図８に示す単語単位での指標値Ｗ１〜Ｗ１３を計算する。図８の“utterance score”に関しては、発話単位信頼性尺度作成手段で作成された尺度を用いて計算する。 Utterance unit reliability evaluation value calculation means Apply the utterance unit index value calculated for the speech recognition result in the conversation recording different from the time when creating the reliability scale to the utterance unit reliability scale (Equation 2), The utterance unit reliability evaluation value is calculated.
Word unit index value calculation means Using the same user utterance speech recognition results as those used in the utterance unit index value calculation means, each recognition result based on the word sequence of each recognition result candidate and the score used in the recognition process Index values W1 to W13 are calculated in units of words shown in FIG. The “utterance score” in FIG. 8 is calculated using the scale created by the utterance unit reliability scale creating means.

単語単位信頼性尺度作成手段
ユーザの発話を認識した結果得られる単語を（Ｒ）、対応する発話を人間が書き起こして得られる単語の正解を（Ｓ）として、発話単位信頼性尺度作成手段と同様の処理を施すことにより、単語単位での信頼性尺度を作成することができる。
単語単位信頼性評価値計算手段
信頼性尺度を作成したときとは別の対話記録中の音声認識結果に対して計算された単語単位指標値を、単語単位信頼性尺度（式２）に当てはめ、単語単位信頼性評価値を計算する。 Word unit reliability scale creation means , where (R) is a word obtained as a result of recognizing a user's utterance, and (S) is a word answer obtained by human transcription of the corresponding utterance. By performing the same processing, a reliability measure in units of words can be created.
Word unit reliability evaluation value calculation means Apply the word unit index value calculated for the speech recognition result in the conversation recording different from the time when creating the reliability scale to the word unit reliability scale (Formula 2), The word unit reliability evaluation value is calculated.

音声対話システムを用いてデータ収集を行なう。対話例を以下に示す。対話例中、Ｓはシステムの発話を、Ｕはユーザの発話をシステムが認識した結果得られた単語の系列を示す。また、括弧内は実際のユーザの発話を書き起こしたものである。
Ｕ：○○バスセンター８時は※※学院大学へ１時
（○○バスセンター発、※※学院大行き）
Ｓ：はい
Ｕ：１１時台出発のバス午後７時の３時
（１１台出発のバスを教えてください）
Ｓ：はい
Ｓ：出発地はどちらですか？
Ｕ：○○バスセンターです
（○○バスセンターです）
Ｓ：はい
Ｓ：どちらのバス停までですか？
Ｕ：※※学院大学で
（※※学院大学です）
Ｓ：はい
Ｓ：経由地を指定しますか？
Ｕ：しません
（しません）
Ｓ：はい
Ｓ：平日、土曜、日曜の、どの時刻表ですか？
Ｕ：休日です
（休日です）
Ｓ：はい○○バスセンターから※※学院大学まで経由地指定なし休日１９時までに到着の時刻表ということでよろしいでしょうか？
Ｕ：１２時台に出発の時刻表です
（１１時台に出発の時刻表です）
Ｓ：１２時台に出発デスネ？
Ｕ：１１時発車で
（１１時発です）
Ｓ：１１時台に出発デスネ？
Ｕ：はい
（はい）
Ｓ：○○バスセンターから※※学院大学まで経由地指定なし休日１１時台に出発の時刻表ということでよろしいでしょうか？
Ｕ：はい
（はい）
Ｓ：かしこまりました○○バスセンター発○○３６系統△△経由※※学院大学方面※※学院大学行き休日１１時台のバスの発車時刻は５分５５分です。 Data collection is performed using a spoken dialogue system. An example of dialogue is shown below. In the dialogue example, S indicates an utterance of the system, and U indicates a sequence of words obtained as a result of the system recognizing the user's utterance. The parentheses are the transcripts of actual user utterances.
U: XX bus center at 8:00 1:00 to Gakuin University
(Depart from XX Bus Center, bound for Gakuin University)
S: Yes U: 3pm at 7pm bus leaving 11am
(Please tell me the departure bus for 11 cars)
S: Yes S: Where is your departure?
U: XX bus center
(This is the ○○ Bus Center)
S: Yes S: Where is the bus stop?
U: ** At Gakuin University
(※※ Gakuin University)
S: Yes S: Do you want to specify a stopover?
U: Not (Not)
S: Yes S: Which timetable is weekday, Saturday, Sunday?
U: Holiday (holiday)
S: Yes From ○○ Bus Center ** No way to go to Gakuin University. Is there a timetable for arrival by 19:00 on holidays?
U: Timetable for departure at 12:00
(Departure timetable at 11 o'clock)
S: Desune leaving at 12:00?
U: Depart at 11:00
(It's 11 o'clock)
S: Desnay leaving at 11 o'clock?
U: Yes
(Yes)
S: From XX bus center to ** Gakuin University.
U: Yes
(Yes)
S: From the bus center ○○ Bus Center ○○ 36 Routes △△ ※※ To Gakuin University ※※ The departure time for the 11:00 bus to Gakuin University is 5 minutes 55 minutes.

対話情報記録手段２３は、対話の各時点におけるユーザの発話音声及び、対話終了時に確定したユーザの情報要求内容（以下に例示）を記録する。
（TIME_TYPE.DEPARTURE）（TIME.11）（DAY.休日）
（STOP_TO.※※学院大学）（VIA.ARBITRARY）（STOP_FROM.○○バスセンター）
次に、対話の各時点において記録したユーザの発話音声を、データを収集したときと同じ音声認識手段（図１に示す音声認識手段１３に相当する）を用いて認識し、最大５個の認識結果候補及び、音声認識スコアファイル１に出力させる。ファイル１の内容は以下のようになる（以下は最尤候補のみを抜粋）。 The dialogue information recording means 23 records the user's utterance voice at each point in the dialogue and the user's information request content (explained below) determined at the end of the dialogue.
(TIME_TYPE.DEPARTURE) (TIME.11) (DAY. Holiday)
(STOP_TO. ** Gakuin University) (VIA.ARBITRARY) (STOP_FROM.XX Bus Center)
Next, the user's utterance voice recorded at each time point of the dialogue is recognized using the same voice recognition means (corresponding to the voice recognition means 13 shown in FIG. 1) when data is collected, and a maximum of five recognitions are recognized. The result candidate and the speech recognition score file 1 are output. The contents of file 1 are as follows (only the most likely candidates are extracted below).

sentencel:11時発車で
wseq1:＜s＞11時発車で＜/s＞
phseq1:silB| j u: i ch i j i | h a q sh a | d e | silE
score1:−6708.832031
===word alignment begin===
id:from to n_score CM−meam CM−var CM−min CM−max
applied HMMs (logical [physical ]or[pseudo])
――――――――――――――――――――――――――――――――――
(0:0 52 −22.2466−1.0495 0.7000−2.8045 0.0000 silB)
(1:53 166−26.9416−2.204 1.8629−9.2010 0.0000 j＋u:[j＋u] j−u:＋i[y−u:＋i] u:−i＋ch[u−i＋ch] i−ch＋i ch−i＋j i−j＋i j−i＋h[v−i＋h])
(2:167 232−27.1568−2.9200 1.8325−7.3065 0.0000 i−h＋a h−a＋q a−q＋sh q−sh＋a sh−a＋d[y−a＋d])
(3:233 250 −27.1151−2.9540 2.4125−8.7062 −0.4162 a−d＋e d−e)
(4:251 254 −28.4927−6.0016 1.6088−8.5824 −4.3282 silE)
re−conputed AM score:−6645.894531
(log−likelihood−ratio:−576.595320 (normalized to−2.261158))
Acoustic−score :−6645.894531(normalized to−26.062331)
===Word alignment end===
ファイル１の内容で、
re−computed AM scoreを単語数で除算したものが、図７の指標Ｕ６に相当する。 sentencel: at 11 o'clock
wseq1: </ s> at <s> 11:00 departure
phseq1: silB | ju: i ch iji | haq sh a | de | silE
score1: −6708.832031
=== word alignment begin ===
id: from to n_score CM−meam CM−var CM−min CM−max
applied HMMs (logical [physical] or [pseudo])
――――――――――――――――――――――――――――――――――
(0: 0 52 −22.2466−1.0495 0.7000−2.8045 0.0000 silB)
(1:53 166−26.9416−2.204 1.8629−9.2010 0.0000 j + u: [j + u] j−u: + i [y−u: + i] u: −i + ch [u−i + ch] i−ch + i ch−i + j i−j + i j− i + h [v-i + h])
(2: 167 232-27.1568-2.9200 1.8325-7.3065 0.0000 i−h + a h−a + q a−q + sh q−sh + a sh−a + d [y−a + d])
(3: 233 250 −27.1151−2.9540 2.4125−8.7062 −0.4162 a−d + e d−e)
(4: 251 254 -28.4927-6.0016 1.6088-8.5824 -4.3282 silE)
re-conputed AM score: −6645.894531
(log−likelihood−ratio: −576.595320 (normalized to−2.261158))
Acoustic-score: −6645.894531 (normalized to−26.062331)
=== Word alignment end ===
In the contents of file 1,
A value obtained by dividing the re-computed AM score by the number of words corresponds to the index U6 in FIG.

各単語のtoとfromの差が、図８の指標Ｗ７に相当する。
各単語のn−scoreが、図８の指標Ｗ１２に相当する。
各単語のCM−meanが、図８の指標Ｗ５に相当する。
各単語のCM−varの平行根が、図８の指標Ｗ４に相当する。
各単語のCM−minが、図８の指標Ｗ３に相当する。
また、認識結果候補の数が、図８の指標Ｗ９に相当する。
そして、ファイル１に出力された認識結果候補から、図７の指標Ｕ１０、Ｕ１２と図８の指標Ｕ１０、Ｕ１２と図８の指標Ｗ１１を得るために必要な、認識結果候補での単語出現頻度を計算する。以下に示すように、認識結果候補間で一致する単語の位置を合わせをして、認識結果候補中のある単語が、他の認識結果で同じ場所に現れる頻度を計算する。以下の例では、「発車」という単語の出現頻度は100％、「１時」という単語の出現頻度は60％である。 The difference between to and from of each word corresponds to the index W7 in FIG.
The n-score of each word corresponds to the index W12 in FIG.
CM-mean of each word corresponds to the index W5 in FIG.
The parallel root of CM-var of each word corresponds to the index W4 in FIG.
CM-min of each word corresponds to the index W3 in FIG.
Further, the number of recognition result candidates corresponds to the index W9 in FIG.
Then, from the recognition result candidates output to the file 1, the word appearance frequencies in the recognition result candidates necessary for obtaining the indices U10 and U12 in FIG. 7, the indices U10 and U12 in FIG. 8, and the index W11 in FIG. calculate. As shown below, the positions of matching words between the recognition result candidates are aligned, and the frequency at which a certain word in the recognition result candidate appears in the same place in other recognition results is calculated. In the following example, the appearance frequency of the word “departure” is 100%, and the appearance frequency of the word “1 o'clock” is 60%.

最尤候補:11時発車で
第2候補:10時1時発車で
第3候補:15時1時発車で
第4候補:14時1時発車で
第5候補:12時で発車まで
また、システムが単語の系列を構文解析するために用いる語彙定義及び文節定義（以下に抜粋）を用いて、
(1時いちじ時間 nil 1)
(11時じゅーいちじ時間 nil 11)
(12時じゅーにじ時間 nil 12)
(14時じゅーよじ時間 nil 14)
(15時じゅーごじ時間 nil 15)
(でで助詞デ＊＊)
(までまで助詞マデ :made ＊)
(発車発車普通名詞_出発:departure)
(普通名詞_出発文節
(普通名詞_出発(opt(or助詞ハ提示する語助詞デ))))
(時間文節
(時間(opt 普通名詞_台))
(opt(or
助詞ノ助詞デ助詞ニ (助詞マデ(opt 助詞ニ))
提示する語終助詞の間投詞的用法))))
各認識結果候補を構文解析した結果（以下に抜粋）から正しく構文解析できなかった単語の系列（未知文節と呼ぶ）を探し、図７の指標Ｕ１９及び図８の指標Ｗ１３を計算する。 Maximum likelihood candidate: 11:00 departure, 2nd candidate: 10: 1 departure, 3rd candidate: 15: 1 departure, 4th candidate: 14: 1 departure, 5th candidate: 12:00 departure until departure Using vocabulary definitions and phrase definitions (excerpts below) used to parse word sequences,
(1 hour 1 hour nil 1)
(11:00 Juichiji time nil 11)
(12 o'clock Juniji time nil 12)
(14:00 Jyuyoji time nil 14)
(15 o'clock jugoji time nil 15)
(De in particle de * *)
(Until particle made: made *)
(Departure Departure)
(Common noun_starting phrase
(Common noun_departure (opt (or particle c)
(Time clause
(Time (opt common noun _ table))
(opt (or
Particle no particle particle particle particle (particle particle (opt particle particle))
Presented word final particle interjective usage))))
A series of words (referred to as unknown phrases) that cannot be correctly parsed is searched from the result of parsing each recognition result candidate (excerpted below), and index U19 in FIG. 7 and index W13 in FIG. 8 are calculated.

(11時)時間文節|(発車で)普通名詞_出発文節
(10時)時間文節|(1時)時間文節|(発車で) 普通名詞_出発文節
(15時)時間文節|(1時)時間文節|(発車で) 普通名詞_出発文節
(14時)時間文節|(1時)時間文節|(発車で) 普通名詞_出発文節
(12時で)時間文節|(発車) 普通名詞_出発文節|(まで)未知文節
そして、対話中の各ユーザ発話に対する各認識結果候補システムが解釈した結果得られる対話行為（以下に抜粋）を求め、図７の指標Ｕ１８を、さらに、対話終了時に確定したユーザの情報要求内容を比較し、図７のＵ１６、Ｕ１７を計算する。 (11 o'clock) time clause | (by departure) common noun _ departure clause
(10 o'clock) time clause | (1 o'clock) time clause | (by departure) common noun _ departure clause
(15 o'clock) time clause | (1 o'clock) time clause | (by departure) common noun _ departure clause
(14:00) Time clause | (1) Time clause | (On departure) Common noun _ Departure clause
(At 12 o'clock) time clause | (departure) common noun_starting clause | (until) unknown clause And the dialogue action (excerpt below) obtained as a result of interpretation by each recognition result candidate system for each user utterance during dialogue Then, the index U18 in FIG. 7 is compared with the information request contents of the user determined at the end of the dialogue, and U16 and U17 in FIG. 7 are calculated.

(SET−TIME(11))(SET_TYPE(DEPARTURE))
(SET−TIME(10))(SET_TYPE(DEPARTURE))
(SET−TIME(15))(SET_TYPE(DEPARTURE))
(SET−TIME(14))(SET_TYPE(DEPARTURE))
(SET−TIME(12))(SET_TYPE(DEPARTURE))
また、各ユーザ発話を人間が書き起こした単語の系列と、各認識結果候補の各単語一致していいるかどうかが、書き起こした単語の系列から得られる対話行為と、認識結果候補から得られる対話行為がすべて一致しているかどうかを比較して、その結果を０と１の二値で表現する。 (SET−TIME (11)) (SET_TYPE (DEPARTURE))
(SET−TIME (10)) (SET_TYPE (DEPARTURE))
(SET−TIME (15)) (SET_TYPE (DEPARTURE))
(SET−TIME (14)) (SET_TYPE (DEPARTURE))
(SET−TIME (12)) (SET_TYPE (DEPARTURE))
In addition, a dialogue sequence obtained from a sequence of transcribed words and a dialogue obtained from a recognition result candidate indicate whether a sequence of words transcribed by humans and each word of each recognition result candidate match. Compare whether all actions match, and express the result as binary values of 0 and 1.

書き起こした単語の系列及びそれらから得られる対話行為
11時発です(＊「発」は認識語彙に含まれていない)
(SET−TIME(11))(SET−TIME_TYPE(DEPARTURE))
以上から得た指標値をまとめ、発話単位の指標をファイル２に、単語単位の指標をファイル３に、それぞれ書き出す。図８の指標Ｗ１１はまだ計算できないため、記号で表現してある。
ファイル２の内容（抜粋）
ID correct U6 U10 U12 U16 U17 U18 U19
20021125−02−04_10 1 −26.062331 84 76.5517241379311 0.5 0.5 0.2 0
ファイル３の内容（抜粋）
ID correct W3 W4 W5 W7 W9 W10 W11 W12 W13
11時1 −9.2010 1.8629 −2.2024 113 0.04 5 [UTTERANCE_SCORE]
−2.69416 0
発車 0 −7.3065 1.8325 −2.9200 65 1 5 [UTTERANCE_SCORE]−27.1568
0
で 0 −8.7062 2.4125 −2.9540 17 1 5 [UTTERANCE_SCORE]−27.1151
0
複数の対話記録のユーザ発話に対して以上の処理を行ない、発話単位の指標をまとめたファイル２を用いて、発話単位信頼性尺度を作成する。作成された信頼性尺度の各パラメータを以下に示す。各指標名に対応する値が、式１のベクトルpの各要素であり、CORRECT_GAUSSIAN−INCPRRECT_GAUSSIANに対応する値が、左からそれぞれ平均μ、分散σ²、事後確率Pである。 A series of transcribed words and dialogue actions obtained from them
11 o'clock departure (* "departure" is not included in the recognition vocabulary)
(SET−TIME (11)) (SET−TIME_TYPE (DEPARTURE))
The index values obtained from the above are summarized, and the utterance unit index is written in the file 2 and the word unit index is written in the file 3, respectively. The index W11 in FIG. 8 is represented by a symbol because it cannot be calculated yet.
Contents of file 2 (excerpt)
ID correct U6 U10 U12 U16 U17 U18 U19
20021125−02−04_10 1 −26.062331 84 76.5517241379311 0.5 0.5 0.2 0
Contents of file 3 (excerpt)
ID correct W3 W4 W5 W7 W9 W10 W11 W12 W13
11 o'clock 1 −9.2010 1.8629 −2.2024 113 0.04 5 [UTTERANCE_SCORE]
−2.69416 0
Departure 0 −7.3065 1.8325 −2.9200 65 1 5 [UTTERANCE_SCORE] −27.1568
0
0 −8.7062 2.4125 −2.9540 17 1 5 [UTTERANCE_SCORE] −27.1151
0
The above processing is performed on the user utterances of a plurality of dialogue records, and the utterance unit reliability scale is created using the file 2 in which the utterance unit indexes are collected. Each parameter of the created reliability measure is shown below. The value corresponding to each index name is each element of the vector p in Equation 1, and the values corresponding to CORRECT_GAUSSIAN-INCPRRECT_GAUSSIAN are the mean μ, variance σ ² , and posterior probability P from the left, respectively.

top_choice_average_acoustic_score 0.4071058
top_choice_average_nbest_purity 0.0590583
average_nbest_purity 0.0049878
top_choice_average_consistent_concept_rate 2.2632635
top_choice_inconsistent_concept_rate −8.8873672
top_choice_average_concept_frepuency 6.3129411
top_choice_unparsed_bunsetsu_rate 0.0453356
CORRECT_GAUSSIAN: −25.7632580 1.4265214 0.47988507
INCORRECT_GAUSSIAN:−30.7077694 4.5412450 0.52011490
発話単位の信頼性評価値は、上記のパラメータを持つ尺度に対し、発話単位指標値を当てはめて式２の計算をすることで得られる。ファイル３の記号[UTTERANCE_SCORE]を、対応する発話の発話単位信頼性評価値を計算し、その値に置き換える。 top_choice_average_acoustic_score 0.4071058
top_choice_average_nbest_purity 0.0590583
average_nbest_purity 0.0049878
top_choice_average_consistent_concept_rate 2.2632635
top_choice_inconsistent_concept_rate −8.8873672
top_choice_average_concept_frepuency 6.3129411
top_choice_unparsed_bunsetsu_rate 0.0453356
CORRECT_GAUSSIAN: −25.7632580 1.4265214 0.47988507
INCORRECT_GAUSSIAN: −30.7077694 4.5412450 0.52011490
The reliability evaluation value of the utterance unit is obtained by calculating Expression 2 by applying the utterance unit index value to the scale having the above parameters. The symbol [UTTERANCE_SCORE] in the file 3 is calculated as the utterance unit reliability evaluation value of the corresponding utterance, and is replaced with that value.

更新されたファイル３の内容（抜粋）
11時1−9.2010 1.8629−2.2024 113 0.04 5−2.07607612356411−26.9416 0
発車 0−7.3065 1.8325−2.9540 17 1 5 −2.07607613256411−27.1151 0
で 0−8.7062 2.4125−2.9540 17 1 5 −2.07607613256411−27.1151 0
更新されたファイル３を用いて、単語単位信頼性尺度を作成された信頼性尺度のパラメータを以下に示す。
minimum_acoustic_score 0.1656796
acoustic_score_standard_deviation −0.0581996
mean_difference_from_maximum_score 2.0266259
number_of_acoustic_observations 0.0187305
square_nbest_purity 1.6731033
number_of_nbest −0.1339422
utterance_score 0.1814701
noumalized_score −0.1636039
unparsed_bunsetsu_violation −0.5617342
CORRECT_GAUSSIAN: 1.3205144 2.4545729 0.58425194
INCORRECT_GAUSSIAN:−3.6866648 4.1660976 0.41574803
単語単位の信頼性評価値は、上記のパラメータを持つ尺度に対し、単語単位指標値を当てはめて式２の計算をすることで得られる。 Updated file 3 contents (excerpt)
11 o'clock 1 -9.2010 1.8629 -2.2024 113 0.04 5 -2.07607612356411 -26.9416 0
Departure 0−7.3065 1.8325−2.9540 17 1 5 −2.07607613256411−27.1151 0
0−8.7062 2.4125−2.9540 17 1 5 −2.07607613256411−27.1151 0
The parameters of the reliability measure for which the word unit reliability measure is created using the updated file 3 are shown below.
minimum_acoustic_score 0.1656796
acoustic_score_standard_deviation −0.0581996
mean_difference_from_maximum_score 2.0266259
number_of_acoustic_observations 0.0187305
square_nbest_purity 1.6731033
number_of_nbest −0.1339422
utterance_score 0.1814701
noumalized_score −0.1636039
unparsed_bunsetsu_violation −0.5617342
CORRECT_GAUSSIAN: 1.3205144 2.4545729 0.58425194
INCORRECT_GAUSSIAN: −3.6866648 4.1660976 0.41574803
The word unit reliability evaluation value is obtained by calculating the formula 2 by applying the word unit index value to the scale having the above parameters.

ここで、信頼性を評価したい、別の対話記録中の発話について、発話単位指標値をファイル４に、単語単位指標値をファイル５に、それぞれ書き出す。
ファイル４の内容（抜粋）
20021127-01-03_06-24.188667 100 90 1 0 0.157894736842105 0
ファイル５の内容（抜粋）
○○○○駅 -6.5979 1.2797-1.4625 123 1 4[UTTERANCE_SCORE] -24.1581 0
出発 1-6.3325 1.4448-1.9726 67 1 4[UTTERANCE_SCORE] -25.0312 0
ファイル４の発話単位指標値と、発話単位信頼性尺度を用いて、発話単位信頼性評価値を計算する。
20021127-01-03_06 1.84255589061485
また、この発話単位信頼性評価値でファイル５の記号[UTTERANCE_SCORE]を置き換え、更新されたファイル５の内容（抜粋）
○○○○駅 -6.5979 1.2797-1.4625 123 1 4 1.84255589061485-24.1581 0
出発 1-6.3325 1.4448-1.9726 67 1 4 1.84255589061485-25.0312 0
更新されたファイル５の発話単位信頼性評価値と、単語単位信頼性尺度を用いて、単語単位信頼性評価値を計算する。
○○○○駅 2.24738419391751
出発 1.19691584999685
以上説明した音声認識評価方法によれば発話に対する評価値の信頼性が高い。従って例えば音声解読装置或は対話システム等に適用することにより、発話に対する理解度が向上し、人との対話を円滑に実行できることとなる。 Here, the utterance unit index value is written in the file 4 and the word unit index value is written in the file 5 for the utterance in another dialogue recording whose reliability is to be evaluated.
Contents of file 4 (excerpt)
20021127-01-03_06-24.188667 100 90 1 0 0.157894736842105 0
Contents of file 5 (excerpt)
○○○○ Station -6.5979 1.2797-1.4625 123 1 4 [UTTERANCE_SCORE] -24.1581 0
Departure 1-6.3325 1.4448-1.9726 67 1 4 [UTTERANCE_SCORE] -25.0312 0
The utterance unit reliability evaluation value is calculated using the utterance unit index value of the file 4 and the utterance unit reliability scale.
20021127-01-03_06 1.84255589061485
In addition, the symbol [UTTERANCE_SCORE] of the file 5 is replaced with the utterance unit reliability evaluation value, and the updated content of the file 5 (excerpt)
○○○○ Station -6.5979 1.2797-1.4625 123 1 4 1.84255589061485-24.1581 0
Departure 1-6.3325 1.4448-1.9726 67 1 4 1.8425558906 1485-25.0312 0
The word unit reliability evaluation value is calculated using the updated utterance unit reliability evaluation value of the file 5 and the word unit reliability scale.
○○○○ Station 2.24738419391751
Departure 1.19691584999685
According to the speech recognition evaluation method described above, the reliability of the evaluation value for speech is high. Therefore, for example, by applying it to a speech decoding device or a dialogue system, the degree of understanding of utterances can be improved, and dialogue with people can be executed smoothly.

以上説明したこの発明による音声解釈方法及び装置はコンピュータが解読可能な符号列で記述された音声解釈プログラムをコンピュータに備えられているＣＰＵに解読させ、実行させることにより実現される。音声解釈プログラムはコンピュータが読み取り可能な記録媒体に記録されてコンピュータにインストールされて実行されるか、又は通信回線を通じてコンピュータにインストールされる場合もある。 The speech interpretation method and apparatus according to the present invention described above is realized by causing a CPU provided in a computer to decode and execute a speech interpretation program described by a computer-readable code string. The speech interpretation program may be recorded on a computer-readable recording medium and installed in the computer for execution, or may be installed in the computer through a communication line.

この発明による音声解釈方法及び装置は例えば音声対話システム、自動案内システム等に活用することができる。 The speech interpretation method and apparatus according to the present invention can be used in, for example, a speech dialogue system, an automatic guidance system, and the like.

この発明による音声解釈装置の一実施例を説明するためのブロック図。The block diagram for demonstrating one Example of the speech interpretation apparatus by this invention. この発明による音声解釈方法の動作を説明するためのフローチャート。The flowchart for demonstrating operation | movement of the speech interpretation method by this invention. 図２と同様のフローチャート。The same flowchart as FIG. 先に提案した音声認識評価方法の手順を説明するためのフローチャート。The flowchart for demonstrating the procedure of the speech recognition evaluation method proposed previously. 図４の続きを説明するためのフローチャート。5 is a flowchart for explaining the continuation of FIG. 先に提案した音声認識評価装置の構成を説明するためのブロック図。The block diagram for demonstrating the structure of the speech recognition evaluation apparatus proposed previously. 図６に示した音声認識評価装置の動作を説明するための図。The figure for demonstrating operation | movement of the speech recognition evaluation apparatus shown in FIG. 図７と同様の図。The same figure as FIG.

Explanation of symbols

１０音声解釈装置
１１入力音声
１２音声入力手段
１３音声認識手段
１４信頼性情報付加手段
１５解釈結果選別手段
１６音声解釈結果
１７データベース
１８有限状態トランスデューサ生成部
１８Ａ規則型解釈手段
１８Ｂ特性種単語抽出手段
１８Ｃ事例型解釈手段
２１事例ファイル
２２事例登録手段
２３対話情報記録手段
２４事例信頼性情報付加手段
２５事例選択手段 DESCRIPTION OF SYMBOLS 10 Speech interpretation apparatus 11 Input speech 12 Speech input means 13 Speech recognition means 14 Reliability information addition means 15 Interpretation result selection means 16 Speech interpretation result 17 Database 18 Finite state transducer generation part 18A Regular type interpretation means 18B Characteristic type word extraction means 18C Case type interpretation means 21 Case file 22 Case registration means 23 Dialog information recording means 24 Case reliability information addition means 25 Case selection means

Claims

On the computer,
Speech input processing for inputting speech information to the system, speech recognition processing for recognizing speech as a sequence of words, speech recognition reliability evaluation processing for sequentially evaluating the recognition reliability for each recognized word, and recognition A speech interpretation method of a speech interpretation system that executes speech interpretation processing for interpreting speech information input based on a series of words and reliability of recognition of each word,
The voice information (A) obtained by the voice input process is recognized as a word series (B) by the voice recognition process, and the words (B0), (B1),..., (Bn) included in this word series. , (Cn) is calculated by the speech recognition reliability evaluation process for each of the words, and the words (B0),. A speech interpretation method characterized by identifying speech interpretation results (D) for speech (A) input from reliability evaluation values (C0), (C1),.

The speech interpretation method according to claim 1,
When a sequence of words including a registered sequence of specific words (E0), (E1),..., (Ei) is input, a sequence of interpretation results (F0), (F1),. Fj) to output a regular interpretation process;
A specific word extraction process for outputting the specific type of word (G) when a series of words including a specific type of word (G) is input;
There are pairs of word sequences (H0), (H1),..., (Hk) stored as examples in the database and corresponding interpretation result sequences (I0), (I1),. A case type interpretation process for outputting a series (I0), (I1),..., (Ii) of the interpretation result when a series of words including the word series (H0),. The speech that identifies the best speech interpretation result (D) by executing interpretation result selection processing using a finite state transducer that combines the above-mentioned regular type interpretation processing, specific type word extraction processing, and case type interpretation processing based on Interpretation method.

The speech interpretation method according to claim 2,
As a case of accumulating in the database, a set of word sequences (Ji), (Li) having a speech recognition reliability evaluation value (Mi) larger than a certain value is registered, and the registered word sequences (Ji), (Li) Is used as a case in the case type interpretation process described above.

The speech interpretation method according to claim 3,
The speech recognition reliability evaluation value (Mi) used in the case type interpretation process is
Extracting a sequence (P) of words estimated to be actually spoken from the dialogue information recorded in the dialogue information recording means,
The content of the dialogue information interpreted from the extracted word sequence (P) is specified as (P) ′,
Extract the words included in the word series (P) as (Q0), (Q1), (Q2), ..., (Qn),
The correct answer of the word sequence actually spoken corresponding to the word sequence (P) is identified as (R),
The correct answer of the content of the dialogue information interpreted from the correct answer (R) of the word sequence is specified as (R) ′,
The word actually spoken in the dialogue information corresponding to the word string (Qi) is identified as (Si),
In addition to the index calculated to estimate the extracted word sequence (P) as a candidate for the result, the index obtained from the dialog information recorded in the dialog information recording means is expressed in utterance units and word units, respectively. Utterance unit index value and word unit index value for the calculated utterance unit index value and word unit index value, a series of words that the system has recognized the dialogue information read from the dialogue information recording means, The utterance unit obtained by comparing the correct answer (R) of the actually spoken word sequence (P) recorded by the dialog information recording means and the correctness in word units are shown as binary values of 0 and 1. The reliability measure for evaluating the reliability of the utterance unit and the word unit is created by obtaining the relationship with the index, and the reliability of the speech recognition result whose reliability is to be evaluated by the created reliability measure. Audio interpretation, characterized in that the target value is calculated, a reliability evaluation value obtained by applying the evaluation formula confidence measures.

Speech input means for inputting speech information to the system; speech recognition means for recognizing the speech information as a sequence of words; speech recognition reliability evaluation means for evaluating the recognition reliability of each recognized word; A speech interpretation system comprising a sequence of recognized words and speech interpretation means for interpreting speech input based on the reliability of recognition of each word,
The voice (A) input from the voice input means is recognized as a word series (B) by the voice recognition means, and the words (B0), (B1),..., (Bn) included in the word series are recognized. Voice recognition reliability evaluation values (C0), (C1),..., (Cn) are calculated for the respective voice recognition reliability evaluation means, and the word (B0),. A speech interpretation apparatus that identifies a speech interpretation result (D) for an input speech (A) from evaluation values (C0), (C1),..., (Cn).

The speech interpretation apparatus according to claim 5,
When a series of words including registered specific word series (E0), (E1),..., (Ei) is input, a series of interpretation results (F0), (F1),. ), A specific type word extracting unit for outputting the specific type word (G) when a series of words including a specific type of word (G) is input, and a database There are pairs of word sequences (H0), (H1),..., (Hk) accumulated as examples and corresponding interpretation result sequences (I0), (I1),. Based on the case type interpretation means for outputting the interpretation result series (I0),..., (Il) when a series of words including the series (H0),. , A finite state transaction that combines specific word extraction means and case type interpretation means The interpretation result sorting means utilizing inducer, speech interpreter to identify the best speech interpretation results (D).

The speech interpretation apparatus according to claim 6,
As examples stored in the database, the speech recognition reliability evaluation value (Mi) includes case registration means for registering a set of word sequences (Ji), (Li) larger than a certain value, and this registered word sequence ( A speech interpretation apparatus in which Ji) and (Li) are used as examples by the case type interpretation means.

8. The speech interpretation apparatus according to claim 7, wherein the speech recognition reliability evaluation value (Mi) used in the case type interpretation means is a sequence of words estimated to be actually uttered from dialogue information recorded in the dialogue information recording means (P ) To extract word series,
Information content specifying means for specifying (P) ′ as the content of the information interpreted from the word sequence (P) extracted by the word sequence extracting means;
Word extraction means for extracting the words included in the word series (P) as (Q0), (Q1), (Q2),..., (Qn);
Correct word specifying means for specifying, as (R), the correct answer of the series of words actually spoken corresponding to the word series (P);
Correct information specifying means for specifying (R) ′ as the correct answer of the information content interpreted from the correct answer (R) of the series of words specified by the correct word specifying means;
An utterance word specifying means for specifying the actually spoken word corresponding to the word string (Qi) as (Si);
In addition to the index calculated to estimate the extracted word sequence (P) as a candidate for the result, the index obtained from the dialog information recorded in the dialog information recording means is expressed in utterance units and word units. An utterance unit index value calculating means and a word unit index value calculating means for calculating an utterance unit index value and a word unit index value for each;
The calculated utterance unit index value and word unit index value, a series of words in which the system recognizes the dialog information read from the dialog information recording means, and the actual utterance recorded by the dialog information recording means The reliability of the utterance unit and the word unit is obtained by obtaining the relationship between the utterance unit obtained by comparing the correct word sequence and the index indicating the correctness in the word unit by binary values of 0 and 1. An utterance unit reliability scale creation means and a word unit reliability scale creation means for creating a scale for evaluation;
An utterance unit reliability evaluation value calculation means and a word for calculating an index value of a speech recognition result whose reliability should be evaluated by the created reliability measure, and obtaining the reliability evaluation value by applying the evaluation value of the reliability scale A speech interpretation apparatus characterized by being calculated by unit reliability evaluation value calculation means.

5. A speech interpretation program, characterized in that the computer is written in a readable program language and causes the computer to execute the speech interpretation method according to any one of claims 1 to 4.