JP2009047920A

JP2009047920A - Device and method for interacting with user by speech

Info

Publication number: JP2009047920A
Application number: JP2007213828A
Authority: JP
Inventors: Kentaro Kohata; 建太郎降幡; Tetsuro Chino; 哲朗知野; Satoshi Kamaya; 聡史釜谷
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2007-08-20
Filing date: 2007-08-20
Publication date: 2009-03-05
Anticipated expiration: 2027-08-20
Also published as: JP4987623B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech interaction device capable of easily correcting an error part without interrupting interaction. <P>SOLUTION: The speech interaction device comprises: a candidate generation section 112 which recognizes speech, and which generates a candidate of response and a likelihood for showing probability of the candidate of response; a response sentence generation section 113 for generating a response sentence including a phrase for expressing a content that the candidate of the most likely response is selected; an output section 102 for outputting synthesis speech of response sentence; a correction phrase generation section 114 for generating at least one correction phrase corresponding to the phrase included in the response sentence by analyzing the recognition result for the speech a user utters during an output of synthesis speech; a selection section 115 which obtains the candidate of the response including the phrase of the same meaning content with the generated correction phrase from the generated candidate of the response, and which selects the candidate of the most likely response in the obtained response candidates; and an update section 116 for updating the response sentence with the phrase expressing the content of the candidate of the selected response. The output section 102 outputs synthesis speech of the response sentence after updating. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、入力した音声に応じた動作を実行することによりユーザと対話する装置および方法に関するものである。 The present invention relates to an apparatus and a method for interacting with a user by executing an operation according to an input voice.

近年、音声認識、音声合成および対話理解といった要素技術の研究が進み、それらを組み合わせることによって、複雑なボタン操作やコマンド入力をせずとも、自然言語音声の発話によって機械を操作できるような音声対話インターフェースが実用化されつつある。 In recent years, research on elemental technologies such as speech recognition, speech synthesis, and dialogue understanding has progressed, and by combining these, speech dialogue that can operate machines by uttering natural language speech without complicated button operations and command inputs Interfaces are being put into practical use.

また、デジタル家電やカーナビゲーションシステムの性能の向上に伴って、このような従来型のユーザ・インタフェースよりも高い処理性能が必要な音声対話インターフェースの実装も可能になりつつある。 In addition, with the improvement in performance of digital home appliances and car navigation systems, it is becoming possible to implement a voice interaction interface that requires higher processing performance than the conventional user interface.

しかし、上記のような各要素技術にはまだ多くの技術的課題が残されており、システムに対するユーザの入力音声を常に正しく解釈し、ユーザの要求を満たす動作の実行または応答の出力を可能とするほど精度の高いシステムの実現はきわめて困難である。 However, many technical problems still remain in each of the above elemental technologies, and it is possible to always interpret the user's input voice to the system correctly and execute an operation that satisfies the user's request or output a response. Therefore, it is very difficult to realize a highly accurate system.

例えば、音声からユーザの要求意図を解釈するためには、最初に音声認識処理によって、音声波形から言語情報を抽出する必要がある。ところが、この音声認識処理でさえ、常に正しい結果が得られるわけではない。例えば、雑音環境下では、認識精度が著しく低下するという課題が存在する。 For example, in order to interpret a user's request intention from speech, it is necessary to first extract language information from the speech waveform by speech recognition processing. However, even with this speech recognition processing, a correct result is not always obtained. For example, in a noisy environment, there is a problem that recognition accuracy is significantly reduced.

また、認識した言語情報（テキスト）から、形態素情報、構文情報を抽出し、さらに発話意図を解析する処理を行う必要があるが、いずれの過程でも誤りが生じる可能性が存在する。特に、発話意図を抽出するような対話理解には、文脈などを考慮した非常に高度な言語処理が必要である。このため、ユーザからの自由発話を入力できる音声対話処理システムが、ユーザの発話を常に正しく解釈し、曖昧性の発生を避けることは非常に困難である。 Further, it is necessary to extract morpheme information and syntax information from the recognized linguistic information (text) and to further analyze the speech intention, but there is a possibility that an error occurs in any process. In particular, in order to understand dialogues such as extracting utterance intentions, very advanced language processing in consideration of context and the like is required. For this reason, it is very difficult for a speech dialogue processing system that can input a user's free utterance to always correctly interpret the user's utterance and avoid the occurrence of ambiguity.

そこで、各処理段階における要素技術の改良とともに、ヒューマン・インターフェース（ＨＩ）を用いて、ユーザがシステムの解釈の曖昧性・誤りを訂正できるようにするという対策が採られている。 Therefore, along with improvements in elemental technology at each processing stage, measures are taken to enable the user to correct ambiguities and errors in the interpretation of the system using a human interface (HI).

ところが、ユーザに対するシステムの解釈結果のフィードバックの仕方によっては、手順が複雑になる場合や、ユーザ入力−システムの解釈結果応答−ユーザの訂正入力−システムの解釈訂正−システム動作実行という一連の訂正処理の時間が増加する場合があり、ユーザにストレスを与える可能性がある。 However, depending on the method of feedback of the system interpretation result to the user, the procedure may be complicated, or a series of correction processes of user input-system interpretation result response-user correction input-system interpretation correction-system operation execution Time may increase, which may stress the user.

例えば、ユーザの発話に対する複数の解釈候補が存在する場合に、各解釈候補をユーザに音声でフィードバックし、ユーザに正しい解釈候補を選択させる方法を考える。この方法では、解釈候補をテキストによって一覧表示することができないため、それぞれの解釈候補に対応する読み上げ音声を順番に出力する必要がある。このため、出力に時間がかかる上、ユーザがその音声を逐一聞いて確認するための処理負担も増大する。 For example, when there are a plurality of interpretation candidates for the user's utterance, consider a method in which each interpretation candidate is fed back to the user by voice and the user selects the correct interpretation candidate. In this method, since interpretation candidates cannot be displayed as a list by text, it is necessary to output read-out speech corresponding to each interpretation candidate in order. For this reason, it takes time to output, and the processing load for the user to listen to and confirm the sound one by one increases.

これを避けるための方法としては、例えば、システムが第１位の解釈候補のみを出力し、ユーザからの訂正入力を受け付けるという方式が考えられる。しかし、単純に応答出力−訂正入力−確認応答出力という手順で訂正する方式では、訂正処理が煩雑になるという問題がある。 As a method for avoiding this, for example, a system in which the system outputs only the first interpretation candidate and accepts a correction input from the user can be considered. However, there is a problem that the correction process becomes complicated in the method of correcting simply by the procedure of response output-correction input-confirmation response output.

また、音声でフィードバックするのではなく、テキストで一覧表示してフィードバックするテキスト表示型インターフェースも考えられる。しかし、表示部が小さい場合は、スクロール等の操作が必要になるため、上記と同様に訂正処理が煩雑になるという問題が生じうる。 In addition, a text display interface is also conceivable in which a list is displayed as text instead of being fed back by voice. However, when the display unit is small, an operation such as scrolling is required, and thus there may be a problem that the correction process becomes complicated as described above.

このように、音声対話型ＨＩでは、人（ユーザ）と機械間の対話を円滑に進められるような工夫が求められる。 As described above, in the voice interactive HI, a device is required so that the conversation between the person (user) and the machine can be smoothly advanced.

例えば、特許文献１では、ユーザからの発話を音声認識する認識処理の過程で、認識誤りが生じたフレーズを自動的に検出し、検出部分のみを原言語話者にテキストまたは音声によって提示して訂正させることによって、円滑な訂正が可能な対話インターフェースを実現する技術が提案されている。この方法では、発話者に提示されるのは誤りフレーズのみであるため、文全体の確認や再入力が不要となり、訂正に要する時間を短くすることができる。 For example, in Patent Document 1, a phrase in which a recognition error has occurred is automatically detected in the process of recognizing a speech from a user, and only the detected portion is presented to a source language speaker by text or speech. There has been proposed a technique for realizing an interactive interface that can be corrected smoothly by making corrections. In this method, since only the erroneous phrase is presented to the speaker, it is not necessary to confirm or re-enter the entire sentence, and the time required for correction can be shortened.

特開２０００−２９４９２号公報JP 2000-29492 A

しかしながら、特許文献１の方法では、音声認識で誤認識が生じうるのと同様に、音声認識誤り箇所の特定にも誤りが生じうるため、誤認識箇所を正しく訂正できない場合があるという問題があった。また、特定された誤りフレーズ以外のフレーズを訂正することができないという問題があった。 However, the method disclosed in Patent Document 1 has a problem in that, in the same way that erroneous recognition may occur in speech recognition, an error may also occur in the identification of a speech recognition error location. It was. There is also a problem that phrases other than the specified erroneous phrase cannot be corrected.

このような問題を解消し、円滑な対話を実現するためには、誤り箇所のみでなく解釈結果全体を音声により確認し、音声により訂正可能とすることが望ましい。しかしこの場合も、解釈結果全体の音声をすべて出力してから訂正発話を受け付けるという一般的な確認・訂正方法では、対話の進行が妨げられるという問題が生じうる。 In order to solve such a problem and realize a smooth dialogue, it is desirable to check not only the error location but also the entire interpretation result by voice and correct it by voice. However, even in this case, the general confirmation / correction method in which the corrected speech is accepted after the entire speech of the interpretation result is output may cause a problem that the progress of the dialogue is hindered.

本発明は、上記に鑑みてなされたものであって、対話を阻害することなく誤り箇所を容易に訂正することができる装置および方法を提供することを目的とする。 The present invention has been made in view of the above, and it is an object of the present invention to provide an apparatus and a method that can easily correct an error location without hindering dialogue.

上述した課題を解決し、目的を達成するために、本発明は、入力した音声を認識し、認識結果の候補を複数生成する認識部と、第１音声に対する複数の第１認識結果の候補を解析して、複数の第１認識結果の候補それぞれに対応する応答の候補と、第１認識結果の候補に対する応答の候補の確からしさを表す尤度とを生成する候補生成部と、前記尤度が最大となる第１認識結果の第１候補に対する応答の候補を選択し、選択した前記第１認識結果の第１候補に対する応答の候補を表す語句を含む第１認識結果の第１候補に対する応答文を生成する応答文生成部と、第１認識結果の第１候補に対する応答文を音声信号に変換した合成音声を出力する出力部と、前記合成音声の出力中に第２音声が入力された場合、前記候補生成部で生成された第２音声に対する第２認識結果の候補を解析して、前記第１認識結果の第１候補に対する応答文に含まれる語句を修正した修正語句を生成する修正語句生成部と、複数の第１認識結果の候補に対する応答の候補から、前記修正語句と同一の語句を含む第１認識結果の別の候補に対する応答の候補を取得し、第１認識結果の別の候補に対する応答の候補のうち前記尤度が最大の第１認識結果の別の候補に対する応答の候補を選択する選択部と、選択された第１認識結果の別の候補に対する応答の候補の語句で前記応答文を更新する更新部と、を備え、前記出力部は、前記応答文が更新された場合、更新前の前記応答文の合成音声に代えて、更新後の前記応答文の合成音声を出力すること、を特徴とする。 In order to solve the above-described problems and achieve the object, the present invention recognizes an input speech and generates a plurality of recognition result candidates, and a plurality of first recognition result candidates for the first speech. A candidate generation unit that analyzes and generates a response candidate corresponding to each of the plurality of first recognition result candidates and a likelihood that represents a likelihood of a response candidate for the first recognition result candidate; and the likelihood A response to the first candidate of the first recognition result including a phrase representing a candidate of a response to the first candidate of the first recognition result is selected. A response sentence generation unit that generates a sentence; an output unit that outputs a synthesized voice obtained by converting a response sentence to the first candidate of the first recognition result into a voice signal; and a second voice inputted during the output of the synthesized voice The candidate generated by the candidate generator Analyzing a candidate of the second recognition result for the speech and generating a corrected phrase that corrects the phrase included in the response sentence to the first candidate of the first recognition result; and a plurality of first recognition results A candidate for a response to another candidate of the first recognition result including the same phrase as the modified word is obtained from a candidate for the response to the candidate, and the likelihood among the candidates for the response to another candidate of the first recognition result is A selection unit that selects a response candidate for another candidate of the maximum first recognition result; and an update unit that updates the response sentence with a word of a response candidate for another candidate of the selected first recognition result. The output unit outputs the synthesized speech of the updated response sentence instead of the synthesized speech of the response sentence before the update when the response sentence is updated.

また、本発明は、上記装置を実行することができる方法である。 The present invention is also a method capable of executing the above apparatus.

本発明によれば、対話を阻害することなく誤り箇所を容易に訂正することができるという効果を奏する。 According to the present invention, there is an effect that an error part can be easily corrected without obstructing the dialogue.

以下に添付図面を参照して、この発明にかかる装置および方法の最良な実施の形態を詳細に説明する。 Exemplary embodiments of an apparatus and a method according to the present invention will be described below in detail with reference to the accompanying drawings.

本実施の形態にかかる音声対話装置は、ユーザの入力音声を解釈し、解釈結果に対応する応答文を音声出力するとともに、応答文の出力中に入力された応答文を修正するための修正音声を利用して解釈結果と応答文を同時に更新し、更新後の応答文を出力するものである。 The voice interactive apparatus according to the present embodiment interprets a user's input voice, outputs a response sentence corresponding to the interpretation result as a voice, and corrects a corrected voice for correcting the response sentence input during the output of the response sentence Is used to simultaneously update the interpretation result and response text, and output the updated response text.

なお、以下では、ハードディクレコーダーやマルチメディアパソコンなどの、録画した放送番組等を録画再生可能なビデオ録画再生装置として音声対話装置を実現した例について説明する。なお、適用可能な装置はビデオ録画再生装置に限られず、ユーザの入力音声に対応する応答を出力するものであればあらゆる装置に適用できる。 In the following, an example will be described in which a voice interactive apparatus is realized as a video recording / reproducing apparatus capable of recording / reproducing recorded broadcast programs, such as a hard disk recorder and a multimedia personal computer. The applicable apparatus is not limited to the video recording / reproducing apparatus, and can be applied to any apparatus that outputs a response corresponding to the user's input voice.

図１は、本実施の形態にかかるビデオ録画再生装置１００の構成を示すブロック図である。図１に示すように、ビデオ録画再生装置１００は、主はハードウェア構成として、マイク１３１と、スピーカ１３２と、記憶部１２０と、を備えている。また、ビデオ録画再生装置１００は、主はソフトウェア構成として、受付部１０１と、対話処理部１１０と、出力部１０２と、録画再生部１０３とを備えている。 FIG. 1 is a block diagram showing a configuration of a video recording / playback apparatus 100 according to the present embodiment. As shown in FIG. 1, the video recording / reproducing apparatus 100 mainly includes a microphone 131, a speaker 132, and a storage unit 120 as a hardware configuration. The video recording / playback apparatus 100 mainly includes a reception unit 101, a dialogue processing unit 110, an output unit 102, and a recording / playback unit 103 as software configurations.

マイク１３１は、ユーザの発話した音声を入力するものである。また、スピーカ１３２は、応答を合成した合成音声などのデジタル形式の音声信号をアナログ形式の音声信号に変換（ＤＡ変換）して出力するものである。 The microphone 131 is used to input voice spoken by the user. In addition, the speaker 132 converts a digital audio signal such as a synthesized voice obtained by synthesizing the response into an analog audio signal (DA conversion) and outputs the analog audio signal.

記憶部１２０は、対話処理部１１０で生成されるアクション候補群、アクション断片、および応答フレーズリストなどの各種データ（詳細は後述）を記録するものである。記憶部１２０は、ＨＤＤ（Hard Disk Drive）、光ディスク、メモリカード、ＲＡＭ（Random Access Memory）などの一般的に利用されているあらゆる記憶媒体により構成することができる。 The storage unit 120 records various data (details will be described later) such as action candidate groups, action fragments, and response phrase lists generated by the dialogue processing unit 110. The storage unit 120 can be configured by any generally used storage medium such as an HDD (Hard Disk Drive), an optical disk, a memory card, and a RAM (Random Access Memory).

受付部１０１は、マイク１３１から入力された音声のアナログ信号に対してサンプリングを行い、ＰＣＭ（パルスデジタルコードモジュレーション）形式などのデジタル信号に変換して出力する処理を行うものである。受付部１０１の処理では、従来から用いられているＡ／Ｄ変換技術などを適用することができる。 The accepting unit 101 performs sampling on the analog signal of the audio input from the microphone 131, converts it into a digital signal such as a PCM (pulse digital code modulation) format, and outputs it. Conventionally used A / D conversion technology or the like can be applied to the processing of the receiving unit 101.

対話処理部１１０は、ユーザから入力された音声に対応する応答および応答の内容を表す応答文を生成して出力することにより、ユーザとの対話処理を実行するものである。具体的には、対話処理部１１０は、まず、デジタル信号を音声認識してユーザの要求を解釈する。次に、対話処理部１１０は、その解釈結果に応じた応答の候補を生成する。さらに、対話処理部１１０は、最尤の候補に対応する応答文を生成する。 The dialogue processing unit 110 executes dialogue processing with the user by generating and outputting a response corresponding to the voice input from the user and a response sentence representing the content of the response. Specifically, the dialogue processing unit 110 first interprets a user request by voice recognition of a digital signal. Next, the dialogue processing unit 110 generates response candidates according to the interpretation result. Furthermore, the dialogue processing unit 110 generates a response sentence corresponding to the most likely candidate.

以下に、対話処理部１１０の詳細な機能と構成について説明する。図１に示すように、対話処理部１１０は、認識部１１１と、候補生成部１１２と、応答文生成部１１３と、修正語句生成部１１４と、選択部１１５と、更新部１１６と、を備えている。 The detailed function and configuration of the dialogue processing unit 110 will be described below. As shown in FIG. 1, the dialogue processing unit 110 includes a recognition unit 111, a candidate generation unit 112, a response sentence generation unit 113, a corrected phrase generation unit 114, a selection unit 115, and an update unit 116. ing.

認識部１１１は、受付部１０１が出力した音声のデジタル信号を音声認識してユーザの要求を表す認識結果の候補を生成するものである。具体的には、認識部１１１は、入力したデジタル信号を音声認識して、少なくとも１つの認識候補テキストからなる認識候補群を生成する。認識部１１１による音声認識処理では、ＬＰＣ分析、隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）、ダイナミックプログラミング、ニューラルネットワーク、Ｎグラム言語モデルなどを用いた、一般的に利用されているあらゆる音声認識方法を適用することができる。 The recognizing unit 111 performs speech recognition on the audio digital signal output from the receiving unit 101 to generate a recognition result candidate representing a user request. Specifically, the recognition unit 111 performs speech recognition on the input digital signal and generates a recognition candidate group including at least one recognition candidate text. In the speech recognition processing by the recognition unit 111, all commonly used speech recognition methods using LPC analysis, Hidden Markov Model (HMM), dynamic programming, neural network, N-gram language model, and the like are used. Can be applied.

図２は、音声認識結果の一例を示す説明図である。図２は、「ＭＨＫで朝、英語講座を録ってね」を意味する日本語に対応する音声Ｉ０（「えむえっちけーであさえいごこうざをとってね」）に対する音声認識結果の例を示している。また、図２は、ラティス表現形式により音声認識結果を表した例を示している。 FIG. 2 is an explanatory diagram illustrating an example of a speech recognition result. FIG. 2 shows an example of a speech recognition result for a speech I0 corresponding to Japanese meaning “Make an English course in the morning at MHK” (“Take a good practice even at Emuecchi-ke”). Is shown. FIG. 2 shows an example in which the speech recognition result is expressed in a lattice expression format.

この例では、ノード２０１（「朝」）とノード２０２（「あさって」）との間、およびノード２０３（「英語講座を」）とノード２０４（「囲碁講座を」）との間に、それぞれ解釈の曖昧性が生じている。 In this example, interpretation is performed between node 201 (“morning”) and node 202 (“morning”), and between node 203 (“English course”) and node 204 (“Go course”). There is an ambiguity.

なお、ラティスのノード間の線に付された数値は、ラティスの生成過程で統計的な共起頻度などから計算されたコストを表す。同図では、例えば、ノード２０５（「ＭＨＫで」）とノード２０１（「朝」）との間のコストが２であること、ノード２０２（「あさって」）とノード２０３（「英語講座を」）との間のコストが４であることが示されている。 The numerical value attached to the line between the nodes of the lattice represents the cost calculated from the statistical co-occurrence frequency in the process of generating the lattice. In the figure, for example, the cost between node 205 ("MHK") and node 201 ("morning") is 2, node 202 ("morning") and node 203 ("English course") It is shown that the cost between

認識部１１１は、このような認識結果のラティス表現およびコストを元に、確からしさを表す尤度が上位の所定数の候補を含む認識候補群を生成する。図３は、生成された認識候補文の一例を示す説明図である。図３は、図２のスタートノードからエンドノードまでのコストの総和に対応する尤度にしたがって、第１位候補から第４位候補まで順位付けを決定した結果を示している。 The recognizing unit 111 generates a recognition candidate group including a predetermined number of candidates having higher likelihoods representing the likelihood based on the lattice expression and the cost of the recognition result. FIG. 3 is an explanatory diagram showing an example of the generated recognition candidate sentence. FIG. 3 shows the result of determining the ranking from the first candidate to the fourth candidate according to the likelihood corresponding to the total cost from the start node to the end node in FIG.

図３に示すように、認識部１１１は、認識候補を識別する候補番号と、認識候補の内容を表す候補テキストと、尤度とを対応づけた認識候補を生成する。なお、図３の例では、ユーザの要求に対応する正しい認識結果が第３位候補となっている。このように、音声認識処理では、第１位候補が誤りであっても、他の候補に正しい認識結果が含まれる場合が生じうる。 As illustrated in FIG. 3, the recognition unit 111 generates a recognition candidate in which a candidate number for identifying a recognition candidate, candidate text representing the content of the recognition candidate, and likelihood are associated with each other. In the example of FIG. 3, the correct recognition result corresponding to the user's request is the third candidate. As described above, in the speech recognition processing, even if the first candidate is incorrect, a case where the correct recognition result is included in other candidates may occur.

図１に戻り、候補生成部１１２は、このような状況を考慮し、最上位の候補に対する応答を生成するだけでなく、認識結果の候補それぞれについて、対応する応答の候補を生成するものである。なお、応答とは、ユーザの入力音声に対応して実行する処理または出力する内容を言う。本実施の形態は、ビデオ録画再生装置の例であるため、例えば、テレビ番組の再生・録画などの処理が応答となる。なお、以下では、応答をアクションといい、応答の候補をアクション候補という。 Returning to FIG. 1, the candidate generation unit 112 generates a corresponding response candidate for each recognition result candidate in addition to generating a response to the highest candidate in consideration of such a situation. . In addition, a response means the process performed according to a user's input audio | voice, or the content to output. Since this embodiment is an example of a video recording / playback apparatus, for example, processing such as playback / recording of a television program is a response. Hereinafter, a response is referred to as an action, and a response candidate is referred to as an action candidate.

図４は、アクションの一例を示す説明図である。図４に示すように、アクションは、「操作」、「日時」、「チャンネル」、および「番組名」の４つの属性（以下、アクション属性という）を含む。なお、図４の表の２行目以降がアクションに相当する。 FIG. 4 is an explanatory diagram illustrating an example of an action. As shown in FIG. 4, the action includes four attributes (hereinafter referred to as action attributes) of “operation”, “date / time”, “channel”, and “program name”. Note that the second and subsequent rows in the table of FIG. 4 correspond to actions.

例えば、２行目は、「朝」（日時）に「ＭＨＫ」（チャンネル）の「英語講座」（番組名）を録画する（操作）というシステムの動作を表している。また、３行目は、「録画データ１」を再生するという動作を表す。ここで、「再生」は、ユーザ要求があった場合に、即時再生する動作を表すため、「日時」の値は空（「−」で表す）である。また、「チャンネル」の値も空である。 For example, the second line represents the system operation of recording (operation) “English course” (program name) of “MHK” (channel) in “morning” (date and time). The third line represents an operation of reproducing “recording data 1”. Here, since “reproduction” represents an operation of immediate reproduction when a user request is made, the value of “date and time” is empty (represented by “−”). The value of “channel” is also empty.

このように、アクションの表現形式は固定されるものではなく、少なくとも１つの語句によって、実行する処理や出力内容を表せればよい。図４の例では、少なくとも「操作」が設定されていればアクションの内容を特定することができる。 In this way, the action expression format is not fixed, and the process to be executed and the output contents may be represented by at least one word. In the example of FIG. 4, the content of the action can be specified if at least “operation” is set.

候補生成部１１２は、認識候補群に対して、形態素解析、構文解析、意味解析などの言語解析手法を適用することにより、ユーザの要求に対応するアクション候補群を生成する。このとき、候補生成部１１２は、音声認識処理で算出された認識候補それぞれの尤度および言語解析処理における確信度などから、各アクション候補についての尤度を算出し、各候補を順位付ける。 The candidate generation unit 112 generates an action candidate group corresponding to the user's request by applying a language analysis method such as morphological analysis, syntax analysis, and semantic analysis to the recognition candidate group. At this time, the candidate generation unit 112 calculates the likelihood for each action candidate from the likelihood of each recognition candidate calculated in the speech recognition process and the certainty in the language analysis process, and ranks each candidate.

図５は、アクション候補群の一例を示す説明図である。図５は、図３に示した各認識候補に対するアクション候補の例を示している。図５に示すように、アクション候補は、識別子である「候補」と、図４と同様の「操作」、「日時」、「チャンネル」、および「番組名」と、「尤度」とを含む。図５の表中、２行目以降の各行がアクションに相当し、第１位候補であるＡｃｔ１から昇順に並べてある。図５の例では、簡単のため、言語処理が正しく行われているものと仮定し、アクション候補の尤度の値として、図３に示した認識候補の尤度値をそのまま用いている。 FIG. 5 is an explanatory diagram illustrating an example of an action candidate group. FIG. 5 shows an example of action candidates for each recognition candidate shown in FIG. As shown in FIG. 5, the action candidate includes an identifier “candidate”, “operation”, “date / time”, “channel”, “program name”, and “likelihood” similar to those in FIG. 4. . In the table of FIG. 5, the second and subsequent lines correspond to actions, and are arranged in ascending order from Act1 which is the first candidate. In the example of FIG. 5, for the sake of simplicity, it is assumed that the language processing is performed correctly, and the likelihood value of the recognition candidate shown in FIG. 3 is used as it is as the likelihood value of the action candidate.

図１に戻り、応答文生成部１１３は、尤度が最大のアクション候補が、ユーザの要求を満たすか否かをユーザに確認するための応答文を生成するものである。具体的には、応答文生成部１１３は、アクション属性によって記述したテンプレートを用いて応答文を生成する。 Returning to FIG. 1, the response sentence generation unit 113 generates a response sentence for confirming to the user whether or not the action candidate having the maximum likelihood satisfies the user's request. Specifically, the response sentence generation unit 113 generates a response sentence using a template described by action attributes.

図６は、テンプレートの一例を示す説明図である。図６に示すように、テンプレートＴは、記号「｛｝」で指定した変数部と、その他の固定部とを含んでいる。変数部は、記号「｛｝」内にアクション属性を指定することにより、各アクション候補の対応するアクション属性の属性値を当てはめることを表している。また、テンプレートＴは、記号「/」によって、それぞれ１つのアクション属性が含まれるようにフレーズ単位で分割される。このように、予めフレーズ単位に分割するのは、後述の出力部１０２が、応答文をフレーズ単位で順次出力できるようにするためである。なお、以下では、フレーズ単位で区切られた応答文を応答フレーズリストといい、Ｐ{Ｐ１〜ＰＮ}（Ｎはフレーズ数）と表す。 FIG. 6 is an explanatory diagram illustrating an example of a template. As shown in FIG. 6, the template T includes a variable part designated by the symbol “{}” and other fixed parts. The variable part indicates that the attribute value of the corresponding action attribute of each action candidate is applied by designating the action attribute in the symbol “{}”. Further, the template T is divided by the phrase “/” so as to include one action attribute. The reason why the phrase is divided in advance in this way is to enable the output unit 102 described later to sequentially output response sentences in phrase units. In the following, a response sentence divided in units of phrases is referred to as a response phrase list, and is represented as P {P1 to PN} (N is the number of phrases).

なお、応答文の生成方法はテンプレートを用いた方法に限られるものではなく、文法規則や生成規則を用いて文を生成する方法などの従来から用いられているあらゆる方法を適用できる。 The method for generating a response sentence is not limited to the method using a template, and any conventionally used method such as a method for generating a sentence using a grammar rule or a generation rule can be applied.

図７は、テンプレートを用いて生成された応答フレーズリストの一例を示す説明図である。図７は、図５のアクション候補ＣＡｃｔ１を、図６のテンプレートに適用して生成した応答フレーズリストを表している。各応答フレーズＰ１〜Ｐ４は、この順で出力部１０２から音声出力される。 FIG. 7 is an explanatory diagram illustrating an example of a response phrase list generated using a template. FIG. 7 shows a response phrase list generated by applying the action candidate CAct1 of FIG. 5 to the template of FIG. The response phrases P1 to P4 are output from the output unit 102 in this order.

図１に戻り、修正語句生成部１１４は、後述する出力部１０２によって出力された応答文に対してユーザが発話した応答文の修正内容を表す修正語句を生成するものである。具体的には、修正語句生成部１１４は、修正のために発話された音声に対する認識部１１１による認識結果の候補を元に、アクションを構成する複数のアクション属性のうち少なくとも１つに対応する属性値を含むアクション断片を修正語句として生成する。 Returning to FIG. 1, the corrected phrase generation unit 114 generates a corrected phrase indicating the correction contents of the response sentence spoken by the user with respect to the response sentence output by the output unit 102 described later. Specifically, the correction phrase generation unit 114 is an attribute corresponding to at least one of a plurality of action attributes constituting the action based on a recognition result candidate by the recognition unit 111 for the speech uttered for correction. Generate an action fragment containing the value as a modified phrase.

ユーザが応答文を修正する場合、応答文のすべてを再度発話するのではなく、修正部分のみを発話する場合がある。すなわち、ユーザの発話に、アクションの全てのアクション属性（操作、日時、チャンネル、番組名）が含まれない場合がある。このような場合でも、修正語句生成部１１４は、認識結果の候補から、少なくともアクション属性の一部を抽出することができる。そして、このようにして抽出されたアクション属性の属性値は、ユーザが要求する修正内容を表すため、修正語句生成部１１４は、この属性値を修正語句として生成する。 When the user corrects the response sentence, the user may utter only the corrected portion instead of speaking the entire response sentence again. That is, the user's utterance may not include all action attributes (operation, date / time, channel, program name) of the action. Even in such a case, the corrected phrase generation unit 114 can extract at least a part of the action attributes from the recognition result candidates. Since the attribute value of the action attribute extracted in this way represents the correction content requested by the user, the corrected phrase generation unit 114 generates this attribute value as a corrected phrase.

図８は、認識部１１１により生成された認識候補文の別の例を示す説明図である。図８は、図７に示す応答フレーズを含む応答文に対して修正を要求するためユーザが発話した音声であり、アクション属性のうち「日時」を修正するために発話した、「朝だよ」を意味する日本語の入力音声Ｉ１（「あさだよ」）に対する音声認識結果の例を示している。また、図８は、認識結果の候補として唯一の候補（「朝だよ」）が生成されたことを示している。 FIG. 8 is an explanatory diagram illustrating another example of the recognition candidate sentence generated by the recognition unit 111. FIG. 8 is a voice uttered by the user for requesting correction of the response sentence including the response phrase shown in FIG. 7, and uttered to correct “date and time” among the action attributes, “Morning”. Shows an example of a speech recognition result for Japanese input speech I1 ("Asadayo") meaning FIG. 8 shows that the only candidate (“Morning is”) is generated as a recognition result candidate.

このような認識結果に対し、修正語句生成部１１４は、アクション属性「日時」の値が「朝」であるという情報をアクション断片として抽出する。図９は、このようにして生成されたアクション断片の一例を示す説明図である。図９は、上述の入力音声Ｉ１から生成されたアクション断片の例である。 In response to such a recognition result, the corrected phrase generation unit 114 extracts information that the value of the action attribute “date” is “morning” as an action fragment. FIG. 9 is an explanatory diagram showing an example of the action fragment generated in this way. FIG. 9 is an example of an action fragment generated from the input voice I1 described above.

なお、修正語句生成部１１４と候補生成部１１２とは、アクション属性の一部のみを含むアクション断片を生成するか、すべてを含むアクション候補を生成するかが異なるのみである。すなわち、認識結果に対して、形態素解析、構文解析、意味解析などの言語解析手法を実行してユーザの要求を解釈する処理手順は共通する。したがって、両者のうちいずれか一方を他方に統合するように構成してもよい。 Note that the modified phrase generation unit 114 and the candidate generation unit 112 differ only in whether an action fragment including only part of the action attribute or an action candidate including all of the action attributes is generated. That is, the processing procedure for interpreting the user's request by executing language analysis techniques such as morphological analysis, syntax analysis, and semantic analysis is common to the recognition result. Therefore, you may comprise so that either one may be integrated with the other.

選択部１１５は、アクション候補群から、アクション断片の属性値を全て含むアクション候補群を選択し、選択したアクション候補群の中から最も尤度の大きい候補を新たな第１位候補として選択するものである。 The selection unit 115 selects an action candidate group that includes all the action fragment attribute values from the action candidate group, and selects a candidate with the highest likelihood from the selected action candidate group as a new first candidate. It is.

例えば、図５に示すようなアクション候補群が生成され、さらに図９に示すようなアクション断片（以下、アクション断片ＳＥＧ１という）が生成されたとする。この場合、選択部１１５は、図５のアクション候補群の中で、属性「日時」がアクション断片ＳＥＧ１（（当日）朝）と一致するアクション候補を探す。図５の例では、選択部１１５は、ＣＡｃｔ３およびＣＡｃｔ４を取得することができる。次に、選択部１１５は、ＣＡｃｔ３およびＣＡｃｔ４のうち、尤度の大きい方を新たに第１位候補として選択する。この例では、ＣＡｃｔ３の尤度＝０．２＞ＣＡｃｔ４の尤度＝０．１であるため、ＣＡｃｔ３が選択される。 For example, it is assumed that an action candidate group as shown in FIG. 5 is generated and an action fragment (hereinafter referred to as action fragment SEG1) as shown in FIG. 9 is generated. In this case, the selection unit 115 searches the action candidate group in FIG. 5 for an action candidate whose attribute “date and time” matches the action fragment SEG1 ((morning) morning). In the example of FIG. 5, the selection unit 115 can acquire CAct3 and CAct4. Next, the selection unit 115 newly selects the one with the highest likelihood of CAct3 and CAct4 as the first candidate. In this example, since the likelihood of CAct3 = 0.2> the likelihood of CAct4 = 0.1, CAct3 is selected.

更新部１１６は、選択部１１５により選択されたアクション候補を元に応答フレーズリストを更新するものである。具体的には、更新部１１６は、まず、選択部１１５が新たに選択したアクション候補（以下、新候補という）と、選択前の第１位のアクション候補（以下、旧候補という）との間で、すべてのアクション属性値を比較する。そして、更新部１１６は、不一致部分に対応する新候補のアクション属性を抽出する。 The update unit 116 updates the response phrase list based on the action candidates selected by the selection unit 115. Specifically, the update unit 116 first determines between the action candidate newly selected by the selection unit 115 (hereinafter referred to as a new candidate) and the first action candidate before selection (hereinafter referred to as an old candidate). Compare all action attribute values. Then, the update unit 116 extracts a new candidate action attribute corresponding to the mismatched portion.

図１０は、旧候補の一例を示す説明図である。また、図１１は、新候補の一例を示す説明図である。図１０および図１１の例では、アクション属性「日時」および「番組名」が相違しているため、更新部１１６は、これらのアクション属性を抽出する。 FIG. 10 is an explanatory diagram illustrating an example of an old candidate. FIG. 11 is an explanatory diagram showing an example of a new candidate. In the example of FIGS. 10 and 11, the action attributes “date and time” and “program name” are different, and the update unit 116 extracts these action attributes.

次に、更新部１１６は、旧候補から生成した応答フレーズリストのうち、抽出したアクション属性に対応する応答フレーズを、新たな属性値で更新する。図１１の例では、更新部１１６は、属性値１１０１（（当日）朝）および属性値１１０２（英語講座）を新たな属性値として取得する。そして、更新部１１６は、生成済みの応答フレーズリストの対応する応答フレーズの内容を新たな属性値で変更する。 Next, the update unit 116 updates the response phrase corresponding to the extracted action attribute in the response phrase list generated from the old candidate with the new attribute value. In the example of FIG. 11, the update unit 116 acquires the attribute value 1101 ((the day) morning) and the attribute value 1102 (English course) as new attribute values. Then, the update unit 116 changes the content of the corresponding response phrase in the generated response phrase list with a new attribute value.

図１２は、更新された後の応答フレーズリストの一例を示す説明図である。図１２は、図７の応答フレーズリストを、図１１に示すようなアクション候補の属性を用いて更新した後の応答フレーズリストを表している。 FIG. 12 is an explanatory diagram showing an example of the response phrase list after being updated. FIG. 12 shows the response phrase list after the response phrase list of FIG. 7 is updated using the action candidate attributes as shown in FIG.

なお、上述のように、候補生成部１１２は、事前にすべての認識結果の候補に対応するアクション候補を生成している。このため、アクションを修正する場合は、選択部１１５が、ユーザの修正発話に応じて、生成済みのアクション候補から、より適切なアクション候補を選択するだけでよい。すなわち、応答文に対するユーザの修正発話に応じて、応答文（応答フレーズリスト）だけでなくアクション候補を同時に修正することが可能となる。 As described above, the candidate generating unit 112 generates action candidates corresponding to all recognition result candidates in advance. For this reason, when the action is corrected, the selection unit 115 only needs to select a more appropriate action candidate from the generated action candidates according to the user's corrected utterance. That is, it is possible to simultaneously correct not only the response sentence (response phrase list) but also the action candidate according to the user's correction utterance for the response sentence.

出力部１０２は、応答文生成部１１３によって生成された応答文、または更新部１１６によって更新された応答文を音声信号に変換した合成音声を生成し、合成音声をスピーカ１３２に出力するものである。 The output unit 102 generates a synthesized voice obtained by converting the response text generated by the response text generating unit 113 or the response text updated by the update unit 116 into a voice signal, and outputs the synthesized voice to the speaker 132. .

具体的には、出力部１０２は、まず、応答文を構成する各文字列を音声信号に変換する音声合成処理を行う。出力部１０２による音声合成処理は、音声素片編集音声合成、フォルマント音声合成、音声コーパスベースの音声合成などの一般的に利用されているあらゆる方法を適用することができる。そして、出力部１０２は、生成した音声信号をＤＡ変換してスピーカ１３２に出力する。 Specifically, the output unit 102 first performs speech synthesis processing for converting each character string constituting the response sentence into a speech signal. For the speech synthesis processing by the output unit 102, any generally used method such as speech segment editing speech synthesis, formant speech synthesis, speech corpus-based speech synthesis, or the like can be applied. Then, the output unit 102 performs DA conversion on the generated audio signal and outputs it to the speaker 132.

また、出力部１０２は、応答文が更新された場合、更新後の応答文をいずれの部分から出力するかを特定する。具体的には、出力部１０２は、更新前の応答文で出力されていない応答フレーズを特定し、特定した応答フレーズから更新後の応答文の合成音声を出力する。 Further, when the response sentence is updated, the output unit 102 specifies from which part the updated response sentence is output. Specifically, the output unit 102 identifies a response phrase that has not been output in the response sentence before the update, and outputs a synthesized speech of the response sentence after the update from the identified response phrase.

録画再生部１０３は、決定されたアクション、すなわち、尤度が最大のアクション候補を実行するものである。例えば、録画再生部１０３は、図５のＣＡｃｔ３が最尤のアクション候補として選択された場合、ＣＡｃｔ３の各アクション属性に従い、指定された日時に、指定されたチャンネルの指定された番組名の番組を録画するアクションを実行する。 The recording / playback unit 103 executes the determined action, that is, the action candidate having the maximum likelihood. For example, when CAct3 in FIG. 5 is selected as the most likely action candidate, the recording / playback unit 103, according to each action attribute of CAct3, displays a program with a specified program name of a specified channel at a specified date and time. Perform the action to record.

なお、録画再生部１０３などのような実際のアクションを実行する構成部を外部装置に備えるように構成してもよい。この場合は、決定したアクションに関する情報を音声対話装置から外部装置に出力し、外部装置はこの情報を参照してアクションを実行するように構成する。 Note that a configuration unit that executes an actual action, such as the recording / playback unit 103, may be provided in the external device. In this case, information regarding the determined action is output from the voice interaction apparatus to the external apparatus, and the external apparatus is configured to execute the action with reference to this information.

次に、このように構成された本実施の形態にかかるビデオ録画再生装置１００による音声対話処理について図１３を用いて説明する。図１３は、本実施の形態における音声対話処理の全体の流れを示すフローチャートである。 Next, a voice interaction process performed by the video recording / playback apparatus 100 according to the present embodiment configured as described above will be described with reference to FIG. FIG. 13 is a flowchart showing the overall flow of the voice interaction process in the present embodiment.

まず、受付部１０１は、マイク１３１から入力音声Ｉ０が入力されたか否かを判断する（ステップＳ１３０１）。入力音声Ｉ０が入力されていない場合は（ステップＳ１３０１：ＮＯ）、入力されるまで処理を繰り返す。 First, the reception unit 101 determines whether or not the input voice I0 is input from the microphone 131 (step S1301). If the input voice I0 has not been input (step S1301: NO), the process is repeated until it is input.

入力音声Ｉ０が入力された場合（ステップＳ１３０１：ＹＥＳ）、認識部１１１は、入力音声Ｉ０を音声認識し、認識候補群を生成する（ステップＳ１３０２）。次に、候補生成部１１２が、認識候補群の各候補について、対応するアクション候補を求め、アクション候補群ＣＡｃｔ｛ＣＡｃｔ１〜ＣＡｃｔＭ｝（Ｍはアクション候補の個数）を生成する（ステップＳ１３０３）。 When the input voice I0 is input (step S1301: YES), the recognition unit 111 recognizes the input voice I0 and generates a recognition candidate group (step S1302). Next, the candidate generation unit 112 obtains a corresponding action candidate for each candidate of the recognition candidate group, and generates an action candidate group CAct {CAct1 to CActM} (M is the number of action candidates) (step S1303).

次に、応答文生成部１１３が、尤度が最大のアクション候補ＡＣＴを決定する（ステップＳ１３０４）。次に、応答文生成部１１３は、アクション候補ＡＣＴに対応する応答フレーズリストＰ｛Ｐ１〜ＰＮ｝（Ｎはフレーズ数）を生成する（ステップＳ１３０５）。具体的には、応答文生成部１１３は、図６に示すようなテンプレートを参照し、テンプレートの変数部に、アクション候補ＡＣＴの対応するアクション属性の属性値をそれぞれ当てはめることにより、応答フレーズリストＰを生成する。 Next, the response sentence generation unit 113 determines the action candidate ACT having the maximum likelihood (step S1304). Next, the response sentence generation unit 113 generates a response phrase list P {P1 to PN} (N is the number of phrases) corresponding to the action candidate ACT (step S1305). Specifically, the response sentence generation unit 113 refers to the template as shown in FIG. 6 and applies the attribute value of the corresponding action attribute of the action candidate ACT to the variable part of the template, respectively, so that the response phrase list P Is generated.

次に、出力部１０２が、生成された応答フレーズリストＰから順次応答フレーズＰｉ（i＝１〜Ｎ）を取得し、音声合成した合成音声を出力する（ステップＳ１３０６）。なお、ｉは応答フレーズの出力順を表すカウンタ値である。 Next, the output unit 102 sequentially obtains response phrases Pi (i = 1 to N) from the generated response phrase list P, and outputs synthesized speech obtained by speech synthesis (step S1306). Note that i is a counter value indicating the output order of response phrases.

次に、受付部１０１は、マイク１３１から入力音声Ｉｉが入力されたか否かを判断する（ステップＳ１３０７）。なお、入力音声Ｉｉは、ｉ番目の応答フレーズＰｉの出力中に入力された音声であることを意味するが、応答フレーズＰｉの修正内容を表す音声であるとは限らない。すなわち、応答フレーズＰｉの前に出力された応答フレーズＰ１〜Ｐｉ−１のいずれかの修正内容を表す場合もある。また、未出力の応答フレーズＰｉ＋１〜ＰＮをユーザが推測して発話した場合であれば、入力音声Ｉｉが応答フレーズＰｉ＋１〜ＰＮの修正内容を表す場合もある。 Next, the reception unit 101 determines whether or not the input voice Ii is input from the microphone 131 (step S1307). The input voice Ii means that the voice is input during the output of the i-th response phrase Pi, but it is not necessarily a voice that indicates the correction content of the response phrase Pi. That is, the correction contents of any of the response phrases P1 to Pi-1 output before the response phrase Pi may be represented. In addition, if the user guesses unspoken response phrases Pi + 1 to PN and speaks, the input voice Ii may represent the correction contents of the response phrases Pi + 1 to PN.

入力音声Ｉｉが入力された場合は（ステップＳ１３０７：ＹＥＳ）、入力音声Ｉｉの内容にしたがって最尤のアクション候補および対応する応答文を更新する候補更新処理が実行される（ステップＳ１３０８）。候補更新処理の詳細については後述する。 When the input voice Ii is input (step S1307: YES), candidate update processing for updating the most likely action candidate and the corresponding response sentence is executed according to the contents of the input voice Ii (step S1308). Details of the candidate update process will be described later.

候補更新処理の後、またはステップＳ１３０７で入力音声Ｉｉが入力されていない場合（ステップＳ１３０７：ＮＯ）、出力部１０２は、すべての応答フレーズを処理したか否かを判断する（ステップＳ１３０９）。 After the candidate update process or when the input voice Ii is not input in step S1307 (step S1307: NO), the output unit 102 determines whether all response phrases have been processed (step S1309).

すべての応答フレーズを処理していない場合は（ステップＳ１３０９：ＮＯ）、出力部１０２は、次の応答フレーズに対して出力処理を繰り返す（ステップＳ１３０６）。なお、後述するように、候補更新処理でアクション候補が変更された場合は、変更後のアクション候補に対応して応答文（応答フレーズリスト）が更新されるため、出力部１０２は、更新後の応答フレーズリストから、次の応答フレーズを取得して出力する。 If all response phrases have not been processed (step S1309: NO), the output unit 102 repeats output processing for the next response phrase (step S1306). As will be described later, when the action candidate is changed in the candidate update process, the response sentence (response phrase list) is updated corresponding to the action candidate after the change. Obtain the next response phrase from the response phrase list and output it.

すべての応答フレーズを処理した場合は（ステップＳ１３０９：ＹＥＳ）、録画再生部１０３が、最尤のアクション候補ＡＣＴに対応するアクションを実行する（ステップＳ１３１０）。 When all the response phrases have been processed (step S1309: YES), the recording / playback unit 103 executes an action corresponding to the most likely action candidate ACT (step S1310).

このようにして、ユーザの要求に対する応答であるアクションの内容を確認するための応答文を生成し、応答文の出力中に修正のための音声が入力された場合は、この音声にしたがってアクションおよび応答文を同時に変更することができる。これにより、音声によって容易に誤り箇所を修正可能としつつ、ユーザとの対話を円滑に進めることができる。 In this way, a response sentence for confirming the content of the action that is a response to the user's request is generated, and when a sound for correction is input during the output of the response sentence, the action and Response sentences can be changed at the same time. Thereby, the dialog with the user can be smoothly advanced while the error part can be easily corrected by voice.

次に、ステップＳ１３０８の候補更新処理の詳細について図１４を用いて説明する。図１４は、本実施の形態における候補更新処理の全体の流れを示すフローチャートである。 Next, details of the candidate update process in step S1308 will be described with reference to FIG. FIG. 14 is a flowchart showing the overall flow of candidate update processing in the present embodiment.

まず、認識部１１１は、入力音声Ｉｉを音声認識し、認識結果を出力する（ステップＳ１４０１）。次に、修正語句生成部１１４は、認識結果を解析して少なくとも１つのアクション属性の属性値を含むアクション断片群ＳＥＧ｛ＳＥＧ１〜ＳＥＧＫ｝（Ｋはアクション断片の個数）を生成する（ステップＳ１４０２）。 First, the recognition unit 111 recognizes the input voice Ii and outputs a recognition result (step S1401). Next, the corrected phrase generation unit 114 analyzes the recognition result and generates an action fragment group SEG {SEG1 to SEGK} (K is the number of action fragments) including an attribute value of at least one action attribute (step S1402). .

次に、選択部１１５は、アクション断片群ＳＥＧが存在するか否かを判断し（ステップＳ１４０３）、存在する場合は（ステップＳ１４０３：ＹＥＳ）、アクション断片群ＳＥＧの要素と同じアクション属性に対応する属性値が、すべての要素について一致するアクション候補を選択する。そして、選択したアクション候補のうち、尤度が最大のアクション候補ＣＡｃｔｋを選択する（ステップＳ１４０４）。 Next, the selection unit 115 determines whether or not the action fragment group SEG exists (step S1403). If it exists (step S1403: YES), it corresponds to the same action attribute as the element of the action fragment group SEG. Select action candidates whose attribute values match for all elements. Then, the action candidate CActk having the maximum likelihood is selected from the selected action candidates (step S1404).

次に、選択部１１５は、アクション候補ＣＡｃｔｋが存在するか否かを判断する（ステップＳ１４０５）。アクション候補ＣＡｃｔｋが存在する場合は（ステップＳ１４０５：ＹＥＳ）、更新部１１６が、アクション候補ＣＡｃｔｋ（新候補）と、現在の最尤のアクション候補ＡＣＴ（旧候補）とを比較する。そして、更新部１１６は、不一致部分に対応する新候補のアクション属性（以下、不一致属性という）を含む不一致属性群Ａｔｔ{Ａｔｔ１〜ＡｔｔＬ}（Ｌは不一致属性の個数）を生成する（ステップＳ１４０６）。 Next, the selection unit 115 determines whether or not there is an action candidate CActk (step S1405). When the action candidate CActk exists (step S1405: YES), the update unit 116 compares the action candidate CActk (new candidate) with the current maximum likelihood action candidate ACT (old candidate). Then, the update unit 116 generates a mismatch attribute group Att {Att1 to AttL} (L is the number of mismatch attributes) including a new candidate action attribute (hereinafter referred to as a mismatch attribute) corresponding to the mismatch part (step S1406). .

次に、選択部１１５は、不一致属性群Ａｔｔが存在するか否かを判断し（ステップＳ１４０７）、存在する場合は（ステップＳ１４０７：ＹＥＳ）、アクション候補ＣＡｃｔｋを最尤のアクション候補ＡＣＴとして設定する（ステップＳ１４０８）。 Next, the selection unit 115 determines whether or not the mismatch attribute group Att exists (step S1407), and if it exists (step S1407: YES), sets the action candidate CActk as the most likely action candidate ACT. (Step S1408).

次に、更新部１１６は、応答フレーズリストＰのうち、不一致属性群Ａｔｔに含まれるアクション属性に対応する応答フレーズを、不一致属性群Ａｔｔの属性値で置換する（ステップＳ１４０９）。 Next, the update unit 116 replaces the response phrase corresponding to the action attribute included in the mismatch attribute group Att in the response phrase list P with the attribute value of the mismatch attribute group Att (step S1409).

続いて、更新後の応答フレーズリストＰを、いずれの応答フレーズから出力するかを特定するため、出力部１０２が以下の処理を実行する（ステップＳ１４１０〜ステップＳ１４１２）。 Subsequently, the output unit 102 executes the following processing to specify from which response phrase the updated response phrase list P is to be output (steps S1410 to S1412).

まず、出力部１０２は、置換した属性値のうち、最も文頭に近い属性値の文頭からの位置ｊを取得する（ステップＳ１４１０）。次に、出力部１０２は、取得した属性値の位置ｊが、更新前の応答フレーズリストＰで出力済みの応答フレーズの位置ｉより前か否かを判断する（ステップＳ１４１１）。 First, the output unit 102 acquires the position j from the beginning of the attribute value closest to the beginning of the replaced attribute values (step S1410). Next, the output unit 102 determines whether the position j of the acquired attribute value is before the position i of the response phrase already output in the response phrase list P before update (step S1411).

通常は、出力済みの応答フレーズに対する修正内容が発話され、対応する属性値が置換されるため、ｊはｉより小さくなる。しかし、上述のようにユーザが応答フレーズを推測して未出力の応答フレーズに対する修正内容が発話された場合などには、ｊがｉより小さくならない場合がある。 Normally, correction contents for the output response phrase are spoken and the corresponding attribute value is replaced, so j is smaller than i. However, j may not be smaller than i, for example, when the user guesses a response phrase as described above and correction contents for an unoutput response phrase are spoken.

位置ｊが位置ｉより前の場合は（ステップＳ１４１１：ＹＥＳ）、出力部１０２は、置換した属性値の位置ｊを、次の出力位置に設定する（ステップＳ１４１２）。すなわち、出力部１０２は、ｊをｉに代入する。 When the position j is before the position i (step S1411: YES), the output unit 102 sets the position j of the replaced attribute value as the next output position (step S1412). That is, the output unit 102 substitutes j for i.

ステップＳ１４０３でアクション断片群ＳＥＧが存在しないと判断された場合（ステップＳ１４０３：ＮＯ、ステップＳ１４０５でアクション候補ＣＡｃｔｋが存在しないと判断された場合（ステップＳ１４０５：ＮＯ）、ステップＳ１４０７で不一致属性群Ａｔｔが存在しないと判断された場合（ステップＳ１４０７：ＮＯ）、または、ステップＳ１４１１で位置ｊが位置ｉより前でないと判断された場合は（ステップＳ１４１１：ＮＯ）、候補更新処理を終了する。 When it is determined in step S1403 that the action fragment group SEG does not exist (step S1403: NO, in step S1405, it is determined that the action candidate CActk does not exist (step S1405: NO), the mismatch attribute group Att is determined in step S1407. If it is determined that it does not exist (step S1407: NO), or if it is determined in step S1411 that the position j is not before the position i (step S1411: NO), the candidate update process is terminated.

次に、本実施の形態のかかるビデオ録画再生装置１００による音声対話処理の具体例について説明する。 Next, a specific example of voice dialogue processing by the video recording / playback apparatus 100 according to the present embodiment will be described.

まず、ユーザが、当日の朝、「ＭＨＫ」というチャンネルの、「英語講座」という名称の番組の録画予約をセットする目的で、「ＭＨＫで朝、英語講座を録ってね」を意味する日本語の入力音声Ｉ０（えむえっちけーであさえいごこうざをとってね）を入力する（ステップＳ１３０１）。続いて、認識部１１１が、入力音声Ｉ０を音声認識し、図３に示すような認識候補群を生成する（ステップＳ１３０２）。さらに、候補生成部１１２が、この認識候補群から図５に示すアクション候補群ＣＡｃｔを生成する（ステップＳ１３０３）。 First, in the morning of the day, in the morning of the day, “MHK” in the channel called “English lecture” is set for the purpose of setting a recording reservation for a program in Japan. The word input voice I0 (Emu-Ecchi-Ke, take a look at the word) is input (step S1301). Subsequently, the recognition unit 111 recognizes the input voice I0 and generates a recognition candidate group as shown in FIG. 3 (step S1302). Further, the candidate generation unit 112 generates an action candidate group CAct shown in FIG. 5 from this recognition candidate group (step S1303).

なお、上述のように、図３の例では、ユーザの要求に適ったアクション候補は第３位候補であることに注意されたい。 Note that, as described above, in the example of FIG. 3, the action candidate that meets the user's request is the third candidate.

アクション候補群ＣＡｃｔ中、最も尤度が大きい候補は、尤度０．４のＣＡｃｔ１であるため、ＣＡｃｔ１をＡＣＴに設定する（ステップＳ１３０４）。次に、応答文生成部１１３が、図６に示すようなテンプレートＴ（｛チャンネル｝で/｛日時｝放送される/｛番組名｝を/｛操作｝しますね？）の変数部に対応するアクション属性のそれぞれに、ＣＡｃｔ１の対応するアクション属性の属性値を挿入し、応答フレーズリストＰを生成する（ステップＳ１３０５）。図７は、このときに生成される応答フレーズリストＰを表している。 In the action candidate group CAct, the candidate having the highest likelihood is CAct1 having a likelihood of 0.4, and therefore CAct1 is set to ACT (step S1304). Next, the response sentence generation unit 113 corresponds to a variable part of a template T as shown in FIG. 6 (/ {date} broadcast on {channel} / {program name}). The attribute value of the action attribute corresponding to CAct1 is inserted into each of the action attributes to generate a response phrase list P (step S1305). FIG. 7 shows a response phrase list P generated at this time.

次に、出力部１０２が、カウンタｉ（＝１）に対応する応答フレーズＰ１（ＭＨＫで）を音声合成して出力する（ステップＳ１３０６）。ここでは、応答フレーズＰ１の出力処理中には、ユーザから入力音声Ｉ１が入力されなかったと仮定する（ステップＳ１３０７：ＮＯ）。続いて、出力部１０２が、次のカウンタｉ（＝２）に対応する応答フレーズＰ２（明後日放送される）を音声合成して出力する（ステップＳ１３０６）。 Next, the output unit 102 synthesizes and outputs a response phrase P1 (in MHK) corresponding to the counter i (= 1) (step S1306). Here, it is assumed that the input voice I1 is not input from the user during the output process of the response phrase P1 (step S1307: NO). Subsequently, the output unit 102 synthesizes and outputs a response phrase P2 (broadcasted tomorrow) corresponding to the next counter i (= 2) (step S1306).

ここで、応答フレーズＰ２の音声出力中、ユーザが最初の入力音声Ｉ０の日時の指定（（今日の）朝）が、誤って解釈されていることに気づいたと仮定する。そして、ユーザが、録画する日時を朝に修正するために、「朝だよ」を意味する日本語の入力音声Ｉ２（あさだよ）を入力したと仮定する（ステップＳ１３０７：ＹＥＳ）。 Here, it is assumed that during the voice output of the response phrase P2, the user notices that the designation of the date and time (the morning of (today)) of the first input voice I0 is misinterpreted. Then, it is assumed that the user has input a Japanese input voice I2 (Asadayo) meaning “It is morning” in order to correct the recording date and time in the morning (step S1307: YES).

この場合は、入力音声Ｉ２を元に最尤のアクション候補ＡＣＴおよび応答フレーズリストＰを更新する候補更新処理が実行される（ステップＳ１３０８）。 In this case, candidate update processing for updating the most likely action candidate ACT and the response phrase list P based on the input speech I2 is executed (step S1308).

候補更新処理では、まず、認識部１１１が、入力音声Ｉ２を音声認識し、図８に示すような認識候補群を生成する（ステップＳ１４０１）。さらに、修正語句生成部１１４が、認識候補群に対応するアクション断片群ＳＥＧを生成する（ステップＳ１４０２）。ここでは、アクション候補の属性「日時」の情報のみが抽出されるため、アクション断片群ＳＥＧ｛ＳＥＧ１｝が得られる。 In the candidate update process, first, the recognition unit 111 recognizes the input voice I2 and generates a recognition candidate group as shown in FIG. 8 (step S1401). Further, the corrected phrase generation unit 114 generates an action fragment group SEG corresponding to the recognition candidate group (step S1402). Here, since only the information of the attribute “date and time” of the action candidate is extracted, the action fragment group SEG {SEG1} is obtained.

続いて、選択部１１５が、アクション断片群ＳＥＧの要素（ここではＳＥＧ１のみ）の属性「日時」の値が「（当日）朝」であるアクション候補群をアクション候補群ＣＡｃｔから選択する。この例では、選択部１１５は、図５のＣＡｃｔ３およびＣＡｃｔ４を選択する。そして、選択部１１５は、これら候補のうち、最も尤度の大きいＣＡｃｔ３（尤度０．３）を最尤候補ＣＡｃｔｋとする（ステップＳ１４０４）。 Subsequently, the selection unit 115 selects from the action candidate group CAct an action candidate group whose attribute “date and time” is “(morning) morning” of the element of the action fragment group SEG (here, only SEG1). In this example, the selection unit 115 selects CAct3 and CAct4 in FIG. And the selection part 115 makes CAct3 (likelihood 0.3) with the largest likelihood among these candidates the maximum likelihood candidate CActk (step S1404).

最尤候補ＣＡｃｔｋが見つかったため（ステップＳ１４０５：ＹＥＳ）、更新部１１６は、ＣＡｃｔ３とＡＣＴ（＝ＣＡｃｔ１）の各属性値を比較し、不一致属性群Ａｔｔを生成する（ステップＳ１４０６）。この例では、図１１に示すように、属性値１１０１に対応するアクション属性「日時」と、属性値１１０２に対応するアクション属性「番組名」とが不一致属性群Ａｔｔに含まれる。 Since the maximum likelihood candidate CActk is found (step S1405: YES), the updating unit 116 compares the attribute values of CAct3 and ACT (= CAct1) to generate a mismatch attribute group Att (step S1406). In this example, as shown in FIG. 11, the action attribute “date” corresponding to the attribute value 1101 and the action attribute “program name” corresponding to the attribute value 1102 are included in the mismatch attribute group Att.

そこで、更新部１１６は、応答フレーズリストＰ（｛ＭＨＫ｝で/｛明後日｝放送される/｛囲碁講座｝を/｛録画｝しますね？｝）の対応する属性値（｛明後日｝および｛囲碁講座｝）を、ＣＡｃｔ３の属性値（「朝」および「英語講座」）で置き換える（ステップＳ１４０９）。図１２は、このようにして更新された応答フレーズリストＰを表している。 Therefore, the updating unit 116 responds to the corresponding attribute values ({the day after tomorrow}) and {the day after tomorrow} with the response phrase list P ({MHK} / {the day after tomorrow} broadcast / {go course}}). (Course}) is replaced with the attribute value of CAct3 (“morning” and “English course”) (step S1409). FIG. 12 shows the response phrase list P updated in this way.

ここまでの処理によって、応答文に対応してユーザが発話した入力音声をフィードバックして、アクションおよびアクションに対応する応答フレーズも修正することができている。 Through the processing so far, the input speech uttered by the user in response to the response sentence is fed back, and the action and the response phrase corresponding to the action can also be corrected.

しかし、応答フレーズを修正した場合に、途中まで出力した応答文（応答フレーズリスト）を再度、最初から出力するか、修正箇所だけ出力するか、といった出力の仕方によってユーザの利便性が大きく異なる。 However, when the response phrase is corrected, the user's convenience varies greatly depending on whether the response sentence (response phrase list) output halfway is output again from the beginning or only the corrected portion is output.

そこで、本実施の形態では、上述のように、応答文のうち既に出力済みの部分は可能な限り再出力をさけつつ、変更箇所については必ず出力するように構成している。すなわち、更新した応答フレーズのうち、最も文頭に近い応答フレーズＰｊ（最も添え字ｊが小さい応答フレーズ）が既に出力済みであれば、出力部１０２は、応答フレーズＰｊから出力を再開する。また、応答フレーズＰｊが未出力であれば、出力部１０２は、現在の出力位置を表すカウンタｉが示す応答フレーズＰｉから続けて出力する。 Therefore, in the present embodiment, as described above, a part that has already been output in the response sentence is configured to be output as much as possible while avoiding re-output as much as possible. That is, if the response phrase Pj closest to the beginning of the sentence among the updated response phrases (the response phrase with the smallest subscript j) has already been output, the output unit 102 resumes output from the response phrase Pj. If the response phrase Pj is not output, the output unit 102 continuously outputs the response phrase Pi indicated by the counter i indicating the current output position.

上述の例では、最も文頭に近い更新された応答フレーズはＰ２（｛朝｝放送される）である。すなわち、更新された応答フレーズの添え字うち最も小さい添え字ｊは２であり、現在のカウンタｉ＝２と一致するため、カウンタｉは更新しない（ステップＳ１４１１：ＮＯ）。 In the above example, the updated response phrase closest to the beginning of the sentence is P2 (broadcast {morning}). That is, the smallest subscript j among the subscripts of the updated response phrase is 2, which matches the current counter i = 2, so the counter i is not updated (step S1411: NO).

この後、出力部１０２は、更新後の応答フレーズＰ２（｛朝｝放送される）の合成音声を出力する（ステップＳ１３０６）。ここで、ユーザが合成音声を聞くことにより入力音声Ｉ２が正しく解釈されたことを確認し、修正のための発話を行わなかったと仮定する。 Thereafter, the output unit 102 outputs the synthesized speech of the updated response phrase P2 (broadcasted in {morning}) (step S1306). Here, it is assumed that the user confirms that the input speech I2 has been correctly interpreted by listening to the synthesized speech, and has not made an utterance for correction.

以降、同様に、応答フレーズＰ３（｛英語講座｝を）、および応答フレーズＰ４（｛録画｝しますね？）が順次出力される。その間、ユーザからの応答発話が検出されなかったとすると、応答文の出力後、録画再生部１０３によって、確定されたアクションが実行される（ステップＳ１３１０）。その後、ユーザからの入力受付状態にもどる（ステップＳ１３０１）。 Thereafter, similarly, the response phrase P3 ({English course}) and the response phrase P4 ({record}?) Are sequentially output. If a response utterance from the user is not detected during that time, the confirmed action is executed by the recording / playback unit 103 after the response text is output (step S1310). Thereafter, the process returns to the state of accepting input from the user (step S1301).

このように、本実施の形態にかかる音声対話装置では、ユーザの要求発話に応じた応答フレーズを順次出力し、ユーザからの修正のための応答があった場合は、アクション候補と応答フレーズリストを同時に修正することができる。また、修正箇所から応答フレーズの発話を続行するため、更新前で出力済みの部分は出力を省略することができる。これにより、余分な手順を踏んで対話を阻害することなく、容易に修正可能な音声対話装置を実現することができる。 Thus, in the voice interaction apparatus according to the present embodiment, response phrases corresponding to the user's requested utterance are sequentially output, and when there is a response for correction from the user, action candidates and response phrase lists are displayed. It can be corrected at the same time. Moreover, since the utterance of the response phrase is continued from the corrected part, the output of the part that has been output before the update can be omitted. As a result, it is possible to realize a voice dialogue device that can be easily corrected without obstructing the dialogue by taking extra steps.

また、応答文の音声を聞いたユーザが、まだ出力されていない部分についての誤りを推測して言い直した場合であっても、修正箇所を特定し、適切な候補を選択しなおすことができる。これにより、ユーザの利便性を向上させ、対話をより円滑に進めることが可能となる。 Even if the user who has heard the voice of the response sentence guesses the error about the part that has not been output yet and rephrases it, the correction part can be identified and an appropriate candidate can be selected again. . As a result, the convenience of the user can be improved and the conversation can proceed more smoothly.

（変形例）
上記実施の形態では、図６に示したような固定のテンプレートにしたがって応答フレーズを生成し、生成した応答フレーズを順次出力していた。 (Modification)
In the above embodiment, response phrases are generated according to a fixed template as shown in FIG. 6, and the generated response phrases are sequentially output.

しかし、文の先頭に近い応答フレーズが誤っているような場合、誤った応答フレーズが出力された時点までに出力される情報が少ないため、その情報のみから、応答フレーズが誤っているか否かを適切に判断できない場合が生じうる。 However, if the response phrase near the beginning of the sentence is incorrect, there is little information that is output up to the point in time when the incorrect response phrase is output. There may be cases where it cannot be judged properly.

例えば、図７の応答フレーズリストの最初の応答フレーズＰ１（｛ＭＨＫ｝で）のチャンネル名である「ＭＨＫ」が「ＬＨＫ」の誤りであったとする。しかし、応答フレーズＰ１が出力された時点で、その断片的な情報のみから、その応答フレーズがチャンネル名に相当する箇所に対する応答フレーズであると、ユーザが瞬時に判別できるとは限らない。 For example, it is assumed that “MHK”, which is the channel name of the first response phrase P1 (with {MHK}) in the response phrase list of FIG. 7, is an error of “LHK”. However, when the response phrase P1 is output, it is not always possible for the user to instantaneously determine that the response phrase is a response phrase corresponding to the channel name from only the fragmentary information.

そこで、本変形例では、より解釈の曖昧性の少ない応答フレーズを先に出力することにより、このような問題を軽減する。ただし、単純に曖昧性の少ない順に応答フレーズを並べ替えただけでは、言語的な制約によって、不自然な意味の応答文や、文法的に不適格な応答文が生成されるおそれがある。 Therefore, in this modification, such a problem is reduced by outputting a response phrase with less ambiguity of interpretation first. However, if response phrases are simply rearranged in the order of less ambiguity, a response sentence with an unnatural meaning or a grammatically inappropriate response sentence may be generated due to linguistic restrictions.

例えば、図７に対応する応答文を「明後日放送される/ＭＨＫで/囲碁講座を/録画しますね？」のように並べ替えた場合、「放送される」が「ＭＨＫ」に係り、意味的に誤った応答文となる。 For example, when the response sentence corresponding to FIG. 7 is rearranged as “Broadcast tomorrow / MHK / Go course / Record?”, “Broadcast” is related to “MHK”, meaning Will result in an incorrect response.

そこで、並べ替えのための制約規則を構築し、その規則にしたがって応答フレーズリストを生成する。例えば、並べ替え可能なパターンを網羅した複数のテンプレートを予め用意し、最適なテンプレートを選択して応答文を生成するように構成する。具体的には、応答文生成部１１３が、このようなテンプレートから、曖昧性に応じて最適なテンプレートを選択して最尤のアクション候補の属性値を当てはめて応答文を生成する。 Therefore, a restriction rule for rearrangement is constructed, and a response phrase list is generated according to the rule. For example, a plurality of templates covering patterns that can be rearranged are prepared in advance, and an optimum template is selected to generate a response sentence. Specifically, the response sentence generation unit 113 selects an optimal template from such a template according to ambiguity and applies the attribute value of the most likely action candidate to generate a response sentence.

図１５は、本変形例で利用するテンプレートの一例を示す説明図である。図１５では、応答フレーズの出力順が異なる４つのテンプレートの例が示されている。 FIG. 15 is an explanatory diagram showing an example of a template used in this modification. FIG. 15 shows an example of four templates with different response phrase output orders.

例えば、図５のアクション候補群が生成され、最尤のアクション候補ＣＡｃｔ１の応答文を生成する場合、まず、応答文生成部１１３は、アクション候補のアクション属性それぞれの曖昧性を判断する。図５の例では、アクション属性「操作」および「チャンネル」は、ただ１通りの属性値を有するため、曖昧性は低いと判断される。アクション属性「日時」および「番組名」は、それぞれ２通りの属性値を有するため曖昧性が高いと判断される。 For example, when the action candidate group of FIG. 5 is generated and a response sentence of the maximum likelihood action candidate CAct1 is generated, first, the response sentence generation unit 113 determines the ambiguity of each action attribute of the action candidate. In the example of FIG. 5, since the action attributes “operation” and “channel” have only one attribute value, it is determined that the ambiguity is low. The action attributes “date and time” and “program name” have two attribute values, respectively, and thus are determined to have high ambiguity.

そこで、応答文生成部１１３は、アクション属性「操作」および「チャンネル」が先に出現するテンプレートを優先して選択する。図１５の例では、応答文生成部１１３は、テンプレートＴ２（｛操作｝しますね？/｛チャンネル｝で/｛日時｝放送される/｛番組名｝を/）を選択する。そして、この場合、応答文生成部１１３は、応答フレーズリストとして、「｛録画｝しますね？/｛ＭＨＫ｝で/｛明後日｝放送される/｛囲碁番組｝を/」を生成する。 Therefore, the response sentence generation unit 113 preferentially selects a template in which the action attributes “operation” and “channel” appear first. In the example of FIG. 15, the response sentence generation unit 113 selects the template T2 ({Do you want to operate}? / {Channel} / {Date} broadcast / {Program name} /). In this case, the response sentence generation unit 113 generates “{record}} / {MHK} / {broadcast tomorrow} / {go program} /” as the response phrase list.

このように、事前に定められたテンプレートにしたがい応答文を生成しているため、文法的に誤った応答文が生成されることはない。また、曖昧性の少ない応答フレーズから順に出力するため、誤って認識された応答フレーズが出力されるまでに、多くの情報（応答フレーズ）が出力される可能性が高くなる。これにより、情報量が少ないことにより応答フレーズの適否を適切に判断できなくなるという上述の問題を解消することが可能となる。 In this way, since the response sentence is generated according to a predetermined template, a grammatically incorrect response sentence is not generated. Further, since the response phrases are output in order from the less ambiguous response phrases, there is a high possibility that a lot of information (response phrases) will be output before the erroneously recognized response phrases are output. As a result, it is possible to solve the above-described problem that the suitability of the response phrase cannot be properly determined due to the small amount of information.

次に、本実施の形態にかかる音声対話装置のハードウェア構成について図１６を用いて説明する。図１６は、本実施の形態にかかる音声対話装置のハードウェア構成を示す説明図である。 Next, the hardware configuration of the voice interaction apparatus according to the present embodiment will be described with reference to FIG. FIG. 16 is an explanatory diagram showing a hardware configuration of the voice interaction apparatus according to the present embodiment.

本実施の形態にかかる音声対話装置は、ＣＰＵ（Central Processing Unit）５１などの制御装置と、ＲＯＭ（Read Only Memory）５２やＲＡＭ５３などの記憶装置と、ネットワークに接続して通信を行う通信Ｉ／Ｆ５４と、各部を接続するバス６１を備えている。 The voice interactive apparatus according to the present embodiment includes a communication I / O that communicates with a control device such as a CPU (Central Processing Unit) 51 and a storage device such as a ROM (Read Only Memory) 52 and a RAM 53 by connecting to a network. F54 and a bus 61 for connecting each part are provided.

本実施の形態にかかる音声対話装置で実行される音声対話プログラムは、ＲＯＭ５２等に予め組み込まれて提供される。 The voice interaction program executed by the voice interaction apparatus according to the present embodiment is provided by being incorporated in advance in the ROM 52 or the like.

本実施の形態にかかる音声対話装置で実行される音声対話プログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ（Compact Disk Read Only Memory）、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ（Compact Disk Recordable）、ＤＶＤ（Digital Versatile Disk）等のコンピュータで読み取り可能な記録媒体に記録して提供するように構成してもよい。 The voice dialogue program executed by the voice dialogue apparatus according to the present embodiment is a file in an installable format or an executable format, and is a CD-ROM (Compact Disk Read Only Memory), a flexible disk (FD), a CD-R. (Compact Disk Recordable), DVD (Digital Versatile Disk) or the like may be provided by being recorded on a computer-readable recording medium.

さらに、本実施の形態にかかる音声対話装置で実行される音声対話プログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。また、本実施の形態にかかる音声対話装置で実行される音声対話プログラムをインターネット等のネットワーク経由で提供または配布するように構成してもよい。 Furthermore, the voice dialogue program executed by the voice dialogue apparatus according to the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network. . Further, the voice dialogue program executed by the voice dialogue apparatus according to the present embodiment may be provided or distributed via a network such as the Internet.

本実施の形態にかかる音声対話装置で実行される音声対話プログラムは、上述した各部（受付部、対話処理部、出力部、録画再生部）を含むモジュール構成となっており、実際のハードウェアとしてはＣＰＵ５１が上記ＲＯＭ５２から音声対話プログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ、各部が主記憶装置上に生成されるようになっている。 The voice dialogue program executed by the voice dialogue apparatus according to the present embodiment has a module configuration including the above-described units (accepting unit, dialogue processing unit, output unit, recording / playback unit), and as actual hardware. When the CPU 51 reads out and executes the voice interaction program from the ROM 52, the above-described units are loaded onto the main storage device, and the respective units are generated on the main storage device.

以上のように、本発明にかかる装置および方法は、音声で入力された要求に応じて動作するビデオ録画再生装置、カーナビゲーションシステム、ゲーム機器などに適している。 As described above, the apparatus and method according to the present invention are suitable for a video recording / playback apparatus, a car navigation system, a game machine, and the like that operate in response to a request input by voice.

本実施の形態にかかるビデオ録画再生装置の構成を示すブロック図である。It is a block diagram which shows the structure of the video recording / reproducing apparatus concerning this Embodiment. 音声認識結果の一例を示す説明図である。It is explanatory drawing which shows an example of a speech recognition result. 認識候補文の一例を示す説明図である。It is explanatory drawing which shows an example of a recognition candidate sentence. アクションの一例を示す説明図である。It is explanatory drawing which shows an example of an action. アクション候補群の一例を示す説明図である。It is explanatory drawing which shows an example of an action candidate group. テンプレートの一例を示す説明図である。It is explanatory drawing which shows an example of a template. 応答フレーズリストの一例を示す説明図である。It is explanatory drawing which shows an example of a response phrase list. 認識候補文の別の例を示す説明図である。It is explanatory drawing which shows another example of a recognition candidate sentence. アクション断片の一例を示す説明図である。It is explanatory drawing which shows an example of an action fragment. 旧候補の一例を示す説明図である。It is explanatory drawing which shows an example of an old candidate. 新候補の一例を示す説明図である。It is explanatory drawing which shows an example of a new candidate. 更新された後の応答フレーズリストの一例を示す説明図である。It is explanatory drawing which shows an example of the response phrase list | wrist after being updated. 本実施の形態における音声対話処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the voice dialogue process in this Embodiment. 本実施の形態における候補更新処理の全体の流れを示すフローチャートである。It is a flowchart which shows the whole flow of the candidate update process in this Embodiment. 変形例で利用するテンプレートの一例を示す説明図である。It is explanatory drawing which shows an example of the template utilized in a modification. 本実施の形態にかかる音声対話装置のハードウェア構成を示す説明図である。It is explanatory drawing which shows the hardware constitutions of the voice interactive apparatus concerning this Embodiment.

Explanation of symbols

５１ＣＰＵ
５２ＲＯＭ
５３ＲＡＭ
５４通信Ｉ／Ｆ
６１バス
１００ビデオ録画再生装置
１０１受付部
１０２出力部
１０３録画再生部
１１０対話処理部
１１１認識部
１１２候補生成部
１１３応答文生成部
１１４修正語句生成部
１１５選択部
１１６更新部
１２０記憶部
１３１マイク
１３２スピーカ
２０１〜２０５ノード
１１０１、１１０２属性値 51 CPU
52 ROM
53 RAM
54 Communication I / F
61 Bus 100 Video Recording / Playback Device 101 Reception Unit 102 Output Unit 103 Recording / Playback Unit 110 Dialogue Processing Unit 111 Recognition Unit 112 Candidate Generation Unit 113 Response Sentence Generation Unit 114 Corrected Phrase Generation Unit 115 Selection Unit 116 Update Unit 120 Storage Unit 131 Microphone 132 Speaker 201-205 Node 1101, 1102 Attribute value

Claims

A recognition unit that recognizes input speech and generates a plurality of recognition result candidates;
A plurality of first recognition result candidates for the first speech are analyzed, and a response candidate corresponding to each of the plurality of first recognition result candidates and a likelihood representing a probability of a response candidate for the first recognition result candidate are expressed. A candidate generator for generating degrees;
The first candidate of the first recognition result including a phrase representing the candidate of the response to the first candidate of the selected first recognition result is selected from the candidate of the response to the first candidate of the first recognition result having the maximum likelihood. A response sentence generator for generating a response sentence for the candidate;
An output unit that outputs a synthesized speech obtained by converting a response sentence to the first candidate of the first recognition result into a speech signal;
When a second voice is input during the output of the synthesized voice, a second recognition result candidate for the second voice generated by the candidate generation unit is analyzed, and a response to the first candidate of the first recognition result A corrected phrase generation unit that generates a corrected phrase by correcting a phrase included in the sentence;
A response candidate for another candidate of the first recognition result including the same phrase as the corrected word is obtained from response candidates for the plurality of first recognition result candidates, and a response of another candidate of the first recognition result is obtained. A selection unit for selecting a candidate for a response to another candidate of the first recognition result having the maximum likelihood among the candidates;
An update unit that updates the response sentence with a candidate word of a response to another candidate of the selected first recognition result,
When the response sentence is updated, the output unit outputs the synthesized voice of the response sentence after the update instead of the synthesized voice of the response sentence before the update,
A voice interaction device characterized by the above.

The output unit, when the response sentence is updated, outputs the synthesized speech of the updated response sentence from a phrase corresponding to a phrase that has not been output in the response sentence before update,
The voice interactive apparatus according to claim 1.

The output unit is updated when a phrase that has been output in the response sentence before the update is included at the end of the sentence from the phrase that is closest to the beginning of the updated phrase among the phrases included in the response sentence. Outputting the synthesized speech of the response sentence updated from the phrase closest to the beginning of the phrase,
The voice interactive apparatus according to claim 2.

The output unit is output when the phrase that has been output in the response sentence before the update is included in the response sentence before update from the phrase that is closest to the beginning of the updated phrase. Outputting the synthesized speech of the response sentence updated from the phrase included at the end of the sentence next to the phrase of
The voice interactive apparatus according to claim 2.

The candidate generation unit further generates a second likelihood that represents a probability for each word that represents the response candidate,
The response sentence generation unit generates the response sentence including words representing the response candidates from the beginning of the sentence in order of increasing second likelihood;
The voice interactive apparatus according to claim 1.

A recognition step of recognizing input speech by the recognition unit and generating a plurality of recognition result candidates;
The candidate generation unit analyzes a plurality of first recognition result candidates for the first speech, and selects response candidates corresponding to the plurality of first recognition result candidates, and response candidates for the first recognition result candidates. A candidate generating step for generating a likelihood representing the likelihood;
A response sentence generator selects a candidate for a response to the first candidate of the first recognition result with the maximum likelihood, and includes a phrase that represents a candidate for a response to the selected first candidate of the first recognition result. A response sentence generation step for generating a response sentence for the first candidate of one recognition result;
A first output step of outputting a synthesized speech obtained by converting a response sentence to the first candidate of the first recognition result into a speech signal by the output unit;
When a second speech is input during the output of the synthesized speech by the correction word generation unit, the second recognition result candidate for the second speech generated in the candidate generation step is analyzed, and the first recognition result A corrected phrase generation step of generating a corrected phrase by correcting the phrase included in the response sentence to the first candidate of
The selection unit acquires a response candidate for another candidate of the first recognition result including the same phrase as the corrected phrase from the response candidates for the plurality of first recognition result candidates, A selection step of selecting a candidate for a response to another candidate of the first recognition result having the maximum likelihood among candidates for a response to the candidate;
An update step of updating the response sentence with a word of a candidate for a response to another candidate of the selected first recognition result by the update unit,
A second output step of outputting the synthesized speech of the response sentence after the update instead of the synthesized speech of the response sentence before the update when the response sentence is updated by the output unit;
A voice dialogue method characterized by comprising: