JP2008046570A

JP2008046570A - Voice input system

Info

Publication number: JP2008046570A
Application number: JP2006224689A
Authority: JP
Inventors: Yasuo Sudo; 康夫須藤
Original assignee: Aioi Insurance Co Ltd
Current assignee: Aioi Insurance Co Ltd
Priority date: 2006-08-21
Filing date: 2006-08-21
Publication date: 2008-02-28

Abstract

<P>PROBLEM TO BE SOLVED: To perform an instruction simply and accurately by a voice input. <P>SOLUTION: A voice recognition processing part 104 performs voice recognition of a voice data input by a voice input part 102 in a normal mode and a confirmation mode. An instruction accepting part 106 presumes the content of the instruction from a user based on a result of the voice recognition in normal mode performed by the voice recognition processing part 104 and shifts the voice recognition processing part 104 to the confirmation mode. When voice data are input from the voice input part 102 while the mode is shifted to the confirmation mode, the voice recognition processing part 104 performs voice recognition of the voice data in the confirmation mode. The instruction accepting part 106 detects whether or not the user has confirmed that the presumption was correct based on a result of the voice recognition in the confirmation mode by the voice recognition processing part 104, and determines the content of the instruction to accept the content as the instruction when it is confirmed that the presumption is correct. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声入力システムに関する。 The present invention relates to a voice input system.

従来、音声によって操作指令を入力できるコントロールシステムが開発されている。特許文献１（特開２００６−３３７９５号公報）には、音声入力を用いたリモートコントロールシステムが記載されている。当該文献に記載された技術では、リモコン端末から音声入力があると、音声認識部にて音声認識され、キーワードが抽出される。このキーワードをもとに選択候補となり得る制御項目が検索される。この制御項目がテレビ画面上に表示される。ユーザは、リモコン端末をポインティングデバイスとして用いて、表示された制御項目のうち、所望の制御項目を選択指示できる。ユーザが操作キーの選択キーを操作すると、当該制御項目の制御コードが取得され、テレビに送信される。これにより、簡易な操作によって目的とする操作指令を入力することができる。
特開２００６−３３７９５号公報 Conventionally, a control system capable of inputting operation commands by voice has been developed. Japanese Unexamined Patent Application Publication No. 2006-33795 describes a remote control system using voice input. In the technique described in this document, when there is a voice input from a remote control terminal, a voice recognition unit recognizes the voice and extracts a keyword. Based on this keyword, control items that can be selection candidates are searched. This control item is displayed on the television screen. The user can select and instruct a desired control item from among the displayed control items by using the remote control terminal as a pointing device. When the user operates the selection key of the operation key, the control code of the control item is acquired and transmitted to the television. Thereby, the target operation command can be input by a simple operation.
JP 2006-33795 A

しかし、特許文献１に記載の技術では、音声入力を行った後、操作指令を選択するためには、リモコン端末の操作キー等を操作する必要がある。そのため、たとえばユーザが身体障害を有していたりして音声入力しか行えないような場合、操作を行うことができないという問題があった。 However, in the technique described in Patent Document 1, it is necessary to operate an operation key or the like of a remote control terminal in order to select an operation command after performing voice input. Therefore, for example, when the user has a physical disability and can only perform voice input, there is a problem that the operation cannot be performed.

本発明は上記事情に鑑みてなされたものであり、その目的とするところは、ユーザが音声入力により簡易かつ精度よく指示を行う技術を提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technique in which a user gives instructions simply and accurately by voice input.

本発明によれば、
ユーザの音声の音声データを入力する音声入力部と、
前記音声入力部が入力した音声データを、通常モードおよび確認モードの少なくとも２種のモードで音声認識する音声認識処理部と、
前記音声認識処理部が前記通常モードで音声認識した結果に基づき、前記ユーザからの指示内容を推定するとともに前記音声認識処理部を前記確認モードに移行させる指示受付部と、
前記指示受付部による前記推定を提示するとともに、当該推定の正否を前記ユーザに発話させるため、前記ユーザに発話させる内容を提示する提示処理部と、
を含み、
前記音声認識処理部は、前記確認モードに移行中に前記音声入力部から音声データが入力されると、当該音声データを前記提示処理部が提示した前記内容と比較して音声認識を行い、
前記指示受付部は、前記音声認識処理部が前記確認モードで音声認識した結果に基づき、前記ユーザが前記推定が正しいことを確認したか否かを検出し、当該推定が正しいことが確認された場合に前記指示内容を確定して指示として受け付ける音声入力システムが提供される。 According to the present invention,
A voice input unit for inputting voice data of the user's voice;
A speech recognition processing unit that recognizes speech data input by the speech input unit in at least two modes of a normal mode and a confirmation mode;
An instruction receiving unit that estimates the instruction content from the user and shifts the voice recognition processing unit to the confirmation mode based on a result of the voice recognition processing unit performing voice recognition in the normal mode;
In order to present the estimation by the instruction receiving unit and to cause the user to utter the correctness of the estimation, a presentation processing unit that presents the content to be uttered by the user;
Including
When voice data is input from the voice input unit during the transition to the confirmation mode, the voice recognition processing unit performs voice recognition by comparing the voice data with the content presented by the presentation processing unit,
The instruction receiving unit detects whether or not the user has confirmed that the estimation is correct based on a result of the voice recognition processing unit performing voice recognition in the confirmation mode, and it has been confirmed that the estimation is correct. In some cases, a voice input system is provided that accepts the instruction content and accepts it as an instruction.

この構成によれば、ユーザの音声の音声認識結果に基づき、指示を推定し、推定結果を提示して、それが正しいか否かをユーザに再度音声入力させる。２回目の入力は、認識精度が高まるため、ユーザからの指示を音声だけで精度よく受け付けることができる。本発明の音声入力システムによれば、このように、音声入力のみで指示を行うことができるので、ユーザは必要な情報を発話するだけで、所望の処理を指示することができる。 According to this configuration, the instruction is estimated based on the voice recognition result of the user's voice, the estimation result is presented, and the user is again input by voice whether or not it is correct. Since the second input increases the recognition accuracy, an instruction from the user can be accurately received only by voice. According to the voice input system of the present invention, since an instruction can be given only by voice input, the user can instruct a desired process only by speaking the necessary information.

本発明の音声入力システムは、前記提示処理部が提示した前記内容を記憶する記憶部をさらに含むことができ、前記音声認識処理部は、前記確認モードにおいて、前記記憶部を参照して、前記音声入力部から入力された前記音声データと前記内容とを比較することができる。 The voice input system of the present invention may further include a storage unit that stores the content presented by the presentation processing unit, and the voice recognition processing unit refers to the storage unit in the confirmation mode, and The voice data input from the voice input unit can be compared with the content.

本発明の音声入力システムにおいて、前記指示受付部は、前記音声認識処理部が前記通常モードで音声認識した結果に基づき、前記ユーザからの指示内容を推定し、当該推定の結果を示す一般的な用語を決定することができ、前記提示処理部は、前記指示受付部が決定した前記一般的な用語を前記ユーザに発話させる前記内容として提示することができる。 In the voice input system of the present invention, the instruction receiving unit estimates a content of the instruction from the user based on a result of the voice recognition processing unit performing voice recognition in the normal mode, and indicates a result of the estimation. A term can be determined, and the presentation processing unit can present the general term determined by the instruction receiving unit as the content that causes the user to speak.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、記録媒体、コンピュータプログラムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a conversion of the expression of the present invention between a method, an apparatus, a system, a recording medium, a computer program, etc. are also effective as an aspect of the present invention.

本発明によれば、ユーザが音声入力により簡易かつ精度よく指示を行うことができる。 ADVANTAGE OF THE INVENTION According to this invention, a user can perform an instruction | indication simply and accurately by voice input.

以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In all the drawings, the same reference numerals are given to the same components, and the description will be omitted as appropriate.

（第１の実施の形態）
図１は、本実施の形態における音声入力システム１００の構成を示すブロック図である。
音声入力システム１００は、音声入力部１０２、音声認識処理部１０４、指示受付部１０６、提示処理部１１２、処理部１１４、モード記憶部１１６、提示内容記憶部１１８、音声データ記憶部１２０、処理内容記憶部１２２、マイク１３０、スピーカ１３２、およびディスプレイ１３４を含む。音声入力システム１００は、たとえばパーソナルコンピュータ（以下ＰＣという）により構成することができる。マイク１３０、スピーカ１３２、およびディスプレイ１３４は、当該ＰＣに内蔵あるいは有線または無線で接続された構成とすることができる。 (First embodiment)
FIG. 1 is a block diagram showing a configuration of a voice input system 100 according to the present embodiment.
The voice input system 100 includes a voice input unit 102, a voice recognition processing unit 104, an instruction receiving unit 106, a presentation processing unit 112, a processing unit 114, a mode storage unit 116, a presentation content storage unit 118, a voice data storage unit 120, and a processing content. A storage unit 122, a microphone 130, a speaker 132, and a display 134 are included. The voice input system 100 can be configured by, for example, a personal computer (hereinafter referred to as a PC). The microphone 130, the speaker 132, and the display 134 can be configured to be built in the PC or connected by wire or wirelessly.

本実施の形態において、音声入力システム１００は、ユーザからの指示を音声で受け付ける。処理部１１４は、指示受付部１０６がユーザから受け付けた指示に従い、各種処理を行う。処理部１１４の処理はとくに限定されないが、たとえば、自動車保険等の保険契約の支援処理を行うことができる。処理内容記憶部１２２は、処理部１１４が行う処理の候補を記憶する。処理内容記憶部１２２の具体的な構成は後述する。 In the present embodiment, voice input system 100 receives a voice instruction from a user. The processing unit 114 performs various processes in accordance with instructions received from the user by the instruction receiving unit 106. Although the process of the process part 114 is not specifically limited, For example, insurance contract support processes, such as car insurance, can be performed. The processing content storage unit 122 stores candidates for processing performed by the processing unit 114. A specific configuration of the processing content storage unit 122 will be described later.

マイク１３０は、ユーザの音声を取得可能な場所に設けられる。音声入力部１０２は、マイク１３０を介してユーザからの指示を音声で入力する。本実施の形態において、音声入力部１０２は、マイク１３０から入力された音声をデジタル変換する。音声データ記憶部１２０は、音声入力部１０２が入力してデジタル変換した音声のデータ（以下、単に音声データという）を記憶する。 The microphone 130 is provided in a place where the user's voice can be acquired. The voice input unit 102 inputs an instruction from the user by voice through the microphone 130. In the present embodiment, the voice input unit 102 digitally converts the voice input from the microphone 130. The audio data storage unit 120 stores audio data (hereinafter simply referred to as audio data) that has been input and digitally converted by the audio input unit 102.

音声認識処理部１０４は、音声入力部１０２が入力した音声データを、通常モードおよび確認モードの少なくとも２種のモードで音声認識する。図示していないが、音声認識処理部１０４は、一般的な音声認識プログラムが有する辞書等を有する。通常モードとは、一般的な音声認識プログラムと同様の音声認識処理のことである。確認モードについては後述する。 The voice recognition processing unit 104 recognizes the voice data input by the voice input unit 102 in at least two modes: a normal mode and a confirmation mode. Although not shown, the speech recognition processing unit 104 has a dictionary or the like included in a general speech recognition program. The normal mode is a voice recognition process similar to a general voice recognition program. The confirmation mode will be described later.

モード記憶部１１６は、音声認識処理部１０４のモードの設定を記憶する。初期状態では、モード記憶部１１６には通常モードが設定されている。 The mode storage unit 116 stores the mode setting of the voice recognition processing unit 104. In the initial state, the mode storage unit 116 is set to the normal mode.

指示受付部１０６は、音声認識処理部１０４が通常モードで音声認識した結果に基づき、ユーザからの指示内容を推定する。処理内容記憶部１２２は、ユーザから指示される指示内容の候補を記憶する。指示受付部１０６は、音声認識処理部１０４が通常モードで音声認識した結果に基づき、処理内容記憶部１２２を参照して、指示内容の候補の中からユーザからの指示内容を推定する。 The instruction receiving unit 106 estimates the instruction content from the user based on the result of the voice recognition processing unit 104 performing voice recognition in the normal mode. The processing content storage unit 122 stores instruction content candidates instructed by the user. The instruction receiving unit 106 refers to the processing content storage unit 122 based on the result of the voice recognition processing unit 104 performing voice recognition in the normal mode, and estimates the instruction content from the user from the instruction content candidates.

また、指示受付部１０６は、ユーザからの指示内容を推定するとともに、音声認識処理部１０４を確認モードに移行させる。具体的には、指示受付部１０６は、モード記憶部１１６の設定を通常モードから確認モードに書き換えることにより、音声認識処理部１０４を確認モードに移行させる。 The instruction receiving unit 106 estimates the instruction content from the user and shifts the voice recognition processing unit 104 to the confirmation mode. Specifically, the instruction receiving unit 106 changes the setting of the mode storage unit 116 from the normal mode to the confirmation mode, thereby causing the voice recognition processing unit 104 to shift to the confirmation mode.

提示処理部１１２は、指示受付部１０６による推定を提示するとともに、当該推定の正否をユーザに発話させるため、ユーザに発話させる内容を提示する処理を行う。提示処理部１１２は、提示内容記憶部１１８を参照して、これらをスピーカ１３２またはディスプレイ１３４によりユーザに提示する処理を行う。 The presentation processing unit 112 performs a process of presenting the content to be uttered by the user in order to present the estimation by the instruction receiving unit 106 and to cause the user to utter the correctness of the estimation. The presentation processing unit 112 refers to the presentation content storage unit 118 and performs a process of presenting them to the user through the speaker 132 or the display 134.

図２は、処理内容記憶部１２２の内部構成の一例を示す図である。処理内容記憶部１２２は、指示内容欄と提示内容欄とを含む。提示内容欄は、確認内容欄、肯定欄、および否定欄を含む。 FIG. 2 is a diagram illustrating an example of an internal configuration of the processing content storage unit 122. The processing content storage unit 122 includes an instruction content field and a presentation content field. The presentation content column includes a confirmation content column, a positive column, and a negative column.

図２（ａ）に示すように、指示内容欄には、たとえば「契約処理開始」や「再入力」が記憶される。「契約処理開始」という指示内容に対応付けて、確認内容として「契約処理を開始します。」、当該推定を肯定する場合にユーザに発話させる内容として「はい」、当該推定を否定する場合にユーザに発話させる内容として「いいえ」が記憶される。処理内容記憶部１２２は、さらに複数の指示内容を記憶することができる。 As shown in FIG. 2A, for example, “contract processing start” and “re-input” are stored in the instruction content column. Corresponding to the instruction content “Contract processing start”, “Contract processing starts” as confirmation content, “Yes” as content to be uttered by the user when affirming the estimation, and denying the estimation “No” is stored as the content to be uttered by the user. The processing content storage unit 122 can further store a plurality of instruction contents.

図１に戻り、指示受付部１０６は、推定した指示内容の正否をユーザに発話させるため、ユーザに発話させる内容および推定した指示内容を提示内容記憶部１１８に記憶する。つまり、図２（ａ）に示した提示内容欄と同様の内容が提示内容記憶部１１８に記憶される。 Returning to FIG. 1, the instruction receiving unit 106 stores the content to be uttered by the user and the estimated instruction content in the presentation content storage unit 118 in order to make the user utter the correctness of the estimated instruction content. That is, the same content as the presentation content column shown in FIG. 2A is stored in the presentation content storage unit 118.

音声認識処理部１０４は、音声入力部１０２から音声データが入力されると、モード記憶部１１６を参照して、現在設定されているモードを確認する。音声認識処理部１０４は、確認モードに移行中に音声入力部１０２から音声データが入力されると、当該音声データを提示処理部１１２が提示した内容と比較して音声認識を行う。具体的には、音声認識処理部１０４は、提示内容記憶部１１８を参照して、音声入力部１０２から入力された音声データと提示内容記憶部１１８に記憶された内容とを比較する。本実施の形態において、確認モードにおいては、ユーザが発話する内容がある程度特定できているので、精度よく音声認識を行うことができる。 When voice data is input from the voice input unit 102, the voice recognition processing unit 104 refers to the mode storage unit 116 and confirms the currently set mode. When voice data is input from the voice input unit 102 during the transition to the confirmation mode, the voice recognition processing unit 104 compares the voice data with the content presented by the presentation processing unit 112 and performs voice recognition. Specifically, the voice recognition processing unit 104 refers to the presentation content storage unit 118 and compares the voice data input from the voice input unit 102 with the content stored in the presentation content storage unit 118. In the present embodiment, in the confirmation mode, since the content of the user's utterance can be specified to some extent, voice recognition can be performed with high accuracy.

指示受付部１０６は、音声認識処理部１０４が確認モードで音声認識した結果に基づき、ユーザが推定が正しいことを確認したか否かを検出し、当該推定が正しいことが確認された場合に指示内容を確定して指示として受け付ける。処理部１１４は、指示受付部１０６が指示内容を確定すると、その指示に基づき処理を行う。 The instruction receiving unit 106 detects whether or not the user has confirmed that the estimation is correct based on the result of the voice recognition processing unit 104 performing voice recognition in the confirmation mode, and instructs when the estimation is confirmed to be correct. Confirm the content and accept it as an instruction. When the instruction receiving unit 106 determines the instruction content, the processing unit 114 performs processing based on the instruction.

図３は、本実施の形態における音声入力システム１００の処理手順を示すフローチャートである。 FIG. 3 is a flowchart showing a processing procedure of the voice input system 100 according to the present embodiment.

音声入力部１０２が音声データを入力すると（Ｓ１００のＹＥＳ）、音声認識処理部１０４は、モード記憶部１１６を参照して、通常モードか確認モードかを確認する（Ｓ１０２）。通常モードが設定されている場合（Ｓ１０２のＹＥＳ）、音声認識処理部１０４は、通常モードで音声認識処理を行う（Ｓ１０４）。 When the voice input unit 102 inputs voice data (YES in S100), the voice recognition processing unit 104 refers to the mode storage unit 116 and checks whether the mode is the normal mode or the confirmation mode (S102). When the normal mode is set (YES in S102), the voice recognition processing unit 104 performs voice recognition processing in the normal mode (S104).

指示受付部１０６は、音声認識処理部１０４の音声認識結果に基づき、処理内容記憶部１２２を参照してユーザの指示内容を推定する（Ｓ１０６）。指示受付部１０６は、推定した指示内容と、その推定を肯定するか否かを示す用語を提示内容記憶部１１８に記憶する（Ｓ１０８）。また、指示受付部１０６は、モード記憶部１１６の設定を通常モードから確認モードに書き換える。これにより、音声認識処理部１０４の音声認識処理が確認モードに移行される（Ｓ１１０）。 The instruction receiving unit 106 refers to the processing content storage unit 122 based on the voice recognition result of the voice recognition processing unit 104 and estimates the user's instruction content (S106). The instruction receiving unit 106 stores the estimated instruction content and a term indicating whether or not to affirm the estimation in the presentation content storage unit 118 (S108). In addition, the instruction receiving unit 106 rewrites the setting in the mode storage unit 116 from the normal mode to the confirmation mode. Thereby, the speech recognition processing of the speech recognition processing unit 104 is shifted to the confirmation mode (S110).

提示処理部１１２は、提示内容記憶部１１８を参照して、指示受付部１０６が推定した指示内容およびその推定を肯定するか否かを示す用語をスピーカ１３２またはディスプレイ１３４により提示する処理を行う（Ｓ１１２）。 The presentation processing unit 112 refers to the presentation content storage unit 118 and performs a process of presenting the instruction content estimated by the instruction receiving unit 106 and a term indicating whether or not to affirm the estimation using the speaker 132 or the display 134 ( S112).

この後、音声入力部１０２が音声入力を受け付けると（Ｓ１１４のＹＥＳ）、ステップＳ１０２に進む。この場合、ステップＳ１０２において、確認モードと判断され（Ｓ１０２のＮＯ）、音声認識処理部１０４は、確認モードで音声認識処理を行う（Ｓ１２０）。指示受付部１０６は、音声認識処理部１０４の音声認識結果に基づき、ステップＳ１０６で推定した指示がユーザにより肯定されたか否かを判断する（Ｓ１２２）。推定が肯定された場合（Ｓ１２２のＹＥＳ）、指示受付部１０６は、推定した指示を確定して、当該指示を受け付ける（Ｓ１２４）。処理部１１４は、指示受付部１０６が確定した指示に基づき処理を開始する（Ｓ１２５）。 Thereafter, when the voice input unit 102 receives a voice input (YES in S114), the process proceeds to step S102. In this case, in step S102, the confirmation mode is determined (NO in S102), and the speech recognition processing unit 104 performs speech recognition processing in the confirmation mode (S120). The instruction receiving unit 106 determines whether or not the instruction estimated in step S106 has been affirmed by the user based on the voice recognition result of the voice recognition processing unit 104 (S122). When the estimation is affirmed (YES in S122), the instruction receiving unit 106 confirms the estimated instruction and receives the instruction (S124). The processing unit 114 starts processing based on the instruction determined by the instruction receiving unit 106 (S125).

一方、ステップＳ１２２において、推定が肯定されなかった場合（Ｓ１２２のＮＯ）、指示受付部１０６は、推定が否定されたか否かを判断する（Ｓ１２６）。推定が否定された場合（Ｓ１２６のＹＥＳ）、キャンセル処理が行われる（Ｓ１３０）。キャンセル処理とは、ステップＳ１００の音声入力処理が行われる前の状態に戻すことをいう。また、指示受付部１０６は、モード記憶部１１６の設定を確認モードから通常モードに書き換える。このとき、指示受付部１０６は、提示内容記憶部１１８に記憶した指示内容の推定を消去する処理をさらに行ってもよい。 On the other hand, if the estimation is not affirmed in step S122 (NO in S122), the instruction receiving unit 106 determines whether or not the estimation is denied (S126). If the estimation is denied (YES in S126), a cancel process is performed (S130). The canceling process means returning to the state before the voice input process of step S100 is performed. Further, the instruction receiving unit 106 rewrites the setting in the mode storage unit 116 from the confirmation mode to the normal mode. At this time, the instruction receiving unit 106 may further perform processing for deleting the estimation of the instruction content stored in the presentation content storage unit 118.

ステップＳ１２６において、推定が否定されなかった場合（Ｓ１２６のＮＯ）、すなわちステップＳ１１４で入力された音声の音声認識が正しく行えなかった場合、指示受付部１０６は、モード記憶部１１６の設定を確認モードから通常モードに書き換える（Ｓ１２８）。このとき、指示受付部１０６は、提示内容記憶部１１８に記憶した指示内容の推定を消去する処理をさらに行ってもよい。この後、ステップＳ１０２に戻り、ステップＳ１１４で入力された音声を通常モードで音声認識する処理が行われる（Ｓ１０４）。音声認識処理部１０４は、音声データ記憶部１２０に記憶された音声データの音声認識を行う。 If the estimation is not denied in step S126 (NO in S126), that is, if the voice input in step S114 is not correctly recognized, the instruction receiving unit 106 checks the setting of the mode storage unit 116 in the confirmation mode. To normal mode (S128). At this time, the instruction receiving unit 106 may further perform processing for deleting the estimation of the instruction content stored in the presentation content storage unit 118. Thereafter, the process returns to step S102, and a process of recognizing the voice input in step S114 in the normal mode is performed (S104). The voice recognition processing unit 104 performs voice recognition of the voice data stored in the voice data storage unit 120.

一方、ステップＳ１１２の後、所定時間が経過しても音声入力がない場合（Ｓ１１４のＮＯ、Ｓ１１６のＹＥＳ）、ステップＳ１３０と同様のキャンセル処理が行われる（Ｓ１１８）。以上により処理が終了する。 On the other hand, after step S112, if there is no voice input even after a predetermined time has elapsed (NO in S114, YES in S116), the same canceling process as in step S130 is performed (S118). The process ends as described above.

次に、具体例を説明する。ここで、モード記憶部１１６は、通常モードに設定されているものとする。たとえば、ユーザが何らかの契約手続きを行う場合に、「契約するよ」と発話したとする。音声認識処理部１０４は、通常モードで「契約するよ」という音声データの音声認識処理を行う。 Next, a specific example will be described. Here, it is assumed that the mode storage unit 116 is set to the normal mode. For example, it is assumed that when a user performs some contract procedure, the user speaks “I will contract”. The voice recognition processing unit 104 performs voice recognition processing of voice data “I will contract” in the normal mode.

指示受付部１０６は、音声認識処理部１０４による音声認識結果に基づき、処理内容記憶部１２２を参照して、ユーザからの指示内容が「契約処理開始」であると推定する。指示受付部１０６は、処理内容記憶部１２２を参照して、提示処理部１１２に提示させる内容を読み出す。このとき、図２（ａ）に示したように設定されている場合、指示内容として「契約処理開始」、確認内容として「契約処理を開始します。」、当該推定を肯定する場合にユーザに発話させる内容として「はい」、当該推定を否定する場合にユーザに発話させる内容として「いいえ」を提示内容記憶部１１８に記憶する。また、指示受付部１０６は、モード記憶部１１６の設定を通常モードから確認モードに書き換える。 The instruction receiving unit 106 refers to the processing content storage unit 122 based on the voice recognition result by the voice recognition processing unit 104 and estimates that the instruction content from the user is “contract processing start”. The instruction receiving unit 106 refers to the processing content storage unit 122 and reads the content to be presented by the presentation processing unit 112. At this time, when the setting is made as shown in FIG. 2 (a), “contract processing start” as the instruction content, “contract processing starts” as the confirmation content, The presentation content storage unit 118 stores “Yes” as the content to be uttered and “No” as the content to be uttered by the user when the estimation is denied. In addition, the instruction receiving unit 106 rewrites the setting in the mode storage unit 116 from the normal mode to the confirmation mode.

提示処理部１１２は、提示内容記憶部１１８を参照して、「契約処理を開始します。よい場合は「はい」、違う場合は「いいえ」と言ってください。」という内容をスピーカ１３２またはディスプレイ１３４によりユーザに提示する。 The presentation processing unit 112 refers to the presentation content storage unit 118 and says “Start contract processing. If yes, say“ Yes ”, otherwise say“ No ”. "Is presented to the user through the speaker 132 or the display 134.

この状態で、たとえばユーザが「はい」と発話すると、音声認識処理部１０４は、確認モードで音声認識処理を行う。このとき、ユーザから入力される音声は「はい」か「いいえ」である可能性が高いため、音声認識処理部１０４は、精度よく音声認識を行うことができる。指示受付部１０６は、音声認識処理部１０４による音声認識結果に基づき、「契約処理開始」という指示がユーザにより確認されたことを検出し、「契約処理開始」という指示を受け付ける。一方、ユーザが「はい」と発話したことが検出できなかった場合、ユーザが「いいえ」と発話したことが検出できた場合は、キャンセル処理を行う。さらに、ユーザの発話内容が検出できなかった場合は、通常モードにより再度音声認識処理を行う。その後、音声認識結果に応じて同様の処理を繰り返す。 In this state, for example, when the user utters “Yes”, the voice recognition processing unit 104 performs voice recognition processing in the confirmation mode. At this time, since the voice input from the user is likely to be “Yes” or “No”, the voice recognition processing unit 104 can perform voice recognition with high accuracy. The instruction receiving unit 106 detects that the instruction “contract processing start” has been confirmed by the user based on the voice recognition result by the voice recognition processing unit 104, and receives the instruction “contract processing start”. On the other hand, when it is not possible to detect that the user has spoken “Yes”, or when it has been detected that the user has spoken “No”, cancel processing is performed. Furthermore, when the user's speech content cannot be detected, the speech recognition process is performed again in the normal mode. Thereafter, the same processing is repeated according to the voice recognition result.

指示受付部１０６は、提示処理部１１２が提示した指示に対してユーザが結果が正しいことを発話したか否かを検出し、結果が正しいことを発話したことが検出された場合に、ユーザからの指示を確定する。この後、契約処理を開始する。 The instruction receiving unit 106 detects whether or not the user has uttered that the result is correct with respect to the instruction presented by the presentation processing unit 112, and when it is detected that the user has uttered that the result is correct, Confirm the instruction. Thereafter, contract processing is started.

次に、他の具体例を説明する。指示受付部１０６は、音声認識処理部１０４が通常モードで音声認識した結果に基づき、ユーザからの指示内容を推定し、当該推定の結果を示す一般的な用語を決定することができる。提示処理部１１２は、指示受付部１０６が決定した一般的な用語をユーザに発話させる内容として提示することができる。この例で動作を行う場合の処理内容記憶部１２２の内部構成の一例を図２（ｂ）に示す。ここでも、指示内容欄には、たとえば「契約処理開始」が記憶される。また、契約処理開始という指示内容に対する一般的な用語として、「契約処理を開始して下さい。」が記憶される。すなわち、この指示内容に対応付けて、確認内容として「契約処理を開始します。」、それに対する肯定として「契約処理を開始して下さい。」、否定として「ちがいます。」が記憶される。 Next, another specific example will be described. The instruction receiving unit 106 can estimate the instruction content from the user based on the result of the voice recognition processing unit 104 performing voice recognition in the normal mode, and can determine a general term indicating the estimation result. The presentation processing unit 112 can present general terms determined by the instruction receiving unit 106 as content that causes the user to speak. An example of the internal configuration of the processing content storage unit 122 when performing operations in this example is shown in FIG. Again, for example, “contract processing start” is stored in the instruction content column. In addition, “Please start contract processing” is stored as a general term for the instruction content of contract processing start. That is, in association with this instruction content, “contract processing starts” is stored as confirmation content, “start contract processing” is stored as an affirmative, and “different” is stored as negative.

まず、モード記憶部１１６は、通常モードに設定されているものとする。たとえば、ユーザが何らかの契約手続きを行う場合に、「契約するよ」と発話したとする。音声認識処理部１０４は、通常モードで「契約するよ」という音声データの音声認識処理を行う。 First, it is assumed that the mode storage unit 116 is set to the normal mode. For example, it is assumed that when a user performs some contract procedure, the user speaks “I will contract”. The voice recognition processing unit 104 performs voice recognition processing of voice data “I will contract” in the normal mode.

指示受付部１０６は、音声認識処理部１０４による音声認識結果に基づき、ユーザが「契約処理開始」という指示を行っていると推定する。指示受付部１０６は、処理内容記憶部１２２を参照して、提示処理部１１２に提示させる内容を読み出す。このとき、図２（ｂ）に示したように設定されている場合、指示内容として「契約処理開始」、確認内容として「契約処理を開始します。」、当該推定を肯定する場合にユーザに発話させる内容として「契約処理を開始して下さい。」、当該推定を否定する場合にユーザに発話させる内容として「ちがいます。」を提示内容記憶部１１８に記憶する。また、指示受付部１０６は、モード記憶部１１６の設定を通常モードから確認モードに書き換える。 The instruction receiving unit 106 estimates that the user is instructing “contract processing start” based on the voice recognition result by the voice recognition processing unit 104. The instruction receiving unit 106 refers to the processing content storage unit 122 and reads the content to be presented by the presentation processing unit 112. At this time, if the setting is as shown in FIG. 2B, the instruction content is “Contract processing start”, the confirmation content is “Start contract processing.” The presentation content storage unit 118 stores “Please start contract processing” as the content to be uttered and “No.” as the content to be uttered by the user when the estimation is denied. In addition, the instruction receiving unit 106 rewrites the setting in the mode storage unit 116 from the normal mode to the confirmation mode.

提示処理部１１２は、提示内容記憶部１１８を参照して、「契約処理を開始します。よい場合は「契約処理を開始して下さい。」、ちがう場合は「ちがいます。」と言ってください。」という内容をスピーカ１３２またはディスプレイ１３４によりユーザに提示する。 The presentation processing unit 112 refers to the presentation content storage unit 118 and says “Start contract processing. If you like, please start contract processing.” If not, please say “No.” . "Is presented to the user through the speaker 132 or the display 134.

この状態で、たとえばユーザが「契約処理を開始して下さい。」と発話すると、音声認識処理部１０４は、確認モードで音声認識処理を行う。このとき、ユーザから入力される音声は「契約処理を開始して下さい。」か「ちがいます。」である可能性が高いため、音声認識処理部１０４は、精度よく音声認識を行うことができる。指示受付部１０６は、音声認識処理部１０４による音声認識結果に基づき、「契約処理開始」という指示がユーザにより確認されたことを検出し、「契約処理開始」という指示を受け付ける。 In this state, for example, when the user speaks “Please start contract processing”, the speech recognition processing unit 104 performs speech recognition processing in the confirmation mode. At this time, since the voice input from the user is likely to be “Please start contract processing” or “No.”, the voice recognition processing unit 104 can perform voice recognition with high accuracy. . The instruction receiving unit 106 detects that the instruction “contract processing start” has been confirmed by the user based on the voice recognition result by the voice recognition processing unit 104, and receives the instruction “contract processing start”.

一方、ユーザが「契約処理を開始して下さい。」と発話したことが検出できなかった場合、ユーザが「ちがいます。」と発話したことが検出できた場合は、キャンセル処理を行う。さらに、ユーザの発話内容が検出できなかった場合は、通常モードにより再度音声認識処理を行う。その後、音声認識結果に応じて同様の処理を繰り返す。 On the other hand, when it is not detected that the user has uttered “Please start contract processing”, or when it is detected that the user has uttered “No”, cancel processing is performed. Furthermore, when the user's speech content cannot be detected, the speech recognition process is performed again in the normal mode. Thereafter, the same processing is repeated according to the voice recognition result.

本実施の形態における音声入力システム１００によれば、音声入力のみで指示を行うことができるので、ユーザは必要な情報を発話するだけで、所望の処理を指示することができる。 According to voice input system 100 in the present embodiment, an instruction can be given only by voice input, and thus the user can instruct a desired process only by speaking necessary information.

（第２の実施の形態）
本実施の形態において、マイクがリモコン端末に設けられた点で、第１の実施の形態と異なる。 (Second Embodiment)
This embodiment is different from the first embodiment in that a microphone is provided in the remote control terminal.

図４は、本実施の形態における音声入力システム１００の構成を示す図である。
本実施の形態において、音声入力システム１００は、システム本体１０１とリモコン端末２００とを含む。システム本体１０１は、マイク１３０を有さず、受信部１５０を有する点以外は第１の実施の形態における音声入力システム１００と同様の構成を有する。 FIG. 4 is a diagram showing a configuration of the voice input system 100 according to the present embodiment.
In the present embodiment, voice input system 100 includes a system main body 101 and a remote control terminal 200. The system main body 101 has the same configuration as that of the voice input system 100 according to the first embodiment except that the system main body 101 does not have the microphone 130 but has the receiving unit 150.

リモコン端末２００は、マイク２０２、音声変換部２０４、および送信部２０６を有する。音声変換部２０４は、マイク２０２から入力された音声をデジタル変換する。送信部２０６は、音声変換部２０４が変換した音声データをシステム本体１０１の受信部１５０に送信する。音声入力部１０２は、受信部１５０が受信した音声データを入力する。受信部１５０と送信部２０６との間の通信は、種々のネットワークを用いて行うことができるが、たとえば赤外線を用いて行うことができる。 The remote control terminal 200 includes a microphone 202, an audio conversion unit 204, and a transmission unit 206. The voice conversion unit 204 digitally converts the voice input from the microphone 202. The transmission unit 206 transmits the audio data converted by the audio conversion unit 204 to the reception unit 150 of the system main body 101. The voice input unit 102 inputs the voice data received by the receiving unit 150. Communication between the reception unit 150 and the transmission unit 206 can be performed using various networks, but can be performed using, for example, infrared rays.

また、図４では、システム本体１０１がスピーカ１３２およびディスプレイ１３４を含む構成を示したが、これらはシステム本体１０１とは別体により設けられてもよい。また、スピーカ１３２またはディスプレイ１３４のいずれか一方または両方がリモコン端末２００に設けられてもよい。さらに、スピーカ１３２およびディスプレイ１３４以外でも、システム本体１０１の各構成要素は、一つの筐体内に設けられる必要はなく、複数の端末内に分散して設けられ、ネットワーク等を介してデータの送受信を行う構成としてもよい。 4 shows a configuration in which the system main body 101 includes the speaker 132 and the display 134, these may be provided separately from the system main body 101. Further, either one or both of the speaker 132 and the display 134 may be provided in the remote control terminal 200. Further, other than the speaker 132 and the display 134, each component of the system main body 101 does not need to be provided in one casing, and is provided in a plurality of terminals so as to transmit / receive data via a network or the like. It is good also as a structure to perform.

図１および図４に示した音声入力システム１００の破線で囲んだ各構成要素は、ハードウエア単位の構成ではなく、機能単位のブロックを示している。音声入力システム１００の破線で囲んだ各構成要素は、任意のコンピュータのＣＰＵ、メモリ、メモリにロードされた本図の構成要素を実現するプログラム、そのプログラムを格納するハードディスクなどの記憶ユニット、ネットワーク接続用インタフェースを中心にハードウエアとソフトウエアの任意の組合せによって実現される。そして、その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。 Each component surrounded by a broken line in the voice input system 100 shown in FIGS. 1 and 4 is not a hardware unit configuration but a functional unit block. Each component surrounded by a broken line of the voice input system 100 includes an arbitrary computer CPU, memory, a program for realizing the components shown in the figure loaded in the memory, a storage unit such as a hard disk for storing the program, and a network connection. It is realized by any combination of hardware and software, with a focus on the interface. It will be understood by those skilled in the art that there are various modifications to the implementation method and apparatus.

以上、図面を参照して本発明の実施形態について述べたが、これらは本発明の例示であり、上記以外の様々な構成を採用することもできる。 As mentioned above, although embodiment of this invention was described with reference to drawings, these are the illustrations of this invention, Various structures other than the above are also employable.

以上の実施の形態において、音声入力システム１００が提示内容記憶部１１８と処理内容記憶部１２２を有する構成を説明したが、これらは一体に形成されていてよい。この場合、指示受付部１０６は、処理内容記憶部１２２に記憶された複数の指示内容のうち、ユーザからの指示があったと推定した指示内容をマーキング等して区別することができる。 In the above embodiment, the configuration in which the voice input system 100 includes the presentation content storage unit 118 and the processing content storage unit 122 has been described, but these may be integrally formed. In this case, the instruction receiving unit 106 can distinguish the instruction contents estimated from the user from among the plurality of instruction contents stored in the processing content storage unit 122 by marking or the like.

以上の実施の形態において、音声入力システム１００がスピーカ１３２およびディスプレイ１３４を含む構成を示したが、音声入力システム１００は、スピーカ１３２およびディスプレイ１３４のいずれか一方のみを含む構成とすることもできる。 In the above embodiment, the voice input system 100 includes the speaker 132 and the display 134. However, the voice input system 100 may include only one of the speaker 132 and the display 134.

本発明の実施の形態における音声入力システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice input system in embodiment of this invention. 処理内容記憶部の内部構成の一例を示す図である。It is a figure which shows an example of an internal structure of a process content storage part. 本発明の実施の形態における音声入力システムの処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the audio | voice input system in embodiment of this invention. 本発明の実施の形態における音声入力システムの構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice input system in embodiment of this invention.

Explanation of symbols

１００音声入力システム
１０１システム本体
１０２音声入力部
１０４音声認識処理部
１０６指示受付部
１１２提示処理部
１１４処理部
１１６モード記憶部
１１８提示内容記憶部
１２０音声データ記憶部
１２２処理内容記憶部
１３０マイク
１３２スピーカ
１３４ディスプレイ
１５０受信部
２００リモコン端末
２０２マイク
２０４音声変換部
２０６送信部 DESCRIPTION OF SYMBOLS 100 Voice input system 101 System main body 102 Voice input part 104 Voice recognition process part 106 Instruction reception part 112 Presentation process part 114 Process part 116 Mode storage part 118 Presentation content storage part 120 Voice data storage part 122 Process content storage part 130 Microphone 132 Speaker 134 Display 150 Reception Unit 200 Remote Control Terminal 202 Microphone 204 Audio Conversion Unit 206 Transmission Unit

Claims

A voice input unit for inputting voice data of the user's voice;
A speech recognition processing unit that recognizes speech data input by the speech input unit in at least two modes of a normal mode and a confirmation mode;
An instruction receiving unit that estimates the instruction content from the user and shifts the voice recognition processing unit to the confirmation mode based on a result of the voice recognition processing unit performing voice recognition in the normal mode;
In order to present the estimation by the instruction receiving unit and to cause the user to utter the correctness of the estimation, a presentation processing unit that presents the content to be uttered by the user;
Including
When voice data is input from the voice input unit during the transition to the confirmation mode, the voice recognition processing unit performs voice recognition by comparing the voice data with the content presented by the presentation processing unit,
The instruction receiving unit detects whether or not the user has confirmed that the estimation is correct based on a result of the voice recognition processing unit performing voice recognition in the confirmation mode, and it has been confirmed that the estimation is correct. A voice input system that accepts the instruction content as an instruction in a case.

The voice input system according to claim 1,
A storage unit for storing the content presented by the presentation processing unit;
In the confirmation mode, the voice recognition processing unit refers to the storage unit and compares the voice data input from the voice input unit with the content.

The voice input system according to claim 1 or 2,
The instruction receiving unit estimates the instruction content from the user based on a result of the voice recognition processing unit performing voice recognition in the normal mode, determines a general term indicating a result of the estimation,
The presentation processing unit is a voice input system that presents the general term determined by the instruction receiving unit as the content that causes the user to speak.