JP2004341293A

JP2004341293A - Device and method for converting speech into character data, and program and interaction device therefor

Info

Publication number: JP2004341293A
Application number: JP2003138606A
Authority: JP
Inventors: Ryo Murakami; 涼村上
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2003-05-16
Filing date: 2003-05-16
Publication date: 2004-12-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide technology which makes it possible to expect accurate conversion of speech into character data corresponding to the speech that a user utters. <P>SOLUTION: Disclosed is a device which converts the speech that the user utters into the character data. This character data converting device is equipped with a means of inputting the speech, a means of converting the inputted speech into the character data, a correcting means of enabling the user to correct the character data converted by the converting means, a means of storing the character data converted by the converting means and corrected character data of the character data while making them correspond to each other, and a means of specifying the "corrected character data" by retrieval from the storage means making the character data converted by the converting means as a key. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】本発明は、ヒトが発した音声を文字データに変換する技術に関する。
【０００２】
【従来の技術】近年、ヒト（ユーザー）と対話できる装置やヒトが発した音声に応じて動作する装置（音声対応型ロボット）等が開発されている。これらの装置は、ヒトが発した音声を入力し、入力された音声を文字データに変換し、変換された文字データに応じて所定の処理を実行する。例えば対話装置であれば、変換された文字データに対応した返答を出力する処理を実行するものがある。また、音声対応型ロボットであれば、変換された文字データに対応した動作を実行するものがある。
上記した対話装置や音声対応型ロボット等は、ユーザーが発した音声を正確に文字データに変換する必要がある。特許文献１には、ユーザーが発した音声を文字データに変換するための一つの技術が開示されている。特許文献１の技術によると、ユーザーが音を区切りながらゆっくりと発声した言葉を文字データに変換することができる。
【０００３】
【特許文献１】
特開２０００−２４２２９５号公報
【０００４】
【発明が解決しようとする課題】特許文献１の技術では、ユーザーが発声した各音を正確に文字データに変換できることを前提としている。しかしながら、実際は、ユーザーが発した音声に対応した文字データに変換されないことがある。例えば、ユーザーがＸという言葉を発声したにもかかわらず、Ｙという文字データに変換されてしまうことがある。これは、声質や発音等には個人差があるために、入力された音声を文字データに変換する手法を全てのユーザーに対応できるように一般化するのは困難であることが一つの原因であると考えられる。
【０００５】
本発明は、上記した実情に鑑みてなされたものであり、ユーザーが発した音声に対応した文字データに正確に変換できることが期待できる技術を提供することを目的とする。
【０００６】
【課題を解決するための手段と作用と効果】上記課題を解決するために創作された請求項１の発明は、ユーザーが発した音声を文字データに変換する装置である。この文字データ変換装置は、音声を入力する手段と、入力された音声を文字データに変換する手段と、前記の変換手段で変換された文字データをユーザーが訂正できる訂正手段と、前記の変換手段で変換された文字データと、その文字データから訂正された文字データとを対応づけて記憶する手段と、前記の変換手段で変換された文字データをキーとして前記の記憶手段を検索して「訂正された文字データ」を特定する手段とを備える。
請求項１の文字データ変換装置によると、ユーザーが発した音声が変換手段で誤った文字データに変換された場合に、その誤って変換された文字データを訂正手段を用いて訂正できる。この訂正方法としてはいかなる方法を採用してもよく、例えば、キーボードを用いてユーザーが文字データを入力することによって訂正してもよいし、ユーザーが再発声することによって訂正するようにしてもよい。例えば、ユーザーがＸと発声したのに変換手段が誤ってＹという文字データに変換した場合は、変換された文字データＹをＸにユーザーが訂正できる。このとき、ＸとＹを対応づけた情報が記憶手段で記憶される。この情報が記憶されることにより、ユーザーが発声したＸが変換手段で再びＹと誤って変換されてしまった場合に、そのＹをキーとして記憶手段を検索することによってＸが特定できるようになる。本発明を用いると、ユーザーが発した音声と異なる文字データに変換手段が変換した場合でも、その誤りをカバーしてユーザーが発した音声に対応した文字データを特定できるようになる。
【０００７】
ユーザーによって声質や発音等が異なるために、次のような事象が起こる可能性がある。例えば、ユーザーＡがＹと発声したのを変換手段がＸという文字データに変換する一方で、ユーザーＢがＺと発声したのを変換手段がＸという文字データに変換することがある。双方が訂正されると、記憶手段には、Ｘ（変換された文字データ）とＹ（訂正された文字データ）とが対応づけて記憶されるとともに、Ｘ（変換された文字データ）とＺ（訂正された文字データ）とが対応づけて記憶される。この場合、ユーザーＡがＹと発声したのを変換手段がＸという文字データに再度変換した場合に、Ｚが特定される可能性が生じてしまう。このような問題に対処するために、上記した文字データ変換装置を以下の構成にしてもよい。即ち、上記の文字データ変換装置に、ユーザーを特定するユーザー特定手段を付加する。この場合、前記の記憶手段は、ユーザー特定手段で特定されたユーザーと、前記の変換手段で変換された文字データと、その文字データから訂正された文字データとを対応づけて記憶する。そして、前記の特定手段は、ユーザー特定手段で特定されたユーザーと前記の変換手段で変換された文字データをキーとして前記の記憶手段を検索して「訂正された文字データ」を特定する。
このような構成にすると、上記を例にすれば、ユーザーＡとＸ（変換された文字データ）とＹ（訂正された文字データ）とを対応づけて記憶できるとともに、ユーザーＢとＸ（変換された文字データ）とＺ（訂正された文字データ）とを対応づけて記憶できる。この場合、ユーザーＡがＹと発声したのを変換手段がＸという文字データに再度変換した場合に、ユーザーＡとＸをキーとして記憶手段が検索されてＹが特定される。本発明によると、個々のユーザーが発声した音声に対応した文字データに正確に変換できるようになることが期待できる。
【０００８】
前記の記憶手段は、前記の変換手段で変換された文字データに対応づけて複数の「訂正された文字データ」をそれぞれ重み付けして記憶可能であってもよい。この場合、前記の特定手段は、キーとした「前記の変換手段で変換された文字データ」に対応づけて複数の「訂正された文字データ」が記憶されている場合に、前記した重みが最大の「訂正された文字データ」を特定する。
上記した「重み」とは、例えば、その「訂正された文字データ」に対応づけられている「前記の変換手段で変換された文字データ」に変換手段が変換した総回数に対する、その「訂正された文字データ」をユーザーが発声した回数の割合を意味する。
以下の事象を例にして本発明を説明する。（１）ユーザーがＸと発声したにもかかわらず変換手段がＺに変換し、そのＺをユーザーがＸに訂正した。（２）ユーザーがＸと発声したにもかかわらず変換手段が再びＺに変換し、そのＺから記憶手段が検索されてＸが特定された。（３）ユーザーがＹと発声したにもかかわらず変換手段がＺに変換し、そのＺをユーザーがＹに訂正した。
上記の（１）〜（３）の事象が起こった場合は、記憶手段では、Ｚ（変換された文字データ）とＸ（訂正された文字データ）と２／３（重み）とが対応づけて記憶され、Ｚ（変換された文字データ）とＹ（訂正された文字データ）と１／３とが対応づけて記憶される。ここでの重みの分母は、変換手段がＺに変換した総回数である。また、重みの分子はＸ又はＹとユーザーが発声した回数である。この状態で、次の事象が起こったとする。（４）ユーザーがＸと発声したにもかかわらず変換手段がＺに変換した。この（４）の事象が起こると、特定手段はＺをキーとして記憶手段を検索し、重みが大きいＸを特定する。このように、本発明では、誤って変換される可能性の高い「訂正された文字データ」を選択するために、一つの文字データが複数の文字データに訂正された場合であってもうまく対応することができる。
【０００９】
本発明は、ユーザーが発した音声に対応する返答を音声出力することによってユーザーと対話する装置に具現化することができる。この対話装置は、音声を入力する手段と、入力された音声を文字データに変換する手段と、前記の変換手段で変換された文字データをユーザーが訂正できる訂正手段と、前記の変換手段で変換された文字データと、その文字データから訂正された文字データとを対応づけて記憶する第１記憶手段と、複数の文字データを記憶しているとともに、文字データ毎に返答を記憶している第２記憶手段と、前記の変換手段で変換された文字データが第２記憶手段に記憶されていない場合に、その「前記の変換手段で変換された文字データ」をキーとして前記の第１記憶手段を検索して「訂正された文字データ」を特定する手段と、前記の変換手段で変換された文字データ又は前記の特定手段で特定された「訂正された文字データ」が第２記憶手段に記憶されている場合に、前記の変換手段で変換された文字データ又は前記の特定手段で特定された「訂正された文字データ」に対応した返答を音声出力する手段とを備える。
この対話装置によると、ユーザーが発した音声とは異なる文字データに変換された場合でも、ユーザーが発した音声に対応した返答を出力できることが期待できる。
【００１０】
また、本発明は、ユーザーが発した音声をコンピュータが文字データに変換する方法として定義することができる。この文字データ変換方法は、音声を入力する工程と、入力された音声を文字データに変換する工程と、前記の変換工程で変換された文字データをユーザーが訂正した場合に、その「前記の変換工程で変換された文字データ」と、その文字データから訂正された文字データとを対応づけて記憶する工程と、前記の変換工程で変換された文字データをキーとして前記の記憶工程で記憶された記憶内容を検索して「訂正された文字データ」を特定する工程とを備える。
この方法を用いると、ユーザーが発声した音声をコンピュータが正確に文字データに変換できるようになることが期待できる。
【００１１】
また、本発明は、ユーザーが発した音声をコンピュータが文字データに変換するためのプログラムとして定義することもできる。このプログラムは、コンピュータに、以下の処理、即ち、音声を入力する処理と、入力された音声を文字データに変換する処理と、前記の変換処理で変換された文字データをユーザーが訂正した場合に、その「前記の変換処理で変換された文字データ」と、その文字データから訂正された文字データとを対応づけて記憶する処理と、前記の変換処理で変換された文字データをキーとして前記の記憶処理で記憶された記憶内容を検索して「訂正された文字データ」を特定する処理とを実行させる。
このプログラムを用いることによって、ユーザーが発声した音声をコンピュータが正確に文字データに変換できるようになることが期待できる。
【００１２】
【発明の実施の形態】上記各請求項に記載の発明は、下記の形態で好適に実施することができる。
（形態１）文字データ変換装置は、ユーザーが発声した音声（言葉）を入力するマイクを有する。マイクは、入力された音声波を電気信号化する。上記の変換手段は、マイクで電気信号化された音声をテキスト形式の文字データに変換する。
（形態２）文字データ変換装置はキーボードを有する。ユーザーは、キーボードに文字を入力することによって、上記した変換手段で変換された文字データを訂正できる。
（形態３）上記した第２記憶手段は、ユーザーに対する質問を記憶している。さらに、その質問に対する複数の答えを記憶しており、答え毎に返答を記憶している。
（形態４）ユーザー特定手段は、キーボードを用いてユーザー名が入力されることによってユーザーを特定する。
【００１３】
【実施例】（第１実施例）図面を参照して、本発明の実施例を説明する。図１は、本実施例に係る対話装置（対話コンピュータ）１０の概略構成を示したものである。この対話装置１０は、ユーザーに対して質問し、その質問に対してユーザーが発声した答えを特定し、その特定された答えに対して返答する。
対話装置１０は、マイク２０と音声特定部３０と制御部４０と音声合成部５０とスピーカ６０とディスプレイ７０とキーボード８０と第１データベース９０と第２データベース１００とユーザー名−ユーザーＩＤ記憶部１１０等から構成される。
マイク２０は、ユーザーが発声した言葉（音声波）を入力する。そして、入力された言葉を電気信号に変換して音声特定部３０に送る。
音声特定部３０は、電気信号化された言葉をテキスト形式の文字データに変換する。音声特定部３０は、文字データを制御部４０に送る。
制御部４０には、音声特定部３０と音声合成部５０とディスプレイ７０とキーボード８０と第１データベース９０と第２データベース１００とユーザー名−ユーザーＩＤ記憶部１１０とが接続されている。制御部４０は、音声特定部３０から送られてくる文字データを入力する。そして、入力した文字データに基づいて返答を出力する処理を実行する。制御部４０は、他にも、ディスプレイ７０に情報に表示するための処理や、第２データベース１００やユーザー名−ユーザーＩＤ記憶部１１０の記憶内容を変更する処理等も実行する。制御部４０が実行する各処理については後で詳しく説明する。
【００１４】
ディスプレイ７０では種々の画像が表示される。表示される画像については後で説明する。ユーザーは、キーボード８０を用いて文字データを入力することができる。キーボード８０に入力された文字データは制御部４０に送られる。
ユーザー名−ユーザーＩＤ記憶部１１０には、ユーザーの名前と、そのユーザーを特定するＩＤとが対応づけて記憶されている。図２に、ユーザー名−ユーザーＩＤ記憶部１１０の記憶内容の一例を示す。対話装置１０のユーザーは、自分の名前を対話装置１０に予め登録しておく。ユーザー名の登録は、キーボード８０を用いて名前を入力することによって行なう。入力されたユーザー名は制御部４０に送られる。制御部４０は、ユーザー名が送られてくるとユーザーＩＤを取得し、そのユーザー名とユーザーＩＤとを対応づけてユーザー名−ユーザーＩＤ記憶部１１０で記憶する。
【００１５】
第１データベース９０には、ユーザーに対しての質問と、その質問の答えとして想定される答え群と、各答えに対しての返答とが対応づけて記憶されている。図３には、第１データベース９０の記憶内容の一例を示している。例えば、質問「好きな果物は何ですか」に対応づけて答え（リンゴ、梨、みかん等）が記憶されている。さらに、ユーザーの答え毎に返答が記憶されている。例えば、リンゴの場合であれば、「リンゴはおいしいよね」という返答が記憶されている。
【００１６】
図４に、第２データベース１００の記憶内容の一例を示す。第２データベース１００は、ユーザーＩＤと特定単語と真単語と重みとが対応づけて記憶されている。
「特定単語」とは、音声特定部３０で特定された単語（変換された文字データ）であって、第１データベース９０に記憶されていないものを意味する。例えば、ユーザーが発声した単語と異なる単語に変換された場合に、その変換された単語が特定単語となることがある。なお、本明細書での「単語」は、一つの文字又は文字の集まりを意味し、一般的にいわれる単語の概念よりも広い概念である。また、詳しくは後述するが、ユーザーは、音声特定部３０によってユーザーが発声した単語と異なる単語に変換された場合に、その誤って変換された単語を訂正することができる（キーボード８０を用いて単語を入力することができる）。
「真単語」とは、ユーザーがキーボード８０を用いて入力した単語を意味する。例えば、ユーザーが発声した単語と異なる文字データに音声特定部３０が変換した単語からユーザーによって訂正された単語（簡単にいうと、音声特定部３０が誤って特定した単語から訂正された単語）を意味する。図４を例にすると、特定単語「やし」に対応づけて真単語「なし」が記憶されているが、これは、ユーザーが「なし」と発声したにもかかわらず音声特定部３０が「やし」と変換し、その「やし」が「なし」に訂正されたことを意味している。また、特定単語「やし」に対応づけて真単語「やし」が記憶されているが、これは、ユーザーが「やし」と発声したのを音声特定部３０が「やし」と変換し、「やし」が第１データベース９０に記憶されていないためにユーザーにキーボード８０で単語を入力してもらったところ、「やし」と入力されたことを意味する。
「重み」とは、その重みと対応づけられている特定単語に音声特定部３０が変換した総回数に対する、その重みと対応づけられている真単語をユーザーが発声した回数の割合を意味する（単位はパーセント）。図４を例にすると特定単語「やし」と真単語「なし」と重み「８０」とが対応づけて記憶されているが、これは、音声特定部３０が「やし」と変換した総回数に対する、ユーザーが「なし」と発声した回数の割合が８０％であることを意味している。また、重みとともに分数が記憶されているが（図４の重み欄のカッコ内に示されている）、これは、その重みと対応づけられている特定単語に音声特定部３０が変換した総回数が分母であり、その重みと対応づけられている真単語をユーザーが発声した回数が分子である。図４を例にすると、特定単語「やし」と真単語「なし」に対応づけられている重みのカッコ内に１６／２０と記載されているが、これは、音声特定部３０が特定単語「やし」と変換したのが２０回あって、その２０回のうちにユーザーが「なし」と発声したのが１６回あったことを意味している。また、特定単語「あし」と真単語「なし」に対応づけられている重みのカッコ内に１／１と記載されているが、これは、音声特定部３０が特定単語「あし」と変換したのは１回しかなく、それが「なし」に訂正されたこと（即ちユーザーが「なし」と発声したこと）を意味している。
【００１７】
図１に示される音声合成部５０は、制御部４０から出力された文字データを電気信号に変換し、その電気信号化されたデータをスピーカ６０に送る。これにより、スピーカ６０から音声が出力される。
【００１８】
次に、対話装置１０が実行する処理について説明する。図５から図７には、対話装置１０が実行する処理のフローチャートを示している。
ユーザーは、対話装置１０を使用するのに先だって、キーボード８０を用いて自分の名前を入力する。対話装置１０は、ユーザーが入力したユーザー名を読取る（ステップＳ２）。ステップＳ２でユーザー名を読取ると、その読取ったユーザー名からユーザー名−ユーザーＩＤ記憶部１１０を検索してユーザーＩＤを特定する（ステップＳ４）。
続いて、対話装置１０は、ユーザーに対して質問する（ステップＳ６）。この処理は、具体的には以下のようにして実行される。まず、制御部４０は、第１データベース９０に記憶されている複数の質問の中から一つの質問をランダムに選択する。そして、選択された質問の文字データを音声合成部５０に送る。音声合成部５０は、送られてくる文字データ（質問）を電気信号化してスピーカ６０に送る。これにより、スピーカ６０から質問が音声出力される。
【００１９】
ユーザーに対して質問すると、その質問に対してのユーザーの答えを特定する（ステップＳ８）。この処理は、マイク２０に入力されたユーザーの答えを音声特定部３０が文字データに変換することによって実行される。音声特定部３０は、特定した答え（変換された文字データ）を制御部４０に送る。なお、ユーザーは、例えば「好きな果物は何ですか？」という質問に対して、「リンゴです」と答える場合もあれば、「私はリンゴが好きです」と答える場合もある。また、「リンゴ」と単語だけを答える場合もある。このステップＳ８では、制御部４０は、答えとなる単語（上記の例ではリンゴ）の前後に付けられる語句（「私は」や「が好きです」や「です」等）を除去して答えとなる単語のみを特定する。
次に、特定した答えが第１データベース９０に記憶されているか否かを判別する（ステップＳ１０）。この処理は、ステップＳ６で選択された質問に対応づけて記憶されている答え群の中に、ステップＳ８で特定した答えがあるか否かを確認することによって実行される。例えば、質問「好きな果物は何ですか？」に対応づけて「梨」が記憶されているために、ステップＳ８で特定した答えが「なし」であった場合はステップＳ１０でＹＥＳと判別される。これに対し、質問「好きな果物は何ですか？」に対応づけて「やし」は記憶されていないために、ステップＳ８で特定した答えが「やし」であった場合はステップＳ１０でＮＯと判別される。
ステップＳ１０でＹＥＳと判別されると、ステップＳ８で特定した答えに対応づけられた返答を出力する（ステップＳ１２）。この処理は以下のようにして実行する。まず、ステップＳ８で特定した答えから第１データベース９０を検索してその答えに対応づけられた返答を特定する。次いで、その特定した返答を音声合成部５０に送る。音声合成部５０が文字データの返答を電気信号に変換し、その電気信号化された返答をスピーカ６０に送ることによって返答が音声出力される。
【００２０】
図８には、対話装置１０とユーザーとの会話例を示している。上記したステップＳ６，Ｓ８，Ｓ１０，Ｓ１２という流れで処理された場合は、図８のパターン１のようになる。パターン１では、「好きな果物は何ですか？」という質問をした場合に、ユーザーが「梨」と答えたところ、音声特定部３０が「なし」と特定している。そして、「しゃりしゃり感がおいしいよね」と返答している。
【００２１】
一方、ステップＳ１０でＮＯと判別されると、図６のステップＳ２０に進む。ステップＳ２０では、ステップＳ８で特定した答えが、第２データベース１００の特定単語として記憶されているか否かを判別する。この処理は、ステップＳ８で特定した答えから第２データベース１００を検索することによって実行される。但し、この処理では、第２データベース１００に記憶されている全ての特定単語を検索対象にするのではなく、ステップＳ４で特定されたユーザーＩＤに対応づけられた特定単語群の中にステップＳ８で特定した答えがあるか否かが判別される。例えば、ユーザーＩＤが「ＸＸＸ１」であるとともにステップＳ８で特定された答えが「やし」だった場合は、ユーザーＩＤ「ＸＸＸ１」に対応づけて特定単語「やし」が記憶されているために（図４参照）、ステップＳ２０でＹＥＳと判別される。ステップＳ２０でＹＥＳと判別されるとステップＳ２２に進み、ステップＳ２０でＮＯと判別されると図７のステップＳ４０に進む。
ステップＳ２２では、重みが最も大きい真単語を特定する。例えば、ユーザーＩＤが「ＸＸＸ１」であるとともにステップＳ８で特定された答えが「やし」だった場合は、重みが最も大きい真単語「なし」を特定する。
ステップＳ２２で真単語を特定すると、その真単語が第１データベース９０に記憶されているか否かを判別する（ステップＳ２４）。このステップＳ２４でＹＥＳと判別された場合はステップＳ２６に進み、ＮＯと判別された場合はステップＳ３４に進む。
【００２２】
ステップＳ３４では、「あなたの言葉がわかりません。質問を変更します。」と音声出力する。そして、ステップＳ６に戻って他の質問を出力する。即ち、これ以上対話することが不可能であるために他の質問に移るのである。
【００２３】
ステップＳ２６では、ステップＳ２２で特定された真単語を発声したのか否かをユーザーに確認する。例えば、ステップＳ２２で特定された真単語が「なし」であった場合は、「梨と言いましたか？」と音声出力する。この音声出力と同時に、ディスプレイ７０の画面上に「はい又はいいえとお答え下さい」と表示する処理も実行される。この表示処理は、制御部４０が表示用データをディスプレイ７０に送ることによって行なわれる。
続いて、ユーザーが「はい」と発声したのか、あるいは「いいえ」と発声したのかを監視する（ステップＳ２８）。この処理は、音声特定部３０が「はい」又は「いいえ」と特定し、その「はい」又は「いいえ」を制御部４０が入力することによって実行される。ユーザーが「はい」と発声した場合にはステップＳ３０に進み。また、ユーザーが「いいえ」と発声した場合には図７のステップＳ４０に進む。
ステップＳ３０では、ステップＳ２２で特定された真単語に対応する返答を音声出力する。例えば、ステップＳ２２で特定された真単語が「なし」であった場合は、第１データベース９０において「なし」に対応づけて記憶されている「しゃりしゃり感がおいしいよね」を出力する。
ステップＳ３０の処理を終えると、続いて、第２データベース１００の記憶内容を変更する処理を実行する（ステップＳ３２）。例えば、特定単語が「やし」であって真単語「なし」が特定されてステップＳ３０までの処理が実行された場合は、特定単語「やし」と真単語「なし」とに対応づけられている重みを大きくするとともに、特定単語「やし」と他の真単語（例えば「やし」や「やぎ」）とに対応づけられている重みを小さくする。図４を例にすると、特定単語「やし」と真単語「なし」とに対応づけられている重み「８０（１６／２０）」を「８１（１７／２１）」に変更する。そして、特定単語「やし」と真単語「やし」とに対応づけられている重み「１５（３／２０）」を「１４（３／２１）」に変更し、特定単語「やし」と真単語「やぎ」とに対応づけられている重み「５（１／２０）」を「５（１／２１）」に変更する。また、例えば、ステップＳ８で特定された単語が「あし」であってステップＳ２２において真単語「なし」が特定されてステップＳ３０までの処理が実行された場合は、重み「１００（１／１）」を「１００（２／２）」に変更する。
【００２４】
上記したステップＳ６〜Ｓ１０，Ｓ２０〜Ｓ３２という流れで処理された場合は、図８のパターン２のようになる。パターン２では、「好きな果物は何ですか？」という質問に対してユーザーが「梨」と答えたが、音声特定部３０が誤って「やし」と特定している。この「やし」は第１データベース９０にないが、第２データベース１００に記憶されている。このとき、特定単語「なし」と真単語「なし」に対応づけられている重みが最大であるために、「梨とおっしゃいましたか？」とユーザーに尋ねている。ユーザーが「はい」と答えたので、「しゃりしゃり感がおいしいよね」と返答している。最後に、第２データベース１００の重みを変更している。
【００２５】
続いて、図７を参照して、ステップＳ４０からの処理について説明する。ステップＳ４０では、「話した言葉をキーボードに入力して下さい」と音声出力する。次いで、ユーザーがキーボード８０に入力したか否かを監視する（ステップＳ４２）。この処理は、ステップＳ４０の音声出力から３０秒経過するまでにキーボード８０に入力されるとＹＥＳとし、３０秒経過するまでにキーボード８０に入力されなかったらＮＯとする。ここでＹＥＳと判別されるとステップＳ４４に進み、ＮＯと判別されるとステップＳ５０に進む。ユーザーは、キーボード８０に文字データを入力することによって、音声特定部３０で誤って変換された文字データを訂正することができる。例えば、ユーザーＩＤＸＸＸ１に対応づけていずれの特定単語も記憶されていない場合に、ユーザーが「なし」と発声したにもかかわらずステップＳ８で「やし」と特定されると、ユーザーはキーボード８０を用いて「なし」と入力することができる。これにより、「やし」が「なし」に訂正されることになる。
【００２６】
ステップＳ５０では、「あなたの言葉がわかりません。質問を変更します。」と音声出力する。この場合は、ステップＳ６（図５）に戻って他の質問を出力する。
一方、ステップＳ４４では、キーボード入力された文字データが第１データベース９０に記憶されているか否かを判別する。即ち、ステップＳ６で選択された質問に対応づけられた答え群の中に、ステップＳ４２でキーボード入力された文字データがあるか否かを判別する。ここでＹＥＳと判別された場合はステップＳ４６に進み、ＮＯと判別された場合はステップＳ４８に進む。
ステップＳ４８では、「あなたの言葉がわかりません。質問を変更します。」と音声出力する。この場合は、ステップＳ６（図５）に戻って他の質問を出力する。ステップＳ４６では、キーボード入力された言葉に対応した返答を出力する。ステップＳ４６又はステップＳ４８の処理を終えると、第２データベース１００の記憶内容を変更する（ステップＳ５２）。このステップＳ５２で記憶内容がどのように変更されるのかは次で詳しく説明する。
【００２７】
ステップＳ２０でＮＯと判別されて、その後にステップＳ４０，Ｓ４２，Ｓ４４，Ｓ４６，Ｓ５２という流れで処理された場合は、図８のパターン３のようになる。パターン３では、「好きな果物は何ですか？」という質問に対してユーザーが「梨」と答えたが、音声特定部３０が誤って「やし」と特定している。特定された「やし」は、第１データベース９０に記憶されていないとともに、特定単語「やし」として第２データベース１００にも記憶されていない。キーボード入力を促すと「なし」と入力されている（即ち「やし」が「なし」に訂正されている）。そして、「しゃりしゃり感がおいしいよね」と返答している。この場合は、特定単語「やし」と真単語「なし」とを対応づけて第２データベース１００に記憶する。この場合の重みは１００（１／１）である。
また、ステップＳ２０でＮＯと判別されて、その後にステップＳ４０，Ｓ４２，Ｓ４４，Ｓ４８，Ｓ５２という流れで処理された場合は、図９のパターン４のようになる。パターン４では、「好きな果物は何ですか？」という質問に対してユーザーが「やし」と答えて音声特定部３０が「やし」と特定している。特定された「やし」は、第１データベース９０に記憶されていないとともに、特定単語「やし」として第２データベース１００にも記憶されていない。キーボード入力を促すと「やし」とキーボード８０に入力されている。そして、「あなたの言葉がわかりません。質問を変更します。」と音声出力している。この場合は、特定単語「やし」と真単語「やし」とを対応づけて第２データベース１００に記憶する。この場合の重みは１００（１／１）である。
また、ステップＳ２８でＮＯと判別されて、その後にステップＳ４０，Ｓ４２，Ｓ４４，Ｓ４８，Ｓ５２という流れで処理された場合は、図９のパターン５のようになる。パターン５では、「好きな果物は何ですか？」という質問に対してユーザーが「やし」と答えて音声特定部３０が「やし」と特定している。この「やし」は第１データベース９０に記憶されていないが、第２データベース１００に記憶されている。特定単語「やし」と真単語「なし」に対応づけられている重みが最大であるために、「梨とおっしゃいましたか？」とユーザーに聞く。このときユーザーが「いいえ」と答えたためにキーボード入力を促すと、「やし」とキーボード８０に入力された。そして、「あなたの言葉がわかりません。質問を変更します。」と音声出力している。この場合は、特定単語「やし」と対応づけられた全ての重みを変更する。図４を例にすると、特定単語「やし」と真単語「やし」に対応づけられた重み「１５（３／２０）」を「１９（４／２１）」に変更し、特定単語「やし」と真単語「なし」に対応づけられた重み「８０（１６／２０）」を「７６（１６／２１）」に変更し、特定単語「やし」と真単語「やぎ」に対応づけられた重み「５（１／２０）」を「５（１／２１）」に変更する。
【００２８】
上記した本実施例に係る対話装置１０によると、ユーザーが発声した言葉と異なる文字データに音声特定部３０が誤って変換した場合でも、その誤りをカバーしてユーザーが発声した言葉を正確に特定できる。また、第２データベース１００でユーザー毎に特定単語と真単語が記憶されているために、発音や声質等が異なる個々のユーザーに応じて真単語を特定できる。
【００２９】
以上、本発明の具体例を詳細に説明したが、これらは例示にすぎず、特許請求の範囲を限定するものではない。特許請求の範囲に記載の技術には、以上に例示した具体例を様々に変形、変更したものが含まれる。
上記した実施例では、ユーザーを特定する方法としてキーボード８０で名前を入力する方法を採用していたが（ステップＳ２）、例えば、自分の名前をユーザーに発声してもらうことによってユーザーを特定することもできる。また、カメラを用いてユーザーを撮影することによってユーザーを特定することもできる。このカメラを用いる場合は、ユーザー名を登録する必要はなく、ユーザーの顔の特徴とユーザーＩＤとを対応づけて記憶すればよい。
上記した実施例では、ユーザーが特定単語を真単語に訂正する方法としてキーボード８０に真単語を入力する方法を採用していたが、次のような方法を採用することもできる。例えば、ユーザーに真単語を書いてもらって、それをカメラ等で撮影して読取ることによって真単語に訂正することができる。また、ユーザーに再発声してもらうことによって真単語に訂正することもできる。
また、本明細書または図面に説明した技術要素は、単独であるいは各種の組み合わせによって技術的有用性を発揮するものであり、出願時請求項記載の組み合わせに限定されるものではない。また、本明細書または図面に例示した技術は複数目的を同時に達成するものであり、そのうちの一つの目的を達成すること自体で技術的有用性を持つものである。
【図面の簡単な説明】
【図１】実施例に係る対話装置の概略構成。
【図２】ユーザー名−ユーザーＩＤ記憶部の記憶内容の一例。
【図３】第１データベースの記憶内容の一例。
【図４】第２データベースの記憶内容の一例。
【図５】対話装置が実行する処理のフローチャート。
【図６】対話装置が実行する処理のフローチャート（図５の続き）。
【図７】対話装置が実行する処理のフローチャート（図６の続き）。
【図８】対話装置が実行する処理の一例（パターン１〜３）。
【図９】対話装置が実行する処理の一例（パターン４と５）。
【符号の説明】
１０・・対話装置
２０・・マイク
３０・・音声特定部
４０・・制御部
５０・・音声合成部
６０・・スピーカ
７０・・ディスプレイ
８０・・キーボード
９０・・第１データベース
１００・・第２データベース
１１０・・ユーザー名−ユーザーＩＤ記憶部[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for converting a human voice into character data.
[0002]
2. Description of the Related Art In recent years, devices that can interact with humans (users) and devices that operate in response to voices uttered by humans (voice-compatible robots) have been developed. These devices input a voice uttered by a human, convert the input voice into character data, and execute a predetermined process according to the converted character data. For example, some interactive devices execute a process of outputting a response corresponding to the converted character data. Some voice-enabled robots execute an operation corresponding to the converted character data.
The above-described interactive device, voice-capable robot, and the like need to accurately convert the voice uttered by the user into character data. Patent Literature 1 discloses one technique for converting a voice uttered by a user into character data. According to the technique of Patent Literature 1, it is possible to convert words that the user utters slowly while separating sounds into character data.
[0003]
[Patent Document 1]
JP-A-2000-242295
[0004]
The technique disclosed in Patent Document 1 is based on the premise that each sound uttered by a user can be accurately converted into character data. However, in practice, it may not be converted into character data corresponding to the voice uttered by the user. For example, even though the user utters the word X, it may be converted to character data of Y. One of the reasons is that it is difficult to generalize the method of converting the input voice to character data so that it can be applied to all users because there are individual differences in voice quality and pronunciation. It is believed that there is.
[0005]
The present invention has been made in view of the above circumstances, and has as its object to provide a technology that can be expected to be accurately converted to character data corresponding to a voice uttered by a user.
[0006]
Means for Solving the Problems, Functions and Effects The invention of claim 1 created to solve the above-mentioned problem is an apparatus for converting a voice uttered by a user into character data. The character data conversion device includes means for inputting a voice, means for converting the input voice to character data, correction means for allowing a user to correct the character data converted by the conversion means, and conversion means for the user. Means for storing the character data converted by the above and character data corrected from the character data in association with each other, and searching the storage means using the character data converted by the conversion means as a key, Means for specifying “character data obtained”.
According to the character data conversion device of the first aspect, when the voice uttered by the user is converted into erroneous character data by the conversion means, the erroneously converted character data can be corrected by the correction means. As this correction method, any method may be adopted, for example, the correction may be made by the user inputting character data using a keyboard, or the user may make a correction by re-uttering. . For example, if the conversion unit erroneously converts the character data to Y when the user utters X, the user can correct the converted character data Y to X. At this time, information associating X and Y is stored in the storage means. By storing this information, if X uttered by the user is incorrectly converted to Y again by the conversion unit, X can be specified by searching the storage unit using that Y as a key. . According to the present invention, even when the conversion unit converts the character data to a character data different from the voice uttered by the user, the error can be covered and the character data corresponding to the voice uttered by the user can be specified.
[0007]
The following phenomena may occur due to differences in voice quality, pronunciation, etc. among users. For example, the conversion unit may convert the utterance of the user A to Y into the character data of X while the conversion unit converts the utterance of the user B to the character data of X. When both are corrected, the storage means stores X (converted character data) and Y (corrected character data) in association with each other, and stores X (converted character data) and Z (converted character data). (Corrected character data) are stored in association with each other. In this case, when the conversion unit converts the utterance of the user A to Y again to character data of X, there is a possibility that Z is specified. In order to deal with such a problem, the above-described character data conversion device may have the following configuration. That is, a user specifying means for specifying a user is added to the above character data conversion device. In this case, the storage unit stores the user specified by the user specifying unit, the character data converted by the conversion unit, and the character data corrected from the character data in association with each other. The specifying unit searches the storage unit using the user specified by the user specifying unit and the character data converted by the conversion unit as a key, and specifies “corrected character data”.
With such a configuration, in the above example, user A and X (converted character data) and Y (corrected character data) can be stored in association with each other, and user B and X (converted character data) can be stored. Character data) and Z (corrected character data) can be stored in association with each other. In this case, when the conversion unit converts the utterance of the user A to Y again to character data of X, the storage unit is searched using the users A and X as keys to specify Y. According to the present invention, it can be expected that accurate conversion to character data corresponding to the voice uttered by each user can be expected.
[0008]
The storage means may be capable of storing a plurality of "corrected character data" in a weighted manner in association with the character data converted by the conversion means. In this case, when a plurality of "corrected character data" are stored in association with the "character data converted by the converting means" as a key, the specifying means sets the weight to be the maximum. "Corrected character data" is specified.
The above-mentioned “weight” is, for example, the “corrected character data”, which is associated with the “character data converted by the conversion unit”, and the “corrected character data” for the total number of conversions. Means the ratio of the number of times the user has uttered the “character data”.
The present invention will be described by taking the following events as examples. (1) Although the user utters X, the conversion means converts the Z to Z, and the user corrects Z to X. (2) Even though the user uttered X, the conversion means converted again to Z, and the storage means was searched from Z to identify X. (3) Although the user utters Y, the conversion means converts it to Z, and the user corrects Z to Y.
When the above events (1) to (3) occur, the storage means associates Z (converted character data), X (corrected character data), and 2/3 (weight) with each other. Z (converted character data), Y (corrected character data), and 1/3 are stored in association with each other. The denominator of the weight here is the total number of times that the conversion unit has converted to Z. The numerator of the weight is X or Y and the number of times the user has spoken. In this state, assume that the following event occurs. (4) Although the user uttered X, the conversion means converted it to Z. When the event (4) occurs, the specifying unit searches the storage unit using Z as a key, and specifies X having a large weight. As described above, in the present invention, in order to select “corrected character data” which is likely to be erroneously converted, even if one character data is corrected to a plurality of character data, the present invention can cope well. can do.
[0009]
The present invention can be embodied in a device that interacts with a user by outputting a response corresponding to a voice uttered by the user. This interactive device includes means for inputting voice, means for converting the input voice to character data, correction means for allowing a user to correct the character data converted by the conversion means, and conversion by the conversion means. A first storage unit that stores the corrected character data and the character data corrected from the character data in association with each other, and stores a plurality of character data and stores a reply for each character data. 2 storage means, and when the character data converted by the conversion means is not stored in the second storage means, the "character data converted by the conversion means" is used as a key in the first storage means. And a means for specifying “corrected character data” and the character data converted by the conversion means or the “corrected character data” specified by the specification means are stored in the second storage means. If it is, and means for voice output a response corresponding to the identified in the converted text data or the specific means by said conversion means "corrected character data".
According to this interactive device, it can be expected that a reply corresponding to the voice uttered by the user can be output even when the voice data is converted into character data different from the voice uttered by the user.
[0010]
Further, the present invention can be defined as a method in which a computer converts voice uttered by a user into character data. This character data conversion method includes a step of inputting a voice, a step of converting the input voice to character data, and a step of, when a user corrects the character data converted in the conversion step, performing the “ Storing the character data converted in the step and character data corrected from the character data in association with each other, and storing the character data converted in the conversion step as a key in the storage step. Searching the stored contents to specify the "corrected character data".
By using this method, it can be expected that the computer can accurately convert the voice uttered by the user into character data.
[0011]
Further, the present invention can be defined as a program for a computer to convert voice uttered by a user into character data. This program provides the following processing to the computer, namely, processing for inputting voice, processing for converting the input voice to character data, and processing when the user corrects the character data converted by the conversion processing. A process of storing the “character data converted by the conversion process” and the character data corrected from the character data in association with each other; and using the character data converted by the conversion process as a key, And a process of searching the storage contents stored in the storage process to specify “corrected character data”.
By using this program, it can be expected that the computer can accurately convert the voice uttered by the user into character data.
[0012]
DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention described in each of the above claims can be suitably implemented in the following modes.
(Mode 1) The character data conversion device has a microphone for inputting a voice (word) uttered by a user. The microphone converts the input sound wave into an electric signal. The conversion means converts the sound converted into an electric signal by the microphone into character data in a text format.
(Mode 2) The character data conversion device has a keyboard. The user can correct the character data converted by the conversion means by inputting characters on the keyboard.
(Mode 3) The above-mentioned second storage means stores a question for the user. Further, a plurality of answers to the question are stored, and a reply is stored for each answer.
(Mode 4) The user specifying means specifies a user by inputting a user name using a keyboard.
[0013]
Embodiment (First Embodiment) An embodiment of the present invention will be described with reference to the drawings. FIG. 1 shows a schematic configuration of an interactive device (interactive computer) 10 according to the present embodiment. The interactive device 10 asks a question to a user, specifies an answer spoken by the user in response to the question, and replies to the specified answer.
The interactive device 10 includes a microphone 20, a voice identification unit 30, a control unit 40, a voice synthesis unit 50, a speaker 60, a display 70, a keyboard 80, a first database 90, a second database 100, a user name-user ID storage unit 110, and the like. Consists of
The microphone 20 inputs a word (voice wave) uttered by the user. Then, the input words are converted into electric signals and sent to the voice specifying unit 30.
The voice identification unit 30 converts the words converted into electric signals into character data in a text format. The voice specifying unit 30 sends the character data to the control unit 40.
The control unit 40 is connected with the voice identification unit 30, the voice synthesis unit 50, the display 70, the keyboard 80, the first database 90, the second database 100, and the user name-user ID storage unit 110. The control unit 40 inputs the character data sent from the voice specifying unit 30. Then, a process of outputting a response based on the input character data is executed. The control unit 40 also executes a process for displaying information on the display 70, a process for changing storage contents of the second database 100 and the user name-user ID storage unit 110, and the like. Each process executed by the control unit 40 will be described later in detail.
[0014]
The display 70 displays various images. The displayed image will be described later. The user can input character data using the keyboard 80. The character data input to the keyboard 80 is sent to the control unit 40.
The user name-user ID storage unit 110 stores a user name and an ID for specifying the user in association with each other. FIG. 2 shows an example of the storage contents of the user name-user ID storage unit 110. The user of the interactive device 10 registers his / her name in the interactive device 10 in advance. The registration of the user name is performed by inputting the name using the keyboard 80. The input user name is sent to the control unit 40. The control unit 40 acquires the user ID when the user name is sent, and stores the user ID in the user name-user ID storage unit 110 in association with the user name.
[0015]
In the first database 90, a question to the user, an answer group assumed as an answer to the question, and a response to each answer are stored in association with each other. FIG. 3 shows an example of the storage contents of the first database 90. For example, answers (apples, pears, tangerines, etc.) are stored in association with the question "What is your favorite fruit?" Further, a reply is stored for each answer of the user. For example, in the case of an apple, a reply that "apples are delicious" is stored.
[0016]
FIG. 4 shows an example of the storage contents of the second database 100. The second database 100 stores user IDs, specific words, true words, and weights in association with each other.
The “specific word” is a word (converted character data) specified by the voice specifying unit 30 and is not stored in the first database 90. For example, when converted to a word different from the word spoken by the user, the converted word may be a specific word. It should be noted that a “word” in this specification means a single character or a group of characters, and is a broader concept than a generally used concept of a word. In addition, as described in detail later, when the voice specifying unit 30 converts the erroneously converted word into a word different from the word uttered by the user, the user can correct the erroneously converted word (using the keyboard 80). You can enter words).
The “true word” means a word input by the user using the keyboard 80. For example, a word corrected by the user from a word converted by the voice specifying unit 30 into character data different from the word spoken by the user (in short, a word corrected from a word incorrectly specified by the voice specifying unit 30) means. In the example of FIG. 4, the true word “none” is stored in association with the specific word “palm”. This is because the voice identification unit 30 outputs “none” despite the user saying “none”. It is converted to "palm", which means that the "palm" has been corrected to "none". Further, the true word “palm” is stored in association with the specific word “palm”. This is because the voice specifying unit 30 converts the utterance of the user “palm” into “palm”. However, since “Pay” is not stored in the first database 90, when the user inputs a word using the keyboard 80, it means that “Pay” has been input.
“Weight” means the ratio of the number of times the user has uttered the true word associated with the weight to the total number of times the voice identification unit 30 has converted the specific word associated with the weight ( The unit is percent). In the example of FIG. 4, the specific word “palm”, the true word “none”, and the weight “80” are stored in association with each other. This means that the ratio of the number of times the user uttered “none” to the number of times is 80%. Also, a fraction is stored together with the weight (shown in parentheses in the weight column in FIG. 4), which indicates the total number of times that the voice identification unit 30 has converted the specific word associated with the weight. Is the denominator, and the number of times the user utters the true word associated with the weight is the numerator. In the example of FIG. 4, 16/20 is described in parentheses of the weights associated with the specific word “palm” and the true word “none”. This means that there are 20 conversions to "palm" and that the user uttered "none" 16 times out of the 20 times. In addition, although 1/1 is described in parentheses of the weights associated with the specific word “ashi” and the true word “none”, this is converted into the specific word “ashi” by the voice identification unit 30. Means that it was only once and was corrected to "none" (i.e., the user uttered "none").
[0017]
The speech synthesizer 50 shown in FIG. 1 converts the character data output from the controller 40 into an electric signal, and sends the electric signal to the speaker 60. Thereby, sound is output from the speaker 60.
[0018]
Next, a process executed by the interactive device 10 will be described. 5 to 7 show flowcharts of processing executed by the interactive device 10.
The user enters his name using the keyboard 80 prior to using the interactive device 10. The interactive device 10 reads the user name input by the user (Step S2). When the user name is read in step S2, the user ID is specified by searching the user name / user ID storage unit 110 from the read user name (step S4).
Subsequently, the interactive device 10 asks a question to the user (step S6). This processing is specifically executed as follows. First, the control unit 40 randomly selects one question from a plurality of questions stored in the first database 90. Then, the character data of the selected question is sent to the speech synthesizer 50. The voice synthesis unit 50 converts the transmitted character data (question) into an electric signal and transmits the electric signal to the speaker 60. Thereby, the question is output from the speaker 60 by voice.
[0019]
When a question is asked to the user, the user's answer to the question is specified (step S8). This processing is executed by the voice specifying unit 30 converting the user's answer input to the microphone 20 into character data. The voice specifying unit 30 sends the specified answer (converted character data) to the control unit 40. For example, the user may answer "Is an apple" or "I like an apple" to the question "What is your favorite fruit?" In some cases, only the word "apple" is answered. In this step S8, the control unit 40 removes phrases ("I like", "I like", "Is", etc.) attached before and after the answer word (apple in the above example) and returns the answer. Only words that are
Next, it is determined whether the specified answer is stored in the first database 90 (step S10). This process is executed by checking whether or not the answer group stored in association with the question selected in step S6 includes the answer specified in step S8. For example, since “pear” is stored in association with the question “What is your favorite fruit?”, If the answer specified in step S8 is “none”, “YES” is determined in step S10. You. On the other hand, since “palm” is not stored in association with the question “What is your favorite fruit?”, If the answer specified in step S8 is “palm”, the process proceeds to step S10. NO is determined.
If YES is determined in the step S10, a reply associated with the answer specified in the step S8 is output (step S12). This process is performed as follows. First, the first database 90 is searched from the answer specified in step S8, and a reply associated with the answer is specified. Next, the specified response is sent to the speech synthesizer 50. The voice synthesizer 50 converts the response of the character data into an electric signal, and sends the electric signal to the speaker 60, whereby the response is output as a voice.
[0020]
FIG. 8 shows a conversation example between the interactive device 10 and a user. When the processing is performed in the flow of steps S6, S8, S10, and S12 described above, the pattern is as shown in pattern 1 in FIG. In Pattern 1, when the user asks “What is your favorite fruit?” And the user answers “Pear”, the voice specifying unit 30 specifies “None”. He replied, "It's delicious."
[0021]
On the other hand, if NO is determined in the step S10, the process proceeds to a step S20 in FIG. In step S20, it is determined whether or not the answer specified in step S8 is stored as a specific word in the second database 100. This process is executed by searching the second database 100 from the answer specified in step S8. However, in this process, not all the specific words stored in the second database 100 are searched for, but the specific word group associated with the user ID specified in step S4 is included in the specific word group in step S8. It is determined whether there is an identified answer. For example, if the user ID is “XXX1” and the answer specified in step S8 is “palm”, the specific word “palm” is stored in association with the user ID “XXX1”. (See FIG. 4), YES is determined in step S20. If “YES” is determined in the step S20, the process proceeds to a step S22. If “NO” is determined in the step S20, the process proceeds to a step S40 in FIG.
In step S22, the true word having the largest weight is specified. For example, when the user ID is “XXX1” and the answer specified in step S8 is “palm”, the true word “none” having the largest weight is specified.
When the true word is specified in step S22, it is determined whether or not the true word is stored in the first database 90 (step S24). If YES is determined in this step S24, the process proceeds to step S26, and if NO is determined, the process proceeds to step S34.
[0022]
In step S34, a voice is output as "I do not understand your word. I will change the question." Then, the process returns to step S6 to output another question. That is, since it is impossible to interact any more, another question is asked.
[0023]
In step S26, it is confirmed with the user whether or not the true word specified in step S22 has been uttered. For example, when the true word specified in step S22 is "none", the voice is output as "Did you say pear?" At the same time as the output of the voice, a process of displaying “Please answer yes or no” on the screen of the display 70 is also executed. This display processing is performed by the control unit 40 sending display data to the display 70.
Subsequently, it is monitored whether the user uttered “Yes” or “No” (step S28). This process is executed when the voice specifying unit 30 specifies “Yes” or “No” and the control unit 40 inputs “Yes” or “No”. When the user utters “yes”, the process proceeds to step S30. When the user utters “No”, the process proceeds to step S40 in FIG.
In step S30, a response corresponding to the true word specified in step S22 is output as voice. For example, when the true word specified in step S22 is “none”, “slippery feeling is delicious” stored in the first database 90 in association with “none” is output.
After the processing of step S30 is completed, subsequently, processing for changing the storage content of the second database 100 is executed (step S32). For example, when the specific word is “palm” and the true word “none” is specified and the processing up to step S30 is executed, the specific word “palm” is associated with the true word “none”. And the weight associated with the specific word "palm" and another true word (for example, "palm" or "palm") is reduced. In the example of FIG. 4, the weight “80 (16/20)” associated with the specific word “palm” and the true word “none” is changed to “81 (17/21)”. Then, the weight “15 (3/20)” associated with the specific word “palm” and the true word “palm” is changed to “14 (3/21)”, and the specific word “palm” is changed. And the weight “5 (1/20)” associated with the true word “Yagi” is changed to “5 (1/21)”. For example, if the word specified in step S8 is “foot” and the true word “none” is specified in step S22 and the processing up to step S30 is executed, the weight “100 (1/1)” "To" 100 (2/2) ".
[0024]
When the processing is performed in the flow of steps S6 to S10 and S20 to S32 described above, the pattern is as shown in pattern 2 in FIG. In Pattern 2, the user answers "Pear" to the question "What is your favorite fruit?", But the voice specifying unit 30 incorrectly specifies "Palm". This “palm” is not in the first database 90, but is stored in the second database 100. At this time, since the weight associated with the specific word "none" and the true word "none" is the largest, the user is asked "Did you say pear?" Since the user answered "yes", he replied, "It's delicious." Finally, the weight of the second database 100 is changed.
[0025]
Subsequently, the processing from step S40 will be described with reference to FIG. In step S40, a voice output "Please enter the spoken word into the keyboard" is output. Next, it is monitored whether or not the user has made an input on the keyboard 80 (step S42). In this process, if the voice is input to the keyboard 80 before 30 seconds elapse from the voice output in step S40, YES is set, and if the voice is not input to the keyboard 80 until 30 seconds elapse, NO is set. If the determination is YES here, the process proceeds to step S44, and if the determination is NO, the process proceeds to step S50. By inputting character data to the keyboard 80, the user can correct character data erroneously converted by the voice identification unit 30. For example, if no specific word is stored in association with the user IDXXX1, and if the user utters “none” and is specified as “palm” in step S8, the user operates the keyboard 80. To enter "none". As a result, “palm” is corrected to “none”.
[0026]
In step S50, a voice is output as "I do not understand your word. I will change the question." In this case, the process returns to step S6 (FIG. 5) to output another question.
On the other hand, in step S44, it is determined whether or not the character data input from the keyboard is stored in first database 90. That is, it is determined whether or not there is character data input by the keyboard in step S42 in the answer group associated with the question selected in step S6. If the determination is YES here, the process proceeds to step S46, and if the determination is NO, the process proceeds to step S48.
In step S48, a voice is output as "I do not understand your word. I will change the question." In this case, the process returns to step S6 (FIG. 5) to output another question. In step S46, a response corresponding to the word input from the keyboard is output. When the processing in step S46 or S48 is completed, the storage contents of the second database 100 are changed (step S52). How the stored contents are changed in step S52 will be described in detail below.
[0027]
If it is determined as NO in step S20 and then the processing is performed in the flow of steps S40, S42, S44, S46, and S52, the pattern is as shown in pattern 3 in FIG. In Pattern 3, the user answers “Pear” to the question “What is your favorite fruit?”, But the voice specifying unit 30 incorrectly specifies “Palm”. The specified “palm” is not stored in the first database 90 and is not stored in the second database 100 as the specific word “palm”. When prompting for keyboard input, "none" is entered (that is, "palm" is corrected to "none"). He replied, "It's delicious." In this case, the specific word “palm” and the true word “none” are stored in the second database 100 in association with each other. The weight in this case is 100 (1/1).
Further, when it is determined NO in step S20 and the processing is performed in the flow of steps S40, S42, S44, S48, and S52, the pattern is as shown in pattern 4 in FIG. In Pattern 4, the user specifies “palm” to the question “What is your favorite fruit?” And the voice specifying unit 30 specifies “palm”. The specified “palm” is not stored in the first database 90 and is not stored in the second database 100 as the specific word “palm”. When the keyboard input is prompted, “Pay” is input to the keyboard 80. And he said, "I don't understand your word. I'll change my question." In this case, the specific word “palm” and the true word “palm” are stored in the second database 100 in association with each other. The weight in this case is 100 (1/1).
Further, when it is determined as NO in step S28 and the processing is performed in the flow of steps S40, S42, S44, S48, and S52, the pattern becomes the pattern 5 in FIG. In Pattern 5, the user answers “Pay” to the question “What is your favorite fruit?” And the voice specifying unit 30 specifies “Pay”. This “palm” is not stored in the first database 90, but is stored in the second database 100. Since the weight associated with the specific word “palm” and the true word “none” is the largest, the user is asked “Did you say pear?” At this time, when the user answered "No" and prompted for keyboard input, "Pay" was input to the keyboard 80. And he said, "I don't understand your word. I'll change my question." In this case, all weights associated with the specific word "palm" are changed. In the example of FIG. 4, the weight “15 (3/20)” associated with the specific word “palm” and the true word “palm” is changed to “19 (4/21)”, and the specific word “palm” is changed. The weight “80 (16/20)” associated with “Yashi” and the true word “None” is changed to “76 (16/21)” to correspond to the specific word “Yashi” and the true word “Yagi”. The assigned weight “5 (1/20)” is changed to “5 (1/21)”.
[0028]
According to the above-described interactive device 10 according to the present embodiment, even when the voice identification unit 30 erroneously converts character data different from the word uttered by the user, the word uttered by the user is accurately identified by covering the error. it can. Further, since the specific word and the true word are stored for each user in the second database 100, the true word can be specified for each user having different pronunciation, voice quality, and the like.
[0029]
As mentioned above, although the specific example of this invention was demonstrated in detail, these are only illustrations and do not limit a claim. The technology described in the claims includes various modifications and alterations of the specific examples illustrated above.
In the above-described embodiment, a method of inputting a name using the keyboard 80 has been adopted as a method of specifying a user (step S2). You can also. Further, the user can be specified by photographing the user using a camera. When this camera is used, there is no need to register a user name, and it is only necessary to store the facial features of the user and the user ID in association with each other.
In the above-described embodiment, as a method of correcting a specific word into a true word by the user, a method of inputting a true word to the keyboard 80 is adopted, but the following method may be adopted. For example, a true word can be corrected by having the user write a true word, photographing it with a camera or the like, and reading the image. It can also be corrected to a true word by having the user re-speak.
Further, the technical elements described in the present specification or the drawings exhibit technical utility singly or in various combinations, and are not limited to the combinations described in the claims at the time of filing. The technology illustrated in the present specification or the drawings achieves a plurality of objects at the same time, and has technical utility by achieving one of the objects.
[Brief description of the drawings]
FIG. 1 is a schematic configuration of a dialogue device according to an embodiment.
FIG. 2 shows an example of contents stored in a user name-user ID storage unit.
FIG. 3 shows an example of storage contents of a first database.
FIG. 4 shows an example of storage contents of a second database.
FIG. 5 is a flowchart of a process executed by the interactive device.
FIG. 6 is a flowchart (continuation of FIG. 5) of a process executed by the interactive device.
FIG. 7 is a flowchart of a process executed by the interactive device (continuation of FIG. 6).
FIG. 8 shows an example of a process executed by the interactive device (patterns 1 to 3).
FIG. 9 shows an example of a process executed by the interactive device (patterns 4 and 5).
[Explanation of symbols]
10. Dialogue device
20 ・・ Mike
30 ... voice identification unit
40 Control unit
50 ... Speech synthesis unit
60 speaker
70 Display
80 Keyboard
90-1st database
100 ... second database
110 user name-user ID storage unit

Claims

It is a device that converts the voice uttered by the user into character data,
Means for inputting voice,
Means for converting the input voice to character data,
Correction means by which the user can correct the character data converted by the conversion means,
Means for storing the character data converted by the conversion means and the character data corrected from the character data in association with each other;
Means for searching the storage means using the character data converted by the conversion means as a key to specify "corrected character data".

The character data conversion device according to claim 1, further comprising a user identification unit that identifies a user,
The storage unit stores the user specified by the user specifying unit, the character data converted by the conversion unit, and the character data corrected from the character data in association with each other,
The character data, wherein the specifying means searches the storage means using the user specified by the user specifying means and the character data converted by the converting means as a key to specify "corrected character data". Conversion device.

The storage means is capable of storing a plurality of `` corrected character data '' in a weighted manner in association with the character data converted by the conversion means,
When a plurality of "corrected character data" are stored in association with the "character data converted by the converting means" as a key, the specifying means may select the "corrected character data" having the largest weight. 3. The character data conversion device according to claim 1, wherein data is specified.

A device that interacts with the user by outputting a response corresponding to the voice emitted by the user,
Means for inputting voice,
Means for converting the input voice to character data,
Correction means by which the user can correct the character data converted by the conversion means,
First storage means for storing the character data converted by the conversion means and character data corrected from the character data in association with each other;
A second storage unit that stores a plurality of character data and stores a reply for each character data;
When the character data converted by the conversion means is not stored in the second storage means, the "character data converted by the conversion means" is used as a key to search the first storage means and the "corrected character data" Means for identifying "character data",
When the character data converted by the conversion unit or the “corrected character data” specified by the specification unit is stored in the second storage unit, the character data converted by the conversion unit or the specification unit A voice output of a response corresponding to the “corrected character data” specified in (1).

A method in which a computer converts speech made by a user into character data,
Inputting a voice;
Converting the input voice into character data;
When the user has corrected the character data converted in the conversion step, the `` character data converted in the conversion step '' and a step of storing the character data corrected from the character data in association with each other,
Searching the storage contents stored in the storage step using the character data converted in the conversion step as a key to specify “corrected character data”.

A program that allows a computer to convert user-generated speech into character data.
The computer performs the following processing:
The process of inputting voice,
A process of converting the input voice to character data;
When the user has corrected the character data converted in the conversion process, the process of storing the `` character data converted in the conversion process '' and the character data corrected from the character data in association with each other,
And performing a process of searching the storage content stored in the storage process using the character data converted in the conversion process as a key to specify “corrected character data”.