JP2004029804A

JP2004029804A - Voice recognition conversation device and voice recognition conversation processing method

Info

Publication number: JP2004029804A
Application number: JP2003170940A
Authority: JP
Inventors: Yasunaga Miyazawa; 宮澤　康永; Mitsuhiro Inazumi; 稲積　満広; Hiroshi Hasegawa; 長谷川　浩; Isanaka Edatsune; 枝常　伊佐央; Osamu Urano; 浦野　治
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2003-06-16
Filing date: 2003-06-16
Publication date: 2004-01-29

Abstract

<P>PROBLEM TO BE SOLVED: To improve a recognition rate of voice due to a difference in an age and a gender and to realize conversations adapted for the gender and the various ages by making a part of ROM storing a storage content which is previously set for recognizing voice and outputting a response to recognized voice to be a cartridge type. <P>SOLUTION: In a voice recognition conversation device, inputted voice is analyzed, voice feature data is generated, previously registered standard voice feature data is compared with voice feature data, word detection data is outputted, a meaning of input voice is understood by receiving word detection data and a response content corresponding to the meaning is decided and outputted. A storage means storing the storage content which is previously set for recognizing voice and the storage content which is previously set for outputting the response to recognized voice is disposed on a cartridge 20-side which is loaded on a device main body 1 so that it can freely be loaded/unloaded. When the cartridge 20 is loaded on a device main body 1-side, it is connected to a voice recognition response processing part 10 arranged on the device main body 1-side and the response content for input sound is outputted based on data stored in the cartridge 20. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【産業上の利用分野】
本発明は、音声を認識し、その認識結果に対応した応答や特定動作を行う音声認識対話装置および音声認識対話処理方法に関する。
【０００２】
【従来の技術】
この種の音声認識装置においては、特定話者のみの音声を認識可能な特定話者音声認識装置と不特定話者の音声を認識可能な不特定話者音声認識装置がある。
【０００３】
特定話者音声認識装置は、或る特定の話者が認識可能な単語を一単語ずつ所定の手順に従って入力することによって、その特定話者の標準的な音声信号パターンを登録しておき、登録終了後、特定話者が登録した単語を話すと、その入力音声を分析した特徴パターンと登録された特徴パターンとを比較して音声認識を行うものである。この種の音声認識対話装置の一例として音声認識玩具がある。たとえば、音声指令となる複数の命令語として、「おはよう」、「おやすみ」、「こんにちわ」などの言葉を１０単語程度、その玩具を使用する子どもが予め登録しておき、話者がたとえば「おはよう」というと、その音声信号と、登録されている「おはよう」の音声信号を比較して、両音声信号が一致したとき、音声指令に対する定められた電気信号を出力し、これに基づいて玩具に特定動作を行わせるものである。
【０００４】
このような特定話者音声認識装置は、特定話者かあるいはそれに近い音声パターンを有する音声しか認識されず、また、初期設定として、認識させたい単語を一単語ずつすべて登録させる必要がありその手間は極めて面倒であった。
【０００５】
これに対して、不特定話者音声認識装置は、多数（たとえば、２００人程度）の話者が発話した音声を用いて、前記したような認識対象単語の標準音声特徴データを予め作成して記憶（登録）させておき、これら予め登録された認識可能な単語に対して、不特定な話者の発する音声を認識可能としたものである。
【０００６】
【発明が解決しようとする課題】
この不特定話者の音声を認識可能な不特定音声認識装置は、確かに、標準的な音声に対しては比較的高い認識率が確保されるが、殆どの全ての音声に対しても高い認識率が得られるとは限られない。たとえば、幼児の声、大人の声、女性の声、男性の声などのように、年齢や性別によって音声の特徴が大きく異なり、大人の問いかけに対してはきわめて高い認識率が得られても、幼児の問いかけに対しては殆ど認識されないという問題も生じてくる。
【０００７】
また、この種の音声認識装置を、ぬいぐるみなどの玩具に適用した場合、そのぬいぐるみで遊ぶ子どもの年代や性別などによって、対話内容も変化するのが普通である。たとえば、幼児と小学校の高学年、男子と女子では、求める対話内容はそれぞれ異なるのが一般的である。
【０００８】
しかしながら、この種の音声認識装置にあっては、認識可能な登録単語も限られており、それに対する応答内容も或る程度は限られた内容のものであるのが一般的である。したがって、この種の玩具は短期間のうちに飽きがくるのが通例であり、また、前記したように、年齢や性別などによる音声の特徴により、認識率の良し悪しにも問題があった。これは、玩具のみならず音声認識を利用する電子機器すべてについても同様である。
【０００９】
本発明はこれらの課題を解決するためになされたもので、標準音声特徴データを記憶する標準音声特徴データ記憶手段や応答内容を記憶する応答内容記憶手段などのＲＯＭの部分をカートリッジ式とし、そのカートリッジを適宜選択して装着可能とすることで、年齢や性別などの違いによる音声の特徴に対応した認識が可能となり、認識率の向上を図るとともに、様々な年齢や性別に対応した会話を可能とすることを目的としている。
【００１０】
【課題を解決するための手段】
本発明の音声認識対話装置は、入力された音声を分析して音声特徴データを発生し、この音声特徴データと予め登録された認識可能な単語の標準音声特徴データとを比較して単語検出データを出力し、この単語検出データを受けて、入力音声の意味を理解し、それに対応した応答内容を決定して出力する音声認識対話装置において、音声認識を行うために予め設定された記憶内容、認識された音声に対する応答出力を行うために予め設定された記憶内容などを記憶する記憶手段を、装置本体に対して着脱自在に装着可能なカートリッジ側に設け、このカートリッジが装置本体に装着されることにより、装置本体側に設けられた音声認識応答処理部に接続され、カートリッジ内に記憶されたデータを基に、この音声認識応答処理部が入力音声に対する応答内容を発生することを特徴としている。
【００１１】
そして、前記カートリッジ側に設けられる記憶手段は、予め登録された認識可能な単語に対する標準音声特徴データを記憶する標準音声特徴データ記憶手段であって、前記装置本体側には、音声を入力する音声入力手段と、この音声入力手段により入力された音声を分析して音声特徴データを発生する音声分析手段と、この音声分析手段からの音声特徴データを入力し、前記カートリッジ側に設けられた標準音声特徴データ記憶手段の記憶内容を基に、入力音声に対する単語検出データを出力する単語検出手段と、この単語検出手段からの単語検出データを受けて、入力音声の意味を理解し、それに対応した応答内容を決定する音声理解会話制御手段と、この音声理解会話制御手段によって決定された応答内容に基づいた音声合成出力を発生させるための応答データ指示内容記憶手段と、前記音声理解会話制御手段により決定された応答内容に対し、前記応答データ指示内容記憶手段の指示に基づいた音声合成出力を発生する音声合成手段と、この音声合成手段からの音声合成出力を外部に出力する音声出力手段とを有した構成とする。
【００１２】
また、前記カートリッジ側に設けられる記憶手段は、予め登録された認識可能な単語に対応する応答内容を記憶する会話内容記憶手段と、どのような音声合成出力を発生するかを指示するための指示内容を記憶する応答データ指示内容記憶手段であって、前記装置本体側には、音声を入力する音声入力手段と、この音声入力手段により入力された音声を分析して音声特徴データを発生する音声分析手段と、予め登録された認識可能な単語の標準音声特徴データを記憶する標準音声特徴データ記憶手段と、この標準音声特徴データ記憶手段の記憶内容を基に、入力音声に対する単語検出データを出力する単語検出手段と、この単語検出手段からの単語検出データを受けて入力音声の意味を理解し、それに対応した応答内容を、前記カートリッジ側に設けられた会話内容記憶手段を参照して決定する音声理解会話制御手段と、この音声理解会話制御手段によって決定された応答内容に対し、前記カートリッジ側に設けられた応答データ指示内容記憶手段の指示に基づいた音声合成出力を発生する音声合成手段と、この音声合成手段からの音声合成出力を外部に出力する音声出力手段とを有した構成とする。
【００１３】
また、前記カートリッジ側に設けられる記憶手段は、どのような音声合成出力を発生するかを指示するための指示内容を記憶する応答データ指示内容記憶手段であって、前記装置本体側には、音声を入力する音声入力手段と、この音声入力手段により入力された音声を分析して音声特徴データを発生する音声分析手段と、予め登録された認識可能な単語の標準音声特徴データを記憶する標準音声特徴データ記憶手段と、この標準音声特徴データ記憶手段の記憶内容を基に、入力音声に対する単語検出データを出力する単語検出手段と、この単語検出手段からの単語検出データを受けて入力音声の意味を理解し、それに対応した応答内容を決定する音声理解会話制御手段と、この音声理解会話制御手段によって決定された応答内容に対し、前記カートリッジ側に設けられた応答データ指示内容記手段の指示に基づいた音声合成出力を発生する音声合成手段と、この音声合成手段からの音声合成出力を外部に出力する音声出力手段とを有した構成とする。
【００１４】
また、前記カートリッジ側に設けられる記憶手段は、予め登録された認識可能な単語に対する標準音声特徴データを記憶する標準音声特徴データ記憶手段、前記登録された認識可能な単語に対応する応答内容を記憶する会話内容記憶手段、どのような音声合成出力を発生するかを指示する応答データ指示内容記憶手段であって、前記装置本体側には、音声を入力する音声入力手段と、この音声入力手段により入力された音声を分析して音声特徴データを発生する音声分析手段と、この音声分析手段からの音声特徴データを入力し、前記カートリッジ側に設けられた標準音声特徴データ記憶手段の記憶内容を基に、入力音声に対する単語検出データを出力する単語検出手段と、この単語検出手段からの単語検出データを受けて入力音声の意味を理解し、それに対応した応答内容を、前記カートリッジ側に設けられた会話内容記憶部を参照して決定する音声理解会話制御手段と、この音声理解会話制御手段によって決定された応答内容に対し、前記カートリッジ側に設けられた応答データ指示内容記憶手段の指示に基づいた音声合成出力を発生する音声合成手段と、この音声合成手段からの音声合成出力を外部に出力する音声出力手段とを有した構成とする。
【００１５】
また、本発明の音声認識対話処理方法は、入力された音声を分析して音声特徴データを発生し、この音声特徴データと予め登録された認識可能な単語の標準音声特徴データとを比較して単語検出データを出力し、この単語検出データを受けて、入力音声の意味を理解し、それに対応した応答内容を決定して出力する音声認識対話処理方法において、音声認識を行うために予め設定された記憶内容、認識された音声に対する応答出力を行うために予め設定された記憶内容などを、装置本体に対して着脱自在に装着可能なカートリッジ側の記憶手段に書き込み、このカートリッジが前記装置本体に装着されることにより、そのカートリッジ内に記憶されたデータを基に、入力音声に対する応答内容を発生することを特徴としている。
【００１６】
そして、前記カートリッジ側に記憶される記憶内容は、予め登録された認識可能な単語に対する標準音声特徴データであって、前記装置本体側には、音声入力手段により入力された音声を分析して音声特徴データを発生する音声分析工程と、この音声分析工程からの音声特徴データを入力し、前記カートリッジ側に記憶された標準音声特徴データを基に、入力音声に対する単語検出データを出力する単語検出工程と、この単語検出工程からの単語検出データを受けて入力音声の意味を理解し、それに対応した応答内容を決定する音声理解会話制御工程と、この音声理解会話制御工程によって決定された応答内容に対し、どのような音声合成出力とするかを指示する応答データ指示内容に基づいて音声合成出力を発生する音声合成工程と、この音声合成工程からの音声合成出力を外部に出力する音声出力工程とを有した構成とする。
【００１７】
また、前記カートリッジ側に記憶される記憶内容は、予め登録された認識可能な単語に対応する応答内容と、どのような音声合成出力を発生するかを指示する応答データ指示内容であって、前記装置本体側には、音声入力手段により入力された音声を分析して音声特徴データを発生する音声分析工程と、予め登録された認識可能な単語の標準音声特徴データを基に、入力音声に対する単語検出データを出力する単語検出工程と、この単語検出工程からの単語検出データを受けて入力音声の意味を理解し、それに対応した応答内容を、前記カートリッジ側に記憶された会話内容を参照して決定する音声理解会話制御工程と、この音声理解会話制御工程によって決定された応答内容に対し、前記カートリッジ側に記憶された応答データ指示内容に基づいた音声合成出力を発生する音声合成工程と、この音声合成工程からの音声合成出力を外部に出力する音声出力工程とを有した構成とする。
【００１８】
また、前記カートリッジ側に記憶される記憶内容は、どのような音声合成出力を発生するかを指示する指示内容であって、前記装置本体側には、音声入力手段により入力された音声を分析して音声特徴データを発生する音声分析工程と、予め登録された認識可能な単語の標準音声特徴データを基に、入力音声に対する単語検出データを出力する単語検出工程と、この単語検出工程からの単語検出データを受けて入力音声の意味を理解し、それに対応した応答内容を決定する音声理解会話制御工程と、この音声理解会話制御工程によって決定された応答内容に対し、前記カートリッジ側に記憶された応答データ指示内容に基づいた音声合成出力を発生する音声合成工程と、この音声合成工程からの音声合成出力を外部に出力する音声出力工程とを有した構成とする。
【００１９】
また、前記カートリッジ側に記憶される記憶内容は、予め登録された認識可能な単語に対する標準音声特徴データ、前記登録された認識可能な単語に対応する応答内容、どのような音声合成出力を発生するかを指示する応答データ指示内容であって、前記装置本体側には、音声入力手段により入力された音声を分析して音声特徴データを発生する音声分析工程と、この音声分析工程からの音声特徴データを入力し、前記カートリッジ側に記憶された標準音声特徴データを基に、入力音声に対する単語検出データを出力する単語検出工程と、この単語検出工程からの単語検出データを受けて入力音声の意味を理解し、それに対応した応答内容を、前記カートリッジ側に記憶された会話内容を参照して決定する音声理解会話制御工程と、この音声理解会話制御工程によって決定された応答内容に対し、前記カートリッジ側に記憶された応答データ指示内容に基づいた音声合成出力を発生する音声合成工程と、この音声合成工程からの音声合成出力を外部に出力する音声出力工程とを有した構成とする。
【００２０】
【作用】
本発明は、音声認識を行うために予め設定された記憶内容、認識された音声に対する応答出力を行うために予め設定された記憶内容などを、装置本体に対して着脱自在に装着可能なカートリッジ側に記憶させ、このカートリッジが装置本体に装着されることにより、そのカートリッジ内に記憶されたデータを基に、入力音声に対する応答内容を出力するようにしたので、装置本体は１台であっても、カートリッジを変えることにより、様々な年代あるいは性別などに応じた対話が可能となる。したがって、ユーザに応じたカートリッジを選択することができ、また、認識可能な単語、応答内容もカートリッジ単位で選択できることから、対話内容などに幅広いバリエーションを持たせることができ、それに対応して様々な動作をさせることが可能となる
【００２１】
【実施例】
以下、本発明の実施例を図面を参照して説明する。なお、この実施例では、本発明を玩具に適用した場合を例にとり、ここでは、犬などのぬいぐるみに適用した場合について説明する。また、不特定話者の音声を認識可能な不特定話者音声認識装置に本発明を適用した例について説明する。
【００２２】
（第１の実施例）
図１は本発明の全体的な概略構成を説明する図であり、概略的には、犬のぬいぐるみ（装置本体）１内に収納された音声認識応答処理部１０（詳細は後述する）と、犬のぬいぐるみ１の所定の部分に着脱自在に装着可能なカートリッジ部２０（詳細は後述する）から構成されている。
【００２３】
図２はこの第１の実施例による音声認識応答処理部１０およびカートリッジ部２０の構成を説明するブロック図である。この第１の実施例では、標準音声特徴データ記憶部２１、会話内容記憶部２２、応答データ指示内容記憶部２３の３つのＲＯＭをカートリッジ側に設けた例について説明する。
【００２４】
音声認識応答処理手段１０は、音声入力部１１、音声分析部２１、単語検出部１３、音声理解会話制御部１４、音声合成部１５、音声出力部１６などから構成されている。なお、これらの構成要素のうち、音声分析部１２、単語検出部１３、音声理解会話制御部１４、音声合成部１５などは、ぬいぐるみ１のたとえば腹部付近に収納され、音声入力部（マイクロホン）１１はぬいぐるみ１のたとえば耳の部分、音声出力部（スピーカ）１６はたとえば口の部分に設けられる。
【００２５】
一方、カートリッジ部２０は、標準音声特徴データ記憶部２１、会話内容記憶部２２、応答データ指示内容記憶部２３により構成され、ぬいぐるみ１のたとえば腹部付近に設けられたカートリッジ装着部（図示せず）に外部から容易に着脱可能となっている。そして、このカートリッジ部２０がカートリッジ装着部に正しく装着されると、前記音声認識応答処理部１０の各部に接続され、信号の授受が可能となる。具体的には、標準音声特徴データ記憶部２１は前記単語検出部１３に接続され、会話内容記憶部２２は前記音声理解会話制御部１４に接続され、応答データ指示内容記憶部２３は前記音声理解会話制御部１４および音声合成部１５に接続されるようになっている。
【００２６】
前記標準音声特徴データ記憶部２１は、１つ１つの単語に対し多数（たとえば、２００人程度）の話者が発話した音声を用いて予め作成した認識可能な単語（登録単語という）の標準パターンを記憶（登録）しているＲＯＭである。ここでは、ぬいぐるみを例にしているので、登録単語は１０単語程度とし、その単語としては、たとえば、「おはよう」、「おやすみ」、「こんにちは」、「明日」、「天気」など挨拶に用いる言葉が多いが、これに限定されるものではなく、色々な単語を登録することができ、登録単語数も１０単語に限られるものではない。また、前記会話内容記憶部２２は、どのような単語が登録単語となっているか、そして、それぞれの登録単語に対してどのような応答をするかというような内容を記憶している。この会話内容記憶部２２は、本来、音声理解会話制御部１４内に設けられているものであるが、登録単語などがカートリッジにより変わる可能性もあるためカートリッジ部２０側に設けられている。また、応答データ指示内容記憶部２３は、それぞれの登録単語に対応してどのような音声合成出力とするかを指示する内容が記憶されており、同じ応答内容であってもたとえば男の子の話し方による音声合成出力、あるいは女の子の話し方による音声合成出力とするというように、主に声の質を指示する内容が予め設定されている。
【００２７】
ところで、このカートリッジ部２０は、前記したように標準音声特徴データや、登録単語、さらには、それぞれの登録単語に対する応答内容、声の質などが様々の種類用意されているもので、ぬいぐるみ１を使用するユーザが任意に選べるようになっている。たとえば、幼児用のカートリッジには、幼児の音声を認識しやすいような標準音声特徴データによる複数の登録単語が標準音声特徴データ記憶部２１に記憶され、そして、どのような単語が登録単語であるか、それぞれの登録単語に対してどのような応答を行うかが会話内容記憶部２２に記憶され、さらに、それぞれの登録単語に対し、どのような音声合成出力とするかというような指示が応答データ指示内容記憶部２３に記憶されている。
【００２８】
次に、以上の各部におけるそれぞれの機能、さらには全体的な処理などについて以下に順次説明する。
【００２９】
音声認識応答処理部１０の音声入力部１１は図示されていないがマイクロホン、増幅器、ローパスフィルタ、Ａ／Ｄ変換器などから構成され、マイクロホンから入力された音声を、増幅器、ローパスフィルタを通して適当な音声波形としたのち、Ａ／Ｄ変換器によりディジタル信号（たとえば、１２ＫＨｚ．１６ｂｉｔ）に変換して出力し、その出力を音声分析部１２に送る。音声分析部１２では、音声入力部１１から送られてきた音声波形信号を、演算器（ＣＰＵ）を用いて短時間毎に周波数分析を行い、周波数の特徴を表す数次元の特徴ベクトルを抽出（ＬＰＣーＣＥＰＳＴＲＵＭ係数が一般的）し、この特徴ベクトルの時系列（以下、音声特徴ベクトル列という）を出力する。
【００３０】
単語検出部１３は図示されていないが主に演算器（ＣＰＵ）と処理プログラムを記憶しているＲＯＭから構成され、カートリッジ部２０に設けられた標準音声特徴データ記憶部２１に登録されている単語が、入力音声中のどの部分にどれくらいの確かさで存在するかを検出するものである。この単語検出部１３としては、隠れマルコフモデル（ＨＭＭ）方式やＤＰマッチング方式などを用いることも可能であるが、ここでは、ＤＲＮＮ（ダイナミック　リカレント　ニューラル　ネットワーク）方式によるキーワードスポッティング処理技術（この技術に関しては、本出願人が特開平６ー４０９７、特開平６ー１１９４７６により、すでに特許出願済みである。）を用いて、不特定話者による連続音声認識に近い音声認識を可能とするための単語検出データを出力するものであるとする。
【００３１】
この単語検出部１３の具体的な処理について、図３を参照しながら簡単に説明する。単語検出部１３は、標準音声特徴データ記憶部２１に登録されている単語が、入力音声中のどの部分にどれくらいの確かさで存在するかを検出するものである。今、話者から「明日の天気は、・・・」というような音声が入力され、図３（ａ）に示すような音声信号が出力されたとする。この「明日の天気は、・・・」の文節のうち、「明日」と「天気」がこの場合のキーワードとなり、これらは、予め登録されている１０単語程度の登録単語の１つとして、標準音声特徴データ記憶部２１にそのパターンが記憶されている。そして、これら登録単語をたとえば１０単語としたとき、これら１０単語（これを、単語１、単語２、単語３、・・・とする）に対応して各単語を検出するための信号が出力されていて、その検出信号の値などの情報から、入力音声中にどの程度の確かさで対応する単語が存在するかを検出する。つまり、「天気」という単語（単語１）が入力音声中に存在したときに、その「天気」という信号を待っている検出信号が、同図（ｂ）の如く、入力音声の「天気」の部分で立ち上がる。同様に、「明日」という単語（単語２）が入力音声中に存在したときに、その「明日」という信号を待っている検出信号が、同図（ｃ）の如く、入力音声の「明日」の部分で立ち上がる。同図（ｂ），（ｃ）において、０．９あるいは０．８といった数値は、確からしさ（近似度）を示す数値であり、０．９や０．８といった高い数値であれば、その高い確からしさを持った登録単語は、入力された音声に対する認識候補であるということができる。つまり、「明日」という登録単語は、同図（ｃ）に示すように、入力音声信号の時間軸上のｗ１の部分に０．８という確からしさで存在し、「天気」という登録単語は、同図（ｂ）に示すように、入力音声信号の時間軸上のｗ２の部分に０．９という確からしさで存在することがわかる。
【００３２】
また、この図３の例では、「天気」という入力に対して、同図（ｄ）に示すように、単語３（この単語３は「何時」という登録単語であるとする）を待つ信号も、時間軸上のｗ２の部分に、ある程度の確からしさ（その数値は０．６程度）を有して立ち上がっている。このように、入力音声信号に対して同一時刻上に、２つ以上の登録単語が認識候補として存在する場合には、最も近似度（確からしさを示す数値）の高い単語を認識単語として選定する方法、各単語間の相関規則を表した相関表を予め作成しておき、この相関表により、いずれか１つの単語を認識単語として選定する方法などを用いて、或る１つの認識候補単語を決定する。たとえば、前者の方法で認識候補を決定するとすれば、この場合は、時間軸上のｗ２の部分に対応する近似度は、「天気」を検出する検出信号の近似度が最も高いことから、その部分の入力音声に対する認識候補は「天気」であるとの判定を行う。なお、これらの近似度を基に入力音声の認識は音声理解会話制御部１４にて行う。
【００３３】
音声理解会話制御部１４は、主に演算器（ＣＰＵ）と処理プログラムを記憶しているＲＯＭから構成され、単語検出部１３からの単語検出データを入力して、その単語検出データを基に、音声を認識し（入力音声全体の意味を理解し）、カートリッジ部２０に設けられた会話内容記憶部２２を参照して、入力音声の意味に応じた応答内容を決定するとともに、応答データ指示内容記憶部２３を参照して、どのような音声合成出力とするかを示す信号を音声合成部（主にＣＰＵとＲＯＭで構成される）１５に送る。
【００３４】
たとえば、単語検出部１３からの図３（ｂ）〜（ｅ）に示すような検出データ（これをワードラティスという。このワードラティスは、登録単語名、近似度、単語の始点ｓと終点ｅを示す信号などが含まれる）が入力されると、まず、そのワードラティスを基に、入力音声の中のキーワードとしての単語を１つまたは複数個決定する。この例では、入力音声は「明日の天気は・・・」であるので、「明日」と「天気」が検出されることになり、この「明日」と「天気」のキーワードから「明日の天気は・・・」という連続的な入力音声の内容を理解し、それに対応した応答内容を選んで出力する。なお、この場合、応答内容としては、「明日の天気は晴れだよ」というような応答内容となるが、これは、ここでは図示されていない状態検出手段（温度検出部、気圧検出部、カレンダ部、計時部など）が設けられていて、たとえば、天気に関する情報であれば、気圧検出部からの気圧の変化の状況を基に天気の変化を判断し、気圧が上昇傾向であればそれに対応した応答内容を応答データ指示内容記憶部２３から読み出すようにする。同様に、気温、時間、日付などに関する応答も可能となる。
【００３５】
また、以上説明したキーワードスポッティング処理による連続音声認識に近い音声認識処理は、日本語だけでなく他の言語においても適用可能である。たとえば、使用する言語が英語であるとすれば、登録されている認識可能な単語は、たとえば、“ｇｏｏｄ　ｍｏｒｎｉｎｇ”、“ｔｉｍｅ”、“ｔｏｍｏｒｒｏｗ”、“ｇｏｏｄ　ｎｉｇｈｔ”などが一例として挙げられ、これら認識可能な登録単語の特徴データが、標準音声特徴データ記憶部２１に記憶されている。そして今、話者が「ｗｈａｔ　ｔｉｍｅ　ｉｓ　ｉｔ　ｎｏｗ」と問いかけた場合、この「ｗｈａｔ　ｔｉｍｅ　ｉｓ　ｉｔ　ｎｏｗ」の文節のうち、単語「ｔｉｍｅ」がこの場合のキーワードとなり、「ｔｉｍｅ」という単語が入力音声の中に存在したときに、その「ｔｉｍｅ」の音声信号を待っている検出信号が、入力音声の「ｔｉｍｅ」の部分で立ち上がる。そして、単語検出部１３からの検出データ（ワードラティス）が入力されると、まず、そのワードラティスを基に、入力音声の中のキーワードとしての単語を１つまたは複数個決定する。この例では、入力音声は、「ｗｈａｔ　ｔｉｍｅ　ｉｓ　ｉｔ　ｎｏｗ」であるので、「ｔｉｍｅ」がキーワードとして検出されることになり、このキーワードを基に、「ｗｈａｔ　ｔｉｍｅ　ｉｓ　ｉｔ　ｎｏｗ」という連続的な入力音声の内容を理解する。
【００３６】
なお、前記した音声分析、単語検出、音声理解会話制御、音声合成などの制御を行うＣＰＵはそれぞれに設けてもよいが、これら全ての処理を行う１台のメインのＣＰＵを設け、この１台のＣＰＵで本発明の全体の処理を行うようにしてもよい。
【００３７】
このような構成において、たとえば、装着されているカートリッジが、幼児を対象としたものであるとすれば、幼児が「おはよう」と問いかければ、その入力音声は音声分析部１２で分析されたのち、単語検出部１３に送られる。そして、単語検出部１３では、標準音声特徴データ記憶部２１に記憶されている特徴データをもとに、単語検出部１３により前記したような処理を行い、入力音声に対する単語検出データ（ワードラティス）を出力する。なお、このとき、カートリッジ部２０は幼児向けのものであるから、標準音声特徴データ記憶部２１の内容は、幼児の音声の特徴を基に得られた標準音声特徴データであるため、高い認識率での認識が可能となる。
【００３８】
そして、単語検出部１３からのワードラティスを受けた音声理解会話制御部１４では、そのワードラティスをもとにカートリッジ部２０の会話内容記憶部２２を参照して、入力音声が「おはよう」であることを理解するとともに、それに対する応答内容を得たのち、応答データ指示内容記憶部２３を参照する。これにより、音声合成部１５では、応答データ指示内容記憶部２３から得た情報を基に、音声合成出力を出し、音声出力部１６から応答内容として出力される。この場合の、応答内容としては、幼児に対する応答であるので、たとえば、幼児向けの話し方で「おはよう」と応答する。
【００３９】
このようにして、ユーザが任意に選択したカートリッジ部２０をぬいぐるみ１に装着することにより、カートリッジ部２０の内容に応じた対話が可能となる。たとえば、前記したような幼児向けのカートリッジを装着すれば幼児向けの対話が行え、小学生向けのカートリッジを装着すれば、それに応じた対話が可能となる。なお、このカートリッジは、様々な年齢、あるいは性別に応じて種々用意しておくことが可能である。具体的には、男子幼児用、女子幼児用、小学校の低学年の男子用、女子用などきめ細かい分類も可能である。
【００４０】
これにより、装置本体（この場合はぬいぐるみ）は１台であっても、カートリッジを変えることにより、様々な年代あるいは性別などに応じた対話が可能となる。また、この場合は、標準音声特徴データ記憶部２１、会話内容記憶部２２、応答データ指示内容記憶部２３がカートリッジ部２０に組み込まれているので、認識可能な単語やその標準音声特徴データをカートリッジ毎に異なったものとすることができ、また、その認識可能な単語に対する応答内容および音声の質をカートリッジ毎に異なったものとすることができる。したがって、カートリッジの種類のバリエーションを増やすことにより、様々な年代あるいは性別などに応じた対話が可能となる。
【００４１】
（第２の実施例）
以上説明した第１の実施例では、標準音声特徴データ記憶部２１、会話内容記憶部２２、応答データ指示内容記憶部２３をカートリッジ部２０に設けた例について説明したが、カートリッジ部２０としては、これら３つの要素を全て持たずに、たとえば、標準音声特徴データ記憶部２１のみをカートリッジ部２０に設けるようにしてもよく、また、応答データ指示内容記憶部２３のみをカートリッジ部２０に設けるようにしてもよく、その組み合わせは種々考えられる。
【００４２】
ここでは、カートリッジ部２０として、標準音声特徴データ記憶部２１のみをカートリッジ部２０内に設けた場合、会話内容記憶部２２と応答データ指示内容記憶部２３をカートリッジ部２０内に設けた場合、応答データ指示内容記憶部２３のみをカートリッジ部２０内に設けた場合についてそれぞれ説明する。
【００４３】
まず、標準音声特徴データ記憶部２１のみをカートリッジ部２０内に設けた場合について説明する。図４は、その構成を示すブロック図であり、図２と同一部分には同一符号が付されている。すなわち、この場合は、会話内容記憶部２２と応答データ指示内容記憶部２３は音声認識応答処理部１０側に設けられている。なお、応答内容記憶部２２は音声理解会話制御部１４内に設けてもよいが、ここでは、別個に設けた場合が示されている。
【００４４】
このように標準音声特徴データ記憶部２１のみをカートリッジ部２０側に設け、会話内容記憶部２２と応答データ指示内容記憶部２３が装置（ぬいぐるみ１）側にある場合は、それぞれの登録単語に対してどのような応答をするかなどは、装置側にて予め決められている。したがって、この場合は、登録単語は、装置側で予め決められた単語（たとえば、前記したように「おはよう」、「こんにちは」、「おやすみ」などというような１０単語程度）のみであるが、それぞれの登録単語における年代や性別に応じた標準音声特徴データをカートリッジ毎に様々用意することができる。たとえば、それぞれの登録単語に対して幼児の標準音声特徴データが記憶されたカートリッジ、それぞれの登録単語に対して小学生の標準音声特徴データが記憶されたカートリッジ、それぞれの登録単語に対して大人の女性の標準音声特徴データが記憶されたカートリッジ、それぞれの登録単語に対して大人の男性の標準音声特徴データが記憶されたカートリッジというように、それぞれの年代や性別ごとに、それぞれの登録単語に対する標準音声特徴データが記憶されたカートリッジを用意しておく。
【００４５】
このようにして、たとえば、幼児が使用する場合は、幼児用の音声特徴データが記憶されたカートリッジを選択して、それを装置本体に装着することにより、幼児の音声の特徴を基に得られた音声特徴データとの比較が行えることから高い認識率で認識することができる。そして、認識された単語に対して、予め決められた応答内容を音声合成して出力する。このように、年代や性別に応じて標準音声特徴データのカートリッジを選択することにより、認識率を大幅に向上させることができる。
【００４６】
次に、会話内容記憶部２２と応答データ指示内容記憶部２３をカートリッジ部２０内に設けた場合について説明する。図５は、その構成を示すブロック図であり、図２と同一部分には同一符号が付されている。すなわち、この場合は、標準音声特徴データ記憶部２１は音声認識応答処理部１０側に設けられ、会話内容記憶部２２と応答データ記憶部２３はカートリッジ部２０側に設けられている。
【００４７】
このように会話内容記憶部２２と応答データ指示内容記憶部２３のみをカートリッジ部２０側に設け、標準音声特徴データ記憶部２１を音声認識応答処理部１０側に設けた場合は、登録単語は装置側で予め決められた単語（たとえば、前記したように「おはよう」、「こんにちは」、「おやすみ」などというような１０単語程度）のみであるが、それぞれの登録単語に対する応答内容およびそれぞれの応答内容に対する音声合成出力（声の質など）をカートリッジ毎に様々用意することができる。たとえば、「おはよう」という単語に対してはどのような応答内容とするか、さらには、その応答内容をどのような音声合成出力とするかなどを予め何通りか決めておき、それらをカートリッジ毎に会話内容記憶部２２および応答データ指示内容記憶部２３に記憶させておく。具体的には、幼児向けのカートリッジは、「おはよう」、「おやすみ」などの登録単語に対しては、幼児向けの応答内容で、かつ、幼児向けの音質での応答を行い、小学生向けのカートリッジとしては、登録単語に対する応答を、たとえば、テレビアニメのキャラクタの話し方を真似て、しかも小学生向けの応答内容での応答を行うなどというように、それぞれの年齢や性別に合わせた応答内容と声の質による応答を行うカートリッジを種々用意しておく。
【００４８】
このようにして、たとえば、小学生が使用する場合に、前記したような小学生向けのカートリッジを選択して、それを装置に装着することにより、小学生が何らかの登録単語を話しかけると、前記したようなテレビアニメのキャラクタの話し方を真似た声の質で、かつ、そのカートリッジに設定された応答内容が返ってくるというようなことが可能となる。
【００４９】
次に、応答データ指示内容記憶部２３のみをカートリッジ部２０側に設けた場合について説明する。図６は、その構成を示すブロック図であり、図２と同一部分には同一符号が付されている。すなわち、この場合は、標準音声特徴データ記憶部２１および会話内容記憶部２２は装置本体の音声認識応答処理部１０側に設けられ、応答データ指示内容記憶部２３のみがカートリッジ部２０側に設けられている。
【００５０】
このように応答データ指示内容記憶部２３のみをカートリッジ部２０側に設け、会話内容記憶部２２と標準音声特徴データ記憶部２１を音声認識応答処理部１０側に設けた場合は、登録単語は装置側で予め決められた単語（たとえば、前記したように「おはよう」、「こんにちは」、「おやすみ」などというような１０単語程度）のみであり、また、これらの登録単語に対する応答内容は基本的には装置側で予め設定された内容となるが、その応答内容に対してどのような音声合成出力とするかをカートリッジ毎に様々設定することができる。たとえば、「おはよう」という単語に対する応答内容を、どのような声の質で出力とするかを予め何通りか決めておき、それらをカートリッジ毎に応答データ指示内容記憶部２３に記憶させておく。具体的には、幼児向けのカートリッジの場合は、「おはよう」、「おやすみ」などの種々の登録単語に対しては、それらの登録単語に対して、母親のような声での応答を行うような指示内容、あるいは幼児向けテレビアニメのキャラクタの声に似せた応答を行うような指示内容が記憶され、小学生向けのカートリッジとしては、登録単語に対する応答を小学生向けのテレビアニメのキャラクタの話し方を真似て応答を行うなどの指示内容が記憶されるというように、それぞれの年齢や性別に合わせて、登録単語毎に、どのような音声合成出力とするかを指示する内容が記憶されたカートリッジを種々用意しておく。なお、このとき、それぞれの登録単語に対するするそれぞれの応答内容は基本的には予め設定された内容である。
【００５１】
このようにして、たとえば、小学生が使用する場合は、前記したような小学生向けのカートリッジを選択して、それを装置に装着することにより、小学生が何らかの登録単語を話しかけると、応答内容は基本的には予め設定された内容であるが、前記したような小学生向けのテレビアニメのキャラクタの話し方を真似た声での応答が返ってくるというようなことが可能となる。
【００５２】
以上説明したように本発明では、標準音声特徴データ記憶部２１、会話内容記憶部２２、応答データ指示内容記憶部２３などのＲＯＭの部分をカートリッジ式とし、ユーザの年齢や性別などに応じた標準音声特徴データ、応答内容などを有するカートリッジを種々用意しておき、ユーザが任意に選択できるようにしている。したがって、認識応答装置そのものは１台であっても、カートリッジを取り替えることにより、幅広い人が利用でき、ユーザに応じた音声認識および対話が行える。
【００５３】
なお、以上の各実施例では、本発明を玩具としてぬいぐるみに適用した例を説明したが、ぬいぐるみに限られるものではなく。他の玩具にも適用できることは勿論であり、さらに、玩具だけではなく、ゲーム機や、日常使われる様々な電子機器などにも適用でき、その適用範囲は極めて広いものと考えられる。
【００５４】
【発明の効果】
以上説明したように、本発明の音声認識対話装置は、請求項１によれば、音声認識および認識された音声に対する応答を行うために予め設定された記憶内容を記憶する記憶手段を、装置本体に対して着脱自在に装着可能なカートリッジ側に設け、このカートリッジが装置本体側に装着されることにより、そのカートリッジ内に記憶されたデータを基に、入力音声に対する応答内容を出力するようにしたので、装置本体は１台であっても、カートリッジを変えることにより、様々な年代あるいは性別などに応じた対話が可能となる。したがって、本発明をたとえば、玩具などに適用した場合には、子どもの成長に合わせたカートリッジを選択することができ、また、認識可能な単語、応答内容もカートリッジを選択することにより色々選ぶことができるため、１台の玩具でも途中で飽きてしまうことが少なく、長い期間使用することができ、また、年代や性別にとらわれることなく幅広く使用可能となる。さらに、玩具だけでなく、電子機器などの適用した場合にも、ユーザに適応したカートリッジを選択することにより、対話内容などに幅広いバリエーションを持たせることができ、それに対応して様々な動作をさせることが可能となるなど、その効果はきわめて大きいものとなる。
【００５５】
また、請求項２によれば、予め登録された認識可能な単語の標準音声特徴データを記憶する標準音声特徴データ記憶手段をカートリッジ側に設けるようにしたので、それぞれの登録単語における年代や性別に応じた標準音声特徴データをカートリッジ毎に様々用意することができ、ユーザに適応したカートリッジを選択して使用することにより、認識率の大幅な向上を図ることができる。
【００５６】
また、請求項３によれば、予め登録された認識可能な単語に対応する応答内容を記憶する会話内容記憶手段と、どのような音声合成出力を発生するかを指示する指示内容を記憶する応答データ指示内容記憶手段をカートリッジ側に設けるようにしたので、それぞれの年齢や性別に合わせた応答内容を持ったカートリッジを種々用意することができ、たとえば、子供向けのカートリッジを選択すれば、何らかの登録単語を話しかけると、子供向けの応答内容で、かつ、子供向けの音声で応答するというようなことが可能となる。
【００５７】
また、請求項４によれば、どのような音声合成出力を発生するかを指示するための指示内容を記憶する応答データ指示内容記憶手段をカートリッジ側に設けるようにしたので、それぞれの年齢や性別に合わせて、登録単語毎に、どのような音声合成出力とするかを指示する内容が記憶されたカートリッジを種々用意することができ、たとえば、小学生向けのカートリッジを選択すれば、何らかの登録単語を話しかけると、応答内容は基本的には予め設定された内容であるが、前記したような小学生向けのテレビアニメのキャラクタの話し方を真似た声での応答が返ってくるというようなことが可能となる。
【００５８】
また、請求項５によれば、予め登録された認識可能な単語の標準音声特徴データを記憶する標準音声特徴データ記憶手段と、前記登録された認識可能な単語に対応する応答内容を記憶する会話内容記憶手段と、どのような音声合成出力を発生するかを指示する応答データ指示内容記憶手段を、カートリッジ側に設けるようにしたので、それぞれの登録単語における年代や性別に応じた標準音声特徴データをカートリッジ毎に様々用意することができ、ユーザに適応したカートリッジを選択して使用することにより、認識率の大幅な向上を図ることができ、また、認識可能な登録単語もカートリッジ単位で設定でき、会話のバリエーションを大幅に増やすことができる。さらに、それぞれの年齢や性別に合わせた応答内容および音声合成出力を持ったカートリッジを種々用意することができる。これにより、１台の装置本体であっても、カートリッジを変えることにより、様々な年代あるいは性別などに応じたバリエーションの豊富な対話が可能となる。
【００５９】
また、本発明の音声認識対話処理方法は、請求項６によれば、音声認識および認識された音声に対する応答を行うために予め設定された記憶内容を、装置本体に対して着脱自在に装着可能なカートリッジ側に設け、このカートリッジが装置本体側に装着されることにより、そのカートリッジ内に記憶されたデータを基に、入力音声に対する応答内容を発生するようにしたので、装置本体は１台であっても、カートリッジを変えることにより、様々な年代あるいは性別などに応じた対話が可能となる。したがって、本発明をたとえば、玩具などに適用した場合には、子どもの成長に合わせたカートリッジを選択することができ、また、認識可能な単語、応答内容もカートリッジを選択することにより色々選ぶことができるため、１台の玩具でも途中で飽きてしまうことが少なく、長い期間使用することができ、また、年代や性別にとらわれることなく幅広く使用可能となる。さらに、玩具だけでなく、電子機器などの適用した場合にも、ユーザに適応したカートリッジを選択することにより、対話内容などに幅広いバリエーションを持たせることができ、それに対応して様々な動作をさせることが可能となるなど、その適用範囲はきわめて広いものとなる。
【００６０】
また、請求項７によれば、予め登録された認識可能な単語の標準音声特徴データをカートリッジ側に記憶させるようにしたので、それぞれの登録単語における年代や性別に応じた標準音声特徴データをカートリッジ毎に様々用意することができ、ユーザに適応したカートリッジを選択して使用することにより、認識率の大幅な向上を図ることができる。
【００６１】
また、請求項８によれば、予め登録された認識可能な単語に対応する応答内容およびどのような音声合成出力を発生するかを指示する指示内容を、カートリッジ側に記憶させるようにしたので、それぞれの年齢や性別に合わせた応答内容を持ったカートリッジを種々用意することができ、たとえば、子供向けのカートリッジを選択すれば、何らかの登録単語を話しかけると、子供向けの応答内容で、かつ、子供向けの音声で応答するというようなことが可能となる。
【００６２】
また、請求項９によれば、どのような音声合成出力を発生するかを指示するための指示内容を、カートリッジ側に記憶させるようにようにしたので、それぞれの年齢や性別に合わせて、登録単語毎に、どのような音声合成出力とするかを指示する内容が記憶されたカートリッジを種々用意することができ、たとえば、小学生向けのカートリッジを選択すれば、何らかの登録単語を話しかけると、応答内容は基本的には予め設定された内容であるが、前記したような小学生向けのテレビアニメのキャラクタの話し方を真似た声での応答が返ってくるというようなことが可能となる。
【００６３】
また、請求項１０によれば、予め登録された認識可能な単語の標準音声特徴データ、前記登録された認識可能な単語に対応する応答内容、どのような音声合成出力を発生するかを指示する応答データ指示内容を、カートリッジ側に記憶させるようにしたので、それぞれの登録単語における年代や性別に応じた標準音声特徴データをカートリッジ毎に様々用意することができ、ユーザに適応したカートリッジを選択して使用することにより、認識率の大幅な向上を図ることができ、また、認識可能な登録単語もカートリッジ単位で設定でき、会話のバリエーションを大幅に増やすことができる。さらに、それぞれの年齢や性別に合わせた応答内容および音声合成出力を持ったカートリッジを種々用意することができる。これにより、１台の装置本体であっても、カートリッジを変えることにより、様々な年代あるいは性別などに応じたバリエーションの豊富な対話が可能となる。
【図面の簡単な説明】
【図１】本発明の概略を説明する図。
【図２】本発明の第１の実施例を説明するブロック図。
【図３】単語検出部による単語検出処理および音声理解会話制御部による音声認識処理を説明する図。
【図４】本発明の第２の実施例（その１）を説明するブロック図。
【図５】本発明の第２の実施例（その２）を説明するブロック図。
【図６】本発明の第２の実施例（その３）を説明するブロック図。
【符号の説明】
１・・・ぬいぐるみ（装置本体）　１０・・・音声認識応答処理部
１１・・・音声入力部　　　　　　１２・・・音声分析部
１３・・・単語検出部　　　　　　１４・・・音声理解会話制御部
１５・・・音声合成部　　　　　　１６・・・音声出力部
２０・・・カートリッジ部　　　　２１・・・標準音声特徴データ記憶部
２２・・・会話内容記憶部　　　　２３・・・応答データ指示内容記憶部[0001]
[Industrial applications]
The present invention relates to a voice recognition dialogue apparatus and a voice recognition dialogue processing method for recognizing voice and performing a response or a specific operation corresponding to the recognition result.
[0002]
[Prior art]
In this type of speech recognition apparatus, there are a specific speaker speech recognition apparatus capable of recognizing only a specific speaker's speech and an unspecified speaker speech recognition apparatus capable of recognizing an unspecified speaker's speech.
[0003]
The specific-speaker voice recognition device registers a standard voice signal pattern of the specific speaker by inputting words recognizable by a specific speaker one by one according to a predetermined procedure. After the end, when the specific speaker speaks the registered word, the speech recognition is performed by comparing the feature pattern obtained by analyzing the input speech with the registered feature pattern. An example of this type of speech recognition dialogue device is a speech recognition toy. For example, as a plurality of command words to be voice commands, about 10 words such as "good morning", "good night", "hello" are registered in advance by a child who uses the toy, and the speaker uses, for example, "good morning". "" Means that the voice signal is compared with the registered "good morning" voice signal, and when both voice signals match, a predetermined electrical signal for the voice command is output, and based on this, the toy is transmitted to the toy. A specific operation is performed.
[0004]
Such a specific-speaker voice recognition device can recognize only a voice having a voice pattern close to or specific to a specific speaker, and it is necessary to register all words to be recognized one by one as an initial setting. Was extremely troublesome.
[0005]
On the other hand, the unspecified speaker voice recognition device generates in advance the standard voice feature data of the recognition target word as described above using voices uttered by a large number (for example, about 200) of speakers. It is stored (registered), and it is possible to recognize the voice of an unspecified speaker for these pre-registered recognizable words.
[0006]
[Problems to be solved by the invention]
The unspecified speech recognition device capable of recognizing the unspecified speaker's speech certainly secures a relatively high recognition rate for standard speech, but also for almost all speech. A recognition rate is not always obtained. For example, voice characteristics vary greatly depending on age and gender, such as infant voices, adult voices, female voices, and male voices, and even when extremely high recognition rates are obtained for adult questions, There is also a problem that little is recognized when asked by a child.
[0007]
Also, when this type of speech recognition device is applied to a toy such as a stuffed toy, the content of the dialog usually changes depending on the age, sex, etc. of the child playing with the stuffed toy. For example, it is common for infants and elementary school seniors and boys and girls to seek different dialogue content.
[0008]
However, in this type of speech recognition apparatus, the number of recognizable registered words is also limited, and the response to the word is generally limited to a certain extent. Therefore, this kind of toy usually gets tired in a short period of time, and as described above, there is also a problem in the quality of the recognition rate due to the characteristics of the voice according to age and gender. The same applies to not only toys but also all electronic devices that use voice recognition.
[0009]
The present invention has been made in order to solve these problems, and a ROM portion such as a standard voice feature data storage unit for storing standard voice feature data and a response content storage unit for storing response content is a cartridge type. By selecting cartridges as appropriate and enabling them to be installed, recognition corresponding to voice characteristics due to differences in age, gender, etc. becomes possible, while improving recognition rate and enabling conversations corresponding to various ages and genders It is intended to be.
[0010]
[Means for Solving the Problems]
The speech recognition dialogue apparatus of the present invention analyzes input speech to generate speech feature data, compares the speech feature data with standard speech feature data of a pre-registered recognizable word, and detects word detection data. In response to the word detection data, to understand the meaning of the input voice, in the voice recognition dialogue device that determines and outputs the corresponding response content, in the storage content preset for performing voice recognition, A storage unit for storing storage contents set in advance for outputting a response to the recognized voice is provided on a cartridge side detachably mountable to the apparatus main body, and the cartridge is mounted on the apparatus main body. Thereby, the voice recognition response processing unit is connected to the voice recognition response processing unit provided on the apparatus main body side, and based on the data stored in the cartridge, the voice recognition response processing unit It is characterized by generating a response content to.
[0011]
The storage means provided on the cartridge side is standard voice feature data storage means for storing standard voice feature data for pre-registered recognizable words. Input means; voice analysis means for analyzing voice input by the voice input means to generate voice feature data; and inputting voice feature data from the voice analysis means to provide a standard voice provided on the cartridge side. Word detection means for outputting word detection data for the input voice based on the storage content of the characteristic data storage means; receiving word detection data from the word detection means, understanding the meaning of the input voice, and responding accordingly. Speech comprehension conversation control means for determining the content, and speech synthesis output based on the response content determined by the speech comprehension conversation control means Response data instruction content storage means for causing the speech data to be output, and voice synthesis means for generating a speech synthesis output based on the instruction of the response data instruction content storage means in response to the response content determined by the voice understanding conversation control means; And a voice output unit for outputting a voice synthesis output from the voice synthesis unit to the outside.
[0012]
The storage means provided on the cartridge side includes a conversation content storage means for storing a response content corresponding to a recognizable word registered in advance, and an instruction for instructing what kind of speech synthesis output is to be generated. Response data instruction content storage means for storing the content, wherein the apparatus main body includes a voice input means for inputting voice, and a voice for generating voice feature data by analyzing the voice input by the voice input means. Analysis means; standard voice feature data storage means for storing standard voice feature data of pre-registered recognizable words; and word detection data for the input voice based on the stored contents of the standard voice feature data storage means. Receiving the word detection data from the word detecting means, understand the meaning of the input voice, and send the corresponding response content to the cartridge. And a response data instruction content storage means provided on the cartridge side for the response content determined by the voice understanding conversation control means. Speech synthesis means for generating a speech synthesis output based on the instruction, and speech output means for outputting the speech synthesis output from the speech synthesis means to the outside.
[0013]
The storage means provided on the cartridge side is response data instruction content storage means for storing instruction contents for instructing what kind of speech synthesis output is to be generated. Voice input means, a voice analysis means for analyzing voice input by the voice input means to generate voice feature data, and a standard voice for storing standard voice feature data of recognizable words registered in advance. Feature data storage means, word detection means for outputting word detection data for the input speech based on the storage contents of the standard speech feature data storage means, and receiving the word detection data from the word detection means, And a voice understanding conversation control means for determining a response content corresponding thereto, and the response content determined by the voice understanding conversation control means, A configuration comprising: voice synthesis means for generating a voice synthesis output based on the instruction of the response data instruction content recording means provided on the cartridge side; and voice output means for outputting the voice synthesis output from the voice synthesis means to the outside. And
[0014]
The storage means provided on the cartridge side stores standard voice feature data storage means for storing standard voice feature data for pre-registered recognizable words, and stores response contents corresponding to the registered recognizable words. A conversation content storage means, a response data instruction content storage means for instructing what kind of speech synthesis output is to be generated, and a voice input means for inputting a voice on the apparatus main body side; Voice analysis means for analyzing input voice and generating voice feature data; and inputting voice feature data from the voice analysis means, based on storage contents of standard voice feature data storage means provided on the cartridge side. A word detecting means for outputting word detection data for the input voice, and receiving the word detection data from the word detecting means to determine the meaning of the input voice. Comprehending and determining the corresponding response content with reference to the conversation content storage unit provided on the cartridge side, and the response content determined by the voice understanding conversation control means, Speech synthesis means for generating a speech synthesis output based on the instruction of the response data instruction content storage means provided on the cartridge side, and speech output means for outputting the speech synthesis output from the speech synthesis means to the outside And
[0015]
Further, the speech recognition dialogue processing method of the present invention analyzes input speech to generate speech feature data, and compares the speech feature data with standard speech feature data of a pre-registered recognizable word. In the voice recognition dialogue processing method of outputting word detection data, receiving the word detection data, understanding the meaning of the input voice, and determining and outputting the corresponding response content, a preset voice recognition dialogue processing method is used. The stored contents, which are set in advance to output a response to the recognized voice, and the like, are written in storage means on the cartridge side which can be detachably mounted to the apparatus main body, and this cartridge is stored in the apparatus main body. It is characterized in that, by being mounted, a response to an input voice is generated based on data stored in the cartridge.
[0016]
The storage content stored on the cartridge side is standard voice feature data for a pre-registered recognizable word, and the apparatus main body analyzes voice input by voice input means and outputs voice data. A voice analysis step of generating feature data, and a word detection step of inputting voice feature data from the voice analysis step and outputting word detection data for the input voice based on the standard voice feature data stored on the cartridge side Receiving the word detection data from the word detection step, comprehending the meaning of the input voice, and determining a response content corresponding thereto; and a response content determined by the voice understanding conversation control step. On the other hand, a speech synthesis step of generating a speech synthesis output based on the response data instruction content indicating what kind of speech synthesis output is to be performed, A structure having an audio output step of outputting a speech synthesis output from the voice synthesizing process to the outside.
[0017]
The storage contents stored on the cartridge side are response contents corresponding to a pre-registered recognizable word and response data instruction contents for instructing what kind of speech synthesis output is to be generated. The apparatus main body side includes a voice analysis step of analyzing voice input by the voice input means to generate voice feature data, and a word corresponding to the input voice based on standard voice feature data of a pre-registered recognizable word. A word detection step of outputting detection data, receiving the word detection data from the word detection step, understanding the meaning of the input voice, referring to the corresponding response content with reference to the conversation content stored in the cartridge side. A voice understanding conversation control step to be determined, and a response data instruction content stored on the cartridge side with respect to the response content determined by the voice understanding conversation control step. Zui a speech synthesis step of generating a speech synthesis output was, a structure having an audio output step of outputting a speech synthesis output from the speech synthesis step to the outside.
[0018]
Further, the storage content stored in the cartridge side is an instruction content for instructing what kind of voice synthesis output is to be generated, and the apparatus main body side analyzes the voice input by voice input means. A speech analysis step of generating speech feature data, a word detection step of outputting word detection data for an input speech based on standard speech feature data of a pre-registered recognizable word, and a word from the word detection step. Upon receiving the detection data, the meaning of the input voice is understood, and a voice understanding conversation control step of determining a response content corresponding thereto, and the response content determined by the voice understanding conversation control step are stored in the cartridge side. A voice synthesis step for generating a voice synthesis output based on the response data instruction content, and a voice output step for outputting the voice synthesis output from the voice synthesis step to the outside And a configuration.
[0019]
The stored contents stored on the cartridge side include standard voice feature data for pre-registered recognizable words, response contents corresponding to the registered recognizable words, and what kind of voice synthesis output. A voice analysis step of analyzing the voice input by the voice input means to generate voice characteristic data; and a voice characteristic from the voice analysis step. A word detection step of inputting data and outputting word detection data for the input voice based on the standard voice characteristic data stored in the cartridge side; and receiving the word detection data from the word detection step, And a voice understanding conversation control step of determining a response content corresponding thereto with reference to the conversation content stored in the cartridge. A voice synthesis step for generating a voice synthesis output based on the response data instruction content stored on the cartridge side in response to the response content determined by the disassembly / conversation control step; and outputting the voice synthesis output from the voice synthesis step to the outside. And an audio output step of outputting.
[0020]
[Action]
The present invention is directed to a cartridge which can be removably attached to a main body of a device, such as storage contents set in advance for performing voice recognition, and storage contents set in advance for outputting a response to the recognized voice. When the cartridge is mounted on the main body of the apparatus, the contents of the response to the input voice are output based on the data stored in the cartridge. By changing the cartridge, a dialog according to various ages or genders becomes possible. Therefore, a cartridge according to the user can be selected, and a recognizable word and a response content can also be selected in a cartridge unit, so that a wide variety of conversation contents and the like can be provided. It is possible to make it work
[0021]
【Example】
Hereinafter, embodiments of the present invention will be described with reference to the drawings. In this embodiment, a case where the present invention is applied to a toy will be described as an example, and a case where the present invention is applied to a stuffed animal such as a dog will be described. An example in which the present invention is applied to an unspecified speaker speech recognition device capable of recognizing an unspecified speaker's speech will be described.
[0022]
(First embodiment)
FIG. 1 is a view for explaining the overall schematic configuration of the present invention. Generally, a voice recognition response processing section 10 (details will be described later) housed in a stuffed dog (apparatus main body) 1 The stuffed dog 1 includes a cartridge section 20 (which will be described later in detail) which can be detachably attached to a predetermined portion of the stuffed dog 1.
[0023]
FIG. 2 is a block diagram illustrating a configuration of the voice recognition response processing unit 10 and the cartridge unit 20 according to the first embodiment. In the first embodiment, an example will be described in which three ROMs of a standard voice feature data storage unit 21, a conversation content storage unit 22, and a response data instruction content storage unit 23 are provided on the cartridge side.
[0024]
The voice recognition response processing means 10 includes a voice input unit 11, a voice analysis unit 21, a word detection unit 13, a voice understanding conversation control unit 14, a voice synthesis unit 15, a voice output unit 16, and the like. Note that, among these components, the voice analysis unit 12, the word detection unit 13, the voice understanding conversation control unit 14, the voice synthesis unit 15, and the like are housed near the abdomen of the stuffed toy 1, for example, and the voice input unit (microphone) 11 The stuffed toy 1 is provided, for example, at the ear, and the audio output unit (speaker) 16 is provided, for example, at the mouth.
[0025]
On the other hand, the cartridge section 20 includes a standard voice feature data storage section 21, a conversation content storage section 22, and a response data instruction content storage section 23, and a cartridge mounting section (not shown) provided near the abdomen of the stuffed toy 1, for example. It is easily removable from the outside. When the cartridge unit 20 is correctly mounted on the cartridge mounting unit, the cartridge unit 20 is connected to each unit of the voice recognition response processing unit 10 and can transmit and receive signals. Specifically, the standard voice feature data storage unit 21 is connected to the word detection unit 13, the conversation content storage unit 22 is connected to the speech understanding conversation control unit 14, and the response data instruction content storage unit 23 is connected to the speech understanding It is connected to a conversation control unit 14 and a speech synthesis unit 15.
[0026]
The standard voice feature data storage unit 21 stores a standard pattern of a recognizable word (referred to as a registered word) created in advance using speech uttered by a large number (for example, about 200) of speakers for each word. Is stored (registered) in the ROM. In this example, a stuffed animal is used as an example, so the number of registered words is about 10 words. However, the present invention is not limited to this, and various words can be registered, and the number of registered words is not limited to 10 words. Further, the conversation content storage unit 22 stores contents such as what words are registered words and how to respond to each registered word. The conversation content storage unit 22 is originally provided in the voice understanding conversation control unit 14, but is provided on the cartridge unit 20 side because the registered words and the like may change depending on the cartridge. In addition, the response data instruction content storage unit 23 stores the content instructing what kind of speech synthesis output is to be made corresponding to each registered word. The content mainly indicating the voice quality is set in advance, such as a speech synthesis output or a speech synthesis output based on how a girl speaks.
[0027]
By the way, as described above, the cartridge section 20 is provided with various types of standard voice feature data, registered words, and further, response contents to each registered word, voice quality, and the like. The user to use can be selected arbitrarily. For example, in the cartridge for an infant, a plurality of registered words based on the standard audio feature data so that the infant's voice can be easily recognized are stored in the standard audio feature data storage unit 21, and what words are the registered words. Or what kind of response to each registered word is stored in the conversation content storage unit 22, and further, for each registered word, an instruction such as what kind of speech synthesis output should be made It is stored in the data instruction content storage unit 23.
[0028]
Next, the functions of the above-described units and the overall processing will be sequentially described below.
[0029]
The voice input unit 11 of the voice recognition / response processing unit 10 includes a microphone, an amplifier, a low-pass filter, an A / D converter, etc., although not shown, and converts the voice input from the microphone into an appropriate voice through an amplifier and a low-pass filter. After being converted into a waveform, the signal is converted into a digital signal (for example, 12 KHz.16 bits) by an A / D converter and output, and the output is sent to the voice analysis unit 12. The voice analysis unit 12 analyzes the frequency of the voice waveform signal sent from the voice input unit 11 at short intervals using a computing unit (CPU), and extracts a several-dimensional feature vector representing a frequency feature ( The LPC-CEPSTRUM coefficient is generally used), and a time series of this feature vector (hereinafter, referred to as an audio feature vector sequence) is output.
[0030]
Although not shown, the word detection unit 13 mainly includes an arithmetic unit (CPU) and a ROM that stores a processing program, and the word registered in the standard voice feature data storage unit 21 provided in the cartridge unit 20. Is detected in which part of the input voice and with certainty. As the word detection unit 13, a hidden Markov model (HMM) method, a DP matching method, or the like can be used. In this case, a keyword spotting processing technique using a DRNN (dynamic recurrent neural network) method (for this technique, The present applicant has already applied for a patent in Japanese Patent Application Laid-Open Nos. Hei 6-4097 and Hei 6-119476) to detect words for enabling speech recognition close to continuous speech recognition by an unspecified speaker. It is assumed that data is output.
[0031]
The specific processing of the word detection unit 13 will be briefly described with reference to FIG. The word detection unit 13 detects a word registered in the standard voice feature data storage unit 21 and at which part in the input voice and with certainty. Now, it is assumed that a voice such as "Tomorrow's weather is ..." is input from the speaker, and a voice signal as shown in FIG. 3A is output. In the phrase "Tomorrow's weather is ...", "tomorrow" and "weather" are keywords in this case, and these are one of about 10 registered words in advance, which are standard words. The pattern is stored in the voice feature data storage unit 21. If the registered words are, for example, 10 words, a signal for detecting each word is output corresponding to these 10 words (hereinafter, referred to as word 1, word 2, word 3,...). Then, from the information such as the value of the detection signal, the degree of certainty that the corresponding word exists in the input voice is detected. That is, when the word “weather” (word 1) is present in the input voice, the detection signal waiting for the signal “weather” is, as shown in FIG. Stand up in parts. Similarly, when the word "tomorrow" (word 2) is present in the input voice, the detection signal waiting for the signal "tomorrow" is changed to the "tomorrow" of the input voice as shown in FIG. Stand up at the part. In FIGS. 9B and 9C, a numerical value such as 0.9 or 0.8 is a numerical value indicating certainty (degree of approximation), and a higher numerical value such as 0.9 or 0.8 is higher. It can be said that the registered word having certainty is a recognition candidate for the input voice. That is, the registered word "tomorrow" exists with a certainty of 0.8 in the w1 portion on the time axis of the input voice signal as shown in FIG. As shown in FIG. 6B, it can be seen that the input audio signal exists at a certainty of w2 on the time axis with a certainty of 0.9.
[0032]
Further, in the example of FIG. 3, a signal for waiting for a word 3 (this word 3 is a registered word of “what time”) is also received as shown in FIG. , With a certain degree of certainty (the numerical value is about 0.6) in the portion of w2 on the time axis. As described above, when two or more registered words exist as recognition candidates at the same time with respect to the input speech signal, the word having the highest degree of approximation (a numerical value indicating the likelihood) is selected as the recognition word. A method and a correlation table indicating the correlation rules between the words are prepared in advance, and a certain recognition candidate word is determined by using a method such as selecting one of the words as a recognition word. decide. For example, if the recognition method is determined by the former method, the approximation degree corresponding to w2 on the time axis is the highest in the detection signal for detecting “weather”. It is determined that the recognition candidate for the input voice of the part is “weather”. The recognition of the input voice is performed by the voice understanding conversation control unit 14 based on these degrees of approximation.
[0033]
The voice understanding conversation control unit 14 is mainly composed of an arithmetic unit (CPU) and a ROM storing a processing program, inputs word detection data from the word detection unit 13, and based on the word detection data, Recognizing the voice (understanding the meaning of the entire input voice), referring to the conversation content storage unit 22 provided in the cartridge unit 20, determining the response content corresponding to the meaning of the input voice, Referring to the storage unit 23, a signal indicating what kind of speech synthesis output is to be sent to the speech synthesis unit (mainly composed of a CPU and a ROM) 15.
[0034]
For example, detection data (referred to as word lattices) as shown in FIGS. 3B to 3E from the word detection unit 13. The word lattices include a registered word name, an approximation degree, and a start point s and an end point e of a word. When a word or a word is included in the input speech, one or more words are determined based on the word lattice. In this example, since the input voice is "Tomorrow's weather is ...", "Tomorrow" and "Weather" will be detected, and "Tomorrow's weather" will be detected from the keywords "Tomorrow" and "Weather". .. ", And the response content corresponding to the continuous input voice is selected and output. In this case, the response content is such as "Tomorrow's weather is fine", but this is not shown here, but is performed by state detection means (temperature detection unit, air pressure detection unit, calendar Section, timekeeping section, etc.), for example, if it is information about weather, it judges the change in weather based on the situation of the change in atmospheric pressure from the atmospheric pressure detecting section, and if the atmospheric pressure is rising, it responds to it The read response content is read from the response data instruction content storage unit 23. Similarly, responses regarding temperature, time, date, and the like are possible.
[0035]
Further, the above-described speech recognition processing similar to continuous speech recognition by the keyword spotting processing can be applied not only to Japanese but also to other languages. For example, if the language to be used is English, the registered recognizable words include, for example, “good morning”, “time”, “tomorrow”, “good night”, and the like. Recognizable registered word feature data is stored in the standard voice feature data storage unit 21. And now, when the speaker asks “what time is it now”, the word “time” becomes the keyword in this case in the phrase “what time is it now”, and the word “time” is the input voice. , The detection signal waiting for the audio signal of “time” rises at the “time” portion of the input audio. When the detection data (word lattice) is input from the word detection unit 13, first, one or more words as keywords in the input voice are determined based on the word lattice. In this example, since the input voice is “what time is it now”, “time” is detected as a keyword, and based on this keyword, a continuous input of “what time is it now” is performed. Understand audio content.
[0036]
A CPU for controlling the above-described speech analysis, word detection, speech understanding conversation control, speech synthesis, and the like may be provided respectively. However, one main CPU for performing all these processes is provided. May perform the entire processing of the present invention.
[0037]
In such a configuration, for example, if the mounted cartridge is intended for an infant, and if the infant asks “good morning”, the input voice is analyzed by the voice analysis unit 12. , Sent to the word detection unit 13. Then, the word detection unit 13 performs the above-described processing by the word detection unit 13 based on the feature data stored in the standard speech feature data storage unit 21 to obtain word detection data (word lattice) for the input speech. Is output. At this time, since the cartridge section 20 is for infants, the contents of the standard voice feature data storage section 21 are standard voice feature data obtained based on the voice features of infants. Recognition becomes possible.
[0038]
Then, the speech understanding conversation control unit 14 receiving the word lattice from the word detection unit 13 refers to the conversation content storage unit 22 of the cartridge unit 20 based on the word lattice, and the input speech is “good morning”. After understanding the facts and obtaining the response contents, the response data instruction contents storage unit 23 is referred to. As a result, the speech synthesis unit 15 outputs a speech synthesis output based on the information obtained from the response data instruction content storage unit 23 and outputs the speech synthesis unit 16 as the response content. In this case, since the response content is a response to the infant, for example, "good morning" is answered in a manner of speaking for the infant.
[0039]
In this way, by attaching the cartridge section 20 arbitrarily selected by the user to the stuffed animal 1, a dialog according to the contents of the cartridge section 20 becomes possible. For example, if a cartridge for an infant as described above is attached, conversation for an infant can be performed, and if a cartridge for an elementary school student is attached, conversation corresponding to the conversation can be performed. In addition, this cartridge can be prepared variously according to various ages or genders. More specifically, detailed classification such as for boys and boys, for girls and boys, and for boys and girls in elementary school lower grades is also possible.
[0040]
Thus, even if there is only one apparatus main body (in this case, a stuffed toy), by changing the cartridge, it is possible to carry out a dialog according to various ages or genders. In this case, since the standard voice feature data storage unit 21, the conversation content storage unit 22, and the response data instruction content storage unit 23 are incorporated in the cartridge unit 20, the recognizable words and their standard voice feature data are stored in the cartridge unit. For each cartridge, the content of the response to the recognizable word and the quality of the voice can be different for each cartridge. Therefore, by increasing the variations of the types of cartridges, it is possible to perform a dialog according to various ages or genders.
[0041]
(Second embodiment)
In the first embodiment described above, the example in which the standard voice feature data storage unit 21, the conversation content storage unit 22, and the response data instruction content storage unit 23 are provided in the cartridge unit 20 has been described. Instead of having all of these three elements, for example, only the standard voice feature data storage unit 21 may be provided in the cartridge unit 20, and only the response data instruction content storage unit 23 may be provided in the cartridge unit 20. Various combinations may be considered.
[0042]
Here, when only the standard voice feature data storage section 21 is provided in the cartridge section 20 as the cartridge section 20, when the conversation content storage section 22 and the response data instruction content storage section 23 are provided in the cartridge section 20, the response The case where only the data instruction content storage unit 23 is provided in the cartridge unit 20 will be described.
[0043]
First, a case where only the standard voice feature data storage unit 21 is provided in the cartridge unit 20 will be described. FIG. 4 is a block diagram showing the configuration, and the same parts as those in FIG. 2 are denoted by the same reference numerals. That is, in this case, the conversation content storage unit 22 and the response data instruction content storage unit 23 are provided on the voice recognition response processing unit 10 side. The response content storage unit 22 may be provided in the voice understanding conversation control unit 14, but here, a case where it is provided separately is shown.
[0044]
In this way, when only the standard voice feature data storage unit 21 is provided on the cartridge unit 20 side, and the conversation content storage unit 22 and the response data instruction content storage unit 23 are on the device (stuffed toy 1) side, each registered word is The type of response is determined in advance by the device. Therefore, in this case, the registration word, word predetermined by apparatus (e.g., as described above, "good morning", "Hello", 10 about words like that as "Good night") but only, respectively Various standard voice feature data corresponding to the age and gender of the registered word can be prepared for each cartridge. For example, a cartridge in which standard voice feature data of an infant is stored for each registered word, a cartridge in which standard voice feature data of an elementary school student is stored for each registered word, and an adult female for each registered word The standard voice for each registered word for each age and gender, such as a cartridge that stores standard voice feature data for each cartridge and a cartridge that stores standard voice feature data for an adult male for each registered word A cartridge in which the characteristic data is stored is prepared.
[0045]
In this way, for example, when used by an infant, by selecting the cartridge storing the voice characteristic data for the infant and mounting it on the main body of the apparatus, the cartridge can be obtained based on the audio characteristics of the infant. Since it can be compared with the voice feature data, it can be recognized with a high recognition rate. Then, for the recognized word, a predetermined response content is voice-synthesized and output. As described above, by selecting the cartridge of the standard voice feature data according to the age and gender, the recognition rate can be significantly improved.
[0046]
Next, a case where the conversation content storage unit 22 and the response data instruction content storage unit 23 are provided in the cartridge unit 20 will be described. FIG. 5 is a block diagram showing the configuration, and the same parts as those in FIG. 2 are denoted by the same reference numerals. That is, in this case, the standard voice feature data storage unit 21 is provided on the voice recognition response processing unit 10 side, and the conversation content storage unit 22 and the response data storage unit 23 are provided on the cartridge unit 20 side.
[0047]
As described above, when only the conversation content storage unit 22 and the response data instruction content storage unit 23 are provided on the cartridge unit 20 side and the standard voice feature data storage unit 21 is provided on the voice recognition response processing unit 10 side, the registered words are predetermined words on the side (e.g., the above-described manner, "Good morning", "Hello", "Good night" 10 about words like that, etc.) only, response content and each response contents for each registered word Various voice synthesis outputs (such as voice quality) can be prepared for each cartridge. For example, what kind of response content is to be given to the word "good morning", and what kind of speech synthesis output is to be used for the response content, are determined in advance, and these are determined for each cartridge. In the conversation content storage unit 22 and the response data instruction content storage unit 23. Specifically, cartridges for infants respond to registered words such as "good morning" and "good night" with responses for infants and sound quality for infants. As a response to the registered word, for example, the response content and voice of each age and gender are matched, such as imitating the way of speaking of TV anime characters and responding with response content for elementary school students. A variety of cartridges that respond by quality are prepared.
[0048]
In this way, for example, when an elementary school student uses the above-mentioned cartridge for the elementary school student and attaches it to the device, when the elementary school student speaks some registered words, the television as described above is used. This makes it possible to return the response set in the cartridge with the quality of the voice imitating the manner of speaking the animation character and the cartridge.
[0049]
Next, a case where only the response data instruction content storage unit 23 is provided on the cartridge unit 20 side will be described. FIG. 6 is a block diagram showing the configuration, and the same parts as those in FIG. 2 are denoted by the same reference numerals. That is, in this case, the standard voice feature data storage unit 21 and the conversation content storage unit 22 are provided on the voice recognition response processing unit 10 side of the apparatus main body, and only the response data instruction content storage unit 23 is provided on the cartridge unit 20 side. ing.
[0050]
When only the response data instruction content storage unit 23 is provided on the cartridge unit 20 side and the conversation content storage unit 22 and the standard voice feature data storage unit 21 are provided on the voice recognition response processing unit 10 side, the registered word is a predetermined word on the side (for example, as the "good morning", "Hello", 10 about words like that, such as "Good night") is only, also, the response content is basically for these registration word The content is preset on the device side, but it is possible to set various types of speech synthesis output for the response content for each cartridge. For example, the response content to the word "good morning" is determined in advance as to what kind of voice quality is to be output, and these are stored in the response data instruction content storage unit 23 for each cartridge. Specifically, in the case of a cartridge for an infant, for a variety of registered words such as "good morning" and "good night", a response such as a mother's voice is made to the registered words. Instructions or instructions that give a response that resembles the voice of a TV anime character for toddlers are stored.As a cartridge for elementary school students, the response to registered words is similar to how a TV anime character for elementary school students speaks. For example, various cartridges in which the content for instructing what kind of speech synthesis output is to be output for each registered word according to each age and gender are stored. Have it ready. At this time, the contents of each response to each registered word are basically the contents set in advance.
[0051]
In this way, for example, when used by elementary school children, by selecting a cartridge for elementary school children as described above and attaching it to the device, if the elementary school student speaks any registered word, the response content is basically Although the content is set in advance, it is possible to return a response in a voice imitating the manner of speaking a TV anime character for elementary school children as described above.
[0052]
As described above, according to the present invention, the ROM parts such as the standard voice feature data storage unit 21, the conversation content storage unit 22, and the response data instruction content storage unit 23 are of a cartridge type, and standard ROMs corresponding to the age and gender of the user are used. Various cartridges having voice characteristic data, response contents, and the like are prepared, so that the user can arbitrarily select the cartridge. Therefore, even if only one recognition response device is used, by replacing the cartridge, it can be used by a wide range of people, and voice recognition and conversation can be performed according to the user.
[0053]
In each of the above embodiments, an example in which the present invention is applied to a stuffed toy as a toy has been described, but the present invention is not limited to a stuffed toy. Of course, the present invention can be applied not only to toys, but also to game machines, various electronic devices used daily, and the like, and the applicable range is considered to be extremely wide.
[0054]
【The invention's effect】
As described above, according to the first aspect of the present invention, the voice recognition interactive device according to the first aspect includes a storage unit configured to store storage contents set in advance for performing voice recognition and responding to the recognized voice. The cartridge is detachably mounted to the cartridge, and the cartridge is mounted on the main body of the apparatus, so that the contents of the response to the input voice are output based on the data stored in the cartridge. Therefore, even if only one apparatus main body is used, by changing the cartridge, it is possible to perform a dialog according to various ages or genders. Therefore, for example, when the present invention is applied to a toy or the like, it is possible to select a cartridge according to the child's growth, and also to select various words and recognizable words by selecting the cartridge. Since it is possible, even a single toy is less likely to get tired on the way, can be used for a long period of time, and can be used widely without regard to age and gender. In addition to the application of not only toys but also electronic devices, by selecting a cartridge suitable for the user, a wide variety of conversations can be provided, and various operations can be performed accordingly. The effect becomes extremely large.
[0055]
According to the second aspect, the standard voice feature data storage means for storing the standard voice feature data of the pre-registered recognizable words is provided on the cartridge side. Various standard voice feature data corresponding to each cartridge can be prepared, and by selecting and using a cartridge adapted to the user, the recognition rate can be significantly improved.
[0056]
According to the third aspect, a conversation content storage means for storing a response content corresponding to a pre-registered recognizable word, and a response for storing an instruction content for instructing what kind of speech synthesis output is to be generated. Since the data instruction content storage means is provided on the cartridge side, it is possible to prepare various cartridges having response contents according to their respective ages and genders. When a word is spoken, it is possible to respond with children's response contents and children's voice.
[0057]
According to the fourth aspect, the response data instruction content storage means for storing the instruction content for instructing what kind of speech synthesis output is to be provided is provided on the cartridge side. In accordance with the above, various cartridges in which contents indicating what kind of speech synthesis output is to be output for each registered word can be prepared. For example, if a cartridge for elementary school students is selected, some registered words can be prepared. When speaking, the response content is basically the content set in advance, but it is possible to return a response with a voice that imitates the way of speaking the TV anime character for elementary school children as described above. Become.
[0058]
According to the fifth aspect, a standard voice feature data storage unit that stores standard voice feature data of a pre-registered recognizable word, and a conversation that stores a response content corresponding to the registered recognizable word. Content storage means and response data indicating the type of speech synthesis output to be generated are provided on the cartridge side, so that standard voice feature data corresponding to the age and gender of each registered word is provided. Various cartridges can be prepared for each cartridge, and by selecting and using a cartridge that is suitable for the user, the recognition rate can be significantly improved. Also, recognizable registered words can be set for each cartridge. The number of conversation variations can be greatly increased. Further, it is possible to prepare various cartridges having response contents and voice synthesis output corresponding to each age and gender. As a result, even with a single apparatus main body, by changing the cartridge, a wide variety of dialogues according to various ages or genders can be performed.
[0059]
According to the speech recognition dialogue processing method of the present invention, it is possible to removably attach storage contents set in advance for performing speech recognition and responding to the recognized speech to the apparatus main body. When the cartridge is attached to the main body of the apparatus, the contents of the response to the input voice are generated based on the data stored in the cartridge. Even so, by changing the cartridge, a dialog according to various ages or genders becomes possible. Therefore, for example, when the present invention is applied to a toy or the like, it is possible to select a cartridge according to the child's growth, and also to select various words and recognizable words by selecting the cartridge. As a result, even a single toy is less likely to get tired on the way, can be used for a long period of time, and can be used widely regardless of age and gender. In addition to the application of not only toys but also electronic devices, by selecting a cartridge suitable for the user, a wide variety of conversations can be provided, and various operations can be performed accordingly. And the scope of application will be extremely wide.
[0060]
According to the present invention, the standard voice feature data of the recognizable word registered in advance is stored in the cartridge, so that the standard voice feature data corresponding to the age and gender of each registered word is stored in the cartridge. A variety of cartridges can be prepared for each case, and by selecting and using a cartridge suitable for the user, the recognition rate can be significantly improved.
[0061]
According to the eighth aspect, the response content corresponding to the recognizable word registered in advance and the instruction content for instructing what kind of speech synthesis output is generated are stored in the cartridge side. Various cartridges with response contents according to their age and gender can be prepared. For example, if a cartridge for children is selected, if a registered word is spoken, response contents for children and children It is possible to respond with a voice for the user.
[0062]
According to the ninth aspect, the instruction content for instructing what kind of speech synthesis output is to be generated is stored in the cartridge side, so that registration is performed according to each age and gender. For each word, various cartridges can be prepared in which contents indicating what kind of speech synthesis output is to be stored. For example, if a cartridge for elementary school students is selected, if a registered word is spoken, the response content Is basically a preset content, but it is possible to return a response with a voice that imitates the way of speaking a TV anime character for elementary school children as described above.
[0063]
According to the tenth aspect, standard voice feature data of a pre-registered recognizable word, response contents corresponding to the registered recognizable word, and what kind of voice synthesis output is to be generated are instructed. Since the response data instruction contents are stored in the cartridge side, standard voice characteristic data corresponding to the age and gender of each registered word can be prepared variously for each cartridge, and a cartridge suitable for the user can be selected. By using this, the recognition rate can be greatly improved, and recognizable registered words can be set for each cartridge, so that conversation variations can be greatly increased. Further, it is possible to prepare various cartridges having response contents and voice synthesis output corresponding to each age and gender. As a result, even with a single apparatus main body, by changing the cartridge, a wide variety of dialogues according to various ages or genders can be performed.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an outline of the present invention.
FIG. 2 is a block diagram illustrating a first embodiment of the present invention.
FIG. 3 is a view for explaining word detection processing by a word detection unit and speech recognition processing by a speech understanding conversation control unit;
FIG. 4 is a block diagram illustrating a second embodiment (part 1) of the present invention.
FIG. 5 is a block diagram for explaining a second embodiment (No. 2) of the present invention.
FIG. 6 is a block diagram illustrating a second embodiment (3) of the present invention.
[Explanation of symbols]
1 ... stuffed animal (device main body) 10 ... voice recognition response processing unit
11: voice input unit 12: voice analysis unit
13: Word detection unit 14: Voice understanding conversation control unit
15: Voice synthesis unit 16: Voice output unit
20: cartridge section 21: standard voice feature data storage section
22: conversation content storage unit 23: response data instruction content storage unit

Claims

The input voice is analyzed to generate voice feature data, and the voice feature data is compared with standard voice feature data of a pre-registered recognizable word to output word detection data. In the voice recognition dialogue device that receives and understands the meaning of the input voice, determines and outputs the corresponding response content,
A storage means for storing data set in advance for performing voice recognition, data set in advance for outputting a response to the recognized voice, and the like is provided on a cartridge side detachably mountable to the apparatus main body. When the cartridge is mounted on the apparatus main body, it is connected to a voice recognition response processing unit provided on the apparatus main body side, and outputs a response content to an input voice based on data stored in the cartridge. A speech recognition dialogue device characterized by the following.

The storage means provided on the cartridge side is a standard voice feature data storage means for storing standard voice feature data for pre-registered recognizable words,
On the device main body side, voice input means for inputting voice, voice analysis means for analyzing voice input by the voice input means to generate voice feature data, and voice feature data from the voice analysis means. A word detecting means for inputting and outputting word detection data for the input voice based on the storage content of the standard voice feature data storage means provided on the cartridge side; and receiving word detection data from the word detecting means. Speech understanding conversation control means for understanding the meaning of the input speech and determining a response content corresponding thereto, and response data instruction content for generating a speech synthesis output based on the response content determined by the speech understanding conversation control means Storage means, and voice synthesis based on the content of the response data instruction content storage means for the response content determined by the voice understanding conversation control means. And speech synthesis means for generating a force, the voice recognition interaction device according to claim 1, characterized in that and an audio output means for outputting the speech synthesized output to the outside from the speech synthesis means.

The storage means provided on the cartridge side includes a conversation content storage means for storing a response content corresponding to a pre-registered recognizable word, and an instruction content for instructing what kind of speech synthesis output is to be generated. Response data instruction content storage means for storing,
The apparatus main body includes a voice input unit for inputting voice, a voice analysis unit for analyzing voice input by the voice input unit to generate voice feature data, and a standardized recognizable word. Standard voice feature data storage means for storing voice feature data; word detection means for outputting word detection data for an input voice based on the storage contents of the standard voice feature data storage means; and word detection from the word detection means By receiving the data, the meaning of the input voice is understood, and the corresponding response content is determined with reference to the conversation content storage means provided on the cartridge side. For the determined response content, a voice for generating a voice synthesis output based on the content of the response data instruction content storage means provided on the cartridge side And forming means, the speech recognition interaction device according to claim 1, characterized in that and an audio output means for outputting the speech synthesized output to the outside from the speech synthesis means.

The storage unit provided on the cartridge side is a response data instruction content storage unit that stores instruction content for instructing what kind of speech synthesis output is to be generated,
The apparatus main body includes a voice input unit for inputting voice, a voice analysis unit for analyzing voice input by the voice input unit to generate voice feature data, and a standardized recognizable word. Standard voice feature data storage means for storing voice feature data; word detection means for outputting word detection data for an input voice based on the storage contents of the standard voice feature data storage means; and word detection from the word detection means Speech comprehension / conversation control means for receiving the data to understand the meaning of the input speech and determining the response content corresponding thereto, and a response provided on the cartridge side for the response content determined by the speech understanding / conversation control means. Speech synthesis means for generating a speech synthesis output based on the contents of the data instruction content storage means, and outputting the speech synthesis output from the speech synthesis means to the outside Speech recognition interaction device according to claim 1, characterized in that and an audio output unit that.

The storage means provided on the cartridge side includes a standard voice feature data storage means for storing standard voice feature data for pre-registered recognizable words, and a conversation for storing response contents corresponding to the registered recognizable words. Content storage means, response data instruction content storage means for instructing what kind of speech synthesis output is to be generated,
On the device main body side, voice input means for inputting voice, voice analysis means for analyzing voice input by the voice input means to generate voice feature data, and voice feature data from the voice analysis means. A word detecting means for inputting and outputting word detection data for the input voice based on the stored contents of the standard voice feature data storage means provided on the cartridge side; and receiving and inputting the word detection data from the word detecting means. Voice understanding conversation control means for understanding the meaning of voice and determining the corresponding response content with reference to the conversation content storage unit provided on the cartridge side; and response content determined by the voice understanding conversation control means A voice synthesizing means for generating a voice synthesizing output based on the contents of the response data instruction content storing means provided on the cartridge side; Speech recognition interaction device according to claim 1, characterized in that and an audio output means for outputting the speech synthesized output to the outside from the synthesis means.

The input voice is analyzed to generate voice feature data, and the voice feature data is compared with standard voice feature data of a pre-registered recognizable word to output word detection data. Receiving, understanding the meaning of the input voice, and determining and outputting a response corresponding to the input voice.
Data set in advance to perform voice recognition, data set in advance to output a response to the recognized voice, and the like are written in storage means on the cartridge side that can be detachably attached to the apparatus main body, A speech recognition dialogue processing method characterized by outputting a response to an input speech based on data stored in the cartridge when the cartridge is mounted on the apparatus main body.

The storage content stored on the cartridge side is standard voice feature data for a pre-registered recognizable word,
On the side of the device main body, a voice analysis step of analyzing voice input by voice input means to generate voice characteristic data, and voice characteristic data from this voice analysis step are input and stored in the cartridge side. A word detection step of outputting word detection data for the input voice based on the standard voice feature data; receiving the word detection data from the word detection step to understand the meaning of the input voice and determine the corresponding response content A voice comprehension conversation control step; a speech synthesis step for performing speech synthesis based on response data instructing what kind of speech synthesis output is to be given to the response content determined in the speech comprehension conversation control step; 7. A speech recognition dialogue processing method according to claim 6, further comprising a speech output step of outputting a speech synthesis output from the speech synthesis step to the outside.

The storage contents stored on the cartridge side are response contents corresponding to a pre-registered recognizable word and response data instruction contents for instructing what kind of speech synthesis output is to be generated,
On the device body side, a voice analysis step of analyzing voice input by voice input means to generate voice feature data, and a standard voice feature data of a pre-registered recognizable word, A word detection step of outputting word detection data, receiving the word detection data from the word detection step, understanding the meaning of the input voice, and referring to the corresponding response content to the conversation content stored in the cartridge side. A voice comprehension conversation control step, which is determined by the voice comprehension conversation control step, and a speech synthesis step for generating a speech synthesis output based on the response data instruction content stored on the cartridge side for the response content determined by the voice understanding conversation control step 7. The speech recognition dialogue processing method according to claim 6, further comprising a speech output step of outputting a speech synthesis output from the speech synthesis step to the outside.

The storage content stored on the cartridge side is an instruction content for instructing what kind of speech synthesis output is to be generated,
On the device body side, a voice analysis step of analyzing voice input by voice input means to generate voice feature data, and a standard voice feature data of a pre-registered recognizable word, A word detection step of outputting word detection data, a speech understanding conversation control step of receiving the word detection data from the word detection step, understanding the meaning of the input speech, and determining a response content corresponding thereto; In response to the response content determined by the control step, a voice synthesis step for generating a voice synthesis output based on the response data instruction content stored in the cartridge side, and outputting the voice synthesis output from the voice synthesis step to the outside. 7. The speech recognition dialogue processing method according to claim 6, further comprising a speech output step.

The storage contents stored on the cartridge side include standard voice feature data for pre-registered recognizable words, response contents corresponding to the registered recognizable words, and what kind of voice synthesis output is to be generated. The response data to be instructed,
On the side of the device main body, a voice analysis step of analyzing voice input by voice input means to generate voice characteristic data, and voice characteristic data from this voice analysis step are input and stored in the cartridge side. Based on the standard voice feature data, a word detection step of outputting word detection data for the input voice, receiving the word detection data from the word detection step to understand the meaning of the input voice, A voice understanding conversation control step which is determined by referring to the conversation content stored in the cartridge side, and a response content determined in the voice understanding conversation control step is based on the response data instruction content stored in the cartridge side. A voice synthesis step of generating a voice synthesis output, and a voice output step of outputting the voice synthesis output from the voice synthesis step to the outside. 6. speech recognition dialogue processing method according to symptoms.