JP2004145237A

JP2004145237A - Interactive doll with translating function

Info

Publication number: JP2004145237A
Application number: JP2002349128A
Authority: JP
Inventors: W Miles Paul Jr; ポール　ダブリュ．マイルズ，ジュニア; Masao Nakamura; 中村　政生; Koichi Naka; 仲　晃一
Original assignee: ICD KK
Current assignee: ICD KK
Priority date: 2002-10-25
Filing date: 2002-10-25
Publication date: 2004-05-20

Abstract

<P>PROBLEM TO BE SOLVED: To provide an interactive doll with a translating function that has a speech translation system and is used to learn a basic foreign language conversation by converting a speech that a user speaks into a speech of a different language and voicing it. <P>SOLUTION: Disclosed is the interactive doll with the translating function and when a conventional tandem type color image forming apparatus is made small-sized, it becomes difficult to secure positioning precision and driving precision of an image formation unit, so a color image has more color slurring and color unevenness. Further, the image formation unit itself is small-sized, so work space for replacement of it as a consumption article by the user is eliminated to cause a problem of deterioration in operability. A speech of a 1st language by the user is recognized and converted into a speech of a 2nd language and a previously stored speech is voiced corresponding to the speech of the 2nd language or the speech of the 1st language to interact with the user. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、対話式人形に関し、より詳しくは、人形内に音声翻訳システムを設け、使用者の発した音声を異なる言語に基づく音声に変換して発声することにより、使用者の遊び心を満足させ、かつ初歩的な外国語会話の学習に供することのできる翻訳機能付対話式人形に関する。すなわち、本発明は、従来にはなかった翻訳機能付対話式人形に関する。
【０００２】
【従来の技術】
一般的に、子供は興味ある遊びやおもちゃにより生活教育を学習する傾向があり、そのおもちゃとの親密な触れ合いは、実社会へ導く模倣学習を実行するものである。このような模倣学習の大部分は、おもちゃ人形を通じて行われ、例えば、子供は自分で模倣学習のシナリオを作り、そのシナリオにしたがってそのおもちゃ人形に適切な反応を誘導する。すなわち、子供は、適切な音声表現および動作行為を双方向対話型に面白く進行することにより、その模倣学習に没頭するのである。
【０００３】
このようなおもちゃ人形による教育は、昔から子供に密着したものとして引き継がれており、最近では、その教育的効果を期待して発声人形の研究が活発になり、より進歩的なおもちゃ人形の製作が試みられている。このような従来の技術によるおもちゃ人形のほとんどにおいては、タッチセンサが所定の位置に設けられており、子供がこのタッチセンサを動作させると、磁気記録媒体（磁気テープ）や半導体記録媒体（ＩＣメモリ）に記録させた簡単な文章、例えば「こんにちは。」、「私は○○○です。」、「あなたは誰ですか。」、および「あなたは何が好きですか。」などの音声が発声される。例えば、特許文献１には、子供に興味を持たすことができ、さらには会話に対する興味を持たすことのできる発声人形が開示されている。
【０００４】
【特許文献１】
登録実用新案第２５６２４３９号公報
【０００５】
具体的には、上記特許文献１には、前記目的を達成するため、人形の胴体部に頭部を取り外し可能に取着するとともに、外部からの操作を検出する検出手段、複数の語彙を記憶する記憶手段、複数の語彙から任意の語彙を選択する選択手段、選択された語彙から音声を合成する音声合成手段及び合成された音声を発声させる発声手段を収容した筐体を上記胴体部内に配置するとともに詰め物で安定させ、筐体の上部に突出して形成した電池収容部を胴体部から上方に突出させるとともに、該電池収容部に上記頭部を嵌合させた発声人形が開示されている。
【０００６】
上記特許文献１記載のおもちゃ人形は、単発的で、簡単な文章を話す人形であり、タッチセンサの動作によって、シナリオのない単純な文章が録音された音声を聞かせるため、子供の好奇心を一時的に誘発することはできる。しかし、直ぐに子供は飽きてしまい、実際にこのようなおもちゃ人形と遊ぶ期間が短くなるため、教育的な効果が低いという問題がある。また、従来のおもちゃ人形が話す音声文章は、対話型のシナリオでなく不連続的な文章の羅列であり、現実味に乏しい。このことからも、その教育的効果が低いという問題がある。
【０００７】
これに対し、例えば特許文献２においては、かかる問題を解決するため、話題に応じた音声出力を可能にし、子供が行う可能性のある行動パターンをシナリオに作成して記録させ、任意に設定された状況に応じて人形と双方向の対話を可能とするおもちゃ人形が開示されている。例えば、子供と対話する状況で、多様なシナリオに導くため、音声圧縮用ソフトウェアで音声を圧縮した後、メモリ部に記録させ、必要時に速やかに取り出し、一つの話題においても、選択可能な状況に応じて直ちに質疑応答が可能であるとされている。具体的に、上記特許文献２には、人と動物の形態が混合した形状に形成された人形本体に、多数の文章のデジタル音声信号ストリームが所定の圧縮率で圧縮された音声圧縮データを記録している第１メモリ部と、外部から入力された使用者の音声信号を認識するための演算エリアが備えられている第２メモリ部とを備えた音声認識対話型人形おもちゃが開示されている。
【０００８】
【特許文献２】
特許第３１６４３４６号明細書
【０００９】
確かに、上記特許文献２記載の人形おもちゃは、使用者の会話に応じて音声を発声するものである。しかし、使用者の発声する音声の言語、ならびに発声人形および人形おもちゃの発声する音声の言語の種類については一切触れられていない。これは、上記特許文献１も同じである。そして、上記特許文献２に係る発明の課題および効果からすると、日本人の幼児が日本語で発声し、それを認識した人形おもちゃが日本語で音声を発声する場合を想定しているものと考えられる。
【００１０】
ここで、昨今の社会の国際化にともなって、我が国においては外国語教育の重要性が盛んに論じられているにもかかわらず、他の国に遅れをとっている。これは、学校教育はさることながら、一般家庭での幼少期における生活環境に問題があるものと考えられる。そうすると、上述のような発声人形や人形おもちゃに、外国語教育用のアイテムとしての機能があれば、外国語教育の進歩発展にも役立つことは明らかである。従来技術においては、かかる機能を有する人形おもちゃは提供されていない。
【００１１】
【発明が解決しようとする課題】
そこで、本発明は、人形内に音声翻訳システムを設け、使用者の発した音声を異なる言語に基づく音声に変換して発声することにより、使用者の遊び心を満足させ、かつ外国語会話の学習に供することのできる翻訳機能付対話式人形を提供することを目的とする。すなわち、本発明は、従来にはなかった翻訳機能付対話式人形を提供することを目的とする。
【００１２】
【課題を解決するための手段】
上記課題を解決すべく、本発明は、使用者による第一の言語に基づく第一の音声を受信して第二の言語に基づく第二の音声に変換し、ついで当該変換された第二の音声、または前記第一の音声に対応してあらかじめ記憶した第二の言語に基づく第三の音声を発声して前記使用者と対話することを特徴とする翻訳機能付対話式人形を提供する。
【００１３】
前記翻訳機能付対話式人形は、
前記第一の音声を受信する音声受信手段を有する耳部分、
前記第一の音声を前記第二の音声に変換する音声変換手段を有する部分、および前記第二の音声または前記第三の音声を発生する音声発声手段を有する口部分を具備するのが好ましい。
また、前記第一の音声および前記第二の音声が単語であるのが好ましい。
【００１４】
また、前記音声変換手段は、
（ａ）前記第一の音声を認識する音声認識手段、
（ｂ）認識された前記第一の音声を前記第二の言語に翻訳する音声翻訳手段、および
（ｃ）前記翻訳の結果に基づいて前記音声を合成する音声合成手段を具備するのが好ましい。
【００１５】
前記音声認識手段（ａ）は、前記第一の音声を音素列として認識するのが好ましい。
具体的には、前記音声認識手段（ａ）は、
前記第一の音声の音声信号を受信し、前記音声信号を対応する電気信号に変換するオーディオプロセッサ手段、
前記電気信号を所定のサンプリングレートでデジタル化し、デジタル化された音声信号を形成するアナログ／デジタル変換器手段、および
前記デジタル化された音声信号の細分化された複数部分に対する時間領域分析を行い、前記音声信号の複数の時間領域特性を識別する手段と、所定の高域および低域カットオフ周波数を有する複数のフィルタ帯域を用いて、前記細分化された各部分をフィルタリングし、前記細分化された各部分の少なくとも１つの周波数領域特性を識別する手段と、前記時間領域特性および周波数領域特性を処理して前記音声信号に含まれる音素を識別する手段とを含む音声音素識別手段を具備するのが好ましい。
【００１６】
また、前記音声翻訳手段（ｂ）は、認識された前記音素列を第二の言語に基づく語彙列に翻訳するのが好ましい。
また、前記音声合成手段（ｃ）は、前記語彙列をコンピュータ処理することにより前記第二の音声を合成するのが好ましい。
【００１７】
具体的には、前記音声合成手段（ｃ）は、
前記第二の言語に基づく語彙列を受信し、前記語彙列を第一の音素列に変換する音声変換サブシステム、
変形規則を受信して前記第一の音素列に適用し、第二の音素列を形成する音声変形器、
所定の基準に基づいて前記第二の音素列に含まれる音素に順位付けを行う評価器、および
前記第二の音素列を受信し、前記順位付けを用いて前記第二の音素列に含まれる音素を音節に分解する音節分解器を具備するのが好ましい。
【００１８】
さらに、前記翻訳機能付対話式人形においては、
前記音声受信手段が前記使用者による第一の言語に基づくキーワードを受信し、前記音声認識手段（ａ）が前記キーワードを認識し、前記音声発声手段が前記キーワードに対応してあらかじめ記憶した第二の言語に基づく質問を発声し、
その後、前記使用者による前記第一の音声を前記第二の音声に変換し、当該第二の音声、または前記第三の音声を発声して前記使用者と対話するのが好ましい。
【００１９】
このような翻訳機能付対話式人形においては、
前記音声認識手段（ａ）が、前記第一の音声の特定部分を認識し、
前記音声翻訳手段（ｂ）が、前記特定部分を第二の言語に基づく音声に翻訳し、
前記音声合成手段（ｃ）が、前記翻訳の結果を前記質問に対応してあらかじめ記憶した第二の言語に基づく音声回答パターンにあてはめ、前記第二の音声を合成することもできる。
【００２０】
この場合、前記音声翻訳手段（ｂ）が、
（ｂ−１）前記音声回答パターンと、前記質問に対応して前記特定部分を構成する語彙として予想される第二の言語に基づく語彙複数個とを記憶する記憶手段、および
（ｂ−２）前記音声認識手段（ａ）が認識した前記特定部分を構成する語彙に対応して、第二の言語に基づく語彙を選択する選択手段を具備し、
前記音声合成手段が（ｃ）が、選択された前記第二の言語に基づく語彙の音声を合成し、前記音声回答パターンにあてはめ、前記第二の音声を合成するのが好ましい。
【００２１】
また、前記翻訳機能付対話式人形は、さらに、前記第一の言語および前記第二の言語を特定する言語特定手段（ｄ）を具備するのが好ましい。
さらに、前記翻訳機能付対話式人形は、外部からの操作を検出して、前記音声認識手段（ａ）、前記音声変換手段（ｂ）、前記音声発声手段（ｃ）および前記言語特定手段（ｄ）よりなる群から選択される少なくとも１種の手段を制御する制御手段（ｅ）を具備するのが好ましい。
【００２２】
前記翻訳機能付対話式人形においては、前記制御手段（ｅ）が、前記音声発声手段（ｃ）に前記第二の音声または前記第三の音声を複数回発声させる機能を有するのが好ましい。
また、前記制御手段（ｅ）が、前記言語特定手段（ｄ）に前記第二の言語として複数の言語を特定し、前記音声発声手段（ｃ）に前記第二の音声または前記第三の音声を複数の言語に基づいて連続して発声させる機能を有するのが好ましい。
【００２３】
【発明の実施の形態】
本発明は、使用者による第一の言語に基づく第一の音声を受信して第二の言語に基づく第二の音声に変換し、ついで当該変換された第二の音声、または前記第一の音声に対応してあらかじめ記憶した第二の言語に基づく第三の音声を発声して前記使用者と対話することを特徴とする翻訳機能付対話式人形に関する。以下に、図面を参照しながら、本発明に係る翻訳機能付対話式人形を説明する。
【００２４】
図１は、本発明に係る翻訳機能付対話式人形の機能を概略的に説明するための図である。図１に示すように、本発明に係る翻訳機能付対話式人形１は、使用者の発声する第一の言語に基づく音声、例えば日本語による「私の名前は桜です。」という音声を耳部１ａに設けられた音声受信手段により受信し、人形の一部分に内蔵され、本発明を実現する音声変換手段１ｂを含む翻訳システムモジュールにより、この音声を第二の言語に基づく音声、例えば英語による「マイ　ネームイズ　サクラ。」またはドイツ語による「マイン　ナーメ　イスト　サクラ。」に変換し、この第二の音声を口部分１ｃに設けられた音声発声手段より発声する。
【００２５】
また、図１には示していないが、第一の音声として、例えば「歌。」と言った場合には、当該第一の音声に対応してあらかじめ記憶された「ハ〜ッピ　バ〜スデ〜ィ　トゥ〜　ユ〜。」という歌を第三の音声として発声させてもよい。
もっとも、前記第一の音声および前記第二の音声が単語であることが好ましい。なぜなら、幼少期の子供は文章を話すのではなく単語を羅列して発するだけであり、これに対して本発明に係る翻訳機能付対話式人形がｗｏｒｄ　ｔｏ　ｗｏｒｄで翻訳をすることができれば、初歩的な外国語教育、すなわち外国語教育への導入にとってに資するところが大きいからである。
【００２６】
次に、本発明に係る翻訳機能付対話式人形の第一の態様において用いられる音声変換手段（翻訳システムモジュール）について説明する。図２は、前記翻訳システムモジュールの構成を示す図である。図２に示すように、本発明における翻訳システムモジュールは、音声受信手段２、音声変換手段３および音声発声手段４を含む。そして、音声変換手段３は、音声認識手段３ａ、音声翻訳手段３ｂおよび音声合成手段３ｃを含む。音声変換手段３のみを翻訳システムモジュールとしてもよいが、当該翻訳システムモジュールは、音声受信手段２および音声発声手段４を含む概念であってもよい。
【００２７】
図１に示した例を用いて説明すると、「私の名前は桜です。」との第一の音声が、まず、前記音声受信手段２（例えばマイク、録音機、無線マイクなど）によって受信される。受信された第一の音声は、音声認識変換手段３に送られ、音声認識手段３ａで認識されるとともに、音声翻訳手段３ｂによって「Ｍｙ　ｎａｍｅ　ｉｓ　Ｓａｋｕｒａ．」に翻訳され、ついで、音声合成手段３ｃにより、「マイ　ネイム　イズ　サクラ。」という第二の音声に合成される。そして、この第二の音声が音声発生手段（例えばスピーカなど）から発声される。
【００２８】
ここで、音声認識、音声翻訳および音声合成については、それぞれ個別に従来から種々の研究開発がなされており、本発明においては、かかる従来技術に基づく音声認識手段、音声翻訳手段および音声合成手段を組み合わせて用いることもできる。もっとも、第一の言語に基づく第一の音声を第二の言語に基づく第二の音声に変換するというコンセプトは、本発明によって新規に見出されたものである。
【００２９】
一例を示すと、「私の名前は桜です。」との第一の音声は、音声受信手段２によって音声信号として受信されて、音声変換手段３に送信される。音声変換手段３においては、音声認識手段３ａが前記音声信号を電気信号に変換し、例えばこれをテキスト（語彙）化する。ついで、テキスト化された第一の音声（第一のテキスト）が、音声翻訳手段３ｂに送信される。
【００３０】
そして、図３に示すように、音声翻訳手段３ｂに記憶手段３ｂ−１よび選択手段３ｂ−２を具備させる。記憶手段３ｂ−１には、メモリーまたは辞書とも言うことができ、前記第一の音声を構成する語彙に対応する第二の言語に基づく語彙（および／または音声）複数個が記憶されている。例えば、英語、ドイツ語、フランス語、スペイン語およびポルトガル語などの複数の言語ごとに、複数の語彙（および／または音声）を記憶させてもよい。
【００３１】
ここで、図１に示した例を用いて説明すると、第一の言語による「私の名前は桜です。」という第一のテキストを構成する語彙である「私の」、「名前は」、「桜」および「です」に対応して、例えば英語のグループとして、「Ｍｙ」、「ｎａｍｅ」、「ｉｓ」および「Ｓａｋｕｒａ」という語彙ならびに／または「マイ」、「ネイム」、「イズ」および「サクラ」という音声を記憶手段３ｂ−１に記憶させる。また、ドイツ語のグループとしては、「Ｍｉｎｅ」、「ｎａｍｅ」、「ｉｓｔ」および「Ｓａｋｕｒａ」という語彙ならびに／または「マイン」、「ナーメ」、「イスト」および「サクラ」という音声を記憶させる。
【００３２】
そして、前記複数の語彙および／または音声から、選択手段３ｂ−２が、前記第一の音声を構成する語彙に対応する前記第二の言語に基づく語彙および／または音声を選択する。図１の例で説明すると、「私の」に対応して「Ｍｙ」を選択し、「名前は」に対応して「ｎａｍｅ」を選択する。そして、「です」に対応して「ｉｓ」を選択し、「桜」に対応して「Ｓａｋｕｒａ」を選択する。
【００３３】
ついで、音声合成手段３ｃが、選択された語彙から前記第二の音声を合成し、個々の語彙に相当する音声をつなぎ合わせて第二の音声を合成し、合成された第二の音声は音声発声手段４から発声される。選択手段３ｂ−１が個々の語彙に相当する音声を選択する場合は、音声合成手段３ｃはその個々の音声をつなぎ合わせて第二の音声を合成し、合成した第二の音声を音声発声手段４から発声させればよい。
【００３４】
以上のような機能を有する音声認識手段、音声変換手段、および音声合成手段は、当業者であれば、本願明細書における本発明の技術的意義に鑑み、従来のものを改良して得ることができるが、以下に、より好ましい音声認識手段について説明する。
【００３５】
本発明に係る翻訳機能付対話式人形においては、前記音声認識手段が、前記第一の音声を音素列として認識するものであるのが好ましい。従来の音声認識手段によれば、使用者（話者）の音調、話し方およびイントネーションなどの癖が多様であるため、使用者の違いによって音声認識の程度が左右されてその精度が低くなってしまうという問題がある。したがって、従来の音声認識手段では、特定の使用者の癖を音声認識手段に覚えさせるトレーニングが必要とされているものが多い。これに対し、音声を音素の列として認識する方法を採用すれば、使用者が違っても、より精度良くその音声を認識することができる。
【００３６】
具体的には、前記音声認識手段は、
前記第一の音声の音声信号を受信し、前記音声信号を対応する電気信号に変換するオーディオプロセッサ手段、
前記電気信号を所定のサンプリングレートでデジタル化し、デジタル化された音声信号を形成するアナログ／デジタル変換器手段、および
前記デジタル化された音声信号の細分化された複数部分に対する時間領域分析を行い、前記音声信号の複数の時間領域特性を識別する手段と、所定の高域および低域カットオフ周波数を有する複数のフィルタ帯域を用いて、前記細分化された各部分をフィルタリングし、前記細分化された各部分の少なくとも１つの周波数領域特性を識別する手段と、前記時間領域特性および周波数領域特性を処理して前記音声信号に含まれる音素を識別する手段とを具備し、前記音声信号に含まれる音素の種類を識別する音声音素識別手段を含むのが好ましい。
【００３７】
ここで、図４に、前記音声音素識別手段を含む音声認識手段（システム）の構成を示す。
図４に示す音声認識システム１０は、音声受信手段で受信した第一の音声の音声信号を、前記音声信号を対応する電気信号に変換するオーディオプロセッサ回路１４を具備する。そして、前記電気信号をデジタルサンプリングに適した電気的状態にするために、前記電気信号を所定のサンプリングレートでデジタル化し、デジタル化された音声信号を形成するアナログ／デジタル変換回路３４を具備する。アナログ／デジタル変換回路３４は、前記電気信号をアナログ形式で受信し、デジタル形式に変換して送信する。
【００３８】
デジタル化された音声信号は、ついで、音声識別回路１６に送信される。音声識別回路１６は、デジタル化された音声信号を、プログラム化して分析し、その音声信号の音声特性を抽出する。そして、必要な音声特性を得た場合に、前記音声信号に含まれる特定の音素を識別することができる。この音素の識別は、個々の使用者（話者）の特徴に依存せずに行うことができ、かつ、使用者が通常の会話速度で話してもリアルタイムで行うことができる。
【００３９】
音声識別回路１６は２つの方法で必要な音声特性を取得する。まず、前記デジタル化された音声信号の細分化された複数部分に対する時間領域分析を行い、前記音声信号の複数の時間領域特性を識別して、前記音声信号に含まれる音素の種類を識別する。音声信号に含まれる音素の種類を識別するパラメータとしては、例えば音声が“有声音”か、“無声音”か、または“静寂”かなどを含む。
【００４０】
つぎに、音声識別回路１６は、所定の高域および低域カットオフ周波数を有する複数のフィルタ帯域を用いて、前記細分化された各部分をフィルタリングする。これにより、複雑な波形を有する第一の音声の音声信号から、細分化された多数の信号であって、前記音声信号の成分である個々の信号の波形を表す多数の信号が生成される。そして、音声識別回路１６は、細分化された各部分を測定し、少なくとも１つの周波数領域特性、例えば、前記信号の周波数および振幅を含む種々の周波数領域データを抽出する。
【００４１】
このようにして得られた周波数領域特性および時間領域特性は、前記音声信号に含まれる音素を識別するために充分な情報を含む。したがって、音声識別回路１６は、最後に、前記時間領域特性および周波数領域特性を処理して前記音声信号に含まれる音素を識別する。
【００４２】
以上のようにして認識された第一の音声は、ついで、音声識別回路１６に内臓させた音声翻訳手段および音声合成手段によって翻訳し、第二の音声に合成される。この場合、上述のように認識された音素の列を第二の言語に基づく語彙の列に翻訳させればよい。例えば、従来技術による言語処理プログラムを用いることにより、かかる翻訳および音声合成を行うことが可能である。
【００４３】
そして、これらの処理は、例えば、音声識別回路１６に接続され、データの入力、記憶および／または制御をすることのできるホストコンピュータまたはＣＰＵなどの制御デバイス２２によって制御すればよい。かかる制御デバイス２２としては、従来のものを用いることができ、音声識別回路１６に内蔵されているのが好ましい。もっとも、音声識別回路１６の構成によっては省略することもできる。
【００４４】
ここで、図５に、さらに詳細な前記音声音素識別手段（システム）の構成を示す。図５に示す音声認識システム１０では、図４の場合と同様に、音声受信手段１２によって受信された第一の音声が、オーディオプロセッサ回路１４で調整される。オーディオプロセッサ回路１４においては、第一の音声の音声信号を電気信号に変え、つづくアナログ／デジタル変換器３４に送信する。
【００４５】
オーディオプロセッサ回路１４では、まず増幅回路２６などの信号増幅手段によって、電気信号が好適なレベルに増幅され、制限増幅回路２８によって、その出力レベルが制限される。そして、フィルタ回路３０によって、高周波数が除去される。これら、増幅回路２６、制限増幅回路２８およびフィルタ回路３０としては、種々のものを用いることができる。ついで、アナログ／デジタル変換回路３４は、前記電気信号をアナログ形式で受信し、デジタル形式に変換して送信する。
【００４６】
つぎに、図４に示す音声認識システム１０は、デジタル音声プロセッサ回路１８およびホスト音声プロセッサ回路２０を含む。これらは図１に示す音声識別回路１６に含まれるものであり、プログラム化できるデバイスを用いる同等の回路で構成することができる。
【００４７】
まず、デジタル音声プロセッサ回路１８は、デジタル化された音声信号を受信し、プログラムに基づいて操作し、種々の音声特性を抽出する。具体的には、まず時間領域においてデジタル化された音声信号を分析し、その分析結果に基づいて少なくとも１種の時間領域音声特性を抽出する。この特性は、音声信号が“有声的な”、“無声的な”または“静寂な”音素を含むか否かを決定するために有利に役立つ。
【００４８】
また、デジタル音声プロセッサ回路１８は、デジタル化された音声信号をさらに操作し、音声信号に関する種々の周波数領域情報を取得する。これは、音声信号を、無数のフィルタ帯でフィルタリングし、対応する無数のフィルタされた信号を生成することにより行うことができる。デジタル音声プロセッサ回路１８は、個々の波形によって発現される種々の特性を測定し、少なくとも１種の周波数領域音声特性を抽出する。この周波数領域音声特性は、フィルタリング工程によって得られた信号成分の周波数、振幅および勾配などを含む。これらの特性は、蓄積ないし記憶され、音声信号に含まれる音素の種類を決定するために用いられる。
【００４９】
図５に示すように、デジタル音声プロセッサ回路１８は、デジタル音声プロセッサ３６などの、プログラム制御のもとでデジタル化された音声信号を分析するプログラム化可能な手段を含む。このデジタル音声プロセッサ回路３６としては、モトローラＤＳＰ５６００１などのプログラム可能な２４ビット汎用デジタル信号プロセッサを好適に用いることができる。もちろん、他の上市されたデジタル信号プロセッサを用いることもできる。
【００５０】
また、デジタル音声プロセッサ３６は、バスタイプの標準アドレス、データおよび制御配列３８を介して、種々の構成要素と接続される。これら構成要素は、例えば、ＤＳＰプログラムメモリー４０などの、ＤＳＰ３６によって実行される一連のプログラムを記憶するプログラムメモリー手段、ＤＳＰデータメモリー４２などの、ＤＳＰ３６によって用いられるデータを記憶するデータメモリー手段、ならびにアドレスおよびデータのゲーティングおよびマッピングなどの標準時間制御機能を実行する制御ロジック４４を含む。
【００５１】
つぎに、ホスト音声プロセッサ回路２０について説明する。ホスト音声プロセッサ回路２０は、適切なホストインターフェイス５２を介してデジタル音声プロセッサ回路１８に接続される。概して、ホスト音声プロセッサ回路２０が、ホストインターフェイス５２を介して、デジタル音声プロセッサ回路１８で生成された種々の音声信号特性情報を受信する。
【００５２】
このホスト音声プロセッサ回路２０は、この情報を分析し、前記信号特性を代表的な使用者（話者）をテストすることによって集めた音声標準音声データと比較することによって、前記音声信号に含まれる音素の種類を識別する。音素を識別した後、ホスト音声プロセッサ回路２０は、種々の言語処理技術を使用し、音素を第一の言語や第二の言語に基づく語彙やフレーズに翻訳する。
【００５３】
前記ホスト音声プロセッサ回路２０は、好ましくは、ホスト音声プロセッサ５４などの、プログラム制御のもとでデジタル化された音声信号の特性を分析する第二のプログラム化可能な手段を有する。ホスト音声プロセッサ５４は、例えばモトローラ６８ＥＣ０３０などのプログラム化可能な３２ビット汎用性ＣＰＵ素子であればよい。
【００５４】
また、ホスト音声プロセッサ５４は、標準アドレス、データおよび制御バスタイプ配列５６を介して、種々の構成要素と接続される。これら構成要素は、例えば、ホストプログラムメモリー５８などの、ホスト音声プロセッサ５４によって実行される一連のプログラムを記憶するプログラムメモリー手段、ホストデータメモリー６０などの、ホスト音声プロセッサ５４によって用いられるデータを記憶するデータメモリー手段、ならびにアドレスおよびデータのゲーティングおよびマッピングなどの標準時間制御機能を実行する制御ロジック６４を含む。
【００５５】
制御デバイス２２については、図４において説明したものと同様である。制御デバイス２２は、ＲＳ−２３２インターフェイス回路などのインターフェイス手段６６およびケーブル２４を介して、ホスト音声プロセッサ回路２０に接続すればよい。もちろん、デジタル音声プロセッサ回路１８およびホスト音声プロセッサ回路２０の構成によっては、制御デバイス２２を省略することも可能である。なお、ホスト音声プロセッサ回路２０には、さらに辞書機能を有するメモリー６２やディスプレイ６８を接続することも可能である。
【００５６】
以上のように、音声認識手段（ａ）が音素で第一の音声を認識する場合、前記音声翻訳手段（ｂ）が、認識された前記音素列を第二の言語に基づく語彙列に翻訳し、前記音声合成手段（ｃ）が、前記語彙列をコンピュータ処理することにより前記第二の音声を合成するのが有効である。もっとも、図４および５に示したような音声認識システムを用いれば、音声認識手段（ａ）に音声翻訳手段（ｂ）および音声合成手段（ｃ）の機能を持たせることが可能である。
【００５７】
ここで、音声合成手段（ｃ）としては従来のものを用いることができるが、従来の音声合成手段によれば、電気的および機械的に音声を合成するため、語彙と語彙との間の間隔やイントネーションなどが完全ではなく、発声される第二の音声が人間の発する声に対して違和感が生じる場合がある。そこで、本発明においては、音声合成手段に以下のものを用いるのが好ましい。
【００５８】
すなわち、前記音声合成手段（ｃ）は、前記第二の言語に基づく語彙列を受信し、前記語彙列を第一の音素列に変換する音声変換サブシステム、変形規則を受信して前記第一の音素列に適用し、第二の音素列を形成する音声変形器、所定の基準に基づいて前記第二の音素列に含まれる音素に順位付けを行う評価器、および前記第二の音素列を受信し、前記順位付けを用いて前記第二の音素列に含まれる音素を音節に分解する音節分解器を具備するのが好ましい。
【００５９】
さらに、本発明に係る翻訳機能付対話式人形には、前記音声受信手段が前記使用者による第一の言語に基づくキーワードを受信し、前記音声認識手段（ａ）が前記キーワードを認識し、前記音声発声手段が前記キーワードに対応してあらかじめ記憶した第二の言語に基づく質問を発声し、その後、前記使用者による前記第一の音声を前記第二の音声に変換し、当該第二の音声、または前記第三の音声を発声して前記使用者と対話させる機能を持たせることが好ましい。
このような機能は、当業者であれば適宜プログラムを作成して、上記音声認識手段、音声翻訳手段および音声合成手段に組み込ませることが可能である。
【００６０】
また、音声認識手段（ａ）に、前記第一の音声の特定部分を認識させ、音声翻訳手段（ｂ）に、前記特定部分を第二の言語に基づく音声に翻訳させ、音声合成手段（ｃ）に、前記翻訳の結果を前記質問に対応してあらかじめ記憶した第二の言語に基づく音声回答パターンにあてはめ、前記第二の音声を合成させることも有効である。
【００６１】
この構成をとれば、前記音声変換手段は、いわゆるパターン翻訳法に基づいて、第一の言語による第一の音声を構成する第一のテキストを、第二の言語による第二のテキストに変換することができる。パターン翻訳法は、長文を翻訳するためには不充分なものであるが、短文を処理するためには有効である。したがって、初歩的な外国語教育にとって重要な時期である幼少期の子供にとっては、有効である。
【００６２】
この場合、音声翻訳手段（ｂ）が、（ｂ−１）前記音声回答パターンと、前記質問に対応して前記特定部分を構成する語彙として予想される第二の言語に基づく語彙複数個とを記憶する記憶手段、および（ｂ−２）前記音声認識手段（ａ）が認識した前記特定部分を構成する語彙に対応して、第二の言語に基づく語彙を選択する選択手段を具備し、音声合成手段が（ｃ）が、選択された前記第二の言語に基づく語彙の音声を合成し、前記音声回答パターンにあてはめ、前記第二の音声を合成すればよい。
【００６３】
【発明の効果】
本発明によれば、人形内に音声翻訳システムを設け、使用者の発した音声を異なる言語に基づく音声に変換して発声することにより、使用者の遊び心を満足させ、かつ外国語会話の学習に供することのできる翻訳機能付対話式人形を提供することができる。すなわち、本発明は、従来にはなかった翻訳機能付対話式人形を提供することができる。
【図面の簡単な説明】
【図１】本発明に係る翻訳機能付対話式人形の機能を概略的に説明するための図である。
【図２】本発明において用いられる音声変換手段（翻訳システムモジュール）の構成を示す図である。
【図３】本発明において用いられる別の音声変換手段（翻訳システムモジュール）の構成を示す図である。
【図４】本発明において用いられる音声音素識別手段を含む音声認識手段（システム）の構成を示す図である。
【図５】本発明において用いられる音声音素識別手段を含む音声認識手段（システム）の構成をさらに詳細に示す図である[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an interactive doll, and more particularly, to provide a voice translation system in a doll, convert a voice uttered by a user into a voice based on a different language, and utter the voice to satisfy the playfulness of the user. The present invention also relates to an interactive doll with a translation function that can be used for elementary foreign language learning. That is, the present invention relates to an interactive doll with a translation function that has not existed in the past.
[0002]
[Prior art]
Generally, children tend to learn life education through interesting play and toys, and the intimate contact with the toys performs imitation learning that leads to the real world. The majority of such imitation learning is performed through toy dolls, for example, a child creates his own imitation learning scenario and induces the toy doll to respond appropriately according to the scenario. In other words, the child is immersed in the imitation learning by making the appropriate speech expression and the action act interesting and interactive.
[0003]
Education using toy dolls like this has been handed down to children since ancient times.Recently, research on voice dolls has become active in anticipation of their educational effects, and the production of more progressive toy dolls has been promoted. Have been tried. In most of such toy dolls according to the related art, a touch sensor is provided at a predetermined position. When a child operates the touch sensor, a magnetic recording medium (magnetic tape) or a semiconductor recording medium (IC memory) is used. simple sentences, which were recorded in), for example, "Hello.", "I am a ○○○.", "who are you?", and the voice, such as "do you like what you." is uttered Is done. For example, Patent Literature 1 discloses an utterance doll that can interest children and further interest conversation.
[0004]
[Patent Document 1]
Registered Utility Model No. 25662439
[0005]
Specifically, in order to achieve the above-mentioned purpose, Patent Document 1 described above stores a head detachably attached to the body of a doll, a detecting means for detecting an operation from outside, and a plurality of vocabularies stored therein. A housing containing a storage unit for performing the processing, a selecting unit for selecting an arbitrary vocabulary from a plurality of vocabularies, a voice synthesizing unit for synthesizing a voice from the selected vocabulary, and a voice generating unit for uttering the synthesized voice are arranged in the body. There is disclosed a voice doll in which a battery accommodating portion formed so as to stabilize with a padding and protruding from an upper portion of a housing is projected upward from a body portion, and the head is fitted to the battery accommodating portion.
[0006]
The toy doll described in Patent Document 1 is a sporadic, simple puppet that speaks a simple sentence. The operation of the touch sensor allows the user to listen to a sound recorded with a simple sentence without a scenario. It can be triggered temporarily. However, there is a problem that the child gets tired immediately and the period of playing with such a toy doll is actually shortened, so that the educational effect is low. In addition, the voice sentences spoken by the conventional toy dolls are not interactive scenarios but a series of discontinuous sentences, which is not realistic. From this, there is a problem that the educational effect is low.
[0007]
On the other hand, for example, in Patent Literature 2, in order to solve such a problem, audio output according to a topic is enabled, and a behavior pattern that a child may perform is created and recorded in a scenario, and an arbitrary pattern is set. There is disclosed a toy doll that enables two-way dialogue with the doll depending on the situation. For example, in situations where you interact with a child, in order to lead to various scenarios, after compressing the voice with voice compression software, record it in the memory unit, quickly retrieve it when necessary, and make it possible to select a single topic. It is said that questions and answers can be made immediately in response. Specifically, Patent Document 2 described above records voice compression data obtained by compressing a digital voice signal stream of a large number of sentences at a predetermined compression rate in a doll body formed into a mixture of human and animal forms. A voice-recognition interactive doll toy comprising a first memory unit, and a second memory unit having a calculation area for recognizing a user's voice signal input from the outside is disclosed. .
[0008]
[Patent Document 2]
Patent No. 3164346
[0009]
Certainly, the doll toy described in Patent Document 2 utters a voice in accordance with the conversation of the user. However, there is no mention of the language of the voice uttered by the user and the language of the voice uttered by the voice dolls and doll toys. This is the same in Patent Document 1 described above. In view of the problems and effects of the invention according to Patent Document 2, it is considered that a case is assumed in which a Japanese infant utters in Japanese, and a doll toy that recognizes it utters a voice in Japanese. Can be
[0010]
Here, with the recent internationalization of society, Japan has lagged behind other countries despite the vigorous debate about the importance of foreign language education. This is thought to be due to problems with the living environment in childhood in ordinary households, aside from school education. Then, it is clear that if the above-mentioned speech doll or doll toy has a function as an item for foreign language education, it will also be useful for the advancement and development of foreign language education. In the prior art, a doll toy having such a function is not provided.
[0011]
[Problems to be solved by the invention]
Therefore, the present invention provides a voice translation system in a doll, converts a voice uttered by a user into a voice based on a different language, and utters the voice, thereby satisfying the user's playfulness and learning a foreign language conversation. It is an object of the present invention to provide an interactive doll with a translation function that can be provided to a user. That is, an object of the present invention is to provide an interactive doll with a translation function that has not existed in the past.
[0012]
[Means for Solving the Problems]
In order to solve the above problem, the present invention receives a first voice based on a first language by a user, converts the first voice into a second voice based on a second language, and then converts the converted second voice. There is provided an interactive doll with a translation function, wherein the interactive doll is characterized by uttering a voice or a third voice based on a second language stored in advance in correspondence with the first voice and interacting with the user.
[0013]
The interactive doll with translation function,
An ear portion having audio receiving means for receiving the first audio,
It is preferable to include a portion having voice conversion means for converting the first voice to the second voice, and a mouth portion having voice utterance means for generating the second voice or the third voice.
Preferably, the first voice and the second voice are words.
[0014]
Further, the voice conversion means,
(A) voice recognition means for recognizing the first voice,
(B) voice translation means for translating the recognized first voice into the second language;
(C) It is preferable to include a voice synthesizing unit that synthesizes the voice based on a result of the translation.
[0015]
It is preferable that the voice recognition means (a) recognizes the first voice as a phoneme sequence.
Specifically, the voice recognition means (a)
Audio processor means for receiving an audio signal of the first audio and converting the audio signal into a corresponding electrical signal;
Analog / digital converter means for digitizing the electrical signal at a predetermined sampling rate to form a digitized audio signal;
Means for performing time domain analysis on the subdivided portions of the digitized audio signal to identify a plurality of time domain characteristics of the audio signal, and a plurality of components having predetermined high and low cutoff frequencies; Means for filtering each of the subdivided portions using a filter band and identifying at least one frequency domain characteristic of each of the subdivided portions; and processing the time domain characteristics and frequency domain characteristics. It is preferable to have a voice phoneme identification means including a means for identifying a phoneme included in the voice signal.
[0016]
Further, it is preferable that the voice translation means (b) translates the recognized phoneme sequence into a vocabulary sequence based on a second language.
Further, it is preferable that the voice synthesizing means (c) synthesizes the second voice by computer processing the vocabulary string.
[0017]
Specifically, the voice synthesis means (c) comprises:
A speech conversion subsystem that receives a vocabulary sequence based on the second language and converts the vocabulary sequence into a first phoneme sequence;
Receiving a transformation rule and applying it to the first phoneme sequence to form a second phoneme sequence,
An evaluator that ranks phonemes included in the second phoneme sequence based on a predetermined criterion, and
It is preferable to include a syllable decomposer that receives the second phoneme sequence and decomposes phonemes included in the second phoneme sequence into syllables using the ranking.
[0018]
Further, in the interactive doll with a translation function,
The voice receiving means receives a keyword based on a first language by the user, the voice recognizing means (a) recognizes the keyword, and the voice uttering means stores in advance a second keyword corresponding to the keyword. Ask a question based on your language,
Then, it is preferable that the first voice by the user is converted into the second voice, and the second voice or the third voice is uttered to interact with the user.
[0019]
In such an interactive doll with a translation function,
The voice recognition means (a) recognizes a specific portion of the first voice,
The voice translation means (b) translates the specific part into voice based on a second language,
The speech synthesis means (c) may synthesize the second speech by applying the translation result to a speech answer pattern based on a second language stored in advance corresponding to the question.
[0020]
In this case, the voice translation means (b)
(B-1) storage means for storing the voice response pattern and a plurality of vocabularies based on a second language expected as vocabulary constituting the specific part in response to the question;
(B-2) selecting means for selecting a vocabulary based on a second language, corresponding to the vocabulary constituting the specific part recognized by the voice recognition means (a),
Preferably, the voice synthesizing means (c) synthesizes a vocabulary voice based on the selected second language, applies the vocabulary to the voice answer pattern, and synthesizes the second voice.
[0021]
Further, it is preferable that the interactive doll with a translation function further includes language specifying means (d) for specifying the first language and the second language.
Further, the interactive doll with a translation function detects an operation from the outside and detects the voice recognition means (a), the voice conversion means (b), the voice utterance means (c), and the language identification means (d). )), It is preferable to have control means (e) for controlling at least one means selected from the group consisting of:
[0022]
In the interactive doll with a translation function, it is preferable that the control means (e) has a function of causing the voice utterance means (c) to utter the second voice or the third voice a plurality of times.
The control means (e) specifies a plurality of languages as the second language in the language specifying means (d), and the second voice or the third voice in the voice utterance means (c). It is preferable to have a function of uttering continuously based on a plurality of languages.
[0023]
BEST MODE FOR CARRYING OUT THE INVENTION
The present invention receives a first voice based on a first language by a user and converts it into a second voice based on a second language, and then converts the converted second voice, or the first voice The present invention relates to an interactive doll with a translation function, characterized by uttering a third voice based on a second language stored in advance corresponding to the voice and interacting with the user. Hereinafter, an interactive doll with a translation function according to the present invention will be described with reference to the drawings.
[0024]
FIG. 1 is a diagram for schematically explaining the functions of the interactive doll with a translation function according to the present invention. As shown in FIG. 1, the interactive doll 1 with a translation function according to the present invention hears a voice based on the first language spoken by the user, for example, a voice saying "My name is cherry blossoms" in Japanese. The voice is received by the voice receiving means provided in the unit 1a, and the voice is based on the second language, for example, in English, by the translation system module including the voice converting means 1b which is incorporated in a part of the doll and which realizes the present invention. It is converted into "My name is Sakura." Or "Main name in Sakura." In German, and this second voice is uttered from the voice uttering means provided in the mouth part 1c.
[0025]
Also, although not shown in FIG. 1, when the first voice is, for example, “song.”, The “happy bass” stored in advance corresponding to the first voice is used. Song may be uttered as the third voice.
However, it is preferable that the first voice and the second voice are words. Because a child in childhood does not speak a sentence but only utters words in a row, on the other hand, if the interactive doll with a translation function according to the present invention can translate in a word to word manner, This is because it greatly contributes to traditional foreign language education, that is, introduction to foreign language education.
[0026]
Next, the speech conversion means (translation system module) used in the first embodiment of the interactive doll with a translation function according to the present invention will be described. FIG. 2 is a diagram showing a configuration of the translation system module. As shown in FIG. 2, the translation system module according to the present invention includes a voice receiving unit 2, a voice converting unit 3, and a voice uttering unit 4. The voice conversion unit 3 includes a voice recognition unit 3a, a voice translation unit 3b, and a voice synthesis unit 3c. Although only the voice conversion unit 3 may be a translation system module, the translation system module may be a concept including the voice reception unit 2 and the voice utterance unit 4.
[0027]
Explaining with reference to the example shown in FIG. 1, a first voice "My name is cherry blossoms" is first received by the voice receiving means 2 (for example, a microphone, a recorder, a wireless microphone, etc.). You. The received first speech is sent to the speech recognition and conversion means 3 and is recognized by the speech recognition means 3a, and is translated by the speech translation means 3b into "My name is Sakura." By this, it is synthesized into the second voice "My name is Sakura." Then, the second sound is uttered from a sound generating means (for example, a speaker or the like).
[0028]
Here, various researches and developments have been individually made on the speech recognition, the speech translation, and the speech synthesis, respectively. In the present invention, the speech recognition means, the speech translation means, and the speech synthesis means based on the conventional technique are used. They can be used in combination. However, the concept of converting a first speech based on a first language into a second speech based on a second language is newly discovered by the present invention.
[0029]
As an example, the first voice "My name is cherry blossoms" is received as a voice signal by the voice receiving means 2 and transmitted to the voice converting means 3. In the voice converting means 3, the voice recognizing means 3a converts the voice signal into an electric signal and converts it into a text (vocabulary), for example. Next, the text-converted first voice (first text) is transmitted to the voice translating means 3b.
[0030]
Then, as shown in FIG. 3, the speech translation unit 3b is provided with a storage unit 3b-1 and a selection unit 3b-2. The storage unit 3b-1 can be called a memory or a dictionary, and stores a plurality of vocabularies (and / or voices) based on the second language corresponding to the vocabulary configuring the first voice. For example, a plurality of vocabularies (and / or sounds) may be stored for each of a plurality of languages such as English, German, French, Spanish, and Portuguese.
[0031]
Here, using the example shown in FIG. 1, the vocabulary of the first text "My name is cherry blossoms." Corresponding to "sakura" and "is", for example, as English groups, the vocabulary words "My", "name", "is" and "Sakura" and / or "My", "Naim", "Iz" and The voice "Sakura" is stored in the storage unit 3b-1. Further, as the German group, the vocabulary of “Mine”, “name”, “ist” and “Sakura” and / or the voices of “Main”, “name”, “ist” and “Sakura” are stored.
[0032]
Then, the selecting unit 3b-2 selects a vocabulary and / or a voice based on the second language corresponding to the vocabulary constituting the first voice from the plurality of vocabularies and / or voices. In the example of FIG. 1, “My” is selected in correspondence with “my”, and “name” is selected in correspondence with “name”. Then, “is” is selected corresponding to “is”, and “Sakura” is selected corresponding to “sakura”.
[0033]
Next, the speech synthesizer 3c synthesizes the second speech from the selected vocabulary, combines the speech corresponding to each vocabulary to synthesize a second speech, and the synthesized second speech is a speech. It is uttered from the utterance means 4. When the selecting unit 3b-1 selects a voice corresponding to each vocabulary, the voice synthesizing unit 3c connects the individual voices to synthesize a second voice, and outputs the synthesized second voice to the voice uttering unit. You just have to start from 4.
[0034]
A person skilled in the art can improve the conventional speech recognition means, speech conversion means, and speech synthesis means having the above functions in view of the technical significance of the present invention in the present specification. Although possible, a more preferable voice recognition unit will be described below.
[0035]
In the interactive doll with a translation function according to the present invention, it is preferable that the voice recognition means recognizes the first voice as a phoneme sequence. According to the conventional voice recognition means, since a user (speaker) has a variety of habits such as a tone, a way of speaking, and intonation, the degree of voice recognition is influenced by a difference between users, and the accuracy is reduced. There is a problem. Therefore, many of the conventional voice recognition means require training to make the voice recognition means remember the habit of a specific user. On the other hand, if a method of recognizing a voice as a sequence of phonemes is adopted, the voice can be recognized more accurately even if the user is different.
[0036]
Specifically, the voice recognition means includes:
Audio processor means for receiving an audio signal of the first audio and converting the audio signal into a corresponding electrical signal;
Analog / digital converter means for digitizing the electrical signal at a predetermined sampling rate to form a digitized audio signal;
Means for performing time domain analysis on the subdivided portions of the digitized audio signal to identify a plurality of time domain characteristics of the audio signal, and a plurality of components having predetermined high and low cutoff frequencies; Means for filtering each of the subdivided portions using a filter band and identifying at least one frequency domain characteristic of each of the subdivided portions; and processing the time domain characteristics and frequency domain characteristics. Means for identifying a phoneme included in the audio signal, and preferably includes audio phoneme identification means for identifying the type of the phoneme included in the audio signal.
[0037]
Here, FIG. 4 shows a configuration of a voice recognition means (system) including the voice phoneme identification means.
The voice recognition system 10 shown in FIG. 4 includes an audio processor circuit 14 that converts a voice signal of a first voice received by a voice receiving unit into a corresponding electrical signal. An analog / digital conversion circuit 34 is provided for digitizing the electric signal at a predetermined sampling rate to form a digitized audio signal in order to bring the electric signal into an electric state suitable for digital sampling. The analog / digital conversion circuit 34 receives the electric signal in an analog format, converts the electric signal into a digital format, and transmits the digital signal.
[0038]
The digitized audio signal is then transmitted to the audio identification circuit 16. The voice discrimination circuit 16 programs and analyzes the digitized voice signal and extracts voice characteristics of the voice signal. Then, when necessary voice characteristics are obtained, a specific phoneme included in the voice signal can be identified. The phoneme can be identified without depending on the characteristics of each user (speaker), and can be performed in real time even if the user speaks at a normal conversation speed.
[0039]
The audio identification circuit 16 acquires necessary audio characteristics in two ways. First, a time domain analysis is performed on a plurality of subdivided portions of the digitized audio signal, a plurality of time domain characteristics of the audio signal are identified, and types of phonemes included in the audio signal are identified. The parameter for identifying the type of phoneme included in the audio signal includes, for example, whether the audio is “voiced sound”, “unvoiced sound”, or “silence”.
[0040]
Next, the speech identification circuit 16 filters each of the subdivided portions using a plurality of filter bands having predetermined high and low cutoff frequencies. As a result, a large number of subdivided signals are generated from the audio signal of the first audio having a complicated waveform, and represent a large number of signals representing the waveforms of the individual signals that are components of the audio signal. Then, the voice identification circuit 16 measures each subdivided portion and extracts at least one frequency domain characteristic, for example, various frequency domain data including the frequency and amplitude of the signal.
[0041]
The frequency domain characteristics and the time domain characteristics obtained in this way contain sufficient information to identify phonemes contained in the audio signal. Therefore, the speech identification circuit 16 finally processes the time domain characteristics and the frequency domain characteristics to identify phonemes included in the audio signal.
[0042]
The first speech recognized as described above is then translated by a speech translation unit and a speech synthesis unit incorporated in the speech identification circuit 16 and synthesized into a second speech. In this case, the sequence of phonemes recognized as described above may be translated into a sequence of vocabulary based on the second language. For example, it is possible to perform such translation and speech synthesis by using a language processing program according to the related art.
[0043]
These processes may be controlled by, for example, a control device 22 such as a host computer or a CPU that is connected to the voice identification circuit 16 and that can input, store, and / or control data. As the control device 22, a conventional device can be used, and it is preferable that the control device 22 is built in the voice identification circuit 16. However, it may be omitted depending on the configuration of the voice identification circuit 16.
[0044]
Here, FIG. 5 shows a more detailed configuration of the speech phoneme identification means (system). In the speech recognition system 10 shown in FIG. 5, the first speech received by the speech receiving means 12 is adjusted by the audio processor circuit 14, as in the case of FIG. In the audio processor circuit 14, the audio signal of the first audio is converted into an electric signal and transmitted to the analog / digital converter 34.
[0045]
In the audio processor circuit 14, first, the electric signal is amplified to a suitable level by a signal amplifying means such as an amplifier circuit 26, and the output level is limited by a limiting amplifier circuit 28. Then, the high frequency is removed by the filter circuit 30. Various circuits can be used as the amplifier circuit 26, the limiting amplifier circuit 28, and the filter circuit 30. Next, the analog / digital conversion circuit 34 receives the electric signal in an analog form, converts it into a digital form, and transmits it.
[0046]
Next, the speech recognition system 10 shown in FIG. 4 includes a digital speech processor circuit 18 and a host speech processor circuit 20. These are included in the speech identification circuit 16 shown in FIG. 1 and can be configured with equivalent circuits using programmable devices.
[0047]
First, the digital audio processor circuit 18 receives a digitized audio signal, operates based on a program, and extracts various audio characteristics. Specifically, first, a digitalized audio signal in the time domain is analyzed, and at least one type of time-domain audio characteristic is extracted based on the analysis result. This property is useful for determining whether the audio signal contains "voiced", "voiceless" or "quiet" phonemes.
[0048]
The digital audio processor circuit 18 further operates on the digitized audio signal to obtain various frequency domain information related to the audio signal. This can be done by filtering the audio signal with a myriad of filter bands to generate a corresponding myriad of filtered signals. Digital audio processor circuit 18 measures various characteristics exhibited by the individual waveforms and extracts at least one frequency domain audio characteristic. The frequency domain sound characteristics include the frequency, amplitude, gradient, and the like of the signal component obtained by the filtering process. These properties are stored or stored and are used to determine the type of phoneme contained in the audio signal.
[0049]
As shown in FIG. 5, the digital audio processor circuit 18 includes programmable means for analyzing the digitized audio signal under program control, such as a digital audio processor 36. As the digital audio processor circuit 36, a programmable 24-bit general-purpose digital signal processor such as a Motorola DSP56001 can be suitably used. Of course, other commercially available digital signal processors can be used.
[0050]
The digital audio processor 36 is also connected to various components via a bus type standard address, data and control arrangement 38. These components include, for example, a program memory means for storing a series of programs executed by the DSP 36, such as the DSP program memory 40, a data memory means for storing data used by the DSP 36, such as the DSP data memory 42, and an address. And control logic 44 to perform standard time control functions such as data gating and mapping.
[0051]
Next, the host audio processor circuit 20 will be described. The host audio processor circuit 20 is connected to the digital audio processor circuit 18 via a suitable host interface 52. Generally, host audio processor circuit 20 receives various audio signal characteristic information generated by digital audio processor circuit 18 via host interface 52.
[0052]
The host audio processor circuit 20 includes this information in the audio signal by analyzing this information and comparing the signal characteristics to audio standard audio data collected by testing a representative user (speaker). Identify phoneme types. After identifying the phonemes, the host speech processor circuit 20 uses various language processing techniques to translate the phonemes into vocabularies and phrases based on the first and second languages.
[0053]
The host audio processor circuit 20 preferably includes a second programmable means for analyzing the characteristics of the digitized audio signal under program control, such as the host audio processor 54. The host audio processor 54 may be a programmable 32-bit versatile CPU element such as, for example, a Motorola 68EC030.
[0054]
The host audio processor 54 is also connected to various components via a standard address, data and control bus type array 56. These components include, for example, program memory means for storing a series of programs executed by the host audio processor 54, such as the host program memory 58, and data used by the host audio processor 54, such as the host data memory 60. It includes data memory means and control logic 64 which performs standard time control functions such as address and data gating and mapping.
[0055]
The control device 22 is the same as that described in FIG. The control device 22 may be connected to the host audio processor circuit 20 via an interface means 66 such as an RS-232 interface circuit and the cable 24. Of course, depending on the configuration of the digital audio processor circuit 18 and the host audio processor circuit 20, the control device 22 can be omitted. Note that a memory 62 having a dictionary function and a display 68 can be further connected to the host audio processor circuit 20.
[0056]
As described above, when the speech recognition means (a) recognizes the first speech by phoneme, the speech translation means (b) translates the recognized phoneme string into a vocabulary string based on a second language. It is effective that the speech synthesizer (c) synthesizes the second speech by computer processing the vocabulary string. However, if the speech recognition system as shown in FIGS. 4 and 5 is used, the speech recognition means (a) can have the functions of the speech translation means (b) and the speech synthesis means (c).
[0057]
Here, a conventional voice synthesizing means (c) can be used. However, according to the conventional voice synthesizing means, since the voice is synthesized electrically and mechanically, the interval between vocabulary words is different. And the intonation may not be perfect, and the second voice uttered may be uncomfortable with the voice uttered by a human. Therefore, in the present invention, it is preferable to use the following for the speech synthesis means.
[0058]
That is, the voice synthesis means (c) receives a vocabulary sequence based on the second language, converts the vocabulary sequence into a first phoneme sequence, a voice conversion subsystem, receives a transformation rule, and A speech deformer applied to the phoneme string of the second phoneme string to form a second phoneme string, an evaluator for ranking phonemes included in the second phoneme string based on a predetermined criterion, and the second phoneme string And a syllable decomposer that decomposes phonemes included in the second phoneme sequence into syllables using the ranking.
[0059]
Further, in the interactive doll with a translation function according to the present invention, the voice receiving means receives a keyword based on a first language by the user, the voice recognition means (a) recognizes the keyword, and A voice uttering unit utters a question based on a second language stored in advance corresponding to the keyword, and then converts the first voice by the user into the second voice, Alternatively, it is preferable to have a function of generating the third voice to make the third user interact with the user.
Those skilled in the art can appropriately create a program for such a function and incorporate the program into the speech recognition unit, the speech translation unit, and the speech synthesis unit.
[0060]
Further, the voice recognition means (a) recognizes a specific part of the first voice, and the voice translation means (b) translates the specific part into voice based on a second language. ), It is also effective to apply the result of the translation to a voice answer pattern based on a second language stored in advance corresponding to the question and synthesize the second voice.
[0061]
With this configuration, the voice conversion unit converts the first text constituting the first voice in the first language into the second text in the second language based on a so-called pattern translation method. be able to. The pattern translation method is insufficient for translating a long sentence, but is effective for processing a short sentence. Therefore, it is effective for children in childhood, which is an important period for elementary foreign language education.
[0062]
In this case, the speech translating means (b) calculates (b-1) the speech answer pattern and a plurality of vocabularies based on the second language expected as the vocabulary constituting the specific part corresponding to the question. Storage means for storing; and (b-2) selecting means for selecting a vocabulary based on a second language corresponding to the vocabulary constituting the specific part recognized by the voice recognizing means (a); The synthesizing means (c) may synthesize speech of a vocabulary based on the selected second language, apply the speech to the speech answer pattern, and synthesize the second speech.
[0063]
【The invention's effect】
According to the present invention, a voice translation system is provided in a doll, and a voice uttered by a user is converted into a voice based on a different language and uttered, thereby satisfying the user's playfulness and learning a foreign language conversation. It is possible to provide an interactive doll with a translation function, which can be provided to the user. That is, the present invention can provide an interactive doll with a translation function that has not existed in the past.
[Brief description of the drawings]
FIG. 1 is a diagram for schematically explaining functions of an interactive doll with a translation function according to the present invention.
FIG. 2 is a diagram showing a configuration of a speech conversion means (translation system module) used in the present invention.
FIG. 3 is a diagram showing a configuration of another speech conversion means (translation system module) used in the present invention.
FIG. 4 is a diagram showing a configuration of a voice recognition means (system) including a voice phoneme identification means used in the present invention.
FIG. 5 is a diagram showing in more detail the configuration of a speech recognition means (system) including a speech phoneme identification means used in the present invention.

Claims

A first voice based on the first language by the user is received and converted to a second voice based on the second language, and then the converted second voice, or corresponding to the first voice. An interactive doll with a translation function, characterized by uttering a third voice based on a second language stored in advance to communicate with the user.

The interactive doll with a translation function according to claim 1,
An ear portion having audio receiving means for receiving the first audio,
A portion having a voice conversion means for converting the first voice into the second voice, and a mouth portion having a voice uttering means for generating the second voice or the third voice. Interactive doll with translation function.

3. The interactive doll with a translation function according to claim 1, wherein the first voice and the second voice are words.

An interactive doll with a translation function according to claim 2,
The voice conversion means,
(A) voice recognition means for recognizing the first voice,
(B) voice translation means for translating the recognized first voice into the second language, and (c) voice synthesis means for synthesizing the voice based on a result of the translation. Interactive doll with translation function.

An interactive doll with a translation function according to claim 4,
An interactive doll with a translation function, wherein the voice recognition means (a) recognizes the first voice as a phoneme sequence.

An interactive doll with a translation function according to claim 5,
The voice recognition means (a)
Audio processor means for receiving an audio signal of the first audio and converting the audio signal into a corresponding electrical signal;
Analog-to-digital converter means for digitizing the electrical signal at a predetermined sampling rate to form a digitized audio signal; and performing time domain analysis on the subdivided portions of the digitized audio signal; Means for identifying a plurality of time domain characteristics of the audio signal, and a plurality of filter bands having predetermined high and low cutoff frequencies, filtering each of the subdivided portions, And voice phoneme identification means including means for identifying at least one frequency domain characteristic of each portion, and means for processing the time domain characteristics and the frequency domain characteristics to identify phonemes contained in the audio signal. An interactive doll with a translation function.

An interactive doll with a translation function according to claim 5 or 6,
An interactive doll with a translation function, wherein the voice translation means (b) translates the recognized phoneme sequence into a vocabulary sequence based on a second language.

An interactive doll with a translation function according to claim 7,
An interactive doll with a translation function, wherein the speech synthesis means (c) synthesizes the second speech by computer processing the vocabulary string.

An interactive doll with a translation function according to claim 8,
The voice synthesis means (c) comprises:
A speech conversion subsystem that receives a vocabulary sequence based on the second language and converts the vocabulary sequence into a first phoneme sequence;
Receiving a transformation rule and applying it to the first phoneme sequence to form a second phoneme sequence,
An evaluator that ranks phonemes included in the second phoneme string based on a predetermined criterion, and receives the second phoneme string, and is included in the second phoneme string using the ranking. An interactive doll with a translation function, comprising a syllable decomposer for decomposing phonemes into syllables.

An interactive doll with a translation function according to any one of claims 1 to 9,
The voice receiving means receives a keyword based on a first language by the user, the voice recognizing means (a) recognizes the keyword, and the voice uttering means stores in advance a second keyword corresponding to the keyword. Ask a question based on your language,
Thereafter, the first voice by the user is converted into the second voice, and the second voice or the third voice is uttered to interact with the user. With interactive doll.

The interactive doll with a translation function according to claim 10,
The voice recognition means (a) recognizes a specific portion of the first voice,
The voice translation means (b) translates the specific part into voice based on a second language,
A translation function, wherein the speech synthesis means (c) applies the result of the translation to a speech answer pattern based on a second language stored in advance corresponding to the question and synthesizes the second speech. With interactive doll.

The interactive doll with a translation function according to claim 11,
The voice translation means (b)
(B-1) storage means for storing the voice response pattern and a plurality of vocabulary words based on a second language expected as vocabulary words constituting the specific part corresponding to the question; and (b-2). Selecting means for selecting a vocabulary based on a second language corresponding to the vocabulary constituting the specific part recognized by the voice recognition means (a);
The speech synthesis means (c) synthesizes speech of a vocabulary based on the selected second language, applies the speech to the speech answer pattern, and synthesizes the second speech. Interactive doll.