JP2004271620A

JP2004271620A - Mobile terminal

Info

Publication number: JP2004271620A
Application number: JP2003058638A
Authority: JP
Inventors: Shigeo Ota; 滋雄太田
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2003-03-05
Filing date: 2003-03-05
Publication date: 2004-09-30

Abstract

<P>PROBLEM TO BE SOLVED: To provide a mobile terminal with which a user can enjoy himself/herself at ease by adding a voice of a completely different type to an image wherein one person is speaking. <P>SOLUTION: There is provided the mobile terminal with which the user can enjoy himself/herself at ease by adding a voice of a different type by an extracting means of extracting shape variation of a lip part of a person or animal from a moving picture wherein the lip part is photographed, a character information storage means of relating and storing shape information on the lip part extracted by the extracting means and character information, a character information generating means of generating character information from shape information on the lip part extracted by the extracting means according to the character information storage means, a speech information storage means of relating and storing the character information and speech information, a speech data generating means of generating speech data of an arbitrary timbre from the character information generated by the character information generating means according to the speech information storage means, and a pronouncing means of pronouncing the speech data as a voice. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、動画像データから発音情報を抽出する携帯端末に関する。
【０００２】
【従来の技術】
近年、通信技術の発達に伴い、携帯電話機やＰＤＡ（ＰＤＡ：ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔｓ）等の携帯端末が特に若い人たちを中心に広く普及している。こうした状況に対応して、本来の通話機能のほかに、様々なアプリケーションが提供されている。中でも、最近は、動画像を扱える携帯端末が登場してきたことに伴い、動画像に関するアプリケーションに注目が集まっている。
【０００３】
この一例として、動画像のうち、音声部分のみを異なるものにして、例えば、男性に女性の言葉を喋らせたり、イントネーションやアクセントを変更させるような遊び感覚のアプリケーションが考えられる。しかし、従来のように、音声データを加工して、こうした変更を加えようとすると、信号処理等に大きな負荷がかかるといった問題があった。また、人間の声を動物の声に変更することは現実的に難しいといった問題があった。
【０００４】
一方、撮像された人の口唇部の形状を画像認識して、予め口唇部の形状パターンと音声データを関連づけて記憶しているデータベースから音声データを抽出し、抽出した音声データを文字データとして送信するような技術も提案されている（例えば、特許文献１参照。）。
【０００５】
【特許文献１】
特開２０００−６８８８２号公報（第２−４頁、第１図）
【０００６】
【発明が解決しようとする課題】
しかし、これは、声を出せない公共の場所等で、通話をする際に、使用者の口唇部の形状をリアルタイムに撮影し、撮像された使用者の口唇部の形状パターンから音声データを抽出した後、これを文字データとして出力するものであり、撮影された人の声と別の音声を付加し、発音するものではない。
【０００７】
そこで、本発明は、上述した問題点に鑑みてなされたものであって、ある人物が喋っている画像にまったく違うタイプの音声を付加して、使用者が気軽に楽しむことができる携帯端末を提供することを目的とする。
【０００８】
【課題を解決するための手段】
前記課題を解決するため、本発明は、以下の手段を提案している。
請求項１に係る発明は、人もしくは動物の口唇部が写っている動画像から、該口唇部の形状変化を抽出する抽出手段と、該抽出手段により抽出された口唇部の形状情報と文字情報とを関連づけて記憶する文字情報記憶手段と、該文字情報記憶手段に基づいて前記抽出手段により抽出された口唇部の形状情報から文字情報を生成する文字情報生成手段と、該文字情報と音声情報とを関連づけて記憶する音声情報記憶手段と、該音声情報記憶手段に基づいて前記文字情報生成手段により生成された文字情報から任意の音色の音声データを生成する音声データ生成手段と、該音声データを音声として発音する発音手段とを有する携帯端末を提案している。
【０００９】
この発明によれば、抽出手段の作動により、撮像された人もしくは動物の口唇部の画像データから口唇部の形状の変化が抽出される。抽出された口唇部の形状は、文字情報記憶手段と文字情報生成手段との作動により、文字情報に変換される。変換された文字情報は、音声情報記憶手段と音声情報生成手段との作動により、音声情報に変換され、音声情報が発音手段の作動により、音声として発音される。
【００１０】
【発明の実施の形態】
以下、本発明の実施形態に係る携帯端末について図１から図７を参照して詳細に説明する。
本発明の実施形態に係る携帯端末は、図１に示すように、ＣＰＵ（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１と、通信部２と、アンテナ３と、マイク４と、イヤスピーカ５と、音声処理部６と、音声合成機能付音源７と、スピーカ８と、操作キー部９と、カメラ１０と、ＲＡＭ（ＲＡＭ：ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１１と、ＲＯＭ（ＲＯＭ：ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２と、表示部１３と、バイブレータ１４とを備えている。
【００１１】
ＣＰＵ１は、携帯端末のシステムに関する各種の制御および処理を実行することにより、携帯端末の各部の動作を制御する。本実施形態においては、カメラ１０において撮像された動画像から口唇部の形状変化を抽出し、これをＲＡＭ１１内に口唇部の形状情報と文字情報とを関連づけて記憶するデータテーブルを用いて、文字情報に変換するとともに、変換された文字情報をＲＡＭ１１内に文字情報と音声情報とを関連づけて記憶するデータテーブルを用いて、音声情報に変換し、この音声情報を音声合成機能付音源７に出力する等の機能を有する。
【００１２】
通信部２は、音声や文字、画像等で構成される通信情報を受信して、これらを電気信号に変換するとともに、音声や文字、画像等の電気信号を通信情報に変換して送信する。アンテナ３は、外部の基地局から送信されてくる通信情報を受信し、これを通信部２に出力するとともに、通信部２から入力した通信情報を外部の基地局に対して送信する。マイク４は、話者の発する音声等を入力する音声入力手段である。イヤスピーカ５は、外部の基地局から送信されてきた音声を出力する音声出力手段である。
【００１３】
音声処理部６は、通信部２から入力した音声に関する電気信号をアナログ信号に変換して、イヤスピーカ５に供給するとともに、マイク４から入力されたアナログ音声信号をデジタル音声信号に変換して通信部２に出力する。音声合成機能付音源７は、ＣＰＵ１から入力された音声情報を解釈して「あ」、「い」、・・・「ん」などの音声を任意の音色（例えば、男声、女声、子供の声など）で再生し、これをスピーカ８に出力する。また、操作キー部９を用いて、入力ボタンを操作する場合に、操作する入力ボタンに対応した特有の確認音等を出力し、着信時には、予め登録された楽曲情報を再生して、着信メロディをスピーカ８を介して出力する。なお、音声合成に関する詳細については後述する。
【００１４】
スピーカ８は、外部からの通信情報を受信したときに、使用者が予め設定した着信音を音声合成機能付音源７からの信号に応じて出力する。操作キー部９は、文字や数字等を入力するための入力手段であり、本実施形態では、音声情報を得るための動画像の選択や付加する音色の選択などにも用いられる。カメラ１０は、静止画像や動画像を撮像するための装置であり、本実施形態においては、一般的な機能のほか、口唇部形状のデータベース作成用にも用いられる。ＲＡＭ１１は、ダウンロードされた楽曲データや受信した電子メール等の情報を一時的に格納する書き換え可能な記憶手段であり、本実施形態においては、口唇部の形状情報と文字情報とを関連づけて記憶するデータテーブルや文字情報と音声情報とを関連づけて記憶するデータテーブルが記憶されている。
【００１５】
ＲＯＭ１２は、ＣＰＵ１が実行する通信情報の送受信に関する制御プログラムや楽曲データの再生に関する制御プログラム、音声合成に関する制御プログラム等を格納する書き換え不能の記憶装置である。表示部１２は、操作キー部９より入力されたデータや着信情報、着信した文字情報や画像情報等を表示する出力手段である。バイブレータ１４は、通話信号の着信や電子メールの着信を振動により使用者に知らせる。なお、これらの各構成要素は、データバスにより接続され、各要素間での信号の入出力を行っている。
【００１６】
次に、口唇部の形状と文字情報との関係について説明する。
口唇部の形状を文字情報に変換する方法としては、主に、以下の２つの方法が考えられる。第一の方法は、ある人の口唇部の形状を一文字づつ撮影して口唇部の形状と文字情報との関係をデータベース化する方法である。この方法は、画像が使用者本人であれば、比較的正確に文字情報へ変換することができるが、汎用性に乏しいという問題がある。第二の方法は、口唇部の形状あるいは口唇部の動きから、これを文字情報に変換する方法である。この方法は、前記第一の方法に比べて、正確性に難があるものの汎用性の高い方法であるといえる。
【００１７】
第二の方法は、図２に示すように、一般的な口唇部の形状に文字情報を対応させるものである。例えば、大きく口が広がった形状は「あ」、横に細長くなった形状は「い」、小さくつぼまる形状は「う」、閉じた状態から大きく広がった形状は「ま」といった具合に、対応づけられることになる。画像データから、こうした口唇部の形状変化を抽出するためには、時間的に前後するフレームの画像データの差分値を求めることにより実現できる。すなわち、喋っている人の画像であれば、前後のフレームの差分値は、ほぼ口唇部の形状のみになるからである（図３参照）。なお、動物の画像の場合でも、動物が口を開けたり閉じたりする動作を人間の口唇部のデータベースと対比して、文字情報や音声情報に変換することもできる。
【００１８】
次に、音声合成機能付音源７における音声合成の詳細について、図４から図７を用いて、その構成を説明する。
図４に示すように、音声合成機能付音源７は、入出力Ｉ／Ｆ（Ｉ／Ｆ：Ｉｎｔｅｒｆａｃｅ）２１と、ＦＩＦＯ（ＦＩＦＯ：ＦａｓｔｉｎＦａｓｔＯｕｔＭｅｍｏｒｙ）２２と、シーケンサ２３と、ＦＭ音源２４と、ＷＴ音源２５と、波形メモリ２６と、加算器２７とから構成されている。入出力Ｉ／Ｆ２１は、データバスを介して、ＣＰＵ１から音声情報やこれに関する命令および音声合成に関する各種パラメータを入力するとともに、後述するＦＩＦＯ２２の状態通知等をＣＰＵ１に対して出力するためのインターフェース回路である。
【００１９】
ＦＩＦＯ２２は、記憶装置を含む回路であり、与えられたシーケンスデータを一時保持し、保持したデータを順次シーケンサ２３に供給する。シーケンサ２３は、ＣＰＵ１からの発音開始および発音終了等の命令により、発音を開始する場合には、ＦＩＦＯ２２から受けた音声情報を解釈するとともに、所定のタイミングで各種パラメータや制御信号を後述するＦＭ音源２４またはＷＴ音源２５に供給し、これらを駆動する。
【００２０】
なお、ＷＴ音源２５は、周知のように各種楽器音や音声等をデジタル録音して予め蓄えられている波形メモリ２６の波形データを一通りまたは繰り返して読み出すことにより音声等を忠実に再現するものである。ＦＭ音源２４およびＷＴ音源２５の出力は加算器２７にて加算され、その出力は図示しないデジタル／アナログ変換器においてアナログ信号に変換され、スピーカ８に供給される。また、通常、音源デバイスでは、各音源がＦＩＦＯ２２およびシーケンサ２３を介して駆動されるが、リアルタイムな応答を要求される効果音等の場合には、ＣＰＵ１がＦＩＦＯ２２およびシーケンサ２３を介さずに、直接、ＦＭ音源２４およびＷＴ音源２５を駆動する。なお、波形メモリ２６は、ＲＯＭにより構成されている。
【００２１】
次に、ＦＭ音源２４について説明する。
ＦＭ音源２４は、図５に示すオペレータ３０と加算器とを複数組み合わせて、図６に示すように構成されている。図５に示すように、１つのオペレータ３０は、ＳＩＮ波形テーブル３１と、フェーズ・ジェネレータ（ＰＧ）３２と、加算器３３と、エンベロープ・ジェネレータ（ＥＧ）３４と、乗算器３５とから構成されている。ＳＩＮ波形テーブル３１は、ＳＩＮ波形（正弦波）の各位相点と、この位相点における波形の振幅値とを関連づけて記憶するデータテーブルである。フェーズ・ジェネレータ（ＰＧ）３２は、シーケンサ２３またはＣＰＵ１から周波数パラメータを受け、この周波数パラメータに基づきＳＩＮ波形テーブル３１から出力させるＳＩＮ波形データの周波数および位相を制御するための位相アドレス信号を生成する。
【００２２】
加算器３３は、オペレータ３０の入力信号と上記位相アドレス信号を加算してＳＩＮ波形テーブル３１に供給する。エンベロープ・ジェネレータ（ＥＧ）３４は、シーケンサ２３またはＣＰＵ１から振幅パラメータを受け、当該オペレータ３０から出力する波形の振幅を制御するためのエンベロープ信号（振幅係数）を生成し乗算器３５に出力する。乗算器３５は、ＳＩＮ波形テーブル３１の出力とエンベロープ・ジェネレータ（ＥＧ）３４の出力を乗算する。このように構成されるオペレータ３０においては、ＳＩＮ波形テーブル３１に記憶されているＳＩＮ波形の振幅値が、加算器３３を介して供給される位相アドレス信号を含む信号に従い順次読み出される。したがって、このオペレータ３０では、ＳＩＮ波形テーブル３１に記憶された波形振幅値を読み出す速度を変化させることにより、すなわち、ＳＩＮ波形テーブル３１に供給する位相アドレス信号を適宜制御することにより、音高を変えることができる。
【００２３】
例えば、読み出し速度を遅くすれば、低い音を生成することができ、逆に、読み出し速度を速くすれば高い音を生成することができる。なお、フェーズ・ジェネレータ（ＰＧ）３２は、ＣＰＵ１からリセット信号を受けると、ＳＩＮ波形テーブル３１から読み出すアドレスを初期値に戻すことにより、出力する位相アドレス信号をリセットする。ＦＭ音源２４は、このようなオペレータ３０を図６（ａ）に示すように、複数個カスケード接続したり、あるいは、同図（ｂ）に示すように、さらに加算器を用いて、オペレータ３０の出力を加算したりして、複数のオペレータ３０と加算器とを様々に組み合わせることで、多様な音声を生成する。
【００２４】
次に、本実施形態にかかるＦＭ音源のハードウエアを利用してＣＳＭ（復号正弦波モデル）音声合成を実現する手法について説明するが、その前に、ＣＳＭ音声合成の原理について説明する。
一般に、音声は短い時間の範囲では、ほぼ定常であると見なすことができる。このことから、ＣＳＭ音声合成においては、短い時間の範囲において、音声のスペクトルが一定であると見なして音声合成を行う。具体的には、数ｍｓないし数十ｍｓの短時間の音声を定常であると見なし、音声を数個の正弦波の和で表現する。これを離散的時間表現により表すと、音声の時系列｛Ｘ_ｔ｝は、
Ｘ_ｔ＝Ａ_１ｓｉｎω_１ｔ＋・・・・＋Ａ_ｎｓｉｎω_ｎｔ（１）
と表される。但し、ここでｔは離散的な時刻を表す整数、ｎは正弦波成分の数（通常は４〜６）、ω_ｉは第ｉ正弦波成分の角周波数（０≦ω_ｉ≦π）、Ａ_ｉは正弦波成分の振幅である。
【００２５】
このＣＳＭ音声合成では、上記（１）式で表されるモデルに対して、パラメータ｛ω_１・・・ω_ｎ、Ａ_１・・・Ａ_ｎ｝を与えて（１）式より、各時刻ｔについて合成音声の系列｛Ｘ_ｔ｝を求める。このとき、有声音（母音や濁子音など）に対しては、有声音が周期性をもつことから、その周期（ピッチ周期）毎に（１）式における時刻ｔをゼロにリセットして位相を初期化し、一方、無声音に対しては、周期性がないことから、ランダムな周期を与えて、すなわちランダムな周期で時刻ｔをリセットしてランダムに位相を初期化する。このようにして、合成される音声信号の時系列は、人の音声に近いものとなる。
【００２６】
次に、このＣＳＭ音声合成をＦＭ音源２４のハードウエアを用いて実現する手法について図７を参照して説明する。
（１）式で表される各正弦波の成分は、前述したオペレータ３０を用いて生成することができる。すなわち、各オペレータ３０の入力信号をゼロとし、ＳＩＮ波形テーブル３１から正弦波の波形を読み出すための周波数パラメータをフェーズ・ジェネレータ（ＰＧ）３２に与えることにより、各正弦波の成分に対応するＳＩＮ波形テーブル３１により、時系列に正弦波を出力させる。これに、次段の乗算器３５により、エンベロープ・ジェネレータ（ＥＧ）３４から与えられる振幅を持たせることによって、各オペレータ３０から（１）式の各正弦波成分の信号の出力を得ることができる。
【００２７】
そして、これらの出力を加算器５０で加算することにより、合成音声信号の系列｛Ｘ_ｔ｝を得ることができる。ＣＳＭ音声合成では、有声音に対し、その周期毎に時刻ｔをゼロにリセットし位相を初期化するとともに、無声音に対し、ランダムな周期で時刻ｔをゼロにリセットし位相を初期化するが、この位相の初期化は、フェーズ・ジェネレータ（ＰＧ）３２に対し、それぞれの周期でリセット信号を与え位相を初期化することにより行える。
【００２８】
以上のように、ＦＭ音源２４を用いたＣＳＭ音声合成では、フェーズ・ジェネレータ（ＰＧ）３２に与える周波数パラメータまたはリセット信号と、エンベロープ・ジェネレータ（ＥＧ）３４に与える振幅パラメータの３要素により合成されるフォルマント音を複数合成することにより音素を決定し音声合成することができる。例えば、「さくら」を音声合成する場合、数ｍｓから数十ｍｓ毎に複数組の上記３要素を設定することにより、／Ｓ／→／Ａ／→／Ｋ／→／Ｕ／→／Ｒ／→／Ａ／の６音素を合成して発音させる。
【００２９】
各オペレータ３０に与える上記の３要素は、各音素毎に予め定義され、ＲＯＭ１２内に登録されている。また、各文字を構成する音素に関する情報、例えば、「さ」の場合、この文字の音素／Ｓ／、／Ａ／からなること等の情報も、同様にＲＯＭ１２内に登録されている。
【００３０】
本実施形態においては、ＦＭ音源２４を用いてＣＳＭ音声合成を実行するものとしているが、ＷＴ音源２５を用いた音声合成についても、もちろん可能である。例えば、「さくら」を音声合成する場合、「さ」、「く」、「ら」をデジタル録音してメモリに蓄えておき、これらを再生すればよい。しかし、ＦＭ音源２４を用いてＣＳＭ音声合成を行うほうが、必要なパラメータがすくなくてすみ、より有利である。
【００３１】
次に、本実施形態に係る携帯端末において、画像データから音声データを生成する手順を図８を用いて説明する。
まず、ＣＰＵ１は、カメラ１０により撮像された各フレームの画像データに関して、その前後のフレームの差分値を求めることにより、画像データに変化があったかどうかを判断する（ステップ１０１）。画像データに変化がないと判断したときは、画像データに変化があると判断するまで、画像データの前後のフレームの差分値の検出を続行する。
【００３２】
画像データに変化があると判断したときは、差分値のデータ（パターンデータ）をパターンデータと文字情報とを関連付けて記憶したパターンデータベース上で検索する（ステップ１０２）。パターンデータベースを検索した結果、差分値データと合致するパターンデータを検索できたとき（ステップ１０３）は、これに対応した文字情報を決定する（ステップ１０４）。一方で、差分値データに合致したパターンデータを検索できなかったときは、再び、画像データの前後のフレームの差分値の検出を行い、次の差分値を検出する。
【００３３】
文字情報が確定すると、ＲＡＭ１１内に格納されたデータテーブルから、この文字情報に対応した音声データを検索し、これにあらかじめ設定されている音色を付加して音声合成機能付音源７に出力する（ステップ１０５）。つぎに、画像データが終了したか否かを確認し（ステップ１０６）、画像データが終了したと判断したときは、処理を終了する。一方、画像データがまだ残っていると判断したときは、残りの画像データに対して、差分値データの検出を続行する。
【００３４】
なお、説明してきた処理フローにおいては、パターンデータの比較から音声発音までの処理を短時間に行い、リアルタイムに発音する必要があるが、例えば、一旦、文字情報だけを画像データのすべてにわたって取得し、これをメモリに保管して、その文字情報から適当な音声データを生成した後、その音声データを対応する画像フレームに付加するようにしてもよい。
【００３５】
以上、図面を参照して本発明の実施の形態について詳述してきたが、具体的な構成はこれらの実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計変更等も含まれる。例えば、本実施形態においては、口唇部の形状変化を文字情報に変換し、さらに、この文字情報を音声情報に変換する方法について説明したが、口唇部の形状変化を直接、音声情報に変換する方法であってもよい。
【００３６】
【発明の効果】
以上のように、この発明によれば、動画像の画像データから発音の文字情報を当てはめる構成としたことから、音声合成の際に、男性の声や女性の声など任意の音色で音声を再生することができるという効果がある。また、動物の口の動きから、文字情報を抽出して、これを人間の音声として発音させることもできるという効果がある。
【図面の簡単な説明】
【図１】本実施形態に係る携帯端末の構成図である。
【図２】本実施形態に係る口唇部の形状と文字情報との関係を示した図である。
【図３】本実施形態に係る画像情報から文字情報を抽出する方法を示した図である。
【図４】本実施形態に係る着信音用音源の構造図である。
【図５】本実施形態に係るＦＭ音源の構造図である。
【図６】本実施形態に係るＦＭ音源におけるオペレータの組み合わせ例を示す図である。
【図７】本実施形態に係るＣＳＭ音声合成による音声合成を実行する場合のＦＭ音源の構成図である。
【図８】本実施形態に係る画像データから音声データを生成する手順を示したフローチャート図である。
【符号の説明】
１・・・ＣＰＵ、２・・・通信部、３・・・アンテナ、４・・・マイク、５・・・イヤスピーカ、６・・・音声処理部、７・・・音声合成機能付音源、８・・・スピーカ、９・・・操作キー部、１０・・・カメラ、１１・・・ＲＡＭ、１２・・・ＲＯＭ、１３・・・表示部、１４・・・バイブレータ、[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a mobile terminal that extracts pronunciation information from moving image data.
[0002]
[Prior art]
2. Description of the Related Art In recent years, with the development of communication technology, mobile terminals such as mobile phones and PDAs (Personal Digital Assistants) have become widespread, especially among young people. In response to these situations, various applications are provided in addition to the original call function. In particular, recently, with the emergence of mobile terminals that can handle moving images, applications related to moving images have attracted attention.
[0003]
As an example of this, it is possible to consider a play-like application in which only the audio portion of a moving image is different, for example, to make a man speak a woman's words or to change intonation or accent. However, as in the related art, there is a problem in that if the sound data is processed to make such a change, a large load is imposed on signal processing and the like. There is also a problem that it is practically difficult to change a human voice into an animal voice.
[0004]
On the other hand, the shape of the lip of the person is image-recognized, voice data is extracted from a database that stores the lip shape pattern and voice data in advance and stored, and the extracted voice data is transmitted as character data. Such a technique has been proposed (for example, see Patent Document 1).
[0005]
[Patent Document 1]
JP-A-2000-68882 (pages 2-4, FIG. 1)
[0006]
[Problems to be solved by the invention]
However, when making a call in a public place where no voice can be heard, the shape of the lip of the user is photographed in real time, and voice data is extracted from the captured lip shape of the user. After that, this is output as character data, and does not sound by adding a voice different from the voice of the photographed person.
[0007]
Accordingly, the present invention has been made in view of the above-described problems, and a portable terminal that allows a user to easily enjoy by adding a completely different type of sound to an image in which a certain person is talking is provided. The purpose is to provide.
[0008]
[Means for Solving the Problems]
In order to solve the above problems, the present invention proposes the following means.
The invention according to claim 1 is an extracting means for extracting a shape change of a lip from a moving image in which a lip of a human or an animal is shown, and the lip shape information and character information extracted by the extracting means. Character information storage means for storing character information from the shape information of the lips extracted by the extraction means based on the character information storage means; and character information and voice information. And voice data generating means for generating voice data of an arbitrary tone from the character information generated by the character information generating means based on the voice information storing means. A portable terminal having a sounding means for pronouncing as a voice is proposed.
[0009]
According to the present invention, the change of the shape of the lip is extracted from the image data of the lip of the person or the animal by the operation of the extracting means. The extracted shape of the lip is converted into character information by the operation of the character information storage means and the character information generation means. The converted character information is converted into voice information by the operation of the voice information storage unit and the voice information generation unit, and the voice information is pronounced as voice by the operation of the sound generation unit.
[0010]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a mobile terminal according to an embodiment of the present invention will be described in detail with reference to FIGS.
As shown in FIG. 1, the mobile terminal according to the embodiment of the present invention includes a CPU (Central Processing Unit) 1, a communication unit 2, an antenna 3, a microphone 4, an ear speaker 5, an audio processing unit, and the like. 6, a sound source 7 with a voice synthesis function, a speaker 8, an operation key unit 9, a camera 10, a RAM (Random Access Memory) 11, a ROM (ROM: Read Only Memory) 12, and a display unit 13. And a vibrator 14.
[0011]
The CPU 1 controls the operation of each unit of the mobile terminal by executing various controls and processes related to the system of the mobile terminal. In the present embodiment, a change in the shape of the lips is extracted from the moving image captured by the camera 10, and the change in the shape of the lips is extracted using a data table that stores the shape information of the lips and the character information in the RAM 11 in association with each other. In addition to converting the converted character information into sound information using a data table that stores the character information and the sound information in the RAM 11 in association with each other, the sound information is output to the sound source 7 with the sound synthesis function. It has the function of doing.
[0012]
The communication unit 2 receives communication information including voices, characters, images, and the like, converts these into electrical signals, and converts electrical signals, such as voices, characters, and images, into communication information and transmits them. The antenna 3 receives communication information transmitted from an external base station, outputs the communication information to the communication unit 2, and transmits the communication information input from the communication unit 2 to the external base station. The microphone 4 is a voice input unit for inputting a voice or the like emitted by a speaker. The ear speaker 5 is an audio output unit that outputs audio transmitted from an external base station.
[0013]
The audio processing unit 6 converts an electric signal related to audio input from the communication unit 2 into an analog signal and supplies the analog signal to the ear speaker 5, and converts an analog audio signal input from the microphone 4 into a digital audio signal for communication. Output to section 2. The sound source with voice synthesis function 7 interprets voice information input from the CPU 1 and outputs voices such as “A”, “I”,..., “N” to any tone (for example, male voice, female voice, child voice). , Etc.) and output this to the speaker 8. When the input button is operated using the operation key section 9, a specific confirmation sound or the like corresponding to the input button to be operated is output. When an incoming call is received, music information registered in advance is reproduced and the incoming melody is played. Is output via the speaker 8. The details of the speech synthesis will be described later.
[0014]
When receiving external communication information, the speaker 8 outputs a ringtone preset by the user according to a signal from the sound source 7 with the voice synthesis function. The operation key unit 9 is an input unit for inputting characters, numerals, and the like. In the present embodiment, the operation key unit 9 is also used for selecting a moving image for obtaining audio information, selecting a tone to be added, and the like. The camera 10 is a device for capturing a still image or a moving image. In the present embodiment, the camera 10 is used not only for general functions but also for creating a lip shape database. The RAM 11 is a rewritable storage unit that temporarily stores information such as downloaded music data and received e-mails. In the present embodiment, the RAM 11 stores lip shape information and character information in association with each other. A data table and a data table for storing character information and voice information in association with each other are stored.
[0015]
The ROM 12 is a non-rewritable storage device that stores a control program for transmitting and receiving communication information, a control program for reproducing music data, a control program for voice synthesis, and the like, which are executed by the CPU 1. The display unit 12 is an output unit that displays data input from the operation key unit 9, incoming information, incoming character information, image information, and the like. The vibrator 14 notifies the user of an incoming call signal or an incoming electronic mail by vibration. These components are connected by a data bus, and input and output signals between the components.
[0016]
Next, the relationship between the shape of the lip and the character information will be described.
As a method of converting the shape of the lip into character information, the following two methods can be mainly considered. The first method is a method in which the shape of the lip of a person is photographed one character at a time and the relationship between the shape of the lip and character information is stored in a database. With this method, if the image is the user himself, it can be relatively accurately converted to character information, but there is a problem that the versatility is poor. The second method is to convert the shape of the lip or the movement of the lip into character information. This method is less versatile than the first method, but can be said to be a versatile method.
[0017]
In the second method, as shown in FIG. 2, character information is made to correspond to a general lip shape. For example, a shape with a wide open mouth is `` A '', a shape that is elongated horizontally is `` I '', a shape that narrows down is `` U '', and a shape that widens from the closed state is `` Ma '', etc. Will be attached. In order to extract such a change in the shape of the lips from the image data, it can be realized by obtaining a difference value between the image data of the temporally preceding and succeeding frames. That is, in the case of an image of a talking person, the difference value between the previous and next frames is almost only the shape of the lips (see FIG. 3). Even in the case of an image of an animal, the operation of opening and closing the mouth of the animal can be converted into character information or voice information by comparing it with a database of human lips.
[0018]
Next, details of speech synthesis in the sound source with speech synthesis function 7 will be described with reference to FIGS. 4 to 7.
As shown in FIG. 4, the sound source with voice synthesis function 7 includes an input / output I / F (I / F: Interface) 21, a FIFO (FIFO: Fast in Fast Out Memory) 22, a sequencer 23, and an FM sound source 24. , A WT sound source 25, a waveform memory 26, and an adder 27. An input / output I / F 21 is an interface circuit for inputting voice information, commands related thereto, and various parameters related to voice synthesis from the CPU 1 via a data bus, and outputting a status notification of a FIFO 22 to be described later to the CPU 1. It is.
[0019]
The FIFO 22 is a circuit including a storage device, temporarily holds the given sequence data, and sequentially supplies the held data to the sequencer 23. The sequencer 23 interprets voice information received from the FIFO 22 when starting sounding in response to a sounding start and sounding end command from the CPU 1 and sends various parameters and control signals at a predetermined timing to an FM sound source to be described later. 24 or the WT sound source 25 to drive them.
[0020]
As is well known, the WT sound source 25 faithfully reproduces voices and the like by digitally recording various instrument sounds and voices and reading out the waveform data stored in the waveform memory 26 in advance or repeatedly. It is. The outputs of the FM sound source 24 and the WT sound source 25 are added by an adder 27, and the output is converted into an analog signal by a digital / analog converter (not shown) and supplied to the speaker 8. Normally, in the sound source device, each sound source is driven via the FIFO 22 and the sequencer 23. However, in the case of a sound effect or the like that requires a real-time response, the CPU 1 directly transmits the sound without passing through the FIFO 22 and the sequencer 23. , FM sound source 24 and WT sound source 25 are driven. The waveform memory 26 is configured by a ROM.
[0021]
Next, the FM sound source 24 will be described.
The FM sound source 24 is configured as shown in FIG. 6 by combining a plurality of operators 30 and adders shown in FIG. As shown in FIG. 5, one operator 30 includes a SIN waveform table 31, a phase generator (PG) 32, an adder 33, an envelope generator (EG) 34, and a multiplier 35. I have. The SIN waveform table 31 is a data table that stores each phase point of the SIN waveform (sine wave) in association with the amplitude value of the waveform at this phase point. The phase generator (PG) 32 receives a frequency parameter from the sequencer 23 or the CPU 1 and generates a phase address signal for controlling the frequency and phase of the SIN waveform data output from the SIN waveform table 31 based on the frequency parameter.
[0022]
The adder 33 adds the input signal of the operator 30 and the phase address signal, and supplies the sum to the SIN waveform table 31. The envelope generator (EG) 34 receives an amplitude parameter from the sequencer 23 or the CPU 1, generates an envelope signal (amplitude coefficient) for controlling the amplitude of a waveform output from the operator 30, and outputs the envelope signal to the multiplier 35. The multiplier 35 multiplies the output of the SIN waveform table 31 by the output of the envelope generator (EG) 34. In the operator 30 configured as described above, the amplitude value of the SIN waveform stored in the SIN waveform table 31 is sequentially read according to a signal including a phase address signal supplied via the adder 33. Therefore, the operator 30 changes the pitch by changing the speed at which the waveform amplitude value stored in the SIN waveform table 31 is read, that is, by appropriately controlling the phase address signal supplied to the SIN waveform table 31. be able to.
[0023]
For example, if the reading speed is reduced, a low sound can be generated. Conversely, if the reading speed is increased, a high sound can be generated. When receiving a reset signal from the CPU 1, the phase generator (PG) 32 resets an address read from the SIN waveform table 31 to an initial value, thereby resetting the output phase address signal. The FM sound source 24 cascade-connects a plurality of such operators 30 as shown in FIG. 6A, or further employs an adder as shown in FIG. Various sounds are generated by adding outputs and variously combining a plurality of operators 30 and adders.
[0024]
Next, a method of realizing CSM (decoded sine wave model) speech synthesis using the hardware of the FM sound source according to the present embodiment will be described. Before that, the principle of CSM speech synthesis will be described.
In general, speech can be considered almost stationary over a short time range. For this reason, in CSM speech synthesis, speech synthesis is performed assuming that the spectrum of speech is constant in a short time range. Specifically, a short-time voice of several ms to several tens of ms is regarded as stationary, and the voice is represented by the sum of several sine waves. Expressing this by a discrete time expression, the time series of speech {X _t } is
X _t = A ₁ sin ω ₁ t +... + A _n sin _ω nt (1)
It is expressed as Here, t is an integer representing a discrete time, n is the number of sine wave components (usually 4 to 6), ω _i is the angular frequency of the ith sine wave component (0 ≦ ω _i ≦ π), and A _i is the amplitude of the sine wave component.
[0025]
In this CSM speech synthesis, parameters {ω ₁ ... Ω _n , A ₁ ... A _n } are given to the model represented by the above equation (1), and the time t , A sequence {X _t } of synthesized speech is obtained. At this time, for voiced sounds (vowels, muddy consonants, etc.), since the voiced sound has periodicity, the time t in equation (1) is reset to zero for each cycle (pitch cycle) and the phase is reset. On the other hand, since the unvoiced sound has no periodicity, a random period is given, that is, the time t is reset at a random period to initialize the phase at random. In this way, the time series of the synthesized audio signal is close to human voice.
[0026]
Next, a method of realizing the CSM speech synthesis using the hardware of the FM sound source 24 will be described with reference to FIG.
The component of each sine wave represented by the equation (1) can be generated using the operator 30 described above. That is, the input signal of each operator 30 is set to zero, and a frequency parameter for reading a sine wave waveform from the SIN waveform table 31 is given to the phase generator (PG) 32, so that the SIN waveform corresponding to each sine wave component is provided. The sine wave is output in time series by the table 31. By giving the amplitude given from the envelope generator (EG) 34 to the multiplier 35 in the next stage, the output of the signal of each sine wave component of the equation (1) can be obtained from each operator 30. .
[0027]
Then, by adding these outputs by the adder 50, a sequence {X _t } of the synthesized voice signal can be obtained. In CSM speech synthesis, for voiced sounds, time t is reset to zero at each cycle to initialize the phase, and for unvoiced sounds, time t is reset to zero at random cycles to initialize the phase. This phase initialization can be performed by giving a reset signal to the phase generator (PG) 32 at each cycle to initialize the phase.
[0028]
As described above, in the CSM speech synthesis using the FM sound source 24, the synthesis is performed by the three elements of the frequency parameter or the reset signal given to the phase generator (PG) 32 and the amplitude parameter given to the envelope generator (EG) 34. By synthesizing a plurality of formant sounds, phonemes can be determined and speech synthesis can be performed. For example, in the case of speech synthesis of "Sakura", by setting a plurality of sets of the above three elements every several ms to several tens ms, / S / → / A / → / K / → / U / → / R / → Synthesize the six phonemes of / A / and make them sound.
[0029]
The above three elements given to each operator 30 are defined in advance for each phoneme and registered in the ROM 12. In addition, information on phonemes constituting each character, for example, in the case of “sa”, information such as consisting of phonemes / S / and / A / of this character is also registered in the ROM 12.
[0030]
In the present embodiment, CSM speech synthesis is executed using the FM sound source 24, but speech synthesis using the WT sound source 25 is of course also possible. For example, when "Sakura" is voice-synthesized, "Sa", "ku", and "ra" may be digitally recorded and stored in a memory, and these may be reproduced. However, it is more advantageous to perform CSM speech synthesis using the FM sound source 24 because the required parameters are reduced.
[0031]
Next, a procedure for generating audio data from image data in the mobile terminal according to the present embodiment will be described with reference to FIG.
First, with respect to the image data of each frame captured by the camera 10, the CPU 1 determines whether or not there is a change in the image data by obtaining a difference value between the previous and subsequent frames (step 101). When it is determined that there is no change in the image data, the detection of the difference value between the frames before and after the image data is continued until it is determined that there is a change in the image data.
[0032]
When it is determined that there is a change in the image data, the data of the difference value (pattern data) is searched on a pattern database that stores the pattern data and the character information in association with each other (step 102). As a result of searching the pattern database, when pattern data matching the difference value data can be searched (step 103), character information corresponding to the pattern data is determined (step 104). On the other hand, when the pattern data matching the difference value data cannot be searched, the difference values of the frames before and after the image data are detected again, and the next difference value is detected.
[0033]
When the character information is determined, the data table stored in the RAM 11 is searched for voice data corresponding to the character information, a preset tone is added to the voice data, and the voice data is output to the sound source 7 with the voice synthesis function ( Step 105). Next, it is confirmed whether or not the image data has been completed (step 106). When it is determined that the image data has been completed, the process is terminated. On the other hand, if it is determined that image data still remains, detection of difference value data is continued for the remaining image data.
[0034]
In the process flow described above, it is necessary to perform processing from comparison of pattern data to sound generation in a short time, and to generate sound in real time. For example, once, only character information is obtained over all image data. This may be stored in a memory, and after appropriate audio data is generated from the character information, the audio data may be added to a corresponding image frame.
[0035]
As described above, the embodiments of the present invention have been described in detail with reference to the drawings. However, the specific configuration is not limited to these embodiments, and a design change or the like may be made without departing from the gist of the present invention. included. For example, in the present embodiment, the method of converting the change in the shape of the lips into character information and further converting the character information into audio information has been described. However, the change in the shape of the lips is directly converted into audio information. It may be a method.
[0036]
【The invention's effect】
As described above, according to the present invention, since character information of pronunciation is applied from image data of a moving image, sound is reproduced with an arbitrary tone such as a male voice or a female voice during voice synthesis. There is an effect that can be. In addition, there is an effect that character information can be extracted from the movement of the mouth of the animal and this can be pronounced as human voice.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a mobile terminal according to an embodiment.
FIG. 2 is a diagram showing a relationship between the shape of a lip and character information according to the embodiment;
FIG. 3 is a diagram illustrating a method for extracting character information from image information according to the embodiment.
FIG. 4 is a structural diagram of a ring tone sound source according to the embodiment.
FIG. 5 is a structural diagram of an FM sound source according to the embodiment.
FIG. 6 is a diagram illustrating an example of combinations of operators in the FM sound source according to the embodiment;
FIG. 7 is a configuration diagram of an FM sound source when performing speech synthesis by CSM speech synthesis according to the present embodiment.
FIG. 8 is a flowchart illustrating a procedure for generating audio data from image data according to the embodiment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... CPU, 2 ... Communication part, 3 ... Antenna, 4 ... Microphone, 5 ... Ear speaker, 6 ... Sound processing part, 7 ... Sound source with sound synthesis function, 8 speaker, 9 operation key unit, 10 camera, 11 RAM, 12 ROM, 13 display unit, 14 vibrator,

Claims

Extraction means for extracting a change in the shape of the lips from a moving image in which the lips of a person or an animal are captured, and character information for storing the lip shape information and character information extracted by the extraction means in association with each other Storage means, character information generation means for generating character information from the lip shape information extracted by the extraction means based on the character information storage means, and voice information for storing the character information in association with voice information Storage means, voice data generating means for generating voice data of an arbitrary tone from the character information generated by the character information generating means based on the voice information storing means, and sounding means for generating the voice data as voice. A mobile terminal having