JP3634687B2

JP3634687B2 - Information communication system

Info

Publication number: JP3634687B2
Application number: JP25764799A
Authority: JP
Inventors: 雅一西本; 俊和金子
Original assignee: MegaChips Corp
Current assignee: MegaChips Corp
Priority date: 1999-09-10
Filing date: 1999-09-10
Publication date: 2005-03-30
Anticipated expiration: 2019-09-10
Also published as: JP2001086497A

Description

【０００１】
【発明の属する技術分野】
この発明は、情報通信システムに関する。
【０００２】
【従来の技術】
近年の通信技術の発展に伴って、通信による様々な情報の提供サービスが試みられている。例えば一般公衆電話回線等を利用したインターネットを通じたディジタルコンテンツの提供もそのひとつである。
【０００３】
このようなディジタルコンテンツには、文字情報の他、音声情報や静止画及び動画を含む画像情報があり、例えば動画像である画像情報に合わせて音声情報が変化するような様々な組み合わせにより多様な情報提供を行うことが可能となっている。
【０００４】
【発明が解決しようとする課題】
ところで、インターネット等の一般公衆電話回線等のネットワークを通じて上記の多様な情報を提供する場合、ネットワークのデータ伝送速度の制限により、送受信される画像情報の画質及び音声情報の音質の向上は制約を受けてしまう。
【０００５】
通信によりリアルタイムで動画像を送受信する技術例としては、例えばＩＳＯ／ＩＥＣＪＴＣ１／ＳＣ２９／ＷＧ１１によって国際標準化が進められている超低ビットレートに適したＭＰＥＧ−４と呼ばれるディジタル動画像符号化方式が提案されている。このＭＰＥＧ−４は、６４ｋｂｐｓ以下の低ビットレートの動画像圧縮に適しており、ファイルサイズを抑え、オンラインでのストリーム再生や動画ファイルの転送を容易にできるものである。したがって、ＩＳＤＮ程度の回線速度があれば、これを受信した側でコマ落ちの少ない動画像として再生することが可能である。
【０００６】
しかしながら、このＭＰＥＧ−４は、非可逆の圧縮方式であり、その圧縮率を極めて高くすることによって低ビットレート向きのデータに圧縮しているため、これを復号化して再生したときには相当な画質の劣化が生じる。具体的には、動画像の再生画質は、ＭＰＥＧ−１及びＭＰＥＧ−２に比べると大幅に劣化してしまう。また、音声情報の音質劣化についても、これを非可逆の圧縮方式で送受信する限り、同様の問題が発生する。例えば、ディジタルコンテンツにおいて人間が聴いて十分に高音質と感じる音声は、サンプリング周波数が４４．１ｋＨｚまたは音声の再生周波数帯域の上限が２０ｋＨｚ前後で約６万５千階調（１６ビット）以上であると言われている。しかしながら、現行の非可逆の圧縮方式でリアルタイムに音声情報を送信する技術では、このような高音質のリアルタイム再生を行うことはできない。
【０００７】
あるいは、これらの情報を、ＣＤ−ＲＯＭやＩＣカード等の所定の記録メディア（記録媒体）を通じて送信することも考えられるが、この場合であっても、その記録メディアのデータ容量には限界があり、かかるデータ容量の制限内でしか情報を送信することができないという問題がある。
【０００８】
そこで、この発明の課題は、動画像情報や音声情報を含むディジタルコンテンツを送受信する場合に、受信先での再生時に画質及び音質を高い水準に保持できる情報通信システムを提供することにある。
【０００９】
【課題を解決するための手段】
上記課題を解決すべく、請求項１に記載の発明は、送信装置により所望のコンテンツを所定のネットワークを通じて受信再生装置に送信しまたは所定の記録媒体を介して運搬する情報通信システムであって、前記送信装置は、前記コンテンツに関する元音声情報、または、前記コンテンツに関する元動画像情報及び元音声情報の双方を個別の要素に分解する手段であって、元音声情報については音源ごとに個別の要素に分解する分解手段と、前記分解手段で分解された個別の要素の種別を認識する認識手段と、前記認識手段での認識結果としての要素の種別を記号化して送信データを生成する送信データ生成部と、前記送信データ生成部で生成された前記送信データを所定のネットワークを通じて前記受信再生装置へ送信する送信側データ通信部、及び／または前記送信データ生成部で生成された前記送信データを所定の記録媒体に記録するデータ記録手段とを備え、前記受信再生装置は、前記送信データを前記ネットワークを通じて受信しまたは前記送信装置から運搬された前記記録媒体から読み出し、当該送信データに基づいて、所定のデータベース内に予め格納された代替素材を読み出し、これらの代替素材を合成して音声、または、動画像及び音声の双方を再生するようにされたものである。
【００１０】
請求項２に記載の発明は、送信装置により所望のコンテンツを所定のネットワークを通じて受信再生装置に送信しまたは所定の記録媒体を介して運搬する情報通信システムであって、前記送信装置は、前記コンテンツに関する元動画像情報を単数または複数の背景画像及び単数または複数のオブジェクトに画像分解する画像分解部と、前記画像分解部で画像分解された前記背景画像及び前記オブジェクトのそれぞれについて画像認識を行う画像認識部と、前記コンテンツに関する元音声情報のなかから少なくとも主要な音声を抽出分解する音声分解部と、前記音声分解部で音声分解された個々の音声データについて音声認識する音声認識部と、前記画像認識部での認識結果及び前記音声認識部での音声認識結果に基づいて前記各背景画像、前記各オブジェクト及び前記各音声データを個別に記号化して送信データを生成する送信データ生成部と、前記送信データ生成部で生成された前記送信データを所定のネットワークを通じて前記受信再生装置へ送信する送信側データ通信部、及び／または前記送信データ生成部で生成された前記送信データを所定の記録媒体に記録するデータ記録手段とを備え、前記受信再生装置は、前記送信データを前記ネットワークを通じて受信しまたは前記送信装置から運搬された前記記録媒体から読み出し、当該送信データに基づいて、所定の画像音声素材データベース内に予め格納された代替素材を読み出し、これらの代替素材を合成して動画像及び音声を再生するようにされたものである。
【００１１】
請求項３に記載の発明は、前記送信装置は、前記送信データ生成部で生成された送信データを編集する編集部をさらに備えるものである。
【００１２】
請求項４に記載の発明は、前記受信再生装置の前記画像音声素材データベースは、前記送信データ中の前記背景画像、前記オブジェクト及び前記音声データの少なくとも一部について同一の前記送信データの情報に対応してそれぞれ複数種類の前記代替素材が用意され、前記受信再生装置は、当該受信再生装置で動画像及び音声を再生する際に、複数種類の前記代替素材の中から選択して前記背景画像、前記オブジェクト及び前記音声データの前記少なくとも一部については、前記画像認識部での認識結果あるいは前記音声認識部での音声認識結果と意味内容は同じでありながら、前記元画像情報あるいは前記元音声情報とは表現態様の異なる前記代替素材を用いることが可能とされたものである。
【００１３】
請求項５に記載の発明は、前記情報通信システムは、前記送信装置により所望のコンテンツを前記所定のネットワークを通じて前記受信再生装置に送信するシステムであり、前記受信再生装置は、当該受信再生装置で動画像及び音声を再生する際に、前記背景画像及び前記オブジェクトの少なくとも一部を変更または削除するための要求を所定の入力装置を通じて入力することが可能とされ、前記送信装置の前記編集部は、前記受信再生装置の前記入力装置で入力された要求を前記所定のネットワークを通じて受信し、当該要求を反映して前記送信データ生成部で前記送信データを生成するようにしたものである。
【００１４】
請求項６に記載の発明は、前記音声データの少なくとも一部は、前記背景画像または前記オブジェクトに対応付けられたリンク音声として認識されて、前記送信データ生成部で記号化される際にその対応付けについてのデータが含められて前記送信データが生成され、前記受信再生装置は、前記リンク音声の再生時に、前記背景画像または前記オブジェクトの再生に対応づけられるようにされたものである。
【００１５】
尚、この明細書において、「コンテンツ」とは、画像及び／または音声等の種々の情報が制作者の意図をもって所定のレイアウトまたは所定のタイミングで編集された編集物をいい、また、「モチーフ」とは、各画像の形状、模様及び色彩や音声の音色、音程、和音、音声出現タイミング及びリズム等の種々の形態をいうものとする。
【００１６】
【発明の実施の形態】
＜原理＞
ディジタル情報の提供サービスのなかには、送信元に保有している動画像情報及び音声情報の元情報と意味内容が同等のものであれば、受信先で再生を行った際にその動画像及び音声のモチーフが異なっていても差し支えない場合がある。むしろ、受信先のユーザの好みにより、元情報の動画像の画風等と異なる画風等に変更したいという場合さえある。このような場合は、元情報（動画像情報及び音声情報）のモチーフの完全同一性は重視されずにこの同一性の擬制が許容される一方、画質及び音質は高水準にしたいという要望がある。
【００１７】
例えば、図１のように、海を題材とするイメージ映像を送受信したい場合に、仮に元情報としての動画像（以下「元動画像」と称す）が実写映像であり、この元動画像内に含まれる空、海及び浜辺の風景を第一背景画像１とし、海の水平線の上部に位置する夕日を第二背景画像２とし、浜辺を左から右に向かって走る男性を第一オブジェクト３とし、空を右から左に向かって飛ぶカモメを第二オブジェクト４とした場合、これらの背景画像１，２及びオブジェクト３，４の位置関係及びその動きが受信側で高画質に再生されるならば、これらのモチーフが図２のように変更されても差し支えない場合がある。
【００１８】
また、音声情報についても同様である。例えば、図１に示した動画像情報の第一背景画像１（空、海及び浜辺の風景）に対応して波の音（第一背景音声）と風の音（第二背景音声）が聞こえ、また第一オブジェクト３（走る男性）の動きに沿って（リンクして）口笛（第一リンク音声）の音が左から右に移動し、さらに第二オブジェクト４（飛ぶカモメ）の動きに沿って（リンクして）カモメの鳴き声（第二リンク音声）の音が右から左に移動するような場合、これらの音声のそれぞれのモチーフの完全同一性は重視されずその擬制が許容される一方、音質については高音質の再生を行いたい場合がある。
【００１９】
このような場合において、この実施の形態の情報通信システムでは、動画像情報及び音声情報を含む元情報の各背景画像１，２中の移動物としての各オブジェクト３，４及びこれに付随する各リンク音声を所定の動き検出等の画像分解技術及び音声分解技術で抽出し、ここで抽出された各オブジェクト３，４及びリンク音声と、残りの各背景画像１，２及び各背景音声とをそれぞれ画像認識及び音声認識して記号化した後、この記号化された記号データのみをネットワークを通じて送受信するようにし、受信側では、受信した記号データに基づいて、情報格納装置内に予め用意しておいたモチーフの画像情報及び音声情報を呼び出し、これらを合成して再生（図２の例を参照）するようにしている。
【００２０】
以下、この情報通信システムを詳述する。
【００２１】
＜構成＞
この情報通信システムは、図３の如く、ディジタルコンテンツを所定のネットワーク１０から受信して動画像情報１３及び音声１４として再生する受信再生装置１１と、ディジタルコンテンツとしての元動画像情報１５及び元音声情報１６を準備し且つネットワーク１０を通じて受信再生装置１１へ送信する送信装置１２とを備える。
【００２２】
送信装置１２は、図４の如く、元動画像情報１５を背景画像及びオブジェクトに画像分解する画像分解部２１と、画像分解部２１で画像分解された背景画像及びオブジェクトのそれぞれについて画像認識を行う画像認識部２２と、元音声情報１６のなかから主要な音声を抽出して分解する音声分解部２３と、この音声分解部２３で音声分解された個々の音声データについて音声認識する音声認識部２４と、画像認識部２２での認識結果及び音声認識部２４での音声認識結果に基づいて送信データを生成する送信データ生成部２５と、送信データ生成部２５で生成された送信データを編集する編集部２６と、送信データ生成部２５で生成され編集部２６で編集された送信データを含む種々の情報についてネットワーク１０を通じて通信を行うデータ通信部２７と、このデータ通信部２７及びネットワーク１０を通じて受信再生装置１１から所望の要求があった際にその要求を認識する要求認識部２８とを備える。
【００２３】
画像分解部２１は、画像の周波数空間における高域フィルタを用いて画像中の各部分のエッジを抽出して領域抽出を行う領域抽出機能と、動画像中の各部分が時系列的に移動した際にＭＰＥＧ等で使用される動き補償の方法を使用して画像中の各部分の動きを抽出する動き抽出（モーションディテクト）機能とを有し、動画像中で動いているオブジェクトを抽出するとともに、抽出されたオブジェクトを除いた部分を背景として、その背景をさらにいくつかの背景画像に分解するようになっている。尚、複数の合成された複合画像から個別のオブジェクト及び背景画像を抽出することが困難な場合もあり得るため、その場合には、元動画像情報１５として複数のオブジェクト及び背景画像をいくつかのチャンネルに分けて用意しておき、これらの各チャンネルのオブジェクト及び背景画像を画像分解部２１において要素認識するようにしてもよい。
【００２４】
画像認識部２２は、図５の如く、画像分解部２１で分解された各部分（背景画像及びオブジェクト）毎に、その特徴抽出３１を行って所定の標準パターンに対してパターンマッチング３２をし、その結果を送信データ生成部２５に出力するようになっている。ここで、パターンマッチング３２を行う場合は、音声認識部２４で認識された音声の種類との間で関連づけを行うことが望ましいため、この関連づけ作業を状況参照３３のブロックにて実行するようになっている。
【００２５】
音声分解部２３は、図６の如く、元音声情報１６のエンベロープを検出するエンベロープフォロワ４１と、エンベロープフォロワ４１からの出力信号を帯域毎に分割してそれぞれのピークを検出するピーク検出部４２と、ピーク検出部４２でのピーク検出結果からピッチ抽出を行うピッチ抽出ブロック４３と、このピッチ抽出ブロック４３で抽出された通過帯域を有するディジタルフィルタ４４とを備え、特に、ピーク検出部４２においてピーク検出される帯域として抽出したい音声の帯域（例えば人間の肉声等）を設定しておけば、波音、風の音、カモメの鳴き声及び口笛等の所望の音声を個別に抽出することが可能となる。尚、複数の合成された複合音声から個別の音声を抽出することが困難な場合もあり得るため、その場合には、元音声情報１６として複数の音声データをいくつかのチャンネルに分けて用意しておき、これらの各チャンネルの音声を音声分解部２３においてそれぞれ異なる音声として分解認識するようにしてもよい。
【００２６】
音声認識部２４は、図７の如く、音声認識のための前処理を行うデータ前処理ブロック５１と、この前処理されたデータに基づいて音声認識を行う認識部５２とを備える。
【００２７】
データ前処理ブロック５１は、音声分解部２３で分解抽出された各音声データのエンベロープを検出するエンベロープフォロワ５５と、エンベロープフォロワ５５からの出力信号のピッチを平坦化して音量レベルを調整するレベル調整部５６とを備え、認識部５２での音声認識精度を向上するように前調整する。
【００２８】
認識部５２は、ケプストラム変換や隠れマルコフ変換等の所定の変換処理を行う変換部５７と、変換部５７で変換されたデータについて音声の種類（波音や風の音等）を特定するとともに当該音声が人間の話す言語である場合に文字列に変換する音素片抽出部５８とを備え、特に音素片抽出部５８には音素片データベース（ＤＢ）５９が接続され、この音素片ＤＢ５９に基づいて音声の種類の特定または文字列の生成を行うようになっている。また、この認識部５２は、認識した音声の時系列的な移動を認識する機能を有しており、この移動情報に基づいて、画像認識部２２でパターンマッチング３２を行った各背景画像またはオブジェクトとの間で関連づけを行い、これをリンク音声として認識するようになっている。例えば、認識部５２で移動している音声があったときに、その移動している旨を画像認識部２２の状況参照３３のブロックに送信することで、動画像情報中のオブジェクトとの関連付けを行うことが可能であり、また音声が静止状態にある場合は、動画像情報中の背景画像との関連づけを行うことが可能となる。
【００２９】
尚、この認識部５２は、データ前処理ブロック５１から与えられた音声データがバックグラウンドミュージック（ＢＧＭ）等の音楽である場合には、そのメロディ、ハーモニー及びリズムを検出し、その検出結果に基づいて当該音楽をディジタルデータとしてのＭＩＤＩデータに変換する機能をも有しており、これにより、記号化されたデータの他、ＭＩＤＩデータをも併せて送信するようになっている。尚、音楽のなかからメロディ、ハーモニー及びリズムを検出することが困難な場合もあり得るため、その場合には、元音声情報１６として各音色のＭＩＤＩデータをいくつかのチャンネルに分けて用意しておき、これらをそのまま用いるようにしてもよい。
【００３０】
送信データ生成部２５は、画像認識部２２での認識結果及び音声認識部２４での音声認識結果に基づいて送信データを生成するものであり、所定のデータベース６１内のデータを参照しながら各背景画像、各オブジェクト、各リンク音声及び各背景音声の記号化を行う。特に、動画像情報中の各オブジェクトまたは背景画像と音声情報中の個々の音声データとが関連づけられている場合に、これらをひとつの状況データとしてまとめて認識するようになっている。ここで、生成されたデータの例を次の表１に示す。
【００３１】
【表１】

【００３２】
ここで、表１中の時刻欄は各背景画像またはオブジェクトが元動画像情報において出現または新たな動作を開始した時刻を示している。また、表１中の画像欄は背景画像またはオブジェクトの種類を示す種類コードを示しており、例えば「ＳＵＮ０００」は夕日を、「ＢＯＹ００８」は走る少年を、「ＢＲＤ４２５」は飛ぶカモメをそれぞれ意味する。さらに、表１中の左側のアクション欄は当該背景画像またはオブジェクトの動作を示す動作コードであり、「ＡＣＴ０００」は「ＳＵＮ０００（夕日）」が画面の中央部付近で赤みを帯びながら下降していく旨を、「ＭＯＶ１０４」は「ＢＯＹ００８（走る少年）」が画面下部分で左から右に向けて移動する旨を、「ＭＯＶ２０３」は「ＢＲＤ４２５（飛ぶカモメ）」が画面上部を右から左に向けて移動する旨をそれぞれ示している。さらにまた、表１中の音声欄は音声の種類を示しており、「ＢＧＭ０１０」はバックグラウンドミュージック（ＢＧＭ）としての爽やかな音楽を、「ＳＮＧ４０８」は口笛を、「ＢＲＤ４４３」はカモメの鳴き声をそれぞれ意味している。そして、右側のアクション欄は、音声欄に記述された種類の音声の動きを示しており、「０」は静止状態を、「ＬＩＮＫ−Ｃ３」は表１中の最上行から３行目の「ＢＯＹ００８（走る少年）」の動きにリンクするリンク音声である旨を、「ＬＩＮＫ−Ｃ４」は最上行から４行目の「ＢＲＤ４２５（飛ぶカモメ）」の動きにリンクするリンク音声である旨をそれぞれ意味している。
【００３３】
尚、この表１では、例えば「ＳＵＮ０００」などの種類コードを中心に画像や音声を表現しているが、実際に関連づけや受信再生装置１１（後述参照）での代替画像の採択の際により詳細な情報が必要になる場合もある。このような場合には、種類コードに付随して、その種類コードで特定された要素の状態情報（例えば、「形状」、「形容」、「状況」等）を種類コードのサブセット（階層情報）として用意するなどしておけば、受信再生装置１１での再生時に、より目的に合致した形で再生処理を行うことが可能となる。例えば、種類コードとして「ＳＵＮ０００」の太陽を特定した場合に、「形状」として「丸い」旨、「形容」として「赤い」旨、「状況」として「低い位置」にある旨をそれぞれコード指定した情報を生成する場合に、例えば、「ＳＵＮ０００」の情報を太陽から月に変更し、「形容」を「赤い」から「黄色い」に変更するなど、細やかな変更処理が容易になる。尚、これらのコードの内容は、原則として受信再生装置１１内に保存されているものと同様のものが使用される。ただし、受信再生装置１１内に存在しない新しい情報を送信する必要が生じた場合には、その情報をコードと共に送信することも可能である。この場合、かかる新たな情報とコードとを、送信データの内部にまたは送信データの外部に関連づけて送信すればよい。
【００３４】
また、この送信データ生成部２５は、音声認識部２４からＭＩＤＩデータが与えられた場合には、このＭＩＤＩデータを上記の記号化されたデータに加えて送信データに含める機能を有している。
【００３５】
尚、この送信データ生成部２５は、後述する編集部２６での編集結果により、記号化して生成した送信データを変更する機能と、要求認識部２８からの信号に基づいて記号化して生成した送信データを変更する機能とをも有している。これにより、編集部２６での編集結果や要求認識部２８からの信号に基づいて、例えば上記の「ＢＧＭ０１０」で表されるバックグラウンドミュージック（ＢＧＭ）を省略するなどの個別の調整を行うことが可能となる。
【００３６】
編集部２６は、送信データ生成部２５で記号化されて生成された送信データに基づいて、所定のモニタ６２に実際に動画像及び音声を再現し、これを見ながらキーボード及びマウス等の所定の入力部６３を用いて送信データの変更及び削除等を行うようになっている。
【００３７】
データ通信部２７は、送信データ生成部２５で生成され編集部２６で編集された送信データを含む種々の情報についてネットワーク１０を通じて受信再生装置１１へ送信する機能と、ネットワーク１０を通じて受信再生装置１１から与えられた要求信号を受信する機能とを有している。
【００３８】
要求認識部２８は、データ通信部２７及びネットワーク１０を通じて受信再生装置１１から要求信号が与えられたときにその要求を認識するもので、例えばバックグラウンドミュージック（ＢＧＭ）を省略するなどの要求を送信データ生成部２５に伝達するようになっている。
【００３９】
受信再生装置１１は、上記の表１に示したように記号化された送信データを受信し、これに基づいて、コンテンツデータベース（ＤＢ）７１、アクションルールデータベース（ＤＢ）７２及び画像音声素材データベース（ＤＢ）７３内に予め格納された代替素材を読み出し、これらの代替素材を合成して動画像１３及び音声１４として再生するようになっている。
【００４０】
具体的に、この受信再生装置１１は、ネットワーク１０を通じて送信装置１２との間の通信を行うデータ通信部７５と、ネットワーク１０及びデータ通信部７５を通じて送信装置１２から与えられた送信データ（表１参照）に基づいてコンテンツデータベース７１内に格納されたコンテンツの種類（画像であるか、音声であるか、文字であるか、あるいは音楽（ＭＩＤＩデータ）であるか等）を特定するとともに、アクションルールデータベース７２に格納された情報を参照して各オブジェクト等の動き（例えば左から右に移動している等）を置き換える置換処理部７６と、置換処理部７６で置き換えられた結果に基づいてデータ伸長すべきデータがあればデータ伸長を行うデータ伸長部７７と、置換処理部７６からの信号に基づいて各オブジェクト等の動きを算出するアクション算出部７８と、アクション算出部７８からの信号を画像音声素材データベース７３内の代替素材に照合して各画像及び音声の素材を決定してレイヤーとして合成するデータ合成部７９と、計時手段としての制御タイマー８０と、制御タイマー８０での計時に基づいてデータ合成部７９からの出力画像及び出力音声の同期をとるためのシーケンサ８１と、シーケンサ８１からの出力に基づいて動画像を再生する映像再生制御装置８２と、映像再生制御装置８２から出力された映像を最終的に合成する映像合成機８３と、シーケンサ８１からの出力に基づいて音声波形の合成を行うシンセサイザ８４と、シンセサイザ８４からの出力に基づいて音声合成を行う音声ミキサ８５と、この受信再生装置１１全体の制御を司る中央制御部８７と、中央制御部８７からの信号に基づいて送信装置１２に対する要求を発行する要求部８８とを備える。尚、シンセサイザ８４は主としてＭＩＤＩデータの再生等を行うものであるが、例えば、音声情報としてＷａｖｅデータやＭＰ３データ等を使用する場合は、これらのデータをシンセサイザ８４を回避してシーケンサ８１から音声ミキサ８５へ直接送信するようになっている。
【００４１】
ここで、中央制御部８７は、所定の入力装置８９での入力に基づいて、受信再生装置１１を使用するユーザーの希望事項を認識し、データ合成部７９での画風（例えば実写風であるか、漫画風であるか、あるいは、クラシック風であるか、現代ポップ風であるか等）、あるいは、バックグラウンドミュージック（ＢＧＭ）が必要でない等の各種の設定を変更する機能を有しており、これらの情報は、要求部８８を通じて送信装置１２へ送信するようになっている。また、中央制御部８７は、ディジタルコンテンツを受信再生する用途（例えば、業務用であるか、私用であるか等）についての情報や、使用している環境（例えば、使用しているモデムやＩＳＤＮ等の通信速度）についての情報等の種々の情報を要求部８８を通じて送信装置１２送信する機能を有している。
【００４２】
ここで、素材データベース７３内の画像素材としては、静止画であってもよいが、各オブジェクトの表現で動きのあるものを表示したい場合を考慮して、複数の静止画を順次表示して動きある動画像を表示するアニメーション画像であってもよい。この場合、ＪＰＥＧやアニメーションＧＩＦのような圧縮された画像素材を使用しても差し支えない。また、音声素材としても、ＭＩＤＩデータ、Ｗａｖｅデータ及びＭＰ３データのように、どのようなデータ形式のものを採用しても差し支えない。尚、例えばＭＰ３データを使用する場合は、これらの圧縮されたデータを伸長するための伸長回路（図示せず）を素材データベース７３の内外に設置しておくことが望ましい。
【００４３】
また、素材データベース７３内の代替素材（画像素材及び音声素材）としては、それぞれ記号化された内容に対して１種類だけでなく、例えば、実写風、漫画風、クラシック風及び現代ポップ風のように、様々なユーザの好みの多様化を考慮して、それぞれの記号化された内容に対して多数のモチーフを用意しておく。
【００４４】
尚、上記した各構成要素は、例えば、専用の回路構成にてハードウェアとして構成されてもよく、あるいは、ＣＰＵを使用して所定のソフトウェアプログラムにしたがって動作する機能要素として実現しても良い。
【００４５】
＜動作＞
上記構成の情報通信システムの動作を説明する。
【００４６】
まず、送信装置１２側では、図４の如く、画像分解部２１において、元動画像情報１５を背景画像及びオブジェクトに画像分解する。この際、画像分解部２１は、画像の周波数空間における高域フィルタを用いて画像中の各部分のエッジを抽出して領域抽出を行いつつ、動き抽出（モーションディテクト）機能により動画像中の各部分の動きを抽出してオブジェクトを抽出するとともに、抽出されたオブジェクトを除いた部分を背景として、その背景をエッジ検出によりさらにいくつかの背景画像に分解する。尚、各オブジェクト及び背景画像が予め別々のチャンネルに分けて容易されている場合は、画像分解部２１でそれぞれの要素認識を行うようにすればよい。
【００４７】
例えば、図１のように、海に沈みゆく夕日を題材とするイメージ映像であって、浜辺を左端から右に向かって少年が口笛を吹きながら走っており、空をカモメが右端から左に向かって飛んでいる元動画像情報１５がある場合、画像分解部２１は、この元動画像内に含まれる空、海及び浜辺の風景を第一背景画像１とし、海の水平線の上部に位置する夕日を第二背景画像２とし、浜辺を左から右に向かって走る男性を第一オブジェクト３とし、空を右から左に向かって飛ぶカモメを第二オブジェクト４とする。
【００４８】
次に、画像分解部２１で画像分解された背景画像１，２及びオブジェクト３，４のそれぞれについて画像認識部２２により画像認識を行う。この際、図５の如く、画像認識部２２は、画像分解部２１で分解された各部分（背景画像１，２及びオブジェクト３，４）毎に、その特徴抽出３１を行って所定の標準パターンに対してパターンマッチング３２をし、その結果を送信データ生成部２５に出力するようになっている。
【００４９】
これにより、例えば図１の元動画像情報１５については、第一背景画像１が空、海及び浜辺の風景であり、第二背景画像２が海の水平線の上部に位置する夕日であり、第一オブジェクト３が浜辺を左から右に向かって走る少年であり、第二オブジェクト４が空を右から左に向かって飛ぶカモメである旨が認識される。
【００５０】
尚、この画像認識の際、後述の音声認識部２４で認識された音声の種類との間で関連づけを状況参照３３のブロックにて行うようにする。
【００５１】
これらの画像分解及び画像認識と併行して、音声分解部２３により元音声情報１６のなかから主要な音声を抽出して分解する。この際、音声分解部２３は、図６の如く、エンベロープフォロワ４１により元音声情報１６のエンベロープを検出し、エンベロープフォロワ４１からの出力信号をピーク検出部４２で帯域毎に分割してそれぞれのピークを検出する。そして、ピーク検出部４２でのピーク検出結果からピッチ抽出ブロック４３がピッチ抽出を行い、ディジタルフィルタ４４を通過させる。この際、ピーク検出部４２においてピーク検出される帯域として抽出したい音声の帯域（例えば人間の肉声等）を設定しておき、波音、風の音、カモメの鳴き声及び口笛等の所望の音声を個別に抽出する。尚、各音声要素が予め別々のチャンネルに分けて容易されている場合は、音声分解部２３でそれぞれの要素認識を行ってこれらを分解すればよい。
【００５２】
この音声分解部２３で音声分解された個々の音声データは、音声認識部２４で音声認識される。
【００５３】
この際、音声認識部２４のデータ前処理ブロック５１では、図７の如く、音声分解部２３で分解抽出された各音声データのエンベロープをエンベロープフォロワ５５で検出し、エンベロープフォロワ５５からの出力信号のピッチをレベル調整部５６で平坦化して音量レベルを調整する。その後、認識部５２の変換部５７においてケプストラム変換や隠れマルコフ変換等の所定の変換処理を行い、次に音素片抽出部５８が、音素片ＤＢ５９に基づいて、変換部５７で変換されたデータについて音声の種類（波音や風の音等）を特定するとともに当該音声が人間の話す言語である場合にその言語の文字列の生成を行う。
【００５４】
また、この認識部５２では、例えば口笛の音声が左から右に向かって移動するような場合に、この認識した音声の時系列的な移動を認識しておき、この移動情報に基づいて、画像認識部２２でパターンマッチング３２を行った各背景画像またはオブジェクトとの間で関連づけを行い、これをリンク音声として認識する。
【００５５】
尚、この認識部５２は、データ前処理ブロック５１から与えられた音声データがバックグラウンドミュージック（ＢＧＭ）等の音楽である場合には、そのメロディ、ハーモニー及びリズムを検出し、その検出結果に基づいて当該音楽をディジタルデータとしてのＭＩＤＩデータに変換しておく。これにより、記号化されたデータの他、ＭＩＤＩデータをも併せて送信することが可能となる。
【００５６】
そして、画像認識部２２での認識結果及び音声認識部２４での音声認識結果は、送信データ生成部２５に送信され、これらに基づいて送信データが生成される。即ち、送信データ生成部２５では、所定のデータベース６１内のデータを参照しながら各背景画像、各オブジェクト、各リンク音声及び各背景音声の記号化を行い、上記した表１のようなデータ列としての送信データを生成する。特に、動画像情報中の各オブジェクトまたは背景画像と音声情報中の個々の音声データとが関連づけられている場合に、これらをひとつの状況データとしてまとめて認識するようにする。尚、表１中の各データの意味は上述したとおりである。
【００５７】
また、この送信データ生成部２５は、音声認識部２４からＭＩＤＩデータが与えられた場合には、このＭＩＤＩデータを上記の記号化されたデータに加えて送信データに含めるようにする。
【００５８】
ここで、作業者が編集部２６で送信データの変更及び削除等を行った場合は、これに従って送信データが変更される。また、データ通信部２７及びネットワーク１０を通じて受信再生装置１１から要求信号が与えられたときには、要求認識部２８によりその要求を認識し、その要求を反映した送信データが生成される。これにより、例えばバックグラウンドミュージック（ＢＧＭ）を省略するなどの要求に応じて、表１中の「ＢＧＭ０１０」で表されるバックグラウンドミュージック（ＢＧＭ）を省略するなどの個別の調整を行う。
【００５９】
このように生成された送信データは、データ通信部２７からネットワーク１０を通じて受信再生装置１１へ送信される。
【００６０】
受信再生装置１１では、上記の表１に示したように記号化された送信データをデータ通信部７５で受信する。そして、置換処理部７６は、データ通信部７５で受信された送信データを読み取り、コンテンツデータベース７１内に格納されたコンテンツの種類（画像であるか、音声であるか、文字であるか、あるいは音楽（ＭＩＤＩデータ）であるか等）を特定するとともに、アクションルールデータベース７２に格納された情報を参照して各オブジェクト等の動き（例えば左から右に移動している等）を置き換える。
【００６１】
この置換処理部７６で置き換えられた結果に基づいて、データ伸長部７７は、データ伸長すべきデータがあればデータ伸長を行う。そして、アクション算出部７８は、置換処理部７６からの信号に基づいて各オブジェクト等の動きを算出する。データ合成部７９は、アクション算出部７８からの信号を画像音声素材データベース７３内の代替素材に照合して各画像及び音声の素材を決定してレイヤーとして合成する。
【００６２】
そして、シーケンサ８１により、制御タイマー８０での計時に基づいてデータ合成部７９からの出力画像及び出力音声の同期をとりながら、映像再生制御装置８２で動画像を再生し、映像合成機８３で最終的な画像のレイヤー合成を行ってこれを所定のモニタ装置（図示しない）に出力するとともに、ＭＩＤＩデータの場合はシンセサイザ８４で音声波形の合成を行って、音声ミキサ８５で音声合成を行いこれを出力する。尚、音声情報としてＷａｖｅデータやＭＰ３データ等を使用する場合は、これらのデータをシンセサイザ８４を回避してシーケンサ８１から音声ミキサ８５へ直接送信して再生を行う。
【００６３】
ここで、受信再生装置１１側のユーザが、その要望を所定の入力装置８９で入力した場合、中央制御部８７はそのユーザーの要望を認識し、動画像１３及び音声１４の合成に反映させる。
【００６４】
例えば、元動画像情報１５と同一のモチーフの画像素材が画像音声素材データベース（ＤＢ）７３内に予め格納されている場合であって、ユーザがそのままのモチーフを維持するように指定した場合は、図３中の符号１８の如く、送信装置１２で用意した元動画像情報１５と殆ど同一の動画像１３を受信再生装置１１で再生することができる。このことは、音声１４についても同様であり、ユーザが音声についてそのままのモチーフを維持するように指定した場合は、送信装置１２で用意した元音声情報１６と殆ど同一の音声１４を受信再生装置１１で再生することができる。
【００６５】
また、図１のように、元動画像情報１５が実写風である場合に、ユーザが受信再生装置１１において漫画風の再生を希望する場合は、このユーザが入力装置８９を通じてその旨を中央制御部８７に伝達する。この場合、データ合成部７９は、中央制御部８７からの指令に従って、画像音声素材データベース７３内から読み出す各画像素材として、実写風のものを選択せずに漫画風のものを選択して読み出すようにする。これらの画像素材をデータ合成部７９で合成した結果は図３中の符号１９及び図２のようになる。この場合、図２の如く、空、海及び浜辺の風景である第一背景画像１、海の水平線の上部に位置する夕日である第二背景画像２、浜辺を左から右に向かって走る少年である第一オブジェクト３、空を右から左に向かって飛ぶカモメである第二オブジェクト４の全てにおいて、意味内容は同一であってもモチーフが元動画像情報１５とは全く異なっており、これにより、ユーザの希望の画風に容易に変更することが可能となる。さらに、例えばユーザ自身の実写画像を画像音声素材データベース７３内に用意しておけば、受信再生装置１１での動画像１３の再生時に、自分自身をモチーフにした動画像１３を楽しむことができる。
【００６６】
ここで、表１の例などのように、一部のデータを他のデータにリンクさせる場合（例えば、音声データとしてのリンク音声をオブジェクトに対応させる場合）は、動画像と音声とが同期して動くなど、極めて自然な状態の対応づけが行われて再生されるため、高水準なディジタルコンテンツ提供サービスを行うことができる。
【００６７】
また音声１４についても同様に、様々なモチーフを変更して楽しむことが可能である。例えば、送信装置１２から与えられてきたＭＩＤＩデータをバックグラウンドミュージックとすることに代えて、画像音声素材データベース７３内に予め格納されていた全く異なる音楽をバックグラウンドミュージックとしてもよい。これらは、ユーザが入力装置８９に入力するだけで容易に変更できる。したがって、ユーザの好みに応じたディジタルコンテンツを容易に提供することが可能となる。
【００６８】
以上の動作において、ネットワーク１０を通じて送信装置１２から受信再生装置１１に与えられるデータは、表１に示したような記号化された送信データが中心となっており、ＭＰＥＧ等の動画像そのものや、ＷＡＶＥデータまたはＭＰ３等の音声データそのものの通信を行う必要がないため、例えば、モデムを使用した一般公衆電話回線やＩＳＤＮ等を使用してインターネット上で通信を行うような場合でも、回線容量の限界により送信装置１２から送信した情報の受信再生装置１１側でのレスポンスの悪化を大幅に低減でき、極めて良好な通信レスポンスのディジタルコンテンツ提供サービスを実施することが可能となる。具体的には、元動画像情報１５及び元音声情報１６の情報量が数メガバイトであるような場合に、これをそのままネットワーク１０を通じて送信すると通信時間が極めて長くなってしまい、リアルタイムの通信を行うことが不可能であるのに対し、この実施の形態では、この数メガバイトの情報量を数１０バイトまで縮小して置き換えることができ、情報量が数１０万分の１以上の圧縮効果を得ることができ、ネットワーク１０に負荷をかけずに通信することが可能となる。
【００６９】
しかも、受信再生装置１１の画像音声素材データベース７３内の代替素材として高解像度の画像素材と高音質の音声素材を用意しておけば、上述のようにレスポンスの悪化を招かずに、極めて良好な画質で優れた音質のディジタルコンテンツを再生することが可能となる。
【００７０】
尚、ディジタルコンテンツの編集は、図４のように送信装置１２内の編集部２６でも可能であるため、いわゆる常連のユーザが常に背景画像やバックグラウンドミュージックを省略するよう要求している場合などでは、予めそのユーザに対して背景画像やバックグラウンドミュージックを省略するよう編集部２６で編集することも可能である。
【００７１】
また、背景画像やバックグラウンドミュージックがコンテンツにとって必須でない場合もあるため、各ユーザが低速のモデムを使用していることが要求認識部２８で認識できた場合には、これに基づいて編集部２６が自動的に背景画像やバックグラウンドミュージックを省略するように送信データ生成部２５に指令するようにしてもよい。
【００７２】
さらに、送信装置１２内の編集部２６や受信再生装置１１内での中央制御部８７により、例えばクラシック調の画風を表現する際には、全体的に茶色っぽい深みのある色彩に統一するなどの色彩調整を行うようにしても良い。
【００７３】
尚、上記実施の形態では、受信再生装置１１の画像音声素材データベース７３において、全ての背景画像、全てのオブジェクト及び全ての音声データについて、異なるモチーフを予め用意し、これらの全てをユーザの希望に応じて変更できるようにしていたが、背景画像、オブジェクト及び音声データの一部についてのみユーザの希望に応じて変更できるようにしても差し支えない。この場合、背景画像、オブジェクト及び音声データの３つのカテゴリーのうちのいずれかひとつまたはふたつのカテゴリーのみを変更できるようにしてもよいし、あるいは、それぞれのカテゴリーについて一部の情報（背景画像、オブジェクト及び音声データ）のみを変更できるようにしてもよい。
【００７４】
また、上記実施の形態では、一般公衆電話回線等のネットワーク１０を通じて送信データを送信装置１２から受信再生装置１１に与えていたが、ＣＤ−ＲＯＭやＩＣカード等の記録メディア（記録媒体）を通じて送信データを受信再生装置１１に与えるようにしても良い。この場合、送信装置１２には、これらの記録媒体に送信データを記録するための記録装置（記録手段）を設置する必要があり、また受信再生装置１１側においても、これらの記録媒体から送信データを読み込むドライブ装置が必要となることは言うまでもない。このように、記録媒体を使用する場合であっても、元動画像情報１５や元音声情報１６等をそのままの状態で送信する場合に比べて、送信データその記録媒体のデータ容量の制限を受けずに済むという利点がある。尚、このように所定の記録メディアを使用する場合は、受信再生装置１１側からユーザの要望を送信装置１２に送信することが困難であるため、ネットワーク１０を通じて通信と併用するすることが望ましい。
【００７５】
【発明の効果】
請求項１及び請求項２に記載の発明によれば、送信装置の画像分解部で元動画像情報を各背景画像及び各オブジェクト毎に分解するとともに、元音声情報の少なくとも主要な音声を音声分解部で抽出分解するなど、所定の分解手段によりコンテンツの要素分解を行い、それぞれ画像認識及び音声認識を行った後、その内容に応じて記号化して送信データを生成して受信再生装置に送信するようにし、受信再生装置側では、送信されてきた送信データに基づいて例えば所定の画像音声素材データベース内に予め格納された代替素材を読み出し、これらの代替素材を合成して動画像及び音声を再生するようにしているので、ネットワークまたは記録媒体を通じて送信装置から受信再生装置に与えられる記号化された送信データは、ＭＰＥＧ等の動画像そのものや、ＷＡＶＥデータまたはＭＰ３等の音声データそのものに比べて遙かに少量のデータになり、例えば、モデムを使用した一般公衆電話回線やＩＳＤＮ等を使用してインターネット上で通信を行うような場合でも、回線容量の限界により送信装置から送信した情報の受信再生装置側でのレスポンスの悪化を大幅に低減でき、極めて良好な通信レスポンスのコンテンツ提供サービスを実施することが可能となる。あるいは、記録媒体を通じて送信データを送信する場合であっても、その記録媒体のデータ容量の制限を受けずにすむという利点がある。
【００７６】
請求項３に記載の発明によれば、送信装置の編集部において、受信再生装置へ送信すべき送信データを編集することが可能となっているため、例えば受信再生装置側のユーザの希望に応じて、またはネットワークの通信速度等の様々な環境等に応じて、送信装置側でコンテンツの内容を変更することが可能となる。
【００７７】
請求項４に記載の発明によれば、ユーザの希望により、意味内容は同じでありながら元動画像情報及び元音声情報とは表現態様の異なる動画像及び音声を再生したり、一部の背景画像、オブジェクト及び音声を省略することが容易に可能となり、ユーザの希望に応じたコンテンツの提供を通信の負荷が増大することなく容易に実行することが可能となる。
【００７８】
請求項５に記載の発明によれば、ユーザの希望をネットワーク経由で受信して送信装置から送信される送信データに反映させることができるので、ネットワークに流れる送信データにユーザが希望しない無駄なデータが含まれるのを防止でき、効率のよい通信を実行することが可能となる。
【００７９】
請求項６に記載の発明によれば、音声データの少なくとも一部が、背景画像またはオブジェクトに対応付けられたリンク音声として認識されて、送信データ生成部で記号化される際にその対応付けについてのデータが含められて送信データが生成され、受信再生装置は、リンク音声の再生時に、背景画像またはオブジェクトの再生に対応づけられるようにしているので、例えば、動画像中のオブジェクトとリンク音声とが同期して動くなど、極めて自然な状態の対応づけが行われて再生されるため、高水準なコンテンツ提供サービスを行うことができる。
【図面の簡単な説明】
【図１】送信装置内に用意された元動画像情報の一例を示す図である。
【図２】受信再生装置で再生する動画像の一例を示す図である。
【図３】この発明の一の実施の形態に係る情報通信システムの全体構成を示すブロック図である。
【図４】この発明の一の実施の形態に係る情報通信システムの送信装置を示すブロック図である。
【図５】画像認識部を示すブロック図である。
【図６】音声分解部を示すブロック図である。
【図７】音声認識部を示すブロック図である。
【図８】この発明の一の実施の形態に係る情報通信システムの受信再生装置を示すブロック図である。
【符号の説明】
１，２背景画像
３，４オブジェクト
１０ネットワーク
１１受信再生装置
１２送信装置
１３動画像
１４音声
１５元動画像情報
１６元音声情報
２１画像分解部
２２画像認識部
２３音声分解部
２４音声認識部
２５送信データ生成部
２６編集部
２７データ通信部
２８要求認識部
３１特徴抽出
３２パターンマッチング
３３状況参照
４１エンベロープフォロワ
４２ピーク検出部
４３ピッチ抽出ブロック
４４ディジタルフィルタ
５１データ前処理ブロック
５２認識部
５５エンベロープフォロワ
５６レベル調整部
５７変換部
５８音素片抽出部
６２モニタ
６３入力部
７３画像音声素材データベース
７５データ通信部
７６置換処理部
７７データ伸長部
７８アクション算出部
７９データ合成部
８０制御タイマー
８１シーケンサ
８２映像再生制御装置
８３映像合成機
８４シンセサイザ
８５音声ミキサ
８７中央制御部
８８要求部
８９入力装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information communication system.
[0002]
[Prior art]
With the development of communication technology in recent years, various information providing services through communication have been attempted. For example, the provision of digital contents through the Internet using a general public telephone line is one of them.
[0003]
Such digital contents include not only text information but also image information including audio information, still images, and moving images. For example, there are various kinds of digital contents by various combinations in which audio information changes according to image information that is a moving image. It is possible to provide information.
[0004]
[Problems to be solved by the invention]
By the way, when the above-mentioned various information is provided through a network such as a general public telephone line such as the Internet, the improvement in the image quality of image information transmitted and received and the sound quality of audio information are restricted due to the limitation of the data transmission speed of the network. End up.
[0005]
As an example of technology for transmitting and receiving moving images in real time by communication, for example, a digital moving image encoding method called MPEG-4 suitable for an ultra-low bit rate, which is being internationally standardized by ISO / IECJTC1 / SC29 / WG11, is proposed. Has been. MPEG-4 is suitable for moving image compression at a low bit rate of 64 kbps or less, and can reduce the file size and facilitate online stream reproduction and moving image file transfer. Therefore, if there is a line speed of about ISDN, it can be reproduced as a moving image with few dropped frames on the receiving side.
[0006]
However, this MPEG-4 is an irreversible compression method, and since it is compressed into data for a low bit rate by increasing its compression rate, when it is decoded and reproduced, it has a considerable picture quality. Deterioration occurs. Specifically, the playback image quality of moving images is significantly degraded compared to MPEG-1 and MPEG-2. In addition, the same problem occurs with sound quality degradation of audio information as long as it is transmitted and received using an irreversible compression method. For example, in a digital content, a sound that a human listens to feels of sufficiently high quality has a sampling frequency of 44.1 kHz or an upper limit of a sound reproduction frequency band of about 65,000 gradations (16 bits) or more at around 20 kHz. It is said. However, such high-quality real-time reproduction cannot be performed with the technology that transmits audio information in real time using the current lossy compression method.
[0007]
Alternatively, it may be possible to transmit such information through a predetermined recording medium (recording medium) such as a CD-ROM or an IC card. However, even in this case, the data capacity of the recording medium is limited. However, there is a problem that information can be transmitted only within the limit of the data capacity.
[0008]
Accordingly, an object of the present invention is to provide an information communication system capable of maintaining high image quality and sound quality at the time of reproduction at a receiving destination when digital content including moving image information and audio information is transmitted and received.
[0009]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, the invention according to claim 1 is an information communication system in which desired content is transmitted to a reception / playback device via a predetermined network by a transmission device or conveyed via a predetermined recording medium. The transmission device decomposes the original audio information related to the content or both the original moving image information and the original audio information related to the content into individual elements. This means that the original audio information is broken down into individual elements for each sound source. Decomposing means, recognizing means for recognizing the types of individual elements decomposed by the decomposing means, and a transmission data generating unit for generating transmission data by encoding the element types as recognition results in the recognizing means; A transmission-side data communication unit that transmits the transmission data generated by the transmission data generation unit to the reception / playback apparatus via a predetermined network, and / or the transmission data generated by the transmission data generation unit is a predetermined recording medium A data recording means for recording the received data on the recording medium received from the recording medium received from the network or transmitted from the transmitting device, and stored in a predetermined database based on the transmitted data. The alternative material stored in advance is read out, and these alternative materials are combined to reproduce the audio or both the moving image and the audio. Those that were designed to.
[0010]
The invention according to claim 2 is an information communication system in which desired content is transmitted to a reception / playback device via a predetermined network by a transmission device or conveyed via a predetermined recording medium, wherein the transmission device includes the content An image decomposing unit that decomposes original moving image information into one or more background images and one or more objects, and an image that performs image recognition for each of the background image and the object that has been image decomposed by the image decomposing unit A speech recognition unit for extracting and decomposing at least main speech from the original speech information related to the content; a speech recognition unit for recognizing speech for individual speech data speech-decomposed by the speech decomposition unit; and the image Each background image based on the recognition result in the recognition unit and the voice recognition result in the voice recognition unit, A transmission data generation unit that individually encodes each object and each audio data to generate transmission data, and a transmission side that transmits the transmission data generated by the transmission data generation unit to the reception / playback device via a predetermined network A data communication unit and / or data recording means for recording the transmission data generated by the transmission data generation unit on a predetermined recording medium, wherein the reception / playback apparatus receives the transmission data via the network, or Read from the recording medium transported from the transmission device, and based on the transmission data, read alternative materials stored in advance in a predetermined video and audio material database, synthesize these alternative materials to generate a moving image and audio It is meant to be played.
[0011]
According to a third aspect of the present invention, the transmission device further includes an editing unit that edits transmission data generated by the transmission data generation unit.
[0012]
According to a fourth aspect of the present invention, the video / audio material database of the reception / playback apparatus corresponds to the same transmission data information for at least part of the background image, the object, and the audio data in the transmission data. A plurality of types of the alternative materials are prepared, and the reception / playback device selects the background image by selecting from a plurality of types of the alternative materials when playing back the moving image and the sound with the reception / playback device. The object and the at least part of the audio data For the above, the alternative material having the same meaning as the recognition result in the image recognition unit or the voice recognition result in the voice recognition unit, but having a different expression form from the original image information or the original voice information is used. It was made possible.
[0013]
According to a fifth aspect of the present invention, the information communication system includes the Send The system transmits desired content to the reception / playback device through the predetermined network by the device, and the reception / playback device, when playing back the moving image and the sound by the reception / playback device, includes the background image and the object. At least some Change Alternatively, a request for deletion can be input through a predetermined input device, and the editing unit of the transmission device receives the request input by the input device of the reception / playback device through the predetermined network. The transmission data is generated by the transmission data generation unit reflecting the request.
[0014]
According to a sixth aspect of the present invention, when at least a part of the audio data is recognized as a link audio associated with the background image or the object and is encoded by the transmission data generation unit, the response The transmission data is generated by including data about the attachment, and the reception / playback apparatus is associated with the playback of the background image or the object when the link sound is played back.
[0015]
In this specification, “content” refers to an edited product in which various information such as images and / or sounds are edited at a predetermined layout or at a predetermined timing with the intention of the producer, and “motif” The term “form” refers to various forms such as the shape, pattern and color of each image, tone color, pitch, chord, voice appearance timing, and rhythm.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
<Principle>
In the digital information providing service, if the semantic information is the same as the original information of the moving image information and audio information held at the transmission source, the moving image and audio data are reproduced when the reproduction is performed at the receiving destination. The motif may be different. Rather, it may even be desired to change to a style different from the style of the moving image of the original information or the like depending on the preference of the user of the receiving destination. In such a case, there is a demand for a high level of image quality and sound quality while allowing for imitation of this identity without emphasizing the complete identity of the motif of the original information (moving image information and audio information). .
[0017]
For example, as shown in FIG. 1, when it is desired to transmit and receive an image video on the sea, a moving image as original information (hereinafter referred to as “original moving image”) is a live-action image, and the original moving image includes The landscape of the sky, sea and beach included is the first background image 1, the sunset located above the sea horizon is the second background image 2, and the man who runs from the left to the right on the beach is the first object 3. If the seagull flying from right to left in the sky is the second object 4, if the positional relationship and movement of these

background images

1 and 2 and

objects

3 and 4 are reproduced with high image quality on the receiving side These motifs may be changed as shown in FIG.
[0018]
The same applies to audio information. For example, a sound of a wave (first background sound) and a sound of a wind (second background sound) can be heard corresponding to the first background image 1 (sky, sea and beach scenery) of the moving image information shown in FIG. The sound of the whistle (first link sound) moves from left to right along the movement of the first object 3 (running male), and further along the movement of the second object 4 (flying seagull) ( When the sound of a seagull squealing (second linked voice) moves from right to left, the exact identity of each motif of these voices is not considered important, and its imitation is allowed, but the sound quality You may want to play high-quality sound.
[0019]
In such a case, in the information communication system of this embodiment, each

object

3, 4 as a moving object in each

background image

1, 2 of the original information including moving image information and audio information, and each associated with this object The link sound is extracted by image decomposition technology and sound decomposition technology such as predetermined motion detection, and each of the extracted

objects

3 and 4 and the link sound and the remaining

background images

1 and 2 and the background sound are respectively extracted here. After symbolizing by image recognition and voice recognition, only this symbolized symbol data is transmitted / received through the network, and the receiving side prepares in advance in the information storage device based on the received symbol data. The image information and audio information of the motif that has been used are called out, and these are combined and reproduced (see the example in FIG. 2).
[0020]
Hereinafter, this information communication system will be described in detail.
[0021]
<Configuration>
As shown in FIG. 3, this information communication system receives and reproduces digital content from a predetermined network 10 and reproduces it as moving image information 13 and audio 14, and original moving image information 15 and original audio as digital content. A transmission device 12 that prepares information 16 and transmits the information 16 to the reception / reproduction device 11 through the network 10.
[0022]
As shown in FIG. 4, the transmission device 12 performs image recognition on each of the image decomposition unit 21 that decomposes the original moving image information 15 into a background image and an object, and the background image and the object that are decomposed by the image decomposition unit 21. An image recognition unit 22, a speech decomposition unit 23 that extracts and decomposes main speech from the original speech information 16, and a speech recognition unit 24 that recognizes speech for each piece of speech data speech-decomposed by the speech decomposition unit 23. A transmission data generation unit 25 that generates transmission data based on a recognition result in the image recognition unit 22 and a voice recognition result in the voice recognition unit 24, and an edit that edits the transmission data generated in the transmission data generation unit 25. The communication unit 26 communicates various information including transmission data generated by the transmission data generation unit 25 and edited by the editing unit 26 through the network 10. It includes a data communication unit 27, and the data communication unit 27 and recognizes the request recognizing section 28 the request when there is a desired request from receiving and reproducing device 11 through the network 10.
[0023]
The image decomposition unit 21 extracts a region by extracting the edge of each part in the image using a high-pass filter in the frequency space of the image, and each part in the moving image moves in time series It has a motion extraction function that extracts the motion of each part in an image using a motion compensation method used in MPEG or the like, and extracts a moving object in the moving image The portion excluding the extracted object is used as a background, and the background is further decomposed into several background images. Note that it may be difficult to extract individual objects and background images from a plurality of composite images, and in this case, a plurality of objects and background images are used as the original moving image information 15 in some cases. It is also possible to divide and prepare the channels, and the image decomposition unit 21 may recognize the elements and background images of these channels.
[0024]
As shown in FIG. 5, the image recognition unit 22 performs feature extraction 31 for each part (background image and object) decomposed by the image decomposition unit 21, and performs pattern matching 32 on a predetermined standard pattern. The result is output to the transmission data generation unit 25. Here, when pattern matching 32 is performed, it is desirable to perform association with the type of speech recognized by the speech recognition unit 24. Therefore, this association operation is performed in the block of the situation reference 33. ing.
[0025]
As shown in FIG. 6, the speech decomposition unit 23 includes an envelope follower 41 that detects the envelope of the original speech information 16, and a peak detection unit 42 that divides the output signal from the envelope follower 41 for each band and detects each peak. A pitch extraction block 43 that performs pitch extraction from the peak detection result in the peak detection unit 42, and a digital filter 44 having a pass band extracted by the pitch extraction block 43. In particular, the peak detection unit 42 performs peak detection. If the band of the voice to be extracted (for example, human voice) is set as the band to be extracted, it is possible to individually extract desired sounds such as wave sound, wind sound, gull cry and whistle. It may be difficult to extract individual sounds from a plurality of synthesized composite sounds. In this case, a plurality of sound data is prepared as original sound information 16 divided into several channels. The sound of these channels may be decomposed and recognized as different sounds by the sound decomposition unit 23.
[0026]
As shown in FIG. 7, the speech recognition unit 24 includes a data preprocessing block 51 that performs preprocessing for speech recognition, and a recognition unit 52 that performs speech recognition based on the preprocessed data.
[0027]
The data preprocessing block 51 includes an envelope follower 55 that detects the envelope of each audio data decomposed and extracted by the audio decomposition unit 23, and a level adjustment unit that adjusts the volume level by flattening the pitch of the output signal from the envelope follower 55. 56, and pre-adjust so as to improve the voice recognition accuracy in the recognition unit 52.
[0028]
The recognizing unit 52 performs a predetermined conversion process such as cepstrum conversion and hidden Markov conversion, specifies the type of sound (wave sound, wind sound, etc.) for the data converted by the conversion unit 57, and the sound is human A phoneme segment extraction unit 58 for converting to a character string in the case of a spoken language. In particular, a phoneme segment database (DB) 59 is connected to the phoneme segment extraction unit 58, and the type of speech is based on the phoneme segment DB 59. Specific or string generation. The recognizing unit 52 has a function of recognizing the time-series movement of the recognized voice, and each background image or object subjected to pattern matching 32 by the image recognizing unit 22 based on the movement information. Is recognized as a link voice. For example, when there is a moving voice in the recognition unit 52, the fact that the voice is moving is transmitted to the block of the status reference 33 of the image recognition unit 22, thereby associating with the object in the moving image information. When the sound is in a stationary state, it can be associated with the background image in the moving image information.
[0029]
When the voice data given from the data preprocessing block 51 is music such as background music (BGM), the recognizing unit 52 detects the melody, harmony and rhythm, and based on the detection result. In addition, it has a function of converting the music into MIDI data as digital data, so that MIDI data as well as symbolized data is transmitted together. In some cases, it may be difficult to detect melody, harmony and rhythm from music. In this case, MIDI data of each tone is prepared as several channels as original voice information 16. Alternatively, these may be used as they are.
[0030]
The transmission data generation unit 25 generates transmission data based on the recognition result in the image recognition unit 22 and the voice recognition result in the voice recognition unit 24, and refers to each background while referring to data in a predetermined database 61. The image, each object, each link sound, and each background sound are symbolized. In particular, when each object or background image in moving image information is associated with individual audio data in audio information, these are collectively recognized as one situation data. Here, an example of the generated data is shown in the following Table 1.
[0031]
[Table 1]

[0032]
Here, the time column in Table 1 indicates the time when each background image or object appears in the original moving image information or starts a new operation. The image column in Table 1 shows a type code indicating the type of background image or object. For example, “SUN000” means sunset, “BOY008” means a running boy, and “BRD425” means a flying seagull. . Further, the action column on the left side of Table 1 is an operation code indicating the operation of the background image or object, and “ACT000” descends while “SUN000 (sunset)” is reddish near the center of the screen. “MOV104” means that “BOY008 (running boy)” moves from left to right at the bottom of the screen, and “MOV203” means that “BRD425 (flying seagull)” points from the right to the left at the top of the screen. Respectively indicating that they are moving. Furthermore, the voice column in Table 1 indicates the type of voice, “BGM010” is a refreshing music as background music (BGM), “SNG408” is a whistle, “BRD443” is a gull cry. Each means. The action column on the right side indicates the movement of the type of audio described in the audio column. “0” indicates a stationary state, and “LINK-C3” indicates “3” from the top line in Table 1. “LINK-C4” indicates that it is a link voice linked to the movement of “BRD425 (flying seagull)” on the fourth line from the top line. I mean.
[0033]
In Table 1, for example, images and sounds are expressed centering on a type code such as “SUN000”. However, the details are more detailed when an association image is actually used and a substitute image is selected by the reception / playback apparatus 11 (see later). May require additional information. In such a case, the state information (for example, “shape”, “description”, “situation”, etc.) of the element specified by the type code is added to the type code as a subset of the type code (hierarchical information). If it is prepared as such, it becomes possible to perform the reproduction process in a form more suited to the purpose at the time of reproduction by the reception reproduction apparatus 11. For example, when the sun of “SUN000” is specified as the type code, “round” as “shape”, “red” as “description”, and “low position” as “situation” are specified as codes. When the information is generated, for example, the information of “SUN000” is changed from the sun to the moon, and “detail” is changed from “red” to “yellow”. In principle, the contents of these codes are the same as those stored in the receiving / reproducing apparatus 11. However, when it becomes necessary to transmit new information that does not exist in the reception / playback apparatus 11, it is also possible to transmit the information together with the code. In this case, the new information and code may be transmitted in association with the transmission data or the transmission data.
[0034]
The transmission data generation unit 25 has a function of including the MIDI data in the transmission data in addition to the symbolized data when the MIDI data is given from the voice recognition unit 24.
[0035]
Note that the transmission data generation unit 25 uses a function of changing the transmission data generated by encoding according to the editing result in the editing unit 26 to be described later, and a transmission generated by encoding based on a signal from the request recognition unit 28. It also has a function to change data. Thus, individual adjustments such as omitting the background music (BGM) represented by “BGM010” described above can be performed based on the editing result in the editing unit 26 and the signal from the request recognition unit 28. It becomes possible.
[0036]
The editing unit 26 actually reproduces a moving image and sound on a predetermined monitor 62 based on the transmission data generated by being symbolized by the transmission data generation unit 25, and while watching this, the predetermined unit such as a keyboard and a mouse is reproduced. Transmission data is changed or deleted using the input unit 63.
[0037]
The data communication unit 27 transmits a variety of information including transmission data generated by the transmission data generation unit 25 and edited by the editing unit 26 to the reception / reproduction device 11 through the network 10, and from the reception / reproduction device 11 through the network 10. And a function of receiving a given request signal.
[0038]
The request recognition unit 28 recognizes a request when a request signal is given from the reception / playback apparatus 11 through the data communication unit 27 and the network 10, and transmits a request such as omitting background music (BGM), for example. The data is transmitted to the data generation unit 25.
[0039]
The reception / playback apparatus 11 receives the transmission data encoded as shown in Table 1 above, and based on this, the content database (DB) 71, the action rule database (DB) 72, and the image / sound material database ( DB) 73, the alternative materials stored in advance are read out, and these alternative materials are combined and reproduced as moving image 13 and sound 14.
[0040]
Specifically, the reception / playback apparatus 11 includes a data communication unit 75 that performs communication with the transmission apparatus 12 through the network 10, and transmission data (Table 1) provided from the transmission apparatus 12 through the network 10 and the data communication unit 75. And the type of content stored in the content database 71 (such as image, sound, character, or music (MIDI data)) based on the reference) and an action rule A replacement processing unit 76 that refers to the information stored in the database 72 and replaces the movement of each object or the like (for example, moving from left to right), and data expansion based on the result replaced by the replacement processing unit 76 If there is data to be processed, the data decompression unit 77 that decompresses the data and each object based on the signal from the replacement processing unit 76. An action calculation unit 78 for calculating the motion of the image, etc., and data synthesis for collating the signal from the action calculation unit 78 with the alternative material in the image / sound material database 73 to determine the material of each image and sound and combining them as layers Unit 79, a control timer 80 as a time measuring means, a sequencer 81 for synchronizing the output image and output sound from the data synthesizing unit 79 based on the time measured by the control timer 80, and the output from the sequencer 81 A video playback control device 82 that plays back moving images, a video synthesizer 83 that finally synthesizes video output from the video playback control device 82, and a synthesizer that synthesizes audio waveforms based on the output from the sequencer 81. 84, an audio mixer 85 for synthesizing speech based on the output from the synthesizer 84, and the entire receiving and reproducing apparatus 11 It comprises a central control unit 87 for controlling, and a request unit 88 issues a request for the transmission device 12 on the basis of a signal from the central control unit 87. The synthesizer 84 mainly performs reproduction of MIDI data and the like. For example, when using Wave data, MP3 data, or the like as audio information, the synthesizer 84 avoids the synthesizer 84 and outputs the data from the sequencer 81 to the audio mixer. 85 is transmitted directly.
[0041]
Here, the central control unit 87 recognizes the items desired by the user who uses the reception / playback device 11 based on the input from the predetermined input device 89, and the style of the data synthesis unit 79 (for example, whether it is a live-action style). , Whether it ’s comic style, classic style, modern pop style, etc.) or background music (BGM) is not required, etc. These pieces of information are transmitted to the transmission device 12 through the request unit 88. In addition, the central control unit 87 receives information about the use of receiving and reproducing digital content (for example, for business use or private use) and the environment in use (for example, the modem used) The transmission device 12 has a function of transmitting various information such as information on a communication speed (such as ISDN) through the request unit 88.
[0042]
Here, the image material in the material database 73 may be a still image, but in consideration of the case where it is desired to display a moving image in the representation of each object, a plurality of still images are sequentially displayed and moved. It may be an animation image that displays a certain moving image. In this case, a compressed image material such as JPEG or animation GIF may be used. Also, the audio material may be in any data format such as MIDI data, Wave data, and MP3 data. For example, when MP3 data is used, it is desirable that an expansion circuit (not shown) for expanding these compressed data is installed inside and outside the material database 73.
[0043]
In addition, the alternative material (image material and audio material) in the material database 73 is not only one type for each symbolized content, but, for example, a live-action style, a cartoon style, a classic style, and a modern pop style. In addition, in consideration of diversification of various user preferences, a large number of motifs are prepared for each symbolized content.
[0044]
Each component described above may be configured as hardware with a dedicated circuit configuration, or may be realized as a functional component that operates according to a predetermined software program using a CPU.
[0045]
<Operation>
The operation of the information communication system having the above configuration will be described.
[0046]
First, on the transmission device 12 side, as shown in FIG. 4, the image decomposition unit 21 decomposes the original moving image information 15 into a background image and an object. At this time, the image decomposition unit 21 extracts a region by extracting an edge of each part in the image using a high-pass filter in the frequency space of the image, and performs each region in the moving image by a motion extraction function (motion detect). The movement of the part is extracted to extract the object, and the part excluding the extracted object is used as the background, and the background is further decomposed into several background images by edge detection. If each object and the background image are easily divided into different channels in advance, each element may be recognized by the image decomposition unit 21.
[0047]
For example, as shown in Fig. 1, it is an image of a sunset over the sea. The boy runs whistling from the left end to the right on the beach, and the seagulls turn from the right end to the left. When there is the original moving image information 15 flying away, the image decomposing unit 21 sets the sky, the sea, and the beach landscape included in the original moving image as the first background image 1 and is positioned above the horizontal line of the sea. The sunset is the second background image 2, the man running from the left to the right on the beach is the first object 3, and the seagull flying from the right to the left is the second object 4.
[0048]
Next, the image recognition unit 22 performs image recognition for each of the

background images

1 and 2 and the

objects

3 and 4 subjected to image decomposition by the image decomposition unit 21. At this time, as shown in FIG. 5, the image recognition unit 22 performs feature extraction 31 for each part (

background images

1 and 2 and objects 3 and 4) decomposed by the image decomposition unit 21 to obtain a predetermined standard pattern. Is subjected to pattern matching 32 and the result is output to the transmission data generator 25.
[0049]
Thus, for example, in the original moving image information 15 of FIG. 1, the first background image 1 is the sky, the sea and the beach landscape, and the second background image 2 is the sunset positioned above the horizon of the sea. It is recognized that one object 3 is a boy running from the left to the right on the beach, and that the second object 4 is a seagull flying from the right to the left in the sky.
[0050]
At the time of this image recognition, the association with the type of voice recognized by the voice recognition unit 24 described later is performed in the block of the situation reference 33.
[0051]
In parallel with these image decomposition and image recognition, main sound is extracted from the original sound information 16 by the sound decomposition unit 23 and decomposed. At this time, as shown in FIG. 6, the voice decomposition unit 23 detects the envelope of the original voice information 16 by the envelope follower 41, and divides the output signal from the envelope follower 41 for each band by the peak detection unit 42. Is detected. Then, the pitch extraction block 43 extracts the pitch from the peak detection result in the peak detector 42 and passes it through the digital filter 44. At this time, a voice band (for example, human voice) to be extracted is set as a band detected by the peak detection unit 42, and desired sounds such as a wave sound, a wind sound, a gull cry and a whistle are individually extracted. To do. If each voice element is easily divided into separate channels in advance, the voice decomposition unit 23 may recognize each element and decompose them.
[0052]
The individual voice data subjected to the voice decomposition by the voice decomposition unit 23 is recognized by the voice recognition unit 24.
[0053]
At this time, in the data preprocessing block 51 of the speech recognition unit 24, as shown in FIG. 7, the envelope of each speech data decomposed and extracted by the speech decomposition unit 23 is detected by the envelope follower 55, and the output signal from the envelope follower 55 is detected. The level is adjusted by the level adjusting unit 56 to adjust the volume level. Thereafter, the conversion unit 57 of the recognition unit 52 performs predetermined conversion processing such as cepstrum conversion and hidden Markov conversion, and then the phoneme segment extraction unit 58 uses the phoneme unit DB 59 to convert the data converted by the conversion unit 57. The type of speech (wave sound, wind sound, etc.) is specified, and when the speech is a language spoken by humans, a character string for that language is generated.
[0054]
The recognizing unit 52 recognizes the time-series movement of the recognized voice when the whistling voice moves from left to right, for example, and based on the movement information, The recognition unit 22 associates each background image or object that has undergone pattern matching 32, and recognizes it as a link sound.
[0055]
When the voice data given from the data preprocessing block 51 is music such as background music (BGM), the recognizing unit 52 detects the melody, harmony and rhythm, and based on the detection result. The music is converted into MIDI data as digital data. As a result, MIDI data can be transmitted together with the encoded data.
[0056]
And the recognition result in the image recognition part 22 and the speech recognition result in the voice recognition part 24 are transmitted to the transmission data generation part 25, and transmission data are produced | generated based on these. That is, the transmission data generation unit 25 performs symbolization of each background image, each object, each link sound, and each background sound while referring to data in a predetermined database 61, and forms a data string as shown in Table 1 above. The transmission data of is generated. In particular, when each object or background image in moving image information is associated with individual audio data in audio information, these are collectively recognized as one piece of situation data. The meaning of each data in Table 1 is as described above.
[0057]
In addition, when the MIDI data is given from the voice recognition unit 24, the transmission data generation unit 25 includes the MIDI data in the transmission data in addition to the symbolized data.
[0058]
Here, when the operator changes or deletes the transmission data in the editing unit 26, the transmission data is changed accordingly. Further, when a request signal is given from the reception / playback apparatus 11 through the data communication unit 27 and the network 10, the request recognition unit 28 recognizes the request, and transmission data reflecting the request is generated. Thus, individual adjustments such as omitting the background music (BGM) represented by “BGM010” in Table 1 are performed in response to a request to omit the background music (BGM), for example.
[0059]
The transmission data generated in this way is transmitted from the data communication unit 27 to the reception / playback apparatus 11 through the network 10.
[0060]
In the reception / playback apparatus 11, the data communication unit 75 receives the transmission data encoded as shown in Table 1 above. The replacement processing unit 76 reads the transmission data received by the data communication unit 75 and stores the type of content stored in the content database 71 (image, sound, character, or music). (MIDI data) or the like) is specified, and the movement of each object or the like (for example, moving from the left to the right) is replaced with reference to the information stored in the action rule database 72.
[0061]
Based on the result replaced by the replacement processing unit 76, the data decompression unit 77 performs data decompression if there is data to be decompressed. Then, the action calculation unit 78 calculates the movement of each object or the like based on the signal from the replacement processing unit 76. The data composition unit 79 collates the signal from the action calculation unit 78 with the alternative material in the image / sound material database 73 to determine each image and sound material and synthesizes them as a layer.
[0062]
Then, the sequencer 81 reproduces the moving image by the video reproduction control device 82 while synchronizing the output image and output sound from the data synthesis unit 79 based on the time measured by the control timer 80, and the video synthesizer 83 finally In the case of MIDI data, a voice waveform is synthesized by a synthesizer 84, and a voice mixer 85 synthesizes a voice. Output. Note that when using Wave data, MP3 data, or the like as audio information, these data are transmitted directly from the sequencer 81 to the audio mixer 85, avoiding the synthesizer 84, and reproduced.
[0063]
Here, when the user on the reception / playback apparatus 11 side inputs the request through the predetermined input device 89, the central control unit 87 recognizes the user's request and reflects it in the synthesis of the moving image 13 and the audio 14.
[0064]
For example, when the image material of the same motif as the original moving image information 15 is stored in advance in the image / audio material database (DB) 73 and the user designates to maintain the motif as it is, As shown by reference numeral 18 in FIG. 3, the moving image 13 almost identical to the original moving image information 15 prepared by the transmitting device 12 can be reproduced by the receiving and reproducing device 11. This also applies to the voice 14, and when the user designates to maintain the same motif for the voice, the reception / playback apparatus 11 receives almost the same voice 14 as the original voice information 16 prepared by the transmission apparatus 12. Can be played.
[0065]
As shown in FIG. 1, when the original moving image information 15 is a live-action image, when the user desires to reproduce the comic style in the reception / reproduction device 11, the user performs a central control to that effect through the input device 89. Transmitted to the unit 87. In this case, in accordance with the command from the central control unit 87, the data synthesizing unit 79 selects and reads out a cartoon-style image material to be read from the image / sound material database 73 without selecting a live-action image material. To. The result of combining these image materials by the data combining unit 79 is as shown by reference numeral 19 in FIG. 3 and FIG. In this case, as shown in FIG. 2, a first background image 1 which is a landscape of the sky, the sea and the beach, a second background image 2 which is a sunset located above the horizon of the sea, and a boy running from the left to the right on the beach The first object 3 and the second object 4 that are seagulls flying from the right to the left in the sky have the same meaning and the same motif as the original video information 15, Thus, it is possible to easily change the style desired by the user. Further, for example, if a user's own photographed image is prepared in the image / sound material database 73, the moving image 13 having the motif of himself / herself can be enjoyed when the moving image 13 is reproduced by the receiving / reproducing apparatus 11.
[0066]
Here, as shown in the example of Table 1, when some data is linked to other data (for example, when linked audio as audio data is associated with an object), the moving image and the audio are synchronized. Since it is played back with a very natural state correspondence such as moving, it is possible to provide a high-level digital content providing service.
[0067]
Similarly, the voice 14 can be enjoyed by changing various motifs. For example, instead of using MIDI data provided from the transmission device 12 as background music, completely different music stored in advance in the image / audio material database 73 may be used as background music. These can be easily changed simply by the user inputting to the input device 89. Therefore, it is possible to easily provide digital contents according to user preferences.
[0068]
In the above operation, the data given from the transmission device 12 to the reception / playback device 11 through the network 10 is centered on the transmission data encoded as shown in Table 1, and the moving image itself such as MPEG, Since there is no need to communicate WAVE data or audio data such as MP3 itself, for example, even when communication is performed on the Internet using a general public telephone line using a modem or ISDN, the line capacity is limited. As a result, it is possible to greatly reduce the deterioration of the response on the reception / reproduction device 11 side of the information transmitted from the transmission device 12, and to implement a digital content providing service with a very good communication response. Specifically, when the information amount of the original moving image information 15 and the original audio information 16 is several megabytes, if this is transmitted as it is through the network 10, the communication time becomes extremely long, and real-time communication is performed. On the other hand, in this embodiment, the amount of information of several megabytes can be reduced and replaced with several tens of bytes in this embodiment, and the amount of information can achieve a compression effect of 1 / 100,000 or more. Thus, communication can be performed without applying a load to the network 10.
[0069]
In addition, if a high-resolution image material and a high-quality sound material are prepared as alternative materials in the image / sound material database 73 of the reception / playback apparatus 11, the response is not deteriorated as described above, which is extremely good. It is possible to play back digital content with excellent image quality and sound quality.
[0070]
Since digital content can be edited by the editing unit 26 in the transmission device 12 as shown in FIG. 4, when a so-called regular user always requests that background images and background music be omitted. The editing unit 26 can also edit the image so that the background image and background music are omitted from the user in advance.
[0071]
In addition, since the background image and the background music may not be essential for the content, when the request recognition unit 28 can recognize that each user is using a low-speed modem, the editing unit 26 is based on this. However, the transmission data generation unit 25 may be instructed to automatically omit the background image and the background music.
[0072]
Further, for example, when a classic-style style is expressed by the editing unit 26 in the transmission device 12 or the central control unit 87 in the reception / playback device 11, for example, it is unified to a color with a brownish depth. Color adjustment may be performed.
[0073]
In the above-described embodiment, different motifs are prepared in advance for all background images, all objects, and all audio data in the image / audio material database 73 of the reception / playback apparatus 11, and all of these are desired by the user. However, only a part of the background image, the object, and the audio data may be changed according to the user's request. In this case, only one or two of the three categories of background image, object, and audio data may be changed, or some information (background image, object, etc.) may be changed for each category. And audio data) may be changed.
[0074]
In the above embodiment, transmission data is given from the transmission device 12 to the reception / reproduction device 11 through the network 10 such as a general public telephone line, but is transmitted through a recording medium (recording medium) such as a CD-ROM or an IC card. Data may be given to the reception / playback apparatus 11. In this case, the transmission device 12 needs to be provided with a recording device (recording means) for recording transmission data on these recording media, and the transmission / reception device 11 also transmits transmission data from these recording media. Needless to say, a drive device for reading is required. As described above, even when the recording medium is used, the transmission data is limited by the data capacity of the recording medium compared to the case where the original moving image information 15 and the original audio information 16 are transmitted as they are. There is an advantage that it is not necessary. When a predetermined recording medium is used in this way, it is difficult to transmit the user's request from the reception / playback apparatus 11 side to the transmission apparatus 12, so it is desirable to use it together with communication through the network 10.
[0075]
【The invention's effect】
According to the first and second aspects of the present invention, the original moving image information is decomposed for each background image and each object by the image decomposition unit of the transmission device, and at least the main sound of the original sound information is sound decomposed. The content is decomposed by predetermined decomposition means such as extraction and decomposition by a unit, and after performing image recognition and voice recognition, respectively, it is encoded according to the contents to generate transmission data and transmit it to the reception / playback apparatus In this way, the reception / playback apparatus reads, for example, substitute materials stored in advance in a predetermined image / sound material database based on the transmitted transmission data, and synthesizes these substitute materials to reproduce a moving image and sound. Therefore, the encoded transmission data given from the transmission device to the reception / playback device through the network or the recording medium is MPEG Compared to the image itself and audio data such as WAVE data or MP3, the amount of data is much smaller. For example, communication is performed on the Internet using a general public telephone line using a modem or ISDN. Even in this case, it is possible to greatly reduce the deterioration of the response on the receiving and reproducing apparatus side of the information transmitted from the transmitting apparatus due to the limit of the line capacity, and it becomes possible to implement a content providing service with a very good communication response. Alternatively, even if transmission data is transmitted through a recording medium, there is an advantage that the data capacity of the recording medium is not limited.
[0076]
According to the third aspect of the invention, the transmission unit to be transmitted to the reception / playback apparatus can be edited in the editing unit of the transmission apparatus. The content of the content can be changed on the transmission device side according to various environments such as the communication speed of the network.
[0077]
According to invention of Claim 4, according to a user's request, The semantic content is the same Original video information and original audio information Is an expression Different Movement It is possible to easily reproduce images and sounds, omit some background images, objects and sounds, and easily provide content according to the user's wishes without increasing the communication load. Is possible.
[0078]
According to the invention described in claim 5, the user's wish is satisfied. Received over the network Since it can be reflected in the transmission data transmitted from the transmission device, it is possible to prevent unnecessary data not desired by the user from being included in the transmission data flowing through the network, and it is possible to execute efficient communication.
[0079]
According to the sixth aspect of the present invention, when at least a part of the audio data is recognized as the link audio associated with the background image or the object and is symbolized by the transmission data generation unit, the association is performed. Since the transmission data is generated and the reception / playback apparatus is associated with the playback of the background image or the object when the link sound is played back, for example, the object in the moving image and the link sound Since they are played back in a very natural state, such as moving in synchronization, a high-level content providing service can be provided.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating an example of original moving image information prepared in a transmission apparatus.
FIG. 2 is a diagram illustrating an example of a moving image to be played back by a reception / playback apparatus.
FIG. 3 is a block diagram showing an overall configuration of an information communication system according to an embodiment of the present invention.
FIG. 4 is a block diagram showing a transmission device of an information communication system according to an embodiment of the present invention.
FIG. 5 is a block diagram illustrating an image recognition unit.
FIG. 6 is a block diagram showing a speech decomposition unit.
FIG. 7 is a block diagram showing a voice recognition unit.
FIG. 8 is a block diagram showing a reception / playback apparatus for an information communication system according to an embodiment of the present invention.
[Explanation of symbols]
1,2 Background image
3,4 objects
10 network
11 Receiving / reproducing device
12 Transmitter
13 video
14 Voice
15 Original video information
16 original audio information
21 Image decomposition unit
22 Image recognition unit
23 Speech decomposition unit
24 Voice recognition unit
25 Transmission data generator
26 Editorial Department
27 Data Communication Department
28 Request recognition part
31 Feature extraction
32 pattern matching
33 See situation
41 Envelope follower
42 Peak detector
43 Pitch extraction block
44 Digital filter
51 Data preprocessing block
52 recognition unit
55 Envelope Follower
56 Level adjuster
57 Converter
58 Phoneme segment extraction unit
62 Monitor
63 Input section
73 Image / Audio Material Database
75 Data communication department
76 Replacement processing part
77 Data decompression section
78 Action calculator
79 Data composition part
80 control timer
81 Sequencer
82 Video playback control device
83 Video synthesizer
84 Synthesizer
85 audio mixer
87 Central control unit
88 Request section
89 Input device

Claims

An information communication system in which desired content is transmitted to a reception / playback device via a predetermined network by a transmission device or conveyed via a predetermined recording medium,
The transmitter is
Means for decomposing the original audio information related to the content or both the original moving image information and the original audio information related to the content into individual elements , wherein the original audio information is decomposed into individual elements for each sound source When,
Recognizing means for recognizing the types of individual elements decomposed by the decomposing means;
A transmission data generation unit that generates a transmission data by encoding the type of an element as a recognition result in the recognition unit;
A transmission-side data communication unit that transmits the transmission data generated by the transmission data generation unit to the reception / playback apparatus via a predetermined network, and / or the transmission data generated by the transmission data generation unit is a predetermined recording medium Data recording means for recording on
The reception / playback device receives the transmission data through the network or reads from the recording medium transported from the transmission device, and reads an alternative material stored in advance in a predetermined database based on the transmission data, An information communication system characterized by synthesizing these alternative materials and reproducing audio or both moving images and audio.

An information communication system in which desired content is transmitted to a reception / playback device via a predetermined network by a transmission device or conveyed via a predetermined recording medium,
The transmitter is
An image decomposition unit that decomposes the original moving image information about the content into one or more background images and one or more objects;
An image recognition unit that performs image recognition for each of the background image and the object that has been subjected to image decomposition by the image decomposition unit;
An audio decomposition unit that extracts and decomposes at least main audio from the original audio information related to the content;
A speech recognition unit that recognizes speech for each piece of speech data speech-decomposed by the speech decomposition unit;
A transmission data generating unit that individually generates a transmission data by symbolizing each background image, each object, and each voice data based on a recognition result in the image recognition unit and a voice recognition result in the voice recognition unit; ,
A transmission-side data communication unit that transmits the transmission data generated by the transmission data generation unit to the reception / playback apparatus via a predetermined network, and / or the transmission data generated by the transmission data generation unit is a predetermined recording medium Data recording means for recording on
The reception / playback device receives the transmission data through the network or reads it from the recording medium transported from the transmission device, and based on the transmission data, substitute material stored in advance in a predetermined video / audio material database An information communication system characterized in that a video and audio are reproduced by synthesizing these alternative materials.

An information communication system according to claim 2,
The transmission device further includes an editing unit that edits transmission data generated by the transmission data generation unit.

An information communication system according to claim 2 or claim 3, wherein
The video / audio material database of the reception / playback apparatus includes a plurality of types of the alternative materials corresponding to information of the same transmission data for at least a part of the background image, the object, and the audio data in the transmission data. Is prepared,
The reception / playback device selects a plurality of types of the alternative materials and reproduces at least a part of the background image, the object, and the audio data when the moving image and the sound are played back by the reception / playback device. The alternative material having the same semantic content as the recognition result in the image recognition unit or the voice recognition result in the voice recognition unit, but having a different expression form from the original image information or the original voice information may be used. An information communication system characterized by being made possible.

The information communication system according to claim 3, wherein the information communication system is a system for transmitting desired content to the reception / playback device through the predetermined network by the transmission device,
The reception / playback device can input a request for changing or deleting at least a part of the background image and the object through a predetermined input device when the reception / playback device plays a moving image and sound. And
The editing unit of the transmission device receives a request input by the input device of the reception / playback device through the predetermined network, and reflects the request to generate the transmission data by the transmission data generation unit. An information communication system characterized by that.

An information communication system according to any one of claims 2 to 5,
At least a part of the audio data is recognized as a link audio associated with the background image or the object, and data about the association is included when symbolized by the transmission data generation unit. Send data is generated,
The information communication system, wherein the reception / playback apparatus is associated with playback of the background image or the object when the link sound is played back.