JP4725918B2

JP4725918B2 - Program image distribution system, program image distribution method, and program

Info

Publication number: JP4725918B2
Application number: JP2010174677A
Authority: JP
Inventors: ひろ美古川; 寛之佐藤
Original assignee: BOND CO Ltd
Current assignee: BOND CO Ltd
Priority date: 2009-08-06
Filing date: 2010-08-03
Publication date: 2011-07-13
Anticipated expiration: 2030-08-03
Also published as: JP2011055483A

Description

本発明は、番組画像配信システム、番組画像配信方法及びプログラムに関し、特に、複数の遠隔再生処理装置において、入力された音声を再生しつつ、入力された前記音声に対応してキャラクタ画像を作成して生成された番組画像を表示する番組画像配信システム等に関する。 The present invention relates to a program image distribution system, a program image distribution method, and a program, and more particularly, to generate a character image corresponding to the input sound while reproducing the input sound in a plurality of remote reproduction processing devices. The present invention relates to a program image distribution system for displaying program images generated in this manner.

従来、テレビやインターネット上で提供される番組（コンテンツ）の中には、人物や背景を実写映像としながら、その一部にコンピュータグラフィックス（ＣＧ）によるアニメーションキャラクタを挿入・合成することなどが行われている（特許文献１参照）。 Conventionally, in a program (content) provided on a television or the Internet, an animated character by computer graphics (CG) is inserted and combined into a part of a live-action image of a person or background. (See Patent Document 1).

また、複数のゲーム端末において、音声を発する人の顔を表示することなく、音声データの入力に合わせて、単一のキャラクタを表示することは知られている（特許文献２参照）。 In addition, it is known that a plurality of game terminals display a single character in accordance with the input of audio data without displaying the face of the person who makes the sound (see Patent Document 2).

特開平７−１７８２４０号公報JP 7-178240 A 特開２００３−２４８８３７号公報JP 2003-248837 A

しかしながら、従来の番組は、予算をかけて専門家が制作した素材（写真、ビデオ、音声、音楽、文字等）を組み込み、作成するものであった。これは、一度完成してしまうと、変更はできないものである。そのため、番組の受け手は、配信をする者の都合でのみ作成された単一のコンテンツを受信して表示するにとどまっていた。 However, conventional programs were created by incorporating materials (photos, videos, sounds, music, characters, etc.) produced by experts over a budget. This cannot be changed once completed. For this reason, the program recipient has only received and displayed a single content created only for the convenience of the distributor.

また、例えば、特許文献２に、音声を発する人が、表示画面に表示されるキャラクタになりきると記載されている（明細書第０２５１段落参照）ように、ゲームの分野では、多数の参加者が一つのゲームに参加するものであり、各キャラクタは、各参加者に応じて統一されたイメージを保つ必要がある。そのため、仮に各キャラクタをコンテンツとして捉えたとしても、複数のゲーム端末で、情報の発信者である各参加者を基準として、単一のコンテンツを共有する点では、従来の番組配信と同様のものである。 In addition, for example, Patent Document 2 describes that a person who makes a sound can be a character displayed on a display screen (see paragraph 0251 of the specification). Each character participates in one game, and each character needs to maintain a unified image according to each participant. Therefore, even if each character is regarded as content, it is the same as conventional program distribution in that a single content is shared by a plurality of game terminals based on each participant who is an information sender. It is.

本発明は、このような従来技術における問題点に着目してなされたものであって、配信の受け手の状況に合わせて、様々なキャラクタが、例えば、撮影現場の出演者と会話をしているかのような画像をリアルタイムに生成することを可能にする番組画像配信システム等を提供することを目的とする。 The present invention has been made paying attention to such problems in the prior art, and according to the situation of the recipient of the distribution, for example, whether various characters are talking to performers at the shooting site, for example. It is an object of the present invention to provide a program image distribution system and the like that can generate such an image in real time.

このような課題を解決するための本発明による番組画像配信システムは、複数の遠隔再生処理装置において、入力された音声を再生しつつ、入力された前記音声に対応してキャラクタ画像を作成して生成された番組画像を表示する番組画像配信システムであって、前記音声が入力される音声入力手段を有する音声入力端末と、前記各遠隔再生処理装置に対して、入力された前記音声を送信する配信管理手段を備え、前記複数の遠隔再生処理装置には、入力された前記音声に対応して第１のキャラクタ画像を作成する第１の遠隔再生処理装置と、入力された前記音声に対応して前記第１のキャラクタ画像とは異なる第２のキャラクタ画像を作成する第２の遠隔再生処理装置が含まれており、前記配信管理手段は、入力された前記音声を分割して、その一部又は全部を音声量子として抽出する音声量子化手段と、前記各遠隔再生処理装置に対して、前記各音声量子を送信する音声量子送信手段と、キャラクタの動作を制御するための制御命令を記憶する制御命令記憶手段と、前記各遠隔再生処理装置に対して、前記制御命令を送信する制御命令送信手段を有し、前記各遠隔再生処理装置は、送信された前記各音声量子を受信する受信手段と、受信した前記各音声量子を再生しつつ、前記制御命令及び受信した前記各音声量子に対応してキャラクタ要素画像から前記キャラクタ画像を作成して前記番組画像を表示する端末番組画像生成手段を有し、前記キャラクタ要素画像は、２種類以上存在し、前記第２の遠隔再生処理装置が有する端末番組画像生成手段は、前記第１の遠隔再生処理装置において用いられた前記キャラクタ要素画像とは異なる種類の前記キャラクタ要素画像から前記第２のキャラクタ画像を作成することを特徴とするものである。 A program image distribution system according to the present invention for solving such a problem is to generate a character image corresponding to the input sound while reproducing the input sound in a plurality of remote reproduction processing devices. A program image distribution system for displaying a generated program image, wherein the input voice is transmitted to a voice input terminal having voice input means for inputting the voice, and each remote reproduction processing device. A plurality of remote reproduction processing devices, each of which includes a first remote reproduction processing device that creates a first character image corresponding to the inputted voice, and that corresponds to the inputted voice; A second remote reproduction processing device for creating a second character image different from the first character image, wherein the distribution management means divides the inputted voice Voice quantization means for extracting a part or all of them as voice quanta, voice quantum transmission means for sending each voice quantum to each remote reproduction processing device, and a control command for controlling the action of the character Control command storage means for storing the control command and control command transmission means for transmitting the control command to each of the remote playback processing devices, wherein each of the remote playback processing devices receives each of the transmitted voice quanta A terminal program image that displays the program image by generating the character image from the character element image corresponding to the control command and the received voice quanta while reproducing the received voice quanta And a terminal program image generating means included in the second remote reproduction processing device includes the first remote reproduction processing device. The said character element image used in is characterized in that to create different types of the said second character image from the character element image.

また、本発明において、前記各遠隔再生処理装置の端末番組画像生成手段は、前記各音声量子を、入力された前記音声とは異なる音声を示す背景音声データと同期させる端末音声同期手段と、同期した前記各音声量子と前記背景音声データを再生する音声再生手段と、前記音声再生手段により再生されている前記音声量子の後に再生されるべき前記音声量子の特徴を検出し、前記制御命令及び検出した前記音声量子の特徴に対応して、キャラクタ要素画像から前記キャラクタ画像を作成し、撮像されて得られた実写データと前記キャラクタ画像とを合成して前記番組画像を作成する画像生成手段と、前記音声再生手段により再生されている前記音声量子の特徴を検出して、前記画像生成手段による前記番組画像の作成処理と前記音声再生手段による前記各音声量子の再生処理とを同期させる同期手段を有する、ことが望ましい。 Further, in the present invention, the terminal program image generation means of each remote reproduction processing device is synchronized with terminal voice synchronization means for synchronizing each voice quantum with background voice data indicating voice different from the inputted voice. A voice reproduction means for reproducing each of the voice quanta and the background voice data; a feature of the voice quanta to be reproduced after the voice quanta being reproduced by the voice reproduction means; Corresponding to the features of the voice quantum, the character image is created from a character element image, and the program image is created by synthesizing the actual image data obtained by imaging and the character image; The feature of the audio quantum being reproduced by the audio reproduction means is detected, the program image creating process by the image generation means, and the audio reproduction means According wherein comprises synchronizing means for synchronizing the reproduction of each audio quantum, it is desirable.

さらに、本発明において、前記配信管理手段は、前記キャラクタ要素画像、前記実写データ及び前記背景音声データを記憶する記憶手段と、前記複数の遠隔再生処理装置の一部又は全部に対して、必要に応じて、前記第１のキャラクタ要素画像又は前記第２のキャラクタ要素画像、前記実写データ及び前記背景音声データを送信するデータ送信手段を有するものであり、動画データを表示する遠隔再生装置と、前記制御命令及び受信した前記各音声量子に対応して前記キャラクタ要素画像から前記キャラクタ画像を作成し、前記各音声量子と合成して動画データを作成して、前記遠隔再生装置に対して前記動画データを送信する動画生成手段を備え、前記遠隔再生装置は、受信した前記動画データを再生するものである、ことが望ましい。 Furthermore, in the present invention, the distribution management means is necessary for a part or all of the storage element for storing the character element image, the live-action data and the background audio data, and the plurality of remote reproduction processing devices. In response, the first character element image or the second character element image, the data transmission means for transmitting the live-action data and the background audio data, and a remote playback device for displaying moving image data, The character image is generated from the character element image corresponding to the control command and the received voice quanta, and is synthesized with the voice quanta to generate moving image data. It is preferable that the remote reproduction device reproduces the received moving image data.

さらに、本発明において、前記動画生成手段は、生成された前記動画データの送信先として、前記音声入力端末、前記複数の遠隔再生処理装置及び前記遠隔再生装置の一部又は全部を指示されるものであり、生成された前記動画データの送信先として、前記複数の遠隔再生処理装置の一部又は全部が指示された場合には、前記配信管理手段の前記音声量子送信手段は、指示された前記遠隔再生処理装置に対して前記各音声量子を送信せず、指示された前記遠隔再生処理装置は、受信した前記動画データを再生する、ことが望ましい。 Further, in the present invention, the moving image generating means is instructed to transmit the generated moving image data to a part or all of the voice input terminal, the plurality of remote reproduction processing devices, and the remote reproduction device. And when a part or all of the plurality of remote reproduction processing devices are instructed as a transmission destination of the generated moving image data, the audio quantum transmission unit of the distribution management unit is instructed to It is desirable that each of the audio reproduction quanta is not transmitted to the remote reproduction processing device, and the instructed remote reproduction processing device reproduces the received moving image data.

さらに、本発明による番組画像配信方法は、複数の遠隔再生処理装置において、入力された音声を再生しつつ、入力された前記音声に対応してキャラクタ画像を作成して生成された番組画像を表示する番組画像配信方法であって、前記複数の遠隔再生処理装置には、入力された前記音声に対応して第１のキャラクタ画像を作成する第１の遠隔再生処理装置と、入力された前記音声に対応して、前記第１のキャラクタ画像を生成できるだけでなく、前記第１のキャラクタ画像に代えて、前記第１のキャラクタ画像とは異なる第２のキャラクタ画像を作成できる第２の遠隔再生処理装置が含まれており、音声入力手段に前記音声が入力される音声入力ステップと、配信管理手段が備える音声量子化手段が、入力された前記音声を分割して、その一部又は全部を音声量子として抽出する音声量子化ステップと、前記配信管理手段が備える音声量子送信手段が、前記各遠隔再生処理装置に対して、前記各音声量子を送信する音声量子送信ステップと、前記第１の遠隔再生処理装置が備える端末番組画像生成手段が、受信した前記各音声量子を再生しつつ、キャラクタの動作を制御するための制御命令及び受信した前記各音声量子に対応してキャラクタ要素画像から前記第１のキャラクタ画像を作成して前記番組画像を表示し、前記第２の遠隔再生装置が備える端末番組画像再生手段が、受信した前記各音声量子を再生しつつ、前記制御命令及び受信した前記各音声量子に対応して、前記第１の遠隔再生処理装置における前記キャラクタ要素画像とは異なるキャラクタ要素画像から前記第２のキャラクタ画像を作成して、又は、前記第１の遠隔再生処理装置における前記キャラクタ要素画像と同じキャラクタ要素画像から前記第１のキャラクタ画像を作成して前記番組画像を表示する番組画像表示ステップを含むことを特徴とするものである。 Furthermore, the program image distribution method according to the present invention displays a program image generated by creating a character image corresponding to the input sound while reproducing the input sound in a plurality of remote reproduction processing devices. And a plurality of remote reproduction processing devices including a first remote reproduction processing device that creates a first character image corresponding to the inputted voice and the inputted voice. In response to the second remote reproduction process, not only the first character image can be generated, but also a second character image different from the first character image can be created instead of the first character image. A speech input step in which the speech is input to the speech input means, and a speech quantization means provided in the distribution management means divides the input speech and Or a voice quantization step for extracting all as voice quanta, a voice quantum transmission means provided in the distribution management means, a voice quantum transmission step for transmitting each voice quantum to each remote reproduction processing device, and The terminal program image generation means included in the first remote reproduction processing device reproduces the received voice quanta while controlling the character's motion and the character element corresponding to the received voice quanta. The first character image is created from the image, the program image is displayed, and the terminal program image reproduction means provided in the second remote reproduction device reproduces the received audio quanta, and the control command and Corresponding to each received voice quantum, from the character element image different from the character element image in the first remote reproduction processing device, the second A program image display step of generating a character image or generating the first character image from the same character element image as the character element image in the first remote reproduction processing device and displaying the program image It is characterized by this.

さらに、本発明によるプログラムは、コンピュータを、請求項５記載の端末番組画像再生手段として機能させるためのものである。 Furthermore, the program according to the present invention is for causing a computer to function as the terminal program image reproduction means according to claim 5.

また、本願発明を、略リアルタイムに撮像され提供される実写画像とコンピュータグラフィックスにより作成されるキャラクタ画像とが合成されて送信もしくは出力される番組などの動画像の作成に適したキャラクタ画像生成装置であって、複数種類のキャラクタ画像を記録しておくためのキャラクタ画像記録手段と、キャラクタのセリフを示す音声を入力するための音声入力手段と、前記音声入力手段により入力された音声信号の単位時間毎の振幅もしくは周波数成分からその特徴を求めるための音声特徴検出手段と、前記音声特徴検出手段からの出力と前記キャラクタ画像記録手段からのキャラクタ画像とに基づいて、前記単位時間毎のキャラクタ画像を略リアルタイムに生成するためのキャラクタ画像生成手段と、前記キャラクタ画像生成手段により生成されたキャラクタ画像と前記音声入力手段により入力された前記キャラクタのセリフを示す音声とを前記単位時間毎に互いに同期して出力するためのキャラクタ画像等出力手段と、を備えたことを特徴とするものとして捉えてもよい。 Further, the present invention provides a character image generation apparatus suitable for creating a moving image such as a program in which a real image captured and provided in substantially real time and a character image created by computer graphics are combined and transmitted or output. A character image recording means for recording a plurality of types of character images, a voice input means for inputting a voice indicating a character speech, and a unit of a voice signal input by the voice input means. The character image for each unit time based on the voice feature detection means for obtaining the feature from the amplitude or frequency component for each time, the output from the voice feature detection means and the character image from the character image recording means A character image generation means for generating the character image substantially in real time, and the character image A character image output means for outputting the character image generated by the generating means and the voice indicating the character lines input by the voice input means in synchronism with each other for each unit time. May be taken as a characteristic.

さらに、本発明を、略リアルタイムに撮像され提供される実写画像とコンピュータグラフィックスにより作成されるキャラクタ画像とが合成されて送信もしくは出力される番組などの動画像の作成に適したキャラクタ画像生成装置であって、複数種類のキャラクタ画像を記録しておくためのキャラクタ画像記録手段と、キャラクタの口唇の形状が互いに異なる複数種類の口元画像を予め記録しておくための口元画像記録手段と、キャラクタのセリフを示す音声を入力するための音声入力手段と、前記音声入力手段により入力された音声信号の単位時間毎の振幅もしくは周波数成分からその特徴を求めるための音声特徴検出手段と、前記音声特徴検出手段からの出力に基づいて、前記口元画像記録手段に記録されている複数種類の口元画像から前記単位時間毎の前記検出結果に応じた口元画像の種類を判定するための口元画像判定手段と、前記口元画像判定手段によりその種類が判定された口元画像と前記キャラクタ画像とに基づいて、前記単位時間毎のキャラクタ画像を略リアルタイムに生成するためのキャラクタ画像生成手段と、前記キャラクタ画像生成手段により生成されたキャラクタ画像と前記音声入力手段により入力された前記キャラクタのセリフを示す音声とを前記単位時間毎に互いに同期して出力するためのキャラクタ画像等出力手段と、を備えたことを特徴とするものとして捉えてもよい。 Furthermore, the present invention provides a character image generation apparatus suitable for creating a moving image such as a program in which a real image captured and provided in substantially real time and a character image created by computer graphics are combined and transmitted or output. A character image recording means for recording a plurality of types of character images, a mouth image recording means for previously recording a plurality of types of mouth images having different lip shapes, and a character Voice input means for inputting a voice indicating the line, voice feature detection means for obtaining the feature from the amplitude or frequency component per unit time of the voice signal inputted by the voice input means, and the voice feature Based on the output from the detection means, from a plurality of types of mouth images recorded in the mouth image recording means Based on the mouth image determining means for determining the kind of mouth image according to the detection result for each unit time, the mouth image determined by the mouth image determining means, and the character image, Character image generation means for generating a character image for each unit time in substantially real time, a character image generated by the character image generation means, and a voice indicating a line of the character input by the voice input means It may be understood that the apparatus includes a character image output means for outputting each unit time in synchronization with each other.

さらに、本発明を、略リアルタイムに撮像され提供される実写画像とコンピュータグラフィックスにより作成されるキャラクタ画像とが合成されて送信もしくは出力される番組などの動画像の作成に適したキャラクタ画像生成装置であって、複数種類のキャラクタ画像を記録しておくためのキャラクタ画像記録手段と、キャラクタのセリフを示す音声を入力するための音声入力手段と、前記音声入力手段により入力された音声信号の単位時間毎の振幅を検出するための振幅検出手段と、前記振幅検出手段からの出力と前記キャラクタ画像記録手段からのキャラクタ画像とに基づいて、前記単位時間毎のキャラクタ画像を略リアルタイムに生成するためのキャラクタ画像生成手段と、前記キャラクタ画像生成手段により生成されたキャラクタ画像と前記音声入力手段により入力された前記キャラクタのセリフを示す音声とを前記単位時間毎に互いに同期して出力するためのキャラクタ画像等出力手段と、を備えたことを特徴とするものとして捉えてもよい。 Furthermore, the present invention provides a character image generation apparatus suitable for creating a moving image such as a program in which a real image captured and provided in substantially real time and a character image created by computer graphics are combined and transmitted or output. A character image recording means for recording a plurality of types of character images, a voice input means for inputting a voice indicating a character speech, and a unit of a voice signal input by the voice input means. In order to generate the character image for each unit time in substantially real time based on the amplitude detection means for detecting the amplitude for each time, the output from the amplitude detection means, and the character image from the character image recording means Character image generating means, and a character image generated by the character image generating means And a character image output means for outputting the speech indicating the character lines input by the voice input means in synchronization with each other for each unit time. Also good.

さらに、本発明を、略リアルタイムに撮像され提供される実写画像とコンピュータグラフィックスにより作成されるキャラクタ画像とが合成されて送信もしくは出力される番組などの動画像の作成に適したキャラクタ画像生成装置であって、複数種類のキャラクタ画像を記録しておくためのキャラクタ画像記録手段と、キャラクタの口唇の形状が互いに異なる複数種類の口元画像を予め記録しておくための口元画像記録手段と、キャラクタのセリフを示す音声を入力するための音声入力手段と、前記音声入力手段により入力された音声信号の単位時間毎の周波数成分から前記音声の前記単位時間毎の母音もしくは音素を判定するための母音等判定手段と、前記母音等判定手段からの出力に基づいて、前記口元画像記録手段に記録されている複数種類の口元画像から前記単位時間毎の前記検出結果に応じた口元画像の種類を判定するための口元画像判定手段と、前記口元画像判定手段によりその種類が判定された口元画像と前記キャラクタ画像とに基づいて、前記単位時間毎のキャラクタ画像を略リアルタイムに生成するためのキャラクタ画像生成手段と、前記キャラクタ画像生成手段により生成されたキャラクタ画像と前記音声入力手段により入力された前記キャラクタのセリフを示す音声とを前記単位時間毎に互いに同期して出力するためのキャラクタ画像等出力手段と、を備えたことを特徴とするものとして捉えてもよい。 Furthermore, the present invention provides a character image generation apparatus suitable for creating a moving image such as a program in which a real image captured and provided in substantially real time and a character image created by computer graphics are combined and transmitted or output. A character image recording means for recording a plurality of types of character images, a mouth image recording means for previously recording a plurality of types of mouth images having different lip shapes, and a character A voice input means for inputting a voice indicating a voice line, and a vowel for determining a vowel or a phoneme per unit time of the voice from a frequency component per unit time of the voice signal input by the voice input means Based on the output from the equality determining means and the vowel etc. determining means. Mouth image determination means for determining the type of the mouth image according to the detection result for each unit time from the types of mouth images, the mouth image determined by the mouth image determination means and the character image Based on the character image generating means for generating the character image per unit time in substantially real time, the character image generated by the character image generating means and the character input by the voice input means It may be understood that the apparatus includes an output means such as a character image for outputting the voice to be shown in synchronism with each other for each unit time.

さらに、本発明を、実在の出演者を含み略リアルタイムに撮像され提供される実写画像とコンピュータグラフィックスにより作成されるキャラクタ画像とを合成して実在の出演者とキャラクタとが会話をしている場面を含む番組画像を生成するための番組画像生成装置であって、少なくとも番組の出演者を撮像するための撮像手段と、複数種類のキャラクタ画像を記録しておくためのキャラクタ画像記録手段と、少なくともキャラクタが話すセリフを示す音声と実在の出演者が話す音声とを入力するための音声入力手段と、前記音声入力手段により入力された音声信号の単位時間毎の振幅もしくは周波数成分からその特徴を求めるための音声特徴検出手段と、前記音声特徴検出手段からの出力と前記キャラクタ画像記録手段からのキャラクタ画像とに基づいて、番組画像に含まれるべき前記単位時間毎のキャラクタ画像を略リアルタイムに生成するためのキャラクタ画像生成手段と、前記キャラクタ画像生成手段により生成されたキャラクタ画像と前記撮像手段により撮像された実写画像とを合成し、実在の出演者とキャラクタとが会話をしている場面を含む番組画像を生成するための番組画像生成手段と、前記番組画像生成手段により生成された番組画像と前記音声入力手段により入力された番組音声とを前記単位時間毎に互いに同期して出力するための番組画像等出力手段と、を備えたことを特徴とするものとして捉えてもよい。 Furthermore, the present performer and the character have a conversation by synthesizing a live-action image that is captured and provided in substantially real time including a real performer and a character image created by computer graphics. A program image generating device for generating a program image including a scene, at least imaging means for imaging a performer of the program, character image recording means for recording a plurality of types of character images, The voice input means for inputting at least the voice indicating the speech spoken by the character and the voice spoken by the actual performer, and the feature from the amplitude or the frequency component per unit time of the voice signal input by the voice input means Voice feature detection means for obtaining, output from the voice feature detection means, and character from the character image recording means A character image generating means for generating the character image for each unit time to be included in the program image in substantially real time based on the image, a character image generated by the character image generating means, and an image pickup by the image pickup means A program image generating means for generating a program image including a scene in which a real performer and a character have a conversation, and a program image generated by the program image generating means; It may be understood that the apparatus includes a program image output unit for outputting the program audio input by the audio input unit in synchronization with each other for each unit time.

さらに、本発明を、実在の出演者を含み略リアルタイムに撮像され提供される実写画像とコンピュータグラフィックスにより作成されるキャラクタ画像とを合成して実在の出演者とキャラクタとが会話をしている場面を含む番組画像を生成するための番組画像生成装置であって、少なくとも番組の出演者を撮像するための撮像手段と、複数種類のキャラクタ画像を記録しておくためのキャラクタ画像記録手段と、キャラクタの口唇の形状が互いに異なる複数種類の口元画像を予め記録しておくための口元画像記録手段と、少なくともキャラクタが話すセリフを示す音声と実在の出演者が話す音声とを入力するための音声入力手段と、前記音声入力手段により入力された音声信号の単位時間毎の振幅もしくは周波数成分からその特徴を求めるための音声特徴検出手段と、前記音声特徴検出手段からの出力に基づいて、前記口元画像記録手段に記録されている複数種類の口元画像から前記単位時間毎の前記検出結果に応じた口元画像の種類を判定するための口元画像判定手段と、前記口元画像判定手段によりその種類が判定された口元画像と前記キャラクタ画像とに基づいて、前記番組に表示すべき前記単位時間毎のキャラクタ画像を略リアルタイムに生成するためのキャラクタ画像生成手段と、前記キャラクタ画像生成手段により生成されたキャラクタ画像と前記撮像手段により撮像された実写画像とを合成し、実在の出演者とキャラクタとが会話をしている場面を含む番組画像を生成するための番組画像生成手段と、前記番組画像生成手段により生成された番組画像と前記音声入力手段により入力された番組音声とを前記単位時間毎に互いに同期して出力するための番組画像等出力手段と、を備えたことを特徴とするものとして捉えてもよい。 Furthermore, the present performer and the character have a conversation by synthesizing a live-action image that is captured and provided in substantially real time including a real performer and a character image created by computer graphics. A program image generating device for generating a program image including a scene, at least imaging means for imaging a performer of the program, character image recording means for recording a plurality of types of character images, Mouth image recording means for recording in advance a plurality of types of lip images having different lip shapes of the character, and sound for inputting at least a voice indicating a speech spoken by the character and a voice spoken by an actual performer The characteristics were obtained from the input means and the amplitude or frequency component per unit time of the voice signal input by the voice input means. Type of mouth image corresponding to the detection result for each unit time from a plurality of types of mouth images recorded in the mouth image recording means based on the output from the sound feature detecting means The character image for each unit time to be displayed in the program is substantially real-time based on the mouth image determining means for determining the mouth, and the mouth image determined by the mouth image determining means and the character image. The character image generating means for generating the character image, the character image generated by the character image generating means and the real image captured by the imaging means are synthesized, and the actual performer and the character have a conversation. Program image generation means for generating a program image including a scene, the program image generated by the program image generation means, and the audio input A program image or the like output means for outputting the program audio input by stage in synchronization with each other for each of the unit time may be regarded as being characterized by comprising a.

さらに、本発明においては、前記音声入力手段は、マイクからの音声、映画などを記録した記録媒体から再生された音声、又は、番組のセリフを示す文章を音声変換して得られた合成音声を入力するものである、ことが望ましい。 Furthermore, in the present invention, the voice input means outputs voice from a microphone, voice reproduced from a recording medium on which a movie is recorded, or synthesized voice obtained by voice-converting a sentence indicating a program speech. It is desirable to input.

このようにすることにより、入力された音声信号の振幅もしくは周波数成分から音声特徴（音量や母音もしくは音素）を検出し、この検出された音声特徴に基づいて、キャラクタの口元形状が互いに異なる口元画像を含むキャラクタ画像を略リアルタイムにレンダリング処理などにより生成し、音声と同期して出力もしくは送信するようにしている。よって、本発明によれば、前記口元画像が単位時間毎に出力音声と同期して変化するキャラクタ画像を撮影現場の出演者を含む画像と合成して前記出力音声と同期して表示させることができるので、あたかも撮像現場の出演者とキャラクタとが自然に会話をしているかのような番組画像をリアルタイムに生成して送信もしくは出力することが可能になる。 In this way, speech features (volume, vowels or phonemes) are detected from the amplitude or frequency component of the input speech signal, and mouth images with different mouth shapes of the characters based on the detected speech features. Is generated by rendering processing or the like in substantially real time, and is output or transmitted in synchronization with voice. Therefore, according to the present invention, the character image in which the mouth image changes in synchronism with the output sound every unit time is combined with the image including the performer at the shooting site and displayed in synchronism with the output sound. As a result, it is possible to generate and transmit or output a program image in real time as if the performer and character at the imaging site have a natural conversation.

また、入力された音声信号の振幅もしくは周波数成分から音声特徴（音量や母音もしくは音素）を検出し、この検出された音声特徴に基づいて、キャラクタの口元形状が互いに異なる複数種類の口元画像のデータベースから前記音声特徴に対応する口元画像を選択し、この選択した口元画像に基づいてキャラクタ画像を略リアルタイムに生成し、音声と同期して出力もしくは送信するようにしている。よって、本発明によれば、撮影現場の出演者を含む画像と前記口元画像を備えたキャラクタ画像とを略リアルタイムに合成すると共に前記口元画像が単位時間毎に出力音声と同期して変化するキャラクタ画像を前記出力音声と同期して表示させることができるので、あたかも撮像現場の出演者とキャラクタが自然に会話をしているかのような番組画像をリアルタイムに生成して送信もしくは出力することが可能になる。 Also, a voice feature (volume, vowel or phoneme) is detected from the amplitude or frequency component of the input voice signal, and a database of a plurality of types of mouth images having different mouth shapes of characters based on the detected voice feature A mouth image corresponding to the voice feature is selected from the above, a character image is generated in substantially real time based on the selected mouth image, and output or transmitted in synchronization with the voice. Therefore, according to the present invention, a character in which an image including a performer at a shooting site and a character image including the mouth image are synthesized in substantially real time, and the mouth image changes in synchronization with the output voice every unit time. Since the image can be displayed in synchronization with the output sound, it is possible to generate and transmit or output a program image in real time as if the performer and character at the imaging site have a natural conversation become.

また、入力された音声信号の振幅を検出し、この検出された振幅に基づいて、キャラクタの口唇の開き具合が互いに異なるキャラクタ画像を略リアルタイムに生成し、音声と同期して出力もしくは送信するようにしている。よって、本発明によれば、撮影現場の出演者を含む画像と前記口元画像を備えたキャラクタ画像とを略リアルタイムに合成すると共に前記口元画像が単位時間毎に出力音声と同期して変化するキャラクタ画像を前記出力音声と同期して表示させることができるので、あたかも撮像現場の出演者とキャラクタが自然に会話をしているかのような番組画像をリアルタイムに生成して送信もしくは出力することが可能になる。 Further, the amplitude of the input voice signal is detected, and based on the detected amplitude, character images having different lip opening states are generated in substantially real time, and output or transmitted in synchronization with the voice. I have to. Therefore, according to the present invention, a character in which an image including a performer at a shooting site and a character image including the mouth image are synthesized in substantially real time, and the mouth image changes in synchronization with the output voice every unit time. Since the image can be displayed in synchronization with the output sound, it is possible to generate and transmit or output a program image in real time as if the performer and character at the imaging site have a natural conversation become.

さらに、入力された音声信号の周波数成分を抽出して母音又は音素を解析し、この解析した母音又は音素に基づいて、キャラクタの口唇の形状が互いに異なる複数種類の口元画像から前記解析結果に対応する口元画像を選択し、この選択した口元画像に基づいてキャラクタ画像を略リアルタイムに生成し、音声と同期して出力もしくは送信するようにしている。よって、本発明によれば、撮影画像の出演者を含む画像と前記口元画像を備えたキャラクタ画像とを略リアルタイムに合成すると共に前記口元画像が単位時間毎に出力音声と同期して変化するキャラクタ画像を前記出力音声と同期して表示させることができるので、あたかも撮像現場の出演者とキャラクタが自然に会話をしているかのような番組画像をリアルタイムに生成して送信もしくは出力することが可能になる。 Furthermore, the frequency component of the input speech signal is extracted to analyze vowels or phonemes, and based on the analyzed vowels or phonemes, the analysis results can be handled from a plurality of types of mouth images with different lip shapes. The mouth image to be selected is selected, a character image is generated in substantially real time based on the selected mouth image, and output or transmitted in synchronization with the voice. Therefore, according to the present invention, a character in which an image including a performer of a photographed image and a character image having the mouth image are synthesized in substantially real time and the mouth image changes in synchronization with the output voice every unit time. Since the image can be displayed in synchronization with the output sound, it is possible to generate and transmit or output a program image in real time as if the performer and character at the imaging site have a natural conversation become.

本願の各請求項に係る発明によれば、入力された音声に基づいて容易に番組（コンテンツ）を作成することが可能となる。そのため、複数の遠隔再生処理装置に対して、入力された音声を共通して配信するとともに、各遠隔再生処理装置では、共通の音声の再生と、共通の音声に基づいて作成された異なるキャラクタの画像を用いた映像の表示が可能となる。 According to the invention according to each claim of the present application, it is possible to easily create a program (content) based on the input voice. Therefore, the input voice is distributed in common to a plurality of remote reproduction processing apparatuses, and each remote reproduction processing apparatus reproduces the common voice and the different characters created based on the common voice. It is possible to display an image using an image.

このように、音声という容易に入力可能な情報を用いて、共通の音声を、複数の場所に、その場に合った情報として配信することが可能となる。そのため、リアルタイムなコンテンツ演出と、消費者参加型のコンテンツ作成が可能となり、市場の活性化を図ることができる。さらに、遠隔操作やリアルタイム配信により、イベントや緊急配信に運用することもできる。これにより、注目度・話題性・認知度・臨場感が向上する。 As described above, by using information that can be easily input such as voice, common voice can be distributed to a plurality of places as information suitable for the place. Therefore, real-time content production and consumer-participation-type content creation can be performed, and the market can be activated. Furthermore, it can be used for events and emergency distribution by remote control and real-time distribution. As a result, attention, topicality, recognition, and presence are improved.

さらに、本願請求項２に係る発明にあるように、実写等と組み合わせることにより疑似会話が可能となる。また、各場所の実写と合成することにより隣接エリアへの誘導活用も可能となる。 Furthermore, as in the invention according to claim 2 of the present application, a pseudo-conversation is possible by combining with a live action. In addition, it can be used for guidance to adjacent areas by combining with actual shots at each location.

さらに、本願の請求項３に係る発明にあるように、動画データを再生する遠隔再生装置（例えば、単なるモニタ）に対しては、動画データにより情報を配信することにより、遠隔再生処理装置（例えばパソコン（ＰＣ）の機能を備えたもの）と混在する場合でも、番組配信が可能となる。また、本願請求項４に係る発明にあるように、動画データを作成する場合には、例えば、同じコンテンツを同時に遠隔再生処理装置に対しても表示させる場合には、遠隔再生装置だけでなく、遠隔再生処理装置に対しても配信できるようにしてもよい。これにより、遠隔再生処理装置における処理を軽減することが可能になる。 Furthermore, as in the invention according to claim 3 of the present application, a remote reproduction processing device (for example, a simple monitor) that reproduces moving image data is distributed by distributing information by moving image data (for example, a simple monitor). Even when mixed with a personal computer (having a PC function), program distribution is possible. Further, as in the invention according to claim 4 of the present application, when creating moving image data, for example, when displaying the same content on the remote playback processing device at the same time, not only the remote playback device, You may enable it to distribute also to a remote reproduction | regeneration processing apparatus. As a result, processing in the remote reproduction processing apparatus can be reduced.

本発明の実施例１による番組画像生成システム５１の構成及び動作を説明するための概念ブロック図である。It is a conceptual block diagram for demonstrating the structure and operation | movement of the program image generation system 51 by Example 1 of this invention. 図１の端末番組画像生成部８３の構成及び動作を説明するための概念ブロック図である。It is a conceptual block diagram for demonstrating the structure and operation | movement of the terminal program image generation part 83 of FIG. 図１の動画データ作成部９３の構成及び動作を説明するための概念ブロック図である。It is a conceptual block diagram for demonstrating the structure and operation | movement of the moving image data preparation part 93 of FIG. 本発明の実施例２による番組画像生成送出装置を説明するための概念ブロック図である。It is a conceptual block diagram for demonstrating the program image production | generation transmission device by Example 2 of this invention. 本実施例２においてキャラクタ画像生成部により生成されるキャラクタ画像の中の顔画像の例を示す図である。It is a figure which shows the example of the face image in the character image produced | generated by the character image production | generation part in the present Example 2. FIG. 本実施例２の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the present Example 2. 本実施例２により生成・送出される番組画像の一例を示す図である。It is a figure which shows an example of the program image produced | generated and transmitted by the present Example 2. 本発明の実施例３による番組画像生成送出装置を説明するための概念ブロック図である。It is a conceptual block diagram for demonstrating the program image production | generation transmission device by Example 3 of this invention. 本実施例３において口元画像データベースに記録されている口元画像の例を示す図である。It is a figure which shows the example of the mouth image currently recorded on the mouth image database in the present Example 3. FIG. 本実施例３の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the present Example 3.

以下、図面を参照して、本発明を実施するための形態について説明する。なお、本発明は、以下の実施例に限定されるものではない。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to the following examples.

図１は、本発明の実施例１による番組画像生成システム５１の構成及び動作を説明するための概念ブロック図である。番組画像生成システム５１は、２つの遠隔再生処理装置５３₁及び５３₂（本願請求項の「遠隔再生処理装置」の一例）並びに遠隔再生装置５５（本願請求項の「遠隔再生装置」の一例）を備える。遠隔再生処理装置５３（以下、添え字は、複数のものを示す場合は省略する。）及び遠隔再生装置５５は、複数あってもよい。遠隔再生処理装置５３は、例えばパソコン（ＰＣ）の機能を備えたもののように、一定の情報処理を行うことが可能である。これは、入力された音声に対して端末機で情報処理を行うことによる配信に適したものである。これにより、各端末が存在する時間・場所に応じて、実写とＣＧキャラクタを組み合わせたコンテンツ等を配信することが可能になる。他方、遠隔再生装置５５は、単に表示機能のみを備えるモニタ等である。遠隔再生装置５５は、動画を再生することは可能である。しかし、入力された音声を配信しただけではコンテンツの配信を実現することができない。このように、端末機に対する配信は、各端末機の性質に応じて、端末機側で情報処理を行い再生するだけでなく、動画運用で再生することも必要になる。そこで、本実施例１では、動画運用を含む場合について説明する。 FIG. 1 is a conceptual block diagram for explaining the configuration and operation of a program image generation system 51 according to Embodiment 1 of the present invention. The program image generation system 51 includes two remote reproduction processing devices 53 ₁ and 53 ₂ (an example of “remote reproduction processing device” in the claims) and a remote reproduction device 55 (an example of “remote reproduction device” in the claims). Is provided. There may be a plurality of remote reproduction processing devices 53 (hereinafter, subscripts are omitted when a plurality of subscripts are indicated) and remote reproduction devices 55. The remote reproduction processing device 53 can perform certain information processing, for example, a device having a personal computer (PC) function. This is suitable for distribution by performing information processing on the input voice at the terminal. As a result, it is possible to deliver content that combines a live-action and a CG character according to the time and place where each terminal exists. On the other hand, the remote playback device 55 is a monitor or the like having only a display function. The remote playback device 55 can play back moving images. However, distribution of content cannot be realized simply by distributing input audio. As described above, distribution to a terminal requires not only information processing and reproduction on the terminal side, but also reproduction by moving image operation according to the characteristics of each terminal. Thus, in the first embodiment, a case where a moving image operation is included will be described.

番組画像生成システム５１は、ＣＧキャラクタの音声（キャラクタを担当する声優が話す音声）が入力される音声入力部５９（本願請求項の「音声入力手段」の一例）を有する音声入力端末５７（本願請求項の「音声入力端末」の一例）と、遠隔再生処理装置５３及び遠隔再生装置５５に対して、入力された音声を送信する配信管理装置６１（本願請求項の「配信管理手段」の一例）と、入力された音声から動画データを作成して送信する動画生成部９１（本願請求項の「動画生成手段」の一例）を備える。 The program image generation system 51 has a voice input terminal 57 (this application) having a voice input unit 59 (an example of “voice input means” in the claims) to which a voice of a CG character (voice spoken by a voice actor in charge of the character) is input. An example of “voice input terminal” in the claims) and a distribution management device 61 that transmits the input voice to the remote reproduction processing device 53 and the remote reproduction device 55 (an example of “distribution management means” in the claims) ) And a moving image generation unit 91 (an example of “moving image generation means” in the claims of the present application) that generates and transmits moving image data from the input voice.

配信管理装置６１は、音声量子化部６３（本願請求項の「音声量子化手段」の一例）と、音声量子記憶部６５と、音声量子送信部６７（本願請求項の「音声量子送信手段」の一例）を備える。音声量子化部６３は、音声入力部５９に入力されたＣＧキャラクタの音声を分割して、その一部又は全部を音声要素として抽出して、量子化して、量子化データを生成する（以下、この量子化データを「音声量子」という。）。音声量子は、例えば、「久しぶりね」というＣＧキャラクタのセリフについて、「ひ」「さ」「し」「ぶ」「り」「ね」のように個々の音声や無音状態を基準にして量子化する。配信管理装置６１が備える音声量子記憶部６５は、生成された各音声量子を記憶する。配信管理装置６１が備える音声量子送信部６７は、各遠隔再生処理装置５３に対して各音声量子を送信する。 The distribution management device 61 includes a voice quantization unit 63 (an example of “voice quantization unit” in the claims of the present application), a voice quantum storage unit 65, and a voice quantum transmission unit 67 (“voice quantum transmission unit” in the claims of this application). Example). The voice quantization unit 63 divides the voice of the CG character input to the voice input unit 59, extracts a part or all of the voice as a voice element, quantizes it, and generates quantized data (hereinafter referred to as “quantized data”). This quantized data is called "voice quantum"). For example, speech quantum is quantized based on individual speech and silent states, such as “hi”, “sa”, “shi”, “bu”, “ri”, and “ne”, for a CG character line “Long time no see”. To do. The speech quantum storage unit 65 included in the distribution management device 61 stores each generated speech quantum. The audio quantum transmission unit 67 included in the distribution management device 61 transmits each audio quantum to each remote reproduction processing device 53.

また、配信管理装置６１は、キャラクタの動作を制御するための制御命令を記憶する制御命令記憶部６９（本願請求項の「制御命令記憶手段」の一例）と、前記各遠隔再生処理装置に対して、前記制御命令を送信する制御命令送信部７１（本願請求項の「制御命令送信手段」の一例）を備える。さらに、配信管理装置６１は、例えば、キャラクタの口元の形状を示す口元画像及び口元画像以外のキャラクタ背景画像などの要素画像データ、撮像されて得られた実写データ、並びに、入力された音声以外の音声を示す背景音声データ（例えば、ＣＧキャラクタが、撮影現場に居る実在の出演者と会話を行う場合に、この出演者の音声が含まれ、また、ＢＧＭなどの音楽データのように実写画像とは直接関連しないものが含まれる。）を記憶する記憶装置７３（本願請求項の「記憶手段」の一例）と、各遠隔再生処理装置５３に対して、要素画像データ、実写データ及び背景音声データを送信するデータ送信部７５（本願請求項の「データ送信手段」の一例）を有する。データ送信部７５は、遠隔再生処理装置５３のうち、独自に要素画像データ等を保持しているものには送信せず、そうでない場合に、遠隔再生処理装置５３の必要に応じて送信するものであってもよい。 In addition, the distribution management device 61 has a control command storage unit 69 (an example of “control command storage unit” in the claims of the present application) that stores a control command for controlling the movement of the character, and each remote reproduction processing device. And a control command transmission unit 71 (an example of “control command transmission means” in the claims). Further, the distribution management device 61, for example, element image data such as a mouth image indicating the shape of the mouth of the character and a character background image other than the mouth image, actual image data obtained by imaging, and input sound other than the voice Background voice data indicating voice (for example, when a CG character has a conversation with an actual performer at the shooting site, the voice of this performer is included, and a live-action image such as music data such as BGM) Are stored in the storage device 73 (an example of “storage means” in the claims of the present application), and each remote reproduction processing device 53 includes element image data, live-action data, and background audio data. The data transmission unit 75 (an example of “data transmission unit” in the claims). The data transmission unit 75 does not transmit to the remote reproduction processing device 53 that has the elemental image data or the like, and otherwise transmits the remote reproduction processing device 53 as necessary. It may be.

遠隔再生処理装置５３は、音声量子送信部６７、制御命令送信部７１及びデータ送信部７５から送信された情報を受信する受信部８１（本願請求項の「受信手段」の一例）と、スピーカ８６に対して受信した各音声量子を再生させつつ、モニタ８５に対して制御命令及び受信した各音声量子に対応して要素画像データからキャラクタ画像を作成して番組画像を表示させる端末番組画像生成部８３（本願請求項の「端末番組画像生成手段」の一例）を有する。 The remote reproduction processing device 53 includes a reception unit 81 (an example of “reception means” in the claims) that receives information transmitted from the audio quantum transmission unit 67, the control command transmission unit 71, and the data transmission unit 75, and a speaker 86. A terminal program image generation unit for generating a character image from elemental image data and displaying a program image in response to a control command and each received voice quantum for the monitor 85 while reproducing each received voice quantum 83 (an example of “terminal program image generation means” in the claims of the present application).

また、番組画像生成システム５１は、動画生成部９１を備える。動画生成部９１は、動画データを生成する動画データ作成部９３と、動画データを送信する動画データ送信部９５を有する。遠隔再生装置５５は、動画データを受信する動画データ受信部９７と、受信した動画データを再生するモニタ９９を備える。 The program image generation system 51 includes a moving image generation unit 91. The moving image generation unit 91 includes a moving image data generation unit 93 that generates moving image data, and a moving image data transmission unit 95 that transmits moving image data. The remote reproduction device 55 includes a moving image data receiving unit 97 that receives moving image data, and a monitor 99 that reproduces the received moving image data.

図２は、図１の端末番組画像生成部８３の構成及び動作を説明するための概念ブロック図である。端末番組画像生成部８３は、受信した制御命令を記憶する制御命令記憶部１０１と、背景音声データを記憶する背景音声記憶部１０３と、要素画像データを記憶する要素画像記憶部１０５と、実写データを記憶する実写記憶部１０７を備える。 FIG. 2 is a conceptual block diagram for explaining the configuration and operation of the terminal program image generation unit 83 of FIG. The terminal program image generation unit 83 includes a control command storage unit 101 that stores the received control command, a background audio storage unit 103 that stores background audio data, an element image storage unit 105 that stores element image data, and live-action data The live-action storage unit 107 is stored.

要素画像データに関して、少なくとも１つの遠隔再生処理装置は、他の遠隔再生処理装置と異なるものにする。例えば、遠隔再生処理装置５３₁に対しては、特別の種類の要素画像データ（例えばパンダの種類）を送信し、他の遠隔再生処理装置５３₂等には送信せず、別の種類の要素画像データ（例えばネコの種類）を送信する。これは、音声入力端末５７の利用者が指定することにより配信管理装置６１が特別の要素画像データを送信し、他の遠隔再生処理装置に対しては送信させないようにしてもよい。また、遠隔再生処理装置の利用者が指定して、特別の要素画像データを送信させるようにしてもよい。これにより、遠隔再生処理装置５３₁と５３₂では、異なるキャラクタにより同じ音声量子を再生することが可能になる。これは、音声量子による番組配信という一方的な情報の流れに対し、各遠隔再生処理装置の設置場所・再生時間等に合わせた番組画像の配信処理を可能にするものである。また、特別の要素画像データは、各遠隔再生処理装置で用意されたものであってもよい。 With respect to the element image data, at least one remote reproduction processing device is different from other remote reproduction processing devices. For example, for remote reproduction processing apparatus 53 _1, it sends a special type of element image data (e.g. Panda type), without transmitting to other remote reproduction processing unit 53 _2, etc., another type of element Image data (for example, cat type) is transmitted. This may be specified by the user of the voice input terminal 57 so that the distribution management device 61 transmits special element image data and does not transmit it to other remote reproduction processing devices. Alternatively, the user of the remote reproduction processing apparatus may designate and send special element image data. As a result, the remote reproduction processing devices 53 ₁ and 53 ₂ can reproduce the same audio quantum by different characters. This makes it possible to perform distribution processing of program images in accordance with the installation location / reproduction time of each remote reproduction processing device, for a unilateral information flow of program distribution by voice quantum. The special element image data may be prepared by each remote reproduction processing device.

まず、端末番組画像生成部８３の音声再生について説明する。端末番組画像生成部８３は、受信した各音声量子を、前記背景音声データと同期させる端末音声同期部１０９（本願請求項の「端末音声同期手段」の一例）と、同期後の各音声量子及び背景音声データを再生させる音声再生部１１１（本願請求項の「音声再生手段」の一例）と、スピーカ８６に対して再生させる音声を送信する音声送出部１１３を備える。 First, the audio reproduction of the terminal program image generation unit 83 will be described. The terminal program image generation unit 83 synchronizes each received audio quantum with the background audio data (an example of “terminal audio synchronization means” in the claims of the present application), and each synchronized audio quantum and An audio reproduction unit 111 (an example of “audio reproduction unit” in the claims of the present application) for reproducing background audio data and an audio transmission unit 113 for transmitting audio to be reproduced to the speaker 86 are provided.

続いて、端末番組画像生成部８３が備える画像生成部１１５（本願請求項の「画像生成手段」の一例）による画像表示について説明する。画像生成部１１５は、受信した各音声量子の特徴を検出する音声特徴検出部１２１と、制御命令及び検出した音声量子の特徴に対応して、３Ｄベクトルデータ処理により、要素画像データからキャラクタ画像を作成するキャラクタ画像生成部１３７と、作成したキャラクタ画像と実写データを合成して番組画像を作成する番組画像生成部１３９を備える。 Next, image display by the image generation unit 115 (an example of “image generation unit” in the claims) included in the terminal program image generation unit 83 will be described. The image generation unit 115 detects a character image from the element image data by performing a 3D vector data process in response to the control feature and the detected voice quantum feature in response to the voice feature detection unit 121 that detects the feature of each received voice quantum. A character image generation unit 137 to be created and a program image generation unit 139 to create a program image by synthesizing the created character image and live-action data are provided.

音声特徴検出部１２１は、音声量子の周波数を解析する画像用周波数解析部１２３と、音量を解析する音量解析部１２５を備える。キャラクタ画像生成部１３７は、例えば、画像用周波数解析部１２３による周波数解析により母音又は「ん」若しくは無音等の分析をして、口元画像の形状を決定し、さらに、音量解析部１２３による音量解析により開度を決定して、口元画像から１つを選択して加工して、キャラクタの口元画像を作成する。また、制御命令（例えば、直立やお辞儀などの動作、上半身の撮影等のカメラの位置など）によりキャラクタの姿勢等を決定し、両者を組み合わせて、要素画像データからキャラクタ画像を生成する（図９参照）。番組画像作成部１３９は、実写記憶部５７に記憶された実写データを合成して番組画像を生成する（図７参照）。制御命令に、合成のタイミング等を含ませ、これを加味して番組画像を作成してもよい。生成された番組画像は、画像送出部１１７により、例えばモニタ等の表示装置に対して送出される。 The audio feature detection unit 121 includes an image frequency analysis unit 123 that analyzes the frequency of the audio quantum, and a volume analysis unit 125 that analyzes the volume. For example, the character image generation unit 137 analyzes the vowel, “n”, or silence by frequency analysis by the image frequency analysis unit 123, determines the shape of the mouth image, and further performs the volume analysis by the volume analysis unit 123. The opening is determined by the above, and one of the mouth images is selected and processed to create a mouth image of the character. Also, the character's posture and the like are determined by a control command (for example, an action such as upright or bowing, a camera position such as shooting of the upper body, etc.), and a character image is generated from the element image data by combining the two (FIG. 9). reference). The program image creation unit 139 generates a program image by synthesizing the live-action data stored in the live-action storage unit 57 (see FIG. 7). The control instruction may include a synthesis timing and the like, and a program image may be created in consideration of this. The generated program image is sent to a display device such as a monitor by the image sending unit 117.

ある音声量子に対しては、画像生成部１１５による番組画像の作成処理の終了後に、画像の表示と音声の再生が行われることとなる。そのため、音声再生部１１１により再生される音声量子と、画像生成部１１５による番組画像の作成処理の基礎となる音声量子とは異なることとなる。そのため、端末番組画像生成部８３は、音声再生部１１１による再生と、画像生成部１１５による番組画像の表示とを同期させる同期部１１９（本願請求項の「同期手段」の一例）を備える。同期部１１９は、音声再生部１１１で再生される音声量子の周波数を解析して再生時間データを検出する音声用周波数解析部１３１と、画像生成部１１５からの番組画像の生成処理に必要な時間とを比較して、音声再生部１１１に対して再生のタイミングを制御するタイミング制御部１３３を備える。 For a certain audio quantum, image display and audio reproduction are performed after the program image creation processing by the image generation unit 115 is completed. Therefore, the audio quanta reproduced by the audio reproduction unit 111 and the audio quanta which is the basis of the program image creation process by the image generation unit 115 are different. Therefore, the terminal program image generation unit 83 includes a synchronization unit 119 (an example of “synchronization means” in the claims of the present application) that synchronizes reproduction by the audio reproduction unit 111 and display of the program image by the image generation unit 115. The synchronization unit 119 analyzes the frequency of the audio quantum reproduced by the audio reproduction unit 111 and detects reproduction time data, and the time required for the program image generation processing from the image generation unit 115 And a timing control unit 133 that controls the playback timing of the audio playback unit 111.

図３は、図１の動画データ作成部９３の構成及び動作を説明するための概念ブロック図である。動画データ作成部９３は、音声量子記憶部６５に記憶された各音声量子、制御命令記憶部６９に記憶された制御命令、記憶装置７３に記憶された要素画像データ、実写データ及び背景音声データを用いて、動画データを作成する。動画データ生成部９３は、背景音声データと各音声量子を同期する音声同期部１５９と、番組画像を生成する画像生成部１６１（図２の画像生成部１１５参照）と、生成された番組画像に対して２Ｄベクトル処理を行い連続画像である画像１、・・・、画像ｎを生成する２Ｄベクトル量子化部１６３と、連続画像を記憶する連像画像記憶部１６５と、音声同期部１５９により同期された音声と連続画像を合成して動画データを生成する合成部１６７と、生成された動画データを記憶する動画データ記憶部１６９を備える。 FIG. 3 is a conceptual block diagram for explaining the configuration and operation of the moving image data creation unit 93 of FIG. The moving image data creation unit 93 stores each voice quantum stored in the voice quantum storage unit 65, a control command stored in the control command storage unit 69, elemental image data, live-action data, and background voice data stored in the storage device 73. To create video data. The moving image data generation unit 93 includes an audio synchronization unit 159 that synchronizes background audio data and each audio quantum, an image generation unit 161 that generates a program image (see the image generation unit 115 in FIG. 2), and a generated program image. A 2D vector quantization unit 163 that performs 2D vector processing to generate images 1,..., N, which are continuous images, a continuous image storage unit 165 that stores continuous images, and an audio synchronization unit 159 A synthesizing unit 167 that synthesizes the generated audio and the continuous image to generate moving image data, and a moving image data storage unit 169 that stores the generated moving image data are provided.

なお、遠隔再生処理装置７３は、動画データを再生することは可能である。そのため、音声入力端末５７や遠隔再生処理装置５３の利用者の指示により、遠隔再生装置７５だけでなく、遠隔再生処理装置７３に対しても動画データを送信するようにしてもよい。これにより、例えば音声入力端末５７の利用者が、複数の端末に同時に同じ表示をさせる場合などに、遠隔再生処理装置５３の処理を軽減することが可能になる。ただし、動画データは、送信データ量が大きくなる可能性がある。そのため、例えば音声入力端末５７の利用者が、複数の端末に同時に同じ表示をさせる指示をした場合でも、音声量子送信部８２と動画データ送信部８９が、データ量や通信状況により、音声量子の送信とするか、又は、動画データを送信するかを自動的に選択するようにしてもよい。 The remote reproduction processing device 73 can reproduce the moving image data. Therefore, the moving image data may be transmitted not only to the remote playback device 75 but also to the remote playback processing device 73 according to an instruction from the user of the voice input terminal 57 or the remote playback processing device 53. Thereby, for example, when the user of the voice input terminal 57 displays the same display on a plurality of terminals at the same time, the processing of the remote reproduction processing device 53 can be reduced. However, the moving image data may have a large transmission data amount. For this reason, for example, even when the user of the voice input terminal 57 gives an instruction to simultaneously display the same on a plurality of terminals, the voice quantum transmission unit 82 and the moving picture data transmission unit 89 may change the voice quantum depending on the data amount and the communication status. You may make it select automatically whether it is set as transmission or moving image data is transmitted.

また、音声入力端末５７の利用者の指示により、動画データ作成部８７に対して、音声入力端末５７に対して、作成された動画データを送信させるようにしてもよい。これにより、音声入力端末５７の利用者は、遠隔再生装置等に再生されるべき動画データを容易に入手して検証等の処理をすることが可能になる。 Further, the created moving image data may be transmitted to the audio input terminal 57 by the moving image data creating unit 87 in accordance with an instruction from the user of the audio input terminal 57. As a result, the user of the voice input terminal 57 can easily obtain moving image data to be played back by a remote playback device or the like and perform processing such as verification.

さらに、音声量子は、例えば、「ひさしぶりね」という、発声されてから音声が一時的に途切れるまでの一連の音声を基準にして量子化したりするように、入力された音声を基準にして量子化されたものであってもよい。また、音声量子は、一定の時間（例えば、３０分の１秒など）を基準にして量子化されたものであってもよい。 Furthermore, the speech quantum is quantized based on the input speech so that it is quantized based on, for example, a series of speech from when the voice is spoken until the speech is temporarily interrupted. It may be converted. Further, the speech quantum may be quantized based on a certain time (for example, 1/30 second).

さらに、図２の音声特徴検出部１２１は、入力された音声の振幅を検出し、キャラクタ画像生成部１３７は、この検出結果に基づいて、口の開き具合を検出してキャラクタ画像を生成することも可能である（図５参照）。 Further, the voice feature detection unit 121 in FIG. 2 detects the amplitude of the input voice, and the character image generation unit 137 detects a mouth opening degree based on the detection result to generate a character image. Is also possible (see FIG. 5).

このような遠隔再生が可能になることにより、例えば、遠隔再生処理装置５３等の前に人がいたときに、ＣＧキャラクタの音声として「久しぶりね」と入力することにより、遠隔再生処理装置５３等でＣＧキャラクタが「久しぶりね」と話すように表示することができる。これにより、リアルタイムなコンテンツの演出だけでなく、消費者が参加する形でのコンテンツが実現可能になる。さらに、注目度・話題性・認知度も上がり、さらに、実写との連携を図ることから、臨場感が向上する。さらに、疑似的な会話が可能になり、人の誘導など、場面に合ったコンテンツを実現することが可能になる。 By enabling such remote reproduction, for example, when a person is present in front of the remote reproduction processing device 53 or the like, by inputting “Long time no see” as the voice of the CG character, the remote reproduction processing device 53 or the like. The CG character can be displayed so as to speak “Long time no see”. This makes it possible to realize not only real-time content production but also content in which consumers participate. In addition, the degree of attention, topicality, and recognition will also increase, and the realism will be improved by coordinating with live action. Furthermore, pseudo-conversation is possible, and it is possible to realize content suitable for the scene, such as human guidance.

図４は本発明の実施例２による番組画像生成送出装置を説明するための概念ブロック図である。図４において、１は撮影現場に居る実在の出演者の音声とこの出演者と会話を行うＣＧキャラクタの音声（キャラクタのセリフを担当する声優が話す音声）を入力するためのマイク、２は撮影現場の出演者などを撮像するためのカメラ、３は前記マイク１からの音声を一時的に記憶してから所定時間後に後述の番組音声送出部１０に出力するためのバッファ（後述のようにキャラクタの音声の送出とキャラクタの口元画像の送出とを同期させるためのもの）、４は前記マイク１から入力された音声中のキャラクタの音声部分を所定の単位時間毎（例えば１秒間当たり３０コマで番組の動画を作成するときは３０分の１秒毎）にサンプリングしてそのサンプリングした各単位時間毎（各フレーム毎）の音声の振幅をそれぞれ検出して数値化（デジタルデータ化）するための振幅検出部、６は複数のキャラクタ画像を予め記録しておくためのキャラクタ画像データベース、７は撮像現場の近傍に設置されたパソコン（ＣＧキャラクタの画像を操作するためのソフトウエアをインストールしたパソコン）から成り操作者が撮像現場の出演者などの様子を見ながらＣＧキャラクタをリアルタイムに動かすための操作信号（コマンド）を入力するためのキャラクタ操作部、８は前記振幅検出部４からの前記単位時間毎の音声の振幅値と前記キャラクタ画像データベース６からのＣＧキャラクタ画像と前記キャラクタ操作部７からのキャラクタ操作信号とに基づいてレンダリング処理などにより前記各単位時間毎の音声に対応する口元形状及び姿勢を有するＣＧキャラクタ画像を生成するためのキャラクタ画像生成部、９は前記カメラ２からの実写画像と前記キャラクタ画像生成部８からのＣＧキャラクタ画像を合成するための番組画像生成部、１０は前記バッファ３からの音声を送信もしくは出力するための番組音声送出部、１１は前記番組画像生成部９からの画像を（後述のように前記キャラクタの音声の送信もしくは出力と同期して）送信もしくは出力するための番組画像送出部、である。 FIG. 4 is a conceptual block diagram for explaining a program image generation / transmission apparatus according to Embodiment 2 of the present invention. In FIG. 4, reference numeral 1 is a microphone for inputting the voice of an actual performer at the shooting site and the voice of a CG character having a conversation with the performer (voice spoken by a voice actor in charge of the character's speech). A camera 3 for picking up performers and the like on site, temporarily stores the sound from the microphone 1 and outputs a buffer (character as described later) to a program sound transmitting unit 10 described later after a predetermined time. 4 is used to synchronize the transmission of the voice of the character and the mouth image of the character), and 4 represents the voice portion of the character in the voice input from the microphone 1 every predetermined unit time (for example, 30 frames per second). When creating a video of a program, it is sampled every 1/30 second), and the amplitude of the sampled audio for each unit time (each frame) is detected and digitized (decoded). An amplitude detector for converting the data into a total data), a character image database 6 for recording a plurality of character images in advance, and a personal computer 7 for operating a CG character image installed in the vicinity of the imaging site. A character operation unit for inputting an operation signal (command) for moving the CG character in real time while watching the appearance of a performer or the like on the imaging site. On the basis of the amplitude value of the sound from the unit 4 for each unit time, the CG character image from the character image database 6 and the character operation signal from the character operation unit 7, the sound for each unit time is rendered by a rendering process or the like. For generating a CG character image having a mouth shape and posture corresponding to A character image generation unit, 9 is a program image generation unit for synthesizing a live-action image from the camera 2 and a CG character image from the character image generation unit 8, and 10 is for transmitting or outputting audio from the buffer 3 The program audio transmission unit 11 is a program image transmission unit for transmitting or outputting an image from the program image generation unit 9 (in synchronization with the transmission or output of the voice of the character as will be described later).

前述のように、図４のキャラクタ画像生成部８は、前記振幅検出部４からの前記単位時間毎の各音声の振幅値に基づいて対応する口元形状、例えば、口元の口唇の開度が０から１００までの段階で互いに異なる口元形状を、レンダリング処理などによりリアルタイムに生成するようにしている。図５はこのようにして生成された口唇の開度が０から１００までの段階で区別される各口元画像の中の３つを例示したものである。図５において、（ａ）は前記ＣＧキャラクタのセリフを話す人が音声を発していないとき（無音時）の口唇を閉じている場合（又は「ん」の音を出している場合）の口元形状を示す図、（ｂ）はＣＧキャラクタのセリフを話す人が比較的小さい声で話している場合（音声の振幅値が比較的小さく音量が少ない場合）の口唇の形状を示す図で、例えば口唇の開度が４０の場合の口元形状を示す図、（ｃ）はＣＧキャラクタのセリフを話す人が比較的大きい声で話している場合（音声の振幅値が比較的大きく音量が多い場合）の口唇の形状を示す図で、例えば口唇の開度が８０の場合の口元形状を示す図、である。 As described above, the character image generation unit 8 in FIG. 4 has a corresponding mouth shape based on the amplitude value of each voice from the amplitude detection unit 4 per unit time, for example, the opening degree of the lip of the mouth is 0. From 100 to 100, different mouth shapes are generated in real time by a rendering process or the like. FIG. 5 exemplifies three of the mouth images that are distinguished in the stage where the opening degree of the lips generated in this way is from 0 to 100. In FIG. 5, (a) shows the shape of the lip when the person who speaks the speech of the CG character closes his / her lip (or when he / she makes a sound of “n”) when he / she is not uttering sound (no sound) FIG. 5B is a diagram showing the shape of the lips when a person who speaks a speech of a CG character is speaking with a relatively low voice (when the amplitude value of the voice is relatively small and the volume is low), for example, the lips The figure which shows a mouth shape in case the opening degree of is 40, (c) is when the person who speaks the speech of the CG character is speaking with a relatively loud voice (when the amplitude value of the voice is relatively large and the volume is high). It is a figure which shows the shape of a lip, For example, it is a figure which shows a mouth shape in case the opening degree of a lip is 80.

また、図４において、前記振幅検出部４は、前記マイク１からの音声をサンプリングした各単位時間毎の音声の振幅を例えば０から１００までの段階に区別した振幅を示すデジタルデータに変換する。また、前記キャラクタ画像生成部８は、前記振幅検出部４からの各単位時間毎の振幅データに基づいて、前記各単位時間毎に前記振幅データ（例えば０から１００までの段階を示すデジタルデータ）にそれぞれ対応する口元画像（例えば０から１００までのいずれかの口唇の開度を有する画像）を生成する。このように、前記キャラクタ画像生成部８は、前記キャラクタ画像データベース６からのキャラクタ画像と前記の振幅値データと前記キャラクタ操作部７からの操作信号とに基づいて、リアルタイムに種々のジオメトリ処理やレンダリング（描画）処理を行って、前記各音声に対応する口元形状や姿勢を有する３次元ＣＧキャラクタ画像を生成する。さらに図４において、前記キャラクタ画像生成部８により生成されたＣＧキャラクタ画像は、前記番組画像生成部９で前記カメラ２からの実写画像と合成されて、前記番組画像送出部１１から送信もしくは出力される。前記番組画像送出部１１からの画像の送信もしくは出力は、前記バッファ３の作用により、前記番組音声送出部１０による音声の送信もしくは出力と同期して行われる。 Further, in FIG. 4, the amplitude detection unit 4 converts the amplitude of the sound for each unit time obtained by sampling the sound from the microphone 1 into digital data indicating the amplitude distinguished from 0 to 100, for example. In addition, the character image generation unit 8 is configured to generate the amplitude data for each unit time based on the amplitude data for each unit time from the amplitude detection unit 4 (for example, digital data indicating steps from 0 to 100). Each of the mouth images corresponding to each (for example, an image having any lip opening degree from 0 to 100) is generated. As described above, the character image generation unit 8 performs various kinds of geometry processing and rendering in real time based on the character image from the character image database 6, the amplitude value data, and the operation signal from the character operation unit 7. (Drawing) processing is performed to generate a three-dimensional CG character image having a mouth shape and posture corresponding to each voice. Further, in FIG. 4, the CG character image generated by the character image generation unit 8 is combined with the actual image from the camera 2 by the program image generation unit 9 and transmitted or output from the program image transmission unit 11. The The transmission or output of the image from the program image transmission unit 11 is performed in synchronization with the transmission or output of audio by the program audio transmission unit 10 by the action of the buffer 3.

次に、本実施例１による、実在の出演者がＣＧキャラクタと会話している場面を含む番組をリアルタイムに制作しながら生中継する場合の動作を、図６のフローチャートを参照して説明する。まず、撮影現場を撮像するカメラ２からの実写画像を入力し（ステップＳ１）、マイク１からの音声を入力し（ステップＳ２）、キャラクタ操作部７からの操作信号を入力する（ステップＳ３）。前記マイク１から入力された音声は、バッファ３に一時的に記憶される（ステップＳ４）。前記振幅検出部４は、このバッファ３からの音声をサンプリングして単位時間毎に音声の振幅を検出して数値化する（ステップＳ５）。前記キャラクタ画像生成部８は、前記ステップＳ５で得られた振幅データとキャラクタ画像データベース６からの画像データとに基づいてＣＧキャラクタ画像をレンダリング処理などによりリアルタイムに生成する（ステップＳ６）。次に、番組画像生成部９が、ステップＳ６で生成されたＣＧキャラクタ画像と前記カメラ２からの実写画像とを合成して、実在の出演者とキャラクタとが互いに会話をしている場面などを含む番組画像を生成する（ステップＳ７）。そして、この生成された番組画像を、前記バッファ３からの音声と同期させて送出する（ステップＳ８）。以上により、番組画像に含まれるＣＧキャラクタの口元形状が、番組音声に含まれるＣＧキャラクタの音声に対応するように且つこれと同期するように出力もしくは送信される。なお、以上の番組画像を生成する動作は、単位時間毎（例えば１秒間に３０コマの場面を番組画像として送出するならば３０分の１秒が単位時間となる）に繰り返される。 Next, an operation in the case of live broadcasting while producing a program including a scene in which a real performer is conversing with a CG character according to the first embodiment will be described with reference to a flowchart of FIG. First, a real image from the camera 2 that captures the shooting site is input (step S1), sound from the microphone 1 is input (step S2), and an operation signal from the character operation unit 7 is input (step S3). The sound input from the microphone 1 is temporarily stored in the buffer 3 (step S4). The amplitude detector 4 samples the sound from the buffer 3, detects the sound amplitude every unit time, and digitizes it (step S5). The character image generation unit 8 generates a CG character image in real time by rendering processing based on the amplitude data obtained in step S5 and the image data from the character image database 6 (step S6). Next, the program image generation unit 9 synthesizes the CG character image generated in step S6 with the actual image from the camera 2, and shows a scene where the actual performer and the character are having a conversation with each other. An included program image is generated (step S7). Then, the generated program image is transmitted in synchronization with the sound from the buffer 3 (step S8). As described above, the mouth shape of the CG character included in the program image is output or transmitted so as to correspond to and synchronize with the sound of the CG character included in the program sound. The operation of generating the above program image is repeated every unit time (for example, if 30 scenes are transmitted as a program image per second, 1/30 second is the unit time).

図７はこのようにして送出される番組画像の一例を示す図である。図７に示すように、本実施例１によれば、実在の出演者１２と架空のＣＧキャラクタ１３とが互いに対話しているかのように見える場面をリアルタイムに生成して音声と同期して送出することができる。この場合、ＣＧキャラクタ１３の口元形状１３ａはＣＧキャラクタのセリフを話す人の音声（＝キャラクタの音声）と同期して表示される。 FIG. 7 is a diagram showing an example of the program image sent in this way. As shown in FIG. 7, according to the first embodiment, a scene that appears as if the actual performer 12 and the fictitious CG character 13 are interacting with each other is generated in real time and transmitted in synchronization with the voice. can do. In this case, the lip shape 13a of the CG character 13 is displayed in synchronization with the voice of a person who speaks the speech of the CG character (= character voice).

次に、図８は本発明の実施例２による番組画像生成送出装置を説明するための概念ブロック図である。図８において、２１は撮影現場に居る実在の出演者の音声とこの出演者と会話を行うＣＧキャラクタの音声（キャラクタを担当する声優が話す音声）を入力するためのマイク、２２は撮影現場の出演者などを撮像するためのカメラ、２３は撮像現場の近傍に設置されたパソコン（ＣＧキャラクタの画像を操作するためのソフトウエアをインストールしたパソコン）から成り操作者が撮像現場の出演者などの様子を見ながらＣＧキャラクタをリアルタイムに動かすためのキャラクタ操作信号（コマンド）を入力するためのキャラクタ操作部、２４は音声の周波数成分の特徴と各母音（各音素でもよい）との対応関係を記録しておくための音声特徴データベース、２５は各キャラクタ毎に各母音に対応する口元画像（図９（ａ）〜（ｆ）の符号２０ａ参照）を記録しておくための口元画像データベース、２６は複数のキャラクタ画像（図９の符号２０参照）を予め記録しておくためのキャラクタ画像データベース、である。 FIG. 8 is a conceptual block diagram for explaining a program image generation / transmission apparatus according to the second embodiment of the present invention. In FIG. 8, 21 is a microphone for inputting the voice of an actual performer at the shooting site and the voice of a CG character that has a conversation with the performer (voice spoken by the voice actor in charge of the character), and 22 is the shooting site. A camera for imaging performers and the like, and 23 is a personal computer (computer installed with software for manipulating CG character images) installed in the vicinity of the imaging site. A character operation unit 24 for inputting a character operation signal (command) for moving the CG character in real time while observing the state, and 24 records the correspondence between the characteristics of the frequency components of the voice and each vowel (or each phoneme). The voice feature database 25 is a mouth image corresponding to each vowel for each character (reference numeral 2 in FIGS. 9A to 9F). Mouth image database for recording the a reference), 26 is the character image database for recorded in advance a plurality of character images (reference numeral 20 in FIG. 9).

図９は前記口元画像データベース２５に記録される一つのキャラクタに関する複数の口元画像の例を示す図である。図９において、２０はキャラクタ画像中の顔の部分を示す顔画像、２０ａはキャラクタの前記顔画像中の口元画像を示している。また、図９において、（ａ）の符号２０ａは「あ」の母音を含む音を発する場合の口元形状、（ｂ）の符号２０ａは「い」の母音を含む音を発する場合の口元形状、（ｃ）の符号２０ａは「う」の母音を含む音を発する場合の口元形状、（ｄ）の符号２０ａは「え」の母音を含む音を発する場合の口元形状、（ｅ）の符号２０ａは「お」の母音を含む音を発する場合の口元形状、（ｆ）の符号２０ａは「無音」の場合（又は「ん」の口を閉じている場合）の口元形状、を示すものである。図９の各口元画像２０ａは前記口元画像データベース２５に記録されている。また、前記キャラクタの顔画像２０及びキャラクタの身体の画像（図示省略）は前記キャラクタ画像データベース２６に記録されている（なお、前記口元画像と顔画像と身体画像とは一つのデータベースに記録されていてもよい）。 FIG. 9 is a diagram showing an example of a plurality of mouth images relating to one character recorded in the mouth image database 25. As shown in FIG. In FIG. 9, 20 is a face image showing a face portion in the character image, and 20a is a mouth image in the character's face image. Further, in FIG. 9, reference numeral 20 a in (a) is a mouth shape when a sound including a vowel “a” is emitted, and reference numeral 20 a in FIG. 9 (b) is a mouth shape when a sound including a vowel “yes” is emitted, Reference numeral 20a in (c) indicates a mouth shape when a sound including a vowel of “U” is generated, reference numeral 20a in (d) indicates a mouth shape when a sound including a vowel of “e” is generated, and reference numeral 20a in (e). Indicates the mouth shape when a sound including the vowel “o” is emitted, and the symbol 20a in (f) indicates the mouth shape when “silence” (or when the mouth “n” is closed). . Each mouth image 20 a in FIG. 9 is recorded in the mouth image database 25. The character face image 20 and the character body image (not shown) are recorded in the character image database 26 (note that the mouth image, the face image, and the body image are recorded in one database. May be)

また、図８において、２７は前記マイク１からの音声を一時的に記憶してから所定時間後に後述の番組音声送出部３２に出力するためのバッファ（後述のようにキャラクタの音声の送出とキャラクタの口元画像の送出とを同期させるためのもの）、２８は前記マイク２１から入力された音声中のキャラクタの音声部分を所定の単位時間毎にサンプリングしてそのサンプリングした各単位時間毎（各フレーム毎）の音声の周波数成分を抽出するための周波数成分抽出部、２９は前記周波数成分抽出部２８からの周波数成分と前記音声特徴データベース２４からの各母音の特徴とを照合して前記各単位時間毎の各音声の母音を判定しこれにより前記各音声の母音に対応する口元画像を選択するための口元画像判定部、３０は前記口元画像判定部２９からの前記各音声に対応する口元画像データに基づいて前記口元画像データベース２５から抽出される口元画像（例えば前記音声の母音が「あ」なら「あ」に対応する図９（ａ）の口元画像２０ａ）と前記キャラクタ画像データベース２６からのキャラクタ画像と前記キャラクタ操作部２３からのキャラクタ操作信号とに基づいて所定のレンダリング処理などにより前記各音声に対応する口元形状及び姿勢を有するＣＧ３次元キャラクタ画像を生成するためのキャラクタ画像生成部、３１は前記カメラ２２からの実写画像と前記キャラクタ画像生成部３０からのＣＧキャラクタ画像を合成するための番組画像生成部、３２は前記バッファ２７からの音声を送信もしくは出力するための番組音声送出部、３３は前記番組画像生成部３１からの画像を（後述のように前記番組音声送出部３２による音声の送信もしくは出力と同期して）送信もしくは出力するための番組画像送出部、である。 In FIG. 8, reference numeral 27 denotes a buffer for temporarily storing the sound from the microphone 1 and outputting it to a program sound sending unit 32 (to be described later) after a predetermined time has elapsed (as will be described later, the sending of the character sound and the character). 28 is for sampling the voice portion of the character in the voice inputted from the microphone 21 every predetermined unit time, and for each of the sampled unit times (each frame). A frequency component extracting unit 29 for extracting the frequency component of each voice, and comparing the frequency component from the frequency component extracting unit 28 with the characteristics of each vowel from the voice feature database 24 to each unit time. A mouth image determining unit for determining a vowel of each voice and selecting a mouth image corresponding to the vowel of each voice, and 30 is the mouth image determining unit 2 9 is a mouth image extracted from the mouth image database 25 based on the mouth image data corresponding to each sound from the mouth (for example, the mouth image of FIG. 9A corresponding to “A” if the vowel of the sound is “A”) 20a), a CG three-dimensional character image having a mouth shape and posture corresponding to each voice by a predetermined rendering process or the like based on the character image from the character image database 26 and the character operation signal from the character operation unit 23. A character image generating unit for generating 31, a program image generating unit for synthesizing a live-action image from the camera 22 and a CG character image from the character image generating unit 30, and 32 transmitting sound from the buffer 27 Alternatively, the program audio transmission unit 33 for outputting is output from the program image generation unit 31. (The program in synchronization with the transmission or output of the speech by the speech transmitting unit 32 as described below) the program image sending unit for transmitting or outputting the image is.

前述のように、図８のキャラクタ画像生成部３０は、前記口元画像判定部２９からの前記単位時間毎の音声の特徴に対応する母音を示す口元画像などに基づいてＣＧキャラクタ画像をリアルタイムに生成するようにしている。また、図８の前記口元画像判定部２９は、前記バッファ２７からの音声をサンプリングした各単位時間毎の音声の特徴に対応する母音を話すときの口元画像を判定・識別する。また、前記キャラクタ画像生成部３０は、前記口元画像判定部２９からの各単位時間毎の音声の母音に対応する口元画像データに基づいて、前記各単位時間毎に前記口元画像を含む３次元ＣＧキャラクタ画像を生成する（前記口元画像判定部２９は本発明の「母音等判定手段」及び「口元画像判定手段」の双方の機能を実現する部分に対応する）。さらに図８において、前記キャラクタ画像生成部３０で生成されたＣＧキャラクタ画像は、前記番組画像生成部３１で前記カメラ２２からの実写画像と合成されて、前記番組画像送出部３３から送信もしくは出力される。前記番組画像送出部３３からの画像の送信もしくは出力は、前記バッファ２７の作用により、前記番組音声送出部３２による音声の送信もしくは出力と同期して行われる。 As described above, the character image generation unit 30 in FIG. 8 generates a CG character image in real time based on the lip image indicating the vowels corresponding to the voice characteristics per unit time from the lip image determination unit 29. Like to do. Further, the mouth image determination unit 29 in FIG. 8 determines and identifies a mouth image when speaking a vowel corresponding to the feature of the sound per unit time obtained by sampling the sound from the buffer 27. Further, the character image generation unit 30 includes a three-dimensional CG including the lip image for each unit time based on lip image data corresponding to the vowel of the sound for each unit time from the lip image determination unit 29. A character image is generated (the mouth image determining unit 29 corresponds to a part that realizes both functions of the “vowel etc. determining unit” and “mouth image determining unit” of the present invention). Further, in FIG. 8, the CG character image generated by the character image generation unit 30 is combined with the actual image from the camera 22 by the program image generation unit 31 and transmitted or output from the program image transmission unit 33. The The transmission or output of the image from the program image transmission unit 33 is performed in synchronization with the transmission or output of the audio by the program audio transmission unit 32 by the action of the buffer 27.

次に、本実施例２による、実在の出演者がＣＧキャラクタと会話している場面の番組をリアルタイムに制作しながら生中継する場合の動作を、図４０のフローチャートを参照して説明する。まず、撮影現場を撮像するカメラ２２からの実写画像を入力し（ステップＳ１１）、マイク２１からの音声を入力し（ステップＳ１２）、キャラクタ操作部２３からのキャラクタ操作信号を入力する（ステップＳ１３）。前記マイク２１から入力された音声は、バッファ２７に一時的に記憶される（ステップＳ１４）。前記バッファ２７からの音声は、前記周波数成分抽出部２８により抽出された周波数成分の特徴と前記音声特徴データベースからのデータとの照合により前記サンプリングされた音声の母音＝口元画像が判定・識別される（ステップＳ１５）。前記キャラクタ画像生成部３０は、前記ステップＳ１５で判定・識別された口元画像を示すデータとこれに対応する口元画像データベース２５からの口元画像とキャラクタ画像データベース２６からの画像データなどとに基づいてＣＧキャラクタ画像をレンダリング処理などによりリアルタイムに生成する（ステップＳ１６）。次に、番組画像生成部３１が、ステップＳ１６で生成されたＣＧキャラクタ画像と前記カメラ２２からの実写画像とを合成して番組画像を生成する（ステップＳ１７）。そして、この生成された番組画像を、前記バッファ２７からの音声と同期させて送出する（ステップＳ１８）。以上により、番組画像に含まれるＣＧキャラクタの口元画像が、番組音声に含まれるＣＧキャラクタの音声に対応するように且つこれと同期するように出力もしくは送信される。なお、以上の番組画像を生成する動作は、単位時間毎（例えば１秒間に３０コマの場面を番組画像として送出するならば３０分の１秒が単位時間となる）に繰り返される。 Next, the operation in the case of live broadcasting while producing a program of a scene in which a real performer is conversing with a CG character according to the second embodiment will be described with reference to the flowchart of FIG. First, a live-action image from the camera 22 that captures the shooting site is input (step S11), sound from the microphone 21 is input (step S12), and a character operation signal from the character operation unit 23 is input (step S13). . The sound input from the microphone 21 is temporarily stored in the buffer 27 (step S14). As for the voice from the buffer 27, the vowel of the sampled voice = mouth image is determined and identified by comparing the characteristics of the frequency component extracted by the frequency component extraction unit 28 with the data from the voice feature database. (Step S15). The character image generation unit 30 performs CG based on the data indicating the mouth image determined and identified in step S15, the mouth image from the mouth image database 25 corresponding thereto, the image data from the character image database 26, and the like. A character image is generated in real time by a rendering process or the like (step S16). Next, the program image generation unit 31 generates a program image by synthesizing the CG character image generated in step S16 and the photographed image from the camera 22 (step S17). Then, the generated program image is transmitted in synchronization with the sound from the buffer 27 (step S18). As described above, the mouth image of the CG character included in the program image is output or transmitted so as to correspond to and synchronize with the sound of the CG character included in the program sound. The operation of generating the above program image is repeated every unit time (for example, if 30 scenes are transmitted as a program image per second, 1/30 second is the unit time).

以上、本発明の各実施例について説明したが、本発明は前記の各実施例として述べたものに限定されるものではなく、様々な修正及び変更が可能である。例えば、前記実施例１，２においては、前記バッファ３，２７に入力される音声をいずれも出演者や製油が話した内容をマイク１で入力した音声としているが、本発明はこれに限られるものではなく、例えばＤＶＤやハードディスクなどに記録されたデータを再生して得られた音声でもよいし、キャラクタのセリフを書いた文字列を文字音声変換ソフトにより変換して得られた合成音声などでもよい（例えば、番組の撮影現場の近傍に居るスタッフがその場で現場の雰囲気を見ながらリアルタイムにアドリブのセリフをパソコンにキーボード入力し、それをリアルタイムに合成音声に変換して前記バッファ３に入力するようにしてもよい）。また、前記実施例２ではキャラクタの口元形状を５つの母音と無音との計６種類だけ用意するようにしている（図９の（ａ）〜（ｆ）参照）が、本発明では、例えば音素解析により「１０種類と無音」との計１１種類かそれ以上の多数の種類の口元画像を予めデータベースなどに用意して、入力された音声の音素解析によりそれらのいずれかを判定・識別するようにしてもよい。さらに、前記実施例１，２においては、それぞれ、入力された音声の各単位時間毎の音量（振幅）による口元の開き具合又は入力された各単位時間毎の音声を解析して得られた母音（もしくは音素）により、複数種類の口元画像（口元形状）から一つを選択・判定するようにしているが、本発明では、入力された音声の音量（振幅）と音素との双方に基づいて、複数種類の口元画像（口元形状）から一つを選択・判定するようにしてもよい。 As mentioned above, although each Example of this invention was described, this invention is not limited to what was described as each said Example, A various correction and change are possible. For example, in the first and second embodiments, the voices input to the buffers 3 and 27 are all the voices spoken by performers and oil refiners, but the present invention is limited to this. For example, it may be voice obtained by reproducing data recorded on a DVD or hard disk, or may be synthesized voice obtained by converting a character string in which character lines are written by character voice conversion software. Good (for example, a staff member in the vicinity of the shooting site of the program inputs the ad-lib lines to the computer in real time while watching the atmosphere of the site, converts it into synthesized speech in real time and inputs it to the buffer 3 You may do it). Further, in the second embodiment, only six types of character mouth shapes, that is, five vowels and silences are prepared (see FIGS. 9A to 9F). A total of 11 types of mouth images of “10 types and silence” or more than that are prepared in a database or the like in advance by analysis, and any one of them is determined and identified by phoneme analysis of the input speech. It may be. Furthermore, in the first and second embodiments, the degree of opening of the mouth based on the volume (amplitude) of each unit time of the input sound or the vowel obtained by analyzing the input sound of each unit time, respectively. (Or phonemes), one is selected and determined from a plurality of types of mouth images (mouth shapes), but in the present invention, based on both the volume (amplitude) of input speech and phonemes Alternatively, one of a plurality of types of mouth images (mouth shapes) may be selected and determined.

５１番組画像生成システム、５３₁，５３₂ 遠隔再生処理装置、５５遠隔再生装置、５７音声入力端末、５９音声入力部、６１配信管理装置、６３音声量子化部、６７音声量子送信部、６９制御命令記憶部、７１制御命令送信部、７３記憶装置、７５データ送信部、８１₁，８１₂ 受信部、８３₁，８３₂ 端末番組画像生成部、９１動画生成部、１０９端末音声同期部、１１１音声再生部、１１５画像生成部、１１９同期部 51 program image generation system, 53 ₁ , 53 ₂ remote reproduction processing device, 55 remote reproduction device, 57 audio input terminal, 59 audio input unit, 61 distribution management device, 63 audio quantization unit, 67 audio quantum transmission unit, 69 control Command storage unit, 71 control command transmission unit, 73 storage device, 75 data transmission unit, 81 ₁ , 81 ₂ reception unit, 83 ₁ , 83 ₂ terminal program image generation unit, 91 video generation unit, 109 terminal audio synchronization unit, 111 Audio playback unit, 115 image generation unit, 119 synchronization unit

Claims

A program image distribution system for displaying a program image generated by creating a character image corresponding to the input sound while reproducing the input sound in a plurality of remote reproduction processing devices,
A voice input terminal having voice input means for inputting the voice;
Distribution management means for transmitting the input voice to each remote reproduction processing device,
The plurality of remote reproduction processing devices include a first remote reproduction processing device that creates a first character image corresponding to the inputted voice, and the first character corresponding to the inputted voice. A second remote reproduction processing device for creating a second character image different from the image is included;
The distribution management means includes:
A voice quantization means for dividing the input voice and extracting a part or all of the voice as voice quanta;
Audio quantum transmitting means for transmitting the audio quanta to the remote reproduction processing devices;
Control command storage means for storing a control command for controlling the movement of the character;
Control command transmission means for transmitting the control command to each remote reproduction processing device,
Each of the remote reproduction processing devices is
Receiving means for receiving each transmitted speech quantum;
Terminal program image generation means for generating the character image from the character element image corresponding to the control command and the received voice quanta and displaying the program image while reproducing the received voice quanta ,
There are two or more types of character element images, and the terminal program image generation means included in the second remote reproduction processing device is of a different type from the character element image used in the first remote reproduction processing device. A program image distribution system, wherein the second character image is created from the character element image.

The terminal program image generation means of each of the remote reproduction processing devices,
Terminal audio synchronization means for synchronizing each audio quantum with background audio data indicating audio different from the input audio;
Audio reproducing means for reproducing each synchronized audio quantum and the background audio data;
The feature of the speech quanta to be reproduced after the speech quanta being reproduced by the speech reproduction means is detected, and the character image is extracted from a character element image corresponding to the control command and the detected feature of the speech quanta. Image generation means for creating the program image by synthesizing the actual image data obtained by imaging and the character image;
Synchronization means for detecting characteristics of the audio quanta reproduced by the audio reproduction means and synchronizing the program image creation processing by the image generation means and the reproduction processing of each audio quanta by the audio reproduction means; The program image distribution system according to claim 1, comprising:

The distribution management means includes:
Storage means for storing the character element image, the live-action data and the background audio data;
Data transmission for transmitting the first character element image or the second character element image, the live-action data, and the background audio data to some or all of the plurality of remote reproduction processing devices as necessary. Having means,
A remote playback device for displaying video data;
The character image is generated from the character element image corresponding to the control command and the received voice quanta, and is synthesized with the voice quanta to generate moving image data, and the moving image is transmitted to the remote playback device. A video generation means for transmitting data,
The remote playback device plays back the received video data.
The program image distribution system according to claim 2.

The moving image generation means is instructed as a transmission destination of the generated moving image data to be part or all of the voice input terminal, the plurality of remote reproduction processing devices and the remote reproduction device,
When a part or all of the plurality of remote reproduction processing devices are instructed as a transmission destination of the generated moving image data,
The audio quantum transmission means of the distribution management means does not transmit each audio quantum to the instructed remote reproduction processing device,
The instructed remote reproduction processing apparatus reproduces the received moving image data.
The program image distribution system according to claim 3.

A program image distribution method for displaying a program image generated by creating a character image corresponding to the input sound while reproducing the input sound in a plurality of remote reproduction processing devices,
The plurality of remote reproduction processing devices include a first remote reproduction processing device that creates a first character image corresponding to the inputted voice, and the first remote reproduction processing device corresponding to the inputted voice. A second remote reproduction processing apparatus capable of generating not only a character image but also creating a second character image different from the first character image instead of the first character image;
A voice input step in which the voice is input to the voice input means;
A speech quantization step provided in the distribution management means, the speech quantization step of dividing the input speech and extracting a part or all of the speech as speech quanta;
A voice quantum transmission step in which the voice quantum transmission means included in the distribution management means transmits the voice quanta to the remote reproduction processing devices;
The terminal program image generation means included in the first remote reproduction processing device reproduces the received voice quanta while controlling the character's motion and the character corresponding to the received voice quanta. The first character image is created from the element image and the program image is displayed, and the terminal program image reproduction means included in the second remote reproduction device reproduces the received audio quanta while the control command is reproduced. And corresponding to each received voice quantum, the second character image is created from a character element image different from the character element image in the first remote reproduction processing device, or the first remote image A program image for generating the first character image from the same character element image as the character element image in the reproduction processing apparatus and displaying the program image Program video distribution method characterized by comprising the shown step.

The program for functioning a computer as a terminal program image reproduction means of Claim 5.