JP2005159592A

JP2005159592A - Contents transmission apparatus and contents receiving apparatus

Info

Publication number: JP2005159592A
Application number: JP2003393278A
Authority: JP
Inventors: Seiichi Goshi; 清一合志; Toshihiko Misu; 俊彦三須; Masaki Takahashi; 正樹高橋; Tetsuo Kuge; 哲郎久下; Shinichi Sakaida; 慎一境田; Kazuhisa Iguchi; 和久井口; Kaoru Watanabe; 馨渡辺
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2003-11-25
Filing date: 2003-11-25
Publication date: 2005-06-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a contents transmission apparatus and a contents receiving apparatus which provide a contrivance of effectively utilizing aborted program sources and satisfying various preferences and desires by each viewer. <P>SOLUTION: The contents transmission apparatus 1 transmits at least one of multi-viewpoint video data imaged or generated by one or more different viewpoints from a viewpoint when video data are imaged, and multi-viewpoint audio data recorded or generated by one or more different audiopoints from an audiopoint, when audio data are recorded synchronously with video data and audio data outputted as a broadcast wave via a network. The contents transmission apparatus 1 is provided with: a broadcast metadata provision means 5; a communication metadata provision means 9; a transmission means 13; and an exhibited information storage means 17. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、放送波とネットワークによるデータ送信とによって、コンテンツを送信するコンテンツ送信装置およびコンテンツを受信するコンテンツ受信装置に関する。 The present invention relates to a content transmission device that transmits content and a content reception device that receives content through broadcast waves and data transmission over a network.

従来より、テレビ番組は、当該テレビ番組を制作した放送局（制作会社）によって編集された結果のみによって構成されており、制作中に収録した映像データや音声データ（以下、これらを番組素材という）は際限なく廃棄されている。しかも、番組素材の総時間は、テレビ番組の放送時間の数倍に及んでいる。 Conventionally, a television program is composed only of the results edited by the broadcasting station (production company) that produced the television program, and video data and audio data (hereinafter referred to as program material) recorded during the production. Are endlessly discarded. Moreover, the total program material time is several times longer than the broadcast time of the TV program.

なお、現行のテレビ放送において、当初のテレビ番組に採用されなかった番組素材の一部は、例えば、ドラマ等では、別のテレビ番組のＮＧ集として活用されているが、このＮＧ集は視聴者の嗜好を満たすものではなく、やはり、放送局で編集されたものであり、インタラクティブなものではなかった。 In the current television broadcasting, some of the program material that was not adopted in the original television program is used as an NG collection of other TV programs in dramas, for example. It was not something that would satisfy the tastes of the people, but it was edited by the broadcasting station and not interactive.

また、スポーツ中継番組等では、複数の撮影カメラおよび複数の集音マイクによって、常に複数のスポーツ選手の動向を収録（撮影、集音）していなければならず、当然のことながら、スポーツ中継番組として放送された部分よりも、遙かに多くの時間の番組素材が活用されていない。 In sports broadcast programs, etc., the trends of multiple athletes must always be recorded (photographed and collected) by multiple shooting cameras and multiple sound collecting microphones. As a result, much more of the program material has not been used than the part that was broadcast as.

ここで、スポーツ中継番組の一例として、野球中継番組についてより具体的に説明する。野球中継番組では、ゲームの行われている野球場に複数の撮影カメラおよび複数の集音マイクが持ち込まれる。これら複数の撮影カメラおよび複数の集音マイクによって取得された番組素材は、野球場付近に停車してある中継車まで、常時伝送されている。 Here, a baseball broadcast program will be described more specifically as an example of a sports broadcast program. In a baseball broadcast program, a plurality of shooting cameras and a plurality of sound collecting microphones are brought into a baseball field where a game is played. Program material acquired by the plurality of photographing cameras and the plurality of sound collecting microphones is constantly transmitted to a relay vehicle stopped near the baseball field.

中継車では、番組素材である複数の映像データ（動画像）および複数の音声データから、特定の映像データ（動画像）と、ミキシングにより作り出された特定の音声データとをスポーツ中継番組として送り出している（例えば、特許文献１を参照）。
特開２０００−０１２３４５号公報（段落０００５〜０００７、図３） In a relay vehicle, specific video data (moving image) and specific audio data created by mixing are sent out as a sports relay program from a plurality of video data (moving images) and a plurality of audio data as program materials. (For example, refer to Patent Document 1).
JP 2000-012345 A (paragraphs 0005 to 0007, FIG. 3)

しかしながら、スポーツ中継番組等を視聴する視聴者は、放送されているスポーツ選手を別の角度から見てみたい、或いは、放送されているスポーツ選手以外も見てみたいといった潜在的な要求があるにも拘わらず、放送局で決定したスポーツ選手しか見ることができず、潜在的な要求は満たされていない。 However, there are potential demands for viewers who watch sports broadcast programs, etc., to see broadcast athletes from a different angle or to view other than athletes being broadcast. Regardless, only the athletes determined by the broadcast station can be seen and the potential requirements are not met.

そしてまた、野球中継番組を自宅のテレビジョンで鑑賞する視聴者は、一人一人異なる嗜好（好み）を持っている。つまり、応援している球団や、好きな野球選手や、監督が異なっている。例えば、野球中継番組で放送されている場面には、好きな野球選手が写っておらず、当該場面以外の任意の場面で、好きな野球選手を常時見ていたいという欲求を抱く視聴者は存在している。当該視聴者の欲求は、野球場でゲームを観戦すれば、満たされるものであるが、当該ゲームをテレビジョンで鑑賞している視聴者に、自分の嗜好を反映させて、好きな場面を選択する選択権は、現在のテレビジョン（つまり、テレビ放送）には存在していない。 In addition, viewers who watch a baseball broadcast program on their home television have different preferences (preferences). In other words, the supporting teams, favorite baseball players, and managers are different. For example, a scene broadcast on a baseball broadcast program does not show a favorite baseball player, and there are viewers who have a desire to always watch a favorite baseball player in any scene other than the scene. ing. The viewer's desire is fulfilled by watching the game on the baseball field, but the viewer who is watching the game on television reflects his / her preference and selects a favorite scene. The right to choose does not exist in current television (ie, television broadcast).

また、野球中継番組を自宅のテレビジョンで鑑賞する視聴者は、当該野球中継番組に対し、常に鮮明で詳細な映像を求めており、例えば、野球選手の表情や、野球選手の着用しているユニフォームの状態（例えば、細かなしわ）等から、次のプレーを予測し、より深く野球を味わいたいとする要求がある。しかし、現行のハイビジョンの解像度を持ってしても、野球選手の表情の微妙な変化や、野球選手の着用しているユニフォームの状態の微妙な違いを表現することは困難である。さらに、野球中継番組において、投手の投げた変化球がどの程度鋭く変化したのかは、現在のテレビジョンの方式では、十分に表現しきれていない。 In addition, viewers who watch a baseball broadcast program on their home television always seek a clear and detailed video for the baseball broadcast program, for example, the expression of a baseball player or a baseball player wearing it. There is a demand for predicting the next play from the state of the uniform (for example, fine wrinkles) and the like, and to enjoy the baseball more deeply. However, even with the current high-definition resolution, it is difficult to express a subtle change in the expression of the baseball player and a subtle difference in the state of the uniform worn by the baseball player. Furthermore, how sharply the changing ball thrown by the pitcher in a baseball broadcast program has not been fully expressed by the current television system.

以上、野球中継番組を例に、スポーツ中継番組の視聴者から求められていることを説明したが、先に述べたように、スポーツ中継番組に限らず、テレビ番組では、廃棄されてしまう番組素材を有効に活用することが望まれており、今後、走査線４０００本のテレビカメラシステムが実用化した場合、撮影カメラから得られる映像データと、廃棄されてしまう番組素材との情報量は飛躍的に増加する。 In the above, the baseball broadcast program is taken as an example to explain that it is required by viewers of the sports broadcast program. However, as described above, the program material is discarded not only in the sports broadcast program but also in the TV program. In the future, when a TV camera system with 4000 scanning lines is put into practical use, the amount of information of video data obtained from the photographic camera and discarded program material will be dramatic. To increase.

ところで、ブロードバンドのネットワークや、一般テレビ放送の放送波以外のチャンネルを有効利用できれば、走査線４０００本のテレビカメラシステムに対応した撮影カメラで撮影された番組素材である映像データの高解像度成分も伝送することは可能となる。 By the way, if a broadband network or a channel other than the broadcast wave of general television broadcasting can be used effectively, the high-resolution component of the video data that is the program material shot by the shooting camera corresponding to the TV camera system with 4000 scanning lines is also transmitted. It is possible to do.

また、これまでのテレビ番組（テレビ放送）では、放送したテレビ番組とは、別角度から撮影した映像データや別の場所で集音した音声データ（番組素材）が取得できていたとしても、これらの番組素材を伝送する伝送路、つまり、ネットワーク等の情報通信のインフラが未整備（不十分）であったので、実現性が低かった。しかし、近年、ブロードバンドのネットワークは、急速に広帯域化しており、廃棄されていた番組素材の伝送路として十分に活用できるレベルにある。つまり、コンピュータ等の高性能化に伴い、番組素材（映像データ、音声データ）の抽出、合成を高速に行うことができ、ブロードバンドのネットワークは、テレビ放送の放送波として伝送されるテレビ番組を補う役割を担うことができるものと成りつつある。 In addition, with conventional TV programs (TV broadcasts), even if video data shot from a different angle or audio data (program material) collected at a different location from the broadcast TV program can be obtained, Since the transmission path for transmitting the program material, that is, the information communication infrastructure such as the network was not developed (insufficient), the feasibility was low. However, in recent years, broadband networks are rapidly becoming broadband, and are at a level that can be sufficiently utilized as transmission paths for discarded program materials. In other words, with the improvement in performance of computers and the like, program materials (video data, audio data) can be extracted and synthesized at high speed, and the broadband network supplements TV programs transmitted as broadcast waves of TV broadcasts. It is becoming possible to play a role.

そこで、本発明では、前記した問題を解決し、廃棄されていた番組素材を有効活用する仕組みを提供すると共に、視聴者毎の種々の嗜好、欲求を満足させる、例えば、放送されているテレビ番組とは、別角度から撮影した映像データや別の場所で集音した音声データを提供することができるコンテンツ送信装置およびコンテンツ受信装置を提供することを目的とする。 Therefore, the present invention solves the above-mentioned problems and provides a mechanism for effectively using the discarded program material, and satisfies various tastes and desires for each viewer, for example, a broadcasted TV program An object of the present invention is to provide a content transmission device and a content reception device that can provide video data shot from different angles and audio data collected at a different location.

前記課題を解決するため、請求項１記載のコンテンツ送信装置は、放送波として出力されている映像データと音声データとに同期させて、前記映像データを撮像した際の視点と異なる少なくとも１つ以上の別の視点で撮像又は生成された多視点映像データと、前記音声データを収録した際の聴点と異なる少なくとも１つ以上の別の聴点で収録又は生成された多聴点音声データとの少なくとも一方を、ネットワークを介して送信するコンテンツ送信装置であって、メタデータ付与手段と、提示情報蓄積手段と、送信手段と、を備える構成とした。 In order to solve the above problem, the content transmission device according to claim 1 is at least one or more different from a viewpoint when the video data is captured in synchronization with video data and audio data output as broadcast waves. Multi-viewpoint video data imaged or generated from different viewpoints and multi-point audio data recorded or generated at least one other listening point different from the listening point when the audio data was recorded At least one of them is a content transmission apparatus that transmits via a network, and is configured to include metadata adding means, presentation information storage means, and transmission means.

かかる構成によれば、コンテンツ送信装置は、メタデータ付与手段によって、映像データと多視点映像データとの関係を記述したメタデータを当該映像データと当該多視点映像データとに付与すると共に、音声データと多聴点音声データとの関係を記述したメタデータを当該音声データと当該多聴点音声データとに付与する。このメタデータは、映像データについては、各場面に対応するように記述されたものである。このコンテンツ送信装置は、送信手段によって、提示情報蓄積手段に蓄積されている提示情報と、多視点映像データと多聴点音声データとの少なくとも一方とを、ネットワークを介して送信する。提示情報は、多視点映像データと多聴点音声データとの少なくとも一方を受信側で提示する際の提示手法をメタデータと関連付けて予め設定したもので、例えば、映像データと多視点映像データとを提示（表示）する際の表示画面のレイアウトを設定するものである。 According to such a configuration, the content transmission device adds the metadata describing the relationship between the video data and the multi-view video data to the video data and the multi-view video data by the metadata adding unit, and the audio data. And the metadata describing the relationship between the multi-point audio data and the multi-point audio data. As for the video data, this metadata is described so as to correspond to each scene. In the content transmission apparatus, the transmission unit transmits the presentation information stored in the presentation information storage unit and at least one of the multi-view video data and the multi-point audio data via the network. The presentation information is a preset presentation method in association with metadata that presents at least one of multi-view video data and multi-point audio data on the receiving side. For example, the presentation information includes: The layout of the display screen when presenting (displaying) is set.

請求項２記載のコンテンツ送信装置は、請求項１に記載のコンテンツ送信装置において、前記提示情報には、少なくとも、受信側に備えられる装置に提示させる内容を示す提示内容情報と、当該装置を動作させる内容を示す動作内容情報と、前記多視点映像データと前記多聴点音声データとを指定するマルチメディアコンテンツ情報、当該装置に提示される前記映像データと前記多視点映像データとのレイアウトを指定するレイアウト情報と、当該提示情報を生成するための、前記メタデータに関連付けて予め設定された提示情報生成テンプレートとを含むことを特徴とする。 The content transmission device according to claim 2 is the content transmission device according to claim 1, wherein the presentation information includes at least presentation content information indicating content to be presented to a device provided on a reception side, and operates the device. Operation content information indicating the content to be performed, multimedia content information designating the multi-view video data and multi-point audio data, and designating the layout of the video data and multi-view video data presented to the device Layout information to be generated, and a presentation information generation template preset in association with the metadata for generating the presentation information.

かかる構成によれば、コンテンツ送信装置は、提示情報蓄積手段に蓄積されている提示情報が、提示内容情報と、動作内容情報と、マルチメディアコンテンツ情報と、レイアウト情報と、提示情報生成テンプレートとを含んでおり、これらの情報によって、映像データ、多視点映像データ、音声データおよび多聴点音声データの受信側における表示出力を送信側で詳細に設定することができる。さらに、受信側では、設定された範囲内ではあるが、自分の嗜好に応じた映像データと多視点映像データとを、音声データと多聴点音声データとを適度に融合させて（切り替えて）、コンテンツ（テレビ番組）を視聴することができる。 According to such a configuration, the content transmission apparatus includes the presentation information stored in the presentation information storage unit as presentation content information, operation content information, multimedia content information, layout information, and a presentation information generation template. With this information, display output on the receiving side of video data, multi-view video data, audio data, and multi-point audio data can be set in detail on the transmitting side. Furthermore, on the receiving side, although it is within the set range, video data and multi-view video data according to one's preference are appropriately fused (switched) with audio data and multi-audio data. , Content (TV programs) can be viewed.

請求項３記載のコンテンツ送信装置は、請求項１または請求項２に記載のコンテンツ送信装置において、前記多視点映像データと前記多聴点音声データとがスポーツに関するスポーツ関連多視点映像データとスポーツ関連多聴点音声データとであり、被写体候補抽出手段と、投影手段と、候補絞り込み手段と、重心検出手段と、逆投影手段と、運動予測手段と、を備える構成とした。 The content transmission device according to claim 3 is the content transmission device according to claim 1 or 2, wherein the multi-view video data and the multi-viewpoint audio data are sports-related multi-view video data and sports-related. It is multi-auditory voice data, and is configured to include subject candidate extraction means, projection means, candidate narrowing means, barycenter detection means, back projection means, and motion prediction means.

かかる構成によれば、コンテンツ送信装置は、被写体候補抽出手段によって、撮像された映像データに含まれている被写体の候補となる部分画像を抽出する。例えば、映像データ（コンテンツ）が野球中継番組であり、被写体の候補が野球ボールであった場合、映像データを構成している静止画単位のフレーム画像を、輝度の違いによって２値化して、輝度の高い部分を部分画像として抽出したり、前後のフレーム画像同士の差分によって、動きのある領域を特定することで、その動きのある領域を部分画像として抽出する。 According to this configuration, the content transmission apparatus extracts a partial image that is a subject candidate included in the captured video data by the subject candidate extraction unit. For example, when the video data (content) is a baseball broadcast program and the candidate for the subject is a baseball ball, the still image unit frame image constituting the video data is binarized according to the luminance difference, and the luminance A region with high motion is extracted as a partial image, or a region with motion is specified by the difference between the previous and next frame images, so that the region with motion is extracted as a partial image.

続いて、コンテンツ送信装置は、投影手段によって、被写体の運動を予測して得られた予測三次元座標を投影した予測画像座標を生成する。予測三次元座標は、抽出された被写体の画像座標を逆投影して得られた推定三次元座標から、当該被写体の運動を予測したものである。そして、コンテンツ送信装置は、候補絞り込み手段によって、被写体候補抽出手段で抽出された部分画像の中から被写体の候補を、投影手段で生成された予測画像座標に基づいて絞り込んで、重心検出手段によって、絞り込まれた被写体の候補の重心を検出する。なお、この候補絞り込み手段では、予め、被写体に関する情報が予め設定された被写体情報を参照して、被写体の候補を絞り込んでいる。つまり、候補絞り込み手段では、予測画像座標と被写体情報を手がかりに、すなわち、被写体を決定づける固有の特徴量として、被写体の候補を絞り込んで、重心検出手段では、被写体の候補の正確な位置を特定するために、被写体の候補の重心を検出する。 Subsequently, the content transmitting apparatus generates predicted image coordinates obtained by projecting predicted three-dimensional coordinates obtained by predicting the motion of the subject by the projecting unit. The predicted three-dimensional coordinates are obtained by predicting the movement of the subject from the estimated three-dimensional coordinates obtained by back projecting the extracted image coordinates of the subject. Then, the content transmission apparatus narrows down the subject candidates from the partial images extracted by the subject candidate extraction unit by the candidate narrowing-down unit based on the predicted image coordinates generated by the projection unit, and the center-of-gravity detection unit The center of gravity of the narrowed-down subject candidate is detected. The candidate narrowing means narrows down subject candidates with reference to subject information in which information about the subject is set in advance. That is, the candidate narrowing-down means narrows down the subject candidates as clues based on the predicted image coordinates and the subject information, that is, as a characteristic amount that determines the subject, and the center-of-gravity detection means specifies the exact position of the subject candidates. Therefore, the center of gravity of the subject candidate is detected.

そして、このコンテンツ送信装置は、逆投影手段によって、重心が検出された被写体の候補の画像座標を逆投影して、推定三次元座標を取得し、この推定三次元座標から運動予測手段によって、予測三次元座標を取得する。なお、送信手段によって、推定三次元座標と共に、スポーツ関連多視点映像データが出力される。また、スポーツ関連多聴点音声データは、あくまでもスポーツ関連多視点映像データに係るものであり、ここでは、当該スポーツ関連多視点映像データに付随するものとして出力される。 Then, the content transmission device backprojects the image coordinates of the candidate for the subject whose center of gravity is detected by the backprojection unit to obtain the estimated 3D coordinates, and the motion prediction unit predicts the estimated 3D coordinates from the estimated 3D coordinates. Get 3D coordinates. The transmission means outputs sports-related multi-view video data together with the estimated three-dimensional coordinates. The sports-related multi-viewpoint audio data is only related to sports-related multi-view video data, and is output here as accompanying the sports-related multi-view video data.

請求項４記載のコンテンツ受信装置は、放送波として出力されている当該映像データと当該音声データとを受信し、これらに同期されている前記映像データを撮像した際の視点と異なる少なくとも１つ以上の別の視点で撮像又は生成された多視点映像データと、前記音声データを収録した際の聴点と異なる少なくとも１つ以上の別の聴点で収録又は生成された多聴点音声データとの少なくとも一方を、ネットワークを介して受信するコンテンツ受信装置であって、放送波受信手段と、受信手段と、操作信号受信手段と、制御手段と、を備える構成とした。 5. The content receiving apparatus according to claim 4, wherein the video data and the audio data output as broadcast waves are received, and at least one or more different from a viewpoint when the video data synchronized with these is captured. Multi-viewpoint video data imaged or generated from different viewpoints and multi-point audio data recorded or generated at least one other listening point different from the listening point when the audio data was recorded At least one of the content receiving apparatuses receives via a network, and includes a broadcast wave receiving unit, a receiving unit, an operation signal receiving unit, and a control unit.

かかる構成によれば、コンテンツ受信装置は、放送波受信手段によって、放送波として送信側から伝送されている映像データと音声データとを受信する。続いて、コンテンツ受信装置は、受信手段によって、多視点映像データと多聴点音声データとの少なくとも一方と、提示情報とを、ネットワークを介して受信する。提示情報は、多視点映像データと多聴点音声データとの少なくとも一方を、受信側の装置に提示（表示出力）する際の提示手法が、映像データと多視点映像データとの関係および音声データと多聴点音声データとの関係が記述されているメタデータに関連付けられて設定されたものである。 According to this configuration, the content receiving apparatus receives the video data and audio data transmitted from the transmission side as broadcast waves by the broadcast wave receiving means. Subsequently, the content receiving apparatus receives at least one of the multi-view video data and the multi-listening point audio data and the presentation information by the receiving unit via the network. The presentation information is based on the relationship between the video data and the multi-view video data, and the audio data when presenting (displaying) at least one of the multi-view video data and the multi-view audio data on the receiving device. Are set in association with metadata describing the relationship between the audio data and the multi-point audio data.

そして、コンテンツ受信装置は、操作信号受信手段によって、ユーザが操作した操作装置（例えば、付属しているリモコン）から出力された操作信号を受信する。その後、コンテンツ受信装置は、制御手段によって、操作信号と、提示情報とに基づいて、映像データおよび音声データと、多視点映像データと多聴点音声データとの少なくとも一方との処理を制御して出力する。つまり、予め送信側で設定された範囲（提示情報で設定された提示手法に従った表現）内で、ユーザが操作装置を操作した操作結果（操作信号）に基づき、表示出力が制御されることとなる。つまり、このコンテンツ受信装置は、いわゆるＰＵＳＨ型の装置であり、送信側から送信された映像データ、音声データ、多視点映像データ、多聴点音声データおよび提示情報と、ユーザ操作による操作信号とに基づいて、動作するものである。 Then, the content receiving device receives the operation signal output from the operation device operated by the user (for example, the attached remote controller) by the operation signal receiving means. Thereafter, the content receiving device controls the processing of the video data and the audio data and at least one of the multi-view video data and the multi-audio data by the control means based on the operation signal and the presentation information. Output. That is, display output is controlled based on an operation result (operation signal) by the user operating the operation device within a range set in advance on the transmission side (expression according to the presentation method set by the presentation information). It becomes. In other words, this content receiving device is a so-called PUSH type device, which includes video data, audio data, multi-view video data, multi-listening audio data and presentation information transmitted from the transmission side, and an operation signal by a user operation. It is based on the operation.

請求項５記載のコンテンツ受信装置は、放送波として出力されている当該映像データと当該音声データとを受信し、これらに同期されている前記映像データを撮像した際の視点と異なる少なくとも１つ以上の別の視点で撮像又は生成された多視点映像データと、前記音声データを収録した際の聴点と異なる少なくとも１つ以上の別の聴点で収録又は生成された多聴点音声データとの少なくとも一方を、ネットワークを介して受信するコンテンツ受信装置であって、放送波受信手段と、操作信号受信手段と、データ要求手段と、受信手段と、制御手段と、を備える構成とした。 6. The content receiving apparatus according to claim 5, wherein the video data and the audio data output as broadcast waves are received, and at least one or more different from a viewpoint when the video data synchronized with these is captured. Multi-viewpoint video data imaged or generated from different viewpoints and multi-point audio data recorded or generated at least one other listening point different from the listening point when the audio data was recorded At least one of the content receiving apparatuses receives via a network, and includes a broadcast wave receiving unit, an operation signal receiving unit, a data request unit, a receiving unit, and a control unit.

かかる構成によれば、コンテンツ受信装置は、放送波受信手段によって、放送波として送信側から伝送されている映像データと音声データとを受信する。続いて、コンテンツ受信装置は、操作信号受信手段によって、ユーザが操作した操作装置から出力された操作信号を受信する。そして、コンテンツ受信装置は、データ要求手段により、操作信号に基づいて多視点映像データ、多聴点音声データおよび提示情報を、ネットワークを介して、送信側、或いは、他のサービス提供サイト等に要求する。 According to this configuration, the content receiving apparatus receives the video data and audio data transmitted from the transmission side as broadcast waves by the broadcast wave receiving means. Subsequently, the content receiving device receives the operation signal output from the operation device operated by the user by the operation signal receiving unit. Then, the content receiving device requests the multi-viewpoint video data, the multi-listening point audio data, and the presentation information from the transmission side or another service providing site via the network by the data requesting unit based on the operation signal. To do.

そして、このコンテンツ受信装置は、受信手段によって、ネットワークを介して、要求した多視点映像データと多聴点音声データとの少なくとも一方と、提示情報とを受信し、制御手段によって、操作信号（データ要求信号以外の当該装置を操作する操作信号）と、提示情報とに基づいて、映像データおよび音声データと、多視点映像データと多聴点音声データとの少なくとも一方との処理を制御して出力する。つまり、このコンテンツ受信装置は、いわゆるＰＵＬＬ型の装置であり、必要とする多視点映像データ、多聴点音声データおよび提示情報を受信側から送信側、他のサービス提供サイト等に要求して、取得してから動作するものである。 Then, the content receiving apparatus receives at least one of the requested multi-view video data and multi-point audio data and the presentation information via the network by the receiving means, and the operation signal (data) by the control means. (Control signal for operating the device other than the request signal) and presentation information to control and output processing of video data and audio data and / or multi-view video data and multi-audio data To do. That is, this content receiving device is a so-called PULL type device, and requests the required multi-view video data, multi-listening audio data and presentation information from the receiving side to the transmitting side, other service providing sites, etc. It works after getting.

請求項６記載のコンテンツ受信装置は、放送波として出力されている当該映像データと当該音声データとを受信し、これらに同期されている前記映像データを撮像した際の視点と異なる少なくとも１つ以上の別の視点で撮像又は生成された多視点映像データと、前記音声データを収録した際の聴点と異なる少なくとも１つ以上の別の聴点で収録又は生成された多聴点音声データとの少なくとも一方を、ネットワークを介して受信するコンテンツ受信装置であって、放送波受信手段と、受信手段と、提示情報生成手段と、操作信号受信手段と、制御手段と、を備える構成とした。 7. The content receiving apparatus according to claim 6, wherein the video data and the audio data output as broadcast waves are received, and at least one or more different from a viewpoint when the video data synchronized with these is captured. Multi-viewpoint video data imaged or generated from different viewpoints and multi-point audio data recorded or generated at least one other listening point different from the listening point when the audio data was recorded At least one of the content receiving apparatuses receives via a network, and includes a broadcast wave receiving unit, a receiving unit, a presentation information generating unit, an operation signal receiving unit, and a control unit.

かかる構成によれば、コンテンツ受信装置は、放送波受信手段によって、放送波として送信側から伝送されている映像データと音声データとを受信し、受信手段によって、ネットワークを介して、多視点映像データと多聴点音声データとの少なくとも一方を受信する。 According to this configuration, the content receiving apparatus receives the video data and the audio data transmitted from the transmission side as the broadcast wave by the broadcast wave receiving unit, and the multi-view video data is received by the receiving unit via the network. And at least one of multipoint audio data.

そして、このコンテンツ受信装置は、提示情報生成手段によって、提示情報生成テンプレートに基づいて、提示情報を生成する。この提示情報生成テンプレートは、映像データと多視点映像データとの関係および音声データと多聴点音声データとの関係とが記述されているメタデータに関連付けられて設定されていると共に、当該メタデータに付随しているものである。 And this content receiving apparatus produces | generates presentation information by a presentation information generation means based on a presentation information generation template. This presentation information generation template is set in association with metadata describing the relationship between video data and multi-view video data and the relationship between audio data and multi-auditory audio data. It is attached.

その後、コンテンツ受信装置は、操作信号受信手段によって、ユーザが操作した操作装置から出力された操作信号を受信し、制御手段によって、操作信号と、生成した提示情報とに基づいて、映像データおよび音声データと、多視点映像データと多聴点音声データとの少なくとも一方との処理を制御して出力する。つまり、このコンテンツ受信装置は、提示情報を生成する生成型の装置であり、受信した多視点映像データと多聴点音声データとの少なくとも一方に含まれているメタデータに付随している提示情報生成テンプレートに基づいて動作する。 Thereafter, the content receiving device receives the operation signal output from the operation device operated by the user by the operation signal receiving unit, and the control unit receives the video data and the audio based on the operation signal and the generated presentation information. It controls and outputs data and at least one of multi-view video data and multi-point audio data. That is, this content receiving device is a generation type device that generates presentation information, and the presentation information attached to metadata included in at least one of the received multi-view video data and multi-point audio data. Operates based on the generated template.

請求項７記載のコンテンツ受信装置は、請求項４から請求項６のいずれか一項に記載のコンテンツ受信装置において、前記多視点映像データと前記多聴点音声データを蓄積する蓄積手段を備えたことを特徴とする。 The content receiving device according to claim 7 is the content receiving device according to any one of claims 4 to 6, further comprising storage means for storing the multi-view video data and the multi-listening point audio data. It is characterized by that.

かかる構成によれば、コンテンツ受信装置は、多視点映像データと多聴点音声データとを蓄積する蓄積手段を備えることで、一旦、これら多視点映像データと多聴点音声データとを蓄積した後、動作させることができ、受信手段によって、ネットワークを介して、逐次、多視点映像データと多聴点音声データとを取得するよりも、ユーザは、時間的なロスを感じすることなく（ストレスが少なく）、コンテンツを鑑賞することができる。 According to such a configuration, the content receiving apparatus includes the storage unit that stores the multi-view video data and the multi-viewpoint audio data, and then temporarily stores the multi-view video data and the multi-viewpoint audio data. Rather than acquiring multi-viewpoint video data and multi-point audio data sequentially via the network by the receiving means, the user can feel no time loss (stress Less), you can watch the content.

請求項８記載のコンテンツ受信装置は、請求項４から請求項７のいずれか一項に記載のコンテンツ受信装置において、前記多視点映像データと前記多聴点音声データとがスポーツに関するスポーツ関連多視点映像データとスポーツ関連多聴点音声データとであり、合成映像データ生成手段と、合成音声データ生成手段と、を備える構成とした。 The content receiving device according to claim 8 is the content receiving device according to any one of claims 4 to 7, wherein the multi-view video data and the multi-viewpoint audio data are sports-related multi-viewpoints related to sports. It is video data and sports-related multi-point audio data, and is configured to include synthetic video data generation means and synthetic audio data generation means.

かかる構成によれば、コンテンツ受信装置は、合成映像データ生成手段によって、受信した推定三次元座標に基づいて、スポーツ関連多視点映像データに係る合成映像データを生成し、合成音声データ生成手段によって、スポーツ関連多聴点音声データに係る合成音声データを生成する。つまり、送信側で生成された推定三次元座標に基づいて、受信側のコンテンツ受信装置で、合成映像データと合成音声データとが生成される。 According to such a configuration, the content receiving device generates the synthesized video data related to the sports-related multi-view video data based on the received estimated three-dimensional coordinates by the synthesized video data generating unit, and the synthesized audio data generating unit Synthetic voice data related to sports-related multi-point voice data is generated. That is, based on the estimated three-dimensional coordinates generated on the transmission side, synthesized video data and synthesized audio data are generated by the content receiver on the receiving side.

請求項１に記載の発明によれば、本放送として放送される映像データおよび音声データと、同期させて撮影された多視点映像データと収録された多聴点音声データとについて、映像データと多視点映像データとの関係と、音声データと多聴点音声データとの関係とをメタデータで記述して付与し、さらに、受信側で提示される際の提示手法を当該メタデータと関連付けて設定した提示情報を送信しているので、本来廃棄されていた番組素材である多視点映像データと多聴点音声データとの少なくとも一方を有効活用することができる。また、提示情報によって、受信側で提示される際の提示手法を多種類設定することで、視聴者毎の種々の嗜好、欲求を満足させることができる。 According to the first aspect of the present invention, video data and audio data broadcast as the main broadcast, multi-view video data photographed in synchronism, and recorded multi-point audio data are recorded. The relationship between the viewpoint video data and the relationship between the audio data and the multi-point audio data is described in metadata, and the presentation method when presented on the receiving side is set in association with the metadata. Since the presented information is transmitted, it is possible to effectively utilize at least one of the multi-view video data and the multi-view audio data that are originally discarded program material. Also, by setting various types of presentation methods when presented on the receiving side according to the presentation information, various preferences and desires for each viewer can be satisfied.

請求項２に記載の発明によれば、提示情報によって、映像データ、多視点映像データ、音声データおよび多聴点音声データの受信側における表示出力を送信側で詳細に設定することができる。受信側では、設定された範囲内ではあるが、自分の嗜好に応じた映像データと多視点映像データとを、音声データと多聴点音声データとを適度に融合させて、コンテンツを視聴することができる。 According to the second aspect of the invention, display output on the receiving side of video data, multi-view video data, audio data, and multi-point audio data can be set in detail on the transmitting side according to the presentation information. On the receiving side, content is viewed within the set range, but the video data and multi-view video data according to the user's preferences are appropriately fused with the audio data and multi-point audio data. Can do.

請求項３に記載の発明によれば、撮像された映像データに含まれている被写体の候補となる部分画像を抽出し、当該被写体の運動を予測して得られた予測三次元座標を投影した予測画像座標を生成する。抽出された部分画像の中から被写体の候補を、予測画像座標に基づいて絞り込んで、絞り込まれた被写体の候補の重心を検出する。重心が検出された被写体の候補の画像座標を逆投影して、推定三次元座標を取得し、この推定三次元座標から予測三次元座標を取得する。このため、例えば、視聴者からスポーツ中継番組に要求されていたボールの軌跡が推定三次元座標から容易に描画することができ、視聴者の要望を満たすことができる。 According to the third aspect of the present invention, a partial image that is a subject candidate included in the captured video data is extracted, and predicted three-dimensional coordinates obtained by predicting the motion of the subject are projected. Generate predicted image coordinates. The candidate candidates are narrowed down from the extracted partial images based on the predicted image coordinates, and the center of gravity of the narrowed-down subject candidates is detected. The image coordinates of the candidate object for which the center of gravity is detected are back-projected to obtain estimated three-dimensional coordinates, and predicted three-dimensional coordinates are obtained from the estimated three-dimensional coordinates. For this reason, for example, the trajectory of the ball requested by the viewer for the sports broadcast program can be easily drawn from the estimated three-dimensional coordinates, and the viewer's request can be satisfied.

請求項４に記載の発明によれば、放送波として送信側から伝送されている映像データと音声データとを受信し、多視点映像データと多聴点音声データとの少なくとも一方と、提示情報とを、ネットワークを介して受信する。そして、ユーザが操作した操作装置から出力された操作信号を受信し、操作信号と、提示情報とに基づいて、映像データおよび音声データと、多視点映像データと多聴点音声データとの少なくとも一方との処理を制御して出力する。このため、本来廃棄されていた番組素材である多視点映像データと多聴点音声データとの少なくとも一方を有効活用することができる。また、提示情報によって、送信側で設定された提示情報によって規定されている多種類の提示手法で、視聴者毎の種々の嗜好、コンテンツ（テレビ番組）に対する欲求を満足させることができる。 According to the fourth aspect of the present invention, the video data and the audio data transmitted from the transmission side as broadcast waves are received, at least one of the multi-view video data and the multi-audio data, presentation information, Is received via the network. And the operation signal output from the operating device operated by the user is received, and at least one of video data and audio data, multi-view video data and multi-audio data is based on the operation signal and the presentation information Control and output the process. For this reason, it is possible to effectively utilize at least one of the multi-view video data and the multi-viewpoint audio data, which are originally discarded program materials. Further, the presentation information can satisfy various tastes and desires for contents (television programs) for each viewer by various types of presentation methods defined by the presentation information set on the transmission side.

請求項５に記載の発明によれば、放送波として送信側から伝送されている映像データと音声データとを受信し、ユーザが操作した操作装置から出力された操作信号を受信する。そして、操作信号に基づいて多視点映像データ、多聴点音声データおよび提示情報を、ネットワークを介して、送信側、或いは、他のサービス提供サイト等に要求し、この要求した多視点映像データと多聴点音声データとの少なくとも一方と、提示情報とを受信し、制御手段によって、操作信号（データ要求信号以外の当該装置を操作する操作信号）と、提示情報とに基づいて、映像データおよび音声データと、多視点映像データと多聴点音声データとの少なくとも一方との処理を制御して出力する。このため、本来廃棄されていた番組素材である多視点映像データと多聴点音声データとの少なくとも一方を有効活用することができる。また、提示情報によって、送信側で設定された提示情報によって規定されている多種類の提示手法で、視聴者毎の種々の嗜好、コンテンツ（テレビ番組）に対する欲求を満足させることができる。 According to the fifth aspect of the present invention, the video data and the audio data transmitted from the transmission side as broadcast waves are received, and the operation signal output from the operation device operated by the user is received. Based on the operation signal, the multi-view video data, the multi-listening audio data, and the presentation information are requested to the transmitting side or other service providing sites via the network, and the requested multi-view video data and At least one of the multi-point audio data and the presentation information are received, and the control means controls the video data and the presentation data based on the operation signal (operation signal for operating the device other than the data request signal) and the presentation information. Controls and outputs audio data and at least one of multi-view video data and multi-point audio data. For this reason, it is possible to effectively utilize at least one of the multi-view video data and the multi-viewpoint audio data, which are originally discarded program materials. Further, the presentation information can satisfy various tastes and desires for contents (television programs) for each viewer by various types of presentation methods defined by the presentation information set on the transmission side.

請求項６に記載の発明によれば、放送波として送信側から伝送されている映像データと音声データとを受信し、ネットワークを介して、多視点映像データと多聴点音声データとの少なくとも一方を受信する。そして、提示情報生成テンプレートに基づいて、提示情報を生成し、ユーザが操作した操作装置から出力された操作信号を受信し、操作信号と、生成した提示情報とに基づいて、映像データおよび音声データと、多視点映像データと多聴点音声データとの少なくとも一方との処理を制御して出力する。このため、本来廃棄されていた番組素材である多視点映像データと多聴点音声データとの少なくとも一方を有効活用することができる。また、提示情報を生成することによって、受信側である程度の自由度を持って、コンテンツ（テレビ番組）を視聴することができ、視聴者毎の種々の嗜好、コンテンツ（テレビ番組）に対する欲求を満足させることができる。 According to the sixth aspect of the present invention, video data and audio data transmitted from the transmission side as a broadcast wave are received, and at least one of multi-view video data and multi-point audio data is transmitted via a network. Receive. Then, based on the presentation information generation template, the presentation information is generated, the operation signal output from the operation device operated by the user is received, and the video data and the audio data are generated based on the operation signal and the generated presentation information. And processing of at least one of the multi-view video data and the multi-point audio data is output. For this reason, it is possible to effectively utilize at least one of the multi-view video data and the multi-viewpoint audio data, which are originally discarded program materials. In addition, by generating the presentation information, the receiving side can view the content (television program) with a certain degree of freedom, and satisfies various tastes and desires for the content (television program) for each viewer. Can be made.

請求項７に記載の発明によれば、多視点映像データと多聴点音声データとを蓄積した後、動作させることができ、ネットワークを介して、逐次、多視点映像データと多聴点音声データとを取得するよりも、ユーザは、時間的なロスを感じすることなく（ストレスが少なく）、コンテンツ（テレビ番組）を鑑賞することができる。 According to the seventh aspect of the present invention, the multi-view video data and the multi-point audio data can be operated after being accumulated, and the multi-view video data and the multi-point audio data are sequentially transmitted via the network. The user can appreciate the content (television program) without feeling time loss (less stress) than acquiring.

請求項８に記載の発明によれば、スポーツ関連多視点映像データに係る合成映像データを生成し、スポーツ関連多聴点音声データに係る合成音声データを生成するので、視聴者の様々な嗜好、コンテンツ（特に、スポーツ中継番組）に対する欲求を満足させることができる。 According to the eighth aspect of the invention, since the synthesized video data related to the sports-related multi-viewpoint video data is generated and the synthesized voice data related to the sports-related multi-point audio data is generated, various viewer preferences, The desire for content (especially sports broadcast programs) can be satisfied.

次に、本発明の実施形態について、適宜、図面を参照しながら詳細に説明する。
（コンテンツ送受信システムについて）
図１にコンテンツ送受信システムのブロック図を示す。この図１に示すように、コンテンツ送受信システムＡは、コンテンツ送信装置１と、コンテンツ受信装置３とから構成されており、映像データおよび音声データからなるコンテンツ（テレビ番組）を放送波によって出力すると共に、ネットワークを介して、当該映像データとは別の視点で撮影した多視点映像データ、当該音声データとは別の箇所から集音した多聴点音声データを、当該映像データ、当該音声データに同期させて送信するものである。 Next, embodiments of the present invention will be described in detail with reference to the drawings as appropriate.
(About content transmission / reception system)
FIG. 1 shows a block diagram of a content transmission / reception system. As shown in FIG. 1, a content transmission / reception system A is composed of a content transmission device 1 and a content reception device 3, and outputs content (television program) composed of video data and audio data by broadcast waves. Synchronize multi-viewpoint video data shot from a different viewpoint from the video data and multi-audio data collected from a different location from the audio data via the network with the video data and audio data And send it.

このコンテンツ送受信システムＡは、例えば、テレビ番組がスポーツ中継番組（野球中継番組）であった場合、放送局のコンテンツ送信装置１が野球球場内に配置した複数の撮影カメラおよび集音マイクから中継車を介して入手した映像データおよび音声データを放送波により通常の野球中継番組（例として、投手の背面側から当該投手と捕手と打者とが一つの画面に収まった映像データと、攻撃側の球団の応援席に配置した集音マイクで集音した音声データ）として放送すると共に、ネットワークを介して、当該映像データとは別の撮影カメラで撮影した映像データ（例として、一塁、二塁、三塁付近を撮影した映像データ）と、当該音声データとは別の集音マイクで集音した音声データ（例として、守備側の球団の応援席に配置した集音マイクで集音した音声データ）とを送信する。 For example, when the television program is a sports broadcast program (baseball broadcast program), the content transmission / reception system A is a relay vehicle from a plurality of shooting cameras and sound collection microphones arranged in the baseball stadium by the content transmission device 1 of the broadcasting station. The video data and audio data obtained through the baseball broadcast program using broadcast waves (for example, video data in which the pitcher, catcher and batter are on one screen from the back side of the pitcher, and the attacking team (Voice data collected by a sound collecting microphone placed at the cheering seat) of the video, and video data shot by a different camera from the video data via the network (for example, around 1st, 2nd and 3rd bases) Video data) and audio data collected by a separate microphone from the audio data (for example, the sound collected at the support seat of the defensive baseball team) To transmit voice data) and collected by the microphone.

そして、受信側（コンテンツ受信装置３）の視聴者は、ネットワークを介して送信された多視点映像データと、多聴点音声データとから、視聴したいものを選択して視聴することができる。つまり、このコンテンツ送受信システムＡでは、本来、廃棄してしまう（スイッチャーが選択せずに放送されない）多視点映像データや多聴点音声データ（番組素材）を、ネットワークを介して送信し、視聴者の選択に委ねて視聴させることで、当該番組素材を積極的に有効利用するものである。 Then, the viewer on the receiving side (content receiving device 3) can select and view what he / she wants to view from the multi-view video data and multi-point audio data transmitted via the network. That is, in this content transmission / reception system A, multi-view video data and multi-viewpoint audio data (program material) that are originally discarded (not broadcasted without the switcher being selected) are transmitted via the network, and the viewer The program material is actively used effectively by letting it be selected and viewed.

なお、多視点映像データと多聴点音声データとをネットワークで配信せずに、デジタル放送の他チャンネルを使用することで、これら多視点映像データと多聴点音声データとを受信側のコンテンツ受信装置３に送信することは可能である。しかし、この場合であっても、視聴者からの要求をリアルタイムに取得するルート、つまり、ネットワークは必要となるので、本発明では、ネットワークを利用する構成としている。 The multi-view video data and multi-audio data are not distributed over the network, but by using other channels of digital broadcasting, the multi-view video data and multi-audio data are received on the receiving side. Transmission to the device 3 is possible. However, even in this case, since a route for acquiring a request from the viewer in real time, that is, a network is required, the present invention is configured to use the network.

（コンテンツ送信装置の構成）
コンテンツ送信装置１は、本放送の放送波としてテレビ番組を出力すると共に、ネットワークを介して、放送されなかった編集素材映像データ（多視点映像データ）、編集素材音声データ（多聴点音声データ）を送信するもので、放送用メタデータ付与手段５と、編集素材加工手段７と、通信用メタデータ付与手段９と、放送波出力手段１１と、送信手段１３と、メタデータ情報蓄積手段１５と、提示情報蓄積手段１７とを備えている。 (Configuration of content transmission device)
The content transmitting apparatus 1 outputs a television program as a broadcast wave of the main broadcast, and edit material video data (multi-view video data) and editing material audio data (multi-view audio data) that were not broadcast via the network. Broadcast metadata giving means 5, editing material processing means 7, communication metadata giving means 9, broadcast wave output means 11, sending means 13, metadata information storage means 15, Presentation information storage means 17.

放送用メタデータ付与手段５は、本放送用のテレビ番組として入力された本放送映像データと、本放送音声データとにメタデータを付与するものである。これら本放送映像データと本放送音声データとは、コンテンツ送信装置１に入力される際に既に、ＭＰＥＧ−２に圧縮されたデジタル信号であり、ここで付与するメタデータは、本放送映像データと編集素材映像データとの関係を記述したもの（放送波映像用メタデータ）であり、本放送音声データと編集素材音声データとの関係を記述したもの（放送波音声用メタデータ）である。 The broadcast metadata assigning means 5 assigns metadata to the main broadcast video data input as the main broadcast television program and the main broadcast audio data. The main broadcast video data and the main broadcast audio data are digital signals that have already been compressed to MPEG-2 when input to the content transmitting apparatus 1, and the metadata to be given here is the main broadcast video data and the main broadcast video data. This is a description (broadcast wave video metadata) describing the relationship with the editing material video data, and a description describing the relationship between the main broadcast audio data and the editing material audio data (broadcast wave audio metadata).

例えば、放送波映像用メタデータには、映像データのタイトル名（番組名）、放送時間、ネットワークを介して配信されている多視点映像データの数、識別ＩＤ、提示情報生成テンプレート等が記述されており、放送波音声用メタデータには、音声データのタイトル名（番組名）、放送時間、ネットワークを介して配信されている多聴点音声データの数、識別ＩＤ、提示情報生成テンプレート等が記述されている。なお、通常のメタデータは、あるデータについて記述したデータ（情報）であり、例えば、データベース等からデータを抽出した際の履歴を記述したもの（履歴情報）や、データの構造に関するものや、当該データベースのアーキテクチャとの内容に関するものを指している。また、これら本放送映像データと本放送音声データとには、予め、経過時間を示すタイムコードが付加されている。 For example, the broadcast wave video metadata describes the title name (program name) of the video data, the broadcast time, the number of multi-view video data distributed via the network, the identification ID, the presentation information generation template, and the like. The broadcast wave audio metadata includes audio data title name (program name), broadcast time, number of multi-point audio data distributed over the network, identification ID, presentation information generation template, and the like. is described. Note that normal metadata is data (information) describing certain data. For example, data describing history when data is extracted from a database (history information), data structure, It refers to the contents of the database architecture. Further, a time code indicating an elapsed time is added in advance to the main broadcast video data and the main broadcast audio data.

編集素材加工手段７は、受信側の視聴者の多種多様な嗜好（視聴欲求）に応えるために入力された編集素材映像データと編集素材音声データとを加工するためのもので、詳細を図２に示す。この図２に示すように、編集素材加工手段７は、映像データ加工制御手段７ａと、音声データ抽出制御手段７ｂと、追跡候補領域抽出手段７ｃ（被写体候補抽出手段）と、候補絞込み手段７ｄと、重心検出手段７ｅと、逆投影手段７ｆと、運動予測手段７ｇと、投影手段７ｈと、加工映像データ出力手段７ｉと、音声データ切替手段７ｊと、音声データ処理手段７ｋと、音声データ出力手段７ｌとを備えている（この図２の説明中、適宜、図１参照）。 The editing material processing means 7 is for processing the editing material video data and the editing material audio data input in order to respond to the various tastes (viewing desires) of the viewer on the receiving side. Shown in As shown in FIG. 2, the editing material processing means 7 includes a video data processing control means 7a, an audio data extraction control means 7b, a tracking candidate area extraction means 7c (subject candidate extraction means), and a candidate narrowing means 7d. Centroid detection means 7e, back projection means 7f, motion prediction means 7g, projection means 7h, processed video data output means 7i, audio data switching means 7j, audio data processing means 7k, and audio data output means. 7l (refer to FIG. 1 as appropriate during the description of FIG. 2).

このコンテンツ送信装置１に入力される編集素材映像データは、例えば、スポーツ中継番組において、複数の撮影カメラによって様々な角度で撮影された多視点映像データの中で、本放送に採用されなかったものであったり、ドラマ等の撮影時に、複数の撮影カメラを使用して、ある出演者を様々な視点で撮影しておき、本放送に使用しなかったものであったりする。 The editing material video data input to the content transmission device 1 is, for example, multi-view video data shot at various angles by a plurality of shooting cameras in a sports broadcast program, which has not been adopted for the main broadcast. Or, when shooting a drama or the like, a certain performer is shot from various viewpoints using a plurality of shooting cameras and is not used for the main broadcast.

映像データ加工制御手段７ａは、コンテンツ送信装置１の操作者が入力手段（図示せず）を介して入力した操作信号、または、受信側のコンテンツ受信装置３から送信された要求信号に基づいて、編集素材映像データをそのまま出力するか、加工して出力するかを制御するものである。このコンテンツ送信装置１には、多種多様の分野の編集素材映像データが入力されるので、自動的に加工するものと加工しないものとを判別することが難しいので、操作者の管理下の元、加工するものと加工しないものとを制御することとしている。 The video data processing control means 7a is based on an operation signal input by an operator of the content transmitting apparatus 1 through an input means (not shown) or a request signal transmitted from the content receiving apparatus 3 on the receiving side. It controls whether the editing material video data is output as it is or processed and output. Since the editing material video data in a wide variety of fields are input to the content transmission device 1, it is difficult to distinguish between automatically processed and unprocessed data. What is processed and what is not processed are controlled.

音声データ抽出制御手段７ｂは、コンテンツ送信装置１の操作者が入力手段（図示せず）を介して入力した操作信号に基づいて、編集素材音声データをそのまま出力するか、編集素材音声データを切り替えて出力するか、編集素材音声データ中の特定の音声データのみを抽出して（フィルタリングして）出力するのかを制御するものである。 The audio data extraction control unit 7b outputs the editing material audio data as it is or switches the editing material audio data based on the operation signal input by the operator of the content transmission apparatus 1 through the input unit (not shown). Or whether only specific audio data in the editing material audio data is extracted (filtered) and output.

この実施の形態では、全ての編集素材音声データ（多聴点音声データ）から受信側のコンテンツ受信装置３のユーザ（視聴者）が要求した、検索キーワードを含む部分を抽出して出力しているが、単に、編集素材映像データに付随している部分を出力する態様であってもよい。 In this embodiment, the portion including the search keyword requested by the user (viewer) of the content receiving apparatus 3 on the receiving side is extracted and output from all the editing material audio data (multiple listening audio data). However, a mode in which the portion attached to the editing material video data is simply output may be used.

追跡候補領域抽出手段７ｃは、予め設定しておいた追跡対象、例えば、スポーツ中継番組において、高速に移動するボール、スポーツ選手等の候補となる領域（画像領域、被写体の候補となる部分画像）を、編集素材映像データを構成する最小単位であるフレーム画像から抽出するものである。この追跡候補領域抽出手段７ｃでは、追跡対象の候補が含まれる領域と背景との色の差、輝度差に基づいて、追跡対象の候補が含まれる画像領域を抽出する。つまり、追跡対象をボールとした場合、このボールの白色と、背景となる芝生の緑色またはグラウンドの黒茶色との色の違いにより、追跡対象の候補が含まれる画像領域（非背景領域）を「シルエット」（シルエット領域）として抽出する。なお、このシルエット領域は２値画像となる The tracking candidate area extraction unit 7c is a tracking target that is set in advance, for example, an area that is a candidate for a fast moving ball, sports player, or the like (an image area or a partial image that is a candidate for a subject) in a sports broadcast program. Are extracted from the frame image which is the minimum unit constituting the editing material video data. The tracking candidate area extracting unit 7c extracts an image area including the tracking target candidate based on the color difference and luminance difference between the area including the tracking target candidate and the background. In other words, when the tracking target is a ball, the image area (non-background area) containing the tracking target candidate is determined by the difference in color between the white color of the ball and the green color of the background lawn or the ground black-brown color. Extracted as “silhouette” (silhouette region). This silhouette area is a binary image.

候補絞込み手段７ｄは、追跡候補領域抽出手段７ｃで抽出したシルエット領域に対して、後記する投影手段７ｈから出力された予測画像座標を参照して、追跡対象の形状、追跡対象の面積、追跡対象に対応する部分領域映像の色、予測される追跡対象の位置を評価して、追跡対象の可能性の高いシルエット領域を１つに絞り込むものである。なお、この候補絞込み手段７ｄには、図示を省略したが、追跡対象を絞り込むために、予め追跡対象の形状、面積、色等の追跡対象情報を蓄積した追跡対象情報蓄積手段が備えられている。 The candidate narrowing-down means 7d refers to the predicted image coordinates output from the projection means 7h described later for the silhouette area extracted by the tracking candidate area extracting means 7c, and the shape of the tracking target, the area of the tracking target, the tracking target The color of the partial region image corresponding to the image and the predicted position of the tracking target are evaluated to narrow down the silhouette region that is highly likely to be the tracking target to one. Although not shown in the figure, the candidate narrowing means 7d is provided with tracking target information storage means for previously storing tracking target information such as the shape, area, and color of the tracking target in order to narrow down the tracking target. .

例えば、追跡対象がボールである場合には、追跡対象の形状を評価する際は、追跡対象の形状が円いか否か、どれだけ真円に近似しているのかを、追跡対象情報蓄積手段に蓄積されている追跡対象情報とマッチングをとって評価する。また、追跡対象の面積を評価する際は、追跡対象の面積がボールらしい大きさであるのか否かを、追跡対象情報蓄積手段に蓄積されている追跡対象情報とマッチングをとって評価する。さらに、対応する部分領域映像の色で評価する際は、追跡対象の部分領域映像（一例として、野球ボールなら、ほぼ白一色であり、サッカーボールなら、白地に他色が混在している）の色がそれぞれのボールの配色パターンに合致しているのかを、追跡対象情報蓄積手段に蓄積されている追跡対象情報とマッチングをとって評価する。さらにまた、予測される追跡対象の位置を評価する際は、追跡候補領域抽出手段７ｃで抽出したフレーム画像の直前のフレーム画像に基づいて予測される位置に追跡対象が位置しているのかを評価する。 For example, when the tracking target is a ball, when evaluating the shape of the tracking target, the tracking target information storage means determines whether the tracking target shape is a circle and how close it is to a perfect circle. Evaluation is performed by matching with the information to be tracked. Further, when evaluating the area of the tracking target, whether the area of the tracking target is a ball-like size is evaluated by matching with the tracking target information stored in the tracking target information storage means. Furthermore, when evaluating with the color of the corresponding partial area image, the partial area image to be tracked (for example, a baseball ball is almost white, and a soccer ball is mixed with other colors on a white background) Whether the color matches the color arrangement pattern of each ball is evaluated by matching with the tracking target information stored in the tracking target information storage means. Furthermore, when evaluating the predicted position of the tracking target, it is evaluated whether the tracking target is positioned at the predicted position based on the frame image immediately before the frame image extracted by the tracking candidate area extracting unit 7c. To do.

重心検出手段７ｅは、候補絞込み手段７ｄで絞り込まれた、各フレーム画像毎の追跡対象と捉えられたシルエット領域の面積重心（セントロイド）を求めるものである。これらフレーム画像毎のシルエット領域の面積重心は、画像座標群として逆投影手段７ｆに出力される。 The center-of-gravity detecting means 7e is for obtaining the area center of gravity (centroid) of the silhouette area that is narrowed down by the candidate narrowing-down means 7d and is regarded as a tracking target for each frame image. The area centroid of the silhouette area for each frame image is output to the back projection means 7f as an image coordinate group.

逆投影手段７ｆは、拡張カルマンフィルタの濾波推定式等を使用することによって、重心検出手段７ｅから出力された画像座標群を、編集素材映像データ（多視点映像データ）を撮影した撮影カメラのカメラ位置、カメラアングルの情報を利用して、三次元座標にマッピング（推定三次元座標）するものである。 The back projection means 7f uses the extended Kalman filter filtering estimation formula, etc., so that the image coordinate group output from the centroid detection means 7e is used as the camera position of the photographing camera that has photographed the editing material video data (multi-viewpoint video data). The camera angle information is used for mapping (estimated three-dimensional coordinates) to three-dimensional coordinates.

運動予測手段７ｇは、逆投影手段７ｆから出力された推定三次元座標と、予め記録された等加速度運動モデル等を使用することによって、追跡対象の三次元座標を予測（予測三次元座標）するものである。 The motion prediction unit 7g predicts the predicted three-dimensional coordinates (predicted three-dimensional coordinates) by using the estimated three-dimensional coordinates output from the back projection unit 7f and the pre-recorded constant acceleration motion model or the like. Is.

投影手段７ｈは、運動予測手段７ｇから出力された予測三次元座標を、編集素材映像データ（多視点映像データ）を撮影した撮影カメラのカメラ位置、カメラアングルの情報を利用して、画像座標にマッピングする（予測画像座標を出力する）ものである。 The projection unit 7h converts the predicted three-dimensional coordinates output from the motion prediction unit 7g into image coordinates using information on the camera position and camera angle of the shooting camera that has shot the editing material video data (multi-viewpoint video data). Mapping is performed (predicted image coordinates are output).

加工映像データ出力手段７ｉは、逆投影手段７ｆで得られた推定三次元座標に基づいて、描画した追跡対象の軌跡を、編集素材映像データに重ね合わせて出力するものである。なお、この加工映像データ出力手段７ｉで描画する追跡対象の軌跡は、予め、ＣＧ等で作成した追跡対象（例えば、ボール）を使用して、一つのフレーム画像中に、このＣＧ等で作成した追跡対象を複数個、目視可能な速度で順次重ね合わせていくことで得られるものである。なお、この加工映像データ出力手段７ｉは、推定三次元座標に基づいて、追跡対象の軌跡を描画せずに、編集素材映像データに付加させて出力することもできる。この場合、受信側のコンテンツ受信装置３の合成映像音声データ生成手段２７（詳細は後記）で追跡対象の軌跡が描画されることとなる。 Based on the estimated three-dimensional coordinates obtained by the back projection means 7f, the processed video data output means 7i superimposes the drawn trace of the tracking target on the editing material video data and outputs it. The track of the tracking target drawn by the processed video data output means 7i is created with the CG or the like in one frame image using the tracking target (for example, a ball) created with the CG or the like in advance. This is obtained by sequentially superimposing a plurality of tracking objects at a visible speed. Note that the processed video data output means 7i can also output the edited video data based on the estimated three-dimensional coordinates by adding it to the editing material video data without drawing the trace of the tracking target. In this case, the track of the tracking target is drawn by the synthesized video / audio data generation means 27 (details will be described later) of the content receiving device 3 on the receiving side.

音声データ切替手段７ｊは、受信側のコンテンツ受信装置３から送信されたデータ要求信号に基づいて、音声データ抽出制御手段７ｂに制御されて出力された、全ての編集素材音声データ（収録されている編集素材音声データ）の中から、所望の編集素材音声データを選択して（切り替えて）、音声データ処理手段７ｋまたは音声データ出力手段７ｌに出力するものである。 The audio data switching means 7j is controlled by the audio data extraction control means 7b based on the data request signal transmitted from the content receiving device 3 on the receiving side, and all the editing material audio data (recorded). The desired editing material audio data is selected (switched) from the editing material audio data) and output to the audio data processing means 7k or the audio data output means 7l.

音声データ処理手段７ｋは、音声データ切替手段７ｊで全ての編集素材音声データから選択されて出力された編集素材音声データをフィルタリングして、所望の編集素材音声データのみを抽出して出力するものである。例えば、編集素材音声データが、野球場（野球中継番組）で収録され、且つ、捕手の近くに配置されたマイクロフォンで収録されている場合、投手の投げたボールをキャッチした際のミットの音のみを抽出することができる。 The audio data processing unit 7k filters the editing material audio data selected and output from all the editing material audio data by the audio data switching unit 7j, and extracts and outputs only the desired editing material audio data. is there. For example, if the edited material audio data is recorded in a baseball field (baseball broadcast program) and recorded with a microphone placed near the catcher, only the sound of the mitt when the ball thrown by the pitcher is caught Can be extracted.

音声データ出力手段７ｌは、音声データ切替手段７ｊから出力された編集素材音声データ、または、音声データ処理手段７ｋで抽出処理された編集素材音声データをそのまま、或いは、音声データ抽出制御手段７ｂから出力された編集素材音声データに多重して出力するものである。 The audio data output means 7l outputs the editing material audio data output from the audio data switching means 7j or the editing material audio data extracted by the audio data processing means 7k as it is or output from the audio data extraction control means 7b. The edited editing material audio data is multiplexed and output.

図１に戻ってコンテンツ送信装置１の各構成の説明を続ける。
通信用メタデータ付与手段９は、入力された編集素材映像データ（多視点映像データ）および編集素材音声データ（多聴点音声データ）、または、編集素材加工手段７で加工（抽出）された加工映像データおよび抽出音声データに、メタデータを付与するものである。ここで付与するメタデータは、編集素材映像データと本放送映像データとの関係を記述したもの（通信映像用メタデータ）であり、編集素材音声データと本放送音声データとの関係を記述したもの（通信音声用メタデータ）である。 Returning to FIG. 1, the description of each component of the content transmission device 1 will be continued.
The communication metadata adding means 9 is the input editing material video data (multi-viewpoint video data) and editing material audio data (multiple listening point audio data), or the processing processed (extracted) by the editing material processing means 7. Metadata is added to video data and extracted audio data. The metadata given here describes the relationship between the editing material video data and the main broadcast video data (metadata for communication video), and describes the relationship between the editing material audio data and the main broadcast audio data. (Communication audio metadata).

例えば、通信映像用メタデータには、編集素材映像データのタイトル名（番組名）、放送時間、放送波によって出力されている本放送映像データの識別ＩＤ、提示情報生成テンプレート等が記述されており、通信音声用メタデータには、編集素材音声データのタイトル名（番組名）、放送時間、放送波によって出力されている本放送音声データの識別ＩＤ、提示情報生成テンプレート等が記述されている。 For example, the communication video metadata includes the title name (program name) of the editing material video data, the broadcast time, the identification ID of the broadcast video data output by the broadcast wave, the presentation information generation template, and the like. In the communication audio metadata, the title name (program name) of the editing material audio data, the broadcast time, the identification ID of the broadcast audio data output by the broadcast wave, the presentation information generation template, and the like are described.

放送波出力手段１１は、放送波映像用メタデータ、放送波音声用メタデータがそれぞれ付与された本放送映像データ、本放送音声データを放送波として出力するものである。この放送波出力手段１１は、デジタル信号である本放送映像データおよび本放送音声データにＯＦＤＭ変調を施して、放送波（ＯＦＤＭ波）として出力するものである。 The broadcast wave output means 11 outputs the main broadcast video data and the main broadcast audio data to which the broadcast wave video metadata and the broadcast wave audio metadata are assigned as broadcast waves. The broadcast wave output means 11 performs OFDM modulation on the main broadcast video data and the main broadcast audio data, which are digital signals, and outputs the result as a broadcast wave (OFDM wave).

送信手段１３は、通信映像用メタデータ、通信音声用メタデータがそれぞれ付与された編集素材映像データ、編集素材音声データの少なくとも一方に提示情報を付加して、ストリーミングデータとしてネットワークに送信するものである。 The transmission means 13 adds presentation information to at least one of the editing material video data and the editing material audio data to which the communication video metadata and the communication audio metadata are respectively added, and transmits the presentation information to the network as streaming data. is there.

メタデータ情報蓄積手段１５は、一般的な大容量のハードディスク、メモリ等によって構成されており、放送用メタデータ付与手段５と通信用メタデータ付与手段９とに出力した、放送波映像用メタデータ、放送波音声用メタデータ、通信映像用メタデータおよび通信音声用メタデータを蓄積するものである。 The metadata information storage means 15 is composed of a general large-capacity hard disk, memory, etc., and broadcast wave video metadata output to the broadcast metadata assignment means 5 and the communication metadata assignment means 9. Broadcast wave audio metadata, communication video metadata, and communication audio metadata are stored.

提示情報蓄積手段１７は、一般的な大容量のハードディスク、メモリ等によって構成されており、提示情報を蓄積するものである。この提示情報には、受信側のコンテンツ受信装置３に接続されている表示出力装置（詳細は後記）に、提示する内容（本放送映像データ、編集素材映像データの関係、本放送音声データ、編集素材音声データの関係）、つまり提示手法に係る情報である提示内容情報と、コンテンツ受信装置３で処理されて出力される際の動作に係る情報である動作内容情報と、ネットワークを介して様々なサイトから参照すべきマルチメディアコンテンツを列挙したマルチメディアコンテンツ情報と、表示出力装置（詳細は後記）の表示画面上で、本放送映像データと編集素材映像データとを表示する際のレイアウトを決定するレイアウト情報と、受信側のコンテンツ受信装置３で提示情報を生成するための雛形となる提示情報生成テンプレートとが含まれている。 The presentation information storage means 17 is constituted by a general large-capacity hard disk, memory, etc., and stores the presentation information. In this presentation information, the contents to be presented (relationship between main broadcast video data, editing material video data, main broadcast audio data, editing) Relationship between material audio data), that is, presentation content information that is information related to a presentation method, operation content information that is information related to an operation when processed and output by the content receiving device 3, and various types of information via a network. Determine the layout for displaying the main broadcast video data and editing material video data on the multimedia content information listing the multimedia content to be referenced from the site and the display screen of the display output device (details will be described later). Includes layout information and a presentation information generation template that serves as a template for generating presentation information in the content receiving device 3 on the receiving side. There.

コンテンツ送信装置１によれば、放送用メタデータ付与手段５と通信用メタデータ付与手段９とによって、映像データと多視点映像データとの関係を記述したメタデータ（放送波映像用メタデータ、通信映像用メタデータ）を当該映像データと当該多視点映像データとに付与されると共に、音声データと多聴点音声データとの関係を記述したメタデータ（放送波音声用メタデータ、通信音声用メタデータ）を当該音声データと当該多聴点音声データとに付与される。放送波出力手段１１によって、映像データおよび音声データが放送波（ＯＦＤＭ波）として出力され、送信手段１３によって、提示情報と、多視点映像データと多聴点音声データとの少なくとも一方とがネットワークに送信される。このため、本来廃棄されていた番組素材である多視点映像データと多聴点音声データとの少なくとも一方を有効活用することができる。また、提示情報によって、受信側で提示される際の提示手法を多種類設定することで、視聴者毎の種々の嗜好、欲求を満足させることができる。 According to the content transmission apparatus 1, the metadata (broadcast wave video metadata, communication) describing the relationship between the video data and the multi-view video data by the broadcast metadata giving means 5 and the communication metadata giving means 9. Video metadata) is added to the video data and the multi-view video data, and metadata describing the relationship between the audio data and the multi-point audio data (broadcast audio metadata, communication audio metadata). Data) is given to the audio data and the multi-point audio data. The broadcast wave output means 11 outputs video data and audio data as broadcast waves (OFDM waves), and the transmission means 13 sends presentation information and at least one of multi-view video data and multi-point audio data to the network. Sent. For this reason, it is possible to effectively utilize at least one of the multi-view video data and the multi-viewpoint audio data, which are originally discarded program materials. Also, by setting various types of presentation methods when presented on the receiving side according to the presentation information, various preferences and desires for each viewer can be satisfied.

また、コンテンツ送信装置１によれば、提示情報蓄積手段１７に蓄積されている提示情報が、提示内容情報と、動作内容情報と、マルチメディアコンテンツ情報と、レイアウト情報と、提示情報生成テンプレートとを含んでおり、これらの情報によって、映像データ、多視点映像データ、音声データおよび多聴点音声データの受信側における表示出力を送信側で詳細に設定することができる。さらに、受信側では、設定された範囲内ではあるが、自分の嗜好に応じた映像データと多視点映像データとを、音声データと多聴点音声データとを適度に融合させて（切り替えて）、コンテンツ（テレビ番組）を視聴することができる。 Moreover, according to the content transmission apparatus 1, the presentation information stored in the presentation information storage unit 17 includes presentation content information, operation content information, multimedia content information, layout information, and a presentation information generation template. With this information, display output on the receiving side of video data, multi-view video data, audio data, and multi-point audio data can be set in detail on the transmitting side. Furthermore, on the receiving side, although it is within the set range, video data and multi-view video data according to one's preference are appropriately fused (switched) with audio data and multi-audio data. , Content (TV programs) can be viewed.

さらに、コンテンツ送信装置１によれば、追跡候補領域抽出手段７ｃによって、撮像された多視点映像データに含まれている追跡対象（被写体）の候補となる画像領域（部分画像）が抽出され、投影手段７ｈによって、追跡対象（被写体）の運動を予測して得られた予測三次元座標を投影した予測画像座標が生成され、候補絞込み手段７ｄによって、追跡候補領域抽出手段７ｃで抽出された画像領域（部分画像）の中から追跡対象の候補が、投影手段７ｈで生成された予測画像座標に基づいて絞り込まれ、重心検出手段７ｅによって、絞り込まれた追跡対象（被写体）の候補の重心が検出される。そして、逆投影手段７ｆによって、重心が検出された追跡対象（被写体）の候補の画像座標が逆投影されて、推定三次元座標が取得され、この推定三次元座標から運動予測手段７ｇによって、予測三次元座標が取得される。このため、例えば、受信側のコンテンツ受信装置３のユーザ（視聴者）からスポーツ中継番組（野球中継番組）の中で、ボール（追跡対象）の軌跡を見てみたいという要望があった場合に、この推定三次元座標からボール（追跡対象）の軌跡を容易に描くことができ、当該視聴者の要望を満たすことができる。 Furthermore, according to the content transmission apparatus 1, the tracking candidate area extraction unit 7c extracts image areas (partial images) that are candidates for the tracking target (subject) included in the captured multi-viewpoint video data, and projects them. The predicted image coordinates obtained by projecting the predicted three-dimensional coordinates obtained by predicting the movement of the tracking target (subject) are generated by the means 7h, and the image area extracted by the tracking candidate area extracting means 7c by the candidate narrowing means 7d. The tracking target candidates are narrowed down from the (partial image) based on the predicted image coordinates generated by the projecting unit 7h, and the center of gravity of the narrowed tracking target (subject) candidate is detected by the center of gravity detecting unit 7e. The Then, the back projection means 7f back-projects the image coordinates of the tracking target (subject) candidate for which the center of gravity is detected, and obtains estimated three-dimensional coordinates. The motion prediction means 7g predicts the estimated three-dimensional coordinates. Three-dimensional coordinates are acquired. For this reason, for example, when there is a request from the user (viewer) of the content receiving device 3 on the receiving side to see the trajectory of the ball (tracking target) in a sports broadcast program (baseball broadcast program), The locus of the ball (tracking target) can be easily drawn from the estimated three-dimensional coordinates, and the viewer's request can be satisfied.

（コンテンツ受信装置の構成）
コンテンツ受信装置３は、送信側のコンテンツ送信装置１から出力された本放送映像データおよび本放送音声データを放送波として受信すると共に、ネットワークを介して送信された編集素材映像データ（多視点映像データ）と、編集素材音声データ（多聴点音声データ）とを、ネットワークを介して受信し、これらを処理して表示出力するもので、放送波受信手段１９と、受信手段２１と、制御手段２３と、提示情報生成手段２５と、合成映像音声データ生成手段２７と、蓄積手段２９と、操作信号受信手段３１と、データ要求手段３３とを備えている。また、このコンテンツ受信装置３には、ユーザが操作する操作装置２と、制御結果を出力する表示出力装置４とが配置されている。 (Configuration of content receiving device)
The content receiving device 3 receives the main broadcast video data and the main broadcast audio data output from the content transmission device 1 on the transmission side as broadcast waves, and edit material video data (multi-view video data transmitted via a network). ) And editing material audio data (multiple listening audio data) are received via a network, processed, displayed and output. Broadcast wave receiving means 19, receiving means 21, and control means 23 Presentation information generating means 25, synthesized video / audio data generating means 27, storage means 29, operation signal receiving means 31, and data requesting means 33. In addition, the content receiving device 3 includes an operation device 2 operated by a user and a display output device 4 that outputs a control result.

放送波受信手段１９は、送信側のコンテンツ送信装置１から出力された放送波（ＯＦＤＭ波）を受信して復調するためものである。つまり、この放送波受信手段１９は、ＯＦＤＭ波を復調して、重畳されている本放送映像データおよび本放送音声データを制御手段２３に出力するものである。 The broadcast wave receiving means 19 is for receiving and demodulating a broadcast wave (OFDM wave) output from the content transmission apparatus 1 on the transmission side. That is, the broadcast wave receiving unit 19 demodulates the OFDM wave and outputs the superimposed main broadcast video data and main broadcast audio data to the control unit 23.

受信手段２１は、送信側のコンテンツ送信装置１から送信された多視点映像データ、多聴点音声データ、提示情報を、ネットワークを介して受信するものである。この受信手段２１で受信された多視点映像データ、多聴点音声データおよび提示情報は制御手段２３に出力されて処理されると共に、蓄積手段２９に出力されて蓄積される。 The receiving unit 21 receives the multi-view video data, multi-listening point audio data, and presentation information transmitted from the content transmission apparatus 1 on the transmission side via the network. The multi-view video data, multi-point audio data and presentation information received by the receiving means 21 are output to the control means 23 for processing, and also output to the storage means 29 for storage.

制御手段２３は、コンテンツ受信装置３の制御を司り、操作信号受信手段３１から出力される制御信号に基づいて、映像データおよび編集素材映像データを表示出力装置４に表示させる際のレイアウト、音声データおよび編集素材音声データを出力するチャンネルの割当、出力タイミングを図る等の制御を行って表示出力させるもので、図３に詳細を示す。図３に示すように、制御手段２３は、放送波分離手段２３ａと、合成データ分離手段２３ｂと、提示情報解釈手段２３ｃと、画面構成手段２３ｄと、音声データ出力割当手段２３ｅと、出力手段２３ｆとを備えている。 The control unit 23 controls the content receiving device 3, and based on the control signal output from the operation signal receiving unit 31, the layout and audio data for displaying the video data and the editing material video data on the display output device 4. Further, the control is performed such as allocation of channels for outputting the editing material audio data, output timing, and the like, which are displayed and output. Details are shown in FIG. As shown in FIG. 3, the control means 23 includes broadcast wave separation means 23a, synthesized data separation means 23b, presentation information interpretation means 23c, screen composition means 23d, audio data output assignment means 23e, and output means 23f. And.

放送波分離手段２３ａは、放送波（ＯＦＤＭ波）に含まれている本放送映像データと本放送音声データとを分離するものである。この放送波分離手段２３ａで分離された本放送映像データは画面構成手段２３ｄに、本放送音声データは音声データ出力割当手段２３ｅに出力される。 The broadcast wave separating means 23a separates the main broadcast video data and the main broadcast audio data included in the broadcast wave (OFDM wave). The main broadcast video data separated by the broadcast wave separating means 23a is outputted to the screen constituting means 23d, and the main broadcast audio data is outputted to the audio data output assigning means 23e.

合成データ分離手段２３ｂは、合成映像音声データ生成手段２７で生成された合成映像データと、合成音声データとを分離するものである。この合成データ分離手段２３ｂで分離された合成映像データは画面構成手段２３ｄに、本放送音声データは音声データ出力割当手段２３ｅに出力される。 The synthesized data separating unit 23b separates the synthesized video data generated by the synthesized video / audio data generating unit 27 and the synthesized audio data. The synthesized video data separated by the synthesized data separating means 23b is outputted to the screen constituting means 23d, and the main broadcast audio data is outputted to the audio data output assigning means 23e.

提示情報解釈手段２３ｃは、受信手段２１で受信された提示情報、または、提示情報生成手段２５で提示情報生成テンプレートに基づいて生成された提示情報（提示内容情報、動作内容情報、レイアウト情報）を解釈して、解釈した結果（コンテンツ受信装置３で処理可能な情報）である表示内容データ、動作内容データおよびレイアウトデータを画面構成手段２３ｄに出力すると共に、動作内容データを音声データ出力割当手段２３ｅに出力するものである。 The presentation information interpretation unit 23c receives the presentation information received by the reception unit 21 or the presentation information generated by the presentation information generation unit 25 based on the presentation information generation template (presentation content information, operation content information, layout information). The display content data, the operation content data, and the layout data, which are the interpretation results (information that can be processed by the content receiving device 3), are output to the screen configuration means 23d, and the operation content data are output to the audio data output assigning means 23e. Is output.

画面構成手段２３ｄは、放送波分離手段２３ａで分離された映像データと、合成データ分離手段２３ｂで分離された合成映像データと、受信手段２１で受信された編集素材映像データとを出力させる際の画面構成を、表示内容データ、動作内容データおよびレイアウトデータに基づいて行うものである。 The screen composition unit 23d outputs the video data separated by the broadcast wave separation unit 23a, the synthesized video data separated by the synthesized data separation unit 23b, and the editing material video data received by the reception unit 21. The screen configuration is performed based on display content data, operation content data, and layout data.

例えば、本放送映像データおよび本放送音声データが野球中継番組であった場合、表示出力装置４の表示画面の画面領域を、メイン画面、サブ画面に分割し、メイン画面に本放送映像データ（例えば、投手の背後から捕手と打者を映し出した映像）を表示させ、サブ画面に編集素材映像データ（例えば、一塁ベース付近の映像）を表示させる。さらに、サブ画面に表示させる編集素材映像データを切り替える（選択できる）選択アイコンを表示画面上の一部に表示させる。そして、コンテンツ受信装置３のユーザ（視聴者）が操作装置２を使って、操作信号を入力させて、選択アイコンをクリックした場合には、サブ画面に表示されている編集素材映像データが別の編集素材映像データ（例えば、三塁ベース付近の映像）に切り替わる。 For example, when the main broadcast video data and the main broadcast audio data are baseball broadcast programs, the screen area of the display screen of the display output device 4 is divided into a main screen and a sub screen, and the main broadcast video data (for example, The video showing the catcher and the batter from the back of the pitcher is displayed, and the editing material video data (for example, the video near the first base) is displayed on the sub screen. Further, a selection icon for switching (selectable) the editing material video data to be displayed on the sub screen is displayed on a part of the display screen. When the user (viewer) of the content receiving device 3 uses the operation device 2 to input an operation signal and clicks the selection icon, the editing material video data displayed on the sub screen is different. Switching to editing material video data (for example, video near the base 3).

また、本放送映像データおよび本放送音声データが野球中継番組であった場合で、且つ、合成映像データが野球ボールの軌跡を描いたものであった場合、「ボール軌跡表示」用の選択アイコンを表示画面上の一部に表示させる。そして、コンテンツ受信装置３のユーザ（視聴者）が操作装置２を使って、操作信号を入力させて、「ボール軌跡表示」用の選択アイコンをクリックした場合に、サブ画面にこの野球ボールの軌跡を描いた合成映像データが表示される。 If the main broadcast video data and the main broadcast audio data are baseball broadcast programs, and if the composite video data depicts a baseball ball trajectory, a selection icon for “ball trajectory display” is displayed. Display on a part of the display screen. When the user (viewer) of the content receiving device 3 uses the operation device 2 to input an operation signal and clicks a selection icon for “ball trajectory display”, the trajectory of this baseball ball is displayed on the sub-screen. Is displayed.

音声データ出力割当手段２３ｅは、放送波分離手段２３ａで分離された音声データと、合成データ分離手段２３ｂで分離された合成音声データと、受信手段２１で受信された編集素材音声データとを出力させる際の出力割当を動作内容データに基づいて行うものである。つまり、音声データと、合成音声データと、編集素材音声データとを表示出力装置４のどのチャンネルから出力させるかを設定するものである。 The audio data output assigning unit 23e outputs the audio data separated by the broadcast wave separating unit 23a, the synthesized audio data separated by the synthesized data separating unit 23b, and the editing material audio data received by the receiving unit 21. Output allocation is performed based on the operation content data. That is, it is set from which channel of the display output device 4 the audio data, the synthesized audio data, and the editing material audio data are output.

出力手段２３ｆは、画面構成手段２３ｄで構成された画面（映像データ、合成映像データ、編集素材映像データ）と、音声データ出力割当手段２３ｅで割り当てられた音（音声データ、合成音声データ、編集素材音声データ）とを表示出力装置４に表示出力させるインターフェースの役割を果たすものである。 The output unit 23f includes a screen (video data, synthesized video data, editing material video data) configured by the screen configuration unit 23d and a sound (audio data, synthesized audio data, editing material) allocated by the audio data output allocation unit 23e. Audio data) plays a role of an interface for causing the display output device 4 to display and output.

図１に戻ってコンテンツ受信装置３の各構成の説明を続ける。
提示情報生成手段２５は、編集素材映像データまたは編集素材音声データに付随している（付加されている）提示情報生成テンプレートに基づいて、提示情報を生成するものである。つまり、この提示情報生成手段２５は、提示情報生成テンプレートに基づき、市販されているＶＴＲ録画再生装置（例えば、ＳＯＮＹ（登録商標）の「コクーン」）のように、予め登録されているユーザの年齢、趣味、嗜好等であるユーザ情報（視聴者情報）を参照して、ユーザ（視聴者）の好みに応じて、提示情報を生成するものである。 Returning to FIG. 1, the description of each component of the content receiving device 3 will be continued.
The presentation information generation unit 25 generates presentation information based on a presentation information generation template attached (attached) to editing material video data or editing material audio data. In other words, the presentation information generation means 25 is based on the presentation information generation template, and the age of the user registered in advance, such as a commercially available VTR recording / playback apparatus (for example, “COCOON” of SONY (registered trademark)). User information (viewer information) such as hobbies and preferences is referred to, and presentation information is generated according to the user (viewer) preference.

合成映像音声データ生成手段２７は、受信手段２１で受信された編集素材映像データと、送信側のコンテンツ送信装置１の編集素材加工手段７から出力された推定三次元座標とに基づいて、追跡対象の軌跡を描画して、編集素材映像データに重ね合わせた合成映像データを生成すると共に、受信手段２１で受信された編集素材音声データに様々な音響効果を加えた合成音声データを生成するものである。なお、この合成映像音声データ生成手段２７は、蓄積手段２９に蓄積されている本放送映像データ、本放送音声データ、編集素材映像データおよび編集素材音声データを呼び出して、これらに基づいて、合成映像データや合成音声データを生成することもできる。 The synthesized video / audio data generating unit 27 is based on the editing material video data received by the receiving unit 21 and the estimated three-dimensional coordinates output from the editing material processing unit 7 of the content transmitting apparatus 1 on the transmission side. In addition to generating composite video data that is superimposed on the editing material video data, and generating synthetic audio data in which various acoustic effects are added to the editing material audio data received by the receiving means 21. is there. The synthesized video / audio data generating unit 27 calls the main broadcast video data, the main broadcast audio data, the editing material video data, and the editing material audio data stored in the storage unit 29, and based on these, the synthesized video / audio data is generated. Data and synthesized voice data can also be generated.

蓄積手段２９は、大容量のハードディスク等によって構成されており、放送波受信手段１９で受信され復調された本放送映像データおよび本放送音声データと、受信手段２１で受信された編集素材映像データおよび編集素材音声データを蓄積するものである。この蓄積手段２９に蓄積されている本放送映像データ、本放送音声データ、編集素材映像データおよび編集素材音声データは、操作信号受信手段３１から出力された制御信号に基づいて、制御手段２３、または、合成映像音声データ生成手段２７に出力される。 The storage means 29 is constituted by a large-capacity hard disk or the like, and the main broadcast video data and main broadcast audio data received and demodulated by the broadcast wave receiving means 19 and the editing material video data received by the receiving means 21 and Edit material audio data is accumulated. The main broadcast video data, the main broadcast audio data, the editing material video data, and the editing material audio data stored in the storage unit 29 are based on the control signal output from the operation signal receiving unit 31, or The synthesized video / audio data generating means 27 outputs the result.

操作信号受信手段３１は、操作装置２から送信された操作信号を、各手段を制御するための制御信号に変換して出力するものである。制御信号は、制御手段２３と、蓄積手段２９と、データ要求手段３３とに出力される。 The operation signal receiving means 31 converts the operation signal transmitted from the operation device 2 into a control signal for controlling each means and outputs the control signal. The control signal is output to the control unit 23, the storage unit 29, and the data request unit 33.

データ要求手段３３は、操作信号受信手段３１から出力された制御信号に基づいて、送信側のコンテンツ送信装置１に対して、編集素材映像データ、編集素材音声データ、提示情報等のデータ要求を行うものである。 Based on the control signal output from the operation signal receiving unit 31, the data request unit 33 makes a data request for editing material video data, editing material audio data, presentation information, and the like to the content transmitting apparatus 1 on the transmission side. Is.

コンテンツ受信装置３によれば、放送波受信手段１９によって、放送波として送信側のコンテンツ送信装置１から出力（伝送）されている本放送映像データと本放送音声データとが受信され、受信手段２１によって、編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方と、提示情報とが、ネットワークを介して受信される。そして、操作信号受信手段３１によって、ユーザ（視聴者）が操作した操作装置２から出力された操作信号が受信され、制御手段２３によって、操作信号と、提示情報とに基づいて、本放送映像データおよび本放送音声データと、編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方との処理が制御されて、表示出力装置４に出力される。このため、本来廃棄されていた番組素材である編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方を有効活用することができる。また、提示情報によって、送信側のコンテンツ送信装置１の操作者で設定された提示情報によって規定されている多種類の提示手法で、視聴者毎の種々の嗜好、コンテンツ（テレビ番組）に対する欲求を満足させることができる。 According to the content receiving device 3, the broadcast wave receiving unit 19 receives the main broadcast video data and the main broadcast audio data that are output (transmitted) from the content transmission device 1 on the transmission side as a broadcast wave, and receives the reception unit 21. Thus, at least one of the editing material video data (multi-viewpoint video data) and the editing material audio data (multi-point audio data) and the presentation information are received via the network. Then, the operation signal output unit 31 receives the operation signal output from the operation device 2 operated by the user (viewer), and the control unit 23 receives the broadcast video data based on the operation signal and the presentation information. The broadcast audio data, at least one of editing material video data (multi-view video data) and editing material audio data (multi-point audio data) are controlled and output to the display output device 4. For this reason, it is possible to effectively utilize at least one of the editing material video data (multi-view video data) and the editing material audio data (multi-audio data) that are originally discarded program materials. In addition, the presentation information can satisfy various tastes and desires for content (television programs) for each viewer using various types of presentation methods defined by the presentation information set by the operator of the content transmission apparatus 1 on the transmission side. Can be satisfied.

また、コンテンツ受信装置３によれば、放送波受信手段１９によって、放送波（ＯＦＤＭ波）として送信側のコンテンツ送信装置１から出力（伝送）されている本放送映像データと本放送音声データとが受信され、操作信号受信手段３１によって、ユーザ（視聴者）が操作した操作装置２から出力された操作信号が受信され、データ要求手段３３により、操作信号に基づいて編集素材映像データ（多視点映像データ）、編集素材音声データ（多聴点音声データ）および提示情報が、ネットワークを介して、送信側、或いは、他のサービス提供サイト等に要求される。そして、受信手段２１によって、ネットワークを介して、要求した編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方と、提示情報とが受信され、制御手段２３によって、制御信号と、提示情報とに基づいて、本放送映像データおよび本放送音声データと、編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方との処理が制御されて出力される。このため、本来廃棄されていた番組素材である編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方を、ユーザ（視聴者）の要求に従って、有効活用することができる。また、提示情報によって、送信側のコンテンツ送信装置１の操作者で設定された提示情報によって規定されている多種類の提示手法で、視聴者毎の種々の嗜好、コンテンツ（テレビ番組）に対する欲求を満足させることができる。 Further, according to the content receiving device 3, the broadcast wave receiving means 19 outputs the main broadcast video data and the main broadcast audio data that are output (transmitted) from the content transmission device 1 on the transmission side as a broadcast wave (OFDM wave). The operation signal received from the operation device 2 operated by the user (viewer) is received by the operation signal receiving means 31, and the editing material video data (multi-view video) is received based on the operation signal by the data request means 33. Data), editing material audio data (multiple listening audio data) and presentation information are requested from the transmission side or other service providing sites via the network. Then, the receiving means 21 receives at least one of the requested editing material video data (multi-view video data) and editing material audio data (multi-audio data) and the presentation information via the network, and provides control. By means 23, based on the control signal and the presentation information, the main broadcast video data and the main broadcast audio data, the editing material video data (multi-view video data) and the editing material audio data (multi-point audio data). Processing with at least one is controlled and output. For this reason, at least one of editing material video data (multi-view video data) and editing material audio data (multi-audit audio data), which is originally discarded program material, is valid according to the request of the user (viewer). Can be used. In addition, the presentation information can satisfy various tastes and desires for content (television programs) for each viewer using various types of presentation methods defined by the presentation information set by the operator of the content transmission apparatus 1 on the transmission side. Can be satisfied.

さらに、コンテンツ受信装置３によれば、放送波受信手段１９によって、放送波（ＯＦＤＭ波）として送信側のコンテンツ送信装置１から出力（伝送）されている本放送映像データと本放送音声データとが受信され、受信手段２１によって、ネットワークを介して、編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方が受信される。そして、提示情報生成手段２５によって、提示情報生成テンプレートとユーザ情報（視聴者情報）に基づいて、提示情報が生成される。その後、操作信号受信手段３１によって、ユーザ（視聴者）が操作した操作装置２から出力された操作信号が受信され、制御手段２３によって、制御信号と、生成した提示情報とに基づいて、本放送映像データおよび本放送音声データと、編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方との処理が制御されて出力される。このため、本来廃棄されていた番組素材である編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）との少なくとも一方を、有効活用することができる。また、提示情報を提示情報生成テンプレートとユーザ情報（視聴者情報）とに基づいて生成し、利用することで、視聴者毎の種々の嗜好、コンテンツ（テレビ番組）に対する欲求を満足させることができる。 Furthermore, according to the content receiving device 3, the main broadcast video data and the main broadcast audio data output (transmitted) from the content transmitting device 1 on the transmission side by the broadcast wave receiving means 19 as a broadcast wave (OFDM wave). The receiving means 21 receives at least one of the editing material video data (multi-view video data) and the editing material audio data (multi-audio data) via the network. Then, the presentation information generation unit 25 generates the presentation information based on the presentation information generation template and the user information (viewer information). Thereafter, the operation signal output unit 31 receives the operation signal output from the operation device 2 operated by the user (viewer), and the control unit 23 performs the main broadcast based on the control signal and the generated presentation information. The video data and the main broadcast audio data, and at least one of editing material video data (multi-view video data) and editing material audio data (multi-point audio data) are controlled and output. For this reason, at least one of the editing material video data (multi-view video data) and the editing material audio data (multi-audit audio data), which are originally discarded program materials, can be effectively used. In addition, by generating and using the presentation information based on the presentation information generation template and user information (viewer information), it is possible to satisfy various tastes and desires for content (TV programs) for each viewer. .

さらにまた、コンテンツ受信装置３によれば、編集素材映像データ（多視点映像データ）と編集素材音声データ（多聴点音声データ）とを蓄積する蓄積手段２９を備えることで、一旦、これら編集素材映像データと編集素材音声データとを蓄積した後、動作させることができ、受信手段２１によって、ネットワークを介して、逐次、編集素材映像データと編集素材音声データとを取得するよりも、ユーザ（視聴者）は、時間的なロスを感じすることなく（ストレスが少なく）、コンテンツ（テレビ番組）を鑑賞することができる。 Furthermore, according to the content receiving device 3, by providing the storage means 29 for storing the editing material video data (multi-viewpoint video data) and the editing material audio data (multi-point audio data), these editing materials are temporarily stored. The video data and the editing material audio data can be stored and then operated, and the receiving means 21 can receive the editing material video data and the editing material audio data from the user (viewing) rather than sequentially acquiring the editing material video data and the editing material audio data via the network. The person can enjoy the content (television program) without feeling a time loss (less stress).

そしてまた、コンテンツ受信装置３によれば、合成映像音声データ生成手段２７によって、受信した推定三次元座標に基づいて、追跡対象の軌跡等の合成映像データが生成され、様々な音響効果が加えられた合成音声データが生成される。視聴者の様々な嗜好、コンテンツ（特に、スポーツ中継番組）に対する欲求、例えば、投手の投げたボールの軌跡を見てみたい等を満足させることができる。 Also, according to the content receiver 3, the synthesized video / audio data generating means 27 generates synthesized video data such as a track to be tracked based on the received estimated three-dimensional coordinates, and various acoustic effects are added. Synthesized voice data is generated. It is possible to satisfy various tastes of viewers and desires for contents (especially sports broadcast programs), for example, wanting to see the trajectory of the ball thrown by the pitcher.

（コンテンツ送信装置の動作）
次に、図４に示すフローチャートを参照して、コンテンツ送信装置１の動作を説明する（適宜、図１参照）。
まず、コンテンツ送信装置１の放送用メタデータ付与手段５に本放送映像データおよび本放送音声データを入力し、この放送用メタデータ付与手段５は、メタデータ（放送波映像用メタデータ、放送波音声用メタデータ）をそれぞれに付与する（Ｓ１）。続いて、コンテンツ送信装置１の放送波出力手段１１からメタデータを付与した本放送映像データおよび本放送音声データを放送波（ＯＦＤＭ波）として出力する（Ｓ２）。 (Operation of content transmission device)
Next, the operation of the content transmission device 1 will be described with reference to the flowchart shown in FIG. 4 (see FIG. 1 as appropriate).
First, the main broadcast video data and the main broadcast audio data are input to the broadcast metadata assigning means 5 of the content transmitting apparatus 1, and the broadcast metadata assigning means 5 receives the metadata (broadcast wave video metadata, broadcast wave). Audio metadata) is assigned to each (S1). Next, the broadcast video data and the broadcast audio data to which metadata is added are output as broadcast waves (OFDM waves) from the broadcast wave output means 11 of the content transmitting apparatus 1 (S2).

また、コンテンツ送信装置１の編集素材加工手段７に編集素材映像データと編集素材音声データを入力し、加工（抽出）、または、そのまま、通信用メタデータ付与手段９に出力する（Ｓ３）。続いて、コンテンツ送信装置１の通信用メタデータ付与手段９は、編集素材映像データと編集素材音声データとにメタデータ（通信映像用メタデータ、通信音声用メタデータ）をそれぞれに付与する（Ｓ４）。そして、コンテンツ送信装置１の送信手段１３によって、提示情報蓄積手段１７に蓄積されている提示情報を加えて（付随させて）、送信する（Ｓ５）。 Further, the editing material video data and the editing material audio data are input to the editing material processing unit 7 of the content transmitting apparatus 1 and processed (extracted) or output as they are to the communication metadata adding unit 9 (S3). Subsequently, the communication metadata adding unit 9 of the content transmitting apparatus 1 adds metadata (communication video metadata and communication audio metadata) to the editing material video data and the editing material audio data, respectively (S4). ). Then, the transmission means 13 of the content transmission apparatus 1 adds the presentation information stored in the presentation information storage means 17 (attaches it) and transmits it (S5).

（編集素材加工手段内の処理）
次に、図５に示すフローチャートを参照して、コンテンツ送信装置１の編集素材加工手段７内の処理（図４のＳ３の詳細）について説明する（適宜、図２参照）
この編集素材加工手段７は、まず、編集素材映像データと操作信号を入力し（Ｓ１１）、映像データ加工制御手段７ａによって、操作信号に基づいて、当該編集素材映像データを加工するか否かを判定する（Ｓ１２）。なお、この映像データ加工制御手段７ａは、受信側のコンテンツ受信装置３から送られたデータ要求信号に基づいても、当該編集素材映像データを加工するか否かを判定する (Processing within editing material processing means)
Next, processing (details of S3 in FIG. 4) in the editing material processing means 7 of the content transmission apparatus 1 will be described with reference to the flowchart shown in FIG. 5 (see FIG. 2 as appropriate).
The editing material processing means 7 first inputs editing material video data and an operation signal (S11), and determines whether or not the editing material video data is to be processed based on the operation signal by the video data processing control means 7a. Determine (S12). The video data processing control means 7a determines whether or not to process the editing material video data based on the data request signal sent from the content receiver 3 on the receiving side.

ここで、加工しないと判定された場合（Ｓ１２、Ｎｏ）は、そのまま、編集素材映像データを出力する。加工すると判定された場合（Ｓ１２、Ｙｅｓ）、編集素材加工手段７の追跡候補領域抽出手段７ｃによって、追跡対象を含む候補となる画像領域（部分画像）を抽出する（Ｓ１３）。画像領域（部分画像）は２値画像として抽出される。続いて、候補絞り込み手段７ｄによって、追跡対象を含む候補を絞り込む（Ｓ１４）。この候補絞り込み手段７ｄからは予測画像座標が出力される。 Here, if it is determined not to process (S12, No), the editing material video data is output as it is. When it is determined that the image is to be processed (S12, Yes), the candidate image area (partial image) including the tracking target is extracted by the tracking candidate area extracting unit 7c of the editing material processing unit 7 (S13). The image area (partial image) is extracted as a binary image. Subsequently, candidates including the tracking target are narrowed down by the candidate narrowing means 7d (S14). Predicted image coordinates are output from the candidate narrowing means 7d.

そして、重心検出手段７ｅによって、追跡対象の重心を検出する（Ｓ１５）。この重心検出手段７ｅからは画像座標群が出力される。そして、逆投影手段７ｆによって、画像座標群を３次元座標にマッピングする（Ｓ１６）。この逆投影手段７ｆからは推定３次元座標が出力される。そして、運動予測手段７ｇによって、推定３次元座標に基づいて、追跡対象の運動予測を行う（Ｓ１７）。この運動予測手段７ｇからは予測３次元座標が出力される。そして、投影手段７ｈによって、編集素材映像データを撮影した撮影カメラ位置、カメラアングル情報に基づいて、予測３次元座標を画像座標にマッピングする（Ｓ１８）。この投影手段７ｈからは予測画像座標が出力される。そして、加工映像データ出力手段７ｉによって、加工映像データとして出力される（Ｓ１９）。 Then, the center of gravity of the tracking target is detected by the center of gravity detecting means 7e (S15). An image coordinate group is output from the barycenter detecting means 7e. Then, the image projection group is mapped to three-dimensional coordinates by the back projection means 7f (S16). The back projection means 7f outputs estimated three-dimensional coordinates. Then, the motion prediction unit 7g performs motion prediction of the tracking target based on the estimated three-dimensional coordinates (S17). Predicted three-dimensional coordinates are output from the motion predicting means 7g. Then, the projection means 7h maps the predicted three-dimensional coordinates to the image coordinates based on the photographing camera position and camera angle information for photographing the editing material video data (S18). Predicted image coordinates are output from the projection means 7h. Then, the processed video data output means 7i outputs the processed video data (S19).

この編集素材加工手段７は、次に、編集素材音声データを入力し（Ｓ２０）、音声データ抽出制御手段７ｂによって、操作信号に基づいて、当該編集素材音声データを処理する（から抽出する）か否かを判定する（Ｓ２１）。ここで、処理（抽出）しないと判定された場合（Ｓ２１、Ｎｏ）は、そのまま、編集素材音声データを出力する。処理（抽出）すると判定された場合（Ｓ２１、Ｙｅｓ）、まず、受信側からの送信されたデータ要求信号を受信し（Ｓ２２）、このデータ要求信号に基づいて、編集素材加工手段７の音声データ切替手段７ｊによって、編集素材音声データを選択（切り替える）（Ｓ２３）。 Next, the editing material processing means 7 inputs the editing material audio data (S20), and the audio data extraction control means 7b processes (extracts) the editing material audio data based on the operation signal. It is determined whether or not (S21). Here, when it is determined not to process (extract) (S21, No), the editing material audio data is output as it is. If it is determined to process (extract) (S21, Yes), first, a data request signal transmitted from the receiving side is received (S22), and the audio data of the editing material processing means 7 is based on this data request signal. The editing material audio data is selected (switched) by the switching means 7j (S23).

そして、音声データ処理手段７ｋは、音声データ切替手段７ｉで選択された編集素材音声データを処理（フィルタリング）して、必要な音域を音声データ出力手段７ｌに出力する（Ｓ２４）。音声データ出力手段７ｌによって、抽出音声データとして出力する。なお、Ｓ１１からＳ１９までの処理と、Ｓ２０からＳ２５までの処理とは、通常、並列に処理されるものであるが、説明の都合上、この編集素材加工手段７内の処理フローでは、編集素材映像データの処理を先に説明した。 Then, the audio data processing means 7k processes (filters) the editing material audio data selected by the audio data switching means 7i, and outputs the necessary sound range to the audio data output means 7l (S24). The audio data output means 7l outputs the extracted audio data. Note that the processing from S11 to S19 and the processing from S20 to S25 are normally performed in parallel, but for the convenience of explanation, in the processing flow in the editing material processing means 7, the editing material is processed. The video data processing has been described above.

（コンテンツ受信装置の動作）
次に、図６に示すフローチャートを参照して、コンテンツ受信装置３の動作を説明する（適宜、図１参照）。
まず、コンテンツ受信装置３の放送波受信手段１９によって、本放送映像データと本放送音声データとが重畳されている放送波を受信する（Ｓ３１）。続いて、受信手段２１によって、提示情報が含まれた編集素材映像データと、編集素材音声データとを受信する（Ｓ３２）。 (Operation of content receiver)
Next, the operation of the content receiving device 3 will be described with reference to the flowchart shown in FIG. 6 (see FIG. 1 as appropriate).
First, the broadcast wave receiving means 19 of the content receiver 3 receives a broadcast wave in which the main broadcast video data and the main broadcast audio data are superimposed (S31). Subsequently, the receiving material 21 receives the editing material video data including the presentation information and the editing material audio data (S32).

次に、制御手段２３は、操作信号受信手段３１から出力された制御信号に基づいて、新たに提示情報を生成するか否かを判定する（Ｓ３３）。提示情報を生成すると判定した場合（Ｓ３３、Ｙｅｓ）には、提示情報生成手段２５によって、提示情報生成テンプレートとユーザ情報（視聴者情報）とに基づいて、提示情報を生成する（Ｓ３４）。提示情報を生成すると判定されなかった場合（Ｓ３３、Ｎｏ）は、制御手段２３は、合成映像データ、合成音声データを生成するか否かを判定する（Ｓ３５）。 Next, the control unit 23 determines whether to newly generate presentation information based on the control signal output from the operation signal receiving unit 31 (S33). When it is determined that the presentation information is to be generated (S33, Yes), the presentation information generation unit 25 generates the presentation information based on the presentation information generation template and the user information (viewer information) (S34). When it is not determined that the presentation information is to be generated (S33, No), the control unit 23 determines whether to generate synthetic video data and synthetic audio data (S35).

合成映像データ、合成映像データを生成すると判定された場合（Ｓ３５、Ｙｅｓ）には、合成映像音声データ生成手段２７によって、編集素材映像データおよび推定３次元座標に基づいて、合成映像データを生成し、編集素材音声データに音響効果を付加することで、合成音声データを生成する（Ｓ３６）。 When it is determined to generate the composite video data and the composite video data (S35, Yes), the composite video data is generated by the composite video / audio data generation unit 27 based on the editing material video data and the estimated three-dimensional coordinates. Then, synthetic sound data is generated by adding an acoustic effect to the editing material sound data (S36).

合成映像データ、合成音声データを生成すると判定されなかった場合（Ｓ３５、Ｎｏ）、データ要求手段３３は、操作信号受信手段３１に入力される操作信号に基づいて、追加データ（別の編集素材映像データ、編集素材音声データ）を要求するか否かを判定する（Ｓ３７）。追加データを要求すると判定した場合（Ｓ３７、Ｙｅｓ）、データ要求信号を出力する（Ｓ３８）。追加データを要求すると判定しなかった場合（Ｓ３７、Ｎｏ）、制御手段２３で画面構成され、出力割当のなされたコンテンツを表示出力装置４に表示出力させる（Ｓ３９）。 If it is not determined to generate the synthesized video data and synthesized audio data (S35, No), the data requesting unit 33 adds the additional data (another editing material video) based on the operation signal input to the operation signal receiving unit 31. It is determined whether or not data (editing material audio data) is requested (S37). If it is determined that additional data is requested (S37, Yes), a data request signal is output (S38). If it is not determined that the additional data is requested (S37, No), the display unit 4 is caused to display and output the content that is screen-configured by the control unit 23 and that is assigned output (S39).

（制御手段内の処理）
次に、図７に示すフローチャートを参照して、コンテンツ受信装置３の制御手段２３内の処理（図６のＳ３９の詳細）について説明する（適宜、図３参照）
まず、コンテンツ受信装置３の制御手段２３は、提示情報を生成させるか、合成映像データおよび合成音声データを生成させるかを判定する。続いて、制御手段２３の放送波分離手段２３ａによって、放送波を分離する（Ｓ４１）。また、合成映像データおよび合成音声データが生成された場合には、合成映像データ分離手段２３ｂによって、合成映像データと、合成音声データとを分離する（Ｓ４２）。 (Processing in the control means)
Next, with reference to the flowchart shown in FIG. 7, the processing (details of S39 in FIG. 6) in the control means 23 of the content receiving device 3 will be described (see FIG. 3 as appropriate).
First, the control means 23 of the content receiving device 3 determines whether to generate presentation information or to generate synthesized video data and synthesized audio data. Subsequently, the broadcast wave is separated by the broadcast wave separation means 23a of the control means 23 (S41). When the synthesized video data and the synthesized audio data are generated, the synthesized video data and the synthesized audio data are separated by the synthesized video data separating unit 23b (S42).

そして、制御手段２３の提示情報解釈手段２３ｃに提示情報を入力し、この提示情報解釈手段２３ｃによって、提示情報を解釈し（Ｓ４３）、表示内容データ、動作内容データ、レイアウトデータを画面構成手段２３ｄに出力すると共に、動作内容データを音声データ出力割当手段２３ｅに出力する。 Then, the presentation information is input to the presentation information interpretation unit 23c of the control unit 23, and the presentation information is interpreted by the presentation information interpretation unit 23c (S43). The display content data, the operation content data, and the layout data are displayed on the screen configuration unit 23d. And the operation content data are output to the audio data output assigning means 23e.

すると、制御手段２３の画面構成手段２３ｄによって、表示画面を構成する（Ｓ４４）。続いて、音声データ出力割当手段２３ｅによって、出力させる本放送映像データ、合成音声データ、編集素材音声データを割り当てる（Ｓ４５）。そして、制御手段２３の出力手段２３ｆからコンテンツとして表示出力装置４に出力する（Ｓ４６）。 Then, the display screen is configured by the screen configuration unit 23d of the control unit 23 (S44). Subsequently, the main broadcast video data, the synthesized audio data, and the editing material audio data to be output are allocated by the audio data output allocation unit 23e (S45). And it outputs to the display output apparatus 4 as a content from the output means 23f of the control means 23 (S46).

（コンテンツ送受信システムの構成例）
次に、図８に示す概略図を参照して、コンテンツ送受信システムＡの構成例について説明する。
このコンテンツ送受信システムＡの構成例では、受信側のコンテンツ受信装置３（図８では図示せず）に２台の表示出力装置４（４Ａ、４Ｂ）を接続して構成しており、ユーザ（視聴者）は、テレビ番組として、野球中継番組を視聴している。 (Configuration example of content transmission / reception system)
Next, a configuration example of the content transmission / reception system A will be described with reference to the schematic diagram shown in FIG.
In this configuration example of the content transmission / reception system A, two display output devices 4 (4A, 4B) are connected to a content receiving device 3 (not shown in FIG. 8) on the receiving side, and a user (viewing / listening) is configured. The person) is watching a baseball broadcast program as a TV program.

また、送信側のコンテンツ送信装置１（図８では図示せず）は、中継車に設置されており、野球場に設置されている複数の撮影カメラが撮影した映像データ（編集素材映像データ）および複数のマイクロフォンが収音した音声データ（編集素材音声データ）を放送局に送信している。放送局から放送衛星（通信衛星）を介して、一般視聴者に向け、放送波が出力されている。 Further, the content transmission apparatus 1 (not shown in FIG. 8) on the transmission side is installed in a relay car, and includes video data (editing material video data) captured by a plurality of imaging cameras installed in a baseball field, and Audio data (edited material audio data) collected by a plurality of microphones is transmitted to a broadcasting station. Broadcast waves are output from broadcast stations to general viewers via broadcast satellites (communication satellites).

中継車に設置されたコンテンツ送信装置１では、編集素材映像データに動画メタデータ（通信映像用メタデータ）を付与して、インターネット（ネットワーク）を介して、コンテンツ受信装置３に送信している。すると、表示出力装置４（４Ａ）を視聴しているユーザ（視聴者）は、本放送映像データ（投手の背面から捕手と打者とが映っている映像）を表示画面のメイン画面に表示させながら、サブ画面に編集素材映像データ（一塁ベース付近の映像、一塁ランナー）を表示させることが可能になる。 In the content transmission device 1 installed in the relay car, moving image metadata (communication video metadata) is added to the editing material video data and transmitted to the content reception device 3 via the Internet (network). Then, the user (viewer) who is viewing the display output device 4 (4A) displays the main broadcast video data (the video showing the catcher and the batter from the back of the pitcher) on the main screen of the display screen. The editing material video data (video near the first base, first base runner) can be displayed on the sub screen.

また、表示出力装置４（４Ｂ）を視聴しているユーザ（視聴者）は、合成映像データによるボールの軌跡（追跡対象の軌跡）を捕手側から撮影されている編集素材映像データに重ね合わせたものを視聴することが可能になる。 In addition, the user (viewer) who views the display output device 4 (4B) superimposes the ball trajectory (tracking target trajectory) based on the composite video data on the editing material video data captured from the catcher side. You can watch things.

以上、本発明の実施形態について説明したが、本発明は前記実施形態には限定されない。例えば、コンテンツ送信装置１の各構成の処理、コンテンツ受信装置３の各構成の処理を汎用的なコンピュータ言語で記述したコンテンツ送信プログラム、コンテンツ受信プログラムとみなすことは可能である。これらのコンテンツ送信プログラム、コンテンツ受信プログラムは、コンテンツ送信装置１、コンテンツ受信装置３と同様の効果を得ることができる。 As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment. For example, the processing of each component of the content transmission device 1 and the processing of each component of the content reception device 3 can be regarded as a content transmission program and a content reception program described in a general-purpose computer language. These content transmission program and content reception program can obtain the same effects as the content transmission device 1 and the content reception device 3.

実施形態に係るコンテンツ送受信システムのブロック図である。It is a block diagram of the content transmission / reception system which concerns on embodiment. 図１に示したコンテンツ送信装置の編集素材加工手段のブロック図である。It is a block diagram of the edit material processing means of the content transmission apparatus shown in FIG. 図１に示したコンテンツ受信装置の制御手段のブロック図である。It is a block diagram of the control means of the content receiver shown in FIG. コンテンツ送信装置の動作を説明したフローチャートである。It is a flowchart explaining operation | movement of a content transmission apparatus. 編集素材加工手段内の処理を説明したフローチャートである。It is a flowchart explaining the process in an edit material process means. コンテンツ受信装置の動作を説明したフローチャートである。It is a flowchart explaining operation | movement of a content receiver. 制御手段内の処理を説明したフローチャートである。It is a flowchart explaining the process in a control means. コンテンツ送受信システムの構成例を説明した概略図である。It is the schematic explaining the structural example of the content transmission / reception system.

Explanation of symbols

１コンテンツ送信装置
３コンテンツ受信装置
５放送用メタデータ付与手段（メタデータ付与手段）
７編集素材加工手段
７ｃ追跡候補抽出手段（被写体候補抽出手段）
７ｄ候補絞込み手段
７ｅ重心検出手段
７ｆ逆投影手段
７ｇ運動予測手段
７ｈ投影手段
９通信用メタデータ付与手段（メタデータ付与手段）
１１放送波出力手段
１３送信手段
１５メタデータ情報蓄積手段
１７提示情報蓄積手段
１９放送波受信手段
２１受信手段
２３制御手段
２５提示情報生成手段
２７合成映像音声データ生成手段（合成映像データ生成手段、合成音声データ生成手段）
２９蓄積手段
３１操作信号受信手段
３３データ要求手段 DESCRIPTION OF SYMBOLS 1 Content transmission apparatus 3 Content reception apparatus 5 Broadcast metadata provision means (metadata provision means)
7 Editing material processing means 7c Tracking candidate extraction means (subject candidate extraction means)
7d Candidate narrowing down means 7e Center of gravity detection means 7f Back projection means 7g Motion prediction means 7h Projection means 9 Communication metadata giving means (metadata giving means)
DESCRIPTION OF SYMBOLS 11 Broadcast wave output means 13 Transmission means 15 Metadata information storage means 17 Presentation information storage means 19 Broadcast wave reception means 21 Reception means 23 Control means 25 Presentation information generation means 27 Synthetic video data generation means (Synthetic video data generation means, synthesis Audio data generation means)
29 Storage means 31 Operation signal receiving means 33 Data request means

Claims

Multi-view video data imaged or generated from at least one other viewpoint different from the viewpoint at the time of imaging the video data in synchronization with video data and audio data output as a broadcast wave; A content transmitting apparatus that transmits at least one of multi-point audio data recorded or generated at least one other listening point different from the listening point when recording audio data via a network,
Meta data describing the relationship between the video data and the multi-view video data is added to the video data and the multi-view video data, and the metadata describing the relationship between the audio data and the multi-view audio data Metadata giving means for giving data to the audio data and the multi-point audio data;
A presentation information storage means for storing presentation information set in advance in association with the metadata as a presentation method when presenting at least one of the multi-view video data and the multi-listening voice data on the receiving side;
Transmission means for transmitting the presentation information stored in the presentation information storage means and at least one of the multi-view video data and the multi-viewpoint audio data via a network;
A content transmission apparatus comprising:

The presentation information includes at least presentation content information indicating content to be presented to a device provided on the receiving side, operation content information indicating content for operating the device, the multi-view video data, and the multi-point audio data. In advance, in association with the metadata for generating the presentation information, the layout information specifying the layout of the video data and the multi-view video data presented to the device, and the metadata for generating the presentation information The content transmission device according to claim 1, further comprising a set presentation information generation template.

The multi-view video data and the multi-point audio data are sports-related multi-view video data and sports-related multi-point audio data related to sports,
Subject candidate extraction means for extracting a partial image that is a candidate for a subject included in the captured video data;
Projecting means for generating predicted image coordinates by projecting predicted three-dimensional coordinates obtained by predicting the motion of the subject;
Candidate narrowing means for narrowing down the subject candidates from the partial images extracted by the subject candidate extracting means based on the predicted image coordinates generated by the projecting means;
Centroid detecting means for detecting the centroid of the candidate of the subject narrowed down by the candidate narrowing means;
Back projection means for back projecting image coordinates of a candidate for a subject whose center of gravity is detected by the center of gravity detection means to obtain estimated three-dimensional coordinates;
A motion prediction means for obtaining the predicted three-dimensional coordinates from the estimated three-dimensional coordinates obtained by the back projection means,
The content transmitting apparatus according to claim 1, wherein the estimated three-dimensional coordinates are transmitted by the transmitting unit.

The video data and the audio data output as a broadcast wave are received and captured or generated from at least one other viewpoint different from the viewpoint at which the video data synchronized with them is captured Receiving at least one of multi-view video data and multi-point audio data recorded or generated at at least one other listening point different from the point at which the audio data was recorded, via a network; A content receiving device,
Broadcast wave receiving means for receiving the video data and the audio data;
A presentation method for presenting at least one of the multi-view video data and the multi-viewpoint audio data, at least one of the multi-view video data and the multi-viewpoint audio data, the video data, and the multi-viewpoint Reception that receives presentation information set in advance in association with metadata describing at least one of a relationship with video data or a relationship between the audio data and the multi-point audio data via a network Means,
Operation signal receiving means for receiving an operation signal output from the operation device as a result of the user of the device operating the operation device operating the device;
Based on the operation signal received by the operation signal receiving means and the presentation information, processing of the video data and the audio data, and at least one of the multi-view video data and the multi-listening point audio data is performed. Control means for controlling whether or not to output; and
A content receiving apparatus comprising:

The video data and the audio data output as a broadcast wave are received and captured or generated from at least one other viewpoint different from the viewpoint at which the video data synchronized with them is captured Receiving at least one of multi-view video data and multi-point audio data recorded or generated at at least one other listening point different from the point at which the audio data was recorded, via a network; A content receiving device,
Broadcast wave receiving means for receiving the video data and the audio data;
Operation signal receiving means for receiving an operation signal output from the operation device as a result of the user of the device operating the operation device operating the device;
Based on the operation signal received by the operation signal receiving means, at least one of the multi-view video data and the multi-viewpoint audio data and at least one of the multi-view video data and the multi-listen sound data are presented. The presentation method is set in advance by associating with metadata describing the relationship between the video data and the multi-view video data and the relationship between the audio data and the multi-point audio data. Requesting data via a network, and
Receiving means for receiving at least one of the multi-view video data and the multi-point audio data requested by the data request means, and the presentation information;
Control whether or not to process at least one of the video data and the audio data, the multi-view video data, and the multi-point audio data based on the operation signal and the presentation information. Control means for outputting;
A content receiving apparatus comprising:

The video data and the audio data output as a broadcast wave are received and captured or generated from at least one other viewpoint different from the viewpoint at which the video data synchronized with them is captured Receiving at least one of multi-view video data and multi-point audio data recorded or generated at at least one other listening point different from the point at which the audio data was recorded, via a network; A content receiving device,
Broadcast wave receiving means for receiving the video data and the audio data;
Receiving means for receiving at least one of the multi-view video data and the multi-viewpoint audio data via a network;
A presentation method when presenting at least one of the multi-view video data and the multi-view audio data, a relationship between the video data and the multi-view video data, or the audio data and the multi-view audio data. Presentation information generating means for generating presentation information, which is information related to presentation, based on a preset presentation information generation template associated with metadata describing at least one of the relationships;
Operation signal receiving means for receiving an operation signal output from the operation device as a result of the user of the device operating the operation device operating the device;
Based on the operation signal received by the operation signal receiving means and the presentation information, processing of the video data and the audio data, and at least one of the multi-view video data and the multi-listening point audio data is performed. Control means for controlling whether or not to output; and
A content receiving apparatus comprising:

The content receiving apparatus according to any one of claims 4 to 6, further comprising storage means for storing the multi-viewpoint video data and the multi-viewpoint audio data.

The estimated three-dimensional data transmitted from the content transmission device according to claim 3, wherein the multi-view video data and the multi-view audio data are sports-related multi-view video data and sports-related multi-view audio data related to sports. Based on the coordinates, synthesized video data generating means for generating synthesized video data related to the sports-related multi-view video data;
Based on the estimated three-dimensional coordinates, synthesized voice data generating means for generating synthesized voice data related to the sports-related multi-point voice data;
The content receiving device according to any one of claims 4 to 7, further comprising: