JP6860178B1

JP6860178B1 - Video processing equipment and video processing method

Info

Publication number: JP6860178B1
Application number: JP2019238399A
Authority: JP
Inventors: 慎平藤田
Original assignee: NEC Platforms Ltd
Current assignee: NEC Platforms Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-04-14
Anticipated expiration: 2039-12-27
Also published as: JP2021108411A

Abstract

【課題】注目したい人物に焦点を当てた映像を自動的に生成することができる。【解決手段】映像処理装置は、カメラが撮像した各人物の映像データにおいて各人物それぞれに対応する人物判定ブロックを生成し、映像データから検知した各人物の視線方向にある人物判定ブロックを判定し、人物判定ブロックごとの視線の数を示す視線ヒストグラムを生成し、各人物の音声データに基づいて音の発信源の人物判定ブロックを示す音配図を生成し、視線を有する人物判定ブロック数に基づいて視線ヒストグラムの有効度を判定し、発信源の人物判定ブロック数に基づいて音配図の有効度を判定し、各有効度に応じて視線ヒストグラム又は音配図に基づき映像データから出力映像を生成する。【選択図】図１PROBLEM TO BE SOLVED: To automatically generate an image focusing on a person to be noticed. A video processing device generates a person determination block corresponding to each person in the video data of each person captured by a camera, and determines a person determination block in the line-of-sight direction of each person detected from the video data. , A line-of-sight histogram showing the number of lines of sight for each person judgment block is generated, and a sound arrangement diagram showing the person judgment block of the sound source is generated based on the voice data of each person, and the number of people judgment blocks having a line of sight is calculated. The effectiveness of the line-of-sight histogram is determined based on this, the effectiveness of the sound distribution map is determined based on the number of person judgment blocks of the source, and the output video is output from the video data based on the line-of-sight histogram or sound distribution diagram according to each effectiveness. To generate. [Selection diagram] Fig. 1

Description

本発明は、映像処理装置及び映像処理方法に関する。 The present invention relates to a video processing apparatus and a video processing method.

会議やイベントでは、カメラで撮像した映像を他拠点に配信することがあるが、固定されたカメラでは一定方向からしか映像を確認できない場合がある。また、撮像者がカメラ撮像をする場合には撮像者に一定の技術が必要であり、撮像者の技術如何によっては本来注目したい人物に焦点を当てた映像ではない場合がある。
このような問題に対し、カメラが撮像した人物の視線の向きや音声等に基づいて映像を処理する技術がある（例えば、特許文献１参照）。また、カメラやマイクを使った映像記録手段として、発言者の方向に自動的にカメラを向けるトラッキングカメラがある。 At meetings and events, the images captured by the camera may be distributed to other locations, but with a fixed camera, the images may only be confirmed from a certain direction. Further, when the imager takes a picture with a camera, the imager needs a certain technique, and depending on the technique of the imager, the image may not be focused on the person who originally wants to pay attention.
To solve such a problem, there is a technique for processing an image based on the direction of the line of sight of a person captured by a camera, sound, or the like (see, for example, Patent Document 1). In addition, as a video recording means using a camera or a microphone, there is a tracking camera that automatically points the camera in the direction of the speaker.

特開２００４−２４８１２５号公報Japanese Unexamined Patent Publication No. 2004-248125

しかしながら、上述した技術は、発言者の音声以外の音が入らないことが前提のものであり、オープンなスペースでのミーティングなどでは、ミーティングメンバー以外の音声が入ってしまい、中心人物とは異なる人物に焦点が当てられる場合があるなどの問題があった。また、運動会などのイベントで、注目されている人物が声援を受けている場合などは、注目されている人物を特定することが困難であるという問題がある。
すなわち、上述した技術では、撮像した映像から本当に注目したい中心人物を撮像技術や編集技術なしに映像に映すことが難しいといった課題がある。 However, the above-mentioned technology is based on the premise that sounds other than the voice of the speaker are not included, and in a meeting in an open space, voices other than the meeting members are included, and a person different from the central person. There was a problem that the focus might be on. In addition, when a person who is attracting attention is cheering at an event such as an athletic meet, there is a problem that it is difficult to identify the person who is attracting attention.
That is, the above-mentioned technology has a problem that it is difficult to project a central person who really wants to pay attention from an captured image on an image without imaging technology or editing technology.

そこでこの発明は、上述の課題を解決する映像処理装置及び映像処理方法を提供することを目的としている。 Therefore, an object of the present invention is to provide a video processing apparatus and a video processing method that solve the above-mentioned problems.

本発明の第１の態様によれば、映像処理装置は、カメラが撮像した各人物の映像データにおいて前記各人物それぞれに対応する人物判定ブロックを生成する人物判定ブロック生成部と、前記映像データから検知した各人物の視線方向にある前記人物判定ブロックを判定し、前記人物判定ブロックごとの視線の数を示す視線ヒストグラムを生成する視線ヒストグラム生成部と、前記各人物の音声データに基づいて音の発信源の前記人物判定ブロックを示す音配図を生成する音配図生成部と、視線を有する人物判定ブロック数に基づいて前記視線ヒストグラムの有効度を判定し、発信源の人物判定ブロック数に基づいて前記音配図の有効度を判定し、前記各有効度に応じて前記視線ヒストグラム又は前記音配図に基づき前記映像データから出力映像を生成する出力映像生成部と、を備えることを特徴とする。 According to the first aspect of the present invention, the image processing apparatus is based on a person determination block generation unit that generates a person determination block corresponding to each person in the image data of each person captured by the camera, and the image data. A line-of-sight histogram generator that determines the person determination block in the line-of-sight direction of each detected person and generates a line-of-sight histogram indicating the number of lines of sight for each person determination block, and a sound lineage generator based on the voice data of each person. The effectiveness of the line-of-sight histogram is determined based on the sound distribution diagram generation unit that generates a sound distribution diagram showing the person determination block of the source and the number of person determination blocks having a line of sight, and the number of person determination blocks of the source is used. It is characterized by including an output image generation unit that determines the effectiveness of the sound arrangement diagram based on the above and generates an output image from the image data based on the line-of-sight histogram or the sound arrangement diagram according to each effectiveness. And.

本発明の第２の態様によれば、映像処理方法は、人物判定ブロック生成部が、カメラが撮像した各人物の映像データにおいて前記各人物それぞれに対応する人物判定ブロックを生成し、視線ヒストグラム生成部が、前記映像データから検知した各人物の視線方向にある前記人物判定ブロックを判定し、前記人物判定ブロックごとの視線の数を示す視線ヒストグラムを生成し、音配図生成部が、前記各人物の音声データに基づいて音の発信源の前記人物判定ブロックを示す音配図を生成し、出力映像生成部が、視線を有する人物判定ブロック数に基づいて前記視線ヒストグラムの有効度を判定し、発信源の人物判定ブロック数に基づいて前記音配図の有効度を判定し、前記各有効度に応じて前記視線ヒストグラム又は前記音配図に基づき前記映像データから出力映像を生成することを特徴とする。 According to the second aspect of the present invention, in the image processing method, the person determination block generation unit generates a person determination block corresponding to each person in the image data of each person captured by the camera, and generates a line-of-sight histogram. The unit determines the person determination block in the line-of-sight direction of each person detected from the video data, generates a line-of-sight histogram showing the number of lines of sight for each person determination block, and the sound distribution diagram generation unit generates each of the above. A sound arrangement diagram showing the person determination block of the sound source is generated based on the voice data of the person, and the output video generation unit determines the effectiveness of the line-of-sight histogram based on the number of person determination blocks having the line of sight. , The effectiveness of the sound arrangement diagram is determined based on the number of person determination blocks of the source, and an output image is generated from the image data based on the line-of-sight histogram or the sound arrangement diagram according to each effectiveness. It is a feature.

本発明によれば、注目したい人物に焦点を当てた映像を自動的に生成することができる。 According to the present invention, it is possible to automatically generate an image focusing on a person to be noticed.

本発明の実施形態による映像処理システムの構成を示す概略図である。It is the schematic which shows the structure of the image processing system by embodiment of this invention. 本発明の実施形態による映像処理装置における動作を説明するための動作説明図である。It is operation explanatory drawing for demonstrating operation in the image processing apparatus by embodiment of this invention. 本発明の実施形態による映像処理装置が第１ケースの場合に生成する出力映像について説明するための図である。It is a figure for demonstrating the output video generated in the case of the 1st case by the video processing apparatus by embodiment of this invention. 本発明の実施形態による映像処理装置が第２ケースの場合に生成する出力映像について説明するための図である。It is a figure for demonstrating the output video generated in the case of the 2nd case by the video processing apparatus by embodiment of this invention. 本発明の実施形態による映像処理装置が第３ケースの場合に生成する出力映像について説明するための図である。It is a figure for demonstrating the output video generated in the case of the 3rd case of the video processing apparatus by embodiment of this invention. 本発明の実施形態による映像処理装置が第４ケースの場合に生成する出力映像について説明するための図である。It is a figure for demonstrating the output video generated in the case of the 4th case of the video processing apparatus by embodiment of this invention. 本発明の実施形態による映像処理装置が映し出す人数を制限する場合に生成する出力映像について説明するための図である。It is a figure for demonstrating the output video generated when the number of people projected by the video processing apparatus by embodiment of this invention is limited. 本発明の実施形態による映像処理装置が実行する映像処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the image processing executed by the image processing apparatus by embodiment of this invention. 本発明の映像処理装置の最小構成を示す図である。It is a figure which shows the minimum structure of the image processing apparatus of this invention.

以下、本発明の一実施形態による映像処理装置、映像処理方法及びプログラムについて図面を参照して説明する。 Hereinafter, a video processing apparatus, a video processing method, and a program according to an embodiment of the present invention will be described with reference to the drawings.

＜第１の実施形態＞
まず、第１の実施形態について説明する。
図１は、本実施形態による映像処理システムの構成を示す概略図である。
映像処理システム１００は、映像処理装置１と、全方位カメラ２と、３Ｄマイクロホン３とを備える。全方位カメラ２と３Ｄマイクロホン３とは任意の同じ場所に設置される。全方位カメラ２及び３Ｄマイクロホン３と映像処理装置１とは、有線又は無線により通信接続している。
全方位カメラ２は、設置された場所から３６０度全ての方位を撮像するカメラである。３Ｄマイクロホン３は、設置された場所から３６０度全ての方位の音声を取得するマイクロホンである。 <First Embodiment>
First, the first embodiment will be described.
FIG. 1 is a schematic view showing a configuration of a video processing system according to the present embodiment.
The video processing system 100 includes a video processing device 1, an omnidirectional camera 2, and a 3D microphone 3. The omnidirectional camera 2 and the 3D microphone 3 are installed at any same place. The omnidirectional camera 2 and the 3D microphone 3 and the video processing device 1 are connected by wire or wireless communication.
The omnidirectional camera 2 is a camera that captures all directions of 360 degrees from the installed location. The 3D microphone 3 is a microphone that acquires sound in all directions of 360 degrees from the place where it is installed.

映像処理装置１は、全方位カメラ２が撮像した撮像映像及び３Ｄマイクロホン３が収音した音声に基づいて、撮像映像における注目したい人物に焦点を当てた映像を生成して出力する装置である。図示するように、映像処理装置１は、映像データ取得部１１と、音声データ取得部１２と、人物判定ブロック生成部１３と、視線ヒストグラム生成部１４と、音配図生成部１５と、出力映像生成部１６と、出力部１７とを備える。 The image processing device 1 is a device that generates and outputs an image focusing on a person to be noticed in the captured image based on the captured image captured by the omnidirectional camera 2 and the sound picked up by the 3D microphone 3. As shown in the figure, the video processing device 1 includes a video data acquisition unit 11, an audio data acquisition unit 12, a person determination block generation unit 13, a line-of-sight histogram generation unit 14, a sound distribution diagram generation unit 15, and an output video. A generation unit 16 and an output unit 17 are provided.

映像データ取得部１１は、撮像映像を示す映像データを全方位カメラ２から取得する。
音声データ取得部１２は、撮像映像に対応する音声を示す音声データを３Ｄマイクロホン３から取得する。 The video data acquisition unit 11 acquires video data indicating the captured video from the omnidirectional camera 2.
The audio data acquisition unit 12 acquires audio data indicating audio corresponding to the captured image from the 3D microphone 3.

人物判定ブロック生成部１３は、全方位カメラ２から取得した各人物の映像データからパノラマ画像を生成し、当該パノラマ画像において各人物それぞれに対応する人物判定ブロックを生成する。 The person determination block generation unit 13 generates a panoramic image from the video data of each person acquired from the omnidirectional camera 2, and generates a person determination block corresponding to each person in the panoramic image.

視線ヒストグラム生成部１４は、映像データから検知した各人物の視線方向にある人物判定ブロックを判定し、人物判定ブロックごとの有効な視線の数を示す視線ヒストグラムを生成する。視線ヒストグラムは、人物判定ブロックごとの集まった視線を示す。視線ヒストグラム生成部１４は、人物の視線が他の人物の人物判定ブロックにある場合に、当該視線が有効であると判定する。一方、視線ヒストグラム生成部１４は、人物の視線が自身の人物判定ブロックにある場合、或いは、人物判定ブロック外にある場合に、当該視線が無効であると判定する。
音配図生成部１５は、３Ｄマイクロホン３から取得した各人物の音声データに基づいて音の発信源の人物判定ブロックを示す音配図を生成する。 The line-of-sight histogram generation unit 14 determines a person determination block in the line-of-sight direction of each person detected from the video data, and generates a line-of-sight histogram showing the number of effective lines of sight for each person determination block. The line-of-sight histogram shows the line-of-sight gathered for each person determination block. The line-of-sight histogram generation unit 14 determines that the line of sight is valid when the line of sight of the person is in the person determination block of another person. On the other hand, the line-of-sight histogram generation unit 14 determines that the line of sight is invalid when the line of sight of the person is in its own person determination block or outside the person determination block.
The sound arrangement diagram generation unit 15 generates a sound arrangement diagram showing a person determination block of a sound source based on the voice data of each person acquired from the 3D microphone 3.

出力映像生成部１６は、有効な視線を有する人物判定ブロック数に基づいて視線ヒストグラムの有効度を判定し、発信源の人物判定ブロック数に基づいて音配図の有効度を判定し、各有効度に応じて視線ヒストグラム又は音配図に基づき映像データから中心人物に焦点を当てた出力映像を生成する。例えば、出力映像生成部１６は、発信源の人物判定ブロック数が所定の閾値を超えている場合に、音配図の有効度が低いと判定する。また、出力映像生成部１６は、有効な視線を有する人物判定ブロック数が所定の閾値を超えている場合に、視線ヒストグラムの有効度が低いと判定する。また、出力映像生成部１６は、無効な視線の数が所定の閾値を超えている場合に、視線ヒストグラムの有効度が低いと判定する。また、出力映像生成部１６は、視線ヒストグラム及び音配図の有効度がともに高い場合には、視線を有する人物判定ブロック数と発信源の人物判定ブロック数とのうち少ない方の人物判定ブロックにいる人物を優先して映す。また、出力映像生成部１６は、視線ヒストグラム及び音配図の有効度がともに低い場合には、視線を有する人物判定ブロック及び発信源の人物判定ブロックにいる人物を映す。また、出力映像生成部１６は、出力映像に映す候補となる人物が最大人数の閾値を超える場合には、視線の少ない人物から順に除外する。また、出力映像生成部１６は、３Ｄマイクロホン３から取得した音声データを出力映像に合成し、出力部１７に出力する。
出力部１７は、音声データが合成された出力映像を外部のコンピュータや表示装置に出力する。 The output video generation unit 16 determines the effectiveness of the line-of-sight histogram based on the number of person determination blocks having a valid line of sight, determines the effectiveness of the sound distribution diagram based on the number of person determination blocks of the source, and each valid. An output image focusing on the central person is generated from the image data based on the line-of-sight histogram or the sound arrangement diagram according to the degree. For example, the output video generation unit 16 determines that the effectiveness of the sound arrangement diagram is low when the number of person determination blocks of the transmission source exceeds a predetermined threshold value. Further, the output video generation unit 16 determines that the effectiveness of the line-of-sight histogram is low when the number of person determination blocks having an effective line-of-sight exceeds a predetermined threshold value. Further, the output video generation unit 16 determines that the effectiveness of the line-of-sight histogram is low when the number of invalid line-of-sights exceeds a predetermined threshold value. Further, when the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is high, the output video generation unit 16 sets the number of person determination blocks having the line of sight and the number of person determination blocks of the source to the smaller of the number of person determination blocks. Priority is given to the person who is. Further, when the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is low, the output video generation unit 16 projects a person in the person determination block having the line of sight and the person determination block of the source. Further, when the number of candidate persons to be projected on the output image exceeds the threshold value of the maximum number of people, the output image generation unit 16 excludes the persons having the smallest line of sight in order. Further, the output video generation unit 16 synthesizes the audio data acquired from the 3D microphone 3 into the output video and outputs it to the output unit 17.
The output unit 17 outputs an output video in which audio data is synthesized to an external computer or display device.

続いて、本映像処理装置１の動作について説明する。
図２は、本実施形態による映像処理装置における動作を説明するための動作説明図である。
本図には、人物Ａ〜人物Ｆが、全方位カメラ２及び３Ｄマイクロホン３を円形に囲んで会議等をしている場合を例示する。全方位カメラ２及び３Ｄマイクロホン３は、同一の場所に設置されている。本例では、人物Ｃが発言しており、人物Ａ、人物Ｂ、人物Ｄ及び人物Ｅの視線が人物Ｃに向いており、人物Ｃの視線が人物Ｂに向いており、人物Ｆの視線が人物Ｅに向いている。 Subsequently, the operation of the video processing device 1 will be described.
FIG. 2 is an operation explanatory diagram for explaining the operation in the video processing apparatus according to the present embodiment.
This figure illustrates a case where people A to F are having a meeting or the like by enclosing the omnidirectional camera 2 and the 3D microphone 3 in a circle. The omnidirectional camera 2 and the 3D microphone 3 are installed in the same place. In this example, person C is speaking, the line of sight of person A, person B, person D, and person E is toward person C, the line of sight of person C is toward person B, and the line of sight of person F is. Suitable for person E.

まず、人物判定ブロック生成部１３が、全方位カメラ２が撮像した３６０度全方位の映像データをパノラマ展開したパノラマ画像２０１を生成する。パノラマ画像２０１には、−１８０度から１８０度までの３６０度の映像が展開される。続いて、人物判定ブロック生成部１３は、パノラマ画像２０１から１人以上の人物の顔検出を行い、当該人物の顔の中心位置を算出する。人物判定ブロック生成部１３は、パノラマ画像２０１が３６０度の映像であることから、その横幅の長さを利用して、人物の顔の中心位置が映像の何度の位置にあるかを判定する。そして、人物判定ブロック生成部１３は、顔検出された人物の人数でパノラマ画像２０１を分割し、各人物の顔の中心位置の角度に基づいて各人物それぞれを判定する人物判定ブロック２０２を生成する。図示する例では、人物判定ブロック生成部１３は、各人物Ａ〜人物Ｆの人物判定ブロックとして、パノラマ画像２０１を人物Ａに対応するブロックＡと、人物Ｂに対応するブロックＢと、人物Ｃに対応するブロックＣと、人物Ｄに対応するブロックＤと、人物Ｅに対応するブロックＥと、人物Ｆに対応するブロックＦとに分割している。なお、人物判定ブロック生成部１３は、人物が少ないと人物判定ブロックが大きくなりすぎてしまうことを考慮し、人物判定ブロックの最大角の閾値を設定し、分割した角度が最大角の閾値を超える場合には、人物判定ブロックの角度を最大角の閾値とする。 First, the person determination block generation unit 13 generates a panoramic image 201 in which the 360-degree omnidirectional video data captured by the omnidirectional camera 2 is panoramicly expanded. A 360-degree image from −180 degrees to 180 degrees is developed on the panoramic image 201. Subsequently, the person determination block generation unit 13 detects the face of one or more persons from the panoramic image 201, and calculates the center position of the face of the person. Since the panoramic image 201 is a 360-degree image, the person determination block generation unit 13 determines how many positions in the image the center position of the person's face is located by using the width of the panoramic image 201. .. Then, the person determination block generation unit 13 divides the panoramic image 201 by the number of people whose faces are detected, and generates a person determination block 202 that determines each person based on the angle of the center position of each person's face. .. In the illustrated example, the person determination block generation unit 13 sets the panoramic image 201 into the block A corresponding to the person A, the block B corresponding to the person B, and the person C as the person determination blocks of each person A to F. It is divided into a corresponding block C, a block D corresponding to the person D, a block E corresponding to the person E, and a block F corresponding to the person F. The person determination block generation unit 13 sets a threshold value for the maximum angle of the person determination block in consideration of the fact that the person determination block becomes too large if there are few people, and the divided angle exceeds the maximum angle threshold. In this case, the angle of the person determination block is set as the threshold of the maximum angle.

続いて、視線ヒストグラム生成部１４が、遠隔視線推定技術により、撮像映像における各人物の視線方向を検知する。例えば、視線ヒストグラム生成部１４は、顔認証技術および、顔特徴点検出技術を用いて視線検知に必要な目頭や目尻、瞳など目の周囲の特徴点位置を正確に特定することで、視線方向を検知する。そして、視線ヒストグラム生成部１４は、各人物の視線方向にある人物判定ブロックを判定する。視線ヒストグラム生成部１４は、人物判定ブロックに視線をプロットし（以下、プロットした視線を「視線プロット」と称する。）、視線プロット及び人物判定ブロックごとの有効な視線プロットの数を示す視線ヒストグラム２０３を生成する。視線ヒストグラム生成部１４は、人物の視線が他の人物の人物判定ブロックにある場合に、当該視線プロットが有効であると判定する。一方、視線ヒストグラム生成部１４は、人物の視線が自身の人物判定ブロックにある場合、或いは、人物判定ブロック外にある場合に、当該視線プロットが無効であると判定する。 Subsequently, the line-of-sight histogram generation unit 14 detects the line-of-sight direction of each person in the captured image by the remote line-of-sight estimation technique. For example, the line-of-sight histogram generation unit 14 uses face recognition technology and face feature point detection technology to accurately identify the positions of feature points around the eyes such as the inner and outer corners of the eyes and the pupils, which are necessary for line-of-sight detection. Is detected. Then, the line-of-sight histogram generation unit 14 determines a person determination block in the line-of-sight direction of each person. The line-of-sight histogram generation unit 14 plots the line of sight on the person determination block (hereinafter, the plotted line of sight is referred to as a “line-of-sight plot”), and indicates the number of line-of-sight plots and valid line-of-sight plots for each person-determination block 203. To generate. The line-of-sight histogram generation unit 14 determines that the line-of-sight plot is valid when the line of sight of a person is in the person determination block of another person. On the other hand, the line-of-sight histogram generation unit 14 determines that the line-of-sight plot is invalid when the line of sight of the person is in its own person determination block or outside the person determination block.

続いて、音配図生成部１５が、音声方向を判定する技術を用いて、３Ｄマイクロホン３から取得した音声データに基づいて音の発信源の方向を判定し、発信源の人物判定ブロックを示す音配図２０４を生成する。音配図２０４は、３Ｄマイクロホン３を用いて取得した音声データから、３６０度全方向の空間内の各音源の位置を可視化できる。例えば、音配図生成部１５は、音の強度を求め、音源の方位角と仰角、３Ｄマイクロホン３の位置を中心とする球面での音の強度分布図を生成する。 Subsequently, the sound distribution diagram generation unit 15 determines the direction of the sound source based on the voice data acquired from the 3D microphone 3 by using the technique of determining the voice direction, and indicates the person determination block of the source. Sound arrangement diagram 204 is generated. In the sound arrangement diagram 204, the position of each sound source in the space of 360 degrees in all directions can be visualized from the voice data acquired by using the 3D microphone 3. For example, the sound distribution map generation unit 15 obtains the sound intensity and generates a sound intensity distribution map on a spherical surface centered on the azimuth and elevation angles of the sound source and the position of the 3D microphone 3.

そして、出力映像生成部１６が、視線ヒストグラム２０３及び音配図２０４に基づいて出力映像２０５を生成する。本図に示す例では、出力映像生成部１６は、人物Ｂ、人物Ｃ及び人物Ｅが視線を集めている（有効な視線プロットを有する）ことから、人物Ｂ、人物Ｃ及び人物Ｅを映した出力映像２０５を生成している。このとき、出力映像生成部１６は、人物Ｃが最も多く視線を受けており、かつ発言しているため、人物Ｃが中心人物であると判定し、人物Ｃを最も大きく中心に映し、人物Ｂ及び人物Ｅを人物Ｃより小さく映している。 Then, the output video generation unit 16 generates the output video 205 based on the line-of-sight histogram 203 and the sound arrangement diagram 204. In the example shown in this figure, the output video generation unit 16 projects the person B, the person C, and the person E because the person B, the person C, and the person E are gathering the line of sight (having a valid line-of-sight plot). The output video 205 is generated. At this time, the output video generation unit 16 determines that the person C is the central person because the person C receives the most eyes and speaks, and the person C is projected in the center of the person B. And person E is projected smaller than person C.

また、出力映像生成部１６は、視線ヒストグラム及び音配図それぞれについて、情報の有効度の高低により映し出す中心人物の判定基準を決定する。以下、視線ヒストグラム及び音配図の有効度に基づく中心人物の判定基準について具体例を用いて詳細に説明する。 In addition, the output video generation unit 16 determines a criterion for determining the central person to be projected based on the effectiveness of the information for each of the line-of-sight histogram and the sound arrangement diagram. Hereinafter, the criteria for determining the central person based on the effectiveness of the line-of-sight histogram and the sound distribution diagram will be described in detail using specific examples.

（第１ケース）
まず、視線が特定の人物に集中していて、音声が分散している第１ケースの場合について説明する。例えば、運動会のように、視線が中心人物に集まり声援などで音声が分散している場合、音声の発信源となる人物が注目すべき人物とはならないため、視線が集中している人物が中心人物だと判定すべきである。そのため、映像処理装置１は、音配図及び視線ヒストグラムそれぞれの有効度を判定し、有効度の高い情報を優先的に利用して中心人物を特定する。 (First case)
First, the case of the first case in which the line of sight is concentrated on a specific person and the voice is dispersed will be described. For example, when the line of sight is gathered at the central person and the voice is dispersed by cheering, as in an athletic meet, the person who is the source of the voice is not the person to pay attention to, so the person with the concentrated line of sight is the center. You should judge that you are a person. Therefore, the video processing device 1 determines the effectiveness of each of the sound arrangement diagram and the line-of-sight histogram, and preferentially uses the highly effective information to identify the central person.

図３は、本実施形態による映像処理装置が第１ケースの場合に生成する出力映像について説明するための図である。
図示する例では、人物Ａ〜人物Ｉが撮像映像に映っている。また、本例では、音声が分散しており、人物Ｃ〜人物Ｇの視線が人物Ａに向いており、人物Ｈ及び人物Ｉの視線が人物Ｂに向いている。 FIG. 3 is a diagram for explaining an output video generated when the video processing apparatus according to the present embodiment is the first case.
In the illustrated example, person A to person I are reflected in the captured image. Further, in this example, the voices are dispersed, the line of sight of the person C to the person G is directed to the person A, and the line of sight of the person H and the person I is directed to the person B.

撮像映像をパノラマ展開したパノラマ画像３０１には、人物Ａ〜人物Ｉが映っている。また、視線ヒストグラム３０２では、人物Ａが最も多い４つの視線プロットを有し、人物Ｂがその次に多い２つの視線プロット有しており、人物Ｃ〜Ｉは視線プロットを有していない。すなわち、人物Ａ及び人物Ｂに視線が集中している。一方、音配図３０３では、人物Ａ及び人物Ｂは音声データを持っておらず、人物Ｃ〜Ｉが音声データを持っている。ここでは、音声データを持つ又は音声データを有するとは、音配図において音声の発信源であることを示す。 Person A to person I are reflected in the panoramic image 301, which is a panoramic development of the captured image. Further, in the line-of-sight histogram 302, the person A has the four line-of-sight plots that are the most, the person B has the two line-of-sight plots that are the next most, and the persons C to I do not have the line-of-sight plots. That is, the line of sight is concentrated on the person A and the person B. On the other hand, in the sound distribution diagram 303, the person A and the person B do not have the voice data, and the people C to I have the voice data. Here, having voice data or having voice data indicates that it is a source of voice in the sound distribution diagram.

出力映像生成部１６は、映像における中心人物の頻繁な移り変わりを防ぐために使用する任意の一定時間(以下、「判定時間」とする。)のうち、音配図において音声データを持つ人物の人数が全体の人数に対する任意の一定の割合（以下、「第１の閾値の割合」とする。）を超えると、音声が分散しているとしその有効度が低いと判定する。全体の人数は、撮像映像に映っている人物全員の人数である。一方、出力映像生成部１６は、判定時間のうち、音配図において音声データを持つ人物の人数が第１の閾値の割合以下である場合には、音配図の有効度が高いと判定する。本図に示す例では、出力映像生成部１６は、音声データを持つ人物が第１の閾値の割合を超えているため、音配図３０３の有効度が低いと判定する。 The output video generation unit 16 determines the number of people who have audio data in the sound distribution diagram, out of an arbitrary fixed time (hereinafter, referred to as “determination time”) used to prevent frequent changes of the central person in the video. When it exceeds an arbitrary fixed ratio to the total number of people (hereinafter, referred to as "the ratio of the first threshold value"), it is determined that the voice is dispersed and its effectiveness is low. The total number of people is the number of all the people shown in the captured image. On the other hand, the output video generation unit 16 determines that the effectiveness of the sound distribution diagram is high when the number of persons having audio data in the sound arrangement diagram is equal to or less than the ratio of the first threshold value in the determination time. .. In the example shown in this figure, the output video generation unit 16 determines that the effectiveness of the sound arrangement diagram 303 is low because the person having the audio data exceeds the ratio of the first threshold value.

また、出力映像生成部１６は、視線ヒストグラムについても音配図と同時にその有効度について判定する。出力映像生成部１６は、判定時間のうち、視線ヒストグラムにおける有効な視線プロットを有する人物の人数が全体の人数に対する任意の一定の割合（以下、「第２の閾値の割合」とする。）を超えると、視線が分散しているとしその有効度が低いと判定する。また、出力映像生成部１６は、判定時間のうち、視線ヒストグラムにおける無効な視線プロットの数が全体の人数に対する任意の一定の割合（以下、「第３の閾値の割合」とする。）を超えた場合にも、視線が分散しているとしその有効度が低いと判定する。一方、出力映像生成部１６は、判定時間のうち、視線ヒストグラムにおける有効な視線プロットを有する人物の人数が第２の閾値の割合以下であって、無効な視線プロットの数が第３の閾値の割合以下であるには、視線ヒストグラムの有効度が高いと判定する。なお、第１の閾値の割合、第２の閾値の割合及び第３の閾値の割合は、同一の割合であってもよいし、それぞれ異なる割合であってもよい。本図に示す例では、出力映像生成部１６は、視線ヒストグラム３０２における有効な視線プロットを有する人物の人数が第２の閾値の割合以下であって、無効な視線プロットが第３の閾値の割合以下であるため、視線ヒストグラム３０２の有効度が高いと判定する。 Further, the output video generation unit 16 also determines the effectiveness of the line-of-sight histogram at the same time as the sound distribution diagram. The output video generation unit 16 sets an arbitrary fixed ratio (hereinafter, referred to as “second threshold value ratio”) of the number of persons having a valid line-of-sight plot in the line-of-sight histogram to the total number of people in the determination time. If it exceeds, it is judged that the line of sight is dispersed and the effectiveness is low. Further, in the output video generation unit 16, the number of invalid line-of-sight plots in the line-of-sight histogram exceeds an arbitrary fixed ratio (hereinafter, referred to as “third threshold value ratio”) with respect to the total number of people in the determination time. Even in this case, it is determined that the line of sight is dispersed and the effectiveness is low. On the other hand, in the output video generation unit 16, the number of persons having a valid line-of-sight plot in the line-of-sight histogram is less than or equal to the ratio of the second threshold value, and the number of invalid line-of-sight plots is the third threshold value in the determination time. If it is less than or equal to the ratio, it is judged that the effectiveness of the line-of-sight histogram is high. The ratio of the first threshold value, the ratio of the second threshold value, and the ratio of the third threshold value may be the same ratio or different ratios. In the example shown in this figure, in the output video generation unit 16, the number of people having a valid line-of-sight plot in the line-of-sight histogram 302 is equal to or less than the ratio of the second threshold value, and the invalid line-of-sight plot is the ratio of the third threshold value. Therefore, it is determined that the line-of-sight histogram 302 is highly effective.

本例においては、出力映像生成部１６は、視線ヒストグラム３０２の有効度が高く、音配図３０３の有効度が低いため、各人物の視線方向の信頼性が高いと判定し、視線が多く集まっている人物から順番に注目した出力映像３０４を生成する。本例における出力映像３０４では、視線が最も多く集まっている人物Ａを最も大きく映し、次に視線が集まっている人物Ｂを人物Ａより小さく映し出している。 In this example, the output video generation unit 16 determines that the effectiveness of the line-of-sight histogram 302 is high and the effectiveness of the sound arrangement diagram 303 is low, so that the reliability of the line-of-sight direction of each person is high, and many lines of sight are gathered. The output video 304 that focuses on the person in order is generated. In the output video 304 in this example, the person A who has the most eyes is projected in the largest size, and the person B who has the next eyes is projected smaller than the person A.

（第２ケース）
次に、視線が分散していて、音声が特定の人物に集中している第２ケースの場合について説明する。例えば、会議のように、各人物が手元の資料を読みながら発言者の話を聞いている場合、各人物はそれぞれ自分の手元を見ていて視線が分散しており、音声は発言者に集中しているため、発言者が中心人物だと判定すべきである。 (Second case)
Next, the case of the second case in which the line of sight is dispersed and the voice is concentrated on a specific person will be described. For example, when each person is listening to the speaker while reading the material at hand, as in a conference, each person is looking at his / her own hand and the line of sight is dispersed, and the voice is concentrated on the speaker. Therefore, it should be determined that the speaker is the central person.

図４は、本実施形態による映像処理装置が第２ケースの場合に生成する出力映像について説明するための図である。
図示する例では、人物Ａ〜人物Ｆが撮像映像に映っている。また、本例では、各人物の視線は分散しており、人物Ｃのみが発言している。 FIG. 4 is a diagram for explaining an output video generated when the video processing apparatus according to the present embodiment is in the second case.
In the illustrated example, the person A to the person F are reflected in the captured image. Further, in this example, the line of sight of each person is dispersed, and only the person C speaks.

本例における視線ヒストグラム４０２では、人物Ｂのみが有効な１つの視線プロットを有しており、他の視線プロットは無効である。一方、本例における音配図４０３では、人物Ｃのみが音声データを持っている。 In the line-of-sight histogram 402 in this example, only person B has one line-of-sight plot that is valid, and the other line-of-sight plots are invalid. On the other hand, in the sound arrangement diagram 403 in this example, only the person C has the voice data.

よって、本図に示す例では、出力映像生成部１６は、無効な視線プロットの数が第３の閾値の割合を超えているため、視線ヒストグラム４０２の有効度が低いと判定する。また、出力映像生成部１６は、音声データを持つ人物の人数が第１の閾値の割合以下であるため、音配図４０３の有効度が高いと判定する。出力映像生成部１６は、視線ヒストグラム４０２の有効度が低く、音配図４０３の有効度が高いため、音声方向の信頼性が高いと判定し、発言している人物に注目した出力映像４０４を生成する。本例における出力映像４０４では、発言している人物Ｃをズームして大きく映し出している。 Therefore, in the example shown in this figure, the output video generation unit 16 determines that the line-of-sight histogram 402 is not effective because the number of invalid line-of-sight plots exceeds the ratio of the third threshold value. Further, the output video generation unit 16 determines that the sound distribution diagram 403 is highly effective because the number of persons having audio data is equal to or less than the ratio of the first threshold value. The output video generation unit 16 determines that the reliability in the audio direction is high because the validity of the line-of-sight histogram 402 is low and the validity of the sound distribution diagram 403 is high, and the output video 404 paying attention to the person who is speaking is generated. Generate. In the output video 404 in this example, the person C who is speaking is zoomed and projected in a large size.

（第３ケース）
次に、視線が集中していて、音声も集中している第３ケースの場合について説明する。第１ケース及び第２ケースでは、視線ヒストグラム又は音配図それぞれの有効度の高低に差がある場合について説明したが、第３ケースでは視線ヒストグラム及び音配図ともにその有効度が高い場合について説明する。 (Third case)
Next, the case of the third case in which the line of sight is concentrated and the voice is also concentrated will be described. In the first case and the second case, the case where there is a difference in the effectiveness of each of the line-of-sight histogram and the sound arrangement diagram is described, but in the third case, the case where the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is high is explained. To do.

図５は、本実施形態による映像処理装置が第３ケースの場合に生成する出力映像について説明するための図である。
本例におけるパノラマ画像５０１には、人物Ａ〜人物Ｆが映っている。また、人物Ａ及び人物Ｆの視線は人物Ｂに向けられており、人物Ｂ、人物Ｄ及び人物Ｅの視線は人物Ｃに向けられている。また、人物Ｅが発言している。 FIG. 5 is a diagram for explaining an output video generated when the video processing apparatus according to the present embodiment is in the third case.
Person A to person F are shown in the panoramic image 501 in this example. Further, the line of sight of the person A and the person F is directed to the person B, and the lines of sight of the person B, the person D, and the person E are directed to the person C. In addition, person E is speaking.

そのため、本例における視線ヒストグラム５０２では、人物Ｂが２つの有効な視線プロットを有しており、人物Ｃが３つの有効な視線プロットを有している。また、本例における音配図５０３では、人物Ｅのみが音声データを持っている。 Therefore, in the line-of-sight histogram 502 in this example, person B has two valid line-of-sight plots and person C has three valid line-of-sight plots. Further, in the sound arrangement diagram 503 in this example, only the person E has the voice data.

出力映像生成部１６は、有効な視線プロットを有する人物の人数が第２の閾値の割合以下であって、無効な視線プロットの数が第３の閾値の割合以下であるため、視線ヒストグラム５０２の有効度が高いと判定する。また、出力映像生成部１６は、音声データを持つ人物の人数が第１の閾値の割合以下であるため、音配図５０３の有効度が高いと判定する。出力映像生成部１６は、視線ヒストグラム５０２及び音配図５０３の有効度がともに高い場合には、視線プロットを有する人物の人数と音声データを持つ人物の人数とを比較し、その人数の少ない方を情報の密度が高いとして、より中心人物を捉えた情報だと判定し、その情報を優先して使用する。 Since the number of people having a valid line-of-sight plot is less than or equal to the ratio of the second threshold value and the number of invalid line-of-sight plots is less than or equal to the ratio of the third threshold value, the output image generation unit 16 of the line-of-sight histogram 502 Judge that the effectiveness is high. Further, since the number of persons having the audio data is equal to or less than the ratio of the first threshold value, the output video generation unit 16 determines that the sound arrangement diagram 503 is highly effective. When both the line-of-sight histogram 502 and the sound arrangement diagram 503 are highly effective, the output image generation unit 16 compares the number of people having the line-of-sight plot with the number of people having audio data, and the smaller number of people is compared. Is judged to be information that captures the central person more, assuming that the information density is high, and that information is used with priority.

本例においては、出力映像生成部１６は、視線プロットを有する人物が２人（人物Ｂ及び人物Ｃ）であり、音声データを持つ人物が１人（人物Ｅ）であるため、音配図５０３の情報密度が高いと判定する。そして、出力映像生成部１６は、音配図５０３において音声データを持つ人物Ｅが最も中心人物であるとして画面の中心に大きく映し出す出力映像５０４を生成する。また、出力映像生成部１６は、視線ヒストグラム５０２において有効な視線プロットの数が多い順番に優先して人物Ｃ、人物Ｂを人物Ｅより小さく出力映像５０４に映し出す。本例に示す出力映像５０４では、発言者である人物Ｅが中心に大きく映し出され、視線プロットの数が最も多い人物Ｃがその次に大きく映し出され、視線プロットの数が次に多い人物Ｂが最も小さく映し出されている。なお、出力映像生成部１６は、視線プロットを有する人物の人数と音声データを持つ人物の人数とが同数である場合には、視線プロットを有する人物を画面の左側に配置し、音声データを持つ人物を画面の右側に配置する等、視線と音声との間に情報の優先度をつけずに各人物を映し出してもよい。 In this example, the output video generation unit 16 has two persons (person B and person C) having a line-of-sight plot and one person (person E) having audio data, so that the sound distribution diagram 503 It is judged that the information density of is high. Then, the output video generation unit 16 generates an output video 504 that is largely projected in the center of the screen assuming that the person E having the audio data is the most central person in the sound distribution diagram 503. Further, the output video generation unit 16 preferentially projects the person C and the person B on the output video 504 in order of increasing number of valid line-of-sight plots in the line-of-sight histogram 502. In the output video 504 shown in this example, the person E who is the speaker is projected in the center, the person C having the largest number of line-of-sight plots is projected next, and the person B having the next largest number of line-of-sight plots is projected. It is projected in the smallest size. When the number of people having the line-of-sight plot and the number of people having the audio data are the same, the output video generation unit 16 arranges the person having the line-of-sight plot on the left side of the screen and has the audio data. Each person may be projected without prioritizing information between the line of sight and the voice, such as placing the person on the right side of the screen.

（第４ケース）
次に、視線が分散していて、音声も分散している第４ケースの場合について説明する。第１ケース及び第２ケースでは、視線ヒストグラム又は音配図それぞれの有効度の高低に差がある場合について説明したが、第４ケースでは視線ヒストグラム及び音配図ともにその有効度が低い場合について説明する。 (4th case)
Next, the case of the fourth case in which the line of sight is dispersed and the voice is also dispersed will be described. In the first case and the second case, the case where there is a difference in the effectiveness of each of the line-of-sight histogram and the sound arrangement diagram is described, but in the fourth case, the case where the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is low is explained. To do.

図６は、本実施形態による映像処理装置が第４ケースの場合に生成する出力映像について説明するための図である。
本例におけるパノラマ画像６０１には、人物Ａ〜人物Ｌが映っている。また、本例における視線ヒストグラム６０２では、人物Ｂが１つの有効な視線プロットを有しており、人物Ｃが２つの有効な視線プロットを有しており、人物Ｄが３つの有効な視線プロットを有しており、人物Ｅが３つの有効な視線プロットを有しており、人物Ｆが１つの有効な視線プロットを有しており、人物Ｈが１つの有効な視線プロットを有しており、人物Ｉが１つの有効な視線プロットを有している。また、本例における音配図６０３では、人物Ｄ〜人物Ｊが音声データを持っている。 FIG. 6 is a diagram for explaining an output video generated when the video processing apparatus according to the present embodiment is in the fourth case.
Person A to person L are shown in the panoramic image 601 in this example. Also, in the line-of-sight histogram 602 in this example, person B has one valid line-of-sight plot, person C has two valid line-of-sight plots, and person D has three valid line-of-sight plots. Has, person E has three valid line-of-sight plots, person F has one valid line-of-sight plot, person H has one valid line-of-sight plot, and Person I has one valid line-of-sight plot. Further, in the sound arrangement diagram 603 in this example, the person D to the person J have the voice data.

出力映像生成部１６は、有効な視線プロットを有する人物の人数が第２の閾値の割合を超えているため、視線ヒストグラム６０２の有効度が低いと判定する。同様に、出力映像生成部１６は、音声データを持つ人物の人数が第１の閾値の割合を超えているため、音配図６０３の有効度が低いと判定する。出力映像生成部１６は、視線ヒストグラム６０２及び音配図６０３の有効度がともに低い場合には、中心人物が定まっていない空間であると判定し、有効な視線プロットを有する人物及び音声データを持つ人物全員を映す出力映像６０４を生成する。図示する出力映像６０４では、有効な視線プロット又は音声データを有する人物Ｂ〜人物Ｊ全員が映し出されている。 The output video generation unit 16 determines that the line-of-sight histogram 602 is not effective because the number of people having a valid line-of-sight plot exceeds the ratio of the second threshold value. Similarly, the output video generation unit 16 determines that the effectiveness of the sound arrangement diagram 603 is low because the number of persons having audio data exceeds the ratio of the first threshold value. When the effectiveness of both the line-of-sight histogram 602 and the sound arrangement diagram 603 is low, the output image generation unit 16 determines that the space has an undetermined central person, and has a person having a valid line-of-sight plot and audio data. Generates an output video 604 that shows all the people. In the illustrated output video 604, all persons B to J having valid line-of-sight plots or audio data are projected.

なお、出力映像生成部１６は、有効な視線プロット又は音声データを有する人物が多い場合には、映し出す最大人数の閾値によりその人数を制限してもよい。 If there are many people who have effective line-of-sight plots or audio data, the output video generation unit 16 may limit the number of people by the threshold value of the maximum number of people to be projected.

図７は、本実施形態による映像処理装置が映し出す人数を制限する場合に生成する出力映像について説明するための図である。
本例におけるパノラマ画像７０１には、人物Ａ〜人物Ｌが映っている。また、本例における視線ヒストグラム７０２では、人物Ｃが２つの有効な視線プロットを有しており、人物Ｄが４つの有効な視線プロットを有しており、人物Ｅが６つの有効な視線プロットを有している。また、本例における音配図７０３では、人物Ｈ〜人物Ｊが音声データを持っている。 FIG. 7 is a diagram for explaining an output video generated when the number of people projected by the video processing apparatus according to the present embodiment is limited.
Person A to person L are shown in the panoramic image 701 in this example. Further, in the line-of-sight histogram 702 in this example, the person C has two valid line-of-sight plots, the person D has four valid line-of-sight plots, and the person E has six valid line-of-sight plots. Have. Further, in the sound arrangement diagram 703 in this example, the person H to the person J have the voice data.

出力映像生成部１６は、有効な視線プロットを有する人物の人数が第２の閾値の割合以下であって、無効な視線プロットの数が第３の閾値の割合以下であるため、視線ヒストグラム７０２の有効度が高いと判定する。また、出力映像生成部１６は、音声データを持つ人物の数が第１の閾値の割合以下であるため、音配図７０３の有効度が高いと判定する。視線ヒストグラム７０２における有効な視線プロットを有する人物は３人であり、音配図７０３における音声データを持つ人物は３人であるため、映し出される人物の候補となる人数は６人である。ここで、出力映像生成部１６は、視線ヒストグラム及び音配図の有効度がともに高い場合であって、候補となる人数が最大人数の閾値より多い場合には、有効な視線プロットの少ない人物から順に候補から除外する。本例では、出力映像生成部１６は、最大人数の閾値は５人であるため、有効な視線プロットの最も少ない人物Ｃを除外し、映し出す人物を最大人数の閾値である５人に制限する。本例に示す出力映像７０４には、人物Ｄ、人物Ｅ、人物Ｈ、人物Ｉ及び人物Ｊの５人が映し出されている。出力映像７０４では、画面左側に有効な視線プロットを有する人物Ｅ及び人物Ｄが映し出され、画面右側に音声データを有する人物Ｈ、人物Ｉ及び人物Ｊが映し出されている。なお、出力映像７０４において、有効な視線プロットを最も有する人物Ｅは、人物Ｄより大きく映し出されている。 Since the number of people having valid line-of-sight plots is less than or equal to the ratio of the second threshold value and the number of invalid line-of-sight plots is less than or equal to the ratio of the third threshold value, the output image generation unit 16 of the line-of-sight histogram 702 Judge that the effectiveness is high. Further, since the number of persons having the audio data is equal to or less than the ratio of the first threshold value, the output video generation unit 16 determines that the sound arrangement diagram 703 is highly effective. Since there are three people who have a valid line-of-sight plot in the line-of-sight histogram 702 and three people who have audio data in the sound distribution diagram 703, the number of candidates for the projected person is six. Here, the output video generation unit 16 starts with a person having few effective line-of-sight plots when the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is high and the number of candidates is larger than the threshold value of the maximum number of people. Exclude from candidates in order. In this example, since the threshold value of the maximum number of people is 5, the output video generation unit 16 excludes the person C having the smallest valid line-of-sight plot and limits the person to be projected to the maximum number of people, which is the threshold value of 5. In the output video 704 shown in this example, five people, a person D, a person E, a person H, a person I, and a person J, are projected. In the output video 704, a person E and a person D having a valid line-of-sight plot are projected on the left side of the screen, and a person H, a person I, and a person J having audio data are projected on the right side of the screen. In the output video 704, the person E having the most effective line-of-sight plot is projected larger than the person D.

なお、本例では、最大人数の閾値が５人である場合について説明しているが、最大人数の閾値はこれに限らず、１人以上であればよい。また、本例では、視線ヒストグラム及び音配図の有効度がともに高い場合について説明したが、視線ヒストグラム及び音配図の有効度がともに低い場合にも同様に、出力映像生成部１６は、候補となる人数が最大人数の閾値を超えているときは、有効な視線プロットの少ない人物から順に候補から除外してもよい。 In this example, the case where the threshold value of the maximum number of people is 5 is described, but the threshold value of the maximum number of people is not limited to this, and may be 1 or more. Further, in this example, the case where the validity of both the line-of-sight histogram and the sound arrangement diagram is high has been described, but similarly, when the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is low, the output video generation unit 16 is a candidate. When the number of people to be obtained exceeds the threshold of the maximum number of people, the person with the least valid line-of-sight plot may be excluded from the candidates in order.

図８は、本実施形態による映像処理装置が実行する映像処理の手順を示すフローチャートである。
まず、人物判定ブロック生成部１３が、全方位カメラ２から取得した撮像映像をパノラマ展開し、パノラマ画像を生成する（ステップＳ１０１）。続いて、人物判定ブロック生成部１３は、パノラマ画像を分割し、各人物それぞれを判定する人物判定ブロックを生成する（ステップＳ１０２）。 FIG. 8 is a flowchart showing a procedure of video processing executed by the video processing apparatus according to the present embodiment.
First, the person determination block generation unit 13 panoramicly develops the captured image acquired from the omnidirectional camera 2 and generates a panoramic image (step S101). Subsequently, the person determination block generation unit 13 divides the panoramic image and generates a person determination block for determining each person (step S102).

続いて、視線ヒストグラム生成部１４が、撮像映像における各人物の視線方向を検知し、検知した視線方向にある人物判定ブロックを判定し、人物判定ブロックごとの視線プロットを示す視線ヒストグラムを生成する（ステップＳ１０３）。
続いて、音配図生成部１５が、３Ｄマイクロホン３から取得した音声データに基づいて音の発信源の方向を判定し、発信源の人物判定ブロックを示す音配図を生成する（ステップＳ１０４）。 Subsequently, the line-of-sight histogram generation unit 14 detects the line-of-sight direction of each person in the captured image, determines a person determination block in the detected line-of-sight direction, and generates a line-of-sight histogram showing a line-of-sight plot for each person determination block ( Step S103).
Subsequently, the sound distribution diagram generation unit 15 determines the direction of the sound source based on the voice data acquired from the 3D microphone 3, and generates a sound distribution diagram showing the person determination block of the source (step S104). ..

続いて、出力映像生成部１６が、視線ヒストグラム及び音配図の有効度を判定し、有効度の高い情報を優先的に使用し、映像データを編集して出力映像を生成する（ステップＳ１０５）。また、出力映像生成部１６が、対応する音声データを出力映像に合成する。
続いて、出力部１７が、音声データが合成された出力映像を出力する（ステップＳ１０６）。その後、処理を終了する。 Subsequently, the output video generation unit 16 determines the validity of the line-of-sight histogram and the sound arrangement diagram, preferentially uses the highly effective information, edits the video data, and generates the output video (step S105). .. Further, the output video generation unit 16 synthesizes the corresponding audio data into the output video.
Subsequently, the output unit 17 outputs an output video in which audio data is combined (step S106). After that, the process ends.

このように、本実施形態によれば、映像処理装置１は、カメラが撮像した各人物の映像データからパノラマ画像を生成し、当該パノラマ画像において各人物それぞれに対応する人物判定ブロックを生成する人物判定ブロック生成部１３と、映像データから検知した各人物の視線方向にある人物判定ブロックを判定し、人物判定ブロックごとの視線の数を示す視線ヒストグラムを生成する視線ヒストグラム生成部１４と、各人物の音声データに基づいて音の発信源の人物判定ブロックを示す音配図を生成する音配図生成部１５と、視線を有する人物判定ブロック数に基づいて視線ヒストグラムの有効度を判定し、発信源の人物判定ブロック数に基づいて音配図の有効度を判定し、各有効度に応じて視線ヒストグラム又は音配図に基づき映像データから出力映像を生成する出力映像生成部１６と、を備える。 As described above, according to the present embodiment, the image processing device 1 generates a panoramic image from the image data of each person captured by the camera, and generates a person determination block corresponding to each person in the panoramic image. The determination block generation unit 13, the line-of-sight histogram generation unit 14 that determines the person determination block in the line-of-sight direction of each person detected from the video data and generates a line-of-sight histogram indicating the number of lines of sight for each person determination block, and each person. The sound distribution diagram generation unit 15 that generates a sound distribution diagram showing the person determination block of the sound source based on the voice data of the above, and the effectiveness of the line-of-sight histogram are determined based on the number of person determination blocks having a line of sight and transmitted. It is provided with an output image generation unit 16 that determines the effectiveness of the sound arrangement diagram based on the number of people determination blocks of the source and generates an output image from the image data based on the line-of-sight histogram or the sound arrangement diagram according to each effectiveness. ..

このような構成により、撮像映像に映る各人物の視線と音声との両方を利用して、撮像映像における中心人物を精度良く判定することができる。また、視線又は音声のうちいずれか一方のデータしかない場合であっても、中心人物を判定することができる。また、取得した映像及び音声から自動的に人物を判定して映像を編集するため、使用者は映像の編集技術を必要としない。すなわち、映像及び音声から注目したい人物に焦点を当てた映像を自動的に生成することができるため、撮像技術や映像の編集技術を持たない人であっても、一定品質の映像を記録・配信する事が可能である。 With such a configuration, it is possible to accurately determine the central person in the captured image by using both the line of sight and the sound of each person reflected in the captured image. Further, even when there is only data of either the line of sight or the voice, the central person can be determined. Further, since the person is automatically determined from the acquired video and audio and the video is edited, the user does not need a video editing technique. That is, since it is possible to automatically generate a video focusing on the person to be noticed from the video and audio, even a person who does not have imaging technology or video editing technology can record and distribute a video of a certain quality. It is possible to do.

また、視線ヒストグラム及び音配図の優先度に基づいて出力映像を生成しているため、運動会等の学校行事、会議、コンサートホール、イベント会場等、異なる状況下であっても、それぞれの状況に応じた中心人物を精度良く判定することができる。よって、本発明による映像処理装置１は、学校行事の記録、会議、コンサートホール、イベント会場等、様々なシーンで利用することができる。 In addition, since the output video is generated based on the priority of the line-of-sight histogram and the sound distribution diagram, even under different circumstances such as school events such as athletic meet, conferences, concert halls, event venues, etc., each situation can be changed. It is possible to accurately determine the corresponding central person. Therefore, the video processing device 1 according to the present invention can be used in various scenes such as recording school events, conferences, concert halls, and event venues.

また、出力映像生成部１６は、発信源の人物判定ブロック数が所定の閾値を超えている場合に、音配図の有効度が低いと判定する。このような構成により、音声が分散している場合には、視線ヒストグラムのデータを優先して使用するため、例えば運動会等で、視線が中心人物に集まり声援などで音声が分散している場合であっても、中心人物を精度良く判定することができる。 Further, the output video generation unit 16 determines that the effectiveness of the sound arrangement diagram is low when the number of person determination blocks of the transmission source exceeds a predetermined threshold value. With such a configuration, when the voice is dispersed, the data of the line-of-sight histogram is preferentially used. Therefore, for example, in an athletic meet or the like, when the line of sight is gathered at the central person and the voice is dispersed by cheering or the like. Even if there is, the central person can be determined accurately.

また、出力映像生成部１６は、視線を有する人物判定ブロック数が所定の閾値を超えている場合に、視線ヒストグラムの有効度が低いと判定する。また、視線ヒストグラム生成部１４は、人物の視線が自身の人物判定ブロックにある場合、或いは、人物判定ブロック外にある場合に、当該視線が無効であると判定し、出力映像生成部１６は、無効な視線の数が所定の閾値を超えている場合に、視線ヒストグラムの有効度が低いと判定する。このような構成により、視線が分散している場合には、音配図のデータを優先して使用するため、例えば会議等で、各人物が手元の資料を読みながら発言者の話を聞いている場合であっても、中心人物を精度良く判定することができる。 Further, the output video generation unit 16 determines that the effectiveness of the line-of-sight histogram is low when the number of person determination blocks having a line of sight exceeds a predetermined threshold value. Further, the line-of-sight histogram generation unit 14 determines that the line of sight is invalid when the line of sight of the person is in its own person determination block or outside the person determination block, and the output video generation unit 16 determines that the line of sight is invalid. When the number of invalid lines of sight exceeds a predetermined threshold value, it is determined that the line-of-sight histogram is not effective. With such a configuration, when the line of sight is dispersed, the sound distribution chart data is used with priority. Therefore, for example, at a meeting or the like, each person listens to the speaker while reading the material at hand. Even if there is, the central person can be determined accurately.

また、出力映像生成部１６は、視線ヒストグラム及び音配図の有効度がともに高い場合には、視線を有する人物判定ブロック数と発信源の人物判定ブロック数とのうち少ない方の人物判定ブロックにいる人物を優先して映す。このような構成により、より密度の高い情報を優先して使用することができるため、中心人物を精度良く判定することができる。 Further, when the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is high, the output video generation unit 16 sets the number of person determination blocks having the line of sight and the number of person determination blocks of the source to the smaller of the number of person determination blocks. Priority is given to the person who is. With such a configuration, it is possible to preferentially use more dense information, so that the central person can be accurately determined.

また、出力映像生成部１６は、視線ヒストグラム及び音配図の有効度がともに低い場合には、視線を有する人物判定ブロック及び発信源の人物判定ブロックにいる人物を映す。このような構成により、中心人物が定まっていない空間である場合に、視線を集めている人物や発言者全員を映し出すことができる。 Further, when the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is low, the output video generation unit 16 projects a person in the person determination block having the line of sight and the person determination block of the source. With such a configuration, when the central person is not fixed, it is possible to project all the people and speakers who are gathering their eyes.

また、出力映像生成部１６は、出力映像に映す候補となる人物が最大人数の閾値を超える場合には、視線の少ない人物から順に除外する。このような構成により、候補となる人物が多い場合に、映し出す人数を制限して、より中心となる人物を優先して映し出すことができる。 Further, when the number of candidate persons to be projected on the output image exceeds the threshold value of the maximum number of people, the output image generation unit 16 excludes the persons having the smallest line of sight in order. With such a configuration, when there are many candidate people, the number of people to be projected can be limited, and the more central person can be preferentially projected.

＜第２の実施形態＞
続いて第２の実施形態について説明する。
図９は、映像処理装置の最小構成を示す図である。
映像処理装置１は、少なくとも、人物判定ブロック生成部１３と、視線ヒストグラム生成部１４と、音配図生成部１５と、出力映像生成部１６とを備えればよい。
人物判定ブロック生成部１３は、カメラが撮像した各人物の映像データにおいて各人物それぞれに対応する人物判定ブロックを生成する。
視線ヒストグラム生成部１４は、映像データから検知した各人物の視線方向にある人物判定ブロックを判定し、人物判定ブロックごとの視線の数を示す視線ヒストグラムを生成する。
音配図生成部１５は、人物の音声データに基づいて音の発信源の人物判定ブロックを示す音配図を生成する。
出力映像生成部１６は、視線を有する人物判定ブロック数に基づいて視線ヒストグラムの有効度を判定し、発信源の人物判定ブロック数に基づいて音配図の有効度を判定し、各有効度に応じて視線ヒストグラム又は音配図に基づき映像データから出力映像を生成する。
本実施形態によれば、映像及び音声から注目したい人物に焦点を当てた映像を自動的に生成することができる。 <Second embodiment>
Subsequently, the second embodiment will be described.
FIG. 9 is a diagram showing the minimum configuration of the video processing device.
The image processing device 1 may include at least a person determination block generation unit 13, a line-of-sight histogram generation unit 14, a sound arrangement diagram generation unit 15, and an output image generation unit 16.
The person determination block generation unit 13 generates a person determination block corresponding to each person in the video data of each person captured by the camera.
The line-of-sight histogram generation unit 14 determines a person determination block in the line-of-sight direction of each person detected from the video data, and generates a line-of-sight histogram showing the number of lines of sight for each person determination block.
The sound distribution diagram generation unit 15 generates a sound arrangement diagram showing a person determination block of a sound source based on the voice data of a person.
The output video generation unit 16 determines the validity of the line-of-sight histogram based on the number of person determination blocks having a line of sight, determines the effectiveness of the sound distribution diagram based on the number of person determination blocks of the source, and determines each effectiveness. The output video is generated from the video data based on the line-of-sight histogram or the sound arrangement diagram accordingly.
According to the present embodiment, it is possible to automatically generate an image focusing on a person to be noticed from the image and the sound.

以上本発明の一実施形態について説明したが、本発明は、上記実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲において種々の変更を加えることが可能である。 Although one embodiment of the present invention has been described above, the present invention is not limited to the above embodiment, and various modifications can be made without departing from the spirit of the present invention.

なお、上述した映像処理装置１における各処理部の機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより上述した各処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 A program for realizing the functions of each processing unit in the video processing apparatus 1 described above is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read by the computer system and executed. Each of the above-mentioned processes may be performed according to the above. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer system" shall also include a WWW system provided with a homepage providing environment (or display environment). Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, or a storage device such as a hard disk built in a computer system. Furthermore, a "computer-readable recording medium" is a volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, it shall include those that hold the program for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 Further, the program may be transmitted from a computer system in which this program is stored in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the "transmission medium" for transmitting a program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the above program may be for realizing a part of the above-mentioned functions. Further, a so-called difference file (difference program) may be used, which can realize the above-mentioned functions in combination with a program already recorded in the computer system.

１・・・映像処理装置
１１・・・映像データ取得部
１２・・・音声データ取得部
１３・・・人物判定ブロック生成部
１４・・・視線ヒストグラム生成部
１５・・・音配図生成部
１６・・・出力映像生成部
１７・・・出力部
２・・・全方位カメラ
３・・・３Ｄマイクロホン
１００・・・映像処理システム 1 ... Video processing device 11 ... Video data acquisition unit 12 ... Audio data acquisition unit 13 ... Person determination block generation unit 14 ... Line-of-sight histogram generation unit 15 ... Sound arrangement diagram generation unit 16・・・ Output image generation unit 17 ・・・ Output unit 2 ・・・ Omnidirectional camera 3 ・・・ 3D microphone 100 ・・・ Image processing system

Claims

A person determination block generation unit that generates a person determination block corresponding to each person in the video data of each person captured by the camera, and a person determination block generation unit.
A line-of-sight histogram generator that determines the person determination block in the line-of-sight direction of each person detected from the video data and generates a line-of-sight histogram indicating the number of lines of sight for each person determination block.
A sound distribution diagram generation unit that generates a sound distribution diagram showing the person determination block of the sound source based on the voice data of each person, and a sound distribution diagram generation unit.
The validity of the line-of-sight histogram is determined based on the number of person determination blocks having a line of sight, the effectiveness of the sound arrangement diagram is determined based on the number of person determination blocks of the source, and the line of sight is determined according to each effectiveness. An output video generator that generates an output video from the video data based on the histogram or the sound distribution diagram,
Video processing device including.

The video processing device according to claim 1, wherein the output video generation unit determines that the effectiveness of the sound arrangement diagram is low when the number of person determination blocks of the transmission source exceeds a predetermined threshold value.

The video processing device according to claim 1 or 2, wherein the output video generation unit determines that the effectiveness of the line-of-sight histogram is low when the number of person determination blocks having a line of sight exceeds a predetermined threshold value.

The line-of-sight histogram generation unit determines that the line of sight is invalid when the line of sight of the person is in its own person determination block or outside the person determination block.
The video according to any one of claims 1 to 3, wherein the output video generation unit determines that the effectiveness of the line-of-sight histogram is low when the number of invalid lines of sight exceeds a predetermined threshold value. Processing equipment.

When the effectiveness of both the line-of-sight histogram and the sound arrangement diagram is high, the output video generation unit sets the number of person determination blocks having a line of sight and the number of person determination blocks of the source to the smaller of the number of person determination blocks. The image processing apparatus according to any one of claims 1 to 4, wherein the person who is present is preferentially projected.

Claims 1 to 5 show the person in the person determination block having the line of sight and the person determination block of the source when the validity of the line-of-sight histogram and the sound arrangement diagram is low. The video processing apparatus according to any one of the above.

The output image generation unit according to any one of claims 1 to 6, in which, when the number of candidate persons to be projected on the output image exceeds the threshold value of the maximum number of people, the person with the least line of sight is excluded in order. Video processing equipment.

The person determination block generation unit generates a person determination block corresponding to each person in the video data of each person captured by the camera.
The line-of-sight histogram generation unit determines the person determination block in the line-of-sight direction of each person detected from the video data, and generates a line-of-sight histogram showing the number of lines of sight for each person determination block.
The sound distribution diagram generation unit generates a sound arrangement diagram showing the person determination block of the sound source based on the voice data of each person.
The output video generation unit determines the validity of the line-of-sight histogram based on the number of person determination blocks having a line of sight, determines the effectiveness of the sound arrangement diagram based on the number of person determination blocks of the source, and each of the valid An output image is generated from the image data based on the line-of-sight histogram or the sound arrangement diagram according to the degree.
Video processing method.