JP4863937B2

JP4863937B2 - Encoding processing apparatus and encoding processing method

Info

Publication number: JP4863937B2
Application number: JP2007166203A
Authority: JP
Inventors: 哲也山本; 大三長原
Original assignee: Sony Interactive Entertainment Inc; Sony Computer Entertainment Inc
Current assignee: Sony Interactive Entertainment Inc
Priority date: 2007-06-25
Filing date: 2007-06-25
Publication date: 2012-01-25
Anticipated expiration: 2027-06-25
Also published as: JP2009005239A

Description

この発明は、ビデオを符号化する符号化処理装置および符号化処理方法に関する。 The present invention relates to an encoding processing apparatus and encoding processing method for encoding video.

ブロードバンドネットワークの普及によって、インターネットでビデオやオーディオのストリームを配信することが盛んになってきている。ストリーム配信は、インターネット電話、遠隔テレビ会議、チャットといったインターネットサービスに利用されている。ビデオおよびオーディオストリームを利用したチャットシステムでは、遠隔地にいるユーザの顔画像および音声をネットワークを介して相互に送信し、ユーザの端末において動画および音声を再生することより、あたかもその場に一緒にいるかのようにチャットをすることができる。 With the widespread use of broadband networks, it is becoming increasingly popular to distribute video and audio streams over the Internet. Stream distribution is used for Internet services such as Internet telephony, remote video conferencing, and chat. In a chat system that uses video and audio streams, the user's face image and sound at a remote location are transmitted to each other over the network, and the video and sound are played back on the user's terminal. You can chat as if you were.

ビデオストリーム配信は、動画のフレーム画像をＭＰＥＧ（Moving Picture Experts Group）などの動画像符号化方式により圧縮符号化して生成されるビデオストリームをＩＰ（Internet Protocol）パケットに格納してインターネット上を転送させ、ユーザの通信端末に受信させることで実現される。インターネットは、ベストエフォートでパケットを転送するため、ネットワークが輻輳すると、パケットが破棄されたり遅延することで、データが欠損することがあり、フレーム画像が正しく受信されないことがある。 In video stream distribution, a video stream generated by compressing and encoding a frame image of a moving image using a moving image encoding method such as MPEG (Moving Picture Experts Group) is stored in an IP (Internet Protocol) packet and transferred over the Internet. This is realized by allowing the user's communication terminal to receive it. Since the Internet transfers packets on a best effort basis, when the network is congested, packets may be discarded or delayed, and data may be lost, and frame images may not be received correctly.

そこで、ネットワークの帯域に応じてビデオやオーディオの符号化ストリームのビットレートを調整することが行われる。また、画像内に注目領域（Region Of Interest; ROI）を設け、注目領域には十分なビットを割り当て、非注目領域に割り当てるビットを少なくして符号化することで、ネットワークの使用帯域を抑え、輻輳を回避したり、輻輳時でも少なくとも注目領域については再生品質が確保されるようにする工夫がなされている。 Therefore, the bit rate of the encoded video or audio stream is adjusted according to the network bandwidth. In addition, by setting a region of interest (ROI) in the image, assigning enough bits to the region of interest, and encoding with fewer bits allocated to the non-region of interest, the network bandwidth is reduced, Ingenuity has been devised to avoid congestion or to ensure reproduction quality at least in the attention area even during congestion.

たとえば、特許文献１には、注目領域の圧縮率と残余領域の圧縮率とを変えて符号化する画像符号化方法が開示されている。
特開２００５−２９５３７９号公報 For example, Patent Document 1 discloses an image encoding method in which encoding is performed by changing the compression rate of the attention area and the compression ratio of the remaining area.
JP 2005-295379 A

ユーザの顔画像を用いたチャットなどのアプリケーションでは、互いに通信相手の顔画像の見栄えがよいことがユーザの満足度を左右する重要な要素である。そこで、画像に映し出されたユーザの顔を自動検出し、検出された顔領域を注目領域に設定し、その注目領域を高画質で符号化したビデオストリームを生成することで、顔領域の再生品質を確保することが行われる。しかし、チャットで用いられるカメラの性能が低かったり、ユーザの動きが速すぎるなどの原因で、顔領域が正しく検出されず、受信者側に顔画像が十分な品質で提供されないことがある。また、実際には顔ではない領域が誤って顔領域として検出されて注目領域に設定されることがあり、不都合が生じる。 In an application such as a chat using a user's face image, it is an important factor that influences the satisfaction level of the user that the face images of communication partners are good in appearance. Therefore, by automatically detecting the user's face displayed in the image, setting the detected face area as the attention area, and generating a video stream that encodes the attention area with high image quality, the reproduction quality of the face area It is done to ensure. However, the face area may not be detected correctly because the performance of the camera used in the chat is low or the user moves too fast, and the face image may not be provided with sufficient quality to the receiver. In addition, an area that is not actually a face may be erroneously detected as a face area and set as an attention area, causing inconvenience.

本発明はこうした課題に鑑みてなされたものであり、その目的は、動画の注目領域を適切に符号化するための動画符号化技術を提供することにある。 The present invention has been made in view of these problems, and an object thereof is to provide a moving image encoding technique for appropriately encoding a region of interest of a moving image.

上記課題を解決するために、本発明のある態様の符号化処理装置は、動画のフレームにおいて顔領域を検出する検出部と、検出された顔領域の履歴がフレーム単位で記録される記録部と、前記記録部に記録されたフレーム単位の顔領域の検出履歴を参照して、前記検出部により検出された顔領域を連続する複数のフレームにわたって追跡することにより、前記検出部による顔領域の検出結果を補正するトラッキング部と、前記トラッキング部により補正された顔領域にもとづいて所定の基準で注目領域を決定する注目領域決定部と、前記注目領域を他の領域とは画質を異ならせて符号化して、動画の符号化ストリームを生成する符号化部とを含む。 In order to solve the above problems, an encoding processing device according to an aspect of the present invention includes a detection unit that detects a face area in a moving image frame, and a recording unit that records a history of the detected face area in units of frames. The detection of the face area by the detection unit by tracking the face area detected by the detection unit over a plurality of consecutive frames with reference to the detection history of the face area in units of frames recorded in the recording unit A tracking unit that corrects a result, a region of interest determination unit that determines a region of interest based on a predetermined area based on the face region corrected by the tracking unit, and a code of the region of interest different from other regions in image quality And an encoding unit that generates an encoded stream of a moving image.

本発明の別の態様は、プログラムである。このプログラムは、動画のフレームにおいて顔領域を検出する検出機能と、検出された顔領域の履歴がフレーム単位で記録する記録機能と、記録されたフレーム単位の顔領域の検出履歴を参照して、検出された顔領域を連続する複数のフレームにわたって追跡することにより、顔領域の検出結果を補正するトラッキング機能と、補正された顔領域にもとづいて所定の基準で注目領域を決定する注目領域決定機能と、前記注目領域を他の領域とは画質を異ならせて符号化して、動画の符号化ストリームを生成する符号化機能とをコンピュータに実現させる。 Another aspect of the present invention is a program. This program refers to a detection function for detecting a face area in a frame of a moving image, a recording function for recording the history of the detected face area in units of frames, and a detection history of the recorded face area in units of frames. A tracking function that corrects the detection result of the face area by tracking the detected face area over a plurality of consecutive frames, and an attention area determination function that determines the attention area based on a predetermined standard based on the corrected face area And the encoding function for generating the encoded stream of the moving image by encoding the attention area with different image quality from the other areas.

このプログラムは、ビデオやオーディオのデコーダ等のハードウエア資源の基本的な制御を行なうために機器に組み込まれるファームウエアの一部として提供されてもよい。このファームウエアは、たとえば、機器内のＲＯＭやフラッシュメモリなどの半導体メモリに格納される。このファームウエアを提供するため、あるいはファームウエアの一部をアップデートするために、このプログラムを記録したコンピュータ読み取り可能な記録媒体が提供されてもよく、また、このプログラムが通信回線で伝送されてもよい。 This program may be provided as a part of firmware incorporated in the device in order to perform basic control of hardware resources such as video and audio decoders. This firmware is stored, for example, in a semiconductor memory such as a ROM or a flash memory in the device. In order to provide the firmware or to update a part of the firmware, a computer-readable recording medium storing the program may be provided, and the program may be transmitted through a communication line. Good.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、コンピュータプログラム、データ構造、記録媒体などの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and the expression of the present invention converted between a method, an apparatus, a system, a computer program, a data structure, a recording medium, and the like are also effective as an aspect of the present invention.

本発明によれば、動画の注目領域を適切に設定して符号化することができる。 According to the present invention, it is possible to appropriately set and encode an attention area of a moving image.

図１は、実施の形態に係るチャットシステムの構成図である。複数（ここでは３台）の情報処理装置１００ａ〜１００ｃにはそれぞれマイク２３０ａ〜２３０ｃ、カメラ２４０ａ〜２４０ｃ、スピーカ２５０ａ〜２５０ｃ、ディスプレイ２６０ａ〜２６０ｃが接続されている。複数の情報処理装置１００ａ〜１００ｃは、ネットワーク３００に接続されている。複数（ここでは３人）のユーザＡ〜Ｃがそれぞれの情報処理装置１００ａ〜１００ｃを用いて、ネットワーク３００を介して互いに自分の顔画像と音声をリアルタイムで送信し合い、また、キーボードから入力されるテキストデータを交換し合うことで、オーディオとビデオを用いたコミュニケーション（いわゆるチャット）を行う。 FIG. 1 is a configuration diagram of a chat system according to an embodiment. Microphones 230a to 230c, cameras 240a to 240c, speakers 250a to 250c, and displays 260a to 260c are connected to a plurality (three in this case) of information processing apparatuses 100a to 100c, respectively. The plurality of information processing apparatuses 100 a to 100 c are connected to the network 300. A plurality (three in this case) of users A to C use their respective information processing apparatuses 100a to 100c to transmit their face images and sounds to each other in real time via the network 300, and are input from the keyboard. By exchanging text data, communication using audio and video (so-called chat) is performed.

以下、各ユーザの情報処理装置１００ａ〜１００ｃなどの構成を総称するときは、符号ａ〜ｃを省略して、単に符号１００などで表記する。 Hereinafter, when generically referring to the configuration of the information processing apparatuses 100a to 100c of each user, the reference signs a to c are omitted, and are simply indicated by the reference numeral 100 or the like.

図２は、情報処理装置１００の構成図である。ここでは、チャットに係る構成は省略し、オーディオとビデオの符号化と復号に係る構成を示す。 FIG. 2 is a configuration diagram of the information processing apparatus 100. Here, the configuration related to chat is omitted, and the configuration related to encoding and decoding of audio and video is shown.

情報処理装置１００は、符号化処理ブロック２００と、復号処理ブロック２２０と、通信部２７０とを含む。情報処理装置１００は、一例として、パーソナルコンピュータや携帯端末であってもよく、マルチプロセッサシステムであってもよい。情報処理装置１００がパーソナルコンピュータである場合、符号化処理ブロック２００と復号処理ブロック２２０は、画像符号化・復号の機能をもつ専用回路をパーソナルコンピュータに別途搭載することで実現してもよい。また、情報処理装置１００がマルチプロセッサシステムである場合、マルチプロセッサの高い計算能力を用いることができるため、符号化処理ブロック２００と復号処理ブロック２２０はソフトウエアで実現されてもよい。 The information processing apparatus 100 includes an encoding processing block 200, a decoding processing block 220, and a communication unit 270. As an example, the information processing apparatus 100 may be a personal computer, a portable terminal, or a multiprocessor system. When the information processing apparatus 100 is a personal computer, the encoding processing block 200 and the decoding processing block 220 may be realized by separately mounting a dedicated circuit having an image encoding / decoding function in the personal computer. Further, when the information processing apparatus 100 is a multiprocessor system, the high processing capability of the multiprocessor can be used, so the encoding processing block 200 and the decoding processing block 220 may be realized by software.

符号化処理ブロック２００は、マイク２３０に入力される音声とカメラ２４０で撮影される動画とを圧縮符号化してオーディオ符号化ストリームおよびビデオ符号化ストリームを生成する。オーディオ符号化ストリームとビデオ符号化ストリームを多重化して一つのストリームとすることもできる。符号化処理ブロック２００により生成されたオーディオ符号化ストリームとビデオ符号化ストリームは、通信部２７０によってパケット化され、ネットワーク３００を介してチャットの相手先に送信される。 The encoding processing block 200 compresses and encodes the audio input to the microphone 230 and the moving image shot by the camera 240 to generate an audio encoded stream and a video encoded stream. The audio encoded stream and the video encoded stream can be multiplexed into a single stream. The audio encoded stream and video encoded stream generated by the encoding processing block 200 are packetized by the communication unit 270 and transmitted to the chat partner via the network 300.

通信部２７０は、ネットワーク３００を介してチャットの相手先からオーディオ符号化ストリームおよびビデオ符号化ストリームのパケットを受信し、復号処理ブロック２２０に供給する。復号処理ブロック２２０は、受信されたオーディオ符号化ストリームおよびビデオ符号化ストリームを復号してオーディオとビデオを再生し、それぞれスピーカ２５０とディスプレイ２６０に出力する。 The communication unit 270 receives audio encoded stream and video encoded stream packets from the chat partner via the network 300 and supplies the packets to the decoding processing block 220. The decoding processing block 220 decodes the received audio encoded stream and video encoded stream to reproduce audio and video, and outputs them to the speaker 250 and the display 260, respectively.

図３は、符号化処理ブロック２００の構成図である。同図は機能に着目したブロック図を描いており、これらの機能ブロックはハードウエアのみ、ソフトウエアのみ、またはそれらの組合せによっていろいろな形で実現することができる。 FIG. 3 is a configuration diagram of the encoding processing block 200. This figure depicts a block diagram focusing on functions, and these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof.

カメラ２４０により撮像された動画の各フレームはフレームメモリ１０に蓄積される。表示制御部１４は、ディスプレイ２６０の垂直同期信号に合わせてフレームメモリ１０からフレームを読み出し、ディスプレイ２６０に表示させる。 Each frame of the moving image captured by the camera 240 is stored in the frame memory 10. The display control unit 14 reads the frame from the frame memory 10 in accordance with the vertical synchronization signal of the display 260 and causes the display 260 to display the frame.

顔領域検出部１２は、フレームメモリ１０に格納されたフレームにおいて人間の顔が映し出されている領域を検出する。この顔領域検出には既存技術を用いる。あらかじめ人間の顔の特徴パターンをいくつか用意しておき、特徴パターンと類似する特徴を有する領域をフレーム内で探索することにより、顔領域を検出する。顔の特徴は、たとえば、エッジ抽出処理によって、顔の輪郭や、目、鼻、口などの特徴部位の形や位置を抽出することによって得られる。 The face area detection unit 12 detects an area in which a human face is projected in a frame stored in the frame memory 10. Existing technology is used for this face area detection. Several human face feature patterns are prepared in advance, and a face region is detected by searching for a region having features similar to the feature pattern in the frame. The facial features can be obtained, for example, by extracting the contour of the face and the shape and position of the characteristic parts such as eyes, nose and mouth by edge extraction processing.

顔領域は複数検出されることがある。顔領域検出部１２は、顔領域毎にその領域の位置情報を生成する。顔領域が矩形領域であれば、領域の位置情報は、たとえば、代表点である左隅点の座標値と領域の縦横サイズで表される。顔領域検出部１２は、このようにして公知の顔検出アルゴリズムにより検出された顔領域の個数と位置情報をＲＯＩ特定部１８に与える。 A plurality of face regions may be detected. The face area detection unit 12 generates position information for each face area. If the face area is a rectangular area, the position information of the area is represented by, for example, the coordinate value of the left corner point, which is a representative point, and the vertical and horizontal sizes of the area. The face area detection unit 12 gives the number and position information of the face areas detected by the known face detection algorithm in this way to the ROI specifying unit 18.

ＲＯＩ特定部１８は、まず、顔領域検出部１２による顔領域検出結果の妥当性を検証する。フレームに顔が映し出されていても、顔検出アルゴリズムによって顔領域として検出されるとは限られず、また、顔が映し出されていないにもかかわらず、誤って顔領域であると判定されることもある。これは、カメラ２４０の処理性能や撮影解像度に限界があり、ユーザの速い動きに追随できないことや、ユーザが横を向いたために顔検出処理において顔の特徴が拾えなくなることなどによる。そこで、顔検出の精度を高めるために、顔らしさを示すスコアを評価して誤検出を少なくしたり、顔領域の検出履歴を利用して、検出漏れを防ぐ工夫がなされる。顔検出の精度向上の工夫については後述する。 The ROI specifying unit 18 first verifies the validity of the face area detection result by the face area detection unit 12. Even if a face is shown in the frame, it is not always detected as a face area by the face detection algorithm, and even if the face is not shown, it may be erroneously determined to be a face area. is there. This is because the processing performance and shooting resolution of the camera 240 are limited, and it is impossible to follow the fast movement of the user, and the face feature cannot be picked up in the face detection process because the user faces sideways. Therefore, in order to improve the accuracy of face detection, a score that indicates the likelihood of a face is evaluated to reduce false detection, or a detection history of the face area is used to prevent omission of detection. A device for improving the accuracy of face detection will be described later.

次に、ＲＯＩ特定部１８は、妥当性が検証された顔領域にもとづいて、視覚上の品質を向上させたい領域を注目領域として特定する。顔領域を中心としてその周辺領域も含む領域を注目領域として設定してもよい。たとえば、検出された顔領域に対して、顔と上半身を含む領域を注目領域とする。 Next, the ROI specifying unit 18 specifies, as the attention area, an area for which visual quality is desired to be improved based on the face area whose validity has been verified. An area including the peripheral area centered on the face area may be set as the attention area. For example, an area including the face and the upper half of the detected face area is set as the attention area.

注目領域は矩形形状に限られず、任意の形状であってもよい。注目領域の形状はマスク情報で指定される。たとえば、マスク情報によりハート型の形状が指定された場合、ＲＯＩ特定部１８は、顔領域検出部１２により検出された顔領域を中心としてハート型の領域を注目領域として特定する。 The region of interest is not limited to a rectangular shape, and may be an arbitrary shape. The shape of the region of interest is specified by mask information. For example, when the heart-shaped shape is specified by the mask information, the ROI specifying unit 18 specifies the heart-shaped region as the attention region around the face region detected by the face region detecting unit 12.

ＲＯＩ特定部１８は、注目領域の個数と位置情報を含むＲＯＩ情報を生成して、非ＲＯＩフィルタ２２およびビデオエンコーダ２４に与える。さらに、ＲＯＩ特定部１８は、多重化ストリームにＲＯＩ情報を含めたい場合は、ＲＯＩ情報を多重化部３２に与える。ＲＯＩ情報を多重化ストリームに含めることはオプションである。たとえば、受信側でＲＯＩ情報を利用したい場合に、ＲＯＩ情報を多重化ストリームに含めればよい。 The ROI specifying unit 18 generates ROI information including the number of regions of interest and position information, and provides the ROI information to the non-ROI filter 22 and the video encoder 24. Further, the ROI specifying unit 18 gives the ROI information to the multiplexing unit 32 when it is desired to include the ROI information in the multiplexed stream. Including ROI information in the multiplexed stream is optional. For example, when it is desired to use the ROI information on the receiving side, the ROI information may be included in the multiplexed stream.

ビデオエンコーダ２４による動画符号化の際、非注目領域は低ビット割り当て領域、注目領域は高ビット割り当て領域となる。すなわち、非注目領域に比べて注目領域にはビット数を多く割り当てて符号化することで、注目領域の品質を非注目領域の品質よりも高くする。このために、ＲＯＩ特定部１８は、非注目領域に割り当てられるビット数に対する注目領域に割り当てられるビット数の比を示すビット割り当て強度βを決め、ビット割り当て強度βをビデオエンコーダ２４に与える。ビット割り当て強度βは１以上の値を取り、βが１の場合は、非注目領域と注目領域でビット割り当て量は同じであり、βが１より大きい場合は、βの大きさに応じて非注目領域のビット割り当て量を削ることで、相対的に注目領域のビット割り当て量を増やす。 At the time of video encoding by the video encoder 24, the non-target area is a low bit allocation area and the target area is a high bit allocation area. That is, the quality of the attention area is made higher than the quality of the non-attention area by assigning a larger number of bits to the attention area and encoding than the non-attention area. For this purpose, the ROI specifying unit 18 determines a bit allocation strength β indicating the ratio of the number of bits allocated to the attention region to the number of bits allocated to the non-target region, and gives the bit allocation strength β to the video encoder 24. The bit allocation strength β takes a value of 1 or more. When β is 1, the bit allocation amount is the same in the non-attention area and the attention area, and when β is larger than 1, the non-interest area depends on the size of β. By reducing the bit allocation amount of the attention area, the bit allocation amount of the attention area is relatively increased.

さらに、ＲＯＩ特定部１８は、非注目領域を注目領域に比べて画像をぼかす程度を示すぼかし強度γを決め、ぼかし強度γを非ＲＯＩフィルタ２２に与える。非ＲＯＩフィルタ２２は、ぼかし強度γにもとづいて高周波成分を除去するフィルタリングを非注目領域に施すことで、非注目領域を視覚的にぼかす。ぼかし強度γは１以上の値をとり、γが１の場合は、ぼかし処理はせず、γが１以上の場合は、γの大きさに応じてぼかし処理の程度を大きくする。 Further, the ROI specifying unit 18 determines a blur strength γ indicating the degree of blurring of the image by comparing the non-target region with the target region, and provides the non-ROI filter 22 with the blur strength γ. The non-ROI filter 22 visually blurs the non-focused area by applying filtering that removes high-frequency components to the non-focused area based on the blur intensity γ. The blur intensity γ takes a value of 1 or more. When γ is 1, no blur process is performed, and when γ is 1 or more, the degree of the blur process is increased according to the magnitude of γ.

帯域情報取得部２０は、通信部２７０から通信経路のビットレートや輻輳状態などの帯域情報を取得し、ＲＯＩ特定部１８およびビデオエンコーダ２４に取得された帯域情報を与える。ＲＯＩ特定部１８は、帯域情報を参照して、ビット割り当て強度βとぼかし強度γを加減する。ビデオエンコーダ２４は、帯域情報を参照してビデオストリームのビットレートを適応的に調整する。 The band information acquisition unit 20 acquires band information such as a bit rate and a congestion state of the communication path from the communication unit 270, and gives the acquired band information to the ROI specifying unit 18 and the video encoder 24. The ROI specifying unit 18 refers to the band information and adjusts the bit allocation strength β and the blur strength γ. The video encoder 24 adaptively adjusts the bit rate of the video stream with reference to the band information.

ビデオエンコーダ２４は、非ＲＯＩフィルタ２２からフィルタリング後の画像を受け取り、一例として、ＭＰＥＧ規格にしたがって、ビデオデータを圧縮符号化し、符号化ビデオストリームを生成する。ビデオエンコーダ２４は、ＲＯＩ特定部１８から受け取ったＲＯＩ情報を参照して注目領域を特定し、非注目領域と注目領域をビット割り当て強度βにもとづいた品質で符号化し、符号化ビデオストリームをビデオパケット化部２６に与える。 The video encoder 24 receives the filtered image from the non-ROI filter 22 and, as an example, compresses and encodes video data according to the MPEG standard to generate an encoded video stream. The video encoder 24 refers to the ROI information received from the ROI identification unit 18 to identify the attention area, encodes the non-attention area and the attention area with quality based on the bit allocation strength β, and encodes the encoded video stream into the video packet. To the conversion unit 26.

オーディオエンコーダ２８は、一例として、ＭＰＥＧオーディオなどの規格にしたがって、マイク２３０から入力されたオーディオデータを圧縮符号化し、符号化オーディオストリームを生成し、オーディオパケット化部３０に与える。 As an example, the audio encoder 28 compresses and encodes audio data input from the microphone 230 in accordance with a standard such as MPEG audio, generates an encoded audio stream, and supplies the encoded audio stream to the audio packetization unit 30.

ビデオエンコーダ２４およびオーディオエンコーダ２８により符号化されたストリームは、エレメンタリストリーム（Elementary Stream；ＥＳ）と呼ばれる。多重化のために、ビデオおよびオーディオの各ストリームはパケット化される。 The stream encoded by the video encoder 24 and the audio encoder 28 is called an elementary stream (ES). For multiplexing, video and audio streams are packetized.

ビデオパケット化部２６は、たとえば、ビデオエンコーダ２４から出力される符号化ビデオストリームをＲＴＰ（Real-time Transport Protocol）パケットにパケット化する。同様に、オーディオパケット化部３０は、オーディオエンコーダ２８から出力される符号化オーディオストリームをＲＴＰパケットにパケット化する。ＲＴＰはビデオやオーディオをストリーム配信するための伝送プロトコルである。なお、符号化ビデオ／オーディオストリームをＰＥＳ（Packetized Elementary Stream）パケットにパケット化してもよい。 For example, the video packetization unit 26 packetizes the encoded video stream output from the video encoder 24 into RTP (Real-time Transport Protocol) packets. Similarly, the audio packetization unit 30 packetizes the encoded audio stream output from the audio encoder 28 into RTP packets. RTP is a transmission protocol for streaming video and audio. Note that the encoded video / audio stream may be packetized into PES (Packetized Elementary Stream) packets.

多重化部３２は、ビデオおよびオーディオのＲＴＰパケットを多重化して多重化ストリームを生成する。生成された多重化ストリームは、通信部２７０によりネットワーク３００に送出される。 The multiplexing unit 32 multiplexes video and audio RTP packets to generate a multiplexed stream. The generated multiplexed stream is sent to the network 300 by the communication unit 270.

図４は、ＲＯＩ特定部１８の機能構成図である。顔検証部４０は、顔領域検出部１２によってフレーム単位で検出された顔領域について、顔の位置、顔の大きさ、および顔らしさのスコアの情報を顔領域検出部１２から受け取る。顔らしさのスコアは、顔検出アルゴリズムにおいて抽出された顔の特徴をもつ画像が本当に顔である可能性がどれくらいであるかを示す度合いである。顔検証部４０は、顔の位置と大きさ、顔らしさのスコアにもとづいて顔領域検出部１２により検出された顔領域の妥当性を検証する。顔検証部４０は、顔検証処理において妥当であると判定された顔領域の情報をフレーム単位で顔領域履歴記憶部４４に履歴として記録する。顔検証部４０は、顔検証処理に合格した顔領域の情報をＲＯＩ決定処理部４６に与える。 FIG. 4 is a functional configuration diagram of the ROI specifying unit 18. The face verification unit 40 receives, from the face region detection unit 12, face position, face size, and facialness score information for the face region detected by the face region detection unit 12 in units of frames. The face-likeness score is a degree indicating the possibility that an image having facial features extracted by the face detection algorithm is really a face. The face verification unit 40 verifies the validity of the face area detected by the face area detection unit 12 based on the position and size of the face and the score of facialness. The face verification unit 40 records information on the face area determined to be valid in the face verification process as a history in the face area history storage unit 44 in units of frames. The face verification unit 40 provides the ROI determination processing unit 46 with information on the face area that has passed the face verification process.

トラッキング部４２は、顔領域検出部１２による顔領域の誤検出や検出漏れをなくすために、顔領域履歴記憶部４４に記録されたフレーム単位の顔領域の検出履歴を参照して、検出された顔領域を連続する複数のフレームにわたって追跡することにより、顔領域の検出結果を補正する。 The tracking unit 42 is detected with reference to the detection history of the face area in units of frames recorded in the face area history storage unit 44 in order to eliminate erroneous detection and omission of detection of the face area by the face area detection unit 12. The detection result of the face area is corrected by tracking the face area over a plurality of continuous frames.

トラッキング部４２は、あるフレームにおいて顔領域が検出された場合であっても、当該フレーム以降の所定の枚数以上の連続するフレームにおいて当該顔領域が連続して検出されていない場合は、当該フレームにおいて検出された顔領域の検出履歴を無効と判定し、顔領域履歴記憶部４４から削除する。これにより、顔領域の情報は、所定の枚数以上の連続するフレームにおいて連続してその顔領域が検出された場合に、有効な検出履歴として顔領域履歴記憶部４４に保持されることになる。 Even when a face area is detected in a certain frame, the tracking unit 42 determines that the face area is not detected continuously in a predetermined number of consecutive frames after the frame. The detected face area detection history is determined to be invalid and is deleted from the face area history storage unit 44. Thus, the face area information is held in the face area history storage unit 44 as an effective detection history when the face area is continuously detected in a predetermined number or more of consecutive frames.

一方、トラッキング部４２は、顔領域履歴記憶部４４に検出履歴が存在する顔領域について、所定の枚数以上の連続するフレームにおいて当該顔領域が検出されない状態が続いた場合は、その顔領域はもはや存在しないことが確実であるから、当該顔領域の検出履歴は不要であると判定し、顔領域履歴記憶部４４から削除する。 On the other hand, when the face area in which the detection history exists in the face area history storage unit 44 continues to be detected in a predetermined number or more of consecutive frames, the tracking section 42 no longer has the face area. Since it is certain that it does not exist, it is determined that the detection history of the face area is unnecessary, and is deleted from the face area history storage unit 44.

トラッキング部４２は、顔検証部４０による顔検証処理に合格した顔領域であっても、それが過去のフレームにおける顔領域の検出履歴と整合しない場合は、誤検出であったと判定する。たとえば、顔領域の位置や大きさが過去のフレームにおける顔領域の位置や大きさと著しく異なる場合、整合性がないと判定する。 The tracking unit 42 determines that a face region that has passed the face verification process by the face verification unit 40 is a false detection if it does not match the face region detection history in the past frame. For example, when the position and size of the face area are significantly different from the position and size of the face area in the past frame, it is determined that there is no consistency.

トラッキング部４２は、あるフレームにおいて検出された顔領域が誤検出かどうかを過去のフレームにおける検出履歴だけで判定するのではなく、判定対象のフレーム以降のフレームにおいて検出される顔領域の情報も参照して、誤検出かどうかを判定してもよい。誤検出の顔領域の履歴が将来にわたって存在すると、将来のフレームにおける顔領域の判定結果に影響を及ぼすため、遡って顔領域の検出履歴を無効化することがより好ましいからである。 The tracking unit 42 does not determine whether a face area detected in a certain frame is erroneously detected based on only the detection history in the past frame, but also refers to information on the face area detected in frames subsequent to the determination target frame. Then, it may be determined whether it is a false detection. This is because it is more preferable to invalidate the detection history of the face region retroactively because the history of the erroneously detected face region will affect the determination result of the face region in the future frame.

具体的には、トラッキング部４２は、あるフレームにおける顔領域の検出結果を当該フレームの前後の所定枚数のフレームにおける顔領域の検出履歴と照合して、当該フレームにおいて検出された顔領域が誤検出であるか否かを判定する。たとえば、トラッキング部４２は、あるフレームにおいて検出された顔領域と当該フレームの前後のフレームにおいて検出された顔領域とを顔の位置や大きさについて照合し、位置や大きさが所定の閾値以上異なるとき、当該フレームにおいて検出された顔領域は誤検出であると判定する。 Specifically, the tracking unit 42 collates the detection result of the face area in a certain frame with the detection history of the face area in a predetermined number of frames before and after the frame, and the face area detected in the frame is erroneously detected. It is determined whether or not. For example, the tracking unit 42 collates a face area detected in a certain frame with face areas detected in frames before and after the frame with respect to the position and size of the face, and the position and size differ by a predetermined threshold or more. At this time, the face area detected in the frame is determined to be erroneous detection.

トラッキング部４２は、誤検出と判定された顔領域の検出履歴を顔領域履歴記憶部４４から削除し、顔検証部４０に誤判定を警告する。顔検証部４０は、トラッキング部４２から誤判定の警告を受けた場合、顔検証処理に合格した顔領域であっても破棄して、ＲＯＩ決定処理部４６に与えない。 The tracking unit 42 deletes the detection history of the face area determined to be erroneous detection from the face area history storage unit 44 and warns the face verification unit 40 of the erroneous determination. When receiving a false determination warning from the tracking unit 42, the face verification unit 40 discards even a face region that has passed the face verification process and does not give it to the ROI determination processing unit 46.

また、トラッキング部４２は、顔領域検出部１２が現在のフレームで顔領域を検出していなかった場合でも、顔領域履歴記憶部４４に記録された過去のフレームの顔領域の検出履歴が有効に存在し、過去のフレームでは顔領域が検出されていた場合は、現在のフレームにおいて顔領域の検出漏れが起きたと判定する。トラッキング部４２は、過去のフレームの顔領域の位置や大きさの情報を、現在のフレームの顔領域の位置や大きさの情報として再利用することにより、検出漏れのあった現在のフレームについての顔領域の情報を補間する。過去の数フレーム分の顔領域の位置や大きさから、検出漏れのあった現在のフレームの顔領域の位置や大きさを決定してもよい。トラッキング部４２は、このようにして補間された現在のフレームの顔領域の情報を顔領域履歴記憶部４４に記録するとともに、顔検証部４０に検出漏れを警告する。顔検証部４０は、トラッキング部４２から検出漏れの警告を受けた場合、顔領域履歴記憶部４４から補間された現在のフレームの顔領域の情報を読み出し、ＲＯＩ決定処理部４６に与える。 In addition, the tracking unit 42 makes effective the detection history of the face area of the past frame recorded in the face area history storage unit 44 even when the face area detection unit 12 has not detected the face area in the current frame. If it exists and a face area has been detected in a past frame, it is determined that a face area detection failure has occurred in the current frame. The tracking unit 42 reuses the information on the position and size of the face area of the past frame as information on the position and size of the face area of the current frame, so that Interpolate face area information. The position and size of the face area of the current frame in which detection omission has occurred may be determined from the position and size of the face area for the past several frames. The tracking unit 42 records the face area information of the current frame interpolated in this way in the face area history storage unit 44 and warns the face verification unit 40 of a detection omission. When receiving the detection omission warning from the tracking unit 42, the face verification unit 40 reads out the information on the face area of the current frame interpolated from the face area history storage unit 44 and gives it to the ROI determination processing unit 46.

たとえば、ユーザが一時的に横を向いたり、後ろを向くなどの動作を行った場合、既存の顔検出アルゴリズムでは顔領域が検出されないフレームが生じることがある。このような場合でも、顔領域検出部１２による顔領域の検出漏れのあったフレームについて、トラッキング部４２が過去のフレームの顔領域の検出結果を再利用して埋め合わせることで、注目領域の設定漏れが生じることを防ぐことができる。 For example, when the user performs an operation such as temporarily facing sideways or facing back, a frame in which a face area is not detected by an existing face detection algorithm may occur. Even in such a case, the tracking unit 42 reuses the detection result of the face area of the past frame for a frame in which the face area detection unit 12 misses the detection of the face region, so that the target region is not set correctly. Can be prevented.

さらに、トラッキング部４２により追跡される顔領域の位置に合わせて、撮像制御部がカメラ２４０のパン・チルトを制御してもよい。また、トラッキング部４２により追跡される顔領域の大きさに合わせて、撮像制御部がカメラ２４０のズームを制御してもよい。ユーザが動いても、カメラ２４０がパン・チルトすることでユーザの顔を捉えることができる。また、カメラとユーザの間の距離が変化しても、ズームイン、ズームアウトにより画面内でユーザの顔を一定の大きさにすることができる。 Furthermore, the imaging control unit may control pan / tilt of the camera 240 in accordance with the position of the face area tracked by the tracking unit 42. Further, the imaging control unit may control the zoom of the camera 240 in accordance with the size of the face area tracked by the tracking unit 42. Even if the user moves, the user's face can be captured by panning and tilting the camera 240. Even if the distance between the camera and the user changes, the user's face can be made a certain size in the screen by zooming in and out.

ＲＯＩ決定処理部４６は、判断基準記憶部４８に記憶された判断基準にもとづいて、顔検証部４０による検証処理を経た顔領域の情報から最終的な注目領域を決定する。注目領域は、アプリケーションやユースケースに応じて決定される。ＲＯＩ決定処理部４６は、いったんあるフレームで注目領域を決定すると、その後、しばらくの間、新たに注目領域を判断して更新することはせず、同じ注目領域を継続して用いてもよい。たとえば、フレーム毎に注目領域を決定し直すのではなく、ＧＯＰ（group of picture）の単位で同じ注目領域を用いて、ＧＯＰの変わり目で注目領域を再設定するようにしてもよい。これにより、ＲＯＩ決定処理による負荷を軽減することができ、また、ＲＯＩ情報をＧＯＰ単位で生成するだけで済む。 The ROI determination processing unit 46 determines a final attention region from the information on the face region that has undergone the verification processing by the face verification unit 40 based on the determination criterion stored in the determination criterion storage unit 48. The attention area is determined according to the application and use case. Once the ROI determination processing unit 46 determines the attention area in a certain frame, the ROI determination processing unit 46 may continue to use the same attention area without determining and updating the attention area for a while after that. For example, instead of re-determining the attention area for each frame, the attention area may be reset at the change of GOP using the same attention area in units of GOP (group of picture). As a result, the load caused by the ROI determination process can be reduced, and it is only necessary to generate ROI information in units of GOPs.

ＲＯＩ決定処理部４６は、最終的に決定された注目領域の個数や位置情報を含むＲＯＩ情報を非ＲＯＩフィルタ２２、ビデオエンコーダ２４および多重化部３２に与える。適当な注目領域を決定できなかった場合は、非ＲＯＩフィルタ２２によるフィルタリングやビデオエンコーダ２４によるＲＯＩ符号化は行われず、従来通りのビデオ符号化が行われる。 The ROI determination processing unit 46 provides the ROI information including the finally determined number of regions of interest and position information to the non-ROI filter 22, the video encoder 24, and the multiplexing unit 32. When an appropriate region of interest cannot be determined, filtering by the non-ROI filter 22 and ROI encoding by the video encoder 24 are not performed, and conventional video encoding is performed.

ＲＯＩパラメータ調整部５０は、ＲＯＩ決定処理部４６により最終決定された注目領域についてビット割り当て強度β、ぼかし強度γなどのＲＯＩパラメータを決定する。複数の注目領域がある場合は、注目領域間で優先順位を決定し、優先度に応じて注目領域に割り当てるビット量を決定してもよい。 The ROI parameter adjustment unit 50 determines ROI parameters such as the bit allocation strength β and the blur strength γ for the attention area finally determined by the ROI determination processing unit 46. When there are a plurality of attention areas, priority may be determined between the attention areas, and the bit amount to be allocated to the attention area may be determined according to the priority.

注目領域のサイズに応じてビット割り当て強度βやぼかし強度γを決めてもよい。注目領域のサイズが大きい場合は、ビット割り当て強度βを大きくしすぎると、ビデオストリームのビットレートが高くなってしまう。そこで、大きな注目領域に対してはビット割り当て強度βを小さくすることで、ビデオストリームのビットレートを最適化する。また、極端に小さな顔領域や極端に大きな顔領域は強調表示することによる効果が期待できないこともあるため、そのような場合はビット割り当て強度βやぼかし強度γを小さくしてもよい。 The bit allocation strength β and the blur strength γ may be determined according to the size of the attention area. If the size of the region of interest is large, the bit rate of the video stream will increase if the bit allocation strength β is increased too much. Therefore, the bit rate of the video stream is optimized by reducing the bit allocation strength β for a large region of interest. In addition, since the effect of highlighting an extremely small face area or an extremely large face area may not be expected, the bit allocation intensity β and the blur intensity γ may be reduced in such a case.

また、注目領域の位置に応じてビット割り当て強度βやぼかし強度γを決めてもよい。たとえば、画像の端に注目領域がある場合、強調表示することによる効果は少ないことがあるため、ビット割り当て強度βやぼかし強度γを小さし、画像の中央付近に注目領域がある場合、強調表示することによる効果が期待できるため、ビット割り当て強度βやぼかし強度γを大きくする。 Further, the bit allocation strength β and the blur strength γ may be determined according to the position of the attention area. For example, if there is a region of interest at the edge of the image, the effect of highlighting may be small, so the bit allocation strength β and blur strength γ are reduced, and if there is a region of interest near the center of the image, it is highlighted Therefore, the bit allocation strength β and the blur strength γ are increased.

さらに、顔らしさのスコアに応じてビット割り当て強度βやぼかし強度γを決めてもよい。顔らしさのスコアが大きい場合は、顔領域を強調表示することの効果が期待できるため、ビット割り当て強度βやぼかし強度γを大きくするが、顔らしさのスコアが低い場合は、逆効果になるおそれもあるので、ビット割り当て強度βやぼかし強度γを小さくする。 Further, the bit allocation strength β and the blur strength γ may be determined according to the facialness score. If the face-like score is large, the effect of highlighting the face area can be expected. Therefore, the bit allocation strength β and the blur strength γ are increased. However, if the face-like score is low, the effect may be adversely affected. Therefore, the bit allocation strength β and the blur strength γ are reduced.

ＲＯＩパラメータ調整部５０は、帯域情報取得部２０から受け取る帯域情報にもとづいて、ビット割り当て強度βとぼかし強度γを加減することもできる。たとえば、ネットワークの帯域がもともと大きかったり、輻輳していないため、十分な利用可能帯域があるなど、動画のフレームサイズおよびフレームレートに対して十分なビットレートが保証されている場合は、非注目領域のビット割り当てを減らす必要はなく、注目領域と非注目領域の区別に関係なく、画像全体を高ビット割り当て領域として符号化してもよい。その場合は、ビット割り当て強度βを１として、ぼかし強度γを１とする。 The ROI parameter adjustment unit 50 can also adjust the bit allocation strength β and the blur strength γ based on the band information received from the band information acquisition unit 20. For example, if a sufficient bit rate is guaranteed for the frame size and frame rate of the video, such as there is sufficient available bandwidth because the network bandwidth is originally large or not congested, the non-target area It is not necessary to reduce the bit allocation, and the entire image may be encoded as a high bit allocation region regardless of the distinction between the attention region and the non-attention region. In this case, the bit allocation strength β is set to 1 and the blurring strength γ is set to 1.

逆に、ネットワークの帯域に制限があったり、輻輳により利用可能な帯域が少なくなっているなど、動画のフレームサイズおよびフレームレートに対して十分なビットレートが保証できない場合は、ビット割り当て強度βとぼかし強度γを大きい値に調整することで、使用帯域幅を減らす。 On the other hand, if you cannot guarantee a sufficient bit rate for the video frame size and frame rate, such as when the network bandwidth is limited or the bandwidth available due to congestion is low, the bit allocation strength β The bandwidth used is reduced by adjusting the blur intensity γ to a large value.

ＲＯＩパラメータ調整部５０は、ビット割り当て強度βをビデオエンコーダ２４に、ぼかし強度γを非ＲＯＩフィルタ２２に与える。 The ROI parameter adjustment unit 50 provides the bit allocation strength β to the video encoder 24 and the blur strength γ to the non-ROI filter 22.

次に、図３の非ＲＯＩフィルタ２２によるフィルタ処理を詳しく説明する。非ＲＯＩフィルタ２２は、低域通過フィルタリングを非注目領域の施すことで、非注目領域をぼかし、相対的に注目領域を際立たせる。一般に、周波数領域で画像を圧縮符号化すると、ビットレートが低いほどブロックノイズが増大する。ビデオエンコーダ２４において非注目領域は注目領域に比べて少ない割り当てビット数で符号化されるため、ブロックノイズが生じやすくなる。そこで、非注目領域については、非ＲＯＩフィルタ２２が高周波成分を除去するフィルタリングを施すことで、ブロックノイズを低減させる効果が得られる。非ＲＯＩフィルタ２２によるフィルタリングには、注目領域以外の領域を視覚的にぼかす以外に、ブロックノイズを低減させる作用が副次的に存在する。 Next, filter processing by the non-ROI filter 22 of FIG. 3 will be described in detail. The non-ROI filter 22 performs low-pass filtering on the non-attention region, thereby blurring the non-attention region and relatively highlighting the attention region. In general, when an image is compression-coded in the frequency domain, block noise increases as the bit rate decreases. In the video encoder 24, the non-attention area is encoded with a smaller number of allocated bits than the attention area, so that block noise is likely to occur. Therefore, for the non-target region, the non-ROI filter 22 performs filtering to remove high frequency components, thereby obtaining an effect of reducing block noise. The filtering by the non-ROI filter 22 has a secondary effect of reducing block noise in addition to visually blurring a region other than the region of interest.

また、非ＲＯＩフィルタリングにより、非注目領域から高周波成分が除去されるため、結果的には、ビットレート一定の条件下で、注目領域に割り当てることのできるビット数を増やす効果も得られる。 Further, since the high-frequency component is removed from the non-target region by non-ROI filtering, as a result, the effect of increasing the number of bits that can be allocated to the target region under a constant bit rate is also obtained.

注目領域と非注目領域は重なりをもたない排他的な領域であり、非ＲＯＩフィルタ２２が、非注目領域４４０をぼかす処理をすると、注目領域４２０と非注目領域４４０の境界で画質が非連続に変化し、注目領域４２０だけが必要以上に際立ち、不自然な印象を与えることがある。そこで、注目領域と非注目領域の境界における非連続性をなくす工夫をする。 The attention area and the non-attention area are exclusive areas that do not overlap, and when the non-ROI filter 22 performs the process of blurring the non-attention area 440, the image quality is discontinuous at the boundary between the attention area 420 and the non-attention area 440. And the attention area 420 alone may stand out more than necessary, giving an unnatural impression. Therefore, a contrivance is made to eliminate discontinuity at the boundary between the attention area and the non-attention area.

図５（ａ）〜（ｃ）は、注目領域と非注目領域の境界における非連続性をなくす方法を説明する図である。図５（ａ）に示すように、画像４００の中央の太線で囲まれた領域は注目領域４２０であり、それ以外の残りの領域は非注目領域４４０である。注目領域４２０の外側の縁に周辺領域４３０（斜線を付した領域）を設定する。周辺領域４３０は非注目領域４４０内に存在する。 FIGS. 5A to 5C are diagrams illustrating a method for eliminating discontinuity at the boundary between the attention area and the non-attention area. As shown in FIG. 5A, the area surrounded by the thick line at the center of the image 400 is the attention area 420, and the remaining area is the non-attention area 440. A peripheral area 430 (a hatched area) is set on the outer edge of the attention area 420. The peripheral area 430 exists in the non-target area 440.

注目領域４２０は、ビデオエンコーダ２４によってビット割り当て強度βのもとで高画質でＲＯＩ符号化される。一方、非注目領域４４０は、非ＲＯＩフィルタ２２によってぼかし強度γのもとで高周波成分がカットされる。注目領域４２０の外側の縁に設けられた周辺領域４３０は、非注目領域４４０内に存在するため、非ＲＯＩフィルタ２２によってぼかし強度γのもとで高周波成分がカットされるが、周辺領域４３０については、ビット割り当て強度βのもとでのＲＯＩ符号化も合わせて行う。すなわち、周辺領域４３０においては、ぼかす処理と画質を上げる処理とを重複させる。周辺領域４３０は、ＲＯＩ符号化されるとともに、高周波成分がカットされるため、注目領域と非注目領域の中間の画質になる。注目領域と非注目領域の境界付近にある周辺領域４３０が中間の画質になることから、注目領域と非注目領域の変わり目の不自然さを軽減できる。 The attention area 420 is ROI-encoded with high image quality by the video encoder 24 under the bit allocation strength β. On the other hand, in the non-target region 440, the high-frequency component is cut by the non-ROI filter 22 under the blur intensity γ. Since the peripheral area 430 provided at the outer edge of the attention area 420 exists in the non-focus area 440, the high-frequency component is cut by the non-ROI filter 22 under the blur intensity γ. Also performs ROI encoding under the bit allocation strength β. That is, in the peripheral area 430, the blurring process and the process for improving the image quality are overlapped. The peripheral area 430 is ROI encoded and the high frequency component is cut, so that the image quality is intermediate between the attention area and the non-attention area. Since the peripheral area 430 near the boundary between the attention area and the non-attention area has an intermediate image quality, unnaturalness at the transition between the attention area and the non-attention area can be reduced.

別の方法として、非ＲＯＩフィルタ２２は、周辺領域４３０において段階的にぼかし強度γを大きくしながらフィルタリングを施すことで、画質を連続的に変化させてもよい。このために、フィルタリングの対象画素に対して近い位置にある周辺画素には大きな重みを、対象画素に対して遠い位置にある周辺画素には小さい重みを付けた加重平均を取る非一様フィルタ、一例としてガウシアン・フィルタを利用してもよい。 As another method, the non-ROI filter 22 may change the image quality continuously by performing filtering while gradually increasing the blur intensity γ in the peripheral region 430. For this purpose, a non-uniform filter that takes a weighted average with a large weight applied to peripheral pixels located close to the target pixel for filtering and a small weight applied to peripheral pixels located far from the target pixel; As an example, a Gaussian filter may be used.

図５（ｂ）に示すように、注目領域４２０の内側の縁に周辺領域４３０を設けてもよい。この場合、周辺領域４３０は、注目領域４２０内に存在するため、ビット割り当て強度βのもとでＲＯＩ符号化されるが、周辺領域４３０については、ぼかし強度γのもとで高周波成分をカットする処理も合わせて行う。あるいは、図５（ｃ）のように、注目領域４２０の外側の縁と内側の縁の両方にまたがって周辺領域４３０を設け、周辺領域４３０において画質を上げる処理とぼかす処理とを重複して行うようにしてもよい。 As shown in FIG. 5B, a peripheral region 430 may be provided on the inner edge of the attention region 420. In this case, since the peripheral region 430 exists in the attention region 420, ROI encoding is performed under the bit allocation strength β, but the high frequency component is cut under the blur strength γ for the peripheral region 430. Processing is also performed. Alternatively, as shown in FIG. 5C, a peripheral area 430 is provided across both the outer edge and the inner edge of the attention area 420, and the process of increasing the image quality and the process of blurring are performed in the peripheral area 430. You may do it.

図６は、符号化処理ブロック２００によるＲＯＩ符号化の処理手順を説明するフローチャートである。 FIG. 6 is a flowchart for explaining the processing procedure of ROI encoding by the encoding processing block 200.

顔領域検出部１２は、現在のフレームについて顔領域検出処理を実行する（Ｓ１０）。顔領域検出部１２によって顔が検出された場合（Ｓ１２のＹ）、ステップＳ１４の顔検証処理に進む。顔領域検出部１２によって顔が検出されなかった場合（Ｓ１２のＮ）、ステップＳ１８のトラッキングによる顔補間処理に進む。 The face area detection unit 12 executes face area detection processing for the current frame (S10). If a face is detected by the face area detection unit 12 (Y in S12), the process proceeds to the face verification process in step S14. If no face is detected by the face area detection unit 12 (N in S12), the process proceeds to face interpolation processing by tracking in step S18.

ステップＳ１４において、顔検証部４０は、顔領域検出部１２により検出された顔領域が妥当なものであるかどうかを検証する。トラッキング部４２は、顔検証部４０による検証済みの顔領域について、トラッキングによる誤検出判定処理を実行する（Ｓ１５）。これにより検証済みの顔領域の内、誤検出されたものは破棄される。顔検証部４０による検証処理に合格した顔の個数が０である場合（Ｓ１６のＮ）、ステップＳ１８のトラッキングによる顔補間処理に進む。検証処理に合格した顔の個数が１以上である場合（Ｓ１６のＹ）、ステップＳ２４のＲＯＩ決定処理に進む。 In step S14, the face verification unit 40 verifies whether the face area detected by the face area detection unit 12 is valid. The tracking unit 42 performs an erroneous detection determination process by tracking for the face area verified by the face verification unit 40 (S15). As a result, the erroneously detected face area is discarded. When the number of faces that have passed the verification process by the face verification unit 40 is 0 (N in S16), the process proceeds to the face interpolation process by tracking in step S18. If the number of faces that have passed the verification process is 1 or more (Y in S16), the process proceeds to the ROI determination process in step S24.

ステップＳ１８において、トラッキング部４２は、過去のフレームの顔領域の情報を参照して、トラッキングによる顔補間処理を実行し、現在のフレームにおいて欠落した顔領域の情報を補間する。補間された顔の個数が１以上である場合（Ｓ２０のＹ）、ステップＳ２４のＲＯＩ決定処理に進むが、補間された顔の個数が０である場合（Ｓ２０のＮ）、ステップＳ２２に進み、この場合、注目領域を設定しない。 In step S18, the tracking unit 42 refers to the face area information of the past frame, executes face interpolation processing by tracking, and interpolates the face area information missing in the current frame. If the number of interpolated faces is 1 or more (Y in S20), the process proceeds to ROI determination processing in step S24. If the number of interpolated faces is 0 (N in S20), the process proceeds to step S22. In this case, the attention area is not set.

ステップＳ２４において、ＲＯＩ決定処理部４６は、ステップＳ１４の検証処理に合格したか、あるいはトラッキングにより補間された顔領域をもとに、最終的な注目領域を決定し、ステップＳ２６において、ＲＯＩパラメータ調整部５０は、注目領域と非注目領域の画質を異ならせるためのＲＯＩパラメータを調整する。 In step S24, the ROI determination processing unit 46 determines a final attention area based on the face area that has passed the verification process in step S14 or interpolated by tracking, and in step S26, ROI parameter adjustment is performed. The unit 50 adjusts ROI parameters for making the image quality of the attention area and the non-attention area different.

次のフレームが入力されると（Ｓ２８のＹ）、ステップＳ１０に戻って、一連の処理を繰り返し、フレームの入力がない場合（Ｓ２８のＮ）、処理を終了する。 When the next frame is input (Y in S28), the process returns to step S10, and a series of processes are repeated. When no frame is input (N in S28), the process is terminated.

図７は、図６のステップＳ１４の顔検証処理の手順を説明するフローチャートである。顔領域検出部１２により検出された顔領域をすべて検証するまで（Ｓ３０のＮ）、ステップＳ３２〜Ｓ３８の処理を繰り返し、検出された顔領域をすべて検証すると（Ｓ３０のＹ）、顔検証処理を終了し、ステップＳ１５に進む。 FIG. 7 is a flowchart for explaining the procedure of the face verification process in step S14 of FIG. Until all the face areas detected by the face area detecting unit 12 are verified (N in S30), the processes in steps S32 to S38 are repeated, and when all the detected face areas are verified (Y in S30), the face verification process is performed. End and proceed to step S15.

顔検証部４０は、顔領域に含まれる顔の大きさが妥当であるかどうか（Ｓ３２）、顔の位置が妥当であるかどうか（Ｓ３４）、顔らしさのスコアは閾値より大きいかどうか（Ｓ３６）をテストする。いずれかのテストに不合格の場合（Ｓ３２のＮ、Ｓ３４のＮ、またはＳ３６のＮ）、ステップＳ３０に戻る。これらのテストにすべて合格した場合（Ｓ３２のＹ、Ｓ３４のＹ、およびＳ３６のＹ）、検証に合格した顔領域としてその情報を顔領域履歴記憶部４４に保存する（Ｓ３８）。 The face verification unit 40 determines whether or not the size of the face included in the face area is appropriate (S32), whether the face position is appropriate (S34), and whether the score of the face likelihood is larger than the threshold (S36). ). If any test fails (N in S32, N in S34, or N in S36), the process returns to step S30. If all of these tests pass (Y in S32, Y in S34, and Y in S36), the information is stored in the face area history storage unit 44 as a face area that has passed verification (S38).

顔の大きさのテストでは、顔の大きさが想定するサイズの範囲にあるかどうかを判定する。たとえば、顔の大きさが大きすぎたり、小さすぎる場合は、顔領域として採用しない。顔の位置のテストでは、顔の位置が想定する位置の範囲にあるかどうかを判定する。たとえば、画像の端の方にある場合は顔領域として採用しない。顔らしさのスコアのテストでは、顔らしさのスコアが想定する値の範囲であるかどうかを判定し、スコアが想定外に低い場合は顔領域として採用しない。 In the face size test, it is determined whether the face size is within an assumed size range. For example, when the face size is too large or too small, it is not adopted as the face area. In the face position test, it is determined whether or not the face position is within the assumed position range. For example, when it is near the edge of the image, it is not adopted as the face area. In the face-likeness score test, it is determined whether or not the face-likeness score is within a range of assumed values. If the score is unexpectedly low, the face-likeness score is not adopted.

図８は、図６のステップＳ１５のトラッキングによる誤検出判定処理の手順を説明するフローチャートである。顔検証部４０による検証済みの顔領域をすべてチェックするまで（Ｓ５０のＮ）、ステップＳ５２〜Ｓ５８の処理を繰り返し、検証済みの顔領域をすべてチェックし終わると（Ｓ５０のＹ）、誤検出判定処理を終了し、ステップＳ１６に進む。 FIG. 8 is a flowchart for explaining the procedure of erroneous detection determination processing by tracking in step S15 of FIG. Until all the face areas verified by the face verification unit 40 are checked (N in S50), the processes in steps S52 to S58 are repeated, and when all the verified face areas are checked (Y in S50), a false detection determination is made. The process ends, and the process proceeds to step S16.

トラッキング部４２は、検証済みの顔領域が所定の枚数以上の連続するフレームにおいて連続して検出されているかどうかを調べる（Ｓ５２）。当該顔領域が連続して検出されていた場合（Ｓ５２のＹ）、ステップＳ５６に進む。当該顔領域が連続して検出されていない場合（Ｓ５２のＮ）、当該顔領域の検出履歴は無効であると判定し、顔領域履歴記憶部４４から削除する（Ｓ５４）。 The tracking unit 42 checks whether or not the verified face area is continuously detected in a predetermined number of consecutive frames (S52). If the face area has been continuously detected (Y in S52), the process proceeds to step S56. When the face area is not continuously detected (N in S52), it is determined that the detection history of the face area is invalid, and is deleted from the face area history storage unit 44 (S54).

次に、トラッキング部４２は、検証済みの顔領域が過去のフレームにおける顔領域の検出履歴と整合するかどうかを調べる（Ｓ５６）。当該顔領域が過去のフレームの検出履歴と整合する場合（Ｓ５６のＹ）、ステップＳ５０に戻る。当該顔領域が過去のフレームの検出履歴と整合しない場合（Ｓ５６のＮ）、当該顔領域は誤検出であると判定し、誤検出と判定した顔領域の検出履歴を顔領域履歴記憶部４４から削除する（Ｓ５８）。 Next, the tracking unit 42 checks whether or not the verified face area matches the detection history of the face area in the past frame (S56). If the face area matches the past frame detection history (Y in S56), the process returns to step S50. If the face area does not match the past frame detection history (N in S56), it is determined that the face area is erroneously detected, and the detection history of the face area determined to be erroneously detected is stored in the face area history storage unit 44. Delete (S58).

図９は、図６のステップＳ１８のトラッキングによる顔補間処理の手順を説明するフローチャートである。顔領域履歴記憶部４４に履歴として記録された顔情報をすべてチェックするまで（Ｓ４０のＮ）、ステップＳ４２〜Ｓ４６の処理を繰り返し、履歴にある顔情報をすべてチェックし終わると（Ｓ４０のＹ）、顔補間処理を終了し、ステップＳ２０に進む。 FIG. 9 is a flowchart for explaining the procedure of face interpolation processing by tracking in step S18 of FIG. Until all the face information recorded as a history in the face area history storage unit 44 is checked (N in S40), the processes in steps S42 to S46 are repeated, and when all the face information in the history is checked (Y in S40). Then, the face interpolation process is terminated, and the process proceeds to step S20.

トラッキング部４２は、顔が検出されなかったフレーム数を調べる（Ｓ４２）。顔が検出されなかったフレーム数が閾値以下である場合（Ｓ４２のＮ）、ステップＳ４６の顔補間処理に進む。この閾値はたとえば、フレームレートに応じて実験的に決められる。たとえば、フレームレートが毎秒３０フレームであれば、フレームレートの１／１０を目安に閾値を１〜３フレームに設定する。 The tracking unit 42 checks the number of frames in which no face has been detected (S42). If the number of frames in which no face has been detected is equal to or less than the threshold (N in S42), the process proceeds to face interpolation processing in step S46. This threshold is determined experimentally according to the frame rate, for example. For example, if the frame rate is 30 frames per second, the threshold is set to 1 to 3 frames with 1/10 of the frame rate as a guide.

ステップＳ４６において、過去のフレームでは顔が検出されていたが、現在のフレームでは顔が検出されなかった場合は、検出漏れであると判断し、過去のフレームで検出された位置に現在のフレームでも顔があるものとして、過去のフレームの顔の位置や大きさの情報を再利用して現在のフレームの顔情報を補間し、現在のフレームの顔情報として保存する（Ｓ４６）。動画の動きベクトルの情報を利用して、過去のフレームの顔の位置から現在のフレームの顔の位置を時間方向に補間して求めてもよい。 In step S46, if a face has been detected in the past frame, but no face has been detected in the current frame, it is determined that the face is not detected, and the current frame is located at the position detected in the past frame. Assuming that there is a face, information on the position and size of the face of the past frame is reused to interpolate the face information of the current frame and stored as face information of the current frame (S46). Using the motion vector information of the moving image, the face position of the current frame may be interpolated in the time direction from the face position of the past frame.

顔が検出されなかったフレーム数が閾値より大きい場合（Ｓ４２のＹ）、顔が検出されていない状態が続いていることから、顔領域は存在しないと判断し、顔情報の検出履歴を削除する（Ｓ４４）。たとえば、毎秒３０フレームのフレームレートの場合、フレームレートの１／１０である３フレームを基準として、３フレーム以上、顔が検出されない場合は、その顔領域の情報を履歴から削除する。閾値を３フレームに設定したことにより、たまたま一枚のフレームにおいて顔領域の検出ミスがあった場合でも、誤って顔領域の履歴情報が削除されることはない。 If the number of frames in which no face has been detected is greater than the threshold (Y in S42), it is determined that no face area exists since the face has not been detected, and the face information detection history is deleted. (S44). For example, in the case of a frame rate of 30 frames per second, when a face is not detected for 3 frames or more with reference to 3 frames, which is 1/10 of the frame rate, information on the face area is deleted from the history. Since the threshold is set to 3 frames, the face area history information is not erroneously deleted even if a face area detection error occurs in one frame.

次に、ＲＯＩ決定処理部４６による注目領域の決定方法について、例を挙げて詳しく説明する。 Next, a method of determining a region of interest by the ROI determination processing unit 46 will be described in detail with an example.

図１０は、画像４００内に注目領域が設定される様子を説明する図である。ユーザが自分の部屋でビデオチャットを利用しているとする。画像４００内に第１の顔領域４２０ａと第２の顔領域４１０ｂが検出される。第１の顔領域４２０ａはユーザの顔４１０ａを含む領域であるから注目領域として設定するべきであるが、第２の顔領域４２０ｂは、部屋の壁に貼られたポスターに写っている人物の顔が誤って検出されたものであるから、注目領域として設定すべきではない。 FIG. 10 is a diagram for explaining how attention areas are set in the image 400. Suppose a user is using video chat in his room. A first face area 420a and a second face area 410b are detected in the image 400. Since the first face area 420a includes the user's face 410a, it should be set as the attention area. The second face area 420b is the face of a person shown on a poster attached to the wall of the room. Should not be set as the attention area.

そこで、動きがない顔領域は、注目領域には選択しないという判断基準を設けてもよい。これにより、壁面のポスターに載っている人物の顔や、机上の写真立てに入っている写真の顔などが誤って注目領域として選択されることを防止することができる。動きがない顔領域を識別するために、トラッキング部４２が、顔領域履歴記憶部４４に保持された顔情報の履歴を調べ、顔領域の位置が過去のフレームと比べて変動しているかどうか、顔の目、鼻、口などの部位が画像上で変化しているかどうかを検出してもよい。顔検証部４０がトラッキング部４２による顔領域の動きの判定結果をＲＯＩ決定処理部４６に通知し、ＲＯＩ決定処理部４６が動きのない顔領域については注目領域に設定しないようにする。あるいは、トラッキング部４２によって動きがないと判定された顔領域については、顔検証部４０が最初から破棄し、ＲＯＩ決定処理部４６には供給しないようにしてもよい。 Therefore, a criterion for not selecting a face area that does not move as an attention area may be provided. Accordingly, it is possible to prevent the face of a person on a wall poster or the face of a photograph in a photo stand on the desk from being erroneously selected as the attention area. In order to identify a face area that does not move, the tracking unit 42 examines the history of the face information held in the face area history storage unit 44, and whether the position of the face area has fluctuated compared to the past frame, It may be detected whether a part of the face such as eyes, nose, or mouth has changed on the image. The face verification unit 40 notifies the ROI determination processing unit 46 of the determination result of the movement of the face area by the tracking unit 42 so that the ROI determination processing unit 46 does not set the face area without movement as the attention area. Alternatively, the face verification unit 40 may discard the face area determined to have no movement by the tracking unit 42 from the beginning and may not supply the ROI determination processing unit 46.

図１１（ａ）、（ｂ）は、ユーザが席を離れる場合における注目領域の決定方法を説明する図である。図１１（ａ）のように、ユーザがカメラの前で席に座っている場合、ユーザの顔４１０を含む顔領域４２０が注目領域として決定される。ＲＯＩ決定処理部４６が、たとえばＧＯＰ単位で注目領域を管理している場合、同一ＧＯＰ内では他のフレームでも同じ注目領域が用いられる。図１１（ｂ）のように、ユーザが席を離れた直後のフレームでは、同じ注目領域４２０が継続して使用されるため、ユーザが居るときは見えなかった、部屋の様子が高画質で通信相手のディスプレイに表示されることになる。 FIGS. 11A and 11B are diagrams illustrating a method of determining a region of interest when the user leaves the seat. As shown in FIG. 11A, when the user is sitting on the seat in front of the camera, the face area 420 including the user's face 410 is determined as the attention area. For example, when the ROI determination processing unit 46 manages the attention area in units of GOPs, the same attention area is used in other frames within the same GOP. As shown in FIG. 11B, in the frame immediately after the user leaves the seat, the same attention area 420 is continuously used. It will be displayed on the other party's display.

そこで、顔領域検出部１２および顔検証部４０によって顔領域が検出されなくなった場合は、ＧＯＰの途中であっても、強制的に注目領域を再設定するように、ＲＯＩ決定処理部４６に割り込み信号を与えるようにする。これにより、ユーザが居なくなった後で、部屋の様子が詳しく映し出されるといった不都合をなくすことができる。 Therefore, when the face area is no longer detected by the face area detection unit 12 and the face verification unit 40, the ROI determination processing unit 46 is interrupted so as to forcibly reset the attention area even during the GOP. Give a signal. Thereby, after the user disappears, it is possible to eliminate the inconvenience that the state of the room is projected in detail.

後述のように、通信相手に見せたくない領域については、ユーザが禁止領域を設定可能にしておき、顔領域検出部１２および顔検証部４０によって禁止領域内に顔領域が検出された場合は、ＲＯＩ決定処理部４６は、禁止領域内であっても顔領域を注目領域に設定するが、禁止領域内から顔領域が検出されなくなった場合は、ただちに禁止領域に設定された注目領域を解除するように制御してもよい。 As described later, for a region that the user does not want to show to the communication partner, the user can set a prohibited region, and when a face region is detected in the prohibited region by the face region detection unit 12 and the face verification unit 40, The ROI determination processing unit 46 sets the face area as the attention area even within the prohibited area, but immediately releases the attention area set as the prohibited area when the face area is no longer detected from within the prohibited area. You may control as follows.

図１２（ａ）、（ｂ）は、ユーザが部屋を動き回る場合における注目領域の決定方法を説明する図である。図１２（ａ）のように、ユーザが席に座っている間、検出された自分の顔４１０ａを含む領域４２０ａが注目領域に設定される。図１２（ｂ）のように、ユーザが一時的に席を離れて部屋を動き回るとき、検出される顔４１０ｃの移動に合わせて注目領域が設定されると、ユーザの移動によって自分の部屋の様子が明瞭に映し出されることになってしまう。そこで、画面の中央から離れた位置にある顔領域４２０ｃが検出されても、注目領域には選択しないという判断基準を設けてもよい。 FIGS. 12A and 12B are diagrams illustrating a method of determining a region of interest when a user moves around a room. As shown in FIG. 12A, while the user is sitting on the seat, an area 420a including the detected face 410a is set as the attention area. As shown in FIG. 12B, when the user temporarily leaves the seat and moves around the room, if the attention area is set in accordance with the detected movement of the face 410c, the state of his / her room is moved by the user's movement. Will be projected clearly. Therefore, even if the face area 420c at a position away from the center of the screen is detected, a criterion for not selecting the face area 420 may be provided.

また、画面の中央以外で検出された顔領域は注目領域としない基準を設ければ、自分以外の家族が部屋に入ってきても、家族の顔が注目領域に設定される心配がなくなるので、ユーザのプライバシーを保護するのにも役立つ。 In addition, if you set a standard that does not set the face area detected outside the center of the screen as the attention area, even if a family other than yourself enters the room, there is no worry that the family face will be set as the attention area. It also helps protect user privacy.

図１３（ａ）、（ｂ）は、注目領域の設定を許可する領域、禁止する領域を指定する方法を説明する図である。図１３（ａ）のように、ユーザが注目領域の設定を許可する領域４５０（ここでは、画面の中央の領域）をあらかじめ設定し、顔領域がユーザの設定した許可領域４５０内で検出された場合は、注目領域にするが、顔領域が許可領域４５０外で検出された場合は、注目領域とはしないという判断基準を設けてもよい。また、図１３（ｂ）のように、ユーザが注目領域の設定を禁止する領域４６０を設定できるようにしてもよい。ここでは、机の上の書類等が高画質で映し出されることのないよう、机の上の領域が禁止領域４６０に設定されている。 FIGS. 13A and 13B are diagrams for explaining a method for designating a region for which setting of a region of interest is permitted and a region for which prohibition is to be set. As shown in FIG. 13A, a region 450 (in this case, the center region of the screen) in which the user permits setting of the attention region is set in advance, and the face region is detected within the permission region 450 set by the user. In this case, a reference area may be set, but if the face area is detected outside the permission area 450, a determination criterion that the area is not the attention area may be provided. Further, as shown in FIG. 13B, the user may be able to set an area 460 that prohibits the setting of the attention area. Here, the area on the desk is set as the prohibited area 460 so that documents on the desk are not projected with high image quality.

許可領域４５０で検出された顔領域を注目領域とするという基準、もしくは禁止領域４６０内で顔領域が検出されても注目領域としないという基準を設けることで、ユーザのプライバシーを保護したり、セキュリティを確保することができる。 By providing a criterion that the face area detected in the permitted area 450 is the attention area, or a criterion that the face area is not detected even if the face area is detected in the prohibited area 460, the privacy of the user can be protected, Can be secured.

次に、複数の顔領域が検出された場合の注目領域の判断基準を説明する。たとえば、以下の基準の少なくとも一つを満たす顔領域を注目領域に決定する。 Next, a criterion for determining a region of interest when a plurality of face regions are detected will be described. For example, a face region that satisfies at least one of the following criteria is determined as a region of interest.

（１）面積が最大の顔領域、
（２）画像の中央付近に存在する顔領域、
（３）顔らしさのスコアが最大である顔領域、または、
（４）顔領域の面積、位置、スコアをそれぞれ正規化し、それらの値の重み付け和が最大である顔領域。 (1) The face area with the largest area,
(2) a face area existing near the center of the image,
(3) the face area where the score of facialness is the maximum, or
(4) A face area in which the area, position, and score of the face area are normalized, and the weighted sum of these values is maximized.

別の判断基準として、以下のように注目領域を決定してもよい。
（５）顔領域をそれぞれ別の注目領域として採用、
（６）すべての顔領域を包含する領域を注目領域として採用、または、
（７）互いに近い位置にある顔領域を一つにまとめて注目領域として採用。 As another criterion, the attention area may be determined as follows.
(5) Adopt face areas as separate attention areas,
(6) Adopting an area including all the face areas as the attention area, or
(7) The face areas that are close to each other are grouped together and adopted as the attention area.

まず、図１４（ａ）、（ｂ）を参照して、顔領域が一つだけ検出された場合の注目領域の決定方法を説明する。図１４（ａ）のように、画像４００内に顔４１０が検出され、それを含む矩形の顔領域４２０が検出されたとする。このとき、この矩形の顔領域４２０をそのまま注目領域としてもよく、あるいは、図１４（ｂ）のように、顔４１０と上半身４１２を含む領域４２０を注目領域としてもよい。 First, with reference to FIGS. 14A and 14B, a method of determining a region of interest when only one face region is detected will be described. Assume that a face 410 is detected in an image 400 and a rectangular face area 420 including the face 410 is detected as shown in FIG. At this time, the rectangular face area 420 may be used as the attention area as it is, or an area 420 including the face 410 and the upper body 412 may be used as the attention area as shown in FIG.

図１５（ａ）、（ｂ）は、顔領域が複数検出された場合の注目領域の決定方法を説明する図である。図１５（ａ）のように、検出された顔４１０ａ〜４１０ｃが近い位置にある、すなわち検出された顔領域が互いにある規定距離以内にある場合、それらの顔領域を包含する領域４２０を注目領域４２０とするが、図１５（ｂ）のように、検出された複数の顔領域が互いに離れている場合は、検出されたそれぞれの顔４１０ａ〜４１０ｃを包含する領域４２０ａ〜４２０ｃを別々の注目領域とする。なお、複数の顔領域が検出された場合でも、図１４（ｂ）のように、顔と上半身を含む領域を注目領域としてもよい。 FIGS. 15A and 15B are diagrams illustrating a method of determining a region of interest when a plurality of face regions are detected. As shown in FIG. 15A, when the detected faces 410a to 410c are close to each other, that is, when the detected face areas are within a predetermined distance from each other, the area 420 including these face areas is set as the attention area. As shown in FIG. 15B, when the plurality of detected face areas are separated from each other as shown in FIG. 15B, the areas 420a to 420c including the detected faces 410a to 410c are set as separate attention areas. And Even when a plurality of face regions are detected, a region including the face and upper body may be used as a region of interest as shown in FIG.

図１６（ａ）、（ｂ）は、大きさの異なる顔領域が検出された場合の注目領域の決定方法を説明する図である。図１６（ａ）のように、サイズの大きい顔４１０ｃと、サイズの小さい顔４１０ａ、４１０ｂとが検出された場合、サイズの大きい方の顔４１０ｃを包含する領域４２０を注目領域とする。図１６（ｂ）のように、二つの大きい顔４１０ａ、４１０ｂと、三つの小さい顔４１０ｃ、４１０ｄ、４１０ｅが検出された場合、二つの大きい顔４１０ａ、４１０ｂをそれぞれ包含する領域４２０ａ、４２０ｂを注目領域とする。判断基準の一例として、検出された複数の顔領域について、最大サイズと最小サイズの比が所定の閾値よりも大きい場合、最大サイズの顔領域を注目領域に設定する。２番目以降に大きなサイズの顔領域も注目領域に設定してもよい。 FIGS. 16A and 16B are diagrams illustrating a method of determining a region of interest when face regions having different sizes are detected. As shown in FIG. 16A, when a large-sized face 410c and small-sized faces 410a and 410b are detected, a region 420 including the larger-sized face 410c is set as a region of interest. As shown in FIG. 16B, when two large faces 410a and 410b and three small faces 410c, 410d, and 410e are detected, attention is paid to regions 420a and 420b that include the two large faces 410a and 410b, respectively. This is an area. As an example of the determination criterion, when the ratio between the maximum size and the minimum size is larger than a predetermined threshold for a plurality of detected face regions, the face region of the maximum size is set as the attention region. The second and subsequent large face areas may be set as the attention area.

以上、本発明を実施の形態をもとに説明した。実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. The embodiments are exemplifications, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are within the scope of the present invention. .

実施の形態に係るチャットシステムの構成図である。It is a lineblock diagram of a chat system concerning an embodiment. 図１の情報処理装置の構成図である。It is a block diagram of the information processing apparatus of FIG. 図２の符号化処理ブロックの構成図である。FIG. 3 is a configuration diagram of an encoding processing block in FIG. 2. 図３のＲＯＩ特定部の機能構成図である。It is a function block diagram of the ROI specific | specification part of FIG. 図５（ａ）〜（ｃ）は、注目領域と非注目領域の境界における非連続性をなくす方法を説明する図である。FIGS. 5A to 5C are diagrams illustrating a method for eliminating discontinuity at the boundary between the attention area and the non-attention area. 図３の符号化処理ブロックによるＲＯＩ符号化の処理手順を説明するフローチャートである。FIG. 4 is a flowchart illustrating a processing procedure of ROI encoding by the encoding processing block of FIG. 3. FIG. 図６の顔検証処理の手順を説明するフローチャートである。It is a flowchart explaining the procedure of the face verification process of FIG. 図６のトラッキングによる誤検出判定処理の手順を説明するフローチャートである。It is a flowchart explaining the procedure of the misdetection determination processing by tracking of FIG. 図６のトラッキングによる顔補間処理の手順を説明するフローチャートである。It is a flowchart explaining the procedure of the face interpolation process by tracking of FIG. 画像内に注目領域が設定される様子を説明する図である。It is a figure explaining a mode that an attention area is set up in an image. 図１１（ａ）、（ｂ）は、ユーザが席を離れる場合における注目領域の決定方法を説明する図である。FIGS. 11A and 11B are diagrams illustrating a method of determining a region of interest when the user leaves the seat. 図１２（ａ）、（ｂ）は、ユーザが部屋を動き回る場合における注目領域の決定方法を説明する図である。FIGS. 12A and 12B are diagrams illustrating a method of determining a region of interest when a user moves around a room. 図１３（ａ）、（ｂ）は、注目領域の設定を許可する領域、禁止する領域を指定する方法を説明する図である。FIGS. 13A and 13B are diagrams for explaining a method for designating a region for which setting of a region of interest is permitted and a region for which prohibition is to be set. 図１４（ａ）、（ｂ）は、顔領域が一つだけ検出された場合の注目領域の決定方法を説明する図である。FIGS. 14A and 14B are diagrams illustrating a method for determining a region of interest when only one face region is detected. 図１５（ａ）、（ｂ）は、顔領域が複数検出された場合の注目領域の決定方法を説明する図である。FIGS. 15A and 15B are diagrams illustrating a method of determining a region of interest when a plurality of face regions are detected. 図１６（ａ）、（ｂ）は、大きさの異なる顔領域が検出された場合の注目領域の決定方法を説明する図である。FIGS. 16A and 16B are diagrams illustrating a method of determining a region of interest when face regions having different sizes are detected.

Explanation of symbols

１０フレームメモリ、１２顔領域検出部、１４表示制御部、１８ＲＯＩ特定部、２０帯域情報取得部、２２非ＲＯＩフィルタ、２４ビデオエンコーダ、２６ビデオパケット化部、２８オーディオエンコーダ、３０オーディオパケット化部、３２多重化部、４０顔検証部、４２トラッキング部、４４顔領域履歴記憶部、４６ＲＯＩ決定処理部、４８判断基準記憶部、５０ＲＯＩパラメータ調整部、１００情報処理装置、２００符号化処理ブロック、２２０復号処理ブロック、２３０マイク、２４０カメラ、２５０スピーカ、２６０ディスプレイ、２７０通信部、３００ネットワーク。 10 frame memory, 12 face area detection unit, 14 display control unit, 18 ROI identification unit, 20 band information acquisition unit, 22 non-ROI filter, 24 video encoder, 26 video packetization unit, 28 audio encoder, 30 audio packetization unit , 32 multiplexing unit, 40 face verification unit, 42 tracking unit, 44 face area history storage unit, 46 ROI determination processing unit, 48 judgment criterion storage unit, 50 ROI parameter adjustment unit, 100 information processing device, 200 encoding processing block , 220 decoding processing block, 230 microphone, 240 camera, 250 speaker, 260 display, 270 communication unit, 300 network.

Claims

A detection unit for detecting a face area in a frame of a movie;
A recording unit in which the history of the detected face area is recorded in units of frames;
The detection result of the face area by the detection unit by tracking the face area detected by the detection unit over a plurality of continuous frames with reference to the detection history of the face area in units of frames recorded in the recording unit A tracking unit for correcting
A region of interest determination unit that determines a region of interest based on a predetermined reference based on the face region corrected by the tracking unit;
The region of interest and the other region is coded with different quality, see contains an encoding unit which generates a video coded stream,
When the attention area determination unit sets an area including the face area detected while the user is sitting on the seat by the detection unit as the attention area, the tracking unit detects that the user has moved away from the seat. When this is the case, the attention area determination section does not set the detected face area as the attention area even if the detection section detects the face area .

Even when the tracking unit detects a face region in a certain frame, the tracking unit does not continuously detect the face region in a predetermined number of consecutive frames after the frame. The encoding processing apparatus according to claim 1, wherein the face area detection history detected in the frame is determined to be invalid and is deleted from the recording unit.

If the tracking unit continues to detect a face area in which the detection history is present in the recording unit and the detection unit does not detect the face area in a predetermined number of consecutive frames, the tracking unit The encoding processing apparatus according to claim 1, wherein the detection history is determined to be unnecessary and is deleted from the recording unit.

The tracking unit determines whether or not the detected face area in the frame is a false detection by collating the detection result of the face area in a frame with the detection history of the face area in the frames before and after the frame. The encoding processing apparatus according to claim 1, wherein when it is determined that the detection is erroneous, the history of the face area detected in the frame is deleted.

The tracking unit compares the face area detected in a certain frame with the face areas detected in frames before and after the frame with respect to at least one of the position and size of the face, thereby detecting the face detected in the frame. The encoding processing apparatus according to claim 4, wherein it is determined whether or not the area is a false detection.

Even if the detection unit does not detect a face region in the current frame, the tracking unit may be a detection failure of the face region in the current frame if the detection history of the face region exists in the past frame. By determining and reusing the information on the position of the face area of the past frame as the position information of the face area of the current frame, the information on the face area of the current frame determined to be undetected is interpolated. The encoding processing apparatus according to claim 1, wherein

The code according to any one of claims 1 to 6, further comprising an imaging control unit that controls a direction of a camera for photographing a face in accordance with a position of a face area tracked by the tracking unit. Processing equipment.

The imaging control unit according to any one of claims 1 to 6, further comprising an imaging control unit that controls zooming of a camera for capturing a face in accordance with a size of a face region tracked by the tracking unit. Encoding processing device.

Detecting a face area in a frame of the video;
A step of recording the history of the detected face area in units of frames;
A tracking step for correcting a detection result of the face area by tracking the detected face area over a plurality of continuous frames with reference to the recorded face area detection history in a frame unit,
Determining a region of interest on a predetermined basis based on the corrected face region;
The region of interest and the other region is coded with different quality, seen including a step of generating a video coded stream,
When the step of determining the attention area is set as the attention area, the area including the face area detected while the user is sitting in the seat by the detecting step, the user has moved away from the seat by the tracking step. And the step of determining the attention area does not set the detected face area as the attention area even if the face area is detected by the detecting step. Method.

A detection function for detecting a face area in a frame of a video;
A recording function that records the history of the detected face area in units of frames;
A tracking function that corrects the detection result of the face area by tracking the detected face area over a plurality of continuous frames with reference to the recorded face area detection history of the frame unit,
A region-of-interest determination function for determining a region of interest based on the corrected face region on a predetermined basis;
Encoding the region of interest with a different image quality from other regions, and causing the computer to realize an encoding function for generating a video encoded stream ,
When the attention area determination function sets an area including a face area detected while the user is sitting in the seat by the detection function as the attention area, the tracking function detects that the user has moved away from the seat. When this is the case, the attention area determination function does not set the detected face area as the attention area even if a face area is detected by the detection function .