JP2010081241A

JP2010081241A - Device and method for encoding moving image

Info

Publication number: JP2010081241A
Application number: JP2008246595A
Authority: JP
Inventors: Daisuke Sakamoto; 大輔坂本
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2008-09-25
Filing date: 2008-09-25
Publication date: 2010-04-08
Anticipated expiration: 2028-09-25
Also published as: JP5274181B2

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently perform interframe prediction using motion compensation with a small computational quantity in the use of prediction among a plurality of reference frames. <P>SOLUTION: Face detection is performed on the image frame of moving image data, face parts such as eyes, a nose and a mouth are detected further from a detected face, and a state is determined for each of the detected face parts. The determination result of the states of the face parts is preserved in association with information indicating the image frame on which the face detection is performed. Then, the states of the face parts detected from the image frame of an encoding object are compared with the preserved states of the face parts, and an image frame corresponding to the matching state of the face parts is retrieved and used as the reference frame of the face parts. Then, each of the face parts is prediction encoded by a motion compensation block unit using the reference frame retrieved for each of the face parts. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、動画像符号化装置および動画像符号化方法に関し、特には、動き補償によるフレーム間予測を用いて動画像データを符号化する動画像符号化装置および動画像符号化方法に関する。 The present invention relates to a moving image encoding device and a moving image encoding method, and more particularly to a moving image encoding device and a moving image encoding method for encoding moving image data using interframe prediction based on motion compensation.

近年では、動画像データの高解像度化が進み、従来から用いられる７２０画素×４８０画素の映像に対して、例えば地上デジタル放送では、フルハイビジョン映像と呼ばれる１９２０画素×１０８０画素の映像が用いられることが多くなっている。このような高解像度の動画像データは、単位時間当たりに伝送されるデータ量も膨大なものになるため、従来の技術に対してより高能率な圧縮符号化技術が求められている。 In recent years, the resolution of moving image data has been increased, and a video of 1920 pixels × 1080 pixels called a full high-definition video is used in digital terrestrial broadcasting, for example, in comparison with a conventionally used video of 720 pixels × 480 pixels. Is increasing. Since such high-resolution moving image data has a huge amount of data transmitted per unit time, a highly efficient compression coding technique is required for the conventional technique.

これらの要求に対し、ＩＴＵ−ＴＳＧ１６やＩＳＯ／ＩＥＣＪＴＣ１／ＳＣ２９／ＷＧ１１の活動で、画像間の相関を利用したフレーム間予測を用いた符号化圧縮方式の標準化作業が進められている。この中でも、現状で最も高能率な符号化を実現しているといわれる符号化方式に、Ｈ．２６４／ＭＰＥＧ−４ＰＡＲＴ１０（ＡＶＣ）（以下、Ｈ．２６４と呼ぶ）がある。Ｈ．２６４の符号化および復号化の仕様については、例えば特許文献１などに記載されている。 In response to these demands, standardization of an encoding / compression method using inter-frame prediction using correlation between images is underway by activities of ITU-T SG16 and ISO / IEC JTC1 / SC29 / WG11. Among these, H.264 is an encoding method that is said to realize the most efficient encoding at present. H.264 / MPEG-4 PART10 (AVC) (hereinafter referred to as H.264). H. H.264 encoding and decoding specifications are described in, for example, Patent Document 1.

このＨ．２６４で新たに導入された技術の一つとして、フレーム間予測に用いる参照画像を、複数の画像の中から選択する技術がある（以下、複数参照フレーム間予測と呼ぶ）。Ｈ．２６４方式では、従来から用いられるＭＰＥＧ−１、ＭＰＥＧ−２方式に対し、直交変換処理をアダマール変換および整数精度ＤＣＴを用いて行うことにより誤差の蓄積を抑制する。それと共に、フレーム内予測符号化および動き補償を用いたフレーム間予測符号化とを行い、より精度の高い予測符号化を実現している。 This H. One of the techniques newly introduced in H.264 is a technique for selecting a reference image used for inter-frame prediction from a plurality of images (hereinafter referred to as “multi-reference inter-frame prediction”). H. In the H.264 system, error accumulation is suppressed by performing orthogonal transform processing using Hadamard transform and integer precision DCT, compared to the conventional MPEG-1 and MPEG-2 systems. At the same time, intra-frame predictive coding and inter-frame predictive coding using motion compensation are performed to realize more accurate predictive coding.

従来のＭＰＥＧ−１方式やＭＰＥＧ−２方式などの符号化方式（以下、ＭＰＥＧ符号化方式と呼ぶ）では、動き予測を行う場合、順方向予測および逆方向予測を用いることができる。ここで、順方向予測とは、時間的に前に位置する画像フレームから時間的に後に位置する画像フレームを予測する予測方式をいう。また、逆方向予測とは、時間的に後に位置する画像フレームから時間的に前に位置する画像フレームを予測する予測方式をいう。逆方向予測では、現在の画像フレームに基づいて、それ以前の符号化をスキップした画像フレームを予測することができる。逆方向予測は、順方向予測と共に用いられ（双方向予測と呼ばれる）、符号化対象の画像フレームに対してより高い圧縮率を実現する。 In a conventional encoding method such as the MPEG-1 method or the MPEG-2 method (hereinafter referred to as an MPEG encoding method), forward prediction and backward prediction can be used when performing motion prediction. Here, the forward prediction refers to a prediction method for predicting an image frame positioned later in time from an image frame positioned earlier in time. The backward prediction is a prediction method for predicting an image frame positioned temporally earlier from an image frame positioned temporally later. In the backward prediction, it is possible to predict an image frame that skips the previous encoding based on the current image frame. Reverse prediction is used in conjunction with forward prediction (called bi-directional prediction) to achieve a higher compression rate for the image frame to be encoded.

なお、ＭＰＥＧ符号化方式において、順方向予測により符号化された画像フレームをＰピクチャ、双方向予測により符号化された画像フレームをＢピクチャと呼ぶ。また、フレーム間予測を用いず、その画像フレームだけで符号化が完結する画像ピクチャをＩピクチャと呼ぶ。 In the MPEG encoding method, an image frame encoded by forward prediction is called a P picture, and an image frame encoded by bidirectional prediction is called a B picture. In addition, an image picture in which encoding is completed with only the image frame without using inter-frame prediction is called an I picture.

このＭＰＥＧ符号化方式における順方向予測および両方向予測では、処理対象となる画像フレームに対して、動き予測を行う際に参照する参照フレームが予め決められている。一例として、１枚のＩピクチャ、４枚のＰピクチャおよび１０枚のＢピクチャからなるＧＯＰ単位で符号化がなされる場合、添字を画像フレームの入力（表示）順とすると、各ピクチャの符号化順は、下記のようになる。
Ｉ_３Ｂ_１Ｂ_２Ｐ_６Ｂ_４Ｂ_５Ｐ_９Ｂ_７Ｂ_８Ｐ_１２Ｂ_１０Ｂ_１１Ｐ_１５Ｂ_１３Ｂ_１４ In the forward prediction and bidirectional prediction in the MPEG encoding method, a reference frame to be referred to when performing motion prediction is determined in advance for an image frame to be processed. As an example, when encoding is performed in units of GOP composed of one I picture, four P pictures, and ten B pictures, each picture is encoded when the subscripts are in the order of input (display) of image frames. The order is as follows.
_{_{_{_{I 3 B 1 B 2 P 6}}}} B 4 B 5 P 9 B 7 B 8 P 12 B 10 B 11 P 15 B 13 B 14

この場合、Ｐ_６ピクチャは、Ｉ_３ピクチャを参照フレームとして予測符号化され、Ｂ_４およびＢ_５ピクチャは、Ｉ_３ピクチャおよびＰ_６ピクチャを参照フレームとして予測符号化される。同様に、Ｐ_９ピクチャは、Ｐ_６ピクチャを参照フレームとして予測符号化され、Ｂ_７およびＢ_８ピクチャは、Ｐ_６およびＰ_９ピクチャを参照フレームとして予測符号化される。 In this case, the P ₆ picture is predictively encoded using the I ₃ picture as a reference frame, and the B ₄ and B ₅ pictures are predictively encoded using the I ₃ picture and the P ₆ picture as a reference frame. Similarly, the P ₉ picture is predictively encoded using the P ₆ picture as a reference frame, and the B ₇ and B ₈ pictures are predictively encoded using the P ₆ and P ₉ pictures as a reference frame.

このＭＰＥＧ符号化方式における順方向予測および逆方向予測では、処理対象となる画像フレームに対して、時間的に近傍に位置する画像フレームを、動き予測を行う際に参照する参照フレームとして用いることが多い。例えば、上述のように、Ｐピクチャは、直前のＩピクチャまたはＰピクチャを参照フレームとして予測符号化がなされる。また、Ｂピクチャは、直前および直後のＩピクチャおよびＰピクチャ、若しくは、直前および直後のＰピクチャを参照フレームとして、予測符号化がなされる。これは、処理対象となる画像フレームと時間的に近傍に位置する画像フレームとにおいて、画像の相関が、多くの場合において高いためである。 In forward prediction and backward prediction in this MPEG encoding method, an image frame that is temporally adjacent to an image frame to be processed is used as a reference frame that is referred to when performing motion prediction. Many. For example, as described above, the P picture is predictively encoded using the immediately preceding I picture or P picture as a reference frame. In addition, the B picture is subjected to predictive coding using the immediately preceding and immediately following I and P pictures or the immediately preceding and immediately following P pictures as reference frames. This is because the image correlation between the image frame to be processed and the image frame positioned in the vicinity in time is high in many cases.

しかし、これらのＭＰＥＧ符号化方式では、画像フレーム間で画像の急激な変化があった場合、動き補償を用いたフレーム間予測の利点が活用できない可能性がある。これは、画像の急激な変化があると、時間的に近傍に位置する画像フレームであっても、符号化対象の画像フレームの画像との相関が低くなるからである。例えば、人物の表情を捉えた動画像の撮影時において、被写体となる人物が目を瞬いた場合や笑うなどして突然大きく口を開けた場合、短時間で画像が変化し、動き補償を用いたフレーム間予測の利点が活用できず、圧縮効率が低下する可能性がある。 However, in these MPEG encoding schemes, if there is a sudden change in image between image frames, there is a possibility that the advantage of inter-frame prediction using motion compensation cannot be utilized. This is because, when there is an abrupt change in the image, the correlation with the image of the image frame to be encoded becomes low even in an image frame located in the vicinity in time. For example, when shooting a moving image that captures the facial expression of a person, if the subject person blinks his eyes or laughs suddenly and opens his mouth suddenly, the image will change in a short time and motion compensation will be used. The advantage of the inter-frame prediction cannot be utilized, and the compression efficiency may be reduced.

上述のＨ．２６４では、１つの符号化対象の画像フレームに対して複数の参照フレームを用いてフレーム間予測を行う、複数参照フレーム間予測を導入して、この問題に対応している。この複数参照フレーム間予測によれば、処理対象となる画像フレームに対して、参照フレームをブロック毎に柔軟に選択できる。例えば、Ｐピクチャであれば、最大で１５枚のＰピクチャまで遡り、動き補償ブロック毎に最適なピクチャを選択して参照フレームとして利用できる。 H. H.264 addresses this problem by introducing multi-reference inter-frame prediction, which performs inter-frame prediction using a plurality of reference frames for one encoding target image frame. According to the prediction between multiple reference frames, a reference frame can be flexibly selected for each block with respect to an image frame to be processed. For example, in the case of a P picture, it can be traced back to a maximum of 15 P pictures, and an optimal picture can be selected for each motion compensation block and used as a reference frame.

このように、Ｈ．２６４では、入力された画像と、既に符号化された画像との誤差が最小となる画像を、複数の画像の中から選択して参照フレームとして利用することで、動き補償を用いたフレーム間予測を行うことができる。これにより、動画像データを圧縮符号化するときに、上述したような、符号化対象の画像フレームと、当該画像フレームに対して時間的に近い位置の参照フレームとの画像間の相関が低い場合でも、効率的な符号化が可能となる。
特開２００５−１６７７２０号公報 In this way, H.C. In H.264, an image that minimizes an error between an input image and an already encoded image is selected from a plurality of images and used as a reference frame, so that inter-frame prediction using motion compensation is performed. It can be performed. As a result, when compressing and encoding moving image data, the correlation between images between the image frame to be encoded and the reference frame at a position close in time to the image frame as described above is low. However, efficient encoding is possible.
JP 2005-167720 A

しかしながら、既に符号化された複数の画像フレームについて、入力画像フレームとの誤差が最小となる画像フレームをブロック毎に選択する演算を常に行うと、参照する画像フレーム数に比例して演算量が増大してしまう。その結果、符号化に要する時間が膨大になってしまうという問題点があった。これは特に、デジタルビデオカメラなど、撮影に対してリアルタイムに符号化が行われることが必要とされる機器においては、演算が間に合わなくなるおそれがある。 However, if a calculation for selecting an image frame that minimizes an error from the input image frame is always performed for a plurality of already encoded image frames, the amount of calculation increases in proportion to the number of image frames to be referenced. Resulting in. As a result, there is a problem that the time required for encoding becomes enormous. In particular, in a device such as a digital video camera that requires encoding in real time for shooting, there is a risk that the calculation may not be in time.

また、デジタルビデオカメラなどの、携帯用に設計された機器の場合には、演算負荷の増大は、駆動するバッテリ消費量の増大に結びつくため、撮影時間に対する影響が無視できなくなってしまうという問題点があった。 In addition, in the case of a device designed for portable use such as a digital video camera, an increase in calculation load leads to an increase in the amount of battery to be driven, so that the influence on the shooting time cannot be ignored. was there.

このように、従来では、複数参照フレーム間予測を用いる場合において、動き補償を用いたフレーム間予測を少ない演算量で、且つ、効率的に実行することが困難であった。 As described above, conventionally, in the case of using the multi-reference inter-frame prediction, it has been difficult to efficiently perform the inter-frame prediction using the motion compensation with a small amount of calculation.

本発明は、特に人物の顔を含む動画像データを複数参照フレーム間予測を用いて符号化する場合において、動き補償を用いたフレーム間予測を少ない演算量で、且つ、効率的に実行できる動画像符号化装置および動画像符号化方法を提供することにある。 The present invention is a moving image that can efficiently execute inter-frame prediction using motion compensation with a small amount of computation, particularly when moving image data including a human face is encoded using inter-reference inter-frame prediction. An object is to provide an image encoding device and a moving image encoding method.

本発明は、上述した課題を解決するために、符号化対象の画像フレームを分割したブロック単位で参照フレームに対してなされた動き検出の結果を用いて動き補償フレーム間予測符号化を行い、参照フレームを複数の参照フレームの候補から選択可能な動画像符号化装置であって、入力された画像フレームを一時的に保存する入力画像フレーム保存手段と、入力画像フレーム保存手段に保存された画像フレームに基づき生成した参照候補フレームの複数を保存する参照候補フレーム保存手段と、入力画像フレーム保存手段から出力された画像フレームから顔領域を検出して顔領域に含まれる顔パーツをさらに検出し、顔パーツの状態を判定する判定手段と、判定手段に判定された顔パーツの状態を、顔パーツが含まれる画像フレームを示す情報に関連付けて保存する顔パーツ情報保存手段と、顔パーツ情報保存手段から、入力画像フレーム保存手段から出力された符号化対象フレームから判定手段で検出され判定された顔パーツの状態が一致し、且つ、符号化対象フレームが参照可能な画像フレームを示す情報を検索する検索手段と、参照候補フレーム保存手段に保存された複数の参照候補フレームのうち、検索手段で検索された画像フレームを示す情報に対応する参照候補フレームを、顔パーツに対応するブロックの参照フレームに決定する参照フレーム決定手段とを有し、参照フレーム決定手段で決定された参照フレームを用いて符号化対象フレームに対する動き補償フレーム間予測符号化を行うことを特徴とする動画像符号化装置である。 In order to solve the above-described problem, the present invention performs motion-compensated inter-frame predictive encoding using a result of motion detection performed on a reference frame in units of blocks obtained by dividing an encoding target image frame. A moving image encoding apparatus capable of selecting a frame from a plurality of reference frame candidates, wherein the input image frame storing unit temporarily stores the input image frame, and the image frame stored in the input image frame storing unit A reference candidate frame storage unit that stores a plurality of reference candidate frames generated based on the image data, a face region detected from the image frame output from the input image frame storage unit, and further detects facial parts included in the face region, The determination means for determining the state of the part, and the state of the face part determined by the determination means, the information indicating the image frame including the face part The face part information storage means stored in association with the face parts information storage means, the face parts state detected and determined by the determination means from the encoding target frame output from the input image frame storage means match, and A search unit that searches for information indicating an image frame that can be referred to by the encoding target frame; and information that indicates an image frame searched by the search unit among a plurality of reference candidate frames stored in the reference candidate frame storage unit. A reference frame determining unit that determines a corresponding reference candidate frame as a reference frame of a block corresponding to the face part, and using the reference frame determined by the reference frame determining unit, between motion compensation frames for the encoding target frame A video encoding apparatus that performs predictive encoding.

また、本発明は、符号化対象の画像フレームを分割したブロック単位で参照フレームに対してなされた動き検出の結果を用いて動き補償フレーム間予測符号化を行い、参照フレームを複数の参照フレームの候補から選択可能な動画像符号化方法であって、入力された画像フレームを入力画像フレーム保存手段に一時的に保存する入力画像フレーム保存ステップと、入力画像フレーム保存手段に保存された画像フレームに基づき生成した参照候補フレームの複数を参照候補フレーム保存手段に保存する参照候補フレーム保存ステップと、入力画像フレーム保存手段から出力された画像フレームから顔領域を検出して顔領域に含まれる顔パーツをさらに検出し、顔パーツの状態を判定する判定ステップと、判定ステップに判定された顔パーツの状態を、顔パーツが含まれる画像フレームを示す情報に関連付けて顔パーツ情報保存手段に保存する顔パーツ情報保存ステップと、顔パーツ情報保存手段から、入力画像フレーム保存手段から出力された符号化対象フレームから判定ステップで検出され判定された顔パーツの状態が一致し、且つ、符号化対象フレームが参照可能な画像フレームを示す情報を検索する検索ステップと、参照候補フレーム保存手段に保存された複数の参照候補フレームのうち、検索ステップで検索された画像フレームを示す情報に対応する参照候補フレームを、顔パーツに対応するブロックの参照フレームに決定する参照フレーム決定ステップとを有し、参照フレーム決定ステップで決定された参照フレームを用いて符号化対象フレームに対する動き補償フレーム間予測符号化を行うことを特徴とする動画像符号化方法である。 In addition, the present invention performs motion compensation inter-frame predictive encoding using the result of motion detection performed on the reference frame in units of blocks obtained by dividing the image frame to be encoded, and converts the reference frame into a plurality of reference frames. A video encoding method that can be selected from candidates, an input image frame storing step for temporarily storing an input image frame in the input image frame storing means, and an image frame stored in the input image frame storing means A reference candidate frame storing step for storing a plurality of reference candidate frames generated based on the reference candidate frame storing unit; and a face part included in the face region by detecting a face region from the image frame output from the input image frame storing unit. Further, a determination step for detecting and determining the state of the face part, and the state of the face part determined in the determination step Is stored in the face part information storage unit in association with information indicating the image frame including the face part, and the encoding target frame output from the input image frame storage unit from the face part information storage unit A search step for searching for information indicating an image frame in which the states of the face parts detected and determined in the determination step match and the encoding target frame can be referred to, and a plurality of reference candidate frame storage means A reference frame determination step for determining a reference candidate frame corresponding to information indicating the image frame searched in the search step among the reference candidate frames as a reference frame of a block corresponding to the face part, and the reference frame determination step A motion compensation frame for the encoding target frame using the reference frame determined in step A moving picture coding method and performing predictive coding.

本発明は、上述した構成を有するため、特に人物の顔を含む動画像データを複数参照フレーム間予測を用いて符号化する場合において、動き補償を用いたフレーム間予測を少ない演算量で、且つ、効率的に実行できる。 Since the present invention has the above-described configuration, particularly when moving image data including a human face is encoded using multiple reference inter-frame prediction, inter-frame prediction using motion compensation can be performed with a small amount of computation, and Can be executed efficiently.

以下、本発明の実施形態について説明する。本発明に適用される動画像符号化装置では、符号化対象の画像フレームを分割したブロック単位で参照フレームに対してなされた動き検出の結果を用いて動き補償フレーム間予測符号化を行う。このとき、参照フレームは、複数の参照フレームの候補から選択可能とされている。 Hereinafter, embodiments of the present invention will be described. In the moving picture coding apparatus applied to the present invention, motion compensation interframe predictive coding is performed using the result of motion detection performed on the reference frame in units of blocks obtained by dividing the coding target image frame. At this time, the reference frame can be selected from a plurality of reference frame candidates.

このような動画像符号化装置において、本発明では、動画像データの画像フレームに対して顔検出を行い、検出された顔から、さらに目、鼻、口などの顔パーツを検出し、検出された顔パーツのそれぞれについて状態を判定する。顔パーツの状態の判定結果は、顔検出を行った画像フレームを示す情報と関連付けて保存する。そして、符号化対象の画像フレームから検出された顔パーツの状態を、保存された顔パーツの状態と比較し、一致する顔パーツの状態に対応する画像フレーム検索し、当該顔パーツの参照フレームとして用いるようにする。そして、顔パーツのそれぞれを、各顔パーツについて検索された参照フレームを用いて、動き補償ブロック単位で動き補償フレーム間予測符号化を行う。 In such a moving image coding apparatus, in the present invention, face detection is performed on an image frame of moving image data, and face parts such as eyes, nose and mouth are further detected from the detected face and detected. Determine the state of each face part. The determination result of the state of the face part is stored in association with information indicating the image frame for which face detection has been performed. Then, the state of the face part detected from the image frame to be encoded is compared with the state of the stored face part, the image frame corresponding to the state of the matching face part is searched, and the reference frame of the face part is used. Use it. Then, each of the face parts is subjected to motion compensation interframe prediction coding in units of motion compensation blocks using the reference frames searched for each face part.

これにより、動画像の被写体である顔の各顔パーツが急に動くような場合でも、各顔パーツそれぞれに対して最適な予測符号化を行うことができる。また、参照フレームの検索を、符号化対象の画像フレームに対する顔パーツの検出が行われた際に保存した顔パーツの状態を用いて行うようにしている。そのため、動き補償を用いたフレーム間予測符号化を少ない演算量で、且つ、効率的に実行することが可能となる。 As a result, even when each facial part of the face that is the subject of the moving image moves suddenly, optimal predictive coding can be performed for each facial part. Further, the reference frame is searched using the face part state stored when the face part is detected for the image frame to be encoded. Therefore, it is possible to efficiently perform inter-frame prediction encoding using motion compensation with a small amount of computation.

図１は、本発明の実施形態による符号化装置１００の一例の構成を示す。符号化装置１００は、供給されたベースバンドの動画像データに対し、１画面を所定サイズに分割したブロック単位で動き検出を行い、動き補償を用いたフレーム間予測符号化を行う。符号化は、アダマール変換および整数精度ＤＣＴを用いた直交変換および変換係数に対する量子化と、フレーム内予測符号化および動き補償を用いたフレーム間予測符号化とを用い、さらにエントロピー符号化を施すことで行う。 FIG. 1 shows an exemplary configuration of an encoding apparatus 100 according to an embodiment of the present invention. The encoding apparatus 100 performs motion detection on the supplied baseband moving image data in units of blocks obtained by dividing one screen into a predetermined size, and performs interframe predictive encoding using motion compensation. Encoding uses orthogonal transformation using Hadamard transform and integer precision DCT, quantization for transform coefficients, intraframe prediction coding using intraframe prediction coding and motion compensation, and further entropy coding. To do.

以下では、アダマール変換および整数精度ＤＣＴを用いた直交変換を整数変換と呼び、フレーム内予測符号化およびフレーム間予測符号化をそれぞれイントラ符号化、インター符号化と呼ぶ。 Hereinafter, the Hadamard transform and orthogonal transform using integer precision DCT are referred to as integer transform, and intra-frame prediction coding and inter-frame prediction coding are referred to as intra coding and inter coding, respectively.

インター符号化により、動き補償の単位となる動き補償ブロックに対して時間的に前に位置する参照フレームとの予測を行うＰピクチャが形成される。また、インター符号化により、動き補償ブロックに対して時系列的に前および／または後の２枚までの参照フレームとの予測を行うＢピクチャも形成される。さらに、イントラ符号化によりＩピクチャが形成される。このように、インター符号化およびイントラ符号化により、時間的な参照関係が異なる複数のタイプのピクチャが形成される。さらに、フレーム間予測符号化では、動画像データを、これらＩピクチャ、ＰピクチャおよびＢピクチャが所定に配列されたＧＯＰ構造を持つデータとして符号化する。 By inter coding, a P picture is formed that performs prediction with a reference frame positioned temporally before a motion compensation block that is a unit of motion compensation. In addition, a B picture that predicts up to two reference frames before and / or after in time series with respect to the motion compensation block is also formed by inter coding. Further, an I picture is formed by intra coding. In this way, a plurality of types of pictures having different temporal reference relationships are formed by inter coding and intra coding. Further, in inter-frame predictive encoding, moving image data is encoded as data having a GOP structure in which these I picture, P picture, and B picture are arranged in a predetermined manner.

例えば、符号化装置１００が１枚のＩピクチャ、４枚のＰピクチャおよび１０枚のＢピクチャからなる１５フレームで１ＧＯＰを形成する場合、符号化装置１００に入力されるフレームに対し、下記の順にピクチャタイプが割り当てられる。なお、添字は、入力順または表示順を示す。
Ｂ_１Ｂ_２Ｉ_３Ｂ_４Ｂ_５Ｐ_６Ｂ_７Ｂ_８Ｐ_９Ｂ_１０Ｂ_１１Ｐ_１２Ｂ_１３Ｂ_１４Ｐ_１５ For example, when the encoding apparatus 100 forms 1 GOP with 15 frames including one I picture, four P pictures, and 10 B pictures, the frames input to the encoding apparatus 100 are processed in the following order: A picture type is assigned. The subscript indicates the order of input or display.
B ₁ B ₂ I ₃ B ₄ B ₅ P ₆ B ₇ B ₈ P ₉ B ₁₀ B ₁₁ P ₁₂ B ₁₃ B ₁₄ P ₁₅

ここで、Ｂピクチャは、時系列的に過去のピクチャと未来のピクチャとを用いて予測符号化を行うことができるため、符号化は、Ｂピクチャの順序をＩピクチャおよびＰピクチャに対して入れ替えて、例えば下記の順序で行われる。なお、Ｉ_３ピクチャに続くＢ_１ピクチャおよびＢ_２ピクチャは、Ｉ_３ピクチャと、直前のＧＯＰにおけるＰ_１５ピクチャとを用いて予測符号化される。
Ｉ_３Ｂ_１Ｂ_２Ｐ_６Ｂ_４Ｂ_５Ｐ_９Ｂ_７Ｂ_８Ｐ_１２Ｂ_１０Ｂ_１１Ｐ_１５Ｂ_１３Ｂ_１４ Here, since the B picture can be subjected to predictive coding using a past picture and a future picture in time series, the order of the B picture is changed with respect to the I picture and the P picture. For example, it is performed in the following order. Note that the B ₁ picture and the B ₂ picture following the I ₃ picture are predictively encoded using the I ₃ picture and the P ₁₅ picture in the immediately preceding GOP.
_{_{_{_{I 3 B 1 B 2 P 6}}}} B 4 B 5 P 9 B 7 B 8 P 12 B 10 B 11 P 15 B 13 B 14

この符号化装置１００は、図示されないＣＰＵにより、所定のプログラムに従って制御される。ＣＰＵは、符号化装置１００を専用に制御するものでもよいし、この符号化装置１００が組み込まれるより上位のシステムを制御するものでもよい。ＣＰＵは、図示されないＲＯＭおよびＲＡＭを有し、ＲＯＭに予め格納されたプログラムに従いＲＡＭをワークメモリとして動作し、符号化装置１００の各部を制御する。 The encoding apparatus 100 is controlled by a CPU (not shown) according to a predetermined program. The CPU may control the encoding device 100 exclusively, or may control a higher-order system in which the encoding device 100 is incorporated. The CPU includes a ROM and a RAM (not shown), operates as a work memory according to a program stored in advance in the ROM, and controls each unit of the encoding device 100.

符号化装置１００に対して、ベースバンドの動画像データ５０が画像フレーム単位で、上述した入力順で入力され、入力画像フレーム保存手段としての、フレームメモリからなる現在フレーム保存部１０に一時的に保存される。現在フレーム保存部１０に保存された画像フレームは、上述した符号化順に並び替えられ、符号化のために、所定サイズ（例えば１６画素×１６画素）のマクロブロックに分割されて読み出される。マクロブロックは、例えば画面の左端から右端に水平方向にスキャンされ、それが垂直方向に繰り返されて読み出される。また、マクロブロックに対して、例えばスキャンの順序に従って画像フレーム内における座標情報が定義される。 Baseband moving image data 50 is input to the encoding device 100 in units of image frames in the input order described above, and temporarily stored in the current frame storage unit 10 including a frame memory as input image frame storage means. Saved. The image frames stored in the current frame storage unit 10 are rearranged in the above-described encoding order, and are divided into macro blocks of a predetermined size (for example, 16 pixels × 16 pixels) and read for encoding. For example, the macro block is scanned in the horizontal direction from the left end to the right end of the screen, and is read repeatedly in the vertical direction. Also, coordinate information in the image frame is defined for the macroblock in accordance with, for example, the scan order.

さらに、現在フレーム保存部１０から、動画像データ５０の、マクロブロック単位で読み出された画像データに対応する画像フレームが読み出され、顔検出部３２に供給される。なお、現在フレーム保存部１０から符号化のためにマクロブロック単位で読み出された画像データに対応する画像フレームを、以下、符号化対象フレームと呼ぶ。 Furthermore, an image frame corresponding to the image data read in units of macroblocks of the moving image data 50 is read from the current frame storage unit 10 and supplied to the face detection unit 32. Note that an image frame corresponding to image data read from the current frame storage unit 10 in units of macroblocks for encoding is hereinafter referred to as an encoding target frame.

顔検出部３２は、現在フレーム保存部１０から供給された符号化対象フレームに対して、人間の顔が含まれる顔領域の検出を行う。顔検出部３２で検出された顔領域を示す情報は、顔検出を行った符号化対象フレームを示す識別情報と共に、顔表情認識部３３に供給される。顔表情認識部３３は、顔検出部３２から供給された顔領域を示す情報に基づき、顔に含まれる各パーツ（以下、顔パーツと呼ぶ）の状態を判定する。ここでは、顔パーツを、顔の中でも動きが頻繁に発生すると考えられる部分であるものとする。顔の中のこのような部分としては、例えば左目、右目および口が挙げられる。例えば、顔表情認識部３３は、これら左目、右目および口のうち少なくとも１つの状態を判定する。顔検出部３２および顔表情認識部３３は、判定手段を構成する。 The face detection unit 32 detects a face region including a human face for the encoding target frame supplied from the current frame storage unit 10. Information indicating the face area detected by the face detection unit 32 is supplied to the facial expression recognition unit 33 together with identification information indicating the encoding target frame for which face detection has been performed. The facial expression recognition unit 33 determines the state of each part (hereinafter referred to as a face part) included in the face based on the information indicating the face area supplied from the face detection unit 32. Here, it is assumed that the face part is a part that is considered to frequently move in the face. Examples of such a part in the face include the left eye, the right eye, and the mouth. For example, the facial expression recognition unit 33 determines at least one state among these left eye, right eye, and mouth. The face detection unit 32 and the facial expression recognition unit 33 constitute a determination unit.

各顔パーツ状態の判定結果は、対応する顔パーツが含まれるマクロブロックの座標情報および符号化対象フレームの識別情報と関連付けられて、顔表情認識部３３が有する図示されないメモリに対して、蓄積的に記憶される。顔表情認識部３３は、顔パーツ情報保存手段も構成する。顔表情認識部３３における処理の詳細は、後述する。 The determination result of each face part state is associated with the coordinate information of the macroblock including the corresponding face part and the identification information of the encoding target frame, and is accumulated in a memory (not shown) included in the facial expression recognition unit 33. Is remembered. The facial expression recognition unit 33 also constitutes face part information storage means. Details of processing in the facial expression recognition unit 33 will be described later.

なお、顔検出部３２による顔領域の検出方法は、様々に考えられるが、例えば、特開２００１−３０９２２５号公報に記載される方法を用いることができる。これは、先ず、画像データに対して、色および形状に基いて肌を含む可能性が高いと思われる中央部と、色および形状に基いて毛髪を含む可能性が高いと思われる周辺領域とを探す。その結果に基づき、第１の顔候補検出アルゴリズムにより、パターン認識オペレータを用いて顔を含む可能性の高い領域を探す。そして、第１のアルゴリズムで求められた顔候補領域中の顔の存在を、パターンマッチにより確かめる第２のアルゴリズムとを併用して顔を検出する。 Various methods for detecting the face area by the face detection unit 32 can be considered. For example, a method described in Japanese Patent Laid-Open No. 2001-309225 can be used. First of all, the image data includes a central portion that is likely to include skin based on color and shape, and a peripheral region that is likely to include hair based on color and shape. Search for. Based on the result, the first face candidate detection algorithm is used to search for a region that is likely to include a face using a pattern recognition operator. Then, a face is detected in combination with the second algorithm for confirming the presence of the face in the face candidate area obtained by the first algorithm by pattern matching.

また、顔表情認識部３３による、顔領域の各顔パーツの状態情報を解析する方法としては、次のような方法が考えられる。先ず、顔の肌色領域を「０」、顔の肌色領域以外を「１」として２値化する。そして、顔の肌色領域から顔の重心を検出し、その重心の斜め上方にあるホールの位置を目領域と決定する。なお、ホールが検出できない場合は、その目を閉じているものと判断する。また、人体の一般的な構造から、顔領域の重心よりも下方で右目と左目との間の垂直２等分線上の所定位置を、口領域とする。顔領域に占める口領域の割合が所定以上である場合には、口を開いているものと判定する。 Further, as a method of analyzing the state information of each face part in the face area by the facial expression recognition unit 33, the following method can be considered. First, binarization is performed with “0” for the skin color area of the face and “1” for areas other than the face skin color area. Then, the center of gravity of the face is detected from the skin color area of the face, and the position of the hole obliquely above the center of gravity is determined as the eye area. If a hole cannot be detected, it is determined that the eyes are closed. Further, from the general structure of the human body, a predetermined position on the vertical bisector between the right eye and the left eye below the center of gravity of the face area is defined as the mouth area. If the ratio of the mouth area to the face area is greater than or equal to a predetermined value, it is determined that the mouth is open.

一方、現在フレーム保存部１０からマクロブロック単位で読み出された画像データは、減算器１１の被減算入力に入力されると共に、動き検出部２３に供給される。動き検出部２３は、現在フレーム保存部１０から供給された画像データにおける動きベクトルを検出し、検出した動きベクトル情報をインター予測部２２とエントロピー符号化部１６とに出力する。 On the other hand, image data read from the current frame storage unit 10 in units of macroblocks is input to the subtracted input of the subtractor 11 and is supplied to the motion detection unit 23. The motion detection unit 23 detects a motion vector in the image data supplied from the current frame storage unit 10, and outputs the detected motion vector information to the inter prediction unit 22 and the entropy encoding unit 16.

減算器１１は、被減算入力に入力された画像データから、後述するスイッチ２６から出力される予測画像データを減算し、画像残差データを生成する。画像残差データは、直交変換部１２でアダマール変換や整数精度ＤＣＴといった直交変換処理によりＤＣＴ係数に変換される。 The subtractor 11 subtracts predicted image data output from the switch 26 described later from the image data input to the subtracted input to generate image residual data. The image residual data is converted into DCT coefficients by orthogonal transform processing such as Hadamard transform or integer precision DCT in the orthogonal transform unit 12.

このＤＣＴ係数は、量子化部１３で所定の量子化パラメータを用いて量子化される。量子化パラメータは、ＤＣＴ係数を量子化する際の量子化ステップと所定の関係を有するパラメータで、例えば量子化パラメータと量子化ステップの対数が比例するように決められる。量子化部１３から出力された量子化値は、エントロピー符号化部１６に供給される。 The DCT coefficient is quantized by the quantization unit 13 using a predetermined quantization parameter. The quantization parameter is a parameter having a predetermined relationship with the quantization step when the DCT coefficient is quantized. For example, the quantization parameter is determined so that the logarithm of the quantization parameter and the quantization step is proportional. The quantized value output from the quantizing unit 13 is supplied to the entropy coding unit 16.

量子化部１３から出力された量子化値は、逆量子化部１７にも供給される。量子化値は、逆量子化部１７で逆量子化され、逆直交変換部１８で逆直交変換され、ローカルデコード画像データとされる。ローカルデコード画像データは、スイッチ２６から出力される予測画像データが加算器１９で加算され、復元画像データが形成される。復元画像データは、フレームメモリ２４に格納されると共に、デブロッキングフィルタ２０で符号化歪みを軽減されてフレームメモリからなる復元画像フレーム保存部３０に格納される。参照候補フレーム保存手段としての復元画像フレーム保存部３０は、複数フレーム分の復元画像データを格納可能とされている。 The quantized value output from the quantization unit 13 is also supplied to the inverse quantization unit 17. The quantized value is inversely quantized by the inverse quantization unit 17 and inversely orthogonally transformed by the inverse orthogonal transform unit 18 to be locally decoded image data. The predicted image data output from the switch 26 is added to the local decoded image data by the adder 19 to form restored image data. The restored image data is stored in the frame memory 24 and is also stored in the restored image frame storage unit 30 including a frame memory after the coding distortion is reduced by the deblocking filter 20. The restored image frame storage unit 30 as reference candidate frame storage means can store restored image data for a plurality of frames.

検索手段および参照フレーム決定手段としての参照フレーム決定部３１は、参照フレームとして用いるデータを選択および決定する。本発明の実施形態においては、参照フレーム決定部３１は、顔表情認識部３３における顔パーツ状態の判定結果に基づき、復元画像フレーム保存部３０に格納された復元画像データの中から、参照フレームを選択し決定することができる。 A reference frame determination unit 31 as a search unit and a reference frame determination unit selects and determines data to be used as a reference frame. In the embodiment of the present invention, the reference frame determination unit 31 selects a reference frame from the restored image data stored in the restored image frame storage unit 30 based on the determination result of the facial part state in the facial expression recognition unit 33. You can select and decide.

すなわち、参照フレーム決定部３１は、顔表情認識部３３における符号化対象フレームに対する顔パーツ状態の判定結果と、顔表情認識部３３に保存された顔パーツ状態の判定結果とを比較する。比較の結果、顔表情認識部３３に保存された顔パーツ状態のうち、符号化対象フレームに対する顔パーツ状態の判定結果と一致するものを検索する。そして、復元画像フレーム保存部３０に格納された復元画像フレームのうち、検索結果として得られた顔パーツ状態に対応する復元画像フレームを、参照フレームに決定し、参照フレーム保存部２１に保存する。 That is, the reference frame determination unit 31 compares the determination result of the facial part state for the encoding target frame in the facial expression recognition unit 33 with the determination result of the facial part state stored in the facial expression recognition unit 33. As a result of the comparison, the face part states stored in the facial expression recognizing unit 33 are searched for those that match the determination result of the face part state for the encoding target frame. Then, among the restored image frames stored in the restored image frame storage unit 30, a restored image frame corresponding to the face part state obtained as a search result is determined as a reference frame and stored in the reference frame storage unit 21.

なお、参照フレーム決定部３１における処理は、各顔パーツのそれぞれについて行われる。つまり、各顔パーツのそれぞれについて、参照フレームを決定することができる。なお、参照フレーム決定部３１における処理の詳細は、後述する。 The process in the reference frame determination unit 31 is performed for each face part. That is, the reference frame can be determined for each face part. Details of processing in the reference frame determination unit 31 will be described later.

イントラ予測部２５は、フレームメモリ２４に格納された復元画像データを用いてフレーム内予測処理を行い、予測画像データを生成する。イントラ予測部２５から出力されたイントラ予測画像データは、スイッチ２６の入力端２６Ａに供給される。 The intra prediction unit 25 performs an intra-frame prediction process using the restored image data stored in the frame memory 24, and generates predicted image data. The intra prediction image data output from the intra prediction unit 25 is supplied to the input terminal 26A of the switch 26.

動き検出部２３は、参照フレーム決定部３１で決定された参照フレームを用いて、現在フレーム保存部１０からマクロブロック単位で供給された画像データの動き検出を行う。インター予測部２２は、参照フレーム保存部２１に格納された復元画像データと、動き検出部２３により検出された動きベクトルとに基づきフレーム間予測処理を行い、インター予測画像データを生成する。インター予測画像データは、スイッチ２６の入力端２６Ｂに供給される。 The motion detection unit 23 performs motion detection of the image data supplied from the current frame storage unit 10 in units of macroblocks using the reference frame determined by the reference frame determination unit 31. The inter prediction unit 22 performs inter-frame prediction processing based on the restored image data stored in the reference frame storage unit 21 and the motion vector detected by the motion detection unit 23, and generates inter prediction image data. The inter prediction image data is supplied to the input terminal 26B of the switch 26.

スイッチ２６は、イントラ予測およびインター予測の何方を用いるかを選択する。イントラ予測部２５から出力されたイントラ予測画像データと、インター予測部２２から出力されたインター予測画像データとのうち一方を選択し、選択された予測画像データを減算器１１の減算入力に供給すると共に、加算器１９に供給する。 The switch 26 selects which of intra prediction and inter prediction is used. One of the intra prediction image data output from the intra prediction unit 25 and the inter prediction image data output from the inter prediction unit 22 is selected, and the selected prediction image data is supplied to the subtraction input of the subtractor 11. At the same time, it is supplied to the adder 19.

エントロピー符号化部１６は、量子化部１３から供給された量子化パラメータおよび動き検出部２３から出力された動きベクトル情報をエントロピー符号化する。また、エントロピー符号化部１６は、イントラ符号化およびインター符号化の何れを行ったかを示す情報（マクロブロックタイプ）や、インター予測の際に用いた参照フレームを、マクロブロック単位で示す情報をさらにエントロピー符号化する。エントロピー符号化部１６の出力は、例えば画面の並び順に従って符号か配列された符号化ストリームとして、符号化装置１００から出力される。 The entropy encoding unit 16 entropy encodes the quantization parameter supplied from the quantization unit 13 and the motion vector information output from the motion detection unit 23. Further, the entropy encoding unit 16 further includes information indicating whether intra encoding or inter encoding has been performed (macroblock type), and information indicating the reference frame used in inter prediction in units of macroblocks. Entropy encoding. The output of the entropy encoding unit 16 is output from the encoding device 100 as, for example, an encoded stream in which codes are arranged according to the screen arrangement order.

次に、参照フレーム決定部３１による参照フレーム決定処理について、より詳細に説明する。図２は、本発明の実施形態による参照フレーム決定の一例の処理を示すフローチャートである。図２の各ステップは、例えば符号化装置１００の全体を制御する図示されないＣＰＵにより実行および／または制御される。 Next, reference frame determination processing by the reference frame determination unit 31 will be described in more detail. FIG. 2 is a flowchart showing an example of reference frame determination processing according to the embodiment of the present invention. Each step of FIG. 2 is executed and / or controlled by a CPU (not shown) that controls the entire encoding apparatus 100, for example.

ステップＳ１０で、顔検出部３２により、符号化対象フレームにおける顔領域が検出される。次のステップＳ１１で、顔表情認識部３３で、顔検出部３２で検出された顔領域に含まれる顔パーツを検出すると共に、検出された各顔パーツの状態を判定する。 In step S10, the face detection unit 32 detects a face area in the encoding target frame. In the next step S11, the facial expression recognition unit 33 detects the face parts included in the face area detected by the face detection unit 32, and determines the state of each detected face part.

一例として、図３（ａ）に示されるような符号化対象フレーム２００に対して顔検出を行い、検出された顔領域から左目、右目および口の各顔パーツを検出する。なお、図３（ａ）および以下の同様の図において符号化対象フレーム２００に格子で示されるブロックは、マクロブロックであるものとし、左上隅のブロックのブロック座標を（０，０）とする。 As an example, face detection is performed on the encoding target frame 200 as shown in FIG. 3A, and the left eye, right eye, and mouth face parts are detected from the detected face area. In FIG. 3A and the following similar figures, a block indicated by a lattice in the encoding target frame 200 is assumed to be a macro block, and the block coordinates of the block at the upper left corner are (0, 0).

本実施形態では、各顔パーツにおける顔表情の一例として、左目、右目および口の各顔パーツについて、各々、開いている場合を状態情報「０」、閉じている場合を状態情報「１」として解析結果を保存しておくものとする。図３（ａ）の例では、符号化対象フレーム２００から検出された顔領域中の各顔パーツについて、左目２１０および右目２１１が開いており、口２１２が閉じていることが、顔表情認識部３３において判定される。したがって、図３（ｂ）に例示されるように、左目２１０および右目２１１の状態情報が「０」、口２１２の状態情報が「１」になる。 In the present embodiment, as an example of facial expression in each face part, for each face part of the left eye, right eye, and mouth, the state information “0” is set when open, and the state information “1” is set when closed. The analysis result shall be saved. In the example of FIG. 3A, for each face part in the face area detected from the encoding target frame 200, the left eye 210 and the right eye 211 are open, and the mouth 212 is closed. It is determined at 33. Therefore, as illustrated in FIG. 3B, the state information of the left eye 210 and the right eye 211 is “0”, and the state information of the mouth 212 is “1”.

また、左目２１０がブロック座標（３，３）および（４，４）で対角座標を示される矩形領域、右目２１１がブロック座標（５，３）および（６，４）で対角座標示される矩形領域に含まれる。また、口２１２がブロック座標（４，５）、（７，５）で対角座標を示される矩形領域に含まれる。 Further, the left eye 210 is a rectangular area whose diagonal coordinates are indicated by block coordinates (3, 3) and (4, 4), and the right eye 211 is diagonally indicated by block coordinates (5, 3) and (6, 4). Included in the rectangular area. Further, the mouth 212 is included in a rectangular area whose diagonal coordinates are indicated by block coordinates (4, 5) and (7, 5).

説明は図２のフローチャートに戻り、ステップＳ１１で顔表情認識部３３において各顔パーツの状態が判定されると、処理はステップＳ１２に移行される。ステップＳ１２以下では、各顔パーツについて、参照フレームを決定するための処理が順次行われる。ここでは、顔パーツについて、左目、右目、口の順に処理を行うものとする。 Returning to the flowchart of FIG. 2, when the facial expression recognition unit 33 determines the state of each facial part in step S11, the process proceeds to step S12. In step S12 and subsequent steps, processing for determining a reference frame is sequentially performed for each face part. Here, the face parts are processed in the order of left eye, right eye, and mouth.

ステップＳ１２では、参照フレーム決定部３１において、符号化対象フレームにおける判定対象の顔パーツの状態と、顔表情認識部３３に記憶される顔パーツの状態とが比較される。判定対象の顔パーツに対応する顔パーツは、例えば顔パーツの座標情報や、顔領域における顔パーツの位置関係などに基づき判断することが考えられる。参照フレーム決定部３１は、顔表情認識部３３に記憶されている顔パーツの状態情報のうち、符号化対象フレームに対して時間的に直近に位置するフレームに対応する顔パーツの状態情報を取得する。 In step S <b> 12, the reference frame determination unit 31 compares the state of the facial part to be determined in the encoding target frame with the state of the facial part stored in the facial expression recognition unit 33. The face part corresponding to the face part to be determined may be determined based on, for example, the coordinate information of the face part or the positional relationship of the face part in the face area. The reference frame determination unit 31 acquires facial part state information corresponding to a frame that is positioned closest to the encoding target frame in the facial part state information stored in the facial expression recognition unit 33. To do.

なお、以下では、復元画像フレーム保存部３０に格納される復元画像フレームを参照候補フレームと呼ぶ。すなわち、参照フレーム決定部３１は、符号化対象フレームについて判定された顔パーツの状態情報と、顔表情認識部３３に記憶されている顔パーツの状態情報とを比較した結果に基づき、復元画像フレーム保存部３０から復元画像フレームを読み出す。この復元画像フレームを参照フレームとして、動き検出部２３による動き検出と、インター予測部２２におけるインター予測とを行う。 Hereinafter, the restored image frame stored in the restored image frame storage unit 30 is referred to as a reference candidate frame. That is, the reference frame determination unit 31 compares the face part state information determined for the encoding target frame with the face part state information stored in the facial expression recognition unit 33 based on the result of comparison. A restored image frame is read from the storage unit 30. Using the restored image frame as a reference frame, motion detection by the motion detection unit 23 and inter prediction by the inter prediction unit 22 are performed.

また、符号化対象フレームに対して時間的に直近とは、当該符号化対象フレームのピクチャタイプに基づき、当該符号化対象フレームが参照可能な参照候補フレームのうち、当該符号化対象フレームに時間的に最も近いことをいう。例えば、符号化対象フレームのピクチャタイプがＰピクチャであれば、参照候補フレームは、当該符号化対象フレームに時間的に最も近い位置にある過去のＰまたはＩピクチャを指す。また例えば、符号化対象フレームのピクチャタイプがＢピクチャであれば、参照候補フレームは、当該符号化対象フレームに時間的に最も近い位置にある過去のＢ、ＰまたはＩピクチャを指す。 In addition, the time closest to the encoding target frame is based on the picture type of the encoding target frame, and the reference candidate frame that can be referred to by the encoding target frame is temporally related to the encoding target frame. The closest thing to. For example, if the picture type of the encoding target frame is a P picture, the reference candidate frame indicates a past P or I picture that is temporally closest to the encoding target frame. For example, if the picture type of the encoding target frame is a B picture, the reference candidate frame indicates a past B, P, or I picture that is temporally closest to the encoding target frame.

ステップＳ１２の比較の結果に基づき、符号化対象フレームと参照候補フレームとで、対応する顔パーツの状態情報が一致するか否かがステップＳ１３で判定される。若し、一致すると判定されれば、処理はステップＳ１４に移行され、参照フレーム決定部３１は、判定対象の顔パーツについて、参照候補フレームを参照フレームに決定する。そして、処理はステップＳ１５に移行される。 Based on the result of the comparison in step S12, it is determined in step S13 whether or not the state information of the corresponding face part matches between the encoding target frame and the reference candidate frame. If it is determined that they match, the process proceeds to step S14, and the reference frame determination unit 31 determines a reference candidate frame as a reference frame for the face part to be determined. Then, the process proceeds to step S15.

一方、ステップＳ１３で、顔パーツの状態情報が符号化対象フレームと参照候補フレームとで一致しないと判定されたら、処理はステップＳ１６に移行される。ステップＳ１６では、ステップＳ１２で比較が行われた参照候補フレームが最後の参照候補フレームであるか否かが判定される。ここで、最後の参照候補フレームとは、参照候補フレームとして用いるように設定された、符号化対象フレームに対して最も時間的に遠い位置にある参照候補フレームを指す。若し、最後の参照候補フレームではないと判定されたら、処理はステップＳ１２に戻され、次に時間的に近い位置にある参照候補フレームについて、判定対象の顔パーツに対する処理が行われる。 On the other hand, if it is determined in step S13 that the state information of the face part does not match between the encoding target frame and the reference candidate frame, the process proceeds to step S16. In step S16, it is determined whether or not the reference candidate frame compared in step S12 is the last reference candidate frame. Here, the last reference candidate frame refers to a reference candidate frame that is set to be used as a reference candidate frame and that is farthest in time from the encoding target frame. If it is determined that the frame is not the last reference candidate frame, the process returns to step S12, and the process for the face part to be determined is performed on the reference candidate frame that is next in time.

一方、ステップＳ１６で、判定対象の顔パーツについて、最後の参照候補フレームに対する処理が終了したと判断されたら、処理はステップＳ１７に移行される。すなわち、この場合、当該判定対象の顔パーツに対して状態情報が一致する顔パーツが、参照候補フレームとして用いるように設定された全ての参照候補フレームに存在しなかったことになる。この場合、ステップＳ１７で、符号化対象フレームに対して時間的に直近に位置する参照候補フレームを参照フレームに決定する。そして、処理はステップＳ１５に移行される。 On the other hand, if it is determined in step S16 that the process for the last reference candidate frame has been completed for the face part to be determined, the process proceeds to step S17. In other words, in this case, the face part whose state information matches the face part to be determined does not exist in all the reference candidate frames set to be used as the reference candidate frame. In this case, in step S17, a reference candidate frame positioned closest in time to the encoding target frame is determined as a reference frame. Then, the process proceeds to step S15.

ステップＳ１５では、上述のステップＳ１１において符号化対象フレームで検出された全ての顔パーツについて判定が終了したか否かが判断される。若し、判定が終了していないと判断されたら、処理はステップＳ１２に戻され、次の顔パーツについて処理がなされる。一方、判定が終了したと判断されたら、当該符号化対象フレームに対する一連の処理が完了される。 In step S15, it is determined whether or not the determination has been completed for all the face parts detected in the encoding target frame in step S11 described above. If it is determined that the determination is not completed, the process returns to step S12, and the next face part is processed. On the other hand, when it is determined that the determination has been completed, a series of processes for the encoding target frame is completed.

図４〜図６を用いて、図２のフローチャートの処理について、より具体的に説明する。一例として、図４に例示されるように、符号化対象フレーム２００に対して、２枚の参照候補フレーム２０１および２０２を用いるように設定されているものとする。参照候補フレーム２０１は、符号化対象フレーム２００に対して時間的に直前のフレームであるものとする。また、参照候補フレーム２０２は、符号化対象フレーム２００に対して参照候補フレーム２０１よりも時間的に遠いフレームであるものとする。 The process of the flowchart of FIG. 2 will be described more specifically with reference to FIGS. As an example, as illustrated in FIG. 4, it is assumed that two reference candidate frames 201 and 202 are set for the encoding target frame 200. The reference candidate frame 201 is assumed to be a frame immediately before the encoding target frame 200 in terms of time. The reference candidate frame 202 is assumed to be a frame farther in time than the reference candidate frame 201 with respect to the encoding target frame 200.

参照候補フレーム２０１は、図５（ａ）に例示されるように、顔領域中の各顔パーツにおいて、左目２１０、右目２１１および口２１２が閉じた状態となっている。したがって、図５（ｂ）に例示されるように、左目２１０、右目２１１および口２１２の状態情報がそれぞれ「０」とされる。また、左目２１０がブロック座標（３，３）、（４，３）で対角座標を示される矩形領域、右目２１１がブロック座標（４，３）、（５，３）で対角座標を示される矩形領域に含まれる。また、口２１２がブロック座標（３，５）、（５，５）で対角座標を示される矩形領域に含まれる。 In the reference candidate frame 201, as illustrated in FIG. 5A, the left eye 210, the right eye 211, and the mouth 212 are closed in each face part in the face area. Therefore, as illustrated in FIG. 5B, the state information of the left eye 210, the right eye 211, and the mouth 212 is set to “0”. In addition, the left eye 210 indicates a rectangular area where diagonal coordinates are indicated by block coordinates (3, 3) and (4, 3), and the right eye 211 indicates diagonal coordinates by block coordinates (4, 3) and (5, 3). Included in the rectangular area. Further, the mouth 212 is included in a rectangular area whose diagonal coordinates are indicated by block coordinates (3, 5) and (5, 5).

一方、参照候補フレーム２０２は、図６（ａ）に例示されるように、顔領域中の各顔パーツについて、左目２１０、右目２１１および口２１２が開いた状態となっている。したがって、図６（ｂ）に例示されるように、左目２１０、右目２１１および口２１２の状態情報がそれぞれ「１」とされる。また、左目２１０がブロック座標（２，３）、（３，４）で対角座標を示される矩形領域、右目２１１がブロック座標（４，３）、（５，３）で対角座標を示される矩形領域に含まれる。また、口２１２がブロック座標（３，５）、（５，５）で対角座標を示される矩形領域に含まれる。 On the other hand, as illustrated in FIG. 6A, the reference candidate frame 202 is in a state where the left eye 210, the right eye 211, and the mouth 212 are open for each face part in the face area. Therefore, as illustrated in FIG. 6B, the state information of the left eye 210, the right eye 211, and the mouth 212 is set to “1”. In addition, the left eye 210 indicates a rectangular area where diagonal coordinates are indicated by block coordinates (2, 3) and (3,4), and the right eye 211 indicates diagonal coordinates by block coordinates (4, 3) and (5, 3). Included in the rectangular area. Further, the mouth 212 is included in a rectangular area whose diagonal coordinates are indicated by block coordinates (3, 5) and (5, 5).

判定対象の顔パーツが左目２１０である場合を例に挙げて説明する。符号化対象フレーム２００において、左目２１０が開いており状態情報は「１」である（図３（ａ）および図３（ｂ）参照）。これに対して、符号化対象フレーム２００に対して時間的に直前の参照候補フレーム２０１の左目２１０は閉じており、状態情報は「０」である（図５（ａ）および図５（ｂ）参照）。したがって、上述のステップＳ１２の比較の結果、両者の状態情報が一致しないと判定される（ステップＳ１３）。そのため、参照フレーム決定部３１は、左目２１０について、参照候補フレーム２０１を参照フレームとすることを保留し、処理がステップＳ１６に移行される。 A case where the face part to be determined is the left eye 210 will be described as an example. In the encoding target frame 200, the left eye 210 is open and the state information is “1” (see FIGS. 3A and 3B). In contrast, the left eye 210 of the reference candidate frame 201 immediately before the encoding target frame 200 is closed, and the state information is “0” (FIG. 5A and FIG. 5B). reference). Therefore, as a result of the comparison in step S12 described above, it is determined that the state information of both does not match (step S13). Therefore, the reference frame determination unit 31 suspends the left eye 210 from setting the reference candidate frame 201 as the reference frame, and the process proceeds to step S16.

ステップＳ１６では、参照候補フレーム２０１が最後の参照候補フレームではないと判定される。そして、処理がステップＳ１６からステップＳ１２に戻され、次の参照候補フレームである参照候補フレーム２０２について、左目２１０の状態情報が符号化対象フレーム２００と比較される。参照候補フレーム２０２の左目２１０は開いており、状態情報は「１」である（図６（ａ）および図６（ｂ）参照）。したがって、上述のステップＳ１２の比較の結果、両者の状態情報が一致すると判定され（ステップＳ１３）、参照候補フレーム２０２が符号化対象フレーム２００の参照フレームに決定される（ステップＳ１４）。 In step S16, it is determined that the reference candidate frame 201 is not the last reference candidate frame. Then, the process returns from step S16 to step S12, and the state information of the left eye 210 is compared with the encoding target frame 200 for the reference candidate frame 202 which is the next reference candidate frame. The left eye 210 of the reference candidate frame 202 is open, and the state information is “1” (see FIG. 6A and FIG. 6B). Therefore, as a result of the comparison in step S12 described above, it is determined that the state information of both coincides (step S13), and the reference candidate frame 202 is determined as the reference frame of the encoding target frame 200 (step S14).

このように、本発明の実施形態では、顔の表情が符号化対象フレームに近い参照候補フレームを、参照フレームとして決定することが可能になる。そのため、動きベクトル検出の際に参照できる参照フレーム数が制限される符号化装置においても、高い符号化効率を実現することが可能になる。 As described above, in the embodiment of the present invention, it is possible to determine a reference candidate frame whose facial expression is close to the encoding target frame as a reference frame. Therefore, high encoding efficiency can be realized even in an encoding device in which the number of reference frames that can be referred to in motion vector detection is limited.

また、顔の表情が符号化対象フレームに近い参照候補フレームが存在しない場合には、符号化対象フレームとの変化が最も少ないと考えられる、符号化対象フレームに対して時間的に最も近い参照候補フレームを参照フレームとして決定することができる。これにより、顔の表情が符号化対象フレームに近い参照候補フレームが存在しない場合でも、符号化時の画質の劣化を抑えられる可能性が大きくなる。 In addition, when there is no reference candidate frame whose facial expression is close to the encoding target frame, it is considered that the change from the encoding target frame is the least, and the reference candidate that is closest in time to the encoding target frame A frame can be determined as a reference frame. As a result, even when there is no reference candidate frame whose facial expression is close to the encoding target frame, there is a high possibility that deterioration in image quality during encoding can be suppressed.

なお、上述したように、本実施形態では、顔パーツとしての左目、右目および口の状態を、開いているか閉じているかの２状態に分類しているが、これはこの例に限定されない。例えば、左目、右目および口の開き具合によってさらに状態数を増やしてもよい。これにより、符号化対象フレームと参照候補フレームとの間での顔パーツ状態の比較を、より詳細に行うことができる。 As described above, in the present embodiment, the states of the left eye, the right eye, and the mouth as the facial parts are classified into two states of open or closed, but this is not limited to this example. For example, the number of states may be further increased depending on how the left eye, the right eye, and the mouth open. Thereby, the comparison of the face part state between the encoding target frame and the reference candidate frame can be performed in more detail.

この場合、符号化対象フレームと参照候補フレームとの間で顔パーツ状態が必ずしも一致しなくても、符号化対象フレームに対して顔パーツ状態が所定以上近い参照候補フレームを、参照フレームとして決定するようにしてもよい。符号化対象フレームに対して顔パーツ状態が所定以上近い参照候補フレームが存在しない場合には、符号化対象フレームに対して時間的に直近の参照候補フレームが参照フレームとして決定される。 In this case, even if the face part state does not necessarily match between the encoding target frame and the reference candidate frame, a reference candidate frame whose face part state is closer than the encoding target frame by a predetermined amount or more is determined as a reference frame. You may do it. When there is no reference candidate frame whose face part state is close to a predetermined value or more with respect to the encoding target frame, the reference candidate frame that is temporally closest to the encoding target frame is determined as a reference frame.

また、本実施形態では、顔パーツを左目、右目および口とし、この３つの顔パーツについて、状態が一致しているか否かを判定しているが、これはこの例に限定されない。例えば、顔パーツとして、鼻、眉など顔の他の部分をさらに用いて状態一致の判定を行うことも考えられる。鼻の位置は、左目、右目および口の位置関係と、左目および右目と口との間の２つのホールや影の位置に基づき特定することが考えられる。眉の位置は、左目および右目の位置から特定可能である。 In this embodiment, the face parts are the left eye, the right eye, and the mouth, and it is determined whether or not the states of these three face parts match. However, this is not limited to this example. For example, it is also conceivable to use other parts of the face such as the nose and eyebrows as face parts to determine the state match. The position of the nose can be specified based on the positional relationship between the left eye, the right eye, and the mouth, and the positions of two holes and shadows between the left eye, the right eye, and the mouth. The position of the eyebrows can be specified from the positions of the left eye and the right eye.

なお、上述の図２のフローチャートでは、顔パーツ毎に参照候補フレームの決定処理を行っているが、これはこの例に限定されない。例えば、参照候補フレーム毎に各顔パーツの判定を行うようにしてもよい。より具体的には、先ず、符号化対象フレームに対して時間的に直近の参照候補フレームについて、各顔パーツに対する判定処理を行う。全ての顔パーツについて参照フレームが決定しなければ、符号化対象フレームに対して時間的に次に近い参照候補フレームについて、参照フレームが決定していない顔パーツに対して判定処理を行う。この処理を、各顔パーツ全てに参照フレームが決定するまで繰り返す。 In the flowchart of FIG. 2 described above, the reference candidate frame determination process is performed for each face part, but this is not limited to this example. For example, each face part may be determined for each reference candidate frame. More specifically, first, determination processing for each face part is performed on the reference candidate frame that is temporally closest to the encoding target frame. If the reference frames are not determined for all the facial parts, a determination process is performed on the facial parts for which the reference frame is not determined for the reference candidate frame temporally next to the encoding target frame. This process is repeated until a reference frame is determined for each face part.

＜他の実施形態＞
上述の実施形態は、システムまたは装置のコンピュータ（あるいはＣＰＵ、ＭＰＵなど）によりソフトウェア的に実現することも可能である。 <Other embodiments>
The above-described embodiments can also be realized in software by a computer (or CPU, MPU, etc.) of a system or apparatus.

従って、上述の実施形態をコンピュータで実現するために、該コンピュータに供給されるコンピュータプログラム自体も本発明を実現するものである。つまり、上述の実施形態の機能を実現するためのコンピュータプログラム自体も本発明の一つである。 Therefore, the computer program itself supplied to the computer in order to implement the above-described embodiment by the computer also realizes the present invention. That is, the computer program itself for realizing the functions of the above-described embodiments is also one aspect of the present invention.

なお、上述の実施形態を実現するためのコンピュータプログラムは、コンピュータで読み取り可能であれば、どのような形態であってもよい。例えば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等で構成することができるが、これらに限るものではない。 The computer program for realizing the above-described embodiment may be in any form as long as it can be read by a computer. For example, it can be composed of object code, a program executed by an interpreter, script data supplied to the OS, but is not limited thereto.

上述の実施形態を実現するためのコンピュータプログラムは、記憶媒体又は有線／無線通信によりコンピュータに供給される。プログラムを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、磁気テープ等の磁気記憶媒体、ＭＯ、ＣＤ、ＤＶＤ等の光／光磁気記憶媒体、不揮発性の半導体メモリなどがある。 A computer program for realizing the above-described embodiment is supplied to a computer via a storage medium or wired / wireless communication. Examples of the storage medium for supplying the program include a magnetic storage medium such as a flexible disk, a hard disk, and a magnetic tape, an optical / magneto-optical storage medium such as an MO, CD, and DVD, and a nonvolatile semiconductor memory.

有線／無線通信を用いたコンピュータプログラムの供給方法としては、コンピュータネットワーク上のサーバを利用する方法がある。この場合、本発明を形成するコンピュータプログラムとなりうるデータファイル（プログラムファイル）をサーバに記憶しておく。プログラムファイルとしては、実行形式のものであっても、ソースコードであっても良い。 As a computer program supply method using wired / wireless communication, there is a method of using a server on a computer network. In this case, a data file (program file) that can be a computer program forming the present invention is stored in the server. The program file may be an executable format or a source code.

そして、このサーバにアクセスしたクライアントコンピュータに、プログラムファイルをダウンロードすることによって供給する。この場合、プログラムファイルを複数のセグメントファイルに分割し、セグメントファイルを異なるサーバに分散して配置することも可能である。 The program file is supplied by downloading to a client computer that has accessed the server. In this case, the program file can be divided into a plurality of segment files, and the segment files can be distributed and arranged on different servers.

つまり、上述の実施形態を実現するためのプログラムファイルをクライアントコンピュータに提供するサーバ装置も本発明の一つである。 That is, a server apparatus that provides a client computer with a program file for realizing the above-described embodiment is also one aspect of the present invention.

また、上述の実施形態を実現するためのコンピュータプログラムを暗号化して格納した記憶媒体を配布し、所定の条件を満たしたユーザに、暗号化を解く鍵情報を供給し、ユーザの有するコンピュータへのインストールを許可してもよい。鍵情報は、例えばインターネットを介してホームページからダウンロードさせることによって供給することができる。 In addition, a storage medium in which the computer program for realizing the above-described embodiment is encrypted and distributed is distributed, and key information for decrypting is supplied to a user who satisfies a predetermined condition, and the user's computer Installation may be allowed. The key information can be supplied by being downloaded from a homepage via the Internet, for example.

また、上述の実施形態を実現するためのコンピュータプログラムは、既にコンピュータ上で稼働するＯＳの機能を利用するものであってもよい。 In addition, the computer program for realizing the above-described embodiment may use an OS function already running on the computer.

さらに、上述の実施形態を実現するためのコンピュータプログラムは、その一部をコンピュータに装着される拡張ボードなどのファームウェアで構成してもよいし、拡張ボードなどが備えるＣＰＵで実行するようにしてもよい。 Furthermore, a part of the computer program for realizing the above-described embodiment may be configured by firmware such as an expansion board attached to the computer, or may be executed by a CPU included in the expansion board. Good.

本発明の実施形態による符号化装置の一例の構成を示すブロック図である。It is a block diagram which shows the structure of an example of the encoding apparatus by embodiment of this invention. 本発明の実施形態による参照フレーム決定の一例の処理を示すフローチャートである。It is a flowchart which shows a process of an example of the reference frame determination by embodiment of this invention. 符号化対象フレームの顔パーツ状態を説明するための図である。It is a figure for demonstrating the face part state of an encoding object frame. 符号化対象フレームと参照候補フレームとを説明するための図である。It is a figure for demonstrating an encoding object frame and a reference candidate frame. 参照候補フレームの顔パーツ状態を説明するための図である。It is a figure for demonstrating the face part state of a reference candidate frame. 参照候補フレームの顔パーツ状態を説明するための図である。It is a figure for demonstrating the face part state of a reference candidate frame.

Explanation of symbols

１０現在フレーム保存部
２１参照フレーム保存部
２２インター予測部
２３動き検出部
３０復元画像保存部
３１参照フレーム保存部
３２顔検出部
３３顔表情認識部
１００符号化装置
２００符号化対象フレーム
２０１，２０２参照候補フレーム 10 current frame storage unit 21 reference frame storage unit 22 inter prediction unit 23 motion detection unit 30 restored image storage unit 31 reference frame storage unit 32 face detection unit 33 facial expression recognition unit 100 encoding device 200 encoding target frames 201 and 202 Candidate frame

Claims

Perform motion compensation interframe predictive coding using the result of motion detection performed on the reference frame in units of blocks obtained by dividing the image frame to be encoded, and select the reference frame from a plurality of reference frame candidates A video encoding device,
Input image frame storage means for temporarily storing the input image frame;
Reference candidate frame storage means for storing a plurality of reference candidate frames generated based on the image frames stored in the input image frame storage means;
Determination means for detecting a face area from the image frame output from the input image frame storage means, further detecting a face part included in the face area, and determining a state of the face part;
Face part information storage means for storing the state of the face part determined by the determination means in association with information indicating the image frame including the face part;
From the face part information storage unit, the state of the face part detected and determined by the determination unit from the encoding target frame output from the input image frame storage unit matches, and the encoding target frame is Search means for searching for information indicating the image frame that can be referred to;
Among the plurality of reference candidate frames stored in the reference candidate frame storage unit, the reference candidate frame corresponding to information indicating the image frame searched by the search unit is selected as the block corresponding to the face part. Reference frame determining means for determining the reference frame;
An apparatus for encoding a moving image, wherein motion-compensated interframe prediction encoding is performed on the encoding target frame using the reference frame determined by the reference frame determining means.

The moving image encoding apparatus according to claim 1, wherein the determination unit detects at least one of eyes or a mouth included in the face region as a face part.

The moving image encoding apparatus according to claim 2, wherein the determination unit determines the degree of opening of the eyes or mouth as the state of the face part.

4. The moving image encoding apparatus according to claim 2, wherein the determination unit determines whether the eyes or mouth are open or closed as the state of the face part.

The search unit performs the search for each of the face parts detected and determined from the encoding target frame by the determination unit for one of the encoding target frames output from the input image frame storage unit,
5. The reference frame determination unit according to claim 1, wherein the reference frame determination unit determines the corresponding reference candidate frame for each piece of information indicating the image frame searched for the face part by the search unit. The moving image encoding device according to any one of the preceding claims.

The search means searches for information indicating the image frame in order from the temporally closer to the encoding target frame among the information indicating the image frame stored in the face part information storage means. The moving picture coding apparatus according to claim 1, wherein:

If the information indicating the image frame in which the state of the face part matches is not searched, the search means includes the encoding target frame of the information indicating the image frame stored in the face part information storage means. 7. The moving image according to claim 1, wherein information indicating the image frame that can be referred to and that is temporally closest to the image frame is used as a search result. Encoding device.

If the information indicating the image frame in which the state of the face part matches is not searched, the search means includes the encoding target frame of the information indicating the image frame stored in the face part information storage means. 7. The moving image encoding apparatus according to claim 1, wherein information indicating the image frame that can be referred to and has the closest face part state is used as a search result. .

Motion detection means for performing motion detection on a reference frame of image data read in block units obtained by dividing the image frame from the input image frame storage means;
Motion compensation means for performing motion compensation inter-frame prediction on the image data in units of blocks based on the result of motion detection by the motion detection means;
The reference candidate frame is:
9. The method according to claim 1, wherein the image data read out in units of blocks from the input image frame storage unit is generated based on the image data predicted between motion compensation frames by the motion compensation unit. The moving image encoding device according to any one of the preceding claims.

Perform motion compensation interframe predictive coding using the result of motion detection performed on the reference frame in units of blocks obtained by dividing the image frame to be encoded, and select the reference frame from a plurality of reference frame candidates A video encoding method comprising:
An input image frame storage step of temporarily storing the input image frame in the input image frame storage means;
A reference candidate frame storage step of storing a plurality of reference candidate frames generated based on the image frame stored in the input image frame storage unit in a reference candidate frame storage unit;
A determination step of detecting a face area from the image frame output from the input image frame storage means, further detecting a face part included in the face area, and determining a state of the face part;
A face part information storage step of storing the state of the face part determined in the determination step in a face part information storage unit in association with information indicating the image frame including the face part;
From the face part information storage unit, the state of the face part detected and determined in the determination step from the encoding target frame output from the input image frame storage unit matches, and the encoding target frame is A search step for searching for information indicating the image frame that can be referred to;
Among the plurality of reference candidate frames stored in the reference candidate frame storage unit, the reference candidate frame corresponding to the information indicating the image frame searched in the search step is selected from the block corresponding to the face part. A reference frame determining step for determining the reference frame;
A moving picture coding method, wherein motion-compensated interframe prediction coding is performed on the coding target frame using the reference frame determined in the reference frame determination step.

A program that causes a computer to function as the moving picture encoding apparatus according to any one of claims 1 to 9.