JP2012213207A

JP2012213207A - Multi-viewpoint image encoding method and decoding method, encoder, decoder, encoding program, decoding program and computer readable recording medium

Info

Publication number: JP2012213207A
Application number: JP2012136560A
Authority: JP
Inventors: Shinya Shimizu; 信哉志水; Hideaki Kimata; 英明木全; Kazuto Kamikura; 一人上倉; Yoshiyuki Yashima; 由幸八島; Masayuki Tanimoto; 正幸谷本; Toshiaki Fujii; 俊彰藤井
Original assignee: Nagoya University NUC; Nippon Telegraph and Telephone Corp
Current assignee: Nagoya University NUC; Nippon Telegraph and Telephone Corp
Priority date: 2012-06-18
Filing date: 2012-06-18
Publication date: 2012-11-01
Anticipated expiration: 2028-07-11
Also published as: JP5531282B2

Abstract

PROBLEM TO BE SOLVED: To provide an efficient encoding system which can reduce memory capacity required in multi-viewpoint image encoding and can reduce an operation amount on a decoding side.SOLUTION: A prediction signal is not directly generated and accumulated by using distance information given to reference frames, but distance information to one processing object frame is generated and accumulated from the distance information given to each of the plurality of reference frames (reference camera distance information input section 205 to processing frame distance information memory 210). In a viewpoint synthesis image generation section 211, a prediction signal of inter-camera image prediction to a processing block is generated by using the distance information to the processing object frame accumulated at every processing block. In an image encoding section 212, prediction encoding is performed by using the prediction signal at every processing block.

Description

本発明は，多視点画像および多視点動画像において，既知の映像信号と距離情報とを用いて，別の視点の距離情報や映像信号を生成する方法である。また，それを用いた多視点画像および多視点動画像の符号化および復号技術に関するものである。 The present invention is a method for generating distance information and a video signal of another viewpoint using a known video signal and distance information in a multi-view image and a multi-view video. The present invention also relates to a technique for encoding and decoding a multi-view image and a multi-view video using the same.

多視点画像とは，複数のカメラで同じ被写体と背景を撮影した複数の画像のことであり，多視点動画像（多視点映像）とは，その動画像のことである。以下では，１つのカメラで撮影された動画像を“２次元動画像”と呼び，同じ被写体と背景を撮影した２次元動画像群を多視点動画像と呼ぶ。 A multi-view image is a plurality of images obtained by photographing the same subject and background with a plurality of cameras, and a multi-view video (multi-view video) is a moving image. Hereinafter, a moving image captured by one camera is referred to as a “two-dimensional moving image”, and a two-dimensional moving image group in which the same subject and background are captured is referred to as a multi-viewpoint moving image.

２次元動画像は，時間方向に関して高い相関があり，その相関を利用することによって符号化効率を高めている。一方，多視点画像や多視点動画像では，各カメラが同期されていた場合，同じ時間に対応した各カメラの映像は全く同じ状態の被写体と背景を別の位置から撮影したものなので，カメラ間で高い相関がある。多視点画像や多視点動画像の符号化においては，この相関を利用することによって符号化効率を高めることができる。 The two-dimensional moving image has a high correlation in the time direction, and the encoding efficiency is improved by using the correlation. On the other hand, in multi-view images and multi-view images, if the cameras are synchronized, the images of each camera corresponding to the same time are taken from different positions of the subject and background in the same state. There is a high correlation. In the encoding of a multi-view image or a multi-view video, the encoding efficiency can be increased by using this correlation.

まず，２次元動画像の符号化技術に関する従来技術を述べる。国際符号化標準であるＨ．２６４，ＭＰＥＧ−２，ＭＰＥＧ−４をはじめとした従来の多くの２次元動画像符号化方式では，動き補償，直交変換，量子化，エントロピー符号化という技術を利用して，高効率な符号化を行う。動き補償と呼ばれる技術がフレーム間の時間相関を利用する方法である。 First, a description will be given of a conventional technique related to a two-dimensional video encoding technique. H., an international encoding standard. In many conventional two-dimensional video coding systems such as H.264, MPEG-2, and MPEG-4, high-efficiency coding is performed using techniques such as motion compensation, orthogonal transformation, quantization, and entropy coding. I do. A technique called motion compensation is a method that uses temporal correlation between frames.

Ｈ．２６４で使われている動き補償技術の詳細については，下記の非特許文献１に記載されているが，以下で概要を説明する。Ｈ．２６４の動き補償では，符号化対象フレームを様々なサイズのブロックに分割し，各ブロックで異なる動きベクトルを持つことを可能にし，局所的な映像変化に対しても高い符号化効率を達成している。また，参照フレームの候補として，符号化対象フレームに対して過去もしくは未来の既に符号化済みの複数枚のフレームを用意し，各ブロックで異なる参照フレームを用いることを可能にしている。これによって，時間変化によってオクルージョンが生じるような映像に対しても高い符号化効率を達成している。 H. The details of the motion compensation technique used in H.264 are described in the following Non-Patent Document 1, but the outline will be described below. H. In H.264 motion compensation, the encoding target frame is divided into blocks of various sizes, each block can have a different motion vector, and high coding efficiency can be achieved even for local video changes. Yes. Also, as reference frame candidates, a plurality of previously encoded frames in the past or future are prepared for the encoding target frame, and different reference frames can be used for each block. As a result, high coding efficiency is achieved even for video in which occlusion occurs due to temporal changes.

次に，従来の多視点画像や多視点動画像の符号化方式について説明する。多視点画像の符号化方法と，多視点動画像の符号化方法との違いは，多視点動画像にはカメラ間の相関に加えて，時間方向の相関が同時に存在するということである。しかし，カメラ間の相関を利用する方法はどちらの場合でも，同じ方法を用いることができる。そのため，ここでは多視点動画像の符号化において用いられる方法について説明する。 Next, a conventional multi-view image and multi-view video encoding method will be described. The difference between the multi-view image encoding method and the multi-view image encoding method is that the multi-view image has a correlation in the time direction in addition to the correlation between cameras. However, in both cases, the same method can be used as the method using the correlation between cameras. Therefore, here, a method used in encoding multi-view video will be described.

多視点動画像の符号化については，動き補償を同じ時刻の異なる視点に置かれたカメラで撮影された画像に適用した“視差補償”によって高効率に多視点動画像を符号化する方式が従来から存在する。ここで，視差とは，異なる位置に配置されたカメラの画像平面上で，被写体上の同じ位置が投影される位置の差である。 As for multi-view video encoding, there is a conventional method that encodes multi-view video with high efficiency by “parallax compensation” applied to images taken by cameras placed at different viewpoints at the same time. Exists from. Here, the parallax is a difference between positions at which the same position on the subject is projected on the image planes of cameras arranged at different positions.

図１７に，このカメラ間で生じる視差の概念図を示す。この概念図では，光軸が平行なカメラの画像平面を垂直に見下ろしたものとなっている。一般的に，異なるカメラの画像平面上で被写体上の同じ位置が投影される位置は対応点と呼ばれる。視差補償は，この対応関係に基づいて，符号化対象フレームの各画素値を参照フレームから予測して，その予測残差と，対応関係を示す視差情報とを符号化する。 FIG. 17 shows a conceptual diagram of parallax generated between the cameras. In this conceptual diagram, the image plane of a camera with parallel optical axes is viewed vertically. In general, a position where the same position on a subject is projected on an image plane of a different camera is called a corresponding point. In the disparity compensation, each pixel value of the encoding target frame is predicted from the reference frame based on the correspondence relationship, and the prediction residual and the disparity information indicating the correspondence relationship are encoded.

多くの手法では，視差を画像平面上でのベクトルとして表現する。例えば，非特許文献２では，ブロック単位で視差補償を行う仕組みを用いているが，ブロック単位の視差を２次元ベクトルで，すなわち２つのパラメータ（ｘ成分およびｙ成分）で表現する。つまり，この手法では，２パラメータで構成される視差情報と予測残差を符号化する。 In many methods, parallax is expressed as a vector on the image plane. For example, in Non-Patent Document 2, a mechanism for performing parallax compensation in units of blocks is used, but parallax in units of blocks is expressed by a two-dimensional vector, that is, by two parameters (x component and y component). That is, in this method, disparity information composed of two parameters and a prediction residual are encoded.

一方，非特許文献３に記載の手法では，カメラパラメータを符号化に利用し，エピポーラ幾何拘束に基づき視差ベクトルを１次元の情報として表現することにより，予測情報を効率的に符号化する。 On the other hand, in the technique described in Non-Patent Document 3, prediction information is efficiently encoded by using camera parameters for encoding and expressing disparity vectors as one-dimensional information based on epipolar geometric constraints.

エピポーラ幾何拘束の概念図を図１８に示す。エピポーラ幾何拘束によれば，２台のカメラ（カメラ１とカメラ２）において，片方の画像上の点に対応するもう片方の画像上の点は，エピポーラ線という直線上に拘束される。なお，この手法ではエピポーラ線上の位置を示すために，カメラから被写体までの距離という１つのパラメータを用いている。 A conceptual diagram of the epipolar geometric constraint is shown in FIG. According to the epipolar geometric constraint, in two cameras (camera 1 and camera 2), a point on the other image corresponding to a point on one image is constrained on a straight line called an epipolar line. In this method, one parameter, the distance from the camera to the subject, is used to indicate the position on the epipolar line.

エピポーラ幾何拘束によって距離が示す情報は被写体の三次元位置であり，被写体の三次元位置はカメラに因らないため，同じ被写体上の点に対して複数の距離を符号化することは冗長であり符号化効率を低下してしまう。そのため，符号化対象フレームごとに，そのフレームを撮影したカメラから被写体までの距離を符号化する非特許文献３の手法では，同じ被写体に対する距離情報を複数符号化することが生じるため，効率的な符号化を実現することができない。 The information indicated by the distance due to epipolar geometric constraints is the 3D position of the subject, and the 3D position of the subject does not depend on the camera, so it is redundant to encode multiple distances for points on the same subject. Encoding efficiency is reduced. Therefore, for each encoding target frame, the method of Non-Patent Document 3 that encodes the distance from the camera that captured the frame to the subject causes multiple encoding of the distance information for the same subject. Encoding cannot be realized.

一方，特許文献１の手法では，この問題を解決するために，参照フレームに対して，そのフレームを撮影しているカメラから被写体までの距離を符号化することで，カメラの台数にかかわらず，同じ参照フレームを利用する符号化対象フレームにおける視差補償を実現し，効率的な符号化を実現している。 On the other hand, in the method of Patent Document 1, in order to solve this problem, by encoding the distance from the camera capturing the frame to the subject with respect to the reference frame, regardless of the number of cameras, Disparity compensation in the encoding target frame using the same reference frame is realized, and efficient encoding is realized.

従来の視差補償画像を予測画像として用いる符号化処理および復号処理の例を，図１９ないし図２２に示すフローチャートに従って説明する。 An example of encoding processing and decoding processing using a conventional parallax compensation image as a predicted image will be described with reference to flowcharts shown in FIGS.

図１９は，従来の符号化処理の全体の処理の流れを示す。まず，符号化対象フレーム，参照フレームおよび参照距離情報を入力する［Ｘ１］。参照距離情報は，参照フレームの各画素ごとの被写体からカメラまでの距離を示す情報である。次に，入力した参照距離情報を用いて，参照フレームの各画素の符号化対象フレームに対する視差ベクトルを計算する［Ｘ２］。視差ベクトルを算出したならば，参照フレームの各画素の映像信号を視差ベクトルによって示される画素に複写して視差補償画像を生成する［Ｘ３］。その後，同位置の視差補償画像を予測画像として用いながら，符号化対象フレームをブロックごとに符号化する［Ｘ４］。 FIG. 19 shows the overall processing flow of the conventional encoding process. First, an encoding target frame, a reference frame, and reference distance information are input [X1]. The reference distance information is information indicating the distance from the subject to the camera for each pixel of the reference frame. Next, using the input reference distance information, a disparity vector for the encoding target frame of each pixel of the reference frame is calculated [X2]. When the disparity vector is calculated, the video signal of each pixel of the reference frame is copied to the pixel indicated by the disparity vector to generate a disparity compensation image [X3]. Thereafter, the encoding target frame is encoded for each block while using the parallax compensation image at the same position as a predicted image [X4].

図２０に，図１９の信号符号化処理Ｘ４の詳細フローを示す。符号化対象フレームをブロックごとに符号化するにあたって，まずブロックのインデックスｂｌｋを０に初期化する［Ｘ４０１］。次に，インデックスｂｌｋで示されるブロックについて，図１９の処理Ｘ３で生成した視差補償画像Ｓｙｎｔｈ［ｂｌｋ］を予測画像候補として使用しながら，入力した符号化対象フレームＯｒｇ［ｂｌｋ］を符号化する［Ｘ４０２］。その後，ｂｌｋに１を加算し［Ｘ４０３］，ｂｌｋがフレームのブロック数ｎｕｍＢｌｋｓになるまで［Ｘ４０４］，すなわち，フレーム内の全ブロックの符号化が終了するまで，処理［Ｘ４０２］に戻って同様に処理を繰り返す。 FIG. 20 shows a detailed flow of the signal encoding process X4 of FIG. When encoding the encoding target frame for each block, first, the block index blk is initialized to 0 [X401]. Next, for the block indicated by the index blk, the input encoding target frame Org [blk] is encoded while using the parallax compensation image Synth [blk] generated in the process X3 of FIG. 19 as a predicted image candidate [ X402]. After that, 1 is added to blk [X403], and until blk reaches the number of blocks numBlks [X404], that is, the process returns to [X402] until all blocks in the frame are encoded. Repeat the process.

図２１は，従来の復号処理の全体の処理の流れを示す。まず，復号対象フレームの符号化データ，参照フレームおよび参照距離情報を入力する［Ｙ１］。次に，入力した参照距離情報を用いて，参照フレームの各画素の復号対象フレームに対する視差ベクトルを計算する［Ｙ２］。視差ベクトルを算出したならば，参照フレームの各画素の映像信号を視差ベクトルによって示される画素に複写して視差補償画像を生成する［Ｙ３］。その後，同位置の視差補償画像を予測画像として用いながら，復号対象フレームをブロックごとに復号する［Ｙ４］。 FIG. 21 shows the flow of the entire conventional decoding process. First, the encoded data, reference frame, and reference distance information of the decoding target frame are input [Y1]. Next, using the input reference distance information, a disparity vector for each decoding target frame of each pixel of the reference frame is calculated [Y2]. When the disparity vector is calculated, the video signal of each pixel of the reference frame is copied to the pixel indicated by the disparity vector to generate a disparity compensation image [Y3]. Thereafter, the decoding target frame is decoded for each block while using the parallax compensation image at the same position as the prediction image [Y4].

図２２に，図２１の信号復号処理Ｙ４の詳細フローを示す。復号対象フレームをブロックごとに復号するにあたって，まずブロックのインデックスｂｌｋを０に初期化する［Ｙ４０１］。次に，インデックスｂｌｋで示されるブロックについて，図２１の処理Ｙ３で生成した視差補償画像Ｓｙｎｔｈ［ｂｌｋ］をカメラ間予測時の予測信号として使用しながら，入力した復号対象フレームＤｅｃ［ｂｌｋ］を復号する［Ｙ４０２］。その後，ｂｌｋに１を加算し［Ｙ４０３］，ｂｌｋがフレームのブロック数ｎｕｍＢｌｋｓになるまで［Ｙ４０４］，すなわち，フレーム内の全ブロックの復号が終了するまで，処理［Ｙ４０２］に戻って同様に処理を繰り返す。 FIG. 22 shows a detailed flow of the signal decoding process Y4 of FIG. When decoding the decoding target frame for each block, first, the block index blk is initialized to 0 [Y401]. Next, for the block indicated by the index blk, the input decoding target frame Dec [blk] is decoded while using the parallax compensated image Synth [blk] generated in the process Y3 of FIG. 21 as the prediction signal at the time of inter-camera prediction. [Y402]. Thereafter, 1 is added to blk [Y403], and until blk reaches the number of blocks numBlks [Y404], that is, until the decoding of all the blocks in the frame is completed, the process returns to [Y402] and the same processing is performed. repeat.

"Editor's Proposed Draft Text Modifications for Joint Video Specification (ITU-T Rec. H.264 ｜ISO/IEC 14496-10 AVC), Draft 7" ，Document JVT-E022d7 ，September 2002．(pp.10-13,pp.62-73)"Editor's Proposed Draft Text Modifications for Joint Video Specification (ITU-T Rec. H.264 | ISO / IEC 14496-10 AVC), Draft 7", Document JVT-E022d7, September 2002. (pp.10-13, pp.62-73) Hideaki. Kimata and Masaki. Kitahara，“Preliminary results on multiple view video coding(3DAV) ”，document M10976 MPEG Redmond Meeting，July，2004.Hideaki. Kimata and Masaki. Kitahara, “Preliminary results on multiple view video coding (3DAV)”, document M10976 MPEG Redmond Meeting, July, 2004. Sehoon Yea，Jongdae Oh，Serdar Ince ，Emin Martinian，and Anthony Vetro ，“Report on Core Experiment CE3 of Multiview Coding ，”Joint Video Team(JVT) of ISO/IEC MPEG & ITU-T VCEG Doc. JVT-T106，July 2006.Sehoon Yea, Jongdae Oh, Serdar Ince, Emin Martinian, and Anthony Vetro, “Report on Core Experiment CE3 of Multiview Coding,” Joint Video Team (JVT) of ISO / IEC MPEG & ITU-T VCEG Doc. JVT-T106, July 2006.

特開２００７−０３６８００号公報JP 2007-036800 A

従来の多視点動画像の符号化方法によれば，カメラパラメータが既知である場合に，エピポーラ幾何拘束を利用して，カメラの台数にかかわらず，参照フレームに対してカメラから被写体までの距離という１次元情報を符号化するだけで，全カメラの符号化対象フレームに対する視差補償が実現でき，多視点動画像を効率的に符号化することが可能である。 According to the conventional multi-view video encoding method, when the camera parameters are known, the distance from the camera to the subject with respect to the reference frame is used regardless of the number of cameras using epipolar geometric constraints. By simply encoding one-dimensional information, it is possible to realize parallax compensation for the encoding target frames of all cameras, and it is possible to efficiently encode a multi-view video.

しかしながら，従来の手法では参照フレームの領域ごとに視差を示す情報が与えられるため，符号化対象フレームのある領域に対する視差を直接求めることができない。そのため，従来手法では，あるフレームの符号化処理を開始する前に，参照フレームの映像信号とそのフレーム対して与えられた距離情報とから，符号化対象フレームの各領域が視差補償予測を用いるかどうかにかかわらず，符号化対象フレーム全体の視差補償予測による予測映像信号を生成し，その予測映像信号を一時的に蓄積する必要がある。復号側も同様に，あるフレームの復号処理を開始する前に，復号対象フレームの各領域が視差補償予測を用いて符号化されているかどうかにかかわらず，復号対象フレーム全体の視差補償による予測映像信号を生成し，その予測映像信号を一時的に蓄積する必要がある。 However, according to the conventional method, information indicating the parallax is given for each region of the reference frame, and thus the parallax for a certain region of the encoding target frame cannot be directly obtained. Therefore, in the conventional method, before starting the encoding process of a certain frame, whether each region of the encoding target frame uses the disparity compensation prediction based on the video signal of the reference frame and the distance information given to the frame. Regardless, it is necessary to generate a predicted video signal based on the parallax compensation prediction of the entire encoding target frame and temporarily store the predicted video signal. Similarly, on the decoding side, before starting the decoding process of a certain frame, regardless of whether each region of the decoding target frame is encoded using the parallax compensation prediction, the prediction video by the parallax compensation of the entire decoding target frame is used. It is necessary to generate a signal and temporarily store the predicted video signal.

また，オクルージョンやノイズ，輝度変化の影響を受けて予測品質の低下を抑えるために，ある処理対象フレームに対して，複数の参照フレームを視差補償予測に利用する場合，その枚数と同じフレーム数の予測映像信号を一時的に蓄積可能な容量のメモリを備えなければならない。 In addition, when using multiple reference frames for disparity compensation prediction for a certain processing target frame to suppress the deterioration of prediction quality due to the influence of occlusion, noise, and luminance change, the number of frames is the same as the number of frames. A memory having a capacity capable of temporarily storing the predicted video signal must be provided.

多視点動画像の符号化・復号処理では，視点数と同じ数の２次元動画像を同時に処理しなくてはならない。そのため，視点ごとに必要となるメモリ量が大きい場合，全体として非常に大容量のメモリが必要となってしまう。このことは，コスト・消費電力・データ転送時間・小型化などの面で非常に大きな問題である。 In the multi-view video encoding / decoding process, the same number of two-dimensional video images as the number of viewpoints must be processed simultaneously. Therefore, if the amount of memory required for each viewpoint is large, a very large memory as a whole is required. This is a very big problem in terms of cost, power consumption, data transfer time, and miniaturization.

さらに，Ｈ．２６４／ＡＶＣなどが符号化効率向上のために採用している方式のように，あるブロックの映像信号を符号化するにあたって，複数存在する映像予測方法から１つの方法を選択して符号化が行われている場合，カメラ間での映像予測が行われてないブロックが多数存在する。しかしながら，従来手法ではあるフレームを処理する前に，そのフレーム全体に対して視差補償予測した場合の予測信号を生成しなくてはならない。このことは復号時に必要のない無駄な演算を行っていることになり，リアルタイム処理および消費電力の削減のためには大きな問題である。 In addition, H. Like the method adopted by H.264 / AVC and the like for improving the encoding efficiency, when encoding a video signal of a certain block, the encoding is performed by selecting one method from a plurality of existing video prediction methods. In other words, there are many blocks for which video prediction between cameras is not performed. However, in the conventional method, before processing a certain frame, it is necessary to generate a prediction signal when the parallax compensation prediction is performed on the entire frame. This means that unnecessary computations that are not required at the time of decoding are being performed, which is a big problem for real-time processing and power consumption reduction.

また，従来手法では参照フレームの領域ごとに処理フレーム上の対応領域を求めるため，処理対象フレームのブロック内では複数の視差ベクトルが用いられることになる。この場合，予測信号において自然画像では存在しないブロックノイズと呼ばれる不連続な映像信号が生成される。このような不連続な予測信号を用いた場合，その予測残差が通常とは異なる周波数成分を持つことになり，その符号化効率が低下するという問題がある。 Further, in the conventional method, since a corresponding area on the processing frame is obtained for each area of the reference frame, a plurality of disparity vectors are used in the block of the processing target frame. In this case, a discontinuous video signal called block noise that does not exist in the natural image is generated in the prediction signal. When such a discontinuous prediction signal is used, there is a problem that the prediction residual has a frequency component different from the normal one, and the coding efficiency is lowered.

本発明はかかる事情に鑑みてなされたものであって，処理対象フレーム（符号化対象フレームまたは復号対象フレーム）で用いる参照フレームに対して距離情報が与えられている際に，処理開始前に処理対象フレーム全体の予測映像信号を生成するのではなく，処理開始前に処理対象フレームの距離情報を生成し，ブロックごとにその生成された距離情報を用いて予測映像信号を生成することで，必要となるメモリ容量を削減し，また復号側での演算量を削減することが可能な効率的な多視点画像符号化を実現することを目的とする。 The present invention has been made in view of such circumstances. When distance information is given to a reference frame used in a processing target frame (encoding target frame or decoding target frame), the processing is performed before starting the processing. Rather than generating a predicted video signal for the entire target frame, the distance information of the processing target frame is generated before processing starts, and the predicted video signal is generated using the generated distance information for each block. An object of the present invention is to realize efficient multi-view image coding capable of reducing the memory capacity and reducing the amount of calculation on the decoding side.

前述した課題を解決するために，本発明では，参照フレームに対して与えられた距離情報を用いて直接予測映像信号を生成・蓄積するのではなく，複数の参照フレームのそれぞれに対して与えられた距離情報から１つの処理対象フレームに対する距離情報を生成・蓄積し，必要に応じて処理ブロックごとに蓄積されている処理対象フレームに対する距離情報を用いて，その処理ブロックに対する予測映像信号のみを生成する。 In order to solve the above-described problems, the present invention does not directly generate and store the predicted video signal using the distance information given to the reference frame, but gives it to each of a plurality of reference frames. Generates and stores distance information for one processing target frame from the acquired distance information, and generates only the predicted video signal for that processing block using the distance information for the processing target frame stored for each processing block as necessary. To do.

これによってフレーム全体の予測映像信号ではなく，フレーム全体の距離情報のみを蓄積すれば済む。通常，映像信号は３チャンネル分のデータであり，距離情報は１チャンネル，つまりモノクロ映像分のデータであるため，これによって必要なメモリ容量を削減することが可能である。特に，複数の参照フレームを用いる場合に，距離情報は参照フレームによって変化しないため，１つの処理対象フレームに対して１フレーム分のデータだけを蓄積すれば十分であり，参照フレームの数と同じフレーム数分のデータを蓄積しなくてはならない従来手法よりも必要なメモリ容量が少なくなることは明らかである。 As a result, only the distance information of the entire frame is stored, not the predicted video signal of the entire frame. Usually, the video signal is data for three channels, and the distance information is data for one channel, that is, monochrome video, so that the necessary memory capacity can be reduced. In particular, when a plurality of reference frames are used, the distance information does not change depending on the reference frame. Therefore, it is sufficient to store only one frame of data for one processing target frame. It is clear that the required memory capacity is smaller than that of the conventional method in which several minutes of data must be stored.

参照フレームに対して与えられた距離情報から処理対象フレームに対する距離情報を生成する処理は，まず参照フレームを撮影したカメラによって被写体が撮影される際の物理現象に従って被写体の三次元位置を復元し，次に復元された三次元位置を持った被写体が処理対象フレームを撮影したカメラによって撮影される際の物理現象に従って，処理対象フレームから撮影された被写体までの距離情報を復元する。このように，ここで行われる処理はカメラによる撮影プロセスという物理現象に従って行われるため，カメラ撮影による射影変換を十分にモデル化することが可能であれば，非常に高い精度で実現可能である。 The process of generating the distance information for the processing target frame from the distance information given to the reference frame first restores the three-dimensional position of the subject according to the physical phenomenon when the subject is photographed by the camera that photographed the reference frame, Next, distance information from the processing target frame to the photographed subject is restored in accordance with a physical phenomenon when the subject having the restored three-dimensional position is photographed by the camera that photographed the processing target frame. As described above, since the processing performed here is performed according to a physical phenomenon called a photographing process by a camera, it can be realized with very high accuracy if the projective transformation by the photographing by the camera can be sufficiently modeled.

複数の参照フレームを用いる場合，１つの参照フレームに対して与えられた距離情報から１つの処理対象フレームに対する距離情報を生成すると，参照フレーム数に等しいフレーム数の距離情報が生成されることになる。１フレーム分の符号化・復号処理を行う間，それらの全てを蓄積する場合には，予測映像信号を蓄積するほどではないが，多くのメモリが必要となる。そこで本発明では，上記の処理に加えて，生成された複数フレーム分の処理対象フレームに対する距離情報から，１フレーム分の処理対象フレームに対する距離情報を生成し，全ての参照フレームで共通して使用する手段を備える。 When a plurality of reference frames are used, if distance information for one processing target frame is generated from distance information given for one reference frame, distance information having the number of frames equal to the number of reference frames is generated. . When all of these are stored during the encoding / decoding process for one frame, a large amount of memory is required, although not so much as storing the predicted video signal. Therefore, in the present invention, in addition to the above processing, distance information for the processing target frame for one frame is generated from the generated distance information for the processing target frames for a plurality of frames, and is used in common for all reference frames. Means are provided.

具体的には，各参照フレームに対して与えられた距離情報から処理対象フレームに対する距離情報をそれぞれ生成した後，同じ領域において複数得られた距離情報からフィルタ処理などによって１つの距離情報を生成することでメモリ量を削減する。 Specifically, after generating the distance information for the processing target frame from the distance information given to each reference frame, one distance information is generated by filtering or the like from a plurality of distance information obtained in the same region. This reduces the amount of memory.

また，参照フレームによらない距離情報を生成することによって，オクルージョンの発生が検知できなくなる問題が生じる。これは生成された距離情報は，領域ごとに，対応関係が存在する参照フレームに対して，処理対象フレーム上の領域と参照フレーム上の領域との対応関係を与えるが，対応関係が存在するか否かの情報を含まないためである。そこで本発明では，上記の処理に加えて，処理ブロックの持つ距離情報と，そのブロックに対応する参照フレーム上のブロックの持つ距離情報とを比較することで，オクルージョンの発生を検知する手段を備える。 In addition, the generation of distance information that does not depend on the reference frame causes a problem that the occurrence of occlusion cannot be detected. This is because the generated distance information gives the correspondence between the region on the processing target frame and the region on the reference frame with respect to the reference frame in which the correspondence exists for each region. This is because it does not include information on whether or not. Therefore, in the present invention, in addition to the above processing, there is provided means for detecting the occurrence of occlusion by comparing the distance information of the processing block with the distance information of the block on the reference frame corresponding to the block. .

本発明によれば，使用する参照フレームの枚数によらず，少ない量のメモリのみを使用した多視点画像や多視点動画像の符号化および復号を実現することが可能になる。 According to the present invention, it is possible to realize encoding and decoding of a multi-view image and a multi-view video using only a small amount of memory regardless of the number of reference frames to be used.

参考例１の多視点映像符号化装置の構成例を示す図である。It is a figure which shows the structural example of the multiview video coding apparatus of the reference example 1. FIG. 参考例１における多視点映像符号化装置の処理フローチャートである。10 is a process flowchart of the multi-view video encoding apparatus in Reference Example 1. 参照カメラが１つの場合の参照距離情報から処理フレーム距離情報を生成する処理を詳細に示したフローチャートである。It is the flowchart which showed in detail the process which produces | generates process frame distance information from the reference distance information in case there is one reference camera. ブロック毎に視点合成画像を画素ベースで生成しながら符号化対象フレームを符号化する処理を詳細に示したフローチャートである。It is the flowchart which showed in detail the process which encodes an encoding object frame, producing | generating a viewpoint synthetic | combination image for every block on a pixel basis. ブロック毎に視点合成画像をブロックベースで生成しながら符号化対象フレームを符号化する処理を詳細に示したフローチャートである。It is the flowchart which showed in detail the process which encodes an encoding object frame, producing | generating a viewpoint synthetic | combination image for every block on a block basis. 実施例１の多視点映像符号化装置の構成例を示す図である。It is a figure which shows the structural example of the multiview video coding apparatus of Example 1. FIG. 実施例１における多視点映像符号化装置の処理フローチャートである。6 is a process flowchart of the multi-view video encoding apparatus according to the first embodiment. 実施例１の多視点映像符号化装置の派生形である多視点映像符号化装置の構成例を示す図である。It is a figure which shows the structural example of the multiview video coding apparatus which is a derivative form of the multiview video coding apparatus of Example 1. FIG. オクルージョンを考慮しながら視点合成画像を生成する処理を詳細に示したフローチャートである。It is the flowchart which showed in detail the process which produces | generates a viewpoint synthetic | combination image, considering an occlusion. 参考例２の多視点映像復号装置の構成例を示す図である。It is a figure which shows the structural example of the multiview video decoding apparatus of the reference example 2. FIG. 参考例２における多視点映像復号装置の処理フローチャートである。12 is a process flowchart of the multi-view video decoding apparatus in Reference Example 2. 必要に応じて画素ベースで視点合成画像を生成しながらブロック毎に復号対象フレームを復号する処理を詳細に示したフローチャートである。It is the flowchart which showed in detail the process which decodes a decoding object flame | frame for every block, producing | generating a viewpoint synthetic | combination image on a pixel basis as needed. 必要に応じてブロックベースで視点含成画像を生成しながらブロック毎に復号対象フレームを復号する処理を詳細に示したフローチャートである。It is the flowchart which showed in detail the process which decodes a decoding object flame | frame for every block, producing | generating a viewpoint containing image on a block basis as needed. 実施例２の多視点映像復号装置の構成例を示す図である。It is a figure which shows the structural example of the multiview video decoding apparatus of Example 2. FIG. 実施例２における多視点映像復号装置の処理フローチャートである。10 is a process flowchart of the multi-view video decoding apparatus according to the second embodiment. 実施例２の多視点映像復号装置の派生形である多視点映像復号装置の構成例を示す図である。It is a figure which shows the structural example of the multiview video decoding apparatus which is a derivative form of the multiview video decoding apparatus of Example 2. FIG. カメラ間で発生する視差を概念的に示した図である。It is the figure which showed notionally the parallax which generate | occur | produces between cameras. カメラ間に存在するエピポーラ幾何拘束を概念的に示した図である。It is the figure which showed notionally epipolar geometric constraints which exist between cameras. 従来の符号化処理の全体の処理フローチャートである。It is a process flowchart of the whole conventional encoding process. 図１９の信号符号化処理Ｘ４を詳細に示したフローチャートである。20 is a flowchart showing in detail a signal encoding process X4 of FIG. 従来の復号処理の全体の処理フローチャートである。It is a process flowchart of the whole conventional decoding process. 図２１の信号復号処理Ｙ４を詳細に示したフローチャートである。It is the flowchart which showed the signal decoding process Y4 of FIG. 21 in detail.

以下，本発明を実施の形態に従って詳細に説明する。なお，以下の説明では，映像や距離情報に対して，記号［］で挟まれた位置を特定可能な情報（座標値もしくは座標値に対応付け可能なインデックス）を付加することで，その位置の画素によってサンプリングされた映像信号や距離情報を示すものとする。また，距離情報はカメラから離れるほど大きな値を持つ情報であるとし，カメラパラメータを定義する三次元座標系における距離を与えるものとする。ただし，この条件を満たす距離情報に変換可能な情報（例えばインデックスを用いて各距離情報を表現したもの）であれば，変換を必要に応じて行うことで，本発明の手法を適用することによる効果を得ることが可能である。 Hereinafter, the present invention will be described in detail according to embodiments. In the following description, information (coordinate value or index that can be associated with a coordinate value) that can specify the position between the symbols [] is added to the image or distance information to add the position information. The video signal sampled by the pixel and the distance information are shown. The distance information is information having a larger value as the distance from the camera increases, and a distance in a three-dimensional coordinate system that defines camera parameters is given. However, if the information can be converted into distance information satisfying this condition (for example, each distance information is expressed using an index), the conversion is performed as necessary, and the method of the present invention is applied. An effect can be obtained.

以下の説明では，各カメラのカメラパラメータは既に得られているものとする。カメラ番号をｖｉｅｗとして，カメラの内部パラメータ行列をＡ_view，回転行列をＲ_view，並進ベクトルをｔ_viewで表す。カメラパラメータの表現法には様々なものがあるため，以下で用いる数式は，カメラパラメータの定義に従って変更する必要がある。なお，本実施例では，画像座標ｍと世界座標Ｍの対応関係が，次の式で得られるカメラパラメータ表現を用いているものとする。 In the following description, it is assumed that the camera parameters of each camera have already been obtained. With the camera number as view, the camera internal parameter matrix is represented by A _view , the rotation matrix is represented by R _view , and the translation vector is represented by t _view . Since there are various representations of camera parameters, the mathematical formulas used below need to be changed according to the definition of camera parameters. In this embodiment, it is assumed that the correspondence between the image coordinate m and the world coordinate M uses a camera parameter expression obtained by the following equation.

チルダ記号は任意スカラ倍を許した斉次座標を表す。 The tilde symbol represents a homogeneous coordinate that allows arbitrary scalar multiplication.

〔多視点映像符号化装置（第１の参考例）〕
まず，本発明の実施例を説明するための第１の参考例（以下，参考例１）について説明する。ここで説明する参考例１では，２つのカメラで撮影された多視点動画像を符号化する場合を想定し，カメラ１で撮影された映像を参照画像として，カメラ２で撮影された映像を符号化する方法について説明する。 [Multi-view video encoding device (first reference example)]
First, a first reference example (hereinafter referred to as reference example 1) for describing an embodiment of the present invention will be described. In Reference Example 1 described here, assuming that a multi-view video captured by two cameras is encoded, the video captured by the camera 2 is encoded using the video captured by the camera 1 as a reference image. A method for realizing the above will be described.

参考例１に係る映像符号化装置の構成図を，図１に示す。図１に示すように，多視点映像符号化装置１００は，符号化対象となるカメラ２のフレームを入力する符号化対象画像入力部１０１と，入力された符号化対象フレームを蓄積する符号化対象画像メモリ１０２と，参照フレームとなるカメラ１のフレームを入力する参照カメラ画像入力部１０３と，その参照フレームを蓄積する参照カメラ画像メモリ１０４と，参照フレームに対する距離情報を入力する参照カメラ距離情報入力部１０５と，カメラの投影モデルを利用して距離情報を三次元空間へ逆投影する距離情報逆投影部１０６と，カメラの投影モデルに従って三次元点をカメラ２へと再投影する距離情報再投影部１０７と，符号化対象フレームに対する距離情報を蓄積する処理フレーム距離情報メモリ１０８と，参照フレームの映像信号と符号化対象フレームに対する距離情報とから符号化対象フレームの合成画像を生成する視点合成画像生成部１０９と，その生成された合成画像を予測画像として使用しながら符号化対象フレームを符号化する画像符号化部１１０とを備える。 A block diagram of a video encoding apparatus according to Reference Example 1 is shown in FIG. As shown in FIG. 1, a multi-view video encoding apparatus 100 includes an encoding target image input unit 101 that inputs a frame of a camera 2 to be encoded, and an encoding target that accumulates the input encoding target frame. An image memory 102, a reference camera image input unit 103 for inputting a frame of the camera 1 serving as a reference frame, a reference camera image memory 104 for storing the reference frame, and a reference camera distance information input for inputting distance information for the reference frame Unit 105, distance information backprojection unit 106 that backprojects distance information into a three-dimensional space using a camera projection model, and distance information reprojection that reprojects a three-dimensional point onto camera 2 according to the camera projection model Unit 107, processing frame distance information memory 108 for storing distance information for the encoding target frame, and video signal of the reference frame A viewpoint composite image generation unit 109 that generates a composite image of the encoding target frame from the distance information with respect to the encoding target frame, and an image code that encodes the encoding target frame while using the generated composite image as a predicted image And a conversion unit 110.

図２に，このようにして構成される多視点映像符号化装置１００の実行する処理フローを示す。この処理フローに従って，参考例１の多視点映像符号化装置１００の実行する処理について詳細に説明する。 FIG. 2 shows a processing flow executed by the multi-view video encoding apparatus 100 configured as described above. The processing executed by the multi-view video encoding apparatus 100 of Reference Example 1 will be described in detail according to this processing flow.

まず，参照カメラ距離情報入力部１０５より参照距離情報ＲｅｆＤｅｐｔｈが入力される［Ａ１］。この参照距離情報とは，参照フレームに対する距離情報のことである。ここで入力される参照距離情報は，一旦任意の符号化手法を用いて符号化され，その符号化データから復号されたものとする。これは復号装置で得られる情報と同じ情報を用いることで，ドリフト歪みと呼ばれる符号化ノイズの発生を抑えるためである。ただし，ドリフト歪みの発生を許容する場合には，符号化前のオリジナルの情報が入力されてもかまわない。 First, reference distance information RefDepth is input from the reference camera distance information input unit 105 [A1]. This reference distance information is distance information with respect to the reference frame. It is assumed that the reference distance information input here is once encoded using an arbitrary encoding method and decoded from the encoded data. This is to suppress the occurrence of coding noise called drift distortion by using the same information as the information obtained by the decoding device. However, if the generation of drift distortion is allowed, the original information before encoding may be input.

次に，参照距離情報から符号化対象フレームに対する距離情報であるところの処理フレーム距離情報ＣｕｒＤｅｐｔｈを生成し，処理フレーム距離情報メモリ１０８に蓄積する［Ａ２］。ここでの処理は後で詳しく説明する。 Next, the processing frame distance information CurDepth, which is the distance information for the encoding target frame, is generated from the reference distance information and stored in the processing frame distance information memory [A2]. This process will be described later in detail.

その後，符号化対象画像入力部１０１より符号化対象フレームＯｒｇが入力され，参照カメラ画像入力部１０３より参照フレームＲｅｆとなるカメラ１の画像が入力され，それぞれ符号化対象画像メモリ１０２および参照カメラ画像メモリ１０４に蓄積される［Ａ３］。なお，参照カメラ画像入力部１０３では，一旦任意の符号化手法を用いて符号化され，その符号化データから復号された参照フレームＲｅｆが入力されるものとする。これは復号装置で得られる情報と同じ情報を用いることで，ドリフト歪みと呼ばれる符号化ノイズの発生を抑えるためである。ただし，ドリフト歪みの発生を許容する場合，符号化前のオリジナルの情報が入力されてもかまわない。 After that, the encoding target frame Org is input from the encoding target image input unit 101, and the image of the camera 1 serving as the reference frame Ref is input from the reference camera image input unit 103. The encoding target image memory 102 and the reference camera image, respectively. Accumulated in the memory 104 [A3]. Note that the reference camera image input unit 103 receives a reference frame Ref that has been once encoded using an arbitrary encoding method and decoded from the encoded data. This is to suppress the occurrence of coding noise called drift distortion by using the same information as the information obtained by the decoding device. However, the original information before encoding may be input when drift distortion is allowed to occur.

次に，視点合成画像生成部１０９で参照フレームと処理フレーム距離情報とを用いて符号化単位ブロックごとに視点合成画像を生成しながら，画像符号化部１１０で符号化対象フレームを符号化する［Ａ４］。ここでの処理の詳細は後で詳しく説明する。 Next, the image encoding unit 110 encodes the encoding target frame while generating the viewpoint synthesized image for each coding unit block using the reference frame and the processing frame distance information in the viewpoint synthesized image generation unit 109 [ A4]. Details of this processing will be described later in detail.

図３に，図２の処理Ａ２で行われる参照距離情報から処理フレーム距離情報を生成する処理の詳細フローを示す。 FIG. 3 shows a detailed flow of processing for generating processing frame distance information from reference distance information performed in processing A2 of FIG.

最初に，処理フレーム距離情報ＣｕｒＤｅｐｔｈを初期化する［Ａ２０１］。この初期化では，全ての位置において処理フレーム距離情報は取り得る最大値を持つように設定される。 First, the processing frame distance information CurDepth is initialized [A201]. In this initialization, the processing frame distance information is set to have a maximum value that can be taken at all positions.

次に参照距離情報の画素ごとに，処理フレーム距離情報を徐々に生成する［Ａ２０２−Ａ２０８］。つまり，参照距離情報の画素インデックスをｐｉｘ，画素数をｎｕｍＰｉｘｓで表すと，ｐｉｘを０で初期化した後［Ａ２０２］，ｐｉｘに１を加算しながら［Ａ２０７］，ｐｉｘがｎｕｍＰｉｘｓになるまで［Ａ２０８］，次の処理［Ａ２０３−Ａ２０６］を繰り返す。 Next, processing frame distance information is gradually generated for each pixel of the reference distance information [A202-A208]. That is, when the pixel index of the reference distance information is represented by pix and the number of pixels is represented by numPixs, after initializing pix with 0 [A202], while adding 1 to pix [A207], until pix becomes numPixs [A208 ], The next process [A203-A206] is repeated.

参照距離情報ＲｅｆＤｅｐｔｈの画素ごとに繰り返される処理では，まず距離情報逆投影部１０６で画素ｐｉｘを三次元座標系へ逆投影することで三次元点ｇを取得する［Ａ２０３］。具体的には，この処理は次の（式１）を用いて行われる。ここで，（ｕ_pix，ｖ_pix）は，画素ｐｉｘでの画像平面上での座標値を表す。 In the process repeated for each pixel of the reference distance information RefDepth, the distance information backprojection unit 106 first obtains a three-dimensional point g by backprojecting the pixel pix onto a three-dimensional coordinate system [A203]. Specifically, this processing is performed using the following (Equation 1). Here, (u _pix , v _pix ) represents a coordinate value on the image plane at the pixel pix.

次に，距離情報再投影部１０７で三次元点ｇを符号化対象フレームへ再投影し，符号化対象フレーム上で三次元点ｇが投影される位置ｐと，カメラ２から三次元点ｇまでの距離ｄを取得する［Ａ２０４］。具体的には，この処理は次の（式２）を用いて行われる。なお，ｐは（ｘ，ｙ）で表される。 Next, the distance information reprojection unit 107 reprojects the three-dimensional point g onto the encoding target frame, the position p where the three-dimensional point g is projected on the encoding target frame, and the camera 2 to the three-dimensional point g. Is obtained [A204]. Specifically, this processing is performed using the following (Equation 2). Note that p is represented by (x, y).

再投影される位置ｐとその距離ｄの情報が得られたならば，処理フレーム距離情報メモリ１０８における位置ｐの処理フレーム距離情報を更新する。ただし，カメラ位置や向きが変わることによってオクルージョンが発生するため，既に得られている位置ｐの処理フレーム距離情報ＣｕｒＤｅｐｔｈ［ｐ］と，新しく得られた距離ｄとを比較し［Ａ２０５］，新しく得られた距離ｄのほうがカメラに近いことを示す場合にのみＣｕｒＤｅｐｔｈ［ｐ］をｄに更新する［Ａ２０６］。 If information on the reprojected position p and its distance d is obtained, the processing frame distance information of the position p in the processing frame distance information memory 108 is updated. However, since occlusion occurs when the camera position and orientation change, the processing frame distance information CurDepth [p] of the position p already obtained is compared with the newly obtained distance d [A205] to obtain a new value. Only when the obtained distance d indicates that it is closer to the camera, CurDepth [p] is updated to d [A206].

全ての参照距離情報の画素に対して上記の処理が終わった後，実空間での連続性を鑑みて得られた処理フレーム距離情報ＣｕｒＤｅｐｔｈを補正する［Ａ２０９］。具体的には，中央値フィルタやbi-lateralフィルタなどのノイズ除去を行う空間フィルタを適用する。この処理は距離情報が離散的にサンプリングされ，変換元と変換先が同じサンプリングではないために生じた誤った距離情報の値を補正するために行われる。サンプリング間隔に起因するノイズであると考えられるため，そのノイズは空間的に大量に固まって生じるものではない。したがって，中央値フィルタやbi-lateralフィルタなどのノイズフィルタを施すことで，空間的に特異な値を持った距離情報の値を，周辺の距離情報の値と同じような値に補正することが可能である。なお，この処理はカメラ１とカメラ２が非常に近くに存在する場合などでは，サンプリングに起因するノイズは発生しないため，省略することが可能である。ただし，復号装置で同様の処理を行う場合には，符号化装置でこの処理を省略するとドリフト歪みが発生することになる。 After the above processing is completed for all the reference distance information pixels, the processing frame distance information CurDepth obtained in consideration of continuity in the real space is corrected [A209]. Specifically, a spatial filter that removes noise such as a median filter or a bi-lateral filter is applied. This process is performed in order to correct the erroneous distance information value generated because the distance information is sampled discretely and the conversion source and the conversion destination are not the same sampling. Since it is considered to be noise caused by the sampling interval, the noise is not generated in a large amount in space. Therefore, by applying a noise filter such as median filter or bi-lateral filter, the value of distance information with spatially unique values can be corrected to the same value as the value of surrounding distance information. Is possible. Note that this processing can be omitted when the camera 1 and the camera 2 are very close to each other, because noise due to sampling does not occur. However, when similar processing is performed in the decoding device, drift distortion occurs if this processing is omitted in the encoding device.

図４に，図２の処理Ａ４で行われる符号化単位ブロックごとに視点合成画像を生成しながら，符号化対象フレームを符号化する処理の第１の詳細フローを示す。 FIG. 4 shows a first detailed flow of a process for encoding a frame to be encoded while generating a viewpoint composite image for each encoding unit block performed in process A4 of FIG.

ここでの処理は，符号化単位ブロックごとに行われる。つまり，符号化単位ブロックインデックスをｂｌｋ，総符号化単位ブロック数をｎｕｍＢｌｋｓとすると，ｂｌｋを０で初期化した後［Ａ４１１］，ｂｌｋに１を加算しながら［Ａ４１８］，ｂｌｋがｎｕｍＢｌｋｓになるまで［Ａ４１９］，視点合成画像Ｓｙｎｔｈを生成する処理［Ａ４１２−Ａ４１６］を行い，生成されたＳｙｎｔｈを予測画像候補として用いながら符号化対象フレームＯｒｇ［ｂｌｋ］を符号化する処理［Ａ４１７］を繰り返す。 This process is performed for each coding unit block. That is, if the coding unit block index is blk and the total number of coding unit blocks is numBlks, after blk is initialized with 0 [A411], while adding 1 to blk [A418], until blk becomes numBlks [A419], the process [A412-A416] for generating the viewpoint synthesized image Synth is performed, and the process [A417] for encoding the encoding target frame Org [blk] is repeated while using the generated Synth as a predicted image candidate.

なお，視点合成画像を予測画像候補として用いながら符号化対象フレームを符号化する画像符号化には，映像予測を行うものであればどのような手法を用いることもできる。例えば，Ｈ．２６４／ＡＶＣのように入力画像と予測画像の差分信号を生成し，その差分信号に直交変換，量子化，エントロピー符号化を施すことで符号化を行うことができる。その際に，視点合成画像以外も予測画像の候補として準備し，符号化効率が最適となる予測方法をブロックごとに適応的に選択してもかまわない。例えば，対応点ベクトルを別途符号化することで参照フレームを用いて視差補償予測を行ったり，カメラ２の既に符号化済みフレームの復号画像を用いて動き補償予測を行ったりすることも可能である。 Note that any method can be used for image encoding that encodes a frame to be encoded while using a viewpoint synthesized image as a predicted image candidate as long as video prediction is performed. For example, H.M. As in H.264 / AVC, a difference signal between an input image and a predicted image is generated, and the difference signal is subjected to orthogonal transformation, quantization, and entropy coding to perform encoding. At this time, a prediction method other than the viewpoint composite image may be prepared as a prediction image candidate, and a prediction method with the optimum encoding efficiency may be adaptively selected for each block. For example, it is possible to perform disparity compensation prediction using a reference frame by separately encoding a corresponding point vector, or to perform motion compensation prediction using a decoded image of an already encoded frame of the camera 2. .

視点合成画像を生成する処理は，符号化単位ブロックｂｌｋに含まれる画素ごとに，参照フレーム上の対応画素を求め，対応画素における映像信号をその画素の視点合成画像信号とすることで行われる。つまり，画素インデックスをｐｅｌ，ブロック符号化単位ブロックｂｌｋに含まれる画素数をｎｕｍＰｅｌｓ_blkとすると，ｐｅｌを０で初期化した後［Ａ４１２］，ｐｅｌに１を加算しながら［Ａ４１５］，ｐｅｌがｎｕｍＰｅｌｓ_blkになるまで［Ａ４１６］，次の（式３）に従って符号化対象フレーム上の画素ｂｌｋ_pelの参照フレーム上の対応画素ｃｐを求め［Ａ４１３］，画素ｃｐの画素値Ｒｅｆ［ｃｐ］を画素ｂｌｋ_pelにおける視点合成画像Ｓｙｎｔｈ［ｐｅｌ］とする処理［Ａ４１４］を繰り返す。 The process of generating the viewpoint composite image is performed by obtaining a corresponding pixel on the reference frame for each pixel included in the encoding unit block blk, and using the video signal at the corresponding pixel as the viewpoint composite image signal of the pixel. That is, assuming that the pixel index is pel and the number of pixels included in the block coding unit block blk is numPels _blk , after initializing pel with 0 [A412], while adding 1 to pel [A415], pel is numPels Until it becomes _blk [A416], the corresponding pixel cp on the reference frame of the pixel blk _pel on the encoding target frame is obtained [A413] according to the following (Equation 3), and the pixel value Ref [cp] of the pixel cp is calculated as the pixel blk. _The process [A414] for making the viewpoint synthesized image Synth [pel] in pel is repeated.

なお，（ｕ_blk,pel，ｖ_blk,pel）は，符号化対象フレーム上の画素ｂｌｋ_pelの位置であり，（ｃｐ_x，ｃｐ_y）は，参照フレーム上の対応画素中の位置であり，ｓは，スカラ値である。 (U _{blk, pel} , v _{blk, pel} ) is the position of the pixel blk _pel on the encoding target frame, (cp _x , cp _y ) is the position in the corresponding pixel on the reference frame, s is a scalar value.

ここで，Ｓｙｎｔｈは処理中の符号化単位ブロックに対する視点合成画像を保持できれば十分であり，符号化対象フレーム全体の視点合成画像を保持する必要はない。すなわち，視点合成画像Ｓｙｎｔｈの生成と画像符号化とを符号化単位ブロックごとに交互に行うことで，視点合成画像Ｓｙｎｔｈを同時には高々符号化単位ブロック分しか生成しないで済むことになる。 Here, it is sufficient for the Synth to be able to hold the view synthesized image for the coding unit block being processed, and it is not necessary to hold the view synthesized image of the entire encoding target frame. That is, by alternately generating the viewpoint synthesized image Synth and encoding the image for each coding unit block, it is possible to generate the viewpoint synthesized image Synth only for the coding unit block at the same time.

図５に，図２の処理Ａ４で行われる符号化単位ブロックごとに視点合成画像を生成しながら，符号化対象フレームを符号化する処理の第２の詳細フローを示す。 FIG. 5 shows a second detailed flow of the process of encoding the encoding target frame while generating the viewpoint composite image for each encoding unit block performed in the process A4 of FIG.

上記図４を用いて説明した処理フローと本処理フローとの違いは，符号化単位ブロックごとに視点合成画像を生成する処理［Ａ４１２−Ａ４１６］と，処理［Ａ４２２−Ａ４２３］との違いだけである。したがって，以下の説明では処理［Ａ４２２−Ａ４２３］についてのみ説明を行う。 The difference between the processing flow described with reference to FIG. 4 and the main processing flow is only the difference between the processing [A412-A416] for generating a viewpoint composite image for each coding unit block and the processing [A422-A423]. is there. Accordingly, in the following description, only the process [A422-A423] will be described.

ここで行われる視点合成画像生成は，符号化単位ブロックｂｌｋに対する処理フレーム距離情報ＣｕｒＤｅｐｔｈ［ｂｌｋ］（画素ごとに１つの距離情報を持つため距離情報の集合となる）を用いて，ブロックｂｌｋに対する１つの代表距離情報ｄｅｐを生成し［Ａ４２２］，その代表距離情報ｄｅｐを用いて符号化対象フレームのブロックｂｌｋに対応する参照フレーム上のブロックｂｌｋ′を求め［Ａ４２３］，そのブロックにおける参照フレームの画像信号を視点合成画像とする。 The viewpoint composite image generation performed here is 1 for the block blk using the processing frame distance information CurDepth [blk] for the coding unit block blk (which is a set of distance information because each pixel has one distance information). One representative distance information dep is generated [A422], and the block blk 'on the reference frame corresponding to the block blk of the encoding target frame is obtained using the representative distance information dep [A423], and an image of the reference frame in the block The signal is a viewpoint composite image.

代表距離情報ｄｅｐはブロックｂｌｋにおけるＣｕｒＤｅｐｔｈ［ｂｌｋ］の代表値となるため，ＣｕｒＤｅｐｔｈ［ｂｌｋ］の平均値や中央値，または最も多く現れる値などを用いて表すことができる。 Since the representative distance information dep is a representative value of CurDepth [blk] in the block blk, it can be expressed using an average value, a median value, a value that appears most frequently, or the like of CurDepth [blk].

ブロックｂｌｋに対応する参照フレーム上のブロックｂｌｋ′を求める処理では，ブロックｂｌｋの中央またはブロックの角の座標をｂｌｋ_pelとして，上記（式３）を用いて求まったｃｐをブロックｂｌｋ′の中央またはブロックの角の座標とすることで求めることができる。 In the process of obtaining the block blk ′ on the reference frame corresponding to the block blk, the center of the block blk or the coordinates of the block corner is set to blk _pel , and the cp obtained using the above (Equation 3) is set to the center of the block blk ′ or It can be obtained by setting the coordinates of the corner of the block.

ブロック内で複数の距離情報を使用して予測信号を生成するということは，ブロック内で複数のベクトルを用いて対応点を求めて予測信号を生成することに等しい。一般に，予測信号を生成する際に複数のベクトルを用いて対応点を求めると，予測信号にブロックノイズが発生し，周波数領域での予測残差の符号化効率が低下する。しかし，ここでの処理のようにブロックに対して１つの代表距離情報を生成することによって，ブロック内では１つのベクトルのみを用いて対応点を求めることと同じになるため，予測信号にブロックノイズが発生せず，予測残差の符号化効率が低下するのを防ぐことができる。なお，ここではブロックに対して１つの代表距離情報を生成したが，ブロックを複数のサブブロックに分割して，サブブロックごとに１つの代表距離情報を生成することで，予測効率の低下を抑えつつ予測信号におけるブロックノイズ発生を抑制することが可能となる。 Generating a prediction signal using a plurality of distance information in a block is equivalent to generating a prediction signal by obtaining corresponding points using a plurality of vectors in the block. In general, when a corresponding point is obtained using a plurality of vectors when generating a prediction signal, block noise is generated in the prediction signal, and the encoding efficiency of the prediction residual in the frequency domain is lowered. However, by generating one representative distance information for the block as in the processing here, it is the same as obtaining a corresponding point using only one vector in the block, so that block noise is included in the prediction signal. It is possible to prevent the encoding efficiency of the prediction residual from being reduced. Although one representative distance information is generated for a block here, the block is divided into a plurality of sub-blocks, and one representative distance information is generated for each sub-block, thereby suppressing a decrease in prediction efficiency. However, it is possible to suppress the occurrence of block noise in the prediction signal.

〔多視点映像符号化装置（第１の実施例）〕
次に，本発明の第１の実施例（以下，実施例１）について説明する，ここで説明する実施例１では，Ｎ個のカメラで撮影された多視点動画像を符号化する場合を想定し，カメラ１〜カメラＮ−１で撮影された映像を参照画像として，カメラＮで撮影された映像を符号化する方法について説明を行う。なお，特別な説明がない限り，参考例１で用いた記号は本実施例においても同様の意味で用いる。 [Multi-view video encoding device (first embodiment)]
Next, the first embodiment of the present invention (hereinafter referred to as the first embodiment) will be described. In the first embodiment described here, it is assumed that a multi-view video captured by N cameras is encoded. A method of encoding the video shot by the camera N using the video shot by the cameras 1 to N-1 as a reference image will be described. Unless otherwise specified, the symbols used in Reference Example 1 have the same meaning in this embodiment.

実施例１に係る映像符号化装置の構成図を，図６に示す。図６に示すように，多視点映像符号化装置２００は，符号化対象となるカメラＮのフレームを入力する符号化対象画像入力部２０１と，入力された符号化対象フレームを蓄積する符号化対象画像メモリ２０２と，参照フレームとなるカメラ１からカメラＮ−１のフレームを入力する参照カメラ画像入力部２０３と，その参照フレーム群を蓄積する参照カメラ画像メモリ２０４と，各参照フレームに対する距離情報を入力する参照カメラ距離情報入力部２０５と，カメラの投影モデルを利用して距離情報を三次元空間へ逆投影する距離情報逆投影部２０６と，カメラの投影モデルに従って三次元点をカメラＮへと再投影する距離情報再投影部２０７と，各参照距離情報を変換することで得られた距離情報群を一時的に蓄積する処理フレーム距離情報候補メモリ２０８と，各参照距離情報から変換することで得られた距離情報群を１つの符号化対象フレームに対する距離情報へと統合する距離情報統合部２０９と，生成された符号化対象フレームに対する距離情報を蓄積する処理フレーム距離情報メモリ２１０と，参照フレームの映像信号と符号化対象フレームに対する距離情報とから符号化対象フレームの合成画像を生成する視点合成画像生成部２１１と，その生成された合成画像を予測画像として使用しながら符号化対象フレームを符号化する画像符号化部２１２とを備える。 FIG. 6 shows a configuration diagram of the video encoding apparatus according to the first embodiment. As shown in FIG. 6, the multi-view video encoding apparatus 200 includes an encoding target image input unit 201 that inputs a frame of the camera N to be encoded, and an encoding target that stores the input encoding target frame. An image memory 202, a reference camera image input unit 203 for inputting a frame of the camera N-1 from the camera 1 serving as a reference frame, a reference camera image memory 204 for storing the reference frame group, and distance information for each reference frame A reference camera distance information input unit 205 to be input, a distance information backprojection unit 206 that backprojects distance information to a three-dimensional space using a camera projection model, and a three-dimensional point to the camera N according to the camera projection model Distance information reprojection unit 207 for reprojection, and processing frame distance for temporarily storing distance information groups obtained by converting each reference distance information Information candidate memory 208, a distance information integration unit 209 that integrates a distance information group obtained by converting each reference distance information into distance information for one encoding target frame, and a generated encoding target frame A processing frame distance information memory 210 for storing distance information, a viewpoint composite image generation unit 211 for generating a composite image of the encoding target frame from the video signal of the reference frame and the distance information for the encoding target frame, and the generated And an image encoding unit 212 that encodes the encoding target frame while using the synthesized image as a predicted image.

図７に，このようにして構成される多視点映像符号化装置２００の実行する処理フローを示す。この処理フローに従って，多視点映像符号化装置２００の実行する処理について詳細に説明する。 FIG. 7 shows a processing flow executed by the multi-view video encoding apparatus 200 configured as described above. The processing executed by the multi-view video encoding apparatus 200 will be described in detail according to this processing flow.

まず，参照距離情報を入力し，各参照距離情報を符号化対象フレームに対する距離情報の候補へと変換する。つまり，参照するカメラのインデックスをｒｅｆとすると，ｒｅｆを０で初期化した後［Ｂ１］，ｒｅｆに１を加算しながら［Ｂ４］，ｒｅｆがＮ−１になるまで［Ｂ５］，参照カメラ距離情報入力部２０５よりカメラｒｅｆに対する参照距離情報ＲｅｆＤｅｐｔｈ_refを入力し［Ｂ２］，参照距離情報ＲｅｆＤｅｐｔｈ_refから符号化対象フレームに対する距離情報の候補であるところの処理フレーム距離情報候補ＴｅｍｐＤｅｐｔｈ_refを生成し，処理フレーム距離情報候補メモリ２０８に一時的に蓄える処理Ｂ３を繰り返す。 First, reference distance information is input, and each reference distance information is converted into distance information candidates for the encoding target frame. In other words, if the index of the camera to be referenced is ref, after initializing ref to 0 [B1], adding 1 to ref [B4], until ref becomes N-1 [B5], the reference camera distance enter the reference distance information RefDepth _ref for the camera ref information input unit 205 [B2], the reference distance information RefDepth _ref to generate processed frame distance information candidate TempDepth _ref where a candidate for the distance information for the encoding target frame, The process B3 temporarily stored in the process frame distance information candidate memory 208 is repeated.

処理Ｂ３は，上記参考例１の処理Ａ２で行われる処理と同じである。ただし，処理Ａ２における参照距離情報ＲｅｆＤｅｐｔｈを参照距離情報ＲｅｆＤｅｐｔｈ_refに，処理フレーム距離情報ＣｕｒＤｅｐｔｈを処理フレーム距離情報候補ＴｅｍｐＤｅｐｔｈ_refに，カメラ１をカメラｒｅｆに，カメラ２をカメラＮに，距離情報逆投影部１０６を距離情報逆投影部２０６，距離情報再投影部１０７を距離情報再投影部２０７に，それぞれ読み替える必要がある。 The process B3 is the same as the process performed in the process A2 of the reference example 1. However, the reference distance information RefDepth in the process A2 is the reference distance information RefDepth _ref , the process frame distance information CurDepth is the process frame distance information candidate TempDepth _ref , the camera 1 is the camera ref, the camera 2 is the camera N, and the distance information backprojection It is necessary to replace the unit 106 with the distance information backprojection unit 206 and the distance information reprojection unit 107 with the distance information reprojection unit 207, respectively.

全ての参照距離情報に関する処理が終了したならば，距離情報統合部２０９で，処理フレーム距離情報候補群から，処理フレーム距離情報ＣｕｒＤｅｐｔｈを生成し，処理フレーム距離情報メモリ２１０に蓄積する［Ｂ６］。ここでの処理は，処理フレーム距離情報の各画素位置に対して，同じ画素位置に対する各処理フレーム距離情報候補がもつ距離の値の集合を用いて，１つの距離を求めることを，画素ごとに繰り返す。したがって，処理フレーム距離情報におけるある画素位置をｐｐとすると，以下の（式４）で表すことが可能である。 When the processing for all the reference distance information is completed, the distance information integration unit 209 generates the processing frame distance information CurDepth from the processing frame distance information candidate group, and stores it in the processing frame distance information memory 210 [B6]. In this processing, for each pixel position of the processing frame distance information, one distance is obtained for each pixel by using a set of distance values of each processing frame distance information candidate for the same pixel position. repeat. Therefore, if a certain pixel position in the processing frame distance information is pp, it can be expressed by the following (formula 4).

なお，上バー付きのδ（）は，引数が０の場合には０を返し，それ以外の場合には１を返すデルタ関数の０と１を反転したような関数であり，ｍａｘ＿ｄは距離情報の取り得る最大値である。 Note that δ () with an upper bar is a function that reverses 0 and 1 of the delta function that returns 0 when the argument is 0 and returns 1 otherwise, and max_d is distance information This is the maximum value that can be taken.

上記の（式４）は，処理フレーム距離情報候補の平均値を処理フレーム距離情報とする方法であり，このほかに中央値を用いる方法（式５）や，カメラに近いことを表す情報を用いる方法（式６）などを用いることも可能である。 The above (Equation 4) is a method of using the average value of the processing frame distance information candidates as the processing frame distance information. In addition, a method using the median value (Equation 5) or information indicating that the camera is close to the camera It is also possible to use a method (formula 6) or the like.

ここで，ｍｅｄｉａｎ（）は中央値を返す関数，ｍｉｎ（）は最小値を返す関数である。ＴＥＭＰ＿ＤＥＰＴＨは画素位置ｐｐの処理フレーム距離情報候補の集合であり，次の（式７）で表すことが可能である。なお，入力される参照距離情報に含まれるノイズ成分を考慮して，ＴＥＭＰ＿ＤＥＰＴＨから外れる値を事前に除外してから，上記の処理を適用してもかまわない。 Here, median () is a function that returns a median value, and min () is a function that returns a minimum value. TEMP_DEPTH is a set of processing frame distance information candidates at the pixel position pp, and can be expressed by the following (Expression 7). Note that the above processing may be applied after excluding values that deviate from TEMP_DEPTH in advance in consideration of noise components included in the input reference distance information.

なお，この処理Ｂ６が終了した後は，処理フレーム距離情報候補メモリ２０８に蓄積した参照フレーム距離情報候補を開放してもかまわない。 Note that after this processing B6 is completed, the reference frame distance information candidates accumulated in the processing frame distance information candidate memory 208 may be released.

処理フレーム距離情報が生成されたならば，参照カメラ画像入力部２０３より参照フレームＲｅｆ_refとなるカメラｒｅｆの画像を入力して参照カメラ画像メモリ２０４に蓄積し，符号化対象画像入力部２０１より符号化対象フレームＯｒｇを入力して符号化対象画像メモリ２０２に蓄える［Ｂ７］。 When the processing frame distance information is generated, an image of the camera ref that becomes the reference frame Ref _ref is input from the reference camera image input unit 203 and stored in the reference camera image memory 204, The encoding target frame Org is input and stored in the encoding target image memory 202 [B7].

その後，視点合成画像生成部２１１で参照フレームと処理フレーム距離情報とを用いて符号化単位ブロックごとに視点合成画像を生成しながら，画像符号化部２１２で符号化対象フレームを符号化する［Ｂ８］。ここでの処理は，参考例１における処理Ａ４の処理と同じである。ただし，実施例１では複数の参照フレームを使用するため，各符号化単位ブロックに対して，参照フレームごとにその参照フレームにおける対応画素や対応ブロックを求め，その領域の映像信号を視点合成画像の１つとして，複数の視点合成画像を生成し，それぞれを符号化時の予測画像の候補とする。また，Ｈ．２６４／ＡＶＣの双予測のように，複数の参照フレームに対して生成された視点合成画像の平均値を新たな視点合成画像として予測画像の候補に加えてもかまわない。 Thereafter, the viewpoint synthesis image generation unit 211 uses the reference frame and the processing frame distance information to generate a viewpoint synthesis image for each coding unit block, and the image encoding unit 212 encodes the encoding target frame [B8. ]. The process here is the same as the process A4 in Reference Example 1. However, since a plurality of reference frames are used in the first embodiment, for each coding unit block, the corresponding pixels and corresponding blocks in the reference frame are obtained for each reference frame, and the video signal in that region is converted to the viewpoint synthesized image. As one example, a plurality of viewpoint composite images are generated, and each of them is set as a predicted image candidate at the time of encoding. H. Like the H.264 / AVC bi-prediction, the average value of the viewpoint composite images generated for a plurality of reference frames may be added as a new viewpoint composite image to the predicted image candidates.

図８は，図６に示す多視点映像符号化装置２００の派生形の実施の形態を示している。図６の多視点映像符号化装置２００に，図８に示す多視点映像符号化装置２００′のように，入力された参照距離情報群を蓄積するための参照カメラ距離情報メモリ２１３を追加し，これによりオクルージョンを考慮した視点合成画像の生成を可能にするようにしてもよい。 FIG. 8 shows a derivative embodiment of the multi-view video encoding apparatus 200 shown in FIG. A reference camera distance information memory 213 for storing the input reference distance information group is added to the multi-view video encoding device 200 of FIG. 6 as in the multi-view video encoding device 200 ′ shown in FIG. This may enable generation of a viewpoint composite image in consideration of occlusion.

この場合の多視点映像符号化の処理フローは前述の実施例１におけるフローと同じであるが，処理Ｂ８内の視点合成画像を生成する部分の処理は，図９に示すフローに従う。このフローはある参照フレームｒｅｆを用いて，符号化対象フレーム内のブロックｂｌｋに対する視点合成画像を生成するフローである。以下，このフローに従って視点合成画像を生成する処理について説明を行う。 The processing flow of multi-view video encoding in this case is the same as the flow in the first embodiment described above, but the processing of the portion for generating the viewpoint composite image in the processing B8 follows the flow shown in FIG. This flow is a flow for generating a viewpoint synthesized image for the block blk in the encoding target frame using a certain reference frame ref. Hereinafter, processing for generating a viewpoint composite image according to this flow will be described.

まず，視点合成画像はブロック内の画素ごとに行うため，画素インデックスｐｅｌを０で初期化した後［Ｂ８０１］，ｐｅｌに１を加算しながら［Ｂ８０６］，ｐｅｌがブロック内の画素数ｎｕｍＰｅｌｓ_blkになるまで［Ｂ８０７］，次の（式８）に従って符号化対象フレーム上の画素ｂｌｋ_pelの参照フレームｒｅｆ上の対応画素ｃｐおよび復元距離情報ｒｄを求め［Ｂ８０２］，対応画素に対する参照距離情報ＲｅｆＤｅｐｔｈ_ref［ｃｐ］と復元距離情報ｒｄとの差を予め定められた閾値ｔｈと比較し［Ｂ８０３］，閾値より小さければ参照フレームｒｅｆの対応画素における画素値Ｒｅｆ_ref［ｃｐ］を画素ｂｌｋ_pelにおける参照フレームｒｅｆを使用した際の視点合成画像Ｓｙｎｔｈ_ref［ｐｅｌ］とし［Ｂ８０４］，そうでなければ視点合成画像が定義できないことを示す情報ＮＯＮ＿ＤＥＦをＳｙｎｔｈ_ref［ｐｅｌ］とする処理［Ｂ８０５］を繰り返す。ここで，復元距離情報とは，参照フレームを撮影したカメラから符号化対象フレーム上の画素ｂｌｋ_pelに撮影されていた被写体までの距離を表す。 First, since the viewpoint composite image is performed for each pixel in the block, after the pixel index pel is initialized to 0 [B801], while adding 1 to pel [B806], pel is set to the number of pixels numPels _blk in the block. Until [B807], the corresponding pixel cp and restoration distance information rd on the reference frame ref of the pixel blk _pel on the encoding target frame are obtained according to the following (Equation 8) [B802], and the reference distance information RefDepth _ref for the corresponding pixel is obtained. The difference between [cp] and the restoration distance information rd is compared with a predetermined threshold th [B803], and if smaller than the threshold, the pixel value Ref _ref [cp] in the corresponding pixel of the reference frame ref is changed to the reference frame in the pixel blk _pel . a view synthesized image Synth _ref when using the ref [pel] [B804], otherwise Processing of the information NON_DEF indicating that the view synthesized image can not be defined as Synth _ref [pel] repeat [B 805]. Here, the restoration distance information represents the distance from the camera that has captured the reference frame to the subject that has been captured by the pixel blk _pel on the encoding target frame.

ここでは画素ごとに対応画素を求めて視点合成画像を生成したが，図５を用いて説明したように，ブロックごとに代表距離情報を求めて対応ブロックを求めることも可能である。その場合，対応ブロックにおける参照距離情報から代表参照距離情報を比較し，代表距離情報と代表参照距離情報との差が予め定められた閾値より小さいかどうかを判定する。そして閥値より小さければ，求まった対応ブロックにおける参照フレームｒｅｆの映像信号が視点合成画像となり，そうでなければ，ブロック全体で視点合成画像を生成できないとすればよい。 Here, the viewpoint synthesized image is generated by obtaining the corresponding pixel for each pixel. However, as described with reference to FIG. 5, it is also possible to obtain the corresponding block by obtaining the representative distance information for each block. In that case, the representative reference distance information is compared with the reference distance information in the corresponding block, and it is determined whether or not the difference between the representative distance information and the representative reference distance information is smaller than a predetermined threshold. If it is smaller than the threshold value, the video signal of the reference frame ref in the obtained corresponding block becomes the viewpoint synthesized image, and otherwise, it is only necessary that the viewpoint synthesized image cannot be generated in the entire block.

なお，どの参照フレームを視点合成画像の生成に用いたかを示す情報を符号化する場合，ブロック全体で視点合成画像が生成できない場合には，その参照フレームを示すためにシンタックスを割り当てなくすることで，さらに符号量を節約することも可能である。また，シンタックスを算術符号化する場合においては，その参照フレームを示すシンタックスが発生する確率を低く設定することでも符号化効率を向上させることが可能である。また，Ｈ．２６４／ＡＶＣの双予測のように，複数の方式で生成した予測画像の平均値を実際の予測画像とするような場合，視点合成画像が定義できない画素を除外して平均値を取ることで，全体の予測効率低下を回避することが可能である。 When encoding information indicating which reference frame was used to generate a viewpoint composite image, if a viewpoint composite image cannot be generated for the entire block, no syntax should be assigned to indicate that reference frame. Therefore, it is possible to further save the code amount. In addition, when the syntax is arithmetically encoded, it is possible to improve the encoding efficiency by setting the probability that the syntax indicating the reference frame is generated low. H. As in the case of H.264 / AVC bi-prediction, when an average value of predicted images generated by a plurality of methods is used as an actual predicted image, by excluding pixels for which a viewpoint composite image cannot be defined, It is possible to avoid a decrease in overall prediction efficiency.

〔多視点映像復号装置（第２の参考例）〕
次に，本発明の実施例を説明するための第２の参考例（以下，参考例２）について説明する。ここで説明する参考例２では，２つのカメラで撮影された多視点動画像を符号化したデータを復号する場合を想定し，カメラ１で撮影された映像を参照画像として，カメラ２で撮影された映像を復号する方法について説明を行う。 [Multi-view video decoding device (second reference example)]
Next, a second reference example (hereinafter referred to as reference example 2) for describing an embodiment of the present invention will be described. In Reference Example 2 described here, it is assumed that data obtained by encoding a multi-view video captured by two cameras is decoded, and the video captured by camera 1 is captured by camera 2 using the video captured by camera 1 as a reference image. A method for decoding the received video will be described.

参考例２に係る映像復号装置の構成図を図１０に示す。図１０に示すように，多視点映像復号装置３００は，復号対象となるカメラ２のフレームの符号化データを入力する符号化データ入力部３０１と，入力された符号化データを蓄積する符号化データメモリ３０２と，参照フレームとなるカメラ１のフレームを入力する参照カメラ画像入力部３０３と，その参照フレームを蓄積する参照カメラ画像メモリ３０４と，参照フレームに対する距離情報を入力する参照カメラ距離情報入力部３０５と，カメラの投影モデルを利用して距離情報を三次元空間へ逆投影する距離情報逆投影部３０６と，カメラの投影モデルに従って三次元点をカメラ２へと再投影する距離情報再投影部３０７と，復号対象フレームに対する距離情報を蓄積する処理フレーム距離情報メモリ３０８と，参照フレームの映像信号と復号対象フレームに対する距離情報とから復号対象フレームの合成画像を生成する視点合成画像生成部３０９と，その生成された合成画像を予測画像として使用しながら復号対象フレームを復号する画像復号部３１０とを備える。 FIG. 10 shows a configuration diagram of a video decoding apparatus according to Reference Example 2. As illustrated in FIG. 10, the multi-view video decoding apparatus 300 includes an encoded data input unit 301 that inputs encoded data of a frame of the camera 2 to be decoded, and encoded data that stores the input encoded data. A memory 302, a reference camera image input unit 303 for inputting a frame of the camera 1 serving as a reference frame, a reference camera image memory 304 for storing the reference frame, and a reference camera distance information input unit for inputting distance information for the reference frame 305, a distance information backprojection unit 306 that backprojects distance information into a three-dimensional space using a camera projection model, and a distance information reprojection unit that reprojects a three-dimensional point onto the camera 2 according to the camera projection model 307, a processing frame distance information memory 308 for storing distance information for the decoding target frame, and a video signal of the reference frame A viewpoint composite image generation unit 309 that generates a composite image of the decoding target frame from distance information with respect to the decoding target frame, and an image decoding unit 310 that decodes the decoding target frame while using the generated composite image as a predicted image. Prepare.

図１１に，このようにして構成される多視点映像復号装置３００の実行する処理フローを示す。この処理フローに従って，参考例２の多視点映像復号装置３００の実行する処理について詳細に説明する。 FIG. 11 shows a processing flow executed by the multi-view video decoding apparatus 300 configured as described above. The processing executed by the multi-view video decoding apparatus 300 of Reference Example 2 will be described in detail according to this processing flow.

まず，参照カメラ距離情報入力部３０５より参照距離情報ＲｅｆＤｅｐｔｈが入力される［Ｃ１］。この参照距離情報とは，参照フレームに対する距離情報のことである。ここで入力される参照距離情報は，一旦任意の符号化手法を用いて符号化された符号化データから復号されたものとする。ただし，何らかの手法でオリジナルの情報が得られるのであれば，オリジナルの情報が入力されてもかまわない。しかしながら，符号化時に使用したものと同じものを入力しない場合，ドリフト歪みが発生することになる。 First, reference distance information RefDepth is input from the reference camera distance information input unit 305 [C1]. This reference distance information is distance information with respect to the reference frame. It is assumed that the reference distance information input here is decoded from encoded data that has been encoded once using an arbitrary encoding method. However, if the original information can be obtained by any method, the original information may be input. However, drift distortion will occur if the same data used for encoding is not input.

次に，参照距離情報から復号対象フレームに対する距離情報であるところの処理フレーム距離情報ＣｕｒＤｅｐｔｈを生成し，処理フレーム距離情報メモリ３０８に蓄積する［Ｃ２］。ここでの処理は参考例１のＡ２の処理と同じである。ただし，符号化対象フレームは復号対象フレームに読み替える必要がある。 Next, processing frame distance information CurDepth, which is distance information for the decoding target frame, is generated from the reference distance information and stored in the processing frame distance information memory 308 [C2]. The processing here is the same as the processing of A2 in Reference Example 1. However, it is necessary to replace the encoding target frame with the decoding target frame.

その後，符号化データ入力部３０１より符号化データが入力され，参照カメラ画像入力部３０３より参照フレームＲｅｆとなるカメラ１の画像が入力され，それぞれ符号化データメモリ３０２および参照カメラ画像メモリ３０４に蓄積される［Ｃ３］。なお，一旦任意の符号化手法を用いて符号化された符号化データから復号された参照フレームが入力されるものとする。ただし，何らかの手法でオリジナルの情報が得られるのであれば，オリジナルの情報が入力されてもかまわない。しかしながら，符号化時に使用したものと同じものを入力しない場合，ドリフト歪みが発生することになる。 Thereafter, the encoded data is input from the encoded data input unit 301, the image of the camera 1 serving as the reference frame Ref is input from the reference camera image input unit 303, and stored in the encoded data memory 302 and the reference camera image memory 304, respectively. [C3]. It is assumed that a reference frame decoded from encoded data once encoded using an arbitrary encoding method is input. However, if the original information can be obtained by any method, the original information may be input. However, drift distortion will occur if the same data used for encoding is not input.

そして，視点合成画像生成部３０９で参照フレームと処理フレーム距離情報とを用いて，復号単位ブロックごとに必要に応じて視点合成画像を生成しながら，画像復号部３１０で復号対象フレームＤｅｃを復号する［Ｃ４］。ここでの処理の詳細は後で詳しく説明する。 Then, the viewpoint composite image generation unit 309 uses the reference frame and the processing frame distance information to decode the decoding target frame Dec by the image decoding unit 310 while generating a viewpoint composite image for each decoding unit block as necessary. [C4]. Details of this processing will be described later in detail.

図１２に，図１１の処理Ｃ４で行われる復号単位ブロックごとに必要に応じて視点合成画像を生成しながら，復号対象フレームを復号する処理の第１の詳細フローを示す。 FIG. 12 shows a first detailed flow of a process of decoding a decoding target frame while generating a viewpoint composite image as necessary for each decoding unit block performed in process C4 of FIG.

ここでの処理は，復号単位ブロックごとに行われる。つまり復号単位ブロックインデックスをｂｌｋ，総復号単位ブロック数をｎｕｍＢｌｋｓとすると，ｂｌｋを０で初期化した後［Ｃ４１０１］，ｂｌｋに１を加算しながら［Ｃ４１１０］，ｂｌｋがｎｕｍＢｌｋｓになるまで［Ｃ４１１１］，ブロックｂｌｋが視点合成画像Ｓｙｎｔｈを使用して符号化されているかどうかをチェックし［Ｃ４１０２］，視点合成画像を使用していない場合には，そのブロックにおいては視点合成画像を生成せずに復号を行い［Ｃ４１０３］，視点合成画像を使用している場合には，そのブロックにおける視点合成画像Ｓｙｎｔｈのみを生成する処理［Ｃ４１０４−Ｃ４１０８］を行い，生成されたＳｙｎｔｈを用いてそのブロックの復号対象フレームＤｅｃ［ｂｌｋ］を復号する処理［Ｃ４１０９］を繰り返す。 This process is performed for each decoding unit block. In other words, if the decoding unit block index is blk and the total number of decoding unit blocks is numBlks, blk is initialized to 0 [C4101], while adding 1 to blk [C4110], until blk becomes numBlks [C4111] , Check whether the block blk is encoded using the view synthesized image Synth [C4102], and if the view synthesized image is not used, the block blk is decoded without generating the view synthesized image. [C4103], if a viewpoint composite image is used, a process [C4104-C4108] for generating only the viewpoint composite image Synth in the block is performed, and the decoding target of the block is generated using the generated Synth. The process [C4109] for decoding the frame Dec [blk] is repeated. It is.

なお，視点合成画像を予測画像として復号対象フレームを復号する方法には，符号化に用いられた符号化方法に対する復号方法を用いる必要がある。例えば，Ｈ．２６４／ＡＶＣのように入力画像と予測画像の差分信号を生成し，その差分信号に直交変換，量子化，エントロピー符号化を施すことで符号化が行われている場合には，符号化データに含まれるビット列をエントロピー復号，逆量子化，逆直交変換を施して得られた信号に，予測画像であるところの視点合成画像を加えることで復号を行う。 Note that, as a method of decoding a decoding target frame using a viewpoint synthesized image as a predicted image, it is necessary to use a decoding method for the encoding method used for encoding. For example, H.M. When encoding is performed by generating a difference signal between an input image and a predicted image and performing orthogonal transformation, quantization, and entropy coding on the difference signal as in H.264 / AVC, Decoding is performed by adding a viewpoint synthesized image, which is a predicted image, to a signal obtained by entropy decoding, inverse quantization, and inverse orthogonal transformation of the included bit string.

視点合成画像を生成する処理は，復号単位ブロックｂｌｋに含まれる画素ごとに，参照フレーム上の対応画素を求め，対応画素における映像信号をその画素の視点合成画像信号とすることで行われる。つまり，画素インデックスをｐｅｌ，ブロック復号単位ブロックｂｌｋに含まれる画素数をｎｕｍＰｅｌｓ_blkとすると，ｐｅｌを０で初期化した後［Ｃ４１０４］，ｐｅｌに１を加算しながら［Ｃ４１０７］，ｐｅｌがｎｕｍＰｅｌｓ_blkになるまで［Ｃ４１０８］，前述の（式３）に従って復号対象フレーム上の画素ｂｌｋ_pelの参照フレーム上の対応画素ｃｐを求め［Ｃ４１０５］，画素ｃｐの画素値Ｒｅｆ［ｃｐ］を画素ｂｌｋ_pelにおける視点合成画像Ｓｙｎｔｈ［ｐｅｌ］とする処理［Ｃ４１０６］を繰り返す。 The process of generating the viewpoint composite image is performed by obtaining the corresponding pixel on the reference frame for each pixel included in the decoding unit block blk, and using the video signal at the corresponding pixel as the viewpoint composite image signal of the pixel. That is, assuming that the pixel index is pel and the number of pixels included in the block decoding unit block blk is numPels _blk , after initializing pel with 0 [C4104], while adding 1 to pel [C4107], pel is numPels _blk Until [C4108], the corresponding pixel cp on the reference frame of the pixel blk _pel on the decoding target frame is obtained according to (Equation 3) described above [C4105], and the pixel value Ref [cp] of the pixel cp is obtained at the pixel blk _pel . The process [C4106] for setting the viewpoint composite image Synth [pel] is repeated.

ここで，Ｓｙｎｔｈは処理中の復号単位ブロックに対する視点合成画像を保持できれば十分であり，復号対象フレーム全体の視点合成画像を保持する必要はない。すなわち，視点合成画像Ｓｙｎｔｈの生成と画像復号とを復号単位ブロックごとに交互に行うことで，視点合成画像Ｓｙｎｔｈを同時には高々復号単位ブロック分しか生成しないで済むことになる。 Here, it is sufficient for the Synth to be able to hold the view synthesized image for the decoding unit block being processed, and it is not necessary to hold the view synthesized image of the entire decoding target frame. That is, by alternately generating the viewpoint synthesized image Synth and decoding the image for each decoding unit block, the viewpoint synthesized image Synth can be generated only for the decoding unit block at the same time.

図１３に，図１１の処理Ｃ４で行われる復号単位ブロックごとに必要に応じて視点合成画像を生成しながら，復号対象フレームを復号する処理の第２の詳細フローを示す。 FIG. 13 shows a second detailed flow of a process of decoding a decoding target frame while generating a viewpoint composite image as necessary for each decoding unit block performed in process C4 of FIG.

上記図１２を用いて説明した処理フローと本処理フローとの違いは，復号単位ブロックごとに視点合成画像を生成する処理［Ｃ４１０４−Ｃ４１０８］と処理［Ｃ４２０４−Ｃ４２０６］との違いだけである。したがって，以下の説明では，処理［Ｃ４２０４−Ｃ４２０６］についてのみ説明を行う。 The difference between the processing flow described with reference to FIG. 12 and this processing flow is only the difference between the processing [C4104-C4108] and the processing [C4204-C4206] that generate the viewpoint composite image for each decoding unit block. Therefore, in the following description, only the process [C4204-C4206] will be described.

ここで行われる視点合成画像生成は，復号単位ブロックｂｌｋに対する処理フレーム距離情報ＣｕｒＤｅｐｔｈ［ｂｌｋ］（画素ごとに１つ距離情報を持つため距離情報の集合となる）を用いて，ブロックｂｌｋに対する１つの代表距離情報ｄｅｐを生成し［Ｃ４２０４］，その代表距離情報ｄｅｐを用い復号対象フレームのブロックｂｌｋに対応する参照フレーム上のブロックｂｌｋ′を求め［Ｃ４２０５］，そのブロックにおける参照フレームの画像信号を視点合成画像とする。 The viewpoint composite image generation performed here uses one processing frame distance information CurDepth [blk] for the decoding unit block blk (which is a set of distance information because there is one distance information for each pixel). The representative distance information dep is generated [C4204], the block blk ′ on the reference frame corresponding to the block blk of the decoding target frame is obtained using the representative distance information dep [C4205], and the image signal of the reference frame in the block is viewed. This is a composite image.

ブロックｂｌｋに対応する参照フレーム上のブロックｂｌｋ′を求める処理では，ブロックｂｌｋの中央またはブロックの角の座標をｂｌｋ_pelとして，上記（式３）を用いて求まったｃｐをブロックｂｌｋの中央またはブロックの角の座標とすることで求めることができる。 In the process of obtaining the block blk ′ on the reference frame corresponding to the block blk, the center of the block blk or the coordinates of the corner of the block is set to blk _pel , and cp obtained using the above (Equation 3) is set to the center of the block blk or the block It can be obtained by using the coordinates of the corners.

〔多視点映像復号装置（第２の実施例）〕
次に，本発明の第２の実施例（以下，実施例２）について説明する。ここで説明する実施例２では，Ｎ個のカメラで撮影された多視点動画像の符号化データを復号する場合を想定し，カメラ１〜カメラＮ−１で撮影された映像を参照画像として，カメラＮで撮影された映像を復号する方法について説明を行う。なお，本実施例において特別な説明がない限り，参考例２で用いた記号は同様の意味で用いる。 [Multi-viewpoint video decoding device (second embodiment)]
Next, a second embodiment (hereinafter referred to as a second embodiment) of the present invention will be described. In Example 2 described here, assuming that the encoded data of a multi-view video captured by N cameras is decoded, videos captured by cameras 1 to N-1 are used as reference images. A method for decoding video captured by the camera N will be described. Unless otherwise specified in the present embodiment, the symbols used in Reference Example 2 have the same meaning.

実施例２に係る多視点映像復号装置の構成図を図１４に示す。図１４に示すように，多視点映像復号装置４００は，復号対象となるカメラＮのフレームの符号化データを入力する符号化データ入力部４０１と，入力された符号化データを蓄積する符号化データメモリ４０２と，参照フレームとなるカメラ１からカメラＮ−１のフレームを入力する参照カメラ画像入力部４０３と，その参照フレーム群を蓄積する参照カメラ画像メモリ４０４と，各参照フレームに対する距離情報を入力する参照カメラ距離情報入力部４０５と，カメラの投影モデルを利用して距離情報を三次元空間へ逆投影する距離情報逆投影部４０６と，カメラの投影モデルに従って三次元点をカメラＮへと再投影する距離情報再投影部４０７と，各参照距離情報を変換することで得られた距離情報群を一時的に蓄積する処理フレーム距離情報候補メモリ４０８と，各参照距離情報から変換することで得られた距離情報群を１つの復号対象フレームに対する距離情報へと統合する距離情報統合部４０９と，生成された復号対象フレームに対する距離情報を蓄積する処理フレーム距離情報メモリ４１０と，参照フレームの映像信号と処理フレーム距離情報とから必要に応じて復号対象フレームの合成画像を生成する視点合成画像生成部４１１と，その生成された合成画像を予測画像として使用しながら復号対象フレームを復号する画像復号部４１２とを備える。 FIG. 14 shows a configuration diagram of a multi-view video decoding apparatus according to the second embodiment. As illustrated in FIG. 14, the multi-view video decoding device 400 includes an encoded data input unit 401 that inputs encoded data of a frame of the camera N to be decoded, and encoded data that accumulates the input encoded data. A memory 402, a reference camera image input unit 403 for inputting the frame of the camera N-1 from the camera 1 serving as a reference frame, a reference camera image memory 404 for storing the reference frame group, and distance information for each reference frame are input. A reference camera distance information input unit 405, a distance information backprojection unit 406 that backprojects distance information to a three-dimensional space using the camera projection model, and a 3D point to the camera N again according to the camera projection model. Distance information reprojection unit 407 to project, and processing frame distance for temporarily storing distance information groups obtained by converting each reference distance information Information candidate memory 408, distance information integration unit 409 for integrating the distance information group obtained by converting from each reference distance information into distance information for one decoding target frame, and distance information for the generated decoding target frame Processing frame distance information memory 410, a viewpoint composite image generation unit 411 that generates a composite image of the decoding target frame from the video signal of the reference frame and the processing frame distance information, if necessary, and the generated composite image And an image decoding unit 412 that decodes a decoding target frame while using as a prediction image.

図１５に，このようにして構成される多視点映像復号装置４００の実行する処理フローを示す。この処理フローに従って，実施例２の多視点映像復号装置４００の実行する処理について詳細に説明する。 FIG. 15 shows a processing flow executed by the multi-view video decoding apparatus 400 configured as described above. The processing executed by the multi-view video decoding apparatus 400 according to the second embodiment will be described in detail according to this processing flow.

まず，参照距離情報の入力し各参照距離情報を復号対象フレームに対する距離情報の候補へと変換する。つまり，参照するカメラのインデックスをｒｅｆとすると，ｒｅｆを０で初期化した後［Ｄ１］，ｒｅｆに１を加算しながら［Ｄ４］，ｒｅｆがＮ−１になるまで［Ｄ５］，参照カメラ距離情報入力部４０５よりカメラｒｅｆに対する参照距離情報ＲｅｆＤｅｐｔｈ_refを入力し［Ｄ２］，参照距離情報ＲｅｆＤｅｐｔｈ_refから復号対象フレームに対する距離情報の候補であるところの処理フレーム距離情報候補ＴｅｍｐＤｅｐｔｈ_refを生成し処理フレーム距離情報候補メモリ４０８に一時的に蓄える処理［Ｄ３］を繰り返す。処理Ｄ３は上記参考例２の処理Ｃ２で行われる処理と同じである。ただし，処理Ｃ２における参照距離情報ＲｅｆＤｅｐｔｈを参照距離情報ＲｅｆＤｅｐｔｈ_refに，処理フレーム距離情報ＣｕｒＤｅｐｔｈを処理フレーム距離情報候補ＴｅｍｐＤｅｐｔｈ_refに，カメラ１をカメラｒｅｆに，カメラ２をカメラＮに，距離情報逆投影部３０６を距離情報逆投影部４０６，距離情報再投影部３０７を距離情報再投影部４０７に，それぞれ読み替える必要がある。 First, reference distance information is input, and each reference distance information is converted into distance information candidates for a decoding target frame. In other words, when the index of the camera to be referred to is ref, after initializing ref to 0 [D1], adding 1 to ref [D4], until ref becomes N−1 [D5], reference camera distance enter the reference distance information RefDepth _ref from the information input unit 405 to the camera ref [D2], the reference distance information RefDepth _ref generate processed frame distance information candidate TempDepth _ref where a candidate for the distance information of the decoding target frame from the processed frame The process [D3] temporarily stored in the distance information candidate memory 408 is repeated. The process D3 is the same as the process performed in the process C2 of the reference example 2. However, the reference distance information RefDepth in the process C2 is the reference distance information RefDepth _ref , the process frame distance information CurDepth is the process frame distance information candidate TempDepth _ref , the camera 1 is the camera ref, the camera 2 is the camera N, and the distance information backprojection It is necessary to replace the unit 306 with the distance information backprojection unit 406 and the distance information reprojection unit 307 with the distance information reprojection unit 407, respectively.

全ての参照距離情報に関する処理が終了したならば，距離情報統合部４０９で，処理フレーム距離情報候補群から，処理フレーム距離情報ＣｕｒＤｅｐｔｈを生成し，処理フレーム距離情報メモリ４１０に蓄積する［Ｄ６］。ここでの処理は，実施例１の処理Ｂ６と同じである。実施例１の場合と同様に，この処理Ｄ６が終了した後は，処理フレーム距離情報候補メモリに蓄積した参照フレーム距離情報候補を開放してもかまわない。 When the processing for all the reference distance information is completed, the distance information integration unit 409 generates the processing frame distance information CurDepth from the processing frame distance information candidate group and stores it in the processing frame distance information memory 410 [D6]. This process is the same as the process B6 of the first embodiment. Similarly to the case of the first embodiment, after this process D6 is completed, the reference frame distance information candidates accumulated in the process frame distance information candidate memory may be released.

処理フレーム距離情報が生成されたならば，参照カメラ画像入力部４０３より参照フレームＲｅｆ_refとなるカメラｒｅｆの画像を入力して参照カメラ画像メモリ４０４に蓄積し，符号化データ入力部４０１より符号化データを入力して符号化データメモリ４０２に蓄える［Ｄ７］。 If the processing frame distance information is generated, an image of the camera ref that becomes the reference frame Ref _ref is input from the reference camera image input unit 403 and stored in the reference camera image memory 404, and encoded by the encoded data input unit 401. Data is input and stored in the encoded data memory 402 [D7].

そして，視点合成画像生成部４１１で参照フレームと処理フレーム距離情報とを用いて，復号単位ブロックごとに必要に応じて視点合成画像を生成しながら，画像復号部４１２で符号化データより復号対象フレームを復号する［Ｄ８］。ここでの処理は，参考例２における処理Ｃ４の処理と同じである。ただし，実施例２では複数の参照フレームが存在するため，復号単位ブロックごとに符号化データに含まれる参照フレームを指定する情報を復号し，その参照フレームにおける対応画素や対応ブロックを求め，その領域の映像信号を視点合成画像とする。また，Ｈ．２６４／ＡＶＣの双予測のように，複数の参照フレームを使用して符号化されていることを示すデータが符号化データに含まれていた場合，指定された参照フレームそれぞれに対して視点合成画像を生成し，その平均値を新たな視点合成画像としてもかまわない。 Then, the viewpoint synthesis image generation unit 411 uses the reference frame and the processing frame distance information to generate a viewpoint synthesis image for each decoding unit block as necessary, while the image decoding unit 412 generates a decoding target frame from the encoded data. [D8]. The process here is the same as the process C4 in Reference Example 2. However, in the second embodiment, since there are a plurality of reference frames, the information specifying the reference frame included in the encoded data is decoded for each decoding unit block, the corresponding pixel and the corresponding block in the reference frame are obtained, and the region Is a viewpoint composite image. H. When data indicating that encoding is performed using a plurality of reference frames is included in the encoded data as in the case of H.264 / AVC bi-prediction, a viewpoint composite image is specified for each specified reference frame. The average value may be used as a new viewpoint composite image.

図１６は，図１４に示す多視点映像復号装置４００の派生形の実施の形態を示している。実施例２として説明した多視点映像復号装置４００に，図１６に示す多視点映像復号装置４００′のように，入力された参照距離情報群を蓄積するための参照カメラ距離情報メモリ４１３を追加することで，オクルージョンを考慮した視点合成画像の生成を可能にしている。 FIG. 16 shows a derivative embodiment of the multi-view video decoding apparatus 400 shown in FIG. A reference camera distance information memory 413 for storing the input reference distance information group is added to the multi-view video decoding apparatus 400 described as the second embodiment, like the multi-view video decoding apparatus 400 ′ shown in FIG. This makes it possible to generate a viewpoint composite image in consideration of occlusion.

この場合の多視点映像復号の処理フローは，前述の実施例２におけるフローと同じであるが，処理Ｄ８内の視点合成画像を生成する部分の処理は，前述の実施例１の際に，図９を用いて説明した処理と同じである。ただし，符号化対象という記載は復号対象と読み替える必要がある。 The processing flow of multi-view video decoding in this case is the same as the flow in the second embodiment described above, but the processing of the portion that generates the viewpoint composite image in the process D8 is the same as that in the first embodiment described above. This is the same as the processing described with reference to FIG. However, the description of the encoding target needs to be read as the decoding target.

以上説明した処理は，コンピュータとソフトウェアプログラムとによっても実現することができ，そのプログラムをコンピュータ読み取り可能な記録媒体に記録して提供することも，ネットワークを通して提供することも可能である。 The processing described above can be realized by a computer and a software program, and the program can be provided by being recorded on a computer-readable recording medium or can be provided through a network.

また，以上の実施の形態では多視点映像符号化装置および多視点映像復号装置を中心に説明したが，これら多視点映像符号化装置および多視点映像復号装置の各部の動作に対応したステップによって本発明の多視点画像符号化方法および多視点画像復号方法を実現することができる。 In the above embodiments, the multi-view video encoding device and the multi-view video decoding device have been mainly described. However, the steps corresponding to the operation of each unit of the multi-view video encoding device and the multi-view video decoding device are described. The multi-view image encoding method and multi-view image decoding method of the invention can be realized.

以上の実施の形態による作用・効果について説明する。 The operation and effect of the above embodiment will be described.

（１）メモリ使用量の削減
本実施の形態では，処理対象フレームに対する視差補償画像を生成して蓄積するのではなく，処理視点に対する距離情報（処理フレーム距離情報）を生成して蓄積する。一般に映像信号は蓄積に３チャンネルを使用するが，距離情報は１チャンネルで済む。したがって，視差補償に必要なメモリを約３分の１に削減できる。これは，例えば符号化時においては，図２に示す処理Ａ２，Ａ４の作用，復号時においては，図１１に示す処理Ｃ２，Ｃ４の作用による。予測信号の生成では，処理フレーム距離情報における処理ブロックの距離情報から，視差ベクトルを計算し，参照フレームから１ブロック分の映像信号を複写することで実現する。 (1) Reduction of Memory Usage In this embodiment, the disparity compensation image for the processing target frame is not generated and stored, but the distance information (processing frame distance information) for the processing viewpoint is generated and stored. In general, a video signal uses three channels for accumulation, but distance information needs only one channel. Therefore, the memory required for parallax compensation can be reduced to about one third. This is due to, for example, the actions of processes A2 and A4 shown in FIG. 2 at the time of encoding, and the actions of processes C2 and C4 shown in FIG. 11 at the time of decoding. The generation of the prediction signal is realized by calculating a disparity vector from the processing block distance information in the processing frame distance information and copying the video signal for one block from the reference frame.

（２）視差補償が使用されていないブロックに対するデコード処理の高速化
従来技術では，視差補償予測の予測信号の生成処理は処理対象フレーム全体に対して行われるため，処理ブロックごとに視差補償予測の必要性を判定して，その予測信号の生成処理を省略することはできない。これに対し，本実施の形態では，視差補償予測の予測信号の生成処理が処理ブロックごとに行われるようになっているため，必要な処理ブロックだけカメラ間予測の予測信号を生成することで，不必要な処理ブロックのカメラ間予測の予測信号を生成する処理を省略することができる。これは，例えば図１２に示す処理Ｃ４１０２の判定処理によって，処理Ｃ４１０３を実行するか処理［Ｃ４１０４−Ｃ４１０９］を実行するかを切り分けることによって実現される。 (2) Acceleration of decoding processing for blocks that do not use disparity compensation In the conventional technology, the process of generating a prediction signal for disparity compensation prediction is performed for the entire processing target frame. It is not possible to omit the prediction signal generation process after determining the necessity. On the other hand, in the present embodiment, since the generation process of the prediction signal of the parallax compensation prediction is performed for each processing block, by generating the prediction signal of the inter-camera prediction only for the necessary processing block, Processing for generating a prediction signal for inter-camera prediction of unnecessary processing blocks can be omitted. This is realized by, for example, determining whether to execute the process C4103 or the process [C4104-C4109] by the determination process of the process C4102 illustrated in FIG.

（３）符号化効率の向上
本実施の形態では，予測信号にブロックノイズが発生するのを抑制することが可能になる。従来技術では，参照フレームの画素ごとに設定された視差ベクトルを用いて視差補償画像を生成するため，処理ブロックでいくつの視差ベクトルを使用するか制御することができない。処理ブロック内で複数の視差ベクトルを用いると，予測信号が不連続になるため，周波数領域での残差符号化効率が低下する。これに対し，本実施の形態では，例えば図２の処理Ａ４，図１１の処理Ｃ４において，処理ブロックごとに使用する視差ベクトルの数を制御することで，処理ブロック内での不連続性をなくすことができる。 (3) Improvement of encoding efficiency In this Embodiment, it becomes possible to suppress that block noise generate | occur | produces in a prediction signal. In the prior art, the disparity compensation image is generated using the disparity vector set for each pixel of the reference frame, and thus it is not possible to control how many disparity vectors are used in the processing block. When a plurality of disparity vectors are used in the processing block, the prediction signal becomes discontinuous, and the residual coding efficiency in the frequency domain is reduced. On the other hand, in the present embodiment, for example, in the process A4 in FIG. 2 and the process C4 in FIG. 11, the number of disparity vectors used for each process block is controlled to eliminate discontinuity in the process block. be able to.

以上，図面を参照して本発明の実施の形態を説明してきたが，上記実施の形態は本発明の例示に過ぎず，本発明が上記実施の形態に限定されるものでないことは明らかである。したがって，本発明の精神および範囲を逸脱しない範囲で構成要素の追加，省略，置換，その他の変更を行っても良い。 The embodiments of the present invention have been described above with reference to the drawings. However, the above embodiments are merely examples of the present invention, and it is clear that the present invention is not limited to the above embodiments. . Accordingly, additions, omissions, substitutions, and other modifications of the components may be made without departing from the spirit and scope of the present invention.

１００，２００多視点映像符号化装置
１０１，２０１符号化対象画像入力部
１０２，２０２符号化対象画像メモリ
１０３，２０３，３０３，４０３参照カメラ画像入力部
１０４，２０４，３０４，４０４参照カメラ画像メモリ
１０５，２０５，３０５，４０５参照カメラ距離情報入力部
１０６，２０６，３０６，４０６距離情報逆投影部
１０７，２０７，３０７，４０７距離情報再投影部
１０８，２１０，３０８，４１０処理フレーム距離情報メモリ
１０９，２１１，３０９，４１１視点合成画像生成部
１１０，２１２画像符号化部
２０８，４０８処理フレーム距離情報候補メモリ
２０９，４０９距離情報統合部
２１３，４１３参照カメラ距離情報メモリ
３００，４００多視点映像復号装置
３０１，４０１符号化データ入力部
３０２，４０２符号化データメモリ
３１０，４１２画像復号部 100, 200 Multi-view video encoding device 101, 201 Encoding target image input unit 102, 202 Encoding target image memory 103, 203, 303, 403 Reference camera image input unit 104, 204, 304, 404 Reference camera image memory 105 205, 305, 405 Reference camera distance information input unit 106, 206, 306, 406 Distance information back projection unit 107, 207, 307, 407 Distance information reprojection unit 108, 210, 308, 410 Processing frame distance information memory 109, 211, 309, 411 Viewpoint composite image generation unit 110, 212 Image encoding unit 208, 408 Processing frame distance information candidate memory 209, 409 Distance information integration unit 213, 413 Reference camera distance information memory 300, 400 Multi-view video decoding device 301 401 encoded data Input unit 302, 402 Encoded data memory 310, 412 Image decoding unit

Claims

When encoding a multi-viewpoint image, the multi-viewpoint image is encoded while predicting the image between the cameras using a reference distance information group indicating the distance from the camera to the subject with respect to a plurality of reference images captured by different cameras. In the multi-view image encoding method to be
A reference camera image setting step for setting a reference camera image obtained by decoding an already encoded camera image, which is used for prediction of an image signal between cameras when encoding an encoding target image;
A distance information candidate generation step for generating, for each reference image, a processing frame distance information candidate representing a distance candidate from the camera to the subject with respect to the encoding target image from the reference distance information;
For each pixel of the encoding target image, the processing frame distance information candidate obtained for each reference image with respect to that pixel is used to represent the distance from the camera that captured the encoding target image at that pixel to the subject. A distance information integration step for generating processing frame distance information;
For each pixel of the image to be encoded, the corresponding pixel on the reference camera image is identified using the processing frame distance information at the pixel, and the viewpoint composite image at the pixel is determined using the image signal at the corresponding pixel. A viewpoint composite image generation step to be generated;
An image encoding step for encoding an encoding target image using the viewpoint synthesized image generated in the viewpoint synthesized image generating step as a predicted image candidate;
A multi-view image encoding method characterized by comprising:

The multi-view image encoding method according to claim 1,
In the viewpoint composite image generation step, for each pixel of the encoding target image, using the processing frame distance information at the pixel, the corresponding pixel on the reference camera image and the correspondence from the camera that captured the reference camera image Reconstructed distance information representing the distance to the subject in the pixel, and only when the difference between the reconstructed distance information and the reference distance information given to the corresponding pixel is within a predetermined range. A multi-viewpoint image encoding method characterized by generating a viewpoint composite image in pixels.

In the multi-view image encoding method according to claim 1 or 2,
In the viewpoint composite image generation step, for each block of the image to be encoded, one representative distance information is generated from the processing frame distance information for the pixels included in the block, and the representative distance information is used to A multi-viewpoint image encoding method characterized by identifying a corresponding pixel for a pixel of an encoding target image by identifying a corresponding region on a reference camera image.

In the multi-view image encoding method according to any one of claims 1 to 3,
A distance information correction step of correcting the processing frame distance information using a correction method assuming continuity in real space;
In the viewpoint composite image generation step, the processing frame distance information corrected in the distance information correction step is used.

The multi-view image encoding method according to any one of claims 1 to 4, wherein:
The viewpoint synthesized image generation step and the image encoding step are alternately performed for each encoding unit block, so that the viewpoint synthesized image is generated only for the encoding unit block at the same time. Encoding method.

When decoding the encoded data obtained by encoding the multi-viewpoint image, the reference distance information for each of the plurality of reference images indicating the distance from the camera that captured the reference image used for inter-camera prediction to the subject is used. In the multi-view image decoding method for decoding multi-view images while predicting the image of
A reference camera image setting step for setting an already decoded reference camera image, which is used for prediction of an image signal between cameras in decoding a decoding target image;
A distance information candidate generation step for generating, for each reference image, a processing frame distance information candidate representing a candidate distance from the camera to the subject with respect to the decoding target image from the reference distance information;
For each pixel of the decoding target image, a processing frame that represents the distance from the camera that captured the decoding target image at that pixel to the subject using the processing frame distance information candidate obtained for each reference image for that pixel A distance information integration step for generating distance information;
For each pixel of the decoding target image, the corresponding pixel on the reference camera image is identified using the processing frame distance information at the pixel only when the pixel is encoded using image prediction between cameras. A viewpoint composite image generation step for generating a viewpoint composite image at the pixel using the image signal at the corresponding pixel;
An image decoding step for decoding a decoding target image using the viewpoint synthesized image generated in the viewpoint synthesized image generating step as a predicted image;
A multi-viewpoint image decoding method comprising:

The multi-viewpoint image decoding method according to claim 6,
In the viewpoint composite image generation step, for each pixel of the decoding target image, the reference is performed using the processing frame distance information at the pixel only when the pixel is encoded using image prediction between cameras. Identifying corresponding pixels on the camera image and restoration distance information representing a distance from the camera that captured the reference camera image to the subject at the corresponding pixels, and the reference given to the restoration distance information and the corresponding pixels A multi-viewpoint image decoding method characterized by generating a viewpoint composite image at a pixel only when a difference from distance information is within a predetermined range.

In the multi-viewpoint image decoding method according to claim 6 or 7,
In the viewpoint composite image generation step, for each block of the decoding target image, only when the block is encoded using image prediction between cameras, 1 is obtained from the processing frame distance information for the pixels included in the block. A multi-viewpoint characterized by generating corresponding representative distance information and identifying corresponding pixels on the reference camera image corresponding to the block by using the representative distance information. Image decoding method.

The multi-view image decoding method according to any one of claims 6 to 8,
A distance information correction step of correcting the processing frame distance information using a correction method assuming continuity in real space;
In the viewpoint composite image generation step, the processing frame distance information corrected in the distance information correction step is used.

The multi-view image decoding method according to any one of claims 6 to 9,
A multi-view image decoding method, wherein the viewpoint composite image generation step and the image decoding step are alternately performed for each decoding unit block, thereby generating only the viewpoint composite image for the decoding unit block at the same time.

When encoding a multi-viewpoint image, the multi-viewpoint image is encoded while predicting the image between the cameras using a reference distance information group indicating the distance from the camera to the subject with respect to a plurality of reference images captured by different cameras. In the multi-view image encoding device
A reference camera image setting means for setting a reference camera image obtained by decoding an already encoded camera image, which is used for prediction of an image signal between cameras when encoding an encoding target image;
A distance information candidate generating unit that generates, for each reference image, a processing frame distance information candidate that represents a candidate distance from the camera to the subject with respect to the encoding target image from the reference distance information;
For each pixel of the encoding target image, the processing frame distance information candidate obtained for each reference image with respect to that pixel is used to represent the distance from the camera that captured the encoding target image at that pixel to the subject. Distance information integration means for generating processing frame distance information;
For each pixel of the image to be encoded, the corresponding pixel on the reference camera image is identified using the processing frame distance information at the pixel, and the viewpoint composite image at the pixel is determined using the image signal at the corresponding pixel. A viewpoint synthesized image generating means for generating;
Image encoding means for encoding an encoding target image using the viewpoint synthesized image generated by the viewpoint synthesized image generating means as a predicted image candidate;
A multi-view image encoding apparatus comprising:

The multi-view image encoding device according to claim 11,
For each pixel of the encoding target image, the viewpoint composite image generation means uses the processing frame distance information in the pixel to correspond to the corresponding pixel on the reference camera image and the camera that captured the reference camera image. Reconstructed distance information representing the distance to the subject in the pixel, and only when the difference between the reconstructed distance information and the reference distance information given to the corresponding pixel is within a predetermined range. A multi-viewpoint image encoding device characterized by generating a viewpoint composite image in a pixel.

The multi-view image encoding device according to claim 11 or 12,
The viewpoint composite image generation means generates, for each block of the encoding target image, one representative distance information from the processing frame distance information for the pixels included in the block, and using the representative distance information, A multi-viewpoint image encoding device characterized by identifying a corresponding pixel for a pixel of an encoding target image by identifying a corresponding region on a reference camera image.

The multi-view image encoding device according to any one of claims 11 to 13,
Distance information correction means for correcting the processing frame distance information using a correction method assuming continuity in real space;
The multi-view image encoding apparatus, wherein the viewpoint composite image generation means uses the processing frame distance information corrected by the distance information correction means.

The multi-view image encoding device according to any one of claims 11 to 14,
The viewpoint synthesized image generating means and the image encoding means are alternately operated for each encoding unit block, so that the viewpoint synthesized image is generated only for the encoding unit block at the same time. Image encoding device.

When decoding the encoded data obtained by encoding the multi-viewpoint image, the reference distance information for each of the plurality of reference images indicating the distance from the camera that captured the reference image used for inter-camera prediction to the subject is used. In a multi-view image decoding device that decodes multi-view images while predicting images of
A reference camera image setting means for setting an already decoded reference camera image, which is used for prediction of an image signal between cameras in decoding a decoding target image;
Distance information candidate generating means for generating a processing frame distance information candidate representing a distance candidate from the camera to the subject with respect to the decoding target image for each reference image from the reference distance information;
For each pixel of the decoding target image, a processing frame that represents the distance from the camera that captured the decoding target image at that pixel to the subject using the processing frame distance information candidate obtained for each reference image for that pixel A distance information integration means for generating distance information;
For each pixel of the decoding target image, the corresponding pixel on the reference camera image is identified using the processing frame distance information at that pixel only if the pixel is encoded using image prediction between cameras. A viewpoint synthesized image generating means for generating a viewpoint synthesized image at the pixel using an image signal at the corresponding pixel;
Image decoding means for decoding a decoding target image using the viewpoint synthesized image generated by the viewpoint synthesized image generating means as a predicted image;
A multi-viewpoint image decoding apparatus comprising:

The multi-view image decoding device according to claim 16,
The viewpoint composite image generation means uses the processing frame distance information in the pixel only for each pixel of the decoding target image when the pixel is encoded using image prediction between cameras, and the reference Identifying corresponding pixels on the camera image and restoration distance information representing a distance from the camera that captured the reference camera image to the subject at the corresponding pixels, and the reference given to the restoration distance information and the corresponding pixels A multi-viewpoint image decoding apparatus that generates a viewpoint composite image at a pixel only when a difference from distance information is within a predetermined range.

The multi-viewpoint image decoding device according to claim 16 or 17,
The viewpoint composite image generation means 1 for each block of the decoding target image, from the processing frame distance information for the pixels included in the block only when the block is encoded using image prediction between cameras. A multi-viewpoint characterized by generating corresponding representative distance information and identifying corresponding pixels on the reference camera image corresponding to the block by using the representative distance information. Image decoding device.

The multi-viewpoint image decoding device according to any one of claims 16 to 18,
Distance information correction means for correcting the processing frame distance information using a correction method assuming continuity in real space;
The multi-viewpoint image decoding apparatus, wherein the viewpoint composite image generation means uses the processing frame distance information corrected by the distance information correction means.

The multi-view image decoding device according to any one of claims 16 to 19,
A multi-viewpoint image decoding apparatus characterized in that, by alternately operating the viewpoint composite image generation unit and the image decoding unit for each decoding unit block, a viewpoint composite image is generated only for decoding unit blocks at the same time. .

A multi-view image encoding program for causing a computer to execute the multi-view image encoding method according to any one of claims 1 to 5.

A computer-readable recording medium on which a multi-view image encoding program for causing a computer to execute the multi-view image encoding method according to any one of claims 1 to 5 is recorded.

A multi-view image decoding program for causing a computer to execute the multi-view image decoding method according to any one of claims 6 to 10.

A computer-readable recording medium on which a multi-view image decoding program for causing a computer to execute the multi-view image decoding method according to any one of claims 6 to 10 is recorded.