JP7322235B2

JP7322235B2 - Image processing device, image processing method, and program

Info

Publication number: JP7322235B2
Application number: JP2022069954A
Authority: JP
Inventors: 香織田谷
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-05-02
Filing date: 2022-04-21
Publication date: 2023-08-07
Anticipated expiration: 2038-05-02
Also published as: JP2022097541A

Description

本発明は、画像処理装置、画像処理方法、及びプログラムに関し、特に、仮想視点画像を生成するために用いて好適なものである。 The present invention relates to an image processing apparatus, image processing method, and program, and is particularly suitable for use in generating a virtual viewpoint image.

被写体（例えば人物等のオブジェクト）を複数の撮像装置で撮像して得られた画像に基づいて、仮想視点（実際には撮像装置が存在しない視点を含む任意の視点）から被写体を観察した際に得られる画像（仮想視点画像）を再構成（生成）する技術が知られている。特許文献１には、以下の方法が開示されている。まず、複数のカメラにより撮像された被写体の撮像画像と、カメラの位置情報とを用いて、被写体の三次元モデルを生成する。次に、三次元モデル上の各位置のテクスチャ画像（ブレンドテクスチャ画像）を、複数の撮像画像に写っているテクスチャ画像をブレンドすることにより生成する。最後に、ブレンドテクスチャ画像を三次元モデルにテクスチャマッピングすることにより、仮想視点からの画像を再構成する。 When observing a subject (for example, an object such as a person) from a virtual viewpoint (an arbitrary viewpoint including a viewpoint at which no imaging device actually exists) based on images obtained by imaging the subject (for example, an object such as a person) with a plurality of imaging devices. Techniques for reconstructing (generating) obtained images (virtual viewpoint images) are known. Patent Document 1 discloses the following method. First, a three-dimensional model of the subject is generated using captured images of the subject captured by a plurality of cameras and position information of the cameras. Next, a texture image (blended texture image) at each position on the three-dimensional model is generated by blending texture images appearing in a plurality of captured images. Finally, the image from the virtual viewpoint is reconstructed by texture mapping the blended texture image onto the 3D model.

特許第５０１１２２４号公報Japanese Patent No. 5011224

W.Matusik, C.Buehler, R.Raskar, S.Gortler, L.McMillan,"ImageBased Visual Hulls", ACM SIGGRAPH2000,pp.369-374, 2000W.Matusik, C.Buehler, R.Raskar, S.Gortler, L.McMillan,"ImageBased Visual Hulls", ACM SIGGRAPH2000,pp.369-374, 2000

しかしながら、特許文献１に記載の技術では、撮像画像において適切に被写体を抽出ができなかった場合、実際とは異なる大きな形状としてレンダリングされる虞がある。即ち、特許文献１に記載の技術では、撮像画像において仮想視点画像を適切に生成することが容易ではないという課題がある。
本発明は、このような課題に鑑みてなされたものであり、撮像画像において仮想視点画像を適切に生成することができるようにすることを目的とする。 However, with the technique described in Patent Document 1, if the subject cannot be appropriately extracted from the captured image , there is a risk that the subject will be rendered as a large shape that is different from the actual shape . That is, the technique described in Patent Document 1 has a problem that it is not easy to appropriately generate a virtual viewpoint image in a captured image.
SUMMARY OF THE INVENTION The present invention has been made in view of such problems, and an object of the present invention is to appropriately generate a virtual viewpoint image in a captured image.

本発明の画像処理装置は、複数の方向から撮像領域の撮像を行う複数の撮像装置のうちのいずれかの撮像装置で撮像された撮像画像から第１の抽出方法により抽出された被写体の領域と、前記撮像装置で撮像された前記撮像画像から前記第１の抽出方法とは異なる第２の抽出方法により抽出された被写体の領域と、に基づいて、前記撮像領域内に位置する被写体の三次元形状を特定するための情報である形状情報を取得する第１の取得手段と、仮想視点を特定するための情報である視点情報を取得する第２の取得手段と、前記形状情報と、前記視点情報と、に基づいて、仮想視点画像を生成する生成手段と、を有し、前記第１の抽出方法は、前記撮像画像と、前記撮像画像に対応する背景画像と、に基づいて、前記被写体を抽出する方法を含み、前記第２の抽出方法は、前記撮像画像と、前記撮像装置で当該撮像画像が撮像されたタイミングと異なるタイミングで撮像された別の撮像画像と、に基づいて、前記被写体を抽出する方法を含むことを特徴とする。 The image processing apparatus of the present invention extracts a region of a subject by a first extraction method from a captured image captured by any one of a plurality of imaging devices that capture images of an imaging region from a plurality of directions . and a region of the subject extracted from the captured image captured by the imaging device by a second extraction method different from the first extraction method, a three-dimensional image of the subject located within the imaging region . A first acquisition means for acquiring shape information that is information for specifying a shape, a second acquisition means for acquiring viewpoint information that is information for specifying a virtual viewpoint, the shape information, and the viewpoint and generating means for generating a virtual viewpoint image based on information, wherein the first extraction method includes, based on the captured image and a background image corresponding to the captured image, a method of extracting the subject, wherein the second extraction method is based on the captured image and another captured image captured at a timing different from the timing at which the captured image was captured by the imaging device; , and a method for extracting the subject .

本発明によれば、撮像画像において仮想視点画像を適切に生成することができる。 According to the present invention, it is possible to appropriately generate a virtual viewpoint image in a captured image.

画像処理システムの構成を示す図である。1 is a diagram showing the configuration of an image processing system; FIG. 画像処理装置のハードウェアの構成を示す図である。2 is a diagram showing the hardware configuration of an image processing apparatus; FIG. 画像処理装置の機能的な構成の第１の例を示す図である。1 is a diagram illustrating a first example of a functional configuration of an image processing device; FIG. 画像処理方法の第１の例を説明するフローチャートである。4 is a flowchart for explaining a first example of an image processing method; 画像処理の内容の第１の例を説明する図である。FIG. 4 is a diagram for explaining a first example of the content of image processing; 画像処理装置の機能的な構成の第２の例を示す図である。FIG. 4 is a diagram showing a second example of the functional configuration of the image processing device; 画像処理方法の第２の例を説明するフローチャートである。9 is a flowchart for explaining a second example of an image processing method; 画像処理の内容の第２の例を説明する図である。FIG. 10 is a diagram illustrating a second example of the content of image processing; 画像処理装置の機能的な構成の第３の例を示す図である。FIG. 13 is a diagram showing a third example of the functional configuration of the image processing device; 画像処理方法の第３の例を説明するフローチャートである。FIG. 11 is a flow chart illustrating a third example of an image processing method; FIG.

以下、本発明の実施形態について、添付の図面を参照して詳細に説明する。
＜第１の実施形態＞
本実施形態では、被写体（例えば人物などのオブジェクト）の動き情報を撮像画像から取得し、その速さと撮像条件とに基づいて、撮像画像において被写体の動きによってブレが起こっている領域とブレが起こっていない領域とを特定する。そして、ブレが生じている領域が半透明になるように、仮想視点から被写体を観察した場合の画像を、複数の撮像画像を用いて生成（再構成）する。本実施形態の画像処理システムは、同一の被写体を異なる視点から撮像することにより得られる複数の画像データに対して適用することができる。以下の説明では、仮想視点から被写体を観察した場合の画像を必要に応じて仮想視点画像と称する。また、被写体の動きによって撮像画像において被写体の少なくとも一部の領域に生じるブレを必要に応じて動きブレと称する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
<First Embodiment>
In the present embodiment, motion information of a subject (for example, an object such as a person) is acquired from a captured image, and based on the speed and imaging conditions, an area in which blurring occurs due to the movement of the subject in the captured image and a blurred area are detected. Identify areas that are not Then, an image when the subject is observed from the virtual viewpoint is generated (reconfigured) using a plurality of captured images so that the blurring area becomes translucent. The image processing system of this embodiment can be applied to a plurality of image data obtained by imaging the same subject from different viewpoints. In the following description, an image obtained by observing an object from a virtual viewpoint will be referred to as a virtual viewpoint image as necessary. In addition, blurring that occurs in at least a partial area of the subject in the captured image due to the movement of the subject is referred to as motion blurring as necessary.

図１は、画像処理システムの構成の一例を示す模式図である。画像処理システムは、複数のカメラ１０１と、画像処理装置１０２と、表示装置１０３と、入力装置１０４とを有する。カメラ１０１は、概略平面上の領域に配置された被写体１０５を囲むような複数の視点から、被写体１０５を撮像する。画像処理装置１０２には、表示装置１０３及び入力装置１０４が接続されている。利用者は、表示装置１０３と入力装置１０４とを用いて画像処理装置１０２に対する入力操作を行う。利用者は、この入力操作によって、撮像条件の設定や、カメラ１０１による撮像により取得した画像データを処理した結果の確認等を行う。 FIG. 1 is a schematic diagram showing an example of the configuration of an image processing system. The image processing system has multiple cameras 101 , an image processing device 102 , a display device 103 and an input device 104 . The camera 101 captures images of the subject 105 from a plurality of viewpoints surrounding the subject 105 arranged in an area on a schematic plane. A display device 103 and an input device 104 are connected to the image processing device 102 . A user performs an input operation to the image processing apparatus 102 using the display device 103 and the input device 104 . Through this input operation, the user sets imaging conditions, confirms the result of processing the image data obtained by imaging with the camera 101, and the like.

図２は、画像処理装置１０２のハードウェアの構成の一例を示すブロック図である。画像処理装置１０２は、ＣＰＵ２０１と、ＲＡＭ２０２と、ＲＯＭ２０３と、記憶部２０４と、入力インターフェース２０５と、出力インターフェース２０６と、システムバス２０７とを含んで構成される。入力インターフェース２０５には、外部メモリ２０８が接続されている。出力インターフェース２０６には表示装置１０３が接続されている。
ＣＰＵ２０１は、画像処理装置１０２の各構成要素を統括的に制御するプロセッサーである。ＲＡＭ２０２は、ＣＰＵ２０１の主メモリおよびワークエリアとして機能するメモリである。ＲＯＭ２０３は、画像処理装置１０２内の処理に用いられるプログラム等を格納するメモリである。ＣＰＵ２０１は、ＲＡＭ２０２をワークエリアとして使用し、ＲＯＭ２０３に格納されたプログラムを実行することで、後述する様々な処理を実行する。 FIG. 2 is a block diagram showing an example of the hardware configuration of the image processing apparatus 102. As shown in FIG. The image processing apparatus 102 includes a CPU 201 , a RAM 202 , a ROM 203 , a storage section 204 , an input interface 205 , an output interface 206 and a system bus 207 . An external memory 208 is connected to the input interface 205 . A display device 103 is connected to the output interface 206 .
A CPU 201 is a processor that comprehensively controls each component of the image processing apparatus 102 . A RAM 202 is a memory that functions as a main memory and a work area for the CPU 201 . A ROM 203 is a memory that stores programs and the like used for processing in the image processing apparatus 102 . The CPU 201 uses the RAM 202 as a work area and executes programs stored in the ROM 203 to perform various processes described later.

記憶部２０４は、画像処理装置１０２での処理に用いる画像データや、当該処理のためのパラメータ（即ち、設定値）等を記憶する記憶デバイスである。記憶部２０４としては、ＨＤＤ、光ディスクドライブ、フラッシュメモリ等を用いることができる。
入力インターフェース２０５は、例えば、ＵＳＢやＩＥＥＥ１３９４等のシリアルバスインターフェースである。画像処理装置１０２は、入力インターフェース２０５を介して、外部メモリ２０８（例えば、ハードディスク、メモリカード、ＣＦカード、ＳＤカード、ＵＳＢメモリ）から処理対象の画像データ等を取得することができる。出力インターフェース２０６は、例えばＤＶＩやＨＤＭＩ（登録商標）等の映像出力端子である。画像処理装置１０２は、出力インターフェース２０６を介して、表示装置１０３（液晶ディスプレイ等の画像表示デバイス）に、画像処理装置１０２で処理した画像データを出力することができる。尚、画像処理装置１０２は、構成要素として前記以外のものも含み得るが、本発明の主眼ではないため、その詳細な説明を省略する。 A storage unit 204 is a storage device that stores image data used for processing in the image processing apparatus 102, parameters (that is, setting values) for the processing, and the like. An HDD, an optical disk drive, a flash memory, or the like can be used as the storage unit 204 .
The input interface 205 is, for example, a serial bus interface such as USB or IEEE1394. The image processing apparatus 102 can acquire image data to be processed from an external memory 208 (eg, hard disk, memory card, CF card, SD card, USB memory) via the input interface 205 . The output interface 206 is, for example, a video output terminal such as DVI or HDMI (registered trademark). The image processing apparatus 102 can output image data processed by the image processing apparatus 102 to a display device 103 (an image display device such as a liquid crystal display) via an output interface 206 . Note that the image processing apparatus 102 may include components other than those described above, but they are not the focus of the present invention, and detailed description thereof will be omitted.

以下、図３、図４、および図５を参照して、本実施形態の画像処理装置１０２における画像処理の一例を説明する。図３は、本実施形態の画像処理装置１０２の機能的な構成の一例を示すブロック図である。図４は、本実施形態の画像処理方法の一例を説明するフローチャートである。図５は、本実施形態の画像処理の内容の一例を説明する模式図である。
本実施形態では、ＣＰＵ２０１が、ＲＯＭ２０３に格納されたプログラムを実行することにより、図３に記載の各ブロックとして機能し、図４のフローチャートによる処理を実行する。尚、ＣＰＵ２０１が必ずしも画像処理装置１０２の全ての機能を実行しなくてもよく、画像処理装置１０２内に各機能に対応する処理回路を設け、当該処理回路が当該機能を実行してもよい。 An example of image processing in the image processing apparatus 102 of this embodiment will be described below with reference to FIGS. 3, 4, and 5. FIG. FIG. 3 is a block diagram showing an example of the functional configuration of the image processing apparatus 102 of this embodiment. FIG. 4 is a flowchart for explaining an example of the image processing method of this embodiment. FIG. 5 is a schematic diagram illustrating an example of the content of image processing according to this embodiment.
In this embodiment, the CPU 201 functions as each block shown in FIG. 3 by executing the program stored in the ROM 203, and executes the processing shown in the flowchart of FIG. Note that the CPU 201 does not necessarily have to execute all the functions of the image processing apparatus 102, and a processing circuit corresponding to each function may be provided in the image processing apparatus 102 and the processing circuit may execute the function.

図５では、左手を素早く動かしている被写体を撮像した場合を例に挙げて画像処理の内容を示す。被写体の実際の動きを表す画像は、画像５０１Ａ、画像５０１Ｂ、画像５０１Ｃの順で得られる。しかしながら、カメラ１０１の露光時間（Ｔｖ）が長いと、画像５０２のように左手部が動きブレしている画像がカメラ１０１により撮像される。
Ｓ４０１において、動画像データ取得部３０１は、入力インターフェース２０５を介して外部メモリ２０８から複数の動画像データを取得してＲＡＭ２０２に格納する。複数の動画像データはそれぞれ、同一の被写体を互いに異なる視点から各カメラ１０１により撮像することにより得られる画像データ、即ち、同一の被写体を異なる視点から表す動画像データである。図５では、画像５０２が動画像データの例を表す。 FIG. 5 shows the details of the image processing, taking as an example a case where an image of a subject whose left hand is moving quickly is captured. Images representing the actual movement of the subject are obtained in the order of image 501A, image 501B, and image 501C. However, when the exposure time (Tv) of the camera 101 is long, the camera 101 captures an image such as an image 502 in which the left hand portion is motion-blurred.
In S401 , the moving image data acquisition unit 301 acquires a plurality of pieces of moving image data from the external memory 208 via the input interface 205 and stores them in the RAM 202 . Each of the plurality of moving image data is image data obtained by imaging the same subject from different viewpoints with each camera 101, that is, moving image data representing the same subject from different viewpoints. In FIG. 5, an image 502 represents an example of moving image data.

次に、Ｓ４０２において、背景画像取得部３０２は、入力インターフェース２０５を介して外部メモリ２０８から、Ｓ４０１で取得された動画像データに対応する複数の背景画像データを取得してＲＡＭ２０２に格納する。図５では、画像５０３が（或る１つのカメラ１０１で撮像された）背景画像データの例を表す。尚、背景画像は、被写体１０５が存在しない状態で、各カメラ１０１により撮像された画像であり、予め外部メモリ２０８に格納されているものとする。背景画像を撮像する際の各カメラ１０１の位置および姿勢は、被写体１０５が存在するときと同じであるのが好ましい。 Next, in S402 , the background image acquisition unit 302 acquires a plurality of background image data corresponding to the moving image data acquired in S401 from the external memory 208 via the input interface 205 and stores them in the RAM 202 . In FIG. 5, an image 503 represents an example of background image data (captured by a certain camera 101). It is assumed that the background image is an image captured by each camera 101 in a state where the object 105 does not exist, and is stored in the external memory 208 in advance. The position and posture of each camera 101 when capturing the background image is preferably the same as when the subject 105 exists.

次に、Ｓ４０３において、第１前景背景分離部３０３は、ＲＡＭ２０２に格納された動画像データと背景画像データとの差分に基づいて、動画像データを前景画像と背景画像に分離しそれぞれＲＡＭ２０２に保存する。第１前景背景分離部３０３は、例えば、動画像データおよび背景画像データと同じサイズの画像データであって、各画素の値が２値の画像データを前景背景画像データとして生成する。そして、第１前景背景分離部３０３は、動画像データと背景画像データとの互いに対応する画素の画素値の差の絶対値が閾値を上回る場合、当該画素に白（１）を割り当て、そうでない場合に当該画素に黒（０）を割り当てることを各画素について行う。この場合、白（１）が割り当てられた領域が前景領域であり、黒（０）が割り当てられた領域が背景領域になる。第１前景背景分離部３０３は、このような前景背景画像データを、動画像データおよび背景画像データと共にＲＡＭ２０２に保存する。図５では、画像５０４が、前景背景画像データの例を表す。 Next, in S403 , the first foreground/background separation unit 303 separates the moving image data into a foreground image and a background image based on the difference between the moving image data and the background image data stored in the RAM 202 and stores them in the RAM 202 . do. The first foreground/background separation unit 303 generates, for example, image data having the same size as the moving image data and the background image data, in which each pixel has a binary value, as the foreground/background image data. Then, when the absolute value of the difference between the pixel values of the pixels corresponding to each other between the moving image data and the background image data exceeds a threshold value, the first foreground/background separation unit 303 assigns white (1) to the pixel, and otherwise assigns white (1) to the pixel. For each pixel, black (0) is assigned to the pixel if it is the case. In this case, the area assigned white (1) is the foreground area, and the area assigned black (0) is the background area. The first foreground/background separation unit 303 stores such foreground/background image data in the RAM 202 together with the moving image data and the background image data. In FIG. 5, image 504 represents an example of foreground background image data.

次に、Ｓ４０４において、動体マップ算出部３０４は、ＲＡＭ２０２に格納された動画像データから動画像の動体マップを算出し、算出した動体マップをＲＡＭ２０２に保存する。動体マップとは、各フレームの画像中の被写体の、前または後のフレームの画像に対するｘ，ｙ座標のそれぞれの移動量を、１画素ごとにマップ状に保存したものである。動体マップ算出部３０４は、計算量の削減のため、ＲＡＭ２０２に格納された前記前景領域の部分のみの動きを算出することによって、前記前景領域のみの動体マップを算出してもよい。 Next, in S404 , the moving object map calculation unit 304 calculates a moving object map of the moving image from the moving image data stored in the RAM 202 and stores the calculated moving object map in the RAM 202 . A moving object map is a map in which the amount of movement of the x and y coordinates of an object in each frame image with respect to the previous or next frame image is stored in the form of a map for each pixel. In order to reduce the amount of calculation, the moving body map calculating unit 304 may calculate the moving body map of only the foreground area by calculating the movement of only the foreground area stored in the RAM 202 .

次に、Ｓ４０５において、動きブレ量算出部３０５は、動体マップと、露光時間（Tv）［sec］と、周波数［fps（frame/sec）］とに基づいて、動きブレ量を算出してＲＡＭ２０２に保存する。尚、周波数は、フレームレートであり、撮像周期に対応する周波数である。本実施形態では、動きブレ量を、画像における被写体のブレの大きさを画素数で表す。動きブレ量の算出方法は、例えば、動体マップの各画素における移動量を（ｘ，ｙ）［pixel/frame］とすると、動きブレ量算出部３０５は、動きブレ量［pixel］を、以下の式（１）により算出すればよい。 Next, in S405, the motion blur amount calculation unit 305 calculates the motion blur amount based on the moving object map, the exposure time (Tv) [sec], and the frequency [fps (frame/sec)]. Save to Note that the frequency is the frame rate and is the frequency corresponding to the imaging cycle. In this embodiment, the amount of motion blur is represented by the number of pixels of the blur of the subject in the image. As a method for calculating the amount of motion blur, for example, when the amount of movement in each pixel of the moving body map is (x, y) [pixel/frame], the amount of motion blur calculation unit 305 calculates the amount of motion blur [pixel] as follows. It can be calculated by the formula (1).

また、動きブレ量算出部３０５は、例えば、カメラ１０１ごとに、画像の各画素に、動きブレ量（数値）がマップ形式で割り当てられたものをＲＡＭ２０２に保存すればよい。以降、このマップ形式で保存される動きブレ量を、必要に応じて動きブレ量マップと称する。図５では、動きブレ量マップ５０５において、速く動いている左手部分に対応する値が大きな動きブレ量であることを示す。 In addition, the motion blur amount calculation unit 305 may store, in the RAM 202, a motion blur amount (numerical value) assigned in a map format to each pixel of an image for each camera 101, for example. Hereinafter, the motion blur amount saved in this map format will be referred to as a motion blur amount map as needed. In FIG. 5, in the motion blur amount map 505, the value corresponding to the fast moving left hand portion indicates a large motion blur amount.

次に、Ｓ４０６において、第２前景背景分離部３０６は、前記前景背景画像データから、非動きブレ前景領域を抽出し、ＲＡＭ２０２に保存する。具体的には、例えば、第２前景背景分離部３０６は、前記前景背景画像データ（画像５０４）のうち、前景領域（白の領域）であり、且つ、対応する動きブレ量マップ５０５の値が一定の閾値以下になる領域（黒の領域）を非動きブレ前景領域とすればよい。図５では、画像５０６が非動きブレ前景領域の例を表す。このように非動きブレ前景領域は、前景領域のうち、動きブレが起こっていない領域である。前景領域のうち、非動きブレ前景領域以外の領域は、動きブレが起こっている領域であり、前景と背景とが混ざっていてこれらを区別できない領域（動きブレ領域）である。 Next, in S406 , the second foreground/background separation unit 306 extracts a non-motion-blurring foreground area from the foreground/background image data, and stores it in the RAM 202 . Specifically, for example, the second foreground/background separation unit 306 is a foreground area (white area) in the foreground/background image data (image 504), and the value of the corresponding motion blur amount map 505 is A region (black region) that is equal to or less than a certain threshold value may be set as a non-motion-blurring foreground region. In FIG. 5, image 506 represents an example of a non-motion blurred foreground region. Thus, the non-motion-blurred foreground area is a foreground area in which motion blur does not occur. Among the foreground areas, areas other than the non-motion-blurred foreground area are areas in which motion blur occurs, where the foreground and the background are mixed and cannot be distinguished (motion-blurred area).

次に、Ｓ４０７において、第１形状推定部３０７は、前景領域の形状を推定する。また、第２形状推定部３０８は、非動きブレ前景領域の形状を推定する。形状の推定には、例えば、カメラ１０１の位置および姿勢を示す情報を含むカメラ位置姿勢パラメータが用いられる。形状の推定の方法としては、例えば、非特許文献１に記載のVisual Hull法を用いる方法が挙げられる。例えば、第１形状推定部３０７は、Visual Hull法を用いて、前景領域のシルエットを実空間上に投影し、そのシルエットが重複する部分を、前景の形状として推定する。第１形状推定部３０７、第２形状推定部３０８は、例えば、カメラ１０１ごとに、画像の各画素に、（形状を表す情報としての）距離がマップ形式で割り当てられたものをＲＡＭ２０２に保存すればよい。ここで、距離とは、出力視点から着目画素に写る被写体までの距離を指す。以下の説明では、このマップ形式で保存されている距離を、必要に応じて距離マップと称する。出力視点は、仮想視点のことを指す。 Next, in S407, the first shape estimation unit 307 estimates the shape of the foreground region. Also, the second shape estimation unit 308 estimates the shape of the motion-blurred foreground region. For example, camera position and orientation parameters including information indicating the position and orientation of the camera 101 are used for shape estimation. A method of estimating the shape includes, for example, a method using the Visual Hull method described in Non-Patent Document 1. For example, the first shape estimation unit 307 uses the Visual Hull method to project the silhouettes of the foreground region onto the real space, and estimates the portion where the silhouettes overlap as the shape of the foreground. The first shape estimating unit 307 and the second shape estimating unit 308 store in the RAM 202, for example, each pixel of the image for each camera 101, in which a distance (as information representing shape) is assigned in a map format. Just do it. Here, the distance refers to the distance from the output viewpoint to the subject captured in the pixel of interest. In the following description, distances stored in this map format will be referred to as distance maps as needed. An output viewpoint refers to a virtual viewpoint.

複数のカメラ１０１により得られた被写体の撮像画像に基づいて距離マップを生成する方法は公知であり、任意の方法を採用することができる。例えば、特許文献１に記載されている視体積公差法またはステレオマッチング法を用いて、被写体の三次元モデルを生成することができる。そして、仮想視点と被写体の三次元モデルとの関係に基づいて、仮想視点画像の各画素について、仮想視点から対応する被写体までの距離を各画素について導出して距離マップに格納する。距離マップの生成方法は被写体の撮像画像に基づく方法に限られず、何らかのトラッカー等を用いて被写体の三次元モデルを生成し、この三次元モデルに基づいて距離マップを生成してもよい。また、事前にレンジセンサなどで仮想視点から対応する被写体までの距離を計測し、距離マップを取得してもよい。 A method of generating a distance map based on images of a subject obtained by a plurality of cameras 101 is well known, and any method can be adopted. For example, a three-dimensional model of the subject can be generated using the visual volume tolerance method or the stereo matching method described in US Pat. Based on the relationship between the virtual viewpoint and the three-dimensional model of the subject, the distance from the virtual viewpoint to the corresponding subject is derived for each pixel of the virtual viewpoint image and stored in the distance map. The method of generating the distance map is not limited to the method based on the captured image of the subject, and a three-dimensional model of the subject may be generated using some tracker or the like, and the distance map may be generated based on this three-dimensional model. Alternatively, the distance from the virtual viewpoint to the corresponding subject may be measured in advance by a range sensor or the like, and a distance map may be acquired.

次に、Ｓ４０８において、第１レンダリング部３０９は、前景領域の形状をレンダリングして前景仮想視点画像を生成する。また、第２レンダリング部３１０は、非動きブレ前景領域の形状をレンダリングして非動きブレ仮想視点画像を生成する。レンダリングに際しては、例えば、仮想視点の位置および視線の方向を含む仮想視点パラメータが用いられる。
以下、第１レンダリング部３０９と第２レンダリング部３１０の処理の概略の一例について説明する。
第１レンダリング部３０９と第２レンダリング部３１０が行う処理は、着目方向に存在する被写体の位置を距離マップに基づいて特定し、この被写体の色情報を撮像画像から抽出する処理に相当する。言い換えれば、第１レンダリング部３０９と第２レンダリング部３１０は、仮想視点画像中の着目画素について、着目画素に写る被写体の位置を距離マップに基づいて特定し、着目画素に写る被写体の色情報を撮像画像から抽出する。具体的に第１レンダリング部３０９と第２レンダリング部３１０は、仮想視点から着目方向に存在する被写体までの距離と、仮想視点とカメラ１０１との位置および姿勢の関係とに基づいて、着目方向に存在する被写体に対応する撮像画像上の画素を特定する。そして、第１レンダリング部３０９と第２レンダリング部３１０は、特定した画素の色情報を、仮想視点から着目方向に存在する被写体の色情報として取得する。 Next, in S408, the first rendering unit 309 renders the shape of the foreground area to generate a foreground virtual viewpoint image. Also, the second rendering unit 310 renders the shape of the non-motion-blurred foreground region to generate a non-motion-blurred virtual viewpoint image. For rendering, virtual viewpoint parameters including, for example, the position of the virtual viewpoint and the direction of the line of sight are used.
An example of the outline of the processing of the first rendering unit 309 and the second rendering unit 310 will be described below.
The processing performed by the first rendering unit 309 and the second rendering unit 310 corresponds to processing for specifying the position of the subject existing in the direction of interest based on the distance map and extracting the color information of the subject from the captured image. In other words, the first rendering unit 309 and the second rendering unit 310 identify the position of the subject appearing in the pixel of interest in the virtual viewpoint image based on the distance map, and obtain the color information of the subject appearing in the pixel of interest. Extract from the captured image. Specifically, the first rendering unit 309 and the second rendering unit 310 render images in the direction of interest based on the distance from the virtual viewpoint to the subject present in the direction of interest and the relationship between the positions and orientations of the virtual viewpoint and the camera 101 . A pixel on the captured image corresponding to the existing subject is specified. Then, the first rendering unit 309 and the second rendering unit 310 acquire the color information of the specified pixel as the color information of the subject present in the viewing direction from the virtual viewpoint.

この処理は、例えば以下のように行うことができる。以下の説明では、仮想視点画像中の着目画素の座標を（ｕ₀，ｖ₀）とする。着目画素に写る被写体の位置は、以下の式（２）に従って、出力視点におけるカメラ座標系の座標で表すことができる。 This processing can be performed, for example, as follows. In the following description, the coordinates of the pixel of interest in the virtual viewpoint image are assumed to be (u ₀ , v ₀ ). The position of the subject captured in the pixel of interest can be represented by the coordinates of the camera coordinate system at the output viewpoint according to the following equation (2).

式（２）において、（ｘ₀，ｙ₀，ｚ₀）は被写体のカメラ座標系の座標を表す。ｄ₀（ｕ₀，ｖ₀）は、距離マップに示される、出力視点から着目画素に写る被写体までの距離を表す。ｆ₀は出力視点の焦点距離を表し、ｃ_x0およびｃ_y0は、出力視点の主点位置を表す。
次に、着目画素に写る被写体について、出力視点におけるカメラ座標系の座標は、以下の式（３）に従って世界座標系の座標に変換することができる。 In Equation (2), (x ₀ , y ₀ , z ₀ ) represent the coordinates of the subject in the camera coordinate system. d ₀ (u ₀ , v ₀ ) represents the distance from the output viewpoint to the subject in the pixel of interest shown in the distance map. f ₀ represents the focal length of the output viewpoint, and c _x0 and c _y0 represent the principal point positions of the output viewpoint.
Next, the coordinates of the camera coordinate system at the output viewpoint for the subject captured in the pixel of interest can be converted into the coordinates of the world coordinate system according to the following equation (3).

式（３）において、（Ｘ₀，Ｙ₀，Ｚ₀）は被写体の世界座標系の座標を表す。Ｒ₀は、出力視点の光軸方向を表す。（Ｘ_output，Ｙ_output，Ｚ_output）は、出力視点の世界座標系の座標を表す。
次に、被写体の世界座標系の座標（Ｘ₀，Ｙ₀，Ｚ₀）に存在する被写体が写っている、入力視点からの撮像画像上の座標は、以下の式（５）に従って算出することができる。入力視点とは、カメラ１０１の視点のことを指す。 In Equation (3), (X ₀ , Y ₀ , Z ₀ ) represent the coordinates of the subject in the world coordinate system. R ₀ represents the optical axis direction of the output viewpoint. (X _output , Y _output , Z _output ) represent the coordinates of the output viewpoint in the world coordinate system.
Next, the coordinates of the captured image from the input viewpoint, in which the subject exists at the coordinates (X ₀ , Y ₀ , Z ₀ ) of the subject's world coordinate system, are calculated according to the following equation (5). can be done. An input viewpoint refers to the viewpoint of the camera 101 .

式（４）において、Ｒ_iは入力視点ｉの光軸方向を表す（入力視点ｉは、複数の入力視点のうちｉ番目の入力視点である）。（Ｘ_cam,i，Ｙ_cam,i，Ｚ_cam,i）は、入力視点ｉのカメラ１０１の世界座標系の座標を表す。ｆ_iは、入力視点ｉの焦点距離を表し、ｃ_xi及びｃ_yiは入力視点ｉの主点位置を表す。また、ｔは定数を表す。式（４）を（ｕ_i，ｖ_i）について解くことにより、式（５）が得られる。 In Equation (4), R _i represents the optical axis direction of input viewpoint i (input viewpoint i is the i-th input viewpoint among a plurality of input viewpoints). (X _cam,i , Y _cam,i , Z _cam,i ) represent the coordinates of the world coordinate system of the camera 101 at the input viewpoint i. f _i represents the focal length of the input viewpoint i, and c _xi and c _yi represent the principal point positions of the input viewpoint i. Also, t represents a constant. Solving equation (4) for (u _i , v _i ) yields equation (5).

式（５）に従うと、まず定数ｔを算出することができ、更に得られた定数ｔを用いて（ｕ_i，ｖ_i）を算出することができる。このように、仮想視点画像中の着目画素の座標（ｕ₀，ｖ₀）は、撮像画像中の画素の座標（ｕ_i，ｖ_i）に変換することができる。仮想視点画像中の着目画素の座標（ｕ₀，ｖ₀）と撮像画像中の画素の座標（ｕ_i，ｖ_i）とは、同じ被写体に対応する可能性が高い。したがって、撮像画像中の画素の座標（ｕ_i，ｖ_i）の画素値（色情報）を、仮想視点画像中の着目画素の座標（ｕ₀，ｖ₀）の画素値（色情報）として用いることができる。 According to equation (5), the constant t can be calculated first, and (u _i , v _i ) can be calculated using the obtained constant t. In this way, the coordinates (u ₀ , v ₀ ) of the pixel of interest in the virtual viewpoint image can be converted to the coordinates (u _i , v _i ) of the pixel in the captured image. It is highly likely that the coordinates (u ₀ , v ₀ ) of the pixel of interest in the virtual viewpoint image and the coordinates (u _i , v _i ) of the pixel in the captured image correspond to the same object. Therefore, the pixel value (color information) at the coordinates (u _i , v _i ) of the pixel in the captured image is used as the pixel value (color information) at the coordinates (u ₀ , v ₀ ) of the pixel of interest in the virtual viewpoint image. be able to.

しかしながら、視線方向の違いのために、仮想視点画像中の着目画素の座標（ｕ₀，ｖ₀）と撮像画像中の画素の座標（ｕ_i，ｖ_i）とが同じ被写体に対応するとは限らない。また、光源の方向等の影響により、これらが同じ被写体に対応したとしても、撮像画像間で色が異なっている可能性もある。このため、本実施形態では、第１レンダリング部３０９と第２レンダリング部３１０は、複数の撮像画像から、仮想視点画像中の着目画素の座標（ｕ₀，ｖ₀）に対応する撮像画像中の画素の座標（ｕ_i，ｖ_i）（ｉ＝１～Ｎ：Ｎはカメラ１０１の数）を特定する。そして、第１レンダリング部３０９と第２レンダリング部３１０は、特定した画素の画素値を重み付け合成する。ここで、被写体が撮像範囲外にある等の理由で、着目画素に対応する被写体が写っていない撮像画像については、合成の対象から外すことができる。このような重み付け合成により得られた画素値が、仮想視点画像中の着目画素の座標（ｕ₀，ｖ₀）の画素値として用いられる。 However, due to the difference in line-of-sight direction, the coordinates (u ₀ , v ₀ ) of the pixel of interest in the virtual viewpoint image and the coordinates (u _i , v _i ) of the pixel in the captured image do not necessarily correspond to the same subject. do not have. In addition, even if these images correspond to the same object, there is a possibility that the captured images have different colors due to the influence of the direction of the light source and the like. For this reason, in the present embodiment, the first rendering unit 309 and the second rendering unit 310 extract the captured image corresponding to the coordinates (u ₀ , v ₀ ) of the pixel of interest in the virtual viewpoint image from the plurality of captured images. Pixel coordinates (u _i , v _i ) (i=1 to N: N is the number of cameras 101) are specified. Then, the first rendering unit 309 and the second rendering unit 310 weight-synthesize the pixel values of the specified pixels. Here, a captured image in which the subject corresponding to the pixel of interest is not captured because the subject is out of the imaging range can be excluded from the synthesis target. The pixel value obtained by such weighted synthesis is used as the pixel value of the coordinates (u ₀ , v ₀ ) of the pixel of interest in the virtual viewpoint image.

このとき、同時に、動きブレ量マップの、仮想視点画像中の着目画素の座標（ｕ₀，ｖ₀）における値も、画素値と同様に、実視点での動きブレ量マップの重み付け合成によって生成することができる。
図５において、画像５０７は、第１レンダリング部３０９によるレンダリングの結果（前景仮想視点画像）の例を表し、画像５０８は、第２レンダリング部３１０のレンダリングの結果（非動きブレ仮想視点画像）の例を表す。画像５０７では、速く動く左手部分が不透明な大きな固まりとしてレンダリングされる。画像５０８では、速く動く左手部分が消えた画像がレンダリングされる。ここで、本来、仮想視点から見えるべき絵は、左手の部分が動きブレして半透明に透けて見える絵である。このように、画像５０７（前景仮想視点画像）の方が、画像５０８（非動きブレ仮想視点画像）よりも、被写体の動きによって画像上の前記被写体の少なくとも一部の領域に生じるブレが大きい。 At this time, at the same time, the value of the motion blur amount map at the coordinates (u ₀ , v ₀ ) of the pixel of interest in the virtual viewpoint image is also generated by weighted synthesis of the motion blur amount map at the real viewpoint in the same way as the pixel value. can do.
In FIG. 5, an image 507 represents an example of the result of rendering by the first rendering unit 309 (foreground virtual viewpoint image), and an image 508 represents an example of the result of rendering by the second rendering unit 310 (non-motion-blurred virtual viewpoint image). represents an example. In image 507, the fast moving left hand portion is rendered as a large opaque blob. In image 508, the image is rendered with the fast moving left hand portion removed. Here, the picture that should be seen from the virtual viewpoint is a picture that can be seen semi-transparently because the left hand part moves and blurs. As described above, the image 507 (foreground virtual viewpoint image) is more blurred in at least a partial area of the subject on the image due to the motion of the subject than the image 508 (non-motion-blurred virtual viewpoint image).

図４の説明に戻り、Ｓ４０９において、αブレンド部３１１は、前景仮想視点画像と非動きブレ仮想視点画像とを動きブレ量に従ってαブレンドし、動きブレ混合仮想視点画像を生成する。図５では、画像５０９が、動きブレ混合仮想視点画像の例を表す。動きブレ量マップ５０５に従って、画像５０７（前景仮想視点画像）、５０８（非動きブレ仮想視点画像）をαブレンドする。このようにすることで、速く動く左手部分が半透明に透けた画像５０９を生成することができる。 Returning to the description of FIG. 4, in S409, the α-blending unit 311 α-blends the foreground virtual viewpoint image and the non-motion-blurred virtual viewpoint image according to the amount of motion blur to generate a motion-blurred mixed virtual viewpoint image. In FIG. 5, an image 509 represents an example of a motion-blurred mixed virtual viewpoint image. According to the motion blur amount map 505, the images 507 (foreground virtual viewpoint image) and 508 (non-motion blur virtual viewpoint image) are α-blended. By doing so, it is possible to generate an image 509 in which the fast-moving left hand part is translucent.

αは、前景仮想視点画像と非動きブレ仮想視点画像との、相互に対応する画素における当該画素の値の合成比率を決定するためのパラメータの一例である。例えば、動きブレ量マップ５０５の値をｘ［pixel］とすると、αは、以下の式（６）のように表される。そして、第１レンダリング部３０９によるレンダリングの結果（前景仮想視点画像）として得られるＲＧＢ値を[R1,G1,B1]とする。また、第２レンダリング部３１０のレンダリングの結果（非動きブレ仮想視点画像）として得られるＲＧＢ値を[R2,G2,B2]とする。そうすると、これらのＲＧＢ値[R1,G1,B1]、[R2,G2,B2]を、αを用いて以下の式（７）のように合成することにより、出力画像のＲＧＢ値を決めればよい。 α is an example of a parameter for determining the synthesis ratio of the pixel values of mutually corresponding pixels in the foreground virtual viewpoint image and the non-motion-blurred virtual viewpoint image. For example, if the value of the motion blurring amount map 505 is x [pixel], α is represented by the following equation (6). Let [R1, G1, B1] be the RGB values obtained as a result of rendering (foreground virtual viewpoint image) by the first rendering unit 309 . Let [R2, G2, B2] be the RGB values obtained as the result of rendering by the second rendering unit 310 (non-motion-blurred virtual viewpoint image). Then, these RGB values [R1, G1, B1] and [R2, G2, B2] are combined using α as shown in the following equation (7) to determine the RGB values of the output image. .

以上のように本実施形態では、画像処理装置１０２は、被写体の動き情報を撮像画像から取得し、撮像画像の被写体の領域を動きブレ領域と非動きブレ前景領域とに分けてレンダリングしてαブレンドする。したがって、動きブレが起こっている前景領域が自然な半透明になるような仮想視点画像を生成することができる。よって、撮像画像において動きブレが起こった場合でも仮想視点画像を適切に生成することができる。 As described above, in the present embodiment, the image processing apparatus 102 acquires subject motion information from a captured image, divides the subject area of the captured image into a motion blur area and a non-motion blurring foreground area, and renders α blend. Therefore, it is possible to generate a virtual viewpoint image in which the foreground area where motion blur occurs is naturally translucent. Therefore, it is possible to appropriately generate a virtual viewpoint image even when motion blur occurs in the captured image.

本実施形態では、動きブレ領域と非動きブレ前景領域の２つに分けてレンダリングする例を示した。しかしながら、動きブレの大きさによって３つ以上の領域に分けて形状の推定とレンダリングとを行ってもよい。
また、計算リソースの削減のために、第２形状推定部３０８と第２レンダリング部３１０において、前景の形状の推定とレンダリングとを実行せずに、背景の形状のみをレンダリングしてもよい。この場合、式（７）の[R2,G2,B2]を背景画像のレンダリングの結果として、前景領域([R1,G1,B2])のブレ量（動きブレ量マップ５０５の値）に応じて透明度だけ変えるようにすることができる。 In this embodiment, an example has been shown in which rendering is divided into two areas, the motion blurred area and the non-motion blurred foreground area. However, shape estimation and rendering may be performed by dividing into three or more regions depending on the magnitude of motion blur.
Also, in order to reduce computational resources, the second shape estimation unit 308 and the second rendering unit 310 may render only the shape of the background without estimating and rendering the shape of the foreground. In this case, [R2, G2, B2] in Equation (7) is the result of rendering the background image, and the amount of blur in the foreground area ([R1, G1, B2]) (the value of the motion blur amount map 505) is It is possible to change only the transparency.

＜第２の実施形態＞
次に、第２の実施形態を説明する。本実施形態では、短秒露光をするカメラと長秒露光をするカメラとを混合して画像を作る例について示す。ここで、短秒露光と長秒露光とは相対的に露光時間が長い動画像と短い動画像のことを示す。例えば、６０［fps］の動画像において、長秒が１／１００［sec］の露光時間とし、短秒が１／１０００［sec］の露光時間とするような撮り方をしているものとする。これらのカメラの周波数［fps］は同じであり、撮影タイミングは同期しているものとする。また、短秒露光を行うカメラ１０１と長秒露光を行うカメラ１０１として、それぞれ複数のカメラ１０１が予め設定されているものとする。後述するように、短秒露光を行うカメラ１０１による撮像画像に基づいて仮想視点画像を生成すると共に、長秒露光を行うカメラ１０１による撮像画像に基づいて仮想視点画像を生成する。それぞれの仮想視点画像が適切に生成されるように、短秒露光を行うカメラ１０１と長秒露光を行うカメラ１０１とを分散して配置するのが好ましい。例えば、図１において、短秒露光を行うカメラ１０１と長秒露光を行うカメラ１０１とを１台おきに交互に配置することができる。尚、短秒露光を行うカメラ１０１と長秒露光を行うカメラ１０１の数は、同じであっても異なっていてもよい。 <Second embodiment>
Next, a second embodiment will be described. In this embodiment, an example of creating an image by combining a short-second exposure camera and a long-second exposure camera will be described. Here, short-second exposure and long-second exposure refer to moving images with relatively long and short exposure times. For example, in a moving image of 60 [fps], it is assumed that the exposure time is 1/100 [sec] for long seconds and 1/1000 [sec] for short seconds. . It is assumed that these cameras have the same frequency [fps] and their shooting timings are synchronized. It is also assumed that a plurality of cameras 101 are set in advance as cameras 101 that perform short-second exposure and cameras 101 that perform long-second exposure. As will be described later, a virtual viewpoint image is generated based on an image captured by the camera 101 that performs short exposure, and a virtual viewpoint image is generated based on an image captured by the camera 101 that performs long exposure. It is preferable to disperse the camera 101 for short-second exposure and the camera 101 for long-second exposure so that each virtual viewpoint image is appropriately generated. For example, in FIG. 1, the cameras 101 that perform short-time exposure and the cameras 101 that perform long-time exposure can be alternately arranged. The number of cameras 101 that perform short-second exposure and the number of cameras 101 that perform long-second exposure may be the same or different.

前述した第１の実施形態では、１つのカメラの画像の時系列での動体マップを推定して動きブレが起こっている領域を判断する。このようにすると、動体マップの算出に比較的時間がかかる。そこで、本実施形態では、動体マップの算出を行わずに、短秒露光で相対的に動きブレの少ない画像群と、長秒露光で相対的に動きブレの大きい画像群との両方を使うことで動きブレのある場面の仮想視点画像を生成する。このように本実施形態と第１の実施形態とでは、動きブレが起こっている領域の判定のための処理が主として異なる。したがって、本実施形態の説明において、第１の実施形態と同一の部分については、図１～図５に付した符号と同一の符号を付す等して詳細な説明を省略する。 In the first embodiment described above, a motion blurring region is determined by estimating a time-series moving body map of images from one camera. In this way, it takes a relatively long time to calculate the moving body map. Therefore, in the present embodiment, both a group of images with relatively little motion blur with short-second exposure and a group of images with relatively large motion blur with long-second exposure are used without calculating a moving body map. generates a virtual viewpoint image of a scene with motion blur. As described above, the main difference between the present embodiment and the first embodiment is the processing for determining an area in which motion blur occurs. Therefore, in the description of the present embodiment, the same parts as those in the first embodiment are denoted by the same reference numerals as those in FIGS. 1 to 5, and detailed description thereof is omitted.

以下、図６、図７、および図８を参照して、本実施形態の画像処理装置１０２における画像処理の一例を説明する。図６は、本実施形態の画像処理装置１０２の機能的な構成の一例を示すブロック図である。図７は、本実施形態の画像処理方法の一例を説明するフローチャートである。図８は、本実施形態の画像処理の内容の一例を説明する模式図である。
本実施形態においても、ＣＰＵ２０１が、ＲＯＭ２０３に格納されたプログラムを実行することにより、図６に記載の各ブロックとして機能し、図７のフローチャートによる処理を実行する。また、ＣＰＵ２０１が必ずしも画像処理装置１０２の全ての機能を実行しなくてもよく、画像処理装置１０２内に各機能に対応する処理回路を設け、当該処理回路が当該機能を実行してもよい。 An example of image processing in the image processing apparatus 102 of the present embodiment will be described below with reference to FIGS. 6, 7, and 8. FIG. FIG. 6 is a block diagram showing an example of the functional configuration of the image processing apparatus 102 of this embodiment. FIG. 7 is a flowchart for explaining an example of the image processing method of this embodiment. FIG. 8 is a schematic diagram illustrating an example of the content of image processing according to this embodiment.
Also in this embodiment, the CPU 201 functions as each block shown in FIG. 6 by executing the program stored in the ROM 203, and executes the processing according to the flowchart of FIG. Further, the CPU 201 does not necessarily have to execute all the functions of the image processing apparatus 102, and a processing circuit corresponding to each function may be provided in the image processing apparatus 102 and the processing circuit may execute the function.

図８では、左手を素早く動かしている被写体を撮像した場合を例に挙げて画像処理の内容を示す。被写体の実際の動きを表す画像は、画像８０１Ａ、画像８０１Ｂ、画像８０１Ｃの順で得られる。しかしながら、カメラ１０１の露光時間（Ｔｖ）が長いと、画像８０２のように左手部が動きブレしている画像がカメラ１０１により撮像される。一方、カメラ１０１の露光時間（Ｔｖ）が短いと、画像８０３のように左手部が動きブレしていない画像がカメラ１０１により撮像される。 FIG. 8 shows the details of the image processing, taking as an example a case where an image of a subject whose left hand is moving quickly is captured. Images representing the actual movement of the subject are obtained in the order of image 801A, image 801B, and image 801C. However, when the exposure time (Tv) of the camera 101 is long, the camera 101 captures an image such as an image 802 in which the left hand portion is motion-blurred. On the other hand, when the exposure time (Tv) of the camera 101 is short, the camera 101 captures an image such as an image 803 in which the left hand portion is not blurred.

以降、露光時間（Ｔｖ）が相対的に長いカメラ１０１により撮像された画像を、必要に応じて長Ｔｖ画像と称し、露光時間（Ｔｖ）が相対的に短いカメラ１０１により撮像された画像を、必要に応じて短Ｔｖ画像と称する。
Ｓ７０１において、長Ｔｖ画像取得部６０１は、長Ｔｖ画像データを取得する。短Ｔｖ画像取得部６０２は、短Ｔｖ画像データを取得する。例えば、画像８０１Ａ→画像８０１Ｂ→画像８０１Ｃのように人が左手を素早く振った動作をした場合、長Ｔｖ画像は画像８０２のようになり、短Ｔｖ画像は画像８０３のようになる。 Hereinafter, an image captured by a camera 101 with a relatively long exposure time (Tv) is referred to as a long Tv image as necessary, and an image captured by a camera 101 with a relatively short exposure time (Tv) is It is called a short Tv image if necessary.
In S701, the long Tv image acquisition unit 601 acquires long Tv image data. The short Tv image acquisition unit 602 acquires short Tv image data. For example, when a person quickly waves his/her left hand in the order of image 801A→image 801B→image 801C, the long Tv image becomes image 802, and the short Tv image becomes image 803.

次に、Ｓ７０２において、長Ｔｖ背景画像取得部６０３は、長Ｔｖ背景画像データを取得する。短Ｔｖ背景画像取得部６０４は、短Ｔｖ背景画像データを取得する。長Ｔｖ背景画像は、被写体１０５が存在しない状態で、各カメラ１０１により相対的に長い露光時間（Ｔｖ）で撮像された画像であり、予め外部メモリ２０８に格納されているものとする。短Ｔｖ背景画像は、被写体１０５が存在しない状態で、各カメラ１０１により相対的に短い露光時間（Ｔｖ）で撮像された画像であり、予め外部メモリ２０８に格納されているものとする。長Ｔｖ画像を撮像する際の露光時間と長Ｔｖ背景画像を撮像する際の露光時間は同じであるのが好ましい。同様に、短Ｔｖ画像を撮像する際の露光時間と短Ｔｖ背景画像を撮像する際の露光時間は同じであるのが好ましい。また長Ｔｖ背景画像、短Ｔｖ背景画像を撮像する際の各カメラ１０１の位置および姿勢は、被写体１０５が存在するときと同じであるのが好ましい。図８では、例えば、画像８０４のような背景画像が取得される。 Next, in S702, the long-Tv background image acquisition unit 603 acquires long-Tv background image data. The short Tv background image acquisition unit 604 acquires short Tv background image data. The long-Tv background image is an image captured with a relatively long exposure time (Tv) by each camera 101 without the object 105 present, and is stored in the external memory 208 in advance. The short Tv background image is an image captured by each camera 101 with a relatively short exposure time (Tv) in a state where the object 105 does not exist, and is stored in the external memory 208 in advance. The exposure time for capturing the long-Tv image and the exposure time for capturing the long-Tv background image are preferably the same. Similarly, the exposure time for capturing the short Tv image and the exposure time for capturing the short Tv background image are preferably the same. Also, the position and posture of each camera 101 when capturing the long Tv background image and the short Tv background image are preferably the same as when the subject 105 exists. In FIG. 8, for example, a background image such as image 804 is obtained.

次に、Ｓ７０３において、第１前景画像分離部６０５は、長Ｔｖ画像データを長Ｔｖ前景領域と長Ｔｖ背景領域とに分離する。例えば、第１前景画像分離部６０５は、画像（長Ｔｖ画像）８０２と画像（背景画像）８０４の互いに対応する画素において、色およびテクスチャの少なくとも何れか一方の差分の絶対値が閾値を上回るか否かを判定する。第１前景画像分離部６０５は、この絶対値が閾値を上回る領域を前景領域とし、当該領域の画素に白（１）を割り当て、そうでない領域を背景領域とし、当該領域の画素に黒（０）を割り当てることを各画素について行う。この場合、白（１）が割り当てられた領域が前景領域であり、黒（０）が割り当てられた領域が背景領域になる。このようにして前景領域とされたものが、長Ｔｖ前景領域であり、背景領域とされたものが、長Ｔｖ背景領域である。これにより、図８に示す画像８０５のような画像が得られる。 Next, in S703, the first foreground image separation unit 605 separates the long Tv image data into a long Tv foreground region and a long Tv background region. For example, the first foreground image separation unit 605 determines whether the absolute value of the difference in at least one of color and texture in corresponding pixels of the image (long Tv image) 802 and the image (background image) 804 exceeds a threshold. determine whether or not The first foreground image separating unit 605 regards the area whose absolute value exceeds the threshold as the foreground area, assigns white (1) to the pixels of this area, and treats the other area as the background area, and assigns black (0) to the pixels of this area. ) is assigned to each pixel. In this case, the area assigned white (1) is the foreground area, and the area assigned black (0) is the background area. The foreground region thus defined is the long-Tv foreground region, and the background region is defined as the long-Tv background region. As a result, an image such as the image 805 shown in FIG. 8 is obtained.

次に、Ｓ７０４において、第２前景画像分離部６０６は、短Ｔｖ画像データを短Ｔｖ前景領域と短Ｔｖ背景領域とに分離する。例えば、第２前景画像分離部６０６は、画像（短Ｔｖ画像）８０３と画像（背景画像）８０４の互いに対応する画素において、色およびテクスチャの少なくとも何れか一方の差分の絶対値が閾値を上回るか否かを判定する。第２前景画像分離部６０６は、この絶対値が閾値を上回る領域を前景領域とし、当該領域の画素に白（１）を割り当て、そうでない領域を背景領域とし、当該領域の画素に黒（０）を割り当てることを各画素について行う。この場合、白（１）が割り当てられた領域が前景領域であり、黒（０）が割り当てられた領域が背景領域になる。このようにして前景領域とされたものが、短Ｔｖ前景領域であり、背景領域とされたものが、短Ｔｖ背景領域である。これにより、図８に示す画像８０６のような画像が得られる。 Next, in S704, the second foreground image separation unit 606 separates the short Tv image data into a short Tv foreground region and a short Tv background region. For example, the second foreground image separation unit 606 determines whether the absolute value of the difference in at least one of the color and texture in the corresponding pixels of the image (short Tv image) 803 and the image (background image) 804 exceeds a threshold. determine whether or not The second foreground image separation unit 606 regards the area whose absolute value exceeds the threshold as the foreground area, assigns white (1) to the pixels of the area, and treats the other area as the background area, and assigns black (0) to the pixels of the area. ) is assigned to each pixel. In this case, the area assigned white (1) is the foreground area, and the area assigned black (0) is the background area. The foreground area thus determined is the short-Tv foreground area, and the background area thus determined is the short-Tv background area. As a result, an image such as the image 806 shown in FIG. 8 is obtained.

次に、Ｓ７０５において、第１形状推定部６０７は、多視点の長Ｔｖ前景領域（各カメラ１０１で得られた長Ｔｖ前景領域）の重複領域から、相対的に長い露光時間で撮像した場合の前景領域の形状を推定する。以下の説明では、この形状を、必要に応じて長Ｔｖ形状と称する。
次に、Ｓ７０６において、第２形状推定部６０８は、多視点の短Ｔｖ前景領域（各カメラ１０１で同じタイミングで得られた短Ｔｖ前景領域）の重複領域から、相対的に短い露光時間で撮像した場合の前景領域の形状を推定する。以下の説明では、この形状を、必要に応じて短Ｔｖ形状と称する。 Next, in S705, the first shape estimating unit 607 calculates the image from the overlap region of the multi-viewpoint long Tv foreground region (the long Tv foreground region obtained by each camera 101) with a relatively long exposure time. Estimate the shape of the foreground region. In the following description, this shape will be referred to as the long Tv shape as appropriate.
Next, in S706, the second shape estimating unit 608 captures an image with a relatively short exposure time from the overlap region of the multi-viewpoint short Tv foreground regions (short Tv foreground regions obtained at the same timing by each camera 101). Estimate the shape of the foreground region when In the following description, this shape will be referred to as the short Tv shape as appropriate.

次に、Ｓ７０７において、第１レンダリング部６０９は、長Ｔｖ形状をレンダリングして、仮想視点から相対的に長い露光時間で撮像したと仮定した場合に得られる仮想視点画像を生成する。以下の説明では、この仮想視点画像を必要に応じて、長Ｔｖ仮想視点画像と称する。また、第２レンダリング部６１０は、短Ｔｖ形状をレンダリングして、仮想視点から相対的に短い露光時間で撮像したと仮定した場合に得られる仮想視点画像を生成する。以下の説明では、この仮想視点画像を必要に応じて、短Ｔｖ仮想視点画像と称する。 Next, in S707, the first rendering unit 609 renders the long Tv shape to generate a virtual viewpoint image obtained when it is assumed that the image is captured from the virtual viewpoint with a relatively long exposure time. In the following description, this virtual viewpoint image will be referred to as a long-Tv virtual viewpoint image as required. Also, the second rendering unit 610 renders the short Tv shape to generate a virtual viewpoint image obtained when it is assumed that the image is captured from the virtual viewpoint with a relatively short exposure time. In the following description, this virtual viewpoint image will be referred to as a short-Tv virtual viewpoint image as required.

ここで、仮想視点画像（長Ｔｖ仮想視点画像、短Ｔｖ仮想視点画像）を生成するときに使うテクスチャに、必ずしも、それぞれの前景領域の画像を作るときに使った入力画像（長Ｔｖ画像、短Ｔｖ画像）を使わなくてもよい。例えば、露光時間（Ｔｖ）が異なると色味も変わってしまう場合がある。このため、仮想視点画像（長Ｔｖ仮想視点画像、短Ｔｖ仮想視点画像）を生成するときに使うテクスチャには、長Ｔｖ画像のみを使うようにしてもよい。図８では、例えば、画像８０７が長Ｔｖ仮想視点画像であり、長Ｔｖ仮想視点画像には、画像８０７のように、動いている部分が大きな不透明な固まりとなって表れる。また、画像８０８が短Ｔｖ仮想視点画像であり、短Ｔｖ仮想視点画像には、画像８０８のように、或る止まった瞬間の手の形が表れる。このように、画像８０７（長Ｔｖ仮想視点画像）の方が、画像８０８（短Ｔｖ仮想視点画像）よりも、被写体の動きによって画像上の前記被写体の少なくとも一部の領域に生じるブレが大きい。 Here, the input images (long Tv image, short Tv image) used to generate the respective foreground region images are not necessarily included in the textures used to generate the virtual viewpoint images (long Tv virtual viewpoint image, short Tv virtual viewpoint image). Tv image) may not be used. For example, if the exposure time (Tv) is different, the color tone may also change. Therefore, only the long Tv image may be used as the texture used when generating the virtual viewpoint images (the long Tv virtual viewpoint image and the short Tv virtual viewpoint image). In FIG. 8, for example, the image 807 is the long-Tv virtual viewpoint image, and in the long-Tv virtual viewpoint image, moving parts appear as large opaque masses like the image 807 . Also, an image 808 is a short Tv virtual viewpoint image, and in the short Tv virtual viewpoint image, like the image 808, the shape of a hand at a certain moment appears. As described above, the image 807 (long Tv virtual viewpoint image) is more blurred than the image 808 (short Tv virtual viewpoint image) in at least a partial area of the subject on the image due to the movement of the subject.

次に、Ｓ７０８において、動きブレ量算出部６１１は、長Ｔｖ仮想視点画像と短Ｔｖ仮想視点画像の相互に対応する画素の画素値の差分の絶対値の大きさから動きブレ量を算出する。このとき、長Ｔｖ仮想視点画像と短Ｔｖ仮想視点画像に代えて、長Ｔｖ形状と短Ｔｖ形状を用いてもよい。
次に、Ｓ７０９において、αブレンド部６１２は、動きブレ量に従って、長Ｔｖ仮想視点画像と短Ｔｖ仮想視点画像とをαブレンドして、動きブレ混合仮想視点画像を生成する。例えば、αブレンド部６１２は、式（７）において、長Ｔｖ仮想視点画像のＲＧＢ値を[R1,G1,B1]とし、短Ｔｖ仮想視点画像のＲＧＢ値を[R2,G2,B2]として、長Ｔｖ仮想視点画像と短Ｔｖ仮想視点画像とを式（７）に従って合成できる。このとき、αブレンド部６１２は、例えば、動きブレ量が大きいほど、長Ｔｖ仮想視点画像のαブレンドの値（＝α）が小さくなるようにする（即ち、長Ｔｖ仮想視点画像のブレンド率を低くする）。図８では、例えば、αブレンドした結果は、画像８０９のように手が動いているためにブレている部分は半透明になるような画像となり、実際に仮想視点において、長い露光時間（Ｔｖ）で撮ったような画像となる。 Next, in S708, the motion blur amount calculation unit 611 calculates the motion blur amount from the absolute value of the difference between the pixel values of the corresponding pixels of the long-Tv virtual viewpoint image and the short-Tv virtual viewpoint image. At this time, a long Tv shape and a short Tv shape may be used instead of the long Tv virtual viewpoint image and the short Tv virtual viewpoint image.
Next, in S709, the α-blending unit 612 α-blends the long-Tv virtual viewpoint image and the short-Tv virtual viewpoint image according to the amount of motion blur to generate a motion-blur mixed virtual viewpoint image. For example, the α blend unit 612 sets the RGB values of the long-Tv virtual viewpoint image to [R1, G1, B1] and sets the RGB values of the short-Tv virtual viewpoint image to [R2, G2, B2] in Equation (7), A long-Tv virtual viewpoint image and a short-Tv virtual viewpoint image can be combined according to equation (7). At this time, the α-blending unit 612, for example, makes the α-blending value (=α) of the long-Tv virtual viewpoint image smaller as the amount of motion blur increases (that is, the blending ratio of the long-Tv virtual viewpoint image is set to make low). In FIG. 8, for example, the result of α-blending is an image such as image 809 in which the blurred portion due to the movement of the hand is translucent. The image will look like it was taken with

以上のように本実施形態では、画像処理装置１０２は、短秒露光で動きブレの小さい画像群と、長秒露光で動きブレの大きい画像群との両方を使ってそれぞれレンダリングしたものをαブレンドする。したがって、動きマップを算出しなくても動きブレのある場面の仮想視点画像を生成することができる。よって、第１の実施形態で説明した効果に加えて、処理時間を削減することができるという効果が得られる。 As described above, in the present embodiment, the image processing apparatus 102 renders images using both short-exposure images with small motion blur and long-exposure images with large motion blur, and performs α-blending. do. Therefore, a virtual viewpoint image of a scene with motion blur can be generated without calculating a motion map. Therefore, in addition to the effect described in the first embodiment, the effect of being able to reduce the processing time can be obtained.

＜第３の実施形態＞
次に、第３の実施形態を説明する。本実施形態では、仮想視点とカメラの実視点（実際の視点）との違いによってαブレンドの比率を切り替えたり、処理を簡略化したりする例について示す。第１の実施形態と第２の実施形態において、動きブレで半透明になる部分の仮想視点画像を生成した場合に問題となるのは、仮想視点とカメラの実視点とが遠い場合である。仮想視点とカメラの実視点とが十分に近い場合は、カメラの映像が仮想視点で見た場合と近いため、動きブレしている部分の形状にカメラの実映像をテクスチャとして貼っても自然な絵となる。そこで、本実施形態では、仮想視点とカメラの実視点との近さに応じて、αブレンドを行うか否かの切り替えと、αブレンドを行う際のαブレンドの値（＝α）の制御とを行う例を示す。このように本実施形態と第１、第２の実施形態とは、αブレンドに係る処理が主として異なる。したがって、本実施形態の説明において、第１、第２の実施形態と同一の部分については、図１～図８に付した符号と同一の符号を付す等して詳細な説明を省略する。 <Third Embodiment>
Next, a third embodiment will be described. In this embodiment, an example of switching the α-blending ratio or simplifying the processing depending on the difference between the virtual viewpoint and the real viewpoint of the camera (actual viewpoint) will be described. In the first and second embodiments, a problem arises when the virtual viewpoint image of the portion that becomes translucent due to motion blur is generated when the virtual viewpoint is far from the real viewpoint of the camera. If the virtual viewpoint and the camera's real viewpoint are sufficiently close, the camera's image is close to what it looks like from the virtual viewpoint. becomes a picture. Therefore, in the present embodiment, switching between whether or not to perform α blending, and control of the α blending value (=α) when performing α blending, according to the closeness between the virtual viewpoint and the real viewpoint of the camera. Here is an example of doing As described above, the main difference between the present embodiment and the first and second embodiments is the processing related to α-blending. Therefore, in the description of this embodiment, the same parts as those in the first and second embodiments are denoted by the same reference numerals as those in FIGS. 1 to 8, and detailed description thereof is omitted.

以下、図９と図１０を参照して、本実施形態の画像処理装置１０２における画像処理の一例を説明する。図９は、本実施形態の画像処理装置１０２の機能的な構成の一例を示すブロック図である。図１０は、本実施形態の画像処理方法の一例を説明するフローチャートである。
図９に示す画像処理装置１０２は、図３に示す画像処理装置１０２に対して、視点依存処理設定部９１２を更に備える。図９の９０１～９１１は、それぞれ、図３のブロック３０１～３１１と同じである。ただし、本実施形態における動体マップ算出部９０４と第１前景背景分離部９０３は、視点依存処理設定部９１２の結果として出力される処理切り替え設定によって処理を変える。また、仮想視点とカメラ１０１の実視点とが十分に近い場合、動画像データ取得部９０１は、動体マップ算出部９０４に動画像データを送らない。この場合、画像処理装置１０２は、動体マップ算出部９０４以降の処理ブロックによる処理を実行しない。また同様に、第１前景背景分離部９０３は、前景画像データを第２前景背景分離部９０６に送らない。この場合、画像処理装置１０２は、第２前景背景分離部９０６以降の処理ブロックによる処理を実行しない。 An example of image processing in the image processing apparatus 102 of the present embodiment will be described below with reference to FIGS. 9 and 10. FIG. FIG. 9 is a block diagram showing an example of the functional configuration of the image processing apparatus 102 of this embodiment. FIG. 10 is a flowchart for explaining an example of the image processing method of this embodiment.
The image processing apparatus 102 shown in FIG. 9 further includes a viewpoint dependent processing setting unit 912 in contrast to the image processing apparatus 102 shown in FIG. 901-911 in FIG. 9 are the same as blocks 301-311 in FIG. 3, respectively. However, the moving object map calculation unit 904 and the first foreground/background separation unit 903 in this embodiment change the processing according to the processing switching setting output as a result of the viewpoint dependent processing setting unit 912 . Also, when the virtual viewpoint and the real viewpoint of the camera 101 are sufficiently close, the moving image data acquisition unit 901 does not send the moving image data to the moving body map calculation unit 904 . In this case, the image processing apparatus 102 does not execute processing by processing blocks after the moving object map calculation unit 904 . Similarly, the first foreground/background separator 903 does not send the foreground image data to the second foreground/background separator 906 . In this case, the image processing apparatus 102 does not execute processing by processing blocks after the second foreground/background separation unit 906 .

また、本実施形態においても、ＣＰＵ２０１が、ＲＯＭ２０３に格納されたプログラムを実行することにより、図９に記載の各ブロックとして機能し、図１０のフローチャートによる処理を実行する。また、ＣＰＵ２０１が必ずしも画像処理装置１０２の全ての機能を実行しなくてもよく、画像処理装置１０２内に各機能に対応する処理回路を設け、当該処理回路が当該機能を実行してもよい。
図１０のＳ１００１～Ｓ１００３は、図４のＳ４０１～Ｓ４０３と同じであるため、その詳細な説明を省略する。 Also in this embodiment, the CPU 201 functions as each block shown in FIG. 9 by executing the program stored in the ROM 203, and executes the processing according to the flowchart of FIG. Further, the CPU 201 does not necessarily have to execute all the functions of the image processing apparatus 102, and a processing circuit corresponding to each function may be provided in the image processing apparatus 102 and the processing circuit may execute the function.
Since S1001 to S1003 in FIG. 10 are the same as S401 to S403 in FIG. 4, detailed description thereof will be omitted.

Ｓ１００４において、視点依存処理設定部９１２は、仮想視点とカメラ１０１の実視点とが十分に近いかどうかを判定する。この判定に用いるカメラ１０１の実視点として、仮想視点画像の生成に際し、複数のカメラ１０１の代表となるカメラ１０１の実視点を採用する。例えば、仮想視点画像を生成する際にテクスチャとなる画像を撮像するカメラ１０１の実視点を採用することができる。また、仮想視点に最も近いカメラ１０１の実視点を採用してもよい。この判定の結果、仮想視点とカメラ１０１の実視点とが十分に近い場合、処理はＳ１０１１に進み、そうでない場合、処理はＳ１００５に進む。視点の近さを評価する指標には、例えば、各視点の位置と、各視点の姿勢（視点と被写体とを結ぶ仮想線と基準線（例えば水平面）との角度）とのうち、少なくとも何れか１つが含まれる。ここで、例えば、入力視点から被写体への方向が、出力視点から被写体への方向により近いほど、撮像画像に写る被写体像は仮想視点からの被写体像により近いと考えられる。従って、入力視点から被写体への方向を示す方向ベクトルの方向と、出力視点から被写体への方向を示す方向ベクトルの方向との近さで視点の近さを評価することができる。具体的には、仮想視点から被写体への方向を示す方向ベクトル（大きさは任意）と、出力視点から被写体への方向を示す方向ベクトル（大きさは任意）とのがなす角度が閾値より小さいかどうかで視点の近さを評価すればよい。 In S1004, the viewpoint-dependent processing setting unit 912 determines whether the virtual viewpoint and the real viewpoint of the camera 101 are sufficiently close. As the real viewpoint of the camera 101 used for this determination, the real viewpoint of the camera 101 representative of the plurality of cameras 101 is used when generating the virtual viewpoint image. For example, the real viewpoint of the camera 101 that captures an image that serves as a texture can be used when generating a virtual viewpoint image. Alternatively, the real viewpoint of the camera 101 closest to the virtual viewpoint may be adopted. As a result of this determination, if the virtual viewpoint and the real viewpoint of the camera 101 are sufficiently close, the process proceeds to S1011; otherwise, the process proceeds to S1005. The index for evaluating the closeness of viewpoints includes, for example, at least one of the position of each viewpoint and the orientation of each viewpoint (the angle between a virtual line connecting the viewpoint and the subject and a reference line (for example, a horizontal plane)). includes one. Here, for example, it is considered that the closer the direction from the input viewpoint to the subject is to the direction from the output viewpoint to the subject, the closer the subject image captured in the captured image is to the subject image from the virtual viewpoint. Therefore, the closeness of viewpoints can be evaluated by the closeness between the direction of the direction vector indicating the direction from the input viewpoint to the subject and the direction of the direction vector indicating the direction from the output viewpoint to the subject. Specifically, the angle formed by a direction vector (of any magnitude) indicating the direction from the virtual viewpoint to the subject and a direction vector (of any magnitude) indicating the direction from the output viewpoint to the subject is smaller than a threshold. The closeness of the viewpoint can be evaluated by whether or not.

このような方向に加えて、カメラ１０１の視野内における、着目方向に位置する被写体の位置を更に考慮して視点の近さを評価してもよい。例えば、被写体の位置がカメラ１０１の視野外に近ければ視点差が大きくなるよう視点の近さを評価すればよい。この場合、例えば、入力視点（カメラ１０１の実視点）から被写体への方向が出力視点（仮想視点）から被写体への方向と近くても、被写体が当該カメラ１０１の視野に含まれない場合、仮想視点と当該カメラ１０１の実視点の近さが近くないと評価できる。このように、視点の近さを評価する指標には、例えば、各視点の視野が含まれる。以下の説明では、仮想視点とカメラ１０１の実視点の近さを、必要に応じて仮想視点差と称する。
前述したようにＳ１００４において、仮想視点差が大きいと判定された場合、処理は、Ｓ１００５に進む。Ｓ１００５～Ｓ１００９の処理は、図４のＳ４０４～Ｓ４０８の処理と同様であるため、これらの処理の詳細な説明を省略する。 In addition to such a direction, the position of the subject positioned in the direction of interest within the field of view of the camera 101 may be further considered to evaluate the proximity of the viewpoint. For example, if the position of the subject is close to the outside of the field of view of the camera 101, the closeness of viewpoints may be evaluated such that the viewpoint difference is large. In this case, for example, even if the direction from the input viewpoint (real viewpoint of the camera 101) to the subject is close to the direction from the output viewpoint (virtual viewpoint) to the subject, if the subject is not included in the field of view of the camera 101, the virtual It can be evaluated that the closeness between the viewpoint and the actual viewpoint of the camera 101 is not close. In this way, the index for evaluating the closeness of viewpoints includes, for example, the field of view of each viewpoint. In the following description, the closeness between the virtual viewpoint and the real viewpoint of the camera 101 will be referred to as a virtual viewpoint difference as needed.
As described above, if it is determined in S1004 that the virtual viewpoint difference is large, the process proceeds to S1005. Since the processing of S1005 to S1009 is the same as the processing of S404 to S408 in FIG. 4, detailed description of these processing will be omitted.

そして、処理は、Ｓ１０１０に進む。Ｓ１０１０において、αブレンド部９１１は、前景仮想視点画像と非動きブレ仮想視点画像を、動きブレ量と、仮想視点差とに従ってαブレンドして、動きブレ混合仮想視点画像を生成する。このとき、動きブレ量が大きいほど、前景仮想視点画像のブレンド率（αブレンドを行う際のαブレンドの値（＝α））を小さくする。また、仮想視点差が小さいほど、前景仮想視点画像のブレンド率（αブレンドを行う際のαブレンドの値（＝α））を大きくする。仮想視点差が大きい場合の処理はこれで終了する。 Then, the process proceeds to S1010. In S1010, the α-blending unit 911 generates a motion-blur mixed virtual viewpoint image by α-blending the foreground virtual viewpoint image and the non-motion-blurred virtual viewpoint image according to the amount of motion blur and the virtual viewpoint difference. At this time, the larger the amount of motion blur, the smaller the blending rate of the foreground virtual viewpoint image (the value of α-blending (=α) when performing α-blending). Also, the smaller the virtual viewpoint difference is, the larger the blending rate of the foreground virtual viewpoint image (the α blend value (=α) when α blending is performed) is made. This completes the processing when the virtual viewpoint difference is large.

一方、Ｓ１００４で仮想視点差が小さいと判断された場合、処理は、Ｓ１０１１に進む。Ｓ１０１１において、第１形状推定部９０７は、前景領域の形状を推定する。この処理の内容はＳ４０７と同じであるため、その詳細な説明を省略する。
次に、Ｓ１０１２において、第１レンダリング部９０９は、前景領域の形状をレンダリングして仮想視点画像を生成する。この処理の内容はＳ４０８と同じであるため、その詳細な説明を省略する。ここでは、出力される仮想視点画像は、αブレンドしたものでなく、動きブレを含んだ形状をレンダリングしたもののみから作られた画像である。αブレンドしなくても、仮想視点とカメラ１０１の実視点とが十分に近い場合には自然な動きブレの画像がレンダリングすることができる。 On the other hand, if it is determined in S1004 that the virtual viewpoint difference is small, the process proceeds to S1011. In S1011, the first shape estimation unit 907 estimates the shape of the foreground region. Since the content of this process is the same as that of S407, its detailed description is omitted.
Next, in S1012, the first rendering unit 909 renders the shape of the foreground area to generate a virtual viewpoint image. Since the content of this process is the same as that of S408, its detailed description is omitted. Here, the virtual viewpoint image to be output is not an image that is α-blended, but an image that is created only by rendering a shape including motion blur. Even without α-blending, an image with natural motion blur can be rendered when the virtual viewpoint and the real viewpoint of the camera 101 are sufficiently close.

以上のように本実施形態では、画像処理装置１０２は、仮想視点とカメラ１０１の実視点とが十分に近いかどうかで、動きブレ量マップおよび非動きブレ仮想視点画像の生成の有無を切り替る。また、画像処理装置１０２は、仮想視点とカメラ１０１の実視点との近さに応じて、非動きブレ仮想視点画像を生成する場合のαブレンドにおけるブレンド比率を制御する。従って、自然な動きブレの画像をレンダリングすると共に処理時間を削減することができる。
本実施形態の手法は、第２の実施形態に対しても適用することができる。このようにする場合、例えば、仮想視点とカメラ１０１の実視点とが十分に近い場合には、短Ｔｖ仮想視点画像の生成のための処理を省略する。 As described above, in the present embodiment, the image processing apparatus 102 switches between generating a motion blur amount map and a non-motion blurring virtual viewpoint image depending on whether the virtual viewpoint and the real viewpoint of the camera 101 are sufficiently close. . In addition, the image processing device 102 controls the blending ratio in α-blending when generating a non-motion-blurred virtual viewpoint image according to the closeness between the virtual viewpoint and the real viewpoint of the camera 101 . Therefore, it is possible to render an image with natural motion blur and reduce the processing time.
The technique of this embodiment can also be applied to the second embodiment. In this case, for example, when the virtual viewpoint and the real viewpoint of the camera 101 are sufficiently close, the processing for generating the short-Tv virtual viewpoint image is omitted.

尚、前述した実施形態は、何れも本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその技術思想、又はその主要な特徴から逸脱することなく、様々な形で実施することができる。 It should be noted that the above-described embodiments merely show specific examples for carrying out the present invention, and the technical scope of the present invention should not be construed to be limited by these. That is, the present invention can be embodied in various forms without departing from its technical concept or main features.

＜その他の実施例＞
本発明は、前述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 <Other Examples>
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or device via a network or a storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.

１０１：カメラ、１０２：画像処理装置、１０３：表示装置、１０４：入力装置、１０５：被写体 101: camera, 102: image processing device, 103: display device, 104: input device, 105: subject

Claims

A region of a subject extracted by a first extraction method from an image captured by any one of a plurality of imaging devices that capture images of an imaging region from a plurality of directions, and an image captured by the imaging device information for specifying the three-dimensional shape of a subject located within the imaging area based on the area of the subject extracted from the captured image by a second extraction method different from the first extraction method; a first acquisition means for acquiring certain shape information;
a second acquisition means for acquiring viewpoint information, which is information for specifying a virtual viewpoint;
generating means for generating a virtual viewpoint image based on the shape information and the viewpoint information ;
The first extraction method includes a method of extracting the subject based on the captured image and a background image corresponding to the captured image,
The second extraction method includes a method of extracting the subject based on the captured image and another captured image captured at a timing different from the timing at which the captured image was captured by the imaging device.
An image processing apparatus characterized by:

The imaging device captures a moving image including a plurality of frames,
2. The image processing apparatus according to claim 1 , wherein the captured image and another captured image captured at a timing different from the timing at which the captured image was captured are included in the moving image as one frame. .

The shape information includes first shape information for specifying the three-dimensional shape of the subject based on the region of the subject extracted from the captured image by the first extraction method, and the second extraction. 3. The image according to claim 1, further comprising second shape information for specifying the three-dimensional shape of the subject based on the region of the subject extracted from the captured image by the method. processing equipment.

The generating means generates a first virtual viewpoint image based on the viewpoint information and the first shape information , and generates a second virtual viewpoint image based on the viewpoint information and the second shape information. 4. The image processing apparatus according to claim 3 , wherein the virtual viewpoint image is generated by generating the first virtual viewpoint image and the second virtual viewpoint image, and synthesizing the first virtual viewpoint image and the second virtual viewpoint image.

The generating means synthesizes the first virtual viewpoint image and the second virtual viewpoint image at a synthesis ratio determined based on the amount of movement of a subject positioned within the imaging area. 5. The image processing apparatus according to claim 4 , wherein the virtual viewpoint image is generated.

The generation means uses a synthesis ratio determined based on the position of one of the plurality of imaging devices and the position of the virtual viewpoint specified based on the viewpoint information , 6. The image processing apparatus according to claim 4 , wherein the virtual viewpoint image is generated by synthesizing the first virtual viewpoint image and the second virtual viewpoint image.

7. The generating means generates the virtual viewpoint image by alpha-blending the first virtual viewpoint image and the second virtual viewpoint image. The image processing device according to .

whether to generate the second virtual viewpoint image based on the position of one of the plurality of imaging devices and the position of the virtual viewpoint specified based on the viewpoint information; a determination means for determining
When the determination means determines not to generate the second virtual viewpoint image, the first virtual viewpoint image is output , and the determination means outputs the second virtual viewpoint image. output means for outputting a virtual viewpoint image generated by synthesizing the first virtual viewpoint image and the second virtual viewpoint image by the generating means when it is determined to generate a virtual viewpoint image; 8. The image processing apparatus according to any one of claims 4 to 7, comprising :

9. The image processing apparatus according to claim 3, wherein said first shape information and said second shape information are information representing a three-dimensional shape of the same subject .

The image processing apparatus according to any one of claims 1 to 9, wherein the subject is a person or a part of a person.

A region of a subject extracted by a first extraction method from an image captured by any one of a plurality of imaging devices that capture images of an imaging region from a plurality of directions, and an image captured by the imaging device information for specifying the three-dimensional shape of a subject located within the imaging area based on the area of the subject extracted from the captured image by a second extraction method different from the first extraction method; a first acquisition step of acquiring certain shape information;
a second acquisition step of acquiring viewpoint information, which is information for specifying a virtual viewpoint;
a generating step of generating a virtual viewpoint image based on the shape information and the viewpoint information;
The first extraction method includes a method of extracting the subject based on the captured image and a background image corresponding to the captured image,
The second extraction method includes a method of extracting the subject based on the captured image and another captured image captured at a timing different from the timing at which the captured image was captured by the imaging device.
An image processing method characterized by:

The imaging device captures a moving image including a plurality of frames,
12. The image processing method according to claim 11, wherein the captured image and another captured image captured at a timing different from the timing at which the captured image was captured are included in the moving image as one frame.

A program for causing a computer to function as each means of the image processing apparatus according to any one of claims 1 to 10.