WO2012042998A1

WO2012042998A1 - Image processing device, image processing method, program, and recording medium

Info

Publication number: WO2012042998A1
Application number: PCT/JP2011/065003
Authority: WO
Inventors: 大津　誠; 敦稔〆野
Original assignee: シャープ株式会社
Priority date: 2010-09-28
Filing date: 2011-06-30
Publication date: 2012-04-05
Also published as: JP4939639B2; JP2012073702A

Abstract

The present invention increases synthesis quality by precisely and appropriately selecting an intermediate synthesized viewpoint image by pixel that synthesizes a video from a desired viewpoint from differing conditions. A video processing device generates a video from any given viewpoint by capturing a subject from a plurality of different positions. Virtual viewpoint generating units (5, 6) generate an intermediate synthesized image using a plurality of camera videos selected for generating a video from the desired virtual viewpoint. A constant feature-quantity calculation unit calculates feature quantities indicating local constancy from the intermediate synthesized image. A synthesis ratio calculation unit (13) appropriately selects an intermediate synthesized image on the basis of the calculated feature quantities, or calculates a synthesis ratio for blending. The feature quantity is considered the entropy (average amount of information) of the amount of edges in a local region, and the intermediate synthesized image having the smallest value thereof (having the greatest constancy) is selected, or the weight thereof is increased.

Description

Image processing apparatus, image processing method, program, and recording medium

The present invention relates to an image processing apparatus, an image processing method, a program, and a recording medium. More specifically, the present invention relates to an image of a viewpoint that is not actually captured by performing signal processing on images captured at a plurality of different viewpoints. The present invention relates to an image processing apparatus, an image processing method, a program for realizing the image processing function, and a recording medium.

Stereo stereoscopic televisions (hereinafter referred to as 3D televisions) that simulate stereoscopic viewing by presenting different images to the left and right eyes can strongly feel a sense of depth that cannot be expressed by conventional two-dimensional images. There is an effect to increase the feeling. The left and right eyes of a human being are located in different places. When you actually see an object, you will see the object from a slightly different angle with the left and right eyes. It is believed that. In 3D television, stereoscopic vision is realized by presenting images with different angles to the left and right eyes using this human visual characteristic.

As a system different from 3D television, there is a naked-eye multi-view display (hereinafter referred to as multi-view display). In a multi-viewpoint display, images with slightly different angles in multiple directions are presented by a lenticular lens with a small bowl-shaped lens attached to the front of the display. When you enter the left eye, you can see stereoscopically. In this method, when the head is moved, images of two different angles at the next position enter the right eye and the left eye, so that a more natural stereoscopic view can be achieved. When the viewpoint position is changed, the change in the angle at which an object is seen in accordance with the movement is called motion parallax, and it is an element necessary for natural stereoscopic vision along with binocular parallax.

However, it is difficult for various reasons to shoot and transmit images of all viewpoints handled by a multi-view display. In particular, the greater the number of viewpoints and the closer the intervals, the more difficult the realization. A typical reason for this is that there is a physical limit to the camera installation interval due to the size of the camera housing and the size of the image pickup device itself. The transmission capacity increases in accordance with the number of viewpoints when all the videos are transmitted.

In order to solve the above problems, there has been proposed a method of introducing viewpoint synthesis technology to create many viewpoint videos from a few viewpoint videos. In view synthesis technology, images between sparsely installed cameras can be generated by interpolation, and it is possible to easily create images with dense viewpoints. As for the number of viewpoints, for example, it is possible to reduce the transmission capacity by transmitting a video with a small number of viewpoints and creating a video with a viewpoint between them on the receiving side.

Regarding the viewpoint synthesis technology, there is a method for improving the synthesis quality by using videos taken from a plurality of different viewpoints. Specifically, once the video for the desired viewpoint is generated for each viewpoint in the middle, and the intermediate generation result that can be assumed that the intermediate viewpoint video is appropriately high quality in units of pixels is selected or the quality is high It is possible to improve the quality of the final synthesized video by blending weighted intermediate generation results that can be assumed. There are a plurality of prior arts describing a method for calculating such an intermediate composite video from each viewpoint, and selecting and blending appropriately.

For example, Patent Document 1 discloses an example in which the method for calculating the pixel value of the combined viewpoint is switched according to three conditions when the video of the viewpoint between the left and right camera images is combined. A point corresponding to two neighboring pixels straddling the pixel of the composite viewpoint in the horizontal direction is obtained from the left and right composite source images, and depending on the difference in length between the two points corresponding to the two pixels of the composite viewpoint Switch. Specifically, when the length between two corresponding points in the left video is longer than a predetermined condition compared to the length between the two corresponding points in the right video, the video shot by the left camera Is used for synthesis. On the other hand, when the length between two corresponding points in the right image is longer than a predetermined condition compared with the corresponding two points in the left image, the image taken with the right camera is used. To synthesize. If neither of the above two conditions is satisfied, the composition ratio is steadily determined according to the position of the viewpoint for composition, and blending is performed based on the ratio.

Further, in Patent Document 2, when calculating the pixels in the video of the virtual viewpoint, regarding the pixel values obtained by combining from different viewpoints and the pixel values of the pixels combined from different times at the corresponding viewpoints, respectively. A method is disclosed in which the reliability is calculated, and the pixel value having the higher reliability is set so that the composition ratio is increased, and the composition is performed. In Patent Document 2, the reliability of an image (scheme 1) synthesized from a viewpoint different from the desired viewpoint and the image synthesized from a different time but the same viewpoint (scheme 2) is calculated, and the reliability is high The viewpoint synthesis is performed so that the synthesis ratio of.

The determination method uses the feature amount indicating the reliability calculated from the method 1 and the feature amount indicating the reliability calculated from the method 2 to determine the size relationship. The feature quantity indicating the reliability in the method 1 is a value (average error between parallaxes) calculated by obtaining corresponding blocks in the left and right camera images and adding the pixel value difference between the left and right blocks. The feature amount indicating the reliability in the method 2 is a value (time average error) calculated by obtaining blocks corresponding to each other and adding pixel value differences in the blocks before and after the time to be calculated. . At this time, the center of the block of method 1 and method 2 is the position of the pixel to be processed for viewpoint synthesis.

Japanese Patent Laid-Open No. 8-201941 JP 2009-3507 A

In the viewpoint synthesis method described in Patent Document 1, the pixel selected as the final synthesis result is steadily determined by the distance between the pixels in the left and right camera images corresponding to two points sandwiching the pixel. For example, when the length between two corresponding points in the left camera image is longer than the length between two corresponding points in the right camera image, a pixel value to be obtained is calculated using the left camera image. On the other hand, when the length between two corresponding points in the right camera image is long, a pixel value to be obtained is calculated using the right camera image.

However, in the case of Patent Document 1, in any case, since the composition is performed by interpolation between two points, even if the length between the two corresponding points is long, the composition sampling position may be adjusted depending on the position of the pixel to be obtained. Without increasing the conversion error. Thus, when many conversion errors occur, there is a problem that the synthesis quality deteriorates.

The viewpoint synthesis method described in Patent Document 2 compares a pixel synthesized from videos of different viewpoints with a pixel synthesized from images of the same viewpoint but taken at different times, and in a corresponding block. The one with the smaller average error is judged to be highly reliable and is synthesized. In an occlusion area where there is an area that cannot be seen in one image at a different viewpoint or at a different time, the error between blocks increases, so it is easy to select the synthesis method because the reliability is likely to be different, but in the non-occlusion area, As shown in the problem of Patent Document 1, there is a possibility that a sampling position for synthesis does not match depending on a pixel position to be obtained even if reliability is high.

This is a problem common to Patent Document 1, but in order to determine a composite result obtained under different conditions, a secondary situation that affects the composite result (Patent Document 1 has a wider range of corresponding pixels). In Japanese Patent Application Laid-Open No. 2004-26853, it is assumed that the average error of the corresponding block between different viewpoints and the error of the corresponding block at different times are suitable for the synthesis. This is based on the situational criteria.), And the error that occurs during the composite conversion is not included in the criteria.

In view of the above problems, the present invention does not make a judgment based on the situation criteria, but uses the result of the synthesis once, and makes a judgment based on the continuity of the synthesized signal as a judgment criterion. An image processing apparatus and an image processing method capable of appropriately selecting or appropriately weighting an intermediate synthesized viewpoint video obtained by synthesizing videos with accuracy in units of pixels and improving the synthesis quality An object of the present invention is to provide a program and a recording medium.

A first technical means for solving the above-described problem is an image processing apparatus that synthesizes a virtual viewpoint image located in the middle of a plurality of viewpoints using camera images of a plurality of viewpoints. A virtual viewpoint synthesizing unit that generates a plurality of intermediate virtual viewpoint images based on each of the camera videos of the plurality of viewpoints using information indicating corresponding points between the images, and the virtual viewpoint synthesizing unit, For each of the synthesized intermediate virtual viewpoint images, a continuity calculation unit that calculates a local continuity of each virtual viewpoint image and the feature amount calculated by the continuity calculation unit A synthesis ratio calculation unit that calculates a ratio for combining the plurality of intermediate virtual viewpoint images, and combines the plurality of intermediate virtual viewpoint images according to the ratio calculated by the synthesis ratio calculation unit, Virtual viewpoint image A combining unit for forming, in which is characterized by having a.

According to a second technical means, in the first technical means, the feature amount is entropy with an edge amount in a mask centering on a processing target pixel of the intermediate virtual viewpoint image as an event. It is a thing.

The third technical means is characterized in that, in the first or second technical means, the plurality of viewpoints are two or more viewpoints.

The fourth technical means is characterized in that in any one of the first to third technical means, information indicating the corresponding points is input from the outside.

According to a fifth technical means, in any one of the first to third technical means, the virtual viewpoint synthesis unit calculates information indicating a correspondence relationship between the images of the plurality of viewpoints, and indicates the correspondence relationship. Based on the information, a plurality of intermediate virtual viewpoint images based on each of the plurality of viewpoint camera images are generated by interpolating mutually compatible pixels.

A sixth technical means is an image processing method for synthesizing a virtual viewpoint image located in the middle of a plurality of viewpoints using camera images of a plurality of viewpoints,
A virtual viewpoint synthesizing step for generating a plurality of intermediate virtual viewpoint images based on respective camera images of the plurality of viewpoints using information indicating corresponding points between the images of the plurality of viewpoints; For each of the intermediate virtual viewpoint images synthesized in the viewpoint synthesis step, a continuity calculation step for calculating a feature amount indicating local continuity of each virtual viewpoint image, and the continuity calculation step calculated above A synthesis ratio calculating step for calculating a ratio for combining the plurality of intermediate virtual viewpoint images based on the feature amount, and the plurality of intermediate virtual viewpoint images according to the ratio calculated in the combining ratio calculation step. And a synthesis step for synthesizing the final virtual viewpoint image.

According to a seventh technical means, in the sixth technical means, the feature amount is entropy having an event of an edge amount in a mask centered on a processing target pixel of the intermediate virtual viewpoint image. It is a thing.

The eighth technical means uses the information indicating the corresponding points between the images of the plurality of viewpoints acquired from the camera images of the plurality of viewpoints to the computer, and uses the information indicating the corresponding points between the images of the plurality of viewpoints as an intermediate A feature value indicating local continuity of each virtual viewpoint image for each of the virtual viewpoint synthesis step for generating a plurality of virtual viewpoint images and the intermediate virtual viewpoint image synthesized in the virtual viewpoint synthesis step A continuity calculating step for calculating the ratio, a combining ratio calculating step for calculating a ratio for combining the plurality of intermediate virtual viewpoint images based on the feature amount calculated in the continuity calculating step, and the combining ratio calculation An image for executing the combining step of combining the plurality of intermediate virtual viewpoint images according to the ratio calculated in the step and combining the final virtual viewpoint image Is a management program.

According to a ninth technical means, in the eighth technical means, the feature quantity is entropy having an event of an edge quantity in a mask centered on a processing target pixel of the intermediate virtual viewpoint image. It is a thing.

The tenth technical means is a computer-readable recording medium on which the program of the eighth or ninth technical means is recorded.

According to the present invention, an intermediate synthesized viewpoint video obtained by synthesizing a video of a target viewpoint from different conditions is determined by using the result of the synthesis once itself and making a determination using the continuity of the synthesized signal as a criterion. It becomes possible to select appropriately and accurately in units, and the synthesis quality can be improved.
In particular, with regard to viewpoint synthesis technology that creates videos of viewpoints that do not physically have a camera using videos taken from a plurality of different viewpoints, intermediate synthesized viewpoint videos generated from different viewpoints are accurately and accurately per pixel. By selecting and blending, it is possible to improve the quality of the composite video. In addition, by generating a video of an arbitrary viewpoint according to the present invention and displaying it on a stereoscopic display, it is possible to artificially generate a multi-view video of a dense viewpoint even with a video of few viewpoints. Multi-view stereoscopic viewing with high image quality.

It is a block diagram corresponding to the 1st example of the present invention. It is a general-view figure of a mode that a photographic subject is photoed using a plurality of cameras. It is a figure which shows the two-dimensional arrangement | positioning to the time direction and viewpoint direction of the image extracted from the image | video of several viewpoints. It is a figure explaining continuity. It is a process flowchart corresponding to a 1st Example. It is a conceptual diagram which produces | generates the image | video (image sequence) of the viewpoint between two real cameras. It is a block diagram corresponding to 2nd Example of this invention. It is a figure which shows the relationship between the two cameras corresponding to a 2nd Example, and the position which carries out a viewpoint synthesis | combination. It is a figure regarding the method of synthesize | combining the viewpoint between corresponding to a 2nd Example. It is a process flowchart corresponding to a 2nd Example.

(First embodiment)
<Configuration>
A first embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing an embodiment of an image processing apparatus of the present invention. As shown in FIG. 1, the image processing apparatus of the present invention includes

frame buffers

1, 2, 3, 4, 7, 8, virtual

viewpoint synthesis units

5, 6, mask formation units 9, 10, and continuity feature amount calculation unit. 11, 12, a synthesis ratio calculation unit 13, and a synthesis unit 14.
The frame buffers 1 and 2 are frame buffers for temporarily holding a frame (image) at a certain time extracted from a video shot at a predetermined viewpoint. The frame buffer 3 is a frame buffer for storing corresponding point calculation information, which is information indicating corresponding points between images of the same viewpoint and different viewpoints from the video held in the frame buffer 1. In this embodiment, information indicating corresponding points is input from the outside. The frame buffer 4 stores corresponding point calculation information corresponding to the viewpoint of the frame buffer 2. Corresponding point calculation information indicating corresponding points is depth information indicating the distance to the subject, for example, and will be described in detail later.

In the virtual viewpoint synthesis unit 5 (6), the viewpoint image at a specific time held in the frame buffer 1 (2) and the corresponding point calculation information at the same time held in the frame buffer 3 (4) are input. The input viewpoint image is converted to a desired viewpoint using the input corresponding point calculation information. Note that a method of converting a viewpoint image shot using the corresponding point calculation information into a desired viewpoint will be described later.
The image converted in the virtual viewpoint synthesis unit 5 (6) is temporarily stored in another frame buffer 7 (8), divided into local blocks in the mask formation unit 9 (10), and the position is shifted for each pixel. While outputting a part of the image.

Subsequently, the continuity feature quantity calculation unit 11 (, 12) calculates a feature quantity indicating continuity using the block image input from the mask formation unit 9 (10), and the result is a synthesis ratio calculation unit. 13 is output. The synthesis ratio calculation unit 13 calculates a synthesis ratio for the synthesis result of each viewpoint generated intermediately according to the feature quantity indicating the continuity input from each of the continuity feature

quantity calculation units

11 and 12, and To the unit 14. The synthesizing unit 14 takes out the converted image from the frame buffers 7 and 8 in which the converted images are stored in accordance with the synthesizing ratio input from the synthesizing ratio calculating unit 13, and combines it into an intermediate viewpoint synthesized image of each viewpoint. Is used to generate and output a composite viewpoint image.

<Concept>
Next, the concept of the viewpoint synthesis process of the present invention will be described with reference to FIGS.
FIG. 2 shows a state where the subject is photographed from a plurality of different positions. In FIG. 2, 21, 22, 23, 24, and 25 indicate real cameras for photographing a subject and their positional relationships.

Reference numerals

26, 27, 28, and 29 between the cameras indicate areas that cannot be physically installed due to, for example, the size of the camera casing, or gaps due to sparse cameras. Hereinafter, a position where a real camera exists is referred to as a real camera position.

The viewpoints i-2, i-1, i, i + 1, i + 2 are defined so as to correspond to the

real cameras

21, 22, 23, 24, 25 in FIG. 2, and the video captured by each camera is represented by V _i-2. , V _i−1 , V _i , V _{i + 1} , V _{i + 2} . Since the actual processing is performed on an image at a specific time extracted from these videos, the following provisions are made to facilitate handling. Images (frames) taken at time t taken out of the video shot at each viewpoint are I (i−2, t), I (i−1, t), I (i, t), I (i + 1, t ), I (i + 2, t). If the images are arranged with respect to the viewpoint and time for explanation, they can be arranged two-dimensionally as shown in FIG. In FIG. 3, a row of horizontal images surrounded by a dotted line represents a row of images of a certain viewpoint, that is, an image, and a collection of vertical images that are not shown by dotted lines or the like is different at a certain time. This is a set of viewpoint images.

In order to explain viewpoint synthesis, a case where the desired virtual viewpoint position is between the camera 23 and the camera 24 is taken as an example. The case of calculating a composite viewpoint at a position different from this can also be processed in the same manner as described below. Hereinafter, a desired viewpoint calculated by synthesis is referred to as a virtual viewpoint, and a synthesized video at that viewpoint is referred to as a synthesized video.

The basic part of the viewpoint synthesis handled in this embodiment can be realized by using the technique described in Non-Patent Document 1, for example. According to this method, external parameters and internal parameters of a camera that acquires a video are known, and distance information corresponding to each viewpoint (hereinafter referred to as depth information) is necessary for performing viewpoint synthesis. The external parameter of the camera here is a matrix indicating the three-dimensional position and orientation of the camera, and the internal parameter is a matrix indicating the focal length, lens distortion, and inclination of the projection plane.

Depth information used as corresponding point calculation information indicating corresponding points is defined by Moving International Picture Standards (ISO / IEC) working group Moving Picture Experts Group (MPEG). It is expressed in 256 levels, that is, 8-bit luminance values. As a result, the distance information is an 8-bit gray scale. Since a higher value of brightness is assigned as the distance is shorter, the subject at the front is whiter and the farther away is the blacker. In addition, in order to decode this distance information as an actual distance, the distance of the largest (white) value and the distance of the smallest (black) value are separately defined. By assigning linearly, the actual distance can be obtained.

According to Non-Patent Document 1, two nearest real cameras are selected with a desired virtual viewpoint in between. When there is a virtual viewpoint between the camera 23 and the camera 24 described above, the selected real cameras are the camera 23 (viewpoint i) and the camera 24 (viewpoint i + 1). A virtual viewpoint video is created using the two camera videos. Actually, an image for one frame is extracted from the video, and a virtual viewpoint image is synthesized using the image. For each selected viewpoint (in this case, from two viewpoint images), a virtual viewpoint image is created in the middle, and finally the closest image is selected for one screen depending on which camera the virtual viewpoint camera position is close to. Or it blends according to the composition ratio by the position, and produces a virtual viewpoint image.

Specifically, a method described in Non-Patent Document 1 for calculating a video of a desired viewpoint from a camera video with known external parameters and internal parameters and depth information will be described. A 3D warping technique is used for the viewpoint synthesis technique described in Non-Patent Document 1. The 3D warping technology uses an image acquired with a camera with known characteristics and depth information to determine a position in the three-dimensional space that corresponds to each pixel of the image in a one-to-one manner. By projecting onto the projection plane of the viewpoint video, it is possible to obtain the correspondence between the pixels in the real camera and the corresponding virtual viewpoint pixels. Based on this correspondence, a texture (pixel value) of a pixel corresponding to the real camera is acquired and assigned to the corresponding pixel of the virtual viewpoint image, so that a composite image can be created. The above is the basic concept of viewpoint synthesis.

In order to improve the synthesis quality, there is a method of creating an intermediate virtual viewpoint image for each of two or more different viewpoints, and combining them by appropriately selecting from them or determining a synthesis ratio. In Non-Patent Document 1, a reference for calculating this selection or composition ratio is definitely determined by the relationship between the position of the virtual viewpoint and the actual camera position of the selected composition source.

The present invention compares the steadiness of local signals of virtual viewpoint images obtained in the middle, selects intermediate synthesis results with high steadiness, or creates a synthesized image by increasing the weight of the synthesis ratio. It is characterized by improving the final synthesis quality. Local signal continuity is the degree to which a synthesized signal is concentrated on a characteristic signal in a local region extracted from an intermediate virtual viewpoint image obtained for each of multiple viewpoints. is there. Concentrating on characteristic signals means that, for example, the edge amount obtained by calculating the absolute value of the difference from the adjacent pixel is concentrated on a specific size. If the synthesis quality is high, the local area concentrates on a specific edge amount.
On the other hand, when the synthesis quality is low, a conversion error is added to the inherent specific edge amount because of the conversion noise mixed in the conversion process, and the resulting distribution of the edge amount is dispersed.

FIG. 4 is a diagram for explaining the relationship between the difference in the occurrence probability of the edge amount of the local region and the continuity of the intermediate viewpoint generated virtual viewpoint image. The horizontal axis indicates the edge amount, and the vertical axis indicates the occurrence probability of the edge amount in the local region. FIG. 4A is a diagram showing that a specific edge amount e has a peak and the occurrence probabilities are concentrated in the vicinity thereof. On the other hand, FIG. 4 (B) is a diagram showing that the edge amount e has a peak as in FIG. 4 (A), but the concentration is low and the whole is broad.

Compared with FIG. 4 (B), FIG. 4 (A) concentrates on a specific edge amount e (high stationarity), so it can be said that the reliability of the signal obtained by the synthesis is high. Therefore, it can be said that selecting the virtual viewpoint image having the characteristics shown in FIG. By selecting or blending based on the determination using the stationarity over the entire image while shifting the pixels, it is possible to create an optimal composite image.

<Processing content>
A method for generating a virtual viewpoint video according to the present invention will be specifically described with reference to a block diagram (FIG. 1) and a flowchart (FIG. 5).
The real cameras for photographing the subject are

cameras

21, 22, 23, 24, and 25, and an example in which the viewpoint between the real camera positions of the camera 23 and the camera 24 is synthesized will be described. First, in S1-1, an actual camera to be used for synthesizing a virtual viewpoint video is selected. The selection of the real camera is made by selecting the two closest cameras so as to sandwich the virtual viewpoint position to be synthesized. In other words, if the position of the desired virtual viewpoint is P _{v ′} and the position of each real camera is P _vi (i = −2, −1, 0, 1, 2), two cameras satisfying the following relationship are selected. To do. However, since the camera position P is one-dimensionally arranged as shown in FIG. 2, it is assumed that the position can be determined by the size relationship.

According to the premise regarding the virtual viewpoint position described above, since the virtual viewpoint position is between the camera 23 and the camera 24, P _vi and P _{vi + 1} in Expression (1) correspond to P _v0 and P _v1 , respectively.

Subsequently, when an image at the time t to be processed is extracted from the video taken by the selected real camera (S1-2, S1-3), the image at the time t of the real camera videos V _i and V _{i + 1.} (Frame) is I (i, t), I (i + 1, t). The extracted image is temporarily stored in the

frame buffers

1 and 2. At the same time, images (frames) D (i, t) and D (i + 1, t) at the same time are extracted from the corresponding viewpoint depth information (distance information) (S1-4, S1-5), and the frame buffer 3 , 4 are stored.

Depth information can be obtained in various ways. Here, it is assumed that measurement is performed using a distance measuring device that can irradiate an object with infrared rays, measure the time until the light is reflected and returned, and obtain the distance to the object. The distance d ₀ to the object to be calculated can be calculated by the following equation, where V _{IR is} the speed at which the infrared light _travels, and t _tof is the time from when the infrared light is irradiated until it returns to the distance measuring device. This processing is performed at the same resolution as the captured image, and a depth image (depth information) is obtained.

According to Non-Patent Document 1, a virtual viewpoint image can be created by the following equation. This process is called 3D warping and corresponds to the processes S1-6 and S1-7 performed in the virtual

viewpoint synthesis units

5 and 6.

Here, d ₀ and d ₀ ′ are distance information of the real camera position and virtual viewpoint position, respectively. A, R, and t represent the camera rotation angle and the three-dimensional position of the camera, which are part of the internal and external parameters of the actual camera, respectively. A ′, R ′, and t ′ represent a rotation angle and a three-dimensional position that are a part of the internal parameters and the external parameters of the virtual viewpoint camera. R ⁻¹ and A ⁻¹ are inverse matrices of the corresponding matrix. Further, c and c ′ indicate the coordinate of the real camera image and the coordinate of the virtual camera image in a homogeneous coordinate system in which one dimension is added to normal two-dimensional coordinates. For example, when a two-dimensional coordinate (x, y) is expressed in a homogeneous coordinate system, the number of dimensions is increased by one as in (x, y, 1), and 1 is assigned to the added dimension portion. be able to.

The correspondence relationship between the coordinate c of the real camera and the coordinate c ′ of the virtual viewpoint is obtained by the expression (3), and the pixel values of the real camera corresponding to all the pixels of the virtual viewpoint are extracted and pasted. Images can be created. The generated virtual viewpoint image is temporarily stored in the frame buffers 7 and 8 for each viewpoint.

By performing the above process on I (i, t) and I (i + 1, t) of the real camera image, two intermediate composite viewpoint images I _i ( _i ′, t) and I _{i + 1 as shown in FIG.} (I ′, t) is obtained. Here, 61 and 62 are real cameras, 63 is a virtual viewpoint camera, and I _i and I _{i + 1} indicate virtual viewpoint images synthesized from the viewpoints i and i + 1, respectively. The composite viewpoint is i ′.

The two generated intermediate virtual viewpoint images I _i ( _i ′, t) and I _{i + 1} ( _i ′, t) are two-dimensional planes, and each of them represents the position of the x coordinate and the y coordinate. Let _i ( _i ′, t, x, y) and I _{i + 1} ( _i ′, t, x, y). In the mask forming units 9 and 10, a 7 × 7 size mask is formed as follows with the processing target pixel (x, y) as the center (S1-8, S1-9).

Next, the continuity feature

amount calculation units

11 and 12 will be described. In the present invention, entropy (average amount of information) handled in information theory is applied to the determination of continuity. First, the amount of information is a measure that represents how difficult a certain event occurs when a plurality of events can occur. And the average value (expected value) of the information amount of all the events is called entropy.
For example, when the events of the peak value e in FIGS. 4A and 4B are compared, the occurrence probability of the event of the peak value e is higher in FIG. 4A than in FIG. In the case of FIG. 4A, even if information on the event of the peak value e is obtained, the amount of information is not high. This is because it is easy to predict.

In addition, the average information amount (entropy) of the average information amount of all events is smaller in FIG. That is, the value of entropy decreases when the occurrence event can be estimated with a high probability such that the occurrence probability is biased. Therefore, it can be determined that the smaller the entropy value, the higher the continuity of the obtained signal.

For each pixel in the mask obtained by Equation (4), an absolute value of a difference from an adjacent pixel is obtained, and this is an event for calculating the entropy. The frequency of occurrence of each event can be calculated by the following equation.

The pixel value of the deal image, generally by three values, such as RGB values and YC _b C _R is formed, wherein in order to simplify the explanation, gray scale values and performing the following conversion To do.

Furthermore, the occurrence probability of each event can be obtained by dividing the occurrence frequency of equation (5) by the number of pixels in the mask, and can be obtained by the following equation.

Note that numM is a constant and is the number of pixels in the mask of Equation (4).

The entropy calculation (S1-10, S1-11) performed by the stationary feature

quantity calculation units

11 and 12 is performed by the following equation.

The entropy is calculated for each intermediate virtual viewpoint image generated from each viewpoint. In FIG. 6, since there are two selected real cameras, two entropy values are calculated for each pixel by equation (7).
When the obtained entropy values are E _i and E _{i + 1} , the composition ratio can be determined by the following equation (S1-12). This process is performed by the synthesis ratio calculation unit 13.

The smaller the entropy value, the higher the stationarity and the more reliable the synthesis result. Therefore, it is necessary to increase the synthesis ratio. In Expression (8), the ratio of the entropy of a predetermined camera to the entropy of a plurality of cameras selected by the second term is calculated. The smaller the entropy value, the higher the composition ratio needs to be. Therefore, the second term is subtracted from 1.0 to obtain the composition ratio.
In the synthesizing unit 14, the synthesizing process is finally realized by the following equation (S1-13).

By repeating the above processing until all the pixels are completed (S1-14), a synthesized viewpoint image can be generated.

In the present embodiment, the synthesis unit 14 has shown an example in which a synthesized viewpoint image that is intermediately synthesized from two different viewpoints is appropriately weighted and synthesized in units of pixels. It also supports multiple viewpoints with 3 or more viewpoints. Accordingly, in FIG. 1, the view points corresponding to one view point handle the configurations of the frame buffer 1, the frame buffer 3, the virtual view point synthesis unit 5, the frame buffer 7, the mask formation unit 9, and the stationary feature value calculation unit 11. By adding, it is possible to increase the number of real cameras used for composition. By increasing the number of viewpoints, it is possible to further improve the synthesis quality by appropriately using subject information from various fields.

In addition, the example in which the composition ratio is calculated according to the entropy value in Expression (8) and the result is applied to Expression (9) for blending has been shown, but only the composition ratio of the camera with the minimum entropy is 1. By setting it to 0 and setting the others to 0, it is also possible to generate a composite image by a method by selection instead of blending.

(Second embodiment)
FIG. 7 is a block diagram showing the second embodiment of the present invention. For the blocks that are common to the first embodiment, the same numbers are assigned and only the correspondence is shown.
The difference between the first embodiment and the second embodiment is whether information indicating a correspondence relationship for each pixel of an image with a different viewpoint is input from the outside or information indicating the correspondence relationship is created internally. Therefore, in the second embodiment, there are no frame buffers 3 and 4 for storing the corresponding viewpoint information of the first embodiment. The block added in the second embodiment is a disparity vector calculation unit 71.

The virtual

viewpoint synthesis units

72 and 73 are different in processing from the virtual

viewpoint synthesis units

5 and 5 because the contents of the corresponding viewpoint information to be input are different, and the numbers are changed from those in the first embodiment (FIG. 1). Hereinafter, parts different from the first embodiment will be described together with the flowchart of FIG. An image at time t to be processed is extracted from the video captured by the selected real camera (S2-2, S2-3), and a disparity vector is calculated from these images (S2-4).
As a method for generating the corresponding viewpoint by internally generating the corresponding viewpoint information, the viewpoint synthesis method described in Patent Document 1 can be used. According to this method, the correspondence between images is implemented by the parallax vector calculation unit 71, and can be obtained by calculating the parallax amount P that minimizes the following equation E (p).

Here, the image is the same as that shown in the first embodiment, and two viewpoints i and i + 1 close to the viewpoint generated by viewpoint synthesis are used.
W represents a local mask for performing matching, and is a mask having a size of 7 × 7, for example. By performing the above process on all the pixels, the correspondence between all the pixels can be obtained.

Next, a method for synthesizing an intermediate image using two images having a corresponding relationship will be described with reference to FIG. As shown in FIG. 8, when the distance between the

real cameras

81 and 82 is L, the virtual viewpoint camera 83 is located at a distance DL from the real camera 81 and a distance DR from the real camera 82. . In order to obtain the correspondence for each pixel, it is assumed that there are also points on the horizontal line connecting two corresponding points. If the correspondence as shown in FIG. 9 is obtained, the pixels of the viewpoint between them can be obtained by the following equation based on the camera 81 (viewpoint i) (S2-5). This process is performed by the virtual viewpoint synthesis unit 72.

Similarly, when the camera 82 (viewpoint i + 1) is used as a reference, the viewpoint pixel in between can be obtained from the following equation (S2-6). This process is performed by the virtual viewpoint synthesis unit 73.

In the method described in Patent Document 1, a virtual viewpoint image is obtained by adaptively switching between Equation (11) and Equation (12) depending on the distance between two corresponding points. The synthesis result is calculated once and synthesized using the continuity of the local signal.
Subsequent processing after calculating a plurality of intermediate synthesis results (stationaryness calculation, synthesis ratio calculation, blending processing) is the same as in the first embodiment.

(Third Embodiment) Program In addition, the present invention is a computer-readable recording medium that records a program to be executed by a computer. , And a composite viewpoint video that is obtained intermediately for each different viewpoint is created. A method of generating a virtual viewpoint video by calculating a synthesis ratio based on the obtained local continuity of the intermediate synthesized video and synthesizing it can also be recorded as software processing.

As a result, it is possible to improve the synthesis quality of the virtual viewpoint image. The recording medium may be a non-illustrated memory, for example, a program medium such as a ROM because processing is performed by a microcomputer, and a program reading device as an external storage device (not illustrated) is provided, and the recording medium is stored therein. It may be a program medium that can be read by being inserted. In any case, the stored program may be configured to be accessed and executed by the microprocessor, and the program is read out, and the read program is stored in a program storage area (not shown) of the microcomputer. A method of downloading and executing the program may be used. In this case, it is assumed that the download program is stored in the main device in advance.

Here, the program medium is a recording medium configured to be separable from the main body, and includes a tape system such as a magnetic tape and a cassette tape, a magnetic disk such as a floppy disk (registered trademark) and a hard disk, and a CD-ROM / MO / Disk systems for optical disks such as MD / DVD, card systems such as IC cards (including memory cards) / optical cards, or mask ROM, EPROM (Erasable Programmable Read Only Memory), EEPROM (Electrically Programmable Programmable Read Only Memory), flash It may be a medium that carries a fixed program including a semiconductor memory such as a ROM.

Further, in this case, since the system configuration is capable of connecting to a communication network including the Internet, the medium may be a medium that dynamically carries the program so as to download the program from the communication network. When the program is downloaded from the communication network in this way, the download program may be stored in the main device in advance or installed from another recording medium. The recording medium is read by a program reading device provided in a digital color image forming apparatus or a computer system, whereby the above-described image processing method is executed. The computer system includes a general-purpose image input device such as a WEB camera, a computer that performs various processes such as the image processing method by loading a predetermined program, a display / liquid crystal display that displays the processing results of the computer, and the like. Image display device. Furthermore, a network card, a modem, and the like are provided as communication means for connecting to a server or the like via a network.

1, 2, 3, 4, 7, 8 ... frame buffer, 5 ... virtual viewpoint synthesis unit, 5, 6 ... temporary viewpoint synthesis unit, 9, 10 ... mask formation unit, 11, 12 ... stationary feature quantity calculation unit, DESCRIPTION OF SYMBOLS 13 ... Composition ratio calculation part, 14 ... Composition part, 23 ... Camera, 24 ... Camera, 71 ... Disparity vector calculation part, 72 ... Virtual viewpoint composition part, 73 ... Virtual viewpoint composition part, 81, 82 ... Real camera, 83 ... Virtual viewpoint camera.

Claims

An image processing apparatus that synthesizes a virtual viewpoint image located in the middle of a plurality of viewpoints using camera images of a plurality of viewpoints,
A virtual viewpoint synthesizing unit that generates a plurality of intermediate virtual viewpoint images based on each of the camera videos of the plurality of viewpoints using information indicating corresponding points between the images of the plurality of viewpoints;
For each of the intermediate virtual viewpoint images synthesized by the virtual viewpoint synthesis unit, a continuity calculation unit that calculates a feature amount indicating local continuity of each virtual viewpoint image;
A composition ratio calculation unit that calculates a ratio of combining the plurality of intermediate virtual viewpoint images based on the feature amount calculated by the continuity calculation unit;
An image processing apparatus comprising: a combining unit that combines the plurality of intermediate virtual viewpoint images according to the ratio calculated by the combining ratio calculation unit and combines the final virtual viewpoint image.
2. The image processing apparatus according to claim 1, wherein the feature amount is entropy having an edge amount in a mask centering on a processing target pixel of the intermediate virtual viewpoint image as an event.
The image processing apparatus according to claim 1 or 2, wherein the plurality of viewpoints are two or more viewpoints.
The image processing apparatus according to any one of claims 1 to 3, wherein information indicating the corresponding points is input from the outside.
The virtual viewpoint synthesizer calculates information indicating a correspondence relationship between the images of the plurality of viewpoints, and interpolates pixels having a correspondence with each other based on the information indicating the correspondence relationship. The image processing apparatus according to any one of claims 1 to 3, wherein a plurality of intermediate virtual viewpoint images are generated based on each of the camera images of
An image processing method for synthesizing a virtual viewpoint image located in the middle of a plurality of viewpoints using camera images of a plurality of viewpoints,
A virtual viewpoint synthesis step of generating a plurality of intermediate virtual viewpoint images based on each of the camera videos of the plurality of viewpoints using information indicating corresponding points between the images of the plurality of viewpoints;
For each of the intermediate virtual viewpoint images synthesized in the virtual viewpoint synthesis step, a continuity calculation step for calculating a feature amount indicating local continuity of each virtual viewpoint image;
A composition ratio calculating step for calculating a ratio of combining the plurality of intermediate virtual viewpoint images based on the feature amount calculated in the continuity calculating step;
An image processing method comprising: a combining step of combining the plurality of intermediate virtual viewpoint images according to the ratio calculated in the combining ratio calculating step and combining the final virtual viewpoint image.
The image processing method according to claim 6, wherein the feature amount is entropy with an edge amount in a mask centered on a processing target pixel of the intermediate virtual viewpoint image as an event.
On the computer,
Generating a plurality of intermediate virtual viewpoint images based on each of the plurality of viewpoint camera images using information indicating corresponding points between the images of the plurality of viewpoints acquired from the camera images of the plurality of viewpoints. A virtual viewpoint synthesis step;
For each of the intermediate virtual viewpoint images synthesized in the virtual viewpoint synthesis step, a continuity calculation step for calculating a feature amount indicating local continuity of each virtual viewpoint image;
A composition ratio calculating step for calculating a ratio of combining the plurality of intermediate virtual viewpoint images based on the feature amount calculated in the continuity calculating step;
An image processing program for executing a combining step of combining the plurality of intermediate virtual viewpoint images in accordance with the ratio calculated in the combining ratio calculating step and combining the final virtual viewpoint image.
The image processing program according to claim 8, wherein the feature amount is entropy with an edge amount in a mask centered on a processing target pixel of the intermediate virtual viewpoint image as an event.
A computer-readable recording medium on which the program according to claim 8 or 9 is recorded.