WO2022118442A1

WO2022118442A1 - Parallax learning device, merge data generation device, parallax learning method, merge data generation method, and parallax learning program

Info

Publication number: WO2022118442A1
Application number: PCT/JP2020/045106
Authority: WO
Inventors: 智彦長田; 慎吾安藤; 潤島村
Original assignee: 日本電信電話株式会社
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2022-06-09

Abstract

The present invention makes it possible to accurately estimate and utilize a parallax amount, even between images acquired by different sensors.　This parallax learning device outputs a filtered label obtained by applying a filter to a correct-answer label that indicates a relationship between the amount of misalignment of a horizontal-direction position between a first image for learning and a second image for learning and a correct answer to the position. The parallax learning device furthermore learns parameters for a model for outputting a degree of similarity that corresponds to a patch image pair, the learning being conducted on the basis of: a first patch image for learning that is generated from the first image for learning, with the coordinate information of the correct-answer label used as a reference; a plurality of second patch images for learning that are generated from the second image for learning by being shifted in the horizontal direction, with said coordinate information used as a reference; and the filtered label.

Description

Parallax learning device, fusion data generator, parallax learning method, fusion data generation method, and parallax learning program

The disclosed techniques relate to a parallax learning device, a fusion data generator, a parallax learning method, a fusion data generation method, and a parallax learning program.

Conventionally, there is a stereo matching technique that estimates the depth after obtaining the parallax amount between two images acquired by using a stereo camera as a sensor (see Non-Patent Document 1 and Non-Patent Document 2). Parallax generally refers to the difference in the appearance of an image caused by the difference in the appearance of two sensors. Further, the parallax estimation method refers to a method of estimating the amount of deviation between images as the parallax amount. Among the methods for estimating parallax by stereo matching, a method using deep learning has been proposed in recent years.

In the conventional technique, it is general that the purpose is to estimate the distance to the object by depth estimation after obtaining the parallax amount. However, the prior art has been premised on estimating the amount of parallax between images acquired by sensors that acquire two channels of the same type. Therefore, there is a problem that it is difficult to estimate the amount of parallax because the characteristics of the images acquired by the sensors that acquire different types of channels are different.

The disclosed technique was made in view of the above points, and is a parallax learning device that enables accurate estimation and utilization of parallax amount even between images acquired by different sensors, fusion data. It is an object of the present invention to provide a generator, a parallax learning method, a fusion data generation method, and a parallax learning program.

The first aspect of the present disclosure is a parallax learning device, with respect to a correct answer label indicating the amount of deviation in the horizontal position between the first image for learning and the second image for learning and the relationship between the correct answer and the position. A filter unit that outputs a filtered label to which a filter has been applied, a first patch image for learning generated from the first image for learning based on the coordinate information of the correct answer label, and a second image for learning. A model for outputting the similarity corresponding to the patch image pair based on the plurality of learning second patch images generated by shifting the image in the horizontal direction with respect to the coordinate information and the filtered label. Includes a similarity learning unit that learns parameters.

The second aspect of the present disclosure is a parallax learning method, with respect to a correct answer label showing the relationship between the amount of deviation in the horizontal position between the first image for learning and the second image for learning and the correct answer to the position. Then, the filtered label to which the filter is applied is output, and the first patch image for learning generated from the first image for learning based on the coordinate information of the correct answer label and the second image for learning are said to be the same. Based on a plurality of second patch images for training generated by shifting horizontally with respect to the coordinate information and the filtered label, the parameters of the model for outputting the similarity corresponding to the patch image pair are learned. It is characterized in that the processing is executed by the computer.

The third aspect of the present disclosure is a parallax learning program for a correct answer label showing the relationship between the amount of displacement of the horizontal position between the first image for learning and the second image for learning and the correct answer to the position. Then, the filtered label to which the filter is applied is output, and the first patch image for learning generated from the first image for learning based on the coordinate information of the correct answer label and the second image for learning are said to be the same. Based on a plurality of training second patch images generated by shifting horizontally based on the coordinate information and the filtered label, the parameters of the model for outputting the similarity corresponding to the patch image pair are trained. Let the computer do that.

According to the disclosed technology, it is possible to accurately estimate and utilize the amount of parallax even between images acquired by different sensors.

It is a schematic diagram which shows that fusion data is generated from a visible image and an infrared image which a parallax occurs. It is a figure which shows an example of the search of the corresponding point in a parallel stereo. It is a figure which shows an example of the case of creating a patch image and searching on an epipolar line. It is a figure which shows an example of the relationship between the value of the correct label of parallax and the amount of deviation. It is a figure which shows an example of the relationship between the value of the correct answer label of the parallax, and the deviation amount x when the LoG filter is applied. It is a figure which shows an example of the similarity before an inverse filter and the similarity after an inverse filter. It is a block diagram which shows the hardware composition of the parallax learning apparatus and the fusion data generation apparatus. It is a block diagram which shows the structure of the parallax learning apparatus of this embodiment. It is a flowchart which shows the flow of the parallax learning process by the parallax learning apparatus. It is a flowchart which shows the flow of the patch image generation processing for learning. It is a block diagram which shows the structure of the fusion data generation apparatus of this embodiment. It is a flowchart which shows the flow of fusion data generation processing by fusion data generation apparatus. It is a flowchart which shows the flow of a feature point calculation process.

Hereinafter, an example of the embodiment of the disclosed technique will be described with reference to the drawings. The same reference numerals are given to the same or equivalent components and parts in each drawing. In addition, the dimensional ratios in the drawings are exaggerated for convenience of explanation and may differ from the actual ratios.

First, the outline of this disclosure will be explained. FIG. 1 is a schematic diagram showing that fusion data is generated from a visible image and an infrared image in which parallax occurs. As shown in FIG. 1, the parallax estimation method can also be used to link the corresponding points of two images to generate fusion data. The fusion data refers to the data obtained by fusing the channels of the images acquired by the two sensors. The task assumed in the present disclosure is to generate fusion data in which corresponding points are matched from two images, a visible image (RGB) taken by a visible light camera and an infrared image taken by an infrared camera. be. The infrared image is an image in which temperature data is recorded for each pixel. If fusion data is used, it is possible to learn with abundant information in the field of object recognition by machine learning or the field of semantic segmentation, and improvement in performance can be expected. As an example of a specific usage scene, a scene in which inspection is automated by deep learning is assumed in the application of equipment inspection in which infrared images are used. Therefore, by utilizing this fusion data, it is possible to learn with abundant information, so performance improvement can be expected. However, as shown in FIG. 1, since parallax occurs in the visible image and the infrared image, how to estimate the parallax becomes a problem.

Visible images and infrared images have different properties. Visible images are used for object recognition because they have abundant appearance information such as textures. On the other hand, the infrared image has a lower resolution than the visible image, but can acquire temperature information of the object, and is used in the security field or for equipment inspection and the like. Further, although a device capable of simultaneously acquiring a visible image and an infrared image is commercially available, the positions of the visible light camera and the infrared camera are different, so that the two images have a misalignment. Due to the parallax, the position where the object is projected onto the image element shifts, so that the shift occurs on the image data. In the following, the magnitude of the parallax shift on the image data is used as the amount of parallax to be estimated.

When performing object recognition or semantic segmentation using the information of the visible image and the infrared image, it is necessary to correct the deviation due to parallax so that the corresponding points of the visible image and the infrared image overlap. As a conventional technique, there is a stereo matching technique in which the depth is estimated by obtaining the parallax amount of two images by a stereo camera. In a stereo camera having a positional relationship of parallel stereo, it is known that for one pixel, the corresponding pixel of the other exists on a horizontal epipolar line. FIG. 2 is a diagram showing an example of searching for corresponding points in parallel stereo. The image taken by the left camera of the stereo camera is taken as the left image, and the image taken by the right camera of the stereo camera is taken as the right image. The stereo camera is not limited to the left and right, but in the following examples, the case where one left image as a reference is the first image 51 and the other right image is the second image 52 will be described as an example. As shown in FIG. 2, a corresponding point is searched for on the epipolar line 53 of the second image 52 with respect to the first image 51.

Explain the principle of the parallax estimation method. In order to obtain the parallax amount of the pixel of the other image corresponding to one pixel on one image as a reference by a basic stereo matching method using the above characteristics, the patch image 54 (hereinafter, the description will be described). The code is omitted unless necessary for convenience). FIG. 3 is a diagram showing an example of a case where a patch image is created and a search is performed on an epipolar line. An N × N patch centered on the pixels of the first image is created, and this is used as a reference patch image (first patch image 54A). The central pixel of the reference first patch image is represented by (u, v). Then, while shifting the N × N patch (second patch image 54B) on the epipolar line of the second image, the patch image is compared with the reference patch image, and the position where the two patch images overlap is obtained from the reference position. The deviation amount x is obtained as the parallax amount. The overlapping position is the position where the same part of the object is projected. Therefore, the pixel of the second patch image corresponding to the central pixel (u, v) of the reference first patch image is obtained as (u, v + x).

Here, a case where the conventional parallax estimation method described above is applied to different types of images, a visible image and an infrared image, will be examined. The visible image is 3-channel data obtained by converting visible light into RGB data, and the infrared image is 1-channel data having temperature information in pixel units. In the parallax estimation method, an N × N patch image is cut out centering on a certain pixel of one image. Then, a plurality of N × N patch images are cut out from the other image centering on a plurality of pixels, and the coordinates having the highest similarity of the pixel groups in the patch image are calculated to obtain the parallax amount. However, since the recorded information is different between the visible image and the infrared image and it is difficult to obtain the degree of similarity, it is not possible to obtain a corresponding point with high accuracy. Therefore, it is difficult to obtain the amount of parallax by the conventional parallax estimation method based on the same sensor. In addition, among the methods for estimating parallax, there is a possibility that learning can be performed flexibly if it is a method that uses deep learning, but since the resolution of infrared images is lower than that of visible images, there is a shift in the corresponding points during learning. , Or it may be noise at the time of estimation.

Therefore, in the present disclosure, we propose a data fusion method for finding corresponding points between pixels for visible images and infrared images taken by using a visible light camera and an infrared camera as different sensors. In the present disclosure, deep learning is used when calculating the matching cost in the stereo matching method, but the method of giving the correct label at the time of learning is different from that of the conventional method. In the conventional method using deep learning, the correct answer label gives a correct answer value or an incorrect answer value as Positive (= 1) when the images overlap and Negative (= 0) when the images do not overlap. FIG. 4 is a diagram showing an example of the relationship between the value of the correct label of parallax and the amount of deviation. In the graph of FIG. 4, the vertical axis represents a value indicating the correct label, and the horizontal axis represents a value indicating the deviation amount x at the coordinates on the epipolar line 53. The correct label is represented by a two-dimensional graph of these values. The deviation amount x corresponds to the horizontal position between the first image for learning and the second image for learning. The correct answer label is represented by a graph showing the relationship between the correct answer or the incorrect answer value with respect to the position of the deviation amount x. Assuming that the amount of deviation on the epipolar line is x and the amount of parallax is d, the correct label f (x) is given by the following equation (1).

... (1)

If the correct label is f (x), it is represented by f (x) = (0,1). The amount of deviation x at the position where the images overlap is the amount of deviation of the correct answer, and the amount of parallax d of the correct answer. The parallax amount d of the correct answer is represented by the length of the deviation amount x of the correct answer. Therefore, when x = d, the value of the correct label is set to f (x) = 1. On the other hand, the value of the correct answer label is f (x) = 0 at the position of the amount x where the images do not overlap. As described above, the correct answer label f (x) is represented as a graph showing the relationship between the deviation amount x and the value of the correct answer label.

However, as mentioned in the above-mentioned problem, the position of the correct parallax amount may shift between the different sensors of the visible light camera and the infrared camera, so there is a possibility of noise during learning. Therefore, in this method, the filtered label obtained by filtering the correct label in the above graph is used as the input for deep learning to deal with noisy data. Applying a filter to the correct label means applying a filter to the values of the correct data plotted on the graph. By applying the filter, the value of the correct answer data expressed by the binary value will be expressed as a distribution. Here, an example is shown in which a LoG (Laplacian of Gaussian) filter to which a Laplacian filter is applied is applied after smoothing is performed by applying a Gaussian filter. The LoG filter is a filter that extracts edges by applying a Laplacian filter after removing noise. FIG. 5 is a diagram showing an example of the relationship between the value of the correct label of the parallax and the deviation amount x when the LoG filter is applied. As shown in FIG. 5, the distribution is formed with the position of the correct answer as the apex in the correct answer label after the filter. The filter is not limited to the LoG filter, and a filter capable of converting correct or incorrect values into a distribution, such as a Gaussian filter, may be applied.

Assuming that g (x) is a Gaussian filter and l (x) is a Laplacian filter, the correct label f _filter (x) after the filter is as follows.

... (2)

By using a filter for the correct label as described above, it can be expected to suppress noise during learning. However, it is assumed that the output of the similarity of this method is as shown in the upper figure of FIG. This is because the output of the similarity is also output with the same distribution because the training is performed with the filtered correct label that has been filtered at the time of training. In order to obtain the parallax amount from the output of the similarity, the parallax amount can be obtained by applying an inverse filter to the output of the similarity and obtaining the peak.

With this disclosure, it is possible to generate fusion data in which a visible image and an infrared image are accurately fused into one data. Further, by performing deep learning by utilizing both the visible image and the infrared image of the fusion data, it is expected that the recognition accuracy will be improved in the object recognition and the semantic segmentation.

Hereinafter, the configuration of this embodiment will be described. Hereinafter, the parallax learning device and the fusion data generation device will be described separately.

FIG. 7 is a block diagram showing the hardware configuration of the parallax learning device 100 and the fusion data generation device 200. Since the parallax learning device 100 and the fusion data generation device 200 have the same hardware configuration, the parallax learning device 100 will be described below.

As shown in FIG. 7, the parallax learning device 100 includes a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and a communication interface. It has (I / F) 17. The configurations are connected to each other via a bus 19 so as to be communicable with each other.

The CPU 11 is a central arithmetic processing unit that executes various programs and controls each part. That is, the CPU 11 reads the program from the ROM 12 or the storage 14, and executes the program using the RAM 13 as a work area. The CPU 11 controls each of the above configurations and performs various arithmetic processes according to the program stored in the ROM 12 or the storage 14. In the present embodiment, the parallax learning program is stored in the ROM 12 or the storage 14.

ROM 12 stores various programs and various data. The RAM 13 temporarily stores a program or data as a work area. The storage 14 is composed of a storage device such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and stores various programs including an operating system and various data.

The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used for performing various inputs.

The display unit 16 is, for example, a liquid crystal display and displays various information. The display unit 16 may adopt a touch panel method and function as an input unit 15.

The communication interface 17 is an interface for communicating with other devices such as terminals. For the communication, for example, a wired communication standard such as Ethernet (registered trademark) or FDDI, or a wireless communication standard such as 4G, 5G, or Wi-Fi (registered trademark) is used. Similarly, the fusion data generation device 200 also has a CPU 21, a ROM 22, a RAM 23, a storage 24, an input unit 25, a display unit 26, and a communication I / F 27. The configurations are connected to each other via a bus 29 so as to be communicable with each other. The fusion data generation program is stored in the ROM 22 or the storage 24. Since the description of each part of the hardware configuration is the same as that of the parallax learning device 100, the description thereof will be omitted.

Next, each functional configuration of the parallax learning device 100 will be described. FIG. 8 is a block diagram showing the configuration of the parallax learning device 100 of the present embodiment. Each functional configuration is realized by the CPU 11 reading the parallax learning program stored in the ROM 12 or the storage 14, expanding it into the RAM 13, and executing the program.

As shown in FIG. 8, the parallax learning device 100 has learning processing units for processing the first image and the second image corresponding to the two cameras. The learning processing units include a first image input unit 101, a first image preprocessing unit 102, a first patch image generation unit 103, a second image input unit 104, and a second image preprocessing unit 105. And the second patch image generation unit 106 are included. Further, the parallax learning device 100 includes a label input unit 107, a filter unit 108, a similarity learning unit 109, and a model storage unit 110. The first image input unit 101 and the second image input unit 104 have the configuration of the stereo camera shown in FIG. 1, the first image input unit 101 is the left camera, and the second image input unit 104 is the right camera. .. Both left and right inputs can be selected for the visible image and the infrared image. Further, each processing unit for learning may be used as an external device, and the parallax learning device 100 may receive the output.

The first image input unit 101 inputs the image taken by the left camera as digital data, and outputs the first image for learning to the first image preprocessing unit 102.

The first image preprocessing unit 102 receives a first image for learning from the first image input unit 101, performs preprocessing such as contour extraction, parallelization, and distortion correction on the first image, and preprocessed the first image. 1 The image is output to the first patch image generation unit 103.

Coordinates (u, v) indicating the correct value of the correct label are input from the label input unit 107 to the first patch image generation unit 103 and the second patch image generation unit 106.

In the first patch image generation unit 103, N × N pixels centered on the coordinates (u, v) of the input correct label for the preprocessed first image input from the first image preprocessing unit 102. The first patch image is generated. The first patch image generation unit 103 outputs the generated first patch image to the similarity learning unit 109. In this way, the first patch image is generated from the first image for learning with reference to the coordinate information of the correct label.

The second image input unit 104 inputs the image taken by the right camera as digital data, and outputs the second image for learning to the second image preprocessing unit 105.

The second image preprocessing unit 105 receives a second image for learning from the second image input unit 104, performs preprocessing such as contour extraction, parallelization, and distortion correction, and uses the preprocessed second image as the second image. 2 Output to the patch image generation unit 106.

In the second patch image generation unit 106, the preprocessed second image input from the second image preprocessing unit 105 is d _min to d _max based on the coordinates (u, v) of the input correct label. A plurality of second patch images in which N × N pixels are cut out are generated for the range of. The range for generating the second patch image from the reference may be d _min to d _max , and the range that can be the parallax amount d from the information of the stereo camera or the like may be predetermined. A plurality of second patch images are generated by shifting the center of the patch one pixel at a time in the horizontal direction along the epipolar line and cutting out N × N pixels. The second patch image generation unit 106 outputs the generated second patch image to the similarity learning unit 109. In this way, a plurality of second patch images are generated by shifting horizontally from the second image for learning with reference to the coordinate information of the correct answer label.

The label input unit 107 is provided with the coordinate information of the deviation amount x and the parallax amount d for each position as inputs. The correct label for parallax is input as f (x) shown in Eq. (1). The label input unit 107 outputs the coordinates (u, v) of the correct answer value of the correct answer label to the first patch image generation unit 103 and the second patch image generation unit 106. Further, the label input unit 107 outputs the correct parallax label to the filter unit 108.

The filter unit 108 applies a filter to the correct answer label of the parallax represented by the equation (1) input from the label input unit 107, and outputs the filtered correct answer label to which the filter is applied to the similarity learning unit 109. When the _LoG filter is applied, the f filter (x) represented by the equation (2) is output as a filtered correct label.

The similarity learning unit 109 receives the first patch image from the first patch image generation unit 103, a plurality of second patch images from the second patch image generation unit 106, and the filtered correct label from the filter unit 108. The similarity learning unit 109 learns the parameters of the model based on the first patch image, the plurality of second patch images, and the filtered correct answer label. The similarity learning unit 109 stores the parameters of the learned model in the model storage unit 110. The model is a model for outputting the similarity corresponding to the patch image pair in estimation. For learning the parameters of the model, a generally known method such as MC-CNN, which is one aspect of the deep learning method, can be used as a method for learning a model for estimating the similarity from two input images.

The model storage unit 110 stores the parameters of the model learned by the similarity learning unit 109.

Next, the operation of the parallax learning device 100 will be described.

FIG. 9 is a flowchart showing the flow of the parallax learning process by the parallax learning device 100. The parallax learning process is performed by the CPU 11 reading the parallax learning program from the ROM 12 or the storage 14, expanding it into the RAM 13, and executing the program. The CPU 11 executes the following processing as each part of the parallax learning device 100.

In step S100, the CPU 11 receives the input of the first image for learning from the first image input unit 101, preprocesses the first image, and uses the preprocessed first image as the first patch image generation unit 103. Output to.

In step S102, the CPU 11 receives the input of the second image for learning from the second image input unit 104, preprocesses the second image, and uses the preprocessed second image as the second patch image generation unit 106. Output to.

In step S104, the CPU 11 receives the input of the correct parallax label from the label input unit 107. The coordinates (u, v) of the correct answer value of the correct answer label are output to the first patch image generation unit 103 and the second patch image generation unit 106, and the correct answer label itself is output to the filter unit 108.

In step S106, the CPU 11 generates a first patch image for learning and a plurality of second patch images for learning. The details of the patch image generation process for learning in this step will be described later.

In step S108, the CPU 11 applies a filter to the correct answer label of the parallax, and outputs the filtered correct answer label to which the filter is applied to the similarity learning unit 109.

In step S110, the CPU 11 learns the model parameters based on the first patch image for learning, the second patch image for learning, and the filtered correct answer label.

In step S112, the CPU 11 stores the learned model parameters in the model storage unit 110.

Next, the patch image generation process for learning in step S106 will be described with reference to the flowchart of FIG.

In step S130, the CPU 11 cuts out N × N pixels from the input preprocessed first image centered on the coordinates (u, v) of the input correct label, and the first patch image for learning. To generate.

In step S132, the CPU 11 sets x = d _min for the position centered on the patch image with reference to the coordinates (u, v) of the correct label.

In step S134, the CPU 11 generates a second patch image for learning centered on the coordinates (u, v + x).

In step S136, it is determined whether or not the CPU 11 has x ≦ d _max . If the condition is satisfied, the process proceeds to step S138, and if the condition is not satisfied, the process ends.

In step S138, the CPU 11 counts up the values of x = x + 1 and x, returns to step S134, and repeats the process of generating the second patch image for learning. As a result, the first patch image for learning and a plurality of second patch images are generated.

As described above, according to the parallax learning device 100 of the present embodiment, it is possible to learn the parameters of the model that enables accurate estimation and utilization of the parallax amount even between images acquired by different sensors. ..

Next, each functional configuration of the fusion data generation device 200 will be described. FIG. 11 is a block diagram showing the configuration of the fusion data generation device 200 of the present embodiment. Each functional configuration is realized by the CPU 21 reading the fusion data generation program stored in the ROM 22 or the storage 24, deploying it in the RAM 23, and executing it.

As shown in FIG. 11, the fusion data generation device 200 has each processing unit for estimation for processing the first image and the second image corresponding to the two cameras. The estimation processing units include a first image input unit 201, a first image preprocessing unit 202, a first feature point extraction unit 203, a first patch image generation unit 204, and a second image input unit 205. A second image preprocessing unit 206 and a second patch image generation unit 207 are included. Further, the fusion data generation device 200 includes a similarity calculation unit 208, a similarity aggregation unit 209, an inverse filter unit 210, a parallax calculation unit 211, a parallax interpolation unit 212, a fusion data generation unit 213, and a model storage. It is configured to include a part 230. The first image input unit 201 and the second image input unit 205 have the configuration of the stereo camera shown in FIG. 1, the first image input unit 201 is the left camera, and the second image input unit 205 is the right camera. .. Both left and right inputs can be selected for the visible image and the infrared image. Further, each processing unit for estimation may be used as an external device, and the fusion data generation device 200 may receive the output.

The first image input unit 201 inputs the image taken by the left camera as digital data, and outputs the first image to be estimated to the first image preprocessing unit 202.

The first image preprocessing unit 202 receives the first image to be estimated from the first image input unit 201, performs preprocessing such as contour extraction, parallelization, and distortion correction on the first image, and the preprocessed first image. 1 The image is output to the first feature point extraction unit 203.

The first feature point extraction unit 203 extracts each of the feature points from the input preprocessed first image, and sets the coordinates of each of the preprocessed first image and the extracted feature points into the first patch image generation unit 204. Output to. Further, the first feature point extraction unit 203 outputs the coordinates of each of the extracted feature points to the second patch image generation unit 207.

The first patch image generation unit 204 receives the coordinates of the preprocessed first image and the feature points from the first feature point extraction unit 203, and centers the coordinates of the feature points for each input feature point. The first patch image obtained by cutting out N × N pixels is output to the similarity calculation unit 208. As a result, for each of the feature points of the first image, the first patch image of the first image generated with respect to the feature points is generated.

The second image input unit 205 inputs the image taken by the right camera as digital data, and outputs the second image to be estimated to the second image preprocessing unit 206.

The second image preprocessing unit 206 receives the second image to be estimated from the second image input unit 205, performs preprocessing such as contour extraction, parallelization, and distortion correction, and uses the preprocessed second image as the second image. 2 Output to the patch image generation unit 207.

In the second patch image generation unit 207, with respect to the preprocessed second image input from the second image preprocessing unit 206, for each feature point input, from d _min based on the coordinates of the feature point. A plurality of second patch images in which N × N pixels are cut out for the range of d _max are generated. The method of generating a plurality of images is the same as that of the parallax learning device 100, and a plurality of second patch images on the epipolar line are generated. As a result, for each of the feature points of the first image, a plurality of second patch images generated from the second image with the feature points as a reference are generated.

The model storage unit 230 stores the parameters of the model for outputting the similarity corresponding to the patch image pair learned by the parallax learning device 100.

The similarity calculation unit 208 receives each of the feature points and the first patch image from the first patch image generation unit 204, and receives a plurality of second patch images from the second patch image generation unit 207. The similarity calculation unit 208 processes a pair of the first patch image and each of the plurality of second patch images as a combination. The similarity calculation unit 208 inputs each of the combinations into the model of the model storage unit 230 for each feature point, and outputs the similarity indicating the similarity of each combination as the output of the model using the parameters. Regarding the combination, the matching cost is evaluated by the trained model for each combination with the plurality of second patch images with the first patch image as a reference, and the matching cost is calculated as the degree of similarity. FIG. 6 is a diagram showing an example of the similarity before the inverse filter and the similarity after the inverse filter after the inverse filter. Each degree of similarity for each combination corresponds to each point on FIG. The similarity, which is the matching cost of each of the calculated combinations, is output to the similarity totaling unit 209.

The similarity totaling unit 209 totals the similarity of each of the output combinations for each feature point, and outputs the totaled result to the inverse filter unit 210. For each feature point, the aggregated result as shown in FIG. 6 is obtained.

The inverse filter unit 210 outputs to the parallax calculation unit 211 the estimation result after the inverse filter to which the inverse filter is applied to the aggregated result of the similarity of each of the output combinations for each feature point. In the inverse filter, for the purpose of reducing the amount of calculation, the aggregated result is converted into the frequency domain by the Fourier transform, the inverse filter is applied in the frequency domain, and the inverse transform is performed to return to the spatial domain.

The parallax calculation unit 211 calculates parallax information indicating the amount of parallax for each feature point from the estimated result after filtering for each feature point, and outputs the parallax information to the parallax interpolation unit 212. From the estimation result after the inverse filter for each feature point input from the inverse filter unit 210, the point to be the peak is searched, the deviation on the epipolar line at the time of the peak is obtained as the parallax amount d, and the parallax amount for each feature point is obtained. Let d be the parallax information. By the processing up to this point, the parallax amount d for each feature point in the image can be calculated.

The parallax interpolation unit 212 calculates the parallax interpolation information obtained by interpolating the parallax amount for each pixel from the parallax information for each feature point, and outputs it to the fusion data generation unit 213. The parallax interpolation unit 212 obtains the parallax amount of the entire image based on the parallax amount d of the feature points. To give a specific example, in the case of a structure inspection use case, it is assumed that the object is a flat surface because it is often a flat surface. Based on this assumption, the parallax between the feature points can be interpolated by approximating the parallax amount for each feature point by the least squares method or the like. When interpolating, a robust estimation method can be incorporated to suppress the effects of outliers. By this processing, the amount of parallax of pixels other than the feature points is estimated, and the amount of parallax for each pixel of the entire image is output as parallax interpolation information.

The fusion data generation unit 213 generates and outputs fusion data of the first image to be estimated and the second image to be estimated based on the parallax interpolation information. The fusion data is generated by aligning the first image and the second image based on the parallax interpolation information. The fusion data generation unit 213 generates fusion data of 4 channels in which the RGB3 channel of the visible image and the temperature data 1 channel of the infrared image are superimposed.

Next, the operation of the fusion data generation device 200 will be described.

FIG. 12 is a flowchart showing the flow of the fusion data generation process by the fusion data generation device 200. The fusion data generation process is performed by the CPU 21 reading the fusion data generation program from the ROM 22 or the storage 24, expanding it into the RAM 23, and executing the fusion data generation program. The CPU 21 executes the following processing as each part of the fusion data generation device 200.

In step S200, the CPU 21 receives the input of the first image to be estimated from the first image input unit 201, preprocesses the first image, and uses the preprocessed first image as the first feature point extraction unit 203. Output to.

In step S202, the CPU 21 receives the input of the second image to be estimated from the second image input unit 205, preprocesses the second image, and uses the preprocessed second image as the second patch image generation unit 207. Output to.

In step S204, the CPU 21 extracts each of the feature points from the input preprocessed first image, and outputs the coordinates of each of the preprocessed first image and the extracted feature points to the first patch image generation unit 204. do.

In step S206, the CPU 21 sets the selection of the feature point to i = 1. The total number of feature points extracted in step S204 is N.

In step S208, the CPU 21 executes processing up to the calculation of parallax information for the selected feature point i. The details of the feature point calculation process in this step will be described later.

In step S210, the CPU 21 determines whether or not i ≦ N. If the condition is satisfied, the process proceeds to step S212, and if the condition is not satisfied, the process proceeds to step S214.

In step S212, the CPU 21 counts up the values of i = i + 1 and i, selects the next feature point, and repeats the feature point calculation process.

In step S214, the CPU 21 calculates parallax interpolation information obtained by interpolating the parallax amount for each pixel from the parallax information for each feature point obtained in the process of step S210.

In step S216, the CPU 21 generates and outputs fusion data of the first image to be estimated and the second image to be estimated based on the parallax interpolation information.

Next, the feature point calculation process in step S208 will be described with reference to the flowchart of FIG. The following is the processing for the selected feature point i.

In step S230, the CPU 21 cuts out N × N pixels with respect to the input preprocessed first image centered on the coordinates (u, v) of the selected feature point i, and the first patch to be estimated. Generate an image.

In step S232, the CPU 21 sets x = d _min for the position centered on the patch image with reference to the coordinates of the feature point i.

In step S234, the CPU 21 generates a second patch image to be estimated centered on the coordinates (u, v + x).

In step S236, the CPU 21 sets the first patch image of the estimation target generated in step S230 and the second patch image of the estimation target generated in step S234 as a pair combination, and sets the pair in the model of the model storage unit 230. Enter the combination. Then, by outputting the model using the parameters, the matching cost indicating the similarity of the combination is calculated as the degree of similarity.

In step S238, the CPU 21 determines whether or not x ≦ d _max . If the condition is satisfied, the process proceeds to step S240, and if the condition is not satisfied, the process proceeds to step S242.

In step S240, the CPU 21 counts up the values of x = x + 1 and x, returns to step S234 to generate a second patch image to be estimated, and repeats the calculation of the similarity.

In step S242, the CPU 21 aggregates the similarity of each of the output combinations and outputs the aggregated result to the inverse filter unit 210.

In step S244, the CPU 21 outputs to the parallax calculation unit 211 the estimation result after the inverse filter to which the inverse filter is applied to the aggregation result of each estimation of the output combination.

In step S246, the CPU 21 calculates the parallax information indicating the parallax amount of the feature point i selected from the estimation result after the filter, and outputs the parallax information to the parallax interpolation unit 212. As a result, parallax information for each feature point i is obtained.

As described above, according to the fusion data generation device 200 of the present embodiment, it is possible to generate fusion data that enables accurate estimation and utilization of the parallax amount even between images acquired by different sensors. ..

It should be noted that various processors other than the CPU may execute the parallax learning process or the fusion data generation process executed by the CPU reading the software (program) in the above embodiment. As a processor in this case, a PLD (Programmable Logic Device) whose circuit configuration can be changed after manufacturing an FPGA (Field-Programmable Gate Array), an ASIC (Application Specific Integrated Circuit), or the like for specifying an ASIC. An example is a dedicated electric circuit or the like, which is a processor having a circuit configuration designed exclusively for it. Further, the parallax learning process or the fusion data generation process may be executed by one of these various processors, or a combination of two or more processors of the same type or different types (for example, a plurality of FPGAs and a CPU). And FPGA in combination, etc.). Further, the hardware-like structure of these various processors is, more specifically, an electric circuit in which circuit elements such as semiconductor elements are combined.

Further, in the above embodiment, the embodiment in which the parallax learning program is stored (installed) in the storage 14 in advance has been described, but the present invention is not limited to this. The program is stored in a non-temporary medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versaille Disk Online Memory), and a USB (Universal Serial Bus) memory. It may be provided in the form. Further, the program may be downloaded from an external device via a network. The same applies to the fusion data generation program.

Regarding the above embodiments, the following additional notes will be further disclosed.

(Appendix 1)
With memory
With at least one processor connected to the memory
Including
The processor
A filtered label is output with a filter applied to the correct label indicating the relationship between the horizontal position deviation between the first image for learning and the second image for learning and the correct answer with respect to the position.
A first patch image for learning generated from the first image for learning based on the coordinate information of the correct answer label, and a plurality of images generated by shifting horizontally from the second image for learning based on the coordinate information. Based on the second patch image for training and the filtered label, the parameters of the model for outputting the similarity corresponding to the patch image pair are learned.
A parallax learning device configured to.

(Appendix 2)
A non-temporary storage medium that stores a program that can be executed by a computer to perform parallax learning processing.
A filtered label is output with a filter applied to the correct label indicating the relationship between the horizontal position deviation between the first image for learning and the second image for learning and the correct answer with respect to the position.
A first patch image for learning generated from the first image for learning based on the coordinate information of the correct answer label, and a plurality of images generated by shifting horizontally from the second image for learning based on the coordinate information. Based on the second patch image for training and the filtered label, the parameters of the model for outputting the similarity corresponding to the patch image pair are learned.
Non-temporary storage medium.

100

Disparity learning device

101, 201 1st

image input unit

102, 202 1st

image preprocessing unit

103, 204 1st patch

image generation unit

104, 205 2nd

image input unit

105, 206 2nd

image preprocessing unit

106, 207 2nd patch image generation unit 107 Label input unit 108 Filter unit 109

Similarity learning unit

110, 230 Model storage unit 200 Fusion data generation device 203 1st feature point extraction unit 208 Similarity calculation unit 209 Similarity totaling unit 210 Inverse filter unit 211 Misparation calculation unit 212 Disparity interpolation unit 213 Fusion data generation unit

Claims

A filter unit that outputs a filtered label to which a filter is applied to a correct label indicating the relationship between the horizontal position deviation amount of the first image for learning and the second image for learning and the correct answer with respect to the position. ,
A first patch image for learning generated from the first image for learning based on the coordinate information of the correct answer label, and a plurality of images generated by shifting horizontally from the second image for learning based on the coordinate information. A similarity learning unit that learns the parameters of the model for outputting the similarity corresponding to the patch image pair based on the second patch image for learning and the filtered label.
Parallax learning device including.
The correct or incorrect value of the correct label is given as a binary value.
The parallax learning device according to claim 1, wherein the filter unit uses a filter capable of converting the correct or incorrect values into a distribution.
For each of the feature points of the first image to be estimated, the first patch image of the estimation target generated from the first image of the estimation target with reference to the feature points, and the feature points from the second image of the estimation target. Each of the combinations with each of the second patch images of the plurality of estimation targets generated with reference to the above, and the parameters of the model learned by the disparity learning device according to claim 1 or 2, are accepted.
A similarity calculation unit that inputs each of the combinations to the model for each feature point and outputs an estimate indicating the similarity of each of the combinations as the output of the model using the parameters.
An inverse filter unit that outputs a post-filter estimation result to which an inverse filter is applied to the aggregated result of each estimation of the combination output for each feature point.
A parallax calculation unit that calculates parallax information indicating the amount of parallax for each feature point from the estimation result after the inverse filter for each feature point.
A parallax interpolation unit that calculates parallax interpolation information by interpolating the amount of parallax for each pixel from the parallax information for each feature point, and
A fusion data generation unit that generates fusion data of the first image and the second image based on the parallax interpolation information, and
Fusion data generator including.
A filtered label is output with a filter applied to the correct label indicating the relationship between the horizontal position deviation between the first image for learning and the second image for learning and the correct answer with respect to the position.
A first patch image for learning generated from the first image for learning based on the coordinate information of the correct answer label, and a plurality of images generated by shifting horizontally from the second image for learning based on the coordinate information. Based on the second patch image for training and the filtered label, the parameters of the model for outputting the similarity corresponding to the patch image pair are learned.
A parallax learning method that causes a computer to perform processing.
The correct or incorrect value of the correct label is given as a binary value.
The parallax learning method according to claim 4, wherein the filter uses a filter capable of converting the correct or incorrect values into a distribution.
For each of the feature points of the first image to be estimated, the first patch image of the estimation target generated from the first image of the estimation target with reference to the feature points, and the feature points from the second image of the estimation target. Each of the combinations with each of the second patch images of the plurality of estimation targets generated with reference to the above, and the parameters of the model learned by the disparity learning device according to claim 1 or 2, are accepted.
For each feature point, each of the combinations is input to the model, and as the output of the model using the parameters, an estimate showing the similarity of each of the combinations is output.
For each of the feature points, the estimation result after the inverse filter is applied to the aggregated result of the estimation of each of the output combinations, and the estimation result is output.
Parallax information indicating the amount of parallax for each feature point is calculated from the estimation result after the inverse filter for each feature point.
The parallax interpolation information obtained by interpolating the parallax amount for each pixel is calculated from the parallax information for each feature point.
The fusion data of the first image and the second image is generated based on the parallax interpolation information.
A fusion data generation method that causes a computer to perform processing.
A filtered label is output with a filter applied to the correct label indicating the relationship between the horizontal position deviation between the first image for learning and the second image for learning and the correct answer with respect to the position.
A first patch image for learning generated from the first image for learning based on the coordinate information of the correct answer label, and a plurality of images generated by shifting horizontally from the second image for learning based on the coordinate information. Based on the second patch image for training and the filtered label, the parameters of the model for outputting the similarity corresponding to the patch image pair are learned.
A parallax learning program that lets a computer perform processing.