WO2023171559A1

WO2023171559A1 - Occlusion-inference object detection device, occlusion-inference object detection, and program

Info

Publication number: WO2023171559A1
Application number: PCT/JP2023/008004
Authority: WO
Inventors: 慧敏陸; 禹超鄭
Original assignee: 国立大学法人九州工業大学
Priority date: 2022-03-08
Filing date: 2023-03-03
Publication date: 2023-09-14

Abstract

Provided is an occlusion-inference object detection device that infers one or more occluded portions of an object that are occluded by one or more occluding objects, said device comprising a learning unit that trains a mathematical model by self-supervised learning; in a process of inferring an occluded portion resulting from segmentation of occluded portions, the mathematical model infers a positional relationship between objects appearing in an image undergoing inference and, on the basis of the positional relationship, infers an occluded part of an occluded object, which is an object appearing in the image undergoing inference and having an occluded portion; the learning unit also uses geometrical information pertaining to the object appearing in the image undergoing inference to train the mathematical model.

Description

Estimated occlusion object detection device, estimated occlusion object detection method and program

The present invention relates to an estimated occlusion object detection device, an estimated occlusion object detection method, and a program.
This application claims priority based on Japanese Patent Application No. 2022-035212 filed in Japan on March 8, 2022, the contents of which are incorporated herein.

In the field of image recognition, deep learning, which automatically acquires the feature extraction process through learning, has been in the spotlight since the 2010s. Image recognition using deep learning has achieved overwhelming results in general object recognition compared to previous methods. In addition, in recent years, image recognition technology that uses deep learning to provide visibility close to that of humans has been put to practical use in various fields such as surveillance cameras, autonomous driving, and robotics, where it is necessary to understand the safety of the surrounding environment and detect obstacles. has been done.

Japanese Patent Application Publication No. 2011-186633

However, in the case of an image of an object that is partially occluded, some of the features of the object will be lost, making image recognition difficult. In other words, it may not be possible to estimate the part of an object that is partially occluded, or the accuracy of estimating the part of the object that is partially occluded may be poor. Recognition may be difficult.

In view of the above circumstances, an object of the present invention is to provide a technique that improves the accuracy of estimating the part of a partially shielded object covered by the shielding object.

One aspect of the present invention provides an apparatus for estimating one or more occluded parts of an object whose one or more parts are occluded by one or more occluded objects, wherein the occluded parts are segmented, and the occluded parts are segmented. In the process of estimating the occluded part, the process involves estimating the positional relationship between the images in the estimation target image, and estimating the positional relationship between the images in the estimation target image, which is an object whose part is occluded. and a learning unit that performs self-supervised learning of a mathematical model that performs estimation based on the positional relationship of the occluded part of the image of the occluding object, the learning unit comprising This is an occlusion estimation object detection device that performs learning of the mathematical model using geometric information possessed by the object.

One aspect of the present invention includes a target data acquisition unit that acquires image data, an estimation of a positional relationship between images appearing in an image to be estimated, and a part of the images appearing in the image to be estimated that is occluded. By executing the learned mathematical model obtained by a learning unit that performs self-supervised learning of a mathematical model that performs estimation based on the positional relationship of the occluded part of the image of the object to be occluded. , an estimation unit that estimates an occluded part of the image of the image data acquired by the target data acquisition unit, and the learning unit is configured to estimate the geometrical shape of the image of the image of the estimation target. The present invention is an occlusion estimation object detection device that performs learning of the mathematical model using information such as:

One aspect of the present invention is to estimate the positional relationship between images in an image to be estimated, and to estimate the positional relationship between images of an object that is partially occluded in the image to be estimated. and a learning step of performing self-supervised learning of a mathematical model that performs estimation based on the positional relationships of the parts that have been estimated. This is an estimated occlusion object detection method that also uses information to learn the mathematical model.

One aspect of the present invention includes a target data acquisition step of acquiring image data, estimating a positional relationship between images appearing in an image to be estimated, and a step of acquiring image data, and estimating a positional relationship between images appearing in the image to be estimated, and an image appearing in the image to be estimated that is partially occluded. Estimation based on the positional relationship of the occluded part of the image of the object to be occluded, and learning by self-supervised learning of the mathematical model that performs an estimation step of estimating an occluded part of the image reflected in the image of the image data acquired in the data acquisition step, and the learning also includes geometric information included in the image reflected in the image to be estimated. This is an occlusion estimation object detection method that is the learning of the mathematical model used.

One aspect of the present invention is a program for causing a computer to function as the above-mentioned occlusion estimation object detection device.

According to the present invention, it is possible to improve the accuracy of estimating the part of an object that is partially obscured.

The figure which shows an example of the flowchart of the whole invention. The figure which shows an example of the flowchart which performs object detection segmentation which is a component of this invention. The figure which shows an example of the flowchart which performs the mathematical modeling which performs amodal labeling and correction of an occlusion rate using the geometric information which is a characteristic component of this invention. FIG. 1 is an explanatory diagram illustrating an overview of an estimated occlusion object detection device according to an embodiment. FIG. 1 is a first explanatory diagram illustrating an algorithm executed by a learning device in an embodiment. FIG. 2 is a second explanatory diagram illustrating an algorithm executed by the learning device in the embodiment. The figure which shows an example of the hardware configuration of the learning device in embodiment. The figure which shows an example of the structure of the control part with which the learning device in embodiment is provided. 5 is a flowchart showing an example of the flow of processing executed by the learning device in the embodiment. FIG. 1 is an explanatory diagram illustrating an overview of an estimation device in an embodiment. The figure which shows an example of the hardware configuration of the estimation device in embodiment. The figure which shows an example of the structure of the control part with which the estimation device in embodiment is provided. 5 is a flowchart illustrating an example of the flow of processing executed by the estimation device in the embodiment. The figure which shows an example of the image of the estimation target in embodiment. FIG. 3 is an explanatory diagram illustrating a bounding box in an embodiment. The figure which shows an example of the structure of the learning part with which the learning device in embodiment is provided.

(Embodiment)
FIG. 1 is a diagram showing an example of an overall flowchart of the present invention. FIG. 2 is a diagram showing an example of a flowchart for performing object detection segmentation, which is a component of the present invention. FIG. 3 is a diagram showing an example of a flowchart for performing annotation, which is a component of the present invention. That is, FIG. 3 is a diagram showing an example of a flowchart for performing mathematical modeling for amodal labeling and correction of occlusion rate using geometric information, which is a characteristic component of the present invention.

The present invention determines at a predetermined cycle whether an image that satisfies a predetermined format-related condition has been input (step S101). The predetermined condition regarding the format is, for example, 640 pixels x 480 pixels. If an image that satisfies the predetermined format-related conditions has not been input (step S101: NO), the process returns to step S101. On the other hand, if an image that satisfies a predetermined condition regarding the format is input (step S101: YES), the present invention detects an elephant that satisfies the predetermined condition among the images captured in the image acquired in step S101 ( Step S102). Hereinafter, the detected image will be referred to as a detection object. The predetermined conditions are, for example, conditions input by the user to the present invention.

Next, the present invention performs segmentation on the image acquired in step S101 (step S103). Next, the present invention determines whether the number of masks is equal to the number of detected images (step S104). Specifically, the present invention determines whether the number of masks is equal to the number of detected images based on the segmentation results.

If they are not equal (step S104: NO), the process returns to step S104. On the other hand, if they are equal (step S104: YES), the present invention converts the segmentation model to global amodal (step S105).

Next, the present invention determines whether the amodal information is correct based on the global amodal (step S106). If the amodal information is correct (step S106: YES), the present invention acquires information indicating the occlusion order (step S107). Next, the present invention executes a mask complementation model such as PCNet-M on the image input in step S101 (step S108).

Next, the present invention executes a content complementation model such as PCNet-C on the result of executing the mask complementation model on the image input in step S101 (step S109). Next, the present invention determines whether the occlusion order has been correctly restored based on the result of the execution of step S109 (step S110). If the occlusion order is correctly restored (step S110: YES), the process ends. On the other hand, if the occlusion order has not been correctly restored (step S110: NO), the process returns to step S108.

On the other hand, if the amodal information is incorrect in step S106 (step S106: NO), the present invention inspects the amodal information (step S111). That is, the present invention inspects an unlabeled object for the presence or absence of a label with amodal information, and if there is no label, performs amodal labeling using the occlusion rate and geometric information. Next, the present invention corrects the occlusion rate based on the inspection result of the amodal information (step S112). That is, the present invention uses geometric information to modify the occlusion rate. By modifying the occlusion rate, the calculation ratio is also modified. Next, it is determined whether the present invention has a predetermined specific shape (step S113). If it does not have a specific shape (step S113: NO), the process returns to step S111. If it has a specific shape (step S113: YES), the process returns to step S105.

FIG. 4 is an explanatory diagram illustrating an overview of the occlusion estimation object detection device 100 of the embodiment. The estimated occlusion object detection device 100 is an example of the present invention, and for example, the estimated occlusion object detection device 100 executes the flowcharts shown in FIGS. 1 to 3. The occlusion estimation object detection device 100 includes a learning device 1 and an estimation device 2. FIG. 4 is also an explanatory diagram illustrating an overview of the algorithm executed by the learning device 1. Prior to explaining the learning device 1, Self-Supervised Scene De-occlusion, which is an algorithm executed by the learning device 1 and is an algorithm for scene de-occlusion, will be explained using FIGS. 5 to 6 in addition to FIG. 4. FIG. 5 is a first explanatory diagram illustrating an algorithm executed by the learning device 1 in the embodiment. More specifically, FIG. 5 is a diagram showing image G1 in FIG. 4 in more detail. FIG. 6 is a second explanatory diagram illustrating an algorithm executed by the learning device 1 in the embodiment.

<Self-Supervised Scene De-occlusion>
Self-Supervised Scene De-occlusion is an algorithm that complements occluded parts. This algorithm aims to restore the occlusion order and complement the invisible parts of the object covered by the occlusion. The algorithm is also a self-supervised learning framework that tackles deocclusion on real-world data without manually annotating occlusion orders or amodal masks. This framework complements occlusion order recovery and amodal mask,occlusion content.

In order to deal with the case where there is no manual annotation of the occlusion order and the amodal mask, the learning device 1 uses two mathematical models, a mask complementation model and a content complementation model, which will be described later, to partially model instances in a self-learning manner. complement. The mathematical model is specifically expressed by a neural network. Note that the above-mentioned amodal complementation is a term meaning recognition by supplementing in the brain an invisible part that is occluded.

Given an image, we can use an off-the-shelf instance segmentation framework to obtain an amodal mask of the object. As a technique for obtaining an amodal mask of an object, for example, deep learning of MABK R-CNN is used. However, those amodal masks are not available. It is very difficult to learn completion for instances with occlusion, as we do not know whether these amodal masks are intact or not. Therefore, in Self-Supervised Scene De-occlusion, partial completion is performed by self-monitoring.

Explain motivation. Assume that the modal mask of the instance constitutes a pixel set M, and let G be the true value of the amodal mask. The supervised approach solves the complete complementation problem of equation (1) below.

Here, f _θ represents a fully complementary model. This interpolation process can be decomposed as shown in equation (2) below.

If an instance is covered by multiple occluders, let M _k be an intermediate state, then p _θ denotes a model of partial completion.

The mask completion model will be explained. “PCNet-M” (see FIG. 4, for example) is an example of a neural network that expresses a mask complementation model. First, prepare the training data. Given an instance A and its amodal mask M _A from a data set D with instance-level annotations, another instance B is randomly sampled from D and randomly arranged to obtain a mask M _B. Here, M _A and M _B are regarded as a set of pixels. There are two cases of input, and different inputs are fed into the network.

The first case corresponds to a partial completion strategy. Define _MB as an eraser. The mask completion model uses B to erase part of A to obtain M _AoutB . _At this time, the mask complementation model learns to restore the original modal mask MA from M _AoutB with M _B as a condition. In addition, M _AoutB is represented by the symbol of the following formula (3) in the following explanation.

Note that AoutB represents a difference set between A and B. The second case is regularization to prevent the network from over-complementing instances when there is no occlusion for an instance. Specifically, M _AoutB that does not invade A is regarded as an eraser. In this case, the mask complementation model urges to keep the original amodal mask M _A on condition of M _AoutB . In the absence of case 2, the mask complementation model always encourages an increase in the number of pixels, which may result in excessive complementation if the instances are not surrounded by adjacent instances.

In either case, the erased image patch functions as an auxiliary input. The loss function is formulated as shown in Equation (4) and Equation (5) below.

Here, P _θ ^(m) (·) represents the mask complementation model, θ represents the parameter to be optimized, I represents the image patch, and L represents the binary cross entropy loss. The final loss function is defined as follows.

Here, γ represents the probability of selecting the first case. By randomly switching between the two cases, the network understands the order relationship from the shapes and boundaries of two adjacent instances and decides whether to complete the instance.

The content completion model will be explained. The content completion model follows the same intuition as the mask completion model, but the target to be completed is RGB content. As shown in FIG. 5 (or image G1 in FIG. 4), input instances A and B are the same as the mask completion model. The image pixels in region M _AandB are erased and the content completion model aims to predict the missing content. Note that AandB represents the following equation (7). That is, AandB represents the common part of A and B in set theory.

Therefore, M _AandB means the following equation (8).

The content completion model also incorporates the remaining mask of A (A\B) to indicate that it is A and not some other object. Therefore, it cannot be simply replaced with standard image filling approaches. In such a case, the loss of the content completion model for minimizing the loss is formulated as shown in the following equation (9). A\B represents a difference set between set A and set B.

Here, P _θ ^(c) is a content completion model, I is an image patch, and L is l ₁ , a loss function consisting of general losses in image rendering including perceptual loss and paradoxical loss. Similar to the mask completion model, complete completion is possible by learning the content completion model through partial completion learning.

Explain double completion for order recovery. The objective ordered graph consists of pairwise occlusion relationships between all adjacent pairs of instances. A proximate instance pair is defined as two instances whose modal masks are connected, so one can potentially be an occlusion of the other. As shown in FIG. 5, when a pair of adjacent instances _A1 and _A2 is given, the modal mask M _A1 of _A1 is first targeted for complementation. M _A2 plays the role of eraser to obtain the increment of A ₁ , ie Δ _A1|A2 . Also, in contrast, we obtain the increment of A ₂ conditioned on A ₁ , ie, Δ _A2|A1 . In part, we consider instances that obtain larger increments to be occluded. Therefore, the order between A ₁ and A ₂ is compared for each increment as follows.

Note that the symbol in formula (13) below means M _A1 , the symbol in formula (14) means M _A2 , the symbol in formula (15) means Δ _{A1 | A2} , and the symbol in formula (16) The symbol means ΔA2 _|A1 .

Here, the following equation (17) indicates that A1 covers A2.

If A1 and A2 are not adjacent, the following formula (18) is satisfied.

By performing double interpolation on all adjacent pairs, the occlusion order of the scene is obtained, which is represented as a graph as shown in image G2-1 in image G2 in FIG. 4. Nodes in the graph represent objects, and edges indicate the direction of occlusion between adjacent objects. The graph of image G2-1 is a graph obtained for image G2-2. Note that the image G2-3 in the image G2 in FIG. 4 is a diagram in which each object mask is uniformly displayed in one diagram after completion of interpolation. Furthermore, image G2-4 in image G2 in FIG. The results of RGB value complementation are also shown using the results shown in FIG.

Explain about amodal and content completion. After estimating the ordered graph, we can perform order-based amodal completion. Assuming that it is necessary to complete a certain instance A, first, all prototypes on the graph are found as occluders of that instance by bread first search (BFS). BFS is a graph search method called breadth-first-search. Since the graph does not necessarily have to be aperiodic, we adapt the BFS algorithm accordingly. The learned mask completion model is generalizable to use the sum of all prototypes as an eraser, and there is no need to iterate over the prototypes and apply the mask completion model to complete A piece by piece. do not have. Instead, amodal completion is performed in one step, conditional on the sum of the modal masks of all prototypes. Let the prototype of A be a set of equations (19) below, and perform amodal complementation as shown in equations (20) to (21) below.

Here, the symbol in equation (22) below is the result of the amodal mask, and the symbol in equation (23) is the i-th prototype amodal mask.

An example of this is shown in image G3 of FIG. Next, complete those occluded contents. As shown in image G4 of FIG. 6, the intersection of the predicted amodal mask and the prototype _A and M _anc A indicates the missing portion of A, which is the eraser of the content complementation model. Then, by applying the learned content complementation model, the content is filled in as shown in equations (24) to (25) below. Note that FIG. 7 shows that the prototype shown by the following equation (26) can be obtained using the "recovery complete mask" image of image G3.

Note that the symbols in the following equation (26) represent the prototype Am _a and M _anc A.

Here, C _A is the decomposed content of A from the scene. For background content, use the sum of all panoramic instances as the eraser. Unlike image rendering, which is not conscious of occlusion, content is supplemented for the estimated occlusion area.

In this way, Self-Supervised Scene De-occlusion is an autocorrelation deep algorithm that obtains a mathematical model that performs the acquisition of a directed graph that restores the order between adjacent objects and the completion of invisible parts using the occlusion geometric relationships of objects. It is a learning algorithm. Note that the above-mentioned ordered graph or ordinal graph is an example of a directed graph that restores the order between adjacent objects. Note that an object is an image that appears in an image to be estimated.

Therefore, the learning device 1 that performs Self-Supervised Scene De-occlusion updates the occluded part estimation model using a self-supervised learning method. The occluded part estimation model is the estimation of the positional relationship between images in the estimation target image, and the estimation based on that positional relationship. This is a mathematical model that estimates the occluded parts of the image. Hereinafter, an object whose part is shielded will be referred to as a shielded object. Further, the shielded part of the object to be shielded is referred to as the shielded part. Note that estimating a region specifically means, for example, estimating an image of the region. Therefore, the occluded part estimation model is a mathematical model that includes a mask complementation model, a content complementation model, and a double complementation model to be described later.

The mathematical model is updated until a predetermined end condition for learning (hereinafter referred to as "learning end condition") is met. The occluded part estimation model at the time when the learning end condition is satisfied (that is, learned) is based on various estimation targets, such as an estimation target image selected by the user or an estimation target image that satisfies predetermined conditions. It is used to estimate the image of the occluded part in the image.

The double complementation model is a mathematical model that estimates the positional relationship between images in the estimation target image based on the estimation results of the mask complementation model. The above-mentioned directed graph that restores the order between adjacent objects is an example of the positional relationship between images. Therefore, the double complementation model is, for example, a mathematical model that executes the above-mentioned double complementation process.

More specifically, the occluded part estimation model is a mathematical model that estimates the occluded part reflected in the estimation target based on the image of the estimation target. Further, the occluded part estimation model is a mathematical model expressed by a neural network. The neural network expressing the occluded part estimation model is, for example, a neural network including a deep neural network. The neural network expressing the occluded region estimation model is, for example, a neural network including a convolutional neural network.

As described above, the mask complementation model and the content complementation model included in the occluded part estimation model are both mathematical models that are updated through learning, so the occluded part estimation model is a mathematical model that is updated through learning. Both the mask complementation model and the content complementation model are mathematical models expressed, for example, by a neural network. Both the neural network expressing the mask complementation model and the neural network expressing the content complementation model are, for example, deep neural networks. Both the neural network representing the mask complementation model and the neural network representing the content complementation model are, for example, convolutional neural networks.

Furthermore, as described above, the mask completion model is trained to partially fill in the invisible mask of an object (occludee) hidden by an occluder corresponding to the occluded object (occluder). Therefore, the mask complementation model is a mathematical model that estimates the shape of an image reflected in the estimation target image based on the estimation target image.

Also, as described above, the content completion model is trained to partially fill the restored mask with RGB values. Therefore, the content complementation model estimates the RGB values of the image in the estimation target image based on the estimation result of the mask complementation model.

Note that segmentation has already been performed on the execution target of the mask completion model before the execution of the mask completion model. The target of the content completion model has also been segmented before the content completion model is executed. Note that in order to reduce the amount of calculation, the mask that was complemented in the previous work is used for content complementation, and then filling the mask with RGB values is started. That is, instance segmentation of each image is performed only once.

The target for execution of the double-completion model has also been segmented before the execution of the double-completion model. That is, in Self-Supervised Scene De-occlusion, segmentation is performed before executing the mask completion model, content completion model, and double completion model. Segmentation processing is included in the occluded region estimation model. By executing the occluded part estimation model, segmentation processing is executed before executing the mask complementation model, the content complementation model, and the double complementation model. Each of the mask complementation model, content complementation model, and double complementation model performs estimation using the results of segmentation.

This concludes the explanation of Self-Supervised Scene De-occlusion. Next, the hardware configuration of the learning device 1 will be explained.

FIG. 7 is a diagram showing an example of the hardware configuration of the learning device 1 in the embodiment. The learning device 1 includes a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) and a memory 92 connected via a bus, and executes a program. The learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15 by executing a program.

More specifically, the processor 91 reads a program stored in the storage unit 14 and stores the read program in the memory 92. When the processor 91 executes the program stored in the memory 92, the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15.

The control unit 11 controls the operations of various functional units included in the learning device 1. The control unit 11 executes Self-Supervised Scene De-occlusion. The control unit 11 controls, for example, the operation of the output unit 15 and causes the output unit 15 to output the execution result of Self-Supervised Scene De-occlusion. The control unit 11 records, for example, various information generated by executing Self-Supervised Scene De-occlusion in the storage unit 14. The various information stored in the storage unit 14 includes, for example, the results of execution of Self-Supervised Scene De-occlusion.

The input unit 12 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 12 may be configured as an interface that connects these input devices to the learning device 1. The input unit 12 receives input of various information to the learning device 1. For example, training data is input to the input unit 12 .

Since the learning device 1 learns the occluded part estimation model by executing Self-Supervised Scene De-occlusion, the training data may be any image data of an image containing an image.

However, if a predetermined scene is assumed in which the trained occluded part estimation model is applied to the trained occluded part estimation model, an image that satisfies the conditions of the image reflected in the estimation target in that usage scene will appear. Preferably, image data of the images is used as training data. This is because the learned occluded part estimation model has high estimation accuracy when actually used in the assumed usage scene. However, in machine learning, it is often difficult to obtain image data of an image that satisfies such predetermined conditions. Therefore, an example of a technique that supports the generation of image data of an image that satisfies the conditions that the image that appears in the estimation target image has will be described in a modified example (specifically, a second modified example). However, even if such a technique is not necessarily used, the learning device 1 itself can perform learning of the occluded part estimation model.

The communication unit 13 includes a communication interface for connecting the learning device 1 to an external device. The communication unit 13 communicates with an external device via wire or wireless. The external device is, for example, a device that is a source of training data. The communication unit 13 acquires training data through communication with a transmission source of the training data. The external device is, for example, the estimation device 2 described later. The estimation device 2 is a device that performs estimation using a trained occluded part estimation model. The communication unit 13 transmits the learned occluded part estimation model program to the estimating device 2 through communication with the estimating device 2 .

The storage unit 14 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 14 stores various information regarding the learning device 1. The storage unit 14 stores information input via the input unit 12 or the communication unit 13, for example. The storage unit 14 stores, for example, a shielded part estimation model. The storage unit 14 stores various information generated by executing Self-Supervised Scene De-occlusion, for example.

The output unit 15 outputs various information. The output unit 15 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 15 may be configured as an interface that connects these display devices to the learning device 1. The output unit 15 outputs information input to the input unit 12 or the communication unit 13, for example. The output unit 15 may display the execution result of Self-Supervised Scene De-occlusion, for example.

FIG. 8 is a diagram showing an example of the configuration of the control unit 11 included in the learning device 1 in the embodiment. The control unit 11 includes a training data acquisition unit 111, a learning unit 112, a storage control unit 113, and an output control unit 114.

The training data acquisition unit 111 acquires training data. The training data acquisition unit 111 acquires training data input to the input unit 12 or the communication unit 13, for example. The training data acquisition unit 111 may acquire training data by reading training data stored in the storage unit 14 in advance.

The learning unit 112 performs Self-Supervised Scene De-occlusion on the training data acquired by the training data acquisition unit 111. By executing Self-Supervised Scene De-occlusion, the learning unit 112 executes the occluded part estimation model and updates the occluded part estimation model based on the execution result. The update is performed so that the accuracy of estimation by the occluded part estimation model is increased. That is, the learning unit 112 performs learning of the occluded part estimation model by executing Self-Supervised Scene De-occlusion.

The storage control unit 113 records various information in the storage unit 14. The output control section 114 controls the operation of the output section 15.

FIG. 9 is a flowchart showing an example of the flow of processing executed by the learning device 1 in the embodiment. The training data acquisition unit 111 acquires training data (step S201). Next, the learning unit 112 performs Self-Supervised Scene De-occlusion on the obtained training data (Step S202). Next, the learning unit 112 determines whether the learning end condition is satisfied (step S203). If the learning end condition is satisfied (step S203: YES), the process ends. The occluded part estimation model at the time when the learning end condition is satisfied is the trained occluded part estimation model. On the other hand, if the learning end condition is not satisfied (step S203: NO), the process returns to step S201.

The trained occluded part estimation model obtained in this way is used in the process of estimating the occluded part shown in the image indicated by the image data, based on the input image data. An example of a device that executes such processing is the estimation device 2. The estimation device 2 obtains the learned occluded part estimation model in advance by obtaining the learned occluded part estimation model from the learning device 1 through communication, for example, before executing the learned occluded part estimation model. The estimation device 2 is equipped with a neural network that expresses the trained occluded part estimation model, for example, so that the estimation device 2 can obtain the learned occluded part estimation model in advance before executing the trained occluded part estimation model. Good too.

FIG. 10 is an explanatory diagram illustrating an overview of the estimation device 2 in the embodiment. The estimation device 2 uses the trained occluded part estimation model obtained by the learning device 1 to estimate the occluded part appearing in the estimation target image. More specifically, the estimation device 2 receives the global amodal instance and executes segmentation annotation. The estimation device 2 generates an ordered directed graph for occlusion. The estimation device 2 executes the learned PCNet-M. PCNet-M is an example of a neural network expressing a mask completion model. The estimation device 2 executes the learned PCNet-C. PCNet-C is an example of a neural network expressing a content completion model. The estimation device 2 outputs the occluded object of the target as a restored object.

FIG. 11 is a diagram showing an example of the hardware configuration of the estimation device 2 in the embodiment. The estimation device 2 includes a control unit 21 including a processor 93 such as a CPU and a memory 94 connected via a bus, and executes a program. The estimation device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25 by executing a program.

More specifically, the processor 93 reads the program stored in the storage unit 24 and stores the read program in the memory 94. When the processor 93 executes the program stored in the memory 94, the estimation device 2 functions as a device including the control section 21, the input section 22, the communication section 23, the storage section 24, and the output section 25.

The control unit 21 controls the operations of various functional units included in the estimation device 2. The control unit 21 executes the learned occluded part estimation model. The control unit 21 controls, for example, the operation of the output unit 25 and causes the output unit 25 to output the execution result of the learned occluded part estimation model. The control unit 21 records, for example, various types of information generated by executing the learned occluded part estimation model in the storage unit 24. The various information stored in the storage unit 24 includes, for example, the execution results of the learned occluded part estimation model.

The input unit 22 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 22 may be configured as an interface that connects these input devices to the estimation device 2. The input unit 22 receives input of various information to the estimation device 2 . For example, image data on which a trained occluded region estimation model is to be executed is input to the input unit 22 . The image data to be executed by the trained occluded part estimation model is the image data of the image to be estimated.

The communication unit 23 is configured to include a communication interface for connecting the estimation device 2 to an external device. The communication unit 23 communicates with an external device via wire or wireless. The external device is, for example, a device that is a source of image data of an image to be estimated. The external device is, for example, the learning device 1. The communication unit 23 may receive the trained occluded part estimation model from the learning device 1 through communication with the learning device 1 .

The storage unit 24 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 24 stores various information regarding the estimation device 2. The storage unit 24 stores information input via the input unit 22 or the communication unit 23, for example. The storage unit 24 stores, for example, a learned occluded part estimation model in advance before executing the learned occluded part estimation model. The storage unit 24 stores, for example, various types of information generated by executing the learned occluded part estimation model.

The output unit 25 outputs various information. The output section 25 is configured to include a display device such as a CRT display, a liquid crystal display, an organic EL display, or the like. The output unit 25 may be configured as an interface that connects these display devices to the estimation device 2. The output unit 25 outputs information input to the input unit 22 or the communication unit 23, for example. The output unit 25 may display, for example, the execution result of the learned occluded part estimation model.

FIG. 12 is a diagram showing an example of the configuration of the control unit 21 included in the estimation device 2 in the embodiment. The control unit 21 includes a target data acquisition unit 211, an estimation unit 212, a storage control unit 213, and an output control unit 214.

The target data acquisition unit 211 acquires image data on which the trained occluded part estimation model is to be executed. The target data acquisition unit 211 acquires, for example, image data input via the input unit 22 or the communication unit 23 as image data to be executed by the learned occluded part estimation model.

The estimation unit 212 executes the learned occluded part estimation model on the image data acquired by the target data acquisition unit 211. The estimation unit 212 estimates the occluded part appearing in the image indicated by the image data to be executed by executing the learned occluded part estimation model.

The storage control unit 213 records various information in the storage unit 14. The output control section 214 controls the operation of the output section 15.

FIG. 13 is a flowchart illustrating an example of the flow of processing executed by the estimation device 2 in the embodiment. The target data acquisition unit 211 acquires image data of an image to be estimated (step S301). Next, the estimation unit 212 executes the learned vertebral body estimation model on the image data acquired in step S301 (step S302). By executing the learned vertebral body estimation model, the occluded region appearing in the image indicated by the image data to be executed is estimated. Next, the output control unit 214 controls the operation of the output unit 15 to display the estimation result of the estimation unit 212 (step S303).

(About the experiment)
According to experiments using the trained occluded part estimation model obtained by Learning Device 1, it was confirmed that the accuracy was over 95% in various applications such as order recovery, amodal completion, and segmentation of amodal instances. It was done.

The learning device 1 according to the embodiment configured as described above is an autocorrelation method for obtaining a mathematical model that performs the acquisition of a directed graph that restores the order between adjacent objects and the complementation of invisible parts using the occlusion geometrical relationship of objects. By executing Self-Supervised Scene De-occlusion, a deep learning algorithm, a mathematical model is trained to estimate occluded areas. Therefore, the learning device 1 can improve the accuracy of estimating the portion of a partially occluded object that is covered by the occluder.

The learning device 1 of the embodiment configured in this way obtains a mathematical model in which processing of modal perception and amodal perception is performed through learning. Note that modal perception refers to the analysis of directly visible areas, and amodal perception refers to the perception of the intact structure of an entity, including invisible areas. Therefore, the learning device 1 can improve the accuracy of estimating the portion of a partially occluded object that is covered by the occluder.

The estimation device 2 of the first embodiment configured as described above uses the trained occluded part estimation model obtained by the learning device 1 to estimate the occluded part reflected in the estimation target. Therefore, the estimation device 2 can improve the accuracy of estimating the part of the object that is partially obscured.

(First modification)
The application scene of the trained occluded part estimation model will be explained, and the learning device 1 and the estimation device 2 will be explained. The trained occluded part estimation model is used, for example, in a factory that produces parts of a predetermined shape. In a factory, there are cases where it is necessary to estimate the occluded part of a part shown in an image as shown in FIG. 14 below, and in such a case, a trained occluded part estimation model is used.

FIG. 14 is a diagram showing an example of an image to be estimated in the embodiment. The image in FIG. 15 shows three circular parts C1 to C3 inside the box. The three circular parts partially overlap. More specifically, the component C1 and the component C2 are located above the component C3, and the component C1 is located above the component C2.

Each of the bounding boxes B1 to B3 in FIG. 14 is the result of the bounding box regression used for component detection.

FIG. 15 is an explanatory diagram illustrating the bounding box in the embodiment. FIG. 15 shows one bounding box B4. In FIG. 15, circle C4 is inscribed in bounding box B4. Circle C4 is the outline of the image reflected in the image. The image in FIG. 15 is, for example, the outline of the part to be detected. For both bounding boxes and contours, geometric information such as their size and shape can be expressed mathematically using a polar coordinate system.

Therefore, if a rectangle (i.e. bounding box) circumscribing the outline of the image is detected, geometric information about the image can be obtained. For example, in the example of FIG. 15, the area of circle C4 is 0.785 times the area of bounding box B4. Note that when the detection target appears in the image of the estimation target of the occluded part estimation model or the trained occluded part estimation model, the detection target is occluded by the occluded part estimation model or the trained occluded part estimation model. The body part is the target to be estimated.

Circle C4 is a circle in the example of FIG. 15 because it does not have a shielded part, but it is not necessarily a circle if the detection target has a shielded part. When the outline of the detection target is a circle without a shielded part, the ratio of the area of the detection target to the area of the bounding box is 0.785, as described above. However, if the detection target has a shielded part, the ratio of the area of the detection target to the area of the bounding box is not necessarily 0.785. Hereinafter, the ratio of the area of the detection target to the area of the bounding box will be referred to as the detection area ratio.

In this way, if the detection area ratio is obtained, geometric information of the shielded region can be obtained. The geometric information of the occluded part is, for example, the occlusion rate. The occlusion rate is the ratio of the area of the occluded part to the area of the bounding box. Both the detection area ratio and the occlusion rate are examples of geometric information possessed by the detection target.

If geometric information is also used, the accuracy of estimating the occluded part will be further improved. For example, for the above-mentioned components C1 and C2, when the respective detection area ratios are obtained, it can be determined that the one with the larger detection area ratio covers the one with the smaller detection area ratio.

Therefore, in order to perform estimation using the geometric information of the image in the image to be estimated, such as the detection area ratio, in learning the occluded part estimation model, we use the geometric information of the image in the image to be estimated. Other information may also be used. Specifically, the training data used in learning may include, as correct data, geometric information of an image reflected in an image to be estimated. In such a case, the occluded part estimation model is, for example, a mathematical model that also estimates the geometric information of the image reflected in the image to be estimated, and the learning unit 112 uses the geometric information estimated by the occluded part estimation model. The occluded part estimation model is updated based on this information as well. That is, the learning unit 112 performs learning of the occluded part estimation model using also the geometric information included in the image reflected in the estimation target image.

In this way, when the geometric information of the image reflected in the estimation target image is used in learning the occluded part estimation model, the trained occluded part estimation model used by the estimation device 2 also uses the estimation target image. Estimation may be performed using geometric information of the image reflected in the image.

(Second modification)
An image data generation support process will be described as an example of a technique for supporting the generation of image data of an image that satisfies the above-mentioned conditions for an image that appears in an image to be estimated. In the image data generation support process, the conditions of the image reflected in the estimation target image are, for example, geometric information such as the detection area ratio. In the image data generation support process, a predetermined function indicating the shape of an image reflected in image data to be generated and having one or more parameters (hereinafter referred to as "image function") is used. The values of the parameters of the image function follow a predetermined probability distribution. One of the parameters of the image function is a value indicating geometric information such as detection area ratio.

Image data generation support processing generates various image data with different geometric information by changing the values of image function parameters according to a predetermined probability distribution each time image data is generated. . When a trained occluded part estimation model is used to estimate the occluded parts of the parts C to C3 described above, the image function includes only parameters indicating the size of the figure, such as the detection area ratio. is a function that expresses a constant figure regardless of the value of the parameter.

(Third modification)
Note that the neural network expressing the mask complementation model may be an object detection convolutional neural network such as YOLOV4. Note that the neural network expressing the content complementation model may be YOLOV4. Note that the neural network expressing the occluded part estimation model may be YOLOV4.

(An example of details of the learning section 112)
FIG. 16 is a diagram showing an example of the configuration of the learning section 112 included in the learning device 1 in the embodiment.

The learning unit 112 includes, for example, an object detection unit 121, an instance segmentation unit 122, a comparison unit 123, a conversion unit 124, an amodal information determination unit 125, an occlusion generation unit 126, a mask complementation model execution unit 127, a content complementation model execution unit 128, It includes a success/failure determination section 129, an amodal information inspection section 130, an occlusion rate modification section 131, and a shape determination section 132.

The object detection unit 121 executes the process of step S102, for example. Note that the process in step S101 is executed by, for example, the training data acquisition unit 111. The instance segmentation unit 122 executes, for example, the process of step S103. The comparison unit 123 executes the process of step S104, for example. The conversion unit 124 executes the process of step S105, for example. The amodal information determination unit 125 executes, for example, the process of step S106.

The occlusion generation unit 126 executes the process of step S107, for example. The mask complementary model execution unit 127 executes, for example, the process of step S108. The content complementation model execution unit 128 executes, for example, the process of step S109. The success/failure determination unit 129 executes, for example, the process of step S110. The amodal information inspection unit 130 executes, for example, the process of step S111. The occlusion rate correction unit 131 executes, for example, the process of step S112. The shape determination unit 132 executes, for example, the process of step S113.

Note that the learning device 1 may be implemented using a plurality of information processing devices communicatively connected via a network. In this case, each functional unit included in the learning device 1 may be distributed and implemented in a plurality of information processing devices.

Note that the estimation device 2 may be implemented using a plurality of information processing devices that are communicably connected via a network. In this case, each functional unit included in the estimation device 2 may be distributed and implemented in a plurality of information processing devices.

Note that the estimated occlusion object detection device 100 may be implemented using a plurality of information processing devices communicatively connected via a network. In this case, each functional unit included in the estimation device 2 may be distributed and implemented in a plurality of information processing devices.

Note that the learning device 1 and the estimation device 2 do not necessarily need to be implemented as different devices, and may be implemented as one device that includes the functions of the learning device 1 and the functions of the estimation device 2. In such a case, for example, the control unit 11 and the control unit 21 may be implemented as one control unit. That is, each functional unit provided in the control unit 11 and each functional unit provided in the control unit 21 may be implemented in one control unit instead of being implemented in different control units.

As described above, the occlusion estimation object detection device 100 of the present invention is a device for estimating one or more occluded parts of an object whose one or more parts are occluded by one or more occluded objects. The part is segmented, and the process of estimating the occluded part in this segmented occluded part involves estimating the positional relationship between images in the estimation target image, and estimating the positional relationship between images in the estimation target image that are partially occluded. The learning unit 112 includes a learning unit 112 that performs self-supervised learning of a mathematical model that performs estimation based on the positional relationship of the occluded part of the image of the occluded object, which is the object that is occluded.

In this way, the control unit 11 calculates the number of occlusions and the number of occluded objects in the segmentation of the occluded portion, performs global amodal conversion, and inputs annotation in the amodal information format.

Also, in this way, the control unit 11 may generate a mask complementation model after obtaining the order of segmented images in the amodal information format annotation, and then perform a convolutional neural network.

Also, in this way, the control unit 11 may perform a convolutional neural network using the order of segmented images or the occlusion rate in the amodal information format annotation.

All or part of each function of the occlusion estimation object detection device 100, the learning device 1, and the estimation device 2 can be implemented using an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate). It may also be realized using hardware such as an Array. The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The program may be transmitted via a telecommunications line.

Note that amodal completion means non-model completion. Moreover, modal completion means model completion. Furthermore, global amodal completion means amodal completion that targets a large space.

Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.

100...Occupation estimation object detection device, 1...Learning device, 2...Estimation device, 11...Control unit, 12...Input unit, 13...Communication unit, 14...Storage unit, 15...Output unit, 111...Training data acquisition unit, 112...Learning unit, 113...Storage control unit, 114...Output control unit, 21...Control unit, 22...Input unit, 23...Communication unit, 24...Storage unit, 25...Output unit, 211...Target data acquisition unit, 212 ... Estimation section, 213... Storage control section, 214... Output control section, 121... Object detection section, 122... Instance segmentation section, 123... Comparison section, 124... Conversion section, 125... Amodal information determination section, 126... Occlusion generation section , 127...Mask complementation model execution unit, 128...Content complementation model execution unit, 129...Success/failure determination unit, 130...Amodal information inspection unit, 131...Occclusion rate correction unit, 132...Shape determination unit, 91...Processor, 92...Memory , 93...processor, 94...memory

Claims

In an apparatus for estimating one or more occluded parts of an object whose one or more parts are occluded by one or more occluded objects, the occluded parts are segmented, and in the segmented occluded parts, the occluded In the process of estimating the part,
Estimating the positional relationship between images in the estimation target image, and estimating the positional relationship of the occluded part of the image of the occluded object, which is an object that is partially occluded in the estimation target image. a learning section that performs self-supervised learning of a mathematical model that performs estimation based on the
Equipped with
The learning unit is an occlusion estimation object detection device that performs learning of the mathematical model using also geometric information included in an image reflected in the estimation target image.
The control unit including the learning unit calculates the number of occlusions and the number of occluded objects in the segmentation for the occluded portion, performs global amodal conversion, and inputs an annotation in an amodal information format.
The estimated occlusion object detection device according to claim 1.
The control unit including the learning unit generates a mask complementation model after acquiring the order of the segmented images in the amodal information format annotation, and then performs a convolutional neural network.
The estimated occlusion object detection device according to claim 2.
The control unit including the learning unit performs a convolutional neural network using the order of the segmented images or the occlusion rate in the amodal information format annotation.
The estimated occlusion object detection device according to claim 2.
the positional relationship is a directed graph that restores the order between adjacent images;
The estimated occlusion object detection device according to claim 1.
The geometric information uses an occlusion rate, which is the ratio of the area of the image to the area of a rectangle circumscribing the image;
The estimated occlusion object detection device according to claim 1.
The training data used for learning the mathematical model is obtained using an image function, which is a predetermined function indicating the shape of the image reflected in the image data to be generated, and is a function having one or more parameters.
The estimated occlusion object detection device according to any one of claims 1 to 6.
the mathematical model is expressed by an object detection convolutional neural network;
The estimated occlusion object detection device according to any one of claims 1 to 7.
a target data acquisition unit that acquires image data;
Estimating the positional relationship between images in the estimation target image, and estimating the positional relationship of the occluded part of the image of the occluded object, which is an object that is partially occluded in the estimation target image. A learning unit that performs self-supervised learning of a mathematical model that performs estimation based on an estimation unit that estimates the occluded part;
Equipped with
The learning unit is an occlusion estimation object detection device that performs learning of the mathematical model using also geometric information included in an image reflected in the estimation target image.
Estimating the positional relationship between images in the estimation target image, and estimating the positional relationship of the occluded part of the image of the occluded object, which is an object that is partially occluded in the estimation target image. a learning step of performing self-supervised learning of a mathematical model that performs estimation based on the
has
In the learning step, the mathematical model is trained using geometric information of an image reflected in the estimation target image.
a target data acquisition step of acquiring image data;
Estimating the positional relationship between images in the estimation target image, and estimating the positional relationship of the occluded part of the image of the occluded object, which is an object that is partially occluded in the estimation target image. By executing the learned mathematical model obtained by self-supervised learning of a mathematical model that performs estimation based on an estimation step of estimating the part;
has
The occlusion estimation object detection method, wherein the learning is learning of the mathematical model using also geometric information of an image reflected in the estimation target image.
A program for causing a computer to function as the occlusion estimation object detection device according to any one of claims 1 to 9.