WO2023171559A1 - Occlusion-inference object detection device, occlusion-inference object detection, and program - Google Patents

Occlusion-inference object detection device, occlusion-inference object detection, and program Download PDF

Info

Publication number
WO2023171559A1
WO2023171559A1 PCT/JP2023/008004 JP2023008004W WO2023171559A1 WO 2023171559 A1 WO2023171559 A1 WO 2023171559A1 JP 2023008004 W JP2023008004 W JP 2023008004W WO 2023171559 A1 WO2023171559 A1 WO 2023171559A1
Authority
WO
WIPO (PCT)
Prior art keywords
occluded
image
estimation
learning
occlusion
Prior art date
Application number
PCT/JP2023/008004
Other languages
French (fr)
Japanese (ja)
Inventor
慧敏 陸
禹超 鄭
Original Assignee
国立大学法人九州工業大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人九州工業大学 filed Critical 国立大学法人九州工業大学
Publication of WO2023171559A1 publication Critical patent/WO2023171559A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/162Segmentation; Edge detection involving graph-based methods

Definitions

  • the present invention relates to an estimated occlusion object detection device, an estimated occlusion object detection method, and a program.
  • an object of the present invention is to provide a technique that improves the accuracy of estimating the part of a partially shielded object covered by the shielding object.
  • One aspect of the present invention provides an apparatus for estimating one or more occluded parts of an object whose one or more parts are occluded by one or more occluded objects, wherein the occluded parts are segmented, and the occluded parts are segmented.
  • the process of estimating the occluded part involves estimating the positional relationship between the images in the estimation target image, and estimating the positional relationship between the images in the estimation target image, which is an object whose part is occluded.
  • One aspect of the present invention includes a target data acquisition unit that acquires image data, an estimation of a positional relationship between images appearing in an image to be estimated, and a part of the images appearing in the image to be estimated that is occluded.
  • a learning unit that performs self-supervised learning of a mathematical model that performs estimation based on the positional relationship of the occluded part of the image of the object to be occluded.
  • an estimation unit that estimates an occluded part of the image of the image data acquired by the target data acquisition unit, and the learning unit is configured to estimate the geometrical shape of the image of the image of the estimation target.
  • the present invention is an occlusion estimation object detection device that performs learning of the mathematical model using information such as:
  • One aspect of the present invention is to estimate the positional relationship between images in an image to be estimated, and to estimate the positional relationship between images of an object that is partially occluded in the image to be estimated. and a learning step of performing self-supervised learning of a mathematical model that performs estimation based on the positional relationships of the parts that have been estimated.
  • This is an estimated occlusion object detection method that also uses information to learn the mathematical model.
  • One aspect of the present invention includes a target data acquisition step of acquiring image data, estimating a positional relationship between images appearing in an image to be estimated, and a step of acquiring image data, and estimating a positional relationship between images appearing in the image to be estimated, and an image appearing in the image to be estimated that is partially occluded.
  • This is an occlusion estimation object detection method that is the learning of the mathematical model used.
  • One aspect of the present invention is a program for causing a computer to function as the above-mentioned occlusion estimation object detection device.
  • FIG. 1 is an explanatory diagram illustrating an overview of an estimated occlusion object detection device according to an embodiment.
  • FIG. 1 is a first explanatory diagram illustrating an algorithm executed by a learning device in an embodiment.
  • FIG. 2 is a second explanatory diagram illustrating an algorithm executed by the learning device in the embodiment.
  • FIG. 1 is an explanatory diagram illustrating an overview of an estimation device in an embodiment.
  • 5 is a flowchart illustrating an example of the flow of processing executed by the estimation device in the embodiment.
  • FIG. 3 is an explanatory diagram illustrating a bounding box in an embodiment.
  • the figure which shows an example of the structure of the learning part with which the learning device in embodiment is provided.
  • FIG. 1 is a diagram showing an example of an overall flowchart of the present invention.
  • FIG. 2 is a diagram showing an example of a flowchart for performing object detection segmentation, which is a component of the present invention.
  • FIG. 3 is a diagram showing an example of a flowchart for performing annotation, which is a component of the present invention. That is, FIG. 3 is a diagram showing an example of a flowchart for performing mathematical modeling for amodal labeling and correction of occlusion rate using geometric information, which is a characteristic component of the present invention.
  • the present invention determines at a predetermined cycle whether an image that satisfies a predetermined format-related condition has been input (step S101).
  • the predetermined condition regarding the format is, for example, 640 pixels x 480 pixels. If an image that satisfies the predetermined format-related conditions has not been input (step S101: NO), the process returns to step S101.
  • the present invention detects an elephant that satisfies the predetermined condition among the images captured in the image acquired in step S101 ( Step S102).
  • the detected image will be referred to as a detection object.
  • the predetermined conditions are, for example, conditions input by the user to the present invention.
  • the present invention performs segmentation on the image acquired in step S101 (step S103).
  • the present invention determines whether the number of masks is equal to the number of detected images (step S104). Specifically, the present invention determines whether the number of masks is equal to the number of detected images based on the segmentation results.
  • step S104 If they are not equal (step S104: NO), the process returns to step S104. On the other hand, if they are equal (step S104: YES), the present invention converts the segmentation model to global amodal (step S105).
  • the present invention determines whether the amodal information is correct based on the global amodal (step S106). If the amodal information is correct (step S106: YES), the present invention acquires information indicating the occlusion order (step S107). Next, the present invention executes a mask complementation model such as PCNet-M on the image input in step S101 (step S108).
  • a mask complementation model such as PCNet-M
  • step S109 executes a content complementation model such as PCNet-C on the result of executing the mask complementation model on the image input in step S101 (step S109).
  • step S110 determines whether the occlusion order has been correctly restored based on the result of the execution of step S109 (step S110). If the occlusion order is correctly restored (step S110: YES), the process ends. On the other hand, if the occlusion order has not been correctly restored (step S110: NO), the process returns to step S108.
  • step S111 the present invention inspects the amodal information. That is, the present invention inspects an unlabeled object for the presence or absence of a label with amodal information, and if there is no label, performs amodal labeling using the occlusion rate and geometric information.
  • step S112 the present invention corrects the occlusion rate based on the inspection result of the amodal information (step S112). That is, the present invention uses geometric information to modify the occlusion rate. By modifying the occlusion rate, the calculation ratio is also modified.
  • step S113 it is determined whether the present invention has a predetermined specific shape (step S113). If it does not have a specific shape (step S113: NO), the process returns to step S111. If it has a specific shape (step S113: YES), the process returns to step S105.
  • FIG. 4 is an explanatory diagram illustrating an overview of the occlusion estimation object detection device 100 of the embodiment.
  • the estimated occlusion object detection device 100 is an example of the present invention, and for example, the estimated occlusion object detection device 100 executes the flowcharts shown in FIGS. 1 to 3.
  • the occlusion estimation object detection device 100 includes a learning device 1 and an estimation device 2.
  • FIG. 4 is also an explanatory diagram illustrating an overview of the algorithm executed by the learning device 1.
  • Self-Supervised Scene De-occlusion which is an algorithm executed by the learning device 1 and is an algorithm for scene de-occlusion, will be explained using FIGS. 5 to 6 in addition to FIG. 4.
  • FIG. 5 to 6 Prior to explaining the learning device 1, Self-Supervised Scene De-occlusion, which is an algorithm executed by the learning device 1 and is an algorithm for scene de-occlusion, will be explained using FIGS. 5 to 6 in addition to FIG. 4.
  • FIG. 5 to 6 Prior to explaining the learning device 1,
  • FIG. 5 is a first explanatory diagram illustrating an algorithm executed by the learning device 1 in the embodiment. More specifically, FIG. 5 is a diagram showing image G1 in FIG. 4 in more detail.
  • FIG. 6 is a second explanatory diagram illustrating an algorithm executed by the learning device 1 in the embodiment.
  • Self-Supervised Scene De-occlusion is an algorithm that complements occluded parts. This algorithm aims to restore the occlusion order and complement the invisible parts of the object covered by the occlusion.
  • the algorithm is also a self-supervised learning framework that tackles deocclusion on real-world data without manually annotating occlusion orders or amodal masks. This framework complements occlusion order recovery and amodal mask,occlusion content.
  • the learning device 1 uses two mathematical models, a mask complementation model and a content complementation model, which will be described later, to partially model instances in a self-learning manner. complement.
  • the mathematical model is specifically expressed by a neural network. Note that the above-mentioned amodal complementation is a term meaning recognition by supplementing in the brain an invisible part that is occluded.
  • an off-the-shelf instance segmentation framework to obtain an amodal mask of the object.
  • a technique for obtaining an amodal mask of an object for example, deep learning of MABK R-CNN is used.
  • those amodal masks are not available. It is very difficult to learn completion for instances with occlusion, as we do not know whether these amodal masks are intact or not. Therefore, in Self-Supervised Scene De-occlusion, partial completion is performed by self-monitoring.
  • f ⁇ represents a fully complementary model.
  • This interpolation process can be decomposed as shown in equation (2) below.
  • M k be an intermediate state
  • p ⁇ denotes a model of partial completion
  • PCNet-M is an example of a neural network that expresses a mask complementation model.
  • M A and M B are regarded as a set of pixels. There are two cases of input, and different inputs are fed into the network.
  • the first case corresponds to a partial completion strategy.
  • MB as an eraser.
  • the mask completion model uses B to erase part of A to obtain M AoutB .
  • the mask complementation model learns to restore the original modal mask MA from M AoutB with M B as a condition.
  • M AoutB is represented by the symbol of the following formula (3) in the following explanation.
  • AoutB represents a difference set between A and B.
  • the second case is regularization to prevent the network from over-complementing instances when there is no occlusion for an instance. Specifically, M AoutB that does not invade A is regarded as an eraser. In this case, the mask complementation model urges to keep the original amodal mask M A on condition of M AoutB . In the absence of case 2, the mask complementation model always encourages an increase in the number of pixels, which may result in excessive complementation if the instances are not surrounded by adjacent instances.
  • the erased image patch functions as an auxiliary input.
  • the loss function is formulated as shown in Equation (4) and Equation (5) below.
  • P ⁇ (m) ( ⁇ ) represents the mask complementation model
  • represents the parameter to be optimized
  • I represents the image patch
  • L represents the binary cross entropy loss.
  • the final loss function is defined as follows.
  • represents the probability of selecting the first case.
  • the content completion model follows the same intuition as the mask completion model, but the target to be completed is RGB content.
  • the target to be completed is RGB content.
  • FIG. 5 or image G1 in FIG. 4
  • input instances A and B are the same as the mask completion model.
  • the image pixels in region M AandB are erased and the content completion model aims to predict the missing content.
  • AandB represents the following equation (7). That is, AandB represents the common part of A and B in set theory.
  • M AandB means the following equation (8).
  • the content completion model also incorporates the remaining mask of A (A ⁇ B) to indicate that it is A and not some other object. Therefore, it cannot be simply replaced with standard image filling approaches.
  • the loss of the content completion model for minimizing the loss is formulated as shown in the following equation (9).
  • a ⁇ B represents a difference set between set A and set B.
  • P ⁇ (c) is a content completion model
  • I is an image patch
  • L is l 1 , a loss function consisting of general losses in image rendering including perceptual loss and paradoxical loss. Similar to the mask completion model, complete completion is possible by learning the content completion model through partial completion learning.
  • the objective ordered graph consists of pairwise occlusion relationships between all adjacent pairs of instances.
  • a proximate instance pair is defined as two instances whose modal masks are connected, so one can potentially be an occlusion of the other.
  • the modal mask M A1 of A1 is first targeted for complementation.
  • M A2 plays the role of eraser to obtain the increment of A 1 , ie ⁇ A1
  • the order between A 1 and A 2 is compared for each increment as follows.
  • the occlusion order of the scene is obtained, which is represented as a graph as shown in image G2-1 in image G2 in FIG. 4. Nodes in the graph represent objects, and edges indicate the direction of occlusion between adjacent objects.
  • the graph of image G2-1 is a graph obtained for image G2-2.
  • the image G2-3 in the image G2 in FIG. 4 is a diagram in which each object mask is uniformly displayed in one diagram after completion of interpolation.
  • image G2-4 in image G2 in FIG. The results of RGB value complementation are also shown using the results shown in FIG.
  • C A is the decomposed content of A from the scene.
  • background content use the sum of all panoramic instances as the eraser.
  • image rendering which is not conscious of occlusion
  • content is supplemented for the estimated occlusion area.
  • Self-Supervised Scene De-occlusion is an autocorrelation deep algorithm that obtains a mathematical model that performs the acquisition of a directed graph that restores the order between adjacent objects and the completion of invisible parts using the occlusion geometric relationships of objects. It is a learning algorithm. Note that the above-mentioned ordered graph or ordinal graph is an example of a directed graph that restores the order between adjacent objects. Note that an object is an image that appears in an image to be estimated.
  • the learning device 1 that performs Self-Supervised Scene De-occlusion updates the occluded part estimation model using a self-supervised learning method.
  • the occluded part estimation model is the estimation of the positional relationship between images in the estimation target image, and the estimation based on that positional relationship. This is a mathematical model that estimates the occluded parts of the image.
  • an object whose part is shielded will be referred to as a shielded object.
  • the shielded part of the object to be shielded is referred to as the shielded part.
  • estimating a region specifically means, for example, estimating an image of the region. Therefore, the occluded part estimation model is a mathematical model that includes a mask complementation model, a content complementation model, and a double complementation model to be described later.
  • the mathematical model is updated until a predetermined end condition for learning (hereinafter referred to as "learning end condition") is met.
  • the occluded part estimation model at the time when the learning end condition is satisfied is based on various estimation targets, such as an estimation target image selected by the user or an estimation target image that satisfies predetermined conditions. It is used to estimate the image of the occluded part in the image.
  • the double complementation model is a mathematical model that estimates the positional relationship between images in the estimation target image based on the estimation results of the mask complementation model.
  • the above-mentioned directed graph that restores the order between adjacent objects is an example of the positional relationship between images. Therefore, the double complementation model is, for example, a mathematical model that executes the above-mentioned double complementation process.
  • the occluded part estimation model is a mathematical model that estimates the occluded part reflected in the estimation target based on the image of the estimation target. Further, the occluded part estimation model is a mathematical model expressed by a neural network.
  • the neural network expressing the occluded part estimation model is, for example, a neural network including a deep neural network.
  • the neural network expressing the occluded region estimation model is, for example, a neural network including a convolutional neural network.
  • the mask complementation model and the content complementation model included in the occluded part estimation model are both mathematical models that are updated through learning, so the occluded part estimation model is a mathematical model that is updated through learning.
  • Both the mask complementation model and the content complementation model are mathematical models expressed, for example, by a neural network.
  • Both the neural network expressing the mask complementation model and the neural network expressing the content complementation model are, for example, deep neural networks.
  • Both the neural network representing the mask complementation model and the neural network representing the content complementation model are, for example, convolutional neural networks.
  • the mask completion model is trained to partially fill in the invisible mask of an object (occludee) hidden by an occluder corresponding to the occluded object (occluder). Therefore, the mask complementation model is a mathematical model that estimates the shape of an image reflected in the estimation target image based on the estimation target image.
  • the content completion model is trained to partially fill the restored mask with RGB values. Therefore, the content complementation model estimates the RGB values of the image in the estimation target image based on the estimation result of the mask complementation model.
  • segmentation has already been performed on the execution target of the mask completion model before the execution of the mask completion model.
  • the target of the content completion model has also been segmented before the content completion model is executed. Note that in order to reduce the amount of calculation, the mask that was complemented in the previous work is used for content complementation, and then filling the mask with RGB values is started. That is, instance segmentation of each image is performed only once.
  • the target for execution of the double-completion model has also been segmented before the execution of the double-completion model. That is, in Self-Supervised Scene De-occlusion, segmentation is performed before executing the mask completion model, content completion model, and double completion model. Segmentation processing is included in the occluded region estimation model. By executing the occluded part estimation model, segmentation processing is executed before executing the mask complementation model, the content complementation model, and the double complementation model. Each of the mask complementation model, content complementation model, and double complementation model performs estimation using the results of segmentation.
  • FIG. 7 is a diagram showing an example of the hardware configuration of the learning device 1 in the embodiment.
  • the learning device 1 includes a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) and a memory 92 connected via a bus, and executes a program.
  • the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15 by executing a program.
  • the processor 91 reads a program stored in the storage unit 14 and stores the read program in the memory 92.
  • the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15.
  • the control unit 11 controls the operations of various functional units included in the learning device 1.
  • the control unit 11 executes Self-Supervised Scene De-occlusion.
  • the control unit 11 controls, for example, the operation of the output unit 15 and causes the output unit 15 to output the execution result of Self-Supervised Scene De-occlusion.
  • the control unit 11 records, for example, various information generated by executing Self-Supervised Scene De-occlusion in the storage unit 14.
  • the various information stored in the storage unit 14 includes, for example, the results of execution of Self-Supervised Scene De-occlusion.
  • the input unit 12 includes input devices such as a mouse, a keyboard, and a touch panel.
  • the input unit 12 may be configured as an interface that connects these input devices to the learning device 1.
  • the input unit 12 receives input of various information to the learning device 1. For example, training data is input to the input unit 12 .
  • the training data may be any image data of an image containing an image.
  • an image that satisfies the conditions of the image reflected in the estimation target in that usage scene will appear.
  • image data of the images is used as training data.
  • the learned occluded part estimation model has high estimation accuracy when actually used in the assumed usage scene.
  • the learning device 1 itself can perform learning of the occluded part estimation model.
  • the communication unit 13 includes a communication interface for connecting the learning device 1 to an external device.
  • the communication unit 13 communicates with an external device via wire or wireless.
  • the external device is, for example, a device that is a source of training data.
  • the communication unit 13 acquires training data through communication with a transmission source of the training data.
  • the external device is, for example, the estimation device 2 described later.
  • the estimation device 2 is a device that performs estimation using a trained occluded part estimation model.
  • the communication unit 13 transmits the learned occluded part estimation model program to the estimating device 2 through communication with the estimating device 2 .
  • the storage unit 14 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device.
  • the storage unit 14 stores various information regarding the learning device 1.
  • the storage unit 14 stores information input via the input unit 12 or the communication unit 13, for example.
  • the storage unit 14 stores, for example, a shielded part estimation model.
  • the storage unit 14 stores various information generated by executing Self-Supervised Scene De-occlusion, for example.
  • the output unit 15 outputs various information.
  • the output unit 15 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display.
  • the output unit 15 may be configured as an interface that connects these display devices to the learning device 1.
  • the output unit 15 outputs information input to the input unit 12 or the communication unit 13, for example.
  • the output unit 15 may display the execution result of Self-Supervised Scene De-occlusion, for example.
  • FIG. 8 is a diagram showing an example of the configuration of the control unit 11 included in the learning device 1 in the embodiment.
  • the control unit 11 includes a training data acquisition unit 111, a learning unit 112, a storage control unit 113, and an output control unit 114.
  • the training data acquisition unit 111 acquires training data.
  • the training data acquisition unit 111 acquires training data input to the input unit 12 or the communication unit 13, for example.
  • the training data acquisition unit 111 may acquire training data by reading training data stored in the storage unit 14 in advance.
  • the learning unit 112 performs Self-Supervised Scene De-occlusion on the training data acquired by the training data acquisition unit 111.
  • the learning unit 112 executes the occluded part estimation model and updates the occluded part estimation model based on the execution result. The update is performed so that the accuracy of estimation by the occluded part estimation model is increased. That is, the learning unit 112 performs learning of the occluded part estimation model by executing Self-Supervised Scene De-occlusion.
  • the storage control unit 113 records various information in the storage unit 14.
  • the output control section 114 controls the operation of the output section 15.
  • FIG. 9 is a flowchart showing an example of the flow of processing executed by the learning device 1 in the embodiment.
  • the training data acquisition unit 111 acquires training data (step S201).
  • the learning unit 112 performs Self-Supervised Scene De-occlusion on the obtained training data (Step S202).
  • the learning unit 112 determines whether the learning end condition is satisfied (step S203). If the learning end condition is satisfied (step S203: YES), the process ends.
  • the occluded part estimation model at the time when the learning end condition is satisfied is the trained occluded part estimation model. On the other hand, if the learning end condition is not satisfied (step S203: NO), the process returns to step S201.
  • the trained occluded part estimation model obtained in this way is used in the process of estimating the occluded part shown in the image indicated by the image data, based on the input image data.
  • An example of a device that executes such processing is the estimation device 2.
  • the estimation device 2 obtains the learned occluded part estimation model in advance by obtaining the learned occluded part estimation model from the learning device 1 through communication, for example, before executing the learned occluded part estimation model.
  • the estimation device 2 is equipped with a neural network that expresses the trained occluded part estimation model, for example, so that the estimation device 2 can obtain the learned occluded part estimation model in advance before executing the trained occluded part estimation model. Good too.
  • FIG. 10 is an explanatory diagram illustrating an overview of the estimation device 2 in the embodiment.
  • the estimation device 2 uses the trained occluded part estimation model obtained by the learning device 1 to estimate the occluded part appearing in the estimation target image. More specifically, the estimation device 2 receives the global amodal instance and executes segmentation annotation. The estimation device 2 generates an ordered directed graph for occlusion.
  • the estimation device 2 executes the learned PCNet-M.
  • PCNet-M is an example of a neural network expressing a mask completion model.
  • the estimation device 2 executes the learned PCNet-C.
  • PCNet-C is an example of a neural network expressing a content completion model.
  • the estimation device 2 outputs the occluded object of the target as a restored object.
  • FIG. 11 is a diagram showing an example of the hardware configuration of the estimation device 2 in the embodiment.
  • the estimation device 2 includes a control unit 21 including a processor 93 such as a CPU and a memory 94 connected via a bus, and executes a program.
  • the estimation device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25 by executing a program.
  • the processor 93 reads the program stored in the storage unit 24 and stores the read program in the memory 94.
  • the estimation device 2 functions as a device including the control section 21, the input section 22, the communication section 23, the storage section 24, and the output section 25.
  • the control unit 21 controls the operations of various functional units included in the estimation device 2.
  • the control unit 21 executes the learned occluded part estimation model.
  • the control unit 21 controls, for example, the operation of the output unit 25 and causes the output unit 25 to output the execution result of the learned occluded part estimation model.
  • the control unit 21 records, for example, various types of information generated by executing the learned occluded part estimation model in the storage unit 24.
  • the various information stored in the storage unit 24 includes, for example, the execution results of the learned occluded part estimation model.
  • the input unit 22 includes input devices such as a mouse, a keyboard, and a touch panel.
  • the input unit 22 may be configured as an interface that connects these input devices to the estimation device 2.
  • the input unit 22 receives input of various information to the estimation device 2 .
  • image data on which a trained occluded region estimation model is to be executed is input to the input unit 22 .
  • the image data to be executed by the trained occluded part estimation model is the image data of the image to be estimated.
  • the communication unit 23 is configured to include a communication interface for connecting the estimation device 2 to an external device.
  • the communication unit 23 communicates with an external device via wire or wireless.
  • the external device is, for example, a device that is a source of image data of an image to be estimated.
  • the external device is, for example, the learning device 1.
  • the communication unit 23 may receive the trained occluded part estimation model from the learning device 1 through communication with the learning device 1 .
  • the storage unit 24 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device.
  • the storage unit 24 stores various information regarding the estimation device 2.
  • the storage unit 24 stores information input via the input unit 22 or the communication unit 23, for example.
  • the storage unit 24 stores, for example, a learned occluded part estimation model in advance before executing the learned occluded part estimation model.
  • the storage unit 24 stores, for example, various types of information generated by executing the learned occluded part estimation model.
  • the output unit 25 outputs various information.
  • the output section 25 is configured to include a display device such as a CRT display, a liquid crystal display, an organic EL display, or the like.
  • the output unit 25 may be configured as an interface that connects these display devices to the estimation device 2.
  • the output unit 25 outputs information input to the input unit 22 or the communication unit 23, for example.
  • the output unit 25 may display, for example, the execution result of the learned occluded part estimation model.
  • FIG. 12 is a diagram showing an example of the configuration of the control unit 21 included in the estimation device 2 in the embodiment.
  • the control unit 21 includes a target data acquisition unit 211, an estimation unit 212, a storage control unit 213, and an output control unit 214.
  • the target data acquisition unit 211 acquires image data on which the trained occluded part estimation model is to be executed.
  • the target data acquisition unit 211 acquires, for example, image data input via the input unit 22 or the communication unit 23 as image data to be executed by the learned occluded part estimation model.
  • the estimation unit 212 executes the learned occluded part estimation model on the image data acquired by the target data acquisition unit 211.
  • the estimation unit 212 estimates the occluded part appearing in the image indicated by the image data to be executed by executing the learned occluded part estimation model.
  • the storage control unit 213 records various information in the storage unit 14.
  • the output control section 214 controls the operation of the output section 15.
  • FIG. 13 is a flowchart illustrating an example of the flow of processing executed by the estimation device 2 in the embodiment.
  • the target data acquisition unit 211 acquires image data of an image to be estimated (step S301).
  • the estimation unit 212 executes the learned vertebral body estimation model on the image data acquired in step S301 (step S302).
  • the output control unit 214 controls the operation of the output unit 15 to display the estimation result of the estimation unit 212 (step S303).
  • the learning device 1 is an autocorrelation method for obtaining a mathematical model that performs the acquisition of a directed graph that restores the order between adjacent objects and the complementation of invisible parts using the occlusion geometrical relationship of objects.
  • a mathematical model is trained to estimate occluded areas. Therefore, the learning device 1 can improve the accuracy of estimating the portion of a partially occluded object that is covered by the occluder.
  • the learning device 1 of the embodiment configured in this way obtains a mathematical model in which processing of modal perception and amodal perception is performed through learning.
  • modal perception refers to the analysis of directly visible areas
  • amodal perception refers to the perception of the intact structure of an entity, including invisible areas. Therefore, the learning device 1 can improve the accuracy of estimating the portion of a partially occluded object that is covered by the occluder.
  • the estimation device 2 of the first embodiment configured as described above uses the trained occluded part estimation model obtained by the learning device 1 to estimate the occluded part reflected in the estimation target. Therefore, the estimation device 2 can improve the accuracy of estimating the part of the object that is partially obscured.
  • the application scene of the trained occluded part estimation model will be explained, and the learning device 1 and the estimation device 2 will be explained.
  • the trained occluded part estimation model is used, for example, in a factory that produces parts of a predetermined shape. In a factory, there are cases where it is necessary to estimate the occluded part of a part shown in an image as shown in FIG. 14 below, and in such a case, a trained occluded part estimation model is used.
  • FIG. 14 is a diagram showing an example of an image to be estimated in the embodiment.
  • the image in FIG. 15 shows three circular parts C1 to C3 inside the box. The three circular parts partially overlap. More specifically, the component C1 and the component C2 are located above the component C3, and the component C1 is located above the component C2.
  • Each of the bounding boxes B1 to B3 in FIG. 14 is the result of the bounding box regression used for component detection.
  • FIG. 15 is an explanatory diagram illustrating the bounding box in the embodiment.
  • FIG. 15 shows one bounding box B4.
  • circle C4 is inscribed in bounding box B4.
  • Circle C4 is the outline of the image reflected in the image.
  • the image in FIG. 15 is, for example, the outline of the part to be detected.
  • geometric information such as their size and shape can be expressed mathematically using a polar coordinate system.
  • a rectangle i.e. bounding box
  • geometric information about the image can be obtained.
  • the area of circle C4 is 0.785 times the area of bounding box B4.
  • Circle C4 is a circle in the example of FIG. 15 because it does not have a shielded part, but it is not necessarily a circle if the detection target has a shielded part.
  • the ratio of the area of the detection target to the area of the bounding box is 0.785, as described above. However, if the detection target has a shielded part, the ratio of the area of the detection target to the area of the bounding box is not necessarily 0.785.
  • the ratio of the area of the detection target to the area of the bounding box will be referred to as the detection area ratio.
  • the geometric information of the occluded part is, for example, the occlusion rate.
  • the occlusion rate is the ratio of the area of the occluded part to the area of the bounding box. Both the detection area ratio and the occlusion rate are examples of geometric information possessed by the detection target.
  • the accuracy of estimating the occluded part will be further improved. For example, for the above-mentioned components C1 and C2, when the respective detection area ratios are obtained, it can be determined that the one with the larger detection area ratio covers the one with the smaller detection area ratio.
  • the occluded part estimation model is, for example, a mathematical model that also estimates the geometric information of the image reflected in the image to be estimated, and the learning unit 112 uses the geometric information estimated by the occluded part estimation model.
  • the occluded part estimation model is updated based on this information as well. That is, the learning unit 112 performs learning of the occluded part estimation model using also the geometric information included in the image reflected in the estimation target image.
  • the trained occluded part estimation model used by the estimation device 2 also uses the estimation target image. Estimation may be performed using geometric information of the image reflected in the image.
  • An image data generation support process will be described as an example of a technique for supporting the generation of image data of an image that satisfies the above-mentioned conditions for an image that appears in an image to be estimated.
  • the conditions of the image reflected in the estimation target image are, for example, geometric information such as the detection area ratio.
  • a predetermined function indicating the shape of an image reflected in image data to be generated and having one or more parameters (hereinafter referred to as "image function") is used.
  • image function indicating the shape of an image reflected in image data to be generated and having one or more parameters.
  • the values of the parameters of the image function follow a predetermined probability distribution.
  • One of the parameters of the image function is a value indicating geometric information such as detection area ratio.
  • Image data generation support processing generates various image data with different geometric information by changing the values of image function parameters according to a predetermined probability distribution each time image data is generated. .
  • the image function includes only parameters indicating the size of the figure, such as the detection area ratio. is a function that expresses a constant figure regardless of the value of the parameter.
  • the neural network expressing the mask complementation model may be an object detection convolutional neural network such as YOLOV4.
  • the neural network expressing the content complementation model may be YOLOV4.
  • the neural network expressing the occluded part estimation model may be YOLOV4.
  • FIG. 16 is a diagram showing an example of the configuration of the learning section 112 included in the learning device 1 in the embodiment.
  • the learning unit 112 includes, for example, an object detection unit 121, an instance segmentation unit 122, a comparison unit 123, a conversion unit 124, an amodal information determination unit 125, an occlusion generation unit 126, a mask complementation model execution unit 127, a content complementation model execution unit 128, It includes a success/failure determination section 129, an amodal information inspection section 130, an occlusion rate modification section 131, and a shape determination section 132.
  • the object detection unit 121 executes the process of step S102, for example.
  • the process in step S101 is executed by, for example, the training data acquisition unit 111.
  • the instance segmentation unit 122 executes, for example, the process of step S103.
  • the comparison unit 123 executes the process of step S104, for example.
  • the conversion unit 124 executes the process of step S105, for example.
  • the amodal information determination unit 125 executes, for example, the process of step S106.
  • the occlusion generation unit 126 executes the process of step S107, for example.
  • the mask complementary model execution unit 127 executes, for example, the process of step S108.
  • the content complementation model execution unit 128 executes, for example, the process of step S109.
  • the success/failure determination unit 129 executes, for example, the process of step S110.
  • the amodal information inspection unit 130 executes, for example, the process of step S111.
  • the occlusion rate correction unit 131 executes, for example, the process of step S112.
  • the shape determination unit 132 executes, for example, the process of step S113.
  • the learning device 1 may be implemented using a plurality of information processing devices communicatively connected via a network.
  • each functional unit included in the learning device 1 may be distributed and implemented in a plurality of information processing devices.
  • estimation device 2 may be implemented using a plurality of information processing devices that are communicably connected via a network.
  • each functional unit included in the estimation device 2 may be distributed and implemented in a plurality of information processing devices.
  • the estimated occlusion object detection device 100 may be implemented using a plurality of information processing devices communicatively connected via a network.
  • each functional unit included in the estimation device 2 may be distributed and implemented in a plurality of information processing devices.
  • the learning device 1 and the estimation device 2 do not necessarily need to be implemented as different devices, and may be implemented as one device that includes the functions of the learning device 1 and the functions of the estimation device 2.
  • the control unit 11 and the control unit 21 may be implemented as one control unit. That is, each functional unit provided in the control unit 11 and each functional unit provided in the control unit 21 may be implemented in one control unit instead of being implemented in different control units.
  • the occlusion estimation object detection device 100 of the present invention is a device for estimating one or more occluded parts of an object whose one or more parts are occluded by one or more occluded objects.
  • the part is segmented, and the process of estimating the occluded part in this segmented occluded part involves estimating the positional relationship between images in the estimation target image, and estimating the positional relationship between images in the estimation target image that are partially occluded.
  • the learning unit 112 includes a learning unit 112 that performs self-supervised learning of a mathematical model that performs estimation based on the positional relationship of the occluded part of the image of the occluded object, which is the object that is occluded.
  • control unit 11 calculates the number of occlusions and the number of occluded objects in the segmentation of the occluded portion, performs global amodal conversion, and inputs annotation in the amodal information format.
  • control unit 11 may generate a mask complementation model after obtaining the order of segmented images in the amodal information format annotation, and then perform a convolutional neural network.
  • control unit 11 may perform a convolutional neural network using the order of segmented images or the occlusion rate in the amodal information format annotation.
  • All or part of each function of the occlusion estimation object detection device 100, the learning device 1, and the estimation device 2 can be implemented using an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate). It may also be realized using hardware such as an Array.
  • the program may be recorded on a computer-readable recording medium.
  • the computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system.
  • the program may be transmitted via a telecommunications line.
  • amodal completion means non-model completion.
  • modal completion means model completion.
  • global amodal completion means amodal completion that targets a large space.
  • 100...Occupation estimation object detection device 1...Learning device, 2...Estimation device, 11...Control unit, 12...Input unit, 13...Communication unit, 14...Storage unit, 15...Output unit, 111...Training data acquisition unit, 112...Learning unit, 113...Storage control unit, 114...Output control unit, 21...Control unit, 22...Input unit, 23...Communication unit, 24...Storage unit, 25...Output unit, 211...Target data acquisition unit, 212 ... Estimation section, 213... Storage control section, 214... Output control section, 121... Object detection section, 122... Instance segmentation section, 123... Comparison section, 124... Conversion section, 125...
  • Amodal information determination section 126... Occlusion generation section , 127...Mask complementation model execution unit, 128...Content complementation model execution unit, 129...Success/failure determination unit, 130...Amodal information inspection unit, 131...Occclusion rate correction unit, 132...Shape determination unit, 91...Processor, 92...Memory , 93...processor, 94...memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Provided is an occlusion-inference object detection device that infers one or more occluded portions of an object that are occluded by one or more occluding objects, said device comprising a learning unit that trains a mathematical model by self-supervised learning; in a process of inferring an occluded portion resulting from segmentation of occluded portions, the mathematical model infers a positional relationship between objects appearing in an image undergoing inference and, on the basis of the positional relationship, infers an occluded part of an occluded object, which is an object appearing in the image undergoing inference and having an occluded portion; the learning unit also uses geometrical information pertaining to the object appearing in the image undergoing inference to train the mathematical model.

Description

遮蔽推定物体検知装置、遮蔽推定物体検知方法及びプログラムEstimated occlusion object detection device, estimated occlusion object detection method and program
 本発明は、遮蔽推定物体検知装置、遮蔽推定物体検知方法及びプログラムに関する。
 本願は、2022年3月8日に、日本に出願された特願2022-035212号に基づき優先権を主張し、その内容をここに援用する。
The present invention relates to an estimated occlusion object detection device, an estimated occlusion object detection method, and a program.
This application claims priority based on Japanese Patent Application No. 2022-035212 filed in Japan on March 8, 2022, the contents of which are incorporated herein.
 画像認識の分野において、2010年代より学習により特徴量抽出過程を自動獲得する深層学習が脚光を浴びている。深層学習による画像認識は一般物体認識においてもそれ以前の手法と比べて圧倒的な成果を上げている。また、近年、深層学習によるヒトに近い視認性をもつ画像認識技術は、周辺環境の安全性把握や障害物検知などが必要とされる監視カメラ、自動運転やロボティクスなどのさまざまな分野で実用化されている。 In the field of image recognition, deep learning, which automatically acquires the feature extraction process through learning, has been in the spotlight since the 2010s. Image recognition using deep learning has achieved overwhelming results in general object recognition compared to previous methods. In addition, in recent years, image recognition technology that uses deep learning to provide visibility close to that of humans has been put to practical use in various fields such as surveillance cameras, autonomous driving, and robotics, where it is necessary to understand the safety of the surrounding environment and detect obstacles. has been done.
特開2011-186633号公報Japanese Patent Application Publication No. 2011-186633
 しかし、部分的に遮蔽がある物体の画像の場合には物体に特徴量の一部が失われることになり、画像認識は困難となる場合がある。すなわち、一部が遮蔽された物体の遮蔽物に覆われた部位の推定ができない場合や、一部が遮蔽された物体の遮蔽物に覆われた部位の推定の精度が悪い場合があり、画像認識が困難になる場合がある。 However, in the case of an image of an object that is partially occluded, some of the features of the object will be lost, making image recognition difficult. In other words, it may not be possible to estimate the part of an object that is partially occluded, or the accuracy of estimating the part of the object that is partially occluded may be poor. Recognition may be difficult.
 上記事情に鑑み、本発明は、一部が遮蔽された物体の遮蔽物に覆われた部位の推定の精度を向上させる技術を提供することを目的としている。 In view of the above circumstances, an object of the present invention is to provide a technique that improves the accuracy of estimating the part of a partially shielded object covered by the shielding object.
 本発明の一態様は、遮蔽する一つ以上の物体で一つ以上の部分が遮蔽された物体の一つ以上の遮蔽された部分を推定する装置において、遮蔽された部分がセグメンテーションされ、このセグメンテーションされた遮蔽部分において、遮蔽部分を推定するプロセスで、推定対象の画像に写る像同士の位置関係の推定と、前記推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の前記位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習を行う学習部、を備え、前記学習部は、前記推定対象の画像に写る像が有する幾何学的な情報も用いて前記数理モデルの学習を行う、遮蔽推定物体検知装置である。 One aspect of the present invention provides an apparatus for estimating one or more occluded parts of an object whose one or more parts are occluded by one or more occluded objects, wherein the occluded parts are segmented, and the occluded parts are segmented. In the process of estimating the occluded part, the process involves estimating the positional relationship between the images in the estimation target image, and estimating the positional relationship between the images in the estimation target image, which is an object whose part is occluded. and a learning unit that performs self-supervised learning of a mathematical model that performs estimation based on the positional relationship of the occluded part of the image of the occluding object, the learning unit comprising This is an occlusion estimation object detection device that performs learning of the mathematical model using geometric information possessed by the object.
 本発明の一態様は、画像データを取得する対象データ取得部と、推定対象の画像に写る像同士の位置関係の推定と、前記推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の前記位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習を行う学習部、が得た学習済みの前記数理モデルの実行により、前記対象データ取得部の取得した前記画像データの画像に写る像の遮蔽された部位を推定する推定部と、を備え、前記学習部は、前記推定対象の画像に写る像が有する幾何学的な情報も用いて前記数理モデルの学習を行う、遮蔽推定物体検知装置である。 One aspect of the present invention includes a target data acquisition unit that acquires image data, an estimation of a positional relationship between images appearing in an image to be estimated, and a part of the images appearing in the image to be estimated that is occluded. By executing the learned mathematical model obtained by a learning unit that performs self-supervised learning of a mathematical model that performs estimation based on the positional relationship of the occluded part of the image of the object to be occluded. , an estimation unit that estimates an occluded part of the image of the image data acquired by the target data acquisition unit, and the learning unit is configured to estimate the geometrical shape of the image of the image of the estimation target. The present invention is an occlusion estimation object detection device that performs learning of the mathematical model using information such as:
 本発明の一態様は、推定対象の画像に写る像同士の位置関係の推定と、前記推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の前記位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習を行う学習ステップ、を有し、前記学習ステップでは、前記推定対象の画像に写る像が有する幾何学的な情報も用いて前記数理モデルの学習を行う、遮蔽推定物体検知方法である。 One aspect of the present invention is to estimate the positional relationship between images in an image to be estimated, and to estimate the positional relationship between images of an object that is partially occluded in the image to be estimated. and a learning step of performing self-supervised learning of a mathematical model that performs estimation based on the positional relationships of the parts that have been estimated. This is an estimated occlusion object detection method that also uses information to learn the mathematical model.
 本発明の一態様は、画像データを取得する対象データ取得ステップと、推定対象の画像に写る像同士の位置関係の推定と、前記推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の前記位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習、により得られた学習済みの前記数理モデルの実行により、前記対象データ取得ステップの取得した前記画像データの画像に写る像の遮蔽された部位を推定する推定ステップと、を有し、前記学習は、前記推定対象の画像に写る像が有する幾何学的な情報も用いた前記数理モデルの学習である、遮蔽推定物体検知方法である。 One aspect of the present invention includes a target data acquisition step of acquiring image data, estimating a positional relationship between images appearing in an image to be estimated, and a step of acquiring image data, and estimating a positional relationship between images appearing in the image to be estimated, and an image appearing in the image to be estimated that is partially occluded. Estimation based on the positional relationship of the occluded part of the image of the object to be occluded, and learning by self-supervised learning of the mathematical model that performs an estimation step of estimating an occluded part of the image reflected in the image of the image data acquired in the data acquisition step, and the learning also includes geometric information included in the image reflected in the image to be estimated. This is an occlusion estimation object detection method that is the learning of the mathematical model used.
 本発明の一態様は、上記の遮蔽推定物体検知装置としてコンピュータを機能させるためのプログラムである。 One aspect of the present invention is a program for causing a computer to function as the above-mentioned occlusion estimation object detection device.
 本発明により、一部が遮蔽された物体の遮蔽物に覆われた部位の推定の精度を向上させることが可能となる。 According to the present invention, it is possible to improve the accuracy of estimating the part of an object that is partially obscured.
本発明の全体のフローチャートの一例を示す図。The figure which shows an example of the flowchart of the whole invention. 本発明の構成部分である物体検知セグメンテーションを行うフローチャートの一例を示す図。The figure which shows an example of the flowchart which performs object detection segmentation which is a component of this invention. 本発明の特長ある構成部分である幾何学的情報を用いてアモーダルのラベリング及びオクル―ジョン率の修正を行う数理的モデル化を行うフローチャートの一例を示す図。The figure which shows an example of the flowchart which performs the mathematical modeling which performs amodal labeling and correction of an occlusion rate using the geometric information which is a characteristic component of this invention. 実施形態の遮蔽推定物体検知装置の概要を説明する説明図。FIG. 1 is an explanatory diagram illustrating an overview of an estimated occlusion object detection device according to an embodiment. 実施形態における学習装置が実行するアルゴリズムを説明する第1の説明図。FIG. 1 is a first explanatory diagram illustrating an algorithm executed by a learning device in an embodiment. 実施形態における学習装置が実行するアルゴリズムを説明する第2の説明図。FIG. 2 is a second explanatory diagram illustrating an algorithm executed by the learning device in the embodiment. 実施形態における学習装置のハードウェア構成の一例を示す図。The figure which shows an example of the hardware configuration of the learning device in embodiment. 実施形態における学習装置が備える制御部の構成の一例を示す図。The figure which shows an example of the structure of the control part with which the learning device in embodiment is provided. 実施形態における学習装置が実行する処理の流れの一例を示すフローチャート。5 is a flowchart showing an example of the flow of processing executed by the learning device in the embodiment. 実施形態における推定装置の概要を説明する説明図。FIG. 1 is an explanatory diagram illustrating an overview of an estimation device in an embodiment. 実施形態における推定装置のハードウェア構成の一例を示す図。The figure which shows an example of the hardware configuration of the estimation device in embodiment. 実施形態における推定装置が備える制御部の構成の一例を示す図。The figure which shows an example of the structure of the control part with which the estimation device in embodiment is provided. 実施形態における推定装置が実行する処理の流れの一例を示すフローチャート。5 is a flowchart illustrating an example of the flow of processing executed by the estimation device in the embodiment. 実施形態における推定対象の画像の一例を示す図。The figure which shows an example of the image of the estimation target in embodiment. 実施形態におけるバウンティングボックスを説明する説明図。FIG. 3 is an explanatory diagram illustrating a bounding box in an embodiment. 実施形態における学習装置の備える学習部の構成の一例を示す図。The figure which shows an example of the structure of the learning part with which the learning device in embodiment is provided.
(実施形態)
図1は、本発明の全体のフローチャートの一例を示す図である。図2は、本発明の構成部分である物体検知セグメンテーションを行うフローチャートの一例を示す図である。図3は、本発明の構成部分であるアノテーションを行うフローチャートの一例を示す図である。すなわち、図3は、本発明の特長ある構成部分である幾何学的情報を用いてアモーダルのラベリング及びオクル―ジョン率の修正を行う数理的モデル化を行うフローチャートの一例を示す図である。
(Embodiment)
FIG. 1 is a diagram showing an example of an overall flowchart of the present invention. FIG. 2 is a diagram showing an example of a flowchart for performing object detection segmentation, which is a component of the present invention. FIG. 3 is a diagram showing an example of a flowchart for performing annotation, which is a component of the present invention. That is, FIG. 3 is a diagram showing an example of a flowchart for performing mathematical modeling for amodal labeling and correction of occlusion rate using geometric information, which is a characteristic component of the present invention.
 本発明は、形式に関する所定の条件を満たす画像が入力されたか否かを、所定の周期で、判定する(ステップS101)。形式に関する所定の条件は、例えば640ピクセル×480ピクセルである、という条件である。形式に関する所定の条件を満たす画像が入力されていない場合(ステップS101:NO)、ステップS101の処理に戻る。一方、形式に関する所定の条件を満たす画像が入力された場合(ステップS101:YES)、本発明は、ステップS101で取得された画像に写る像のうちの、所定の条件を満たす象を検知する(ステップS102)。以下、検知された像を、検知像(detection object)という。所定の条件は、例えばユーザが本発明に対して入力した条件である。 The present invention determines at a predetermined cycle whether an image that satisfies a predetermined format-related condition has been input (step S101). The predetermined condition regarding the format is, for example, 640 pixels x 480 pixels. If an image that satisfies the predetermined format-related conditions has not been input (step S101: NO), the process returns to step S101. On the other hand, if an image that satisfies a predetermined condition regarding the format is input (step S101: YES), the present invention detects an elephant that satisfies the predetermined condition among the images captured in the image acquired in step S101 ( Step S102). Hereinafter, the detected image will be referred to as a detection object. The predetermined conditions are, for example, conditions input by the user to the present invention.
 次に本発明は、ステップS101で取得された画像に対してセグメンテーションを実行する(ステップS103)。次に本発明が、マスクの数が検知像の数と等しいか否かを判定する(ステップS104)。具体的には、セグメンテーションの結果に基づき、マスクの数が検知像の数と等しいか否かを本発明が判定する。 Next, the present invention performs segmentation on the image acquired in step S101 (step S103). Next, the present invention determines whether the number of masks is equal to the number of detected images (step S104). Specifically, the present invention determines whether the number of masks is equal to the number of detected images based on the segmentation results.
 等しくない場合(ステップS104:NO)、ステップS104の処理に戻る。一方、等しい場合(ステップS104:YES)、本発明は、セグメンテーションモデルをグローバルアモーダルに変換する(ステップS105)。 If they are not equal (step S104: NO), the process returns to step S104. On the other hand, if they are equal (step S104: YES), the present invention converts the segmentation model to global amodal (step S105).
 次に本発明は、グローバルアモーダルに基づきアモーダル情報が正しいか否かを判定する(ステップS106)。アモーダル情報が正しい場合(ステップS106:YES)、本発明は、オクルージョン順序を示す情報を取得する(ステップS107)。次に、本発明が、PCNet-M等のマスク補完モデルをステップS101で入力された画像に対して実行する(ステップS108)。 Next, the present invention determines whether the amodal information is correct based on the global amodal (step S106). If the amodal information is correct (step S106: YES), the present invention acquires information indicating the occlusion order (step S107). Next, the present invention executes a mask complementation model such as PCNet-M on the image input in step S101 (step S108).
 次に、本発明が、ステップS101で入力された画像に対するマスク補完モデルの実行の結果に対して、PCNet-C等の内容補完モデルを実行する(ステップS109)。次に本発明が、ステップS109の実行の結果に基づき、オクルージョン順序が正しく回復したか否かを判定する(ステップS110)。オクルージョン順序が正しく回復した場合(ステップS110:YES)、処理が終了する。一方、オクルージョン順序が正しく回復していない場合(ステップS110:NO)、ステップS108の処理に戻る。 Next, the present invention executes a content complementation model such as PCNet-C on the result of executing the mask complementation model on the image input in step S101 (step S109). Next, the present invention determines whether the occlusion order has been correctly restored based on the result of the execution of step S109 (step S110). If the occlusion order is correctly restored (step S110: YES), the process ends. On the other hand, if the occlusion order has not been correctly restored (step S110: NO), the process returns to step S108.
 一方、ステップS106においてアモーダル情報が正しくない場合(ステップS106:NO)、本発明が、アモーダル情報を検査する(ステップS111)。すなわち、本発明がラベリング無し物体に対してアモーダル情報のラベルの有無を検査し、無ければオクルージョン率や幾何学情報を用いてアモーダルのラベリングを行う。次に、本発明が、アモーダル情報の検査結果に基づきオクルージョン率を修正する(ステップS112)。すなわち、本発明が、幾何学的情報を用いてオクルージョン率を修正する。オクルージョン率の修正により演算率(calculation ratio)も修正される。次に、本発明が、予め定められた特定の形状を有するか否かを判定する(ステップS113)。特定の形状を有さない場合(ステップS113:NO)、ステップS111の処理に戻る。特定の形状を有する場合(ステップS113:YES)、ステップS105の処理に戻る。 On the other hand, if the amodal information is incorrect in step S106 (step S106: NO), the present invention inspects the amodal information (step S111). That is, the present invention inspects an unlabeled object for the presence or absence of a label with amodal information, and if there is no label, performs amodal labeling using the occlusion rate and geometric information. Next, the present invention corrects the occlusion rate based on the inspection result of the amodal information (step S112). That is, the present invention uses geometric information to modify the occlusion rate. By modifying the occlusion rate, the calculation ratio is also modified. Next, it is determined whether the present invention has a predetermined specific shape (step S113). If it does not have a specific shape (step S113: NO), the process returns to step S111. If it has a specific shape (step S113: YES), the process returns to step S105.
 図4は、実施形態の遮蔽推定物体検知装置100の概要を説明する説明図である。遮蔽推定物体検知装置100が本発明の一例であり、例えば、遮蔽推定物体検知装置100が図1から図3に示すフローチャートを実行する。遮蔽推定物体検知装置100は、学習装置1と推定装置2とを備える。図4は、学習装置1の実行するアルゴリズムの概要を説明する説明図でもある。学習装置1の説明に先立ち、学習装置1が実行するアルゴリズムであり、脱シーンオクルージョンのアルゴリズムであるSelf-Supervised Scene De-occlusionについて図4にくわえて、図5~図6も用いて説明する。図5は、実施形態における学習装置1が実行するアルゴリズムを説明する第1の説明図である。より具体的には図5は図4の画像G1をより詳細に示す図である。図6は、実施形態における学習装置1が実行するアルゴリズムを説明する第2の説明図である。 FIG. 4 is an explanatory diagram illustrating an overview of the occlusion estimation object detection device 100 of the embodiment. The estimated occlusion object detection device 100 is an example of the present invention, and for example, the estimated occlusion object detection device 100 executes the flowcharts shown in FIGS. 1 to 3. The occlusion estimation object detection device 100 includes a learning device 1 and an estimation device 2. FIG. 4 is also an explanatory diagram illustrating an overview of the algorithm executed by the learning device 1. Prior to explaining the learning device 1, Self-Supervised Scene De-occlusion, which is an algorithm executed by the learning device 1 and is an algorithm for scene de-occlusion, will be explained using FIGS. 5 to 6 in addition to FIG. 4. FIG. 5 is a first explanatory diagram illustrating an algorithm executed by the learning device 1 in the embodiment. More specifically, FIG. 5 is a diagram showing image G1 in FIG. 4 in more detail. FIG. 6 is a second explanatory diagram illustrating an algorithm executed by the learning device 1 in the embodiment.
<Self-Supervised Scene De-occlusion>
 Self-Supervised Scene De-occlusionは遮蔽部分を補完するアルゴリズムである。このアルゴリズムは遮蔽の順序を回復し、遮蔽に覆われた物体の見えない部分を補完することを目的とするアルゴリズムである。また、このアルゴリズムは、遮蔽の順序やアモーダルマスクを手動で注釈付けすることなく、実世界のデータ上での脱閉塞化に取り組む自己教師付き学習フレームワークである。このフレームワークは、オクルージョン順序の回復と、アモーダルマスク、閉塞物の内容を補完する。
<Self-Supervised Scene De-occlusion>
Self-Supervised Scene De-occlusion is an algorithm that complements occluded parts. This algorithm aims to restore the occlusion order and complement the invisible parts of the object covered by the occlusion. The algorithm is also a self-supervised learning framework that tackles deocclusion on real-world data without manually annotating occlusion orders or amodal masks. This framework complements occlusion order recovery and amodal mask,occlusion content.
 学習装置1は、遮蔽の順序とアモーダルマスクの手動アノテーションがない場合に対応するために、後述するマスク補完モデルと内容補完モデルとの2つの数理モデルを用いて自己学習的にインスタンスを部分的に補完する。数理モデルは具体的にはニューラルネットワークによって表現される。なお、上述のアモーダル補完とは遮蔽された見えない部分を脳内で補うことで認識することを意味する用語である。 In order to deal with the case where there is no manual annotation of the occlusion order and the amodal mask, the learning device 1 uses two mathematical models, a mask complementation model and a content complementation model, which will be described later, to partially model instances in a self-learning manner. complement. The mathematical model is specifically expressed by a neural network. Note that the above-mentioned amodal complementation is a term meaning recognition by supplementing in the brain an invisible part that is occluded.
 画像が与えられれば、既製のインスタンスセグメンテーションフレームワークを使って、物体のアモーダルマスクを得ることはできる。物体のアモーダルマスクを得る技術としては、例えばMABK R-CNNの深層学習が用いられる。しかし、それらのアモーダルマスクは利用できない。これらのアモーダルマスクが無傷であるかどうかもわからないため、遮蔽のあるインスタンスの補完を学習することは非常に困難である。そのため、Self-Supervised Scene De-occlusionでは、自己監視による部分補完が行われる。 Given an image, we can use an off-the-shelf instance segmentation framework to obtain an amodal mask of the object. As a technique for obtaining an amodal mask of an object, for example, deep learning of MABK R-CNN is used. However, those amodal masks are not available. It is very difficult to learn completion for instances with occlusion, as we do not know whether these amodal masks are intact or not. Therefore, in Self-Supervised Scene De-occlusion, partial completion is performed by self-monitoring.
 モチベーションを説明する。インスタンスのモーダルマスクが画素集合Mを構成していると仮定し、その真値アモーダルマスクをGとする。教師付きアプローチは、以下の式(1)の完全補完問題を解く。 Explain motivation. Assume that the modal mask of the instance constitutes a pixel set M, and let G be the true value of the amodal mask. The supervised approach solves the complete complementation problem of equation (1) below.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 ここでfθは完全補完モデルを表す。この補完のプロセスは以下の式(2)のように分解することができる。 Here, f θ represents a fully complementary model. This interpolation process can be decomposed as shown in equation (2) below.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 インスタンスが複数のオクルーダによって覆われている場合、Mが中間の状態であるとすると、pθは部分補完のモデルを示す。 If an instance is covered by multiple occluders, let M k be an intermediate state, then p θ denotes a model of partial completion.
 マスク補完モデルについて説明する。マスク補完モデルを表現するニューラルネットワークの一例として、“PCNet-M”(例えば、図4参照)が挙げられる。まず学習データが用意する。インスタンスレベルのアノテーションを持つデータセットDから、インスタンスAとそのアモーダルマスクMが与えられると、Dから別のインスタンスBをランダムにサンプリングし、ランダムに配置してマスクMを取得する。ここではMとMとを画素の集合とみなす。入力には2つのケースがあり、異なる入力がネットワークに供給される。 The mask completion model will be explained. “PCNet-M” (see FIG. 4, for example) is an example of a neural network that expresses a mask complementation model. First, prepare the training data. Given an instance A and its amodal mask M A from a data set D with instance-level annotations, another instance B is randomly sampled from D and randomly arranged to obtain a mask M B. Here, M A and M B are regarded as a set of pixels. There are two cases of input, and different inputs are fed into the network.
 第1のケースは、部分補完の戦略に相当する。Mをeraser(消去するもの)と定義する。マスク補完モデルは、Bを用いてAの一部を消去し、MAoutBを得る。このとき、マスク補完モデルはMを条件として、MAoutBをから元のモーダルマスクMを復元するように学習する。なお、MAoutBは以下の説明において次の式(3)の記号で表される。 The first case corresponds to a partial completion strategy. Define MB as an eraser. The mask completion model uses B to erase part of A to obtain M AoutB . At this time, the mask complementation model learns to restore the original modal mask MA from M AoutB with M B as a condition. In addition, M AoutB is represented by the symbol of the following formula (3) in the following explanation.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 なおAoutBは、AとBとの差集合を表す。第2のケースは、あるインスタンスに遮蔽がない場合に、ネットワークはインスタンスが過度に補完しないようにするための正則化である。具体的にはAに侵入しないMAoutBをeraserとみなす。この場合、マスク補完モデルはMAoutBを条件に元のアモーダルマスクMを保持するように促す。ケース2がない場合、マスク補完モデルは常に画素数の増加を促すため、隣り合うインスタンスに囲まれていない場合には、過度に補完してしまう可能性がある。 Note that AoutB represents a difference set between A and B. The second case is regularization to prevent the network from over-complementing instances when there is no occlusion for an instance. Specifically, M AoutB that does not invade A is regarded as an eraser. In this case, the mask complementation model urges to keep the original amodal mask M A on condition of M AoutB . In the absence of case 2, the mask complementation model always encourages an increase in the number of pixels, which may result in excessive complementation if the instances are not surrounded by adjacent instances.
 いずれの場合も消去された画像パッチは補助入力として機能する。損失関数は以下の式(4)及び式(5)のように定式化される。 In either case, the erased image patch functions as an auxiliary input. The loss function is formulated as shown in Equation (4) and Equation (5) below.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 ここでPθ (m)(・)は、マスク補完モデル,θは最適化するパラメータ、Iは画像パッチ、Lは2値交差エントロピー損失を表す。最終的な損失関数は以下のように定義される。 Here, P θ (m) (·) represents the mask complementation model, θ represents the parameter to be optimized, I represents the image patch, and L represents the binary cross entropy loss. The final loss function is defined as follows.
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 ここでγは第1のケースを選択する確率を表す。2つのケースをランダムに切り替えることで、ネットワークは隣接する2つのインスタンスの形状や境界線から順序関係を理解し、インスタンスを完成させるかどうかを判断することになる。 Here, γ represents the probability of selecting the first case. By randomly switching between the two cases, the network understands the order relationship from the shapes and boundaries of two adjacent instances and decides whether to complete the instance.
 内容補完モデルについて説明する。内容補完モデルはマスク補完モデルと同様の直感に従うが、完成させる対象はRGBコンテンツである。図5(又は図4の画像G1)に示すように、入力インスタンスA、Bはマスク補完モデルと同じである。領域MAandBの画像画素は消去され、内容補完モデルは欠落したコンテンツを予測することを目的としている。なお、AandBは、以下の式(7)を表す。すなわちAandBは、集合論における、AとBとの共通部分を表す。 The content completion model will be explained. The content completion model follows the same intuition as the mask completion model, but the target to be completed is RGB content. As shown in FIG. 5 (or image G1 in FIG. 4), input instances A and B are the same as the mask completion model. The image pixels in region M AandB are erased and the content completion model aims to predict the missing content. Note that AandB represents the following equation (7). That is, AandB represents the common part of A and B in set theory.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 したがって、MAandBは以下の式(8)を意味する。 Therefore, M AandB means the following equation (8).
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 また、内容補完モデルは、他のオブジェクトではなくAであることを示すために、Aの残りのマスク(A¥B)も取り込んでいる。それゆえに標準的なイメージの塗り潰しのアプローチで単純に置き換えることはできない。このような場合には損失を最小化するための内容補完モデルの損失は次の式(9)ように定式化される。A¥Bは、集合Aと集合Bとの差集合を表す。 The content completion model also incorporates the remaining mask of A (A\B) to indicate that it is A and not some other object. Therefore, it cannot be simply replaced with standard image filling approaches. In such a case, the loss of the content completion model for minimizing the loss is formulated as shown in the following equation (9). A\B represents a difference set between set A and set B.
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
 ここで、Pθ (c)は内容補完モデル、Iは画像パッチ、Lはl、知覚的損失、逆説的損失を含む画像描画における一般的な損失からなる損失関数を表す。マスク補完モデルと同様に、部分補完学習を介した内容補完モデルの学習により、完全補完が可能となる。 Here, P θ (c) is a content completion model, I is an image patch, and L is l 1 , a loss function consisting of general losses in image rendering including perceptual loss and paradoxical loss. Similar to the mask completion model, complete completion is possible by learning the content completion model through partial completion learning.
 順序回復のための二重補完について説明する。目的の順序付けグラフは、すべての隣接するインスタンスペア間の対となる遮蔽の関係で構成される。近接するインスタンスペアは、モーダルマスクが接続されている2つのインスタンスと定義され、そのため、一方が他方の遮蔽となる可能性がある。図5に示すように,隣接するインスタンスAとAのペアが与えられた場合、まずAのモーダルマスクMA1を補完の対象とする。MA2はAの増分、すなわちΔA1|A2を得るために、eraserの役割を果たす。また、対照的にAを条件としたAの増分、すなわち、ΔA2|A1を得る。部分的に、より大きな増分を得るインスタンスはオクルーディであると考える。したがって、AとAとの間の順序を以下のように、それぞれの増分を比較する。 Explain double completion for order recovery. The objective ordered graph consists of pairwise occlusion relationships between all adjacent pairs of instances. A proximate instance pair is defined as two instances whose modal masks are connected, so one can potentially be an occlusion of the other. As shown in FIG. 5, when a pair of adjacent instances A1 and A2 is given, the modal mask M A1 of A1 is first targeted for complementation. M A2 plays the role of eraser to obtain the increment of A 1 , ie Δ A1|A2 . Also, in contrast, we obtain the increment of A 2 conditioned on A 1 , ie, Δ A2|A1 . In part, we consider instances that obtain larger increments to be occluded. Therefore, the order between A 1 and A 2 is compared for each increment as follows.
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012
 なお、以下の式(13)の記号はMA1を意味し、式(14)の記号はMA2を意味し、式(15)の記号はΔA1|A2を意味し、式(16)の記号はΔA2|A1を意味する。 Note that the symbol in formula (13) below means M A1 , the symbol in formula (14) means M A2 , the symbol in formula (15) means Δ A1 | A2 , and the symbol in formula (16) The symbol means ΔA2 |A1 .
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000014
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000015
Figure JPOXMLDOC01-appb-M000016
Figure JPOXMLDOC01-appb-M000016
 ここで、以下の式(17)はA1がA2を覆っていることを示す。 Here, the following equation (17) indicates that A1 covers A2.
Figure JPOXMLDOC01-appb-M000017
Figure JPOXMLDOC01-appb-M000017
 A1とA2とが隣接していない場合、以下の式(18)が満たされる。 If A1 and A2 are not adjacent, the following formula (18) is satisfied.
Figure JPOXMLDOC01-appb-M000018
Figure JPOXMLDOC01-appb-M000018
 隣接するすべてのペアに対して二重補完を行うと、シーンの遮蔽の順序が得られ、それは図4の画像G2内の画像G2-1に示すようなグラフとしてあらわされる。グラフ内のノードは物体を表し、エッジは隣接する物体間の遮蔽の方向を示している。画像G2-1のグラフは画像G2-2の画像に対して得られたグラフである。なお図4の画像G2内の画像G2-3は、補完を行った後、各物体マスクが一枚の図に統一的に表示された図である。また図4の画像G2内の画像G2-4は、画像G2-1のグラフに基づいて得られた結果の1例であって、画像G2-2に写る像の1つについて画像G2-3に示す結果も用いてRGB値の補完が行われた結果を示す。 By performing double interpolation on all adjacent pairs, the occlusion order of the scene is obtained, which is represented as a graph as shown in image G2-1 in image G2 in FIG. 4. Nodes in the graph represent objects, and edges indicate the direction of occlusion between adjacent objects. The graph of image G2-1 is a graph obtained for image G2-2. Note that the image G2-3 in the image G2 in FIG. 4 is a diagram in which each object mask is uniformly displayed in one diagram after completion of interpolation. Furthermore, image G2-4 in image G2 in FIG. The results of RGB value complementation are also shown using the results shown in FIG.
 アモーダルとコンテンツの完成について説明する。順序グラフを推定したあと、順序の基づいたアモーダル補完を行うことができる。あるインスタンスAを完成させる必要があるとすると、まず、そのインスタンスのオクルーダとしてグラフ上のすべての原型を、ブレッドファースト探索(BFS)によってみつける。BFSはグラフ探索手法で幅優先探索(Breadth-First-Search)と言われる。グラフは必ずしも非周期的である必要はないので、BFSアルゴリズムをそれに応じて適応させる。学習されたマスク補完モデルは、すべての原型の和をeraserとして使用するために一般化可能であり、原型を反復してマスク補完モデルを適用してAを部分的に1つずつ完了させる必要はない。その代わりに、すべての原型のモーダルマスクの和を条件に1ステップでアモーダル補完を行う。Aの原型を以下の式(19)の集合とし、以下の式(20)~式(21)ようにアモーダル補完を行う。 Explain about amodal and content completion. After estimating the ordered graph, we can perform order-based amodal completion. Assuming that it is necessary to complete a certain instance A, first, all prototypes on the graph are found as occluders of that instance by bread first search (BFS). BFS is a graph search method called breadth-first-search. Since the graph does not necessarily have to be aperiodic, we adapt the BFS algorithm accordingly. The learned mask completion model is generalizable to use the sum of all prototypes as an eraser, and there is no need to iterate over the prototypes and apply the mask completion model to complete A piece by piece. do not have. Instead, amodal completion is performed in one step, conditional on the sum of the modal masks of all prototypes. Let the prototype of A be a set of equations (19) below, and perform amodal complementation as shown in equations (20) to (21) below.
Figure JPOXMLDOC01-appb-M000019
Figure JPOXMLDOC01-appb-M000019
Figure JPOXMLDOC01-appb-M000020
Figure JPOXMLDOC01-appb-M000020
Figure JPOXMLDOC01-appb-M000021
Figure JPOXMLDOC01-appb-M000021
 ここで、以下の式(22)の記号はアモーダルマスクの結果であり、式(23)の記号はi番目の原型アモーダルマスクである。 Here, the symbol in equation (22) below is the result of the amodal mask, and the symbol in equation (23) is the i-th prototype amodal mask.
Figure JPOXMLDOC01-appb-M000022
Figure JPOXMLDOC01-appb-M000022
Figure JPOXMLDOC01-appb-M000023
Figure JPOXMLDOC01-appb-M000023
 その例を図6の画像G3に示す。次に、それらの遮蔽コンテンツを完成させる。図6の画像G4に示すように、予測アモーダルマスクと原型AmandMancAの交点は、内容補完モデルのeraserとされるAの欠落部分を示している。そして、学習した内容補完モデルを適用して、以下の式(24)~式(25)ように内容を埋めていく。なお図7は、画像G3の“recovery complete mask”の画像を用いて以下の式(26)の示す原型が得られることを示す。 An example of this is shown in image G3 of FIG. Next, complete those occluded contents. As shown in image G4 of FIG. 6, the intersection of the predicted amodal mask and the prototype A and M anc A indicates the missing portion of A, which is the eraser of the content complementation model. Then, by applying the learned content complementation model, the content is filled in as shown in equations (24) to (25) below. Note that FIG. 7 shows that the prototype shown by the following equation (26) can be obtained using the "recovery complete mask" image of image G3.
Figure JPOXMLDOC01-appb-M000024
Figure JPOXMLDOC01-appb-M000024
Figure JPOXMLDOC01-appb-M000025
Figure JPOXMLDOC01-appb-M000025
 なお、以下の式(26)の記号は、原型AmandMancAを表す。 Note that the symbols in the following equation (26) represent the prototype Am a and M anc A.
Figure JPOXMLDOC01-appb-M000026
Figure JPOXMLDOC01-appb-M000026
 ここで、CはシーンからのAの分解されたコンテンツである。背景コンテンツについては、すべての全景インスタンスの和をeraserとして使用する。遮蔽を意識しない画像描画とは異なり、推定された遮蔽領域に対してコンテンツの補完を行う。 Here, C A is the decomposed content of A from the scene. For background content, use the sum of all panoramic instances as the eraser. Unlike image rendering, which is not conscious of occlusion, content is supplemented for the estimated occlusion area.
 このように、Self-Supervised Scene De-occlusionは、隣接するオブジェクト間の順序を復元する有向グラフの取得とオブジェクトのオクルージョン幾何関係を利用した見えない部分の補完とを実行する数理モデルを得る自己相関深層学習のアルゴリズムである。なお、上述の順序付けグラフ又は順序グラフは、隣接するオブジェクト間の順序を復元する有向グラフの一例である。なお、オブジェクトとは推定対象の画像に写る像のことである。 In this way, Self-Supervised Scene De-occlusion is an autocorrelation deep algorithm that obtains a mathematical model that performs the acquisition of a directed graph that restores the order between adjacent objects and the completion of invisible parts using the occlusion geometric relationships of objects. It is a learning algorithm. Note that the above-mentioned ordered graph or ordinal graph is an example of a directed graph that restores the order between adjacent objects. Note that an object is an image that appears in an image to be estimated.
 したがって、Self-Supervised Scene De-occlusionを実行する学習装置1は、被遮蔽部位推定モデルを、自己教師あり学習の方法を用いて更新する。被遮蔽部位推定モデルは、推定対象の画像に写る像同士の位置関係の推定と、その位置関係に基づく推定であり、推定対象の画像に写る像であって一部が遮蔽された物体の像の遮蔽された部位の推定と、を行う数理モデルである。以下、一部が遮蔽された物体を被遮蔽物という。また、被遮蔽物の遮蔽された部位を被遮蔽部位という。なお、部位を推定するとは、具体的には例えば部位の像を推定することを意味する。したがって、被遮蔽部位推定モデルは、マスク補完モデルと内容補完モデルと後述する二重補完モデルとを含む数理モデルである。 Therefore, the learning device 1 that performs Self-Supervised Scene De-occlusion updates the occluded part estimation model using a self-supervised learning method. The occluded part estimation model is the estimation of the positional relationship between images in the estimation target image, and the estimation based on that positional relationship. This is a mathematical model that estimates the occluded parts of the image. Hereinafter, an object whose part is shielded will be referred to as a shielded object. Further, the shielded part of the object to be shielded is referred to as the shielded part. Note that estimating a region specifically means, for example, estimating an image of the region. Therefore, the occluded part estimation model is a mathematical model that includes a mask complementation model, a content complementation model, and a double complementation model to be described later.
 数理モデルの更新は、学習に関する所定の終了条件(以下「学習終了条件」という。)が満たされるまで行われる。学習終了条件が満たされた時点(すなわち学習済み)の被遮蔽部位推定モデルは、ユーザによって選択された推定対象の画像や予め定められた所定の条件を満たす推定対象の画像等の各種の推定対象の画像に写る被遮蔽部位の像の推定に用いられる。 The mathematical model is updated until a predetermined end condition for learning (hereinafter referred to as "learning end condition") is met. The occluded part estimation model at the time when the learning end condition is satisfied (that is, learned) is based on various estimation targets, such as an estimation target image selected by the user or an estimation target image that satisfies predetermined conditions. It is used to estimate the image of the occluded part in the image.
 二重補完モデルは、マスク補完モデルの推定の結果に基づいて、推定対象の画像に写る像同士の位置関係を推定する数理モデルである。上述の、隣接するオブジェクト間の順序を復元する有向グラフ、は像同士の位置関係の一例である。したがって、二重補完モデルは、例えば上述の二重補完の処理を実行する数理モデルである。 The double complementation model is a mathematical model that estimates the positional relationship between images in the estimation target image based on the estimation results of the mask complementation model. The above-mentioned directed graph that restores the order between adjacent objects is an example of the positional relationship between images. Therefore, the double complementation model is, for example, a mathematical model that executes the above-mentioned double complementation process.
 被遮蔽部位推定モデルは、より具体的には、推定対象の画像に基づき、推定対象に写る被遮蔽部位を推定する数理モデルである。また、被遮蔽部位推定モデルは、ニューラルネットワークで表現される数理モデルである。被遮蔽部位推定モデルを表現するニューラルネットワークは、例えば深層ニューラルネットワークを含むニューラルネットワークである。被遮蔽部位推定モデルを表現するニューラルネットワークは、例えば畳み込みニューラルネットワークを含むニューラルネットワークである。 More specifically, the occluded part estimation model is a mathematical model that estimates the occluded part reflected in the estimation target based on the image of the estimation target. Further, the occluded part estimation model is a mathematical model expressed by a neural network. The neural network expressing the occluded part estimation model is, for example, a neural network including a deep neural network. The neural network expressing the occluded region estimation model is, for example, a neural network including a convolutional neural network.
 上述したように、被遮蔽部位推定モデルの含むマスク補完モデル及び内容補完モデルはどちらも学習により更新される数理モデルであるので、被遮蔽部位推定モデルは学習により更新される数理モデルである。マスク補完モデル及び内容補完モデルはどちらも例えばニューラルネットワークで表現される数理モデルである。マスク補完モデルを表現するニューラルネットワークと内容補完モデルを表現するニューラルネットワークとは、どちらも、例えば深層ニューラルネットワークである。マスク補完モデルを表現するニューラルネットワークと内容補完モデルを表現するニューラルネットワークとは、どちらも、例えば畳み込みニューラルネットワークである。 As described above, the mask complementation model and the content complementation model included in the occluded part estimation model are both mathematical models that are updated through learning, so the occluded part estimation model is a mathematical model that is updated through learning. Both the mask complementation model and the content complementation model are mathematical models expressed, for example, by a neural network. Both the neural network expressing the mask complementation model and the neural network expressing the content complementation model are, for example, deep neural networks. Both the neural network representing the mask complementation model and the neural network representing the content complementation model are, for example, convolutional neural networks.
 また上述したように、マスク補完モデルは、遮蔽となっている物体(オクルーダ)に対応する遮蔽物に隠されている物体(オクルーディ)の見えないマスクを部分的に埋めるように学習される。そのため、マスク補完モデルは、推定対象の画像に基づき推定対象の画像に写る像の形状を推定する数理モデルである。 Furthermore, as described above, the mask completion model is trained to partially fill in the invisible mask of an object (occludee) hidden by an occluder corresponding to the occluded object (occluder). Therefore, the mask complementation model is a mathematical model that estimates the shape of an image reflected in the estimation target image based on the estimation target image.
 また上述したように、内容補完モデルは、復元されたマスクをRGB値で部分的に埋めるように学習される。そのため内容補完モデルは、マスク補完モデルの推定の結果に基づいて、推定対象の画像に写る像のRGB値を推定する。 Also, as described above, the content completion model is trained to partially fill the restored mask with RGB values. Therefore, the content complementation model estimates the RGB values of the image in the estimation target image based on the estimation result of the mask complementation model.
 なおマスク補完モデルの実行対象はマスク補完モデルの実行前にセグメンテーションが実行済みである。内容補完モデルの実行対象もまた、内容補完モデルの実行前にセグメンテーションが実行済みである。なお、計算量を減らすために、内容補完は前期作業で補完したマスクが採用され、その後、マスクへのRGB値の充填が開始される。つまり、各画像のインスタンスセグメンテーションは一回だけ実行される。 Note that segmentation has already been performed on the execution target of the mask completion model before the execution of the mask completion model. The target of the content completion model has also been segmented before the content completion model is executed. Note that in order to reduce the amount of calculation, the mask that was complemented in the previous work is used for content complementation, and then filling the mask with RGB values is started. That is, instance segmentation of each image is performed only once.
 二重補完モデルの実行対象もまた、二重補完モデルの実行前にセグメンテーションが実行済みである。すなわち、Self-Supervised Scene De-occlusionにおいては、マスク補完モデル、内容補完モデル及び二重補完モデルの実行前にセグメンテーションの実行が行われる。セグメンテーションの処理は被遮蔽部位推定モデルに含まれる。被遮蔽部位推定モデルの実行により、セグメンテーションの処理がマスク補完モデル、内容補完モデル及び二重補完モデルの実行前に実行される。マスク補完モデル、内容補完モデル及び二重補完モデルそれぞれは、セグメンテーションの結果も用いて推定を行う。 The target for execution of the double-completion model has also been segmented before the execution of the double-completion model. That is, in Self-Supervised Scene De-occlusion, segmentation is performed before executing the mask completion model, content completion model, and double completion model. Segmentation processing is included in the occluded region estimation model. By executing the occluded part estimation model, segmentation processing is executed before executing the mask complementation model, the content complementation model, and the double complementation model. Each of the mask complementation model, content complementation model, and double complementation model performs estimation using the results of segmentation.
 ここまででSelf-Supervised Scene De-occlusionの説明を終了する。次に学習装置1のハードウェア構成について説明する。 This concludes the explanation of Self-Supervised Scene De-occlusion. Next, the hardware configuration of the learning device 1 will be explained.
 図7は、実施形態における学習装置1のハードウェア構成の一例を示す図である。学習装置1は、バスで接続されたCPU(Central Processing Unit)等のプロセッサ91とメモリ92とを備える制御部11を備え、プログラムを実行する。学習装置1は、プログラムの実行によって制御部11、入力部12、通信部13、記憶部14及び出力部15を備える装置として機能する。 FIG. 7 is a diagram showing an example of the hardware configuration of the learning device 1 in the embodiment. The learning device 1 includes a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) and a memory 92 connected via a bus, and executes a program. The learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15 by executing a program.
 より具体的には、プロセッサ91が記憶部14に記憶されているプログラムを読み出し、読み出したプログラムをメモリ92に記憶させる。プロセッサ91が、メモリ92に記憶させたプログラムを実行することによって、学習装置1は、制御部11、入力部12、通信部13、記憶部14及び出力部15を備える装置として機能する。 More specifically, the processor 91 reads a program stored in the storage unit 14 and stores the read program in the memory 92. When the processor 91 executes the program stored in the memory 92, the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15.
 制御部11は、学習装置1が備える各種機能部の動作を制御する。制御部11は、Self-Supervised Scene De-occlusionを実行する。制御部11は、例えば出力部15の動作を制御し、出力部15にSelf-Supervised Scene De-occlusionの実行結果を出力させる。制御部11は、例えばSelf-Supervised Scene De-occlusionの実行により生じた各種情報を記憶部14に記録する。記憶部14が記憶する各種情報は、例えばSelf-Supervised Scene De-occlusionの実行の結果を含む。 The control unit 11 controls the operations of various functional units included in the learning device 1. The control unit 11 executes Self-Supervised Scene De-occlusion. The control unit 11 controls, for example, the operation of the output unit 15 and causes the output unit 15 to output the execution result of Self-Supervised Scene De-occlusion. The control unit 11 records, for example, various information generated by executing Self-Supervised Scene De-occlusion in the storage unit 14. The various information stored in the storage unit 14 includes, for example, the results of execution of Self-Supervised Scene De-occlusion.
 入力部12は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部12は、これらの入力装置を学習装置1に接続するインタフェースとして構成されてもよい。入力部12は、学習装置1に対する各種情報の入力を受け付ける。入力部12には、例えば訓練データが入力される。 The input unit 12 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 12 may be configured as an interface that connects these input devices to the learning device 1. The input unit 12 receives input of various information to the learning device 1. For example, training data is input to the input unit 12 .
 学習装置1はSelf-Supervised Scene De-occlusionの実行により被遮蔽部位推定モデルの学習を行うため、訓練データは像の写る画像の画像データであればどのようなものであってもよい。 Since the learning device 1 learns the occluded part estimation model by executing Self-Supervised Scene De-occlusion, the training data may be any image data of an image containing an image.
 ただし、学習済みの被遮蔽部位推定モデルの適用先として学習済みの被遮蔽部位推定モデルを用いる所定の場面が想定される場合、その使用場面における推定対象に写る像が有する条件を満たす像の写る画像の画像データが訓練データとして用いられる方が好ましい。なぜなら、想定された使用場面で実際に使用される際の、学習済みの被遮蔽部位推定モデルの推定の精度が高いからである。しかしながら、このような所定の条件を満たす像の写る画像の画像データを取得することは機械学習においては容易ではない場合が多い。そこで、このような推定対象の画像に写る像が有する条件を満たす像の写る画像の画像データの生成を支援する技術の一例を変形例(具体的には第2の変形例)にて述べる。しかしながら、必ずしもこのような技術が用いられなくても、学習装置1そのものは被遮蔽部位推定モデルの学習の実行が可能である。 However, if a predetermined scene is assumed in which the trained occluded part estimation model is applied to the trained occluded part estimation model, an image that satisfies the conditions of the image reflected in the estimation target in that usage scene will appear. Preferably, image data of the images is used as training data. This is because the learned occluded part estimation model has high estimation accuracy when actually used in the assumed usage scene. However, in machine learning, it is often difficult to obtain image data of an image that satisfies such predetermined conditions. Therefore, an example of a technique that supports the generation of image data of an image that satisfies the conditions that the image that appears in the estimation target image has will be described in a modified example (specifically, a second modified example). However, even if such a technique is not necessarily used, the learning device 1 itself can perform learning of the occluded part estimation model.
 通信部13は、学習装置1を外部装置に接続するための通信インタフェースを含んで構成される。通信部13は、有線又は無線を介して外部装置と通信する。外部装置は、例えば訓練データの送信元の装置である。通信部13は、訓練データの送信元との通信によって訓練データを取得する。外部装置は、例えば後述する推定装置2である。推定装置2は、学習済みの被遮蔽部位推定モデルによる推定を行う装置である。通信部13は、推定装置2との通信によって学習済みの被遮蔽部位推定モデルのプログラムを推定装置2に送信する。 The communication unit 13 includes a communication interface for connecting the learning device 1 to an external device. The communication unit 13 communicates with an external device via wire or wireless. The external device is, for example, a device that is a source of training data. The communication unit 13 acquires training data through communication with a transmission source of the training data. The external device is, for example, the estimation device 2 described later. The estimation device 2 is a device that performs estimation using a trained occluded part estimation model. The communication unit 13 transmits the learned occluded part estimation model program to the estimating device 2 through communication with the estimating device 2 .
 記憶部14は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部14は学習装置1に関する各種情報を記憶する。記憶部14は、例えば入力部12又は通信部13を介して入力された情報を記憶する。記憶部14は、例えば被遮蔽部位推定モデルを記憶する。記憶部14は、例えばSelf-Supervised Scene De-occlusionの実行により生じた各種情報を記憶する。 The storage unit 14 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 14 stores various information regarding the learning device 1. The storage unit 14 stores information input via the input unit 12 or the communication unit 13, for example. The storage unit 14 stores, for example, a shielded part estimation model. The storage unit 14 stores various information generated by executing Self-Supervised Scene De-occlusion, for example.
 出力部15は、各種情報を出力する。出力部15は、例えばCRT(Cathode Ray Tube)ディスプレイや液晶ディスプレイ、有機EL(Electro-Luminescence)ディスプレイ等の表示装置を含んで構成される。出力部15は、これらの表示装置を学習装置1に接続するインタフェースとして構成されてもよい。出力部15は、例えば入力部12又は通信部13に入力された情報を出力する。出力部15は、例えばSelf-Supervised Scene De-occlusionの実行結果を表示してもよい。 The output unit 15 outputs various information. The output unit 15 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 15 may be configured as an interface that connects these display devices to the learning device 1. The output unit 15 outputs information input to the input unit 12 or the communication unit 13, for example. The output unit 15 may display the execution result of Self-Supervised Scene De-occlusion, for example.
 図8は、実施形態における学習装置1が備える制御部11の構成の一例を示す図である。制御部11は、訓練データ取得部111、学習部112、記憶制御部113及び出力制御部114を備える。 FIG. 8 is a diagram showing an example of the configuration of the control unit 11 included in the learning device 1 in the embodiment. The control unit 11 includes a training data acquisition unit 111, a learning unit 112, a storage control unit 113, and an output control unit 114.
 訓練データ取得部111は、訓練データを取得する。訓練データ取得部111は、例えば入力部12又は通信部13に入力された訓練データを取得する。訓練データ取得部111は、予め記憶部14に記憶済みの訓練データを読み出すことで訓練データを取得してもよい。 The training data acquisition unit 111 acquires training data. The training data acquisition unit 111 acquires training data input to the input unit 12 or the communication unit 13, for example. The training data acquisition unit 111 may acquire training data by reading training data stored in the storage unit 14 in advance.
 学習部112は、訓練データ取得部111の取得した訓練データに対してSelf-Supervised Scene De-occlusionを実行する。Self-Supervised Scene De-occlusionの実行により、学習部112は被遮蔽部位推定モデルを実行することと、実行結果に基づき被遮蔽部位推定モデルを更新することとを行う。更新では、被遮蔽部位推定モデルによる推定の精度が高まるように更新が行われる。すなわち、学習部112は、Self-Supervised Scene De-occlusionの実行により、被遮蔽部位推定モデルの学習を行う。 The learning unit 112 performs Self-Supervised Scene De-occlusion on the training data acquired by the training data acquisition unit 111. By executing Self-Supervised Scene De-occlusion, the learning unit 112 executes the occluded part estimation model and updates the occluded part estimation model based on the execution result. The update is performed so that the accuracy of estimation by the occluded part estimation model is increased. That is, the learning unit 112 performs learning of the occluded part estimation model by executing Self-Supervised Scene De-occlusion.
 記憶制御部113は各種情報を記憶部14に記録する。出力制御部114は出力部15の動作を制御する。 The storage control unit 113 records various information in the storage unit 14. The output control section 114 controls the operation of the output section 15.
 図9は、実施形態における学習装置1が実行する処理の流れの一例を示すフローチャートである。訓練データ取得部111が訓練データを取得する(ステップS201)。次に得られた訓練データに対して学習部112がSelf-Supervised Scene De-occlusionを実行する(ステップS202)。次に学習部112は、学習終了条件が満たされたか否かを判定する(ステップS203)。学習終了条件が満たされた場合(ステップS203:YES)、処理が終了する。学習終了条件が満たされた時点の被遮蔽部位推定モデルが学習済みの被遮蔽部位推定モデルである。一方、学習終了条件が満たされない場合(ステップS203:NO)、ステップS201の処理に戻る。 FIG. 9 is a flowchart showing an example of the flow of processing executed by the learning device 1 in the embodiment. The training data acquisition unit 111 acquires training data (step S201). Next, the learning unit 112 performs Self-Supervised Scene De-occlusion on the obtained training data (Step S202). Next, the learning unit 112 determines whether the learning end condition is satisfied (step S203). If the learning end condition is satisfied (step S203: YES), the process ends. The occluded part estimation model at the time when the learning end condition is satisfied is the trained occluded part estimation model. On the other hand, if the learning end condition is not satisfied (step S203: NO), the process returns to step S201.
 このようにして得られた学習済みの被遮蔽部位推定モデルは、入力された画像データに基づき、画像データの示す画像に写る被遮蔽部位を推定する処理に用いられる。このような処理を実行するものの一例が、推定装置2である。推定装置2は学習済みの被遮蔽部位推定モデルを、例えば通信によって学習装置1から取得することで、学習済みの被遮蔽部位推定モデルの実行前に予め取得する。推定装置2は学習済みの被遮蔽部位推定モデルを、例えば学習済みの被遮蔽部位推定モデルを表現するニューラルネットワークが備え付けられることで、学習済みの被遮蔽部位推定モデルの実行前に予め取得してもよい。 The trained occluded part estimation model obtained in this way is used in the process of estimating the occluded part shown in the image indicated by the image data, based on the input image data. An example of a device that executes such processing is the estimation device 2. The estimation device 2 obtains the learned occluded part estimation model in advance by obtaining the learned occluded part estimation model from the learning device 1 through communication, for example, before executing the learned occluded part estimation model. The estimation device 2 is equipped with a neural network that expresses the trained occluded part estimation model, for example, so that the estimation device 2 can obtain the learned occluded part estimation model in advance before executing the trained occluded part estimation model. Good too.
 図10は、実施形態における推定装置2の概要を説明する説明図である。推定装置2は、学習装置1の得た学習済みの被遮蔽部位推定モデルを用いて、推定対象の画像に写る被遮蔽部位を推定する。より具体的には、推定装置2は、グローバルアモーダルインスタンスを入力とし、セグメンテーションアノテーションを実行する。推定装置2は、遮蔽に対する順序有向グラフを生成する。推定装置2は、学習済みのPCNet-Mを実行する。PCNet-Mはマスク補完モデルを表現するニューラルネットワークの一例である。推定装置2は、学習済みのPCNet-Cを実行する。PCNet-Cは内容補完モデルを表現するニューラルネットワークの一例である。推定装置2は、ターゲットの遮蔽された物体を復元された物体として出力する。 FIG. 10 is an explanatory diagram illustrating an overview of the estimation device 2 in the embodiment. The estimation device 2 uses the trained occluded part estimation model obtained by the learning device 1 to estimate the occluded part appearing in the estimation target image. More specifically, the estimation device 2 receives the global amodal instance and executes segmentation annotation. The estimation device 2 generates an ordered directed graph for occlusion. The estimation device 2 executes the learned PCNet-M. PCNet-M is an example of a neural network expressing a mask completion model. The estimation device 2 executes the learned PCNet-C. PCNet-C is an example of a neural network expressing a content completion model. The estimation device 2 outputs the occluded object of the target as a restored object.
 図11は、実施形態における推定装置2のハードウェア構成の一例を示す図である。推定装置2は、バスで接続されたCPU等のプロセッサ93とメモリ94とを備える制御部21を備え、プログラムを実行する。推定装置2は、プログラムの実行によって制御部21、入力部22、通信部23、記憶部24及び出力部25を備える装置として機能する。 FIG. 11 is a diagram showing an example of the hardware configuration of the estimation device 2 in the embodiment. The estimation device 2 includes a control unit 21 including a processor 93 such as a CPU and a memory 94 connected via a bus, and executes a program. The estimation device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25 by executing a program.
 より具体的には、プロセッサ93が記憶部24に記憶されているプログラムを読み出し、読み出したプログラムをメモリ94に記憶させる。プロセッサ93が、メモリ94に記憶させたプログラムを実行することによって、推定装置2は、制御部21、入力部22、通信部23、記憶部24及び出力部25を備える装置として機能する。 More specifically, the processor 93 reads the program stored in the storage unit 24 and stores the read program in the memory 94. When the processor 93 executes the program stored in the memory 94, the estimation device 2 functions as a device including the control section 21, the input section 22, the communication section 23, the storage section 24, and the output section 25.
 制御部21は、推定装置2が備える各種機能部の動作を制御する。制御部21は、学習済みの被遮蔽部位推定モデルを実行する。制御部21は、例えば出力部25の動作を制御し、出力部25に学習済みの被遮蔽部位推定モデルの実行結果を出力させる。制御部21は、例えば学習済みの被遮蔽部位推定モデルの実行により生じた各種情報を記憶部24に記録する。記憶部24が記憶する各種情報は、例えば学習済みの被遮蔽部位推定モデルの実行結果を含む。 The control unit 21 controls the operations of various functional units included in the estimation device 2. The control unit 21 executes the learned occluded part estimation model. The control unit 21 controls, for example, the operation of the output unit 25 and causes the output unit 25 to output the execution result of the learned occluded part estimation model. The control unit 21 records, for example, various types of information generated by executing the learned occluded part estimation model in the storage unit 24. The various information stored in the storage unit 24 includes, for example, the execution results of the learned occluded part estimation model.
 入力部22は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部22は、これらの入力装置を推定装置2に接続するインタフェースとして構成されてもよい。入力部22は、推定装置2に対する各種情報の入力を受け付ける。入力部22には、例えば学習済みの被遮蔽部位推定モデルの実行対象の画像データが入力される。学習済みの被遮蔽部位推定モデルの実行対象の画像データとは推定対象の画像の画像データである。 The input unit 22 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 22 may be configured as an interface that connects these input devices to the estimation device 2. The input unit 22 receives input of various information to the estimation device 2 . For example, image data on which a trained occluded region estimation model is to be executed is input to the input unit 22 . The image data to be executed by the trained occluded part estimation model is the image data of the image to be estimated.
 通信部23は、推定装置2を外部装置に接続するための通信インタフェースを含んで構成される。通信部23は、有線又は無線を介して外部装置と通信する。外部装置は、例えば推定対象の画像の画像データの送信元の装置である。外部装置は、例えば学習装置1である。通信部23は、学習装置1との通信によって学習済みの被遮蔽部位推定モデルを学習装置1から受信してもよい。 The communication unit 23 is configured to include a communication interface for connecting the estimation device 2 to an external device. The communication unit 23 communicates with an external device via wire or wireless. The external device is, for example, a device that is a source of image data of an image to be estimated. The external device is, for example, the learning device 1. The communication unit 23 may receive the trained occluded part estimation model from the learning device 1 through communication with the learning device 1 .
 記憶部24は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部24は推定装置2に関する各種情報を記憶する。記憶部24は、例えば入力部22又は通信部23を介して入力された情報を記憶する。記憶部24は、例えば学習済みの被遮蔽部位推定モデルを学習済みの被遮蔽部位推定モデルの実行前に予め記憶する。記憶部24は、例えば学習済みの被遮蔽部位推定モデルの実行により生じた各種情報を記憶する。 The storage unit 24 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 24 stores various information regarding the estimation device 2. The storage unit 24 stores information input via the input unit 22 or the communication unit 23, for example. The storage unit 24 stores, for example, a learned occluded part estimation model in advance before executing the learned occluded part estimation model. The storage unit 24 stores, for example, various types of information generated by executing the learned occluded part estimation model.
 出力部25は、各種情報を出力する。出力部25は、例えばCRTディスプレイや液晶ディスプレイ、有機ELディスプレイ等の表示装置を含んで構成される。出力部25は、これらの表示装置を推定装置2に接続するインタフェースとして構成されてもよい。出力部25は、例えば入力部22又は通信部23に入力された情報を出力する。出力部25は、例えば学習済みの被遮蔽部位推定モデルの実行結果を表示してもよい。 The output unit 25 outputs various information. The output section 25 is configured to include a display device such as a CRT display, a liquid crystal display, an organic EL display, or the like. The output unit 25 may be configured as an interface that connects these display devices to the estimation device 2. The output unit 25 outputs information input to the input unit 22 or the communication unit 23, for example. The output unit 25 may display, for example, the execution result of the learned occluded part estimation model.
 図12は、実施形態における推定装置2が備える制御部21の構成の一例を示す図である。制御部21は、対象データ取得部211、推定部212、記憶制御部213及び出力制御部214を備える。 FIG. 12 is a diagram showing an example of the configuration of the control unit 21 included in the estimation device 2 in the embodiment. The control unit 21 includes a target data acquisition unit 211, an estimation unit 212, a storage control unit 213, and an output control unit 214.
 対象データ取得部211は、学習済みの被遮蔽部位推定モデルの実行対象の画像データを取得する。対象データ取得部211は、例えば入力部22又は通信部23を介して入力された画像データを学習済みの被遮蔽部位推定モデルの実行対象の画像データとして取得する。 The target data acquisition unit 211 acquires image data on which the trained occluded part estimation model is to be executed. The target data acquisition unit 211 acquires, for example, image data input via the input unit 22 or the communication unit 23 as image data to be executed by the learned occluded part estimation model.
 推定部212は、対象データ取得部211の取得した画像データに対して学習済みの被遮蔽部位推定モデルを実行する。推定部212は、学習済みの被遮蔽部位推定モデルの実行により、実行対象の画像データの示す画像に写る被遮蔽部位を推定する。 The estimation unit 212 executes the learned occluded part estimation model on the image data acquired by the target data acquisition unit 211. The estimation unit 212 estimates the occluded part appearing in the image indicated by the image data to be executed by executing the learned occluded part estimation model.
 記憶制御部213は各種情報を記憶部14に記録する。出力制御部214は出力部15の動作を制御する。 The storage control unit 213 records various information in the storage unit 14. The output control section 214 controls the operation of the output section 15.
 図13は、実施形態における推定装置2が実行する処理の流れの一例を示すフローチャートである。対象データ取得部211が、推定対象の画像の画像データを取得する(ステップS301)。次に、推定部212が、ステップS301で取得された画像データに対して学習済みの椎体推定モデルを実行する(ステップS302)。学習済みの椎体推定モデルの実行により、実行対象の画像データの示す画像に写る被遮蔽部位が推定される。次に、出力制御部214が出力部15の動作を制御して、推定部212の推定結果を表示する(ステップS303)。 FIG. 13 is a flowchart illustrating an example of the flow of processing executed by the estimation device 2 in the embodiment. The target data acquisition unit 211 acquires image data of an image to be estimated (step S301). Next, the estimation unit 212 executes the learned vertebral body estimation model on the image data acquired in step S301 (step S302). By executing the learned vertebral body estimation model, the occluded region appearing in the image indicated by the image data to be executed is estimated. Next, the output control unit 214 controls the operation of the output unit 15 to display the estimation result of the estimation unit 212 (step S303).
(実験について)
 学習装置1により得られた学習済みの被遮蔽部位推定モデルを用いた実験によれば、順序回復、アモーダル補完、アモーダルインスタンスのセグメンテーション等の様々なアプリケーションで95%以上の精度があることが確認された。
(About the experiment)
According to experiments using the trained occluded part estimation model obtained by Learning Device 1, it was confirmed that the accuracy was over 95% in various applications such as order recovery, amodal completion, and segmentation of amodal instances. It was done.
 このように構成された実施形態の学習装置1は、隣接するオブジェクト間の順序を復元する有向グラフの取得とオブジェクトのオクルージョン幾何関係を利用した見えない部分の補完とを実行する数理モデルを得る自己相関深層学習のアルゴリズムであるSelf-Supervised Scene De-occlusionの実行により、被遮蔽部位を推定する数理モデルの学習を行う。そのため、学習装置1は、一部が遮蔽された物体の遮蔽物に覆われた部位の推定の精度を向上させることができる。 The learning device 1 according to the embodiment configured as described above is an autocorrelation method for obtaining a mathematical model that performs the acquisition of a directed graph that restores the order between adjacent objects and the complementation of invisible parts using the occlusion geometrical relationship of objects. By executing Self-Supervised Scene De-occlusion, a deep learning algorithm, a mathematical model is trained to estimate occluded areas. Therefore, the learning device 1 can improve the accuracy of estimating the portion of a partially occluded object that is covered by the occluder.
 このように構成された実施形態の学習装置1は、モーダル知覚とアモーダル知覚との処理が行われる数理モデルを学習により得る。なお、モーダル知覚とは、直接見える領域の解析のことであり、アモーダル知覚とは、見えない領域を含む実体の無傷の構造を知覚することである。そのため、学習装置1は、一部が遮蔽された物体の遮蔽物に覆われた部位の推定の精度を向上させることができる。 The learning device 1 of the embodiment configured in this way obtains a mathematical model in which processing of modal perception and amodal perception is performed through learning. Note that modal perception refers to the analysis of directly visible areas, and amodal perception refers to the perception of the intact structure of an entity, including invisible areas. Therefore, the learning device 1 can improve the accuracy of estimating the portion of a partially occluded object that is covered by the occluder.
 このように構成された第1実施形態の推定装置2は、学習装置1が得た学習済みの被遮蔽部位推定モデルを用いて、推定対象に写る被遮蔽部位を推定する。そのため、推定装置2は、一部が遮蔽された物体の遮蔽物に覆われた部位の推定の精度を向上させることができる。 The estimation device 2 of the first embodiment configured as described above uses the trained occluded part estimation model obtained by the learning device 1 to estimate the occluded part reflected in the estimation target. Therefore, the estimation device 2 can improve the accuracy of estimating the part of the object that is partially obscured.
(第1の変形例)
 学習済みの被遮蔽部位推定モデルの適用場面を説明するとともに学習装置1及び推定装置2を説明する。学習済みの被遮蔽部位推定モデルは、例えば所定の形状の部品を作成する工場で使用される。工場では、例えば以下の図14に示すような画像に写る部品の遮蔽部位を推定することが必要な場合があり、このような場合に学習済みの被遮蔽部位推定モデルが用いられる。
(First modification)
The application scene of the trained occluded part estimation model will be explained, and the learning device 1 and the estimation device 2 will be explained. The trained occluded part estimation model is used, for example, in a factory that produces parts of a predetermined shape. In a factory, there are cases where it is necessary to estimate the occluded part of a part shown in an image as shown in FIG. 14 below, and in such a case, a trained occluded part estimation model is used.
 図14は、実施形態における推定対象の画像の一例を示す図である。図15の画像には、箱の中に部品C1~C3の3つの円形の部品が写る。3つの円形の部品は、一部が重なり合っている。より具体的には部品C3の上に部品C1及び部品C2が位置し、部品C2の上に部品C1が位置する。 FIG. 14 is a diagram showing an example of an image to be estimated in the embodiment. The image in FIG. 15 shows three circular parts C1 to C3 inside the box. The three circular parts partially overlap. More specifically, the component C1 and the component C2 are located above the component C3, and the component C1 is located above the component C2.
 図14におけるバウンティングボックスB1~B3それぞれは、部品の検出に用いられたバウンティングボックス回帰の結果である。 Each of the bounding boxes B1 to B3 in FIG. 14 is the result of the bounding box regression used for component detection.
 図15は、実施形態におけるバウンティングボックスを説明する説明図である。図15は、1つのバウンティングボックスB4を示す。図15において円C4はバウンティングボックスB4に内接する。円C4は画像に写る像の輪郭である。図15における像は、例えば検出対象の部品の輪郭である。バウンティングボックスと輪郭とはいずれも極座標系を用いてその大きさや形状等の幾何学的な情報を数学的に表現可能である。 FIG. 15 is an explanatory diagram illustrating the bounding box in the embodiment. FIG. 15 shows one bounding box B4. In FIG. 15, circle C4 is inscribed in bounding box B4. Circle C4 is the outline of the image reflected in the image. The image in FIG. 15 is, for example, the outline of the part to be detected. For both bounding boxes and contours, geometric information such as their size and shape can be expressed mathematically using a polar coordinate system.
 そのため、像の輪郭に外接する四角形(すなわちバウンティングボックス)が検出されれば、像の幾何学的な情報が得られる。例えば、図15の例では、円C4の面積はバウンティングボックスB4の面積は0.785倍である。なお、検出対象は、被遮蔽部位推定モデル又は学習済みの被遮蔽部位推定モデルの推定対象の画像に写る場合に、検出対象は被遮蔽部位推定モデル又は学習済みの被遮蔽部位推定モデルによって被遮蔽部位が推定される対象である。 Therefore, if a rectangle (i.e. bounding box) circumscribing the outline of the image is detected, geometric information about the image can be obtained. For example, in the example of FIG. 15, the area of circle C4 is 0.785 times the area of bounding box B4. Note that when the detection target appears in the image of the estimation target of the occluded part estimation model or the trained occluded part estimation model, the detection target is occluded by the occluded part estimation model or the trained occluded part estimation model. The body part is the target to be estimated.
 円C4は図15の例では被遮蔽部位を有さないため円であるが、検出対象が被遮蔽部位を有する場合には必ずしも円ではない。検出対象の輪郭が被遮蔽部位を有さない円の場合には、上述のように、バウンティングボックスの面積に対する検出対象の面積の比は0.785である。しかしながら、もし検出対象が被遮蔽部位を有する場合には、バウンティングボックスの面積に対する検出対象の面積の比は必ずしも0.785ではない。以下、バウンティングボックスの面積に対する検出対象の面積の比を検出面積比という。 Circle C4 is a circle in the example of FIG. 15 because it does not have a shielded part, but it is not necessarily a circle if the detection target has a shielded part. When the outline of the detection target is a circle without a shielded part, the ratio of the area of the detection target to the area of the bounding box is 0.785, as described above. However, if the detection target has a shielded part, the ratio of the area of the detection target to the area of the bounding box is not necessarily 0.785. Hereinafter, the ratio of the area of the detection target to the area of the bounding box will be referred to as the detection area ratio.
 このように、検出面積比が得られれば被遮蔽部位の幾何学的な情報が得られる。被遮蔽部位の幾何学的な情報は、例えばオクルージョン率である。オクルージョン率は、バウンティングボックスの面積に対する被遮蔽部位の面積の比である。検出面積比もオクルージョン率も検出対象が有する幾何学的な情報の一例である。 In this way, if the detection area ratio is obtained, geometric information of the shielded region can be obtained. The geometric information of the occluded part is, for example, the occlusion rate. The occlusion rate is the ratio of the area of the occluded part to the area of the bounding box. Both the detection area ratio and the occlusion rate are examples of geometric information possessed by the detection target.
 幾何学的な情報も用いれば、被遮蔽部位の推定の精度はより一層向上する。例えば、上述の部品C1と部品C2とであれば、それぞれの検出面積比が得られると、検出面積比の大きい方が検出面積比の小さい方を覆っていると判定可能である。 If geometric information is also used, the accuracy of estimating the occluded part will be further improved. For example, for the above-mentioned components C1 and C2, when the respective detection area ratios are obtained, it can be determined that the one with the larger detection area ratio covers the one with the smaller detection area ratio.
 そこで、検出面積比等の推定対象の画像に写る像が有する幾何学的な情報も用いて推定を行うために、被遮蔽部位推定モデルの学習においては、推定対象の画像に写る像の幾何学的な情報が用いられてもよい。具体的には、学習で用いられる訓練データが、推定対象の画像に写る像の幾何学的な情報を正解データとして含む訓練データであってもよい。このような場合、被遮蔽部位推定モデルは例えば、推定対象の画像に写る像の幾何学的な情報も推定する数理モデルであって、学習部112は被遮蔽部位推定モデルの推定した幾何学的な情報にも基づいて被遮蔽部位推定モデルを更新する。すなわち、学習部112は、推定対象の画像に写る像が有する幾何学的な情報も用いて被遮蔽部位推定モデルの学習を行う。 Therefore, in order to perform estimation using the geometric information of the image in the image to be estimated, such as the detection area ratio, in learning the occluded part estimation model, we use the geometric information of the image in the image to be estimated. Other information may also be used. Specifically, the training data used in learning may include, as correct data, geometric information of an image reflected in an image to be estimated. In such a case, the occluded part estimation model is, for example, a mathematical model that also estimates the geometric information of the image reflected in the image to be estimated, and the learning unit 112 uses the geometric information estimated by the occluded part estimation model. The occluded part estimation model is updated based on this information as well. That is, the learning unit 112 performs learning of the occluded part estimation model using also the geometric information included in the image reflected in the estimation target image.
 このように被遮蔽部位推定モデルの学習において推定対象の画像に写る像の幾何学的な情報が用いられる場合には、推定装置2が用いる学習済みの被遮蔽部位推定モデルもまた推定対象の画像に写る像の幾何学的な情報を用いた推定を行ってもよい。 In this way, when the geometric information of the image reflected in the estimation target image is used in learning the occluded part estimation model, the trained occluded part estimation model used by the estimation device 2 also uses the estimation target image. Estimation may be performed using geometric information of the image reflected in the image.
(第2の変形例)
 上述した、推定対象の画像に写る像が有する条件を満たす像の写る画像の画像データの生成を支援する技術の一例として、画像データ生成支援処理を説明する。画像データ生成支援処理において推定対象の画像に写る像が有する条件は、例えば検出面積比等の幾何学的な情報である。画像データ生成支援処理では、生成する画像データに写る像の形状を示す予め定められた関数であり1又は複数のパラメータを有する関数(以下「像関数」という。)が用いられる。像関数のパラメータの値は予め定められた所定の確率分布にしたがう。像関数のパラメータの1つは検出面積比等の幾何学的な情報を示す値である。
(Second modification)
An image data generation support process will be described as an example of a technique for supporting the generation of image data of an image that satisfies the above-mentioned conditions for an image that appears in an image to be estimated. In the image data generation support process, the conditions of the image reflected in the estimation target image are, for example, geometric information such as the detection area ratio. In the image data generation support process, a predetermined function indicating the shape of an image reflected in image data to be generated and having one or more parameters (hereinafter referred to as "image function") is used. The values of the parameters of the image function follow a predetermined probability distribution. One of the parameters of the image function is a value indicating geometric information such as detection area ratio.
 画像データ生成支援処理では、画像データの生成のたびに像関数のパラメータの値を予め定められた所定の確率分布にしたがって変化させることで、幾何学的な情報の異なるさまざまな画像データを生成する。上述した部品C~C3の被遮蔽部位の推定に学習済みの被遮蔽部位推定モデルが用いられる場合には、像関数は例えば、検出面積比等の図形の大きさを示すパラメータのみ像関数のパラメータとして有する関数であって形状はパラメータの値に依らず一定の図形を表現する関数である。 Image data generation support processing generates various image data with different geometric information by changing the values of image function parameters according to a predetermined probability distribution each time image data is generated. . When a trained occluded part estimation model is used to estimate the occluded parts of the parts C to C3 described above, the image function includes only parameters indicating the size of the figure, such as the detection area ratio. is a function that expresses a constant figure regardless of the value of the parameter.
(第3の変形例)
 なお、マスク補完モデルを表現するニューラルネットワークは、YOLOV4等の物体検出畳み込みニューラルネットワークであってもよい。なお、内容補完モデルを表現するニューラルネットワークは、YOLOV4であってもよい。なお、被遮蔽部位推定モデルを表現するニューラルネットワークは、YOLOV4であってもよい。
(Third modification)
Note that the neural network expressing the mask complementation model may be an object detection convolutional neural network such as YOLOV4. Note that the neural network expressing the content complementation model may be YOLOV4. Note that the neural network expressing the occluded part estimation model may be YOLOV4.
(学習部112の詳細の一例)
 図16は、実施形態における学習装置1の備える学習部112の構成の一例を示す図である。
(An example of details of the learning section 112)
FIG. 16 is a diagram showing an example of the configuration of the learning section 112 included in the learning device 1 in the embodiment.
 学習部112は、例えば、オブジェクト検知部121、インスタンスセグメンテーション部122、比較部123、変換部124、アモーダル情報判定部125、オクルージョン生成部126、マスク補完モデル実行部127、内容補完モデル実行部128、成否判定部129、アモーダル情報検査部130、オクルージョン率修正部131及び形状判定部132を備える。 The learning unit 112 includes, for example, an object detection unit 121, an instance segmentation unit 122, a comparison unit 123, a conversion unit 124, an amodal information determination unit 125, an occlusion generation unit 126, a mask complementation model execution unit 127, a content complementation model execution unit 128, It includes a success/failure determination section 129, an amodal information inspection section 130, an occlusion rate modification section 131, and a shape determination section 132.
 オブジェクト検知部121は、例えばステップS102の処理を実行する。なお、ステップS101の処理は、例えば訓練データ取得部111が実行する。インスタンスセグメンテーション部122は、例えばステップS103の処理を実行する。比較部123は、例えばステップS104の処理を実行する。変換部124は、例えばステップS105の処理を実行する。アモーダル情報判定部125は、例えばステップS106の処理を実行する。 The object detection unit 121 executes the process of step S102, for example. Note that the process in step S101 is executed by, for example, the training data acquisition unit 111. The instance segmentation unit 122 executes, for example, the process of step S103. The comparison unit 123 executes the process of step S104, for example. The conversion unit 124 executes the process of step S105, for example. The amodal information determination unit 125 executes, for example, the process of step S106.
 オクルージョン生成部126は、例えばステップS107の処理を実行する。マスク補完モデル実行部127は、例えばステップS108の処理を実行する。内容補完モデル実行部128は、例えばステップS109の処理を実行する。成否判定部129は、例えばステップS110の処理を実行する。アモーダル情報検査部130は、例えばステップS111の処理を実行する。オクルージョン率修正部131は、例えばステップS112の処理を実行する。形状判定部132は、例えばステップS113の処理を実行する。 The occlusion generation unit 126 executes the process of step S107, for example. The mask complementary model execution unit 127 executes, for example, the process of step S108. The content complementation model execution unit 128 executes, for example, the process of step S109. The success/failure determination unit 129 executes, for example, the process of step S110. The amodal information inspection unit 130 executes, for example, the process of step S111. The occlusion rate correction unit 131 executes, for example, the process of step S112. The shape determination unit 132 executes, for example, the process of step S113.
 なお、学習装置1は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、学習装置1が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。 Note that the learning device 1 may be implemented using a plurality of information processing devices communicatively connected via a network. In this case, each functional unit included in the learning device 1 may be distributed and implemented in a plurality of information processing devices.
 なお、推定装置2は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、推定装置2が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。 Note that the estimation device 2 may be implemented using a plurality of information processing devices that are communicably connected via a network. In this case, each functional unit included in the estimation device 2 may be distributed and implemented in a plurality of information processing devices.
 なお、遮蔽推定物体検知装置100は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、推定装置2が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。 Note that the estimated occlusion object detection device 100 may be implemented using a plurality of information processing devices communicatively connected via a network. In this case, each functional unit included in the estimation device 2 may be distributed and implemented in a plurality of information processing devices.
 なお、学習装置1及び推定装置2は必ずしも異なる装置として実装される必要はなく、学習装置1の有する機能と推定装置2の有する機能とを備える1つの装置として実装されてもよい。このような場合、例えば、制御部11と制御部21とは1つの制御部として実装されてもよい。すなわち、制御部11の備える各機能部と制御部21の備える各機能部とは異なる制御部に実装されるのではなく1つの制御部に実装されてもよい。 Note that the learning device 1 and the estimation device 2 do not necessarily need to be implemented as different devices, and may be implemented as one device that includes the functions of the learning device 1 and the functions of the estimation device 2. In such a case, for example, the control unit 11 and the control unit 21 may be implemented as one control unit. That is, each functional unit provided in the control unit 11 and each functional unit provided in the control unit 21 may be implemented in one control unit instead of being implemented in different control units.
 このように本発明の遮蔽推定物体検知装置100は、遮蔽する一つ以上の物体で一つ以上の部分が遮蔽された物体の一つ以上の遮蔽された部分を推定する装置において、遮蔽された部分がセグメンテーションされ、このセグメンテーションされた遮蔽部分において、遮蔽部分を推定するプロセスで、推定対象の画像に写る像同士の位置関係の推定と、推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習を行う学習部112を備える。 As described above, the occlusion estimation object detection device 100 of the present invention is a device for estimating one or more occluded parts of an object whose one or more parts are occluded by one or more occluded objects. The part is segmented, and the process of estimating the occluded part in this segmented occluded part involves estimating the positional relationship between images in the estimation target image, and estimating the positional relationship between images in the estimation target image that are partially occluded. The learning unit 112 includes a learning unit 112 that performs self-supervised learning of a mathematical model that performs estimation based on the positional relationship of the occluded part of the image of the occluded object, which is the object that is occluded.
 またこのように、制御部11には、遮蔽された部分に対するセグメンテーションにおいて、遮蔽数および遮蔽された物体数を算出し、グローバルアモーダルの変換を行い、アモーダル情報形式のアノテーションが入力される。 In this way, the control unit 11 calculates the number of occlusions and the number of occluded objects in the segmentation of the occluded portion, performs global amodal conversion, and inputs annotation in the amodal information format.
 またこのように、制御部11は、アモーダル情報形式のアノテーションにおいて、セグメンターションされた像の順序を取得後、マスク補完モデルを生成し、次に畳み込みニューラルネットワークを行ってもよい。 Also, in this way, the control unit 11 may generate a mask complementation model after obtaining the order of segmented images in the amodal information format annotation, and then perform a convolutional neural network.
 またこのように、制御部11は、アモーダル情報形式のアノテーションにおいて、セグメンターションされた像の順序、またはオクル―ジョン率を用いて畳み込みニューラルネットワークを行ってもよい。 Also, in this way, the control unit 11 may perform a convolutional neural network using the order of segmented images or the occlusion rate in the amodal information format annotation.
 なお、遮蔽推定物体検知装置100と、学習装置1と、推定装置2と、の各機能の全て又は一部は、ASIC(Application Specific Integrated Circuit)やPLD(Programmable Logic Device)やFPGA(Field Programmable Gate Array)等のハードウェアを用いて実現されてもよい。プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ROM、CD-ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。プログラムは、電気通信回線を介して送信されてもよい。 All or part of each function of the occlusion estimation object detection device 100, the learning device 1, and the estimation device 2 can be implemented using an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate). It may also be realized using hardware such as an Array. The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The program may be transmitted via a telecommunications line.
 なお、アモーダル補完(amodal completion)とは、非モデリティ補完を意味する。また、モーダル補完(modal completion)とはモデリティ補完を意味する。また、グローバルアモーダル補完とは、大きな空間を対象とするアモーダル補完を意味する。 Note that amodal completion means non-model completion. Moreover, modal completion means model completion. Furthermore, global amodal completion means amodal completion that targets a large space.
 以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.
 100…遮蔽推定物体検知装置、 1…学習装置、 2…推定装置、 11…制御部、 12…入力部、 13…通信部、 14…記憶部、 15…出力部、 111…訓練データ取得部、 112…学習部、 113…記憶制御部、 114…出力制御部、 21…制御部、 22…入力部、 23…通信部、 24…記憶部、 25…出力部、 211…対象データ取得部、 212…推定部、 213…記憶制御部、 214…出力制御部、 121…オブジェクト検知部、 122…インスタンスセグメンテーション部、 123…比較部、 124…変換部、 125…アモーダル情報判定部、 126…オクルージョン生成部、 127…マスク補完モデル実行部、 128…内容補完モデル実行部、 129…成否判定部、 130…アモーダル情報検査部、 131…オクルージョン率修正部、 132…形状判定部、 91…プロセッサ、 92…メモリ、 93…プロセッサ、 94…メモリ 100...Occupation estimation object detection device, 1...Learning device, 2...Estimation device, 11...Control unit, 12...Input unit, 13...Communication unit, 14...Storage unit, 15...Output unit, 111...Training data acquisition unit, 112...Learning unit, 113...Storage control unit, 114...Output control unit, 21...Control unit, 22...Input unit, 23...Communication unit, 24...Storage unit, 25...Output unit, 211...Target data acquisition unit, 212 ... Estimation section, 213... Storage control section, 214... Output control section, 121... Object detection section, 122... Instance segmentation section, 123... Comparison section, 124... Conversion section, 125... Amodal information determination section, 126... Occlusion generation section , 127...Mask complementation model execution unit, 128...Content complementation model execution unit, 129...Success/failure determination unit, 130...Amodal information inspection unit, 131...Occclusion rate correction unit, 132...Shape determination unit, 91...Processor, 92...Memory , 93...processor, 94...memory

Claims (12)

  1.  遮蔽する一つ以上の物体で一つ以上の部分が遮蔽された物体の一つ以上の遮蔽された部分を推定する装置において、遮蔽された部分がセグメンテーションされ、このセグメンテーションされた遮蔽部分において、遮蔽部分を推定するプロセスで、
     推定対象の画像に写る像同士の位置関係の推定と、前記推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の前記位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習を行う学習部、
     を備え、
     前記学習部は、前記推定対象の画像に写る像が有する幾何学的な情報も用いて前記数理モデルの学習を行う、遮蔽推定物体検知装置。
    In an apparatus for estimating one or more occluded parts of an object whose one or more parts are occluded by one or more occluded objects, the occluded parts are segmented, and in the segmented occluded parts, the occluded In the process of estimating the part,
    Estimating the positional relationship between images in the estimation target image, and estimating the positional relationship of the occluded part of the image of the occluded object, which is an object that is partially occluded in the estimation target image. a learning section that performs self-supervised learning of a mathematical model that performs estimation based on the
    Equipped with
    The learning unit is an occlusion estimation object detection device that performs learning of the mathematical model using also geometric information included in an image reflected in the estimation target image.
  2.  前記学習部を備える制御部には、前記遮蔽された部分に対するセグメンテーションにおいて、遮蔽数および遮蔽された物体数を算出し、グローバルアモーダルの変換を行い、アモーダル情報形式のアノテーションが入力される、
     請求項1に記載の遮蔽推定物体検知装置。
    The control unit including the learning unit calculates the number of occlusions and the number of occluded objects in the segmentation for the occluded portion, performs global amodal conversion, and inputs an annotation in an amodal information format.
    The estimated occlusion object detection device according to claim 1.
  3.  前記学習部を備える制御部は、アモーダル情報形式のアノテーションにおいて、セグメンターションされた像の順序を取得後、マスク補完モデルを生成し、次に畳み込みニューラルネットワークを行う、
     請求項2に記載の遮蔽推定物体検知装置。
    The control unit including the learning unit generates a mask complementation model after acquiring the order of the segmented images in the amodal information format annotation, and then performs a convolutional neural network.
    The estimated occlusion object detection device according to claim 2.
  4.  前記学習部を備える制御部は、アモーダル情報形式のアノテーションにおいて、セグメンターションされた像の順序、またはオクル―ジョン率を用いて畳み込みニューラルネットワークを行う、
     請求項2に記載の遮蔽推定物体検知装置。
    The control unit including the learning unit performs a convolutional neural network using the order of the segmented images or the occlusion rate in the amodal information format annotation.
    The estimated occlusion object detection device according to claim 2.
  5.  前記位置関係は隣接する像の間の順序を復元する有向グラフである、
     請求項1に記載の遮蔽推定物体検知装置。
    the positional relationship is a directed graph that restores the order between adjacent images;
    The estimated occlusion object detection device according to claim 1.
  6.  前記幾何学的な情報は、前記像に外接する四角形の面積に対する前記像の面積の比である、オクルージョン率を用いる、
     請求項1に記載の遮蔽推定物体検知装置。
    The geometric information uses an occlusion rate, which is the ratio of the area of the image to the area of a rectangle circumscribing the image;
    The estimated occlusion object detection device according to claim 1.
  7.  前記数理モデルの学習に用いられる訓練データは、生成する画像データに写る像の形状を示す予め定められた関数であり1又は複数のパラメータを有する関数である像関数を用いて得られる、
     請求項1から6のいずれか一項に記載の遮蔽推定物体検知装置。
    The training data used for learning the mathematical model is obtained using an image function, which is a predetermined function indicating the shape of the image reflected in the image data to be generated, and is a function having one or more parameters.
    The estimated occlusion object detection device according to any one of claims 1 to 6.
  8.  前記数理モデルは物体検出畳み込みニューラルネットワークによって表現される、
     請求項1から7のいずれか一項に記載の遮蔽推定物体検知装置。
    the mathematical model is expressed by an object detection convolutional neural network;
    The estimated occlusion object detection device according to any one of claims 1 to 7.
  9.  画像データを取得する対象データ取得部と、
     推定対象の画像に写る像同士の位置関係の推定と、前記推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の前記位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習を行う学習部、が得た学習済みの前記数理モデルの実行により、前記対象データ取得部の取得した前記画像データの画像に写る像の遮蔽された部位を推定する推定部と、
     を備え、
     前記学習部は、前記推定対象の画像に写る像が有する幾何学的な情報も用いて前記数理モデルの学習を行う、遮蔽推定物体検知装置。
    a target data acquisition unit that acquires image data;
    Estimating the positional relationship between images in the estimation target image, and estimating the positional relationship of the occluded part of the image of the occluded object, which is an object that is partially occluded in the estimation target image. A learning unit that performs self-supervised learning of a mathematical model that performs estimation based on an estimation unit that estimates the occluded part;
    Equipped with
    The learning unit is an occlusion estimation object detection device that performs learning of the mathematical model using also geometric information included in an image reflected in the estimation target image.
  10.  推定対象の画像に写る像同士の位置関係の推定と、前記推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の前記位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習を行う学習ステップ、
     を有し、
     前記学習ステップでは、前記推定対象の画像に写る像が有する幾何学的な情報も用いて前記数理モデルの学習を行う、遮蔽推定物体検知方法。
    Estimating the positional relationship between images in the estimation target image, and estimating the positional relationship of the occluded part of the image of the occluded object, which is an object that is partially occluded in the estimation target image. a learning step of performing self-supervised learning of a mathematical model that performs estimation based on the
    has
    In the learning step, the mathematical model is trained using geometric information of an image reflected in the estimation target image.
  11.  画像データを取得する対象データ取得ステップと、
     推定対象の画像に写る像同士の位置関係の推定と、前記推定対象の画像に写る像であって一部が遮蔽された物体である被遮蔽物の像の遮蔽された部位の前記位置関係に基づく推定と、を行う数理モデルの自己教師あり学習による学習、により得られた学習済みの前記数理モデルの実行により、前記対象データ取得ステップの取得した前記画像データの画像に写る像の遮蔽された部位を推定する推定ステップと、
     を有し、
     前記学習は、前記推定対象の画像に写る像が有する幾何学的な情報も用いた前記数理モデルの学習である、遮蔽推定物体検知方法。
    a target data acquisition step of acquiring image data;
    Estimating the positional relationship between images in the estimation target image, and estimating the positional relationship of the occluded part of the image of the occluded object, which is an object that is partially occluded in the estimation target image. By executing the learned mathematical model obtained by self-supervised learning of a mathematical model that performs estimation based on an estimation step of estimating the part;
    has
    The occlusion estimation object detection method, wherein the learning is learning of the mathematical model using also geometric information of an image reflected in the estimation target image.
  12.  請求項1から9のいずれか一項に記載の遮蔽推定物体検知装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the occlusion estimation object detection device according to any one of claims 1 to 9.
PCT/JP2023/008004 2022-03-08 2023-03-03 Occlusion-inference object detection device, occlusion-inference object detection, and program WO2023171559A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022035212 2022-03-08
JP2022-035212 2022-03-08

Publications (1)

Publication Number Publication Date
WO2023171559A1 true WO2023171559A1 (en) 2023-09-14

Family

ID=87935005

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/008004 WO2023171559A1 (en) 2022-03-08 2023-03-03 Occlusion-inference object detection device, occlusion-inference object detection, and program

Country Status (1)

Country Link
WO (1) WO2023171559A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017211761A (en) * 2016-05-24 2017-11-30 株式会社東芝 Information processing apparatus and information processing method
JP2019192022A (en) * 2018-04-26 2019-10-31 キヤノン株式会社 Image processing apparatus, image processing method, and program
JP2021144631A (en) * 2020-03-13 2021-09-24 エヌ・ティ・ティ・ビズリンク株式会社 Animal behavior estimation system, animal behavior estimation support device, animal behavior estimation method, and program
JP2021141876A (en) * 2020-03-13 2021-09-24 エヌ・ティ・ティ・ビズリンク株式会社 Animal behavior estimation device, animal behavior estimation method, and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017211761A (en) * 2016-05-24 2017-11-30 株式会社東芝 Information processing apparatus and information processing method
JP2019192022A (en) * 2018-04-26 2019-10-31 キヤノン株式会社 Image processing apparatus, image processing method, and program
JP2021144631A (en) * 2020-03-13 2021-09-24 エヌ・ティ・ティ・ビズリンク株式会社 Animal behavior estimation system, animal behavior estimation support device, animal behavior estimation method, and program
JP2021141876A (en) * 2020-03-13 2021-09-24 エヌ・ティ・ティ・ビズリンク株式会社 Animal behavior estimation device, animal behavior estimation method, and program

Similar Documents

Publication Publication Date Title
CN111968235B (en) Object attitude estimation method, device and system and computer equipment
US20240070546A1 (en) System and method for end-to-end differentiable joint image refinement and perception
US10818014B2 (en) Image object segmentation based on temporal information
US20220277515A1 (en) Structure modelling
US11314989B2 (en) Training a generative model and a discriminative model
CN111161349B (en) Object posture estimation method, device and equipment
JP2021089724A (en) 3d auto-labeling with structural and physical constraints
KR20210002606A (en) Medical image processing method and apparatus, electronic device and storage medium
CN111340867A (en) Depth estimation method and device for image frame, electronic equipment and storage medium
US11209277B2 (en) Systems and methods for electronic mapping and localization within a facility
EP1887514B1 (en) Signal processing device
EP1363235A1 (en) Signal processing device
US11314986B2 (en) Learning device, classification device, learning method, classification method, learning program, and classification program
JP7161107B2 (en) generator and computer program
JP2018156640A (en) Learning method and program
CN113903028A (en) Target detection method and electronic equipment
US11132586B2 (en) Rolling shutter rectification in images/videos using convolutional neural networks with applications to SFM/SLAM with rolling shutter images/videos
CN111626134A (en) Dense crowd counting method, system and terminal based on hidden density distribution
CN114387392B (en) Method for reconstructing three-dimensional human body posture according to human shadow
WO2023171559A1 (en) Occlusion-inference object detection device, occlusion-inference object detection, and program
WO2023119922A1 (en) Image generating device, method, and program, training device, and training data
CN112800822A (en) 3D automatic tagging with structural and physical constraints
JP2006318232A (en) Analytical mesh correction device
CN117529749A (en) Unconstrained image stabilization
CN114359587A (en) Class activation graph generation method, interpretable method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23766735

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024506134

Country of ref document: JP