WO2020152927A1

WO2020152927A1 - Training data generation method, training data generation device, and inference processing method

Info

Publication number: WO2020152927A1
Application number: PCT/JP2019/040667
Authority: WO
Inventors: 吉田　修一; 剛大濱; 勁峰今西; 良一今中
Original assignee: 日本金銭機械株式会社
Priority date: 2019-01-22
Filing date: 2019-10-16
Publication date: 2020-07-30
Also published as: JP2020119127A; JP6675691B1

Abstract

A training data generation system according to the present invention can quickly acquire a large amount of training data required for a training process in order to acquire a trained model to be used when executing an object detection process, an orientation detection process, and the like. In a training data generation system (1000), a training image is acquired by rendering, in a background image, a CG object having a known three-dimensional position in a three-dimensional space for which the background image has been acquired. It is therefore possible to, for example, acquire extremely accurate teaching data by acquiring a training position label and a training orientation label that identify the position and orientation of CG objects in the training image.

Description

Learning data generation method, learning data generation device, and inference processing method

The present invention relates to a technique for automatically generating learning data used for object detection processing, posture detection processing, and the like.

Conventionally, a device for gripping or carrying various objects has been known (for example, see Patent Document 1). In such a device, various techniques have been developed in order to efficiently grasp and carry objects having various shapes. For example, in Patent Document 2, by performing a learning process using an image of a picking motion of a picking robot that grips an object and information indicating the motion state of the picking robot at that time as learning data, There is a disclosure of a technique that makes it possible to efficiently grasp and carry an object having a different shape.

In order to efficiently grip and carry objects with various shapes, it is necessary to perform high-level processing (object detection processing) that detects the target object to be gripped and processing that detects the posture of the target object (posture detection processing). It is important to carry out with precision. In recent years, a technique for executing an object detection process or the like with high accuracy has been developed by using a machine learning technique represented by a deep learning technique.

Japanese Patent Publication No. 2018-504333 Japanese Patent Laid-Open No. 2018-83246

In order to execute object detection processing with high accuracy using machine learning technology, it is necessary to execute learning processing using a large amount of learning data and acquire a highly accurate learned model. For example, a large amount of learning data is required to acquire a learned model that performs object detection processing and orientation detection processing on various objects. Normally, in order to obtain a learned model that performs object detection processing and orientation detection processing, an image of the target object is captured and the position information and orientation information of the target object captured in the image are manually specified. Then, the specified position information and posture information and the captured image are set as learning data.

However, when learning data is acquired by such a method, it is difficult to acquire a large amount of learning data in a short time because it is necessary to manually specify position information and posture information of the target object. is there.

Therefore, in view of the above problems, the present invention provides a large amount of learning data required in the learning process in a short time in order to acquire a learned model used when executing the object detection process, the posture detection process, and the like. The purpose is to realize a learning data generation method that can be acquired.

In order to solve the above problems, a first invention is a learning data generation method including a background image acquisition step and a learning image data acquisition step.

The background image acquisition step acquires a background image acquired by imaging a predetermined three-dimensional space.

The learning image data acquisition step acquires CG object generation data that is data for computer graphics processing including at least one of an object shape and texture, and is generated based on the acquired CG object generation data. A rendering image, which is an image obtained by combining a CG object with a background image so as to be arranged at a predetermined coordinate position in a three-dimensional space where the background image is captured, is obtained as learning image data.

In this learning data generation method, a CG object whose three-dimensional position in the three-dimensional space from which the background image was acquired is known is rendered to the background image to acquire the learning image. In the above, extremely accurate teacher data can be acquired by acquiring the learning position label and the learning attitude label that specify the position and attitude of each CG object. That is, since each CG object is generated by CG processing, when each CG object is projected onto the background image, the position (orientation) of each CG object is where the image area occupied by each CG object is. It is possible to accurately calculate how it is. As a result, the learning position label and the learning posture label that specify the position and posture of each CG object, which are generated using the learning image data acquired by the learning data generation method, are extremely accurate.

Furthermore, with this learning data generation method, a CG object can be automatically generated by CG processing without human intervention. In this learning data generation method, a large number of learning images Img1 can be generated in a short time by projecting the generated CG object on the background image (by performing the rendering process).

Therefore, in order to acquire the learned model used when executing the object detection processing and the posture detection by the learning data generation method, it is necessary to acquire a large amount of learning data necessary for the learning processing in a short time. You can

Then, by performing learning processing using a large amount of extremely accurate teacher data acquired by this learning data generation method, for example, when performing object detection processing, posture detection, etc. It is possible to obtain the trained model with high accuracy and efficiently.

Note that in order to “composite with the background image”, for example, 3D (3D) coordinate data may be projectively transformed to obtain 2D (2D) data.

The second invention is the first invention and further comprises a learning position label acquisition step.

The learning position label acquisition step sets a two-dimensional bounding area that is an area surrounding the CG object on the rendered image from the learning image data, and acquires coordinate information of the two-dimensional bounding as a learning position label.

As a result, in this learning data generation method, the learning image (rendering image) acquired by rendering and synthesizing the CG object generated by the CG processing on the background image is combined with the learning image (rendering image). , And the position label for learning (the coordinate data of the 2D bounding box of each CG object) that specifies the position of each CG object can be acquired. In this learning data generation method, a CG object whose three-dimensional position in the three-dimensional space is known is rendered on a background image to acquire a learning image, and the position of each CG object is acquired in the learning image. Since the specified learning position label (for example, the coordinate data of the 2D bounding box of each CG object) is specified, extremely accurate teacher data can be acquired. That is, since each CG object is generated by CG processing in this learning data generation method, when each CG object is projected onto the background image, the position of the image area occupied by each CG object is calculated. Can be obtained more accurately. As a result, the learning position label (the coordinate data of the 2D bounding box of each CG object) that specifies the position of each CG object becomes extremely accurate.

Furthermore, with this learning data generation method, a CG object can be automatically generated by CG processing without human intervention. In this learning data generation method, a large amount of learning images Img1 and learning position labels (each CG object) are projected in a short time by projecting the generated CG object on the background image (by performing rendering processing). 2D bounding box coordinate data) and

Therefore, in this learning data generation method, a large amount of learning data necessary for the learning process can be acquired in a short time in order to acquire the learned model used when executing the object detection process or the like. ..

Then, by performing a learning process using a large amount of extremely accurate teacher data acquired by this learning data generation method, for example, a learned model used when executing an object detection process. Can be obtained with high accuracy and efficiency.

The third invention is the first invention, and further comprises a posture detection image data acquisition step and a posture detection learning data acquisition step.

In the attitude detection image data acquisition step, a cropped image, which is an image acquired by extracting an image area surrounding a CG object on the rendering image from the learning image data, is acquired as attitude detection image data.

The posture detection learning data acquisition step acquires, as posture detection learning data, data in which the information regarding the posture of the CG object included in the posture detection image data and the posture detection image data are associated with each other.

According to this learning data generation method, a crop image acquired for each CG object and a CG object in the crop image from a rendering image acquired by rendering and compositing a CG object generated by CG processing on a background image. It is possible to obtain a posture label that specifies the posture of the.

In this learning data generation method, a CG object whose three-dimensional position and orientation in a three-dimensional space are known is rendered on a background image to obtain a rendering image, and the position of each CG object is obtained in the rendering image. Since the area defined by the specified 2D bounding box is specified as the crop area, the crop image including each CG object can be acquired extremely accurately.

Further, since the CG objects included in the cropped image are generated by the CG processing in the learning data generation method, what is the posture of each CG object when each CG object is projected on the background image. Can be calculated accurately. As a result, the learning posture label that identifies the orientation of each CG object on the cropped image (data that identifies the orientation of each CG object on the cropped image (for example, the class number)) is extremely accurate. Become.

Furthermore, with this learning data generation method, a CG object can be automatically generated by CG processing without human intervention. In this learning data generation method, a large amount of learning images (cropped images of each CG object) are learned in a short time by projecting the generated CG object on the background image (by performing rendering processing). Orientation label (data (for example, a class number) that identifies the orientation of each CG object on the cropped image) can be generated.

Therefore, in this learning data generation method, a large amount of learning data necessary for the learning process can be acquired in a short time in order to acquire the learned model used when executing the posture detection process and the like. ..

Then, the learning process is performed using a large amount of and extremely accurate teacher data acquired by the learning data generation method, for example, a learned model used when executing the posture detection process. Can be obtained with high accuracy and efficiency.

A fourth invention is any one of the first to the third invention, and in the learning image data acquisition step, when the background image includes an actual processing target object, an image including the processing target object is included. A rendering image is generated so that a CG object is arranged in an image area other than the area.

Thus, in this learning data generation method, for example, by rendering the CG object in an area other than the image area of the 2D bounding box that is manually set (the image area including the actual processing target object). , Processing for generating learning data can be executed.

The fifth invention is the first invention, and the background image is an image including the first object.

The CG object is combined with the background image so that at least a part of it is arranged on the surface of the first object.

With this, in this learning data generation method, it is possible to generate learning data based on the background image, at least a part of which is arranged on the surface of the first object.

The “first object” is, for example, an arbitrary object whose size is known. The first object is, for example, a rectangular parallelepiped object whose size is known.

A sixth invention is the first invention, and in the background image acquisition step, the first background image is acquired by combining the background image with an image including a first object.

The CG object is combined with the first background image so that at least a part of the CG object is arranged on the surface of the first object.

With this, in this learning data generation method, the image of the first object is combined with the background image in which the first object is not captured, so that an image similar to the background image in which the first object is actually captured. The first background image that is In this learning data generation method, the learning data generation process can be performed using the first background image instead of the background image.

The seventh invention is the fifth or sixth invention, wherein the CG object has a shape that forms a keyhole in the first object.

With this, with this learning data generation method, it is possible to generate learning data by an image in which the shape that forms the keyhole is synthesized on the surface of the first object.

The eighth invention is a learning data generation device including a background image data acquisition unit and a learning image data acquisition unit.

The background image data acquisition unit acquires a background image acquired by imaging a predetermined three-dimensional space.

The learning image data acquisition unit acquires CG object generation data that is data for computer graphics processing that includes at least one of an object shape and texture, and is generated based on the acquired CG object generation data. A rendering image, which is an image obtained by combining a CG object with a background image so as to be arranged at a predetermined coordinate position in a three-dimensional space where the background image is captured, is obtained as learning image data.

With this, it is possible to realize a learning data generation device that achieves the same effects as the first invention.

The ninth invention is an inference processing method including a learned model acquisition step and a prediction processing step.

The learned model acquisition step acquires a learned model by executing a learning process using the learning data acquired by the learning data generating method according to any one of the fifth to seventh inventions.

In the prediction processing step, an image including a predetermined shape arranged on the surface of the first object is input, and a prediction process using a learned model is executed to obtain data for identifying the position of the predetermined shape. Output.

With this, this inference processing method can acquire data (inference result data) for specifying the position of a predetermined shape.

The tenth invention is the ninth invention, and further includes a detection accuracy determination step and an imaging parameter adjustment step.

The detection accuracy determination step determines the detection accuracy of the data for identifying the position of the predetermined shape.

The image capturing parameter adjusting step adjusts the image capturing parameter of the image capturing device which captures an image including a predetermined shape, which is disposed on the surface of the first object.

Then, when the detection accuracy of the data for specifying the position of the predetermined shape is lower than the predetermined threshold value, after the shooting parameter adjustment step changes the shooting parameter of the imaging device, the prediction processing step executes the prediction processing. ..

As a result, with this inference processing method, if the accuracy of the data (inference result data) for identifying the position of the predetermined shape is insufficient, the inference processing can be performed with high accuracy by adjusting the shooting parameters of the imaging device. The prediction process can be executed using an image with a high possibility.

When the size (actual size) of the first object is known, (1) the focal length of the imaging device (an example of the imaging parameter), and (2) the image captured by the imaging device with the focal length ( The three-dimensional distance from the imaging device to the target object (first object) can be obtained from the ratio of the image area corresponding to the target object (first object) to the entire image area in the captured image). Therefore, in the prediction processing step, the prediction processing may be executed by using the three-dimensional distance from the imaging device acquired as described above to the target object (first object).

According to the present invention, in order to acquire the learned model used when executing the object detection process, the posture detection process, etc., the learning data required for the learning process can be acquired in large amount in a short time. A generation method can be realized.

1 is a schematic configuration diagram of a learning data generation system 1000 according to the first embodiment. 7 is a flowchart of processing executed by the learning data generation system 1000 when generating learning data for the object detection processing. The figure which showed typically three-dimensional space SP1 (three-dimensional space in room Rm1) for acquiring a background image. The figure which shows the background image Img0 (one example). The figure which shows the image Img1 (rendering image Img1) acquired by the rendering process which projects and synthesize|combines N (N=9) CG objects CG_obj1 to CG_obj9 to the background image Img0. In the image Img1 (rendering image Img1) acquired by the rendering process in which N (N=9) CG objects CG_obj1 to CG_obj9 are projected onto the background image Img0 and combined, the bounding box of each CG object is clearly indicated by a rectangle. The figure which shows an image. With the learning data generation system of the first modified example of the first embodiment, the image Img1A(N(9=9) CG objects CG_obj1 to CG_obj9 acquired by the rendering process of projecting and synthesizing on the background image Img0( The figure which shows rendering image Img1A). In the image Img1A (rendering image Img1A) acquired by the rendering process in which N (N=9) CG objects CG_obj1 to CG_obj9 are projected onto the background image Img0 and combined, the bounding box of each CG object is shown as a rectangle. The figure which shows an image. The schematic block diagram of the learning data generation system 2000 which concerns on 2nd Embodiment. The figure for demonstrating the method of pinpointing the posture of a CG object. 9 is a flowchart of a process executed by the learning data generation system 2000 when generating learning data for the posture detection process. The figure which shows the image Img2 (rendering image Img2) acquired by the rendering process which projects and synthesize|combines N (N=10) CG objects CG_obj1-CG_obj10 to the background image Img0. In the image Img2A (rendering image Img2A) acquired by the rendering process in which N (N=10) CG objects CG_obj1 to CG_obj10 are projected onto the background image Img0 and combined, the bounding box (the image area to be cropped) of each CG object (Corresponding to) is a diagram showing an image in which a rectangle is specified. FIG. 9 is a diagram showing cropped images Img_crop(1) to Img_crop(9) of N (N=10) CG objects CG_obj1 to CG_obj9 and the number of the determined class. The schematic block diagram of the learning data generation system 3000 which concerns on 3rd Embodiment. The figure which shows the background image Img0A in which the detection target object (real thing) Real_obj is reflected. The figure which showed 2D bounding box Bbox_manual manually set to the background image Img0A in which the detection target object (real thing) Real_obj was reflected. The figure which shows the rendering image Img3 acquired by rendering a CG object in the background image Img0A in which the detection target object (real thing) Real_obj is reflected. The schematic block diagram of the learning data generation system 4000 which concerns on 4th Embodiment. The figure which shows the background image Img4. The figure which shows the background image Img4. The flowchart of the process which the learning data generation system 4000 performs. The figure for demonstrating the process which synthesize|combines a keyhole with the extraction image. The figure for demonstrating the process which synthesize|combines a keyhole with the extraction image. The schematic block diagram of the learning inference processing system Sys1 which concerns on 4th Embodiment. The schematic block diagram of the learning processing apparatus 200 which concerns on 4th Embodiment. The schematic block diagram of the inference processing apparatus 300 which concerns on 4th Embodiment. The flowchart of the inference process of the inference processing apparatus 300. The figure which shows the input image Img5. Explanatory drawing about a zoom image. The figure which shows a CPU bus structure.

[First Embodiment]
The first embodiment will be described below with reference to the drawings.

<1.1: Configuration of learning data generation system>
FIG. 1 is a schematic configuration diagram of a learning data generation system 1000 according to the first embodiment.

As shown in FIG. 1, the learning data generation system 1000 includes a background image data storage unit DB1, a learning data generation device 100, and a learning data storage unit DB2.

The background image data storage unit DB1 is a functional unit for storing the background image data acquired by imaging a predetermined three-dimensional space. The background image data storage unit DB1 is realized by, for example, a database. The background image data storage unit DB1 includes an image acquired by capturing an image of a predetermined three-dimensional space and information for specifying the three-dimensional space of the image capturing target when the image is captured (for example, the capturing parameter (imaging point (The position of the camera (for example, the center point of the image sensor surface of the image sensor)), the focus position, the focal length, the angle of view, the viewing angle, the optical axis of the camera optical system, etc.) are stored.

As shown in FIG. 1, the learning data generation device 100 includes a background image data acquisition unit 1, a CG processing unit 2 (CG: Computer Graphics), a rendering processing unit 3, and a learning data generation unit 4. ..

The background image data acquisition unit 1 acquires predetermined background image data from the background image data storage unit DB1 (data including a background image and information for specifying a three-dimensional space of an imaging target when the background image is acquired). To get Then, the background image data acquisition unit 1 outputs the background image extracted from the background image data to the rendering processing unit 3 as the data D1. Further, the background image data acquisition unit 1 uses the information (shape information of the three-dimensional space) for specifying the three-dimensional space of the imaging target when the background image extracted from the background image data is acquired, as data Info_3D_space, and performs the CG processing. Output to the unit 2 and the rendering processing unit 3.

The CG processing unit 2 generates a CG object (object generated by CG) arranged in a three-dimensional space in which a background image is captured, and generates processing data necessary for synthesizing the CG object with the background image. Is. As shown in FIG. 1, the CG processing unit 2 includes a 3D arrangement coordinate determination unit 21, a posture determination unit 22, a collision detection unit 23, a texture setting unit 24, and a 3D-2D conversion unit 25.

The 3D placement coordinate determination unit 21 acquires coordinate information of a CG object (object generated by CG) to be placed in the three-dimensional space specified by the data Info_3D_space. For example, the 3D arrangement coordinate determination unit 21 acquires the coordinate information of the CG object to be arranged in the three-dimensional space, using a random number.

The posture determination unit 22 acquires information for determining the posture of the CG object to be arranged in the three-dimensional space specified by the data Info_3D_space.

When there are a plurality of CG objects to be arranged in the three-dimensional space specified by the data Info_3D_space, the collision detection unit 23 detects whether or not each CG object is set to be arranged in an area where it cannot be physically arranged. ..

The texture setting unit 24 sets a texture to be attached to the surface of each CG object. Note that the texture setting unit 24 holds, for example, texture data of a plurality of patterns, and the texture can be set for each arbitrary pattern.

The 3D-2D conversion unit 25 calculates the two-dimensional coordinates on the background image when the CG object to be arranged in the three-dimensional space specified by the data Info_3D_space is combined with the background image in the three-dimensional space of the CG object. The 3D coordinate of is acquired by 3D-2D conversion (projective conversion).

The CG processing unit 2 outputs the data including the information acquired by each of the functional units included in the CG processing unit 2 to the rendering processing unit 3 as data Data_CG_obj. Further, when the CG object generated by the CG processing unit 2 is displayed on the background image, the CG processing unit 2 generates information for the bounding box that defines the boundary of the image area surrounding the CG object as learning data generation. Output to section 4. In addition, when N (N: natural number) CG objects are generated by the CG processing unit 2, the information of the bounding box of the i-th (i: natural number, 1≦i≦N) CG object is set to “Data_for_training(BBox( i))”.

The rendering processing unit 3 inputs the background image D1 and the data Info_3D_space output from the background image data acquisition unit 1, and the data Data_CG_obj output from the CG processing unit 2. The rendering processing unit 3 combines the CG object generated by the CG processing unit 2 with the background image D1 based on the data Info_3D_space and the data Data_CG_obj, to thereby generate the combined image data D2 (image data of the combined image Img1). Is acquired, and the acquired combined image data D2 is output to the learning data generation unit 4.

The learning data generation unit 4 inputs the combined image data D2 output from the rendering processing unit 3 and the data Data_coordinate(BBox(i)) including the bounding box information of the CG object output from the CG processing unit 2. To do. The learning data generation unit 4 generates learning data from the input data and outputs the generated data as data Dout to, for example, the learning data storage unit DB2.

The learning data storage unit DB2 receives the data Dout output from the learning data generation unit 4 and stores and holds the data. The learning data storage unit DB2 is realized by, for example, a database.

The “learning image data acquisition unit” is a functional unit realized by the CG processing unit 2, the rendering processing unit 3, and the learning data generation unit 4.

<1.2: Operation of learning data generation system>
The operation of the learning data generation system 1000 configured as above will be described below.

In the following, a case will be described in which the learning data generation system 1000 generates learning data for object detection processing. For convenience of explanation, it is assumed that the object to be detected in the object detection process has a substantially rectangular parallelepiped shape.

FIG. 2 is a flowchart of a process executed by the learning data generation system 1000 when generating learning data for the object detection process.

FIG. 3 is a diagram schematically showing a three-dimensional space SP1 (three-dimensional space in the room Rm1) for acquiring a background image. FIG. 3 is a view of the room Rm1 as seen from above, in which the camera Cam1 is disposed, the three-dimensional space (imaging target space) SP1 is set to the angle of view α, and the light of the optical system of the camera Cam1. It is assumed that the background image Img0 (an example) is acquired (imaged) by imaging the axis as the optical axis Ax1. Further, as shown in FIG. 3, it is assumed that the x axis, the y axis, and the z axis are set.

FIG. 4 is a diagram showing a background image Img0 (an example).

(Step S11):
In step S11, the background image data acquisition unit 1 acquires one background image data from the background image data storage unit DB1. For convenience of explanation, it is assumed that the background image data acquisition unit 1 acquires the background image Img0 (FIG. 4) captured by the camera Cam1 according to the situation of FIG. 3 from the background image data storage unit DB1. The case will be described.

(Steps S12 to S14):
In step S12, the CG processing unit 2 sets the maximum number of CG objects arranged in a stacked state when arranging the CG objects in the three-dimensional space SP, and the CG objects exceed the number when the CG objects are arranged. Avoid stacking.

The texture setting unit 24 sets a texture to be attached to the surface of each CG object. In the present embodiment, one pattern texture is attached to each CG object. That is, it is assumed that the CG processing unit 2 generates one type of CG object.

The 3D arrangement coordinate determination unit 21 acquires shape information (three-dimensional coordinate information) of a CG object (object generated by CG) to be arranged in the three-dimensional space SP1 specified by the data Info_3D_space. The 3D arrangement coordinate determination unit 21 acquires the coordinate information of the CG object to be arranged in the three-dimensional space SP1 by using the random number (step S13). Note that N CG objects are generated, and the coordinate information of the i-th CG object in the three-dimensional space SP1 is represented as 3D_coordinate(i). Then, the coordinate information 3D_coordinate(i) in the three-dimensional space SP1 of the i-th CG object is, for example, data of three-dimensional coordinates in the three-dimensional space SP1 of the six vertices of the i-th CG object (substantially rectangular parallelepiped). It is data including.

Also, the posture determination unit 22 determines the posture of the CG object (the orientation of the CG object) using a random number (step S13).

The collision detection unit 23 detects whether or not, among the N CG objects arranged in the three-dimensional space SP1 as described above, each CG object is arranged in a region where it cannot be physically arranged. (Step S14). Then, as a result of the detection, when it is determined that there is a CG object placed in the physically unplaceable area, the placement of the CG object placed in the physically unplaceable area is canceled, and the processing is executed. Return to step S13. On the other hand, when it is determined that there are no CG objects arranged in the area that cannot be physically arranged, all N CG objects arranged in the three-dimensional space SP1 are in the physically arrangeable area. Since they are arranged, the process proceeds to step S15.

(Step S15):
In step S15, a rendering process for synthesizing the CG object generated by the CG processing unit 2 (the CG object whose coordinate information, orientation, etc. are determined by the above processing) with the background image Img0 is executed.

Specifically, the 3D-2D conversion unit 25 calculates the two-dimensional coordinates on the background image Img0 when the CG object to be arranged in the three-dimensional space SP1 specified by the data Info_3D_space is combined with the background image Img0. The 3D coordinates in the 3D space SP1 of the CG object are obtained by 3D-2D conversion (projection conversion).

Then, the rendering processing unit 3 projects each CG object from the three-dimensional space SP1 to the two-dimensional space on the background image Img0 based on the two-dimensional coordinates on the background image Img0 of each CG object acquired as described above. , Images corresponding to the respective CG objects are combined with the background image Img0. At this time, a rendering process is performed in which the CG objects are projected onto the background image Img0 in order from the back of the line of sight (a position far from the camera Cam1) to the front of the line of sight (a position near the camera Cam1) and combined.

(Step S16):
In step S16, the learning data generation unit 4 outputs the rendering result acquired in step S15, that is, the image acquired by the rendering process in which each CG object is projected onto the background image Img0 and is combined (learning data). The image is stored in the learning data storage unit DB2.

FIG. 5 shows, as an example, an image Img1 (rendering image Img1) acquired by a rendering process in which N (N=9) CG objects CG_obj1 to CG_obj9 are projected onto the background image Img0 and combined.

(Steps S17 to S19):
In step S17, the 3D-2D conversion unit 25 of the CG processing unit 2 obtains the two-dimensional coordinates on the rendering image Img1 from the three-dimensional coordinates of each vertex of each CG object by projection conversion. Then, the 3D-2D conversion unit 25 determines a 2D bounding box that defines a region surrounding each CG object on the rendered image Img1 (step S18). The 3D-2D conversion unit 25 outputs information for specifying the determined 2D bounding box of each CG object to the learning data generation unit 4 as data Data_coordinate(Bbox(i)). Note that “Bbox(i)” is a notation indicating the 2D bounding box of the i-th CG object.

The learning data generation unit 4 outputs data Data_coordinate(Bbox(i)) (learning position label) including the bounding box information of the CG object acquired in step S18 to the learning data storage unit DB2, and the learning data The data is stored in the storage DB2 (step S19).

In FIG. 6, as an example, N (N=9) CG objects CG_obj1 to CG_obj9 are projected on the background image Img0 and synthesized in the image Img1 (rendering image Img1) acquired by the rendering process. The image shows the bounding box as a rectangle.

As described above, in the learning data generation system 1000, the learning image (rendering image Img1) and the learning image (rendering image) obtained by rendering and compositing the CG object generated by the CG processing unit on the background image Img0. On Img1), a learning position label (coordinate data of the 2D bounding box of each CG object) that specifies the position of each CG object can be acquired. In the learning data generation system 1000, a CG object whose three-dimensional position in the three-dimensional space SP1 is known is rendered as a background image Img0 to acquire a learning image Img1, and each CG in the learning image Img1 is acquired. Since the learning position label (the coordinate data of the 2D bounding box of each CG object) that specifies the position of the object is specified, extremely accurate teacher data can be acquired. That is, since each CG object is generated by the CG processing unit 2 of the learning data generation device 100, when each CG object is projected onto the background image Img0, what is the image area occupied by each CG object? Can be calculated accurately. As a result, the learning position label (the coordinate data of the 2D bounding box of each CG object) that specifies the position of each CG object becomes extremely accurate.

Furthermore, in the learning data generation system 1000, the CG processing unit 2 can automatically generate a CG object without human intervention. Then, in the learning data generation system 1000, by projecting the generated CG object on the background image (by performing the rendering process), a large number of learning images Img1 and learning position labels (each CG object). 2D bounding box coordinate data) and

Therefore, the learning data generation system 1000 can acquire a large amount of learning data required for the learning process in a short time in order to acquire the learned model used when executing the object detection process and the like. ..

Then, the learning process is performed using a large amount of extremely accurate teacher data acquired by the learning data generation system 1000, and for example, a learned model used when executing the object detection process. Can be obtained with high accuracy and efficiency.

<<First Modification>>
Next, a first modified example of the first embodiment will be described.

The same parts as those in the first embodiment are designated by the same reference numerals, and detailed description thereof will be omitted.

The learning data generation system of this modification is different from the first embodiment in that the texture of the CG object is set to a plurality of types.

FIG. 7 is acquired by a rendering process of projecting N (N=9) CG objects CG_obj1 to CG_obj9 onto the background image Img0 by the learning data generation system of the first modified example of the first embodiment. It is a figure which shows the image Img1A (rendering image Img1A).

FIG. 8 shows the bounding box of each CG object in the image Img1A (rendering image Img1A) obtained by the rendering process of projecting and combining N (N=9) CG objects CG_obj1 to CG_obj9 onto the background image Img0. It is a figure which shows the image clarified by the rectangle.

In the learning data generation system of this modification, the texture setting unit 24 sets the texture to be attached to the surface of each CG object.

For example, assuming that there are two types of CG objects, the texture setting unit 24 sets the texture to be attached to the surface of each CG object to one of the above two types (2 patterns). For example, as shown in FIG. 7, the texture setting unit 24 sets the CG objects CG_obj1 to CG_obj3, CG_obj5, and CG_obj8 as the first pattern textures, and sets the CG objects CG_obj4, CG_obj6 to CG_obj7 and CG_obj9 to the second pattern. Set.

When two types of textures are set in this way, the learning data generation system of the present modification executes the same processing as that of the first embodiment, so that each CG is accurately represented as shown in FIG. Information specifying the bounding box of the object can be obtained. Therefore, for example, when there are many kinds of objects to be subjected to the object detection processing, a large amount of learning data (teaching data) can be generated in a short time by the learning data generation system of this modification.

As described above, in the learning data generation system of the present modification, the texture can be changed in various ways, and by projecting the CG object having various textures onto the background image (by performing the rendering process), It is possible to generate a large number of learning images Img1A and learning position labels (coordinate data of the 2D bounding box of each CG object) in a short time.

Note that, in the present modification, there are two types of textures, but the present invention is not limited to this, and three types of textures may be used.

[Second Embodiment]
Next, a second embodiment will be described.

Note that the same parts as those in the above-described embodiment (including modified examples) are designated by the same reference numerals, and detailed description thereof will be omitted.

<2.1: Configuration of Learning Data Generation System 2000>
FIG. 9 is a schematic configuration diagram of the learning data generation system 2000 according to the second embodiment.

The learning data generation system 2000 of the second embodiment has a configuration in which the learning data generation device 100 is replaced with the learning data generation device 100A in the learning data generation system 1000 of the first embodiment. In the learning data generation device 100A, the CG processing unit 2 is replaced with the CG processing unit 2A, and the learning data generation unit 4 is replaced with the learning data generation unit 4A. Other than that, the learning data generation system 2000 of the second embodiment is the same as the learning data generation system 1000 of the first embodiment.

The CG processing unit 2A has the posture (orientation) of each CG object determined (set) by the orientation determination unit 22, and when the CG object is projected onto the background image Img0, what kind of posture the CG object has ( The information for specifying which direction it is facing) is output to the learning data generation unit 4A as data Label_posture. Information for identifying the posture of the i-th CG object is referred to as data Label_posture(i).

For example, as shown in FIG. 10, when the CG object is a rectangular parallelepiped, and when a 3D CG object is projected and converted into 2D, there are three visible surfaces, so depending on which surface is the visible surface, A class is set, and for example, the posture (orientation) when the CG object is projected on the background image Img0 is specified by the class number. For example, in the case of FIG. 10, the surfaces that can be visually recognized on the background image Img0 (rendering image Img1) are the E surface as the upper surface, the A surface as the left side surface, and the B surface as the right side surface. “Class 1” as shown in FIG. The posture (orientation) when the CG object is projected onto the background image Img0 can be specified by the class number thus set.

The learning data generation unit 4A outputs the data D2 output from the rendering processing unit 3 (rendered image Img1 (image obtained by combining the CG object with the background image Img0)) and the data Data_coordinate(Bbox(Bbox( i)) (data for specifying the bounding box) and data Label_posture(i) (data for specifying the posture of each CG object in the rendering image Img1). Then, the learning data generation unit 4A generates learning data from the input data, and outputs the generated data as data Dout to, for example, the learning data storage unit DB2.

<2.2: Operation of Learning Data Generation System 2000>
The operation of the learning data generation system 2000 configured as above will be described below.

The following describes a case where the learning data generation system 2000 generates learning data for posture detection processing. Further, for convenience of explanation, it is assumed that the object to be detected in the posture detection process has a substantially rectangular parallelepiped shape.

FIG. 11 is a flowchart of a process executed by the learning data generation system 2000 when generating learning data for the posture detection process.

FIG. 12 is a diagram showing an image Img2 (rendering image Img2) obtained by a rendering process in which N (N=10) CG objects CG_obj1 to CG_obj10 are projected onto the background image Img0 and synthesized.

FIG. 13 illustrates a bounding box of each CG object in an image Img2A (rendering image Img2A) acquired by a rendering process of projecting and combining N (N=10) CG objects CG_obj1 to CG_obj10 onto a background image Img0. It is a figure which shows the image which clarified with the rectangle (corresponding to the image area|region to crop).

FIG. 14 is a diagram showing cropped images Img_crop(1) to Img_crop(9) of N (N=10) CG objects CG_obj1 to CG_obj9 and the number of the determined class.

(Steps S21 to S25):
The processing of steps S21 to S25 is the same as the processing of steps S11 to S15 of the first embodiment.

(Step S26):
In step S26, the 3D-2D conversion unit 25 of the CG processing unit 2 obtains the two-dimensional coordinates on the rendering image Img2 from the three-dimensional coordinates of the vertices of each CG object by the projection conversion. Then, the 3D-2D conversion unit 25 detects a CG object shielded by another CG object when viewed from the viewpoint (the camera position of the background image Img0) on the rendered image Img2, and completes the CG object. Alternatively, the CG object determined to be partially shielded is excluded from the learning data acquisition targets. That is, by the process of step S26, only the unshielded CG object is set as the learning data acquisition target.

(Step S27):
In step S27, the learning data generation unit 4A sets the area defined by the 2D bounding box (data for determining the area surrounding each CG object on the rendering image) as the crop area, and the image of the crop area is set. To extract. An image obtained by extracting the crop region of the i-th CG object is referred to as a crop image Img_crop(i).

The cropped image Img_crop(i) acquired by the learning data generation unit 4A is stored in the learning data storage unit DB2 as a learning image.

(Step S28):
In step S28, the learning data generation unit 4A acquires the data Label_posture(i) (data indicating the orientation of the i-th CG object on the rendering image) output from the CG processing unit 2A, and uses the data for learning. The data is stored in the data storage unit DB2. The data Label_posture(i) is data indicating the orientation of the CG object included in the cropped image Img_crop(i).

In the learning data generation system 2000 of the present embodiment, as shown in FIG. 14, the cropped images Img_crop(1) to Img_crop(9) of the CG objects CG_obj1 to CG_obj9 (the CG object CG_obj(10) are shielded and therefore excluded. It is also possible to accurately acquire the attitude label (data specifying the attitude of the CG object on the cropped image).

As described above, in the learning data generation system 2000, the cropped image acquired for each CG object from the rendering image Img2 acquired by rendering and compositing the CG object generated by the CG processing unit on the background image Img0, and It is possible to acquire a posture label that specifies the posture of the CG object in the cropped image.

In the learning data generation system 2000, a CG object whose three-dimensional position and orientation in the three-dimensional space SP1 is known is rendered as a background image Img0 to obtain a rendered image Img2, and each CG is obtained in the rendered image Img2. Since the area defined by the 2D bounding box that specifies the position of the object is specified as the crop area, the crop image including each CG object can be acquired extremely accurately.
Furthermore, since the CG object included in the cropped image is generated by the CG processing unit 2A of the learning data generation device 100A, when each CG object is projected onto the background image Img0, the posture of each CG object is determined. It can be accurately calculated by calculation. As a result, the learning posture label that identifies the orientation of each CG object on the cropped image (data that identifies the orientation of each CG object on the cropped image (for example, the class number)) is extremely accurate. Become.

Further, in the learning data generation system 2000, the CG processing unit 2A can automatically generate a CG object without human intervention. Then, in the learning data generation system 2000, a large amount of learning images (cropped images of each CG object) are learned in a short time by projecting the generated CG object on the background image (by performing rendering processing). Orientation label (data (for example, a class number) that identifies the orientation of each CG object on the cropped image) can be generated.

Therefore, in the learning data generation system 2000, a large amount of learning data necessary for the learning process can be acquired in a short time in order to acquire the learned model used when executing the posture detection process and the like. ..

Then, by performing a learning process using a large amount of extremely accurate teacher data acquired by the learning data generation system 2000, for example, a learned model used when executing a posture detection process. Can be obtained with high accuracy and efficiency.

Note that, similarly to the first modified example of the first embodiment, the learning data generation system 2000 of the present embodiment may have a plurality of types (a plurality of patterns) of textures of CG objects.

[Third Embodiment]
Next, a third embodiment will be described.

FIG. 15 is a schematic configuration diagram of a learning data generation system 3000 according to the third embodiment.

FIG. 16 is a diagram showing a background image Img0A in which the detection target object (real object) Real_obj is shown.

FIG. 17 is a diagram showing the 2D bounding box Bbox_manual manually set in the background image Img0A in which the detection target object (real object) Real_obj is shown.

FIG. 18 is a diagram showing a rendering image Img3 obtained by rendering a CG object on a background image Img0A in which the detection target object (actual object) Real_obj is shown.

The learning data generation system 3000 of the third embodiment has a configuration in which a manual bounding box information input unit 5 is added to the learning data generation device 100 of the learning data generation system 1000 of the first embodiment.

In the learning data generation system 3000, for example, in order to generate learning data for object detection, the image area of the detection target object (actual object) shown in the background image is manually set in the same manner as the conventional method, and 2D When the bounding box is set, the information of the manually set 2D bounding box is acquired by the manual bounding box information input unit 5.

Then, in the learning data generation system 3000, the acquired information is input to the rendering processing unit 3, and the rendering processing unit 3 applies the CG by CG to an area other than the image area in the manually set 2D bounding box. Try to position the object.

By doing so, learning data by the conventional method is acquired using the background image Img0A in which the detection target object (actual object) is captured, and learning is performed by using the CG object generated by the CG described in the above embodiment. Data can be acquired.

In the learning data generation system 3000, for example, when processing is performed using the background image Img0A of FIG. 16, as shown in FIG. 17, an area other than the image area of the manually set 2D bounding box Bbox_manual is used. The processing of the above embodiment is performed.

By processing in this way, in the learning data generation system 3000, for example, as shown in FIG. 18, by rendering a CG object in a region other than the image region of the manually set 2D bounding box Bbox_manual, The processing described in the above embodiment can be executed.

In addition, in the learning data generation device 100A of the second embodiment, a manual bounding box information input unit 5 is added as in the present embodiment, and similarly to the present embodiment, a manually set 2D bounding box Bbox_manual is displayed. The learning data may be acquired by rendering a CG object in an area other than the image area.

[Fourth Embodiment]
Next, a fourth embodiment will be described.

<4.1: Configuration of Learning Data Generation System 4000>
FIG. 19 is a schematic configuration diagram of a learning data generation system 4000 according to the fourth embodiment.

The learning data generation system 4000 of the fourth embodiment has a configuration in which the learning data generation device 100 is replaced with the learning data generation device 100C in the learning data generation system 1000 of the first embodiment. Then, in the learning data generation device 100C, the background image data acquisition unit 1 is replaced with the background image data acquisition unit 1A, the CG processing unit 2 is replaced with the CG processing unit 2B, and the attitude determination unit 22 is replaced with the key information determination unit 22A. , The rendering processing unit 3 is replaced with the rendering processing unit 3A, and the learning data generation unit 4 is replaced with the learning data generation unit 4B. Other than that, the learning data generation system 4000 of the fourth embodiment is the same as the learning data generation system 1000 of the first embodiment.

Note that, in the fourth embodiment, for convenience of description, as an example, as shown in FIGS. 20 and 21, an image including a rectangular parallelepiped object arranged in a predetermined space (for example, the region R1 in FIG. 21 is cropped). A case will be described in which an object (CG object) forming a keyhole on the surface of the rectangular parallelepiped object is CG-composited using the image as a background image (composite image of the CG object).

The background image data acquisition unit 1A includes predetermined background image data from the background image data storage unit DB1 and (1) information for specifying a three-dimensional space of an imaging target when the background image is acquired (three-dimensional space). Vertical/horizontal/height information, etc.), (2) information on the shooting parameters (focal length of the camera, angle of view, etc.) when the background image was acquired, and (3) extraction included in the background image. Information such as the size and shape of the object (object to be combined with the keyhole) is acquired. The background image data acquisition unit 1A acquires, as the data Info1, data including the above information (1) to (3) when the background image was captured.

The background image data acquisition unit 1A specifies, for example, an area (for example, the area R1 in FIG. 21) on the background image of the extraction object (object to be combined with the keyhole) by the image recognition processing, and cuts out the area. The image is acquired as the extracted image D1A. In addition, the background image data acquisition unit 1A acquires data Info2 (D1A) including information for specifying which space of the three-dimensional space of the imaging target the cut-out image region corresponds to.

Then, the background image data acquisition unit 1A outputs the acquired data D1A (the image in which the region of the extraction target (the object to be combined with the keyhole) is extracted) to the rendering processing unit 3A, and the acquired data Info1 and the data The Info2 (D1A) is output to the CG processing unit 2B and the rendering processing unit 3A.

The CG processing unit 2B generates a CG object (object generated by CG) arranged in a three-dimensional space in which a background image is captured, and the CG object is a background image (image D1A acquired from the background image data acquisition unit 1A). ) Is a processing unit that generates the data necessary for the composition.

The key information determination unit 22A specifies the type of the key for CG synthesis, and also the position information of the shape of the key to be CG synthesized in the three-dimensional space (when the key is arranged in the three-dimensional space by CG synthesis). Information for specifying the three-dimensional position (three-dimensional shape) of the key is specified. The information for identifying the type of the key and the information for identifying the three-dimensional position (three-dimensional shape) of the key are stored in a predetermined storage unit (not shown). Further, the position information in the three-dimensional space regarding the shape of the key to be CG synthesized may be three-dimensional coordinates (data in absolute coordinates) based on the three-dimensional coordinates set in the three-dimensional space of the imaging target, or the extraction target. Relative to the origin with a three-dimensional coordinate (for example, a predetermined point (for example, the left end point) of the clipped region) set by the three-dimensional coordinate set in the space corresponding to the region where the object (object to be combined with the keyhole) is clipped Data by coordinates).

The CG processing unit 2B outputs the data Data_CG_obj necessary for synthesizing the CG object (key-shaped CG object) to the background image D1A to the rendering processing unit 3A. The data Data_CG_obj is composed by the 3D-2D conversion unit 25 of three-dimensional position data (three-dimensional coordinate data) of the keyhole-shaped object (CG object) specified by the key information determination unit 22A on the background image. In this case, it is acquired by converting into the two-dimensional coordinate data on the background image.

Further, the CG processing unit 2B, when synthesizing a keyhole-shaped object (CG object) with a background image, two-dimensional coordinate data of the CG object on the background image (two-dimensional acquired by 3D-2D conversion). Information including coordinate data) is acquired as data Key_pos(i).

The CG processing unit 2B also provides information indicating the key type of the keyhole-shaped object (CG object) identified by the key information determination unit 22A (this is the i-th (i: natural number) CG object) as data Key_type. As (i), information indicating the key position on the composite image of the keyhole-shaped object (CG object) is output to the learning data generation unit 4B as data Key_pos(i).

The rendering processing unit 3A outputs data Info1, data Info2 (D1A) output from the background image data acquisition unit 1A, and image data D1A (an image of an object in which a keyhole is combined) extracted from the CG processing unit 2B. Input the output data Data_CG_obj.

The rendering processing unit 3A combines the CG object (keyhole-shaped CG object) generated by the CG processing unit 2B with the image D1A based on the data Info1, the data Info2 (D1A), and the data Data_CG_obj. , And acquires the composite image data Img_render(i), and outputs the acquired composite image data Img_render(i) to the learning data generation unit 4B.

The learning data generation unit 4B stores information indicating the key type of the composite image data Img_render(i) output from the rendering processing unit 3A and the CG object (keyhole-shaped CG object) output from the CG processing unit 2B. The Key_type(i) and the data Key_pos(i) are input as information indicating the key position on the composite image of the keyhole-shaped object (CG object).

The learning data generation unit 4B generates learning data from the input data, and outputs the generated data as data Dout to, for example, the learning data storage unit DB2. The data Dout is the data of the i-th CG object,
(1) Img_render(i) (image combining keyholes)
(2) Label_key(i) (label including key type information and key position information)
It is assumed that the data includes.

The learning data storage unit DB2 inputs the data Dout output from the learning data generation unit 4B, and stores and holds the data.

<4.2: Operation of Learning Data Generation System 4000>
The operation of the learning data generation system 4000 configured as above will be described below.

In the following, a case will be described in which learning data generation system 4000 generates learning data for keyhole detection processing. Further, for convenience of explanation, it is assumed that the object to be detected in the keyhole detection process has a substantially rectangular parallelepiped shape.

FIG. 22 is a flowchart of the processing executed by the learning data generation system 4000 when the learning data for the key detection processing is generated.

The operation of the learning data generation system 4000 will be described below with reference to the flowchart of FIG.

(Step S31):
In step S31, the background image data acquisition unit 1A acquires one background image data from the background image data storage unit DB1. For convenience of explanation, the background image data acquisition unit 1A acquires the background image Img4 shown in FIG. 20 from the background image data storage unit DB1, and this case will be described below.

(Steps S32, S33):
The CG processing unit 2B identifies the key type Key_type(i) for CG combination by the key information determination unit 22A (step S32), and also obtains the position information of the three-dimensional space regarding the shape of the key to be CG combined. The 3D-2D conversion is performed to acquire the two-dimensional coordinate data on the background image, and the data Key_pos(i) including the acquired two-dimensional data is specified (step S33). In step S33, the key position information Key_pos(i) includes information that can specify on the background image (composite image) which of the left and right areas of the surface of the CG object to combine the key with is arranged. It may be one.

The CG processing unit 2B outputs the data Data_CG_obj necessary for synthesizing the CG object (key-shaped CG object) to the background image D1A to the rendering processing unit 3A.

(Step S34):
In step S34, the rendering processing unit 3A determines the data Info1, the data Info2 (D1A) output from the background image data acquisition unit 1A, the image data D1A (the image in which the target object in which the keyhole is combined is extracted), and By combining the image D1A with the CG object (keyhole-shaped CG object) generated by the CG processing unit 2B based on the data Data_CG_obj output from the processing unit 2B, the combined image data Img_render(i) (rendering) Get the resulting image).

In FIGS. 23 and 24, as an example, image data D1A (an image obtained by extracting an object to be combined with a keyhole) (in FIG. 23 and FIG. 24, written as Img_real(box1)) has four types of keyhole shapes. A state in which the CG objects key1 to key4 are combined is schematically shown. Note that, in FIGS. 23 and 24, the data Data_CG_obj of the CG object keyx in the shape of the keyhole is expressed as Data_CG_obj(keyx).

As shown in FIG. 23, the rendering processing unit 3A generates, for example, the following learning data.
(1) When i=1 (when the key key1 is combined with the left area in front of the object to be combined)
Key type: key1
Composite image data: Img_render(1)
Label for learning label: Label_key(1)=(key1, pos_L)
pos_L: Position information (2) i=2 of the keyhole-shaped CG object (when the key key1 is combined with the right area in front of the combined object)
Key type: key1
Composite image data: Img_render(2)
Label for learning label: Label_key(2)=(key1, pos_R)
pos_R: Position information (3) i of the keyhole-shaped CG object (when i=3 (when the key key2 is combined with the left area in front of the combined object))
Key type: key2
Composite image data: Img_render(3)
Label for learning label: Label_key(3)=(key2, pos_L)
pos_L: In the case of position information (4) i=4 of the keyhole-shaped CG object (when the key key2 is combined with the right area in front of the combined object)
Key type: key2
Composite image data: Img_render(4)
Label for learning label: Label_key(4)=(key2, pos_R)
pos_R: Position information of keyhole-shaped CG object Further, as shown in FIG. 24, the rendering processing unit 3A generates, for example, the following learning data.
(5) When i=5 (when key key3 is combined with the front left side area of the object to be combined)
Key type: key3
Composite image data: Img_render(5)
Label for learning label: Label_key(5)=(key3, pos_L)
pos_L: In the case of position information (6) i=6 of the keyhole-shaped CG object (when the key key2 is combined with the right area in front of the combined object)
Key type: key3
Composite image data: Img_render(6)
Label for learning label: Label_key(6)=(key3, pos_R)
pos_R: Position information (7) i=7 of the keyhole-shaped CG object (when key key4 is combined with the left area in front of the object to be combined)
Key type: key4
Composite image data: Img_render(7)
Label for learning label: Label_key(7)=(key4, pos_L)
pos_L: In the case of position information (8) i=8 of the keyhole-shaped CG object (when the key key4 is combined with the right area in front of the combined object)
Key type: key4
Composite image data: Img_render(8)
Label for learning label: Label_key(8)=(key4, pos_R)
pos_R: Position information of keyhole shape CG object Then, the rendering processing unit 3A outputs the acquired combined image data Img_render(i) to the learning data generation unit 4B.

(Steps S35, S36):
The learning data generation unit 4B stores the combined image data Img_render(i) output from the rendering processing unit 3A as a learning image in the learning data storage unit DB2 (step S35).

Further, the learning data generation unit 4B outputs information indicating the key type of the CG object (key hole-shaped CG object) output from the CG processing unit 2B, from the data Key_type(i) and the key hole-shaped object (CG object). The learning label Label_key(i) (the label including the key type information and the key position information) including the information indicating the key position on the combined image and the data Key_pos(i) is generated, and the generated learning label Label_key. (I) is saved in the learning data storage unit DB2 (step S36).

As described above, in the learning data generation system 4000, the learning image (rendering image Img_render(i)) acquired by rendering and synthesizing the CG object generated by the CG processing unit 2B on the background image D1A and the learning image It is possible to acquire the label Label_key(i) (a label including the key type information Key_type(i) and the key position information Key_pos(i)).

In the learning data generation system 4000, a CG object (a keyhole-shaped CG object) whose three-dimensional position in the three-dimensional space SP1 is known is converted into a background image (synthesized image (for example, an extracted image of the region R1 in FIG. 21). )) to acquire the learning image Img_render(i) (keyhole composite image data), and further specify the position of each CG object (keyhole-shaped CG object) in the learning image Img_render(i). A learning label Label_key(i) (a label including key type information Key_type(i) and key position information Key_pos(i)) is acquired. Therefore, the learning data generation system 4000 can acquire extremely accurate teacher data. That is, since each CG object is generated by the CG processing unit 2B of the learning data generation device 100C, the type of key can be accurately grasped, and each CG object is projected on the background image D1A. At this time, the position of the image area occupied by each CG object (keyhole-shaped CG object) can be calculated accurately.

Further, in the learning data generation system 4000, the CG processing unit 2B can automatically generate a CG object without human intervention. Then, the learning data generation system 4000 can generate a large number of learning images and learning labels in a short time by projecting the generated CG object on the background image (by performing rendering processing). it can.

Therefore, in the learning data generation system 4000, a large amount of learning data necessary for the learning process can be acquired in a short time in order to acquire the learned model used when executing the keyhole detection process and the like. ..

[Fifth Embodiment]
Next, a fifth embodiment will be described.

<5.1: Configuration of Learning Inference Processing System Sys1>
FIG. 25 is a schematic configuration diagram of the learning inference processing system Sys1 according to the fifth embodiment.

FIG. 26 is a schematic configuration diagram of the learning processing device 200 according to the fifth embodiment.

FIG. 27 is a schematic configuration diagram of the inference processing device 300 according to the fifth embodiment.

As shown in FIG. 25, the learning inference processing system Sys1 includes a learning data storage unit DB2, a learning processing device 200, an optimization parameter storage unit DB3, a camera C1, and an inference processing device 300.

It is assumed that the learning data storage unit DB2 stores the learning data generated by the learning data generation system 4000 of the fourth embodiment.

In the learning inference processing system Sys1, the learning image (rendering image Img_render(i)), the learning label Label_key(i) (key type information Key_type(i), and key position information Key_pos(i)) are used in the learning process. Learning data including a label) and output learning data when the image is input, the data specifying the key type and the key position on the surface of the object included in the image is output. Get the trained model.

Further, in the learning inference processing system Sys1, in the inference processing, the three-dimensional position of the key on the surface of the object included in the image acquired by capturing the image of the space to be captured is estimated.

In the following, the learning processing device 200 and the inference processing device 300 will be described separately.

<5.2: Learning processing device 200>
As shown in FIG. 26, the learning processing device 200 includes a learning data input unit 201, a learning model 202, a parameter updating unit 203, and a determining unit 204.

The learning data input unit 201 acquires the learning data DL_in from the learning data storage unit DB2. Then, the learning data input unit 201 takes out the image data Img_render(i) (keyhole composite image) included in the learning data DL_in and outputs it to the learning model 202. Further, the learning data input unit 201 takes out the learning label Label_key(i) (the label including the key type information Key_type(i) and the key position information Key_pos(i)) included in the learning data DL_in, It is output to the determination unit 204 as teacher data DL_answer.

The learning model 202 is, for example, a model based on a neural network including an input layer, a plurality of intermediate layers, and an output layer. The weighting coefficient between the layers of the learning model 202 (weighting of synapse connection connecting the layers) is set (adjusted) by the parameter θ output from the parameter updating unit 203. The learning model 202 outputs the data output from the output layer as the data DL_out to the determination unit 204. The data DL_out has the same dimension as the learning label Label_key(i) (label including the key type information Key_type(i) and the key position information Key_pos(i)).

The parameter updating unit 203 inputs the control signal adj_prm output from the determining unit 204, and updates the parameter θ of the learning model 202 based on the control signal adj_prm (updates the weighting coefficient of each synapse connection).

The determination unit 204 inputs the teacher data DL_answer output from the learning data input unit 201 and the data DL_out output from the learning model 202. The determination unit 204 compares the data DL_out with the teacher data DL_answer, generates the control signal adj_prm for updating the parameters so that the difference between them (for example, the norm of both) becomes small, and the generated control signal. The adj_prm is output to the parameter updating unit 203.

In addition, the determination unit 204 sets the set of input data DL_img to the learning model 202 as x, the set of output data DL_out from the learning model 202 as y, and when the input data x is input to the learning model 202. Let P(y|x) be the conditional probability that the output data y will be output to

The optimum parameter θ_opt that satisfies the above is acquired by repeating the process of updating (adjusting) the above parameters. The conditional P(y|x) takes a larger value as the output data is closer to the teacher data.

For example, conditional P(y|x) is set as follows.

σ: standard deviation Note that x _i is a vector included in the set x, y _i is a vector included in the set y, and y _{i_correct} is teacher data (correct data) when x _i is an input. (Vector data). H(x _i ; θ) represents an operator corresponding to a process of applying a process of a neural network including a plurality of layers to the input x _i and acquiring an output. The parameter θ is, for example, a parameter that determines the weighting of the synaptic connection of the neural network. Note that H(x _i ; θ) may include a non-linear operation.

The determination unit 204 stores the acquired optimum parameter θ_opt in the optimum parameter storage unit DB3. The parameter θ and the optimum parameter θ_opt are vectors or tensors.

As described above, the learning processing device 200 acquires the optimum parameter θ_opt which is a parameter set in the learned model.

<5.3.1: Configuration of the inference processing device 300>
Next, the configuration of the inference processing device 300 will be described.

As shown in FIG. 27, the inference processing device 300 includes an input interface 31, an image recognition extraction unit 32, a prediction unit 33, a 2D coordinate detection unit 34, a detection accuracy determination unit 35, and a shooting parameter adjustment unit 36. A key parameter acquisition unit 37 and a 3D coordinate estimation unit 38 are provided. Further, the inference processing device 300 can input the image captured by the camera C1 to the input interface. Further, as shown in FIG. 27, in the inference processing device 300, the prediction unit 33 is connected to the optimization parameter storage unit DB3, the 2D coordinate detection unit 34 is connected to the keyhole pattern storage unit DB4, The key parameter acquisition unit 37 is connected to the key parameter storage unit DB5.

The input interface 31 is an input interface with an external device. The input interface 31 inputs an image (or video) DPin captured by the camera C1, and outputs the input data as data DP1 to the image recognition extraction unit 32.

The image recognition extraction unit 32 inputs the data DP1 (image DP1) output from the input interface 31, and an image region including a predetermined target object from the data DP1 (for example, when the input image is Img4 in FIG. 21). The image region R1) is extracted. Then, the image recognition extraction unit 32 outputs the extracted image as data DP2 (image DP2) to the prediction unit 33. In addition, the image recognition extraction unit 32 outputs information for specifying which position in the three-dimensional space of the imaging target corresponds to the space corresponding to the extracted image area to the 3D coordinate estimation unit 38 as data Info_3D_extracted_img. .. It should be noted that the data Info_3D_extracted_img includes, for example, data indicating the ratio of the area of the image area occupied by the predetermined object to the area of the entire image area of the image DP1 in the image DP1.

The prediction unit 33 includes an optimization parameter setting unit 331 and a prediction model (learned model) 332.

The optimization parameter setting unit 331 acquires the optimum parameter θ_opt acquired by the learning processing device from the optimization parameter storage unit DB3. Then, the optimization parameter setting unit 331 sets the optimum parameter θ_opt in the prediction model. As a result, the prediction model 332 becomes the same model as the learned model (the learning model 202 when the optimum parameters are set) acquired by the learning processing device 200.

The prediction model 332 is a model having the same configuration as the learning model 202, and the parameters of the prediction model 332 are set by the optimization parameter setting unit 331. The prediction model 332 inputs the image DP2 output from the image recognition extraction unit 32, and sets the data including the key type of the surface of the target object and the position information of the key as output data DP3, and the 2D coordinate detection unit 34. Output to.

The 2D coordinate detection unit 34 inputs the data DP3 output from the prediction model 332 and the image DP2 output from the image recognition extraction unit 32. Further, the 2D coordinate detection unit 34 inputs pattern matching template (keyhole pattern template) data from the keyhole pattern storage unit DB4. The 2D coordinate detection unit 34 specifies the approximate position of the key of the image DP2 (for example, the right side area or the left side area of the predetermined surface) based on the data DP3 acquired by the prediction model 332, and the specified position Based on the above, pattern matching is performed using the keyhole pattern template acquired from the keyhole pattern storage unit DB4. Then, the 2D coordinate detection unit 34 outputs the data DP4 of the detection result of the pattern matching and the detection accuracy accr1 of the pattern matching to the detection accuracy determination unit 35.

The detection accuracy determination unit 35 inputs the data DP4 of the detection result of the pattern matching output from the 2D coordinate detection unit 34 and the detection accuracy accr1 of the pattern matching, and based on the input data, the pattern by the 2D coordinate detection unit 34. Determine the matching accuracy. Then, the detection accuracy determination unit 35 outputs the data Rst1 indicating the determination result to the imaging parameter adjustment unit 36. Further, when the detection accuracy determination unit 35 determines that the accuracy of the pattern matching by the 2D coordinate detection unit 34 is sufficient, the information about the key pattern for which the predetermined accuracy can be ensured by the pattern matching, The data including the data of the coordinate position on the image DP4 of the key is output to the 3D coordinate estimating unit 38 as the data DP5.

The shooting parameter adjustment unit 36 inputs the shooting parameter Param_cam output from the camera C1 and the accuracy detection result data Rst1 output from the detection accuracy determination unit 35. When the accuracy detection result data Rst1 is data indicating that the accuracy detection result data Rst1 is not sufficiently accurate, the imaging parameter adjustment unit 36 generates a control signal Ctl1 for changing the imaging parameter (for example, the focal length) of the camera C1, and Output to C1. Further, the shooting parameter adjustment unit 36 outputs the shooting parameter Param_cam acquired from the camera C1 to the 3D coordinate estimation unit 38.

The key parameter acquisition unit 37 inputs the request signal Req_key for requesting acquisition of the key parameter output from the 3D coordinate estimation unit 38. When the request signal Req_key is input, the key parameter acquisition unit 37 acquires the key parameter specified by the request signal Req_key from the key parameter storage unit DB5 based on the request signal Req_key, and the data including the acquired key parameter. Is output to the 3D coordinate estimation unit 38 as data Prm_key.

The 3D coordinate estimation unit 38 inputs the data DP5 output from the detection accuracy determination unit 35. Further, the 3D coordinate estimation unit 38 outputs the data Info_3D_extracted_img output from the image recognition extraction unit 32, information (data) Info_3D for specifying the captured three-dimensional space, and the shooting parameter output from the shooting parameter adjustment unit 36. Input Param_cam and key parameter data Prm_key output from the key parameter.

The 3D coordinate estimating unit 38, based on the data DP5, the data Info_3D_extracted_img, the data Info_3D, the shooting parameter Param_cam, and the key parameter data Prm_key, the three-dimensional coordinates of the key of the surface of the target object shown in the image DP1. To estimate. Then, the 3D coordinate estimation unit 38 acquires the estimation result data as the data DPout. Note that the 3D coordinate estimation unit 38 is capable of acquiring data on the size of an object (target object) that synthesizes a CG object (for example, a keyhole), and (1) the focal length of the camera C1 and (2) The three-dimensional distance from the camera C1 to the target object is acquired from the ratio of the image area corresponding to the target object to the entire image area in the image (captured image DPin) captured by the camera C1 with the focal length. To do.

<5.3.2: Operation of inference processing device 300>
The operation of the inference processing device 300 configured as above will be described.

FIG. 28 is a flowchart of the inference processing of the inference processing device 300.

In the following, a case will be described in which the inference processing device 300 performs key type/position determination processing as inference processing. Further, for convenience of explanation, it is assumed that the image (video) acquired by the camera C1 is the image Img5 shown in FIG.

(Step S41):
In step S41, the input interface 31 acquires the video frame from the camera C1 by inputting the data DPin (image Img5) captured by the camera C1.

(Steps S42 and S43):
The image recognition extraction unit 32 recognizes the target object (a rectangular parallelepiped object) shown in the image Img5 by image recognition processing, and extracts the image area of the target object (step S42). Then, the image recognition extraction unit 32 outputs the extracted image as the image DP2 to the prediction unit 33.

Further, the image recognition extraction unit 32 determines the type of the extracted object (step S43). This type determination of the object is preferably performed by, for example, an inference processing device using a learned model learned by the learning data (learning data for object detection) generated in the first embodiment.

The prediction unit 33 also inputs the image DP2 into the prediction model 332 to acquire the data DP3 including the type of key on the surface of the target object and the position information of the key.

Then, the 2D coordinate detection unit 34, based on the data DP3 acquired by the prediction model 332, the approximate position of the key on the surface of the target object (cuboid object) (two-dimensional coordinate position on the image) (for example, a predetermined value). The right side area or the left side area of the surface of the.

(Step S44):
In step S44, the 2D coordinate detection unit 34, based on the data DP3 acquired by the prediction model 332, the approximate position of the key of the identified image DP2 (for example, the right side area or the left side area of the predetermined surface). Based on the above, pattern matching is performed using the keyhole pattern template acquired from the keyhole pattern storage unit DB4. Then, the 2D coordinate detection unit 34 outputs the data DP4 of the detection result of the pattern matching and the detection accuracy accr1 of the pattern matching to the detection accuracy determination unit 35. The detection accuracy accr1 of the pattern matching is acquired, for example, by the following (1) and (2).
(1) Pixel value P(i,j) of each pixel of the pattern matching target image (image area) (pixel value of coordinate (i,j)) and each pixel Pt(i,j) of the keyhole pattern template ( A sum (absolute sum in all image regions targeted for pattern matching) sum_error of absolute values of the difference from the pixel value of the coordinates (i, j) is calculated.
(2) From the sum sum_error calculated in (1),
accr1=f1 (sum_error)
f1(x): monotonically decreasing function for x (x≧0)
The detection accuracy accr1 of the pattern matching is acquired by the process corresponding to. The function f1(x) is defined as x≧0, and is assumed to be a monotonically decreasing function with respect to x (that is, a function having the maximum value at f1(0)).

(Step S45):
In step S45, the detection accuracy determination unit 35 inputs the pattern matching detection result data DP4 output from the 2D coordinate detection unit 34 and the pattern matching detection accuracy accr1, and based on the input data, the 2D coordinates. The detection unit 34 determines the accuracy of pattern matching. The determination of the accuracy of the pattern matching is performed by, for example, comparing the detection accuracy accr1 of the pattern matching with a predetermined threshold Th1.

Then, the detection accuracy determination unit 35 outputs the data Rst1 indicating the determination result to the imaging parameter adjustment unit 36. Further, when the detection accuracy determination unit 35 determines that the accuracy of the pattern matching by the 2D coordinate detection unit 34 is sufficient (for example, when accr1>Th1) (Yes in step S45), the process is performed in step S47. If it is determined that the accuracy is not sufficient (No in step S45), the process proceeds to step S46.

(Step S46):
The shooting parameter adjustment unit 36 is data indicating that the accuracy detection result data Rst1 is not sufficiently accurate, and thus generates the control signal Ctl1 for changing the shooting parameter (for example, the focal length) of the camera C1, and the camera Output to C1. Thereby, for example, it is possible to execute a process of zooming the area where the keyhole exists. For example, as shown in FIG. 30, the focal length of the camera C1 is adjusted so that the region R2 is enlarged, and the zoom image in the right diagram of FIG. 30 is acquired. Since the details of the keyhole can be recognized in the zoom image shown on the right side of FIG. 30, it is possible to improve the detection accuracy of the keyhole by performing pattern matching using the zoom image.

After the processing of step S46, the processing is returned to step S44.

(Steps S47, S48):
When it is determined that the detection accuracy in step S46 is sufficient, the 3D coordinate estimation unit 38 outputs the data DP5 output from the detection accuracy determination unit 35 (the key of which the predetermined accuracy can be ensured by the pattern matching). Data including information about the pattern and data of the coordinate position on the image DP of the key), data Info_3D_extracted_img (data for specifying three-dimensional coordinates of a region corresponding to the extracted image), and data Info_3D (imaging). Data for identifying the three-dimensional space), the shooting parameter Param_cam, and the key parameter data Prm_key (step S47).

The 3D coordinate estimating unit 38 occupies (1) the focal length of the camera C1 and (2) an image region corresponding to the target object for the entire image region in the image (captured image DPin) captured by the camera C1 with the focal length. And the three-dimensional distance from the camera C1 to the target object. Since the size of the target object is known and the focal length of the camera C1 when the captured image DPin is acquired is known, if the ratio of the target object in the captured image DPin is known, the The three-dimensional distance to the target object C1 can be acquired. Therefore, the 3D coordinate estimating unit 38 (1) the focal length of the camera C1 and (2) the image region (captured image DPin) captured by the camera C1 with the focal length, which corresponds to the target object for the entire image region. It is possible to acquire the three-dimensional distance from the camera C1 to the target object from the ratio occupied by.

Then, the 3D coordinate estimation unit 38 estimates the three-dimensional coordinates of the key on the surface of the target object shown in the image DP1 based on the acquired data (step S48). That is, since the three-dimensional coordinate data of the target object and the position of the key, the pattern of the key, and the shape of the key on the surface of the target object can be known from the data acquired above, the position of the key in the three-dimensional space should be estimated. You can Then, the data thus inferred is acquired as the data DPout.

As described above, in the inference processing device 300, the model learned by the learning processing device 200 (prediction model (learned model) 332) can recognize the key type of the target object and the approximate position of the key, Furthermore, the exact position of the key can be acquired by pattern matching with the key pattern. Then, by processing the obtained accurate position of the key (position on the extracted image) using the data of the key pattern and the three-dimensional coordinate data of the imaging space, the position of the key in the three-dimensional space is highly accurate. Can be estimated.

Furthermore, the inference processing device 300 can determine the accuracy of the pattern matching, and if the accuracy is insufficient, the zoom processing of the camera C1 can be performed to improve the accuracy of the pattern matching. As a result, the inference processing device 300 can perform highly accurate key position inference processing.

In the above description, for convenience of explanation, the learning processing device 200 and the inference processing device 300 are described as separate devices, but the invention is not limited to this. For example, the learning processing device 200 and the inference processing device 300 may be a single device, and a learning processing mode and an inference processing mode may be provided, and the processing may be performed by a single device depending on the mode. In this case, the learning model 202 and the prediction model (learned model) 332 may be shared (when the learning process is performed on one model and the optimum parameter is acquired, the parameter is set to the optimum parameter. The learned model may be acquired by fixing it).

[Other Embodiments]
You may make it comprise a learning data production|generation system and a learning data production|generation apparatus by combining the said embodiment and modification.

In the above-described embodiment and modified example, the description has been made on the assumption that the coordinates are set by Cartesian coordinates, but the present invention is not limited to this, and another coordinate system such as polar coordinates may be used.

The shape of an object (CG object) created by CG by the CG processing unit may be a shape other than a substantially rectangular parallelepiped.

It should be noted that the present invention is applied to the case where the shape of the detection target is a substantially rectangular parallelepiped and, for example, an object whose one surface is specified (for example, a cash box in which one surface is always provided with a keyhole) is the detection target. Thus, for example, it is possible to efficiently acquire a learned model that implements the object detection process and the posture detection process for the process of performing the process of placing the device in a predetermined position in a predetermined posture such as a cash box with high accuracy. it can.

Further, in the learning data generation system and the learning data generation device described in the above embodiments, each block may be individually made into one chip by a semiconductor device such as an LSI, or a part or all of the blocks may be included. It may be integrated into one chip.

Note that although the term LSI is used here, it may also be called IC, system LSI, super LSI, or ultra LSI depending on the degree of integration.

Also, the method of circuit integration is not limited to LSI, and it may be realized by a dedicated circuit or a general-purpose processor. After manufacturing the LSI, a programmable programmable gate array (FPGA) or a reconfigurable processor capable of reconfiguring connection and setting of circuit cells inside the LSI may be used.

Also, some or all of the processing of each functional block of each of the above embodiments may be realized by a program. Then, a part or all of the processing of each functional block of each of the above-described embodiments is performed by a central processing unit (CPU) in a computer. A program for performing each processing is stored in a storage device such as a hard disk or a ROM, and is read out and executed in the ROM or the RAM.

Also, each process of the above embodiment may be realized by hardware, or may be realized by software (including a case where it is realized together with an OS (operating system), middleware, or a predetermined library). Further, it may be realized by mixed processing of software and hardware.

For example, when each functional unit of the above-described embodiment (including modified examples) is implemented by software, the hardware configuration shown in FIG. 31 (for example, CPU, ROM, RAM, input unit, output unit, etc., is implemented by a bus Bus). Each functional unit may be realized by software processing using the connected hardware configuration).

When each functional unit of the above-described embodiment is realized by software, the software may be realized by using a single computer having the hardware configuration shown in FIG. 31, or may be realized by a plurality of computers. It may be realized by distributed processing using.

Further, the execution order of the processing methods in the above embodiments is not necessarily limited to the description of the above embodiments, and the execution order can be changed without departing from the gist of the invention.

A computer program that causes a computer to execute the above-described method and a computer-readable recording medium that records the program are included in the scope of the present invention. Here, examples of the computer-readable recording medium include a flexible disk, a hard disk, a CD-ROM, an MO, a DVD, a DVD-ROM, a DVD-RAM, a Blu-ray (registered trademark), a next-generation optical disk, and a semiconductor memory. Can be mentioned.

The computer program is not limited to the one recorded on the recording medium, and may be transmitted via an electric communication line, a wireless or wired communication line, a network typified by the Internet, or the like.

The specific configuration of the present invention is not limited to the above-described embodiment, and various changes and modifications can be made without departing from the spirit of the invention.

Sys1 learning

inference processing system

1000, 1000, 3000 learning

data generation system

100, 100A, 100B learning data generation device 200 learning processing device 300

inference processing device

1, 1A background image

data acquisition unit

2, 2A, 2B CG processing unit 3 3A

Rendering processing unit

4, 4A, 4B Learning data generation unit 5 Manual bounding box information input unit

Claims

A background image acquisition step of acquiring a background image acquired by imaging a predetermined three-dimensional space,
The CG object generation data that is data for computer graphics processing including at least one of the shape and texture of the object is acquired, and the CG object generated based on the acquired CG object generation data is used as the background image. Image data acquisition for learning that acquires, as learning image data, a rendering image that is an image acquired by combining the background image so as to be arranged at a predetermined coordinate position in the three-dimensional space that is the imaging target of Steps,
A method for generating learning data, comprising:
A learning position label acquisition step of setting a two-dimensional bounding area that is an area surrounding the CG object on the rendered image from the learning image data and acquiring coordinate information of the two-dimensional bounding as a learning position label. Further comprising,
The learning data generation method according to claim 1.
An attitude detection image data acquisition step of acquiring, as the attitude detection image data, a cropped image that is an image acquired by extracting an image region surrounding the CG object on the rendered image from the learning image data.
An attitude detection learning data acquisition step of acquiring, as attitude detection learning data, data in which information on the attitude of the CG object included in the attitude detection image data and the attitude detection image data are associated with each other;
Further comprising,
The learning data generation method according to claim 1.
The learning image data acquisition step,
When the background image includes an actual processing target object, the rendering image is generated such that the CG object is arranged in an image area other than the image area including the processing target object.
The learning data generation method according to claim 1.
The background image is an image including a first object,
The CG object is combined with the background image such that at least a portion of the CG object is arranged on the surface of the first object,
The learning data generation method according to claim 1.
The background image acquisition step acquires a first background image by combining an image including a first object with the background image,
The CG object is combined with the first background image so that at least a portion of the CG object is arranged on the surface of the first object,
The learning data generation method according to claim 1.
The CG object has a shape that forms a keyhole in the first object,
The learning data generation method according to claim 5.
A background image data acquisition unit that acquires a background image acquired by imaging a predetermined three-dimensional space;
The CG object generation data that is data for computer graphics processing including at least one of the shape and texture of the object is acquired, and the CG object generated based on the acquired CG object generation data is used as the background image. Image data acquisition for learning that acquires, as learning image data, a rendering image that is an image acquired by combining the background image so as to be arranged at a predetermined coordinate position in the three-dimensional space that is the imaging target of Department,
A data generation device for learning comprising.
A learned model acquisition step of acquiring a learned model by executing a learning process using the learning data acquired by the learning data generation method according to claim 5.
By inputting an image including a predetermined shape arranged on the surface of the first object and executing a prediction process by the learned model, data for specifying the position of the predetermined shape is output. A prediction processing step,
An inference processing method including.
A detection accuracy determination step of determining the detection accuracy of the data for identifying the position of the predetermined shape,
A photographing parameter adjusting step of adjusting a photographing parameter of an image pickup device for picking up an image including a predetermined shape arranged on the surface of the first object;
Further equipped with,
When the detection accuracy of the data for specifying the position of the predetermined shape is lower than a predetermined threshold value,
After the shooting parameter adjustment step changes the shooting parameter of the imaging device, the prediction processing step executes the prediction processing,
The inference processing method according to claim 9.