WO2023100282A1

WO2023100282A1 - Data generation system, model generation system, estimation system, trained model production method, robot control system, data generation method, and data generation program

Info

Publication number: WO2023100282A1
Application number: PCT/JP2021/044058
Authority: WO
Inventors: 光司曽我部; 次郎村岡
Original assignee: 株式会社安川電機
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2023-06-08

Abstract

A data generation system according to one example comprises: a detection unit that detects the presence or absence of a target object in an input image using a machine learning model that detects the presence or absence of a target object on the basis of an image; an identification unit that identifies, from the input image and as an area of interest, the area targeted by the machine learning model at the time of the detection; and an annotation unit that associates an annotation corresponding to the detected target object with the area of interest.

Description

Data generation system, model generation system, estimation system, trained model manufacturing method, robot control system, data generation method, and data generation program

One aspect of the present disclosure relates to a data generation system, a model generation system, an estimation system, a trained model manufacturing method, a robot control system, a data generation method, and a data generation program.

Patent Document 1 discloses item information acquisition means for acquiring item information about an item, mark specifying means for specifying a mark of an item based on the item information, and item classification for specifying an item classification based on the item information. A fraud presumption system is described that includes a class identification means and an inference means for presuming fraud for an item based on the identified mark and classification.

International Publication No. 2020/240834 pamphlet

　There is a demand for more efficient processing of objects in images.

A data generation system according to one aspect of the present disclosure includes a detection unit that detects the presence or absence of an object in an input image using a machine learning model that detects the presence or absence of an object based on an image, and a machine learning model in detection. An identification unit that identifies an area of interest from an input image as an area of interest, and an annotation unit that associates an annotation corresponding to the detected object with the area of interest.

A data generation method according to one aspect of the present disclosure is executed by a data generation system including at least one processor. This data generation method includes steps of detecting the presence or absence of a target object in an input image using a machine learning model that detects the presence or absence of a target object based on the image; and associating annotations corresponding to the detected objects with the region of interest.

A data generation program according to one aspect of the present disclosure detects the presence or absence of an object in an input image using a machine learning model that detects the presence or absence of an object based on the image; causing a computer to perform the steps of identifying the detected region from the input image as the region of interest and associating annotations corresponding to the detected object with the region of interest.

According to one aspect of the present disclosure, it is possible to improve the efficiency of processing related to an object appearing in an image.

It is a figure showing an example of functional composition of an object detection system. It is a figure which shows an example of the hardware constitutions of the computer used with an object detection system. 4 is a flowchart showing an example of processing in an object detection system; 10 is a flowchart showing an example of annotation processing; It is a figure which shows an example of specification of an attention area. It is a figure showing an example of functional composition of a robot system. 4 is a flow chart showing an example of processing in the robot system; It is a figure which shows an example of robot control.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or equivalent elements are denoted by the same reference numerals, and overlapping descriptions are omitted.

[Object detection system]
(system overview)
In one example, the data generation system, model generation system, and estimation system according to the present disclosure are applied to the object detection system 1 . The object detection system 1 is a computer system that generates a trained model for detecting the position of an object from an image and uses this trained model to perform the detection. A trained model is a computational model that detects a position where an object appears in an image as the position of the object. A trained model is generated in advance by machine learning. Machine learning refers to a method of autonomously discovering laws or rules by repeatedly learning based on given information. A trained model is built using algorithms and data structures. In one example, the trained model is built by a neural network such as a convolutional neural network (CNN). Generating a trained model corresponds to the learning phase, and using the trained model corresponds to the operation phase. Therefore, the object detection system 1 performs both a learning phase and an operational phase.

The object is any tangible object, and is set according to the purpose of use of the object detection system 1. For example, if the object detection system 1 is used to robotically harvest crops automatically, the target is the crops. As another example, if the object detection system 1 is used for robotic automatic boxing of items, the object is the item.

(System configuration)
FIG. 1 is a diagram showing an example of the functional configuration of an object detection system 1. As shown in FIG. In this example, object detection system 1 comprises data generation system 10 , model generation system 20 and estimation system 30 . Data generation system 10, model generation system 20, and estimation system 30 are examples of a data generation system, a model generation system, and an estimation system, respectively, according to the present disclosure. The data generation system 10 and model generation system 20 correspond to the learning phase, and the estimation system 30 corresponds to the operation phase.

In the learning phase, the model generation system 20 generates a trained model 42 used by the estimation system 30. In order to generate the trained model 42, machine learning using teacher data including a plurality of annotated images is required. Generally, a large amount of training data is required for machine learning, so it is necessary to annotate a large number of images. An annotation is information (metadata) related to an image, and indicates, for example, a class value indicating the type of object and the position of the object in the image. Class values can be expressed in any form, such as numbers, text, and so on. The position of the object may be indicated by a rectangular bounding box that is set corresponding to the position. Conventionally, annotation is performed manually, which is very costly and time consuming. In addition, variations in annotations between workers can also occur. In the object detection system 1, the data generation system 10 executes the annotation to automatically generate at least part of the teacher data. Therefore, it is possible to efficiently carry out annotations on the object appearing in the image.

The data generation system 10 uses a machine learning model 41 to annotate images. The machine learning model 41 is a computational model for detecting the presence or absence of an object based on an image. In the present disclosure, detecting the presence or absence of an object may include processing for identifying the type of object, that is, the class value. Machine learning model 41 processes the image to detect whether or not the object is visible in the image. The machine learning model 41 is generated in advance by machine learning, so it can be said that this is also a trained model. In one example, machine learning model 41 is constructed by a neural network such as a convolutional neural network (CNN).

In the present disclosure, for convenience of explanation, a computational model for annotating images in the data generation system 10 corresponding to a part of the learning phase is referred to as a "machine learning model", and the estimation system 30 corresponding to the operation phase A computational model for detecting the position of an object in an image is called a "learned model".

In one example, the object detection system 1 can access a first image database 51 and a second image database 52 . These databases may be provided outside the object detection system 1 or may be part of the object detection system 1 . The first image database 51 is a device that stores a plurality of first training images with labels indicating the presence or absence of objects as first teacher data used to generate the machine learning model 41 . A label is information (metadata) associated with an image, and indicates, for example, a class value indicating whether or not an object exists in the image. The second image database 52 is a device that stores a plurality of second training images annotated with annotations corresponding to objects as second teacher data used to generate the trained model 42 .

Labels and annotations are common in that they are used as ground truth in machine learning. In this disclosure, for convenience of explanation, the metadata attached to the first training images is referred to as "labels", and the metadata attached to the second training images is referred to as "annotations".

In one example, the data generation system 10 includes a display control unit 11, a labeling unit 12, a preparation unit 13, a detection unit 14, an identification unit 15, and an annotation unit 16 as functional modules. The display control unit 11 is a functional module that displays a user interface related to labels or annotations. The labeling unit 12 is a functional module that generates a first training image by adding a label input via a user interface to a given image. The labeling unit 12 stores the first training images in the first image database 51 . The preparation unit 13 is a functional module that executes machine learning based on the labeled first training image to generate a machine learning model 41 . The detection unit 14 is a functional module that uses the machine learning model 41 to detect the presence or absence of an object in the input image. The detection unit 14 may detect whether or not a specific type of object exists in the input image. Alternatively, the detection unit 14 may detect whether or not each of a plurality of types of target objects exists in the input image. The detection unit 14 may specify each type of one or more types of objects, that is, each class value. The specifying unit 15 is a functional module that specifies, from the input image, a region focused on by the machine learning model 41 in its detection as a region of interest. The annotation unit 16 is a functional module that associates the annotation corresponding to the detected object with the region of interest. The annotation unit 16 stores the input image whose annotation is associated with the region of interest, that is, the input image to which the annotation is added, in the second image database 52 as a second training image. The input image is an image processed by the machine learning model 41, and is treated as a second training image by annotating it.

In one example, the model generation system 20 includes a learning unit 21. The learning unit 21 is a functional module that acquires second teacher data including second training images generated by the data generation system 10 and generates a trained model 42 based on this second teacher data. Therefore, the learning unit 21 also functions as an acquisition unit. In one example, model generation system 20 may be constructed as a computer system including data generation system 10 and learning unit 21 .

In one example, the estimation system 30 includes an estimation unit 31 . The estimation unit 31 is a functional module that inputs a target image to the trained model 42 and detects at least the position of the target object from the target image. The estimation unit 31 may detect the position of each of two or more types of target objects from one target image. A target image is an image to be processed by the trained model 42 . No metadata is associated with the target image, and the trained model 42 at least detects the position of the target based on the pixel information of the target image. The estimator 31 may further estimate a class value for each detected object. In one example, the estimation system 30 may be constructed as a computer system that includes the model generation system 20 that may include the data generation system 10 and the estimation unit 31 .

The object detection system 1 can be realized by any kind of computer. The computer may be a general-purpose computer such as a personal computer or a server for business use, or may be incorporated in a dedicated device that executes specific processing. The object detection system 1 may be implemented by one computer, or may be implemented by a distributed system having a plurality of computers. Each of the data generation system 10, the model generation system 20, and the estimation system 30 may be implemented by one computer, or may be implemented by a distributed system of multiple computers. Alternatively, one computer may function as at least two of data generation system 10 , model generation system 20 and estimation system 30 .

FIG. 2 is a diagram showing an example of the hardware configuration of the computer 100 used in the object detection system 1. As shown in FIG. In this example, computer 100 comprises main body 110 , monitor 120 and input device 130 .

The main body 110 is a device that executes the main functions of the computer. The main body 110 has a circuit 160 which has at least one processor 161 , a memory 162 , a storage 163 , an input/output port 164 and a communication port 165 . Storage 163 records programs for configuring each functional module of main body 110 . The storage 163 is a computer-readable recording medium such as a hard disk, nonvolatile semiconductor memory, magnetic disk, or optical disk. The memory 162 temporarily stores programs loaded from the storage 163, calculation results of the processor 161, and the like. The processor 161 configures each functional module by executing a program in cooperation with the memory 162 . The input/output port 164 inputs and outputs electrical signals to/from the monitor 120 or the input device 130 according to instructions from the processor 161 . The input/output port 164 may input/output electrical signals to/from other devices. Communication port 165 performs data communication with other devices via communication network N according to instructions from processor 161 .

The monitor 120 is a device for displaying information output from the main body 110 . The monitor 120 may be of any type as long as it can display graphics, and a specific example thereof is a liquid crystal panel.

The input device 130 is a device for inputting information to the main body 110. The input device 130 may be of any type as long as desired information can be input, and specific examples thereof include operation interfaces such as a keypad, mouse, and operation controller.

The monitor 120 and the input device 130 may be integrated as a touch panel. For example, the main body 110, the monitor 120, and the input device 130 may be integrated like a tablet computer.

Each functional module of the object detection system 1 is realized by loading an object detection program onto the processor 161 or memory 162 and causing the processor 161 to execute the object detection program. The processor 161 operates the input/output port 164 or communication port 165 according to the object detection program to read and write data in the memory 162 or storage 163 .

In one example, the object detection program includes a data generation program, a model generation program, and an estimation program. The data generation program includes code for realizing each functional module of data generation system 10 . The model generation program includes code for realizing each functional module of the model generation system 20 . The estimation program includes code for implementing each functional module of estimation system 30 .

The object detection program may be provided after being permanently recorded on a non-temporary recording medium such as a CD-ROM, DVD-ROM, or semiconductor memory. Alternatively, the object detection program may be provided over a communication network as a data signal superimposed on a carrier wave. The data generator, model generator, and estimation program may be provided separately. Alternatively, at least two of these three types of programs may be provided as one package.

(Object detection method)
As an example of the object detection method according to the present disclosure, an example of a processing procedure executed by the object detection system 1 will be described with reference to FIG. 3 . FIG. 3 is a flowchart showing an example of processing in the object detection system 1 as a processing flow S1. That is, the object detection system 1 executes the processing flow S1.

In step S11, the data generation system 10 associates a given image with a label to generate a first training image. In one example, this process is performed by the display control section 11 and the labeling section 12 . The display control unit 11 displays on the monitor 120 a labeling user interface for labeling a given image. The user inputs a label indicating whether or not the object is present in the image via the label user interface. The labeling unit 12 assigns the label to the image to generate the first training image, and stores the first training image in the first image database 51 as at least part of the first training data.

In step S12, the data generation system 10 executes machine learning based on the first teacher data including at least one first training image to generate the machine learning model 41. In one example, this processing is performed by the preparation unit 13 . The preparation unit 13 accesses the first image database 51 and executes the following processing for each first training image. That is, the preparation unit 13 inputs the first training image to the first reference model, which is the calculation model on which the machine learning model 41 is based, and obtains the estimation result of the class value output from the first reference model. The preparation unit 13 executes back propagation (error backpropagation method) based on the error between the estimation result and the label (correct answer) to update the parameter group in the first reference model. The preparation unit 13 obtains a machine learning model 41 by repeating this learning until a given termination condition is satisfied. In one example, the machine learning model 41 is a computational model estimated to be optimal for detecting the presence or absence of objects based on images. Note that the machine learning model 41 is not necessarily a "computational model that is optimal in reality".

In step S13, the data generation system 10 generates a second training image by annotation using the machine learning model 41. In one example, this processing is performed by the detection unit 14 , the identification unit 15 and the annotation unit 16 . The details of this processing will be described later.

In step S14, the model generation system 20 executes machine learning based on the second teacher data including at least one second training image to generate the learned model 42. In one example, this processing is performed by the learning unit 21 . The learning unit 21 accesses the second image database 52 and executes the following processing for each second training image. That is, the learning unit 21 inputs the second training image to the second reference model, which is the calculation model on which the trained model 42 is based, and obtains the position estimation result of the target output from the second reference model. . The estimation result may further indicate a class value indicating the type of object. The learning unit 21 executes back propagation based on the error between the estimation result and the annotation (correct answer) to update the parameter group in the second reference model. The learning unit 21 obtains a trained model 42 by repeating this learning until a given termination condition is satisfied. In one example, trained model 42 is a computational model that is estimated to be optimal for locating objects based on images. Note that the trained model 42 is not necessarily the "actually optimal computational model".

In step S15, the estimation system 30 performs estimation by the trained model 42. In one example, this process is performed by the estimation unit 31 . The estimation unit 31 inputs the target image to the trained model 42 and detects at least the position of the target object in the target image. The estimation unit 31 outputs the detected position as an estimation result. The estimating unit 31 may further detect class values, and thus the estimation result may further include class values. The estimation unit 31 may superimpose a bounding box or the like indicating the position of the target object on the target image to generate an estimation result, and display the estimation result on the monitor 120 . Alternatively, the estimation unit 31 may store the estimation result in a recording medium such as the storage 163 . Alternatively, the estimation unit 31 may transmit the estimation result to another computer. The estimation unit 31 may perform detection for each of the plurality of target images.

(data generation method)
Details of step S13 will be described with reference to FIG. 4 as an example of the data generation method according to the present disclosure. FIG. 4 is a flowchart showing an example of annotation processing.

In step S131, the detection unit 14 acquires one input image. The input image may be a still image, or may be one frame image forming a video. The detector 14 may receive input images sent from a camera or other computer. Alternatively, the detection unit 14 may accept an input image input by a user, or read an input image from a given storage device based on user input.

In step S132, the detection unit 14 inputs the input image to the machine learning model 41 and detects the presence or absence of the object. That is, the detection unit 14 detects whether or not the target object appears in the input image. The detection unit 14 may use the machine learning model 41 to detect the presence or absence of each of multiple types of objects in the input image. The detection unit 14 may specify a class value for each detected object.

In step S133, the specifying unit 15 executes the visualization method on the machine learning model 41 that has executed the detection, and calculates the degree of attention for each of the plurality of pixels forming the input image. In the present disclosure, the degree of attention refers to an index indicating the degree of attention paid by the machine learning model 41 . The degree of attention can be said to be an index indicating how much influence the decision by the machine learning model 41 has, and it can also be said to be an index indicating the grounds for the decision by the machine learning model 41 . A pixel with a higher degree of attention has a greater influence on the determination by the machine learning model 41 .

The visualization method is executed based on values calculated in the machine learning model 41 that processed the input image, for example, calculated values corresponding to individual nodes and individual edges in the neural network. In one example, the identification unit 15 uses Class Activation Mapping (CAM) as a visualization technique. CAM is a technique for visualizing the grounds for judgment by a neural network based on a feature map and weights corresponding to edges from Global Average Pooling (GAP) to detected classes. The identifying unit 15 may use Gradient-weighted CAM (Grad-CAM). Grad-CAM is a method of substituting gradients during backpropagation for weights used in CAM calculations, which makes it possible to visualize the grounds for decisions in various types of neural networks. The identification unit 15 may use Grad-CAM++, Score-CAM, Ablation-CAM, Eigen-CAM, or Integrated Grad-CAM.

In step S134, the specifying unit 15 selects one or more pixels whose attention level is equal to or greater than a given threshold value Ta as target pixels from a plurality of pixels of the input image. That is, the identifying unit 15 selects a pixel that has a relatively large influence on the determination by the machine learning model 41 as the pixel of interest.

In step S135, the identifying unit 15 identifies a region of interest based on the set of one or more selected pixels of interest. In one example, the identifying unit 15 identifies a dense area, which is an area where a plurality of pixels of interest are concentrated, as the attention area. A dense area is a limited range in which a plurality of pixels of interest are gathered at a density equal to or higher than a given reference value. At least part of the dense area may be formed by two or more target pixels that are continuously present. In one example, the specifying unit 15 may specify at least one dense area by clustering the target pixels, and specify the dense area as the target area. The specifying unit 15 may calculate the area of each of one or more dense areas, and specify a dense area whose area is equal to or larger than a given threshold value Tb as the attention area. Alternatively, the specifying unit 15 may calculate the area of the circumscribed shape of each of one or more dense areas, and specify the dense areas whose area is equal to or larger than a given threshold value Tc as the attention area. The circumscribed shape may be a circumscribed rectangle or a circumscribed circle.

As shown in steps S133 to S135, the identifying unit 15 identifies from the input image the area that has been noticed by the machine learning model 41 in the detection as the area of interest. The specifying unit 15 can specify a plurality of attention areas from one input image. The specifying unit 15 may specify a region of interest for each of one or more types of target objects that are detected among the plurality of types of target objects. The identifying unit 15 may identify the attention area based on a plurality of degrees of attention corresponding to a plurality of pixels forming the input image. The identification unit 15 may execute Grad-CAM on the machine learning model to identify the region of interest.

At step S136, the annotation unit 16 associates the annotation with the region of interest. In one example, the annotation unit 16 associates a class value indicating the type of object and an annotation indicating the position of the object with the region of interest. When a plurality of attention areas are specified, the annotation unit 16 associates annotations with each attention area. The annotation unit 16 may associate different annotations for each type of object with one or more attention regions corresponding to one or more types of detected objects. “Annotations different for each type of object” is, for example, an annotation including a class value indicating the type of object. The annotation unit 16 associates the annotation corresponding to the object detected by the machine learning model 41 with the region of interest.

In one example, the annotation unit 16 may associate an annotation including a graphic representation corresponding to the region of interest with the region of interest. This graphical representation can be used to indicate the position of the object in the input image. For example, the annotation unit 16 may generate a graphical representation rendered corresponding to the location of the region of interest. This graphic representation is for example a bounding box. The graphical representation may be a shape that encloses the entire region of interest, such as a circumscribing shape. Alternatively, the graphical representation may have a shape that partially overlaps the area of interest, so that the area of interest may extend beyond the graphical representation. Alternatively, the graphic representation may be a shape that surrounds the entire object located in the region of interest, or a shape that overlaps part of the object. The annotation unit 16 may generate graphic representations that provide visual effects such as blinking, highlighting, and the like. In one example, the annotation unit 16 may associate an annotation that indicates the position of an object and does not include a class value with the region of interest.

As an example of processing for associating annotations with regions of interest, the annotation unit 16 may perform segmentation for setting annotations in units of pixels. In this case, the annotation unit 16 may generate a graphic representation corresponding to the segment as an example of a graphic representation corresponding to the region of interest.

At step S137, the display control unit 11 accepts correction of the annotation. The display control unit 11 displays on the monitor 120 a correction user interface for allowing the user to correct the annotation associated with the attention area by the annotation unit 16 . For example, the display control unit 11 displays the input image on which the annotation is superimposed on the correction user interface, and receives user input for correcting the annotation. A user can use this modification user interface to modify the position or dimensions of a graphic representation such as a bounding shape, modify class values, add or delete annotations, and so on. When the user modifies the annotation, the annotation section 16 modifies the annotation based on the user's input. The user does not have to correct the annotation, and in this case the annotation unit 16 does not execute correction processing.

In step S138, the annotation unit 16 stores the processed input image, that is, the annotated input image in the second image database 52 as a second training image.

In step S139, the detection unit 14 determines whether or not all input images have been processed. If there is an unprocessed input image (NO in step S139), the process returns to step S131. In step S131, the detection unit 14 acquires the next input image, and the processes of steps S132 to S138 are executed based on this input image. When all input images have been processed (YES in step S139), data generation system 10 ends the process of step S13.

FIG. 5 is a diagram showing an example of identifying a region of interest from an input image. In this example, data generation system 10 performs annotation on input image 200 . An input image 200 is an image captured by a camera that captures the periphery of the end effector 3b of the robot, and shows three decorations 201-203.

Decorations

201 and 202 are products Pa, and decoration 203 is product Pb. In this example, the detection unit 14 detects the product Pa as the object.

The detection unit 14 inputs the input image 200 to the machine learning model 41 and detects the

decorations

201 and 202 as objects (step S132). The specifying unit 15 executes a visualization method such as Grad-CAM on the machine learning model 41 that has processed the input image 200, and calculates the degree of attention for each pixel of the input image 200 (step S133). Image 210 is a heat map that visualizes the degree of attention of each pixel. Looking at this image 210, it can be seen that a pixel group 211 located in the area of the decoration 201 and a pixel group 212 located in the area of the decoration 202 are highly noticeable. The specifying unit 15 selects pixels whose attention levels are equal to or higher than a given threshold as target pixels, and specifies a target area based on a set of target pixels (for example, a dense area) (steps S134 and 135). Image 220 shows a dense region 221 obtained from pixel group 211 corresponding to decoration 201 and a dense region 222 obtained from pixel group 212 corresponding to decoration 202 . Based on the area of the dense area itself or the area of the circumscribed shape of the dense area, the specifying unit 15 determines whether or not to specify the dense area as the attention area. The annotation unit 16 associates the annotation with each attention area (step S136). An image 230 shows that

dense regions

221 and 222 are specified as regions of interest, an annotation 231 is associated with the dense region (region of interest) 221, and an annotation 232 is associated with the dense region (region of interest). In this example,

annotations

231 and 232 related to product Pa are represented by circumscribing rectangles of dense areas (areas of interest). The annotation unit 16 can modify at least one of the

annotations

231, 232 based on user input (step S137).

(Manufacturing method of learned model)
Steps S13 and S14 are also an example of a learned model manufacturing method according to the present disclosure. For example, the manufacturing method is realized as follows. That is, the data generation system 10 uses the machine learning model 41 to detect the presence or absence of the object in the input image (step S132). The data generation system 10 specifies from the input image the area that has been noticed by the machine learning model 41 in the detection as the attention area (steps S133 to S135). The data generation system 10 associates the annotation corresponding to the detected object with the region of interest (S136). The model generation system 20 generates a trained model 42 based on second teacher data including input images (ie, second training images) with annotations associated with regions of interest. That is, in one example, the method for producing a trained model according to the present disclosure includes a data generation method according to the present disclosure, and an image based on teacher data including an input image in which an annotation is associated with a region of interest by the data generation method. generating a trained model for at least detecting the position of the object from.

[Robot system]
(system overview)
In one example, an example robotic control system according to the present disclosure is shown as a component of the robotic system 2 . The robot system 2 is a mechanism for automating given work by causing a robot to execute a task, which is a series of processes for achieving a given purpose. In one example, the robot system 2 actively extracts an area in which the robot is expected to be able to perform a task, i.e., an area in which it is expected to process an object, as a task area in a situation where the surrounding environment of the robot is unknown. , the robot approaches the object (task area). In order to extract a task area, the robot system 2 uses the mechanism of the data generation system 10 that identifies the attention area from the input image. That is, the robot system 2 detects the presence or absence of an object in the input image, specifies an area of interest from the input image as an area of interest by the machine learning model in this detection, and selects a task area based on the area of interest. In one example, the robot system 2 makes the robot approach the object while repeatedly extracting the task area. When the robot reaches the task area, the robot system 2 causes the robot to perform the task. In this disclosure, when the robot reaches the task area, it means that the robot has come close enough to the object to perform the task.

In one example, the task includes the step of bringing the robot 3 into contact with an object. Examples of such tasks include tasks involving grasping an object, pushing or pulling an object. When the robot 3 is caused to perform such a task, the robot system 2 extracts, as a task area, an area where the robot 3 can come into contact with the object in order to perform the task.

In order to recognize the unknown surrounding environment, the robot system 2 uses active sensing, which is a technique for searching and collecting necessary information by actively changing sensor conditions. This technology allows the robot to recognize the target to be aimed at, even if the conditions regarding the object or surrounding environment change frequently or are impossible or difficult to model in advance. Active sensing is a technique for finding unknown targets, and thus differs from visual feedback, which positions a mechanical system toward a known target.

(System configuration)
FIG. 6 is a diagram showing an example of the functional configuration of the robot system 2. As shown in FIG. In this example, the robot system 2 comprises a robot control system 60 , one or more robots 3 , and one or more robot controllers 4 corresponding to the one or more robots 3 . FIG. 6 shows one robot 3 and one robot controller 4 and shows a configuration in which one robot 3 is connected to one robot controller 4 . However, neither the number of devices nor the connection method is limited to the example of FIG. For example, one robot controller 4 may be connected to multiple robots 3 . A communication network that connects devices may be a wired network or a wireless network. The communication network may comprise at least one of the Internet and an intranet. Alternatively, the communication network may simply be implemented by a single communication cable.

The robot control system 60 is a computer system for autonomously operating the robot 3 in at least some situations. The robot control system 60 performs given operations to generate command signals for controlling the robot 3 . In one example, the command signal includes data for controlling the robot 3, such as a path indicating the trajectory of the robot 3. FIG. The trajectory of the robot 3 refers to the path of movement of the robot 3 or its components. For example, the trajectory of the robot 3 can be the trajectory of the tip. The robot control system 60 transmits the generated command signal to the robot controller 4 .

The robot controller 4 is a device that operates the robot 3 according to command signals from the robot control system 60 . In one example, the robot controller 4 calculates joint angle target values (angle target values of each joint of the robot 3) for matching the position and orientation of the tip portion with the target values indicated by the command signals, and calculates the joint angle target values. to control the robot 3 according to

The robot 3 is a device or machine that works on behalf of humans. In one example, the robot 3 is a multi-axis serial link type vertical articulated robot. The robot 3 includes a manipulator 3a and an end effector 3b, which is a tool attached to the tip of the manipulator 3a. The robot 3 can perform various processes using its end effector 3b. The robot 3 can freely change the position and posture of the end effector 3b within a given range. The robot 3 may be a 6-axis vertical multi-joint robot or a 7-axis vertical multi-joint robot in which one redundant axis is added to the 6 axes.

The robot 3 operates under the control of the robot control system 60 to perform a given task. The execution of the task by the robot 3 produces the result desired by the user of the robot system 2 . For example, a task is set to process some object, in which case the robot 3 processes the object. Examples of tasks include "grab an object and place it on a conveyor", "grab an object and attach it to a workpiece", and "spray paint an object".

Visual sensors such as cameras are examples of sensors for recognizing the three-dimensional space around the robot. In one example, the robot includes a camera 3c that captures the surroundings of the end effector 3b. The coverage of the camera 3c may be set so as to capture at least part of the end effector 3b. The camera 3c may be arranged on the manipulator 3a, for example attached near the tip of the manipulator 3a. In one example, the camera 3c moves corresponding to the motion of the robot 3. FIG. This movement may include changes in at least one of the position and orientation of the camera 3c. The camera 3 c may be provided at a different location from the robot 3 as long as it moves in response to the motion of the robot 3 . For example, the camera 3c may be attached to another robot, or may be movably provided on the ceiling, wall, or camera stand.

In one example, the robot control system 60 includes a display control unit 11, a labeling unit 12, a preparation unit 13, a detection unit 14, an identification unit 15, and a robot control unit 61 as functional modules. Display control unit 11 , labeling unit 12 , preparation unit 13 , detection unit 14 , and identification unit 15 are the same as the functional modules shown in data generation system 10 . The display control unit 11 displays a label user interface. The labeling unit 12 generates a first training image by adding a label input through the label user interface to a given image, and stores the first training image in the first image database 51 . The preparation unit 13 executes machine learning based on the labeled first training image to generate a machine learning model 41 . The detection unit 14 uses the machine learning model 41 to detect the presence or absence of an object in the input image. The specifying unit 15 specifies from the input image a region focused on by the machine learning model 41 in the detection as a region of interest. The robot control unit 61 is a functional module that controls the robot 3 that processes an object based on its attention area.

In one example, the robot system 2 (robot control system 60) can access the first image database 51. The first image database 51 may be provided outside the robot system 2 or the robot control system 60 or may be part of the robot system 2 or the robot control system 60 . The first image database 51 is a device that stores a plurality of first training images with labels indicating the presence or absence of objects as first teacher data used to generate the machine learning model 41 .

Similar to the object detection system 1, the robot control system 60 may be implemented by any kind of computer, for example by the computer 100 shown in FIG. Each functional module of the robot control system 60 is implemented by loading a robot control program into the processor 161 or memory 162 and causing the processor 161 to execute the robot control program. The processor 161 operates the input/output port 164 or the communication port 165 according to the robot control program to read and write data in the memory 162 or storage 163 . Similar to the object detection program, the robot control program may be provided by a non-transitory recording medium or via a communication network.

(Robot control method)
As an example of the robot control method according to the present disclosure, an example of a processing procedure executed by the robot system 2 (robot control system 60) will be described with reference to FIG. FIG. 7 is a flowchart showing an example of processing in the robot system 2 (robot control system 60) as a processing flow S2. That is, the robot system 2 (robot control system 60) executes the processing flow S2.

The processing flow S2 is executed on the premise that the machine learning model 41 has already been prepared. In one example, the machine learning model 41 generated by steps S11 and S12 of process flow S1 is used in process flow S2.

In step S21, the detection unit 14 acquires one input image. This process is similar to step S131. The input image may be a still image, or may be one frame input image forming a video. For example, the detection unit 14 may receive an input image sent from the camera 3c.

In step S22, the detection unit 14 inputs the input image to the machine learning model 41 and detects the presence or absence of the object. This process is the same as step S132.

In step S23, the identifying unit 15 executes the visualization method on the machine learning model 41 that has executed the detection, and calculates the degree of attention for each of the plurality of pixels forming the input image. This process is the same as step S133.

In step S24, the identifying unit 15 selects one or more pixels whose degree of attention is equal to or greater than a given threshold value Ta as target pixels from a plurality of pixels of the input image. This process is the same as step S134.

In step S25, the specifying unit 15 specifies a region of interest based on the set of one or more selected pixels of interest. This process is the same as step S135.

As shown in steps S23 to S25, the identifying unit 15 identifies from the input image the area that has been noticed by the machine learning model 41 in the detection as the area of interest. The specifying unit 15 can specify a plurality of attention areas from one input image. The identifying unit 15 may identify the attention area based on a plurality of degrees of attention corresponding to a plurality of pixels forming the input image. The identification unit 15 may execute Grad-CAM on the machine learning model to identify the region of interest. These matters regarding steps S23-S25 are the same as those of steps S133-S135.

In step S26, the robot control unit 61 selects a task area based on the attention area. For example, the robot control unit 61 selects one of the one or more attention areas as the task area. When a plurality of attention areas are specified, the robot control section 61 may select the attention area having the largest area of the circumscribed rectangle, or may select the attention area having the largest area. When the attention area is selected in this manner, there is a high probability that the area of the object that is closest to the robot 3 in the coverage area will be selected as the task area.

In step S27, the robot control unit 61 determines whether the robot 3 has reached the task area. For example, the robot control section 61 may calculate the distance between the end effector 3b and the task area, and execute determination based on this distance. If the calculated distance is equal to or less than the given threshold value Td, the robot control unit 61 determines that the robot 3 has reached the task area. On the other hand, when the calculated distance is greater than the threshold Td, the robot control section 61 determines that the robot 3 has not reached the task area.

If it is determined that the robot 3 has not reached the task area (NO in step S27), the process proceeds to step S28. In step S28, the robot control unit 61 controls the robot 3 toward the task area as an example of control of the robot 3 based on the attention area. That is, the robot control unit 61 controls the robot 3 so that the robot 3 approaches the object. In one example, the robot control unit 61 generates a path of the robot 3 for moving the end effector 3b from the current position to the task area by planning. Alternatively, the robot control unit 61 may generate a path (trajectory) of the robot 3 by planning so that the distance to the task area is reduced and the task area appears in the center of the image of the camera 3c. The robot control unit 61 outputs a command signal indicating the generated path to the robot controller 4, and the robot controller 4 controls the robot 3 according to the command signal. As a result, the robot 3 approaches the object along its path.

After step S28, the process returns to step S21, and the robot control system 60 executes the processes after step S21 again. In this repetition, the detection unit 14 acquires a new input image (step S21), and further detects the presence or absence of the object in the input image using the machine learning model 41 (step S22). For example, the detection unit 14 processes an input image captured after the robot 3 approaches the object as a new input image. The specifying unit 15 specifies from the new input image, as a new attention area, the area that has been noticed by the machine learning model 41 in the detection (steps S23 to S25). The robot control unit 61 further controls the robot 3 based on the new attention area (from step S26).

If it is determined that the robot 3 has reached the task area (YES in step S27), the process proceeds to step S29. At step S29, the robot control unit 61 causes the robot 3 to execute the task. In one example, the robot control unit 61 generates a path for executing a task through planning, and outputs a command signal indicating the path to the robot controller 4 . The robot controller 4 controls the robot 3 according to the command signal. As a result, the robot 3 executes the task.

In step S30, the robot control unit 61 determines whether or not to end robot control. The robot control section 61 may perform this determination based on any termination condition. For example, the robot control unit 61 may determine to end the robot control when the task has been executed a specified number of times, and may determine to continue the robot control when the number of task executions is less than the specified number of times. Alternatively, the robot control unit 61 may determine to end the robot control when an error occurs in the robot control, and determine to continue the robot control when the error does not occur.

If it is determined to end the robot control (YES in step S30), the process proceeds to step S31. In step S31, the robot control system 60 executes end processing. In this termination process, the robot control unit 61 may return the robot 3 to its initial posture and position. Alternatively, the robot control unit 61 may notify the user by visual information or voice that all tasks have been completed.

If robot control is to be continued (NO in step S30), the process proceeds to step S32. At step S32, the robot control unit 61 prepares for the next task. For example, the robot controller 61 may return the robot 3 to its initial posture and position. Alternatively, the robot control unit 61 may notify the user by visual information or voice that the next task is to be executed.

An example of the motion of the robot based on the processing flow S2 will be described with reference to FIG. FIG. 8 is a diagram showing an example of robot control by the robot control system 60. As shown in FIG. In this example, the robot 3 performs the task of putting the balls in the box 410 in its surroundings. That is, in this example the object is a ball. FIG. 8 sequentially represents a series of actions of the robot 3 by scenes S301 to S304. The following description also shows the correspondence with the processing flow S2.

In the scene S301, the detection unit 14 acquires an input image showing the ball 421 (step S21), and uses the machine learning model 41 to detect the presence or absence of the object (ball) in the input image (step S22). The specifying unit 15 specifies the attention area in the machine learning model 41 in the detection (steps S23 to S25). This attention area corresponds to the position where the ball 421 exists. The robot control unit 61 selects a task area corresponding to the ball 421 based on the attention area (step S26).

In scene S302, the robot control unit 61 controls the robot 3 so that it approaches the ball 421 (steps S27 and S28). Through this control, the distance between the end effector 3b and the ball 421 is shortened. After that, the processing of steps S21 to S28 is repeated.

In the scene S303, in response to the robot 3 reaching the task area (YES in step S27), the robot control unit 61 causes the robot 3 to execute the task (step S29).

From scene S303 to scene S304, the robot 3 executes the task under the control of the robot control unit 61 (step S29). The robot 3 grips the ball 421 with the end effector 3 b and puts the ball 421 into the box 410 .

[effect]
As described above, the data generation system according to one aspect of the present disclosure includes a detection unit that detects the presence or absence of a target object in an input image using a machine learning model that detects the presence or absence of the target object based on the image; The apparatus includes a specifying unit that specifies a region noticed by a machine learning model in detection from an input image as a region of interest, and an annotation unit that associates an annotation corresponding to the detected object with the region of interest.

In this aspect, annotations corresponding to the object are automatically associated with the region focused on by the machine learning model in detecting the object. Therefore, it is possible to efficiently annotate an object in an image.

A data generation system according to another aspect may further include a preparation unit that executes machine learning based on a plurality of training images labeled to indicate the presence or absence of an object to generate a machine learning model. With this configuration, it is possible to generate a machine learning model for detecting the presence or absence of an object, that is, a machine learning model used to obtain a region of interest.

A data generation system according to another aspect includes: a display control unit that displays a label user interface for assigning a label to a given image; A labeling unit that assigns and generates a training image may also be provided. This configuration enables training images to be prepared as desired by the user.

A data generation system according to another aspect further includes a display control unit that displays a correction user interface for allowing a user to correct the annotation, and the annotation unit corrects the annotation based on user input via the correction user interface. You can fix it. This configuration gives the user the opportunity to modify the automatically set annotations. Also, since the annotation is corrected according to user input, the accuracy of the annotation can be improved.

In a data generation system according to another aspect, the annotation unit may associate a graphical representation corresponding to the region of interest with the region of interest as at least part of the annotation. Annotations using graphical representations can clearly indicate objects in the input image.

In a data generation system according to another aspect, the annotation unit may generate the circumscribed shape of the region of interest as a graphic representation. This circumscribed shape can clearly indicate the position of the object in the input image.

In the data generation system according to another aspect, the detection unit detects the presence or absence of each of the plurality of types of objects in the input image using a machine learning model, and the identification unit detects among the plurality of types of objects. The annotation unit specifies a region of interest for each of the detected one or more types of target objects, and annotating each of the one or more target regions corresponding to the detected one or more types of target objects with different annotations for each target type. may be associated. In this case, annotations can be added to the input image so that the type of object can be determined.

In a data generation system according to another aspect, the annotation unit may associate a class value for identifying the type of object with the region of interest as at least part of the annotation. With this configuration, it is possible to add annotations that clearly indicate where and what is shown in the input image.

In the data generation system according to another aspect, the specifying unit calculates an attention level indicating a degree of attention given to each of the plurality of pixels forming the input image by the machine learning model, and calculates a plurality of attention levels corresponding to the plurality of pixels. A region of interest may be identified based on the degree of attention. Since the attention area is specified based on the degree of attention for each pixel, the attention area can be specified in detail.

In the data generation system according to another aspect, the specifying unit selects one or more pixels whose attention level is equal to or higher than a given threshold value from the plurality of pixels as target pixels, and based on the selected one or more target pixels A region of interest may be identified. Since a pixel with a relatively high degree of attention is set as an attention area, an annotation can be associated with a position where there is a high probability that an object exists. As a result, it is possible to further improve the accuracy of annotation.

In the data generation system according to another aspect, when the area of a dense region, which is a region in which a plurality of pixels of interest are concentrated, is equal to or greater than a given threshold, the identifying unit may identify the dense region as the attention region. good. A region in which pixels of interest are concentrated over a relatively wide range is set as a region of interest, so that an annotation can be associated with an object clearly appearing in the input image.

In the data generation system according to another aspect, when the area of the circumscribed shape of the dense region, which is a region in which a plurality of pixels of interest are concentrated, is equal to or greater than a given threshold, the identifying unit identifies the dense region as the attention region. You may Since the attention area is specified based not on the area of the dense area itself but on the area of the circumscribed shape of the dense area, the area of the dense area can be easily calculated, and the attention area can be specified at high speed accordingly.

In the data generation system according to another aspect, the identifying unit may execute Grad-CAM on the machine learning model to identify the region of interest. Grad-CAM can be used to identify regions of interest for various types of machine learning models, such as various types of neural networks.

A model generation system according to one aspect of the present disclosure includes the data generation system described above, an acquisition unit that acquires teacher data including an input image in which an annotation is associated with a region of interest by the data generation system, and based on the teacher data, a learning unit that generates a trained model for detecting at least the position of the object from the image. In this aspect, a trained model for detecting the position of an object can be generated using teacher data including annotated input images.

An estimation system according to one aspect of the present disclosure includes the model generation system described above, and an estimation unit that inputs a target image to a trained model generated by the model generation system and detects at least the position of the target from the target image. Prepare. In this aspect, the trained model can be used to efficiently detect the position of the target object from the target image.

A learned model robot control system according to one aspect of the present disclosure includes a detection unit that detects the presence or absence of an object in an input image using a machine learning model that detects the presence or absence of an object based on an image, and The apparatus includes a specifying unit that specifies a region noticed by a machine learning model as a region of interest from an input image, and a robot control unit that controls a robot that processes an object based on the region of interest.

In this aspect, since the robot is controlled based on the area focused on by the machine learning model in detecting the object, the robot can be operated autonomously according to the position of the object.

In a robot control system according to another aspect, the robot control unit may control the robot so that the robot approaches the object. In this case, the robot can autonomously approach the object according to the position of the object.

In the robot control system according to another aspect, the detection unit further detects the presence or absence of the object in a new input image acquired after the robot approaches the object using a machine learning model, and the identification unit A region focused by the machine learning model in the further detection may be specified as a new region of interest from the new input image, and the robot control unit may further control the robot based on the new region of interest. When the robot approaches the object, the attention area is identified again, and the robot is further controlled based on the attention area. This mechanism allows the robot to operate more accurately.

A method for producing a trained model according to an aspect of the present disclosure is based on the data generation method described above and teacher data including an input image in which annotations are associated with regions of interest by the data generation method. generating a trained model for at least detecting the Since the input image automatically annotated is used as at least part of the teacher data, it is possible to efficiently generate a trained model for object detection.

[Modification]
The above has been described in detail based on the embodiments of the present disclosure. However, the disclosure is not limited to the above examples. Various modifications can be made to the present disclosure without departing from the gist thereof.

The functional configuration of the system according to the present disclosure is not limited to the above example. For example, the data generation system 10 may be constructed independently without including the model generation system 20 and the estimation system 30 . In this case, the computer systems corresponding to model generation system 20 and estimation system 30 may be computer systems owned by different owners from data generation system 10 . Alternatively, a combination of data generation system 10 and model generation system 20 may be constructed without including estimation system 30 . In this case, the computer system corresponding to the estimation system 30 may be a computer system owned by a different owner from the data generation system 10 and the model generation system 20 . The data generation system and robot control system according to the present disclosure may not include the display control unit 11, the labeling unit 12, and the preparation unit 13. That is, the data generation system and robot control system may use machine learning models generated by other computer systems. Because machine learning models and trained models are portable between computer systems, various systems according to the present disclosure can be implemented flexibly.

The hardware configuration of the system according to the present disclosure is not limited to the aspect of implementing each functional module by executing a program. For example, at least part of the functional module group described above may be configured by a logic circuit specialized for that function, or may be configured by an ASIC (Application Specific Integrated Circuit) that integrates the logic circuit.

The processing procedure of the method executed by at least one processor is not limited to the above examples. For example, some of the steps or processes described above may be omitted, or the steps may be performed in a different order. Also, two or more of the steps described above may be combined, and some of the steps may be modified or deleted. Alternatively, other steps may be performed in addition to the above steps.

When comparing two numerical values within a computer system or within a computer, either of the two criteria "greater than" and "greater than" may be used, and the two criteria "less than" and "less than" may be used. Either of the criteria may be used.

DESCRIPTION OF SYMBOLS 1... Object detection system, 2... Robot system, 3... Robot, 3b... End effector, 4... Robot controller, 10... Data generation system, 11... Display control part, 12... Labeling part, 13... Preparation part, 14... Detection Part, 15... Identification part, 16... Annotation part, 20... Model generation system, 21... Learning part, 30... Estimation system, 31... Estimation part, 41... Machine learning model, 42... Learned model, 51... First image Database 52 Second image database 60 Robot control system 61 Robot control unit 200

Input image

221, 222

Dense area

231, 232 Annotation 421 Ball (object).

Claims

A detection unit that detects the presence or absence of the target object in the input image using a machine learning model that detects the presence or absence of the target object based on the image;
a specifying unit that specifies, from the input image, a region focused by the machine learning model in the detection as a region of interest;
an annotation unit that associates an annotation corresponding to the detected object with the region of interest;
A data generation system comprising
The data generation system according to claim 1, further comprising a preparation unit that executes machine learning based on a plurality of training images to which labels indicating the presence or absence of the object are assigned to generate the machine learning model.
a display control unit that displays a label user interface for assigning the label to a given image;
a labeling unit that generates the training image by adding the label input through the label user interface to the given image;
3. The data generation system of claim 2, further comprising:
Further comprising a display control unit that displays a correction user interface for allowing the user to correct the annotation,
The annotation unit corrects the annotation based on user input via the correction user interface.
The data generation system according to any one of claims 1-3.
The annotation unit associates a graphical representation corresponding to the region of interest with the region of interest as at least part of the annotation.
The data generation system according to any one of claims 1-4.
The annotation unit generates a circumscribed shape of the region of interest as the graphic representation.
The data generation system according to claim 5.
The detection unit uses the machine learning model to detect the presence or absence of each of a plurality of types of objects in the input image,
The specifying unit specifies the region of interest for each of one or more types of target objects detected among the plurality of types of target objects,
The annotation unit associates each of the one or more regions of interest corresponding to the detected one or more types of objects with the annotations that differ for each type of the object.
The data generation system according to any one of claims 1-6.
The annotation unit associates a class value for identifying the type of the object as at least part of the annotation with the region of interest.
The data generation system according to any one of claims 1-7.
The specifying unit is
calculating a degree of attention indicating the degree of attention given to each of the plurality of pixels forming the input image by the machine learning model;
Identifying the attention area based on the plurality of attention degrees corresponding to the plurality of pixels;
The data generation system according to any one of claims 1-8.
The specifying unit is
selecting one or more of the pixels whose attention level is equal to or greater than a given threshold as target pixels from the plurality of pixels;
Identifying the region of interest based on the selected one or more pixels of interest;
The data generation system according to claim 9.
When the area of a dense region, which is a region in which a plurality of the pixels of interest are densely packed, is equal to or greater than a given threshold, the identifying unit identifies the dense region as the attention region.
The data generation system according to claim 10.
When the area of the circumscribed shape of the dense region, which is the region in which the plurality of pixels of interest are densely packed, is equal to or greater than a given threshold, the identifying unit identifies the dense region as the attention region.
The data generation system according to claim 10 or 11.
The identifying unit executes Grad-CAM on the machine learning model to identify the attention area,
The data generation system according to any one of claims 1-12.
A data generation system according to any one of claims 1 to 13;
an acquisition unit configured to acquire teacher data including the input image in which the annotation is associated with the region of interest by the data generation system;
a learning unit that generates a trained model for detecting at least the position of the object from an image based on the training data;
A model generation system with
a model generation system according to claim 14;
An estimation system comprising an estimation unit that inputs a target image to the trained model generated by the model generation system and detects at least the position of the target from the target image.
A detection unit that detects the presence or absence of the target object in the input image using a machine learning model that detects the presence or absence of the target object based on the image;
a specifying unit that specifies, from the input image, a region focused by the machine learning model in the detection as a region of interest;
a robot control unit that controls a robot that processes the object based on the attention area;
A robot control system with
The robot control unit controls the robot so that the robot approaches the object.
17. A robot control system according to claim 16.
The detection unit further detects the presence or absence of the object in a new input image acquired after the robot approaches the object, using the machine learning model,
The specifying unit specifies, from the new input image, a region focused by the machine learning model in the further detection as a new region of interest,
the robot control unit further controls the robot based on the new attention area;
Robot control system according to claim 16 or 17.
A data generation method performed by a data generation system comprising at least one processor, comprising:
detecting the presence or absence of the object in the input image using a machine learning model that detects the presence or absence of the object based on the image;
identifying from the input image a region focused by the machine learning model in the detection as a region of interest;
associating annotations corresponding to detected objects with the region of interest;
data generation methods, including
a data generation method according to claim 19;
generating a trained model for at least detecting the position of the object from an image based on teacher data including the input image in which the annotation is associated with the region of interest by the data generation method;
How to produce a trained model that contains
detecting the presence or absence of the object in the input image using a machine learning model that detects the presence or absence of the object based on the image;
identifying from the input image a region focused by the machine learning model in the detection as a region of interest;
associating annotations corresponding to detected objects with the region of interest;
A data generation program that causes a computer to execute