WO2021220191A2

WO2021220191A2 - Automatic production of a training dataset for a neural network

Info

Publication number: WO2021220191A2
Application number: PCT/IB2021/053529
Authority: WO
Inventors: Carlo Bazzica
Original assignee: Bazzica Engineering S.R.L.
Priority date: 2020-04-28
Filing date: 2021-04-28
Publication date: 2021-11-04
Also published as: IT202000009283A1; WO2021220191A3

Abstract

A neural network training method comprising the phases of virtually creating a scenography (2) that represents a physical work area (3) in which different objects (7) to be handled are arranged. The scenography (2) models a 3D representation of the objects (7) and implements the laws of physics to which the objects (7) would be subjected in a real physical environment; processing 3D images (10) extracted from the scenography (2) by extracting from each image labeling data (11) associated with the objects (7) represented on the image, thus forming a training dataset (12); and providing the training dataset (12) to a neural network (13) to carry out a network setup phase that implements the artificial intelligence for the recognition of different types of objects (7) and for determining the positions of the individual objects (7) in the physical work area.

Description

AUTOMATIC PRODUCTION OF A TRAINING DATASET FOR A NEURAL NETWORK

Cross Reference to Related Applications

This patent application claims priority to Italian patent application no. 102020000009283 filed on 28.04.2020, the disclosure of which is incorporated herein by reference.

Technical Field of the Invention

The present invention relates, in general, to the field of neural networks, and more particularly to the automatic production of a training dataset for a neural network.

State of the Art

As is known, an increasing number of objects are sold online through e-commerce platforms where a vast variety of different objects characterized by an ever-shorter life cycle is offered.

Companies operating in this field have found themselves managing an increasing number of objects in a limited time-frame and sometimes even in small quantities.

In other words, the number of Stock Keeping Unit (SKU) (different objects to be managed) has increased and at the same time the quantity of the average batch of identical objects to be managed and the lead times have significantly decreased.

For the physical treatment of these objects from the warehouses to the shipping channels, logistic handling systems have been developed in which a robotic device is provided with a grasping member designed to pick up the objects moving along conveyors, for example a conveyor belt, to direct them to different destinations.

In order to carry out the grasping operations in a completely automatic way the robotic device is provided with a vision system that uses artificial intelligence for recognizing different types of objects and determining positions of the individual objects on the conveyor belt in order to grasp the objects arranged randomly on a support (conveyor belt and/or logistic bin).

In particular, the vision systems can conveniently use:

- “Active Stereo” and “Structured-Light” 3D-”Dep Vision” artificial vision sensory, and

- Recognition of objects to be grasped by means of software frameworks based on convolutional neural network models of the “Deep Learning - R-CNN” (Region Based Convolutional Neural Network) type. As is known, neural networks need training datasets to train the neural network and cause in it to become able to produce the expected result with regard to the automatic recognition of the types of objects and the positions thereof.

Production of a neural network training dataset, for example an R-CNN (Region Based Convolutional Neural Network) neural network, is certainly the most critical phase during the logistic system setup.

During production of a training dataset it is necessary to manually take a series of photographs of objects arranged randomly on the conveyor belt in a work area.

An operator must then manually act on the images by labeling via a graphic interface the images of the objects considered interesting for the creation of the training dataset.

Figure 1 shows a prior art labeling during which bounds of the objects considered interesting { A,B,C,D,E,F,G) are indicated with a polyline: the angles that longitudinal axes of the objects (in the example elongated objects are shown) form with respect to a reference plane is also indicated. In this way, a series of labeling data is manually produced.

The set of all the labeling data of all the images forms the training dataset based on which the neural network is then trained.

In the article by Stephan R. Richter et al. “ Playing for Data: Ground Truth from Computer Games”, 17 September 2016 (2016-09-17), Big Data Analytics in the Social and Ubiquitous Context: 5^th International Workshop on Modeling Social Media, Ubiquitous and Social Environments, Muse 2014 and First International Workshop on Machine Learning, proposes a solution to the high-cost problem caused by the amount of human effort required to create large datasets with pixel-level labels. In this article, an approach is presented that enables the rapid creation of pixel level-accurate semantic label maps for images extracted from modern computer games. Although the source code and the inner workings of commercial games are inaccessible, the associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This allows for a rapid propagation of semantic labels within and by way of the images synthesized by the game, without having access to the source code or the content. The presented approach is validated by producing dense semantic pixel-level labeling for 25,000 images synthesized by an open-world photorealistic computer game. Experiments on semantic segmentation datasets show that by using the acquired data to integrate real-world images significantly increases accuracy and that the acquired data can reduce the amount of hand-tagged real-world data. Object and Summary of the Invention

The Applicant has experienced that, depending on the morphological complexity of the set of objects, it may be necessary to acquire even several hundred images, associating each of them with the labeling data. During this tedious work phase, which can take several days of work time, it is sufficient for the operator to make only a few mistakes, for example associating the labeling data of an object to a different object or drawing incorrect polylines (or any other kind of error caused by the repetitiveness of this type of work) to produce extremely negative effects on the neural network training procedure and the entire vision system will be unstable and inaccurate and, ultimately, unusable.

The aim of the present invention is to provide an automatic neural network training dataset production methodology that overcomes the drawbacks of known methodologies requiring the intervention of an operator.

According to the present invention, a logistic system is provided, as claimed in the appended claims.

Brief Description of the Drawings

Figure 1 shows a prior art solution.

Figure 2 schematically shows the automatic neural network training dataset production methodology of the present invention.

Figures 3 and 4 show examples of working environments virtually generated according to the automatic neural network training dataset production methodology of the present invention.

Detailed Description of Preferred Embodiments of the Invention

The present invention will now be described in detail with reference to the attached figures so as to allow a person skilled in the art to develop and implement it. V arious modifications to the embodiments described will be immediately evident to those skilled in the art and the generic principles described can be applied to other embodiments and applications without thereby departing from the scope of the present invention, as defined in the appended claims. Therefore, the present invention should not be considered limited to the embodiments described and illustrated, but should be accorded the broadest scope according to the principles and characteristics described and claimed herein.

Unless otherwise defined, all the technical and scientific terms used herein have the same meaning commonly used by persons of ordinary experience in the field pertaining to the present invention. In the event of conflict, the present description, including the definitions provided, will be binding. Furthermore, the examples are provided for illustrative purposes only and as such should not be considered limiting.

In particular, the block diagrams included in the attached figures and described in the following are not intended as a representation of the structural characteristics, or constructive limitations, but must be interpreted as a representation of functional characteristics, i.e. intrinsic properties of the devices and defined by the effects obtained i.e. functional limitations and that can be implemented in different ways, therefore so as to protect the functionality of the same (possibility of functioning).

In order to facilitate the understanding of the embodiments described herein, reference will be made to some specific embodiments and a specific language will be used to describe the same. The terminology used herein has the purpose of describing only particular embodiments, and is not intended to limit the scope of the present invention.

As shown in Figure 2, the present invention comprises the following macro phases:

A) virtually creating a scenography 2 representing a physical work area 3 in which different objects 7 to be handled are arranged; the scenography 2 models a 3D representation of the objects 7 and implements the laws of physics (e.g., gravity, laws of motion, etc.) to which the objects 7 would be subjected in a real physical environment;

B) processing 3D images 10 extracted from the scenography 2 by extracting from each image labeling data 11 associated with the objects 7 represented in the image 10, thus automatically forming a training dataset 12; and

C) providing the training dataset 12 to a neural network 13 to carry out a set-up phase of the neural network 13 which implements the artificial intelligence for recognizing different types of objects 7 and determining positions of individual objects 7 in a real work area.

In the example shown in Figure 3, the scenography 2 represents a physical work area 3 where a robotic device 4 is provided with a grasping member 5 operable to grasp different objects 7 which move along a conveyor device 8 on which the objects 7 are randomly arranged; the robotic device 4 represented in the scenography 2 is driven by an object recognition system of the Region Based Convolutional Neural Network type. However, it goes without saying that the work area may be different, for example it may represent by a bin in which the objects 7 are arranged randomly overlapping.

In greater detail, in phase A), the scenography 2 comprises the 3D representation of each object 7 (see Figure 4) including the representation of the walls 15 that delimit the internal volume of the object 7, and the representation of the portions of walls 15p of the object 7 that co-operate with the grasping member 5 to allow the object 7 to be grasped from the conveyor device 8.

The 3D representation of each object 7 is associated with characteristic parameters of the object 7 such as size, weight, density, etc., which are used by the laws of physics implemented in the scenography 2 to characterize the physical-dynamic behavior of the object 7.

The 3D representation of the object 7 is produced by a 3D CAD system and/or by a 3D scanner operated to scan a physical object 7.

In greater detail, in the scenography 2 a mixing operation is modeled (see Figure 2, this operation is indicated with 2m) which concurs to arrange the objects from a first arrangement to a second arrangement in which the objects are arranged, one with respect to the other in space, according to a random mode. Said operation is extremely important as it serves to create a work environment in which the three-dimensional objects are arranged in bulk; the subsequent recognition operations take place in a “difficult” environment in order to carry out an extremely competitive training.

In the example shown, the mixing operation is modeled by the fall of moving objects from a first (upper) position to a second (lower) position of the work area (Figure 4); the physical laws of dynamics which create the trajectories of falling objects and which represent the rebound of the objects contribute to the mixing operation.

In phase B) the images are extracted by carrying out a shooting phase in which the representation of the set of objects, as arranged in the second arrangement, is taken.

In the example shown, the images of the objects arranged randomly one with respect to the other, following the fall, are taken.

In particular, each individual object is represented isolated from the other objects of the set by extracting a plurality of layering images in which the object maintains its relative position with respect to other objects that are not represented.

Phase B) is carried out on the individual layering images by providing labeling data associated with the objects represented individually in each image.

As is known, the labeling data can be represented in different formats depending on the RCNN framework for which the dataset is intended.

In phase B an automatic procedure can extract the outline of the images of a generic class of objects 7 and on the basis of this outline information the labeling data can be extracted. In phase B, some images may also be discarded on which the labeling data extraction operation is difficult. The mixing operation described above which creates a totally random spatial arrangement of the objects in the second arrangement could produce an arrangement which, as represented in the extracted images, leads to a difficult outline operation. At the end of phase B), statistical data can be provided on the rejected images and on the images actually used for the production of labeling data.

Conveniently, phases A) - B) are carried out on a software platform configured for the creation of video games.

The system described above is completely implemented by a computer and therefore provides a training dataset in a completely automatic way, avoiding that manual errors introduced by the operator can lead to the creation of an unsuitable training dataset. In addition, a long and tedious manual work is eliminated. In the real work area, a logistic system is obtained in which a real robotic device 4 provided with a physical grasping member 5 for objects 7 different one from the other and arranged randomly on a transport and/or accumulation device; the robotic device 4 is guided by an object recognition system which uses a neural network which has been set up using the training dataset obtained by means of the method described above.

Claims

1. A logistic system comprising a robotic device (4) with a grasping member (5) operable to grasp different objects (7) randomly arranged on a transport and/or accumulation device in a work area (3) and; the robotic device (4) is driven by an object recognition system (7) operating based on a neural network trained to recognize different objects (7) and determine positions thereof in the work area (3) by using a dataset automatically produced by a computer-implemented video gaming software platform designed to:

A) virtually create a scenography (2) representing the work area (3) where different objects (7) to be grasped are arranged; the scenography models a 3D representation of the objects (7) and implements the laws of physics to which the (7) objects would be subjected in a real environment; and

B) processing a plurality of 3D images (10) extracted from the scenography (2) by extracting from each processed image labeling data (11) associated with the objects (7) represented in the processed image, thus forming a training dataset (12) for the neural network.

2. The logistic system of claim 1, wherein the 3d representation of an object (7) comprises the representation of external walls (15) that delimit an internal volume of the object (7); and the representation of portions (15p) of the object (7) intended to co-operate with a device, for example the grasping member (5), operating in the work area (3).

3. The logistic system of claim 1 or 2, wherein the 3D representation of an object (7) is associated with characteristic parameters of the object (7) such as size, weight, density, etc., which are used by the laws of physics implemented in the scenography (2) to characterize the physical- dynamic behavior of the object (7).

4. The logistic system of any one of the preceding claims, wherein the video gaming software platform is further designed to model in the scenography (2) an object mixing operation during which the arrangement of the objects (7) changes from a first arrangement to a second arrangement in which the objects (7) are mutually randomly spatially arranged.

5. The logistic system of claim 4, wherein the mixing operation is modeled to cause moving objects (7) to fall from an upper position to a lower position in the work area (3); the physical laws of dynamics that produce the trajectories of the falling objects (7) and that represent the rebound of objects (7) contribute to the mixing operation.

6. The logistic system of claim 4 or 5, wherein in phase B) the images are extracted by carrying out a shooting phase in which a representation of the objects (7) arranged in the second arrangement is taken.

7. The logistic system of claim 6, wherein in the shooting phase each individual object (7) is represented isolated from the other objects (7) by extracting layering images in which the object (7) maintains its position relative to the other objects (7) that are not represented.

8. The logistic system of claim 7, wherein phase B) is carried out on the layering images by providing labeling data (11) associated with the objects (7) individually represented in each image.

9. The logistic system of any of the preceding claims, wherein the 3D representation of an object (7) is obtained by means of a 3D CAD system or by means of a 3D scanner that scans the object

(7).