CN116363085B

CN116363085B - Industrial part target detection method based on small sample learning and virtual synthesized data

Info

Publication number: CN116363085B
Application number: CN202310274497.5A
Authority: CN
Inventors: 陆慧敏; 陈修; 李玉洁; 蔡金彤
Original assignee: Jiangsu Gongzhi Automation Technology Co ltd
Current assignee: Jiangsu Gongzhi Automation Technology Co ltd
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2024-01-12
Anticipated expiration: 2043-03-21
Also published as: CN116363085A

Abstract

The invention discloses an industrial part target detection method based on small sample learning and virtual synthetic data, which comprises the steps of firstly manufacturing a part synthetic data set with a similar geometric shape with a real part through virtual simulation software, and simultaneously outputting marking information of a corresponding image, wherein the whole data acquisition process is automatically executed without complicated manual operation. And (3) carrying out data preprocessing on the acquired synthetic data, converting the image format and the labeling information of the acquired synthetic data into a format required by training a target detection network, and then training the target detection network based on small sample learning on the synthetic data, so that the network obtains the capability of detecting on the synthetic data. And acquiring images in a real industrial scene, manually labeling the images, performing secondary training on a target detection network after labeling, fine-tuning parameters of the network, and finally detecting and outputting the types and positions of the parts in the current scene image by the trained network in real time.

Description

Industrial part target detection method based on small sample learning and virtual synthesized data

Technical Field

The invention relates to an industrial part target detection method based on small sample learning and virtual synthetic data, and belongs to the technical fields of computer vision, target detection, virtual synthetic data, intelligent robots and the like.

Background

Robot factories driven by automation technology and artificial intelligence systems aim to solve repetitive and laborious tasks in daily production. In order for a robot to perform production activities as accurately as a person, the robot needs to autonomously grasp objects in an industrial environment using a robot arm. Although significant progress has been made in robotic grasping using deep learning, in practical applications, training a deep neural network requires large-scale data sets with manual labeling, and acquiring high quality RGB data sets that cover various industrial scenarios and part arrangements is a very time-consuming and labor-consuming task. Therefore, for the target detection task of the industrial scene, a detection method which does not need excessive manual labeling work needs to be constructed.

Aiming at the problem of industrial scene data acquisition, two methods can be classified according to different training data types: a method based on virtual synthetic data and a method based on small sample learning.

The method based on virtual synthesized data comprises the following steps: the existing virtual synthetic data generation method is mainly applied to recognition researches of scenes, pedestrians and vehicles, and the researches cover classification, semantic segmentation, target detection and the like of images. The synthesized data is obtained through virtual software, and the data is quickly synthesized and marked by using a graphic engine in a virtual three-dimensional environment. In the robot research, reference [1] trains a robot recognition system using virtual environments of various simulators, which focus on simulating physical characteristics of a target object. Recent development of game engines has enabled them to implement simulation of physical engines as well as render images of high realism in real time, and computer vision researchers have begun to develop related studies using game engines. In reference [2], it is proposed to train a classifier for pedestrian detection using a virtual environment, and then use the trained pedestrian classifier for a pedestrian detection task of a real image, and migrate a virtual dataset to a real dataset using a domain adaptation algorithm. Reference [3] proposes a syntha virtual dataset to accomplish the semantic segmentation task, the syntha implementing the rendering of a large-scale virtual city scene, providing pixel-level labeling of commonly used class objects in 11 autopilot scenes. Reference [4] proposes to use the latest graphics technology engines of Unity, etc. to make a composite dataset, achieving the goal of fitting virtual data to real data. The engine may provide data for different computer vision tasks including optical flow, instance semantic segmentation, target detection and tracking, and vision odometry, among others.

A small sample learning-based method: small sample learning refers to training examples where the data used to train the neural network uses only a few annotated training examples. Reference [5] uses bayesian reasoning to generalize knowledge from pre-trained models to perform one-time learning. LSTD of reference [6] and RepMet of reference [7] employ a generic migration learning framework to reduce overfitting by adapting the pre-trained detector to small sample scenes. Meta YOLO of reference [8] a novel small sample detection model was designed using YOLO v2 of reference [9] that learns generalizable Meta-features and automatically re-weights new class features by generating class-specific activation coefficients from support examples. The TFA of reference [10] can simply perform a two-stage tuning method by tuning the classifier in the second stage and achieve better performance. CoAE of reference [11] proposes a non-local RPN and focuses on single detection from the tracking point of view by comparing itself to other tracking methods.

According to the research of the prior art, the method based on the virtual synthesized data has low reality on the rendered picture and is not applicable to computer vision tasks. Although the game engine can render images with high reality in real time, the reality and scene complexity of the virtual environment are low, the performance of a network trained by directly using the virtual data set in the real data set is poor, and when the scale of the data set is continuously increased, the acquired images can generate larger similarity, so that the final data set is subjected to fitting when a model with a deep network structure is trained. In addition, the virtual software is used for synthesizing the scene, so that accurate modeling processing is required for the simulated object, and the time for constructing the simulation environment by early modeling is too long although more synthesized data can be obtained. The small sample learning-based approach does not rely on large-scale datasets, but only makes a few annotated examples for training the network. When training data is rare, the network often has an overfitting phenomenon due to the biased distribution of the small amount of data, and when the training data is fitted to a real scene for testing, the effect is often poor. The existing target detection method based on small sample learning is used for predicting low precision in actual life and cannot meet the task requirements in industrial scenes.

Disclosure of Invention

The invention aims to: with the rapid development of artificial intelligence technology, robots have been used in industrial production to handle repetitive, monotonous work. The robot needs to train the target detection network again every time when being applied to a new scene, the arrangement condition of various parts cannot be covered by manually collecting training data, and manual labeling work is time-consuming and labor-consuming. The invention is the premise and the basis of the operation, and by establishing digital twin of a real industrial scene in virtual software, the model similar to the geometrical shape of a real part is used for collecting synthetic data and automatically outputting labeling information. Firstly, a small sample target detection network is used for carrying out first training on the acquired synthetic data, then, data in a plurality of real scenes are acquired for marking, secondary training is only carried out on the plurality of examples to finely adjust network parameters, and finally, the trained target detection network is used for helping the intelligent robot to finish operations such as identification, grabbing, obstacle avoidance and the like.

In order to solve the problem of difficult acquisition of target detection data of industrial scene parts, the invention provides a target detection method based on combination of small sample learning and virtual synthesis data, which only uses a few real scene image example training networks with marking information to automatically complete data acquisition and marking work, avoids a large number of manual operations, and finally enables the network to detect the position and type information of the parts in the industrial scene in real time.

The technical scheme is as follows: a method for detecting the target of industrial part based on small sample learning and virtual synthetic data includes such steps as creating a synthetic data set of part with similar geometry to real part by virtual simulation software, generating a synthetic data set by virtual simulation software, and automatically executing the whole data acquisition process without complex manual operation. The method comprises the steps of carrying out data preprocessing on an acquired synthetic data set, converting image format and labeling information of the synthetic data into a format required by a training target detection network, then carrying out further processing on the synthetic data, screening out the labeling information smaller than the shielding rate by manually setting the shielding rate, and finally training the target detection network based on small sample learning on the processed synthetic data, so that the network obtains the detection capability on the synthetic data. And acquiring a plurality of part frame images containing parts in the real industrial scene, manually labeling the images of the real industrial scene, performing secondary training on a small sample target detection network after labeling, fine-tuning parameters of the network, and finally detecting and outputting the types and positions of the parts in the images acquired in the current industrial scene in real time by using the trained network.

The invention uses virtual simulation software to generate a synthetic data set, and the principle is that computer graphics and virtual simulation are utilized to perform rendering and labeling work, and a simulation scene and a virtual camera are used to replace a real scene and a camera. The method is realized based on Nvidia Omniverse Isaac Sim software, and real-time ray and path tracking is performed through RTX, so that a realistic scene image is provided. A model similar to the real part geometry is simulated by using a simple model in the Isaac Sim platform to generate synthetic data, and the synthetic data sets are preprocessed to be more approximate to the color and arrangement distribution condition of the real data. The method greatly improves the detection precision of the model by means of transfer learning, and can meet the task requirements under industrial scenes by using a small amount of real sample data.

The method for generating the synthetic data set is realized automatically based on a UI interface and programming, and the generation process can be divided into two parts of static generation and dynamic generation. The specific virtual synthetic data method comprises the following steps:

for industrial scenes, the generation of parts in a feeding box and a box body is the most important part in the whole scene, and an intelligent factory driven by a robot technology generally installs a camera at the tail end of a mechanical arm to collect real-time conditions in the current feeding box, so that the position information of the parts is detected, and a grabbing function is realized. The purpose of feedbox generation is to accurately describe the box structure and geometry used in an industrial setting, and the structure used to represent the feedbox needs to be defined before the virtual feedbox is generated. The feed bin has five surfaces, each surface being rendered and synthesized using a Cube model of the Isaac Sim itself. And loading the simulated feed box into a virtual scene by using an example environment provided in the Isaac Sim, wherein the whole virtual environment truly restores indoor illumination by using a simulated light source. And then generating a dynamic scene, selecting an object similar to the geometric shape of the part of the industrial scene, simulating the shape geometric feature of the part by using a Circle model in the Isaac Sim through fine adjustment of parameter information, and adding information such as collision, gravity, friction, luster and the like to the part model to enable the part model to have physical properties under the real environment in the virtual environment. And carrying out batch copying on the part model, wherein the copied target objects share parameters. Because the relation between different parts in an industrial scene is relatively complex, when the complexity of the scene generated by the virtual environment is low, the distribution condition of the parts contained in the acquired image is insufficient to cover the distribution in the real scene, and the fitting is often generated in the process of training the neural network. To solve this problem, it is necessary to increase the diversity of virtual environments. The present invention uses a domain randomization method to perform a limited random assignment of each generated position of the replicated part model. And (3) carrying out region definition on randomly generated positions (x, y, z, w) of the part model, wherein (x, y, z, w) is a set of quaternions used for representing the position and rotation information of the part model in the virtual environment. Limiting part models to only appear in the range of a feeding box in a virtual environment, and due to the support of a physical engine, the models generated each time can freely fall at different heights, the phenomena of collision, shielding and the like in the real falling process can be simulated in the falling process, the diversity of the virtual environment is further increased after repeated for a plurality of times, the generated data set can complete the training of a more complex and deeper neural network, and the occurrence of the overfitting condition is reduced.

To implement automatic labeling by a computer program, each object model in the virtual environment needs to be tracked, where objects refer to those models that are simulated in virtual simulation software (e.g., virtual simulated parts frames, etc.). In the three-dimensional graphic engine, each object is a three-dimensional model, and the basic data structure of the three-dimensional model is a Mesh (Mesh) composed of vertices (Vertex) and Triangular faces (triangulars). The same three-dimensional model may be composed of different grids. In the rendering process of the three-dimensional model, color coding rendering can be carried out on different grids by calling corresponding functions of the graphic API, namely, illumination, materials and other information are ignored, and all pixel points after grid rasterization are represented in RGB colors. The compiled script uses the visual window to track the grid of the common father object and stores the grid in the final annotation file. Executing data capture in such a real virtual environment enables us to greatly expand the number and diversity of training data in a very efficient manner beyond what can be achieved by a real scene manual capture method, and also avoids the costs of manual labeling. So that the synthetic data of the real physical scene can be obtained with high quality.

The invention preprocesses the generated synthesized data information, firstly, converting the image data in the png format in the previous synthesized data set into the jpg format; screening and format conversion are carried out on the annotation information in the synthesized data set, only the annotation information of the 2D bounding box is extracted independently and is stored as an xml format annotation file, the annotation information in the previous synthesized data set is stored in an npy format file, and the annotation information is a numpy matrix which contains the annotation information used in the training of the small sample target detection network and also contains some unused information, and the unused information needs to be discarded, so that the annotation information needs to be screened; because the labeling information of each image in the current synthetic data set outputs labels for each object in the image, and some objects have very serious shielding conditions, if training of a small sample target detection network is performed by using only the image data and the labeling information after the current conversion format, the final effect can be caused, the detection model can also detect the seriously shielded object, and gives a very high prediction score, and finally gives a wrong feedback result to the robot, so that the robot can grasp the severely shielded object preferentially, and further labeling data processing is needed on the basis of the currently screened 2D surrounding frame labeling data. And judging the shielding relation by using the 2D bounding box in the labeling information and the instance segmentation information. When the shielding condition occurs, a shielded object and a shielded object are necessarily present. When the labels of two adjacent pixels in the example segmentation label information are different, the fact that the parts corresponding to the two pixels are blocked is indicated. 2D bounding box information corresponding to the two parts is extracted, and intersection areas of the 2D bounding boxes are calculated. And judging the number of pixels belonging to different parts in the current intersection area according to the value of each pixel corresponding to the segmentation marking information of the current example, wherein the part with the larger number of pixels corresponding to a certain value is a shielded part, and the part with the smaller number is the shielded part. And (3) comparing all objects in the images in the synthetic data set by using the comparison method to obtain a shielding relation diagram, wherein the shielding relation diagram has a square matrix with a representation format of N multiplied by N in a file (N is the number of objects in the current image), and if the element at the position of [ i, j ] is-1, the shielding exists between the current i object and the j object (i, j are the numbers of different objects in the current image, and the numbers and the number N can be obtained in the marking information which is not preprocessed initially). Then, for the parts with shielding phenomenon, the intersection of the 2D bounding boxes is obtained, the number of pixels corresponding to the shielded parts in the intersection is calculated, and the shielding rate is calculated. According to the actual requirements of different industrial scenes, a shielding rate is set manually, the marking data are screened by taking the shielding rate as a threshold value, marking information which is not shielded or is less shielded is reserved, and shielding is less: the concept is given by people, an operator can set that a certain object is less than 20% and is blocked in a condition of less blocking according to the actual production requirement, and can modify the threshold value of 20%, which is not fixed and can be changed according to the requirements of different tasks. And finally, carrying out batch color migration on the synthesized data by using a real industrial scene picture and a Reinhard algorithm, so that the synthesized data set is more close to the real scene.

The detection model used in the invention firstly uses virtual simulation software to generate synthetic data which is geometrically similar to real parts as base class data to train the feature extraction network, so that the network obtains the capability of extracting geometric features on the base class target. And training the detection network by using a small amount of real samples in the secondary training process, and fine-tuning network parameters by comparing learning strategies.

There is a sufficient amount of data in the base class, while the new class that appears in the secondary training has a limited number of marked samples. The distribution statistics of the base classes can be estimated more accurately than the new class data. Considering that the feature distribution is a gaussian distribution, the mean and variance of each class are related to the semantic similarity of each class. When the similarity of the two classes is higher (for example, the similarity reaches a preset value), the common features learned by the model in the first training process can be transferred from the base class to the new class. The present invention proposes that the use of geometrically similar synthetic data distribution calibration strategies is at the feature level and independent of any feature extractor.

In the target detection model network structure, the invention uses a Faster-RCNN two-stage detection framework as a backbone network. The RPN takes the backbone feature map as input and generates a region proposal, then the RoI classifies each candidate region, and returns to the bounding box if foreground objects are contained in the current candidate region. The general detector is not able to build robust feature representations for region proposals from limited data, which can lead to false labeling of local objects, and eventually the detection accuracy of the network is not ideal. To learn a more robust representation of object features from less data, the present invention uses a contrast learning strategy to distinguish between intra-instance-level class similarities and inter-class differences.

The invention introduces a contrast branch in the ROI part of the Faster-RCNN network, parallel to the classification and regression branches. The comparison branch is implemented as a layer of multi-layer perceptron (MLP), the ROI feature codes are used as similarity scores between the proposed representations of the measured objects after the comparison features, and the comparison targets are optimized based on the ROI features of the MLP-Head codes so as to maximally improve the consistency between target proposals from the same category and improve the uniqueness of proposals from different categories.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing an industrial part target detection method based on small sample learning and virtual composite data as described above when executing the computer program.

A computer-readable storage medium storing a computer program that performs the industrial part target detection method based on small sample learning and virtual composite data as described above.

The beneficial effects are that: compared with the prior art, the method provided by the invention can solve the problem of data acquisition in an industrial scene, and has higher detection precision compared with the same type of method. When the robot factory faces new application scenes and new part data, the front-end large-scale data acquisition and manual labeling work are not needed, the adaptation time is greatly saved, the working efficiency is improved, and the production efficiency of the factory is further improved.

The target detection model uses less computer resources, has lower energy consumption, is relatively simple to train, and is easy to provide a beginner with a quick opportunity to get hands on. Meanwhile, the system has an end-to-end structure, does not have complicated multi-stage regulation and control work, and therefore, has lower labor cost. In the large environment of combining the development of artificial intelligence and actual production in the future, the method has very broad and long-term application prospect.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a virtual simulation scenario in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of virtual composite data generation in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of HV8 round part composite data for an embodiment of the invention;

fig. 5 is a schematic diagram of a small sample target detection network in accordance with an embodiment of the present invention.

Detailed Description

The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.

An industrial part target detection method based on small sample learning and virtual synthesized data, the flow of FIG. 1 is the method flow proposed by the invention. Firstly, a part synthesis data set with a similar geometric shape to a real part is manufactured through virtual simulation software, and meanwhile, marking information of a corresponding image is output, and the whole data acquisition process is automatically executed without complicated manual operation. The method comprises the steps of carrying out data preprocessing on collected synthetic data, converting image format and labeling information of the synthetic data into a format required by training a target detection network, determining shielding rate according to actual industrial scenes, screening parts with shielding rate smaller than the threshold, and training the target detection network based on small sample learning on the synthetic data to enable the network to obtain the detection capability on the synthetic data. The image (1-10) under the real industrial scene is collected, and only 10 examples are included. And manually labeling the images by using a LabelMe tool, performing secondary training on the detection network after labeling, fine-tuning parameters of the network, and finally detecting and outputting the types and positions of the parts in the current industrial scene in real time by the trained network.

The invention uses virtual software to generate a synthetic data set, and the principle is that computer graphics and virtual simulation are utilized to perform rendering and labeling work, and a simulation scene and a virtual camera are used to replace a real scene and a camera. The method is realized based on Nvidia Omniverse Isaac Sim software, isaac Sim is a robot simulation platform developed by NVIDIA, supports multi-platform development and optimization, and can track real-time light and paths through RTX, thereby providing a realistic image. Because Isaac Sim uses a real-time ray tracing technique, as shown in fig. 2, both the illumination and the object reflection in the simulated scene are very real, and the real environment can accumulate a large amount of valuable data for the camera. In the simulation environment, the position and the gesture of each object are stored in the computer, so that the data of the object can be conveniently obtained, a large amount of data with labels can be obtained, and the object can be conveniently detected, identified and segmented. However, the premise of using the method is that an accurate CAD model of the part to be generated must be obtained, and in a real industrial scene, certain differences exist between different parts of the same model, so that certain difficulties exist in obtaining the accurate CAD model. According to the method for carrying out target detection based on the small sample learning and virtual synthetic data, specific model information of the parts is not needed in the data generation stage, the data is generated by simulating the approximate geometric shape of the parts to be generated by using a simple model in an Isaac Sim platform, and the synthetic data is preprocessed to be more approximate to the color and arrangement distribution condition of real data. The method greatly improves the detection precision of the model by means of transfer learning, and can meet the task requirements under industrial scenes by using a small amount of real sample data.

The method for generating the synthetic data set is automatically realized based on the UI interface and the programming, and the generation process can be divided into two parts of static generation and dynamic generation. The specific virtual composite data method process is shown in fig. 3:

for industrial scenes, the generation of parts in a feeding box and a box body is the most important part in the whole scene, and an intelligent factory driven by a robot technology generally installs a camera at the tail end of a mechanical arm to collect real-time conditions in the current feeding box, so that the position information of the parts is detected, and a grabbing function is realized. The purpose of feedbox generation is to accurately describe the box structure and geometry used in an industrial setting, and the structure used to represent the feedbox needs to be defined before the virtual feedbox is generated. The feed bin has five surfaces, each surface being rendered and synthesized using a Cube model of the Isaac Sim itself. And loading the simulated feed box into a virtual scene by using an example environment provided in the Isaac Sim, wherein the whole virtual environment truly restores indoor illumination by using a simulated light source. Then, generating a dynamic scene, selecting an object (taking an HV8 circular part as an example) with similar geometric shape to the industrial scene part, simulating the circular geometric feature of the HV8 circular part by using a Circle model in Isaac Sim through fine adjustment of parameter information, and adding information such as collision, gravity, friction, gloss and the like to the model to enable the model to have physical properties under a real environment in a virtual environment. And carrying out batch copying on the circular model, wherein the copied target objects share parameters. Because the relation between different parts in an industrial scene is relatively complex, when the complexity of the scene generated by the virtual environment is low, the distribution condition of the parts contained in the acquired image is insufficient to cover the distribution in the real scene, and the fitting is often generated in the process of training the neural network. To solve this problem, it is necessary to increase the diversity of virtual environments. The present invention uses a domain randomization method to perform a limited random assignment of the position of each generation of the replicated model. The randomly generated positions (x, y, z and w) are subjected to region definition, the model is limited to only appear in the range of the feeding box, due to the support of the physical engine, the model generated each time can fall freely at different heights, the phenomena of collision, shielding and the like in the real falling process can be simulated in the falling process, the virtual environment diversity is further increased after repeated for many times, the generated data set can complete the training of a more complex and deeper neural network, and the occurrence of the over fitting condition is reduced.

To achieve automatic labeling by a computer program, each object in the virtual environment needs to be tracked. In the three-dimensional graphic engine, each object is a three-dimensional model, and the basic data structure of the three-dimensional model is a Mesh (Mesh) composed of vertices (Vertex) and Triangular faces (triangulars). The same three-dimensional model may be composed of different grids. In the rendering process of the three-dimensional model, color coding rendering can be carried out on different grids by calling corresponding functions of the graphic API, namely, illumination, materials and other information are ignored, and all pixel points after grid rasterization are represented in RGB colors. The compiled script uses the visual window to track the grid of the common father object and stores the grid in the final annotation file. Executing data capture in such a real virtual environment enables us to greatly expand the number and diversity of training data in a very efficient manner beyond what can be achieved by a real scene manual capture method, and also avoids the costs of manual labeling. So that the synthetic data of the real physical scene can be obtained with high quality. Fig. 4 shows a composite dataset acquired using the method of the invention described above (for example HV8 circular parts).

In a real scene, people usually grab the uppermost layer of non-shielding object first, and the robot grabs the same, if blind grabbing has the object that shelters from to influence the grabbing success rate of robot, the collision of robot clamping jaw and shielding other objects in the environment also can cause the harm to accurate clamping jaw simultaneously. If training the deep neural network by directly using 2D bounding box data of all objects can cause the best grabbing object predicted by the detection network to be misjudged as the object with shielding, in order to improve the environment perception capability of the system, the invention preprocesses the generated synthesized data information and judges the shielding relation by using the 2D bounding box and the instance segmentation information. When the shielding condition occurs, a shielded object and a shielded object are necessarily present. When the labels of two adjacent pixels in the example segmentation label information are different, the fact that the parts corresponding to the two pixels are blocked is indicated. 2D bounding box information corresponding to the two parts is extracted, and intersection areas of the 2D bounding boxes are calculated. And judging the number of the pixels belonging to different parts in the current intersection area according to the labeling information of the pixels, wherein the larger part is the shielded part, and the smaller part is the shielded part. And (5) comparing all objects in the image in pairs to obtain a shielding relation diagram. Then, for the parts with shielding phenomenon, the intersection of the 2D bounding boxes is obtained, the number of pixels corresponding to the shielded parts in the intersection is calculated, and the shielding rate is calculated. Screening the annotation data according to the shielding rate, and reserving the annotation information which is not shielded or is less shielded. And finally, carrying out batch color migration on the synthesized data by using a real industrial scene picture and a Reinhard algorithm, so that the synthesized data set is more close to the real scene.

The target detection method based on small sample learning provided by the invention is shown in fig. 5: the detection model used in the invention firstly uses virtual simulation software to generate synthetic data which is geometrically similar to real parts as base class data to train the feature extraction network, so that the network obtains the capability of extracting geometric features on the base class target. And training the detection network by using a small amount of real samples in the secondary training process, and fine-tuning network parameters by comparing learning strategies.

There is a sufficient amount of data in the base class, while there are only a limited number of marked samples in the new class. The distribution statistics of the base classes can be estimated more accurately than the new class data. Considering that the feature distribution is a gaussian distribution, the mean and variance of each class are related to the semantic similarity of each class. When the similarity of the two classes is high, then the statistics can be transferred from the base class to the new class. The present invention proposes that the use of geometrically similar synthetic data distribution calibration strategies is at the feature level and independent of any feature extractor.

In the network architecture, the present invention uses the fast-RCNN two-stage detection framework as the backbone network. The RPN takes the stem signature as input and generates region proposals, and then the RoI classifies each region proposal, and regresses the bounding box if the prediction contains an object. The general detector is not able to build robust feature representations for region proposals from limited data, which can lead to false labeling of local objects, and eventually the detection accuracy of the network is not ideal. To learn a more robust representation of object features from less data, the present invention uses a contrast learning strategy to distinguish between intra-instance-level class similarities and inter-class differences.

The invention introduces a contrast branch in the ROI part of the Faster-RCNN network, parallel to the classification and regression branches. The comparison branch is implemented as a layer of multi-layer perceptron (MLP), the ROI features are encoded as similarity scores between the proposed representations of the measurement objects after comparison features, the comparison targets are optimized based on the MLP-Head encoded ROI features to maximize consistency between target proposals from the same class, and to increase uniqueness of proposals from different classes.

The invention relates to integration and application of computer vision, intelligent robot image processing and the like across multiple fields. Compared with the prior art, the method has practical application capability, can solve the problem of data acquisition in industrial scenes, and has higher detection precision compared with the same type of method. When the robot factory faces new application scenes and new part data, the front-end large-scale data acquisition and manual labeling work are not needed, the adaptation time is greatly saved, the working efficiency is improved, and the production efficiency of the factory is further improved.

The research is integrally configured on an Intel i9-10940X CPU and a computer running NVIDIA RTX A8000 with memory of 48GB, the algorithm is realized by using Python, and the deployment can be realized on a plurality of platforms and different hardware configurations. The model uses less computer resources, has lower energy consumption, is relatively simple to train, and is easy to provide a beginner with a quick opportunity to get on hand. Meanwhile, the system has an end-to-end structure, does not have complicated multi-stage regulation and control work, and therefore, has lower labor cost. In the large environment of the combined development of the artificial intelligence and the actual production in the future, the research has very broad and long-term application prospect and improvement space.

It will be apparent to those skilled in the art that the steps of the industrial part target detection method based on small sample learning and virtual composite data of the embodiments of the present invention described above may be implemented by a general purpose computing device, they may be concentrated on a single computing device, or distributed over a network of multiple computing devices, alternatively they may be implemented by program code executable by a computing device, so that they may be stored in a storage device for execution by the computing device, and in some cases, the steps shown or described may be performed in a different order than herein, or they may be fabricated separately as individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated as a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

Claims

1. An industrial part target detection method based on small sample learning and virtual synthetic data is characterized in that a synthetic data set is generated through virtual simulation software, and the data set comprises image data and marking information corresponding to each image data; the annotation information comprises depth information, 2D bounding box information and instance segmentation information; preprocessing the synthesized data set, converting the image format and the labeling information of the synthesized data into a format required by training a small sample target detection network, then further processing the synthesized data, screening out the labeling information smaller than the shielding rate by manually setting the shielding rate, and finally training the target detection network based on small sample learning on the processed synthesized data; acquiring part frame images containing parts in real industrial scenes, marking the images of the real industrial scenes, performing secondary training on a small sample target detection network after marking is completed, and finally detecting and outputting the types and positions of the parts in the images acquired in the current scene in real time by using a training network;

the virtual simulation software generates a synthetic data set by using a simulation scene and a virtual camera to replace a real scene and a real camera; real-time ray and path tracing by RTX based on Nvidia Omniverse Isaac Sim software implementation, providing a scene image; simulating a model similar to the real part geometry using a model carried by the Isaac Sim platform to generate synthetic data;

firstly, virtual simulation software is used for generating synthetic data which is geometrically similar to real parts as basic class data to train a feature extraction network, so that the network obtains the capability of extracting geometric features on basic class targets, a real sample is used for training a detection network in the secondary training process, and network parameters are finely adjusted through comparison and learning strategies;

in the secondary training process, when the similarity degree of the two classes reaches a preset value, transferring the common features learned by the model in the first training process from the base class to the new class;

preprocessing the generated synthesized data information, and firstly, converting image data in a previous synthesized data set into a required format; screening and format conversion are carried out on the labeling information in the synthesized data set, only the labeling information, namely the 2D bounding box, is extracted independently, and is stored as an xml format labeling file;

firstly, judging a shielding relation by using a 2D bounding box in labeling information and instance segmentation information, wherein when shielding situations occur, a shielded object and a shielded object are necessarily existed, when labeling of two adjacent pixels in the instance segmentation labeling information is different, indicating that shielding situations exist in parts corresponding to the two pixels respectively at present, extracting 2D bounding box information corresponding to the two parts, and calculating intersection areas of the 2D bounding boxes; judging the number of pixels belonging to different parts in the current intersection area according to the marking information of each pixel corresponding to the marking information of the current example segmentation, wherein the larger part is a shielded part, and the smaller part is a shielded part; after the comparison method is used for comparing all objects in the images in the synthetic data set in pairs, a shielding relation diagram is obtained, the expression format of the shielding relation diagram in a file is an N multiplied by N square matrix, N is the number of objects in the current image, and if the element at the [ i, j ] is-1, the shielding relation diagram represents that the current i object and the j object are shielded;

then, aiming at the parts with shielding phenomena, obtaining the intersection of the 2D bounding boxes, calculating the number of pixels corresponding to the shielding parts in the intersection, and calculating the shielding rate; according to the actual requirements of different industrial scenes, artificially setting a shielding rate, screening the marking data by taking the shielding rate as a threshold value, and reserving marking information which is not shielded or is less shielded;

finally, carrying out batch color migration on the synthesized data by using a real industrial scene picture and a Reinhard algorithm, so that the synthesized data set is more close to the real scene;

to achieve automatic labeling through a computer program, each object in the virtual environment needs to be tracked; in the three-dimensional graphic engine, each object is a three-dimensional model, and the basic data structure of the three-dimensional model is a grid formed by vertexes and triangular surfaces; the same three-dimensional model consists of different grids; in the rendering process of the three-dimensional model, color coding rendering is carried out on different grids by calling corresponding functions of the graphic API, namely illumination and materials are ignored, and all pixel points after grid rasterization are represented by RGB colors; tracking grids of common father objects by using the visual window after the script is compiled, and storing the grids into a final annotation file;

generating a synthetic data set through virtual simulation software, and constructing a model in a virtual environment, wherein the method comprises the steps of generating a camera, a feeding box and internal parts of the feeding box; defining a structure for representing the feed bin before generating the virtual feed bin; the feeding box is provided with five surfaces, and each surface is rendered and synthesized by using a Cube model of Isaac Sim; loading the simulated feed box into a virtual scene by using an example environment provided in Isaac Sim, and truly restoring indoor illumination conditions by using a simulated light source in the whole virtual environment; then, generating a dynamic scene, selecting an object similar to the geometric shape of an industrial scene part, simulating the shape geometric feature of the part by using a Circle model in Isaac Sim through fine adjustment of parameter information, and adding collision, gravity, friction and gloss information to the part model to enable the part model to have physical properties in a real environment in a virtual environment; batch copying is carried out on the part model, and the copied target objects share parameters; performing limited random assignment on each generated position of the copied part model by using a domain randomization method; the randomly generated positions (x, y, z and w) are subjected to region definition, the part model is limited to only appear in the range of the feeding box, and due to the support of the physical engine, the model generated each time falls freely due to different heights, the collision and shielding phenomena in the real falling process can be simulated in the falling process, and the virtual environment diversity is further increased after the repeated steps are carried out.

2. The industrial part target detection method based on small sample learning and virtual synthetic data according to claim 1, wherein virtual simulation software is first used to generate synthetic data geometrically similar to real parts as base class data to train a feature extraction network, so that the network obtains the capability of extracting geometric features on the base class target, the real sample is used to train the detection network in the secondary training process, and network parameters are finely tuned by contrasting learning strategies.

3. The industrial part target detection method based on small sample learning and virtual composite data according to claim 2, wherein in the secondary training process, when the similarity degree of two classes reaches a preset value, the common feature learned by the model in the first training process is transferred from the base class to the new class.

4. The industrial part target detection method based on small sample learning and virtual composite data according to claim 2, wherein in the target detection model network structure, a fast-RCNN two-stage detection framework is used as a backbone network; the RPN takes the trunk feature map as input and generates area proposals, the RoI classifies each area proposal, and if the prediction contains an object, the foreground object exists in the current proposal area, the RPN returns to a boundary box; distinguishing instance-level intra-class similarities from inter-class differences using a contrast learning strategy;

a contrast branch is introduced into the ROI part in the Faster-RCNN network and is parallel to the classification branch and the regression branch; the comparison branch is implemented as a layer of multi-layer perceptron, the ROI feature codes are used as similarity scores between the proposal representations of the measured objects after the comparison features are used, and the comparison targets are optimized based on the ROI features of the MLP-Head codes so as to maximally improve the consistency between the target proposals from the same category.

5. A computer device, characterized by: the computer device comprises a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the industrial part target detection method based on small sample learning and virtual composite data as claimed in any one of claims 1-4 when executing the computer program.

6. A computer-readable storage medium, characterized by: the computer-readable storage medium stores a computer program for executing the industrial part target detection method based on small sample learning and virtual composite data as claimed in any one of claims 1 to 4.