CN111311683B

CN111311683B - Method and device for detecting pick-up point of object, robot, equipment and medium

Info

Publication number: CN111311683B
Application number: CN202010223336.XA
Authority: CN
Inventors: 吴华栋; 周韬; 成慧; 高鸣岐
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2023-08-15
Anticipated expiration: 2040-03-26
Also published as: CN111311683A

Abstract

The present disclosure relates to a method and apparatus for detecting a pick-up point of an object, a robot, a device and a medium. The method comprises the following steps: acquiring an image to be detected and a first depth map corresponding to the image to be detected; respectively extracting features of the image to be detected and the first depth map to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map; according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, carrying out feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map to obtain a feature fusion map corresponding to the image to be detected; and determining the position information of the point to be picked up of the object in the image to be detected according to the feature fusion map corresponding to the image to be detected.

Description

Method and device for detecting pick-up point of object, robot, equipment and medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a method and apparatus for detecting a pick-up point of an object, a robot, a device, and a medium.

Background

With the development of computer software and hardware technology, artificial intelligence technology is mature. Robots have received a great deal of attention as an important application of artificial intelligence to landing. The robot can be applied to the fields of national defense and military, industrial production, logistics and the like. In the processes of logistics sorting, industrial production and the like, how to accurately detect the pick-up point of each object when a plurality of objects of various types are tightly or loosely placed on a container (such as a transfer box) or a table top is a problem to be solved.

Disclosure of Invention

The present disclosure provides a technical solution for detecting a pick-up point of an object.

According to an aspect of the present disclosure, there is provided a method of detecting a pickup point of an object, including:

acquiring an image to be detected and a first depth map corresponding to the image to be detected;

respectively extracting features of the image to be detected and the first depth map to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map;

according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, carrying out feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map to obtain a feature fusion map corresponding to the image to be detected;

And determining the position information of the point to be picked up of the object in the image to be detected according to the feature fusion map corresponding to the image to be detected.

The method comprises the steps of respectively extracting features of an image to be detected and a first depth map corresponding to the image to be detected, obtaining the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map, carrying out feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, obtaining a feature fusion map corresponding to the image to be detected, and determining the position information of a point to be picked up of an object in the image to be detected according to the feature fusion map corresponding to the image to be detected, so that the information of the image to be detected and the information in the depth map to be detected can be fused more fully, and the position of the point to be picked up can be predicted more accurately. Therefore, the robot or the mechanical arm and other equipment pick up the object according to the position of the point to be picked up of the object in the image to be detected, and the success rate of picking up the object can be improved.

In one possible implementation manner, after the determining the position information of the point to be picked up of the object in the image to be detected, the method further includes:

determining a depth value corresponding to the point to be picked up according to the first depth map;

and determining pose information of the picked component according to the position information of the point to be picked and the depth value corresponding to the point to be picked.

According to the pose information of the pickup part determined by the implementation mode, equipment such as a robot or a mechanical arm can pick up the object.

In a possible implementation manner, the feature extracting the to-be-detected image and the first depth map respectively to obtain a feature map corresponding to the to-be-detected image and a feature map corresponding to the first depth map includes:

respectively carrying out multistage feature extraction on the image to be detected and the first depth map to obtain a multistage feature map corresponding to the image to be detected and a multistage feature map corresponding to the first depth map;

the feature fusion of the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map is performed according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, so as to obtain a feature fusion map corresponding to the image to be detected, including:

For any one of the multiple levels, carrying out feature fusion on the level feature map corresponding to the image to be detected and the level feature map corresponding to the first depth map according to the weight of the level feature map corresponding to the image to be detected and the weight of the level feature map corresponding to the first depth map, so as to obtain the level feature fusion map corresponding to the image to be detected;

the determining the position information of the point to be picked up of the object in the image to be detected according to the feature fusion map corresponding to the image to be detected comprises:

and determining the position information of the point to be picked up of the object in the image to be detected according to the multi-level feature fusion map corresponding to the image to be detected.

In this implementation manner, the multi-level feature extraction is performed on the image to be detected and the first depth map respectively to obtain a multi-level feature map corresponding to the image to be detected and a multi-level feature map corresponding to the first depth map, and feature fusion is performed on the multi-level feature maps respectively to obtain feature fusion maps of all levels, and according to the multi-level feature fusion map corresponding to the image to be detected, position information of a point to be picked up of an object in the image to be detected is determined, so that the information in the image to be detected and the information in the first depth map can be fully extracted, and the accuracy of position prediction of the pick-up point can be improved.

inputting the image to be detected and the first depth map into a neural network, and respectively extracting features of the image to be detected and the first depth map through the neural network to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map;

determining the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map through the neural network, and carrying out feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map to obtain a feature fusion map corresponding to the image to be detected;

obtaining a first position prediction graph of a pickup point corresponding to the image to be detected according to the feature fusion graph corresponding to the image to be detected through the neural network, wherein the first position prediction graph is used for representing the position of a pixel in the image to be detected as a prediction value of the confidence coefficient of the pickup point;

and determining the position information of the point to be picked up of the object in the image to be detected according to the first position prediction graph.

In this implementation manner, the weights of the feature images corresponding to the image to be detected and the weights of the feature images corresponding to the first depth image are determined through the neural network, the specific gravity of the information drawn from the image to be detected and the first depth image can be adjusted adaptively, and the two are fused according to the weights of the two, so that more accurate position information of the point to be picked up can be predicted.

In a possible implementation manner, the determining the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map includes:

Performing convolution operation on the feature images corresponding to the images to be detected to obtain a first convolution result;

performing convolution operation on the feature map corresponding to the first depth map to obtain a second convolution result;

performing an activating operation on the sum of the first convolution result and the second convolution result to obtain a first activating result;

performing convolution operation on the first activation result to obtain a third convolution result;

and activating the third convolution result to obtain the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map.

According to the implementation manner, the weight of the decision of the image to be detected and the first depth map on the neural network can be adaptively adjusted based on the attention mechanism, so that the information contained in the image to be detected and the first depth map can be better and more fully learned and fused, and various conditions (such as poor quality of the depth map caused by influence of ambient shadow illumination on occasion) can be dealt with.

In one possible implementation, before the inputting the image to be detected and the first depth map into a neural network, the method further includes:

acquiring a training image, a second depth map corresponding to the training image and a real position map of a pickup point corresponding to the training image, wherein the real position map is used for representing the position of a pixel in the training image as a real value of the confidence coefficient of the pickup point;

Inputting the training image and the second depth map into a neural network, respectively extracting features of the training image and the second depth map through the neural network to obtain a feature map corresponding to the training image and a feature map corresponding to the second depth map, determining weights of the feature map corresponding to the training image and the feature map corresponding to the second depth map, and performing feature fusion on the feature map corresponding to the training image and the feature map corresponding to the second depth map according to the weights of the feature map corresponding to the training image and the feature map corresponding to the second depth map to obtain a feature fusion map corresponding to the training image, and obtaining a second position prediction map of a pick-up point corresponding to the training image according to the feature fusion map corresponding to the training image, wherein the second position prediction map is used for representing the position of a pixel in the training image as a prediction value of the confidence degree of the pick-up point;

training the neural network based on the difference between the true position map and the second position prediction map.

The neural network trained by the implementation mode can adaptively adjust the specific gravity of the information drawn from the image to be detected and the first depth map, and the two are fused according to the weight of the image to be detected and the first depth map, so that more accurate position information of the point to be picked up can be predicted and obtained.

In one possible implementation manner, the acquiring a training image, a second depth map corresponding to the training image, and a real position map of a pickup point corresponding to the training image includes:

obtaining a training image according to an image of a simulation scene, wherein the simulation scene comprises an object model and a background model;

acquiring a depth map of the simulation scene as a second depth map corresponding to the training image;

and obtaining a real position diagram of the pick-up point corresponding to the training image according to the parameters of the object model.

According to the implementation mode, the neural network can be trained by using the simulation data, and the problem of pick-up point detection of objects in a real scene is solved. The simulation system is adopted to collect the real position diagram of the pick-up point corresponding to the training image, so that the marking cost can be greatly reduced, and the cost of the whole system can be reduced.

In one possible implementation manner, the obtaining a training image according to the image of the simulation scene includes:

generating a simulation scene;

controlling at least one object model in the simulation scene to randomly drop on a workbench model in the simulation scene from the upper part of the workbench model until the at least one object model is stable; and/or randomly adjusting the object model and/or the background model in the simulation scene to obtain a plurality of training images.

In the implementation manner, the object models in the simulation scene are controlled to randomly drop on the workbench models from the upper part of the workbench models in the simulation scene until the object models are stable, so that the situation of object stacking in the real scene can be simulated, and the neural network can be trained based on the training of the neural network to learn and process the situation of object stacking in the real scene. By randomly adjusting the object model and/or the background model in the simulation scene, a large number of training images can be obtained, and the neural network obtained based on the training can have higher accuracy and robustness.

According to an aspect of the present disclosure, there is provided an apparatus for detecting a pickup point of an object, including:

the acquisition module is used for acquiring an image to be detected and a first depth map corresponding to the image to be detected;

the feature extraction module is used for respectively extracting features of the image to be detected and the first depth map to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map;

the feature fusion module is used for carrying out feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, so as to obtain a feature fusion map corresponding to the image to be detected;

And the first determining module is used for determining the position information of the point to be picked up of the object in the image to be detected according to the feature fusion map corresponding to the image to be detected.

In one possible implementation, the apparatus further includes:

the second determining module is used for determining a depth value corresponding to the point to be picked up according to the first depth map;

and the third determining module is used for determining pose information of the picked-up component according to the position information of the point to be picked up and the depth value corresponding to the point to be picked up.

In one possible implementation manner, the feature extraction module is configured to:

the feature fusion module is used for:

The first determining module is used for:

the feature fusion module is used for:

the first determining module is used for:

In one possible implementation manner, the feature fusion module is configured to:

In one possible implementation manner, the apparatus further includes a training module, where the training module is configured to:

In one possible implementation, the training module is configured to:

generating a simulation scene;

According to an aspect of the present disclosure, there is provided a robot including a detection module and a pickup part, wherein the detection module is connected with the pickup part; the detection module is used for realizing the method so as to obtain the position information of the point to be picked up of the object in the image to be detected; the pickup part is used for picking up the object according to the position information of the point to be picked up of the object in the image to be detected.

According to an aspect of the present disclosure, there is provided an electronic apparatus including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the executable instructions stored by the memory to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, the feature extraction is performed on the to-be-detected image and the first depth image corresponding to the to-be-detected image respectively to obtain the feature image corresponding to the to-be-detected image and the feature image corresponding to the first depth image, the feature image corresponding to the to-be-detected image and the feature image corresponding to the first depth image are subjected to feature fusion according to the weight of the feature image corresponding to the to-be-detected image and the weight of the feature image corresponding to the first depth image to obtain the feature fusion image corresponding to the to-be-detected image, and the position information of the to-be-picked point of the object in the to-be-detected image is determined according to the feature fusion image corresponding to the to-be-detected image, so that the information in the to-be-detected image and the depth image can be fused more fully, and the positions of the to-be-picked point can be predicted more accurately based on the attention mechanism. Therefore, the robot or the mechanical arm and other equipment pick up the object according to the position of the point to be picked up of the object in the image to be detected, and the success rate of picking up the object can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 shows a flowchart of a method of detecting a pick-up point of an object provided by an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a neural network in an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of an attention-based fusion module in an embodiment of the present disclosure.

Fig. 4 shows a block diagram of a robot provided by an embodiment of the present disclosure.

Fig. 5 shows a block diagram of an apparatus for detecting a pick-up point of an object provided by an embodiment of the present disclosure.

Fig. 6 shows a block diagram of an electronic device 800 provided by an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an electronic device 1900 provided by an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Fig. 1 shows a flowchart of a method of detecting a pick-up point of an object provided by an embodiment of the present disclosure. The execution subject of the method of detecting a pick-up point of an object may be a device that detects a pick-up point of an object. For example, the method of detecting a pick-up point of an object may be performed by a terminal device or a server or other processing device. The terminal device may be a robot (e.g., sorting robot), a mechanical arm, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like. In some possible implementations, the method of detecting a pick-up point of an object may be implemented by a processor invoking computer readable instructions stored in a memory. As shown in fig. 1, the method of detecting a pick-up point of an object includes steps S11 to S14.

In step S11, an image to be detected and a first depth map corresponding to the image to be detected are obtained.

In the embodiment of the present disclosure, the image to be detected may be an image of a real scene. The first depth map represents a depth map of the image to be detected. The image to be detected and the first depth map may be acquired by an image acquisition device, or may be acquired from a database or other device in which the image to be detected and the first depth map are stored.

In one possible implementation, the image to be detected is a two-dimensional image. According to the implementation mode, on the premise that a three-dimensional model of an object is not required to be acquired in advance, the two-dimensional image to be detected and the depth map corresponding to the image to be detected are combined to predict the position of the point to be picked up of the object, so that hardware cost can be reduced, and calculation cost is reduced.

In one possible implementation, the image to be detected is an RGB (Red; green; blue, blue) image.

In an embodiment of the disclosure, the pixel value of the pixel of the first depth map may represent a distance between an image acquisition module used for acquiring the first depth map and a surface point of an object in the image to be detected corresponding to the first depth map. The first depth map and the image to be detected can be acquired by the same image acquisition module, and can also be acquired by different image acquisition modules. Under the condition that the first depth map and the image to be detected are collected by different image collecting modules, the image collecting module used for collecting the first depth map and the image collecting module used for collecting the image to be detected can be arranged at adjacent positions, so that the image collecting module used for collecting the first depth map and the image collecting module used for collecting the image to be detected can collect images with the same visual angle.

In step S12, feature extraction is performed on the image to be detected and the first depth map, so as to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map.

In the embodiment of the disclosure, the feature extraction may be performed on the to-be-detected image and the first depth map by performing a convolution operation on the to-be-detected image and the first depth map, or performing a convolution operation and an activation operation on the to-be-detected image and the first depth map, or performing a convolution operation, an activation operation and a pooling operation on the to-be-detected image and the first depth map, so as to obtain a feature map corresponding to the to-be-detected image and a feature map corresponding to the first depth map.

In a possible implementation manner, the feature extracting the to-be-detected image and the first depth map respectively to obtain a feature map corresponding to the to-be-detected image and a feature map corresponding to the first depth map includes: and respectively carrying out multistage feature extraction on the image to be detected and the first depth map to obtain a multistage feature map corresponding to the image to be detected and a multistage feature map corresponding to the first depth map.

In this implementation, the image to be detected and the first depth map may be subjected to multi-level feature extraction by a plurality of convolution layers, respectively. For example, 3-level feature extraction may be performed on the image to be detected and the first depth map by 3-level convolution layers, respectively. Extracting features of the image to be detected to obtain a first feature map S_1-1 corresponding to the image to be detected, and extracting features of the first depth map to obtain a first feature map S_1-2 corresponding to the first depth map; performing feature extraction on a first feature map S_1-1 corresponding to an image to be detected to obtain a second feature map S_2-1 corresponding to the image to be detected, and performing feature extraction on the first feature map S_1-2 corresponding to the first depth map to obtain a second feature map S_2-2 corresponding to the first depth map; and carrying out feature extraction on the second feature map S_2-1 corresponding to the image to be detected to obtain a third feature map S_3-1 corresponding to the image to be detected, and carrying out feature extraction on the second feature map S_2-2 corresponding to the first depth map to obtain a third feature map S_3-2 corresponding to the first depth map.

In this implementation manner, the multi-level feature extraction is performed on the image to be detected and the first depth map respectively, so that the multi-level feature map corresponding to the image to be detected and the multi-level feature map corresponding to the first depth map are obtained, and therefore information in the image to be detected and the first depth map can be fully extracted, and accuracy of position prediction of the pick-up point can be improved.

In another possible implementation manner, only the to-be-detected image and the first depth map may be subjected to primary feature extraction respectively, so as to obtain a primary feature map corresponding to the to-be-detected image and a primary feature map corresponding to the first depth map.

In a possible implementation manner, the feature extracting the to-be-detected image and the first depth map respectively to obtain a feature map corresponding to the to-be-detected image and a feature map corresponding to the first depth map includes: inputting the image to be detected and the first depth map into a neural network, and respectively extracting features of the image to be detected and the first depth map through the neural network to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map. The neural network is trained in advance by combining a training image and a second depth map corresponding to the training image. In this implementation, feature extraction may be performed on the image to be detected and the first depth map respectively by a convolution layer in the neural network.

In step S13, feature fusion is performed on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, so as to obtain a feature fusion map corresponding to the image to be detected.

In the embodiment of the present disclosure, the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map may be obtained based on an attention mechanism. And according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, calculating the weighted sum of the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map, thereby obtaining the feature fusion map corresponding to the image to be detected.

In the embodiment of the present disclosure, the feature fusion graph may represent a graph obtained by feature fusion of the feature graph.

In the embodiment of the disclosure, feature fusion is performed on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, so that information in the image to be detected and the first depth map can be fully combined based on an attention mechanism.

In one possible implementation manner, the performing feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map to obtain a feature fusion map corresponding to the image to be detected includes: and for any one of the multiple stages, carrying out feature fusion on the stage feature map corresponding to the image to be detected and the stage feature map corresponding to the first depth map according to the weight of the stage feature map corresponding to the image to be detected and the weight of the stage feature map corresponding to the first depth map, so as to obtain the stage feature fusion map corresponding to the image to be detected.

For example, the multiple stages include 3 stages. Performing feature fusion on a first feature map S_1-1 corresponding to the image to be detected and a first feature map S_1-2 corresponding to the first depth map to obtain a feature fusion map F1; feature fusion is carried out on a second feature map S_2-1 corresponding to the image to be detected and a second feature map S_2-2 corresponding to the depth map, so that a feature fusion map F2 is obtained; and carrying out feature fusion on a third feature map S_3-1 corresponding to the image to be detected and a third feature map S_3-2 corresponding to the depth map to obtain a feature fusion map F3.

In the implementation manner, the feature fusion of each level is obtained by respectively carrying out feature fusion on the multi-level feature images, so that the information in the image to be detected and the first depth image can be better utilized based on an attention mechanism, and the accuracy of the position prediction of the pick-up point can be improved.

In another possible implementation manner, if only the to-be-detected image and the first depth map are extracted in a first level, feature fusion may be performed on the to-be-detected image corresponding to the level feature map and the first depth map corresponding to the level feature map according to the weight of the to-be-detected image corresponding to the level feature map and the weight of the first depth map corresponding to the level feature map, so as to obtain the to-be-detected image corresponding to the level feature fusion map.

In one possible implementation manner, the performing feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map to obtain a feature fusion map corresponding to the image to be detected includes: and determining the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map through the neural network, and carrying out feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map to obtain a feature fusion map corresponding to the image to be detected.

In the actual application process, the scenes are various, and sometimes the neural network needs to rely on the information of the image to be detected to make decisions, and sometimes the neural network needs to rely on the information of the first depth map to make decisions. In this implementation manner, the weights of the feature images corresponding to the image to be detected and the weights of the feature images corresponding to the first depth image are determined through the neural network, the specific gravity of the information drawn from the image to be detected and the first depth image can be adjusted adaptively, and the two are fused according to the weights of the two, so that more accurate position information of the point to be picked up can be predicted.

As an example of this implementation manner, the determining the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map includes: performing convolution operation on the feature images corresponding to the images to be detected to obtain a first convolution result; performing convolution operation on the feature map corresponding to the first depth map to obtain a second convolution result; performing an activating operation on the sum of the first convolution result and the second convolution result to obtain a first activating result; performing convolution operation on the first activation result to obtain a third convolution result; and activating the third convolution result to obtain the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map.

In one example, a convolution layer may perform a 1×1 convolution operation on a feature map corresponding to the image to be detected, to obtain a first convolution result; carrying out 1×1 convolution operation on the feature map corresponding to the first depth map through the convolution layer to obtain a second convolution result; performing an activation operation on the sum of the first convolution result and the second convolution result through an activation function ReLU to obtain a first activation result; carrying out 1×1 convolution operation on the first activation result through the convolution layer to obtain a third convolution result; and activating the third convolution result through an activating function Sigmoid to obtain the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map. Of course, those skilled in the art may flexibly select the size of the convolution kernel and the type of the activation function according to the actual application scenario requirements, which is not limited by the embodiment of the present disclosure.

According to this example, the weights of the decisions of the image to be detected and the first depth map on the neural network can be adaptively adjusted based on the attention mechanism, whereby the information contained in the image to be detected and the first depth map can be better and more fully learned and fused, so that various situations (for example, poor quality of the depth map sometimes caused by the influence of ambient shadow illumination, etc.) can be dealt with.

In step S14, position information of a point to be picked up of an object in the image to be detected is determined according to the feature fusion map corresponding to the image to be detected.

In the embodiment of the present disclosure, the pickup may be a grabbing or sucking, and the pickup point may be a grabbing point or a sucking point, respectively.

In a possible implementation manner, the determining, according to the feature fusion map corresponding to the image to be detected, the position information of the point to be picked up of the object in the image to be detected includes: and determining the position information of the point to be picked up of the object in the image to be detected according to the multi-level feature fusion map corresponding to the image to be detected.

In the implementation manner, the position information of the point to be picked up of the object in the image to be detected is determined according to the multi-level feature fusion map corresponding to the image to be detected, so that the information in the image to be detected and the first depth map can be fully utilized, and the accuracy of the predicted position of the point to be picked up can be improved.

As an example of this implementation manner, a first position prediction graph of a pickup point corresponding to the image to be detected may be obtained according to a multi-level feature fusion graph corresponding to the image to be detected, and position information of the pickup point of the object in the image to be detected may be determined according to the first position prediction graph.

For example, the multi-level feature fusion map corresponding to the image to be detected may be respectively rolled and up-sampled, the up-sampling results corresponding to the multi-level feature fusion map are spliced, and then the splicing results are rolled and up-sampled, so as to obtain the first position prediction map of the pickup point corresponding to the image to be detected. The splicing can be performed by concat and the like.

For another example, the multi-level feature fusion map corresponding to the image to be detected may be respectively rolled and up-sampled, up-sampled results corresponding to the multi-level feature fusion map are added, and then the rolled and up-sampled results are further performed to obtain a first position prediction map of the pickup point corresponding to the image to be detected.

For another example, the multi-level feature fusion map corresponding to the image to be detected may be respectively rolled and up-sampled, and then up-sampled results corresponding to the multi-level feature fusion map are added to obtain a first position prediction map of the pickup point corresponding to the image to be detected.

In another possible implementation manner, if only the to-be-detected image and the first depth map are extracted by a first level of feature, the position information of the to-be-picked point of the object in the to-be-detected image may be determined according to the level of feature fusion map corresponding to the to-be-detected image. For example, the level of feature fusion map corresponding to the image to be detected may be rolled and up-sampled to obtain position information of a point to be picked up of the object in the image to be detected.

In a possible implementation manner, the determining, according to the feature fusion map corresponding to the image to be detected, the position information of the point to be picked up of the object in the image to be detected includes: obtaining a first position prediction graph of a pickup point corresponding to the image to be detected according to the feature fusion graph corresponding to the image to be detected through the neural network, wherein the first position prediction graph is used for representing the position of a pixel in the image to be detected as a prediction value of the confidence coefficient of the pickup point; and determining the position information of the point to be picked up of the object in the image to be detected according to the first position prediction graph.

In this implementation, the first position prediction map of the pickup point corresponding to the image to be detected and the size of the image to be detected may be the same. The first position prediction graph represents a position prediction graph of a pickup point corresponding to an image to be detected. The pixel value of any pixel in the first position prediction graph may represent the possibility that the object can be successfully picked up by using the position of the pixel as a pick-up point. The larger the pixel value of any pixel in the first position prediction graph (i.e., the larger the predicted value of the confidence coefficient of the position of the pixel as the pick-up point), the higher the success rate of picking up the object with the position of the pixel as the pick-up point can be indicated. In the implementation mode, the neural network predicts the pick-up point of the object on the pixel level, so that the accuracy of the determined pick-up point of the object can be improved, the robustness is high, and the interpretation is high.

As an example of this implementation, the determining, according to the first position prediction graph, position information of a point to be picked up of an object in the image to be detected includes: and determining the coordinates of the pixel with the highest confidence in the first position prediction graph as the coordinates of the point to be picked up of the object in the image to be detected. With this example, the success rate of picking up the object can be improved. For example, the pixel with the highest confidence in the first position prediction graph is the pixel in the x-th row and the y-th column, and then the pixel (x, y) can be determined as the coordinates of the point to be picked up.

In one possible implementation manner, after the determining the position information of the point to be picked up of the object in the image to be detected, the method further includes: determining a depth value corresponding to the point to be picked up according to the first depth map; and determining pose information of the picked component according to the position information of the point to be picked and the depth value corresponding to the point to be picked. Wherein the pick-up member may be a gripping member or a suction member. For example, the gripping part of the robotic arm may be a gripper and the suction part may be a suction cup. For example, if the coordinates of the point to be picked up are (x, y), the pixel value z of the pixel in the x-th row and the y-th column in the first depth map may be determined to be the depth value corresponding to the point to be picked up, and (x, y, z, rx, ry, rz) may be determined to be pose information of the pick-up component. Wherein, (rx, ry, rz) represents a normal vector corresponding to the point to be picked up, and the normal vector corresponding to the point to be picked up can be determined according to the tangential plane of the point to be picked up. The tangential plane of the point to be picked up may be determined according to the three-dimensional coordinates of the point to be picked up and the three-dimensional coordinates of the points on the surface of the object around the point to be picked up. Wherein the three-dimensional coordinates of the point to be picked up may be (x, y, z). According to the pose information of the pickup part determined by the implementation mode, six-degree-of-freedom pickup can be realized by equipment such as a robot or a mechanical arm.

In this implementation, before determining pose information of the pickup part, hand-eye calibration may be performed on a device such as a robot or a mechanical arm. After the hand and the eye are calibrated, the coordinates of the point to be picked in the first position prediction graph are the coordinates of the point to be picked under the coordinate system of the robot or the mechanical arm and other equipment, and the depth value of the point to be picked in the first depth graph is the depth value of the point to be picked under the coordinate system of the robot or the mechanical arm and other equipment. And combining the coordinates of the point to be picked up and the depth value corresponding to the point to be picked up, so that pose information of the picked up parts of the equipment such as the robot or the mechanical arm can be obtained.

In an embodiment of the disclosure, before the inputting the image to be detected and the first depth map into the neural network, the method may further include: acquiring a training image, a second depth map corresponding to the training image and a real position map of a pickup point corresponding to the training image, wherein the real position map is used for representing the position of a pixel in the training image as a real value of the confidence coefficient of the pickup point; inputting the training image and the second depth map into a neural network, respectively extracting features of the training image and the second depth map through the neural network to obtain a feature map corresponding to the training image and a feature map corresponding to the second depth map, determining weights of the feature map corresponding to the training image and the feature map corresponding to the second depth map, and performing feature fusion on the feature map corresponding to the training image and the feature map corresponding to the second depth map according to the weights of the feature map corresponding to the training image and the feature map corresponding to the second depth map to obtain a feature fusion map corresponding to the training image, and obtaining a second position prediction map of a pick-up point corresponding to the training image according to the feature fusion map corresponding to the training image, wherein the second position prediction map is used for representing the position of a pixel in the training image as a prediction value of the confidence degree of the pick-up point; training the neural network based on the difference between the true position map and the second position prediction map.

Wherein the second depth map represents a depth map of the training image. The pixel values of the pixels of the second depth map may represent distances between the image acquisition module for acquiring the second depth map and object surface points in the training image corresponding to the second depth map.

In the embodiment of the disclosure, the sizes of the training image, the second depth map corresponding to the training image, and the real position map of the pickup point corresponding to the training image may be the same.

In one possible implementation manner, the training image is a two-dimensional image, so that a more accurate pixel-level pick-up point position diagram can be obtained by combining the two-dimensional training image and the depth map prediction on the premise that a three-dimensional model of an object to be picked up is not required to be acquired in advance, hardware cost can be reduced, and calculation cost is reduced.

In one possible implementation, the training image is an RGB image.

In the embodiment of the disclosure, the information in the training image and the depth map can be fully utilized to predict and obtain the pixel-level pick-up point position map by combining the training image and the depth map based on the attention mechanism, so that the trained neural network can predict and obtain the more accurate position of the point to be picked up.

In the embodiment of the disclosure, in the true position map of the pickup point corresponding to the training image, the pixel value of the pixel capable of being the pickup point is different from the pixel value of the pixel incapable of being the pickup point. For example, in the true position map of the pickup point corresponding to the training image, the pixel value of the pixel that can be the pickup point is 1 (i.e., the true value of the confidence that the position of the pixel is the pickup point is 1), and the pixel value of the pixel that cannot be the pickup point is 0 (i.e., the true value of the confidence that the position of the pixel is the pickup point is 0). In one possible implementation manner, in the true position map of the pickup point corresponding to the training image, the pixel values of the pixels capable of being the pickup point are the same, and the pixel values of the pixels incapable of being the pickup point are the same. Wherein, a certain pixel can be used as a pick-up point, which means that if the pixel is used for picking up an object, the possibility that the object can be successfully picked up is high; a pixel cannot be used as a pickup point, and it may be assumed that if an object is picked up by the pixel, there is less possibility that the object can be successfully picked up. The number of pixels on an object that can be a pick-up point may be one or more.

In the embodiment of the disclosure, the second position prediction graph represents a position prediction graph of the pickup point corresponding to the training image. The pixel value of any pixel in the second position prediction graph may represent the possibility that the object can be successfully picked up by using the position of the pixel as the pick-up point. The larger the pixel value of any pixel in the second position prediction graph (i.e. the larger the predicted value of the confidence coefficient of the position of the pixel as the pick-up point), the higher the success rate of picking up the object with the position of the pixel as the pick-up point can be indicated. The method and the device for predicting the pick-up point of the object are capable of improving the accuracy of the determined pick-up point of the object by predicting the pick-up point of the object on the pixel level.

In an embodiment of the present disclosure, the second position prediction map of the pickup point corresponding to the training image may be the same as the training image in size.

In one possible implementation manner, the acquiring a training image, a second depth map corresponding to the training image, and a real position map of a pickup point corresponding to the training image includes: obtaining a training image according to an image of a simulation scene, wherein the simulation scene comprises an object model and a background model; acquiring a depth map of the simulation scene as a second depth map corresponding to the training image; and obtaining a real position diagram of the pick-up point corresponding to the training image according to the parameters of the object model. The object model can comprise models of various types of objects such as an express parcel model, a component model, a garbage model, a goods model and the like; the background models may include one or more models of the ground, tables, boxes, shelves, countertops (e.g., work tables), ambient lighting, and the like.

In this implementation, the parameters of the object model may include one or more of a type parameter, a shape parameter, a size parameter, a position parameter, and the like of the object model. Wherein the position parameters of the object model may represent the position of the object model in the simulation scene.

As one example of this implementation, from shape parameters, size parameters, and location parameters of the object model, it may be determined which pixels in the training image have objects, and which locations in the simulated scene have object models; and obtaining a second depth map corresponding to the training image according to the distance between the image acquisition device of the depth map in the simulation scene and the surface point of the object model.

As an example of this implementation, from the difference between the normal vector of the tangential plane of a pixel of the object model surface and the normal vector of the tangential plane of a pixel adjacent to the pixel, a true position map of the pickup point corresponding to the training image may be obtained. Wherein the tangent plane of a pixel represents the tangent plane of an object surface point represented by the pixel. For example, if the included angle between the normal vector of the tangent plane of a certain pixel on the surface of the object and the normal vector of the tangent plane of the adjacent pixel of the pixel is smaller (for example, smaller than or equal to the first preset value), it can be determined that the vicinity of the pixel is relatively flat, so that the pixel can be determined to be a pickup point, that is, the pixel value of the pixel in the real position diagram of the pickup point corresponding to the training image can be determined to be 1; if the included angle between the normal vector of the tangent plane of a certain pixel on the surface of the object model and the normal vector of the tangent plane of the adjacent pixel of the pixel is larger (for example, larger than the first preset value), the pixel is determined to be less flat nearby the pixel, so that the pixel can be determined to be unable to be used as a pickup point, that is, the pixel value of the pixel in the real position diagram of the pickup point corresponding to the training image can be determined to be 0. In one example, a pixel having a distance to any pixel less than or equal to a second preset value may be determined as a neighboring pixel of the pixel. For example, if the distance between the pixel a and the pixel B is less than or equal to the second preset value, the pixel a may be determined as the neighboring pixel of the pixel B, and the pixel B may be determined as the neighboring pixel of the pixel a; if the distance between the pixel a and the pixel B is greater than the second preset value, the pixel a and the pixel B may not be regarded as adjacent pixels to each other.

As an example of this implementation, multiple kinds of object models may be included in the simulation scenario, enabling the neural network to learn the ability to process different kinds of objects.

As an example of this implementation, the obtaining a training image according to the image of the simulation scene includes: generating a simulation scene; controlling at least one object model in the simulation scene to randomly drop on a workbench model in the simulation scene from the upper part of the workbench model until the at least one object model is stable; and/or randomly adjusting the object model and/or the background model in the simulation scene to obtain a plurality of training images.

In this example, the object model and the background model may be randomly selected to generate the simulation scene according to the generation instruction of the simulation scene, or the simulation scene may be generated according to the object model and the background model selected by the user. For example, a simulation scene similar to the real scene may be generated.

In this example, a domain randomization (Domain Randomization) method may be employed to randomly adjust the object model and/or the background model in the simulated scene. For example, the colors and textures of the ground in the simulated scene, the colors and textures of the table models, the colors and textures of the box models, the direction and intensity of the ambient light, the placement position and angle of the object models, the colors and textures of the object models, the size and shape of the object models, the types, the number and the placement modes of the object models, and the like can be randomly adjusted.

In this example, the background model may include a table model, and the simulated scene may include a plurality of object models therein. Wherein the plurality of object models may belong to a plurality of categories. In one example, a plurality of object models in the simulation scene may be controlled to randomly drop onto the workbench model from above the workbench model until the plurality of object models are stable, and then the object models and/or the background model in the simulation scene may be randomly adjusted. After the object model and/or the background model in the simulation scene are randomly adjusted, an image of the current simulation scene may be saved as a training image, for example, an RGB image of the current simulation scene may be saved as a training image. In one example, images of the simulated scene may also be saved from different perspectives, thereby enabling training images from different perspectives to be obtained.

In this example, by controlling a plurality of object models in the simulation scene to drop randomly on a table model in the simulation scene from above until the plurality of object models are stable, whereby the situation of object stacking in the real scene can be simulated, based on which the neural network is trained, the neural network can be made to learn to handle the situation of object stacking in the real scene. By randomly adjusting the object model and/or the background model in the simulation scene, a large number of training images can be obtained, and the neural network obtained based on the training can have higher accuracy and robustness.

In another possible implementation manner, the real position diagram of the pick-up point corresponding to the training image can be obtained through a manual labeling mode.

Fig. 2 shows a schematic diagram of a neural network in an embodiment of the present disclosure. In the training process of the neural network, the input image may be a training image and a depth map (e.g., a second depth map) corresponding to the training image; in an actual use process of the neural network, the input image may be an image to be detected and a depth map (for example, a first depth map) corresponding to the image to be detected. In one example, the training image and the image to be detected may be RGB images, the neural network may learn scene information from the RGB images and the depth map, respectively, and may sufficiently fuse information of different sources (RGB images and depth map) and different levels through feature fusion based on an attention mechanism, so as to predict a result more accurately.

The neural network may be a fully convoluted neural network, thereby helping to reduce the amount of computation and the number of network parameters. The neural network can receive RGB images and depth maps as input, and after feature extraction is performed through a plurality of convolution layers, information from different modes (RGB image modes and depth map modes) is fused. The feature map corresponding to the RGB image (i.e. the feature map obtained by carrying out feature extraction on the RGB image) and the feature map corresponding to the depth map (i.e. the feature map obtained by carrying out feature extraction on the depth map) are fused based on an attention mechanism, and then the position prediction map of the pick-up point at the pixel level is gradually generated through a plurality of convolution layers so as to clearly and accurately represent the position information of the pick-up point of the object.

The neural network may be composed of a plurality of convolutional layers, each of which may be followed by a batch normalization (Batch Normalization) process. Each layer of the neural network may use a ReLU (Rectified Linear Unit, modified linear unit) function as an activation function. When the neural network is trained, tested and used, the input image can be normalized.

In one possible implementation, the neural network may be trained using a random gradient descent method, the batch size may be 64, and all parameters of the neural network may be randomly initialized.

In the example shown in fig. 2, the input images are RGB images and depth maps. Performing feature extraction on the RGB image through the convolution layer to obtain a first feature image S_1-1 corresponding to the RGB image, and performing feature extraction on the depth image through the convolution layer to obtain a first feature image S_1-2 corresponding to the depth image; performing feature extraction on a first feature map S_1-1 corresponding to the RGB image through a convolution layer to obtain a second feature map S_2-1 corresponding to the RGB image, and performing feature extraction on the first feature map S_1-2 corresponding to the depth map through the convolution layer to obtain a second feature map S_2-2 corresponding to the depth map; and performing feature extraction on the second feature map S_2-1 corresponding to the RGB image through the convolution layer to obtain a third feature map S_3-1 corresponding to the RGB image, and performing feature extraction on the second feature map S_2-2 corresponding to the depth map through the convolution layer to obtain a third feature map S_3-2 corresponding to the depth map. In fig. 2 # -can represent a fusion module based on the attention mechanism. Feature fusion is carried out on a first feature map S_1-1 corresponding to the RGB image and a first feature map S_1-2 corresponding to the depth map through a fusion module based on an attention mechanism, so that a feature fusion map F1 is obtained; feature fusion is carried out on a second feature map S_2-1 corresponding to the RGB image and a second feature map S_2-2 corresponding to the depth map through a fusion module based on an attention mechanism, so that a feature fusion map F2 is obtained; feature fusion is carried out on a third feature map S_3-1 corresponding to the RGB image and a third feature map S_3-2 corresponding to the depth map through a fusion module based on an attention mechanism, so that a feature fusion map F3 is obtained; and carrying out convolution operation on the feature fusion map F1, the feature fusion map F2 and the feature fusion map F3 through a convolution layer to obtain a position prediction map.

Fig. 3 shows a schematic diagram of an attention-based fusion module in an embodiment of the present disclosure. As shown in fig. 3, a convolution operation is performed on a feature map corresponding to an RGB image through a convolution layer, for example, a convolution operation of 1×1 may be performed, so as to obtain a first convolution result; and carrying out convolution operation on the feature map corresponding to the depth map through the convolution layer, for example, carrying out 1×1 convolution operation to obtain a second convolution result. And adding the first convolution result and the second convolution result, and performing an activation operation on the sum of the first convolution result and the second convolution result through an activation layer, for example, performing an activation operation on the sum of the first convolution result and the second convolution result through an activation function ReLU, so as to obtain a first activation result. The convolution operation is performed on the first activation result by the convolution layer, for example, a convolution operation of 1×1 may be performed, to obtain a third convolution result. And activating the third convolution result through an activation layer, for example, activating the third convolution result through an activation function Sigmoid to obtain a weight graph based on an attention mechanism. The weight map may be used to represent the weight of the feature map corresponding to the RGB image and the weight of the feature map corresponding to the depth map, for example, the pixel value of the pixel in the weight map may represent the ratio of the weight of the feature map corresponding to the depth map to the weight of the corresponding pixel in the feature map corresponding to the RGB image. The range of pixel values of the pixels of the weight map may be greater than 0 and less than 1. And performing dot multiplication on the feature map corresponding to the depth map and the weight map (namely, performing dot multiplication on the feature map corresponding to the depth map and the pixel value of the corresponding pixel in the weight map), and adding the dot multiplication result and the feature map corresponding to the RGB image to obtain a feature fusion map.

In the embodiment of the present disclosure, the range of values of the pixel values of the pixels in the position prediction map output by the neural network may be greater than or equal to 0 and less than or equal to 1. That is, the range of values of the predicted value of the confidence level of the pickup point at which the pixel in the training image or the image to be detected is located may be 0 or more and 1 or less. In the position prediction graph, a larger pixel value of a certain pixel can indicate that if the position of the pixel is taken as a pick-up point to pick up an object, the possibility of successfully picking up the object is higher; a smaller pixel value for a certain pixel may indicate that if an object is picked up using the position of the pixel as a pick-up point, the likelihood that the object can be picked up successfully is smaller.

In one possible implementation manner, a difference value graph can be obtained according to the difference value of the pixel value of the corresponding pixel in the real position graph and the second position prediction graph; obtaining a value of a loss function of the neural network according to the difference value graph; training the neural network according to the value of the loss function.

As an example of this implementation, the square of the difference between the pixel values of the corresponding pixels in the real position map and the second position prediction map may be determined as the pixel value of the corresponding pixel in the difference map; in the case where any one pixel includes pixel values of three channels, differences in pixel values of the corresponding channels of the corresponding pixel may be calculated, respectively.

As another example of this implementation, an absolute value of a difference value of pixel values of corresponding pixels in the real position map and the second position prediction map may be determined as the pixel value of corresponding pixels in the difference map.

As an example of this implementation, the sum of the pixel values of all pixels in the difference map may be calculated, resulting in a sum value; and determining the value of the loss function of the neural network according to the sum value. For example, the sum value may be taken as the value of the loss function of the neural network.

In the embodiment of the disclosure, the prediction of the picked points is performed by combining the training image and the depth map based on the attention mechanism, so that the information in the training image and the depth map can be fully utilized to predict and obtain the pixel-level picked point position map, and the more accurate position of the points to be picked can be predicted and obtained.

In the practical application process, the scenes are various, and the neural network sometimes needs to rely on the information of the RGB image to make decisions, and sometimes needs to rely on the information of the depth map to make decisions. In the embodiment of the disclosure, the neural network is enabled to adaptively determine the weights of the RGB image and the depth map based on the attention mechanism, namely, the specific gravity of the information drawn from the RGB image and the depth map is adaptively adjusted, the RGB image and the depth map are fused according to the weights of the RGB image and the depth map, and then a final decision is made, so that the information contained in the RGB image and the depth map can be better and more fully learned and fused, various conditions (such as poor quality of the depth map caused by the influence of ambient shadow illumination) can be dealt with, and the method has stronger robustness and higher success rate of picking up objects.

The method for detecting the pick-up point of the object, which is provided by the embodiment of the disclosure, has small operand, is convenient for transplanting, and can be widely applied to various scenes. For example, in a logistics sorting scene, an object may be an express parcel, and according to the position information of the point to be picked up of the object, which is determined according to the embodiment of the present disclosure, a robot or a mechanical arm and other devices can accurately pick up the express parcel; in an industrial assembly scene, the object can be a component, and the robot or the mechanical arm and the like can accurately pick up the component and put the component on another component according to the position information of the point to be picked up of the object, which is determined by the embodiment of the disclosure; in a garbage classification scene, the object can be garbage, and the robot or the mechanical arm and the like can accurately pick up the garbage and put the garbage into a corresponding classification box according to the position information of the object to be picked up, which is determined according to the embodiment of the disclosure; in an unmanned vending scene, the object can be goods, and the robot or the mechanical arm and the like can accurately pick up the appointed goods to give the customer according to the position information of the object to be picked up, which is determined according to the embodiment of the disclosure; in the cargo identification scene, the object may be cargo, and the robot or the mechanical arm and the like can accurately pick up the cargo to scan the two-dimensional code according to the position information of the point to be picked up of the object determined according to the embodiment of the disclosure.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

In addition, the disclosure further provides a device for detecting the pick-up point of the object, a robot, an electronic device, a computer readable storage medium and a program, and any of the above methods for detecting the pick-up point of the object provided by the disclosure may be implemented, and corresponding technical schemes and descriptions and corresponding descriptions of method parts are omitted.

Fig. 4 shows a block diagram of a robot provided by an embodiment of the present disclosure. As shown in fig. 4, the robot includes a detection module 41 and a pickup part 42, wherein the detection module 41 is connected to the pickup part 42; the detection module 41 is configured to implement the method for detecting a pick-up point of an object, so as to obtain position information of the pick-up point of the object in the image to be detected; the pickup part 42 is used for picking up the object according to the position information of the point to be picked up of the object in the image to be detected.

Fig. 5 shows a block diagram of an apparatus for detecting a pick-up point of an object provided by an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes: an obtaining module 51, configured to obtain an image to be detected and a first depth map corresponding to the image to be detected; the feature extraction module 52 is configured to perform feature extraction on the image to be detected and the first depth map, so as to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map; the feature fusion module 53 is configured to perform feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map, so as to obtain a feature fusion map corresponding to the image to be detected; the first determining module 54 is configured to determine location information of a point to be picked up of an object in the image to be detected according to a feature fusion map corresponding to the image to be detected.

In one possible implementation, the apparatus further includes: the second determining module is used for determining a depth value corresponding to the point to be picked up according to the first depth map; and the third determining module is used for determining pose information of the picked-up component according to the position information of the point to be picked up and the depth value corresponding to the point to be picked up.

In one possible implementation, the feature extraction module 52 is configured to: respectively carrying out multistage feature extraction on the image to be detected and the first depth map to obtain a multistage feature map corresponding to the image to be detected and a multistage feature map corresponding to the first depth map; the feature fusion module 53 is configured to: for any one of the multiple levels, carrying out feature fusion on the level feature map corresponding to the image to be detected and the level feature map corresponding to the first depth map according to the weight of the level feature map corresponding to the image to be detected and the weight of the level feature map corresponding to the first depth map, so as to obtain the level feature fusion map corresponding to the image to be detected; the first determining module 54 is configured to: and determining the position information of the point to be picked up of the object in the image to be detected according to the multi-level feature fusion map corresponding to the image to be detected.

In one possible implementation, the feature extraction module 52 is configured to: inputting the image to be detected and the first depth map into a neural network, and respectively extracting features of the image to be detected and the first depth map through the neural network to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map; the feature fusion module 53 is configured to: determining the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map through the neural network, and carrying out feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map to obtain a feature fusion map corresponding to the image to be detected; the first determining module 54 is configured to: obtaining a first position prediction graph of a pickup point corresponding to the image to be detected according to the feature fusion graph corresponding to the image to be detected through the neural network, wherein the first position prediction graph is used for representing the position of a pixel in the image to be detected as a prediction value of the confidence coefficient of the pickup point; and determining the position information of the point to be picked up of the object in the image to be detected according to the first position prediction graph.

In one possible implementation, the feature fusion module 53 is configured to: performing convolution operation on the feature images corresponding to the images to be detected to obtain a first convolution result; performing convolution operation on the feature map corresponding to the first depth map to obtain a second convolution result; performing an activating operation on the sum of the first convolution result and the second convolution result to obtain a first activating result; performing convolution operation on the first activation result to obtain a third convolution result; and activating the third convolution result to obtain the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map.

In one possible implementation manner, the apparatus further includes a training module, where the training module is configured to: acquiring a training image, a second depth map corresponding to the training image and a real position map of a pickup point corresponding to the training image, wherein the real position map is used for representing the position of a pixel in the training image as a real value of the confidence coefficient of the pickup point; inputting the training image and the second depth map into a neural network, respectively extracting features of the training image and the second depth map through the neural network to obtain a feature map corresponding to the training image and a feature map corresponding to the second depth map, determining weights of the feature map corresponding to the training image and the feature map corresponding to the second depth map, and performing feature fusion on the feature map corresponding to the training image and the feature map corresponding to the second depth map according to the weights of the feature map corresponding to the training image and the feature map corresponding to the second depth map to obtain a feature fusion map corresponding to the training image, and obtaining a second position prediction map of a pick-up point corresponding to the training image according to the feature fusion map corresponding to the training image, wherein the second position prediction map is used for representing the position of a pixel in the training image as a prediction value of the confidence degree of the pick-up point; training the neural network based on the difference between the true position map and the second position prediction map.

In one possible implementation, the training module is configured to: obtaining a training image according to an image of a simulation scene, wherein the simulation scene comprises an object model and a background model; acquiring a depth map of the simulation scene as a second depth map corresponding to the training image; and obtaining a real position diagram of the pick-up point corresponding to the training image according to the parameters of the object model.

In one possible implementation, the training module is configured to: generating a simulation scene; controlling at least one object model in the simulation scene to randomly drop on a workbench model in the simulation scene from the upper part of the workbench model until the at least one object model is stable; and/or randomly adjusting the object model and/or the background model in the simulation scene to obtain a plurality of training images.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. Wherein the computer readable storage medium may be a non-volatile computer readable storage medium or may be a volatile computer readable storage medium.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing a method of detecting a pick-up point of an object as provided in any of the embodiments above.

The disclosed embodiments also provide another computer program product for storing computer readable instructions that, when executed, cause a computer to perform the operations of the method for detecting a pick-up point of an object provided in any of the above embodiments.

The embodiment of the disclosure also provides an electronic device, including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the executable instructions stored by the memory to perform the above-described method.

The electronic device may be provided as a terminal, server or other form of device.

Fig. 6 shows a block diagram of an electronic device 800 provided by an embodiment of the present disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 6, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as Wi-Fi, 2G, 3G, 4G/LTE, 5G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of electronic device 800 to perform the above-described methods.

Fig. 7 shows a block diagram of an electronic device 1900 provided by an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server. Referring to FIG. 7, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. Electronic device 1900 may operate an operating system based on data stored in memory 1932, such as WindowsMac OS/>Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of detecting a pick-up point of an object, comprising:

determining position information of a point to be picked up of an object in the image to be detected according to the feature fusion diagram corresponding to the image to be detected;

the determining, according to the feature fusion map corresponding to the image to be detected, position information of a point to be picked up of an object in the image to be detected includes:

2. The method according to claim 1, wherein after said determining the position information of the point to be picked up of the object in the image to be detected, the method further comprises:

3. The method according to claim 1 or 2, wherein the performing feature extraction on the image to be detected and the first depth map to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map includes:

4. The method according to claim 1 or 2, wherein the determining the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map includes:

5. The method according to claim 1 or 2, characterized in that before said inputting the image to be detected and the first depth map into a neural network, the method further comprises:

6. The method of claim 5, wherein the acquiring the training image, the second depth map corresponding to the training image, and the true position map of the pickup point corresponding to the training image comprises:

7. The method of claim 6, wherein obtaining a training image from the image of the simulated scene comprises:

generating a simulation scene;

8. An apparatus for detecting a pick-up point of an object, comprising:

the feature extraction module is used for inputting the image to be detected and the first depth map into a neural network, and respectively extracting features of the image to be detected and the first depth map through the neural network to obtain a feature map corresponding to the image to be detected and a feature map corresponding to the first depth map;

the feature fusion module is used for determining the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map through the neural network, and carrying out feature fusion on the feature map corresponding to the image to be detected and the feature map corresponding to the first depth map according to the weight of the feature map corresponding to the image to be detected and the weight of the feature map corresponding to the first depth map to obtain a feature fusion map corresponding to the image to be detected;

the first determining module is used for obtaining a first position prediction graph of a pickup point corresponding to the image to be detected according to the feature fusion graph corresponding to the image to be detected through the neural network, wherein the first position prediction graph is used for representing the position of a pixel in the image to be detected as a prediction value of the confidence coefficient of the pickup point; and determining the position information of the point to be picked up of the object in the image to be detected according to the first position prediction graph.

9. A robot comprising a detection module and a pick-up member, wherein the detection module is connected with the pick-up member; the detection module is used for implementing the method according to any one of claims 1 to 7 to obtain position information of a point to be picked up of an object in an image to be detected; the pickup part is used for picking up the object according to the position information of the point to be picked up of the object in the image to be detected.

10. An electronic device, comprising:

one or more processors;

a memory for storing executable instructions;

wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the method of any of claims 1 to 7.

11. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 7.