WO2022227678A1

WO2022227678A1 - Three-dimensional target detection method and grabbing method, apparatus, and electronic device

Info

Publication number: WO2022227678A1
Application number: PCT/CN2021/143443
Authority: WO
Inventors: 刘亦芃; 杜国光; 赵开勇
Original assignee: 达闼机器人股份有限公司
Priority date: 2021-04-29
Filing date: 2021-12-30
Publication date: 2022-11-03
Also published as: CN113223091A; CN113223091B

Abstract

A three-dimensional target detection method and grabbing method, an apparatus, and an electronic device, which relate to the technical field of computer vision. The detection method comprises: obtaining a depth image comprising a target identification object (110); generating a camera point cloud corresponding to the depth image according to the depth image and a camera intrinsic parameter, the camera point cloud being a point cloud in a camera coordinate system (120); converting the camera point cloud into a world point cloud, the world point cloud being a point cloud in a world coordinate system (130); performing target detection on the world point cloud according to a preset target recognition model, so as to generate an external minimal cuboid for the target identification object in the world coordinate system (140); and generating an external minimal cuboid for the target identification object in the camera coordinate system according to the external minimal cuboid for the target identification object in the world coordinate system (150). The present method improves the detection quality of a three-dimensional target.

Description

Three-dimensional target detection method, grasping method, device and electronic equipment

cross reference

This application claims the priority of the Chinese patent application filed on April 29, 2021 with the application number "202110473106.3" and the invention titled "3D object detection method, grasping method, device and electronic equipment", the entire contents of which are by reference Incorporated in this application.

technical field

The embodiments of the present disclosure relate to the technical field of computer vision, and in particular, to a three-dimensional target detection method, a grasping method, an apparatus, and an electronic device.

Background technique

Three-dimensional object detection refers to the technology of detecting the three-dimensional space coordinates of objects. In the field of autonomous driving, vehicle collision can be avoided through 3D object detection to control vehicles; in the field of service robots, objects can be accurately grasped through 3D object detection.

3D target detection generally outputs the circumscribed minimum rectangle, category and corresponding confidence of the target recognition object according to the input point cloud data. However, when performing three-dimensional target detection in the related art, it is generally necessary to obtain camera extrinsic parameters, and convert the point cloud data in the camera coordinate system into the point cloud data in the world coordinate system according to the camera extrinsic parameters. When the external parameters of the camera cannot be obtained, the detection accuracy of the three-dimensional target in the related technology is low.

SUMMARY OF THE INVENTION

In view of the above problems, the embodiments of the present disclosure provide a three-dimensional target detection method, a grasping method, an apparatus, and an electronic device, which are used to solve the problem of low three-dimensional target detection accuracy existing in the prior art.

According to an aspect of the embodiments of the present disclosure, a three-dimensional target detection method is provided, the method comprising:

Obtain a depth image containing the target identifier;

generating a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, where the camera point cloud is a point cloud in a camera coordinate system;

converting the camera point cloud into a world point cloud, where the world point cloud is a point cloud in the world coordinate system;

Perform target detection on the world point cloud according to a preset target recognition model, so as to generate a circumscribed minimum rectangle of the target recognition object in the world coordinate system;

The minimum circumscribed rectangle of the target identifier in the camera coordinate system is generated according to the circumscribed minimum rectangle of the target identifier in the world coordinate system.

In an optional manner, the converting the camera point cloud into the world point cloud includes:

registering the camera point cloud with a preset plane point cloud to generate a transformation matrix from the camera coordinate system to the world coordinate system;

Transform the camera point cloud into a world point cloud according to the transformation matrix.

In an optional manner, the step of registering the camera point cloud with a preset plane point cloud to generate a transformation matrix from the camera coordinate system to the world coordinate system includes:

Calculate the mean value of the camera point cloud in three dimensions respectively;

Construct a homogeneous transformation matrix according to the mean value, and set the homogeneous transformation matrix as the initial value of the iterative closest point algorithm;

A transformation matrix from the camera coordinate system to the world coordinate system is generated according to the iterative closest point algorithm and the plane point cloud perpendicular to the gravity axis.

In an optional manner, the converting the camera point cloud into the world point cloud according to the transformation matrix includes:

determining the rotation matrix corresponding to the transformation matrix;

If the rotation angle corresponding to the rotation matrix is greater than 90 degrees, generate a world point cloud according to the rotation matrix and the camera point cloud;

If the rotation angle corresponding to the rotation matrix is not greater than 90 degrees, the world point cloud is generated according to the complementary angle rotation corresponding to the rotation matrix and the camera point cloud.

In an optional manner, the method further includes:

constructing a point cloud data training set, the point cloud data training set includes multiple sets of world point cloud data and label information corresponding to each set of world point cloud data;

A preset target recognition algorithm is trained by using the point cloud data training set to generate the target recognition model.

In an optional manner, the constructing a training set of point cloud data includes:

constructing a three-dimensional model library, the three-dimensional model library includes three-dimensional models of a plurality of identification objects;

After aligning each recognized object to the world coordinate system, calculate the initial value of the circumscribed minimum rectangle of each recognized object;

Place each identification object in a simulated position, and calculate the external minimum rectangle simulation value of each identification object at the simulation position;

Randomly generate a camera perspective, and render based on the camera perspective to generate camera point cloud data for each recognized object;

Converting the camera point cloud data of each identified object into the corresponding world point cloud data;

Add label information to the corresponding world point cloud data.

According to another aspect of the embodiments of the present disclosure, a three-dimensional target grasping method is provided, including the above-mentioned three-dimensional target detection method, and the three-dimensional target grasping method further includes:

Determine the spatial position of the target recognition object according to the circumscribed minimum rectangle of the target recognition object in the camera coordinate system;

A grasping instruction is generated according to the spatial position, so that the grasper grasps the target identification object according to the grasping instruction.

According to another aspect of the embodiments of the present disclosure, there is provided a three-dimensional target detection device, the device comprising:

The acquisition module is used to acquire the depth image containing the target recognition object;

a first generation module, configured to generate a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, where the camera point cloud is a point cloud in a camera coordinate system;

a conversion module for converting the camera point cloud into a world point cloud, where the world point cloud is a point cloud in the world coordinate system;

a second generation module, configured to perform target detection on the world point cloud according to a preset target recognition model, so as to generate a circumscribed minimum rectangle of the target recognition object in the world coordinate system;

The third generation module is configured to generate the minimum circumscribed rectangle of the target identifier in the camera coordinate system according to the circumscribed minimum rectangle of the target identifier in the world coordinate system.

In an optional manner, the conversion module includes:

a registration unit for registering the camera point cloud with a preset plane point cloud to generate a transformation matrix from the camera coordinate system to the world coordinate system;

A conversion unit, configured to convert the camera point cloud into a world point cloud according to the transformation matrix.

In an optional manner, the registration unit is configured to include:

In an optional manner, the conversion unit is configured to include:

determining the rotation matrix corresponding to the transformation matrix;

In an optional manner, the apparatus further includes a training module for:

In an optional manner, the training module is configured to include:

Randomly generating a camera perspective, and rendering based on the camera perspective to generate camera point cloud data for each identified object;

Add label information to the corresponding world point cloud data.

According to another aspect of the embodiments of the present disclosure, a three-dimensional target grasping device is provided, which is characterized by comprising the above-mentioned three-dimensional target detection device, and the three-dimensional target grasping device further includes:

a spatial determination module, configured to determine the spatial position of the target recognition object according to the circumscribed minimum rectangle of the target recognition object in the camera coordinate system;

A grasping module, configured to generate a grasping instruction according to the spatial position, so that the grasper grasps the target identification object according to the grasping instruction.

According to another aspect of the embodiments of the present disclosure, an electronic device is provided, including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface communicate with each other through the communication bus. communication between;

The memory is used for storing at least one executable instruction, and the executable instruction enables the processor to perform the operations of the above-mentioned three-dimensional target detection method or the above-mentioned three-dimensional target grasping method.

According to yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, where at least one executable instruction is stored in the storage medium, and when the executable instruction is executed on an electronic device, the electronic device executes the above The operation of the three-dimensional target detection method or the above-mentioned three-dimensional target grasping method.

According to yet another aspect of the embodiments of the present disclosure, there is provided a computer program comprising instructions that, when executed on a computer, cause the computer to perform operations according to the above-mentioned three-dimensional target detection method or the above-mentioned three-dimensional target grasping method.

In the embodiment of the present disclosure, a camera point cloud corresponding to the depth image can be generated by using the depth image and the internal parameters of the camera; after the camera point cloud is converted into a world point cloud, target detection can be performed on the world point cloud according to a preset target recognition model to obtain Generate the minimum circumscribed rectangle of the target recognition object in the world coordinate system; further, the circumscribed minimum rectangle of the target recognition object in the camera coordinate system can be generated according to the circumscribed minimum rectangle of the target recognition object in the world coordinate system, so as to complete the target recognition detection of objects. It can be seen that the embodiment of the present disclosure can still generate the circumscribed minimum rectangle of the target recognition object in the camera coordinate system based on the camera point cloud without acquiring the external parameters of the camera, which can improve the detection accuracy of the target recognition object.

The above description is only an overview of the technical solutions of the embodiments of the present disclosure. In order to understand the technical means of the embodiments of the present disclosure more clearly, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and characteristics of the embodiments of the present disclosure. The advantages can be more clearly understood, and the specific embodiments of the present disclosure are given below.

Description of drawings

The drawings are for illustrative purposes only and are not to be considered limiting of the present disclosure. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

FIG. 1 shows a schematic flowchart of a three-dimensional target detection method provided by an embodiment of the present disclosure;

FIG. 2( a ) shows a schematic diagram of a placement scene of an identification object and a simulated position of a corresponding camera provided by an embodiment of the present disclosure;

Fig. 2(b) shows a schematic diagram of the rendering effect of the camera in Fig. 2(a);

FIG. 3( a ) shows a schematic diagram of another identification object placement scene and a corresponding camera simulation position provided by an embodiment of the present disclosure;

Fig. 3(b) shows a schematic diagram of the rendering effect of the camera in Fig. 3(a);

FIG. 4 shows a schematic flowchart of a three-dimensional target grasping method provided by an embodiment of the present disclosure;

FIG. 5 shows a schematic structural diagram of a three-dimensional target detection apparatus provided by an embodiment of the present disclosure;

FIG. 6 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein.

FIG. 1 shows a flowchart of a three-dimensional target detection method according to an embodiment of the present disclosure, and the method is executed by an electronic device. The memory of the electronic device is used to store at least one executable instruction, and the executable instruction enables the processor of the electronic device to perform the operations of the above-mentioned three-dimensional target detection method. The electronic device can be a robot, a car, a computer or other terminal equipment. As shown in Figure 1, the method includes the following steps:

Step 110: Acquire a depth image containing the target identifier.

The depth image may be an RGBD image, that is, a depth image in an RGB color mode. The target recognition object in the depth image is the recognition object that needs to be detected. The target identifier can be, for example, a water glass, a beverage bottle, a fruit, and the like. Generally speaking, a depth image containing the target recognition object can be obtained by photographing the scene containing the target recognition object by the depth camera.

Step 120: Generate a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, where the camera point cloud is a point cloud in a camera coordinate system.

The camera point cloud corresponding to the depth image can be generated according to the depth image and the camera internal parameters, and the camera point cloud is the point cloud in the camera coordinate system. The camera intrinsic parameter is a parameter related to the own characteristics of the camera that captures the depth image, and generally includes the focal length of the camera, the pixel size, and the like.

Step 130: Convert the camera point cloud into a world point cloud, where the world point cloud is a point cloud in a world coordinate system.

In an optional way, the camera point cloud can be registered with a preset plane point cloud to generate a transformation matrix from the camera coordinate system to the world coordinate system, and the camera point cloud can be converted into a world point cloud according to the transformation matrix . In order to obtain the transformation matrix, the mean value of the camera point cloud in the three dimensions can be calculated separately, the homogeneous transformation matrix can be constructed according to the mean value, and the homogeneous transformation matrix is set as the initial value of the iterative closest point algorithm. The plane point cloud of the gravity axis generates the transformation matrix from the camera coordinate system to the world coordinate system.

For example, first calculate the mean of each dimension of the camera point cloud in the three-dimensional space

and

Then construct a homogeneous transformation matrix

As the initial value of the iterative closest point algorithm; generate a plane point cloud perpendicular to the gravity axis (z-axis) of the world coordinate system to obtain the transformation matrix from the camera point cloud to the plane point cloud. Clouds are converted to world point clouds.

In an optional way, when converting the camera point cloud into the world point cloud according to the transformation matrix, first, determine the rotation matrix corresponding to the transformation matrix, if the rotation angle corresponding to the rotation matrix is greater than 90 degrees, then according to the rotation matrix Generate the world point cloud with the camera point cloud. If the rotation angle corresponding to the rotation matrix is not greater than 90 degrees, then generate the world point cloud according to the complementary angle rotation corresponding to the rotation matrix and the camera point cloud. For example, if the rotation angle does not exceed 90 degrees, the difference between 180 degrees and the rotation angle is used as the rotation angle of the rotation matrix.

Step 140 : perform target detection on the world point cloud according to a preset target recognition model, so as to generate a circumscribed minimum rectangle of the target recognition object in the world coordinate system.

Among them, the target detection can be performed on the world point cloud according to the preset target recognition model, so as to generate the circumscribed minimum rectangle of the target recognition object in the world coordinate system. The circumscribed minimum rectangle, that is, the circumscribed minimum cuboid, also known as the bounding box, is an algorithm used to solve the optimal enclosing space of a discrete point set. in place of complex geometric objects. The minimum circumscribed rectangle of the target identifier can be, for example, an AABB bounding box, a bounding sphere, an oriented bounding box OBB, and a fixed-direction convex hull FDH. Before the target detection is performed on the world point cloud according to the preset target recognition model, the target recognition algorithm can be trained based on deep learning to generate the target recognition model. The training process of the target recognition algorithm is described in detail below.

Before training the target recognition algorithm, a point cloud data training set needs to be constructed, and the point cloud data training set includes multiple sets of world point cloud data and label information corresponding to each set of world point cloud data. Use the point cloud data training set to train the preset target recognition algorithm to generate the target recognition model. In one embodiment of the present disclosure, the target recognition model may be, for example, a Vote Net network (three-dimensional target detection network). The Vote Net network is an end-to-end 3D object detection network based on the synergy of deep point set network and Hough voting.

In an optional way, the point cloud data training set can be constructed as follows:

Build a 3D model library, which includes 3D models of multiple objects, and align each object to the world coordinate system (the x-axis is rightward, the y-axis is forward, and the z-axis is upward), so that the object is vertical When placed, the long axis corresponds to the y axis, the width corresponds to the x axis, and the height corresponds to the z axis. Then, principal component analysis method can be used to calculate the circumscribed minimum rectangle of each recognized object. Further, an identification object placement scene for simulation is constructed, each identification object is placed in a simulation position under the placement scene, and the circumscribed minimum rectangle of each identification object at the simulation position is calculated. If the placement position includes multiple identification objects, collision detection can also be performed to ensure that the identification objects do not collide. The placement position is the spatial position of each identification object within the preset space range in the world coordinate system. After the identification object is rectified to the world coordinate system, the initial position of the identification object is determined, and the identification object is determined by the translation matrix and the rotation matrix. The placement position of , where the rotation matrix is the rotation matrix around the z-axis. Further, multiple camera perspectives can be randomly generated, and the world point cloud data can be rendered based on each camera perspective to generate camera point cloud data corresponding to the camera perspective of each recognized object, and save the recognition corresponding to the camera point cloud data. The object type, the centroid, length, width, height, and rotation angle around the z-axis of the corresponding smallest circumscribed rectangle.

Fig. 2(a) shows a schematic diagram of an object placement scene and a simulated position of a corresponding camera provided by an embodiment of the present disclosure, and Fig. 2(b) shows a schematic diagram of a rendering effect of the camera in Fig. 2(a); in Fig. 2(a) ), the camera perspective is randomly generated, and the point cloud data of the object in the world coordinate system is rendered based on the camera perspective, and the rendering effect in Figure 2(b) can be obtained. Similarly, FIG. 3(a) shows another object placement scene and a schematic diagram of a corresponding camera simulation position provided by an embodiment of the present disclosure, and FIG. 3(b) shows a schematic diagram of the rendering effect of the camera in FIG. 3(a); In the object placement scene in Figure 3(a), the camera perspective is randomly generated, and the object point cloud data in the world coordinate system is rendered based on the camera perspective, and the rendering effect in Figure 3(b) can be obtained. It should be noted that, for any object placement scene, multiple camera perspectives can be randomly generated, and the world point cloud of the recognized object is rendered based on each camera perspective to obtain the camera point cloud corresponding to the camera perspective.

The following describes the process of calculating the circumscribed minimum rectangle of the recognized object by using the principal component analysis method.

Suppose M is a 3×n matrix, which is used to represent the point cloud coordinates in three-dimensional space, and n is the number of point clouds. Assuming that mean(M) represents a matrix formed by the mean of M in three dimensions, that is, the mean(M) matrix is also a 3×n matrix, the elements of each row are equal, and the elements of each row are equal to the matrix M in The mean on the corresponding dimension. definition

calculate

The covariance matrix Corr of ,

And find the eigenvalue A and eigenvector V of Corr, so that CorrV=AV. Further, the column vectors of the feature vector V are rearranged to obtain the feature vectors V ^{, ,} corresponding to the six different placement modes of the identified objects.

Further, by calculating M ^′ =V ^, M, the corrected point clouds M ^{, ,} respectively, of the identified objects in 6 different placement states can be obtained. Translate M ^, to the origin, that is, M ^′ =M ^′ -mean(M ^′ ), then the circumscribed minimum rectangle B of the corrected point cloud M ^, can be calculated. Among them, xmin, ymin and zmin are the correction point cloud M ^, the minimum values in the x-axis direction, y-axis direction and z-axis direction respectively, xmax, ymax and zmax are the correction point cloud M ^, respectively in the x-axis direction, y-axis direction and the maximum value in the z-axis direction.

by rotation matrix

And the translation matrix t=[t _x , _ty , t _z ] ^T , the correction point cloud M ^can be randomly placed, and then the correction point cloud M ^, =RM ^, +t can be updated. Among them, θ is the correction point cloud M ^, the rotation angle around the z-axis, t _x , _ty and t _z are the correction point cloud M ^, the translation amount on the x-axis, y-axis and z-axis, respectively.

The following describes the process of randomly generating the camera perspective and rendering the point cloud in the world coordinate system based on the camera perspective.

Among them, the position matrix C _P =[x _p , y _p , z _p ] ^T of the virtual camera, the front facing matrix C _f =[x _f , y _f , z _f ] ^T and the upward facing matrix C _t =[x f ] T can be set _t , y _t , z _t ] ^T , then the left orientation matrix of the camera can be obtained as C _l =[y _t z _f -z _t y _f , z _t x _f -x _t z _f ,x _t y _f - y _t x _f ] ^T . The camera angle of view of the virtual camera at the corresponding position can be determined by the front facing matrix, the top facing matrix and the left facing matrix. Assuming that T _C is the homogeneous transformation matrix of the camera coordinate system relative to the world coordinate system, we can get

in,

is the extrinsic parameter matrix of the camera,

It is the orientation transformation matrix of the camera coordinate system relative to the world coordinate system.

By solving the above linear equation, we can get

Further, by inverting T _C , the homogeneous transformation matrix of the world coordinate system relative to the camera coordinate system can be obtained as

Furthermore, the camera point cloud coordinate M _C of the recognized object is

Because the embodiment of the present disclosure chooses to train the Vote Net network to obtain the target recognition model. Vote Net can only predict rotation around a single axis relatively well, so before training the Vote Net network based on deep learning, it is necessary to transform the camera point cloud of the recognized object to the world point cloud, even if the direction of gravity is aligned with the -z axis. Further, the camera point cloud of the recognized object can be converted into the recognized object world point cloud based on the iterative closest point algorithm. The process of transforming the camera point cloud of the recognized object to the world point cloud will be described below.

In an optional way, first calculate the mean value of each dimension of the camera point cloud of the recognized object in the three-dimensional space

and

Then, based on the mean of each dimension

and

Construct a homogeneous transformation matrix

as the initial value for the iterative closest point algorithm. Since the background desktop occupies a large proportion in the scene where the recognized objects are placed, and the point cloud corresponding to the background desktop is relatively large, a plane point cloud perpendicular to the z-axis is generated, and the iterative nearest point algorithm can be used to perform plane registration and calculate the recognition The transformation matrix from the camera point cloud of the object to the plane point cloud, the transformation matrix includes a translation matrix and a rotation matrix, and the rotation angle corresponding to the rotation matrix can be further determined.

It should be noted that since the robot grasps by default, the rotation angle of the ^{T vector should exceed 90 degrees; if the rotation angle of the (0,0,1) T} ^vector does not exceed 90 degrees , then the difference between 180 degrees and the rotation angle of the (0,0,1) ^T vector is used as the rotation angle of the rotation matrix. Finally, the camera point cloud is converted to the world point cloud through the rotation matrix, that is, the -z axis is consistent with the direction of gravity.

The point cloud data training set can be constructed by converting the camera point cloud data corresponding to the camera perspective at each placement position into the world point cloud data, and adding label information to the world point cloud data. The label information may include, for example, the category of the corresponding identifier, and the centroid, length, width, height, and rotation angle around the z-axis of the circumscribed smallest rectangle corresponding to the simulation position.

The Vote Net network takes the world point cloud as the input, and outputs the 3D circumscribed minimum rectangle, confidence and category of the target recognition object in the actual placement scene. To detect 3D targets through the Vote Net network, only the coordinate information of the world point cloud is needed, and there is no great dependence on the density of the world point cloud, and the generalization performance is very good. Although Vote Net has achieved good results in the task of 3D object detection in indoor scenes, it only deals with real data of large indoor objects. In this specification, Vote Net is used to process the simulation data, use the simulation data for training, and detect the world point cloud obtained from the real shooting data. Since the geometric features of the simulated data and the real shot data are not very different, this makes the embodiment of the present disclosure more feasible.

The following describes the training of the Vote Net network based on the point cloud data training set.

When training the Vote Net network, first construct a 2.5D point cloud in the simulated scene according to a similar density, and then shoot through a virtual camera, and generate world point cloud data according to the camera point cloud data obtained by shooting, and automatically obtain it The label information of each world point cloud data, which can improve the training speed of the target recognition model. Input the world point cloud data with label information into the Vote Net network for training, and determine the total number of training rounds according to the point cloud volume. After the training of the Vote Net network is completed, the 3D target detection is performed on the world point cloud processed by the iterative closest point algorithm, and the 3D circumscribed minimum rectangle, the confidence level and the type of the recognized object corresponding to the camera point cloud data can be obtained.

Step 150: Generate a minimum circumscribed rectangle of the target identifier in the camera coordinate system according to the smallest circumscribed rectangle of the target identifier in the world coordinate system.

Wherein, the minimum circumscribed rectangle of the target recognition object in the world coordinate system can be converted into the minimum circumscribed rectangle of the target recognition object in the camera coordinate system according to the above-mentioned rotation matrix. Further, the rotation matrix can be used to right multiply the circumscribed minimum rectangle matrix of the target identifier in the world coordinate system to obtain the circumscribed minimum rectangle matrix of the target identifier in the camera coordinate system.

FIG. 4 shows a flowchart of a three-dimensional object grasping method according to another embodiment of the present disclosure, and the method is executed by an electronic device. The memory of the electronic device is used to store at least one executable instruction, and the executable instruction enables the processor of the electronic device to perform the operations of the above-mentioned three-dimensional object grasping method. As shown in Figure 4, the method includes the following steps:

Step 210: Determine the spatial position of the target recognition object according to the circumscribed minimum rectangle of the target recognition object in the camera coordinate system.

Among them, the spatial position of the target recognition object can be determined according to the circumscribed minimum rectangle of the target recognition object in the camera coordinate system. The spatial position of the target identifier includes the spatial coordinates of the target identifier and the rotation angle of the target identifier in the three-dimensional space.

Step 220: Generate a grasping instruction according to the spatial position, so that the grasper grasps the target identification object according to the grasping instruction.

Wherein, a grabbing instruction may be generated according to the spatial position of the target identifier, and the grabbing instruction may be sent to a grabber for grabbing the target identifier. The grasper can determine the grasping path of the target identification object according to the grasping instruction, and grasp the target identification object according to the grasping path.

The embodiment of the present disclosure generates the minimum circumscribed rectangle of the target recognition object in the camera coordinate system based on the camera point cloud, determines the spatial position of the target recognition object according to the circumscribed minimum rectangle of the target recognition object in the camera coordinate system, and generates a grasping instruction according to the spatial position , so that the grasper can accurately grasp the target identification object according to the grasping instruction.

FIG. 5 shows a schematic structural diagram of a three-dimensional object detection apparatus according to an embodiment of the present disclosure. As shown in FIG. 5 , the apparatus 300 includes: an acquisition module 310 , a first generation module 320 , a conversion module 330 , a second generation module 340 and a third generation module 350 .

Wherein, the obtaining module 310 is used to obtain the depth image including the target identifier;

a first generating module 320, configured to generate a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, where the camera point cloud is a point cloud in a camera coordinate system;

a conversion module 330, configured to convert the camera point cloud into a world point cloud, where the world point cloud is a point cloud in a world coordinate system;

The second generation module 340 is configured to perform target detection on the world point cloud according to a preset target recognition model, so as to generate the circumscribed minimum rectangle of the target recognition object in the world coordinate system;

The third generating module 350 is configured to generate a minimum circumscribed rectangle of the target identifier in the camera coordinate system according to the circumscribed minimum rectangle of the target identifier in the world coordinate system.

In an optional manner, the conversion module 330 includes:

In an optional manner, the registration unit is configured to include:

In an optional way, the conversion unit is used to include:

determining the rotation matrix corresponding to the transformation matrix;

In an optional manner, the apparatus 300 further includes a training module for:

In an optional way, the training module is used to:

Add label information to the corresponding world point cloud data.

FIG. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, and the specific embodiment of the present disclosure does not limit the specific implementation of the electronic device.

As shown in FIG. 6 , the electronic device may include: a processor (processor) 402 , a communication interface (Communications Interface) 404 , a memory (memory) 406 , and a communication bus 408 .

The processor 402 , the communication interface 404 , and the memory 406 communicate with each other through the communication bus 408 . The communication interface 404 is used for communicating with network elements of other devices such as clients or other servers. The processor 402 is configured to execute the program 410, and specifically may execute the relevant steps in the foregoing embodiments of the three-dimensional target detection method.

Specifically, program 410 may include program code, which includes computer-executable instructions.

The processor 402 may be a central processing unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present disclosure. The one or more processors included in the electronic device may be the same type of processors, such as one or more CPUs; or may be different types of processors, such as one or more CPUs and one or more ASICs.

The memory 406 is used to store the program 410 . Memory 406 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.

The program 410 can be specifically called by the processor 402 to make the electronic device perform the following operations:

Obtain a depth image containing the target identifier;

In an optional manner, the program 410 is invoked by the processor 402 to cause the electronic device to perform the following operations:

The camera point cloud is registered with the preset plane point cloud to generate a transformation matrix from the camera coordinate system to the world coordinate system;

determining the rotation matrix corresponding to the transformation matrix;

Add label information to the corresponding world point cloud data.

An embodiment of the present disclosure provides a computer-readable storage medium, where the storage medium stores at least one executable instruction, and when the executable instruction runs on an electronic device, causes the electronic device to execute any of the foregoing method embodiments. 3D object detection method.

An embodiment of the present disclosure provides a three-dimensional target detection apparatus, which is used for executing the above-mentioned three-dimensional target detection method.

An embodiment of the present disclosure provides a computer program, and the computer program can be invoked by a processor to cause an electronic device to execute the three-dimensional target detection method in any of the foregoing method embodiments.

An embodiment of the present disclosure provides a computer program product, the computer program product includes a computer program stored on a computer-readable storage medium, and the computer program includes program instructions, when the program instructions are executed on a computer, the computer is caused to execute any of the above The three-dimensional target detection method in the method embodiment.

The algorithms or displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a system is apparent from the above description. Furthermore, embodiments of the present disclosure are not directed to any particular programming language. It is to be understood that various programming languages may be used to implement the disclosures described herein and that the descriptions of specific languages above are intended to disclose the best mode of the disclosure.

In the description provided herein, numerous specific details are set forth. It will be understood, however, that embodiments of the present disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it is to be understood that in the above descriptions of exemplary embodiments of the present disclosure, various features of embodiments of the present disclosure are sometimes grouped together into a single implementation in order to simplify the disclosure and to aid in the understanding of one or more of the various inventive aspects. examples, figures, or descriptions thereof. However, this method of disclosure is not to be interpreted as reflecting an intention that the claimed disclosure requires more features than are expressly recited in each claim.

Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

It should be noted that the above-described embodiments illustrate rather than limit the disclosure, and that alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The present disclosure may be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names. The steps in the above embodiments should not be construed as limitations on the execution order unless otherwise specified.

Claims

A three-dimensional target detection method, characterized in that the method comprises:

Obtain a depth image containing the target identifier;

generating a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, where the camera point cloud is a point cloud in a camera coordinate system;

converting the camera point cloud into a world point cloud, where the world point cloud is a point cloud in the world coordinate system;

Perform target detection on the world point cloud according to a preset target recognition model, so as to generate a circumscribed minimum rectangle of the target recognition object in the world coordinate system;

The minimum circumscribed rectangle of the target identifier in the camera coordinate system is generated according to the circumscribed minimum rectangle of the target identifier in the world coordinate system.
The method according to claim 1, wherein the converting the camera point cloud into a world point cloud comprises:

registering the camera point cloud with a preset plane point cloud to generate a transformation matrix from the camera coordinate system to the world coordinate system;

Transform the camera point cloud into a world point cloud according to the transformation matrix.
The method according to claim 2, wherein the registering the camera point cloud with a preset plane point cloud to generate a transformation matrix from the camera coordinate system to the world coordinate system comprises:

Calculate the mean value of the camera point cloud in three dimensions respectively;

Construct a homogeneous transformation matrix according to the mean value, and set the homogeneous transformation matrix as the initial value of the iterative closest point algorithm;

A transformation matrix from the camera coordinate system to the world coordinate system is generated according to the iterative closest point algorithm and the plane point cloud perpendicular to the gravity axis.
The method according to claim 2 or 3, wherein the converting the camera point cloud into a world point cloud according to the transformation matrix comprises:

determining the rotation matrix corresponding to the transformation matrix;

If the rotation angle corresponding to the rotation matrix is greater than 90 degrees, generate a world point cloud according to the rotation matrix and the camera point cloud;

If the rotation angle corresponding to the rotation matrix is not greater than 90 degrees, the world point cloud is generated according to the complementary angle rotation corresponding to the rotation matrix and the camera point cloud.
The method according to claim 1, wherein the method further comprises:

constructing a point cloud data training set, the point cloud data training set includes multiple sets of world point cloud data and label information corresponding to each set of world point cloud data;

A preset target recognition algorithm is trained by using the point cloud data training set to generate the target recognition model.
The method according to claim 5, wherein the constructing a training set of point cloud data comprises:

constructing a three-dimensional model library, the three-dimensional model library includes three-dimensional models of a plurality of identification objects;

After aligning each recognized object to the world coordinate system, calculate the initial value of the circumscribed minimum rectangle of each recognized object;

Place each identification object in a simulated position, and calculate the external minimum rectangle simulation value of each identification object at the simulation position;

Randomly generating a camera perspective, and rendering based on the camera perspective to generate camera point cloud data for each identified object;

Converting the camera point cloud data of each identified object into the corresponding world point cloud data;

Add label information to the corresponding world point cloud data.
A three-dimensional target grabbing method, characterized in that it includes the three-dimensional target detection method according to any one of claims 1-6, and the three-dimensional target grabbing method further comprises:

Determine the spatial position of the target recognition object according to the circumscribed minimum rectangle of the target recognition object in the camera coordinate system;

A grasping instruction is generated according to the spatial position, so that the grasper grasps the target identification object according to the grasping instruction.
A three-dimensional target detection device, characterized in that the device comprises:

The acquisition module is used to acquire the depth image containing the target recognition object;

a first generation module, configured to generate a camera point cloud corresponding to the depth image according to the depth image and the camera internal parameters, where the camera point cloud is a point cloud in a camera coordinate system;

a conversion module for converting the camera point cloud into a world point cloud, where the world point cloud is a point cloud in the world coordinate system;

a second generation module, configured to perform target detection on the world point cloud according to a preset target recognition model, so as to generate a circumscribed minimum rectangle of the target recognition object in the world coordinate system;

The third generation module is configured to generate the minimum circumscribed rectangle of the target identifier in the camera coordinate system according to the circumscribed minimum rectangle of the target identifier in the world coordinate system.
The device according to claim 8, wherein the conversion module comprises:

a registration unit for registering the camera point cloud with a preset plane point cloud to generate a transformation matrix from the camera coordinate system to the world coordinate system;

A conversion unit, configured to convert the camera point cloud into a world point cloud according to the transformation matrix.
The device according to claim 9, wherein the registration unit is configured to include:

Calculate the mean value of the camera point cloud in three dimensions respectively;

Construct a homogeneous transformation matrix according to the mean value, and set the homogeneous transformation matrix as the initial value of the iterative closest point algorithm;

A transformation matrix from the camera coordinate system to the world coordinate system is generated according to the iterative closest point algorithm and the plane point cloud perpendicular to the gravity axis.
The device according to claim 9 or 10, wherein the conversion unit is configured to include:

determining the rotation matrix corresponding to the transformation matrix;

If the rotation angle corresponding to the rotation matrix is greater than 90 degrees, generate a world point cloud according to the rotation matrix and the camera point cloud;

If the rotation angle corresponding to the rotation matrix is not greater than 90 degrees, the world point cloud is generated according to the complementary angle rotation corresponding to the rotation matrix and the camera point cloud.
The device according to claim 8, wherein the device further comprises a training module for:

constructing a point cloud data training set, the point cloud data training set includes multiple sets of world point cloud data and label information corresponding to each set of world point cloud data;

A preset target recognition algorithm is trained by using the point cloud data training set to generate the target recognition model.
The apparatus according to claim 12, wherein the training module is configured to include:

constructing a three-dimensional model library, the three-dimensional model library includes three-dimensional models of a plurality of identification objects;

After aligning each recognized object to the world coordinate system, calculate the initial value of the circumscribed minimum rectangle of each recognized object;

Place each identification object in a simulated position, and calculate the external minimum rectangle simulation value of each identification object at the simulation position;

Randomly generating a camera perspective, and rendering based on the camera perspective to generate camera point cloud data for each identified object;

Converting the camera point cloud data of each identified object into the corresponding world point cloud data;

Add label information to the corresponding world point cloud data.
A three-dimensional target grasping device, characterized in that it includes the three-dimensional target detection device according to any one of claims 9-13, and the three-dimensional target grasping device further comprises:

a spatial determination module, configured to determine the spatial position of the target recognition object according to the circumscribed minimum rectangle of the target recognition object in the camera coordinate system;

A grasping module, configured to generate a grasping instruction according to the spatial position, so that the grasper grasps the target identification object according to the grasping instruction.
An electronic device, characterized in that it comprises: a processor, a memory, a communication interface and a communication bus, and the processor, the memory and the communication interface communicate with each other through the communication bus;

The memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the three-dimensional target detection method according to any one of claims 1-6 or the three-dimensional target capture method according to claim 7. The operation of the fetch method.
A computer-readable storage medium, characterized in that, the storage medium stores at least one executable instruction, and when the executable instruction is executed on an electronic device, the electronic device executes any one of claims 1-6. The operation of the three-dimensional target detection method or the three-dimensional target grasping method as claimed in claim 7.
A computer program, comprising instructions that, when run on a computer, make the computer perform the operations of the three-dimensional target detection method according to any one of claims 1-6 or the three-dimensional target grasping method according to claim 7 .