WO2022116423A1

WO2022116423A1 - Object posture estimation method and apparatus, and electronic device and computer storage medium

Info

Publication number: WO2022116423A1
Application number: PCT/CN2021/083083
Authority: WO
Inventors: 王健宗; 李泽远; 朱星华
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-01
Filing date: 2021-03-25
Publication date: 2022-06-09
Also published as: CN112446919A; CN112446919B

Abstract

A target object posture estimation method, comprising: obtaining a three-dimensional point cloud according to a scene depth map of a target object; extracting a target object point set from the three-dimensional point cloud; calculating a visibility loss value of the target object according to the three-dimensional point cloud and the target object point set; calculating a key point loss value of the target object by means of performing Hough voting on the target object point set; performing semantic segmentation on pixel points of the scene depth map, so as to obtain a semantic loss value of the target object; and calculating a posture of the target object according to the visibility loss value, the key point loss value, the semantic loss value and a multi-task joint model. Further provided are a target object posture estimation apparatus, a device and a storage medium. The method further relates to blockchain technology, and a scene depth map can be stored in a blockchain node. By means of the method, the posture of a target object to be grabbed can be accurately analyzed, thereby improving the grabbing precision of a mechanical arm.

Description

Object pose estimation method, device, electronic device and computer storage medium

This application claims the priority of the Chinese patent application filed on December 01, 2020 with the application number 202011385260.7 and the invention titled "Object pose estimation method, device, electronic device and computer storage medium", the entire content of which is Incorporated herein by reference.

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to an object pose estimation method, apparatus, electronic device, and computer-readable storage medium.

Background technique

The inventor realized that with the continuous development of robotic arms in the industrial field and the in-depth application of intelligent vision systems, robotic arms equipped with intelligent vision systems began to undertake complex tasks such as intelligent sorting and flexible manufacturing, which became a way to save human resources. industrial machinery.

The grasping and sorting tasks of industrial robotic arms mainly rely on the pose estimation of the objects to be grasped. At present, the pose estimation methods of objects mainly use point-by-point teaching or 2D visual perception methods. However, in an industrial environment, the point-by-point teaching method is complex and time-consuming, and the 2D visual perception method will lead to inaccurate pose estimation of objects due to the cluttered placement of objects and the occlusion between objects.

SUMMARY OF THE INVENTION

An object pose estimation method provided by this application includes:

Use a preset camera to obtain a scene depth map of the target object, and calculate a three-dimensional point cloud of the scene depth map according to the pixels in the scene depth map;

Extract target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set;

Calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set;

Hough voting is performed on the target object point set to obtain a key point set, and the key point loss value of the target object is calculated according to the key point set;

Semantic segmentation is performed on the pixels of the scene depth map to obtain the semantic loss value of the target object;

According to the visibility loss value, the key point loss value, the semantic loss value, and the multi-task joint model obtained by pre-training, the pose of the target object is calculated.

The present application also provides a device for estimating the pose of a target object, the device comprising:

a three-dimensional point cloud acquisition module, configured to obtain a scene depth map of a target object by using a preset camera device, and calculate a three-dimensional point cloud of the scene depth map according to the pixel points in the scene depth map;

A target object point set extraction module, used for extracting target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set;

a visibility loss value calculation module, configured to calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set;

a key point loss value calculation module, configured to perform Hough voting on the target object point set to obtain a key point set, and calculate the key point loss value of the target object according to the key point set;

a semantic loss value calculation module, configured to perform semantic segmentation on the pixels of the scene depth map to obtain the semantic loss value of the target object;

The pose calculation module is configured to calculate the pose of the target object according to the visibility loss value, the key point loss value, the semantic loss value, and the multi-task joint model obtained by pre-training.

The present application also provides an electronic device, the electronic device comprising:

a memory that stores at least one computer program; and

The processor executes the computer program stored in the memory to implement the method for estimating the pose of an object as described below:

The present application also provides a computer-readable storage medium, including a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, the computer program is implemented as follows when executed by a processor The described object pose estimation method:

Description of drawings

FIG. 1 is a schematic flowchart of an object pose estimation method provided by an embodiment of the present application;

FIG. 2 is a schematic block diagram of an object pose estimation apparatus provided by an embodiment of the present application;

3 is a schematic diagram of an internal structure of an electronic device for implementing a method for estimating object pose and pose provided by an embodiment of the present application;

The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.

Detailed ways

It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

The embodiments of the present application provide a method for estimating the pose of an object. The execution subject of the object pose estimation method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server, a terminal, and the like. In other words, the object pose estimation method may be executed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Referring to FIG. 1 , a schematic flowchart of an object pose estimation method provided by an embodiment of the present application is shown. In this embodiment, the object pose estimation method includes:

S1. Use a preset camera device to acquire a scene depth map of a target object, and calculate a three-dimensional point cloud of the scene depth map according to the pixels in the scene depth map.

In this embodiment of the present application, the camera device may be a 3D camera, and the target object may be a target object to be grasped by a manipulator. The scene depth image (depth image) is also called a range image (range image), and refers to an image in which the distance (depth) from the camera to each point of the target object is taken as the pixel value. The scene depth map can be calculated as point cloud data after coordinate transformation.

In one of the embodiments of the present application, the scene depth map may be stored in a blockchain node.

In detail, in this embodiment of the present application, the 3D point cloud of the scene depth map can be calculated according to the pixel points in the scene depth map through the following formula:

Wherein, x, y, z are the coordinates of the point in the three-dimensional point cloud, u, v are the row and column where the pixel point is located in the scene depth map, c _x and _cy are the two-dimensional pixel point in the scene depth map The coordinates, f _x , f _y , and d are the focal lengths of the camera device on the x-axis, the y-axis, and the z-axis, respectively.

S2. Use a pre-built deep learning network to extract target points in the three-dimensional point cloud to obtain a target object point set.

As can be seen from the above description, the three-dimensional point cloud is the three-dimensional point cloud of the scene depth map of the target object to be grasped by the manipulator. Since there are many objects in the scene of the target object to be grasped, it is necessary to extract target points from the three-dimensional point cloud to obtain a target object point set.

In the embodiment of the present application, the pre-built deep learning network is a convolutional neural network including a convolution layer, a pooling layer, and a fully connected layer. The convolution layer uses a preset function to perform feature extraction on the three-dimensional point cloud, and the pooling layer compresses the data obtained by feature extraction, simplifies the computational complexity, and extracts main feature data. The fully connected layer is: The feature point set is obtained by concatenating all the data obtained by feature extraction. Further, in the embodiment of the present application, the deep learning network further includes a classifier. Specifically, the classifier uses a given category to learn classification rules using known training data, and then classifies the feature point set to obtain the target object point set and the non-target object point set.

In detail, the use of a deep learning network to extract target points in the three-dimensional point cloud to obtain a target object point set, including:

Extract the feature point set of the 3D point cloud by using the convolution, pooling and fully connected layers in the pre-built deep learning network;

The feature point set is classified into a target point set and a non-target point set by using the classifier in the deep learning network, and the target point set is extracted to obtain a target object point set.

S3. Calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set.

It can be understood that visibility is the degree to which a target object can be seen by normal eyesight. Some objects are occluded by other objects and other reasons, resulting in reduced visibility, resulting in a loss of visibility value. Those heavily occluded objects are not the objects that the robotic arm prioritizes to grab, because they are likely to be at the bottom, and there is not enough information for pose estimation. In order to reduce the interference caused by these objects, the embodiment of this application needs to calculate the objects The visibility loss value of .

One of the embodiments of the present application may use the following method to calculate the visibility loss value of the target object:

Calculate the actual visibility of the target object according to the ratio of the point number of the target object point set to the point number of the largest point set in all objects included in the three-dimensional point cloud;

The visibility loss value of the target object is obtained by weighted calculation of the difference between the actual visibility and the predicted visibility of the target object.

which is:

Among them, N _i represents the number of points of the target object point set of the target target object i, N _max represents the number of points of the largest point set in the target object contained in the 3D point cloud,

Represents the predicted visibility of the target object i, that is, the maximum visibility of the target object i without any occlusion.

S4. Perform Hough voting on the target object point set to obtain a key point set, and calculate the key point loss value of the key point set.

Specifically, performing Hough voting on the target object point set to obtain a key point set, including:

The target object sampling point set is obtained by sampling the target object point set, and the Euclidean distance offset of the target object sampling point is calculated to obtain the offset;

Voting is performed according to the offset, and the set of points whose votes exceed the preset threshold is used as the key point set.

Further, in the embodiment of the present application, according to the property that there is only one central key point and is not affected by occlusion, the key point set is divided into a common key point set and a central key point, and the following formula is used to adopt a point-by-point method. The feature regression algorithm calculates the keypoint loss value L _kps of the keypoint set:

L _kps =γ ₁ L _kp +γ ₂ L _c

Among them, L _kp represents the loss of common key points, N is the number of points in the target object point set, M is the number of common key points,

represents the actual position offset of the target object point set,

represents the predicted actual position offset of the target object point set, L _c represents the loss of the center key point, Δx _i is the actual offset from the common key point to the center key point,

is the predicted offset from the common key point to the center key point, γ ₁ is the weight of the loss of the common key point, and γ ₂ is the weight of the loss of the center key point.

S5. Perform semantic segmentation on the pixels of the scene depth map to obtain a semantic loss value.

In detail, the semantic segmentation is to calculate the semantic loss L _s of the target object according to the pixel points of the scene depth map using the following formula;

L _s =-α(1-q _i ) ^γ log(q _i )

Wherein, α represents the balance parameter of the camera, γ represents the focus parameter of the camera, and q _i represents the confidence that the ith pixel in the scene depth map belongs to the foreground point or the background point.

S6. Calculate the pose of the target object according to the visibility loss value, the key point loss value, the semantic loss value, and the multi-task joint model obtained by pre-training.

In detail, in the embodiment of the present application, the pose of the target object refers to a six-dimensional quantity composed of a three-dimensional rotation matrix and a three-dimensional translation matrix.

The embodiment of the present application uses the following multi-task joint model to calculate the final loss value L _mt of the target object:

L _mt = μ ₁ L _kps + μ ₂ L _s + μ ₃ L _v

Wherein, L _kps represents the key point loss value, L _s represents the semantic loss, L _v represents the visibility loss value, μ ₀₁ , μ ₀₂ , μ ₀₃ represent the weights obtained after training the multi-task joint model value.

Adjust the predicted rotation matrix and predicted translation matrix of the target object according to the final loss value to obtain the pose of the target object.

The embodiment of the present application calculates the three-dimensional point cloud of the scene depth map by acquiring the scene depth map of the target object, and uses the deep learning network to extract the target object point set from the three-dimensional point cloud, and according to the three-dimensional point cloud And the target object point set calculates the visibility loss value, key point loss value and semantic loss value of the target object, and finally obtains the pose of the target object according to the visibility loss value, key point loss value and semantic loss value. The object pose estimation method proposed in the embodiment of the present application performs pose estimation on the target object according to the loss of visibility, key points, and semantics, and therefore, the accuracy of the object pose estimation can be improved.

As shown in FIG. 2 , it is a schematic diagram of a module of the object pose estimation apparatus of the present application.

The object pose estimation apparatus 100 described in this application may be installed in an electronic device. According to the implemented functions, the object pose estimation device may include a three-dimensional point cloud acquisition module 101 , a target object point set extraction module 102 , a visibility loss value calculation module 103 , a key point loss value calculation module 104 , and a semantic loss value calculation module 105 and pose calculation module 106 . The modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.

In this embodiment, the functions of each module/unit are as follows:

The three-dimensional point cloud acquiring module 101 is configured to acquire a scene depth map of a target object by using a preset camera device, and calculate a three-dimensional point cloud of the scene depth map according to the pixels in the scene depth map.

In this embodiment of the present application, the camera device may be a 3D camera, and the target object may be a target object to be grasped by a manipulator. The scene depth image (depth image) is also called a range image (range image), and refers to an image in which the distance (depth) from the camera to each point of the target object is taken as the pixel value. The scene depth map can be calculated as point cloud data after coordinate transformation. In detail, in this embodiment of the present application, the 3D point cloud of the scene depth map can be calculated according to the pixel points in the scene depth map through the following formula:

The target object point set extraction module 102 uses a pre-built deep learning network to extract target points in the three-dimensional point cloud to obtain a target object point set.

In the embodiment of the present application, the pre-built deep learning network is a convolutional neural network, including a convolution layer, a pooling layer, and a fully connected layer. The convolution layer uses a preset function to perform feature extraction on the three-dimensional point cloud, and the pooling layer compresses the data obtained by feature extraction, simplifies the computational complexity, and extracts main feature data. The fully connected layer is: The feature point set is obtained by concatenating all the data obtained by feature extraction. Further, in the embodiment of the present application, the deep learning network further includes a classifier. Specifically, the classifier uses a given category to learn classification rules using known training data, and then classifies the feature point set to obtain the target object point set and the non-target object point set.

In detail, in the embodiment of the present application, the target object point set extraction module 102 is specifically used for:

The feature point set is classified into a target point set and a non-target object point set by using the classifier in the deep learning network, and the target object point set is extracted.

The visibility loss value calculation module 103 is configured to calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set.

In one of the embodiments of the present application, the visibility loss value calculation module 103 is specifically used for:

Calculate the actual visibility of the target object according to the ratio of the number of points of the target object point set of the target object to the number of points of the largest point set among all the target objects included in the three-dimensional point cloud;

which is:

The key point loss value calculation module 104 is configured to perform Hough voting on the target object point set to obtain a key point set, and calculate the key point loss value of the target object according to the key point set.

L _kps =γ ₁ L _kp +γ ₂ L _c

represents the actual position offset of the target object point set,

The semantic loss value calculation module 105 is configured to perform semantic segmentation on the pixels of the scene depth map to obtain the semantic loss value of the target object.

L _s =-α(1-q _i ) ^γ log(q _i )

The pose calculation module 106 is configured to calculate the pose of the target object according to the visibility loss value, the key point loss value, the semantic loss value, and the multi-task joint model obtained by pre-training.

Specifically, the pose calculation module 106 uses the following multi-task joint model to calculate the final loss value L _mt of the target object:

L _mt = μ ₁ L _kps + μ ₂ L _s + μ ₃ L _v

Wherein, L _kps represents the key point loss value, L _s represents the semantic loss, L _v represents the visibility loss value, μ ₀₁ , μ ₀₂ , μ ₀₃ represent the weights obtained after training the multi-task joint model value;

The embodiment of the present application further adjusts the predicted rotation matrix and the predicted translation matrix of the target object according to the final loss value to obtain the pose of the target object.

Further, the pose calculation module 106 sends the pose of the target object to a pre-built robotic arm, and uses the robotic arm to perform the target object grasping task.

As shown in FIG. 3 , it is a schematic structural diagram of an electronic device implementing the object pose estimation method of the present application.

The electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as an object pose estimation program 12.

Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium may be volatile or non-volatile. Specifically, the readable storage medium includes a flash memory, a mobile hard disk, a multimedia card, a card-type memory (eg, SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) equipped on the electronic device 1. card, flash memory card (FlashCard) and so on. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the object pose estimation program 12, etc., but also can be used to temporarily store data that has been output or will be output.

In some embodiments, the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits. Central processing unit (Central Processing unit, CPU), microprocessor, digital processing chip, graphics processor and combination of various control chips, etc. The processor 10 is the control core (ControlUnit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the object) stored in the memory 11. pose estimation program, etc.), and call the data stored in the memory 11 to perform various functions of the electronic device 1 and process data.

The bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (extended industry standard architecture, EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.

FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.

For example, although not shown, the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management The device implements functions such as charge management, discharge management, and power consumption management. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.

Further, the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.

Optionally, the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.

It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.

The object pose estimation program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, it can realize:

According to the visibility loss value, the key point loss value, the semantic loss value and the multi-task joint model obtained by pre-training, the pose of the target object is calculated.

Further, if the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. Specifically, the computer-readable storage medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read Only Memory) -Only Memory).

Further, the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.

The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application.

Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any accompanying reference signs in the claims should not be construed as limiting the involved claims.

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be realized by one unit or means by means of software or hardware. Second-class terms are used to denote names and do not denote any particular order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims

An object pose estimation method, wherein the method includes:

Use a preset camera to obtain a scene depth map of the target object, and calculate a three-dimensional point cloud of the scene depth map according to the pixels in the scene depth map;

Extract target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set;

Calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set;

Hough voting is performed on the target object point set to obtain a key point set, and the key point loss value of the target object is calculated according to the key point set;

Semantic segmentation is performed on the pixels of the scene depth map to obtain the semantic loss value of the target object;

According to the visibility loss value, the key point loss value, the semantic loss value and the multi-task joint model obtained by pre-training, the pose of the target object is calculated.
The object pose estimation method according to claim 1, wherein calculating the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set, comprising:

Calculate the actual visibility of the target object according to the ratio of the point number of the target object point set to the point number of the largest point set in all objects included in the three-dimensional point cloud;

The visibility loss value of the target object is obtained by weighted calculation of the difference between the actual visibility and the predicted visibility of the target object.
The method for estimating object pose and pose according to claim 1, wherein, extracting target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set, comprising:

Extract the feature point set of the 3D point cloud by using the convolution, pooling and fully connected layers in the pre-built deep learning network;

The feature point set is classified into a target point set and a non-target point set by using the classifier in the deep learning network, and the target point set is extracted to obtain a target object point set.
The object pose estimation method according to claim 1, wherein the Hough voting is performed on the target object point set to obtain a key point set, comprising:

The sampling point set is obtained by sampling from the target object point set, and the Euclidean distance offset between the sampling point sets is calculated to obtain the offset;

Voting is performed according to the offset, and the set of points whose votes exceed the preset threshold is used as the key point set.
The object pose estimation method according to claim 1, wherein the semantic segmentation of the pixels of the scene depth map to obtain the semantic loss value of the target object comprises:

The semantic loss L s of the target object is obtained by calculating the following formula;

L s =-α(1-q i ) γ log(q i )

Wherein, α represents the balance parameter of the camera, γ represents the focus parameter of the camera, and q i represents the confidence that the ith pixel in the scene depth map belongs to the foreground point or the background point.
The method for estimating object pose and pose according to any one of claims 1 to 5, wherein the multi-task joint based on the visibility loss value, the keypoint loss value, the semantic loss value and the pre-trained model, and calculate the pose of the target object, including:

The final loss value L mt of the target object is calculated using the following multi-task joint model:

L mt = μ 1 L kps + μ 2 L s + μ 3 L v

Wherein, L kps represents the key point loss value, L s represents the semantic loss, L v represents the visibility loss value, μ 01 , μ 02 , μ 03 represent the weights obtained after training the multi-task joint model value.

Adjust the predicted rotation matrix and predicted translation matrix of the target object according to the final loss value to obtain the pose of the target object.
The object pose estimation method according to any one of claims 1 to 5, wherein the multi-task joint training is performed on the target point, and after obtaining the pose of the target object, the method further includes:

The pose of the target object is sent to a pre-built robotic arm, and the robotic arm is used to perform the grasping task of the target object.
An object pose estimation device, wherein the device includes:

a three-dimensional point cloud acquisition module, configured to obtain a scene depth map of a target object by using a preset camera device, and calculate a three-dimensional point cloud of the scene depth map according to the pixel points in the scene depth map;

A target object point set extraction module, used for extracting target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set;

a visibility loss value calculation module, configured to calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set;

a key point loss value calculation module, configured to perform Hough voting on the target object point set to obtain a key point set, and calculate the key point loss value of the target object according to the key point set;

a semantic loss value calculation module, configured to perform semantic segmentation on the pixels of the scene depth map to obtain the semantic loss value of the target object;

The pose calculation module is configured to calculate the pose of the target object according to the visibility loss value, the key point loss value, the semantic loss value and the multi-task joint model obtained by pre-training.
An electronic device, wherein the electronic device comprises:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores computer program instructions executable by the at least one processor, the computer program instructions being executed by the at least one processor to enable the at least one processor to perform an object pose as described below Estimation method:

Use a preset camera to obtain a scene depth map of the target object, and calculate a three-dimensional point cloud of the scene depth map according to the pixels in the scene depth map;

Extract target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set;

Calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set;

Hough voting is performed on the target object point set to obtain a key point set, and the key point loss value of the target object is calculated according to the key point set;

Semantic segmentation is performed on the pixels of the scene depth map to obtain the semantic loss value of the target object;

According to the visibility loss value, the key point loss value, the semantic loss value and the multi-task joint model obtained by pre-training, the pose of the target object is calculated.
The electronic device according to claim 9, wherein calculating the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set comprises:

Calculate the actual visibility of the target object according to the ratio of the point number of the target object point set to the point number of the largest point set in all objects included in the three-dimensional point cloud;

The visibility loss value of the target object is obtained by weighted calculation of the difference between the actual visibility and the predicted visibility of the target object.
The electronic device according to claim 9, wherein, extracting target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set, comprising:

Extract the feature point set of the 3D point cloud by using the convolution, pooling and fully connected layers in the pre-built deep learning network;

The feature point set is classified into a target point set and a non-target point set by using the classifier in the deep learning network, and the target point set is extracted to obtain a target object point set.
The electronic device according to claim 9, wherein, performing Hough voting on the target object point set to obtain a key point set, comprising:

The sampling point set is obtained by sampling from the target object point set, and the Euclidean distance offset between the sampling point sets is calculated to obtain the offset;

Voting is performed according to the offset, and the set of points whose votes exceed the preset threshold is used as the key point set.
The electronic device according to claim 9, wherein the semantic segmentation of the pixels of the scene depth map to obtain the semantic loss value of the target object comprises:

The semantic loss L s of the target object is obtained by calculating the following formula;

L s =-α(1-q i ) γ log(q i )

Wherein, α represents the balance parameter of the camera, γ represents the focus parameter of the camera, and q i represents the confidence that the ith pixel in the scene depth map belongs to the foreground point or the background point.
The electronic device according to any one of claims 9 to 13, wherein the calculation is performed according to the visibility loss value, the keypoint loss value, the semantic loss value and a pre-trained multi-task joint model Obtain the pose of the target object, including:

The final loss value L mt of the target object is calculated using the following multi-task joint model:

L mt = μ 1 L kps + μ 2 L s + μ 3 L v

Wherein, L kps represents the key point loss value, L s represents the semantic loss, L v represents the visibility loss value, μ 01 , μ 02 , μ 03 represent the weights obtained after training the multi-task joint model value.

Adjust the predicted rotation matrix and predicted translation matrix of the target object according to the final loss value to obtain the pose of the target object.
A computer-readable storage medium, comprising a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, when the computer program is executed by a processor, the following objects are realized Pose estimation method:

Use a preset camera to obtain a scene depth map of the target object, and calculate a three-dimensional point cloud of the scene depth map according to the pixels in the scene depth map;

Extract target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set;

Calculate the visibility loss value of the target object according to the three-dimensional point cloud and the target object point set;

Hough voting is performed on the target object point set to obtain a key point set, and the key point loss value of the target object is calculated according to the key point set;

Semantic segmentation is performed on the pixels of the scene depth map to obtain the semantic loss value of the target object;

According to the visibility loss value, the key point loss value, the semantic loss value and the multi-task joint model obtained by pre-training, the pose of the target object is calculated.
The computer-readable storage medium of claim 15, wherein the calculating a visibility loss value of the target object according to the three-dimensional point cloud and the target object point set comprises:

Calculate the actual visibility of the target object according to the ratio of the point number of the target object point set to the point number of the largest point set in all objects included in the three-dimensional point cloud;

The visibility loss value of the target object is obtained by weighted calculation of the difference between the actual visibility and the predicted visibility of the target object.
The computer-readable storage medium according to claim 15, wherein, extracting target points in the three-dimensional point cloud by using a pre-built deep learning network to obtain a target object point set, comprising:

Extract the feature point set of the 3D point cloud by using the convolution, pooling and fully connected layers in the pre-built deep learning network;

The feature point set is classified into a target point set and a non-target point set by using the classifier in the deep learning network, and the target point set is extracted to obtain a target object point set.
The computer-readable storage medium according to claim 15, wherein the performing Hough voting on the target object point set to obtain a key point set, comprising:

Sampling from the target object point set to obtain a sampling point set, and calculating the Euclidean distance offset between the sampling point sets to obtain an offset;

Voting is performed according to the offset, and the set of points whose votes exceed the preset threshold is used as the key point set.
The computer-readable storage medium of claim 15, wherein the semantically segmenting the pixels of the scene depth map to obtain the semantic loss value of the target object comprises:

The semantic loss L s of the target object is obtained by calculating the following formula;

L s =-α(1-q i ) γ log(q i )

Wherein, α represents the balance parameter of the camera, γ represents the focus parameter of the camera, and q i represents the confidence that the ith pixel in the scene depth map belongs to the foreground point or the background point.
The computer-readable storage medium according to any one of claims 15 to 19, wherein the multi-task joint based on the visibility loss value, the keypoint loss value, the semantic loss value and pre-trained model, and calculate the pose of the target object, including:

The final loss value L mt of the target object is calculated using the following multi-task joint model:

L mt = μ 1 L kps + μ 2 L s + μ 3 L v

Wherein, L kps represents the key point loss value, L s represents the semantic loss, L v represents the visibility loss value, μ 01 , μ 02 , μ 03 represent the weights obtained after training the multi-task joint model value.

Adjust the predicted rotation matrix and predicted translation matrix of the target object according to the final loss value to obtain the pose of the target object.