WO2024101532A1

WO2024101532A1 - Device and method for estimating three-dimensional location of object in real time

Info

Publication number: WO2024101532A1
Application number: PCT/KR2022/020601
Authority: WO
Inventors: 유수정; 이창식
Original assignee: 한국생산기술연구원
Priority date: 2022-11-07
Filing date: 2022-12-16
Publication date: 2024-05-16
Also published as: KR20240065772A

Abstract

The present invention provides a device for estimating the three-dimensional location of an object in real time, the device comprising: an input unit which receives an RGB image and a depth image; an object detection unit which extracts a two-dimensional bounding box, including the object, from the RGB image by using an artificial neural network; an object area segmentation unit which extracts a segmentation mask, corresponding to an object area, from the two-dimensional bounding box by applying explainable artificial intelligence to the two-dimensional bounding box; and an object location estimation unit which estimates the three-dimensional location of the object using the depth image and the segmentation mask.

Description

Apparatus and method for real-time estimation of 3D position of an object

The present invention relates to an apparatus and method for real-time estimation of the 3D position of an object.

The present invention was derived from research conducted as part of the regulation-free special zone innovation business development (R&D) of the Korea Institute for Advancement of Technology.

[Assignment number] 1425157863

[Assignment number] P0016943

[Ministry Name] Ministry of SMEs and Startups

[Project management (professional) organization name] Korea Institute for Advancement of Technology

[Research project name] Regulation-free special zone innovation business development (R&D)

[Research project name] Demonstration of operation of autonomous outdoor robot

[Contribution rate] 1/1

[Name of project carrying out organization] Twiny Co., Ltd.

[Research period] 2022.01.01 ~ 2022.12.31

3D object detection technology is an artificial intelligence technology that implements the human cognitive ability to determine the type of each object visible within the field of view and estimate the 3D location of the object using sensors and computing devices. When humans work, the ability to determine the type of surrounding object and estimate its location is an essential ability for an artificial intelligence system to perform advanced tasks. For example, 3D object detection technology can be used by autonomous robots to determine surrounding static and dynamic obstacles to determine driving policies. Additionally, it can be used to determine the object that the robot arm should pick up and calculate the arm's movement trajectory to the corresponding object.

Likewise, research to enable artificial intelligence systems to have 3D object detection capabilities has been continuously conducted in the field of computer vision for over 30 years. In recent years, since AlexNet demonstrated outstanding performance in the image discrimination problem of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, deep learning theory has emerged as an effective solution to various problems in the field of computer vision. , 3D object detection research published since then is mainly based on methods using deep learning.

One of the approaches commonly used in 3D object detection technology research is a method that utilizes RGB-D images and deep learning theory. Specifically, the method acquires visual characteristic information of nearby objects from RGB images, acquires 3D location information for each pixel in the image from depth images, and determines the type of nearby objects through deep learning theory. Estimate 3D position.

3D object detection approaches using RGB-D images can be broadly classified into three types. The first is a method that fuses 2D object detection results for RGB images and depth images, and the second is a 2D instance for RGB images. It is a method that fuses the instance segmentation results and depth images, and the third is an end-to-end learning method that obtains 3D object detection results by inputting RGB images and depth images together into an artificial neural network.

First, the method of fusing the 2D object detection results for the RGB image and the depth image consists of two steps. The first step is to obtain 2D object detection results. By inputting RGB images into a deep neural network, the type of each object in the image is determined, and a 2D bounding box expresses the location and size of each object. Acquire (bounding box). The second step is to estimate the 3D position of each object. The depth image value within the 2D bounding box obtained for each object is extracted and the 3 pixels within the 2D bounding box are calculated using the camera projection matrix. Obtain dimensional real world coordinates. The 3D position coordinates of the pixels obtained in this way are filtered to estimate the 3D position of the object. To apply this method, a deep artificial neural network that receives RGB images as input and outputs 2D object detection results must be trained, and the time required for labeling the learning dataset is approximately 38.1 seconds per image.

This 2D object detection method requires the least amount of calculation on average compared to other approaches described later. However, when performing 3D position estimation, estimation noise occurs due to objects or background objects blocking the object within the 2D bounding box and due to the diversity of object poses.

Next, the method of fusing the 2D instance segmentation results for the RGB image and the depth image consists of two steps. The first step is to obtain a 2D instance segmentation result. By inputting the RGB image into a deep artificial neural network, the type of each object in the image is determined, and a 2D instance mask representing the pixel area occupied by each object in the image is obtained. do. The second step is to estimate the 3D position of each object. The depth image value in the 2D instance mask obtained for each object is extracted and the 3D real world coordinates of the pixels in the 2D instance mask are calculated using the camera projection matrix. Acquire. The 3D position coordinates of the pixels obtained in this way are filtered to estimate the 3D position of the object.

In order to apply this method, a deep artificial neural network that receives RGB images as input and outputs 2D instance segmentation results must be trained, and the time required for labeling the learning dataset is approximately 239.7 seconds per image. Unlike the 2D object detection method, this 2D instance segmentation method does not require a noise filtering procedure. However, on average, it requires more calculations than the 2D object detection method.

Lastly, the end-to-end learning method of inputting the RGB image and the depth image together into the artificial neural network consists of one step. The RGB image and the depth image are input together into the deep artificial neural network to determine the type of each object in the image and create a 3D Obtain location. In order to apply this method, an artificial neural network that receives RGB images and depth images and outputs 3D object detection results must be trained, and the time required for labeling the learning dataset is approximately 714.4 seconds per image.

Unlike 2D object detection methods or 2D instance segmentation methods, this end-to-end learning method does not require noise filtering procedures and does not require separate calculation of real-world coordinates. However, on average, it requires more computation than the 2D object detection method and the 2D instance segmentation method.

The present invention was created to solve the above problems, and by applying explainable artificial intelligence techniques to the process of estimating the 3D position of an object, it is possible to accurately estimate the 3D position of an object at a low cost of building a dataset. The purpose is to provide a device and method for real-time estimation of the 3D position of an object.

The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned can be clearly understood by those skilled in the art from the description below. There will be.

To this end, the present invention describes an input unit that receives an RGB image and a depth image, an object detection unit that extracts a two-dimensional bounding box containing an object from the RGB image using an artificial neural network, and the two-dimensional bounding box. An object area segmentation unit that extracts a segmentation mask corresponding to the object area from the two-dimensional bounding box by applying possible artificial intelligence, and an object location that estimates the three-dimensional position of the object using the depth image and the segmentation mask. A real-time 3D position estimation device for an object including an estimation unit is provided.

Here, the object area segmentation unit may extract the segmentation mask based on the degree to which the artificial neural network refers to each pixel of the RGB image when extracting the two-dimensional bounding box.

In addition, the object area segmentation unit expresses the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box as a heat map, and extracts the segmentation mask based on the heat map. You can.

In addition, the object area segmentation unit calculates a heat map score by scoring the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box, and compares the heat map score with a threshold. The segmentation mask can be extracted.

Additionally, the object area segmentation unit may extract the segmentation mask by selecting pixels whose heat map score is equal to or greater than a threshold.

Additionally, the object area segmentation unit may extract the segmentation mask by filtering pixels whose heat map score is less than a threshold.

Additionally, the object location estimation unit may extract a depth image value corresponding to each pixel in the segmentation mask.

Additionally, the object position estimation unit may extract the 3D position coordinates of each pixel using the image coordinates of each pixel in the segmentation mask and the camera projection matrix.

In addition, the present invention includes the steps of receiving an RGB image and a depth image, extracting a two-dimensional bounding box containing an object from the RGB image using an artificial neural network, and an artificial neural network that can explain the two-dimensional bounding box. 3 of the object, including applying intelligence to extract a segmentation mask corresponding to the object area from the two-dimensional bounding box, and estimating a three-dimensional position of the object using the depth image and the segmentation mask. Provides a real-time estimation method for dimensional position.

According to the present invention, the accuracy of object 3D position estimation technology using depth image values can be improved by applying an explainable artificial intelligence technique to the RGB image 2D object detection results to obtain an object area segmentation mask for each object. .

In addition, when developing a 3D object detection algorithm for a new environment where a learning dataset has not been built, such as the present invention, the instance segmentation method or end-to-end method applied in the method of using existing 2D object detection technology and depth image fusion Compared to learning methods, less cost is consumed in building the dataset. In other words, the present invention can estimate the 3D position of an object more accurately at a lower dataset construction cost than existing methods.

In addition, since the 2D object detection algorithm of the present invention requires a relatively small amount of calculation, it can be used to estimate the 3D position of a more accurate real-time RGB-D image object even in artificial intelligence systems equipped with low-power computing devices such as AR devices, drones, and mobile robots. becomes available.

The effects that can be obtained from the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below. will be.

Figure 1 is a flowchart of a method for estimating the location of a 3D object by fusing a conventional 2D object detection result and a depth image.

Figure 2 is a block diagram showing a real-time 3D location estimation device for an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention.

Figure 3 is a flowchart of a method for real-time estimation of the 3D position of an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention.

Figure 4 is a flowchart of a method for extracting a segmentation mask in Figure 3.

Terms or words used in this specification and claims should not be construed as limited to their common or dictionary meanings, and the inventor may appropriately define the concept of terms in order to explain his or her invention in the best way. It must be interpreted with meaning and concept consistent with the technical idea of the present invention based on the principle that it is.

Therefore, the embodiments described in this specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention and do not represent the entire technical idea of the present invention, and therefore various equivalents and It should be understood that variations may exist.

Referring to FIG. 1, a conventional method for estimating the position of a 3D object first detects a 2D bounding box containing the object in an RGB image (S10). Next, a 2D bounding box is applied to the depth image to extract the depth image value within the 2D bounding box (S20). Next, the real-world coordinates of pixels within the 2D bounding box are acquired (S30), and the 3D location of the object is estimated based on the obtained real-world coordinates (S40).

Specifically, 2D object detection technology uses an artificial neural network to determine the type (class) of objects existing in an input image, and determines the center pixel (C _u , C _v ) and width for each object. Express the location and size of the object by estimating a 2D bounding box with w and height h.

Here, the 2D object detection artificial neural network is

is input and the result of bounding box regression for each of the N objects existing in the image is

and the estimated value σ _j (q _i ) of the probability that the corresponding object belongs to each of the C object types (classes), and the estimated object type (class) y _i of the corresponding object are output. Here, i=0, 1, 2,..., N-1 are integers representing the order of objects, and j=0, 1, 2,..., C-1 are integers representing the order of classes. And the activation function σ(·) is a softmax function,

is the logit value of the probability that the ith object belongs to each of the C classes.

The process of estimating the 3D location of each object by fusing the 2D object detection result and the depth image is performed through camera inverse projection. Specifically, RGB images

, depth image

, and the camera projection matrix

Given, the coordinates of random pixels contained in the 2D bounding box of the ith object in the image.

The real world coordinates of

is calculated through Equation 1 below.

[Equation 1]

mistake here

is the depth value measured at the pixel coordinates (u _i , v _i ) of the depth image D, and matrix K ^-1 is the inverse matrix of the camera projection matrix K.

Equation 1 above is calculated as the coordinates of all pixels included in the two-dimensional bounding box of the ith object.

By applying it to a point cloud, which is a set of real world coordinates as shown in Figure 1

obtain. The real-world 3D position estimate of the ith object is calculated by calculating an estimator such as the average value, weighted average value, or median value from the point cloud obtained in this way.

obtain.

By performing the above calculation process on all bounding boxes resulting from 2D object detection in image I, real-world 3D position estimates P _i , i=0, 1, 2,… of N objects are obtained. , obtain N-1.

A limitation of the method of estimating the 3D position of an object by combining the above-described conventional 2D object detection and depth image is that components other than the object may be included in the 2D bounding box. As shown in Figure 1, the point cloud calculated for all pixels included in the two-dimensional bounding box includes not only the object but also components other than the object, such as occlusion and background. Therefore, in the process of calculating an estimator to obtain the 3D position of an object, there is a problem that components other than the object included in the point cloud generate noise in the position estimate value.

Figure 2 is a block diagram showing a real-time 3D location estimation device for an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention, and Figure 3 is an explainable device according to an embodiment of the present invention. This is a flowchart of a method for real-time estimation of the 3D position of an object detected in an RGB-D image using artificial intelligence techniques, and FIG. 4 is a flowchart of a method for extracting a segmentation mask in FIG. 3.

Referring to FIG. 2, the real-time 3D location estimation system 100 of an object detected in an RGB-D image using an explainable artificial intelligence technique includes an input unit 110, a memory 120, an object detection unit 130, and an object area. It may include a division unit 140 and an object location estimation unit 150.

Referring to FIGS. 2 and 3 , the input unit 110 can receive sensor measurement values for an object for which the object is to be detected (S110). For example, it can be configured to receive RGB images and depth images captured by an RGB-D camera in real time. However, the input unit 110 is not limited to the RGB-D camera, and the present invention uses RGB images and 3D LiDAR image inputs, or RGB images and estimated depth images obtained by applying a deep artificial neural network to RGB images. It can also receive input and operate.

The memory 120 may be configured to store parameters necessary for driving the object detection unit 130, the object area dividing unit 140, and the object location estimation unit 150.

Specifically, the memory 120 may store a bounding box reliability threshold and a class estimation probability threshold, which are parameters used in the object detection unit 130, and a heat map score threshold, which are parameters used in the object area segmentation unit 140. The number of activation layers used when constructing a heat map can be stored, and the minimum and maximum values of depth image values used by the object 3D position estimation unit 150, camera projection matrix, etc. can be stored.

The object detector 130 may extract a two-dimensional bounding box containing an object from an RGB image using an artificial neural network (S120).

Specifically, the object detection unit 130 can apply a pre-trained 2D object detection artificial neural network to the RGB image received from the input unit 110 to obtain a 2D bounding box and object type determination result for each object in the image. there is. At this time, the 2D object detection artificial neural network can output the logit value of the class estimation probability for each object in the image along with the 2D object detection result.

The object area division unit 140 can extract a segmentation mask corresponding to the object area from the 2D bounding box by applying explainable artificial intelligence (XAI) to the 2D bounding box (S130).

Here, explainable artificial intelligence (XAI) is a method of expressing the basis for judgment of an artificial neural network, a black box structure that is difficult to interpret the judgment process, in a way that humans can understand.

The heat map representation method, which is a mainly studied method among explainable artificial intelligence technologies, is a method in which the artificial neural network expresses the mainly referenced part of the input data in order to calculate the result as a heat map with the same dimension as the input data. .

Here, heat map expression methods can be broadly classified into three types: the first is a perturbation based method, the second is a gradient based method, and the third is a class activation map.

First, the perturbation-based method is a method of partially changing the input data in various ways to find the input data change pattern that causes the greatest change in the output value. Compared to other approaches described later, it is possible to obtain a clearly interpretable heat map. However, to obtain a heat map, artificial neural network calculations must be performed multiple times on the same input data, and as the dimension and resolution of the input data increases, the amount of calculation increases. need.

Next, the gradient-based method calculates the gradient of the output value with respect to the input data and obtains the change pattern of the input value to increase the class estimation score of the artificial neural network most quickly. Since the gradient can be calculated using backpropagation after performing the artificial neural network operation once, the amount of calculation required is less than that of the perturbation-based method, but there is a lot of noise in the heat map due to the gradient shattering phenomenon.

Lastly, the class activation map is a method of expressing the portion that contributes to the class estimation score, the final output value of the artificial neural network, as a heat map in the activation map, which is the output value of each layer of the artificial neural network. This method has less noise than gradient-based methods and requires less computation than perturbation-based methods, but the heat maps generally have lower resolution.

Explainable artificial intelligence technologies that express heat maps are also used to solve problems other than artificial intelligence explanation. For example, perturbation-based methods are sometimes applied to research to develop an automatic labeling system to build a 2D instance segmentation learning dataset using an already trained 2D object detection artificial neural network, and the class activation map is used in the input image. When using an artificial neural network that determines the presence or absence of a specific object (ex: determining the presence or absence of a road crack, determining the presence or absence of product damage), it is also applied as a segmentation technique to find the location of the object in the input image.

The object area segmentation unit 140 may extract a segmentation mask based on the degree to which the artificial neural network references each pixel of the RGB image when extracting a 2D bounding box.

Specifically, the object area segmentation unit 140 can express the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box as a heat map, and extract a segmentation mask based on the heat map. .

Here, the object area segmentation unit 140 calculates a heat map score by scoring the degree of reference for each pixel of the RGB image, and extracts a segmentation mask by comparing the heat map score with a threshold.

Specifically, the object area segmentation unit 140 may extract a segmentation mask by selecting pixels whose heat map score is equal to or greater than a threshold, or extract a segmentation mask by filtering pixels whose heat map score is less than the threshold.

For example, referring to FIG. 4, the object area segmentation unit 140 divides the two-dimensional bounding box output value and the class discrimination score output value for each object in the image acquired by the object detection unit 130 and the output value received from the input unit 110. By calculating the guided gradient, which is one of the artificial intelligence techniques that can be explained using RGB images, a heat map showing the correlation between the RGB input image and the 2D object detection result can be obtained (S131).

Then, the object region segmentation unit 140 may obtain an object region segmentation mask by filtering the obtained correlation heat map based on the threshold value stored in the memory 120 (S132). In the embodiment of the present invention, the derived gradient, which is one of the gradient-based methods, is specified, but among the explainable artificial intelligence heat map representation techniques excluding the perturbation-based method, which cannot obtain results in real time, a gradient-based method such as SmoothGrad or Grad- Class activation maps such as CAM and LayerCAM can be applied.

Explainable AI (XAI), an explainable artificial intelligence technique that expresses a heat map, expresses the part referenced in the input data by an artificial neural network as a heat map to calculate the discriminant score of a specific class.

Specifically, the

Receives input and estimates the probability that the image belongs to each of the C classes.

Assuming that the artificial neural network outputs the probability estimate σ _j (q) for a specific class j, the part referenced by the artificial neural network in the input image is used as a heat map.

is calculated and expressed.

Among several heat map expression techniques, the gradient based method uses the gradient of the input of the output value to express a heat map for artificial intelligence explanation. Specifically, the logit value of the artificial neural network estimated probability for a specific class j,

The slope for the input image I of

By calculating , a heat map H _j of the change pattern of the input image I is obtained so that the logit value q _j of the probability for class j increases the fastest. However, the gradient-based heat map H _j obtained in this way contains many noise components that are difficult for humans to interpret because both negative and positive gradients are considered in the backpropagation process to calculate the gradient. Guided backpropagation, which was proposed to compensate for this, considers only the positive gradient during the backpropagation process and expresses a heat map that can be interpreted by humans. At this time, the guided gradient heat map ψ _j calculated through guided backpropagation is calculated using Equation 2 below.

[Equation 2]

Here z _k , k=0, 1, 2,… , L-1 is the output value of the kth hidden layer in the artificial neural network, and the activation function Φ(·) is a ReLU (Rectified Linear Unit) function.

In summary, when the class discrimination artificial neural network calculates the estimated probability for the jth class by calculating the gradient derived through Equation 2 above, the input image

A heat map representing what visual features are referenced within

can be obtained.

As described above, the existing method of estimating the 3D location of an object by fusing the 2D object detection result and the depth image has the limitation that components other than the object are included in the object's bounding box. Previously, to avoid these problems, 2D instance segmentation or end-to-end learning methods were applied rather than object detection when solving 3D location estimation problems. However, 2D instance segmentation and end-to-end learning methods require more costs than 2D object detection technology in terms of dataset construction costs and the amount of computation of the algorithm itself. Therefore, it is necessary to research a method to more accurately estimate the 3D position of an object only by detecting 2D objects. For this purpose, the induced gradient XAI technique described above can be applied.

Specifically, the input image

2D object detection results for 2D bounding box estimates for N objects (C _u,i , C _v,i , w _i , h _i ) and probability estimates for each of a total of C object types.

, and the object type estimates yi, i=0, 1, 2,… , N-1, j=0, 1, 2,… , the derived gradient heat map of the ith object for the estimated class yi of the object, given C-1.

Can be calculated as shown in Equation 3 below.

[Equation 3]

gather here

is the set of pixel coordinates contained in the 2-dimensional bounding box (C _u,i , C _v,i , w _i , h _i ) of the ith object, and I({u _i , v _i }) is the 2-dimensional bounding box This is the pixel value of image I corresponding to my pixel coordinates. And q _i,yi is the logit value of the probability that the i th object belongs to the y _i th class estimated by the artificial neural network.

By performing the above calculation process on all bounding boxes resulting from 2D object detection in image I, the derived gradient heat map ψ _yi , i=0, 1, 2,… for N objects is generated. , obtain N-1.

Finally, a value higher than the threshold τ is selected from the derived gradient heat map ψ _yi of each object to obtain an object area segmentation mask that displays the pixel area of the object.

The object location estimation unit 150 may estimate the 3D location of the object using a depth image and a segmentation mask.

Specifically, the object location estimation unit 150 extracts the depth image value corresponding to each pixel in the object area segmentation mask for each object acquired by the object area segmentation unit 140, and the image coordinates and camera projection of the corresponding pixel. Using a matrix, you can obtain the 3D real-world coordinates of each pixel.

In addition, the object location estimation unit 150 can calculate the average value of the acquired real-world point cloud and finally obtain the 3D location coordinates of the object.

In an embodiment of the present invention, the average value of a point cloud is calculated to estimate the final 3D position of an object, but other types of estimators such as a weighted average or median may be applied.

As such, the present invention applies an explainable artificial intelligence technique to the RGB image 2D object detection results to obtain an object area segmentation mask for each object, thereby improving the accuracy of object 3D location estimation technology using depth image values. You can.

The above detailed description is illustrative of the present invention. In addition, the foregoing merely shows and describes preferred embodiments of the present invention, and the present invention can be used in various other combinations, modifications, and environments. In other words, changes or modifications can be made within the scope of the inventive concept disclosed in this specification, within the scope equivalent to the written disclosure, and/or within the scope of skill or knowledge in the art.

The above-described embodiments are intended to explain the best state in carrying out the present invention, and are implemented in other states known in the art when using other inventions such as the present invention, and are required in the specific application field and use of the invention. Various changes are also possible. Accordingly, the detailed description of the invention above is not intended to limit the invention to the disclosed embodiments. Additionally, the appended claims should be construed to include other embodiments as well.

The real-time 3D location estimation device according to the present invention can be used in various fields such as artificial intelligence technology such as AR devices, drones, and robots.

Claims

An input unit that receives RGB images and depth images;

an object detection unit that extracts a two-dimensional bounding box containing an object from the RGB image using an artificial neural network;

an object area segmentation unit that extracts a segmentation mask corresponding to the object area from the two-dimensional bounding box by applying explainable artificial intelligence to the two-dimensional bounding box; and

An object position estimation unit that estimates the 3D position of the object using the depth image and the segmentation mask.

A real-time 3D location estimation device for an object including a device.
According to claim 1,

The object area division unit

The artificial neural network extracts the segmentation mask based on the degree to which each pixel of the RGB image is referenced in extracting the two-dimensional bounding box.

A device for real-time estimation of the 3D position of an object.
According to claim 1,

The object area division unit

The degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box is expressed as a heat map, and the segmentation mask is extracted based on the heat map.

A device for real-time estimation of the 3D position of an object.
According to claim 1,

The object area division unit

In extracting the two-dimensional bounding box, the artificial neural network scores the degree of reference to each pixel of the RGB image to calculate a heat map score, and compares the heat map score with a threshold to extract the segmentation mask.

A device for real-time estimation of the 3D position of an object.
According to claim 4,

The object area division unit

Selecting pixels whose heat map score is greater than or equal to a threshold value to extract the segmentation mask.

A device for real-time estimation of the 3D position of an object.
According to claim 4,

The object area division unit

Extracting the segmentation mask by filtering pixels whose heat map score is less than a threshold

A device for real-time estimation of the 3D position of an object.
According to claim 1,

The object location estimation unit

Extracting the depth image value corresponding to each pixel in the segmentation mask

A device for real-time estimation of the 3D position of an object.
According to claim 1,

The object location estimation unit

Extracting the 3D position coordinates of each pixel using the image coordinates and camera projection matrix of each pixel in the segmentation mask.

A device for real-time estimation of the 3D position of an object.
Step of receiving RGB image and depth image;

extracting a two-dimensional bounding box containing an object from the RGB image using an artificial neural network;

extracting a segmentation mask corresponding to the object area from the two-dimensional bounding box by applying explainable artificial intelligence to the two-dimensional bounding box; and

Estimating the 3D location of the object using the depth image and the segmentation mask.

A method for real-time estimation of the 3D position of an object including.