WO2024101532A1 - Device and method for estimating three-dimensional location of object in real time - Google Patents

Device and method for estimating three-dimensional location of object in real time Download PDF

Info

Publication number
WO2024101532A1
WO2024101532A1 PCT/KR2022/020601 KR2022020601W WO2024101532A1 WO 2024101532 A1 WO2024101532 A1 WO 2024101532A1 KR 2022020601 W KR2022020601 W KR 2022020601W WO 2024101532 A1 WO2024101532 A1 WO 2024101532A1
Authority
WO
WIPO (PCT)
Prior art keywords
bounding box
image
real
segmentation mask
neural network
Prior art date
Application number
PCT/KR2022/020601
Other languages
French (fr)
Korean (ko)
Inventor
유수정
이창식
Original Assignee
한국생산기술연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국생산기술연구원 filed Critical 한국생산기술연구원
Publication of WO2024101532A1 publication Critical patent/WO2024101532A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/40Image enhancement or restoration using histogram techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/12Bounding box

Definitions

  • the present invention relates to an apparatus and method for real-time estimation of the 3D position of an object.
  • the present invention was derived from research conducted as part of the regulation-free special zone innovation business development (R&D) of the Korea Institute for Advancement of Technology.
  • 3D object detection technology is an artificial intelligence technology that implements the human cognitive ability to determine the type of each object visible within the field of view and estimate the 3D location of the object using sensors and computing devices.
  • the ability to determine the type of surrounding object and estimate its location is an essential ability for an artificial intelligence system to perform advanced tasks.
  • 3D object detection technology can be used by autonomous robots to determine surrounding static and dynamic obstacles to determine driving policies. Additionally, it can be used to determine the object that the robot arm should pick up and calculate the arm's movement trajectory to the corresponding object.
  • One of the approaches commonly used in 3D object detection technology research is a method that utilizes RGB-D images and deep learning theory. Specifically, the method acquires visual characteristic information of nearby objects from RGB images, acquires 3D location information for each pixel in the image from depth images, and determines the type of nearby objects through deep learning theory. Estimate 3D position.
  • 3D object detection approaches using RGB-D images can be broadly classified into three types.
  • the first is a method that fuses 2D object detection results for RGB images and depth images
  • the second is a 2D instance for RGB images.
  • It is a method that fuses the instance segmentation results and depth images
  • the third is an end-to-end learning method that obtains 3D object detection results by inputting RGB images and depth images together into an artificial neural network.
  • the method of fusing the 2D object detection results for the RGB image and the depth image consists of two steps.
  • the first step is to obtain 2D object detection results.
  • the type of each object in the image is determined, and a 2D bounding box expresses the location and size of each object.
  • the second step is to estimate the 3D position of each object.
  • the depth image value within the 2D bounding box obtained for each object is extracted and the 3 pixels within the 2D bounding box are calculated using the camera projection matrix.
  • the 3D position coordinates of the pixels obtained in this way are filtered to estimate the 3D position of the object.
  • a deep artificial neural network that receives RGB images as input and outputs 2D object detection results must be trained, and the time required for labeling the learning dataset is approximately 38.1 seconds per image.
  • This 2D object detection method requires the least amount of calculation on average compared to other approaches described later.
  • estimation noise occurs due to objects or background objects blocking the object within the 2D bounding box and due to the diversity of object poses.
  • the method of fusing the 2D instance segmentation results for the RGB image and the depth image consists of two steps.
  • the first step is to obtain a 2D instance segmentation result.
  • the type of each object in the image is determined, and a 2D instance mask representing the pixel area occupied by each object in the image is obtained. do.
  • the second step is to estimate the 3D position of each object.
  • the depth image value in the 2D instance mask obtained for each object is extracted and the 3D real world coordinates of the pixels in the 2D instance mask are calculated using the camera projection matrix. Acquire.
  • the 3D position coordinates of the pixels obtained in this way are filtered to estimate the 3D position of the object.
  • this 2D instance segmentation method does not require a noise filtering procedure. However, on average, it requires more calculations than the 2D object detection method.
  • the end-to-end learning method of inputting the RGB image and the depth image together into the artificial neural network consists of one step.
  • the RGB image and the depth image are input together into the deep artificial neural network to determine the type of each object in the image and create a 3D Obtain location.
  • an artificial neural network that receives RGB images and depth images and outputs 3D object detection results must be trained, and the time required for labeling the learning dataset is approximately 714.4 seconds per image.
  • this end-to-end learning method does not require noise filtering procedures and does not require separate calculation of real-world coordinates. However, on average, it requires more computation than the 2D object detection method and the 2D instance segmentation method.
  • the present invention was created to solve the above problems, and by applying explainable artificial intelligence techniques to the process of estimating the 3D position of an object, it is possible to accurately estimate the 3D position of an object at a low cost of building a dataset.
  • the purpose is to provide a device and method for real-time estimation of the 3D position of an object.
  • the present invention describes an input unit that receives an RGB image and a depth image, an object detection unit that extracts a two-dimensional bounding box containing an object from the RGB image using an artificial neural network, and the two-dimensional bounding box.
  • An object area segmentation unit that extracts a segmentation mask corresponding to the object area from the two-dimensional bounding box by applying possible artificial intelligence, and an object location that estimates the three-dimensional position of the object using the depth image and the segmentation mask.
  • a real-time 3D position estimation device for an object including an estimation unit is provided.
  • the object area segmentation unit may extract the segmentation mask based on the degree to which the artificial neural network refers to each pixel of the RGB image when extracting the two-dimensional bounding box.
  • the object area segmentation unit expresses the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box as a heat map, and extracts the segmentation mask based on the heat map. You can.
  • the object area segmentation unit calculates a heat map score by scoring the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box, and compares the heat map score with a threshold.
  • the segmentation mask can be extracted.
  • the object area segmentation unit may extract the segmentation mask by selecting pixels whose heat map score is equal to or greater than a threshold.
  • the object area segmentation unit may extract the segmentation mask by filtering pixels whose heat map score is less than a threshold.
  • the object location estimation unit may extract a depth image value corresponding to each pixel in the segmentation mask.
  • the object position estimation unit may extract the 3D position coordinates of each pixel using the image coordinates of each pixel in the segmentation mask and the camera projection matrix.
  • the present invention includes the steps of receiving an RGB image and a depth image, extracting a two-dimensional bounding box containing an object from the RGB image using an artificial neural network, and an artificial neural network that can explain the two-dimensional bounding box.
  • 3 of the object including applying intelligence to extract a segmentation mask corresponding to the object area from the two-dimensional bounding box, and estimating a three-dimensional position of the object using the depth image and the segmentation mask.
  • the accuracy of object 3D position estimation technology using depth image values can be improved by applying an explainable artificial intelligence technique to the RGB image 2D object detection results to obtain an object area segmentation mask for each object.
  • the present invention when developing a 3D object detection algorithm for a new environment where a learning dataset has not been built, such as the present invention, the instance segmentation method or end-to-end method applied in the method of using existing 2D object detection technology and depth image fusion Compared to learning methods, less cost is consumed in building the dataset. In other words, the present invention can estimate the 3D position of an object more accurately at a lower dataset construction cost than existing methods.
  • the 2D object detection algorithm of the present invention requires a relatively small amount of calculation, it can be used to estimate the 3D position of a more accurate real-time RGB-D image object even in artificial intelligence systems equipped with low-power computing devices such as AR devices, drones, and mobile robots. becomes available.
  • Figure 1 is a flowchart of a method for estimating the location of a 3D object by fusing a conventional 2D object detection result and a depth image.
  • Figure 2 is a block diagram showing a real-time 3D location estimation device for an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention.
  • Figure 3 is a flowchart of a method for real-time estimation of the 3D position of an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention.
  • Figure 4 is a flowchart of a method for extracting a segmentation mask in Figure 3.
  • Figure 1 is a flowchart of a method for estimating the location of a 3D object by fusing a conventional 2D object detection result and a depth image.
  • a conventional method for estimating the position of a 3D object first detects a 2D bounding box containing the object in an RGB image (S10). Next, a 2D bounding box is applied to the depth image to extract the depth image value within the 2D bounding box (S20). Next, the real-world coordinates of pixels within the 2D bounding box are acquired (S30), and the 3D location of the object is estimated based on the obtained real-world coordinates (S40).
  • 2D object detection technology uses an artificial neural network to determine the type (class) of objects existing in an input image, and determines the center pixel (C u , C v ) and width for each object. Express the location and size of the object by estimating a 2D bounding box with w and height h.
  • the 2D object detection artificial neural network is input and the result of bounding box regression for each of the N objects existing in the image is and the estimated value ⁇ j (q i ) of the probability that the corresponding object belongs to each of the C object types (classes), and the estimated object type (class) y i of the corresponding object are output.
  • the activation function ⁇ ( ⁇ ) is a softmax function, is the logit value of the probability that the ith object belongs to each of the C classes.
  • the process of estimating the 3D location of each object by fusing the 2D object detection result and the depth image is performed through camera inverse projection. Specifically, RGB images , depth image , and the camera projection matrix Given, the coordinates of random pixels contained in the 2D bounding box of the ith object in the image. The real world coordinates of is calculated through Equation 1 below.
  • matrix K -1 is the inverse matrix of the camera projection matrix K.
  • Equation 1 above is calculated as the coordinates of all pixels included in the two-dimensional bounding box of the ith object.
  • a point cloud which is a set of real world coordinates as shown in Figure 1 obtain.
  • the real-world 3D position estimate of the ith object is calculated by calculating an estimator such as the average value, weighted average value, or median value from the point cloud obtained in this way. obtain.
  • a limitation of the method of estimating the 3D position of an object by combining the above-described conventional 2D object detection and depth image is that components other than the object may be included in the 2D bounding box.
  • the point cloud calculated for all pixels included in the two-dimensional bounding box includes not only the object but also components other than the object, such as occlusion and background. Therefore, in the process of calculating an estimator to obtain the 3D position of an object, there is a problem that components other than the object included in the point cloud generate noise in the position estimate value.
  • Figure 2 is a block diagram showing a real-time 3D location estimation device for an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention
  • Figure 3 is an explainable device according to an embodiment of the present invention.
  • This is a flowchart of a method for real-time estimation of the 3D position of an object detected in an RGB-D image using artificial intelligence techniques
  • FIG. 4 is a flowchart of a method for extracting a segmentation mask in FIG. 3.
  • the real-time 3D location estimation system 100 of an object detected in an RGB-D image using an explainable artificial intelligence technique includes an input unit 110, a memory 120, an object detection unit 130, and an object area. It may include a division unit 140 and an object location estimation unit 150.
  • the input unit 110 can receive sensor measurement values for an object for which the object is to be detected (S110).
  • S110 sensor measurement values for an object for which the object is to be detected
  • it can be configured to receive RGB images and depth images captured by an RGB-D camera in real time.
  • the input unit 110 is not limited to the RGB-D camera, and the present invention uses RGB images and 3D LiDAR image inputs, or RGB images and estimated depth images obtained by applying a deep artificial neural network to RGB images. It can also receive input and operate.
  • the memory 120 may be configured to store parameters necessary for driving the object detection unit 130, the object area dividing unit 140, and the object location estimation unit 150.
  • the memory 120 may store a bounding box reliability threshold and a class estimation probability threshold, which are parameters used in the object detection unit 130, and a heat map score threshold, which are parameters used in the object area segmentation unit 140.
  • the number of activation layers used when constructing a heat map can be stored, and the minimum and maximum values of depth image values used by the object 3D position estimation unit 150, camera projection matrix, etc. can be stored.
  • the object detector 130 may extract a two-dimensional bounding box containing an object from an RGB image using an artificial neural network (S120).
  • S120 an artificial neural network
  • the object detection unit 130 can apply a pre-trained 2D object detection artificial neural network to the RGB image received from the input unit 110 to obtain a 2D bounding box and object type determination result for each object in the image. there is. At this time, the 2D object detection artificial neural network can output the logit value of the class estimation probability for each object in the image along with the 2D object detection result.
  • the object area division unit 140 can extract a segmentation mask corresponding to the object area from the 2D bounding box by applying explainable artificial intelligence (XAI) to the 2D bounding box (S130).
  • XAI explainable artificial intelligence
  • XAI explainable artificial intelligence
  • the heat map representation method which is a mainly studied method among explainable artificial intelligence technologies, is a method in which the artificial neural network expresses the mainly referenced part of the input data in order to calculate the result as a heat map with the same dimension as the input data. .
  • heat map expression methods can be broadly classified into three types: the first is a perturbation based method, the second is a gradient based method, and the third is a class activation map.
  • the perturbation-based method is a method of partially changing the input data in various ways to find the input data change pattern that causes the greatest change in the output value. Compared to other approaches described later, it is possible to obtain a clearly interpretable heat map. However, to obtain a heat map, artificial neural network calculations must be performed multiple times on the same input data, and as the dimension and resolution of the input data increases, the amount of calculation increases. need.
  • the gradient-based method calculates the gradient of the output value with respect to the input data and obtains the change pattern of the input value to increase the class estimation score of the artificial neural network most quickly. Since the gradient can be calculated using backpropagation after performing the artificial neural network operation once, the amount of calculation required is less than that of the perturbation-based method, but there is a lot of noise in the heat map due to the gradient shattering phenomenon.
  • the class activation map is a method of expressing the portion that contributes to the class estimation score, the final output value of the artificial neural network, as a heat map in the activation map, which is the output value of each layer of the artificial neural network. This method has less noise than gradient-based methods and requires less computation than perturbation-based methods, but the heat maps generally have lower resolution.
  • the object area segmentation unit 140 may extract a segmentation mask based on the degree to which the artificial neural network references each pixel of the RGB image when extracting a 2D bounding box.
  • the object area segmentation unit 140 can express the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box as a heat map, and extract a segmentation mask based on the heat map. .
  • the object area segmentation unit 140 calculates a heat map score by scoring the degree of reference for each pixel of the RGB image, and extracts a segmentation mask by comparing the heat map score with a threshold.
  • the object area segmentation unit 140 may extract a segmentation mask by selecting pixels whose heat map score is equal to or greater than a threshold, or extract a segmentation mask by filtering pixels whose heat map score is less than the threshold.
  • the object area segmentation unit 140 divides the two-dimensional bounding box output value and the class discrimination score output value for each object in the image acquired by the object detection unit 130 and the output value received from the input unit 110.
  • the guided gradient which is one of the artificial intelligence techniques that can be explained using RGB images
  • a heat map showing the correlation between the RGB input image and the 2D object detection result can be obtained (S131).
  • the object region segmentation unit 140 may obtain an object region segmentation mask by filtering the obtained correlation heat map based on the threshold value stored in the memory 120 (S132).
  • the derived gradient which is one of the gradient-based methods, is specified, but among the explainable artificial intelligence heat map representation techniques excluding the perturbation-based method, which cannot obtain results in real time, a gradient-based method such as SmoothGrad or Grad- Class activation maps such as CAM and LayerCAM can be applied.
  • XAI Explainable AI
  • XAI an explainable artificial intelligence technique that expresses a heat map, expresses the part referenced in the input data by an artificial neural network as a heat map to calculate the discriminant score of a specific class.
  • the Receive input and estimates the probability that the image belongs to each of the C classes. Assuming that the artificial neural network outputs the probability estimate ⁇ j (q) for a specific class j, the part referenced by the artificial neural network in the input image is used as a heat map. is calculated and expressed.
  • the gradient based method uses the gradient of the input of the output value to express a heat map for artificial intelligence explanation.
  • the logit value of the artificial neural network estimated probability for a specific class j
  • the slope for the input image I of By calculating , a heat map H j of the change pattern of the input image I is obtained so that the logit value q j of the probability for class j increases the fastest.
  • the gradient-based heat map H j obtained in this way contains many noise components that are difficult for humans to interpret because both negative and positive gradients are considered in the backpropagation process to calculate the gradient.
  • the activation function ⁇ ( ⁇ ) is a ReLU (Rectified Linear Unit) function.
  • the class discrimination artificial neural network calculates the estimated probability for the jth class by calculating the gradient derived through Equation 2 above, the input image A heat map representing what visual features are referenced within can be obtained.
  • the existing method of estimating the 3D location of an object by fusing the 2D object detection result and the depth image has the limitation that components other than the object are included in the object's bounding box.
  • 2D instance segmentation or end-to-end learning methods were applied rather than object detection when solving 3D location estimation problems.
  • 2D instance segmentation and end-to-end learning methods require more costs than 2D object detection technology in terms of dataset construction costs and the amount of computation of the algorithm itself. Therefore, it is necessary to research a method to more accurately estimate the 3D position of an object only by detecting 2D objects.
  • the induced gradient XAI technique described above can be applied.
  • the input image 2D object detection results for 2D bounding box estimates for N objects (C u,i , C v,i , w i , h i ) and probability estimates for each of a total of C object types.
  • Equation 3 Equation 3 below.
  • a value higher than the threshold ⁇ is selected from the derived gradient heat map ⁇ yi of each object to obtain an object area segmentation mask that displays the pixel area of the object.
  • the object location estimation unit 150 may estimate the 3D location of the object using a depth image and a segmentation mask.
  • the object location estimation unit 150 extracts the depth image value corresponding to each pixel in the object area segmentation mask for each object acquired by the object area segmentation unit 140, and the image coordinates and camera projection of the corresponding pixel. Using a matrix, you can obtain the 3D real-world coordinates of each pixel.
  • the object location estimation unit 150 can calculate the average value of the acquired real-world point cloud and finally obtain the 3D location coordinates of the object.
  • the average value of a point cloud is calculated to estimate the final 3D position of an object, but other types of estimators such as a weighted average or median may be applied.
  • the present invention applies an explainable artificial intelligence technique to the RGB image 2D object detection results to obtain an object area segmentation mask for each object, thereby improving the accuracy of object 3D location estimation technology using depth image values. You can.
  • the present invention when developing a 3D object detection algorithm for a new environment where a learning dataset has not been built, such as the present invention, the instance segmentation method or end-to-end method applied in the method of using existing 2D object detection technology and depth image fusion Compared to learning methods, less cost is consumed in building the dataset. In other words, the present invention can estimate the 3D position of an object more accurately at a lower dataset construction cost than existing methods.
  • the 2D object detection algorithm of the present invention requires a relatively small amount of calculation, it can be used to estimate the 3D position of a more accurate real-time RGB-D image object even in artificial intelligence systems equipped with low-power computing devices such as AR devices, drones, and mobile robots. becomes available.
  • the real-time 3D location estimation device can be used in various fields such as artificial intelligence technology such as AR devices, drones, and robots.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a device for estimating the three-dimensional location of an object in real time, the device comprising: an input unit which receives an RGB image and a depth image; an object detection unit which extracts a two-dimensional bounding box, including the object, from the RGB image by using an artificial neural network; an object area segmentation unit which extracts a segmentation mask, corresponding to an object area, from the two-dimensional bounding box by applying explainable artificial intelligence to the two-dimensional bounding box; and an object location estimation unit which estimates the three-dimensional location of the object using the depth image and the segmentation mask.

Description

객체의 3차원 위치 실시간 추정 장치 및 방법Apparatus and method for real-time estimation of 3D position of an object
본 발명은 객체의 3차원 위치 실시간 추정 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for real-time estimation of the 3D position of an object.
본 발명은 아래의 한국산업기술진흥원의 규제자유특구혁신사업육성(R&D)의 일환으로 수행한 연구로부터 도출된 것이다.The present invention was derived from research conducted as part of the regulation-free special zone innovation business development (R&D) of the Korea Institute for Advancement of Technology.
[과제고유번호] 1425157863[Assignment number] 1425157863
[과제번호] P0016943[Assignment number] P0016943
[부처명] 중소벤처기업부[Ministry Name] Ministry of SMEs and Startups
[과제관리(전문)기관명] 한국산업기술진흥원[Project management (professional) organization name] Korea Institute for Advancement of Technology
[연구사업명] 규제자유특구혁신사업육성(R&D)[Research project name] Regulation-free special zone innovation business development (R&D)
[연구과제명] 자율주행 실외로봇 운영 실증[Research project name] Demonstration of operation of autonomous outdoor robot
[기여율] 1/1[Contribution rate] 1/1
[과제수행기관명] (주)트위니[Name of project carrying out organization] Twiny Co., Ltd.
[연구기간] 2022.01.01 ~ 2022.12.31[Research period] 2022.01.01 ~ 2022.12.31
3차원 객체 검출(3d object detection) 기술은 시야 내에 보이는 각 객체의 종류를 판별하고, 해당 객체의 3차원 위치를 가늠하는 인간의 인지 능력을 센서와 연산 장치로 구현하는 인공지능 기술이다. 인간이 작업할 때 주변 물체의 종류를 판별하고, 위치를 가늠하는 능력은 인공지능 시스템이 고등작업을 수행하는 데에 필수적인 능력이다. 예를 들어, 3차원 객체 검출 기술은 자율 주행 로봇이 주행 정책을 결정하기 위해 주변의 정적/동적 장애물을 판단하는 데에 사용될 수 있다. 또한, 로봇 팔이 집어야 하는 물체를 판별하고, 해당 물체까지 팔의 이동 궤적을 연산하는 데에 사용될 수 있다.3D object detection technology is an artificial intelligence technology that implements the human cognitive ability to determine the type of each object visible within the field of view and estimate the 3D location of the object using sensors and computing devices. When humans work, the ability to determine the type of surrounding object and estimate its location is an essential ability for an artificial intelligence system to perform advanced tasks. For example, 3D object detection technology can be used by autonomous robots to determine surrounding static and dynamic obstacles to determine driving policies. Additionally, it can be used to determine the object that the robot arm should pick up and calculate the arm's movement trajectory to the corresponding object.
이와 같이 인공 지능 시스템이 3차원 객체 검출 능력을 갖추도록 만들기 위한 연구는 컴퓨터 비전 분야에서 30년 넘게 지속적으로 수행되어 왔다. 근래에 들어, 2012년 ILSVRC(ImageNet Large Scale Visual Recognition Challenge)의 이미지 판별 문제에서 AlexNet이 뛰어난 성능을 선보인 이래로 심층학습(deep learning) 이론이 컴퓨터 비전 분야의 여러 문제들에 대한 효과적인 해결책으로 떠올랐으며, 이후 발표된 3차원 객체 검출 연구는 심층 학습을 활용한 방법이 주를 이룬다.Likewise, research to enable artificial intelligence systems to have 3D object detection capabilities has been continuously conducted in the field of computer vision for over 30 years. In recent years, since AlexNet demonstrated outstanding performance in the image discrimination problem of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, deep learning theory has emerged as an effective solution to various problems in the field of computer vision. , 3D object detection research published since then is mainly based on methods using deep learning.
3차원 객체 검출 기술 연구에서 주로 사용되는 접근법 중 하나는 RGB-D 영상과 심층 학습 이론을 활용하는 방법이다. 구체적으로 해당 방법은 RGB 영상으로부터 인근 객체의 시각적 특징 정보를 획득하고, 깊이(depth) 영상으로부터 영상 내 각 픽셀에 대한 3차원 위치 정보를 획득하여 심층 학습 이론을 통해 인근 객체의 종류를 판별하고, 3차원 위치를 추정한다.One of the approaches commonly used in 3D object detection technology research is a method that utilizes RGB-D images and deep learning theory. Specifically, the method acquires visual characteristic information of nearby objects from RGB images, acquires 3D location information for each pixel in the image from depth images, and determines the type of nearby objects through deep learning theory. Estimate 3D position.
RGB-D 영상을 이용한 3차원 객체 검출 접근법은 크게 3가지로 분류할 수 있으며, 첫번째는 RGB 영상에 대한 2차원 객체 검출 결과와 깊이 영상을 융합하는 방법이고, 두번째는 RGB 영상에 대한 2차원 인스턴스 분할(instance segmentation) 결과와 깊이 영상을 융합하는 방법이며, 세번째는 RGB 영상과 깊이 영상을 인공신경망에 함께 입력하여 3차원 객체 검출 결과를 획득하는 종단 간(end-to-end) 학습 방법이다. 3D object detection approaches using RGB-D images can be broadly classified into three types. The first is a method that fuses 2D object detection results for RGB images and depth images, and the second is a 2D instance for RGB images. It is a method that fuses the instance segmentation results and depth images, and the third is an end-to-end learning method that obtains 3D object detection results by inputting RGB images and depth images together into an artificial neural network.
먼저, RGB 영상에 대한 2차원 객체 검출 결과와 깊이 영상을 융합하는 방법은 두 단계로 구성된다. 첫번째 단계는 2차원 객체 검출 결과를 얻는 단계로, 심층 인공 신경망(deep neural network)에 RGB 영상을 입력하여 영상 내 각 객체의 종류를 판별하고, 각 객체의 위치 및 크기를 표현하는 2차원 경계 상자(bounding box)획득한다. 두번째 단계는 각 객체의 3차원 위치를 추정하는 단계로, 각 객체에 대해 획득한 2차원 경계상자 내 깊이 영상 값을 추출하고 카메라 투영 행렬(projection matrix)을 이용하여 2차원 경계 상자 내 픽셀들의 3차원 실세계 좌표(world coordinate)를 획득한다. 이렇게 획득한 픽셀들의 3차원 위치 좌표들을 필터링하여 객체의 3차원 위치를 추정한다. 이 방법을 적용하기 위해서는 RGB 영상을 입력 받아 2차원 객체 검출 결과를 출력하는 심층 인공신경망을 학습시켜야 하며, 이를 위한 학습 데이터셋의 레이블링 작업에 소요되는 시간은 이미지 당 약 38.1초이다. First, the method of fusing the 2D object detection results for the RGB image and the depth image consists of two steps. The first step is to obtain 2D object detection results. By inputting RGB images into a deep neural network, the type of each object in the image is determined, and a 2D bounding box expresses the location and size of each object. Acquire (bounding box). The second step is to estimate the 3D position of each object. The depth image value within the 2D bounding box obtained for each object is extracted and the 3 pixels within the 2D bounding box are calculated using the camera projection matrix. Obtain dimensional real world coordinates. The 3D position coordinates of the pixels obtained in this way are filtered to estimate the 3D position of the object. To apply this method, a deep artificial neural network that receives RGB images as input and outputs 2D object detection results must be trained, and the time required for labeling the learning dataset is approximately 38.1 seconds per image.
이와 같은 2차원 객체 검출 방법은 후술할 다른 접근법에 비해 평균적으로 제일 적은 연산량을 필요로 한다. 하지만 3차원 위치 추정을 할 때 2차원 경계 상자 내 객체를 가리고 있는 물체 또는 배경 물체들로 인해서, 그리고 객체 자세의 다양성으로 인해서 추정 잡음이 발생한다.This 2D object detection method requires the least amount of calculation on average compared to other approaches described later. However, when performing 3D position estimation, estimation noise occurs due to objects or background objects blocking the object within the 2D bounding box and due to the diversity of object poses.
다음으로, RGB 영상에 대한 2차원 인스턴스 분할 결과와 깊이 영상을 융합하는 방법은 두 단계로 구성된다. 첫번째 단계는 2차원 인스턴스 분할 결과를 얻는 단계로, 심층 인공신경망에 RGB 영상을 입력하여 영상 내 각 객체의 종류를 판별하고, 각 객체가 영상 내에서 차지하고 있는 픽셀 영역을 나타내는 2차원 인스턴스 마스크를 획득한다. 두번째 단계는 각 객체의 3차원 위치를 추정하는 단계로, 각 객체에 대해 획득한 2차원 인스턴스 마스크 내 깊이 영상 값을 추출하고 카메라 투영 행렬을 이용하여 2차원 인스턴스 마스크 내 픽셀들의 3차원 실세계 좌표를 획득한다. 이렇게 획득한 픽셀들의 3차원 위치 좌표들을 필터링하여 객체의 3차원 위치를 추정한다. Next, the method of fusing the 2D instance segmentation results for the RGB image and the depth image consists of two steps. The first step is to obtain a 2D instance segmentation result. By inputting the RGB image into a deep artificial neural network, the type of each object in the image is determined, and a 2D instance mask representing the pixel area occupied by each object in the image is obtained. do. The second step is to estimate the 3D position of each object. The depth image value in the 2D instance mask obtained for each object is extracted and the 3D real world coordinates of the pixels in the 2D instance mask are calculated using the camera projection matrix. Acquire. The 3D position coordinates of the pixels obtained in this way are filtered to estimate the 3D position of the object.
이 방법을 적용하기 위해서는 RGB 영상을 입력 받아 2차원 인스턴스 분할 결과를 출력하는 심층 인공신경망을 학습시켜야 하며, 이를 위한 학습 데이터셋의 레이블링 작업에 소요되는 시간은 이미지 당 약 239.7초이다. 이러한 2차원 인스턴스 분할 방법은 2차원 객체 검출 방법과 달리 잡음 필터링 절차를 필요로 하지 않는다. 하지만 평균적으로 2차원 객체 검출 방법보다 많은 연산량을 필요로 한다.In order to apply this method, a deep artificial neural network that receives RGB images as input and outputs 2D instance segmentation results must be trained, and the time required for labeling the learning dataset is approximately 239.7 seconds per image. Unlike the 2D object detection method, this 2D instance segmentation method does not require a noise filtering procedure. However, on average, it requires more calculations than the 2D object detection method.
마지막으로, RGB 영상과 깊이 영상을 인공 신경망에 함께 입력하는 종단 간 학습 방법은 한 단계로 구성되며, RGB 영상과 깊이 영상을 함께 심층 인공 신경망에 입력하여 영상 내 각 객체의 종류를 판별하고 3차원 위치를 획득한다. 이 방법을 적용하기 위해서는 RGB 영상과 깊이 영상을 입력 받아 3차원 객체 검출 결과를 출력하는 인공신경망을 학습시켜야 하며, 이를 위한 학습 데이터셋의 레이블링 작업에 소요되는 시간은 이미지 당 약 714.4초이다. Lastly, the end-to-end learning method of inputting the RGB image and the depth image together into the artificial neural network consists of one step. The RGB image and the depth image are input together into the deep artificial neural network to determine the type of each object in the image and create a 3D Obtain location. In order to apply this method, an artificial neural network that receives RGB images and depth images and outputs 3D object detection results must be trained, and the time required for labeling the learning dataset is approximately 714.4 seconds per image.
이와 같은 종단 간 학습 방법은 2차원 객체 검출 방법 또는 2차원 인스턴스 분할 방법과 달리 잡음 필터링 절차를 필요로 하지 않으며, 실세계 좌표를 따로 계산할 필요도 없다. 하지만 평균적으로 2차원 객체 검출 방법과 2차원 인스턴스 분할 방법보다 많은 연산량을 필요로 한다.Unlike 2D object detection methods or 2D instance segmentation methods, this end-to-end learning method does not require noise filtering procedures and does not require separate calculation of real-world coordinates. However, on average, it requires more computation than the 2D object detection method and the 2D instance segmentation method.
상기와 같은 문제점을 해결하기 위해 안출된 것으로서, 본 발명은, 객체의 3차원 위치를 추정하는 과정에 설명 가능한 인공 지능 기법을 적용하여 적은 데이터셋 구축 비용으로 정확하게 객체의 3차원 위치를 추정할 수 있는 객체의 3차원 위치 실시간 추정 장치 및 방법을 제공하는 것을 목적으로 한다The present invention was created to solve the above problems, and by applying explainable artificial intelligence techniques to the process of estimating the 3D position of an object, it is possible to accurately estimate the 3D position of an object at a low cost of building a dataset. The purpose is to provide a device and method for real-time estimation of the 3D position of an object.
본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned can be clearly understood by those skilled in the art from the description below. There will be.
이를 위해, 본 발명은, RGB 영상과 깊이 영상을 입력 받는 입력부와, 인공 신경망을 이용하여 상기 RGB 영상에서 객체를 포함하는 2차원 경계 상자를 추출하는 객체 검출부와, 상기 2차원 경계 상자에 대해 설명 가능한 인공 지능을 적용하여 상기 2차원 경계 상자에서 상기 객체 영역에 해당하는 분할 마스크를 추출하는 객체영역 분할부와, 상기 깊이 영상 및 상기 분할 마스크를 이용하여 상기 객체의 3차원 위치를 추정하는 객체 위치 추정부를 포함하는 객체의 3차원 위치 실시간 추정 장치를 제공한다.To this end, the present invention describes an input unit that receives an RGB image and a depth image, an object detection unit that extracts a two-dimensional bounding box containing an object from the RGB image using an artificial neural network, and the two-dimensional bounding box. An object area segmentation unit that extracts a segmentation mask corresponding to the object area from the two-dimensional bounding box by applying possible artificial intelligence, and an object location that estimates the three-dimensional position of the object using the depth image and the segmentation mask. A real-time 3D position estimation device for an object including an estimation unit is provided.
여기서, 상기 객체 영역 분할부는, 상기 인공 신경망이 상기 2차원 경계 상자를 추출하는데 있어 상기 RGB 영상의 각 픽셀에 대해 참조한 정도를 기초로 상기 분할 마스크를 추출할 수 있다.Here, the object area segmentation unit may extract the segmentation mask based on the degree to which the artificial neural network refers to each pixel of the RGB image when extracting the two-dimensional bounding box.
또한, 상기 객체 영역 분할부는, 상기 인공 신경망이 상기 2차원 경계 상자를 추출하는데 있어 상기 RGB 영상의 각 픽셀에 대해 참조한 정도를 히트 맵으로 표현하고, 상기 히트 맵을 기초로 상기 분할 마스크를 추출할 수 있다.In addition, the object area segmentation unit expresses the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box as a heat map, and extracts the segmentation mask based on the heat map. You can.
또한, 상기 객체 영역 분할부는, 상기 인공 신경망이 상기 2차원 경계 상자를 추출하는데 있어 상기 RGB 영상의 각 픽셀에 대해 참조한 정도를 점수화하여 히트 맵 점수를 산출하고, 상기 히트 맵 점수를 임계치와 비교하여 상기 분할 마스크를 추출할 수 있다.In addition, the object area segmentation unit calculates a heat map score by scoring the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box, and compares the heat map score with a threshold. The segmentation mask can be extracted.
또한, 상기 객체 영역 분할부는, 상기 히트 맵 점수가 임계치 이상인 픽셀들을 선택하여 상기 분할 마스크를 추출할 수 있다.Additionally, the object area segmentation unit may extract the segmentation mask by selecting pixels whose heat map score is equal to or greater than a threshold.
또한, 상기 객체 영역 분할부는, 상기 히트 맵 점수가 임계치 미만인 픽셀들을 필터링하여 상기 분할 마스크를 추출할 수 있다.Additionally, the object area segmentation unit may extract the segmentation mask by filtering pixels whose heat map score is less than a threshold.
또한, 상기 객체 위치 추정부는, 상기 분할 마스크 내 각 픽셀에 대응되는 깊이 영상 값을 추출할 수 있다.Additionally, the object location estimation unit may extract a depth image value corresponding to each pixel in the segmentation mask.
또한, 상기 객체 위치 추정부는, 상기 분할 마스크 내 각 픽셀의 이미지 좌표와 카메라 투영 행렬을 이용하여 각 픽셀의 3차원 위치 좌표를 추출할 수 있다.Additionally, the object position estimation unit may extract the 3D position coordinates of each pixel using the image coordinates of each pixel in the segmentation mask and the camera projection matrix.
또한, 본 발명은, RGB 영상과 깊이 영상을 입력 받는 단계와, 인공 신경망을 이용하여 상기 RGB 영상에서 객체를 포함하는 2차원 경계 상자를 추출하는 단계와, 상기 2차원 경계 상자에 대해 설명 가능한 인공 지능을 적용하여 상기 2차원 경계 상자에서 상기 객체 영역에 해당하는 분할 마스크를 추출하는 단계와, 상기 깊이 영상 및 상기 분할 마스크를 이용하여 상기 객체의 3차원 위치를 추정하는 단계를 포함하는 객체의 3차원 위치 실시간 추정 방법을 제공한다.In addition, the present invention includes the steps of receiving an RGB image and a depth image, extracting a two-dimensional bounding box containing an object from the RGB image using an artificial neural network, and an artificial neural network that can explain the two-dimensional bounding box. 3 of the object, including applying intelligence to extract a segmentation mask corresponding to the object area from the two-dimensional bounding box, and estimating a three-dimensional position of the object using the depth image and the segmentation mask. Provides a real-time estimation method for dimensional position.
본 발명에 따르면, RGB 영상 2차원 객체 검출 결과에 설명 가능한 인공 지능 기법을 적용하여 각 객체마다 객체영역 분할 마스크를 획득함으로써 깊이 영상 값을 활용한 객체 3차원 위치 추정 기술의 정확도를 향상시킬 수 있다. According to the present invention, the accuracy of object 3D position estimation technology using depth image values can be improved by applying an explainable artificial intelligence technique to the RGB image 2D object detection results to obtain an object area segmentation mask for each object. .
또한, 본 발명과 같이 학습 데이터셋이 구축되지 않은 새로운 환경을 위한 3차원 객체 검출 알고리즘을 개발할 때, 기존의 2차원 객체 검출 기술과 깊이 영상을 융합하여 사용하는 방법에서 적용된 인스턴스 분할 방법이나 종단 간 학습 방법에 비해 데이터셋 구축에 적은 비용을 소모된다. 즉, 본 발명은 기존 방법보다 더 적은 데이터셋 구축 비용으로 더 정확하게 객체의 3차원 위치를 추정할 수 있다.In addition, when developing a 3D object detection algorithm for a new environment where a learning dataset has not been built, such as the present invention, the instance segmentation method or end-to-end method applied in the method of using existing 2D object detection technology and depth image fusion Compared to learning methods, less cost is consumed in building the dataset. In other words, the present invention can estimate the 3D position of an object more accurately at a lower dataset construction cost than existing methods.
또한, 본 발명의 2차원 객체 검출 알고리즘은 필요 연산량이 비교적 적기 때문에 AR 기기, 드론, 모바일 로봇과 같이 저전력 연산 장치를 탑재한 인공 지능 시스템에서도 더 정확한 실시간 RGB-D 영상 객체 3차원 위치 추정 기술을 사용할 수 있게 된다.In addition, since the 2D object detection algorithm of the present invention requires a relatively small amount of calculation, it can be used to estimate the 3D position of a more accurate real-time RGB-D image object even in artificial intelligence systems equipped with low-power computing devices such as AR devices, drones, and mobile robots. becomes available.
본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects that can be obtained from the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below. will be.
도 1은 종래의 2차원 객체 검출 결과와 깊이 영상을 융합하여 3차원 객체의 위치를 추정하는 방법의 순서도이다.Figure 1 is a flowchart of a method for estimating the location of a 3D object by fusing a conventional 2D object detection result and a depth image.
도 2는 본 발명의 실시예에 따른 설명 가능한 인공지능 기법을 이용한 RGB-D 영상 내 검출된 객체의 3차원 위치 실시간 추정 장치를 나타내는 블록도이다.Figure 2 is a block diagram showing a real-time 3D location estimation device for an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention.
도 3은 본 발명의 실시예에 따른 설명 가능한 인공지능 기법을 이용한 RGB-D 영상 내 검출된 객체의 3차원 위치 실시간 추정 방법의 순서도이다.Figure 3 is a flowchart of a method for real-time estimation of the 3D position of an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention.
도 4는 도 3에서 분할 마스크를 추출하는 방법의 순서도이다.Figure 4 is a flowchart of a method for extracting a segmentation mask in Figure 3.
본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다.Terms or words used in this specification and claims should not be construed as limited to their common or dictionary meanings, and the inventor may appropriately define the concept of terms in order to explain his or her invention in the best way. It must be interpreted with meaning and concept consistent with the technical idea of the present invention based on the principle that it is.
따라서 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시 예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원 시점에서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다.Therefore, the embodiments described in this specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention and do not represent the entire technical idea of the present invention, and therefore various equivalents and It should be understood that variations may exist.
도 1은 종래의 2차원 객체 검출 결과와 깊이 영상을 융합하여 3차원 객체의 위치를 추정하는 방법의 순서도이다.Figure 1 is a flowchart of a method for estimating the location of a 3D object by fusing a conventional 2D object detection result and a depth image.
도 1을 참조하면, 종래의 3차원 객체의 위치 추정 방법은, 먼저, RGB 영상에서 객체를 포함하는 2차원 경계 상자를 검출한다(S10). 다음, 깊이 영상에 대해 2차원 경계 상자를 적용하여 2차원 경계 상자 내 깊이 영상 값을 추출한다(S20). 다음, 2차원 경계 상자 내 픽셀들의 실세계 좌표를 획득하고(S30), 획득한 실세계 좌표를 기초로 객체의 3차원 위치를 추정한다(S40).Referring to FIG. 1, a conventional method for estimating the position of a 3D object first detects a 2D bounding box containing the object in an RGB image (S10). Next, a 2D bounding box is applied to the depth image to extract the depth image value within the 2D bounding box (S20). Next, the real-world coordinates of pixels within the 2D bounding box are acquired (S30), and the 3D location of the object is estimated based on the obtained real-world coordinates (S40).
구체적으로, 2차원 객체 검출(2D object detection) 기술은, 인공 신경망을 이용하여 입력 영상 내에 존재하는 객체들의 종류(class)를 판별하고, 각 객체에 대해 중심 픽셀(Cu, Cv)과 너비 w, 높이 h를 가진 2차원 경계 상자(2D bounding box)를 추정하여 해당 객체의 위치 및 크기를 표현한다.Specifically, 2D object detection technology uses an artificial neural network to determine the type (class) of objects existing in an input image, and determines the center pixel (C u , C v ) and width for each object. Express the location and size of the object by estimating a 2D bounding box with w and height h.
여기서, 2D 객체 검출 인공 신경망은, 영상
Figure PCTKR2022020601-appb-img-000001
을 입력받아 영상 내 존재하는 N개의 객체(object) 각각에 대해 경계 상자 회귀(bounding box regression) 결과인
Figure PCTKR2022020601-appb-img-000002
와 해당 객체가 C개의 객체 종류(class) 각각에 속할 확률의 추정값 σj(qi), 그리고 해당 객체의 객체 종류(class) 추정값 yi를 출력한다. 여기서 i=0, 1, 2,..., N-1는 객체의 순서를 나타내는 정수이고, j=0, 1, 2,..., C-1는 클래스 순서를 나타내는 정수이다. 그리고 활성화 함수(activation function) σ(·)는 소프트맥스(softmax)함수이며,
Figure PCTKR2022020601-appb-img-000003
는 i번째 객체가 C개 클래스 각각에 속할 확률의 logit 값이다.
Here, the 2D object detection artificial neural network is
Figure PCTKR2022020601-appb-img-000001
is input and the result of bounding box regression for each of the N objects existing in the image is
Figure PCTKR2022020601-appb-img-000002
and the estimated value σ j (q i ) of the probability that the corresponding object belongs to each of the C object types (classes), and the estimated object type (class) y i of the corresponding object are output. Here, i=0, 1, 2,..., N-1 are integers representing the order of objects, and j=0, 1, 2,..., C-1 are integers representing the order of classes. And the activation function σ(·) is a softmax function,
Figure PCTKR2022020601-appb-img-000003
is the logit value of the probability that the ith object belongs to each of the C classes.
상기 2차원 객체 검출 결과와 깊이 영상을 융합하여 각 객체의 3차원 위치를 추정하는 과정은 카메라 역투영(camera inverse projection)을 통해 이루어진다. 구체적으로, RGB 영상
Figure PCTKR2022020601-appb-img-000004
, 깊이 영상
Figure PCTKR2022020601-appb-img-000005
, 그리고 카메라 투영 행렬
Figure PCTKR2022020601-appb-img-000006
가 주어졌을 때, 영상 내 i번째 객체의 2차원 경계 상자에 포함된 임의의 픽셀 좌표
Figure PCTKR2022020601-appb-img-000007
의 실세계 좌표(world coordinate)
Figure PCTKR2022020601-appb-img-000008
는 하기 수학식 1을 통해 계산된다.
The process of estimating the 3D location of each object by fusing the 2D object detection result and the depth image is performed through camera inverse projection. Specifically, RGB images
Figure PCTKR2022020601-appb-img-000004
, depth image
Figure PCTKR2022020601-appb-img-000005
, and the camera projection matrix
Figure PCTKR2022020601-appb-img-000006
Given, the coordinates of random pixels contained in the 2D bounding box of the ith object in the image.
Figure PCTKR2022020601-appb-img-000007
The real world coordinates of
Figure PCTKR2022020601-appb-img-000008
is calculated through Equation 1 below.
[수학식 1][Equation 1]
Figure PCTKR2022020601-appb-img-000009
Figure PCTKR2022020601-appb-img-000009
여기서 실수
Figure PCTKR2022020601-appb-img-000010
는 깊이 영상 D의 픽셀 좌표 (ui, vi)에 측정된 깊이 값이며, 행렬 K-1은 카메라 투영 행렬 K의 역행렬이다.
mistake here
Figure PCTKR2022020601-appb-img-000010
is the depth value measured at the pixel coordinates (u i , v i ) of the depth image D, and matrix K -1 is the inverse matrix of the camera projection matrix K.
상기 수학식 1을 i번째 객체의 2차원 경계 상자에 포함된 모든 픽셀 좌표
Figure PCTKR2022020601-appb-img-000011
에 적용함으로써, 도 1과 같이 실세계 좌표의 집합인 점구름(point cloud)
Figure PCTKR2022020601-appb-img-000012
를 획득한다. 이렇게 획득한 점구름으로부터 평균값, 가중 평균값, 또는 중간값 등의 추정량(estimator)를 계산함으로써 i번째 객체의 실세계 3차원 위치 추정값
Figure PCTKR2022020601-appb-img-000013
를 획득한다.
Equation 1 above is calculated as the coordinates of all pixels included in the two-dimensional bounding box of the ith object.
Figure PCTKR2022020601-appb-img-000011
By applying it to a point cloud, which is a set of real world coordinates as shown in Figure 1
Figure PCTKR2022020601-appb-img-000012
obtain. The real-world 3D position estimate of the ith object is calculated by calculating an estimator such as the average value, weighted average value, or median value from the point cloud obtained in this way.
Figure PCTKR2022020601-appb-img-000013
obtain.
상기 연산 과정을 영상 I의 2차원 객체 검출 결과인 모든 경계 상자에 대해 수행함으로써 N개 객체의 실세계 3차원 위치 추정값 Pi, i=0, 1, 2,…, N-1를 획득한다.By performing the above calculation process on all bounding boxes resulting from 2D object detection in image I, real-world 3D position estimates P i , i=0, 1, 2,… of N objects are obtained. , obtain N-1.
전술한 종래의 2차원 객체 검출과 깊이 영상을 융합하여 객체의 3차원 위치를 추정하는 방법의 한계점은, 2차원 경계 상자 내에 객체 외 성분이 포함될 수 있다는 점이다. 도 1과 같이 2차원 경계 상자에 포함된 모든 픽셀에 대해 계산된 점구름은 객체뿐만 아니라 객체를 가리는 물체(occlusion), 배경 물체(background) 등 객체 외 성분을 포함한다. 따라서, 객체의 3차원 위치를 획득하기 위해 추정량을 계산하는 과정에서 점구름에 포함된 객체 외 성분은 위치 추정 값에 잡음을 발생시키는 문제점이 있다.A limitation of the method of estimating the 3D position of an object by combining the above-described conventional 2D object detection and depth image is that components other than the object may be included in the 2D bounding box. As shown in Figure 1, the point cloud calculated for all pixels included in the two-dimensional bounding box includes not only the object but also components other than the object, such as occlusion and background. Therefore, in the process of calculating an estimator to obtain the 3D position of an object, there is a problem that components other than the object included in the point cloud generate noise in the position estimate value.
도 2는 본 발명의 실시예에 따른 설명 가능한 인공지능 기법을 이용한 RGB-D 영상 내 검출된 객체의 3차원 위치 실시간 추정 장치를 나타내는 블록도이고, 도 3은 본 발명의 실시예에 따른 설명 가능한 인공지능 기법을 이용한 RGB-D 영상 내 검출된 객체의 3차원 위치 실시간 추정 방법의 순서도이고, 도 4는 도 3에서 분할 마스크를 추출하는 방법의 순서도이다.Figure 2 is a block diagram showing a real-time 3D location estimation device for an object detected in an RGB-D image using an explainable artificial intelligence technique according to an embodiment of the present invention, and Figure 3 is an explainable device according to an embodiment of the present invention. This is a flowchart of a method for real-time estimation of the 3D position of an object detected in an RGB-D image using artificial intelligence techniques, and FIG. 4 is a flowchart of a method for extracting a segmentation mask in FIG. 3.
도 2 를 참조하면, 설명 가능한 인공지능 기법을 이용한 RGB-D 영상 내 검출된 객체의 3차원 위치 실시간 추정 시스템(100)은 입력부(110), 메모리(120), 객체 검출부(130), 객체영역 분할부(140), 객체 위치 추정부(150)를 포함할 수 있다.Referring to FIG. 2, the real-time 3D location estimation system 100 of an object detected in an RGB-D image using an explainable artificial intelligence technique includes an input unit 110, a memory 120, an object detection unit 130, and an object area. It may include a division unit 140 and an object location estimation unit 150.
도 2 및 도 3을 참조하면, 입력부(110)는 객체를 검출하고자 하는 대상에 대한 센서 측정값을 입력 받을 수 있다(S110). 예를 들어, RGB-D 카메라가 촬영한 RGB 영상과 깊이(depth) 영상을 실시간으로 입력 받도록 구성될 수 있다. 하지만 입력부(110)는 RGB-D 카메라에 국한되지 않으며, 본 발명은 RGB 영상과 3차원 라이다(LiDAR) 영상 입력, 또는 RGB 영상과 RGB 영상에 심층 인공 신경망을 적용하여 획득한 추정 깊이 영상을 입력 받아서 동작할 수도 있다.Referring to FIGS. 2 and 3 , the input unit 110 can receive sensor measurement values for an object for which the object is to be detected (S110). For example, it can be configured to receive RGB images and depth images captured by an RGB-D camera in real time. However, the input unit 110 is not limited to the RGB-D camera, and the present invention uses RGB images and 3D LiDAR image inputs, or RGB images and estimated depth images obtained by applying a deep artificial neural network to RGB images. It can also receive input and operate.
메모리(120)는 객체 검출부(130), 객체영역 분할부(140), 객체 위치 추정부(150)를 구동하기 위해 필요한 파라미터를 저장하도록 구성될 수 있다. The memory 120 may be configured to store parameters necessary for driving the object detection unit 130, the object area dividing unit 140, and the object location estimation unit 150.
구체적으로는, 메모리(120)는 객체 검출부(130)에서 사용되는 파라미터인 경계 상자 신뢰도 임계치, 클래스 추정 확률 임계치를 저장할 수 있고, 객체영역 분할부(140)에서 사용하는 파라미터인 히트 맵 점수 임계치, 히트 맵 구성 시 사용하는 활성화 층 수를 저장할 수 있고, 객체 3차원 위치 추정부(150)에서 사용하는 깊이 영상 값의 최솟값과 최댓값, 카메라 투영 행렬 등을 저장할 수 있다.Specifically, the memory 120 may store a bounding box reliability threshold and a class estimation probability threshold, which are parameters used in the object detection unit 130, and a heat map score threshold, which are parameters used in the object area segmentation unit 140. The number of activation layers used when constructing a heat map can be stored, and the minimum and maximum values of depth image values used by the object 3D position estimation unit 150, camera projection matrix, etc. can be stored.
객체 검출부(130)는 인공 신경망을 이용하여 RGB 영상에서 객체를 포함하는 2차원 경계 상자를 추출할 수 있다(S120).The object detector 130 may extract a two-dimensional bounding box containing an object from an RGB image using an artificial neural network (S120).
구체적으로, 객체 검출부(130)는 입력부(110)로부터 전달받은 RGB 영상에 사전 학습된 2차원 객체 검출 인공 신경망을 적용하여 영상 내 각 객체에 대한 2차원 경계 상자와 객체 종류 판별 결과를 획득할 수 있다. 이 때, 2차원 객체 검출 인공 신경망은 2차원 객체 검출 결과와 함께 영상 내 각 객체에 대한 클래스 추정 확률의 logit 값을 출력할 수 있다.Specifically, the object detection unit 130 can apply a pre-trained 2D object detection artificial neural network to the RGB image received from the input unit 110 to obtain a 2D bounding box and object type determination result for each object in the image. there is. At this time, the 2D object detection artificial neural network can output the logit value of the class estimation probability for each object in the image along with the 2D object detection result.
객체영역 분할부(140)는 2차원 경계 상자에 대해 설명 가능한 인공 지능(XAI, Explainable AI)을 적용하여 2차원 경계 상자에서 객체 영역에 해당하는 분할 마스크를 추출할 수 있다(S130).The object area division unit 140 can extract a segmentation mask corresponding to the object area from the 2D bounding box by applying explainable artificial intelligence (XAI) to the 2D bounding box (S130).
여기서, 설명 가능한 인공지능(XAI, Explainable AI)은 판단 과정을 해석하기 어려운 블랙박스 구조인 인공 신경망의 판단 근거를 인간이 이해할 수 있는 방식으로 표현하는 방법이다. Here, explainable artificial intelligence (XAI) is a method of expressing the basis for judgment of an artificial neural network, a black box structure that is difficult to interpret the judgment process, in a way that humans can understand.
설명 가능한 인공지능 기술 중 주로 연구되는 방법인 히트 맵(heat map) 표현 방법은 인공 신경망이 결과값을 연산하기 위해 입력 자료에서 주로 참조한 부분을 입력 자료와 같은 차원을 가진 히트 맵으로 표현하는 방법이다. The heat map representation method, which is a mainly studied method among explainable artificial intelligence technologies, is a method in which the artificial neural network expresses the mainly referenced part of the input data in order to calculate the result as a heat map with the same dimension as the input data. .
여기서, 히트 맵 표현 방법은 크게 3가지로 분류할 수 있는데, 첫번째는 섭동 기반 방법(perturbation based method)이고 두번째는 기울기 기반 방법(gradient based method)이며 세번째는 클래스 활성화 맵(class activation map)이다. Here, heat map expression methods can be broadly classified into three types: the first is a perturbation based method, the second is a gradient based method, and the third is a class activation map.
먼저, 섭동 기반 방법은 여러 방식으로 입력 자료에 부분적으로 변화를 주어 출력 값이 가장 많이 변화하는 입력 자료 변화 양상을 찾는 방법이다. 후술할 다른 접근법에 비해 명확하게 해석 가능한 히트 맵을 획득할 수 있으나, 히트 맵을 획득하기 위해 동일한 입력 자료에 대해 인공신경망 연산을 여러 차례 수행해야 하며 입력 자료의 차원과 해상도가 증가함에 따라 많은 연산량을 필요로 한다.First, the perturbation-based method is a method of partially changing the input data in various ways to find the input data change pattern that causes the greatest change in the output value. Compared to other approaches described later, it is possible to obtain a clearly interpretable heat map. However, to obtain a heat map, artificial neural network calculations must be performed multiple times on the same input data, and as the dimension and resolution of the input data increases, the amount of calculation increases. need.
다음으로, 기울기 기반 방법은 입력 자료에 대한 출력 값의 기울기(gradient)를 계산하여 인공 신경망의 클래스 추정 점수가 가장 빠르게 증가하기 위한 입력 값의 변화 양상을 획득하는 방법이다. 기울기는 인공 신경망 연산을 한번 수행한 후 역전파(backpropagation)를 이용하여 계산할 수 있기 때문에 섭동 기반 방법보다 필요한 연산량이 적지만 기울기 깨짐(gradient shattering) 현상으로 인해 히트 맵에 잡음이 많다. Next, the gradient-based method calculates the gradient of the output value with respect to the input data and obtains the change pattern of the input value to increase the class estimation score of the artificial neural network most quickly. Since the gradient can be calculated using backpropagation after performing the artificial neural network operation once, the amount of calculation required is less than that of the perturbation-based method, but there is a lot of noise in the heat map due to the gradient shattering phenomenon.
마지막으로, 클래스 활성화 맵은 인공신경망 각 계층의 출력 값인 활성화 맵(activation map)에서 인공 신경망 최종 출력 값인 클래스 추정 점수에 기여하는 부분을 히트 맵으로 표현하는 방법이다. 이 방법은 기울기 기반 방법보다 잡음이 적으며, 섭동 기반 방법보다 연산량이 적지만 일반적으로 히트 맵의 해상도가 낮다.Lastly, the class activation map is a method of expressing the portion that contributes to the class estimation score, the final output value of the artificial neural network, as a heat map in the activation map, which is the output value of each layer of the artificial neural network. This method has less noise than gradient-based methods and requires less computation than perturbation-based methods, but the heat maps generally have lower resolution.
히트 맵을 표현하는 설명 가능한 인공지능 기술들은 인공 지능 설명이 아닌 다른 문제를 해결하기 위해 활용되기도 한다. 예를 들어, 섭동 기반 방법은 이미 학습된 2차원 객체 검출 인공 신경망을 이용하여 2차원 인스턴스 분할 학습 데이터셋을 구축하기 위한 자동 라벨링 시스템을 개발하는 연구에 적용되기도 하며, 클래스 활성화 맵은 입력 영상 내 특정 대상의 존재 여부를 판별하는 인공 신경망(ex: 도로 균열 유무 판별, 제품 손상 유무 판별)을 사용할 때 입력 영상에서 해당 대상의 위치를 찾기 위한 분할(segmentation) 기법으로써 적용되기도 한다.Explainable artificial intelligence technologies that express heat maps are also used to solve problems other than artificial intelligence explanation. For example, perturbation-based methods are sometimes applied to research to develop an automatic labeling system to build a 2D instance segmentation learning dataset using an already trained 2D object detection artificial neural network, and the class activation map is used in the input image. When using an artificial neural network that determines the presence or absence of a specific object (ex: determining the presence or absence of a road crack, determining the presence or absence of product damage), it is also applied as a segmentation technique to find the location of the object in the input image.
객체 영역 분할부(140)는 인공 신경망이 2차원 경계 상자를 추출하는데 있어 RGB 영상의 각 픽셀에 대해 참조한 정도를 기초로 분할 마스크를 추출할 수 있다.The object area segmentation unit 140 may extract a segmentation mask based on the degree to which the artificial neural network references each pixel of the RGB image when extracting a 2D bounding box.
구체적으로, 객체 영역 분할부(140)는 인공 신경망이 2차원 경계 상자를 추출하는데 있어 RGB 영상의 각 픽셀에 대해 참조한 정도를 히트 맵으로 표현하고, 히트 맵을 기초로 분할 마스크를 추출할 수 있다. Specifically, the object area segmentation unit 140 can express the degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box as a heat map, and extract a segmentation mask based on the heat map. .
여기서, 객체 영역 분할부(140)는 RGB 영상의 각 픽셀에 대해 참조한 정도를 점수화하여 히트 맵 점수를 산출하고, 히트 맵 점수를 임계치와 비교하여 분할 마스크를 추출할 수 있다.Here, the object area segmentation unit 140 calculates a heat map score by scoring the degree of reference for each pixel of the RGB image, and extracts a segmentation mask by comparing the heat map score with a threshold.
구체적으로, 객체 영역 분할부(140)는, 히트 맵 점수가 임계치 이상인 픽셀들을 선택하여 분할 마스크를 추출하거나, 히트 맵 점수가 임계치 미만인 픽셀들을 필터링하여 분할 마스크를 추출할 수 있다.Specifically, the object area segmentation unit 140 may extract a segmentation mask by selecting pixels whose heat map score is equal to or greater than a threshold, or extract a segmentation mask by filtering pixels whose heat map score is less than the threshold.
예를 들어, 도 4를 참조하면, 객체영역 분할부(140)는 객체 검출부(130)에서 획득한 영상 내 각 객체에 대한 2차원 경계 상자 출력값 및 클래스 판별 점수 출력값과 입력부(110)로부터 전달받은 RGB 영상을 이용하여 설명 가능한 인공지능 기법 중 하나인 유도된 기울기(guided gradient)를 계산함으로써 RGB 입력 영상과 2차원 객체 검출 결과 간 상관 관계를 나타내는 히트 맵을 획득할 수 있다(S131). For example, referring to FIG. 4, the object area segmentation unit 140 divides the two-dimensional bounding box output value and the class discrimination score output value for each object in the image acquired by the object detection unit 130 and the output value received from the input unit 110. By calculating the guided gradient, which is one of the artificial intelligence techniques that can be explained using RGB images, a heat map showing the correlation between the RGB input image and the 2D object detection result can be obtained (S131).
그리고, 객체영역 분할부(140)는 획득한 상관관계 히트 맵을 메모리(120)에 저장된 임계치를 기준으로 걸러내어 객체영역 분할 마스크를 획득할 수 있다(S132). 본 발명의 실시예에서는 기울기 기반 방법 중 하나인 유도된 기울기를 명시하였지만, 실시간으로 결과를 획득할 수 없는 섭동 기반 방법을 제외한 설명 가능한 인공지능 히트 맵 표현 기법 중 SmoothGrad와 같은 기울기 기반 방법 또는 Grad-CAM, LayerCAM과 같은 클래스 활성화 맵을 적용할 수 있다.Then, the object region segmentation unit 140 may obtain an object region segmentation mask by filtering the obtained correlation heat map based on the threshold value stored in the memory 120 (S132). In the embodiment of the present invention, the derived gradient, which is one of the gradient-based methods, is specified, but among the explainable artificial intelligence heat map representation techniques excluding the perturbation-based method, which cannot obtain results in real time, a gradient-based method such as SmoothGrad or Grad- Class activation maps such as CAM and LayerCAM can be applied.
히트 맵을 표현하는 설명 가능한 인공지능 기법(XAI, Explainable AI)은 인공 신경망이 특정 클래스의 판별 점수를 연산하기 위해 입력 자료에서 참조한 부분을 히트 맵으로 표현한다.Explainable AI (XAI), an explainable artificial intelligence technique that expresses a heat map, expresses the part referenced in the input data by an artificial neural network as a heat map to calculate the discriminant score of a specific class.
구체적으로, 히트 맵을 표현하는 XAI 기법은 클래스 판별 문제를 해결하기 위한 심층 인공 신경망이 L개의 은닉층(hidden layer)로 구성되어 있으며, 임의의 영상
Figure PCTKR2022020601-appb-img-000014
를 입력 받아 해당 영상이 C개의 각 클래스에 속할 확률 추정 값
Figure PCTKR2022020601-appb-img-000015
을 출력한다고 가정하고, 특정 클래스 j에 대한 인공 신경망의 확률 추정값 σj(q)을 연산하기 위해 인공 신경망이 입력 영상에서 참조한 부분을 히트 맵
Figure PCTKR2022020601-appb-img-000016
를 계산하여 나타낸다.
Specifically, the
Figure PCTKR2022020601-appb-img-000014
Receives input and estimates the probability that the image belongs to each of the C classes.
Figure PCTKR2022020601-appb-img-000015
Assuming that the artificial neural network outputs the probability estimate σ j (q) for a specific class j, the part referenced by the artificial neural network in the input image is used as a heat map.
Figure PCTKR2022020601-appb-img-000016
is calculated and expressed.
여러 히트 맵 표현 기법 중에서 기울기 기반 방법(gradient based method)은 출력 값의 입력에 대한 기울기(gradient)를 이용하여 인공지능 설명을 위한 히트 맵을 표현한다. 구체적으로, 특정 클래스 j에 대한 인공 신경망 추정 확률의 logit 값인
Figure PCTKR2022020601-appb-img-000017
의 입력 영상 I에 대한 기울기인
Figure PCTKR2022020601-appb-img-000018
를 연산하여, 클래스 j에 대한 확률의 logit 값 qj가 가장 빠르게 증가하기 위한 입력 영상 I의 변화 양상 히트 맵 Hj를 획득한다. 하지만 이렇게 획득한 기울기 기반 히트 맵 Hj는 기울기를 계산하기 위한 역전파(backpropagation) 과정에서 음의 기울기와 양의 기울기를 모두 고려하기 때문에 사람이 해석하기 어려운 잡음 성분을 많이 보유하고 있다. 이를 보완하기 위해 제안된 유도된 역전파(guided backpropagation)는 역전파 과정에서 양의 기울기만을 고려하여 사람이 해석 가능한 히트 맵을 표현한다. 이때, 유도된 역전파를 통해 계산된 유도된 기울기(guided gradient) 히트 맵 ψj은 하기 수학식 2를 이용하여 계산된다.
Among several heat map expression techniques, the gradient based method uses the gradient of the input of the output value to express a heat map for artificial intelligence explanation. Specifically, the logit value of the artificial neural network estimated probability for a specific class j,
Figure PCTKR2022020601-appb-img-000017
The slope for the input image I of
Figure PCTKR2022020601-appb-img-000018
By calculating , a heat map H j of the change pattern of the input image I is obtained so that the logit value q j of the probability for class j increases the fastest. However, the gradient-based heat map H j obtained in this way contains many noise components that are difficult for humans to interpret because both negative and positive gradients are considered in the backpropagation process to calculate the gradient. Guided backpropagation, which was proposed to compensate for this, considers only the positive gradient during the backpropagation process and expresses a heat map that can be interpreted by humans. At this time, the guided gradient heat map ψ j calculated through guided backpropagation is calculated using Equation 2 below.
[수학식 2][Equation 2]
Figure PCTKR2022020601-appb-img-000019
Figure PCTKR2022020601-appb-img-000019
여기서 zk, k=0, 1, 2,…, L-1은 인공 신경망 내 k번째 은닉층의 출력값이며, 활성화 함수 Φ(·)는 ReLU(Rectified Linear Unit) 함수이다.Here z k , k=0, 1, 2,… , L-1 is the output value of the kth hidden layer in the artificial neural network, and the activation function Φ(·) is a ReLU (Rectified Linear Unit) function.
정리하면, 상기 수학식 2를 통해 유도된 기울기를 계산함으로써 클래스 판별 인공 신경망이 j번째 클래스에 대한 추정 확률을 계산할 때 입력 영상
Figure PCTKR2022020601-appb-img-000020
내에서 어떤 시각적 특징을 참조하는지를 표현하는 히트 맵
Figure PCTKR2022020601-appb-img-000021
을 획득할 수 있다.
In summary, when the class discrimination artificial neural network calculates the estimated probability for the jth class by calculating the gradient derived through Equation 2 above, the input image
Figure PCTKR2022020601-appb-img-000020
A heat map representing what visual features are referenced within
Figure PCTKR2022020601-appb-img-000021
can be obtained.
앞서 설명한 바와 같이 2차원 객체 검출 결과와 깊이 영상을 융합하여 객체의 3차원 위치를 추정하는 기존 방법은, 객체의 경계 상자 내에 객체 외 성분이 포함된다는 한계점을 보유하고 있다. 기존에는 이러한 문제점을 회피하기 위해 3차원 위치 추정 문제를 해결할 때 객체 검출을 적용하기 보다 2차원 인스턴스 분할(instance segmentation)을 적용하거나, 종단 간 학습(end-to-end learning) 방법을 적용하였다. 하지만 2차원 인스턴스 분할과 종단 간 학습 방법은 데이터셋 구축 비용과 알고리즘 자체의 연산량 측면에서 2차원 객체 검출 기술보다 많은 비용을 요구한다. 그렇기 때문에 2차원 객체 검출만으로 객체의 3차원 위치를 더 정확하게 추정하는 방법의 연구가 필요한데, 이를 위해 앞서 설명한 유도된 기울기 XAI 기법을 적용할 수 있다.As described above, the existing method of estimating the 3D location of an object by fusing the 2D object detection result and the depth image has the limitation that components other than the object are included in the object's bounding box. Previously, to avoid these problems, 2D instance segmentation or end-to-end learning methods were applied rather than object detection when solving 3D location estimation problems. However, 2D instance segmentation and end-to-end learning methods require more costs than 2D object detection technology in terms of dataset construction costs and the amount of computation of the algorithm itself. Therefore, it is necessary to research a method to more accurately estimate the 3D position of an object only by detecting 2D objects. For this purpose, the induced gradient XAI technique described above can be applied.
구체적으로, 입력 영상
Figure PCTKR2022020601-appb-img-000022
에 대한 2차원 객체 검출 결과로 N개의 객체에 대한 2차원 경계 상자 추정 값 (Cu,i, Cv,i, wi, hi)과 총 C개의 객체 종류 각각에 대한 확률 추정값
Figure PCTKR2022020601-appb-img-000023
, 그리고 객체 종류 추정값 yi, i=0, 1, 2,…, N-1, j=0, 1, 2,…, C-1이 주어졌을 때 객체의 추정 클래스 yi에 대한 i번째 객체의 유도된 기울기 히트 맵
Figure PCTKR2022020601-appb-img-000024
은 하기 수학식3과 같이 계산할 수 있다.
Specifically, the input image
Figure PCTKR2022020601-appb-img-000022
2D object detection results for 2D bounding box estimates for N objects (C u,i , C v,i , w i , h i ) and probability estimates for each of a total of C object types.
Figure PCTKR2022020601-appb-img-000023
, and the object type estimates yi, i=0, 1, 2,… , N-1, j=0, 1, 2,… , the derived gradient heat map of the ith object for the estimated class yi of the object, given C-1.
Figure PCTKR2022020601-appb-img-000024
Can be calculated as shown in Equation 3 below.
[수학식 3][Equation 3]
Figure PCTKR2022020601-appb-img-000025
Figure PCTKR2022020601-appb-img-000025
여기서 집합
Figure PCTKR2022020601-appb-img-000026
은 i번째 객체의 2차원 경계 상자 (Cu,i, Cv,i, wi, hi)에 포함된 픽셀 좌표의 집합이며, I({ui, vi})는 2차원 경계 상자 내 픽셀 좌표에 해당하는 영상 I의 픽셀 값이다. 그리고 qi,yi는 인공 신경망이 추정한 i번째 객체가 yi번째 클래스에 속할 확률의 logit 값이다.
gather here
Figure PCTKR2022020601-appb-img-000026
is the set of pixel coordinates contained in the 2-dimensional bounding box (C u,i , C v,i , w i , h i ) of the ith object, and I({u i , v i }) is the 2-dimensional bounding box This is the pixel value of image I corresponding to my pixel coordinates. And q i,yi is the logit value of the probability that the i th object belongs to the y i th class estimated by the artificial neural network.
상기 연산 과정을 영상 I의 2차원 객체 검출 결과인 모든 경계 상자에 대해 수행함으로써 N개 객체에 대한 유도된 기울기 히트 맵 ψyi, i=0, 1, 2,…, N-1를 획득한다.By performing the above calculation process on all bounding boxes resulting from 2D object detection in image I, the derived gradient heat map ψ yi , i=0, 1, 2,… for N objects is generated. , obtain N-1.
마지막으로, 각 객체의 유도된 기울기 히트 맵 ψyi에서 임계치 τ보다 높은 값을 선택하여 해당 객체의 픽셀 영역을 표시하는 객체영역 분할 마스크를 획득한다.Finally, a value higher than the threshold τ is selected from the derived gradient heat map ψ yi of each object to obtain an object area segmentation mask that displays the pixel area of the object.
객체 위치 추정부(150)는 깊이 영상 및 분할 마스크를 이용하여 객체의 3차원 위치를 추정할 수 있다.The object location estimation unit 150 may estimate the 3D location of the object using a depth image and a segmentation mask.
구체적으로, 객체 위치 추정부(150)는 객체영역 분할부(140)에서 획득한 각 객체에 대한 객체영역 분할 마스크 내 각 픽셀에 대응되는 깊이 영상 값을 추출하고, 해당 픽셀의 이미지 좌표와 카메라 투영 행렬을 이용하여 각 픽셀의 3차원 실세계 좌표를 획득할 수 있다. Specifically, the object location estimation unit 150 extracts the depth image value corresponding to each pixel in the object area segmentation mask for each object acquired by the object area segmentation unit 140, and the image coordinates and camera projection of the corresponding pixel. Using a matrix, you can obtain the 3D real-world coordinates of each pixel.
그리고, 객체 위치 추정부(150)는 획득한 실세계 점 구름(point cloud)의 평균값을 계산하여 최종적으로 해당 객체의 3차원 위치 좌표를 획득할 수 있다. In addition, the object location estimation unit 150 can calculate the average value of the acquired real-world point cloud and finally obtain the 3D location coordinates of the object.
본 발명의 실시예에서는 객체의 최종 3차원 위치를 추정하기 위해 점 구름의 평균값을 계산하였지만, 가중 평균값, 중간값과 같이 다른 종류의 추정량(estimator)를 적용할 수도 있다.In an embodiment of the present invention, the average value of a point cloud is calculated to estimate the final 3D position of an object, but other types of estimators such as a weighted average or median may be applied.
이와 같이, 본 발명은, RGB 영상 2차원 객체 검출 결과에 설명 가능한 인공 지능 기법을 적용하여 각 객체마다 객체영역 분할 마스크를 획득함으로써 깊이 영상 값을 활용한 객체 3차원 위치 추정 기술의 정확도를 향상시킬 수 있다. As such, the present invention applies an explainable artificial intelligence technique to the RGB image 2D object detection results to obtain an object area segmentation mask for each object, thereby improving the accuracy of object 3D location estimation technology using depth image values. You can.
또한, 본 발명과 같이 학습 데이터셋이 구축되지 않은 새로운 환경을 위한 3차원 객체 검출 알고리즘을 개발할 때, 기존의 2차원 객체 검출 기술과 깊이 영상을 융합하여 사용하는 방법에서 적용된 인스턴스 분할 방법이나 종단 간 학습 방법에 비해 데이터셋 구축에 적은 비용을 소모된다. 즉, 본 발명은 기존 방법보다 더 적은 데이터셋 구축 비용으로 더 정확하게 객체의 3차원 위치를 추정할 수 있다.In addition, when developing a 3D object detection algorithm for a new environment where a learning dataset has not been built, such as the present invention, the instance segmentation method or end-to-end method applied in the method of using existing 2D object detection technology and depth image fusion Compared to learning methods, less cost is consumed in building the dataset. In other words, the present invention can estimate the 3D position of an object more accurately at a lower dataset construction cost than existing methods.
또한, 본 발명의 2차원 객체 검출 알고리즘은 필요 연산량이 비교적 적기 때문에 AR 기기, 드론, 모바일 로봇과 같이 저전력 연산 장치를 탑재한 인공 지능 시스템에서도 더 정확한 실시간 RGB-D 영상 객체 3차원 위치 추정 기술을 사용할 수 있게 된다.In addition, since the 2D object detection algorithm of the present invention requires a relatively small amount of calculation, it can be used to estimate the 3D position of a more accurate real-time RGB-D image object even in artificial intelligence systems equipped with low-power computing devices such as AR devices, drones, and mobile robots. becomes available.
이상의 상세한 설명은 본 발명을 예시하는 것이다. 또한, 전술한 내용은 본 발명의 바람직한 실시 형태를 나타내고 설명하는 것에 불과하며, 본 발명은 다양한 다른 조합, 변경 및 환경에서 사용할 수 있다. 즉, 본 명세서에 개시된 발명의 개념의 범위, 저술한 개시 내용과 균등한 범위 및/또는 당 업계의 기술 또는 지식의 범위 내에서 변경 또는 수정할 수 있다. The above detailed description is illustrative of the present invention. In addition, the foregoing merely shows and describes preferred embodiments of the present invention, and the present invention can be used in various other combinations, modifications, and environments. In other words, changes or modifications can be made within the scope of the inventive concept disclosed in this specification, within the scope equivalent to the written disclosure, and/or within the scope of skill or knowledge in the art.
전술한 실시 예들은 본 발명을 실시함에 있어 최선의 상태를 설명하기 위한 것이며, 본 발명과 같은 다른 발명을 이용하는데 당 업계에 알려진 다른 상태로의 실시, 그리고 발명의 구체적인 적용 분야 및 용도에서 요구되는 다양한 변경도 가능하다. 따라서 이상의 발명의 상세한 설명은 개시된 실시 상태로 본 발명을 제한하려는 의도가 아니다. 또한, 첨부된 청구범위는 다른 실시 상태도 포함하는 것으로 해석되어야 한다.The above-described embodiments are intended to explain the best state in carrying out the present invention, and are implemented in other states known in the art when using other inventions such as the present invention, and are required in the specific application field and use of the invention. Various changes are also possible. Accordingly, the detailed description of the invention above is not intended to limit the invention to the disclosed embodiments. Additionally, the appended claims should be construed to include other embodiments as well.
본 발명에 따른 3차원 위치 실시간 추정 장치는 AR 기기, 드론 및 로봇 등 인공 지능 기술 등 다양한 분야에 이용될 수 있다.The real-time 3D location estimation device according to the present invention can be used in various fields such as artificial intelligence technology such as AR devices, drones, and robots.

Claims (9)

  1. RGB 영상과 깊이 영상을 입력 받는 입력부;An input unit that receives RGB images and depth images;
    인공 신경망을 이용하여 상기 RGB 영상에서 객체를 포함하는 2차원 경계 상자를 추출하는 객체 검출부;an object detection unit that extracts a two-dimensional bounding box containing an object from the RGB image using an artificial neural network;
    상기 2차원 경계 상자에 대해 설명 가능한 인공 지능을 적용하여 상기 2차원 경계 상자에서 상기 객체 영역에 해당하는 분할 마스크를 추출하는 객체영역 분할부; 및 an object area segmentation unit that extracts a segmentation mask corresponding to the object area from the two-dimensional bounding box by applying explainable artificial intelligence to the two-dimensional bounding box; and
    상기 깊이 영상 및 상기 분할 마스크를 이용하여 상기 객체의 3차원 위치를 추정하는 객체 위치 추정부An object position estimation unit that estimates the 3D position of the object using the depth image and the segmentation mask.
    를 포함하는 객체의 3차원 위치 실시간 추정 장치.A real-time 3D location estimation device for an object including a device.
  2. 제 1 항에 있어서,According to claim 1,
    상기 객체 영역 분할부는The object area division unit
    상기 인공 신경망이 상기 2차원 경계 상자를 추출하는데 있어 상기 RGB 영상의 각 픽셀에 대해 참조한 정도를 기초로 상기 분할 마스크를 추출하는The artificial neural network extracts the segmentation mask based on the degree to which each pixel of the RGB image is referenced in extracting the two-dimensional bounding box.
    객체의 3차원 위치 실시간 추정 장치.A device for real-time estimation of the 3D position of an object.
  3. 제 1 항에 있어서,According to claim 1,
    상기 객체 영역 분할부는The object area division unit
    상기 인공 신경망이 상기 2차원 경계 상자를 추출하는데 있어 상기 RGB 영상의 각 픽셀에 대해 참조한 정도를 히트 맵으로 표현하고, 상기 히트 맵을 기초로 상기 분할 마스크를 추출하는The degree to which the artificial neural network refers to each pixel of the RGB image in extracting the two-dimensional bounding box is expressed as a heat map, and the segmentation mask is extracted based on the heat map.
    객체의 3차원 위치 실시간 추정 장치.A device for real-time estimation of the 3D position of an object.
  4. 제 1 항에 있어서,According to claim 1,
    상기 객체 영역 분할부는The object area division unit
    상기 인공 신경망이 상기 2차원 경계 상자를 추출하는데 있어 상기 RGB 영상의 각 픽셀에 대해 참조한 정도를 점수화하여 히트 맵 점수를 산출하고, 상기 히트 맵 점수를 임계치와 비교하여 상기 분할 마스크를 추출하는In extracting the two-dimensional bounding box, the artificial neural network scores the degree of reference to each pixel of the RGB image to calculate a heat map score, and compares the heat map score with a threshold to extract the segmentation mask.
    객체의 3차원 위치 실시간 추정 장치.A device for real-time estimation of the 3D position of an object.
  5. 제 4 항에 있어서,According to claim 4,
    상기 객체 영역 분할부는The object area division unit
    상기 히트 맵 점수가 임계치 이상인 픽셀들을 선택하여 상기 분할 마스크를 추출하는Selecting pixels whose heat map score is greater than or equal to a threshold value to extract the segmentation mask.
    객체의 3차원 위치 실시간 추정 장치.A device for real-time estimation of the 3D position of an object.
  6. 제 4 항에 있어서,According to claim 4,
    상기 객체 영역 분할부는The object area division unit
    상기 히트 맵 점수가 임계치 미만인 픽셀들을 필터링하여 상기 분할 마스크를 추출하는Extracting the segmentation mask by filtering pixels whose heat map score is less than a threshold
    객체의 3차원 위치 실시간 추정 장치.A device for real-time estimation of the 3D position of an object.
  7. 제 1 항에 있어서,According to claim 1,
    상기 객체 위치 추정부는The object location estimation unit
    상기 분할 마스크 내 각 픽셀에 대응되는 깊이 영상 값을 추출하는Extracting the depth image value corresponding to each pixel in the segmentation mask
    객체의 3차원 위치 실시간 추정 장치.A device for real-time estimation of the 3D position of an object.
  8. 제 1 항에 있어서,According to claim 1,
    상기 객체 위치 추정부는The object location estimation unit
    상기 분할 마스크 내 각 픽셀의 이미지 좌표와 카메라 투영 행렬을 이용하여 각 픽셀의 3차원 위치 좌표를 추출하는Extracting the 3D position coordinates of each pixel using the image coordinates and camera projection matrix of each pixel in the segmentation mask.
    객체의 3차원 위치 실시간 추정 장치.A device for real-time estimation of the 3D position of an object.
  9. RGB 영상과 깊이 영상을 입력 받는 단계;Step of receiving RGB image and depth image;
    인공 신경망을 이용하여 상기 RGB 영상에서 객체를 포함하는 2차원 경계 상자를 추출하는 단계;extracting a two-dimensional bounding box containing an object from the RGB image using an artificial neural network;
    상기 2차원 경계 상자에 대해 설명 가능한 인공 지능을 적용하여 상기 2차원 경계 상자에서 상기 객체 영역에 해당하는 분할 마스크를 추출하는 단계; 및 extracting a segmentation mask corresponding to the object area from the two-dimensional bounding box by applying explainable artificial intelligence to the two-dimensional bounding box; and
    상기 깊이 영상 및 상기 분할 마스크를 이용하여 상기 객체의 3차원 위치를 추정하는 단계Estimating the 3D location of the object using the depth image and the segmentation mask.
    를 포함하는 객체의 3차원 위치 실시간 추정 방법.A method for real-time estimation of the 3D position of an object including.
PCT/KR2022/020601 2022-11-07 2022-12-16 Device and method for estimating three-dimensional location of object in real time WO2024101532A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2022-0146927 2022-11-07
KR1020220146927A KR20240065772A (en) 2022-11-07 2022-11-07 Apparatus and method for real-time estimation of a three-dimensional position of object

Publications (1)

Publication Number Publication Date
WO2024101532A1 true WO2024101532A1 (en) 2024-05-16

Family

ID=91033080

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/020601 WO2024101532A1 (en) 2022-11-07 2022-12-16 Device and method for estimating three-dimensional location of object in real time

Country Status (2)

Country Link
KR (1) KR20240065772A (en)
WO (1) WO2024101532A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021167586A1 (en) * 2020-02-18 2021-08-26 Google Llc Systems and methods for object detection including pose and size estimation
KR102378105B1 (en) * 2021-10-13 2022-03-24 (주)이노시뮬레이션 Apparatus for estimating 3d position in augmented reality and method performing thereof
KR102439429B1 (en) * 2021-12-21 2022-09-05 주식회사 인피닉 Annotation method for easy object extraction and a computer program recorded on a recording medium to execute the annotation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021167586A1 (en) * 2020-02-18 2021-08-26 Google Llc Systems and methods for object detection including pose and size estimation
KR102378105B1 (en) * 2021-10-13 2022-03-24 (주)이노시뮬레이션 Apparatus for estimating 3d position in augmented reality and method performing thereof
KR102439429B1 (en) * 2021-12-21 2022-09-05 주식회사 인피닉 Annotation method for easy object extraction and a computer program recorded on a recording medium to execute the annotation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BENNETOT ADRIEN, FRANCHI GIANNI; SER JAVIER DEL; CHATILA RAJA; DÍAZ-RODRÍGUEZ NATALIA: "Greybox XAI: A Neural-Symbolic learning framework to produce interpretable predictions for image classification", KNOWLEDGE-BASED SYSTEMS, vol. 258, 1 December 2022 (2022-12-01), pages 109947, XP093170409, DOI: 10.1016/j.knosys.2022.109947 *
LEE JUNHEE, CHO HYEONSEO; JANG PYUN YUN; KANG SUK-JU; NAM HYOUNGSIK: "Heatmap Assisted Accuracy Score Evaluation Method for Machine-Centric Explainable Deep Neural Networks", IEEE ACCESS, IEEE, USA, vol. 10, 1 January 2022 (2022-01-01), USA , pages 64832 - 64849, XP093170410, ISSN: 2169-3536, DOI: 10.1109/ACCESS.2022.3184453 *

Also Published As

Publication number Publication date
KR20240065772A (en) 2024-05-14

Similar Documents

Publication Publication Date Title
Wang et al. Self-supervised drivable area and road anomaly segmentation using rgb-d data for robotic wheelchairs
CN110349250B (en) RGBD camera-based three-dimensional reconstruction method for indoor dynamic scene
KR100474848B1 (en) System and method for detecting and tracking a plurality of faces in real-time by integrating the visual ques
CN108168539B (en) Blind person navigation method, device and system based on computer vision
CN112734765B (en) Mobile robot positioning method, system and medium based on fusion of instance segmentation and multiple sensors
Fitzpatrick First contact: an active vision approach to segmentation
Rout A survey on object detection and tracking algorithms
CN113066129A (en) Visual positioning and mapping system based on target detection in dynamic environment
Yu et al. Drso-slam: A dynamic rgb-d slam algorithm for indoor dynamic scenes
CN114332394A (en) Semantic information assistance-based dynamic scene three-dimensional reconstruction method
CN115049954A (en) Target identification method, device, electronic equipment and medium
WO2019088333A1 (en) Method for recognizing human body activity on basis of depth map information and apparatus therefor
Henein et al. Exploiting rigid body motion for SLAM in dynamic environments
WO2024101532A1 (en) Device and method for estimating three-dimensional location of object in real time
Ganguly et al. An unsupervised learning approach for road anomaly segmentation using RGB-D sensor for advanced driver assistance system
Yamanaka et al. Tactile Tile Detection Integrated with Ground Detection using an RGB-Depth Sensor.
WO2018131729A1 (en) Method and system for detection of moving object in image using single camera
CN113487738A (en) Building based on virtual knowledge migration and shielding area monomer extraction method thereof
CN113850750A (en) Target track checking method, device, equipment and storage medium
Fukui et al. Multiple object tracking system with three level continuous processes
Shanmugavel et al. Automatic Generation of Segmented Labels for Road Anomaly Detection: An Application for Robotic Wheelchair
Li et al. RGB-D Based Visual SLAM Algorithm for Indoor Crowd Environment
CN115775325B (en) Pose determining method and device, electronic equipment and storage medium
Qiao et al. Deep learning based optical flow estimation for change detection: A case study in Indonesia earthquake
CN113744397B (en) Real-time object-level semantic map construction and updating method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22965301

Country of ref document: EP

Kind code of ref document: A1