KR20230043653A

KR20230043653A - Hand joint 3d pose estimation method and hand joint 3d pose estimation system based on stereo camera input

Info

Publication number: KR20230043653A
Application number: KR1020220005655A
Authority: KR
Inventors: 조현중; 서경은
Original assignee: 고려대학교 세종산학협력단
Priority date: 2021-09-24
Filing date: 2022-01-14
Publication date: 2023-03-31

Abstract

A joint position estimation method based on stereo camera input according to an embodiment of the disclosed invention comprises: (a) a step of receiving a first image acquired by a first camera and a second image acquired by a second camera by a stereo image input unit; (b) a step of extracting a first feature from the first image and a second feature from the second image using a deep learning model by an image feature extraction unit; (c) a step of generating first joint 2D position information and second joint 2D position information based on the first feature and the second feature by a stereo attention block unit; and (d) a step of determining the 3D position of the hand joint using a stereo deep learning model based on the first joint 2D position information and the second joint 2D position information, by a hand joint 3D position determination unit. Accordingly, the present invention can reduce the cost of data collection.

Description

Hand joint 3D position estimation method and hand joint 3D position estimation system based on stereo camera input

본 발명은 스테레오 카메라로부터 입력 받은 이미지를 기초로 손 관절의 3D 위치를 추정할 수 있는 손 관절 3D 위치 추정 방법 및 손 관절 3D 위치 추정 시스템에 관한 것이다.The present invention relates to a method for estimating a 3D position of a hand joint and a system for estimating a 3D position of a hand joint capable of estimating a 3D position of a hand joint based on an image received from a stereo camera.

다양한 종류의 웨어러블 카메라들이 판매됨에 따라, 이를 활용한 재활치료, 의료 시뮬레이션 교육, 생활패턴 분석, AR 응용 프로그램과 같은 다양한 VR/AR기반의 기술들이 소개되었다. 이러한 웨어러블 카메라기반의 응용 기술은 사용자가 기기를 착용한 상태이기 때문에 사용자중심의 시선을 기반으로 상호작용을 하게 된다. 상호작용 시 가장 유용한 입력 도구는 손으로 자유로운 상호작용을 위해서는 손의 자세 추정이 필수적이다.As various types of wearable cameras are sold, various VR/AR-based technologies such as rehabilitation treatment using them, medical simulation education, life pattern analysis, and AR applications have been introduced. This wearable camera-based application technology interacts based on the user-centered gaze because the user is wearing the device. The most useful input tool for interaction is the hand, and hand posture estimation is essential for free interaction.

손의 자세 추정의 종류로는 절대 위치 추정과 상대 위치 추정이 있는데, 절대 위치 추정은 원점이 카메라로, 카메라 위치를 기준으로 손 관절의 위치를 추정한다. 이에 반해 상대 위치 추정은 원점이 손 관절 중 하나이며, 해당 관절의 위치로부터 상대적인 손 관절의 위치를 추정한다. 즉, 절대 위치 추정은 사용자의 손과 카메라간의 위치를 알고 있으므로 손의 위치를 활용하여 다양한 상호작용을 제공할 수 있는 장점이 있다.There are two types of hand posture estimation: absolute position estimation and relative position estimation. In absolute position estimation, the origin is the camera and the position of the hand joint is estimated based on the camera position. On the other hand, in the relative position estimation, the origin is one of the hand joints, and the relative position of the hand joint is estimated from the position of the corresponding joint. In other words, since absolute position estimation knows the position between the user's hand and the camera, it has the advantage of providing various interactions by utilizing the position of the hand.

절대 위치 추정을 위해서는 깊이 정보를 얻는 것이 필수적이다. 깊이 정보를 얻기 위해서는 RGBD 카메라를 사용하거나 Stereo 카메라를 사용하는 방법이 있다. RGBD 카메라는 적외선 패턴을 방사하고 이를 기반으로 깊이 정보를 얻게 된다. 그러나, 이러한 방법은 다른 빛에 간섭을 받아 부정확한 정보를 얻게 될 수 있으며, 적외선 패턴을 방사하기 때문에 전력을 많이 소모한다는 단점이 있다. 이에 반해 Stereo 카메라는 두 카메라의 시차를 이용하여 깊이 정보를 획득하기 때문에 빛의 간섭이나 추가적인 전력 소모가 없다는 장점이 있다.For absolute position estimation, it is essential to obtain depth information. In order to obtain depth information, there is a method of using an RGBD camera or a stereo camera. The RGBD camera emits an infrared pattern and based on it, depth information is obtained. However, this method has a disadvantage in that inaccurate information may be obtained due to interference with other light, and a lot of power is consumed because infrared patterns are emitted. On the other hand, a stereo camera has the advantage of no light interference or additional power consumption because it obtains depth information using the parallax of the two cameras.

그러나, 현재 stereo 카메라를 기반으로 하는 손 관절 절대 위치 추정 기술은 한계가 있다. 첫째, 기존 딥러닝 기반의 기술은 stereo 카메라의 기하학적인 정보를 학습에 사용하지 않는다. stereo 카메라 기반의 딥러닝 알고리즘을 제안하였지만, 손의 관절 위치에 대한 정보만을 학습하고 기하학적인 정보를 학습에 활용하지 않는다. 둘째, 현재 기술은 몇 가지 가정을 기반으로 손 관절 절대 위치를 추정한다. 손이 다른 객체보다 제일 앞에 나와 있다는 가정을 사용한다. 이러한 가정은 자유로운 사용을 제한하게 된다. 셋째, 딥러닝 모델 학습을 위해서는 데이터셋(이미지와 정답지 쌍)이 필요한데, 현재는 정답지를 생성을 수작업으로 진행한다. 이러한 방법은 데이터셋이 많아질수록 시간이 많이 들고, 정확도가 부정확할 수 있다는 한계가 있다.However, the current technique for estimating the absolute position of the hand joint based on a stereo camera has limitations. First, the existing deep learning-based technology does not use the geometric information of the stereo camera for learning. Although a stereo camera-based deep learning algorithm is proposed, it only learns information about hand joint positions and does not use geometric information for learning. Second, the current technology estimates the absolute position of the hand joint based on several assumptions. We use the assumption that the hand is in front of other objects. This assumption limits free use. Third, a dataset (a pair of images and an answer sheet) is required to train a deep learning model, but currently, the answer sheet is created manually. This method has limitations in that it takes a lot of time as the number of datasets increases and the accuracy may be inaccurate.

본 발명은 종래의 손 관절 위치 추정 방법보다 정확하게 손 관절의 3D 위치를 추정할 수 있는 손 관절 3D 위치 추정 방법 및 손 관절 3D 위치 추정 시스템을 제공하기 위한 것이다.An object of the present invention is to provide a hand joint 3D position estimation method and a hand joint 3D position estimation system capable of estimating the 3D position of a hand joint more accurately than conventional hand joint position estimation methods.

또한, 본 발명은 정확한 손 자세 추정을 특별한 가정없이 수행함으로써, 스테레오 카메라 기반의 웨어러블 환경에서 사용자가 작업의 종류에 제한받지 않는 다양한 작업이 가능한 손 관절 3D 위치 추정 방법 및 손 관절 3D 위치 추정 시스템을 제공하기 위한 것이다.In addition, the present invention provides a hand joint 3D position estimation method and a hand joint 3D position estimation system that enables a user to perform various tasks regardless of the type of task in a wearable environment based on a stereo camera by performing accurate hand posture estimation without special assumptions. is to provide

또한, 본 발명은 손 관절 3D 위치 추정에 이용되는 기계학습 모델을 학습하는데 필요한 데이터 수집을 정확하고 빠르게 수행할 수 있고, 데이터 수집에 드는 비용을 줄일 수 있는 손 관절 3D 위치 추정 방법 및 손 관절 3D 위치 추정 시스템을 제공하기 위한 것이다.In addition, the present invention provides a hand joint 3D position estimation method and hand joint 3D position estimation method that can accurately and quickly collect data required for learning a machine learning model used for hand joint 3D position estimation and reduce the cost of data collection. It is to provide a position estimation system.

개시된 발명의 일 측면에 따른 스테레오 카메라 입력 기반의 관절 위치 추정 방법은, (a) 스테레오 이미지 입력부에 의해, 제1 카메라에 의해 획득된 제1 이미지 및 제2 카메라에 의해 획득된 제2 이미지를 입력받는 단계; (b) 이미지 특징 추출부에 의해, 딥러닝 모델을 이용하여 제1 이미지로부터 제1 특징을 추출하고, 제2 이미지로부터 제2 특징을 추출하는 단계; (c) 스테레오 어텐션 블록부에 의해, 제1 특징 및 제2 특징을 기초로, 제1 관절 2D 위치 정보, 제2 관절 2D 위치 정보를 생성하는 단계; 및 (d) 손 관절 3D 위치 결정부에 의해, 제1 관절 2D 위치 정보, 제2 관절 2D 위치 정보를 기초로 스테레오 딥러닝 모델을 이용하여 손 관절 3D 위치를 결정하는 단계를 포함할 수 있다.A joint position estimation method based on a stereo camera input according to an aspect of the disclosed invention includes: (a) inputting a first image obtained by a first camera and a second image obtained by a second camera through a stereo image input unit; receiving step; (b) extracting a first feature from a first image and a second feature from a second image using a deep learning model by an image feature extractor; (c) generating first joint 2D position information and second joint 2D position information based on the first feature and the second feature by the stereo attention block unit; and (d) using a stereo deep learning model based on the first joint 2D position information and the second joint 2D position information to determine the hand joint 3D position by the hand joint 3D position determiner.

또한, 히트맵 블록부에 의해, 상기 제1 특징을 기초로 제1 관절 2D 위치 맵을 출력하고, 상기 제2 특징을 기초로 제2 관절 2D 위치 맵을 출력하는 단계를 더 포함할 수 있다.The method may further include outputting, by the heat map block unit, a first joint 2D location map based on the first feature and outputting a second joint 2D location map based on the second feature.

또한, 상기 (c) 단계는: 상기 스테레오 어텐션 블록부에 의해, 상기 제1 특징 및 상기 제2 특징을 결합하여 결합 특징을 생성하는 단계; 및 상기 스테레오 어텐션 블록부에 의해, 상기 결합 특징을 기초로 성분들이 0이상 1이하의 값인 마스크 맵을 생성하는 단계를 포함할 수 있다.Also, the step (c) may include: generating a combined feature by combining the first feature and the second feature by the stereo attention block unit; and generating, by the stereo attention block unit, a mask map in which components have values greater than or equal to 0 and less than or equal to 1, based on the combined feature.

또한, 상기 (c) 단계는, 상기 스테레오 어텐션 블록부에 의해, 상기 마스크 맵 및 상기 제1 관절 2D 위치 맵을 기초로 상기 제1 관절 2D 위치 정보를 생성하고, 상기 마스크 맵 및 상기 제2 관절 2D 위치 맵을 기초로 상기 제2 관절 2D 위치 정보를 생성하는 단계를 포함할 수 있다.In the step (c), the stereo attention block unit generates the first joint 2D location information based on the mask map and the first joint 2D location map, and generates the mask map and the second joint 2D location information. The method may include generating 2D location information of the second joint based on a 2D location map.

또한, 상기 (c) 단계는, 상기 스테레오 어텐션 블록부에 의해, 상기 결합 특징을 기초로 동일 관절 시차 정보를 생성하는 단계를 포함하고, 상기 (d) 단계는, 상기 손 관절 3D 위치 결정부에 의해, 상기 제1 관절 2D 위치 정보, 상기 제2 관절 2D 위치 정보, 상기 동일 관절 시차 정보를 기초로 상기 손 관절 3D 위치를 결정하는 단계를 포함할 수 있다.In addition, the step (c) includes generating the same joint parallax information based on the coupling feature by the stereo attention block unit, and the step (d) includes the step of generating the same joint disparity information by the hand joint 3D positioning unit and determining the 3D position of the hand joint based on the 2D position information of the first joint, the 2D position information of the second joint, and the parallax information of the same joint.

또한, 상기 동일 관절 시차 정보를 생성하는 단계는, 상기 스테레오 어텐션 블록부에 의해, 상기 제1 이미지 및 상기 제2 이미지의 동일 객체에 대한 상기 제1 이미지에서의 좌표 및 상기 제2 이미지에서의 좌표를 기초로 상기 동일 관절 시차 정보를 생성하는 단계를 포함할 수 있다.In addition, the generating of the same-joint parallax information may include coordinates in the first image and coordinates in the second image of the same object of the first image and the second image by the stereo attention block unit. It may include generating the same-joint parallax information based on.

또한, 상기 스테레오 이미지 입력부에 의해, 복수의 학습용 이미지쌍을 입력받는 단계; 상기 스테레오 어텐션 블록부에 의해, 상기 학습용 이미지쌍의 관절 2D 위치 정보 및 동일 관절 시차 정보를 생성하는 단계; 및 인공지능 학습부에 의해, 상기 학습용 이미지쌍의 관절 2D 위치 정보 및 동일 관절 시차 정보를 기초로 상기 스테레오 딥러닝 모델을 학습하는 단계를 더 포함할 수 있다.In addition, the step of receiving a plurality of image pairs for learning by the stereo image input unit; generating joint 2D position information and same-joint parallax information of the training image pair by the stereo attention block unit; and learning, by an artificial intelligence learning unit, the stereo deep learning model based on 2D joint position information and parallax information of the same joint of the training image pair.

또한, 상기 스테레오 딥러닝 모델을 학습하는 단계는: 상기 인공지능 학습부에 의해, 생성된 상기 학습용 이미지쌍의 관절 2D 위치 정보 및 학습용 이미지쌍의 기준 관절 2D 위치 정보를 기초로 제1 손실 함수를 연산하는 단계; 상기 인공지능 학습부에 의해, 생성된 상기 학습용 이미지쌍의 동일 관절 시차 정보 및 학습용 이미지쌍의 기준 동일 관절 시차 정보를 기초로 제2 손실 함수를 연산하는 단계; 상기 인공지능 학습부에 의해, 생성된 상기 학습용 이미지쌍의 관절 2D 위치 정보 및 상기 학습용 이미지쌍의 2D 관절 위치를 반대쪽 이미지에 투영하여 생성된 투영 위치 정보를 기초로 제3 손실 함수를 연산하는 단계; 및 상기 인공지능 학습부에 의해, 상기 제1 손실 함수, 상기 제2 손실 함수, 상기 제3 손실 함수를 기초로 통합 손실 함수를 연산하고, 상기 통합 손실 함수가 감소하도록 상기 스테레오 딥러닝 모델을 학습하는 단계를 포함할 수 있다.In addition, the step of learning the stereo deep learning model: a first loss function based on the 2D joint position information of the training image pair and the reference joint 2D position information of the training image pair generated by the artificial intelligence learning unit. calculating; calculating, by the artificial intelligence learning unit, a second loss function based on the same-joint disparity information of the training image pair and the reference same-joint disparity information of the training image pair; Calculating, by the artificial intelligence learning unit, a third loss function based on 2D joint position information of the training image pair and projected position information generated by projecting the 2D joint position of the training image pair to an opposite image generated by the artificial intelligence learning unit. ; and calculating an integration loss function based on the first loss function, the second loss function, and the third loss function by the artificial intelligence learning unit, and learning the stereo deep learning model so that the integration loss function decreases. steps may be included.

또한, 변경 학습용 이미지 생성부에 의해, 입력된 학습용 이미지의 밝기를 변경하여 복수의 변경 학습용 이미지를 생성하는 단계; 및 학습용 데이터 생성부에 의해, 각각의 상기 변경 학습용 이미지마다 관절 위치 정보를 생성하고, 복수의 변경 학습용 이미지들의 관절 위치 정보를 비교하면서 변경 학습용 이미지의 관절 위치 정보들의 차이값을 줄이는 방식으로 기준 동일 관절 시차 정보를 산출하는 단계를 더 포함할 수 있다.In addition, the step of changing the brightness of the input learning image by the change learning image generation unit to generate a plurality of changed learning images; And a learning data generator generates joint position information for each of the changed learning images, compares the joint position information of a plurality of changed learning images, and reduces the difference between the joint position information of the changed learning images. Calculating joint parallax information may be further included.

개시된 발명의 일 측면에 따른 컴퓨터 프로그램은, 상기 스테레오 카메라 입력 기반의 관절 위치 추정 방법을 실행시키도록 컴퓨터로 판독 가능한 기록매체에 저장될 수 있다.A computer program according to an aspect of the disclosed invention may be stored in a computer-readable recording medium to execute the joint position estimation method based on the stereo camera input.

개시된 발명의 일 측면에 따른 스테레오 카메라 입력 기반의 관절 위치 추정 시스템은, 제1 카메라에 의해 획득된 제1 이미지 및 제2 카메라에 의해 획득된 제2 이미지를 입력받도록 구성되는 스테레오 이미지 입력부; 딥러닝 모델을 이용하여 제1 이미지로부터 제1 특징을 추출하고, 제2 이미지로부터 제2 특징을 추출하도록 구성되는 이미지 특징 추출부; 제1 특징 및 제2 특징을 기초로, 제1 관절 2D 위치 정보, 제2 관절 2D 위치 정보를 생성하도록 구성되는 스테레오 어텐션 블록부; 및 제1 관절 2D 위치 정보, 제2 관절 2D 위치 정보를 기초로 스테레오 딥러닝 모델을 이용하여 손 관절 3D 위치를 결정하도록 구성되는 손 관절 3D 위치 결정부를 포함할 수 있다.A joint position estimation system based on a stereo camera input according to an aspect of the present invention includes a stereo image input unit configured to receive a first image obtained by a first camera and a second image obtained by a second camera; an image feature extraction unit configured to extract a first feature from a first image and a second feature from a second image by using a deep learning model; a stereo attention block unit configured to generate first joint 2D position information and second joint 2D position information based on the first and second characteristics; and a hand joint 3D position determiner configured to determine a hand joint 3D position using a stereo deep learning model based on the first joint 2D position information and the second joint 2D position information.

또한, 상기 제1 특징을 기초로 제1 관절 2D 위치 맵을 출력하고, 상기 제2 특징을 기초로 제2 관절 2D 위치 맵을 출력하도록 구성되는 히트맵 블록부를 더 포함할 수 있다.It may further include a heat map block unit configured to output a first joint 2D location map based on the first feature and output a second joint 2D location map based on the second feature.

또한, 상기 스테레오 어텐션 블록부는: 상기 제1 특징 및 상기 제2 특징을 결합하여 결합 특징을 생성하고; 그리고 상기 결합 특징을 기초로 성분들이 0이상 1이하의 값인 마스크 맵을 생성하도록 구성될 수 있다.In addition, the stereo attention block unit: generates a combined feature by combining the first feature and the second feature; And it may be configured to generate a mask map in which components have values of 0 or more and 1 or less based on the combination feature.

또한, 상기 스테레오 어텐션 블록부는: 상기 마스크 맵 및 상기 제1 관절 2D 위치 맵을 기초로 상기 제1 관절 2D 위치 정보를 생성하고; 그리고 상기 마스크 맵 및 상기 제2 관절 2D 위치 맵을 기초로 상기 제2 관절 2D 위치 정보를 생성하도록 구성될 수 있다.The stereo attention block unit may: generate the first joint 2D position information based on the mask map and the first joint 2D position map; And it may be configured to generate the second joint 2D location information based on the mask map and the second joint 2D location map.

또한, 상기 스테레오 어텐션 블록부는, 상기 결합 특징을 기초로 동일 관절 시차 정보를 생성하도록 구성되고, 상기 손 관절 3D 위치 결정부는, 상기 제1 관절 2D 위치 정보, 상기 제2 관절 2D 위치 정보, 상기 동일 관절 시차 정보를 기초로 상기 손 관절 3D 위치를 결정하도록 구성될 수 있다.In addition, the stereo attention block unit is configured to generate identical joint parallax information based on the coupling feature, and the hand joint 3D position determining unit includes the first joint 2D position information, the second joint 2D position information, and the same joint 3D position information. It may be configured to determine the hand joint 3D position based on joint parallax information.

또한, 상기 스테레오 어텐션 블록부는, 상기 제1 이미지 및 상기 제2 이미지의 동일 객체에 대한 상기 제1 이미지에서의 좌표 및 상기 제2 이미지에서의 좌표를 기초로 상기 동일 관절 시차 정보를 생성하도록 구성될 수 있다.The stereo attention block unit may be configured to generate the same joint parallax information based on coordinates in the first image and coordinates in the second image of the same object in the first image and the second image. can

또한, 상기 스테레오 이미지 입력부는, 복수의 학습용 이미지쌍을 입력받도록 구성되고, 상기 스테레오 어텐션 블록부는, 상기 학습용 이미지쌍의 관절 2D 위치 정보 및 동일 관절 시차 정보를 생성하도록 구성되고, 상기 학습용 이미지쌍의 관절 2D 위치 정보 및 동일 관절 시차 정보를 기초로 상기 스테레오 딥러닝 모델을 학습하도록 구성되는 인공지능 학습부를 더 포함할 수 있다.In addition, the stereo image input unit is configured to receive a plurality of image pairs for learning, and the stereo attention block unit is configured to generate joint 2D position information and parallax information of the same joint of the image pair for learning, An artificial intelligence learning unit configured to learn the stereo deep learning model based on joint 2D position information and the same joint parallax information may be further included.

또한, 상기 인공지능 학습부는: 생성된 상기 학습용 이미지쌍의 관절 2D 위치 정보 및 학습용 이미지쌍의 기준 관절 2D 위치 정보를 기초로 제1 손실 함수를 연산하고; 생성된 상기 학습용 이미지쌍의 동일 관절 시차 정보 및 학습용 이미지쌍의 기준 동일 관절 시차 정보를 기초로 제2 손실 함수를 연산하고; 생성된 상기 학습용 이미지쌍의 관절 2D 위치 정보 및 상기 학습용 이미지쌍의 2D 관절 위치를 반대쪽 이미지에 투영하여 생성된 투영 위치 정보를 기초로 제3 손실 함수를 연산하고; 그리고 상기 제1 손실 함수, 상기 제2 손실 함수, 상기 제3 손실 함수를 기초로 통합 손실 함수를 연산하고, 상기 통합 손실 함수가 감소하도록 상기 스테레오 딥러닝 모델을 학습할 수 있다.In addition, the artificial intelligence learning unit: calculates a first loss function based on the generated joint 2D position information of the image pair for learning and the reference joint 2D position information of the image pair for training; calculating a second loss function based on the generated same-joint disparity information of the training image pair and the reference joint disparity information of the training image pair; Calculate a third loss function based on the generated 2D joint position information of the training image pair and projected position information generated by projecting the 2D joint position of the training image pair to an opposite image; An integrated loss function may be calculated based on the first loss function, the second loss function, and the third loss function, and the stereo deep learning model may be learned to reduce the integrated loss function.

또한, 입력된 학습용 이미지의 밝기를 변경하여 복수의 변경 학습용 이미지를 생성하도록 구성되는 변경 학습용 이미지 생성부; 및 각각의 상기 변경 학습용 이미지마다 관절 위치 정보를 생성하고, 복수의 변경 학습용 이미지들의 관절 위치 정보를 비교하면서 변경 학습용 이미지의 관절 위치 정보들의 차이값을 줄이는 방식으로 기준 동일 관절 시차 정보를 산출하도록 구성되는 학습용 데이터 생성부를 더 포함할 수 있다.In addition, a change learning image generator configured to generate a plurality of changed learning images by changing the brightness of the input learning image; and generating joint position information for each image for change learning, and comparing joint position information of a plurality of images for change learning, and calculating reference same-joint parallax information by reducing a difference between joint position information of the image for change learning. It may further include a data generation unit for learning.

개시된 발명의 일 측면에 따르면, 종래의 손 관절 위치 추정 방법보다 정확하게 손 관절의 3D 위치를 추정할 수 있다.According to one aspect of the disclosed invention, it is possible to estimate the 3D position of the hand joint more accurately than the conventional method for estimating the position of the hand joint.

또한, 본 발명의 실시예에 의하면, 정확한 손 자세 추정을 특별한 가정없이 수행함으로써, 스테레오 카메라 기반의 웨어러블 환경에서 사용자가 작업의 종류에 제한받지 않는 다양한 작업을 할 수 있다.In addition, according to an embodiment of the present invention, accurate hand posture estimation is performed without special assumptions, so that a user can perform various tasks regardless of the type of task in a wearable environment based on a stereo camera.

마지막으로, 본 발명의 실시예에 의하면, 손 관절 3D 위치 추정에 이용되는 기계학습 모델을 학습하는데 필요한 데이터 수집을 정확하고 빠르게 수행할 수 있고, 데이터 수집에 드는 비용을 줄일 수 있다.Finally, according to an embodiment of the present invention, data collection necessary for learning a machine learning model used for hand joint 3D position estimation can be performed accurately and quickly, and the cost of data collection can be reduced.

도 1은 일 실시예에 따른 관절 추정 시스템의 구성도이다.
도 2는 일 실시예에 따른 관절 위치 추정 과정을 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 손 관절 3D 위치를 결정하는 과정을 도시한 도면이다.
도 4는 일 실시예에 따른 스테레오 어텐션 블록부의 동작을 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 동일 관절 시차 정보를 설명하기 위한 도면이다.
도 6은 일 실시예에 따른 학습용 데이터를 생성하는 과정을 설명하기 위한 도면이다.
도 7은 일 실시예에 따른 관절 위치 추정 방법의 순서도이다.
도 8은 일 실시예에 따른 관절 위치 추정 방법이 종래의 관절 위치 추정 방식에 비해 개선된 정도를 나타낸 그래프이다.1 is a configuration diagram of a joint estimation system according to an embodiment.
2 is a diagram for explaining a joint position estimation process according to an embodiment.
3 is a diagram illustrating a process of determining a 3D position of a hand joint according to an embodiment.
4 is a diagram for explaining an operation of a stereo attention block unit according to an exemplary embodiment.
5 is a diagram for explaining same-joint parallax information according to an exemplary embodiment.
6 is a diagram for explaining a process of generating learning data according to an exemplary embodiment.
7 is a flowchart of a joint position estimation method according to an embodiment.
8 is a graph showing the degree of improvement of the joint position estimation method according to an embodiment compared to the conventional joint position estimation method.

명세서 전체에 걸쳐 동일 참조 부호는 동일 구성요소를 지칭한다. 본 명세서가 실시예들의 모든 요소들을 설명하는 것은 아니며, 개시된 발명이 속하는 기술분야에서 일반적인 내용 또는 실시예들 간에 중복되는 내용은 생략한다. 명세서에서 사용되는 '~부'라는 용어는 소프트웨어 또는 하드웨어로 구현될 수 있으며, 실시예들에 따라 복수의 '~부'가 하나의 구성요소로 구현되거나, 하나의 '~부'가 복수의 구성요소들을 포함하는 것도 가능하다.Like reference numbers designate like elements throughout the specification. This specification does not describe all elements of the embodiments, and general content or overlapping content between the embodiments in the technical field to which the disclosed invention belongs is omitted. The term '~unit' used in the specification may be implemented in software or hardware, and according to embodiments, a plurality of '~units' may be implemented as one component, or one '~unit' may constitute a plurality of components. It is also possible to include elements.

또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.In addition, when a certain component is said to "include", this means that it may further include other components without excluding other components unless otherwise stated.

제1, 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위해 사용되는 것으로, 구성요소가 전술된 용어들에 의해 제한되는 것은 아니다.Terms such as first and second are used to distinguish one component from another, and the components are not limited by the aforementioned terms.

단수의 표현은 문맥상 명백하게 예외가 있지 않는 한, 복수의 표현을 포함한다.Expressions in the singular number include plural expressions unless the context clearly dictates otherwise.

각 단계들에 있어 식별부호는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 실시될 수 있다.In each step, the identification code is used for convenience of description, and the identification code does not explain the order of each step, and each step may be performed in a different order from the specified order unless a specific order is clearly described in context. there is.

이하 첨부된 도면들을 참고하여 개시된 발명의 작용 원리 및 실시예들에 대해 설명한다.Hereinafter, the working principle and embodiments of the disclosed invention will be described with reference to the accompanying drawings.

도 1은 일 실시예에 따른 관절 추정 시스템의 구성도이며, 도 2는 일 실시예에 따른 관절 위치 추정 과정을 설명하기 위한 도면이다.1 is a configuration diagram of a joint estimation system according to an embodiment, and FIG. 2 is a diagram for explaining a joint position estimation process according to an embodiment.

도 1 및 도 2를 참조하면, 본 발명의 실시예에 따른 관절 위치 추정 시스템(100)은 스테레오 이미지 입력부(110), 이미지 특징 추출부(120), 스테레오 어텐션 블록부(130), 손 관절 3D 위치 결정부(140), 히트맵 블록부(150), 인공지능 학습부(160) 및 메모리(170)를 포함할 수 있다.1 and 2, the joint position estimation system 100 according to an embodiment of the present invention includes a stereo image input unit 110, an image feature extraction unit 120, a stereo attention block unit 130, a hand joint 3D It may include a location determination unit 140, a heat map block unit 150, an artificial intelligence learning unit 160, and a memory 170.

웨어러블 기기(200)는 사람이 눈에 안경처럼 착용 가능한 기기일 수 있다. 웨어러블 기기(200)는 제1 카메라(201) 및 제2 카메라(202)를 포함할 수 있다. 제1 카메라(201) 및 제2 카메라(202)는 각각 사용자의 좌측 안구 및 우측 안구 부근에 하나씩 위치할 수 있다. 즉, 제1 카메라(201) 및 제2 카메라(202)는 각각 사람의 좌측 안구 및 우측 안구가 바라보는 위치에서 이미지 정보를 획득할 수 있다.The wearable device 200 may be a device that can be worn by a person like glasses. The wearable device 200 may include a first camera 201 and a second camera 202 . The first camera 201 and the second camera 202 may be located near the user's left eyeball and right eyeball, respectively. That is, the first camera 201 and the second camera 202 may obtain image information from positions where the left and right eyeballs of the person are looking, respectively.

스테레오 이미지 입력부(110)는 제1 카메라(201)에 의해 획득된 제1 이미지(301) 및 제2 카메라(202)에 의해 획득된 제2 이미지(302)를 입력받을 수 있다.The stereo image input unit 110 may receive a first image 301 obtained by the first camera 201 and a second image 302 obtained by the second camera 202 .

제1 이미지(301)는 제1 카메라(201)가 사용자의 좌측 안구 부근에서 획득한 이미지(Left image)이고, 제2 이미지(302)는 제2 카메라(202)가 사용자의 우측 안구 부근에서 획득한 이미지(Right image)일 수 있으나, 이에 한정되는 것은 아니다. 즉, 반대로 제1 이미지(301)가 사용자의 우측 안구 부근에서 획득한 이미지이고 제2 이미지(302)가 사용자의 좌측 안구 부근에서 획득한 이미지일수도 있다. 또한 제1 카메라(201) 및 제2 카메라(202)가 좌우가 아닌 상하로 배치된 경우 제1 이미지(301) 및 제2 이미지(302)는 상대적으로 좌측에서 획득된 이미지 및 우측에서 획득된 이미지가 아니라 각각 상대적으로 상단 또는 하단에서 획득된 이미지일 수 있다.The first image 301 is a left image acquired by the first camera 201 near the user's left eyeball, and the second image 302 is acquired by the second camera 202 near the user's right eyeball. It may be one image (Right image), but is not limited thereto. That is, conversely, the first image 301 may be an image acquired near the user's right eyeball and the second image 302 may be an image acquired near the user's left eyeball. In addition, when the first camera 201 and the second camera 202 are arranged up and down instead of left and right, the first image 301 and the second image 302 are relatively images acquired from the left side and images acquired from the right side. It may be an image obtained from a relatively top or bottom, respectively.

정리하면, 제1 이미지(301) 및 제2 이미지(302)는 특정 시점에 특정 위치, 특정 영역 또는 동일 대상을 두 대의 카메라가 각각 다른 위치에서 촬영하여 획득한 이미지들일 수 있다.In summary, the first image 301 and the second image 302 may be images obtained by taking pictures of a specific location, a specific area, or the same object at different locations by two cameras at a specific time point.

스테레오 이미지 입력부(110)는 제1 카메라(201) 및 제2 카메라(202)로부터 무선으로 제1 이미지(301) 및 제2 이미지(302)를 수신할 수 있다. 이때, 스테레오 이미지 입력부(110)가 웨어러블 기기(200)와 무선으로 통신하는 방식은 와이파이(Wifi), LTE, 4G, 5G 등 어떠한 방식이라도 상관없다. 또한, 스테레오 이미지 입력부(110)는 웨어러블 기기(200)로부터 유선 통신 방식으로 제1 이미지(301) 및 제2 이미지(302)를 수신할 수도 있다.The stereo image input unit 110 may wirelessly receive the first image 301 and the second image 302 from the first camera 201 and the second camera 202 . At this time, the stereo image input unit 110 communicates wirelessly with the wearable device 200 using any method such as Wi-Fi, LTE, 4G, or 5G. Also, the stereo image input unit 110 may receive the first image 301 and the second image 302 from the wearable device 200 through a wired communication method.

한편, 스테레오 이미지 입력부(110)는 제1 이미지(301)에서 제1 영역 이미지를 추출하고, 제2 이미지(302)에서 제2 영역 이미지를 추출할 수 있다. 제1 영역 이미지와 제2 영역 이미지는 각각 제1 이미지(301) 또는 제2 이미지(302)에서 사용자의 손이 위치한 영역이 바운딩 박스(bounding box)의 형태로 추출된 이미지(Cropped image)일 수 있다. 사용자의 손이 위치한 영역을 탐지하는 것은 객체 탐지 딥러닝 모델, 예를 들면 YOLOv3를 활용하는 방식일 수 있으나, 이에 한정되는 것은 아니다.Meanwhile, the stereo image input unit 110 may extract a first area image from the first image 301 and a second area image from the second image 302 . The first region image and the second region image may be cropped images extracted in the form of a bounding box in which the user's hand is located in the first image 301 or the second image 302, respectively. there is. Detecting the region where the user's hand is located may be a method using an object detection deep learning model, for example, YOLOv3, but is not limited thereto.

이미지 특징 추출부(120)는 이미지로부터 특징(feature)을 추출할 수 있다. 어떤 특정한 이미지의 특징은 해당 이미지에 대한 다양한 특성을 나타내는 정보일 수 있다. 예를 들어, 특정한 이미지의 특징은 해당 이미지의 각 픽셀 단위에서의 색상, 명도, 경계 등에 대한 정보일 수 있으나 이에 한정되는 것은 아니다.The image feature extractor 120 may extract features from an image. A feature of a specific image may be information indicating various characteristics of the image. For example, the feature of a specific image may be information about color, brightness, boundary, etc. in each pixel unit of the image, but is not limited thereto.

이미지 특징 추출부(120)는 딥러닝 모델(171)을 이용하여 제1 이미지(301)로부터 제1 특징(401)을 추출하고, 제2 이미지(302)로부터 제2 특징(402)을 추출할 수 있다. 이때, 이미지 특징 추출부(120)는 제1 영역 이미지로부터 제1 특징(401)을 추출하고, 제2 이미지(302)로부터 제2 특징(402)을 추출할 수 있다. 즉, 제1 특징(401)은 제1 이미지(301)의 특징이고, 제2 특징(402)은 제2 이미지(302)의 특징일 수 있다.The image feature extractor 120 extracts a first feature 401 from the first image 301 and a second feature 402 from the second image 302 using the deep learning model 171. can In this case, the image feature extractor 120 may extract a first feature 401 from the first area image and a second feature 402 from the second image 302 . That is, the first feature 401 may be a feature of the first image 301 , and the second feature 402 may be a feature of the second image 302 .

딥러닝 기반의 객체 검출 기술은 이미지로부터 추출되는 특징(feature)을 데이터를 기반으로 미리 학습된 딥러닝 모델(171)을 이용할 수 있다. 이때, 이미지로부터 특징을 추출하는 방식을 학습하기 위해 여러 단계의 컨볼루션 계층(convolution layer)을 쌓은 CNN(Convolutional Neural Networks) 구조가 활용될 수 있으나 이에 한정되는 것은 아니다.The deep learning-based object detection technology may use a deep learning model 171 pre-learned based on data of features extracted from an image. In this case, a convolutional neural networks (CNN) structure in which several convolutional layers are stacked may be used to learn a method of extracting features from an image, but is not limited thereto.

스테레오 어텐션 블록부(130)는 제1 특징(401) 및 제2 특징(402)을 기초로, 제1 관절 2D 위치 정보(601), 제2 관절 2D 위치 정보(602)를 생성할 수 있다.The stereo attention block unit 130 may generate first joint 2D position information 601 and second joint 2D position information 602 based on the first feature 401 and the second feature 402 .

제1 관절 2D 위치 정보(601)는 촬영된 손의 관절들이 평면적인 제1 이미지(301) 상에서 위치한 평면 좌표 값에 관련된 정보일 수 있다. 제2 관절 2D 위치 정보(602)는 촬영된 손의 관절들이 평면적인 제2 이미지(302) 상에서 위치한 평면 좌표 값에 관련된 정보일 수 있다. 이때, 동일한 손에 대해서 관절 위치 정보를 생성했다고 하더라도 제1 카메라(201) 및 제2 카메라(202)의 위치 차이 때문에 제1 이미지(301) 및 제2 이미지(302) 상에서의 관절들의 평면 좌표 값은 서로 다를 수 있다.The first joint 2D position information 601 may be information related to plane coordinate values where the joints of the photographed hand are located on the planar first image 301 . The second joint 2D position information 602 may be information related to plane coordinate values where the joints of the photographed hand are located on the planar second image 302 . At this time, even if the joint position information is generated for the same hand, the plane coordinate values of the joints on the first image 301 and the second image 302 are due to the position difference between the first camera 201 and the second camera 202. may be different from each other.

손 관절 3D 위치 결정부(140)는 제1 관절 2D 위치 정보(601) 및 제2 관절 2D 위치 정보(602)를 기초로 스테레오 딥러닝 모델(172)을 이용하여 손 관절 3D 위치(Absolute 3D hand pose)를 결정할 수 있다.The hand joint 3D position determining unit 140 determines the absolute 3D hand position (Absolute 3D hand) position by using the stereo deep learning model 172 based on the first joint 2D position information 601 and the second joint 2D position information 602. pose) can be determined.

스테레오 딥러닝 모델(172)은 기계학습 방식으로 학습된 인공지능 모델일 수 있다. 손 관절 3D 위치는 촬영된 손의 관절들이 입체적인 3차원 공간에서 위치한 입체 좌표 값에 관련된 정보일 수 있다.The stereo deep learning model 172 may be an artificial intelligence model learned through a machine learning method. The 3D positions of the hand joints may be information related to three-dimensional coordinate values in which the joints of the photographed hand are located in a three-dimensional 3D space.

스테레오 이미지 입력부(110), 이미지 특징 추출부(120), 스테레오 어텐션 블록부(130), 손 관절 3D 위치 결정부(140), 히트맵 블록부(150), 인공지능 학습부(160)는 관절 위치 추정 시스템(100)에 포함된 복수개의 프로세서 중 어느 하나의 프로세서를 포함할 수 있다. 또한, 지금까지 설명된 본 발명의 실시예 및 앞으로 설명할 실시예에 따른 관절 위치 추정 방법은, 프로세서에 의해 구동될 수 있는 프로그램의 형태로 구현될 수 있다.The stereo image input unit 110, the image feature extraction unit 120, the stereo attention block unit 130, the hand joint 3D position determination unit 140, the heat map block unit 150, and the artificial intelligence learning unit 160 are It may include any one processor among a plurality of processors included in the position estimation system 100 . In addition, the joint position estimation method according to the embodiments of the present invention described so far and the embodiments to be described in the future may be implemented in the form of a program that can be driven by a processor.

여기서 프로그램은, 프로그램 명령, 데이터 파일 및 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 프로그램은 기계어 코드나 고급 언어 코드를 이용하여 설계 및 제작된 것일 수 있다. 프로그램은 상술한 부호 수정을 위한 방법을 구현하기 위하여 특별히 설계된 것일 수도 있고, 컴퓨터 소프트웨어 분야에서 통상의 기술자에게 기 공지되어 사용 가능한 각종 함수나 정의를 이용하여 구현된 것일 수도 있다. 전술한 정보 표시 방법을 구현하기 위한 프로그램은, 프로세서에 의해 판독 가능한 기록매체에 기록될 수 있다. 이때, 기록매체는 메모리(170)일 수 있다.Here, the program may include program commands, data files, and data structures alone or in combination. The program may be designed and manufactured using machine language codes or high-level language codes. The program may be specially designed to implement the above-described code correction method, or may be implemented using various functions or definitions that are known and usable to those skilled in the art in the field of computer software. A program for implementing the above information display method may be recorded on a recording medium readable by a processor. At this time, the recording medium may be the memory 170 .

메모리(170)는 전술한 동작 및 후술하는 동작을 수행하는 프로그램을 저장할 수 있으며, 메모리(170)는 저장된 프로그램을 실행시킬 수 있다. 프로세서와 메모리(170)가 복수인 경우에, 이들이 하나의 칩에 집적되는 것도 가능하고, 물리적으로 분리된 위치에 마련되는 것도 가능하다. 메모리(170)는 데이터를 일시적으로 기억하기 위한 S램(Static Random Access Memory, S-RAM), D랩(Dynamic Random Access Memory) 등의 휘발성 메모리를 포함할 수 있다. 또한, 메모리(170)는 제어 프로그램 및 제어 데이터를 장기간 저장하기 위한 롬(Read Only Memory), 이피롬(Erasable Programmable Read Only Memory: EPROM), 이이피롬(Electrically Erasable Programmable Read Only Memory: EEPROM) 등의 비휘발성 메모리를 포함할 수 있다.The memory 170 may store a program for performing the above-described operation and an operation to be described later, and the memory 170 may execute the stored program. When there are a plurality of processors and memories 170, they may be integrated into a single chip, or may be provided in physically separate locations. The memory 170 may include volatile memory such as static random access memory (S-RAM) and dynamic random access memory (D-lab) for temporarily storing data. In addition, the memory 170 may include a read only memory (ROM), an erasable programmable read only memory (EPROM), and an electrically erasable programmable read only memory (EEPROM) for long-term storage of control programs and control data. It may contain non-volatile memory.

프로세서는 각종 논리 회로와 연산 회로를 포함할 수 있으며, 메모리(170)로부터 제공된 프로그램에 따라 데이터를 처리하고, 처리 결과에 따라 제어 신호를 생성할 수 있다.The processor may include various logic circuits and arithmetic circuits, process data according to programs provided from the memory 170, and generate control signals according to processing results.

도 3은 일 실시예에 따른 손 관절 3D 위치를 결정하는 과정을 도시한 도면이며, 도 4는 일 실시예에 따른 스테레오 어텐션 블록부의 동작을 설명하기 위한 도면이다.3 is a diagram illustrating a process of determining a 3D position of a hand joint according to an exemplary embodiment, and FIG. 4 is a diagram for explaining an operation of a stereo attention block unit according to an exemplary embodiment.

도 3을 참조하면, 제1 이미지(Left image)(301) 및 제2 이미지(Right image)(302)를 기초로 제1 특징(XL)(401) 및 제2 특징(XR)(402)이 추출되고, 제1 특징(401)을 기초로 제1 관절 2D 위치 맵(M2,L)(501)이 생성되고, 제2 특징(402)을 기초로 제2 관절 2D 위치 맵(M2,R)(502)이 생성되는 것을 확인할 수 있다.Referring to FIG. 3 , a first feature (XL) 401 and a second feature (XR) 402 are based on a first image (Left image) 301 and a second image (Right image) 302. extracted, a first joint 2D position map (M2,L) 501 is generated based on the first feature 401, and a second joint 2D position map (M2,R) based on the second feature 402 It can be confirmed that 502 is created.

또한, 제1 특징(401), 제2 특징(402), 제1 관절 2D 위치 맵(501) 및 제2 관절 2D 위치 맵(502)을 기초로 제1 관절 2D 위치 정보(2D left Keypoints)(601), 제2 관절 2D 위치 정보(2D right Keypoints)(602) 및 동일 관절 시차 정보(Disparities)(603)가 생성되는 것을 확인할 수 있다.In addition, based on the first feature 401, the second feature 402, the first joint 2D location map 501 and the second joint 2D location map 502, first joint 2D location information (2D left Keypoints) ( 601), second joint 2D position information (2D right Keypoints) 602, and same joint disparity information (Disparities) 603 are generated.

히트맵 블록부(150)는 제1 특징(401)을 기초로 제1 관절 2D 위치 맵(501)을 출력하고, 제2 특징(402)을 기초로 제2 관절 2D 위치 맵(502)을 출력할 수 있다.The heat map block unit 150 outputs a first joint 2D position map 501 based on the first feature 401 and outputs a second joint 2D position map 502 based on the second feature 402. can do.

제1 관절 2D 위치 맵(501)은 제1 이미지(301)에 찍힌 손의 관절들의 좌표 정보를 나타내는 특징 맵일 수 있다. 제2 관절 2D 위치 맵(502)은 제2 이미지(302)에 찍힌 손의 관절들의 좌표 정보를 나타내는 특징 맵일 수 있다.The first joint 2D location map 501 may be a feature map representing coordinate information of joints of the hand captured in the first image 301 . The second joint 2D location map 502 may be a feature map representing coordinate information of joints of the hand captured in the second image 302 .

정리하면, 히트맵 블록부(150)는 제1 특징(401)을 기초로 제1 이미지(301)에서 손의 관절이 있을 것으로 예상되는 위치에 관한 특징 맵을 출력하고, 제2 특징(402)을 기초로 제2 이미지(302)에서 손의 관절이 있을 것으로 예상되는 위치에 관한 특징 맵을 출력할 수 있다.In summary, the heat map block unit 150 outputs a feature map related to the position where the joint of the hand is expected to be located in the first image 301 based on the first feature 401, and the second feature 402 Based on this, it is possible to output a feature map related to a position where the joint of the hand is expected to be located in the second image 302 .

도 3 및 도 4를 참조하면, 스테레오 어텐션 블록부(130)는 제1 특징(401) 및 제2 특징(402)을 결합하여 결합 특징(403)을 생성할 수 있다. 즉, 결합 특징(403)은 제1 이미지(301)에 관련된 정보와 제2 이미지(302)에 관련된 정보가 전부 포함되어 있는 특징일 수 있다. 이때, 결합 특징(403)이 생성되는 과정은 디컨볼루션(Deconvolution) 과정으로 인해 특징의 크기가 커지는 과정이 포함될 수 있다.Referring to FIGS. 3 and 4 , the stereo attention block unit 130 may generate a combined feature 403 by combining a first feature 401 and a second feature 402 . That is, the combination feature 403 may be a feature including both information related to the first image 301 and information related to the second image 302 . At this time, the process of generating the combined feature 403 may include a process of increasing the size of the feature due to a deconvolution process.

스테레오 어텐션 블록부(130)는 결합 특징(403)을 기초로 성분들이 0이상 1이하의 값인 마스크 맵(A(XL) 및 A(XR))(503)을 생성할 수 있다. 마스크 맵(503)은 결합 특징(403)의 성분들이 0이상 1이하의 값을 가지도록 스케일링하여 생성되는 맵일 수 있다. 정리하면, 결합 특징(403)의 각 성분들은 1을 초과하는 값일 수 있으나, 스테레오 어텐션 블록부(130)가 결합 특징(403)을 스케일링하여 각 성분의 값이 0과 1사이의 값인 마스크(attention mask)를 생성할 수 있다.The stereo attention block unit 130 may generate mask maps (A(XL) and A(XR)) 503 in which components have values greater than 0 and less than 1 based on the combination feature 403 . The mask map 503 may be a map generated by scaling components of the combined features 403 to have a value greater than or equal to 0 and less than or equal to 1. In summary, each component of the combination feature 403 may have a value exceeding 1, but the stereo attention block unit 130 scales the combination feature 403 so that each component has a value between 0 and 1 (attention mask). mask) can be created.

스테레오 어텐션 블록부(130)는 마스크 맵(503) 및 제1 관절 2D 위치 맵(501)을 기초로 제1 관절 2D 위치 정보(601)를 생성하고, 마스크 맵(503) 및 제2 관절 2D 위치 맵(502)을 기초로 제2 관절 2D 위치 정보(602)를 생성할 수 있다.The stereo attention block unit 130 generates first joint 2D position information 601 based on the mask map 503 and the first joint 2D position map 501, and the mask map 503 and the second joint 2D position Second joint 2D position information 602 may be generated based on the map 502 .

구체적으로, 스테레오 어텐션 블록부(130)는 제1 이미지(301)에 대한 특징 맵과 마스크 맵(503)을 곱한 값에 제1 이미지(301)에 대한 특징 맵을 더한 값을 제1 관절 2D 위치 정보(601)로 결정하고, 제2 이미지(302)에 대한 특징 맵과 마스크 맵(503)을 곱한 값에 제2 이미지(302)에 대한 특징 맵을 더한 값을 제2 관절 2D 위치 정보(602)로 결정할 수 있다.Specifically, the stereo attention block unit 130 calculates a value obtained by adding the feature map of the first image 301 to a value obtained by multiplying the feature map of the first image 301 by the mask map 503 as the first joint 2D position. The second joint 2D position information (602 ) can be determined.

[방정식 1][Equation 1]

[방정식 1]을 참조하면, M'2,R은 제2 관절 2D 위치 정보(602)이고, A(C(XR,XL))은 마스크맵이고, M2,R(XR)은 제2 관절 2D 위치 맵(502)일 수 있다. 이때, 제2 이미지(302)에 대한 특징 맵인 제2 관절 2D 위치 맵(502)과 마스크 맵(503)을 곱한 값에 제2 관절 2D 위치 맵(502)을 더하여 제2 관절 2D 위치 정보(602)를 계산할 수 있으며, 제1 관절 2D 위치 정보(601)도 동일한 방식으로 계산할 수 있다.Referring to [Equation 1], M'2,R is the second joint 2D position information 602, A(C(XR,XL)) is a mask map, and M2,R(XR) is the second joint 2D It may be a location map 502 . At this time, the second joint 2D location map 502 is added to the value obtained by multiplying the second joint 2D location map 502, which is a feature map of the second image 302, by the mask map 503, and the second joint 2D location information 602 ) can be calculated, and the first joint 2D position information 601 can be calculated in the same way.

제1 관절 위치 2D 정보 및 제2 관절 위치 정보는 히트 맵(heat map)일 수 있다. 즉, 전술한 방식으로 스테레오 어텐션 블록부(130)는 두 종류의 히트 맵을 생성할 수 있다. 히트 맵은 손 관절의 상대적인 위치 추정을 위한 맵일 수 있다.The first joint position 2D information and the second joint position information may be a heat map. That is, in the above-described manner, the stereo attention block unit 130 may generate two types of heat maps. The heat map may be a map for estimating relative positions of hand joints.

구체적으로, 스테레오 어텐션 블록부(130)는 제1 이미지(301)에 대한 마스크 맵(503) 및 제1 관절 2D 위치 맵(501)을 기초로 제1 이미지(301)에 대한 히트 맵(M'2,L)을 생성하고, 제2 이미지(302)에 대한 마스크 맵(503) 및 제2 관절 2D 위치 맵(502)을 기초로 제2 이미지(302)에 대한 히트 맵(M'2,R)을 생성할 수 있다.Specifically, the stereo attention block unit 130 is a heat map (M') for the first image 301 based on the mask map 503 and the first joint 2D position map 501 for the first image 301 . 2,L), and generate a heat map (M′2,R) for the second image 302 based on the mask map 503 and the second joint 2D position map 502 for the second image 302 ) can be created.

히트 맵은 이차원 맵(2D map)으로 표현될 수 있다. 스테레오 어텐션 블록부(130)는 관절의 개수만큼 제1 이미지(301)에 대한 히트 맵을 생성할 수 있으며, 관절의 개수만큼 제2 이미지(302)에 대한 히트 맵을 생성할 수 있다. 예를 들어, 추정할 관절이 21개이면, 제1 이미지(301)에 대한 21개의 이차원 맵 및 제2 이미지(302)에 대한 21개의 이차원 맵이 만들어질 수 있다. 각 이차원 맵에는 해당 위치에 관절이 있을 확률에 정보를 갖고 있다.The heat map may be expressed as a 2D map. The stereo attention block unit 130 may generate as many heat maps for the first image 301 as the number of joints, and generate a heat map for the second image 302 as many as the joints. For example, if there are 21 joints to be estimated, 21 2D maps for the first image 301 and 21 2D maps for the second image 302 may be created. Each two-dimensional map has information on the probability that a joint exists at a corresponding location.

스테레오 어텐션 블록부(130)는 결합 특징(403)을 기초로 동일 관절 시차 맵(D)을 생성할 수 있다. 동일 관절 시차 맵 역시 관절 개수만큼 맵이 만들어질 수 있다.The stereo attention block unit 130 may generate the same-joint parallax map D based on the coupling feature 403 . Same-joint parallax maps can also be created as many maps as the number of joints.

일 실시예에 의하면, 전술한 방식으로 획득된 히트 맵에는 soft-argmax와 같은 계산법을 적용하고, 계산 결과 확률 값이 아닌 실제 추정 값을 얻을 수 있다. 이때, 손 관절의 2D 좌표는 히트 맵 상에서 동일한 위치로 표현될 수 있다.According to an embodiment, a calculation method such as soft-argmax may be applied to the heat map obtained in the above-described manner, and as a result of the calculation, an actual estimated value rather than a probability value may be obtained. In this case, the 2D coordinates of the hand joint may be expressed as the same location on the heat map.

스테레오 어텐션 블록부(130)는 결합 특징(403)을 기초로 동일 관절 시차 정보(D)(603)를 생성할 수 있다. 동일 관절 시차 정보(603)는 두 개의 스테레오 이미지의 시차 추정에 이용되는 정보로서 깊이 정보를 생성할 때 이용될 수 있다. 이러한 동일 관절 시차 정보(603) 또한 관절의 개수만큼 맵이 생성될 수 있다.The stereo attention block unit 130 may generate the same joint parallax information (D) 603 based on the coupling feature 403 . The same-joint disparity information 603 is information used for estimating the disparity between two stereo images, and may be used when generating depth information. The same joint parallax information 603 may also generate a map as many as the number of joints.

도 5는 일 실시예에 따른 동일 관절 시차 정보를 설명하기 위한 도면이다.5 is a diagram for explaining same-joint parallax information according to an exemplary embodiment.

도 5를 참조하면, 두 개의 스테레오 이미지의 시차 추정을 위해서는 또 다른 이차원 맵(2D disparity map)을 이용할 수 있다.Referring to FIG. 5 , another 2D disparity map may be used to estimate the disparity between two stereo images.

스테레오 어텐션 블록부(130)는 제1 이미지(301) 및 제2 이미지(302)의 동일 객체에 대한 제1 이미지(301)에서의 좌표 및 제2 이미지(302)에서의 좌표를 기초로 동일 관절 시차 정보(603)를 생성할 수 있다. 즉, 스테레오 어텐션 블록부(130)는 양안시차(두 이미지에서 보이는 객체의 위치 차이)를 사용하여 이차원 맵(2D disparity map)을 생성할 수 있다.The stereo attention block unit 130 is the same joint based on the coordinates in the first image 301 and the coordinates in the second image 302 of the first image 301 and the second image 302 for the same object. Time difference information 603 may be generated. That is, the stereo attention block unit 130 may generate a 2D disparity map using binocular disparity (positional differences of objects seen in two images).

[방정식 2][Equation 2]

구체적으로, x 축에는 uL 위치 값을, y 축에는 uR 값을 사용할 수 있다. 예를 들어, (uL, vL)이 (2,3)이고 (uR, vR)이 (4,5)이면 이차원 맵(2D disparity map)에서는 (2,4)로 표현된다. 이렇게 계산한 값을 기초로 상대 시차(

)를 계산한다. 상대 시차는 상대적으로 크기를 줄인 시차를 의미하고, [방정식 2]를 통하여 계산될 수 있다. 여기에서 wb와

는 임의의 값일 수 있다.Specifically, uL position values may be used for the x-axis and uR values may be used for the y-axis. For example, if (uL, vL) is (2,3) and (uR, vR) is (4,5), it is expressed as (2,4) in the 2D disparity map. Based on this calculated value, the relative parallax (

) is calculated. The relative disparity means a disparity that has been relatively reduced in size, and can be calculated through [Equation 2]. wb here

can be any value.

다시 도 3 및 도 4를 참조하면, 손 관절 3D 위치 결정부(140)는 제1 관절 2D 위치 정보(601), 제2 관절 2D 위치 정보(602) 및 동일 관절 시차 정보(603)를 기초로 손 관절 3D 위치를 결정할 수 있다.Referring back to FIGS. 3 and 4 , the hand joint 3D position determining unit 140 based on the first joint 2D position information 601 , the second joint 2D position information 602 and the same joint parallax information 603 The 3D position of the hand joint can be determined.

구체적으로, 손 관절 3D 위치 결정부(140)는 상대적인 관절 위치에 관한 정보인 제1 관절 2D 위치 정보(601) 및 제2 관절 2D 위치 정보(602)를 기초로, [방정식 3], [방정식 4] 및 [방정식 5]를 통해 절대적인 3D 관절 위치에 관한 정보를 생성할 수 있다.Specifically, the hand joint 3D position determination unit 140 determines the relative position of the joint based on the first joint 2D position information 601 and the second joint 2D position information 602, [Equation 3], [Equation 3] 4] and [Equation 5], information on the absolute 3D joint position can be generated.

[방정식 3][Equation 3]

[방정식 3]을 참조하면,

는 원점이 원 이미지의 오른쪽 상단인 관절 위치이고,

는 원점이 손 영역탐지로 인해 잘린 이미지 오른쪽 상단으로 조정된 위치일 수 있다.Referring to [Equation 3],

is the joint position where the origin is the upper right corner of the original image,

may be a position where the origin is adjusted to the top right of the cropped image due to hand region detection.

[방정식 4][Equation 4]

[방정식 4]를 참조하면 상대 시차(

) 스테레오 이미지의 상대 시차 계산에 이용된 wb와

값을 그대로 사용하여 절대 시차(

)를 계산할 수 있다.Referring to [Equation 4], the relative parallax (

) wb used for calculating the relative parallax of the stereo image and

Using the value as it is, the absolute parallax (

) can be calculated.

이후, 손 관절 3D 위치 결정부(140)는 생성된 절대적인 3D 관절 위치에 관한 정보와 절대 시차 값을 이용하여 절대 관절 위치, 즉 손 관절 3D 위치를 구할 수 있다. 이 때 절대적인 손 관절 3D 위치의 차원은 3차원으로서, 기존 (x,y)에 깊이 차원이 추가된 것일 수 있다. 전술한 손 관절 3D 위치를 구하는 과정은 스테레오 비전에서 일반적으로 사용하는 3차원 위치 복원 방법을 사용하는 것일 수 있다.Thereafter, the hand joint 3D position determining unit 140 may obtain an absolute joint position, that is, a hand joint 3D position, using the generated absolute 3D joint position information and an absolute parallax value. At this time, the dimension of the absolute hand joint 3D position is three-dimensional, and may be a depth dimension added to the existing (x, y). The above-described process of obtaining the 3D position of the hand joint may use a 3D position restoration method generally used in stereo vision.

[방정식 5][Equation 5]

[방정식 5]를 참조하면, 여기에서

와

는 초점 거리(focal length)이고

와

는 주점(pricipal point), B는 제1 카메라(201)와 제2 카메라(202) 사이의 거리 차일 수 있다.Referring to [Equation 5], where

and

is the focal length and

and

may be a principal point, and B may be a distance difference between the first camera 201 and the second camera 202.

전술한 모든 과정을 거치면, 손 관절 3D 위치 결정부(140)는 최종적으로 오른쪽, 왼쪽 이미지에 대한 3차원 절대 관절 위치를 결정할 수 있다. 이때, 손 관절 3D 위치 결정부(140)는 총 2개의 값을 얻게 되므로 둘의 평균 값을 계산하여 하나의 3차원의 손 관절 3D 위치를 알아낼 수 있다.After going through all the processes described above, the hand joint 3D position determining unit 140 can finally determine the 3D absolute joint positions for the right and left images. At this time, since the hand joint 3D position determiner 140 obtains a total of two values, it is possible to find out one 3D hand joint 3D position by calculating an average value of the two values.

한편, 손 관절 3D 위치를 알아내는 방법은 손 관절 3D 위치 결정부(140)가 스테레오 딥러닝 모델(172)을 이용하여 손 관절 3D 위치를 결정하는 것일 수 있다. 따라서, 일 실시예에 의해 손 관절 3D 위치를 결정하기 위해서는 미리 학습용 이미지를 기초로 기계학습을 통해 스테레오 딥러닝 모델(172)을 학습하는 과정이 필요하다.Meanwhile, a method of finding the 3D position of the hand joint may be that the 3D hand joint position determiner 140 determines the 3D position of the hand joint using the stereo deep learning model 172 . Therefore, in order to determine the 3D position of the hand joint according to an embodiment, a process of learning the stereo deep learning model 172 through machine learning based on the learning image in advance is required.

다시 도 1을 참조하면, 스테레오 이미지 입력부(110)는 복수의 학습용 이미지를 입력받을 수 있다. 이때, 스테레오 이미지 입력부(110)는 어느 동일한 대상이나 동일한 배경을 제1 카메라(201) 및 제2 카메라(202)가 각각 촬영한 학습용 이미지쌍을 입력받을 수 있다.Referring back to FIG. 1 , the stereo image input unit 110 may receive a plurality of learning images. At this time, the stereo image input unit 110 may receive a pair of images for learning, each of which is captured by the first camera 201 and the second camera 202 of the same object or the same background.

스테레오 어텐션 블록부(130)는 학습용 이미지쌍의 관절 2D 위치 정보 및 동일 관절 시차 정보(603)를 생성할 수 있다. 이때, 스테레오 어텐션 블록부(130)는 학습 단계가 아니라 일반적인 제1 이미지(301) 및 제2 이미지(302)에 대한 손 관절 3D 위치를 알아내는 방식과 동일한 방식으로 한 쌍의 학습용 이미지에 대한 관절 2D 위치 정보 및 동일 관절 시차 정보(603)를 생성할 수 있다.The stereo attention block unit 130 may generate joint 2D position information and same-joint parallax information 603 of an image pair for learning. At this time, the stereo attention block unit 130 is not a learning step, but a joint for a pair of training images in the same way as the method for finding out the 3D positions of the hand joints for the first image 301 and the second image 302 in general. 2D location information and the same joint parallax information 603 may be generated.

인공지능 학습부(160)는 학습용 이미지쌍의 관절 2D 위치 정보 및 동일 관절 시차 정보(603)를 기초로 스테레오 딥러닝 모델(172)을 학습할 수 있다.The artificial intelligence learning unit 160 may learn the stereo deep learning model 172 based on the joint 2D position information and the same joint parallax information 603 of the training image pair.

인공지능 학습부(160)는 제1 손실 함수, 제2 손실 함수 및 제3 손실함수를 연산하고, 3개의 손실 함수를 기초로 스테레오 딥러닝 모델(172)을 학습할 수 있다.The artificial intelligence learning unit 160 may calculate the first loss function, the second loss function, and the third loss function, and learn the stereo deep learning model 172 based on the three loss functions.

인공지능 학습부(160)는 생성된 학습용 이미지쌍의 관절 2D 위치 정보 및 학습용 이미지쌍의 기준 관절 2D 위치 정보를 기초로 제1 손실 함수를 연산할 수 있다.The artificial intelligence learning unit 160 may calculate a first loss function based on 2D joint position information of the generated training image pair and reference joint 2D position information of the training image pair.

[방정식 6][Equation 6]

[방정식 6]을 참조하면, 제1 손실 함수(

)는 추정된 학습용 이미지쌍의 상대적인 2D 관절 위치와 미리 정해진 학습용 이미지쌍의 정답 상대 2D 관절 위치의 차이를 최소화하는데 이용되는 손실 함수임을 알 수 있다. 이때, 2D 관절 위치는 2D 픽셀 좌표일 수 있다. 여기에서

는 추정된 j번째 관절의 상대적인 위치이고,

는 그와 대응되는 j번째 관절의 상대적인 위치의 정답 값일 수 있다.

은

로 구성될 수 있다.Referring to [Equation 6], the first loss function (

) is a loss function used to minimize the difference between the relative 2D joint position of the estimated training image pair and the correct answer relative 2D joint position of the predetermined training image pair. In this case, the 2D joint positions may be 2D pixel coordinates. From here

is the relative position of the estimated j-th joint,

may be the correct answer value of the relative position of the j-th joint corresponding thereto.

silver

may consist of

인공지능 학습부(160)는 생성된 학습용 이미지쌍의 동일 관절 시차 정보(603) 및 학습용 이미지쌍의 기준 동일 관절 시차 정보(603)를 기초로 제2 손실 함수를 연산할 수 있다The artificial intelligence learning unit 160 may calculate a second loss function based on the generated same-joint disparity information 603 of the image pair for training and the reference joint disparity information 603 of the image pair for training.

[방정식 7][Equation 7]

[방정식 7]을 참조하면, 제2 손실 함수(

)는 추정된 상대 시차와 정답 상대 시차의 차이를 최소화하는데 이용되는 손실 함수임을 알 수 있다.

은 j번째 상대 관절의 시차이고,

은 그에 대응되는 정답 시차일 수 있다. Referring to [Equation 7], the second loss function (

) is a loss function used to minimize the difference between the estimated relative lag and the correct answer relative lag.

is the parallax of the j-th relative joint,

may be a corresponding answer lag.

인공지능 학습부(160)는 생성된 학습용 이미지쌍의 관절 2D 위치 정보 및 학습용 이미지쌍의 2D 관절 위치를 반대쪽 이미지에 투영하여 생성된 투영 위치 정보를 기초로 제3 손실 함수를 연산할 수 있다.The artificial intelligence learning unit 160 may calculate a third loss function based on 2D joint position information of the generated training image pair and projected position information generated by projecting the 2D joint position of the training image pair to the opposite image.

[방정식 8][Equation 8]

[방정식 8]을 참조하면, 제3 손실 함수(

)는 좌측 이미지에서 추정한 3D 관절 위치를 우측 이미지로 투영했을 때의 차이를 최소화는데 이용되는 손실 함수일 수 있다. 만약 제대로 3D 관절 위치의 추정을 했다면 좌측 이미지에서 추정한 관절 위치를 우측 이미지로 투영했을 때 차이는 0이 되어야 한다. 여기에서

와

은 투영 행렬로서 각각 좌측 이미지에서 우측으로 투영을 하고, 우측 이미지에서 좌측으로 투영을 하는데 이용되는 행렬일 수 있다.

와

는 j번째 왼쪽, 오른쪽 절대 손 관절 위치일 수 있다.Referring to [Equation 8], the third loss function (

) may be a loss function used to minimize a difference when the 3D joint position estimated from the left image is projected onto the right image. If the 3D joint position was correctly estimated, the difference should be zero when the joint position estimated from the left image is projected onto the right image. From here

and

is a projection matrix, and may be a matrix used for projection from the left image to the right and projection from the right image to the left, respectively.

and

may be the jth left and right absolute hand joint positions.

인공지능 학습부(160)는 제1 손실 함수, 제2 손실 함수 및 제3 손실 함수를 기초로 통합 손실 함수를 연산할 수 있다.The artificial intelligence learning unit 160 may calculate an integrated loss function based on the first loss function, the second loss function, and the third loss function.

[방정식 9][Equation 9]

[방정식 9]를 참조하면, 통합 손실 함수(

)는 임의의 값(α,β 또는 γ)이 각각 곱해진 제1 손실 함수, 제2 손실 함수 및 제3 손실 함수를 합해서 구해질 수 있다.Referring to [Equation 9], the integrated loss function (

) can be obtained by summing the first loss function, the second loss function, and the third loss function multiplied by an arbitrary value (α, β or γ), respectively.

인공지능 학습부(160)는 반복적인 기계 학습(Machine Learning)을 통해 통합 손실 함수가 감소하도록 스테레오 딥러닝 모델(172)을 학습할 수 있다. 미리 학습된 스테레오 딥러닝 모델(172)은 메모리(170)에 저장될 수 있다.The artificial intelligence learning unit 160 may learn the stereo deep learning model 172 so that the integrated loss function decreases through repetitive machine learning. The pretrained stereo deep learning model 172 may be stored in the memory 170 .

기계 학습이란 다수의 파라미터로 구성된 모델을 이용하며, 주어진 데이터로 파라미터를 최적화하는 것을 의미할 수 있다. 기계 학습은 학습 문제의 형태에 따라 지도 학습(supervised learning), 비지도 학습(unsupervised learning), 강화 학습(reinforcement learning)을 포함할 수 있다. 지도 학습(supervised learning)은 입력과 출력 사이의 매핑을 학습하는 것이며, 입력과 출력 쌍이 데이터로 주어지는 경우에 적용할 수 있다. 비지도 학습(unsupervised learning)은 입력만 있고 출력은 없는 경우에 적용하며, 입력 사이의 규칙성 등을 찾아낼 수 있다. 다만, 일 실시예에 따른 기계 학습이 반드시 전술한 학습 방식으로 한정되는 것은 아니다.Machine learning may mean using a model composed of multiple parameters and optimizing the parameters with given data. Machine learning may include supervised learning, unsupervised learning, and reinforcement learning depending on the form of a learning problem. Supervised learning is to learn the mapping between inputs and outputs, and can be applied when input and output pairs are given as data. Unsupervised learning is applied when there are only inputs and no outputs, and regularities between inputs can be found. However, machine learning according to an embodiment is not necessarily limited to the aforementioned learning method.

기계학습부는 다양한 방식으로 스테레오 딥러닝 모델(172)을 학습할 수 있다. 예를 들어, 기계학습부는 복수의 학습용 이미지로부터 추출되는 특징(feature)을 딥러닝 기반의 학습방법으로 학습할 수 있다. 이때, 이미지로부터 3D 손 관절 위치에 관련된 특징을 추출하는 방법을 학습하기 위해 여러 단계의 컨볼루션 계층(convolution layer)을 쌓은 CNN(Convolutional Neural Networks) 구조가 활용될 수 있으나, 기계학습부의 학습방법이 반드시 CNN 구조를 활용하는 방법으로 한정되는 것은 아니다.The machine learning unit may learn the stereo deep learning model 172 in various ways. For example, the machine learning unit may learn features extracted from a plurality of training images using a deep learning-based learning method. At this time, a CNN (Convolutional Neural Networks) structure in which several convolution layers are stacked can be used to learn how to extract features related to the 3D hand joint position from the image, but the machine learning unit's learning method It is not necessarily limited to a method using a CNN structure.

한편, 전술한 방식대로 기계학습을 진행하기 위해서는 각각의 학습용 이미지 쌍마다 관절의 상대적인 위치의 정답 값 및 정답 시차가 기준으로서 미리 정해져 있을 필요가 있다.Meanwhile, in order to perform machine learning in the above-described manner, it is necessary to pre-determine the correct answer value of the relative position of the joint and the correct answer parallax for each image pair for learning.

도 6은 일 실시예에 따른 학습용 데이터를 생성하는 과정을 설명하기 위한 도면이다.6 is a diagram for explaining a process of generating learning data according to an exemplary embodiment.

도 6을 참조하면, 학습용 이미지가 입력되면 먼저 3D 손 관절 위치 후보 데이터들이 생성될 수 있다. 3D 손 관절 위치 후보 데이터 생성을 위해서 좌측 이미지 및 우측 이미지의 2D 손 관절 위치를 추정한 다음 이를 3D로 변환하는 작업이 수행될 수 있다. 2D 손 관절 위치 추정을 위해서 빛을 조절하며 여러 개의 후보군을 생성하는 방법을 사용할 수 있다. 이때, 후보군 개수는 임의로 정할 수 있다.Referring to FIG. 6 , when an image for learning is input, 3D hand joint position candidate data may be first generated. In order to generate 3D hand joint position candidate data, an operation of estimating 2D hand joint positions of the left image and the right image and then converting them into 3D may be performed. For 2D hand joint position estimation, a method of generating multiple candidate groups while controlling light can be used. In this case, the number of candidate groups may be arbitrarily determined.

관절 위치 추정 시스템(100)은 변경 학습용 이미지 생성부 및 학습용 데이터 생성부를 더 포함할 수 있다.The joint position estimating system 100 may further include an image generating unit for change learning and a data generating unit for learning.

변경 학습용 이미지 생성부는 입력된 학습용 이미지의 밝기를 변경하여 복수의 변경 학습용 이미지를 생성할 수 있다. 즉, 어느 한 학습용 이미지쌍에 대해서 밝기만 다른 복수개의 변경 학습용 이미지가 생성될 수 있다.The alteration learning image generation unit may generate a plurality of alteration learning images by changing the brightness of the input training image. That is, a plurality of changed learning images that differ only in brightness may be generated for any one training image pair.

학습용 데이터 생성부는 각각의 변경 학습용 이미지마다 관절 위치 정보를 생성할 수 있다.The learning data generator may generate joint position information for each change learning image.

학습용 데이터 생성부는 복수의 변경 학습용 이미지의 관절 위치 정보를 기반으로 최적의 관절 위치를 추정할 수 있다. 학습용 데이터 생성부는 최적의 관절 위치의 추정을 위해 [방정식 10]과 같이 각 변경 이미지의 관절 위치 정보와 기준이 되는 변경 이미지의 관절 위치 정보의 차이를 계산할 수 있다. 이때 기준이 되는 변경 이미지는 복수의 변경 이미지 중에서 임의로 선택될 수 있다.The training data generation unit may estimate an optimal joint position based on joint position information of a plurality of modified learning images. The training data generator may calculate the difference between the joint position information of each change image and the joint position information of the reference change image as shown in [Equation 10] in order to estimate the optimal joint position. In this case, a change image serving as a reference may be arbitrarily selected from among a plurality of change images.

[방정식 10][Equation 10]

[방정식 11][Equation 11]

학습용 데이터 생성부는 계산된 관절 위치 정보의 차이를 기초로 [방정식 11]에 나타난 바와 같이 최적화 알고리즘을 통해 어느 한 학습용 이미지쌍의 최적의 관절 위치들(

)을 결정할 수 있다.The learning data generator determines the optimal joint positions of any one training image pair through an optimization algorithm as shown in [Equation 11] based on the calculated joint position information.

) can be determined.

즉, 학습용 데이터 생성부는 복수의 변경 학습용 이미지들의 관절 위치 정보를 비교하면서 복수의 변경 학습용 이미지의 관절 위치 정보들의 차이값을 줄이는 방식으로 기계학습 과정에 필요한 정답 값인 기준 관절 2D 위치 정보 및 기준 동일 관절 시차 정보(603)를 산출할 수 있다.That is, the learning data generator compares the joint position information of a plurality of change learning images and reduces the difference between the joint position information of the plurality of change learning images. Time difference information 603 can be calculated.

도 7은 일 실시예에 따른 관절 위치 추정 방법의 순서도이다. 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일 뿐이며, 필요에 따라 일부 구성이 추가되거나 삭제될 수 있음은 물론이다.7 is a flowchart of a joint position estimation method according to an embodiment. This is only a preferred embodiment for achieving the object of the present invention, and it goes without saying that some components may be added or deleted as needed.

도 7을 참조하면, 스테레오 이미지 입력부(110)는 제1 카메라(201)에 의해 획득된 제1 이미지(301) 및 제2 카메라(202)에 의해 획득된 제2 이미지(302)를 입력받을 수 있다(1001).Referring to FIG. 7 , the stereo image input unit 110 may receive a first image 301 obtained by a first camera 201 and a second image 302 obtained by a second camera 202. Yes (1001).

이미지 특징 추출부(120)는 딥러닝 모델(171)을 이용하여 제1 이미지(301)로부터 제1 특징(401)을 추출하고, 제2 이미지(302)로부터 제2 특징(402)을 추출할 수 있다(1002).The image feature extractor 120 extracts a first feature 401 from the first image 301 and a second feature 402 from the second image 302 using the deep learning model 171. can (1002).

히트맵 블록부(150)는 제1 특징(401)을 기초로 제1 관절 2D 위치 맵(501)을 출력하고, 제2 특징(402)을 기초로 제2 관절 2D 위치 맵(502)을 출력할 수 있다(1003).The heat map block unit 150 outputs a first joint 2D position map 501 based on the first feature 401 and outputs a second joint 2D position map 502 based on the second feature 402. can (1003).

스테레오 어텐션 블록부(130)는 제1 특징(401) 및 제2 특징(402)을 기초로 결합 특징(403)을 생성하고, 결합 특징(403)을 기초로 마스크 맵(503)을 생성할 수 있다(1004).The stereo attention block unit 130 may generate a combined feature 403 based on the first feature 401 and the second feature 402, and generate a mask map 503 based on the combined feature 403. Yes (1004).

스테레오 어텐션 블록부(130)는 마스크 맵(503) 및 제1 관절 2D 위치 맵(501)을 기초로 제1 관절 2D 위치 정보(601)를 생성하고, 마스크 맵(503) 및 제2 관절 2D 위치 맵(502)을 기초로 제2 관절 2D 위치 정보(602)를 생성할 수 있다(1005).The stereo attention block unit 130 generates first joint 2D position information 601 based on the mask map 503 and the first joint 2D position map 501, and the mask map 503 and the second joint 2D position Second joint 2D position information 602 may be generated based on the map 502 (1005).

스테레오 어텐션 블록부(130)는 결합 특징(403)을 기초로 동일 관절 시차 정보(603)를 생성할 수 있다(1006).The stereo attention block unit 130 may generate the same joint parallax information 603 based on the coupling feature 403 (1006).

손 관절 3D 위치 결정부(140)는 제1 관절 2D 위치 정보(601), 제2 관절 2D 위치 정보(602), 동일 관절 시차 정보(603)를 기초로 손 관절 3D 위치를 결정할 수 있다(1007).The hand joint 3D position determination unit 140 may determine the hand joint 3D position based on the first joint 2D position information 601, the second joint 2D position information 602, and the same joint parallax information 603 (1007 ).

본 발명의 실시예에 따른 관절 위치 추정 방법의 성능을 검증하기 위하여, 한 쌍의 스테레오 카메라를 이용하여 획득된 이미지를 기초로 실험을 진행하였다.In order to verify the performance of the joint position estimation method according to an embodiment of the present invention, an experiment was conducted based on images acquired using a pair of stereo cameras.

도 8은 일 실시예에 따른 관절 위치 추정 방법이 종래의 관절 위치 추정 방식에 비해 개선된 정도를 나타낸 그래프이다.8 is a graph showing the degree of improvement of the joint position estimation method according to an embodiment compared to the conventional joint position estimation method.

도 8을 참조하면, 일 실시예에 따른 관절 위치 추정 방법(StreoNet, StereoDMap)이 다른 종래의 방법(ResNet, AttentionNet, baseline 등)보다 더 오류가 덜 발생하는 것을 확인할 수 있다.Referring to FIG. 8 , it can be seen that the joint position estimation method (StereoNet, StereoDMap) according to an embodiment generates fewer errors than other conventional methods (ResNet, AttentionNet, baseline, etc.).

구체적으로, 도시된 각 그래프의 x축 값(Error thresold)은 실제 정답의 3D 손 관절 위치와 추정된 3D 손 관절 위치 사이의 차이를 나타낸다. 또한, y축은 각 3D 손 관절 위치 사이의 차이 발생 빈도, 즉 각 에러 정도에 대한 발생 빈도를 나타낸다. 예를 들어, 관절 위치 추정 방법(StreoNet)에 의하면 10mm이하의 에러가 발생한 빈도는 약 30%이지만, 종래의 방법은 10mm이하의 에러가 발생한 빈도가 15%보다 낮음을 확인할 수 있다.Specifically, the x-axis value (error threshold) of each graph shown represents the difference between the 3D hand joint position of the actual correct answer and the estimated 3D hand joint position. In addition, the y-axis represents the frequency of difference between each 3D hand joint position, that is, the frequency of occurrence for each error degree. For example, according to the joint position estimation method (StereoNet), the frequency of occurrence of errors of 10 mm or less is about 30%, but in the conventional method, it can be confirmed that the frequency of occurrence of errors of 10 mm or less is lower than 15%.

정리하면, 일 실시예에 따른 관절 위치 추정 방법(StreoNet, StereoDMap)이 종래의 방법보다 더 정답 및 추정된 3D 손 관절 위치 사이의 차이가 적게 발생함을 확인할 수 있다.In summary, it can be confirmed that the joint position estimation method (StereoNet, StereoDMap) according to an embodiment generates a smaller difference between the correct answer and the estimated 3D hand joint position than the conventional method.

이상에서와 같이 첨부된 도면을 참조하여 개시된 실시예들을 설명하였다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고도, 개시된 실시예들과 다른 형태로 본 발명이 실시될 수 있음을 이해할 것이다. 개시된 실시예들은 예시적인 것이며, 한정적으로 해석되어서는 안 된다.As above, the disclosed embodiments have been described with reference to the accompanying drawings. Those skilled in the art to which the present invention pertains will understand that the present invention can be implemented in a form different from the disclosed embodiments without changing the technical spirit or essential features of the present invention. The disclosed embodiments are illustrative and should not be construed as limiting.

100: 관절 위치 추정 시스템
110: 스테레오 이미지 입력부
120: 이미지 특징 추출부
130: 스테레오 어텐션 블록부
140: 손 관절 3D 위치 결정부
150: 히트맵 블록부
160: 인공지능 학습부
170: 메모리
171: 딥러닝 모델
172: 스테레오 딥러닝 모델
200: 웨어러블 기기
201: 제1 카메라
202: 제2 카메라
301: 제1 이미지
302: 제2 이미지
401: 제1 특징
402: 제2 특징
403: 결합 특징
501: 제1 관절 2D 위치 맵
502: 제2 관절 2D 위치 맵
503: 마스크 맵
601: 제1 관절 2D 위치 정보
602: 제2 관절 2D 위치 정보
603: 동일 관절 시차 정보100: joint position estimation system
110: stereo image input unit
120: image feature extraction unit
130: stereo attention block unit
140: hand joint 3D positioning unit
150: heat map block part
160: artificial intelligence learning unit
170: memory
171: deep learning model
172: stereo deep learning model
200: wearable device
201: first camera
202: second camera
301: first image
302: second image
401 First characteristic
402 Second feature
403: Combination feature
501: first joint 2D position map
502: second joint 2D position map
503: mask map
601: first joint 2D position information
602: second joint 2D position information
603: co-joint parallax information

Claims

(a) receiving a first image obtained by a first camera and a second image obtained by a second camera by a stereo image input unit;
(b) extracting a first feature from a first image and a second feature from a second image using a deep learning model by an image feature extractor;
(c) generating first joint 2D position information and second joint 2D position information based on the first feature and the second feature by the stereo attention block unit; and
(d) determining the 3D position of the hand joint using a stereo deep learning model based on the 2D position information of the first joint and the 2D position information of the second joint by the 3D hand joint position determiner; A method for estimating joint position.

According to claim 1,
Outputting a first joint 2D position map based on the first feature and outputting a second joint 2D position map based on the second feature by a heat map block unit based on stereo camera input Joint position estimation method.

According to claim 2,
The step (c) is:
generating a combined feature by combining the first feature and the second feature by the stereo attention block unit; and
and generating, by the stereo attention block unit, a mask map in which components have values of 0 or more and 1 or less based on the combined feature.

According to claim 3,
In step (c),
The stereo attention block unit generates the first joint 2D position information based on the mask map and the first joint 2D position map, and the second joint 2D position information based on the mask map and the second joint 2D position map. A method for estimating joint positions based on stereo camera input, comprising generating joint 2D position information.

According to claim 4,
In step (c),
Generating, by the stereo attention block unit, same-joint parallax information based on the combining feature;
In step (d),
determining the 3D position of the hand joint based on the 2D position information of the first joint, the 2D position information of the second joint, and the parallax information of the same joint by the 3D hand joint position determination unit; A method for estimating joint position.

According to claim 5,
The step of generating the same-joint parallax information,
generating the same joint parallax information based on coordinates in the first image and coordinates in the second image of the same object in the first image and the second image, by the stereo attention block unit; A joint position estimation method based on a stereo camera input including:

According to claim 6,
receiving a plurality of image pairs for learning by the stereo image input unit;
generating joint 2D position information and same-joint parallax information of the training image pair by the stereo attention block unit; and
A method of estimating joint positions based on stereo camera input, further comprising the step of learning the stereo deep learning model based on the joint 2D position information and the same joint parallax information of the training image pair by an artificial intelligence learning unit.

According to claim 7,
The step of learning the stereo deep learning model is:
calculating, by the artificial intelligence learning unit, a first loss function based on 2D joint position information of the training image pair and reference joint 2D position information of the training image pair;
calculating, by the artificial intelligence learning unit, a second loss function based on the same-joint disparity information of the training image pair and the reference same-joint disparity information of the training image pair;
Calculating, by the artificial intelligence learning unit, a third loss function based on 2D joint position information of the training image pair and projected position information generated by projecting the 2D joint position of the training image pair to an opposite image generated by the artificial intelligence learning unit. ; and
By the artificial intelligence learning unit, calculating an integration loss function based on the first loss function, the second loss function, and the third loss function, and learning the stereo deep learning model so that the integration loss function decreases A joint position estimation method based on a stereo camera input comprising steps.

According to claim 8,
generating a plurality of changed learning images by changing the brightness of the input learning image by a change learning image generation unit; and
The learning data generation unit generates joint position information for each of the changed learning images, and compares the joint position information of the plurality of changed learning images while reducing the difference between the joint position information of the plurality of changed learning images. A method for estimating joint positions based on stereo camera input, further comprising calculating joint 2D position information and reference same-joint parallax information.

A computer program stored in a computer-readable recording medium to execute the joint position estimation method based on any one of claims 1 to 9 of a stereo camera.

a stereo image input unit configured to receive a first image acquired by a first camera and a second image acquired by a second camera;
an image feature extraction unit configured to extract a first feature from a first image and a second feature from a second image by using a deep learning model;
a stereo attention block unit configured to generate first joint 2D position information and second joint 2D position information based on the first and second characteristics; and
A joint position estimation system based on a stereo camera input including a hand joint 3D position determiner configured to determine a hand joint 3D position using a stereo deep learning model based on the first joint 2D position information and the second joint 2D position information.

According to claim 11,
A heat map block unit configured to output a first joint 2D position map based on the first feature and a second joint 2D position map based on the second feature; joint position estimation based on stereo camera input system.

According to claim 12,
The stereo attention block unit:
combine the first feature and the second feature to create a combined feature; and
A joint position estimation system based on stereo camera input, configured to generate a mask map in which components have values of 0 or more and 1 or less based on the combined feature.

According to claim 13,
The stereo attention block unit:
generating the first joint 2D position information based on the mask map and the first joint 2D position map; and
A joint position estimation system based on a stereo camera input configured to generate the second joint 2D position information based on the mask map and the second joint 2D position map.

According to claim 14,
The stereo attention block unit is configured to generate same-joint parallax information based on the combining feature;
The hand joint 3D position determining unit,
A joint position estimation system based on stereo camera input configured to determine the 3D position of the hand joint based on the first joint 2D position information, the second joint 2D position information, and the same joint parallax information.

According to claim 15,
The stereo attention block unit,
Joint position estimation based on a stereo camera input configured to generate the same-joint parallax information based on coordinates in the first image and coordinates in the second image for the same object in the first image and the second image system.

According to claim 16,
The stereo image input unit,
It is configured to receive a plurality of image pairs for learning,
The stereo attention block unit,
configured to generate joint 2D position information and same-joint parallax information of the pair of images for learning;
Stereo camera input-based joint position estimation system further comprising an artificial intelligence learning unit configured to learn the stereo deep learning model based on the joint 2D position information and the same joint parallax information of the training image pair.

According to claim 17,
The artificial intelligence learning unit:
calculating a first loss function based on the generated 2D joint position information of the training image pair and reference joint 2D position information of the training image pair;
calculating a second loss function based on the generated same-joint disparity information of the training image pair and the reference joint disparity information of the training image pair;
Calculate a third loss function based on the generated 2D joint position information of the training image pair and projected position information generated by projecting the 2D joint position of the training image pair to an opposite image; and
Based on a stereo camera input, configured to calculate an integrated loss function based on the first loss function, the second loss function, and the third loss function, and learn the stereo deep learning model so that the integrated loss function decreases. Joint localization system.

According to claim 18,
a change learning image generator configured to generate a plurality of change learning images by changing the brightness of the input learning image; and
The reference joint 2D position information and the reference are the same by generating joint position information for each of the change learning images and reducing the difference between the joint position information of the plurality of change learning images while comparing the joint position information of the plurality of change learning images. A system for estimating joint position based on stereo camera input, further comprising a learning data generator configured to calculate joint parallax information.