KR20240018122A

KR20240018122A - Apparatus and Method for Detecting Pointing Gesture

Info

Publication number: KR20240018122A
Application number: KR1020220095990A
Authority: KR
Inventors: 유철환; 유장희; 김호원; 한병옥
Original assignee: 한국전자통신연구원
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2024-02-13

Abstract

가리키기 제스처 검출 장치 및 방법이 개시된다. 본 발명의 실시예에 따른 가리키기 제스처 검출 장치는, 적어도 하나의 프로그램이 기록된 메모리 및 프로그램을 실행하는 프로세서를 포함하며, 프로그램은, 입력 영상으로부터 사용자 신체의 3차원 자세를 예측하는 단계, 예측된 자세 정보를 통해 손 영역 및 팔 영역을 검출하는 단계 및 검출된 손 영역 및 팔 영역 각각을 입력으로 딥러닝 기반으로 가리키기 제스처 동작 여부를 추론하는 단계를 수행할 수 있다. An apparatus and method for detecting a pointing gesture are disclosed. A pointing gesture detection device according to an embodiment of the present invention includes a memory in which at least one program is recorded and a processor for executing the program, the program comprising: predicting a three-dimensional posture of the user's body from an input image, prediction A step of detecting the hand area and arm area through the posture information and a step of inferring whether a pointing gesture is performed based on deep learning using each of the detected hand area and arm area as input can be performed.

Description

Apparatus and Method for Detecting Pointing Gesture}

기재된 실시 예는 가리키기 제스처를 인식하는 기술에 관한 것이다.The described embodiments relate to techniques for recognizing pointing gestures.

컴퓨터 비전 및 영상처리 분야에서의 RGB 혹은 RGBD 카메라를 이용한 가리키기와 관련된 기술들은 로봇 조작 및 가상 커서 등의 기술로 활용될 수 있다. Technologies related to pointing using RGB or RGBD cameras in the field of computer vision and image processing can be used as technologies such as robot manipulation and virtual cursors.

종래의 가리키기 제스처 인식 기술들은 사용자 신체의 3차원 관절 좌표를 예측한 후, 팔꿈치와 손을 잇는 선 또는 눈과 검지를 잇는 선과 같이 관절좌표를 잇는 선(ray)와 3차원 공간상에서의 교차점 산출을 통해 사용자가 3차원 공간상의 어떠한 물체를 가리키고 있는지 판단하게 된다. Conventional pointing gesture recognition technologies predict the 3D joint coordinates of the user's body and then calculate the intersection point in 3D space with a line (ray) connecting the joint coordinates, such as the line connecting the elbow and hand or the line connecting the eye and index finger. Through this, it is determined which object the user is pointing to in three-dimensional space.

그런데, 종래의 대부분 사용자가 가리키기 제스쳐를 수행하고 있다는 암묵적 가정하에 이루어진다. However, most conventional methods are performed under the implicit assumption that the user is performing a pointing gesture.

즉, 가리키기 제스쳐 유무에 대한 판단없이 항상 가리키기 방향 예측 알고리즘이 작동하게 되므로, 가리키기 제스처가 아님에도 가리키기 방향 예측을 위한 불필요한 계산 비용 및 전력 소모로 인한 비효율성 문제를 야기한다. In other words, since the pointing direction prediction algorithm always operates without determining whether a pointing gesture exists, it causes inefficiency problems due to unnecessary computational costs and power consumption for predicting the pointing direction even though it is not a pointing gesture.

또한, 가리키기 제스처가 아닌데도 가리키기 방향 예측 결과가 출력됨에 따라 원하지 않는 오작동으로 인한 불편함과, 더 나아가 위험성에 노출될 수 있다. In addition, as the pointing direction prediction result is output even though it is not a pointing gesture, you may be exposed to inconvenience and even risk due to unwanted malfunction.

일 예로, 가리키기 방향 예측을 통해 키오스크나 스마트 TV에서 비접촉 센싱 기반으로 메뉴 혹은 아이콘을 선택할 때, 가리키기 제스쳐를 수행하고 있지 않음에도 항상 작동하고 있는 가리키기 방향 예측 알고리즘에 의한 오작동은 제품 활용에 대한 사용성을 감소시킬 수 있다. 또한, 자율 주행 자동차의 HUD를 통한 인포테인먼트 조작 또는 드론 조작과 같은 산업 응용 분야에서의 오작동은 큰 안전 사고로까지 이어질 위험성이 있다. For example, when selecting a menu or icon based on non-contact sensing on a kiosk or smart TV by predicting the pointing direction, a malfunction caused by the pointing direction prediction algorithm, which is always running even when a pointing gesture is not performed, may affect the use of the product. It may reduce usability. Additionally, malfunctions in industrial applications, such as operating infotainment through the HUD of a self-driving car or operating a drone, have the risk of leading to a major safety accident.

전술한 바와 같이 가리키기 제스쳐 유무 판별 기술에 대한 필요성이 분명히 존재함에도 불구하고, 종래의 딥러닝을 활용한 제스쳐 인식 기술들은 정적 제스쳐 인식을 통한 숫자(0~10), 알파벳(A~Z) 등의 기호 분류 또는 좌, 우, 상, 하, 실행, 정지와 같은 동적 제스쳐 인식에 치중해 있다. As mentioned above, although there is a clear need for technology to determine the presence or absence of a pointing gesture, gesture recognition technologies using conventional deep learning are limited to recognizing numbers (0~10), the alphabet (A~Z), etc. through static gesture recognition. It focuses on classification of symbols or recognition of dynamic gestures such as left, right, up, down, run, and stop.

종래의 기술 중에서 "Visual pointing gestures for bi-directional human robot interaction in a pick-and-place task."(2015 24th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 2015)과 같이 가리키는 손 이외의 다른 손에 대한 인식을 바탕으로 가리키기 방향 예측 알고리즘 작동 조건을 판단하는 방법이 존재하나, 이는 다른 손의 불필요한 제약이 요구되며 운전 중인 상황 등에서 활용하기 어려운 한계점을 지닌다.Among conventional technologies, pointing gestures such as "Visual pointing gestures for bi-directional human robot interaction in a pick-and-place task." (2015 24th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 2015) There is a method to determine the operating conditions of the pointing direction prediction algorithm based on recognition of hands other than the hand, but this method requires unnecessary restrictions on the other hand and has limitations that make it difficult to utilize in driving situations.

또한 "Pointing gesture recognition based on 3d-tracking of face, hands and head orientation." (Proceedings of the 5th international conference on Multimodal interfaces. 2003)와 같이 전통적인 머신 러닝 기법인 Hidden Markov Model(HMM) 등을 활용하여 확률 모델 기반으로 가리키기 제스쳐 발생 유무를 판단하는 일부 종래의 방법들이 존재하지만, 이는 입력 데이터 간의 종속성(dependency) 모델링 및 데이터 표현 등에서 우수한 성능을 보이는 최근의 뉴럴 네트워크 기반의 방식들에 비해 성능면에서 한계를 가진다. 또한 HMM은 일반적으로 시계열 데이터에 대한 모델링 방법으로서, 시계열 데이터의 길이가 너무 짧거나 단일 영상에서는 추론 결과에 대한 신뢰도가 낮은 단점을 지닌다. Also, “Pointing gesture recognition based on 3d-tracking of face, hands and head orientation.” There are some conventional methods, such as (Proceedings of the 5th international conference on Multimodal interfaces. 2003), that determine whether a pointing gesture occurs based on a probability model using Hidden Markov Model (HMM), a traditional machine learning technique, etc. This has limitations in performance compared to recent neural network-based methods that show excellent performance in dependency modeling and data expression between input data. In addition, HMM is generally a modeling method for time series data, and has the disadvantage that the length of time series data is too short or the reliability of inference results is low in a single image.

기재된 실시예는 가리키기 제스처 유무 판단없이 가리키기 방향 예측 동작을 수행함에 따른 비효율성, 오작동 및 위험성의 문제를 해결하는 데 그 목적이 있다. The purpose of the described embodiment is to solve problems of inefficiency, malfunction, and risk caused by performing a pointing direction prediction operation without determining the presence or absence of a pointing gesture.

기재된 실시예는 가리키기 방향 예측 알고리즘 작동 조건이 신체 또는 상황 제약으로 활용되기 어려운 문제점을 해결하는 데 그 목적이 있다. The purpose of the described embodiment is to solve the problem that the operating conditions of the pointing direction prediction algorithm are difficult to utilize due to physical or situational constraints.

기재된 실시예는 딥러닝 기술을 이용하여 단일 영상으로도 사용자의 신체 부위를 기반으로 실시간 가리키기 제스쳐 유무 검출 성능을 향상시키는 데 그 목적이 있다. The purpose of the described embodiment is to improve the performance of detecting the presence or absence of a pointing gesture in real time based on the user's body part even with a single image using deep learning technology.

실시예에 따른 가리키기 제스처 검출 장치는, 적어도 하나의 프로그램이 기록된 메모리 및 프로그램을 실행하는 프로세서를 포함하며, 프로그램은, 입력 영상으로부터 사용자 신체의 3차원 자세를 예측하는 단계, 예측된 자세 정보를 통해 손 영역 및 팔 영역을 검출하는 단계 및 검출된 손 영역 및 팔 영역 각각을 입력으로 딥러닝 기반으로 가리키기 제스처 동작 여부를 추론하는 단계를 수행할 수 있다. A pointing gesture detection device according to an embodiment includes a memory in which at least one program is recorded and a processor for executing the program, the program comprising: predicting a three-dimensional posture of the user's body from an input image, predicted posture information A step of detecting the hand area and an arm area and a step of inferring whether a pointing gesture is performed based on deep learning using each of the detected hand area and arm area as input can be performed.

이때, 입력 영상은, 비접촉 센싱이 사용되는 장비에 설치된 RGB 카메라 또는 RGBD 카메라를 통해 획득될 수 있다. At this time, the input image can be acquired through an RGB camera or RGBD camera installed in equipment using non-contact sensing.

이때, 프로그램은, 예측하는 단계, 검출하는 단계 및 추론하는 단계를, 입력 영상 프레임 별로 수행하되, 일정 시간 동안 내에 소정 개수 이상의 입력 영상 프레임들의 추론 결과가 가리키기 제스처 동작일 경우, 가리키기 제스처가 검출된 것으로 판단하는 단계를 더 수행할 수 있다. At this time, the program performs the predicting step, the detecting step, and the inferring step for each input image frame. However, if the inference result of a predetermined number or more of the input image frames within a certain period of time is a pointing gesture action, the pointing gesture is performed. Additional steps to determine that it has been detected may be performed.

이때, 프로그램은, 예측된 3차원 자세에서 랜드마크의 신뢰도를 판단하는 단계를 더 수행하고, 랜드마크의 신뢰도가 소정 임계치보다 크지 않을 경우, 해당 입력 영상 프레임에 대한 검출하는 단계 및 추론하는 단계는 생략할 수 있다. At this time, the program further performs the step of determining the reliability of the landmark in the predicted 3D posture, and if the reliability of the landmark is not greater than a predetermined threshold, the step of detecting and inferring the corresponding input image frame is It can be omitted.

이때, 3차원 자세 정보는, 사용자 신체의 관절들 각각을 나타내는 3차원 관절 좌표이고, 프로그램은, 검출하는 단계에서, 사용자 신체의 3차원 관절 좌표들 중에서 팔꿈치, 손목 및 손바닥이 동일 선상에 위치함을 가정하여 손 영역 및 팔 영역을 검출할 수 있다. At this time, the 3D posture information is 3D joint coordinates representing each of the joints of the user's body, and in the detection step, the program determines that among the 3D joint coordinates of the user's body, the elbow, wrist, and palm are located on the same line. Assuming, the hand area and arm area can be detected.

이때, 프로그램은, 손 영역을 검출함에 있어, 팔꿈치 3차원 좌표 및 손목 3차원 좌표를 잇는 일직선이 손의 방향으로 일직선의 소정 배수만큼 연장된 위치를 손 위치로 유추하는 단계, 손 위치를 중심으로 3차원 공간 상에 바운딩 박스를 생성하는 단계 및 3차원 공간상의 바운딩 박스를 카메라 내부 파라미터를 통해 다시 픽셀 좌표계로 투영하여 손 영역에 상응하는 2차원 바운딩 박스 좌표를 획득하는 단계를 수행할 수 있다. At this time, when detecting the hand area, the program infers the position where the straight line connecting the 3-dimensional coordinates of the elbow and the 3-dimensional coordinates of the wrist extends in the direction of the hand by a predetermined multiple of the straight line as the hand position, focusing on the hand position. Steps of generating a bounding box in three-dimensional space and projecting the bounding box in three-dimensional space back into the pixel coordinate system through camera internal parameters may be performed to obtain two-dimensional bounding box coordinates corresponding to the hand area.

이때, 바운딩 박스는, 그 크기가 사용자의 연령에 따라 상이하게 설정될 수 있다. At this time, the size of the bounding box may be set differently depending on the user's age.

이때, 프로그램은, 팔 영역을 검출함에 있어, 손 영역의 바운딩 박스 좌표, 팔꿈치 3차원 좌표 및 어깨 3차원 좌표를 포함하는 최소의 사각 패치를 생성하여 팔 영역을 검출할 수 있다. At this time, when detecting the arm area, the program can detect the arm area by generating a minimal square patch including the bounding box coordinates of the hand area, the 3D coordinates of the elbow, and the 3D coordinates of the shoulder.

이때, 프로그램은, 추론하는 단계에서, 검출된 손 영역 데이터가 입력되는 딥러닝 네트워크 및 검출된 팔 영역 데이터가 입력되는 딥러닝 네트워크 각각에 대한 출력들에 대해 평균을 계산하고, argmax 연산을 통해 이진 분류를 수행할 수 있다. At this time, in the inference step, the program calculates the average of the outputs for each of the deep learning network into which the detected hand area data is input and the deep learning network into which the detected arm area data is input, and uses the argmax operation to generate binary Classification can be performed.

실시예에 따른 가리키기 제스처 검출 방법은, 입력 영상으로부터 사용자 신체의 3차원 자세를 예측하는 단계, 예측된 자세 정보를 통해 손 영역 및 팔 영역을 검출하는 단계 및 검출된 손 영역 및 팔 영역 각각을 입력으로 딥러닝 기반으로 가리키기 제스처 동작 여부를 추론하는 단계를 포함할 수 있다. A pointing gesture detection method according to an embodiment includes predicting a three-dimensional posture of the user's body from an input image, detecting a hand region and an arm region through predicted posture information, and detecting each of the detected hand region and arm region. It may include a step of inferring whether a pointing gesture is performed as an input based on deep learning.

이때, 예측하는 단계, 검출하는 단계 및 추론하는 단계는, 입력 영상 프레임 별로 수행되되, 일정 시간 동안 내에 소정 개수 이상의 입력 영상 프레임들의 추론 결과가 가리키기 제스처 동작일 경우, 가리키기 제스처가 검출된 것으로 판단하는 단계를 더 포함할 수 있다.At this time, the predicting step, the detecting step, and the inferring step are performed for each input image frame, and if the inference result of a predetermined number or more of the input image frames within a certain period of time is a pointing gesture, it is considered that the pointing gesture has been detected. A judgment step may be further included.

이때, 실시예에 따른 가리키기 제스처 검출 방법은, 예측된 3차원 자세에서 랜드마크의 신뢰도를 판단하는 단계를 더 포함하고, 랜드마크의 신뢰도가 소정 임계치보다 크지 않을 경우, 해당 입력 영상 프레임에 대한 검출하는 단계 및 추론하는 단계는 생략할 수 있다. At this time, the pointing gesture detection method according to the embodiment further includes the step of determining the reliability of the landmark in the predicted 3D posture, and if the reliability of the landmark is not greater than a predetermined threshold, The detection step and the inference step can be omitted.

이때, 3차원 자세 정보는, 사용자 신체의 관절들 각각을 나타내는 3차원 관절 좌표이고, 검출하는 단계는, 사용자 신체의 3차원 관절 좌표들 중에서 팔꿈치, 손목 및 손바닥이 동일 선상에 위치함을 가정하여 손 영역 및 팔 영역을 검출할 수 있다. At this time, the 3D posture information is 3D joint coordinates representing each of the joints of the user's body, and the detecting step assumes that the elbow, wrist, and palm are located on the same line among the 3D joint coordinates of the user's body. The hand area and arm area can be detected.

이때, 검출하는 단계는, 손 영역을 검출함에 있어, 팔꿈치 3차원 좌표 및 손목 3차원 좌표를 잇는 일직선이 손의 방향으로 일직선의 소정 배수만큼 연장된 위치를 손 위치로 유추하는 단계, 손 위치를 중심으로 3차원 공간 상에 바운딩 박스를 생성하는 단계 및 3차원 공간상의 바운딩 박스를 카메라 내부 파라미터를 통해 다시 픽셀 좌표계로 투영하여 손 영역에 상응하는 2차원 바운딩 박스 좌표를 획득하는 단계를 포함할 수 있다. At this time, the detection step is to detect the hand area, inferring the position where the straight line connecting the 3-dimensional coordinates of the elbow and the 3-dimensional coordinates of the wrist extends in the direction of the hand by a predetermined multiple of the straight line as the hand position. It may include generating a bounding box in a three-dimensional space as the center and projecting the bounding box in the three-dimensional space back into the pixel coordinate system through camera internal parameters to obtain two-dimensional bounding box coordinates corresponding to the hand area. there is.

이때, 검출하는 단계는, 팔 영역을 검출함에 있어, 손 영역의 바운딩 박스 좌표, 팔꿈치 3차원 좌표 및 어깨 3차원 좌표를 포함하는 최소의 사각 패치를 생성하여 팔 영역을 검출할 수 있다. At this time, in the detection step, when detecting the arm area, the arm area can be detected by generating a minimum square patch including the bounding box coordinates of the hand area, the 3D coordinates of the elbow, and the 3D coordinates of the shoulder.

이때, 추론하는 단계는, 검출된 손 영역 데이터가 입력되는 딥러닝 네트워크 및 검출된 팔 영역 데이터가 입력되는 딥러닝 네트워크 각각에 대한 출력들에 대해 평균을 계산하고, argmax 연산을 통해 이진 분류를 수행할 수 있다. At this time, the inference step calculates the average of the outputs of each of the deep learning network into which the detected hand area data is input and the deep learning network into which the detected arm area data is input, and performs binary classification through the argmax operation. can do.

실시예에 따른 가리키기 제스처 검출 방법은, 입력 영상 프레임으로부터 사용자 신체의 3차원 자세로 3차원 관절 좌표들을 예측하는 단계, 예측된 자세 정보를 통해 손 영역 및 팔 영역을 검출하는 단계 및 검출된 손 영역 및 팔 영역 각각을 입력으로 하는 딥러닝 기반 앙상블 추론하는 단계를 포함하되, 예측하는 단계, 검출하는 단계 및 추론하는 단계는, 입력 영상의 프레임 별로 수행되고, 소정 개수의 프레임 별 딥러닝 기반 앙상블 추론 결과들이 가리키기 제스처일 경우, 사용자의 가리키기 제스처가 검출된 것으로 판단하는 단계를 더 포함할 수 있다. A pointing gesture detection method according to an embodiment includes predicting 3D joint coordinates from an input image frame to the 3D posture of the user's body, detecting a hand region and an arm region through the predicted posture information, and detecting the detected hand It includes the step of inferring a deep learning-based ensemble using each of the region and the arm region as input, where the predicting step, detecting step, and inferring step are performed for each frame of the input image, and a deep learning-based ensemble for each predetermined number of frames is performed. If the inference results are pointing gestures, a step of determining that the user's pointing gesture has been detected may be further included.

이때, 검출하는 단계는, 팔꿈치 및 손목의 3차원 좌표들을 잇는 일직선이 손의 방향으로 일직선의 소정 배수만큼 연장된 위치를 손 위치로 유추하는 단계, 손 위치를 중심으로 3차원 공간 상에 바운딩 박스를 생성하는 단계, 3차원 공간상의 바운딩 박스를 카메라 내부 파라미터를 통해 다시 픽셀 좌표계로 투영하여 손 영역에 상응하는 2차원 바운딩 박스 좌표를 획득하는 단계 및 손 영역의 바운딩 박스 좌표, 팔꿈치 3차원 좌표 및 어깨 3차원 좌표를 포함하는 최소의 사각 패치를 생성하여 팔 영역을 검출하는 단계를 포함할 수 있다. At this time, the detecting step is to infer the position of the hand where the straight line connecting the three-dimensional coordinates of the elbow and wrist extends in the direction of the hand by a predetermined multiple of the straight line, and create a bounding box in three-dimensional space centered on the hand position. A step of generating, projecting the bounding box in 3D space back into the pixel coordinate system through camera internal parameters to obtain 2D bounding box coordinates corresponding to the hand area, and the bounding box coordinates of the hand area, the 3D coordinates of the elbow, and It may include detecting the arm area by generating a minimal square patch containing the 3D coordinates of the shoulder.

기재된 실시예에 따라, 가리키기 제스처 유무 판단없이 가리키기 방향 예측 동작을 수행함에 따른 비효율성, 오작동 및 위험성의 문제를 해결할 수 있다. According to the described embodiment, it is possible to solve problems of inefficiency, malfunction, and risk caused by performing a pointing direction prediction operation without determining the presence or absence of a pointing gesture.

즉, 종래의 컴퓨터 비전 분야에서 널리 연구되고 있는 가리키기 방향 예측 기술에 대한 선행 작동 조건으로 활용되어 가리키기 제스쳐 동작을 수행하고 있다고 판단될 때에 한해서만 가리키기 방향을 예측함으로써 응용 시스템의 효율성, 사용성 향상뿐만 아니라 원치 않은 오작동으로 인한 산업 분야에서의 위험성 등을 제거할 수 있다. In other words, it is used as a prerequisite for the pointing direction prediction technology that is widely studied in the field of conventional computer vision, improving the efficiency and usability of the application system by predicting the pointing direction only when it is determined that a pointing gesture is being performed. In addition, risks in the industrial field due to unwanted malfunctions can be eliminated.

또한, 기재된 실시예에 따라, 가리키기 방향 예측 알고리즘 작동 조건이 신체 또는 상황 제약으로 활용되기 어려운 문제점을 해결할 수 있다. In addition, according to the described embodiment, it is possible to solve the problem that the pointing direction prediction algorithm operating condition is difficult to utilize due to body or situation constraints.

또한, 기재된 실시예에 따라, 자폐스펙트럼장애(Autism Spectrum Disorder, ASD) 아동의 비언어적 의사 소통 특징 중 핵심 요소인 가리키기 제스쳐에 대한 수행 여부를 판단하여 ASD 아동의 조기 선별을 위한 보조 기술로 활용될 수 있다. 즉, ASD 아동은 일반 아동들과 비교하여 가리키기 제스쳐를 비언어적 의사소통의 한 형태로 활용하는 능력에서 큰 어려움을 보이며, 따라서 ASD 아동을 조기선별 하는데 있어 가리키기 제스쳐 자동 검출 기술을 통해 기존의 숙련된 전문가를 통한 진단 과정의 금전적, 시간적 부담을 완화하고 ASD 선별 검사 공간뿐만 아니라 가정에 손쉽게 설치 가능한 RGB 카메라 등을 활용하여 ASD 아동의 조기선별을 위한 보조 기술로 활용될 수 있다.In addition, according to the described embodiment, it can be used as an assistive technology for early screening of children with ASD by determining whether the pointing gesture, which is a key element of the nonverbal communication characteristics of children with autism spectrum disorder (ASD), is performed. You can. In other words, compared to general children, children with ASD show great difficulty in using pointing gestures as a form of non-verbal communication. Therefore, in early screening of children with ASD, automatic pointing gesture detection technology is used to measure existing skills. It can be used as an assistive technology for early screening of children with ASD by alleviating the financial and time burden of the diagnosis process through qualified experts and by utilizing RGB cameras that can be easily installed not only in the ASD screening test space but also at home.

도 1은 실시예에 따른 가리키기 제스처 검출 장치의 개략적인 블록 구성도이다.
도 2는 실시예에 따른 가리키기 제스처 검출 장치가 활용되는 예시도이다.
도 3는 실시예에 따른 가리키기 제스처 검출 방법을 설명하기 위한 순서도이다.
도 4는 실시예에 따른 컴퓨터 시스템 구성을 나타낸 도면이다.1 is a schematic block diagram of a pointing gesture detection device according to an embodiment.
Figure 2 is an example diagram in which a pointing gesture detection device according to an embodiment is utilized.
Figure 3 is a flowchart for explaining a pointing gesture detection method according to an embodiment.
Figure 4 is a diagram showing the configuration of a computer system according to an embodiment.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and will be implemented in various different forms. The present embodiments only serve to ensure that the disclosure of the present invention is complete and that common knowledge in the technical field to which the present invention pertains is not limited. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

비록 "제1" 또는 "제2" 등이 다양한 구성요소를 서술하기 위해서 사용되나, 이러한 구성요소는 상기와 같은 용어에 의해 제한되지 않는다. 상기와 같은 용어는 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용될 수 있다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있다.Although terms such as “first” or “second” are used to describe various components, these components are not limited by the above terms. The above terms may be used only to distinguish one component from another component. Accordingly, the first component mentioned below may also be the second component within the technical spirit of the present invention.

본 명세서에서 사용된 용어는 실시예를 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소 또는 단계가 하나 이상의 다른 구성요소 또는 단계의 존재 또는 추가를 배제하지 않는다는 의미를 내포한다.The terms used in this specification are for describing embodiments and are not intended to limit the invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” or “comprising” implies that the mentioned component or step does not exclude the presence or addition of one or more other components or steps.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 해석될 수 있다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms used in this specification can be interpreted as meanings commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

이하에서는, 도 1 내지 도 4를 참조하여 실시예에 따른 가리키기 제스쳐 검출 장치 및 방법이 상세히 설명된다.Hereinafter, an apparatus and method for detecting a pointing gesture according to an embodiment will be described in detail with reference to FIGS. 1 to 4 .

도 1은 실시예에 따른 가리키기 제스처 검출 장치의 개략적인 블록 구성도이고, 도 2는 실시예에 따른 가리키기 제스처 검출 장치가 활용되는 예시도이다. FIG. 1 is a schematic block diagram of a pointing gesture detection device according to an embodiment, and FIG. 2 is an example diagram of how the pointing gesture detection device according to an embodiment is utilized.

도 1을 참조하면, 실시예에 따른 가리키기 제스처 검출 장치(100)는 입력 영상에서 딥러닝 기반으로 사용자의 가리키기 제스처 동작 여부를 검출할 수 있다. Referring to FIG. 1, the pointing gesture detection device 100 according to an embodiment can detect whether the user is making a pointing gesture based on deep learning in an input image.

상세하게는, 실시예에 따른 가리키기 제스처 검출 장치(100)는, 영상 기반 신체 자세 검출부(110), 관심 영역 검출부(120), 딥러닝 추론부(130) 및 가리키기 제스처 검출부(140)를 포함할 수 있다. In detail, the pointing gesture detection device 100 according to the embodiment includes an image-based body posture detection unit 110, a region of interest detection unit 120, a deep learning inference unit 130, and a pointing gesture detection unit 140. It can be included.

영상 기반 신체 자세 검출부(110)는, 입력 영상으로부터 사용자 신체의 3차원 자세를 예측하는 것으로, 영상 획득부(111) 및 신체 자세 검출부(112)를 포함할 수 있다. The image-based body posture detection unit 110 predicts the 3D posture of the user's body from an input image and may include an image acquisition unit 111 and a body posture detection unit 112.

영상 획득부(111)는, RGBD 카메라(11) 및 RGB 카메라(12)를 통해 입력 영상을 획득할 수 있다. The image acquisition unit 111 may acquire input images through the RGBD camera 11 and the RGB camera 12.

여기서, RGBD 카메라(11)의 경우, RGB값과 더불어 거리 값을 획득할 수 있기 때문에 단일 카메라만으로도 3차원 포즈 추정을 용이하게 수행할 수 있다. Here, in the case of the RGBD camera 11, since the distance value can be obtained in addition to the RGB value, 3D pose estimation can be easily performed with only a single camera.

그리고, RGB 카메라(12)의 경우, 단일 카메라만을 사용할 경우 거리 값에 대한 모호성(ambiguity)이 발생할 수 있으나, 최근 빅데이터를 통해 학습된 딥러닝 네트워크를 통해 단일 영상만으로도 어느 정도 정확한 3차원 좌표 추론이 가능할 수 있다. Also, in the case of the RGB camera 12, ambiguity about the distance value may occur if only a single camera is used, but 3D coordinates can be inferred to some extent with a single image through a deep learning network learned through recent big data. This may be possible.

이러한 RGBD 카메라(11) 및 RGB 카메라(12)는 키오스크, 스마트 TV, 자율주행 자동차, 로봇, 드론, 엘리베이터 및 ASD 검사 공간 등에 부착된 것일 수 있다. These RGBD cameras 11 and RGB cameras 12 may be attached to kiosks, smart TVs, self-driving cars, robots, drones, elevators, and ASD inspection spaces.

즉, 실시예에 따른 가리키기 제스처 기술은, 로봇 네비게이션, 로봇 팔 조작 혹은 드론 조작과 같은 인간-로봇 상호작용이 요구되는 산업분야, 또는 키오스크, 스마트 TV, 엘리베이터 등의 다양한 산업분야에서 물리적 터치, 조이스틱, 리모컨 등의 하드웨어를 대체하는 비접촉 센싱 기반의 지시 알고리즘의 선행 조건 기술로 활용될 수 있다. 즉, 사용자가 가리키기 제스쳐를 수행하고 있다고 판단될 때에만 비접촉 센싱 기반의 지시 알고리즘을 활성화함으로써, 불필요한 전력 소모 및 메모리 사용을 방지하여 효율성을 향상시킬 뿐만 아니라, 원치 않은 조작으로 인한 사용성의 감소, 사고 위험 등을 방지하는데 활용될 수 있다.That is, the pointing gesture technology according to the embodiment is used in industrial fields that require human-robot interaction such as robot navigation, robot arm manipulation, or drone manipulation, or in various industrial fields such as kiosks, smart TVs, and elevators, using physical touch, It can be used as a prerequisite technology for a non-contact sensing-based instruction algorithm that replaces hardware such as joysticks and remote controls. In other words, by activating the non-contact sensing-based pointing algorithm only when it is determined that the user is performing a pointing gesture, not only does it improve efficiency by preventing unnecessary power consumption and memory use, but also reduces usability due to unwanted manipulation. It can be used to prevent the risk of accidents, etc.

신체 자세 검출부(112)는, 획득된 영상으로부터 3차원 자세 정보를 추론한다. The body posture detector 112 infers 3D posture information from the acquired image.

이때, 3차원 자세 정보는, 카메라 좌표계 기준으로 사용자 신체의 관절들 각각을 나타내는 3차원 관절 좌표일 수 있다. 예컨대, 도 2를 참조하면, 3차원 자세 정보는 사용자 신체의 관절들 각각을 나타내는 3차원 관절 좌표 정보(210)일 수 있다. At this time, the 3D posture information may be 3D joint coordinates representing each joint of the user's body based on the camera coordinate system. For example, referring to FIG. 2 , the 3D posture information may be 3D joint coordinate information 210 representing each joint of the user's body.

상세하게는, 신체 자세 검출부(112)는 HRNet 및 Stacked Hourglass 네트워크 등의 탑다운(top-down) 방식의 알고리즘을 기반으로 입력 영상 내에서 적어도 하나의 사용자를 검출한 후, 검출된 적어도 하나의 사용자 영역 각각에 대한 자세 추론을 통해 정밀한 2D 관절 위치를 추론할 수 있다. In detail, the body posture detection unit 112 detects at least one user in the input image based on a top-down algorithm such as HRNet and Stacked Hourglass network, and then detects the at least one user. Precise 2D joint positions can be inferred through posture inference for each region.

이때, 입력 영상 내에 사람이 다수인 경우, 추론 속도 향상을 위해 bottom-up 방식의 OpenPose 및 HigherHRNet 등이 활용될 수도 있다. At this time, if there are many people in the input image, bottom-up OpenPose and HigherHRNet may be used to improve inference speed.

신체 자세 검출부(112)는 전술한 바와 같이 검출된 픽셀 좌표계 기준의 2D 좌표를 거리 값 및 카메라 내부 파라미터를 활용하여 3D 좌표로 역 투영(back-projection)할 수 있다. As described above, the body posture detector 112 may back-project the detected 2D coordinates based on the pixel coordinate system into 3D coordinates using the distance value and camera internal parameters.

이와 같이 예측된 3차원 관절들 좌표는, 추후 관심 영역을 검출하기 위한 랜드마크로 사용될 수 있다. The 3D joint coordinates predicted in this way can be used as landmarks to detect areas of interest in the future.

이때, 실시예에 따라, 신체 자세 검출부(112)는 이러한 랜드 마크의 신뢰도를 판단하여, 랜드마크의 신뢰도가 소정 임계치보다 크지 않을 경우, 해당 영상 프레임에 대한 3차원 자세 정보는 사용되지 않도록 하고, 다음 입력 영상 프레임에 대한 3차원 자세를 추정하게 된다. At this time, depending on the embodiment, the body posture detection unit 112 determines the reliability of these landmarks, and if the reliability of the landmarks is not greater than a predetermined threshold, the 3D posture information for the corresponding image frame is not used, The 3D pose for the next input image frame is estimated.

관심 영역 검출부(120)는, 예측된 사용자 신체의 3차원 관절 좌표를 이용해 딥러닝 추론에 필요한 신체 부위로 적어도 하나의 관심 영역(Region of Interest, ROI)을 검출한다.The region of interest detection unit 120 detects at least one region of interest (ROI) as a body part required for deep learning inference using the predicted 3D joint coordinates of the user's body.

종래의 행동 인식 분야에서는 보편적으로 시계열 입력 데이터에 대해 영상 내의 사람 영역을 검출하고 검출된 전신 영역을 활용하여 딥러닝 네트워크를 학습시킨다. 반면, 실시예에서는 가리키기 제스쳐 검출을 위한 딥러닝 네트워크는 정지 영상, 특히 손 및 팔의 신체 부위 영역을 학습 및 추론 시의 데이터로 활용한다. In the field of conventional action recognition, human areas within an image are commonly detected for time series input data and a deep learning network is learned using the detected full body area. On the other hand, in the embodiment, the deep learning network for detecting pointing gestures uses still images, especially body part areas of hands and arms, as data for learning and inference.

이를 통해 실시예가 종래의 행동 인식 또는 제스쳐 인식 분야와 차별화되는 이유는 다음과 같다. The reason why this embodiment is differentiated from the conventional action recognition or gesture recognition fields is as follows.

먼저, 가리키기 제스쳐 유무를 판단하기 위한 큰 규모의 공개 학습용 데이터가 부재하고, 따라서 학습과 추론시의 데이터 특성 차이에 따른 도메인 격차(domain gap)로 인해 딥러닝 추론의 정확도 하락이 일어날 가능성이 크다. 시계열 데이터를 활용한 가리키기 동작의 학습 및 추론의 경우 가리키기 제스쳐의 행동 양상, 수행시간 등에 대한 변화량이 크고 이러한 스펙트럼을 모두 포함하는 학습 데이터를 수집하기에는 추가적으로 방대한 인적, 시간적 자원이 소모된다. First, there is no large-scale public training data to determine the presence or absence of a pointing gesture, and therefore, there is a high possibility that the accuracy of deep learning inference will decrease due to the domain gap due to differences in data characteristics during learning and inference. . In the case of learning and inference of pointing behavior using time series data, the amount of change in the behavioral aspect and execution time of the pointing gesture is large, and collecting learning data that includes all of this spectrum consumes additional enormous human and temporal resources.

이를 해결하기 위해 정지 영상 기반의 딥러닝 학습 및 추론을 시도해 볼 수 있으나, 정지 영상 기반의 제스쳐 인식은 시간 축에 대한 모션 정보의 부족, 높은 클래스 내 변화량(inter-class variance) 및 낮은 클래스 간 변화량(intra-class variance) 등의 추가적인 문제를 야기한다. To solve this, you can try still image-based deep learning learning and inference, but still image-based gesture recognition lacks motion information on the time axis, high inter-class variance, and low inter-class variance. It causes additional problems such as intra-class variance.

따라서, 실시예에서는 이러한 문제들을 해결하기 위해, 가리키기 제스쳐 검출에 특화된 신체 부위인 손 영역 및 팔 영역을 딥러닝 네트워크의 입력 데이터로 활용할 수 있다. 가리키기 제스쳐 인식의 경우 기존의 세분화되어 있는 클래스들(fine-grained classes)를 구분하는 종래의 제스쳐 인식 기술들과 상이하게 가리키기 제스쳐 동작 여부에 대한 단순한 이진 분류를 목적으로 하므로, 가리키기 행동에 특화된 입력 데이터를 직접적으로 1차 정제하여 딥러닝 네트워크에 전달함으로써 학습 데이터와 추론시에 사용될 데이터 간의 도메인 격차를 최소화하여 학습의 안정성을 높일 수 있다.Therefore, in order to solve these problems in the embodiment, the hand region and arm region, which are body parts specialized for detecting pointing gestures, can be used as input data for the deep learning network. In the case of pointing gesture recognition, unlike conventional gesture recognition technologies that distinguish between existing fine-grained classes, the purpose is simple binary classification of whether or not a pointing gesture is performed, so it is used for pointing behavior. By directly first purifying specialized input data and transmitting it to the deep learning network, the stability of learning can be increased by minimizing the domain gap between the training data and the data to be used during inference.

즉, "Pointing: Where language, culture, and cognition meet." (Psychology Press, 2003)에 기술된 것 같이 가리키기 행동의 경우 검지 손가락의 확장 및 나머지 손가락을 손바닥을 향해 구부리는 자세, 팔의 확장과 같이 타 동작과 구분되는 명확한 특성을 지니며, 이러한 신체 국소 부위인 손과 팔의 어피어런스 정보를 활용하여 데이터 간의 도메인 격차를 최소화 하면서 동시에 모션 정보가 부족한 한 장의 정지 영상만으로도 가리키기 제스쳐를 용이하게 판별할 수 있다. In other words, “Pointing: Where language, culture, and cognition meet.” As described in (Psychology Press, 2003), pointing behavior has clear characteristics that distinguish it from other movements, such as extension of the index finger, bending the remaining fingers toward the palm, and extension of the arm. By utilizing the appearance information of the hand and arm, the domain gap between data can be minimized, while at the same time, pointing gestures can be easily identified with just one still image that lacks motion information.

이를 위해, 관심 영역 검출부(120)는, 팔 영역 검출부(121) 및 손 영역 검출부(122)를 포함한다. To this end, the region of interest detection unit 120 includes an arm area detection unit 121 and a hand area detection unit 122.

이때, 예측된 3차원 관절 좌표 중 팔꿈치, 손목, 손바닥이 동일 선상에 있다(collinearity)는 가정을 토대로 멀티 스케일 딥러닝 추론에 필요한 신체 부위인 손 영역 및 팔 영역을 관심 영역으로 검출한다. At this time, based on the assumption that among the predicted 3D joint coordinates, the elbow, wrist, and palm are on the same line (collinearity), the hand and arm regions, which are body parts required for multi-scale deep learning inference, are detected as regions of interest.

종래의 crop된 신체 부위 데이터를 활용해 학습된 CNN 기반의 신체 부위 검출기(예: 손 검출기)를 활용할 경우 영상 내의 신체 부위의 각도, 크기 등에 의존적이며, 특히 영상 내에서 신체 부위가 차지하는 스케일이 기존의 학습에 활용된 데이터와 상이할 경우 안정적인 검출 성능을 제공하지 못하는 단점이 있다. When using a CNN-based body part detector (e.g. hand detector) learned using conventional cropped body part data, it is dependent on the angle and size of the body part in the image, and in particular, the scale occupied by the body part in the image is different from the existing body part detector. If it is different from the data used for learning, it has the disadvantage of not providing stable detection performance.

따라서, 실시예에서는 안정적인 신체 부위인 손 영역 및 팔 영역에 대한 검출을 위해 "Hand keypoint detection in single images using multiview bootstrapping." (Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017)에 개시된 방법과 유사하게 검출된 3차원 신체 관절 중 팔꿈치, 손목, 손이 일직선상에 있다는 가정을 통해 해당 신체 영역에 대한 검출을 수행한다. Therefore, in the embodiment, "Hand keypoint detection in single images using multiview bootstrapping." was used to detect hand and arm regions, which are stable body parts. Similar to the method disclosed in (Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017), detection of the corresponding body region is performed by assuming that among the detected 3D body joints, the elbow, wrist, and hand are in a straight line. .

손 영역 검출부(121)는, 팔꿈치 3차원 좌표 및 손목 3차원 좌표를 잇는 선이 손의 방향으로 선 소정 배수만큼 연장된 위치를 손 위치로 유추할 수 있다. The hand area detector 121 may infer as the hand position the position where the line connecting the 3D coordinates of the elbow and the 3D coordinates of the wrist extends a predetermined multiple of the line in the direction of the hand.

예컨대, 도 2에 도시된 바와 같이, 휴리스틱(heuristic) 판단을 통해 손의 위치(212)가 팔꿈치와 손목 사이의 길이(211)의 약 0.5 배만큼 손목을 기준으로 뻗어 있는 위치에 존재한다는 가정을 통해 손의 3차원 위치를 유추할 수 있다. For example, as shown in FIG. 2, through a heuristic judgment, it is assumed that the hand position 212 is in a position extended relative to the wrist by about 0.5 times the length 211 between the elbow and the wrist. Through this, the 3D position of the hand can be inferred.

그런 후, 손 영역 검출부(121)는, 손 위치를 중심으로 3차원 공간 상에 바운딩 박스(Bounding Box)를 생성한다. Then, the hand area detector 121 creates a bounding box in three-dimensional space centered on the hand position.

이때, 바운딩 박스의 크기는 어른, 아이 등 사용자의 손의 크기에 따라 달라질 수 있다. 실시예에서는, 휴리스틱 판단 및 실험적 경험을 토대로 140mm로 설정될 수 있다. At this time, the size of the bounding box may vary depending on the size of the user's hand, such as an adult or child. In an embodiment, it may be set to 140 mm based on heuristic judgment and experimental experience.

그리고, 손 영역 검출부(121)는, 3차원 공간상의 바운딩 박스를 카메라 내부 파라미터를 통해 다시 픽셀 좌표계로 투영하여 손 영역에 상응하는 2차원 바운딩 박스 좌표를 획득한다. Then, the hand area detection unit 121 projects the bounding box in the three-dimensional space back into the pixel coordinate system through camera internal parameters to obtain two-dimensional bounding box coordinates corresponding to the hand area.

팔 영역 검출부(122)는, 손 영역 검출부(121)에서 팔 영역을 검출함에 있어, 손 영역의 바운딩 박스 좌표, 팔꿈치 3차원 좌표 및 어깨 3차원 좌표를 포함하는 최소의 사각 패치를 생성하여 팔 영역을 검출할 수 있다. 예컨대, 도 2이 도시된 바와 같이, 손 영역(221)과 어깨 좌표를 포함하는 사각 패치(222)가 팔 영역이 될 수 있다. When detecting the arm area in the hand area detection unit 121, the arm area detection unit 122 generates a minimum square patch including the bounding box coordinates of the hand area, the 3D coordinates of the elbow, and the 3D coordinates of the shoulder to detect the arm area. can be detected. For example, as shown in FIG. 2, a square patch 222 including a hand area 221 and shoulder coordinates may be an arm area.

딥러닝 추론부(130)는, 검출된 손 영역 및 팔 영역 각각을 입력으로 딥러닝 기반으로 가리키기 제스처 동작 여부를 추론할 수 있다. The deep learning inference unit 130 can infer whether a pointing gesture is performed based on deep learning using each of the detected hand and arm regions as input.

이를 위해, 딥러닝 추론부(130)는, 모델 앙상블 추론부(131) 및 시계열 결과 종합부(132)를 포함할 수 있다. To this end, the deep learning inference unit 130 may include a model ensemble inference unit 131 and a time series result synthesis unit 132.

이때, 모델 앙상블 추론부(131)는, 단일 영상에 대해 각각의 검출된 손 영역 및 팔 영역의 다중 입력으로 딥러닝 기반의 정적 포즈 분류 모델 추론 결과들을 종합한다. At this time, the model ensemble inference unit 131 synthesizes the deep learning-based static pose classification model inference results with multiple inputs of each detected hand region and arm region for a single image.

즉, 도 2에 도시된 바와 같이, 손 영역(221)의 데이터는 제1 딥러닝 네트워크(231)에 입력되고, 팔 영역(222)의 데이터는 제2 딥러닝 네트워크(232)에 입력되어 추론될 수 있다. That is, as shown in FIG. 2, data in the hand area 221 are input to the first deep learning network 231, and data in the arm area 222 are input to the second deep learning network 232 for inference. It can be.

이때, 각각의 손과 팔 영역에 대한 입력 데이터는 딥러닝 학습 및 추론을 위해 224x224 사이즈로 리사이즈된 후, RGB 픽셀 범위에 대한 0~1 값으로의 scaling 및 ImageNet의 평균과 분산을 이용한 정규화될 수 있다.At this time, the input data for each hand and arm area can be resized to 224x224 for deep learning learning and inference, then scaled to 0 to 1 values for the RGB pixel range and normalized using the mean and variance of ImageNet. there is.

또한, 실시예에 따른 딥러닝 네트워크들(231, 232)는, VGG, ResNet, ViT 등의 다양한 딥러닝 네트워크를 자유롭게 활용될 수 있다. 그리고, 딥러닝 네트워크의 마지막 레이어는 이진 분류를 위해 출력 dimension 2의 완전 연결 레이어(fully connected layer)로 구성될 수 있다. Additionally, the deep learning networks 231 and 232 according to the embodiment can freely utilize various deep learning networks such as VGG, ResNet, and ViT. And, the last layer of the deep learning network may be composed of a fully connected layer with output dimension 2 for binary classification.

모델 앙상블 추론부(131)는, 도 2에 도시된 바와 같이, 검출된 손 영역 데이터가 입력되는 제1 딥러닝 네트워크(231) 및 검출된 팔 영역 데이터가 입력되는 제2 딥러닝 네트워크(232) 각각에 대한 출력들을 앙상블(240)한다. 구체적으로, 출력값들에 대한 평균을 계산하고, argmax 연산을 통해 이진 분류를 수행할 수 있다. As shown in FIG. 2, the model ensemble inference unit 131 includes a first deep learning network 231 into which detected hand region data is input and a second deep learning network 232 into which detected arm region data is input. The outputs for each are ensembled (240). Specifically, the average of the output values can be calculated and binary classification can be performed through the argmax operation.

이는 일종의 멀티 스케일 입력 혹은 멀티 모달 입력을 이용한 앙상블 추론과도 메커니즘 적으로 일맥상통한다고 볼 수 있으며, late fusion을 통한 예측 결과 종합 방법으로 분류된다. This can be seen as being mechanically consistent with ensemble inference using a type of multi-scale input or multi-modal input, and is classified as a method of synthesizing prediction results through late fusion.

시계열 결과 종합부(132)는, 각 프레임 단위의 결과값의 변동성을 완화하고 보다 안정적인 결과를 도출하기 위해 시계열 축에 대해 결과값들을 종합한다. 즉, 크기 3 프레임의 버퍼에 프레임 단위의 추론 결과값을 저장하고, 이를 종합할 수 있다. The time series result synthesis unit 132 synthesizes the result values on the time series axis in order to alleviate the volatility of the result values of each frame unit and derive more stable results. In other words, frame-by-frame inference results can be stored in a buffer of size 3 frames and synthesized.

가리키기 제스처 판단부(140)는, 딥러닝 추론부(130)의 추론 결과를 기반으로 사용자가 현재 가리키기 동작을 하고 있는지 여부를 판단한다. 즉, 일정 시간 동안 내에 소정 개수 이상의 입력 영상 프레임들의 추론 결과가 가리키기 제스처 동작일 경우, 가리키기 제스처가 검출된 것으로 판단할 수 있다. The pointing gesture determination unit 140 determines whether the user is currently performing a pointing gesture based on the inference result of the deep learning inference unit 130. That is, if the inference result of a predetermined number of input image frames within a certain period of time is a pointing gesture, it may be determined that the pointing gesture has been detected.

전술한 바와 같이, 실시예에 다른 가리키기 제스처 검출 장치(100)에 의해 가리키기 제스처인 것으로 판단될 경우, 키오스크, 스마트 TV, 자율주행 자동차, 로봇, 드론, 엘리베이터 및 ASD 검사 공간과 같은 비접촉 센싱이 요구되는 장비에서 가리키기 제스처 인식이 수행될 수 있다. As described above, when it is determined that it is a pointing gesture by the pointing gesture detection device 100 according to another embodiment, non-contact sensing such as kiosks, smart TVs, self-driving cars, robots, drones, elevators, and ASD examination spaces Pointing gesture recognition can be performed on this required equipment.

예컨대, 도 2에 도시된 바와 같이, 눈 및 검지를 잇는 ray casting 방법(230)을 통해 연동된 3차원 공간 상에서 사용자가 가리키는 물체(250)를 검출하는 등의 응용 시스템에 활용될 수 있다.For example, as shown in FIG. 2, it can be used in an application system such as detecting an object 250 pointed by a user in a three-dimensional space linked through a ray casting method 230 connecting the eye and index finger.

도 3는 실시예에 따른 가리키기 제스처 검출 방법을 설명하기 위한 순서도이다. 여기서, 실시예에 따른 가리키기 제스처 검출 방법은 전술한 장치(100)에 의해 수행될 수 있다. Figure 3 is a flowchart for explaining a pointing gesture detection method according to an embodiment. Here, the pointing gesture detection method according to the embodiment may be performed by the above-described device 100.

도 3을 참조하면, 실시예에 따른 가리키기 제스처 검출 방법은, 입력 영상으로부터 사용자 신체의 3차원 자세를 예측하는 단계(S310), 예측된 자세 정보를 통해 손 영역 및 팔 영역을 검출하는 단계(S340~S345) 및 검출된 손 영역 및 팔 영역 각각을 입력으로 딥러닝 기반으로 가리키기 제스처 동작 여부를 추론하는 단계(S350~S370)를 포함할 수 있다. Referring to FIG. 3, the pointing gesture detection method according to the embodiment includes predicting a three-dimensional posture of the user's body from an input image (S310), detecting a hand region and an arm region through the predicted posture information (S310). S340 to S345) and a step of inferring whether a pointing gesture is performed based on deep learning using each of the detected hand and arm regions as input (S350 to S370).

우선, 장치(100)는, 카메라의 화각 내에 존재하는 사용자에 대한 적어도 하나의 3차원 자세를 추정한다(S310). First, the device 100 estimates at least one 3D posture of a user existing within the camera's field of view (S310).

이때, 장치(100)는, 추정된 적어도 하나의 3차원 자세 중, 손 영역 및 팔 영역 검출을 위한 랜드마크 신뢰도를 분석한다(S320). At this time, the device 100 analyzes landmark reliability for detecting the hand region and arm region among at least one estimated 3D posture (S320).

이때, 손 영역 및 팔 영역의 검출을 위한 랜드 마크는, 어깨, 팔꿈치 및 손목을 포함할 수 있다. At this time, landmarks for detecting the hand area and arm area may include the shoulder, elbow, and wrist.

그런 후, 장치(100)는, 분석된 랜드마크 신뢰도가 소정 임계치 T₁보다 큰지를 판단한다(S330). Then, the device 100 determines whether the analyzed landmark reliability is greater than a predetermined threshold T ₁ (S330).

S330의 판단 결과 랜드마크 신뢰도가 소정 임계치 T₁보다 크지 않을 경우, 해당 프레임에 대한 가리키기 제스처 검출을 생략하고, S310으로 진행하여 다음 프레임에 대한 자세 추정을 수행한다. If the landmark reliability is not greater than the predetermined threshold T ₁ as a result of the determination in S330, pointing gesture detection for the corresponding frame is omitted, and the process proceeds to S310 to perform posture estimation for the next frame.

반면, S330의 판단 결과 랜드마크 신뢰도가 소정 임계치 T₁보다 클 경우, 장치(100)는 선택된 랜드마크를 기반으로 관심 영역을 검출한다(S340, S345). 즉, S340에서 손 영역을 검출하고, 예측된 손 영역 좌표를 기반으로 S345에서 팔 영역을 검출한다. On the other hand, if the landmark reliability is greater than the predetermined threshold T ₁ as a result of the determination in S330, the device 100 detects the region of interest based on the selected landmark (S340, S345). That is, the hand area is detected in S340, and the arm area is detected in S345 based on the predicted hand area coordinates.

이때, 3차원 자세 정보는, 사용자 신체의 관절들 각각을 나타내는 3차원 관절 좌표이고, 검출하는 단계(S340, S345)는, 사용자 신체의 3차원 관절 좌표들 중에서 팔꿈치, 손목 및 손바닥이 동일 선상에 위치함을 가정하여 손 영역 및 팔 영역을 검출할 수 있다. At this time, the 3D posture information is 3D joint coordinates representing each of the joints of the user's body, and the detecting steps (S340, S345) are performed by determining that among the 3D joint coordinates of the user's body, the elbow, wrist, and palm are on the same line. Assuming their location, the hand area and arm area can be detected.

이때, 손 영역을 검출하는 단계(S340)는, 팔꿈치 3차원 좌표 및 손목 3차원 좌표를 잇는 일직선이 손의 방향으로 일직선의 소정 배수만큼 연장된 위치를 손 위치로 유추하는 단계, 손 위치를 중심으로 3차원 공간 상에 바운딩 박스를 생성하는 단계 및 3차원 공간상의 바운딩 박스를 카메라 내부 파라미터를 통해 다시 픽셀 좌표계로 투영하여 손 영역에 상응하는 2차원 바운딩 박스 좌표를 획득하는 단계를 수행할 수 있다. At this time, the step of detecting the hand area (S340) is a step of inferring the position where the straight line connecting the 3-dimensional coordinates of the elbow and the 3-dimensional coordinates of the wrist extends in the direction of the hand by a predetermined multiple of the straight line as the hand position, with the hand position as the center. The steps of creating a bounding box in 3D space and projecting the bounding box in 3D space back into the pixel coordinate system through camera internal parameters can be performed to obtain 2D bounding box coordinates corresponding to the hand area. .

이때, 팔 영역을 검출하는 단계(S345)에서, 손 영역 좌표, 팔꿈치 좌표 및 어깨 좌표를 포함하는 최소의 사각 패치를 생성하여 팔 영역을 검출할 수 있다. At this time, in the step of detecting the arm area (S345), the arm area can be detected by generating a minimal square patch including hand area coordinates, elbow coordinates, and shoulder coordinates.

그런 후, 장치(100)는, 검출된 적어도 손 영역 및 팔 영역 각각을 딥러닝 네트워크를 기반으로 손 영역 기반 딥러닝 추론(S350) 및 팔 영역 기반 딥러닝 추론(S355)을 수행한다. Then, the device 100 performs hand region-based deep learning inference (S350) and arm region-based deep learning inference (S355) based on a deep learning network for at least each of the detected hand region and arm region.

장치(100)는, 손 영역 기반 딥러닝 추론 결과 및 팔 영역 기반 딥러닝 추론 결과를 모델 앙상블을 통해 종합한다(S360). 즉, 검출된 손 영역 데이터가 입력되는 딥러닝 네트워크 및 검출된 팔 영역 데이터가 입력되는 딥러닝 네트워크 각각에 대한 출력들에 대해 평균을 계산하고, argmax 연산을 통해 이진 분류를 수행할 수 있다. The device 100 synthesizes the hand region-based deep learning inference results and the arm region-based deep learning inference results through a model ensemble (S360). That is, the average of the outputs of each of the deep learning network into which the detected hand area data is input and the deep learning network into which the detected arm area data is input is calculated, and binary classification can be performed through the argmax operation.

이때, S310 내지 S360에 의한 프레임 단위 추론 결과를 프레임 단위 결과를 버퍼에 저장될 수 있다. At this time, the frame-by-frame inference results from S310 to S360 may be stored in a buffer.

그런 후, 장치(100)는, 저장된 시계열 데이터를 종합하여 가리키기 제스쳐 유무를 판단한다(S370). 즉, 시간 동안 내에 소정 개수 이상의 입력 영상 프레임들의 추론 결과가 가리키기 제스처 동작일 경우, 가리키기 제스처가 검출된 것으로 판단할 수 있다. Then, the device 100 determines the presence or absence of a pointing gesture by synthesizing the stored time series data (S370). That is, if the inference result of more than a predetermined number of input image frames within a period of time is a pointing gesture, it may be determined that the pointing gesture has been detected.

도 4는 실시예에 따른 컴퓨터 시스템 구성을 나타낸 도면이다.Figure 4 is a diagram showing the configuration of a computer system according to an embodiment.

실시예에 따른 가리키기 제스쳐 검출 장치(100)는 컴퓨터로 읽을 수 있는 기록매체와 같은 컴퓨터 시스템(1000)에서 구현될 수 있다.The pointing gesture detection device 100 according to the embodiment may be implemented in a computer system 1000 such as a computer-readable recording medium.

컴퓨터 시스템(1000)은 버스(1020)를 통하여 서로 통신하는 하나 이상의 프로세서(1010), 메모리(1030), 사용자 인터페이스 입력 장치(1040), 사용자 인터페이스 출력 장치(1050) 및 스토리지(1060)를 포함할 수 있다. 또한, 컴퓨터 시스템(1000)은 네트워크(1080)에 연결되는 네트워크 인터페이스(1070)를 더 포함할 수 있다. 프로세서(1010)는 중앙 처리 장치 또는 메모리(1030)나 스토리지(1060)에 저장된 프로그램 또는 프로세싱 인스트럭션들을 실행하는 반도체 장치일 수 있다. 메모리(1030) 및 스토리지(1060)는 휘발성 매체, 비휘발성 매체, 분리형 매체, 비분리형 매체, 통신 매체, 또는 정보 전달 매체 중에서 적어도 하나 이상을 포함하는 저장 매체일 수 있다. 예를 들어, 메모리(1030)는 ROM(1031)이나 RAM(1032)을 포함할 수 있다. 이상의 과정을 통해 본 발명이 종래의 방법들과 차별화되는 점과 본발명의 구현예를 동종업계의 지식을 가진 자가 구현할 수 있는 수준에서 기술하였다. 본 발명의 일실시예에 따른 정지 이미지를 활용한 신체 부위 기반 가리키기 제스쳐 검출 시스템은 가리키기 제스쳐 수행 유무에 대한 판단을 위해 포즈 추정 알고리즘을 통해 신체 부위를 검출하고 검출된 부위를 입력으로 하여 딥러닝 추론 및 모델 앙상블, 시계열 결과 종합의 과정을 통해 종래의 기법들과 차별화되는 방법론을 제시한다.Computer system 1000 may include one or more processors 1010, memory 1030, user interface input device 1040, user interface output device 1050, and storage 1060 that communicate with each other via bus 1020. You can. Additionally, the computer system 1000 may further include a network interface 1070 connected to the network 1080. The processor 1010 may be a central processing unit or a semiconductor device that executes programs or processing instructions stored in the memory 1030 or storage 1060. The memory 1030 and storage 1060 may be storage media including at least one of volatile media, non-volatile media, removable media, non-removable media, communication media, and information transfer media. For example, memory 1030 may include ROM 1031 or RAM 1032. Through the above process, the points that differentiate the present invention from conventional methods and implementation examples of the present invention have been described at a level that can be implemented by those with knowledge in the same industry. A body part-based pointing gesture detection system using still images according to an embodiment of the present invention detects body parts through a pose estimation algorithm to determine whether or not a pointing gesture is performed, and uses the detected parts as input to perform deep We present a methodology that is differentiated from conventional techniques through the process of running inference, model ensemble, and time series result synthesis.

이상에서 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the attached drawings, those skilled in the art will understand that the present invention can be implemented in other specific forms without changing the technical idea or essential features. You will understand that it exists. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

100 : 가리키기 제스처 검출 장치
110 : 영상 기반 신체 자세 추출부 111 : 영상 획득부
112 : 신체 자세 검출부 120 : 관심 영역 검출부
121 : 팔 영역 검출부 122 : 손 영역 검출부
130 : 딥러닝 추론부 131 : 모델 앙상블 추론부
132 : 시계열 결과 종합부 140 : 가리키기 제스처 검출부100: Pointing gesture detection device
110: Image-based body posture extraction unit 111: Image acquisition unit
112: body posture detection unit 120: region of interest detection unit
121: arm area detection unit 122: hand area detection unit
130: Deep learning inference unit 131: Model ensemble inference unit
132: Time series result synthesis unit 140: Pointing gesture detection unit

Claims

a memory in which at least one program is recorded; and
Contains a processor that executes a program,
The program is,
Predicting a 3D posture of a user's body from an input image;
Detecting a hand area and an arm area through predicted posture information; and
A pointing gesture detection device that performs a step of inferring whether a pointing gesture is performed based on deep learning using each of the detected hand and arm areas as input.

The method of claim 1, wherein the input image is:
A pointing gesture detection device obtained through an RGB camera or RGBD camera installed on equipment where non-contact sensing is used.

The method of claim 1, wherein the program:
The prediction step, detection step, and inference step are performed for each input image frame,
A pointing gesture detection device further performs the step of determining that a pointing gesture has been detected when the inference result of a predetermined number of input image frames within a certain period of time is a pointing gesture.

According to claim 3, the program:
Further performing the step of determining the reliability of the landmark in the predicted 3D posture,
A pointing gesture detection device that omits the detection and inference steps for the corresponding input image frame when the reliability of the landmark is not greater than a predetermined threshold.

The method of claim 1, wherein the 3D posture information is:
These are three-dimensional joint coordinates representing each joint of the user's body,
The program is,
In the detection step, a pointing gesture detection device that detects the hand area and arm area by assuming that the elbow, wrist, and palm are located on the same line among the three-dimensional joint coordinates of the user's body.

According to claim 5, the program:
In detecting the hand area, inferring as the hand position a position where a straight line connecting the 3-dimensional coordinates of the elbow and the 3-dimensional coordinates of the wrist extends in the direction of the hand by a predetermined multiple of the straight line;
Creating a bounding box in three-dimensional space centered on the hand position; and
A pointing gesture detection device that performs the step of acquiring two-dimensional bounding box coordinates corresponding to the hand area by projecting the bounding box in three-dimensional space back into the pixel coordinate system through camera internal parameters.

The method of claim 6, wherein the bounding box is:
A pointing gesture detection device whose size is set differently depending on the user's age.

According to claim 6, the program:
A pointing gesture detection device that, in detecting the arm area, detects the arm area by generating a minimum square patch including the bounding box coordinates of the hand area, the 3-dimensional coordinates of the elbow, and the 3-dimensional shoulder.

According to claim 8, the program:
In the inference step, the average is calculated for the outputs of each of the deep learning network into which the detected hand area data is input and the deep learning network into which the detected arm area data is input, and binary classification is performed through the argmax operation. Pointing gesture detection device.

Predicting a 3D posture of a user's body from an input image;
Detecting a hand area and an arm area through predicted posture information; and
A pointing gesture detection method comprising the step of inferring whether a pointing gesture is performed based on deep learning using each of the detected hand and arm regions as input.

The method of claim 10, wherein the input image is:
A pointing gesture detection method obtained through an RGB camera or RGBD camera installed on equipment using non-contact sensing.

The method of claim 10, wherein the predicting, detecting, and inferring steps include:
It is performed for each input video frame,
A pointing gesture detection method further comprising determining that a pointing gesture has been detected when the inference result of a predetermined number or more of the input image frames within a certain period of time is a pointing gesture motion.

According to claim 12,
Further comprising the step of determining the reliability of the landmark in the predicted 3D posture,
A pointing gesture detection method in which, when the reliability of a landmark is not greater than a predetermined threshold, the detection and inference steps for the corresponding input image frame are omitted.

The method of claim 10, wherein the 3D posture information is:
These are three-dimensional joint coordinates representing each joint of the user's body,
The detection step is,
A pointing gesture detection method that detects the hand area and arm area by assuming that the elbow, wrist, and palm are located on the same line among the 3D joint coordinates of the user's body.

The method of claim 14, wherein the detecting step includes:
In detecting the hand area, inferring as the hand position a position where a straight line connecting the 3-dimensional coordinates of the elbow and the 3-dimensional coordinates of the wrist extends in the direction of the hand by a predetermined multiple of the straight line;
Creating a bounding box in three-dimensional space centered on the hand position; and
A pointing gesture detection method comprising the step of projecting a bounding box in three-dimensional space back into the pixel coordinate system through camera internal parameters to obtain two-dimensional bounding box coordinates corresponding to the hand area.

The method of claim 15, wherein the bounding box is:
A pointing gesture detection method whose size is set differently depending on the user's age.

The method of claim 15, wherein the detecting step includes:
A pointing gesture detection method that detects the arm area by generating a minimum square patch containing the bounding box coordinates of the hand area, the 3D coordinates of the elbow, and the 3D shoulder coordinates.

The method of claim 17, wherein the inferring step is:
A pointing gesture detection method that calculates the average of the outputs of each of the deep learning network into which the detected hand area data is input and the deep learning network into which the detected arm area data is input, and performs binary classification through the argmax operation. .

Predicting 3D joint coordinates from an input image frame with a 3D posture of the user's body;
Detecting a hand area and an arm area through predicted posture information; and
It includes a deep learning-based ensemble inference step using each of the detected hand and arm regions as input,
The predicting, detecting, and inferring steps are performed for each frame of the input image,
A pointing gesture detection method further comprising determining that a user's pointing gesture has been detected when the deep learning-based ensemble inference results for a predetermined number of frames are pointing gestures.

The method of claim 19, wherein the detecting step includes:
inferring the position of the hand where a straight line connecting the three-dimensional coordinates of the elbow and wrist extends in the direction of the hand by a predetermined multiple of the straight line;
Creating a bounding box in three-dimensional space centered on the hand position;
Obtaining two-dimensional bounding box coordinates corresponding to the hand area by projecting the bounding box in three-dimensional space back into the pixel coordinate system through camera internal parameters; and
A method for detecting a pointing gesture, comprising the step of detecting an arm area by generating a minimal square patch including bounding box coordinates of the hand area, 3D coordinates of the elbow, and 3D shoulder coordinates.