KR20210079542A

KR20210079542A - User Motion Recognition Method and System using 3D Skeleton Information

Info

Publication number: KR20210079542A
Application number: KR1020190171440A
Authority: KR
Inventors: 김성흠; 류한나
Original assignee: 한국전자기술연구원
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2021-06-30
Also published as: KR102338486B1

Abstract

Provided are a user motion recognition method using 3D skeleton information, and a system thereof. The motion recognition method according to an embodiment of the present invention comprises the steps of: detecting a person in a current frame; extracting joints of the detected person; classifying the extracted joints into a plurality of joint sets; calculating spatial characteristics of the skeleton based on distance information about a distance between the classified joint sets; calculating spatial-temporal characteristics in which temporal characteristics are added to the spatial characteristics of the skeleton based on the calculated spatial characteristics and a change in position between frames of the core joints; and inferring human motion by inputting the calculated spatial-temporal characteristics to an artificial intelligence model trained for human motion recognition. Accordingly, the method extracts the user's 3D joint information from the continuous 2D images, calculates the spatial characteristics of the skeleton, and then adds the temporal characteristics that appear as a change in the position of each joint within a motion section to recognize the user's motion. Therefore, it is possible to recognize/estimate the user's motion only by the dynamic calculation of the 3D skeleton information.

Description

User Motion Recognition Method and System using 3D Skeleton Information

본 발명은 영상 처리 기술에 관한 것으로, 더욱 상세하게는 골격 정보를 이용하여 사용자의 동작을 인식/추정하는 방법 및 시스템에 관한 것이다.The present invention relates to image processing technology, and more particularly, to a method and system for recognizing/estimating a user's motion using skeletal information.

동작 인식 기술은 입력 영상 내 존재하는 사람의 움직임을 추정하는 기술로, 소셜 로봇, 컴퓨터와 사용자 간의 상호 인터페이스 등 다양한 산업 분야에 적용되고 있다.Motion recognition technology is a technology for estimating the movement of a person in an input image, and is being applied to various industrial fields, such as social robots and interactive interfaces between computers and users.

최근 3D 거리 센서(예를 들어, 키넥트) 및 딥러닝 기술의 발전으로 2차원 영상에서 사람의 골격 구조 및 움직임 패턴을 추정할 수 있게 되면서, 동작 인식의 어려움을 상당 부분 극복하게 되었다. With the recent development of a 3D distance sensor (eg, Kinect) and deep learning technology, it is possible to estimate a human skeletal structure and a movement pattern from a two-dimensional image, and overcome the difficulties of motion recognition to a large extent.

하지만, 3D 거리 센서를 이용한 방법은 인터넷 상의 다양한 이미지 영상을 사용하는데 제약이 존재한다. 또한, 기존의 탬플릿 매칭, 히스토그램 등을 이용한 모션 추정 방식은 사람의 관절 정보 외에도 부가적인 요소를 필요로 한다는 단점이 있다.However, the method using the 3D distance sensor has limitations in using various images on the Internet. In addition, the existing motion estimation method using template matching, histogram, etc. has a disadvantage in that it requires additional elements in addition to human joint information.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은, 3D 골격 정보의 역학적 계산만으로 사용자의 동작을 인식/추정하기 위한 방법 및 시스템을 제공함에 있다.The present invention has been devised to solve the above problems, and an object of the present invention is to provide a method and system for recognizing/estimating a user's motion only by mechanical calculation of 3D skeletal information.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른, 동작 인식 방법은 현재 프레임에서 사람을 검출하는 단계; 검출된 사람의 관절들을 추출하는 단계; 추출한 관절들을 다수의 관절 집합들로 분류하는 단계; 분류된 관절 집합 간의 거리 정보를 기초로, 골격의 공간적 특징을 계산하는 단계; 계산된 공간적 특징과 핵심 관절의 프레임들 간 위치 변화를 기초로, 골격의 공간적 특징에 시간적 특징이 부가된 시-공간적 특징을 산출하는 단계; 계산된 시-공간적 특징을 사람의 동작 인식을 위해 학습된 인공지능 모델에 입력하여, 사람의 동작을 추론하는 단계;를 포함한다. According to an embodiment of the present invention for achieving the above object, a motion recognition method includes: detecting a person in a current frame; extracting the detected human joints; classifying the extracted joints into a plurality of joint sets; calculating spatial characteristics of the skeleton based on distance information between the classified joint sets; calculating a spatio-spatial feature in which a temporal feature is added to the spatial feature of the skeleton based on the calculated spatial feature and a change in position between frames of the core joint; and inputting the calculated spatio-temporal characteristics to the artificial intelligence model trained for human motion recognition, and inferring the human motion.

분류 단계는, 추출한 관절들의 물리적 연결성을 고려하여, 추출한 관절들을 재배치하는 단계; 재배치된 관절들을 다수의 관절 집합들로 분류하는 단계;를 포함할 수 있다. The classification step may include rearranging the extracted joints in consideration of the physical connectivity of the extracted joints; It may include; classifying the rearranged joints into a plurality of joint sets.

추출 단계는, 3D 관절 좌표들을 추출하고, 분류 단계는, 추출된 3D 관절 좌표들을 정규화하는 단계; 정규화된 3D 관절 좌표들로부터 골격의 무게 중심을 계산하는 단계;를 더 포함하고, 계산된 무게 중심을 중심으로, 관절들의 물리적 연결성을 고려하여 관절들을 재배치하며, 계산된 무게 중심을 기준으로, 재배치된 관절들을 다수의 관절 집합들로 분류하는 것일 수 있다. The extraction step may include extracting 3D joint coordinates, and the classification step may include normalizing the extracted 3D joint coordinates; Calculating the center of gravity of the skeleton from the normalized 3D joint coordinates; further comprising, based on the calculated center of gravity, rearranging the joints in consideration of the physical connectivity of the joints, based on the calculated center of gravity, relocating It may be to classify the acquired joints into a plurality of joint sets.

골격의 공간적 특징 계산단계는, 분류된 관절 집합 간의 물리적 거리 정보에 분류된 관절 집합 간의 내재적 거리 정보를 반영하여, 공간적 특징을 산출하는 것일 수 있다. The step of calculating the spatial characteristics of the skeleton may be to calculate the spatial characteristics by reflecting the intrinsic distance information between the classified joint sets in the physical distance information between the classified joint sets.

물리적 거리 정보는, 관절들 간의 상대적 거리 값이고, 내재적 거리 정보는, 관절들이 포함된 관절 집합들의 관계에 의해 결정되는 것일 수 있다. The physical distance information may be a relative distance value between joints, and the intrinsic distance information may be determined by a relationship between joint sets including joints.

시-공간적 특징 산출단계는, 현재 프레임이 포함된 시간 윈도우에서 핵심 관절의 위치 변화를 공간적 특징에 반영하여 시-공간적 특징을 산출하는 것일 수 있다. The spatio-temporal feature calculation step may be to calculate the spatio-temporal feature by reflecting the change in the position of the key joint in the time window including the current frame to the spatial feature.

시-공간적 특징 산출단계는, 추출된 관절들 중 핵심 관절을 선정하는 단계; 핵심 관절의 프레임 간 위치 변화를 계산하는 단계; 계산된 위치 변화를 참조로, 사람의 동작을 구분 짓는 핵심 프레임들을 선정하는 단계; 선정된 핵심 프레임들을 기준으로, 시간 윈도우를 설정하는 단계;를 더 포함할 수 있다. The spatio-temporal feature calculation step includes: selecting a core joint from among the extracted joints; calculating the inter-frame position change of the core joint; Selecting the core frames that distinguish the motion of a person with reference to the calculated position change; It may further include; setting a time window based on the selected key frames.

선정 단계는, 골격의 끝 관절들을 핵심 관절로 선정하는 것일 수 있다.The selection step may be to select the end joints of the skeleton as core joints.

시-공간적 특징 산출단계는, 현재 프레임이 포함된 시간 윈도우에서 핵심 관절의 위치 변화가 큰 순서 대로 N개의 프레임을 추출하고, 추출된 프레임들을 이용하여 산출한 시-공간적 특징들을 하나로 결합하는 것일 수 있다.The spatio-temporal feature calculation step may be to extract N frames in the order of the largest change in the position of key joints in the time window including the current frame, and combine the spatio-temporal features calculated using the extracted frames into one. have.

본 발명의 다른 측면에 따르면, 영상을 생성하는 카메라; 영상의 현재 프레임에서 사람을 검출하고, 검출된 사람의 관절들을 추출하며, 추출한 관절들을 다수의 관절 집합들로 분류하고, 분류된 관절 집합 간의 거리 정보를 기초로 골격의 공간적 특징을 계산하고, 계산된 공간적 특징과 핵심 관절의 프레임들 간 위치 변화를 기초로 골격의 공간적 특징에 시간적 특징이 부가된 시-공간적 특징을 산출하며, 계산된 시-공간적 특징을 사람의 동작 인식을 위해 학습된 인공지능 모델에 입력하여 사람의 동작을 추론하는 컴퓨팅 시스템;을 포함하는 것을 특징으로 하는 동작 인식 시스템이 제공된다.According to another aspect of the present invention, a camera for generating an image; Detects a person in the current frame of the image, extracts the detected human joints, classifies the extracted joints into a plurality of joint sets, calculates the spatial characteristics of the skeleton based on distance information between the classified joint sets, and calculates Based on the spatial characteristics and the position change between the frames of the core joints, a spatio-temporal feature with temporal features added to the spatial features of the skeleton is calculated, and the calculated spatio-temporal features are applied to the artificial intelligence learned for human motion recognition. A motion recognition system is provided, comprising: a computing system that infers a human motion by inputting it into a model.

본 발명의 또다른 측면에 따르면, 현재 프레임에서 검출된 사람의 관절들을 다수의 관절 집합들로 분류하는 단계; 분류된 관절 집합 간의 거리 정보를 기초로, 골격의 공간적 특징을 계산하는 단계; 계산된 공간적 특징과 핵심 관절의 프레임들 간 위치 변화를 기초로, 골격의 공간적 특징에 시간적 특징이 부가된 시-공간적 특징을 산출하는 단계; 계산된 시-공간적 특징을 이용하여, 사람의 동작을 추론하는 단계;를 포함하는 것을 특징으로 하는 동작 인식 방법이 제공된다.According to another aspect of the present invention, the method comprising: classifying human joints detected in a current frame into a plurality of joint sets; calculating spatial characteristics of the skeleton based on distance information between the classified joint sets; calculating a spatio-spatial feature in which a temporal feature is added to the spatial feature of the skeleton based on the calculated spatial feature and a change in position between frames of the core joint; Inferring a human motion by using the calculated spatio-temporal features; a method for recognizing a motion is provided, comprising: a.

본 발명의 또다른 측면에 따르면, 현재 프레임에서 검출된 사람의 관절들을 다수의 관절 집합들로 분류하는 단계; 분류된 관절 집합 간의 거리 정보를 기초로, 골격의 공간적 특징을 계산하는 단계; 계산된 공간적 특징과 핵심 관절의 프레임들 간 위치 변화를 기초로, 골격의 공간적 특징에 시간적 특징이 부가된 시-공간적 특징을 산출하는 단계; 계산된 시-공간적 특징을 이용하여, 사람의 동작을 추론하는 단계;를 포함하는 것을 특징으로 하는 동작 인식 방법을 수행할 수 있는 프로그램이 기록된 컴퓨터로 읽을 수 있는 기록매체가 제공된다.According to another aspect of the present invention, the method comprising: classifying human joints detected in a current frame into a plurality of joint sets; calculating spatial characteristics of the skeleton based on distance information between the classified joint sets; calculating a spatio-spatial feature in which a temporal feature is added to the spatial feature of the skeleton based on the calculated spatial feature and a change in position between frames of the core joint; A computer-readable recording medium is provided in which a program capable of performing a motion recognition method, comprising: inferring a human motion by using the calculated spatio-temporal characteristics.

이상 설명한 바와 같이, 본 발명의 실시예들에 따르면, 연속된 2D 영상으로부터 사용자의 3D 관절 정보를 추출하고, 골격의 공간적 특징을 계산한 후에 동작 구간 내에서 각 관절의 위치 변화로 나타나는 시간적 특징을 부가하여 사용자의 동작을 인식함으로써, 3D 골격 정보의 역학적 계산만으로 사용자의 동작을 인식/추정할 수 있게 된다.As described above, according to the embodiments of the present invention, after extracting the user's 3D joint information from the continuous 2D image and calculating the spatial characteristics of the skeleton, the temporal characteristics appearing as a change in the position of each joint within the motion section are obtained. In addition, by recognizing the user's motion, it is possible to recognize/estimate the user's motion only by the dynamic calculation of 3D skeleton information.

도 1은 카메라로 촬영된 영상으로부터 각 영상 내에서 사용자의 특정 동작을 분류한 결과,
도 2는 본 발명의 일 실시예에 따른 사용자 동작 인식 방법의 설명에 제공되는 흐름도,
도 3은 관절 분류 과정의 상세 흐름도,
도 4는 14개의 관절 노드를 재배치한 결과,
도 5는, 도 4를 그래프로 표현한 것,
도 6은 관절 집합 분류 결과,
도 7은 핵심 관절의 위치 변화를 공간적 특징에 반영하여 시-공간적 특징을 산출하는 과정의 상세 흐름도, 그리고,
도 8은 본 발명의 다른 실시예에 따른 동작 인식 시스템의 블럭도이다.1 is a result of classifying a specific action of a user within each image from images captured by a camera;
2 is a flowchart provided for explaining a user gesture recognition method according to an embodiment of the present invention;
3 is a detailed flowchart of the joint classification process;
4 is a result of rearranging 14 joint nodes,
Figure 5 is a graphical representation of Figure 4,
6 is a joint set classification result,
7 is a detailed flowchart of the process of calculating the spatio-temporal characteristics by reflecting the change in the position of the core joints in the spatial characteristics, and,
8 is a block diagram of a gesture recognition system according to another embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다.Hereinafter, the present invention will be described in more detail with reference to the drawings.

본 발명의 실시예에서는, 3D 골격 정보를 이용한 사용자 동작 인식 방법 및 시스템을 제시한다.In an embodiment of the present invention, a method and system for recognizing a user's motion using 3D skeleton information are provided.

본 발명의 실시예에 따른 동작 인식 방법에서는, 연속된 2D 영상으로부터 사용자의 3D 관절 정보를 추출하고, 골격의 공간적 특징을 계산한 후에 동작 구간 내에서 각 관절의 위치 변화로 나타나는 시간적 특징을 부가하여 사용자의 동작을 인식한다.In the motion recognition method according to an embodiment of the present invention, the user's 3D joint information is extracted from the continuous 2D image, the spatial characteristics of the skeleton are calculated, and then the temporal characteristics indicated by the position change of each joint within the motion section are added. Recognize the user's actions.

본 발명의 실시예에서는, 딥러닝 기반으로 사용자의 관절 정보를 추출하고 동작을 인식하는데, 시범 동작과의 유사성 비교가 아닌, 사용자의 관절 특징을 시범 영상에 맵핑함으로써 사용자의 정확한 자세 추정 및 교정이 가능하다.In an embodiment of the present invention, the user's joint information is extracted based on deep learning and motion is recognized, and the user's accurate posture estimation and correction is achieved by mapping the user's joint features to the demonstration image, not by comparing similarity with the demonstration motion. It is possible.

또한, 본 발명의 실시예에서는, 기존의 탬플릿 매칭, 히스토그램 등의 방법론과 달리, 관절 정보만 사용하여 사용자의 동작을 추정하며, 적은 양의 데이터베이스로 학습이 가능하고, 학습 시간 및 처리 속도 기존 보다 빠르다.In addition, in the embodiment of the present invention, unlike the existing methodologies such as template matching and histogram, the user's motion is estimated using only joint information, and learning is possible with a small amount of database, and the learning time and processing speed are higher than before. fast.

사용자 동작 인식에 필요한 학습 영상들을 구축하는 방법은 다음과 같다.A method of constructing learning images necessary for user gesture recognition is as follows.

먼저, 도 1에 제시된 바와 같이, 1대 이상의 카메라로 촬영된 영상으로부터 각 영상 내에서 사용자의 특정 동작을 분류한다. 이때, 대부분의 촬영된 영상은 여러 동작이 혼합되어 있기 때문에, 각 영상 내 사용자의 동작을 단일 동작으로 분류하여야 한다. 이렇게 획득된 영상으로부터 사람의 동작 시퀀스에 따라 일상 동작, 운동, 가사 활동 등 여러 카테고리로 분류하며 각 카테고리는 서브 카테고리들로 이루어진다.First, as shown in FIG. 1 , a specific action of a user is classified in each image from images captured by one or more cameras. At this time, since most of the captured images have several motions mixed, the user's motion in each image should be classified as a single motion. From the image obtained in this way, it is classified into various categories such as daily motion, exercise, housework, etc. according to the motion sequence of the person, and each category consists of sub-categories.

예를 들어, '일상 동작' 카테고리 내에는 '물 마시기', '신문 보기' 등의 서브 카테고리로 다시 분류할 수 있다. 그 외 추가적인 특정 동작은 자체 영상 촬영으로 동작 인식에 필요한 카테고리를 지속적으로 확장할 수 있다.For example, within the 'daily action' category, it may be re-classified into sub-categories such as 'drinking water' and 'reading a newspaper'. In addition, specific additional motions can continuously expand the categories required for motion recognition by taking their own video.

한편, 단일 시점 및 다시점에 따른 관절의 위치 특성을 반영하기 위해, 공인 데이터베이스(DB)를 사용할 수도 있다.On the other hand, in order to reflect the positional characteristics of the joint according to the single viewpoint and the multi-viewpoint, an official database (DB) may be used.

이하에서, 사용자의 동작을 인식하는 방법에 대해, 도 2를 참조하여 상세히 설명한다. 도 2는 본 발명의 일 실시예에 따른 사용자 동작 인식 방법의 설명에 제공되는 흐름도이다.Hereinafter, a method of recognizing a user's motion will be described in detail with reference to FIG. 2 . 2 is a flowchart provided to explain a method for recognizing a user's motion according to an embodiment of the present invention.

도시된 바와 같이, 먼저, 현재 프레임에서 사람을 검출하고(S110), 검출된 사람의 관절들을 추출한다(S120). S120단계에서는 딥러닝 기반의 관절 추정기를 이용하여, 14개의 3D 관절 좌표를 추출한다.As shown, first, a person is detected in the current frame (S110), and joints of the detected person are extracted (S120). In step S120, 14 3D joint coordinates are extracted using a deep learning-based joint estimator.

다음, S120단계에서 추출한 관절들을 다수의 관절 집합들로 분류한다(S130). S130단계에 대해, 도 3을 참조하여 상세히 설명한다. 도 3은 관절 분류 과정의 상세 흐름도이다.Next, the joints extracted in step S120 are classified into a plurality of joint sets (S130). Step S130 will be described in detail with reference to FIG. 3 . 3 is a detailed flowchart of a joint classification process.

도시된 바와 같이, 먼저, S120단계에서 추출된 3D 관절 좌표들을 -1,1 사이의 값 사이로 정규화하고(S131), 정규화된 3D 관절 좌표들로부터 골격의 무게 중심을 계산한다(S132).As shown, first, the 3D joint coordinates extracted in step S120 are normalized to a value between -1 and 1 (S131), and the center of gravity of the skeleton is calculated from the normalized 3D joint coordinates (S132).

카메라-사용자 간의 이격 거리로 인해, 영상 내 관절 좌표가 가지는 범위 값이 다르게 되므로, S131단계에서 관절 좌표의 범위를 정규화하는 것이 필요하다.Because the range value of the joint coordinates in the image is different due to the distance between the camera and the user, it is necessary to normalize the range of the joint coordinates in step S131.

다음, 계산된 무게 중심을 원점으로, 추출한 관절들의 물리적 연결성을 고려하여, 추출한 관절들을 재배치하고(S133), 계산된 무게 중심을 기준으로, 재배치된 관절들을 다수의 관절 집합들로 분류한다(S134).Next, taking the calculated center of gravity as the origin and considering the physical connectivity of the extracted joints, the extracted joints are rearranged (S133), and the relocated joints are classified into a plurality of joint sets based on the calculated center of gravity (S134). ).

이로 인해, 영상 내 사용자의 상대적인 위치 정보가 아닌 골격 중심의 절대적인 위치 결과를 이용할 수 있게 된다. Accordingly, it is possible to use the result of the absolute position of the skeletal center rather than the relative position information of the user in the image.

도 4에는 관절 좌표의 어노테이션을 정의하고, 골격의 무게 중심을 원점(시작점)으로 14개의 관절 노드를 재배치한 결과를 나타내었다. 그리고, 도 5에는 도 4에 나타난 관절들 간 물리적 연결성을 그래프로 나타내었다. 또한, 도 5에 나타난 골격의 무게 중심(시작 노드)으로부터 각 끝점(종단 노드)까지 물리적 연결성을 고려하여 관절들을 5개의 관절 집합으로 분류한 결과를 도 6에 나타내었다.4 shows the results of defining the annotation of the joint coordinates and rearranging 14 joint nodes with the center of gravity of the skeleton as the origin (start point). And, FIG. 5 shows the physical connectivity between the joints shown in FIG. 4 as a graph. In addition, the results of classifying the joints into five joint sets in consideration of the physical connectivity from the center of gravity (start node) of the skeleton shown in FIG. 5 to each end point (end node) are shown in FIG. 6 .

다시, 도 2를 참조하여 설명한다.Again, it will be described with reference to FIG. 2 .

관절들을 집합들로 분류한 후에는. 분류된 관절 집합 간의 거리 정보를 기초로, 골격의 공간적 특징을 계산한다(S140). 골격의 공간적 특징은, 분류된 관절 집합 간의 물리적 거리 정보를 추정하되, 추정된 물리적 거리 정보에 분류된 관절 집합 간의 내재적 거리 정보를 반영하여 계산된다.After classifying the joints into sets. Based on the classified distance information between the joint sets, the spatial characteristics of the skeleton are calculated (S140). The spatial characteristics of the skeleton are calculated by estimating physical distance information between the classified joint sets, and reflecting intrinsic distance information between the classified joint sets in the estimated physical distance information.

모든 관절 좌표 간의 유클리디언 거리를 계산하고 이로부터 관절 간의 상대적 거리 값을 추정할 수 있다. 이 때, 거리 계산에서 두 관절 좌표에 소실 좌표가 포함되는 경우에 전체 거리 가중치가 편향될 수 있으며 이에 따른 영향력을 최소화 하기 위해 "0"으로 정의한다. Euclidean distances between all joint coordinates can be calculated and the relative distance values between joints can be estimated from this. At this time, when the missing coordinates are included in the two joint coordinates in the distance calculation, the overall distance weight may be biased, and thus, it is defined as “0” to minimize the influence thereof.

그리고, 각 관절 간의 내재(암시)적 가중치를 반영하기 위해, 도 5의 그래프 계층 구조를 기반으로 관절 노드 간의 내재적 거리 정보를 구하며,

와 같이 표현한다. And, in order to reflect the implicit (implicit) weight between each joint, the intrinsic distance information between the joint nodes is obtained based on the graph hierarchical structure of FIG.

express it as

관절 노드 Ji는 골격의 무게 중심(J0)에서 i번째 관절 노드(Ji) 사이의 최소 거리로 정의하며, 그래프의 각 간선의 가중치 값은 모두 "1"로 가정한다. 예를 들어, 특정 두 노드(i,j번째 관절 노드)가 동일한 관절 집합 내에 포함되는 경우에 두 노드간의 거리 정보는 |Ji-Jj|로 표현된다. 또한, 서로 다른 관절 집합에 포함될 경우에는 두 노드간의 거리 정보는 항상 두 노드간의 거리 가중치가 아닌 노드와 무게 중심 간의 거리를 최소 거리로 표현된다.The joint node Ji is defined as the minimum distance between the ith joint node (Ji) from the center of gravity (J0) of the skeleton, and it is assumed that the weight values of all edges of the graph are “1”. For example, when two specific nodes (i and j joint nodes) are included in the same joint set, distance information between the two nodes is expressed as |Ji-Jj|. In addition, when included in different joint sets, the distance information between two nodes is always expressed as the minimum distance between the node and the center of gravity rather than the distance weight between the two nodes.

따라서, 최종 두 관절 노드 간의 내재적 거리 가중치는 min(Ji, Jj, |Ji-Jj|)로 정의되며, 본 발명의 실시예에서 관절의 공간적 특징은 물리적 거리 정보와 내재(암시)적 거리 정보가 결합된 것이다.Therefore, the intrinsic distance weight between the final two joint nodes is defined as min(Ji, Jj, |Ji-Jj|), and in the embodiment of the present invention, the spatial characteristics of the joint include physical distance information and implicit (implicit) distance information. it will be combined

다음, 계산된 공간적 특징과 핵심 관절의 프레임들 간 위치 변화를 기초로, 골격의 공간적 특징에 시간적 특징이 부가된 시-공간적 특징을 산출한다(S150).Next, based on the calculated spatial characteristics and the position change between the frames of the core joint, a spatio-temporal characteristic in which a temporal characteristic is added to the spatial characteristic of the skeleton is calculated (S150).

S150단계는 전체 동작 구간으로부터 핵심 동작 구간을 추출하여, 추출된 핵심 동작 구간에서 핵심 관절의 위치 변화를 공간적 특징에 반영하여 시-공간적 특징을 산출하는 것이며, 구체적인 과정은 도 7에 도시되어 있다.Step S150 is to extract the core motion section from the entire motion section, and reflect the change in the position of the core joint in the extracted core motion section to the spatial feature to calculate the spatio-temporal feature, and the specific process is shown in FIG. 7 .

시-공간적 특징 산출을 위해, 도 7에 도시된 바와 같이, 추출된 관절들 중 핵심 관절을 선정한다(S151).For spatio-temporal feature calculation, as shown in FIG. 7 , a core joint is selected among the extracted joints ( S151 ).

실제 사람의 동작은 연속성을 가지므로, 동작 구간 내 포함된 단일 이미지로부터 전체 동작을 추정하는데 어려움이 있다. 따라서, 전체 영상을 구성하는 단일 프레임 간의 연관성을 고려하여 데이터 인식 모델을 구축하여야 한다.Since the actual human motion has continuity, it is difficult to estimate the entire motion from a single image included in the motion section. Therefore, it is necessary to construct a data recognition model in consideration of the correlation between single frames constituting the entire image.

이를 위해, 본 발명의 실시예에서는 2장 이상의 입력 영상으로부터 사용자의 공간 정보와 더불어, 시간 변화에 따른 각 대응 관절 간의 위치 변화를 시간 가중치로 변환하여 연속 동작을 인식한다.To this end, in the embodiment of the present invention, continuous motion is recognized by converting the positional change between each corresponding joint according to time change into a time weight along with the user's spatial information from two or more input images.

우선, 동작 변화에 따라 관절의 위치는 변화하며, 이에 따른 관절 집합 간의 상관 관계는 달라진다. 예를 들어, '걷기' 동작에서 '손목' 관절과 '발목' 관절은 높은 상관 관계를 보여주지만, '손목' 관절과 다른 관절들 사이에는 낮은 상관 관계를 보인다.First, the position of the joint changes according to the change in motion, and the correlation between the joint sets is changed accordingly. For example, in the 'walking' motion, the 'wrist' joint and the 'ankle' joint show a high correlation, but the 'wrist' joint and other joints show a low correlation.

또한, '걷기' 동작과 '전화 하기' 동작을 판정하는데 기여하는 관절 집합은 서로 다르다. 따라서 영상 내 동작의 변화를 판단하기 위해선, 동작 유형에 따른 관절 집합의 상관 관계를 파악해야 한다. Also, the joint sets contributing to determining the 'walking' motion and the 'calling' motion are different from each other. Therefore, in order to determine the change in motion in the image, it is necessary to understand the correlation of the joint set according to the motion type.

따라서, 사용자의 연속 동작을 추정하는데 모든 관절 간의 관계 요소가 필요하지 않으며 전체 관절 집합 역시 모든 동작 결정에 영향을 미치지 않는다. 대부분의 동작을 결정하는 요소는 골격의 끝점(도 5에서 그래프의 종단 노드)에 의해 결정되며 이를 연속 동작을 결정하는 핵심 관절로 정의한다. Therefore, a relation element between all joints is not required to estimate the user's continuous motion, and the entire joint set does not affect all motion decisions. The element that determines most of the motion is determined by the end point of the skeleton (the terminal node of the graph in FIG. 5), and it is defined as a key joint that determines the continuous motion.

다음, 핵심 관절의 프레임 간 위치 변화를 계산한 후(S152), 계산된 위치 변화를 참조로, 사람의 동작을 구분 짓는 핵심 프레임들을 선정하고(S153), 선정된 핵심 프레임들을 기준으로 시간 윈도우를 설정한다(S154).Next, after calculating the inter-frame position change of the core joint (S152), referring to the calculated position change, core frames that distinguish human motion are selected (S153), and a time window is selected based on the selected core frames set (S154).

즉, 각 프레임 간의 대응되는 핵심 관절(골격의 끝 관절)의 위치 변화량을 기준으로 시간 윈도우를 설정하는 것이다.That is, the time window is set based on the amount of position change of the corresponding core joint (end joint of the skeleton) between each frame.

구체적으로, 영상의 입력 시점을 기준으로 시간 윈도우(τ, 동작 구간)을 설정하기 위해, 우선, 매 프레임마다 핵심 관절 간 대응되는 위치 변화 (

)를 계산한다. 이때, 위치 변화량(속도)이 음수 부호를 가지는 경우가 있으며 이를 제거하기 위해 유클리디언 거리 값을 사용하여 매 프레임의 사용자의 움직임 정도를 추정한다. Specifically, in order to set the time window (τ, motion section) based on the input time of the image, first, the corresponding position change (

) is calculated. At this time, there is a case where the amount of position change (velocity) has a negative sign, and in order to remove it, the degree of movement of the user in each frame is estimated using the Euclidean distance value.

이렇게 추정된 결과 값이 0일 때, 사용자의 움직임은 잠시 동안 멈추었음을 의미하며 이는 특정 동작을 구분 짓는 핵심 프레임으로 간주할 수 있다. 하지만 실제 입력 영상의 FPS와 3D 관절의 노이즈 등으로 인해 위치 변화에 따른 거리 차이가 0이 될 수 없으므로 임계값을 설정하여 임계값보다 낮은 경우에 이를 핵심 프레임으로 간주한다. 그리고, 핵심 프레임들을 기준으로 시간 윈도우를 설정한다.When this estimated result value is 0, it means that the user's movement has stopped for a while, which can be regarded as a key frame that distinguishes a specific motion. However, since the distance difference according to the position change cannot be 0 due to the FPS of the actual input image and the noise of the 3D joint, a threshold is set and if it is lower than the threshold, it is regarded as a core frame. Then, a time window is set based on the key frames.

다음, 현재 프레임이 포함된 시간 윈도우의 프레임들에, 대해 핵심 관절의 거리 변화에 따른 구간 별 시간 가중치를 결합하여, 골격의 공간적 특징에 동작의 시간적 특징을 부가하여 시-공간적 특징을 산출한다(S155). 구체적인 방법은 다음의 수식과 같다.Next, for the frames of the time window including the current frame, time weights for each section according to the change in the distance of the core joints are combined, and the spatio-temporal characteristics are calculated by adding the temporal characteristics of the motion to the spatial characteristics of the skeleton ( S155). The specific method is as follows.

그리고, 현재 프레임이 포함된 시간 윈도우의 다른 프레임들 중 현재 프레임과 관절의 위치 변화량이 큰 N개의 프레임을 추출하고(S156), 추출된 프레임들의 시-공간적 특징 벡터들을 하나로 결합하여 최종 특징 벡터로 출력한다(S157).Then, from among other frames of the time window including the current frame, N frames with a large amount of change in position of the current frame and joints are extracted (S156), and spatio-temporal feature vectors of the extracted frames are combined into one to obtain a final feature vector. output (S157).

이에 의해, 최종 특징 벡터는 골격의 공간적 특성(거리, 각도, 그래프 구조)과 핵심 관절 간의 시간 변화 특성이 결합한 형태가 된다.As a result, the final feature vector is a combination of spatial characteristics (distance, angle, graph structure) of the skeleton and temporal change characteristics between key joints.

시-공간적 특징 계산 이후에는, 계산된 시-공간적 특징을 사람의 동작 인식을 위해 학습된 인공지능 모델(random forest 등)에 입력하여, 사람의 동작을 추론하여 인식한다(S160).After the spatio-temporal feature is calculated, the computed spatio-temporal feature is input to an artificial intelligence model (random forest, etc.) trained for human motion recognition, and human motion is inferred and recognized (S160).

도 8은 본 발명의 다른 실시예에 따른 동작 인식 시스템의 블럭도이다. 본 발명의 실시예에 따른 동작 인식 시스템은, 도시된 바와 같이, 카메라(210), 컴퓨팅 시스템(220) 및 디스플레이(230)를 포함하여 구성된다.8 is a block diagram of a gesture recognition system according to another embodiment of the present invention. The gesture recognition system according to an embodiment of the present invention is configured to include a camera 210 , a computing system 220 , and a display 230 , as shown.

카메라(210)는 연속된 2D 영상을 생성하는 촬영 장치이다. 컴퓨팅 시스템(220)은 카메라(210)에서 생성된 영상에서 사용자의 움직임을 분석하여 동작을 추론/인식한다. 디스플레이(230)는 카메라(210)에서 생성되는 영상과 컴퓨팅 시스템(220)에서의 추론/인식 결과가 표시된다.The camera 210 is a photographing device that generates a continuous 2D image. The computing system 220 analyzes the user's movement from the image generated by the camera 210 to infer/recognize the motion. The display 230 displays an image generated by the camera 210 and an inference/recognition result from the computing system 220 .

동작 인식을 위해, 컴퓨팅 시스템(220)은 현재 프레임에서 사람을 검출하고, 검출된 사람의 관절들을 추출한 후에, 추출한 관절들을 다수의 관절 집합들로 분류하고, 분류된 관절 집합 간의 거리 정보를 기초로 골격의 공간적 특징을 계산한다.For motion recognition, the computing system 220 detects a person in the current frame, extracts the detected joints of the person, classifies the extracted joints into a plurality of joint sets, and based on distance information between the classified joint sets Calculate the spatial features of the skeleton.

나아가, 컴퓨팅 시스템(220)은 계산된 공간적 특징과 핵심 관절의 프레임들 간 위치 변화를 기초로 골격의 공간적 특징에 시간적 특징이 부가된 시-공간적 특징을 산출하고, 계산된 시-공간적 특징을 사람의 동작 인식을 위해 학습된 인공지능 모델에 입력하여 사람의 동작을 추론하여 인식한다.Furthermore, the computing system 220 calculates a spatio-temporal feature in which a temporal feature is added to the spatial feature of the skeleton based on the calculated spatial feature and a change in position between frames of the core joint, and converts the calculated spatio-temporal feature to the human body. It is input to the trained artificial intelligence model to recognize human motion by inferring and recognizing human motion.

지금까지, 3D 골격 정보를 이용한 사용자 동작 인식 방법 및 시스템에 대해 바람직한 실시예를 들어 상세히 설명하였다.So far, a preferred embodiment has been described in detail for a method and system for recognizing a user's motion using 3D skeleton information.

위 실시예에서는, 사람의 동작 인식에 있어, 골격의 공간적-시간적의 기하학적 관계를 핵심 요소로 하여, 영상 중심에서 골격 중심으로 전체 관절의 위치 정보를 재배치함으로써, 카메라로부터의 각도 및 위치에 관계 없이 관절의 위치 정보를 사용하였다.In the above embodiment, in human motion recognition, by using the spatial-temporal geometric relationship of the skeleton as a key element, by relocating the position information of the entire joint from the image center to the skeleton center, regardless of the angle and position from the camera Joint position information was used.

또한, 관절 간의 거리 정보 및 각도 정보를 이용하여 물리적 연결 관계를 구하고, 골격의 무게 중심을 기준으로 각 관절 간의 물리적 관계를 그래프로 표현하여, 이로부터 관절 집합을 정의하고 관절 간의 내재적 연결 관계를 구하였다.In addition, the physical connection relationship is obtained using distance information and angle information between joints, and the physical relationship between each joint is expressed in a graph based on the center of gravity of the skeleton, and from this, the joint set is defined and the intrinsic connection relationship between the joints is obtained. did.

그리고, 현재 프레임을 기준으로, 시간 윈도우(구간) 내에 관절 집합 간의 위 치변화 및 상관 관계를 분석하여 동작의 시간 정보를 추정할 수 있다.And, based on the current frame, it is possible to estimate the time information of the motion by analyzing the position change and correlation between the joint sets within the time window (section).

한편, 본 실시예에 따른 장치와 방법의 기능을 수행하게 하는 컴퓨터 프로그램을 수록한 컴퓨터로 읽을 수 있는 기록매체에도 본 발명의 기술적 사상이 적용될 수 있음은 물론이다. 또한, 본 발명의 다양한 실시예에 따른 기술적 사상은 컴퓨터로 읽을 수 있는 기록매체에 기록된 컴퓨터로 읽을 수 있는 코드 형태로 구현될 수도 있다. 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터에 의해 읽을 수 있고 데이터를 저장할 수 있는 어떤 데이터 저장 장치이더라도 가능하다. 예를 들어, 컴퓨터로 읽을 수 있는 기록매체는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광디스크, 하드 디스크 드라이브, 등이 될 수 있음은 물론이다. 또한, 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터로 읽을 수 있는 코드 또는 프로그램은 컴퓨터간에 연결된 네트워크를 통해 전송될 수도 있다.On the other hand, it goes without saying that the technical idea of the present invention can also be applied to a computer-readable recording medium containing a computer program for performing the functions of the apparatus and method according to the present embodiment. In addition, the technical ideas according to various embodiments of the present invention may be implemented in the form of computer-readable codes recorded on a computer-readable recording medium. The computer-readable recording medium may be any data storage device readable by the computer and capable of storing data. For example, the computer-readable recording medium may be a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, or the like. In addition, the computer-readable code or program stored in the computer-readable recording medium may be transmitted through a network connected between computers.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In addition, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention as claimed in the claims Various modifications are possible by those of ordinary skill in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

210 : 카메라
220 : 컴퓨팅 시스템
230 : 디스플레이210: camera
220: computing system
230: display

Claims

detecting a person in the current frame;
extracting the detected human joints;
classifying the extracted joints into a plurality of joint sets;
calculating spatial characteristics of the skeleton based on distance information between the classified joint sets;
calculating a spatio-temporal feature in which a temporal feature is added to the spatial feature of the skeleton based on the calculated spatial feature and a change in position between frames of the core joint;
A motion recognition method comprising: inputting the calculated spatio-temporal characteristics to the artificial intelligence model trained for human motion recognition, and inferring the human motion.

The method according to claim 1,
The classification step is
rearranging the extracted joints in consideration of the physical connectivity of the extracted joints;
Classifying the relocated joints into a plurality of joint sets; motion recognition method comprising: a.

3. The method according to claim 2,
The extraction step is
Extract the 3D joint coordinates,
The classification step is
normalizing the extracted 3D joint coordinates;
Calculating the center of gravity of the skeleton from the normalized 3D joint coordinates; further comprising,
Based on the calculated center of gravity, the joints are rearranged in consideration of the physical connectivity of the joints,
A motion recognition method, characterized in that the relocated joints are classified into a plurality of joint sets based on the calculated center of gravity.

The method according to claim 1,
The step of calculating the spatial features of the skeleton is,
A method for recognizing a motion, characterized in that the spatial characteristics are calculated by reflecting the intrinsic distance information between the classified joint sets in the classified physical distance information between the classified joint sets.

5. The method according to claim 4,
physical distance information,
is the relative distance between the joints,
Intrinsic distance information is
A method for recognizing a motion, characterized in that it is determined by a relationship between joint sets including joints.

The method according to claim 1,
The spatio-temporal feature calculation step is,
A method for recognizing a motion, characterized in that the temporal-spatial characteristic is calculated by reflecting the positional change of the core joint in the spatial characteristic in the time window including the current frame.

7. The method of claim 6,
The spatio-temporal feature calculation step is,
selecting a core joint from among the extracted joints;
calculating the inter-frame position change of the core joint;
Selecting the core frames that distinguish the motion of a person with reference to the calculated position change;
Based on the selected key frames, setting a time window; Method for recognizing motion further comprising a.

8. The method of claim 7,
The selection step is
A motion recognition method, characterized in that the end joints of the skeleton are selected as core joints.

7. The method of claim 6,
The spatio-temporal feature calculation step is,
A method for recognizing motion, comprising extracting N frames in the order of the largest change in the position of key joints in a time window including the current frame, and combining the spatio-temporal features calculated using the extracted frames into one.

a camera that generates an image;
Detects a person in the current frame of the image, extracts the detected human joints, classifies the extracted joints into multiple joint sets, calculates spatial features of the skeleton based on distance information between the classified joint sets, and calculates Based on the spatial characteristics and the position change between the frames of the core joints, a spatio-temporal feature with temporal features added to the spatial features of the skeleton is calculated, and the calculated spatio-temporal features are used to recognize human motion. A motion recognition system comprising: a computing system for inferring a human motion by inputting the model.

classifying the human joints detected in the current frame into a plurality of joint sets;
calculating spatial characteristics of the skeleton based on distance information between the classified joint sets;
calculating a spatio-temporal feature in which a temporal feature is added to the spatial feature of the skeleton based on the calculated spatial feature and a change in position between frames of the core joint;
Inferring a human motion by using the calculated spatio-temporal features. A motion recognition method comprising: a.

classifying the human joints detected in the current frame into a plurality of joint sets;
calculating spatial characteristics of the skeleton based on distance information between the classified joint sets;
calculating a spatio-temporal feature in which a temporal feature is added to the spatial feature of the skeleton based on the calculated spatial feature and a change in position between frames of the core joint;
A computer-readable recording medium in which a program capable of performing a motion recognition method is recorded, comprising: inferring a human motion by using the calculated spatio-temporal characteristics.