KR20230040849A

KR20230040849A - Method and apparatus for classifying action based on hand tracking

Info

Publication number: KR20230040849A
Application number: KR1020220042990A
Authority: KR
Inventors: 이예림; 고준규; 서동주; 김준호
Original assignee: 국민대학교산학협력단; (주)그렙
Priority date: 2021-09-16
Filing date: 2022-04-06
Publication date: 2023-03-23
Also published as: KR102542683B1

Abstract

Disclosed are a method and a device for classifying a behavior based on hand tracking. According to an embodiment of the present disclosure, a method for classifying a behavior based on hand tracking includes the steps of: obtaining an image of a non-face-to-face user; detecting a hand region of a non-face-to-face user image based on a pre-learned behavior classification algorithm and classifying the behavior of the non-face-to-face user; and determining whether the non-face-to-face user uses a non-allowed object, based on an action classification result. Therefore, even when an image in which the hand is obscured by an obstacle is acquired, high accuracy can be obtained.

Description

Method and apparatus for classifying action based on hand tracking {METHOD AND APPARATUS FOR CLASSIFYING ACTION BASED ON HAND TRACKING}

본 개시는 손의 자세 파악을 위한 키 포인트(key point)와 손 영역 이미지를 통해 영상 내 인물의 행위를 분류하는 딥러닝 기반 학습 모델을 이용하여 해당 인물의 비 허용 행위 수행 여부를 판단하는 손 추적 기반 행위 분류 방법 및 장치에 관한 것이다.The present disclosure uses a deep learning-based learning model that classifies the action of a person in an image through a key point for grasping the posture of the hand and a hand region image, and hand tracking that determines whether the person performs an unacceptable action. It relates to a method and apparatus for classifying based actions.

최근에는 사회 모든 분야에 있어서 비대면 방식이 적용되고 있다. 비대면은 다른 말로 디지털 소통이라고 할 수 있다. 비대면 방식에 따른 디지털 소통을 하려면 먼저 디지털 전환이 이루어져야 한다. 온라인 교육, 화상회의, 재택근무 등이 디지털 전환의 예다.Recently, non-face-to-face methods are being applied in all fields of society. In other words, non-face-to-face can be called digital communication. In order to communicate digitally in a non-face-to-face manner, digital transformation must first take place. Online education, video conferencing, and telecommuting are examples of digital transformation.

특히 온라인 시험은 갑자기 시행된 경우가 많아 커닝 등의 부정행위가 발생할 우려가 있지만, 4차 산업혁명 시대에 걸맞게 대규모 현장 시험의 사회적 비용 및 시간을 줄여주는 장점을 가진다. In particular, online exams are often conducted suddenly, so there is a risk of fraud such as cheating, but it has the advantage of reducing the social cost and time of large-scale field exams in line with the era of the Fourth Industrial Revolution.

일반적으로 기업들은 채용 시험부터 면허 자격을 취득하기 위한 자격 시험 또한 온라인으로 진행하고 있다. 기업뿐만 아니라 유치원, 중고등학교부터 대학교에 이르기까지 온라인으로 수업을 진행하면서 중간기말고사 등의 시험을 온라인으로 진행하는 추세다. In general, companies are also conducting online qualification exams to obtain license qualifications from recruitment exams. It is a trend that not only companies but also kindergartens, middle and high schools, and universities conduct online classes and conduct midterm and final exams online.

하지만 온라인 시험의 경우 다양한 부정행위가 발생할 수 있다. 이에 일반적으로는 학생의 접속시각, IP 등을 체크하여 부정행위 여부를 확인하게 된다. 나아가 부정행위 여부를 확인하기 위해, 학생의 컴퓨터에 설치된 웹캠 등을 통해 실시간 촬영되는 학생의 영상을 확인할 수 있다.However, in the case of online exams, various cheating can occur. Therefore, in general, the student's access time, IP, etc. are checked to determine whether there is cheating. Furthermore, in order to check whether cheating has occurred, it is possible to check the student's video captured in real time through a webcam installed in the student's computer.

이 경우 시험 감독관들을 두어 모든 시험 응시자들의 영상을 감독해야 하기 때문에, 실시간으로 처리하기 어려울 뿐만 아니라 부정행위 확인에 소요되는 시간 및 비용이 많이 들고 정확도도 떨어지는 등 비효율적인 문제가 있다.In this case, since test supervisors have to supervise the images of all test takers, it is difficult to process in real time, and there are inefficient problems such as high cost and time required for checking cheating, and low accuracy.

이에, 영상을 통해 부정행위를 자동으로 판별할 수 있는 알고리즘 및 그 알고리즘의 정확도 향상 및 비용 감소 등의 방안에 대한 필요성이 대두되고 있다. Accordingly, there is a need for an algorithm capable of automatically discriminating cheating through video and a method for improving the accuracy of the algorithm and reducing cost.

전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.The foregoing background art is technical information that the inventor possessed for derivation of the present invention or acquired during the derivation process of the present invention, and cannot necessarily be said to be known art disclosed to the general public prior to filing the present invention.

선행기술 1: 한국 등록특허공보 제10-2195401호(2020.12.19)Prior art 1: Korean Patent Registration No. 10-2195401 (2020.12.19)

본 개시의 실시 예의 일 과제는, 손의 자세 파악을 위한 키 포인트와 손 영역 이미지를 기반으로 영상 내 인물의 행위를 분류하는 딥러닝 기반 학습 모델을 통해 해당 인물의 비 허용 행위 수행 여부를 판단하여, 부정행위를 자동으로 검출할 수 있도록 하는데 있다.One task of an embodiment of the present disclosure is to determine whether the person has performed an unacceptable action through a deep learning-based learning model that classifies the action of a person in an image based on a key point for grasping a hand posture and an image of a hand region , it is intended to automatically detect cheating.

본 개시의 실시예의 목적은 이상에서 언급한 과제에 한정되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 발명의 실시 예에 의해 보다 분명하게 이해될 것이다. 또한, 본 발명의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 알 수 있을 것이다.The purpose of the embodiments of the present disclosure is not limited to the above-mentioned tasks, and other objects and advantages of the present invention not mentioned above can be understood by the following description and will be more clearly understood by the embodiments of the present invention. will be. It will also be seen that the objects and advantages of the present invention may be realized by means of the instrumentalities and combinations indicated in the claims.

본 개시의 일 실시 예에 따른 손 추적 기반 행위 분류 방법은, 비대면 사용자의 이미지를 획득하는 단계와, 기 학습된 행위 분류 알고리즘을 기반으로, 비대면 사용자 이미지의 손 영역을 검출하여, 비대면 사용자의 행위를 분류하는 단계와, 행위 분류 결과에 기초하여, 비대면 사용자의 비 허용 객체 사용 여부를 판단하는 단계를 포함할 수 있다.A hand tracking-based action classification method according to an embodiment of the present disclosure includes acquiring an image of a non-face-to-face user, detecting a hand region of a non-face-to-face user image based on a pre-learned action classification algorithm, and Classifying the user's behavior, and determining whether the non-face-to-face user uses a disallowed object based on the behavior classification result.

이 외에도, 본 발명의 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램이 저장된 컴퓨터로 판독 가능한 기록매체가 더 제공될 수 있다.In addition to this, another method for implementing the present invention, another system, and a computer-readable recording medium storing a computer program for executing the method may be further provided.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features and advantages other than those described above will become apparent from the following drawings, claims and detailed description of the invention.

본 개시의 실시 예에 의하면, 손의 자세 파악을 위한 키 포인트와 손 영역 이미지를 기반으로 영상 내 인물의 행위를 분류하는 딥러닝 기반 학습 모델을 통해 해당 인물의 비 허용 행위 수행 여부를 실시간으로 정확하게 판단함으로써, 부정행위 검출 및 즉각적 대응이 가능하도록 할 수 있다.According to an embodiment of the present disclosure, a deep learning-based learning model that classifies actions of a person in an image based on a key point for grasping a hand posture and an image of a hand region accurately determines in real time whether the person has performed an unacceptable action. By judging, it is possible to detect cheating and respond immediately.

또한, 실시간으로 전송되는 비대면 사용자의 영상을 처리함에 있어 큰 딜레임 타임이 발생하지 않기 때문에 실제 서비스 환경에서 무리 없이 리얼 타임 동작이 가능하도록 할 수 있다.In addition, since a large delay time does not occur in processing a video of a non-face-to-face user transmitted in real time, real-time operation can be performed without difficulty in an actual service environment.

또한, 단순히 이미지만을 사용하는 것이 아니라 손에 대한 ROI 영상과 관절 키포인트를 함께 활용함으로써 영상으로부터는 손과 그 주변 사물들의 모습에 대한 정보를 얻고, 키포인트로부터는 더욱 정확한 손의 모양을 학습할 수 있어, 손이 장애물에 가려져 있는 영상이 획득되는 경우에도 높은 정확도를 가질 수 있다.In addition, by using the ROI image of the hand and joint keypoints together, rather than simply using the image, information on the shape of the hand and its surroundings can be obtained from the image, and the shape of the hand can be learned more accurately from the keypoint. , it can have high accuracy even when an image in which the hand is obscured by an obstacle is obtained.

본 개시의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present disclosure are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 일 실시 예에 따른 손 추적 기반 행위 분류 시스템을 개략적으로 도시한 도면이다.
도 2는 일 실시 예에 따른 행위 분류 장치를 개략적으로 나타낸 블록도이다.
도 3은 일 실시 예에 따른 행위 분류 알고리즘의 네트워크 구조를 개략적으로 나타낸 도면이다.
도 4는 일 실시 예에 따른 행위 분류 알고리즘의 손 영역 검출 네트워크에서 추론하는 키포인트의 예시도이다.
도 5는 일 실시 예에 행위 분류 알고리즘의 행위 검출 네트워크의 입력 데이터를 시각화한 도면이다.
도 6은 일 실시 예에 따른 사용자 이미지 중 일정 프레임 구간을 입력으로 하는 행위 분류 알고리즘 구조의 예시도이다.
도 7은 일 실시 예에 따른 시선 추적 네트워크가 포함된 행위 분류 알고리즘의 네트워크 구조를 개략적으로 나타낸 도면이다.
도 8은 일 실시 예에 따른 시선 추적 포함 행위 분류 결과를 나타낸 예시도이다.
도 9는 일 실시 예에 따른 행위 분류 알고리즘 실험 결과(제 1 데이터 세트의 손실 및 정확도 그래프)를 나타낸 도면이다.
도 10은 일 실시 예에 따른 행위 분류 알고리즘의 실험 결과(두 가지 클래스만 학습했을 때의 오차 행렬)를 나타낸 도면이다.
도 11은 일 실시 예에 따른 행위 분류 알고리즘 실험 결과(제 2 데이터 세트의 손실 및 정확도 그래프)를 나타낸 도면이다.
도 12는 일 실시 예에 따른 행위 분류 알고리즘 실험 결과(5 가지 클래스를 사용하여 학습했을 때의 오차 행렬)를 나타낸 도면이다.
도 13은 일 실시 예에 따른 행위 분류 방법을 설명하기 위한 흐름도이다.
도 14는 일 실시 예에 따른 시선 추적 포함 행위 분류 방법을 설명하기 위한 흐름도이다.1 is a diagram schematically illustrating a hand tracking-based action classification system according to an embodiment.
2 is a block diagram schematically illustrating a behavior classification apparatus according to an exemplary embodiment.
3 is a diagram schematically illustrating a network structure of an action classification algorithm according to an embodiment.
4 is an exemplary diagram of keypoints inferred by a hand region detection network of an action classification algorithm according to an embodiment.
5 is a diagram visualizing input data of an action detection network of an action classification algorithm according to an embodiment.
6 is an exemplary diagram of a structure of an action classification algorithm that takes a predetermined frame section of a user image as an input according to an embodiment.
7 is a diagram schematically illustrating a network structure of an action classification algorithm including a gaze tracking network according to an embodiment.
8 is an exemplary diagram illustrating a result of classifying an action including gaze tracking according to an exemplary embodiment.
9 is a diagram showing experimental results (loss and accuracy graphs of a first data set) of an action classification algorithm according to an embodiment.
10 is a diagram showing an experimental result (a mismatch matrix when only two classes are learned) of an action classification algorithm according to an embodiment.
11 is a diagram showing experimental results (loss and accuracy graphs of a second data set) of an action classification algorithm according to an embodiment.
12 is a diagram showing experimental results of an action classification algorithm (a mismatch matrix when learning using five classes) according to an embodiment.
13 is a flowchart illustrating a method of classifying an action according to an exemplary embodiment.
14 is a flowchart illustrating a method of classifying an action including eye tracking according to an exemplary embodiment.

본 개시의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 설명되는 실시 예들을 참조하면 명확해질 것이다.Advantages and features of the present disclosure, and methods of achieving them, will become clear with reference to the detailed description of embodiments in conjunction with the accompanying drawings.

그러나 본 개시는 아래에서 제시되는 실시 예들로 한정되는 것이 아니라, 서로 다른 다양한 형태로 구현될 수 있고, 본 개시의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 아래에 제시되는 실시 예들은 본 개시가 완전하도록 하며, 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자에게 개시의 범주를 완전하게 알려주기 위해 제공되는 것이다. 본 개시를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 개시의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.However, it should be understood that the present disclosure is not limited to the embodiments presented below, but may be implemented in a variety of different forms, and includes all conversions, equivalents, and substitutes included in the spirit and technical scope of the present disclosure. . The embodiments presented below are provided to complete the present disclosure and to fully inform those skilled in the art of the scope of the disclosure. In describing the present disclosure, if it is determined that a detailed description of related known technologies may obscure the gist of the present disclosure, the detailed description will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 개시를 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present disclosure. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded. Terms such as first and second may be used to describe various components, but components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another.

이하, 본 개시에 따른 실시 예들을 첨부된 도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same or corresponding components are assigned the same reference numerals, and overlapping descriptions thereof are omitted. I'm going to do it.

도 1은 일 실시 예에 따른 손 추적 기반 행위 분류 시스템을 개략적으로 도시한 도면이다.1 is a diagram schematically illustrating a hand tracking-based action classification system according to an embodiment.

도 1을 참조하면, 손 추적 기반 행위 분류 시스템(1)은 행위 분류 장치(100), 사용자 단말(200), 서버(300) 및 네트워크(400)를 포함할 수 있다.Referring to FIG. 1 , a hand tracking based action classification system 1 may include an action classification device 100 , a user terminal 200 , a server 300 and a network 400 .

손 추적 기반 행위 분류 시스템(1)은 비대면 사용자가 위치한 곳에 구비된 촬영장치를 통해 비대면 사용자를 촬영하고, 딥러닝 기반 학습 모델을 이용하여 비대면 사용자 이미지에서 비 허용 행위 수행 여부를 추론하는 것이다.The hand tracking-based action classification system 1 photographs a non-face-to-face user through a photographing device provided at a location where the non-face-to-face user is located, and uses a deep learning-based learning model to infer whether a non-face-to-face user performs an unacceptable action in an image. will be.

예를 들어, 손 추적 기반 행위 분류 시스템(1)은 온라인 시험에 응시 중인 사용자의 이미지에서 비 허용 객체를 사용하는 등의 부정행위를 판단할 수 있다.For example, the hand tracking-based action classification system 1 may determine cheating, such as using an unacceptable object in an image of a user taking an online test.

일 실시 예에서는, 비대면 사용자의 이미지 내에서, 정상 및 비 허용 객체 사용으로 행동 클래스를 정의할 수 있다. 예를 들어, 정상 클래스의 데이터는 필기를 하는 이미지와, 머리를 넘기거나 얼굴을 매만지는 등 시험 응시 도중의 자연스러운 행동으로 구성될 수 있다.In an embodiment, a behavior class may be defined as using normal and non-permissible objects in an image of a non-face-to-face user. For example, the data of the normal class may consist of an image of taking notes and a natural behavior during test taking, such as turning one's hair or touching one's face.

즉, 상기에서 정의한 클래스를 판단하기 위해서는 비대면 사용자의 주변 환경이 모두 찍힌 이미지에서 비대면 사용자의 손이 위치한 영역을 검출해야 한다. 또한, 일 실시 예에서는, 손의 이미지 정보뿐만 아니라 손가락 관절 마다의 키포인트(keypoint)를 검출해 두 가지 정보를 함께 활용하여, 손의 자세를 정확히 파악할 수 있어야 한다. 키포인트는 손 식별을 위한 대표적인 랜드마크 점들을 의미할 수 있다. 이하에서는 키포인트로 통일하여 기재하도록 한다.That is, in order to determine the class defined above, it is necessary to detect the region where the non-face-to-face user's hand is located in an image in which all surrounding environments of the non-face-to-face user are captured. In addition, in an embodiment, keypoints for each knuckle as well as image information of the hand should be detected and the two pieces of information should be used together to accurately grasp the hand posture. Key points may mean representative landmark points for hand identification. In the following, the key points are unified and described.

따라서, 손 추적 기반 행위 분류 시스템(1)은 비대면 사용자 이미지에서 비대면 사용자의 손 영역과 손의 키포인트를 검출하고, 검출한 손 영역 이미지와 손 키포인트에 기초하여 이미지에서 포착된 객체와 손의 모양을 통해 부정행위를 인식할 수 있다.Therefore, the hand tracking-based behavior classification system 1 detects the non-face-to-face user's hand region and hand keypoints in the non-face-to-face user image, and the object and hand captured in the image are identified based on the detected hand region image and hand keypoints. Cheating can be recognized through the shape.

일 실시 예에서, 손 추적 기반 행위 분류 시스템(1)은 비대면 사용자 이미지를 획득하여 비대면 사용자의 손 영역이 크롭된 이미지와 손 모양에 대한 키포인트 정보를 추출하는 손 영역 검출 네트워크와, 손 이미지와 키포인트 정보를 합친 데이터를 입력 받아 비대면 사용자가 현재 비 허용 객체를 사용 중인지 추론하는 행위 분류 네트워크로 구성될 수 있다.In one embodiment, the hand tracking-based action classification system 1 acquires a non-face-to-face user image, and a hand region detection network that extracts a cropped image of the non-face-to-face user's hand region and keypoint information on the hand shape, and a hand image. It can be composed of a behavior classification network that receives data that combines , and keypoint information and infers whether a non-face-to-face user is currently using a disallowed object.

즉, 손 추적 기반 행위 분류 시스템(1)은 손 영역 검출 네트워크와 행위 분류 네트워크로 구성된 행위 분류 알고리즘을 기반으로, 현재 비대면 사용자가 비 허용 객체를 사용 중인지 추론할 수 있다.That is, the hand tracking-based action classification system 1 can infer whether a non-face-to-face user is currently using a disallowed object based on an action classification algorithm composed of a hand region detection network and an action classification network.

이에, 손 추적 기반 행위 분류 시스템(1)은 예를 들어, 시험 감독관이 모든 시험 응시자의 영상을 직접 살피며 감독해야 했던 기존 업무를 AI가 실시간으로 판단하게 함으로써 자동화가 가능하도록 할 수 있다. Accordingly, the hand-tracking-based action classification system 1 can enable automation by allowing AI to judge in real time the existing task, for example, in which a test supervisor had to directly monitor and supervise all test takers' images.

또한 손 추적 기반 행위 분류 시스템(1)은 단순히 손의 이미지만을 사용하는 것이 아니라, 구체적인 관절의 위치 정보까지 함께 활용하여 보다 높은 정확도를 보이는 모델을 학습시킬 수 있다.In addition, the hand tracking-based action classification system 1 can train a model with higher accuracy by utilizing not only the image of the hand but also location information of specific joints.

한편 일 실시 예에서는, 사용자들이 사용자 단말(200)에서 구현되는 어플리케이션 또는 웹사이트에 접속하여, 행위 분류 장치(100)의 네트워크를 생성 및 학습하는 등의 과정을 수행할 수 있다. Meanwhile, in an embodiment, users may access an application or website implemented in the user terminal 200 and perform processes such as generating and learning a network of the behavior classification apparatus 100 .

이러한 사용자 단말(200)은 사용자가 조작하는 데스크 탑 컴퓨터, 스마트폰, 노트북, 태블릿 PC, 스마트 TV, 휴대폰, PDA(personal digital assistant), 랩톱, 미디어 플레이어, 마이크로 서버, GPS(global positioning system) 장치, 전자책 단말기, 디지털방송용 단말기, 네비게이션, 키오스크, MP3 플레이어, 디지털 카메라, 가전기기 및 기타 모바일 또는 비모바일 컴퓨팅 장치일 수 있으나, 이에 제한되지 않는다. The user terminal 200 includes a desktop computer, a smart phone, a laptop computer, a tablet PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a laptop computer, a media player, a micro server, and a global positioning system (GPS) device operated by a user. , e-book readers, digital broadcast terminals, navigation devices, kiosks, MP3 players, digital cameras, home appliances, and other mobile or non-mobile computing devices, but are not limited thereto.

또한, 사용자 단말(200)은 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등의 웨어러블 단말기 일 수 있다. 사용자 단말(200)은 상술한 내용에 제한되지 아니하며, 웹 브라우징이 가능한 단말기는 제한 없이 차용될 수 있다.In addition, the user terminal 200 may be a wearable terminal such as a watch, glasses, hair band, and ring having a communication function and a data processing function. The user terminal 200 is not limited to the above, and a terminal capable of web browsing may be borrowed without limitation.

일 실시 예에서, 손 추적 기반 행위 분류 시스템(1)은 행위 분류 장치(100) 및/또는 서버(300)에 의해 구현될 수 있다.In an embodiment, the hand tracking-based action classification system 1 may be implemented by the action classification device 100 and/or the server 300 .

일 실시 예에서, 행위 분류 장치(100)는 서버(300)에서 구현될 수 있는데, 이때 서버(300)는 행위 분류 장치(100)가 포함되는 손 추적 기반 행위 분류 시스템(1)을 운용하기 위한 서버이거나 행위 분류 장치(100)의 일부분 또는 전 부분을 구현하는 서버일 수 있다. In one embodiment, the action classification device 100 may be implemented in the server 300, where the server 300 is configured to operate the hand tracking-based action classification system 1 including the action classification device 100. It may be a server or a server implementing part or all of the behavior classification apparatus 100 .

일 실시 예에서, 서버(300)는 비대면 사용자의 이미지를 획득하여 비대면 사용자의 손 영역이 크롭된 이미지와 손 모양에 대한 키포인트 정보를 추출하고, 손 이미지와 키포인트 정보를 합친 데이터를 기반으로 현재 비대면 사용자가 비 허용 객체를 사용 중인지 추론하는 전반의 프로세스에 대한 행위 분류 장치(100)의 동작을 제어하는 서버일 수 있다.In one embodiment, the server 300 obtains an image of a non-face-to-face user, extracts a cropped image of the non-face-to-face user's hand region and keypoint information about the hand shape, and based on data combining the hand image and the keypoint information It may be a server that controls the operation of the behavior classification device 100 for the overall process of inferring whether a non-face-to-face user is currently using a disallowed object.

또한, 서버(300)는 행위 분류 장치(100)를 동작시키는 데이터를 제공하는 데이터베이스 서버일 수 있다. 그 밖에 서버(300)는 웹 서버 또는 어플리케이션 서버 또는 딥러닝 네트워크 제공 서버를 포함할 수 있다.Also, the server 300 may be a database server that provides data for operating the behavior classification apparatus 100 . In addition, the server 300 may include a web server, an application server, or a deep learning network providing server.

그리고 서버(300)는 각종 인공 지능 알고리즘을 적용하는데 필요한 빅데이터 서버 및 AI 서버, 각종 알고리즘의 연산을 수행하는 연산 서버 등을 포함할 수 있다.In addition, the server 300 may include a big data server and an AI server required to apply various artificial intelligence algorithms, a calculation server that performs calculations of various algorithms, and the like.

또한 본 실시 예에서, 서버(300)는 상술하는 서버들을 포함하거나 이러한 서버들과 네트워킹 할 수 있다. 즉, 본 실시 예에서, 서버(300)는 상기의 웹 서버 및 AI 서버를 포함하거나 이러한 서버들과 네트워킹 할 수 있다.Also, in this embodiment, the server 300 may include the aforementioned servers or network with these servers. That is, in this embodiment, the server 300 may include or network with the above web server and AI server.

손 추적 기반 행위 분류 시스템(1)에서 행위 분류 장치(100) 및 서버(300)는 네트워크(400)에 의해 연결될 수 있다. 이러한 네트워크(400)는 예컨대 LANs(local area networks), WANs(Wide area networks), MANs(metropolitan area networks), ISDNs(integrated service digital networks) 등의 유선 네트워크나, 무선 LANs, CDMA, 블루투스, 위성 통신 등의 무선 네트워크를 망라할 수 있으나, 본 개시의 범위가 이에 한정되는 것은 아니다. 또한 네트워크(400)는 근거리 통신 및/또는 원거리 통신을 이용하여 정보를 송수신할 수 있다.In the hand tracking based action classification system 1 , the action classification device 100 and the server 300 may be connected by a network 400 . Such a network 400 may be wired networks such as LANs (local area networks), WANs (wide area networks), MANs (metropolitan area networks), ISDNs (integrated service digital networks), wireless LANs, CDMA, Bluetooth, satellite communication However, the scope of the present disclosure is not limited thereto. In addition, the network 400 may transmit and receive information using short-range communication and/or long-distance communication.

또한, 네트워크(400)는 허브, 브리지, 라우터, 스위치 및 게이트웨이와 같은 네트워크 요소들의 연결을 포함할 수 있다. 네트워크(400)는 인터넷과 같은 공용 네트워크 및 안전한 기업 사설 네트워크와 같은 사설 네트워크를 비롯한 하나 이상의 연결된 네트워크들, 예컨대 다중 네트워크 환경을 포함할 수 있다. 네트워크(400)에의 액세스는 하나 이상의 유선 또는 무선 액세스 네트워크들을 통해 제공될 수 있다. 더 나아가 네트워크(400)는 사물 등 분산된 구성 요소들 간에 정보를 주고받아 처리하는 IoT(Internet of Things, 사물인터넷) 망 및/또는 5G 통신을 지원할 수 있다.Also, the network 400 may include connections of network elements such as hubs, bridges, routers, switches, and gateways. Network 400 may include one or more connected networks, such as a multiple network environment, including a public network such as the Internet and a private network such as a secure enterprise private network. Access to network 400 may be provided through one or more wired or wireless access networks. Furthermore, the network 400 may support an Internet of Things (IoT) network and/or 5G communication in which information is exchanged and processed between distributed components such as things.

도 2는 일 실시 예에 따른 행위 분류 장치를 개략적으로 나타낸 블록도이다.2 is a block diagram schematically illustrating a behavior classification apparatus according to an exemplary embodiment.

도 2를 참조하면, 행위 분류 장치(100)는 통신부(110), 사용자 인터페이스(120), 메모리(130) 및 프로세서(140)를 포함할 수 있다.Referring to FIG. 2 , the behavior classification apparatus 100 may include a communication unit 110 , a user interface 120 , a memory 130 and a processor 140 .

통신부(110)는 네트워크(400)와 연동하여 외부 장치간의 송수신 신호를 패킷 데이터 형태로 제공하는 데 필요한 통신 인터페이스를 제공할 수 있다. 또한 통신부(110)는 다른 네트워크 장치와 유무선 연결을 통해 제어 신호 또는 데이터 신호와 같은 신호를 송수신하기 위해 필요한 하드웨어 및 소프트웨어를 포함하는 장치일 수 있다.The communication unit 110 may provide a communication interface required to provide a transmission/reception signal between external devices in the form of packet data in conjunction with the network 400 . In addition, the communication unit 110 may be a device including hardware and software necessary for transmitting and receiving signals such as control signals or data signals to and from other network devices through wired or wireless connections.

즉, 프로세서(140)는 통신부(110)를 통해 연결된 외부 장치로부터 각종 데이터 또는 정보를 수신할 수 있으며, 외부 장치로 각종 데이터 또는 정보를 전송할 수도 있다. That is, the processor 140 may receive various data or information from an external device connected through the communication unit 110 and may transmit various data or information to the external device.

일 실시 예에서, 사용자 인터페이스(120)는 행위 분류 장치(100)의 동작(예컨대, 네트워크의 파라미터 변경, 네트워크의 학습 조건 변경 등)을 제어하기 위한 사용자 요청 및 명령들이 입력되는 입력 인터페이스를 포함할 수 있다.In one embodiment, the user interface 120 may include an input interface into which user requests and commands for controlling the operation of the behavior classification device 100 (eg, network parameter change, network learning condition change, etc.) are input. can

그리고 일 실시 예에서, 사용자 인터페이스(120)는 행위 분류 결과를 출력하거나 부정행위로 판단된 사용자에게 경고하는 출력 인터페이스를 포함할 수 있다. 즉, 사용자 인터페이스(120)는 사용자 요청 및 명령에 따른 결과를 출력할 수 있다. 이러한 사용자 인터페이스(120)의 입력 인터페이스와 출력 인터페이스는 동일한 인터페이스에서 구현될 수 있다.In one embodiment, the user interface 120 may include an output interface for outputting an action classification result or warning a user determined to have cheated. That is, the user interface 120 may output results according to user requests and commands. An input interface and an output interface of the user interface 120 may be implemented in the same interface.

메모리(130)는 행위 분류 장치(100)의 동작의 제어(연산)에 필요한 각종 정보들을 저장하고, 제어 소프트웨어를 저장할 수 있는 것으로, 휘발성 또는 비휘발성 기록 매체를 포함할 수 있다. The memory 130 may store various types of information necessary for the control (operation) of the operation of the behavior classification device 100 and store control software, and may include a volatile or non-volatile recording medium.

메모리(130)는 하나 이상의 프로세서(140)와 전기적 또는 내부 통신 인터페이스로 연결되고, 프로세서(140)에 의해 실행될 때, 프로세서(140)로 하여금 행위 분류 장치(100)를 제어하도록 야기하는(cause) 코드들을 저장할 수 있다.The memory 130 is connected to one or more processors 140 by an electrical or internal communication interface and, when executed by the processor 140, causes the processor 140 to control the behavior classification device 100 (cause) Codes can be saved.

여기서, 메모리(130)는 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media) 등의 비 일시적 저장매체이거나 램(RAM) 등의 일시적 저장매체를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다. 이러한 메모리(130)는 내장 메모리 및/또는 외장 메모리를 포함할 수 있으며, DRAM, SRAM, 또는 SDRAM 등과 같은 휘발성 메모리, OTPROM(one time programmable ROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, NAND 플래시 메모리, 또는 NOR 플래시 메모리 등과 같은 비휘발성 메모리, SSD. CF(compact flash) 카드, SD 카드, Micro-SD 카드, Mini-SD 카드, Xd 카드, 또는 메모리 스틱(memory stick) 등과 같은 플래시 드라이브, 또는 HDD와 같은 저장 장치를 포함할 수 있다. Here, the memory 130 may include non-temporary storage media such as magnetic storage media or flash storage media, or temporary storage media such as RAM, but the scope of the present invention is not limited thereto. The memory 130 may include built-in memory and/or external memory, and may include volatile memory such as DRAM, SRAM, or SDRAM, one time programmable ROM (OTPROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, Non-volatile memory such as NAND flash memory, or NOR flash memory, SSD. It may include a compact flash (CF) card, a flash drive such as an SD card, a Micro-SD card, a Mini-SD card, an Xd card, or a memory stick, or a storage device such as an HDD.

그리고, 메모리(130)에는 본 개시에 따른 학습을 수행하기 위한 알고리즘에 관련된 정보가 저장될 수 있다. 그 밖에도 본 개시의 목적을 달성하기 위한 범위 내에서 필요한 다양한 정보가 메모리(130)에 저장될 수 있으며, 메모리(130)에 저장된 정보는 서버 또는 외부 장치로부터 수신되거나 사용자에 의해 입력됨에 따라 갱신될 수도 있다.Also, information related to an algorithm for performing learning according to the present disclosure may be stored in the memory 130 . In addition, various information necessary within the scope of achieving the object of the present disclosure may be stored in the memory 130, and the information stored in the memory 130 may be updated as received from a server or an external device or input by a user. may be

프로세서(140)는 행위 분류 장치(100)의 전반적인 동작을 제어할 수 있다. 구체적으로, 프로세서(140)는 메모리(130)를 포함하는 행위 분류 장치(100)의 구성과 연결되며, 메모리(130)에 저장된 적어도 하나의 명령을 실행하여 행위 분류 장치(100)의 동작을 전반적으로 제어할 수 있다. The processor 140 may control overall operations of the behavior classification apparatus 100 . Specifically, the processor 140 is connected to the configuration of the action classification device 100 including the memory 130, and executes at least one command stored in the memory 130 to control the overall operation of the action classification device 100. can be controlled with

프로세서(140)는 다양한 방식으로 구현될 수 있다. 예를 들어, 프로세서(140)는 주문형 집적 회로(Application Specific Integrated Circuit, ASIC), 임베디드 프로세서, 마이크로 프로세서, 하드웨어 컨트롤 로직, 하드웨어 유한 상태 기계(Hardware Finite State Machine, FSM), 디지털 신호 프로세서(Digital Signal Processor, DSP) 중 적어도 하나로 구현될 수 있다. Processor 140 can be implemented in a variety of ways. For example, the processor 140 may include an application specific integrated circuit (ASIC), an embedded processor, a microprocessor, hardware control logic, a hardware finite state machine (FSM), a digital signal processor Processor, DSP) may be implemented as at least one.

프로세서(140)는 일종의 중앙처리장치로서 메모리(130)에 탑재된 제어 소프트웨어를 구동하여 행위 분류 장치(100)의 동작을 제어할 수 있다. 프로세서(140)는 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 여기서, '프로세서(processor)'는, 예를 들어 프로그램 내에 포함된 코드 또는 명령어로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다.The processor 140, as a kind of central processing unit, may control the operation of the behavior classification device 100 by driving control software loaded in the memory 130. The processor 140 may include any type of device capable of processing data. Here, a 'processor' may refer to a data processing device embedded in hardware having a physically structured circuit to perform functions expressed by codes or instructions included in a program, for example.

도 3은 일 실시 예에 따른 행위 분류 알고리즘의 네트워크 구조를 개략적으로 나타낸 도면이고, 도 4는 일 실시 예에 따른 행위 분류 알고리즘의 손 영역 검출 네트워크에서 추론하는 키포인트의 예시도이며, 도 5는 일 실시 예에 행위 분류 알고리즘의 행위 검출 네트워크의 입력 데이터를 시각화한 도면이다.3 is a diagram schematically showing a network structure of an action classification algorithm according to an embodiment, FIG. 4 is an example of keypoints inferred in a hand region detection network of an action classification algorithm according to an embodiment, and FIG. It is a diagram visualizing input data of an action detection network of an action classification algorithm according to an embodiment.

일 실시 예에서, 프로세서(140)는 비대면 사용자(예를 들어, 온라인 시험에 응시 중인 사용자)를 촬영한 사용자 이미지에서 부정행위를 판단하는 알고리즘을 수행할 수 있다. 프로세서(140)는 비대면 사용자 이미지에서 비대면 사용자의 손 영역과 손의 키포인트를 검출하여, 이미지에 포착된 객체와 손의 모양을 통해 부정행위를 인식할 수 있다.In an embodiment, the processor 140 may perform an algorithm for determining cheating in a user image of a non-face-to-face user (eg, a user taking an online test). The processor 140 may detect a non-face-to-face user's hand region and a key point of the hand in the non-face-to-face user image, and recognize cheating through an object captured in the image and a shape of the hand.

즉, 프로세서(140)는 비대면 사용자 이미지에서 비대면 사용자의 손 영역과 손의 키포인트를 검출하여 부정행위를 인식할 수 있는 행위 분류 알고리즘을 학습할 수 있으며, 기 학습된 행위 분류 알고리즘을 기반으로 손 추적 기반 행위 분류에 따른 비 허용 객체 사용 여부 판단을 수행할 수 있다.That is, the processor 140 may learn an action classification algorithm capable of recognizing cheating by detecting the hand region and keypoint of the non-face-to-face user in the non-face-to-face user image, and based on the previously learned action classification algorithm It is possible to determine whether to use a disallowed object according to hand tracking-based action classification.

도 3을 참조하면, 일 실시 예에서는, 행위 분류 알고리즘을 손 영역과 손 키포인트를 검출하는 손 영역 검출 네트워크와 이미지에 포착된 객체와 손의 모양을 통해 행위 분류 결과에 따른 비 허용 객체 사용 여부를 추론하는 행위 분류 네트워크로 크게 구분하여 설명할 수 있다.Referring to FIG. 3 , in an embodiment, an action classification algorithm determines whether a disallowed object is used according to an action classification result through a hand region detection network that detects a hand region and a hand keypoint and an object captured in an image and a shape of the hand. It can be broadly classified into inferring behavior classification networks.

일 실시 예에서는, 오픈 소스 크로스 플랫폼(예를 들어, Mediapipe)을 사용하여 비대면 사용자의 손 영역과 손의 키포인트를 검출할 수 있다.In one embodiment, an open source cross platform (eg, Mediapipe) may be used to detect a non-face-to-face user's hand region and hand keypoint.

이러한 오픈 소스 크로스 플랫폼은 그래프 기반의 프레임워크를 가지 고 있으며, 라이브 미디어 및 스트리밍 미디어를 위한 머신러닝(Machine Learning) 솔루션을 제공한다.This open-source cross-platform has a graph-based framework and provides machine learning solutions for live and streaming media.

크로스 플랫폼(cross-platform)은 멀티 플랫폼(multi-platform)이라고도 하며, 컴퓨터 프로그램, 운영 체제, 컴퓨터 언어, 프로그래밍 언어, 컴퓨터 소프트웨어 등이 여러 종류의 컴퓨터 플랫폼에서 동작할 수 있다는 것을 뜻하는 용어이다. 크로스 플랫폼 응용 프로그램은 둘 이상의 플랫폼에서 실행할 수 있다. 이러한 종류의 소프트웨어는 멀티플랫폼 소프트웨어라고도 한다. Cross-platform, also called multi-platform, is a term that means that a computer program, operating system, computer language, programming language, computer software, etc. can run on many types of computer platforms. A cross-platform application can run on more than one platform. This kind of software is also called multiplatform software.

일 실시 예에서의 오픈 소스 크로스 플랫폼은, 예를 들어, Face Detection, Face Mesh, Pose, Hands, Object Detection 등의 솔루션을 제공할 수 있는데, 특히, 부정행위 판단을 위해 손과 손가락의 모양을 추적하는 Hands 솔루션을 사용할 수 있다. 예를 들어, Hands는 머신러닝을 이용하여 하나의 프레임에서 손 마디에 대한 21개의 3차원 키포인트를 추론할 수 있다.The open source cross-platform in one embodiment, for example, can provide solutions such as Face Detection, Face Mesh, Pose, Hands, Object Detection, etc., in particular, tracking the shape of the hand and fingers to determine cheating. A Hands solution is available. For example, Hands can use machine learning to infer 21 three-dimensional keypoints for each hand knuckle in one frame.

프로세서(140)는 행위 분류 알고리즘의 손 영역 검출 네트워크를 기반으로, 예를 들어 도 4에 도시된 것과 같은 21개의 손 키포인트 검출을 수행할 수 있다. 이러한 행위 분류 알고리즘의 손 영역 검출 네트워크는, 또한 두 가지의 모델(손 검출 모델, 손 키포인트 검출 모델)로 세부적으로 구성될 수 있다.The processor 140 may perform, for example, 21 hand keypoint detection as shown in FIG. 4 based on the hand region detection network of the action classification algorithm. The hand region detection network of this action classification algorithm may also be configured in detail with two models (a hand detection model and a hand keypoint detection model).

그 중 첫 번째로 돌아가는 손 검출 모델(예를 들어, Palm Detection Model)은 이미지에서 사람의 손바닥 또는 손등을 찾아낼 수 있다. 예를 들어, 손바닥 또는 손등을 찾아내기 위한 네트워크 모델은 입력으로 [-1.0, 1.0]의 RGB 값을 갖는 128 X 128 X 3 텐서(tensor)를 받아 1 X 896 X 18 텐서를 출력할 수 있다.Among them, the first hand detection model (eg, Palm Detection Model) can find the palm or the back of a person's hand in an image. For example, a network model for finding the palm or the back of the hand may receive as input a 128 X 128 X 3 tensor with an RGB value of [-1.0, 1.0] and output a 1 X 896 X 18 tensor.

손의 키포인트를 추론하기 위해서는 먼저 이미지 전체에서 손이 등장하는 위치를 감지해야 한다. 하지만 사람마다 다른 손 크기와 복잡한 손의 모양, 가려진 정도 등에 의한 영향을 받지 않고서 정확히 손의 영역을 찾아내는 것은 어려운 작업이다. 얼굴의 경우에는 눈과 입처럼 구별해내기 쉬운 패턴을 가지고 있지만 손에서는 이런 시각적인 특징을 찾기가 힘들다. In order to infer the key point of the hand, first, the position where the hand appears in the entire image must be detected. However, it is difficult to accurately locate the hand region without being affected by different hand sizes, complex hand shapes, and degree of occlusion. In the case of the face, it has patterns that are easy to distinguish, such as eyes and mouth, but it is difficult to find such visual features in the hand.

이에, 프로세서(140)는 손바닥이나 주먹처럼 모양 변화가 적게 일어나는 부위를 먼저 감지할 수 있다. 손 전체보다 손바닥이 더 작은 객체이기 때문에 비최대억제(Non-Maximum Suppression) 알고리즘이 상대적으로 잘 동작할 수 있다. 또한 손바닥은 가로와 세로 비율이 대략 1:1이기 때문에 비최대억제 알고리즘에서 사용하는 앵커 박스(anchor boxe)를 간단히 정사각형으로 정의할 수 있게 된다. Accordingly, the processor 140 may first detect an area where shape changes are small, such as a palm or a fist. Since the palm is a smaller object than the whole hand, the non-maximum suppression algorithm can work relatively well. In addition, since the horizontal to vertical ratio of the palm is approximately 1:1, the anchor box used in the non-maximum suppression algorithm can be simply defined as a square.

Object detection(객체 감지) 알고리즘은 일반적으로 많은 수의 영역 샘플링하고, 이 영역에 관심 개체가 포함되어 있는지 여부를 확인하고, 대상의 실측 Bounding Boxes(경계 상자)를 정확하게 예측하기 위해 영역의 가장자리를 조정한다. 이러한 바운딩 박스를 찾기 위해, 앵커 박스 방법을 사용할 수 있다.Object detection algorithms typically sample a large number of regions, determine whether these regions contain an object of interest, and adjust the edges of the regions to accurately predict the ground truth Bounding Boxes of the object. do. To find these bounding boxes, the anchor box method can be used.

앵커 박스 방법은, 각 픽셀을 중앙에 두고 크기와 종횡비가 서로 다른 바운딩 박스를 생성할 수 있는데, 이러한 바운딩 박스를 앵커 박스라고 하며, 이를 통해 객체를 감지하게 된다.In the anchor box method, bounding boxes with different sizes and aspect ratios can be created with each pixel at the center, and such bounding boxes are called anchor boxes, through which an object is detected.

이때, 프로세서(140)는 비대면 사용자 이미지에서 감지한 손바닥 ROI(Region of Interest)를 손가락을 모두 폈을 때의 모양을 충분히 감쌀 수 있을 만한 크기로 확장할 수 있다. 다시 말해 손바닥 영역의 ROI가 손 전체 영역의 ROI로 확장되는 것이다. At this time, the processor 140 may expand the palm region of interest (ROI) detected in the non-face-to-face user image to a size sufficient to sufficiently cover the shape when all fingers are spread out. In other words, the ROI of the palm area is expanded to the ROI of the entire hand area.

그리고 두 번째로 돌아가는 손 키포인트 검출 모델(예를 들어, Hand Landmark Model)은 상기 확장된 손 전체 영역의 ROI 이미지를 입력으로 받아 손의 키포인트에 대한 3차원 좌표값을 예측할 수 있다. 여기서 ROI 이미지는 컬러 이미지(RGB 이미지)로 입력될 수 있다.The second hand keypoint detection model (eg, Hand Landmark Model) receives the ROI image of the entire extended hand region as an input and can predict 3D coordinate values for keypoints of the hand. Here, the ROI image may be input as a color image (RGB image).

이때 프로세서(140)는 행위 분류 알고리즘의 손 영역 검출 네트워크의 커패시티(capacity)를 낮추기 위해, ROI 이미지에 대해 손목 중앙과 가운데 손가락을 이은 선분이 이미지의 y축과 평행하게 회전되는 과정을 수행할 수 있다.At this time, the processor 140 rotates a line segment connecting the center of the wrist and the middle finger in the ROI image parallel to the y-axis of the image in order to lower the capacity of the hand region detection network of the action classification algorithm. can

일 실시 예의, 연속적인 이미지의 입력을 처리하는 전체 파이프라인에서는 이전 프레임에서 식별된 손의 위치 정보를 다음 프레임에서도 이용하는 식으로 손을 추적(tracking)하는 작업이 이루어질 수 있다. 이때 프로세서(140)는 주어진 ROI 이미지에서 손을 찾을 수 없게 된 경우, 사용자 이미지에서 손바닥 또는 손등을 검출하는 과정부터 다시 수행할 수 있다.According to an embodiment, in an entire pipeline for processing input of continuous images, a hand tracking operation may be performed in such a way that hand position information identified in a previous frame is also used in a next frame. At this time, if the processor 140 cannot find the hand in the given ROI image, it can perform again from the process of detecting the palm or the back of the hand in the user image.

일 실시 예에서는, 상기와 같은 과정을 통해 행위 분류 알고리즘의 네트워크의 용량을 줄였기 때문에 모바일 환경에서도 실시간으로 동작시킬 수 있다.In one embodiment, since the capacity of the network of the behavior classification algorithm is reduced through the above process, it can be operated in real time even in a mobile environment.

이하에서는, 행위 분류 알고리즘의 행위 분류 네트워크에 대해 설명하도록 한다. 행위 분류 네트워크는 비대면 사용자 이미지에서 검출된 손의 ROI 이미지와 키포인트 정보를 가지고 현재 프레임에 나타난 손이 비 허용 객체를 사용하고 있는지를 판단하는 것으로 일종의 분류기라고 할 수 있다.Hereinafter, the action classification network of the action classification algorithm will be described. The behavior classification network determines whether the hand appearing in the current frame is using a disallowed object with the ROI image and keypoint information of the hand detected in the non-face-to-face user image, and can be called a kind of classifier.

일 실시 예에서는, 행위 분류 알고리즘의 행위 분류 네트워크의 실시간 성능을 위해 백본(Backbone) 네트워크로 pretrained MobileNetV2를 사용할 수 있다. MobileNet은 Depthwise Separable Convolution 개념을 도입하여 연산량을 크게 줄인 모델로, 모바일 기기처럼 컴퓨팅 리소스가 다소 부족한 환경에서도 충분히 잘 동작할 수 있도록 하는 것을 목표로 한다. 이의 다음 버전인 MobileNetV2는 Inverted Residual 구조를 차용하여 기존 MobileNet 보다 연산량을 더 줄인 모델이다. In one embodiment, pretrained MobileNetV2 may be used as a backbone network for real-time performance of the behavior classification network of the behavior classification algorithm. MobileNet is a model that greatly reduces the amount of computation by introducing the concept of Depthwise Separable Convolution, and aims to operate sufficiently well even in an environment with somewhat insufficient computing resources, such as a mobile device. The next version, MobileNetV2, borrows the Inverted Residual structure and reduces the amount of computation more than the existing MobileNet.

따라서 일 실시 예에서는, 학습 시간과 데이터 셋의 유사성을 고려하여 ImageNet으로 학습된 MobilenetV2를 사용할 수 있다.Therefore, in one embodiment, MobilenetV2 learned with ImageNet can be used in consideration of the learning time and the similarity of data sets.

일 실시 예에서, 행위 분류 알고리즘의 행위 분류 네트워크는 160 X 160 X 24 텐서를 입력으로 받을 수 있다. 이때, 처음 세 개의 채널은 손 ROI 이미지의 RGB 채널이며, 뒤따르는 나머지 21개의 채널은 21개의 키포인트가 각각 하나씩 찍혀 있는 히트맵(heatmap)일 수 있다.In one embodiment, the action classification network of the action classification algorithm may receive 160 X 160 X 24 tensors as input. In this case, the first three channels are RGB channels of the hand ROI image, and the remaining 21 channels that follow may be a heatmap in which 21 key points are marked one by one.

도 5를 참조하면, 도 5(b)는 도 5(a)에 도시된 손 ROI 이미지에서의 손에 대한 0번 키포인트를 나타낸 것이고, 도 5(c)는 도 5(a)에 도시된 손 ROI 이미지에서의 손에 대한 21개의 키포인트들을 나타낸 것이다. 그리고 도 5(d) 및 도 5(e)는 각 채널의 키포인트를 한 채널로 합쳐 손 ROI 이미지 위에 시각화한 것이다.Referring to FIG. 5, FIG. 5(b) shows keypoint 0 for the hand in the hand ROI image shown in FIG. 5(a), and FIG. 5(c) shows the hand shown in FIG. 5(a). It shows 21 key points for the hand in the ROI image. 5(d) and 5(e) are visualized on the hand ROI image by combining the key points of each channel into one channel.

일 실시 예에서, 여러 개의 노드가 그래프처럼 연결되어 동작하는 파이프라인은 실시간으로 전송되는 이미지 프레임을 연속적으로 처리할 수 있다. 행위 분류 네트워크의 입력은 이 파이프라인 내에서 실시간으로 만들어질 수 있다.In one embodiment, a pipeline in which several nodes are connected like a graph and operate can continuously process image frames transmitted in real time. The inputs of the behavioral classification network can be made real-time within this pipeline.

보다 구체적으로는, 행위 분류 네트워크를 실행하는 노드에서는 먼저 손을 모두 포함하는 크기의 ROI 이미지를 만들어 내는 노드로부터 160 X 160 X 3 크기의 손 ROI 이미지(RGB 이미지)를 입력으로 받아올 수 있다. 그리고 네트워크(140)는 동일한 ROI 이미지를 손 키포인트 검출 네트워크의 입력으로 주어 21개 키포인트를 추론하는 노드로부터 키포인트들의 좌표 값을 패킷으로 받아올 수 있다. More specifically, a node executing an action classification network may first receive a 160 X 160 X 3 hand ROI image (RGB image) as an input from a node that generates an ROI image of a size including all hands. In addition, the network 140 may receive coordinate values of keypoints as packets from a node that infers 21 keypoints by giving the same ROI image as an input of the hand keypoint detection network.

그리고 프로세서(140)는 21개의 키포인트들 각각에 대해 ROI 이미지와 크기가 동일한 빈 그리드를 생성한 뒤 해당 키포인트의 좌표 지점을 마킹하고 가우시안 블러를 적용할 수 있다. 프로세서(140)는 상기와 같이 하나의 키포인트가 찍혀 있는 히트맵 21개가 만들어지면, 패킷으로 함께 받았던 손 ROI 이미지의 RGB 채널 뒤에 추가하여 총 24채널의 입력 텐서를 구성할 수 있다.The processor 140 may generate an empty grid having the same size as the ROI image for each of the 21 keypoints, mark coordinates of the keypoints, and apply Gaussian blur. When 21 heat maps with one keypoint are created as described above, the processor 140 can configure a total of 24 channels of input tensors by adding them to the back of the RGB channels of the hand ROI image received as a packet.

일 실시 예에서, 상기 160 X 160 X 24 입력 텐서는 행위 분류 네트워크의 입력으로 주어질 수 있다. 예를 들어, ImageNet으로 사전 학습된 입력층을 수정하지 않고 그대로 사용하는 행위 분류 네트워크는 160 X 160 X 3 텐서를 입력으로 받을 수 있다. In one embodiment, the 160 X 160 X 24 input tensor may be given as an input to a behavior classification network. For example, an action classification network that uses the input layer pretrained with ImageNet without modification can receive 160 X 160 X 3 tensors as input.

따라서, 일 실시 예에서, 행위 분류 네트워크는 제일 앞 단에 1 X 1 2D convolution layer를 두어 입력으로 들어온 텐서를 160 X 160 X 3 사이즈로 맞춰줄 수 있다. 또한 일 실시 예에서, 24채널을 3채널로 바꿔주는 1 X 1 convolution 연산에 의해 출력 텐서는 ROI 이미지의 각 픽셀과 해당 픽셀에서의 21개의 키포인트 정보가 함께 인코딩된 값을 갖게 된다.Therefore, in one embodiment, the action classification network can adjust the input tensor to a size of 160 X 160 X 3 by placing a 1 X 1 2D convolution layer at the front end. In addition, in one embodiment, the output tensor has a value encoded with each pixel of the ROI image and 21 keypoint information in the corresponding pixel by a 1 X 1 convolution operation that converts 24 channels into 3 channels.

즉, 일 실시 예에서는, 상기와 같이 생성된 160 X 160 X 3 텐서가 다음 레이어들을 마저 통과한다. 그리고 일 실시 예에서, 행위 분류 네트워크의 가장 마지막 fully-connected layer는 행위 분류를 위한 클래스에 맞는 출력 수를 가진 FC layer로 구성될 수 있다.That is, in one embodiment, the 160 X 160 X 3 tensor generated as described above passes through the next layers. And in one embodiment, the last fully-connected layer of the action classification network may be composed of an FC layer having the number of outputs suitable for the class for action classification.

도 6은 일 실시 예에 따른 사용자 이미지 중 일정 프레임 구간을 입력으로 하는 행위 분류 알고리즘 구조의 예시도이다.6 is an exemplary diagram of a structure of an action classification algorithm that takes a predetermined frame section of a user image as an input according to an embodiment.

도 6을 참조하면, 일 실시 예에서는, 연속적인 이미지의 입력을 처리할 때, 하나의 이미지 프레임만 이용하여 손이 어떤 행동을 취하고 있는지를 예측하고, 이전 프레임에서 식별된 손의 위치 정보를 다음 프레임에서도 이용하는 식으로 손을 추적한다.Referring to FIG. 6 , in an embodiment, when processing input of consecutive images, a motion of a hand is predicted using only one image frame, and position information of the hand identified in the previous frame is transferred to the next frame. The hand is traced in a way that is also used in the frame.

하지만 모델의 백본이 되는 네트워크(예를 들어, MobileNetV2)를 다른 네트워크로 바꾼다면, 새로운 백본 네트워크에 알맞는 시나리오를 다양하게 적용할 수 있다.However, if the model's backbone network (for example, MobileNetV2) is changed to another network, various scenarios suitable for the new backbone network can be applied.

먼저, 행위 분류 알고리즘의 행위 분류 네트워크를 비디오 입력, 즉 일정한 프레임 구간을 입력으로 받는 모델로 교체할 수 있다. Video Classification 분야에서 사용되는 네트워크, 예를 들어 SlowFast나 I3D, X3D와 같은 네트워크로 바꾸어 사용한다면 실시간으로 받아오는 비대면 사용자의 이미지 중 일정 구간을 입력으로 정의하여 해당 구간 내에서의 사용자의 행동을 정의할 수 있다. 이 경우 행위 검출 네트워크의 전체 입출력 파이프라인은 도 6과 같이 구성될 수 있는 것이다. First, the action classification network of the action classification algorithm can be replaced with a model that receives a video input, that is, a certain frame section as an input. If you switch to a network used in the video classification field, such as SlowFast, I3D, or X3D, a certain section of non-face-to-face user images received in real time is defined as an input and the user's behavior within that section is defined. can do. In this case, the entire input/output pipeline of the action detection network can be configured as shown in FIG. 6 .

도 7은 일 실시 예에 따른 시선 추적 네트워크가 포함된 행위 분류 알고리즘의 네트워크 구조를 개략적으로 나타낸 도면이고, 도 8은 일 실시 예에 따른 시선 추적 포함 행위 분류 결과를 나타낸 예시도이다.7 is a diagram schematically illustrating a network structure of an action classification algorithm including a gaze tracking network according to an embodiment, and FIG. 8 is an exemplary diagram illustrating a result of classification of an action including gaze tracking according to an embodiment.

도 7 및 도 8을 참조하면, 일 실시 예에서는, 시선 추적을 추가적으로 수행하여 부정행위를 감지할 수 있다. Referring to FIGS. 7 and 8 , in one embodiment, eye tracking may be additionally performed to detect cheating.

프로세서(140)는 정면 혹은 측면의 얼굴이 포함된 이미지에서 사용자의 시선을 추적할 수 있다. 즉, 프로세서(140)는 이미지 공간 상의 홍채의 위치를 이용하여 삼차원 공간상의 시선의 방향을 도출할 수 있다. 그리고 프로세서(140)는 화면 주시, 측면 주시 등의 지정된 특정 시선 주시 패턴을 인지하는 프로세스를 구현할 수 있다.The processor 140 may track the user's gaze in an image including a frontal or side face. That is, the processor 140 may derive the gaze direction in the 3D space by using the position of the iris in the image space. Further, the processor 140 may implement a process of recognizing a specified specific gaze gazing pattern, such as screen gazing and side gazing.

다시 말하면, 프로세서(140)는 이미지 공간에서의 홍채의 이차원 위치와 카메라 공간에서의 삼차원 공간 상의 얼굴의 자세가 주어지면, 삼차원 공간 상의 시선을 알아내어 안정적인 시선 주시 패턴을 인지할 수 있다.In other words, given the two-dimensional position of the iris in the image space and the posture of the face in the three-dimensional space in the camera space, the processor 140 can detect a gaze in the three-dimensional space and recognize a stable gaze-gaze pattern.

즉, 일 실시 예에서는, 비대면 사용자의 손을 추적하는 것에 추가적으로 시선 추적 정보를 결합하여 부정행위 여부를 판단할 수 있다. 다시 말하면, 프로세서(140)는 비대면 사용자의 손에 대한 정보뿐만 아니라 시선이 어딜 향하고 있는지도 계산할 수 있다.That is, in one embodiment, it is possible to determine whether or not cheating by combining gaze tracking information in addition to tracking the hand of a non-face-to-face user. In other words, the processor 140 may calculate not only information about the hand of the non-face-to-face user but also where the gaze is directed.

프로세서(140)는 시선 검출 네트워크(예를 들어, Mediapipe에서 제공하는 기능 중 하나인 Iris 그래프)를 통해, 영상에 찍힌 사람의 얼굴 키포인트를 추론하고, 그 중에서 눈의 윤곽에 해당하는 키포인트를 통해 눈 영역 이미지를 잘라낼 수 있다. The processor 140 infers key points of a person's face captured in the image through a gaze detection network (eg, an Iris graph, one of the functions provided by Mediapipe), and uses the key points corresponding to the contours of the eyes from among them. Area images can be cropped.

그리고 프로세서(140)는 잘라낸 눈 영역 이미지에서 다시 눈동자(홍채)와 동공에 대한 키포인트를 예측할 수 있다. 홍채 키포인트는 3차원 공간 좌표를 가지고 있기 때문에, 홍채와 동공에 대한 키포인트 값들의 외적을 통해서 이미지 속 공간 내에서 비대면 사용자의 눈동자가 어느 방향을 향하고 있는지를 나타내는 벡터를 계산할 수 있다.In addition, the processor 140 may predict the pupil (iris) and key points for the pupil again from the cropped eye region image. Since the iris keypoint has 3-dimensional space coordinates, a vector indicating which direction the pupil of the non-face-to-face user is facing within the space in the image can be calculated through the cross product of the keypoint values of the iris and the pupil.

프로세서(140)는 비대면 사용자의 시선 벡터를 통해, 사전에 대략적으로 정해진 모니터의 범위 밖을 벗어난 외부를 바라보고 있는지를 판단할 수 있다. The processor 140 may determine whether the non-face-to-face user is looking outside the range of the monitor roughly determined in advance through the user's gaze vector.

또한 프로세서(140)는 비대면 사용자의 손에 대한 키포인트를 가지고 3차원 공간 상에서의 손이 차지하는 바운딩 박스를 추가적으로 정의할 수 있다. 다시 말하면, 프로세서(140)는 3차원 공간 내에서 비대면 사용자의 시선 벡터가 상기 바운딩 박스를 향하고 있는지를 계산할 수 있다. 이때의 손이 취하고 있는 행동으로는 상술한 행위 분류 알고리즘을 통해 예측한 비 허용 객체 사용 여부 결과를 사용할 수 있다. In addition, the processor 140 may additionally define a bounding box occupied by a hand in a 3D space with a key point for a non-face-to-face user's hand. In other words, the processor 140 may calculate whether the gaze vector of the non-face-to-face user is facing the bounding box in the 3D space. As the action taken by the hand at this time, the result of whether or not to use the non-permissible object predicted through the above-described action classification algorithm can be used.

예를 들어, 비 허용 객체 사용 여부 판단 결과, 비대면 사용자가 핸드폰을 사용 중이라는 결과가 나온 경우, 비대면 사용자의 시선 역시 손을 향하고 있다는 값을 얻게 된다면 더 효과적으로 비 허용 객체 사용 판단 여부를 결정할 수 있다.For example, as a result of determining whether to use a non-permissible object, if the result is that the non-face-to-face user is using a mobile phone, if the value that the non-face-to-face user's gaze is also directed toward the hand is obtained, it is more effective to determine whether or not to use a non-permissible object. can

보다 구체적으로, 프로세서(140)는 사용자 이미지에 포함된 얼굴에 대하여 이미지 공간에서의 양쪽 눈의 홍채의 이차원 위치와 카메라 공간에서의 삼차원 얼굴 자세를 검출할 수 있다. 프로세서(140)는 이를 이용하여 삼차원 공간에서의 시선 방향을 알아내고 시선 추적을 진행하여 화면 주시, 측면 주시 등의 시선 패턴을 인지할 수 있다.More specifically, the processor 140 may detect the two-dimensional position of the iris of both eyes in the image space and the three-dimensional face posture in the camera space with respect to the face included in the user image. The processor 140 may use this to determine the gaze direction in the 3D space and perform gaze tracking to recognize gaze patterns such as screen gaze and side gaze.

프로세서(140)는 주어진 이미지를 기반으로 얼굴을 추적할 수 있다. 프로세서(140)는 얼굴 상에서 미리 지정된 유한한 키포인트의 이차원 위치를 검출할 수 있다. 예를 들어, 프로세서(140)는 CNN(convolutional neural network)와 같은 학습 기반 모델을 통해 각 키포인트에 대하여 레이블을 주어 일정 이상의 정확도를 도달하게 학습한 학습 모델을 사용하여 얼굴을 추적할 수 있다.Processor 140 may track a face based on a given image. The processor 140 may detect a two-dimensional position of a predefined finite keypoint on the face. For example, the processor 140 may track a face using a learning model trained to reach a certain level of accuracy by giving a label to each keypoint through a learning-based model such as a convolutional neural network (CNN).

프로세서(140)는 홍채의 키포인트로 정함으로써 홍채의 이차원 위치를 추론할 수 있고, 삼차원 공간에서 얼굴의 표준 키포인트 메쉬(canonical landmark mesh)가 존재할 때, 프로크루테스 문제(Procrustes problem)로 정의하여 얼굴의 삼차원 자세를 추론할 수 있다. 여기서, 프로크루테스 문제는 다른 주어진 점 세트에 가장 가까운 주어진 점 세트를 매핑하는 직교 행렬을 찾는 것이다. 두 집합 사이의 점의 일대일 대응은 선험적으로 알려져 있어야 한다.The processor 140 can infer the two-dimensional position of the iris by determining the iris keypoint, and when a canonical landmark mesh of the face exists in the three-dimensional space, the face is defined as a Procrustes problem. The three-dimensional posture of can be inferred. Here, the Procrutesian problem is to find an orthogonal matrix that maps a given set of points closest to another given set of points. The one-to-one correspondence of points between the two sets must be known a priori.

프로세서(140)는 이미지를 촬영하기 위하여 사용된 카메라의 초점거리를 알고 있을 때, 카메라 공간을 이미지 공간으로 변환하는 원근 투영 행렬을 정의할 수 있다. 그리고 프로세서(140)는 투영 행렬의 역을 계산하고 앞서 얻은 이미지 공간에서의 홍채의 위치를 이용하여 카메라 공간에서의 홍채의 위치가 놓이게 되는 직선을 계산할 수 있다.When the focal length of a camera used to capture an image is known, the processor 140 may define a perspective projection matrix for transforming a camera space into an image space. Further, the processor 140 may calculate the inverse of the projection matrix and calculate a straight line on which the position of the iris in camera space is placed using the previously obtained position of the iris in the image space.

이를 수학식을 참조하여 설명하면, 초점 거리

, 형상비(aspect ration)

, 주요 포인트(principal point) (

,

)가 주어졌을 때, 원근 투영 행렬

는 다음 수학식 1과 같이 정의될 수 있다.If this is explained with reference to the equation, the focal length

, aspect ratio

, the principal point (

,

) is given, the perspective projection matrix

Can be defined as in Equation 1 below.

프로세서(140)는 원근 투영 행렬의 역

를 계산하여, 변수

를 기반으로 다음 수학식 2와 같이, 이미지 공간 상의 홍채의 위치

를 통해 카메라 공간에서 원점과 홍채의 위치

를 통과하는 직선을 도출할 수 있다.Processor 140 is the inverse of the perspective projection matrix

By calculating the variable

Based on Equation 2, the position of the iris on the image space

The location of the origin and iris in camera space via

A straight line passing through can be derived.

일 실시 예에서는,

의 적절한 값을 찾아서 카메라 공간 상의 홍채의 위치

를 구할 수 있다.In one embodiment,

The position of the iris in camera space by finding an appropriate value of

can be obtained.

그리고 프로세서(140)는 카메라의 중심과 홍채의 중심을 통과하는 직선과 주어진 얼굴의 자세를 기반으로 얼굴 공간에서 정의된 눈의 표면을 모델링하는 기하와 직선의 교점을 구하여 카메라 공간에서의 홍채의 위치를 계산할 수 있다.In addition, the processor 140 calculates the position of the iris in the camera space by obtaining the intersection of the straight line passing through the center of the camera and the center of the iris and the geometry modeling the eye surface defined in the face space based on the given face posture. can be calculated.

프로세서(140)는 카메라 공간에서 눈의 표면 기하와의 교점을 나타내는

를 구하기 위해서는 얼굴 공간에서 정의된 눈의 표면 기하를 카메라 공간으로 변환하고, 변환된 기하와 직선의 교점을 찾을 수 있다.Processor 140 represents an intersection with the surface geometry of the eye in camera space.

In order to obtain , the surface geometry of the eye defined in face space can be converted to camera space, and the intersection point of the converted geometry and straight line can be found.

프로세서(140)는 얼굴 공간에서 한쪽 눈 평면 상의 점

, 눈 평면의 법선

이 주어졌을 때, 이 값들(

및

)을 얼굴의 자세를 나타내는 회전 행렬

, 이동 벡터

를 통해 카메라 공간으로 변환하여 다음 수학식 3을 만족하는

를 찾을 수 있다.Processor 140 calculates a point on one eye plane in face space.

, the normal of the eye plane

Given , these values (

and

) is a rotation matrix representing the pose of the face.

, the moving vector

Converted to camera space through

can be found.

이 수식을 만족하는

는 다음 수학식 4와 같이 계산될 수 있다.that satisfies this formula

Can be calculated as in Equation 4 below.

프로세서(140)는 안구의 중심과 홍채의 위치의 차이를 이용하여 카메라 공간에서의 시선의 방향을 계산할 수 있다. 즉 일 실시 예에서는, 얼굴 공간에서의 한쪽 안구의 중심 위치

가 주어진다고 가정하였을 때, 카메라 공간에서의 시선의 방향

는 다음 수학식 5와 같이 정의할 수 있다.The processor 140 may calculate the gaze direction in the camera space using the difference between the position of the center of the eyeball and the iris. That is, in one embodiment, the central position of one eyeball in the facial space

Assuming that is given, the direction of gaze in camera space

Can be defined as in Equation 5 below.

그리고 프로세서(140)는 정해진 화면 영역 상에 놓이는 시선의 위치를 계산할 수 있다. 이때, 얼굴의 회전(rotation) Y축이 화면의 평면과 평행하다고 가정하여, 화면의 평면을 계산하여 시선을 이루는 직선과의 교점을 구할 수 있다.In addition, the processor 140 may calculate the position of the gaze placed on the predetermined screen area. At this time, assuming that the Y-axis of rotation of the face is parallel to the plane of the screen, the plane of the screen is calculated and the intersection point with the straight line forming the line of sight can be obtained.

일 실시 예에서는, 카메라 공간에서의 얼굴 자세

이 다음 수학식 6에서와 같이 정의된다고 했을 때, 카메라 공간에서 화면 좌표계로 변환하는 행렬

을 정의할 수 있다. 화면 좌표계는 카메라 좌표계의 X축의 회전으로 정의되며, 원점의 위치를 서로 공유할 수 있다.In one embodiment, the pose of the face in camera space

Assuming that this is defined as in Equation 6 below, a matrix that converts from camera space to screen coordinate system

can define The screen coordinate system is defined as the rotation of the X axis of the camera coordinate system, and the location of the origin can be shared with each other.

일 실시 예에서, 변환 행렬

을 이용하여 화면 상의 시선의 위치를 구하는 수식은 다음 수학식 7과 같다. 그리고

는 시선을 이루는 직선 상의 점 중 Z 성분이 0이 되는 것으로 정의될 수 있다.In one embodiment, the transformation matrix

The formula for obtaining the position of the gaze on the screen using Equation 7 is shown in Equation 7 below. and

may be defined as having a Z component of zero among points on a straight line constituting the line of sight.

프로세서(140)는 상기에서 검출한 화면과 평행한 평면 위의 점

가 미리 정한 화면의 축 정렬 경계상자(axis-aligned bounding box) 내에 들어오는지 검사하여 화면을 주시하는지 판단할 수 있다. 그리고 프로세서(140)는 시선의 방향을 나타내는 벡터

를 정해진 기준 벡터(reference vector)

와의 사잇각

을 알아내어, 일정 임계치가 넘었을 때 측면을 주시한다고 판단할 수 있다. 사잇각을 구하는 수식은 다음 수학식 8과 같이 나타낼 수 있다.The processor 140 is a point on a plane parallel to the screen detected above.

It can be determined whether the user is watching the screen by examining whether or not is within the axis-aligned bounding box of the screen which is pre-determined. And the processor 140 is a vector representing the direction of the gaze

to a set reference vector (reference vector)

angle between

By finding out, it can be determined that the user is looking at the side when a certain threshold value is exceeded. The formula for obtaining the included angle can be expressed as in Equation 8 below.

이하에서는, 도 9 내지 도 12를 참조하여, 행위 분류 알고리즘의 구현 성능에 대한 실험 결과를 설명한다.Hereinafter, experimental results for implementation performance of the behavior classification algorithm will be described with reference to FIGS. 9 to 12 .

도 9는 일 실시 예에 따른 행위 분류 알고리즘 실험 결과(제 1 데이터 세트의 손실 및 정확도 그래프)를 나타낸 도면이고, 도 10은 일 실시 예에 따른 행위 분류 알고리즘의 실험 결과(두 가지 클래스만 학습했을 때의 오차 행렬)를 나타낸 도면이며, 도 11은 일 실시 예에 따른 행위 분류 알고리즘 실험 결과(제 2 데이터 세트의 손실 및 정확도 그래프)를 나타낸 도면이고, 도 12는 일 실시 예에 따른 행위 분류 알고리즘 실험 결과(5 가지 클래스를 사용하여 학습했을 때의 오차 행렬)를 나타낸 도면이다.9 is a diagram showing the experimental results of the behavior classification algorithm (loss and accuracy graph of the first data set) according to an embodiment, and FIG. 10 is the experimental result of the behavior classification algorithm according to an embodiment (when only two classes were learned). 11 is a diagram showing an experiment result (loss and accuracy graph of the second data set) of an action classification algorithm according to an embodiment, and FIG. 12 is a diagram showing an action classification algorithm according to an embodiment. This is a diagram showing the experimental results (the error matrix when learning using 5 classes).

일 실시 예에서는, 행위 분류 장치(100)의 성능을 검증하기 위한 실험을 진행하였다. 예를 들어, 행위 분류 알고리즘의 네트워크 훈련을 위해 제 1 데이터 세트 및 제 2 데이터 세트를 사용할 수 있다. In one embodiment, an experiment was conducted to verify the performance of the behavior classification device 100. For example, the first data set and the second data set may be used for network training of an action classification algorithm.

제 1 데이터 세트는 3명의 대학원생이 촬영한 동영상으로부터 추출한 이미지 프레임이다. 클래스는 정상과 비 허용 객체 사용 두 가지만 존재하며 두 클래스를 모두 합하여 훈련 데이터 세트는 10,766개의 프레임, 테스트 데이터 세트는 5,054개의 프레임으로 구성되어 있다. 정상 또는 비 허용 객체에 해당하는 휴대폰을 사용하는 행동 중 하나만을 취하는 영상에서 일정 간격으로 프레임을 추출한 것이기 때문에 데이터 세트의 양은 많지만 동작의 변화가 크지 않다.The first data set is an image frame extracted from a video taken by three graduate students. There are only two classes, one using normal and one non-permitted object, and the training data set consists of 10,766 frames and the test data set consists of 5,054 frames by combining both classes. Since frames are extracted at regular intervals from videos taking only one of the actions of using a mobile phone corresponding to normal or non-permitted objects, the amount of data set is large, but the change in action is not large.

제 2 데이터 세트는, 제 1 데이터 세트보다 클래스 종류와 참여 인원 수를 늘린 더 광범위한 버전의 데이터 세트이다. 5 가지 클래스(정상, 키보드 타이핑, 휴대폰 사용, 필기, 참고 서적 사용)에 대해 18명의 참여자 1명당 198장의 이미지로 이루어져 있다. The second data set is a more extensive version of the data set in which class types and the number of participants are increased compared to the first data set. It consists of 198 images per 18 participants for 5 classes (normal, keyboard typing, cell phone use, handwriting, reference book use).

정상 클래스는 온라인 시험 응시 도중에 자연스럽게 취할 수 있는 행동들로 손 깍지를 끼는 행 동과 귀, 머리, 이마, 옷깃을 매만지는 행동, 턱을 괴는 행동, 아무런 행동을 취하지 않고 책상 위에 손을 올려둔 행동이 포함된다.Normal classes are behaviors that can be taken naturally while taking an online test, such as clasping hands, touching ears, hair, forehead, and collar, resting chin, and putting hands on the desk without taking any action. This is included.

키보드 타이핑 클래스는 일반적인 데스크탑에서 사용하는 기계식, 멤브레인 등의 키보드와 노트북 키보드를 포함한다. 휴대폰 사용 클래스는 왼손 또는 오른손 한쪽으로만 사용하는 것과 양손으로 사용하여 촬영하도록 했으며, 필기 클래스는 연필이나 볼펜, 샤프만 사용하여 이면지나 노트에 글을 적는 장면을 촬영한 데이터이다. 마지막 참고 서적 사용 클래스는 책장을 넘기는 행동과, 두 손으로 책을 잡고 있는 것으로 이루어져 있다.Keyboard typing classes include keyboards such as mechanical and membrane used in general desktops and laptop keyboards. The mobile phone use class was used with either the left or right hand, and both hands were used, and the handwriting class was data taken when writing on paper or notebooks using only a pencil, ballpoint pen, or mechanical pencil. The last reference book use class consists of turning the pages and holding the book with both hands.

일 실시 예에서는, 데이터 세트로 수집된 영상들이 실제 시나리오인 온라인 시험 상황과 최대한 비슷한 장면을 연출 할 수 있도록 촬영에 몇 가지 제한을 두었다. 18명의 참여자는 다른 사람의 개입 없이 혼자서 시험을 치룰 수 있다고 판단한 장소에서 임의의 위치에 스스로 설치한 카메라를 이용하여 본인의 모습을 촬영하였다. In one embodiment, some limitations are placed on filming so that the images collected as a data set can produce a scene that is as similar as possible to an actual scenario, an online test situation. Eighteen participants took pictures of themselves using a camera installed at a random location in a place where they judged that they could take the test alone without the intervention of others.

이때 설치된 카메라의 위치와 카메라가 참여자를 향하는 각도의 다양성을 위하여 위도와 경도의 개념을 적용하여 카메라를 설치했다. 즉 위도는 카메라가 놓인 곳의 높이를 뜻하고, 경도는 카메라가 참여자를 중심으로 왼편, 정면, 오른편에 놓여 있는지를 의미한다. 참여자는 하나의 클래스에 대해서 카메라의 세 가지 위도와 세 가지 경도 세팅을 조합하여 촬영해야 한다.At this time, the camera was installed by applying the concept of latitude and longitude to diversify the location of the installed camera and the angle at which the camera faces the participant. In other words, latitude means the height of the location where the camera is placed, and longitude means whether the camera is placed on the left, front, or right side of the participant. Participants must shoot with a combination of three latitude and three longitude settings of the camera for one class.

한편, 일 실시 예의 실험 조건은 다음과 같이 설정될 수 있다. tensorflow 2.3을 사용하고, 제 1 데이터 세트에 대해 20 에포크 학습하고, 제 2 데이터 세트에 대해 100 에포크 학습할 수 있다. 그리고 배치 사이즈는 8, 최적화는 Adam optimizer로 설정될 수 있으며, 초기 러닝 레이트는 0.3으로 설정되어 검증 정확도가 2번의 에포크 동안 향상되지 않았다면 0.5 배 감소하는 것으로 설정할 수 있다.Meanwhile, the experimental conditions of an embodiment may be set as follows. Using tensorflow 2.3, you can learn 20 epochs on the first data set and 100 epochs on the second data set. Also, the batch size can be set to 8, the optimization can be set to Adam optimizer, the initial running rate is set to 0.3, and if the verification accuracy does not improve during 2 epochs, it can be set to decrease by 0.5 times.

먼저, 제 1 데이터 세트를 사용하여 학습한 경우를 살펴보면, 일 실시 예에서는, NVIDIA TITAN Xp를 사용했으며, 학습에는 약 19시간 정도가 소요되었다. 학습 동안의 손실은 도 9(a)에 나타난 그래프처럼 변했으며, 정확도 변화는 도 9(b)처럼 나타났다. First, looking at the case of learning using the first data set, in one embodiment, NVIDIA TITAN Xp was used, and it took about 19 hours for learning. The loss during learning changed as shown in the graph shown in Fig. 9(a), and the change in accuracy appeared as shown in Fig. 9(b).

그리고 도 9(c)를 참조하여 매 에포크마다 테스트 데이터 세트를 가지고 측정한 검증 손실을 확인할 수 있으며, 도 9(d)를 참조하여 검증 정확도를 확인할 수 있다. 일 실시 예에서는, 학습이 끝난 후, 테스트 데이터 세트에서 정확도 88.8%를 달성했음을 확인할 수 있다.In addition, referring to FIG. 9(c), the verification loss measured with the test data set for each epoch can be checked, and the verification accuracy can be checked with reference to FIG. 9(d). In one embodiment, it can be confirmed that an accuracy of 88.8% was achieved in the test data set after learning was completed.

도 10은 True positive, True negative, False positive, False negative를 확인할 수 있는 오차 행렬을 도시한 것으로, 도 10(a)를 통해 제 1 데이터 세트의 학습 결과를 확인할 수 있다.10 shows an error matrix capable of confirming true positive, true negative, false positive, and false negative, and the learning result of the first data set can be confirmed through FIG. 10 (a).

다음으로, 제 2 데이터 세트를 사용하여 학습한 경우를 살펴보면, 크게 두 가지로 나눌 수 있다. Next, examining the case of learning using the second data set, it can be largely divided into two types.

첫 번째로는 제 1 데이터 세트에 포함된 데이터와 동일하게 정상 클래스와 휴대폰 사용 클래스만 사용하여 이진 분류를 학습시킨 것이고, 두 번째는 제 2 데이터 세트의 5 가지 클래스를 전부 사용하여 멀티 클래스 분류를 학습시킨 것이다. The first is to train binary classification using only the normal class and the mobile phone use class, the same as the data included in the first data set, and the second is to train multi-class classification using all five classes of the second data set. it was learned

일 실시 예에서는, 학습 데이터의 형평성을 위해 제 2 데이터 세트의 각 클래스 별 데이터 개수를 가장 적은 클래스와 동일하게 맞추었다. 그래서 정상 클래스와 휴대폰 사용 클래스만 학습한 첫 번째 케이스에서는 학습 데이터의 수가 각각 790개로 총 1,580개가 쓰이며, 5 가지 클래스 전부를 학습한 두 번째 케이스에서는 각 클래스 별 학습 데이터가 320개로 맞추어져 총 1,600개의 데이터가 사용된다. In an embodiment, for fairness of learning data, the number of data for each class of the second data set is equal to the smallest class. So, in the first case where only the normal class and the mobile phone use class were learned, the number of training data is 790 each, for a total of 1,580. data is used

정상과 휴대폰 사용 두 가지 클래스로만 학습을 시킨 모델의 경우, 테스트 데이터 세트에서의 정확도는 86.6%를 달성하였으며, 이때의 오차 행렬은 도 10(b)를 통해 확인할 수 있다.In the case of the model trained with only two classes, normal and mobile phone usage, the accuracy in the test data set was 86.6%, and the error matrix at this time can be confirmed through FIG. 10(b).

일 실시 예에서는, 제 1 데이터 세트를 사용하여 학습을 시켰을 때는 테스트 데이터 세트에 훈련 데이터 세트에 있던 인물의 다른 영상이 포함될 수 있다. 제 2 데이터 세트를 사용하여 학습을 시킨 경우에는 훈련 데이터 세트에 포함되지 않은 인물들로만 테스트 데이터 세트가 구성될 수 있다. In one embodiment, when learning is performed using the first data set, other images of people in the training data set may be included in the test data set. When learning is performed using the second data set, the test data set may be configured only with persons not included in the training data set.

그렇기 때문에 제 2 데이터 세트로 학습한 모델은 처음 보는 인물에 대해서 테스트를 진행한 것이지만, 3명의 인원으로만 이루어져 오버피팅이 일어난 제 1 데이터 세트를 사용하여 학습했을 때 보다 정확도가 크게 떨어지지 않은 것을 보아 일반화 성능은 더욱 뛰어나다고 볼 수 있다.Therefore, the model trained with the second data set was tested on a person I never saw, but the accuracy did not drop significantly compared to when it was trained using the first data set, which consisted of only three people and overfitting occurred. It can be seen that the generalization performance is better.

일 실시 예에서, 손실과 정확도의 변화는 도 11을 통해 확인할 수 있으며, 학습에는 약 7시간이 소요되었으며 NVIDIA GeForce RTX 2080 SUPER를 사용했다.In one embodiment, the change in loss and accuracy can be confirmed through FIG. 11, and the training took about 7 hours and used an NVIDIA GeForce RTX 2080 SUPER.

한편, 5 가지 클래스 모두 학습을 시킨 모델의 경우, 테스트 데이터 세트에서의 정확도는 47.1%를 기록했다. 상술한 바와 같이 학습 데이터의 형평성을 위해 데이터가 가장 적은 클래스에 데이터 수를 동일하게 맞추었기 때문에 이때 사용된 데이터 수는 총 1,600개이다. On the other hand, in the case of the model trained on all five classes, the accuracy in the test data set was 47.1%. As described above, the number of data used is 1,600 in total because the number of data is equally matched to the class with the smallest amount of data for fairness of training data.

데이터 수가 정상과 휴대폰 사용 두 가지로만 학습한 모델과 거의 동일하지만 클래스는 5 개나 되기 때문에 충분한 학습이 이루어지지 않을 수 있다. 모든 클래스 별로 데이터를 더 많이 수집하여 학습을 진행한다면 모델의 성능은 충분히 더 향상 가능할 것이다. Although the number of data is almost the same as the model trained only with normal and mobile phone usage, it may not be trained enough because there are as many as 5 classes. If more data are collected for each class and training is conducted, the performance of the model will be sufficiently improved.

도 12는 5 개 클래스로 학습한 모델의 오차 행렬을 도시한 것이며, 일 실시 예에서의 학습에는 약 10시간 가량이 걸렸으며 NVIDIA TITAN Xp를 사용했다.12 shows the error matrix of a model trained with 5 classes. In one embodiment, training took about 10 hours and NVIDIA TITAN Xp was used.

도 13은 일 실시 예에 따른 행위 분류 방법을 설명하기 위한 흐름도이다.13 is a flowchart illustrating a method of classifying an action according to an exemplary embodiment.

도 13을 참조하면, S100단계에서, 행위 분류 장치(100)는 비대면 사용자의 이미지를 획득한다.Referring to FIG. 13 , in step S100, the behavior classification apparatus 100 acquires an image of a non-face-to-face user.

S200단계에서, 행위 분류 장치(100)는 기 학습된 행위 분류 알고리즘을 기반으로, 비대면 사용자 이미지의 손 영역을 검출하여, 비대면 사용자의 행위를 분류한다.In step S200, the action classification apparatus 100 classifies the action of the non-face-to-face user by detecting the hand region of the non-face-to-face user image based on the previously learned action classification algorithm.

이때, 행위 분류 장치(100)는 획득한 비대면 사용자 이미지를 기반으로 손 영역 이미지와 손 키포인트 정보를 검출할 수 있다. At this time, the action classification apparatus 100 may detect a hand region image and hand keypoint information based on the acquired non-face-to-face user image.

즉, 행위 분류 장치(100)는 비대면 사용자 이미지에서 손 전체 영역의 ROI(Region Of Interest) 이미지를 검출하고, 검출한 손 전체 영역의 ROI 이미지에서 손 키포인트 각각에 대한 3차원 좌표 값을 예측할 수 있다.That is, the behavior classification apparatus 100 may detect a region of interest (ROI) image of the entire region of the hand in the non-face-to-face user image, and predict three-dimensional coordinate values for each keypoint of the hand in the detected ROI image of the entire region of the hand. there is.

그리고 행위 분류 장치(100)는 비대면 사용자 이미지에서 손바닥 또는 손등 영역의 ROI 이미지를 감지하고, 손바닥 또는 손등 영역의 ROI를 손 전체 영역의 ROI로 확장할 수 있다.In addition, the action classification apparatus 100 may detect an ROI image of the palm or back of the hand region in the non-face-to-face user image, and expand the ROI of the palm or back of the hand region to the ROI of the entire hand region.

또한, 행위 분류 장치(100)는 검출한 손 전체 영역의 ROI 이미지에서 손목 중앙과 가운데 손가락 위치를 파악하고, 파악한 손목 중앙과 가운데 손가락을 이은 선분을 이미지의 y축과 평행하게 회전시켜 검출한 손 전체 영역의 ROI 이미지를 획득할 수 있다.In addition, the action classification apparatus 100 determines the position of the center of the wrist and the middle finger in the ROI image of the entire area of the detected hand, and rotates a line segment connecting the center of the wrist and the middle finger in parallel with the y-axis of the image to obtain the detected hand An ROI image of the entire area can be obtained.

그리고 행위 분류 장치(100)는 손 영역 이미지와 손 키포인트 정보를 기반으로, 손 영역의 객체와 손의 자세를 검출하여, 비대면 사용자의 행위를 분류할 수 있다.Further, the action classification apparatus 100 may classify a non-face-to-face user's action by detecting an object in the hand region and a posture of the hand based on the hand region image and hand keypoint information.

이때 행위 분류 장치(100)는 ROI 이미지 및 손 키포인트 각각에 대한 3차원 좌표 값을 기반으로 비대면 사용자의 손의 자세를 판단할 수 있다.At this time, the action classification apparatus 100 may determine the posture of the hand of the non-face-to-face user based on the ROI image and the 3D coordinate values for each keypoint of the hand.

S300단계에서, 행위 분류 장치(100)는 행위 분류 결과에 기초하여, 비대면 사용자의 비 허용 객체 사용 여부를 판단한다.In step S300, the action classification apparatus 100 determines whether the non-face-to-face user uses a disallowed object based on the action classification result.

즉, 일 실시 예에서, 기 학습된 행위 분류 알고리즘은, 비대면 사용자 이미지가 입력되면, 비대면 사용자 이미지를 기반으로 검출된 손 전체 영역의 ROI 이미지 및 손 키포인트 정보를 입력으로 쿼리(query)하여, 손 영역의 객체와 손의 자세에 매핑된 행위 클래스에 따른 비 허용 객체 사용 여부가 출력되도록 학습된 학습 모델일 수 있다.That is, in an embodiment, when a non-face-to-face user image is input, the pre-learned behavior classification algorithm queries the ROI image of the entire hand area and hand keypoint information detected based on the non-face-to-face user image as input, , it may be a learning model learned to output whether or not to use a disallowed object according to an object in the hand region and an action class mapped to the posture of the hand.

이러한 기 학습된 행위 분류 알고리즘은, 훈련 페이즈(training phase)를 거쳐 훈련되고, 훈련 페이즈는, 비대면 사용자 이미지가 입력되면, 비대면 사용자 이미지에서 손 전체 영역의 ROI 이미지를 검출하는 단계와, 검출한 손 전체 영역의 ROI 이미지에서 손 키포인트 각각에 대한 3차원 좌표 값을 예측하는 단계와, ROI 이미지 및 손 키포인트 각각에 대한 3차원 좌표 값을 기반으로, 손 영역의 객체와 손의 자세를 검출하여, 비대면 사용자의 행위를 분류하는 단계와, 행위 분류 결과에 따른 비대면 사용자의 비 허용 객체 사용 여부를 추론하는 단계를 포함할 수 있다.This pre-learned behavior classification algorithm is trained through a training phase, in which, when a non-face-to-face user image is input, the ROI image of the entire hand region is detected in the non-face-to-face user image; Predicting 3D coordinate values for each keypoint of the hand in the ROI image of the entire region of one hand, and detecting an object in the hand region and a posture of the hand based on the 3D coordinate values for each keypoint of the hand and the ROI image. , Classifying the non-face-to-face user's behavior, and inferring whether the non-face-to-face user uses a non-allowed object according to the behavior classification result.

이때, 훈련 페이즈의 비대면 사용자의 행위를 분류하는 단계는, 손 전체 영역의 ROI 이미지와 크기가 동일한 빈 그리드를 생성하는 단계와, 손 키포인트 각각에 대한 3차원 좌표 값을 기반으로, 빈 그리드에 각각의 키포인트 좌표 지점을 마킹하는 단계와, 각각의 키포인트 좌표 지점이 마킹된 그리드에 가우시안 블러를 적용하는 단계를 포함할 수 있다.At this time, the step of classifying the behavior of the non-face-to-face user in the training phase is the step of generating an empty grid having the same size as the ROI image of the entire hand area, and the empty grid based on the three-dimensional coordinate values for each keypoint of the hand. The method may include marking each keypoint coordinate point and applying Gaussian blur to a grid marked with each keypoint coordinate point.

또한, 훈련 페이즈의 비대면 사용자의 행위를 분류하는 단계는, 손 전체 영역의 ROI 이미지의 색상 값에 기반한 채널 및 각각의 키포인트별 채널로 구성된 입력 텐서를 기반으로, 비대면 사용자의 행위를 분류하는 단계를 포함할 수 있다.In addition, the step of classifying the behavior of the non-face-to-face user in the training phase is based on a channel based on the color value of the ROI image of the entire hand region and an input tensor composed of channels for each keypoint, Classifying the behavior of the non-face-to-face user steps may be included.

도 14는 일 실시 예에 따른 시선 추적 포함 행위 분류 방법을 설명하기 위한 흐름도이다.14 is a flowchart illustrating a method of classifying an action including gaze tracking according to an exemplary embodiment.

일 실시 예에서는, 상기 도 13에 따른 S100단계 내지 S300단계를 수행한 이후, 시선 정보에 따른 추가 프로세스를 수행할 수 있다.In one embodiment, after performing steps S100 to S300 according to FIG. 13, an additional process according to gaze information may be performed.

즉, S100단계 내지 S300단계에서, 손 영역 이미지와 손 키포인트 정보를 기반으로 한 비대면 사용자의 행위 분류 결과에 기초하여 비대면 사용자가 비 허용 객체를 사용하는 것으로 판단한 경우, S400단계에서, 행위 분류 장치(100)는 비대면 사용자의 시선 방향이 손의 위치를 향하고 있는지 판단한다.That is, in steps S100 to S300, when it is determined that the non-face-to-face user uses a disallowed object based on the result of the non-face-to-face user's action classification based on the hand region image and the hand keypoint information, in step S400, the action is classified. The device 100 determines whether the non-face-to-face user's gaze direction is toward the hand position.

이때, 행위 분류 장치(100)는 획득한 비대면 사용자 이미지를 기반으로 얼굴 영역 이미지와 얼굴 키포인트 정보를 검출하고, 얼굴 영역 이미지와 얼굴 키포인트 정보를 기반으로 비대면 사용자의 시선의 방향을 판단할 수 있다.At this time, the behavior classification apparatus 100 may detect a face region image and face keypoint information based on the acquired non-face-to-face user image, and determine the direction of the gaze of the non-face-to-face user based on the face region image and face keypoint information. there is.

그리고 행위 분류 장치(100)는 얼굴 영역 이미지와 얼굴 키포인트 정보를 기반으로, 이미지 공간에서의 홍채의 이차원 위치와 카메라 공간에서의 삼차원 공간상의 얼굴 자세를 도출하고, 삼차원 공간상에서의 홍채의 방향을 통해 비대면 사용자의 시선의 방향을 검출할 수 있다.In addition, the action classification apparatus 100 derives the two-dimensional position of the iris in the image space and the pose of the face in the three-dimensional space in the camera space based on the face region image and the facial key point information, and through the direction of the iris in the three-dimensional space The direction of the gaze of the non-face-to-face user may be detected.

이후 S500단계에서, 행위 분류 장치(100)는 비대면 사용자의 시선 방향이 손의 위치를 향하고 있는 경우, 비대면 사용자가 비 허용 객체를 사용하고 있는 것으로 추가 판단할 수 있다.In step S500 thereafter, when the non-face-to-face user's line of sight direction is toward the hand position, the action classification apparatus 100 may additionally determine that the non-face-to-face user is using a disallowed object.

다만, 일 실시 예에서는, 이에 한정되지 않고, 손 영역 이미지와 손 키포인트 정보를 기반으로 한 비대면 사용자의 행위 분류 결과와, 얼굴 영역 이미지와 얼굴 키포인트 정보를 기반으로 한 비대면 사용자의 시선 방향에 기초하여 비대면 사용자의 비 허용 객체 사용 여부를 판단할 수 있다.However, in one embodiment, it is not limited to this, and the non-face-to-face user's action classification result based on the hand region image and hand keypoint information and the non-face-to-face user's gaze direction based on the face region image and face keypoint information Based on this, it is possible to determine whether a non-face-to-face user uses a non-allowed object.

즉, 실시 예에 따라서, 손 추적 기반 행위 분류 결과에 따라 사용자가 비 허용 객체를 사용한 것으로 판단되면 시선 방향을 추가적으로 검출하는 것이 아니라, 손에 대한 정보와 시선에 대한 정보를 기반으로 행위 분류 결과를 추론하여 비 허용 객체 사용 여부를 판단할 수도 있다.That is, according to an embodiment, if it is determined that the user has used a disallowed object according to the hand tracking-based action classification result, the action classification result is performed based on the information on the hand and the information on the gaze, rather than additionally detecting the gaze direction. It may be inferred to determine whether to use a disallowed object.

한편, S600단계에서, 행위 분류 장치(100)는 비 허용 객체 사용 판단에 따라 비대면 사용자에 경고를 출력할 수 있다.Meanwhile, in step S600, the behavior classification apparatus 100 may output a warning to the non-face-to-face user according to the non-allowed object use determination.

일 실시 예에서는, 비대면 사용자가 사용하고 있는 단말에 경고를 출력하거나, 비대면 사용자가 위치한 곳에 경고를 출력할 수 있으며, 동시에 관리자 단말로도 비 허용 행위를 수행한 사용자에 대한 알림을 출력할 수 있다. In one embodiment, a warning may be output to a terminal being used by a non-face-to-face user, or a warning may be output to a location where a non-face-to-face user is located, and at the same time, a notification for a user who has performed an unacceptable action may be output to an administrator terminal. can

경고 방법에 대해서는 구체적으로 한정하지 않으며, 실시 예에 따라서 손 추적 기반 부정행위를 판단한 후 출력하는 경고와, 추가적으로 시선 추적 기반 부정행위를 판단한 후 출력하는 경고를 다르게 할 수 있다.The warning method is not specifically limited, and according to an embodiment, a warning output after determining hand tracking-based cheating and a warning output after additionally determining eye-tracking-based cheating may be different.

이상 설명된 본 개시에 따른 실시 예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.Embodiments according to the present disclosure described above may be implemented in the form of a computer program that can be executed on a computer through various components, and such a computer program may be recorded on a computer-readable medium. At this time, the medium is a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, an optical recording medium such as a CD-ROM and a DVD, a magneto-optical medium such as a floptical disk, and a ROM hardware devices specially configured to store and execute program instructions, such as RAM, flash memory, and the like.

한편, 상기 컴퓨터 프로그램은 본 개시를 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 통상의 기술자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the purpose of the present disclosure, or may be known and usable to those skilled in the art in the field of computer software. An example of a computer program may include not only machine language codes generated by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like.

본 개시의 명세서(특히 특허청구범위에서)에서 "상기"의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 본 개시에서 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 적용한 발명을 포함하는 것으로서(이에 반하는 기재가 없다면), 발명의 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다.In the specification of the present disclosure (particularly in the claims), the use of the term "above" and similar indicating terms may correspond to both singular and plural. In addition, when a range is described in the present disclosure, as including the invention to which individual values belonging to the range are applied (unless otherwise stated), each individual value constituting the range is described in the detailed description of the invention Same as

본 개시에 따른 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 따라 본 개시가 한정되는 것은 아니다. 본 개시에서 모든 예들 또는 예시적인 용어(예들 들어, 등등)의 사용은 단순히 본 개시를 상세히 설명하기 위한 것으로서 특허청구범위에 의해 한정되지 않는 이상 상기 예들 또는 예시적인 용어로 인해 본 개시의 범위가 한정되는 것은 아니다. 또한, 통상의 기술자는 다양한 수정, 조합 및 변경이 부가된 특허청구범위 또는 그 균등물의 범주 내에서 설계 조건 및 팩터에 따라 구성될 수 있음을 알 수 있다.Unless an order is explicitly stated or stated to the contrary for steps comprising a method according to the present disclosure, the steps may be performed in any suitable order. The present disclosure is not necessarily limited to the order of description of the steps. The use of all examples or exemplary terms (eg, etc.) in this disclosure is simply to explain the present disclosure in detail, and the scope of the present disclosure is limited due to the examples or exemplary terms unless limited by the claims. it is not going to be In addition, those skilled in the art can appreciate that various modifications, combinations and changes can be made according to design conditions and factors within the scope of the appended claims or equivalents thereof.

따라서, 본 개시의 사상은 상기 설명된 실시 예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 또는 이로부터 등가적으로 변경된 모든 범위는 본 개시의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present disclosure should not be limited to the above-described embodiments, and not only the claims to be described later, but also all ranges equivalent to or equivalent to these claims are within the scope of the spirit of the present disclosure. will be said to belong to

1 : 이미지 복원 시스템
100 : 이미지 복원 장치
110 : 통신부
120 : 사용자 인터페이스
130 : 메모리
140 : 프로세서
200 : 사용자 단말
300 : 서버
400 : 네트워크1: Image Restoration System
100: image restoration device
110: Communication Department
120: user interface
130: memory
140: processor
200: user terminal
300: server
400: network

Claims

A hand tracking-based action classification method for determining an action of a non-face-to-face user based on a hand image of a user using content in a non-face-to-face manner, at least part of each step being performed by a processor,
Obtaining an image of a non-face-to-face user;
Classifying the behavior of the non-face-to-face user by detecting a hand region of the non-face-to-face user image based on a pre-learned behavior classification algorithm; and
Based on the action classification result, determining whether the non-face-to-face user uses a non-allowed object,
Hand tracking-based behavior classification method.

According to claim 1,
The step of classifying the behavior of the non-face-to-face user,
Detecting a hand region image and hand keypoint information based on the acquired non-face-to-face user image; and
Classifying an action of the non-face-to-face user by detecting an object in the hand region and a posture of the hand based on the hand region image and the hand keypoint information,
Hand tracking-based behavior classification method.

According to claim 2,
The step of detecting the hand region image and hand keypoint information,
detecting a region of interest (ROI) image of an entire hand region from the non-face-to-face user image; and
Predicting three-dimensional coordinate values for each hand keypoint in the ROI image of the entire detected hand region,
Hand tracking-based behavior classification method.

According to claim 3,
The step of detecting the ROI image of the entire region of the hand,
detecting an ROI image of a palm or back of the hand region in the non-face-to-face user image; and
Expanding the ROI of the palm or back of the hand region to the ROI of the entire hand region,
Hand tracking-based behavior classification method.

According to claim 4,
The step of detecting the ROI image of the entire region of the hand,
detecting positions of the center of the wrist and the middle finger in the detected ROI image of the entire region of the hand; and
Obtaining an ROI image of the entire detected hand region by rotating the identified line segment connecting the center of the wrist and the middle finger in parallel with the y-axis of the image,
Hand tracking-based behavior classification method.

According to claim 4,
The step of classifying the behavior of the non-face-to-face user,
Determining the posture of the hand of the non-face-to-face user based on the ROI image and the three-dimensional coordinate values for each of the hand keypoints,
Hand tracking-based behavior classification method.

According to claim 1,
The pre-learned behavior classification algorithm,
When the non-face-to-face user image is input, the ROI image of the entire hand region and hand keypoint information detected based on the non-face-to-face user image are queried as input, and the action mapped to the object in the hand region and the posture of the hand A learning model learned to output whether or not to use a disallowed object according to the class,
Hand tracking-based behavior classification method.

According to claim 7,
The pre-learned behavior classification algorithm is trained through a training phase,
The training phase is
detecting an ROI image of an entire hand region from the non-face-to-face user image when the non-face-to-face user image is input;
predicting three-dimensional coordinate values for each hand keypoint in the detected ROI image of the entire region of the hand;
Classifying an action of the non-face-to-face user by detecting an object in the hand region and a posture of the hand based on the ROI image and the three-dimensional coordinate values for each keypoint of the hand; and
Including the step of inferring whether the non-face-to-face user uses a non-allowed object according to the action classification result,
Hand tracking-based behavior classification method.

According to claim 8,
Classifying the behavior of the non-face-to-face user in the training phase,
generating an empty grid having the same size as the ROI image of the entire region of the hand;
marking each keypoint coordinate point on the empty grid based on the three-dimensional coordinate value for each keypoint of the hand; and
Applying Gaussian blur to a grid marked with each keypoint coordinate point,
Hand tracking-based behavior classification method.

According to claim 9,
Classifying the behavior of the non-face-to-face user in the training phase,
Classifying the action of the non-face-to-face user based on an input tensor composed of a channel based on a color value of the ROI image of the entire hand region and a channel for each keypoint,
Hand tracking-based behavior classification method.

According to claim 1,
When it is determined that the non-face-to-face user uses a disallowed object based on the non-face-to-face user's action classification result based on the hand region image and the hand keypoint information, the non-face-to-face user's gaze direction is the direction of the hand determining whether it is facing a location; and
Further comprising the step of additionally determining that the non-face-to-face user is using a non-allowed object when the direction of the non-face-to-face user's gaze is toward the position of the hand,
Hand tracking-based behavior classification method.

According to claim 11,
Determining whether the non-face-to-face user's gaze direction is toward the hand position,
detecting a face region image and face keypoint information based on the acquired non-face-to-face user image; and
Determining the direction of the gaze of the non-face-to-face user based on the face region image and the face keypoint information,
Hand tracking-based behavior classification method.

According to claim 12,
Determining the direction of the gaze of the non-face-to-face user,
deriving a two-dimensional position of the iris in an image space and a face posture in a three-dimensional space in a camera space based on the face region image and the facial keypoint information; and
Detecting the direction of the gaze of the non-face-to-face user through the direction of the iris on the three-dimensional space,
Hand tracking-based behavior classification method.

According to claim 1,
The step of determining whether the non-face-to-face user uses a non-allowed object,
Based on the non-face-to-face user's action classification result based on the hand region image and the hand keypoint information and the gaze direction of the non-face-to-face user based on the face region image and face keypoint information, the non-face-to-face user Further comprising the step of determining whether to use a non-permissible object of
Hand tracking-based behavior classification method.

According to claim 1,
Further comprising outputting a warning to the non-face-to-face user according to the non-allowed object use determination.
Hand tracking-based behavior classification method.

A hand tracking-based action classification device for determining a non-face-to-face user's action based on a hand image of the user using non-face-to-face content,
Memory; and
a processor coupled with the memory and configured to execute computer readable instructions contained in the memory;
The at least one processor,
Obtaining an image of a non-face-to-face user;
Based on a pre-learned behavior classification algorithm, detecting a hand region of the non-face-to-face user image and classifying the behavior of the non-face-to-face user; and
Based on the action classification result, it is set to perform an operation of determining whether the non-face-to-face user uses a non-allowed object
A hand tracking-based behavior classification device.

17. The method of claim 16,
The operation of classifying the behavior of the non-face-to-face user,
An operation of detecting a hand region image and hand keypoint information based on the acquired non-face-to-face user image, and
Based on the hand region image and the hand keypoint information, detecting an object in the hand region and a posture of the hand, and classifying the action of the non-face-to-face user,
A hand tracking-based behavior classification device.

17. The method of claim 16,
The pre-learned behavior classification algorithm,
When the non-face-to-face user image is input, the ROI image of the entire hand region and hand keypoint information detected based on the non-face-to-face user image are queried as input, and the action mapped to the object in the hand region and the posture of the hand A learning model learned to output whether or not to use a disallowed object according to the class,
A hand tracking-based behavior classification device.

17. The method of claim 16,
The at least one processor,
When it is determined that the non-face-to-face user uses a disallowed object based on the non-face-to-face user's action classification result based on the hand region image and the hand keypoint information, the non-face-to-face user's gaze direction is the direction of the hand an operation to determine whether the location is facing, and
When the non-face-to-face user's gaze direction is toward the hand position, further determining that the non-face-to-face user is using a non-allowed object is further performed,
A hand tracking-based behavior classification device.

17. The method of claim 16,
The operation of determining whether the non-face-to-face user uses a non-allowed object,
Based on the non-face-to-face user's action classification result based on the hand region image and the hand keypoint information and the gaze direction of the non-face-to-face user based on the face region image and face keypoint information, the non-face-to-face user Further comprising the operation of determining whether to use a non-permissible object of
A hand tracking-based behavior classification device.