KR102167760B1

KR102167760B1 - Sign language analysis Algorithm System using Recognition of Sign Language Motion process and motion tracking pre-trained model

Info

Publication number: KR102167760B1
Application number: KR1020200092505A
Authority: KR
Inventors: 유승수; 금청
Original assignee: 주식회사 멀틱스
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-10-19

Abstract

The present invention relates to a sign language movement analysis algorithm system using a sign language movement recognition processing procedure and a movement tracking pre-trained model, which constructs sign language movement datasets capable of training deep learning of sign language movement elements (hand form, hand level, hand movement, hand direction, body movement) and non-hand signal (face expression), and utilizes a hand movement avatar (character) to communicate with hearing impaired people by recognizing an accurate sign language movement through a deep learning algorithm capable of calculating the pre-trained model-based human body skeleton, facial feature points, and finger skeleton tracking information. The system comprises a plurality of cameras, a movement processing unit, a hand movement extraction unit, a facial expression extraction unit, a hand movement feature point extraction unit, a facial expression feature point extraction unit, a data conversion unit, an image generation unit, a word model unit, a sentence model unit, an intention analysis unit, a sentence conversion unit, and a character sentence generation unit.

Description

Sign language analysis algorithm system using Recognition of Sign Language Motion process and motion tracking pre-trained model using sign language motion recognition processing procedure and motion tracking pre-trained model.

본 발명은 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템에 관한 것으로서, 더욱 상세하게는 수어동작 요소인 수어소(수형, 수위, 수동, 수향, 체동)와 비수지신호(얼굴표정)를 딥러닝 학습할 수 있는 수어 동작 DB를 구축하고, Pre-trained 모델기반 사람의 인체 골격, 얼굴 특징점, 손가락 골격 추적 정보를 산출할 수 있는 딥러닝 알고리즘을 통해 정확한 수어 동작을 인식하여 청각장애인들이 의사 소통할 수 있는 수어 번역 서비스를 제공하는 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템에 관한 것이다.The present invention relates to a sign language motion analysis algorithm system using a sign language motion recognition processing procedure and a motion tracking pre-trained model, and more particularly, a sign language motion element (water form, water level, manual, water direction, body movement) and non-resin It builds a sign language motion DB that can deep-learn signals (face expressions), and uses a deep learning algorithm that can calculate the human body skeleton, facial feature points, and finger skeleton tracking information based on a pre-trained model. The present invention relates to a sign language motion analysis algorithm system using a sign language motion recognition processing procedure and a motion tracking pre-trained model that provides a sign language translation service to recognize and communicate with the hearing impaired.

일반적으로, 수어는 수어소(수형, 수위, 수동, 수향, 체동)와 비수지신호(얼굴표정) 구성되어 있다. 수어는 손과 손가락의 모양(수형), 손바닥의 방향(수향), 손의 위치(수위), 손의 움직임(수동) 등에 따라 의미가 달라진다. 또한 같은 동작을 하더라도 어떤 표정을 짓느냐에 따라 다른 의미가 된다.In general, the sign language is composed of a signpost (water form, water level, manual, water direction, body movement) and a non-resin signal (face expression). Sign language has a different meaning depending on the shape of the hand and fingers (water shape), the direction of the palm (water direction), the position of the hand (water level), and the movement of the hand (manually). Also, even if you do the same movement, it has different meanings depending on what kind of expression you make.

이로 인해 국가에서는 표준수어로 지문자를 지정하였으나 지문자는 수형으로만 표현되어 수어를 표현하는 데 매우 비효율적임에 따라 실생활에서 거의 사용되지 않고 있다. For this reason, the country designated a fingerprint as a standard sign language, but the fingerprint is expressed only in a hand form and is very inefficient in expressing sign language, so it is rarely used in real life.

선행특허 1은 청각장애자의 Communication을 위해 말하는 사람의 내용을 특정장치를 통해 Display 장치에 수어로 표현해 주는 시스템의 구성으로, 말(소리)을 Text화하고 그 Text를 근간으로 이를 3D Animation을 이용한 캐릭터의 동작(수어)으로 표현하는 구성이 기재되어 있다.Prior patent 1 is a configuration of a system that expresses the contents of a person speaking in a sign language through a specific device for communication of the hearing impaired. The speech (sound) is converted into text and based on the text, a character using 3D animation The structure expressed in the operation (sign language) of is described.

선행특허 1은 음성인식부와 문자인식부, 수어표시부로 구성되며 음성인식부는 입력된 음성신호를 Dictation 처리하여 Caption화 하고, 문자인식부 및 수어표시부는 Caption을 Code화하여 3D 캐릭터의 팔과 손동작 그리고 얼굴표정 D/B와 연동시키고, 연동에 따라 수어동작이 3D Animation으로 나타나도록 하는 구성이 기재되어 있다.Prior patent 1 consists of a voice recognition unit, a character recognition unit, and a sign language display unit, and the voice recognition unit converts the input voice signal into a Caption, and the character recognition unit and sign language display unit convert the caption into a code to make the arm and hand motion of the 3D character In addition, a configuration is described in which the facial expression D/B is linked and the sign language motion is displayed in 3D animation according to the linkage.

그러나 선행특허 1은 음성을 수어로 변환시키는 구성으로, 반대로 수어를 문장으로 변환하여 검증하고, 수어를 글자 문장으로 변환하여 일반인이 알 수 있도록 하는 구성이 없는 문제점이 있다.However, prior patent 1 has a problem in that there is no configuration in which a voice is converted into a sign language, and conversely, a sign language is converted into a sentence for verification, and a sign language is converted into a letter sentence so that the general public can know it.

선행특허 2는 재해나 사고 등의 긴급 시의 정보를 수어에 의하여 청각 장애자에게도 알기 쉽게 제공하기 위해 수어에 관한 지식을 가지지 않는 자라도 수어 애니메이션을 작성할 수 있게 하는 수어애니메이션 생성장치로서, 문자와 음성 외에 수어에 의하여 청각 장애인에게도 알기 쉽게 정보를 제공하는 수어 애니메이션 생성장치에 있어서, 자주 이용되는 수어문마다 가변부분을 갖는 수어문 템플릿을 준비하고, 가변부분에 대입하는 수어 단어를 선택하는 수단과, 필요한 수어 단어에 대한 CG 데이터가 준비되어 있지 않으면, 그 표제의 일본어 읽기 가나사전으로부터 지화문자에 의한 수어 애니메이션을 생성하는 수단을 갖는 것에 의하여 수어를 모르는 사람이라도 수어 애니메이션을 빨리 작성하여 송출할 수 있도록 하는 수어 애니메이션 생성장치가 기재되어 있다.Prior patent 2 is a sign language animation generating device that enables a person without knowledge of sign language to create sign language animations in order to provide information in an emergency such as a disaster or an accident in an easy to understand by sign language. In a sign language animation generating device that provides easy-to-understand information to the hearing impaired by sign language, means for preparing a sign language template having a variable part for each frequently used sign language, and selecting a sign language word to be substituted for the variable part; If CG data for sign language words is not prepared, it is possible to quickly create and transmit sign language animations even by people who do not know sign language by having a means to generate sign language animations using jihwa characters from the Japanese reading Kana dictionary of the title. A sign language animation generating device is described.

그러나 선행특허 2는 수어문을 생성하는 사람이 수작업으로 수어문 템플릿을 검색하여 선택하고, 가변부분을 선택하여 조회하는 등의 과정으로 수어 애니메이션 데이터를 생성하기 때문에 사람이 항상 대기하고 있어야 하고, 사람을 촬영하는 카메라가 한대뿐이라 정확한 수어의 동작을 확인하여 세세한 부분따지 변환할 수 없는 문제점이 있으며, 수어를 글자 문장으로 변환하는 구성이 아니라 수어에 대한 검증이 불가능한 문제점이 있다.However, prior patent 2 creates sign language animation data by manually searching and selecting a sign language template by a person creating a sign language, and selecting and inquiring a variable part, so that a person must always be on standby. Since there is only one camera that shoots, there is a problem in that it is impossible to convert detailed parts by checking the operation of the correct sign language, and there is a problem that it is not possible to verify sign language as well as a configuration that converts sign language into text sentences.

또한, 변환되는 수어 애니메이션의 검증이 불가능한 문제점이 있다.In addition, there is a problem that it is impossible to verify the converted sign language animation.

선행특허 3은 인가되는 문자정보를 자동으로 분석하여 표준문안에 대한 식별자와 가변문자부의 문자에 대한 정보를 추출하는 원문분석모듈, 원문분석모듈에서 출력되는 정보를 데이터베이스에 조회하여 고정문자부 애니메이션 데이터와 지화 애니메이션 데이터를 추출하고 조합하여 인가되는 표준문안에 대응하는 애니메이션 데이터를 생성하는 애니메이션 데이터 생성모듈을 포함하는 구성이 기재되어 있다.Prior patent 3 is an original text analysis module that automatically analyzes the applied text information and extracts information on the identifier of the standard text and the text of the variable text part, and the fixed text part animation data by inquiring the information output from the original text analysis module in the database. A configuration including an animation data generation module for generating animation data corresponding to the applied standard text by extracting and combining the jihwa animation data is described.

그러나 선행특허 3은 언어를 수어 애니메이션으로 변환할 때 수어 애니메이션에 대한 검증이 불가능하며, 수어 애니메이션을 만들어 일반인이 알 수 있는 글자 문장으로 변환하는 구성이 아니므로, 일반인이 수어를 알 수 없는 문제점이 있다.However, prior patent 3 does not allow verification of sign language animation when converting a language into sign language animation, and does not create a sign language animation and convert it into text sentences that can be understood by the general public. have.

선행특허 1 : 한국 공개특허공보 제10-2001-0107877호(2001.12.07.)Prior Patent 1: Korean Patent Application Publication No. 10-2001-0107877 (2001.12.07.) 선행특허 2 : 일본 특허공보 특개평9-274428호Prior patent 2: Japanese Patent Application Laid-Open No. Hei 9-274428 선행특허 3 : 한국 등록특허공보 제10-2110445호(2020.05.07.)Prior Patent 3: Korean Patent Publication No. 10-2110445 (2020.05.07.)

본 발명은 청각장애인이 주로 사용하는 언어는 수어이지만 청각장애인이 대부분 접하는 정보는 문자(한국어문장) 형태로 비장애인과 달리 청각장애인의 문자 해독력이 취약하여 일상생활 정보, 공공서비스 정보, 문화예술정보, 긴급재난/안전 문자 등 수어 정보취득의 한계로 생활, 생존권의 위협을 받을 수 있는 다수의 상황에서 수어 정보를 정확하게 파악하고 신속하게 대처하기 위하여 수어의 동작 정보와 비수지 정보를 인식 받아 딥러닝 학습 분석을 통하여 의도를 추론하여 청각장애인에게 정확한 수어의 동작정보를 제공하여 비장애인과 동일한 정보를 제공할 수 있는 서비스에 적용 활용하고자 하는데 그 목적이 있다.In the present invention, the language mainly used by the hearing impaired is sign language, but the information most encountered by the hearing impaired is in the form of characters (Korean sentences), and unlike the non-disabled, the character deciphering ability of the hearing impaired is weak. In a number of situations where sign language information may be threatened due to limitations in obtaining sign language information such as emergency disasters/safety texts, etc., deep learning by recognizing sign language motion information and non-residue information to accurately identify sign language information and respond quickly The purpose is to infer the intention through learning analysis and apply it to a service that can provide the same information as the non-disabled by providing accurate sign language motion information to the hearing impaired.

또한, 수어동작에서 좌우-손(지문자)으로만 인식하여 문장을 생성하기에는 정보가 부족하기 때문에 3대의 카메라를 이용하는 영상입력기를 통해 입력된 수어동작영상에서 동작처리부에서는 움직임 추적(추정) Pre-trained Model을 적용한 딥러닝 모델(알고리즘)을 적용하여 좌우-손(지문자)와 얼굴표정(비수지)의 중요 포인트를 추출하여 수어동작의 인식률을 높이는데 그 목적이 있다.In addition, since information is insufficient to generate sentences by recognizing only left and right-hand (fingerprints) in sign language motion, motion tracking (estimation) in the sign language motion image input through an image input device using three cameras is pre- The purpose is to increase the recognition rate of sign language motions by extracting important points of left and right-hand (fingerprints) and facial expressions (non-resin) by applying a deep learning model (algorithm) applying the trained model.

상기 과제를 달성하기 위한 본 발명에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템은, 수어 사용자의 전면 3방향 또는 그 이상의 방향에서 일정 간격 이격된 상태로 설치되어 사용자의 상반신을 촬영하는 다수의 카메라; 상기 다수의 카메라의 영상에서 촬영되는 영상을 이미지화 하고, 이미지에서 움직임 추적 Pre-trained Model을 이용하여 사용자의 동작과 배경을 분리하여 사용자의 동작만 딥러닝을 통해 추출하는 동작 처리부; 상기 추출된 사용자의 동작에서 팔과 손동작의 움직임을 추출하는 손동작 추출부; 상기 추출된 사용자의 동작에서 얼굴 표정을 추출하는 얼굴표정 추출부; 상기 손동작 추출부에서 추출된 손동작의 움직임에 따른 포인트 좌표를 모두 측정하는 손동작 특징점 추출부; 상기 얼굴표정 추출부에서 추출된 얼굴표정에 따른 포인트 좌표를 모두 측정하는 얼굴표정 특징점 추출부; 상기 손동작 특징점 추출부 및 얼굴표정 특징점 추출부에서 추출된 손동작 특징점 좌표와 얼굴표정 좌표를 데이터화하는 데이터변환부; 상기 데이터변환부에서 변환된 데이터를 통해 수어에 대한 특징점 이미지를 생성하는 이미지 생성부; 상기 이미지 생성부에서 변환된 단어에 대한 이미지 통해 단어의 문자와 이미지의 특징점에 대한 단어를 딥러닝을 통해 확인하는 단어모델부; 상기 이미지 생성부에서 변환된 단어에 대한 이미지를 통해 사용자가 수어로 말한 문장을의 특징점을 딥러닝을 통해 확인하는 문장모델부; 상기 단어모델부와 문장모델부에서 확인된 단어와 문장이 사용자가 처음 수어로 말한 문장 의도와 일치 또는 유사한지 여부를 딥러닝을 통해 분석하는 의도분석부; 상기 의도분석부를 통해 분석된 단어와 문장의 의도 일치 여부에 따라 단어와 수어의 뜻을 나타내는 문장 및 사용자의 영상 이미지를 통해 글자 문장을 배열하는 문장 변환부 및 상기 문장변환부에서 변환된 글자 문장을 문장 배열대로 글자로 이루어지지는 문장을 생성하는 글자 문장 생성부를 포함하는 것을 특징으로 한다.The sign language motion analysis algorithm system using the sign language motion recognition processing procedure and the motion tracking pre-trained model according to the present invention to achieve the above task is installed in a state spaced apart from the sign language user's front three directions or more. A plurality of cameras for photographing the user's upper body; A motion processing unit that converts an image captured from the images of the plurality of cameras, separates the user's motion and the background from the image using a motion tracking pre-trained model, and extracts only the user's motion through deep learning; A hand motion extracting unit for extracting a motion of an arm and a hand motion from the extracted user's motion; A facial expression extraction unit for extracting a facial expression from the extracted user's motion; A hand motion feature point extracting unit for measuring all point coordinates according to the motion of the hand motion extracted by the hand motion extracting unit; A facial expression feature point extracting unit that measures all point coordinates according to the facial expressions extracted from the facial expression extracting unit; A data conversion unit for converting the hand motion feature point coordinates and the facial expression coordinates extracted from the hand motion feature point extracting unit and the facial expression feature point extracting unit into data; An image generating unit that generates a feature point image for sign language through the data converted by the data conversion unit; A word model unit that checks a word for a character of a word and a feature point of the image through an image of the word converted by the image generating unit through deep learning; A sentence model unit for checking feature points of a sentence spoken in a sign language by a user through deep learning through an image of the word converted by the image generating unit; An intention analysis unit that analyzes through deep learning whether the words and sentences identified by the word model unit and the sentence model unit match or are similar to the sentence intention initially spoken by the user; Depending on whether the word analyzed through the intention analysis unit matches the intention of the sentence, the sentence representing the meaning of the word and the sign language and the sentence conversion unit for arranging character sentences through the image image of the user and the character sentence converted by the sentence conversion unit are Characterized in that it comprises a text sentence generator for generating a sentence consisting of letters in the sentence arrangement.

상기 손동작 특징점 추출부는 사용자의 수어동작 영상을 카메라를 통해 입력받아 움직임 추적(추정) Pre-trained Model을 적용한 딥러닝 모델을 통해 좌,우 손(지문자)각각 에서 21개의 중요 포인트를 추출하고, 좌,우 손을 합한 42개의 중요 포인트를 사용하고, 상기 얼굴표정 특징점 추출부는 카메라의 영상에서 얼굴표정(비수지)을 추출할 때에는 코, 눈, 입술에서 70개의 중요 포인트를 추출하여 지문자에 얼굴표정(비수지)을 추가하는 것을 특징으로 한다.The hand motion feature point extractor receives the user's sign language motion image through a camera and extracts 21 important points from each of the left and right hands (fingerprints) through a deep learning model applying a motion tracking (estimation) pre-trained model, 42 important points combined with left and right hands are used, and the facial expression feature point extraction unit extracts 70 important points from the nose, eyes, and lips when extracting facial expressions (non-resin) from the image of the camera. It is characterized by adding facial expressions (non-resin).

상기 데이터 변환부는 수어동작에 대한 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트에 대한 표준 데이터 베이스를 구축하고, 상기 손동작 특징점 추출부 및 얼굴표정 특징점 추출부에서 생성된 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트를 별도로 저장하여 동일한 수어동작에 대해 수집된 좌,우 손(지문자) 중요 포인트의 집합과 얼굴표정(비수지) 중요 포인트의 집합을 분리하여 구성하고 좌,우 손(지문자)와 얼굴표정(비수지)를 조합하여 새로운 데이터 베이스를 구성하여 딥러닝을 통해 수어동작 인식모델에 학습시켜서 인식률을 증대시키고, 상기 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트를 추출하여 각각의 json 포맷으로 파일을 생성하고, 데이터 베이스에 Json 파일을 저장하며, 상기 json 포맷에 저장된 정보는 중요 포인트에 대한 이미지에서 X좌표(0~이미지 넓이), Y좌표(0~이미지 높이), 정확도(0~1)의 값으로 프레임 수만큼 구성하는 것을 특징으로 한다.The data conversion unit builds a standard database for left and right hand (fingerprint) important points and facial expressions (non-resin) important points for sign language motion, and generated by the hand gesture feature point extraction unit and the facial expression feature point extraction unit. Left and right hand (fingerprints) important points and facial expressions (non-resin) important points are stored separately and collected for the same sign language movement, left and right hand (fingerprint) important points and facial expressions (non resinous) important points By separating and configuring a set of, a new database is constructed by combining left and right hands (fingerprints) and facial expressions (non-resin), and learning to a sign language motion recognition model through deep learning to increase the recognition rate. By extracting the right hand (fingerprint) important point and facial expression (non-resin) important point, a file is created in each json format, and a Json file is stored in the database, and the information stored in the json format is It is characterized by configuring as many frames as the number of frames in the image with values of X coordinate (0 to image width), Y coordinate (0 to image height), and accuracy (0 to 1).

상기 단어모델부는 이미지 인식을 위한 16개의 출력 계층으로 구성된 특징점 추출모듈 및 상기 특징점 추출모듈의 출력 계층으로 입력된 신호 벡터(Feature Map)에 수어동작 대한 데이터 베이스의 표준 데이터로 얻어진 수어단어에 대한 추론 출력 활성화 함수(풀이)를 적용한 단어분류모듈로 구성되며, 상기 특징점 추출모듈의 이미지 인식을 위한 16개의 출력 계층은 딥러닝 모델로 이루어지며, 딥러닝 모델은 수어동작을 입력을 받으면 수어동작에서 중요 포인트를 추출할 수 있도록 Feature를 추출하며, 다음 convolution layer에서는 손의 모양이나 모형을 인식하고, 상기 단어모델부에 좌,우 손 중요 포인트와 얼굴표정 중요 포인트를 입력으로 사용하고 좌,우 손을 처리하는 이미지 인식을 위한 16개의 출력 계층으로 구성된 특징점 추출모듈의 convolution layer에서 출력되는 좌,우 손 이미지 특징과 얼굴표정 특징을 통해서 출력되는 이미지 특징을 합친 하나의 Feature Map을 생성하고, 상기 좌,우 손 이미지와 얼굴표정의 특징점 추출모듈을 통해 만들어진 Feature map에 출력 활성화 함수를 적용한 단어분류모듈의 입력으로 처리하여 단어를 생성하는 것을 특징으로 한다.The word model unit infers sign language words obtained as standard data of a database for sign language actions in a feature map input signal vector (Feature Map) input to an output layer of the feature point extraction module and a feature point extraction module consisting of 16 output layers for image recognition. It is composed of a word classification module to which an output activation function (solution) is applied, and the 16 output layers for image recognition of the feature point extraction module are made up of a deep learning model, and the deep learning model is important in sign language operation when a sign language operation is received. Feature is extracted so that points can be extracted, and in the next convolution layer, the shape or model of the hand is recognized, and left and right hand important points and facial expression important points are used as inputs in the word model. A feature map that combines the left and right hand image features output from the convolution layer of the feature point extraction module consisting of 16 output layers for processing image recognition and the image features output through facial expression features is created, It is characterized in that a word is generated by processing it as an input of a word classification module in which an output activation function is applied to a feature map created through the feature point extraction module of the right hand image and facial expression.

상기 문장모델부는 수어문장은 앞뒤 신호가 서로 상관도를 가지고 있기 때문에 수어문장을 추론하기 위해서 엔코더(Encoder)와 디코더(Decoder)로 구성되고, 상기 엔코더(Encoder)는 이미지의 특징벡터인 Feature map과 Attention을 추출하여 디코더(Decoder)에서 가중치로 사용하고, 상기 디코더(Decoder)에 입력된 Feature map에 Attention을 적용하여 가까운 단어를 추출하는 벡터로 사용하며, 상기 이미지의 특징 벡터인 Feature map은 문장분류모듈의 입력으로 사용하고, 상기 문장분류모듈은 시계열 기반으로 입력과 입력조절벡터를 곱하여 산출되는 활성화 함수를 통해 나온 값에 출력조절벡터를 연산하여 문장을 생성하는 것을 특징으로 한다.The sentence model unit consists of an encoder and a decoder in order to infer the sign language sentence because the signals before and after the sign language sentence have a correlation with each other, and the encoder is composed of a feature map, which is a feature vector of an image. Attention is extracted and used as a weight in a decoder, and attention is applied to a feature map input to the decoder to use as a vector to extract nearby words, and the feature map, which is a feature vector of the image, is used as a sentence classification. It is used as an input of a module, and the sentence classification module generates a sentence by calculating an output control vector on a value obtained through an activation function calculated by multiplying the input and the input control vector based on a time series.

본 발명은 3대의 카메라를 이용하여 입력된 수어동작영상에 대해서 중요 포인트 정보를 획득하고 수어동작 인식결과를 높이기 위해서 통합된 중요 포인트를 생성하여 단어모델과 문장모델에 입력으로 사용하여 입력된 수어동작의 인식을 높여서 적절한 수어의 답변을 제공할 수 있는 효과가 있다.The present invention is to obtain important point information for a sign language motion image input using three cameras, and to generate an integrated important point to increase the sign language motion recognition result, and use the input sign language motion as input to the word model and sentence model. It has the effect of raising awareness of people and providing appropriate sign language answers.

또한, 수어동작에 대한 좌,우 손(지문자) 중요 포인트 집합과 얼굴표정(비수지) 중요 포인트 집합을 개별로 조합하여 향후 단어인식을 위한 추가적인 학습데이터 생성이 가능하고 동작인식 모델에 학습데이터 사용하여 인식률을 높일 수 있는 효과가 있다.In addition, it is possible to create additional training data for future word recognition by individually combining the left and right hand (fingerprint) important point set and facial expression (non-resining) important point set for sign language motion, and use the training data in the motion recognition model. This has the effect of increasing the recognition rate.

또한, 수어를 글자 문장으로 변환하게 되므로, 일반인이 수어의 내용을 글자로 확인하여 대화를 할 수 있는 효과가 있다.In addition, since sign language is converted into text sentences, there is an effect that the general public can have a conversation by checking the contents of sign language with letters.

도 1은 본 발명의 일 실시 예에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템의 전체 구성을 나타낸 도면이다.
도 2는 본 발명에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템의 다수의 카메라를 통해 수어 사용자의 영상을 촬영하여 처리하는 것을 나타낸 도면이다.
도 3은 본 발명에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템의 단어모델부를 나타낸 도면이다.
도 4는 본 발명에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템의 문장모델부를 나타낸 도면이다.1 is a view showing the overall configuration of a sign language motion analysis algorithm system using a sign language motion recognition processing procedure and a motion tracking pre-trained model according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a sign language motion recognition processing procedure and a process of photographing and processing an image of a sign language user through a plurality of cameras of a sign language motion analysis algorithm system using a motion tracking pre-trained model.
3 is a view showing a word model unit of a sign language motion analysis algorithm system using a sign language motion recognition processing procedure and a motion tracking pre-trained model according to the present invention.
4 is a view showing a sentence model unit of a sign language motion analysis algorithm system using a sign language motion recognition processing procedure and a motion tracking pre-trained model according to the present invention.

이하 본 발명의 실시를 위한 구체적인 실시예를 도면을 참고하여 설명한다. 본 발명의 실시예는 하나의 발명을 설명하기 위한 것으로서 권리범위는 예시된 실시예에 한정되지 아니하고, 예시된 도면은 발명의 명확성을 위하여 핵심적인 내용만 확대 도시하고 부수적인 것을 생략하였으므로 도면에 한정하여 해석하여서는 아니 된다.Hereinafter, specific embodiments for carrying out the present invention will be described with reference to the drawings. Embodiments of the present invention are for explaining one invention, and the scope of rights is not limited to the illustrated embodiments, and the illustrated drawings are limited to the drawings because only the essential content is enlarged and ancillary items are omitted for clarity of the invention. And should not be interpreted.

본 발명은 수어 사용자의 전면 3방향 또는 그 이상의 방향에서 일정 간격 이격된 상태로 설치되어 사용자의 상반신을 촬영하는 다수의 카메라(10); 상기 다수의 카메라(10)의 영상에서 촬영되는 영상을 이미지화 하고, 이미지에서 움직임 추적 Pre-trained Model을 이용하여 사용자의 동작과 배경을 분리하여 사용자의 동작만 딥러닝을 통해 추출하는 동작 처리부(20); 상기 추출된 사용자의 동작에서 팔과 손동작의 움직임을 추출하는 손동작 추출부(30); 상기 추출된 사용자의 동작에서 얼굴 표정을 추출하는 얼굴표정 추출부(40); 상기 손동작 추출부(30)에서 추출된 손동작의 움직임에 따른 포인트 좌표를 모두 측정하는 손동작 특징점 추출부(50); 상기 얼굴표정 추출부(40)에서 추출된 얼굴표정에 따른 포인트 좌표를 모두 측정하는 얼굴표정 특징점 추출부(60); 상기 손동작 특징점 추출부(50) 및 얼굴표정 특징점 추출부(60)에서 추출된 손동작 특징점 좌표와 얼굴표정 좌표를 데이터화하는 데이터변환부; 상기 데이터변환부에서 변환된 데이터를 통해 수어에 대한 특징점 이미지를 생성하는 이미지 생성부(80; 상기 이미지 생성부(80에서 변환된 단어에 대한 이미지 통해 단어의 문자와 이미지의 특징점에 대한 단어를 딥러닝을 통해 확인하는 단어모델부(90); 상기 이미지 생성부(80에서 변환된 단어에 대한 이미지를 통해 사용자가 수어로 말한 문장을의 특징점을 딥러닝을 통해 확인하는 문장모델부(100); 상기 단어모델부(90)와 문장모델부(100)에서 확인된 단어와 문장이 사용자가 처음 수어로 말한 문장 의도와 일치 또는 유사한지 여부를 딥러닝을 통해 분석하는 의도분석부(110); 상기 의도분석부(110)를 통해 분석된 단어와 문장의 의도 일치 여부에 따라 단어와 수어의 뜻을 나타내는 문장 및 사용자의 영상 이미지를 통해 글자 문장을 배열하는 문장변환부(120) 및 상기 문장변환부(120)에서 변환된 글자 문장을 문장 배열대로 글자로 이루어지지는 문장을 생성하는 글자 문장 생성부(130)를 포함하는 것을 특징으로 한다. The present invention includes a plurality of cameras 10 that are installed in a state spaced apart from each other at a predetermined interval in three directions or more in the front of the user to photograph the user's upper body; A motion processing unit (20) that converts the image captured from the images of the plurality of cameras 10 into an image, separates the user's motion and the background from the image using a motion tracking pre-trained model, and extracts only the user's motion through deep learning. ); A hand motion extracting unit 30 for extracting a motion of an arm and a hand motion from the extracted user's motion; A facial expression extraction unit 40 for extracting a facial expression from the extracted user's motion; A hand motion feature point extracting unit 50 for measuring all point coordinates according to the motion of the hand motion extracted by the hand motion extracting unit 30; A facial expression feature point extracting unit 60 for measuring all point coordinates according to the facial expressions extracted by the facial expression extracting unit 40; A data conversion unit for converting the hand motion feature point coordinates and facial expression coordinates extracted from the hand motion feature point extracting unit 50 and the facial expression feature point extracting unit 60 into data; An image generation unit 80 for generating a feature point image for sign language through the data converted by the data conversion unit; dips the character of the word and the word for the feature point of the image through the image for the word converted by the image generator 80 A word model unit 90 for checking through learning, a sentence model unit 100 for checking feature points of a sentence spoken in a sign language by a user through deep learning through an image of the word converted by the image generating unit 80; An intention analysis unit 110 that analyzes through deep learning whether the words and sentences identified by the word model unit 90 and the sentence model unit 100 match or are similar to the sentence intention spoken in the first sign language by the user; Sentence conversion unit 120 and the sentence conversion unit for arranging a sentence representing the meaning of a word and a sign language according to whether the intention of the word analyzed by the intention analysis unit 110 and the sentence match, and a text sentence through the image image of the user Characterized in that it comprises a character sentence generation unit 130 for generating a sentence consisting of the character sentence converted in 120 as a sentence arrangement.

상기 손동작 특징점 추출부(50)는 사용자의 수어동작 영상을 카메라(10)를 통해 입력받아 움직임 추적(추정) Pre-trained Model을 적용한 딥러닝 모델을 통해 좌,우 손(지문자)각각 에서 21개의 중요 포인트를 추출하고, 좌,우 손을 합한 42개의 중요 포인트를 사용하고, 상기 얼굴표정 특징점 추출부(60)는 카메라(10)의 영상에서 얼굴표정(비수지)을 추출할 때에는 코, 눈, 입술에서 70개의 중요 포인트를 추출하여 지문자에 얼굴표정(비수지)을 추가하는 것을 특징으로 한다.The hand motion feature point extraction unit 50 receives the user's sign language motion image through the camera 10 and uses a deep learning model to which a motion tracking (estimation) pre-trained model is applied, and the left and right hands (finger letters) are each 21 The two important points are extracted, and 42 important points combined with the left and right hands are used, and the facial expression feature point extraction unit 60 extracts a facial expression (non-resin) from the image of the camera 10, the nose, It is characterized by extracting 70 important points from the eyes and lips and adding facial expressions (non-resin) to the fingerprint.

상기 데이터 변환부(70)는 수어동작에 대한 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트에 대한 표준 데이터 베이스를 구축하고, 상기 손동작 특징점 추출부(50) 및 얼굴표정 특징점 추출부(60)에서 생성된 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트를 별도로 저장하여 동일한 수어동작에 대해 수집된 좌,우 손(지문자) 중요 포인트의 집합과 얼굴표정(비수지) 중요 포인트의 집합을 분리하여 구성하고 좌,우 손(지문자)와 얼굴표정(비수지)를 조합하여 새로운 데이터 베이스를 구성하여 딥러닝을 통해 수어동작 인식모델에 학습시켜서 인식률을 증대시키고, 상기 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트를 추출하여 각각의 json 포맷으로 파일을 생성하고, 데이터 베이스에 Json 파일을 저장하며, 상기 json 포맷에 저장된 정보는 중요 포인트에 대한 이미지에서 X좌표(0~이미지 넓이), Y좌표(0~이미지 높이), 정확도(0~1)의 값으로 프레임 수만큼 구성하는 것을 특징으로 한다.The data conversion unit 70 builds a standard database for important left and right hand (fingerprint) points and facial expressions (non-resin) important points for sign language motion, and the hand gesture feature point extractor 50 and face Left and right hand (fingerprint) important points and facial expressions (non-resin) important points generated by the facial expression feature point extraction unit 60 are separately stored and collected for the same sign language movement. The set of facial expressions (non-resining) and the set of important points are separated and configured, and a new database is constructed by combining the left and right hands (fingerprints) and facial expressions (non-resining), and the sign language motion recognition model It learns to increase the recognition rate, extracts the left and right hand (fingerprint) important points and facial expressions (non-resin) important points, creates a file in each json format, stores a Json file in the database, and The information stored in the json format is characterized by configuring as many frames as the number of frames with values of X coordinate (0 to image width), Y coordinate (0 to image height), and accuracy (0 to 1) in the image for an important point.

상기 단어모델부(90)는 이미지 인식을 위한 16개의 출력 계층으로 구성된 특징점 추출모듈 및 상기 특징점 추출모듈의 출력 계층으로 입력된 신호 벡터(Feature Map(F))에 수어동작 대한 데이터 베이스의 표준 데이터로 얻어진 수어단어에 대한 추론 출력 활성화 함수(풀이)를 적용한 단어분류모듈(140)로 구성되며, 상기 특징점 추출모듈의 이미지 인식을 위한 16개의 출력 계층은 딥러닝 모델로 이루어지며, 딥러닝 모델은 수어동작을 입력을 받으면 수어동작에서 중요 포인트를 추출할 수 있도록 Feature를 추출하며, 다음 convolution layer에서는 손의 모양이나 모형을 인식하고, 상기 단어모델부(90)에 좌,우 손 중요 포인트와 얼굴표정 중요 포인트를 입력으로 사용하고 좌,우 손을 처리하는 이미지 인식을 위한 16개의 출력 계층으로 구성된 특징점 추출모듈의 convolution layer에서 출력되는 좌,우 손 이미지 특징과 얼굴표정 특징을 통해서 출력되는 이미지 특징을 합친 하나의 Feature Map(F)을 생성하고, 상기 좌,우 손 이미지와 얼굴표정의 특징점 추출모듈을 통해 만들어진 Feature Map(F)에 출력 활성화 함수를 적용한 단어분류모듈(140)의 입력으로 처리하여 단어를 생성하는 것을 특징으로 한다.The word model unit 90 includes a feature point extraction module consisting of 16 output layers for image recognition and a signal vector (Feature Map(F)) input to the output layer of the feature point extraction module, and the standard data of the database for sign language operation. It consists of a word classification module 140 to which an inference output activation function (solution) is applied to the sign language obtained by, and the 16 output layers for image recognition of the feature point extraction module are made of a deep learning model, and the deep learning model is When a sign language motion is received, a feature is extracted so that important points in sign language motion can be extracted. In the next convolution layer, the shape or model of the hand is recognized, and left and right hand important points and faces are displayed in the word model unit 90. Image features of left and right hand images output from the convolution layer of the feature point extraction module consisting of 16 output layers for image recognition that uses important facial expression points as input and processes left and right hands, and image features that are output through facial expression features Generates one Feature Map (F) combining the left and right hand images and the feature map (F) created through the feature point extraction module of the left and right hand images and processed as an input of the word classification module 140 applying an output activation function to the Feature Map (F). It is characterized in that to generate a word.

상기 문장모델부(100)는 수어문장은 앞뒤 신호가 서로 상관도를 가지고 있기 때문에 수어문장을 추론하기 위해서 엔코더(Encoder)(E)와 디코더(Decoder)(D)로 구성되고, 상기 엔코더(Encoder)(E)는 이미지의 특징벡터인 Feature Map(F)과 Attention(A)을 추출하여 디코더(Decoder)(D)에서 가중치로 사용하고, 상기 디코더(Decoder)(D)에 입력된 Feature Map(F)에 Attention(A)을 적용하여 가까운 단어를 추출하는 벡터로 사용하며, 상기 이미지의 특징 벡터인 Feature Map(F)은 문장분류모듈(150)의 입력으로 사용하고, 상기 문장분류모듈(150)은 시계열 기반으로 입력과 입력조절벡터를 곱하여 산출되는 활성화 함수를 통해 나온 값에 출력조절벡터를 연산하여 문장을 생성하는 것을 특징으로 한다.The sentence model unit 100 is composed of an encoder (E) and a decoder (D) in order to infer a sign sentence because the signal sentence has a correlation between the front and back signals, and the encoder (Encoder) )(E) extracts Feature Map(F) and Attention(A), which are feature vectors of the image, and uses them as weights in the decoder (D), and the Feature Map ( F) is used as a vector for extracting close words by applying Attention (A), and Feature Map (F), which is a feature vector of the image, is used as an input to the sentence classification module 150, and the sentence classification module 150 ) Is characterized in that a sentence is generated by calculating an output control vector to a value obtained through an activation function calculated by multiplying an input and an input control vector based on a time series.

도 1은 본 발명의 일 실시예에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템의 전체 구성을 나타낸 도면으로서, 수어 사용자의 전면 3방향 또는 그 이상의 방향에서 일정 간격 이격된 상태로 설치되어 사용자의 상반신을 촬영하는 다수의 카메라(10), 다수의 카메라(10)의 영상에서 촬영되는 영상을 이미지화 하고, 이미지에서 움직임 추적 Pre-trained Model을 이용하여 사용자의 동작과 배경을 분리하여 사용자의 동작만 딥러닝을 통해 추출하는 동작 처리부(20), 추출된 사용자의 동작에서 팔과 손동작의 움직임을 추출하는 손동작 추출부(30), 추출된 사용자의 동작에서 얼굴 표정을 추출하는 얼굴표정 추출부(40), 손동작 추출부(30)에서 추출된 손동작의 움직임에 따른 포인트 좌표를 모두 측정하는 손동작 특징점 추출부(50), 얼굴표정 추출부(40)에서 추출된 얼굴표정에 따른 포인트 좌표를 모두 측정하는 얼굴표정 특징점 추출부(60), 손동작 특징점 추출부(50) 및 얼굴표정 특징점 추출부(60)에서 추출된 손동작 특징점 좌표와 얼굴표정 좌표를 데이터화하는 데이터변환부, 데이터변환부에서 변환된 데이터를 통해 수어에 대한 특징점 이미지를 생성하는 이미지 생성부(80, 이미지 생성부(80에서 변환된 단어에 대한 이미지 통해 단어의 문자와 이미지의 특징점에 대한 단어를 딥러닝을 통해 확인하는 단어모델부(90), 이미지 생성부(80에서 변환된 단어에 대한 이미지를 통해 사용자가 수어로 말한 문장을의 특징점을 딥러닝을 통해 확인하는 문장모델부(100), 단어모델부(90)와 문장모델부(100)에서 확인된 단어와 문장이 사용자가 처음 수어로 말한 문장 의도와 일치 또는 유사한지 여부를 딥러닝을 통해 분석하는 의도분석부(110), 의도분석부(110)를 통해 분석된 단어와 문장의 의도 일치 여부에 따라 단어와 수어의 뜻을 나타내는 문장 및 사용자의 영상 이미지를 통해 글자 문장을 배열하는 문장변환부(120) 및 문장변환부(120)에서 변환된 글자 문장을 문장 배열대로 글자로 이루어지지는 문장을 생성하는 글자 문장 생성부(130)를 포함한다. 1 is a diagram showing the overall configuration of a sign language motion analysis algorithm system using a sign language motion recognition processing procedure and a motion tracking pre-trained model according to an embodiment of the present invention. A plurality of cameras 10 that are installed in a spaced apart state to photograph the user's upper body, images taken from images of a plurality of cameras 10, and motion tracking in the image using a pre-trained model The motion processing unit 20 that separates the user's motion from the background and extracts only the user's motion through deep learning, the hand motion extracting unit 30 that extracts the motion of the arm and hand motion from the extracted user motion, and the facial expression from the extracted user motion. The face extracted from the facial expression extracting unit 40 for extracting, the hand motion feature point extracting unit 50 for measuring point coordinates according to the motion of the hand motions extracted from the hand motion extracting unit 30, and the facial expression extracting unit 40 A data conversion unit that converts hand motion feature point coordinates and facial expression coordinates extracted from the facial expression feature point extraction unit 60, which measures all point coordinates according to the facial expression, the hand motion feature point extracting unit 50, and the facial expression feature point extracting unit 60, into data. , An image generator 80, which generates a feature point image for sign language through the data converted by the data conversion section, deep learning the words for the character of the word and the feature point of the image through the image for the word converted in the image generator 80 The word model unit 90 for checking through, the sentence model unit 100 for checking the feature points of the sentences spoken in sign language by the user through the image of the word converted by the image generating unit 80 through deep learning, and the word model An intention analysis unit 110, an intention analysis unit 110 that analyzes through deep learning whether the words and sentences identified by the unit 90 and the sentence model unit 100 match or are similar to the sentence intention spoken in the first sign language by the user. A sentence conversion unit 120 that arranges a sentence representing the meaning of a word and a sign language and a text sentence through an image image of a user according to whether the intention of the word and the sentence analyzed through 110) coincide, and It includes a character sentence generation unit 130 for generating a sentence consisting of the character sentence converted by the sentence conversion unit 120 as a sentence arrangement.

도 2는 본 발명에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템의 다수의 카메라를 통해 수어 사용자의 영상을 촬영하여 처리하는 것을 나타낸 도면이며, 도 3은 본 발명에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템의 단어모델부를 나타낸 도면이고, 도 4는 본 발명에 따른 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템의 문장모델부를 나타낸 도면이다.FIG. 2 is a diagram showing a sign language motion recognition processing procedure and a sign language motion analysis algorithm system using a motion tracking pre-trained model to capture and process an image of a sign language user through a plurality of cameras. A diagram showing a word model part of a sign language motion recognition processing procedure and a sign language motion analysis algorithm system using a motion tracking pre-trained model according to the present invention, and FIG. 4 shows the sign language motion recognition processing procedure and a motion tracking pre-trained model according to the present invention. It is a diagram showing the sentence model part of the used sign language motion analysis algorithm system.

도 1 및 도 2를 참조하면, 상기 다수의 카메라(10)는 수어를 사용하는 사용자의 전면에서 사용자를 촬영하며, 다수의 카메라(10)가 일정 간격 이격된 다양한 각도에서 사용자의 전면의 영상을 촬영하게 된다.1 and 2, the plurality of cameras 10 photograph the user from the front of the user who uses sign language, and the plurality of cameras 10 captures the image of the user's front from various angles spaced apart by a predetermined interval. I will shoot.

이로 인해 상기 카메라(10)는 서로 다른 각도에서 사용자의 영상을 촬영하게되므로, 사용자의 손동작 및 얼굴표정을 명확하게 촬영할 수 있다.Accordingly, since the camera 10 captures the user's image from different angles, the user's hand motion and facial expression can be clearly captured.

이러한 상기 카메라(10)는 사용자의 전면 3방향 또는 그 이상의 다방향에서 사용자를 촬영하여 손의 손가락 움직임등을 명확하게 촬영할 수 있도록 한다.The camera 10 photographs the user in three directions in front of the user or in multiple directions above the user's face, so that the movement of the fingers of the hand can be clearly captured.

상기 카메라(10)의 영상은 동작처리부로 전송되어 사용자의 동작과 배경을 분리하게 된다.The image of the camera 10 is transmitted to the motion processing unit to separate the user's motion and the background.

이때 상기 동작 처리부(20)는 다수의 카메라(10)의 영상에서 촬영되는 영상을 이미지화 하고, 이미지에서 움직임 추적 Pre-trained Model을 이용하여 사용자의 동작과 배경을 분리하여 사용자의 동작만 딥러닝을 통해 추출하게 된다.At this time, the motion processing unit 20 images an image captured from the images of a plurality of cameras 10, separates the user's motion from the background using a motion tracking pre-trained model from the image, and performs deep learning only for the user's motion. It is extracted through.

상기 동작 처리부(20)에서 사용자의 동작만 따로 분리해내면, 손동작 추출부(30)를 통해 손의 움직임을 따로 추출하고, 얼굴표정 추출부(40)에서 얼굴표정만 따로 추출하게 된다.When only the user's motion is separated by the motion processing unit 20, the hand motion is separately extracted through the hand motion extracting unit 30, and only the facial expression is separately extracted by the facial expression extracting unit 40.

이렇게 추출된 상기 손동작과 얼굴표정은 손동작 특징점 추출부(50)를 통해 손동작 추출부(30)에서 추출된 손동작의 움직임에 따른 포인트 좌표를 모두 측정하고, 얼굴표정 특징점 추출부(60)를 통해 얼굴표정 추출부(40)에서 추출된 얼굴표정에 따른 포인트 좌표를 모두 측정하게 된다.The hand gestures and facial expressions extracted in this way measure all the point coordinates according to the movement of the hand gestures extracted by the hand gestures extraction unit 30 through the hand gestures feature point extraction unit 50, and the facial expressions through the facial expressions feature point extraction unit 60 All point coordinates according to the facial expression extracted by the expression extraction unit 40 are measured.

이때, 상기 손동작 특징점 추출부(50)는 사용자의 수어동작 영상을 카메라(10)를 통해 입력받아 움직임 추적(추정) Pre-trained Model을 적용한 딥러닝 모델을 통해 좌,우 손(지문자)각각 에서 21개의 중요 포인트를 추출하고, 좌,우 손을 합한 42개의 중요 포인트를 사용한다.At this time, the hand motion feature point extraction unit 50 receives the user's sign language motion image through the camera 10 and uses a deep learning model to which a motion tracking (estimation) pre-trained model is applied, respectively. 21 important points are extracted from and 42 important points combined with left and right hands are used.

그리고 상기 얼굴표정 특징점 추출부(60)는 카메라(10)의 영상에서 얼굴표정(비수지)을 추출할 때에는 코, 눈, 입술에서 70개의 중요 포인트를 추출하여 좌,우 손(지문자)의 중요 포인트에 얼굴표정(비수지)을 추가하게 된다.In addition, the facial expression feature point extracting unit 60 extracts 70 important points from the nose, eyes, and lips when extracting the facial expression (non-resin) from the image of the camera 10, Facial expressions (non-resin) are added to important points.

상기 Pre-trained Model은 로컬에 training 이미지와 코드를 넣고 돌려서 나온 최종 출력(output)을 말한다.The pre-trained model refers to the final output obtained by putting the training image and code locally and returning it.

상기 손동작 특징점 추출부(50) 및 얼굴표정 특징점 추출부(60)에서 추출된 손동작 특징점 좌표와 얼굴표정 좌표는 데이터변환부를 통해 데어터화하게 된다.The hand motion feature point coordinates and facial expression coordinates extracted from the hand motion feature point extracting unit 50 and the facial expression feature point extracting unit 60 are converted into data through a data conversion unit.

상기 데이터 변환부(70)는 수어동작에 대한 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트에 대한 표준 데이터 베이스를 구축한다.The data conversion unit 70 builds a standard database for important points of left and right hands (finger letters) and important points for facial expressions (non-resining) for sign language motion.

그리고, 상기 손동작 특징점 추출부(50) 및 얼굴표정 특징점 추출부(60)에서 생성된 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트를 별도로 저장하여 동일한 수어동작에 대해 수집된 좌,우 손(지문자) 중요 포인트의 집합과 얼굴표정(비수지) 중요 포인트의 집합을 분리하여 구성하고 좌,우 손(지문자)와 얼굴표정(비수지)를 조합하여 새로운 데이터 베이스를 구성하여 딥러닝을 통해 수어동작 인식모델에 학습시켜서 인식률을 증대시킨다In addition, the left and right hand (fingerprint) important points and facial expressions (non-resin) important points generated by the hand gesture feature point extracting unit 50 and the facial expression feature point extracting unit 60 are separately stored for the same sign language motion. A new database by separating the set of collected left and right hand (fingerprint) important points and facial expression (non-resin) important point set and combining left and right hand (fingerprint) and facial expression (non-resin) To increase the recognition rate by configuring a sign language motion recognition model through deep learning.

상기 인식률이 증대된 좌,우 손(지문자) 중요 포인트와 얼굴표정(비수지) 중요 포인트를 추출하여 각각의 json 포맷으로 파일을 생성하고, 데이터 베이스에 Json 파일을 저장하고, 이때, json 포맷에 저장된 정보는 중요 포인트에 대한 이미지에서 X좌표(0~이미지 넓이), Y좌표(0~이미지 높이), 정확도(0~1)의 값으로 프레임 수만큼 구성하게 된다.The left and right hand (fingerprint) important points and facial expressions (non-resin) important points with the increased recognition rate are extracted to create a file in each json format, and a Json file is stored in the database, and at this time, the json format The information stored in the image for important points consists of the values of X coordinate (0~ image width), Y coordinate (0~ image height), and accuracy (0~1) as many as the number of frames.

상기 데이터 변환부(70)에서 변환된 데이터는 이미지 생성부(80에서 수어에 대한 특징점 이미지를 생성하게 된다.The data converted by the data conversion unit 70 is generated by the image generation unit 80 to generate a feature point image for sign language.

이때 상기 특징점 이미지는 변환된 데이터에서 특징점 좌표를 점으로 표시하고, 이를 곡선 또는 직선으로 연결하여 일종의 맵을 형성하여 이미지화 한다.At this time, the feature point image is imaged by displaying feature point coordinates as points in the converted data, and connecting them with curves or straight lines to form a kind of map.

상기 이미지 생성부(80에서 생성된 이미지는 단어모델부(90)와 문장모델부(100)로 전송되며, 단어모델부(90)는 이미지 통해 단어의 문자와 이미지의 특징점에 대한 단어를 딥러닝을 통해 확인하고, 문장모델부(100)는 이미지를 통해 사용자가 수어로 말한 문장을의 특징점을 딥러닝을 통해 확인하게 된다.The image generated by the image generation unit 80 is transmitted to the word model unit 90 and the sentence model unit 100, and the word model unit 90 deep-learns the words for the character of the word and the feature points of the image through the image. Through the image, the sentence model unit 100 checks the feature points of the sentences spoken in sign language by the user through deep learning.

도 3을 참조하면, 상기 단어모델부(90)는 이미지 인식을 위한 16개의 출력 계층으로 구성된 특징점 추출모듈 및 상기 특징점 추출모듈의 출력 계층으로 입력된 신호 벡터(Feature Map(F))에 수어동작 대한 데이터 베이스의 표준 데이터로 얻어진 수어단어에 대한 추론 출력 활성화 함수(풀이)를 적용한 단어분류모듈(140)로 구성된다.Referring to FIG. 3, the word model unit 90 performs a sign language operation on a feature map (F) input to an output layer of the feature point extraction module and a feature point extraction module composed of 16 output layers for image recognition. It consists of a word classification module 140 to which an inference output activation function (solution) for sign language words obtained as standard data of the Korean database is applied.

이러한 상기 단어모델부(90)는 특징점 추출모듈의 이미지 인식을 위한 16개의 출력 계층은 딥러닝 모델로 이루어지며, 딥러닝 모델은 수어동작을 입력을 받으면 수어동작에서 중요 포인트를 추출할 수 있도록 Feature를 추출하며, 다음 convolution layer에서는 손의 모양이나 모형을 인식하게 된다.The word model unit 90 has 16 output layers for image recognition of the feature point extraction module as a deep learning model, and the deep learning model features a feature to extract important points from sign language motions when a sign language motion is input. Is extracted, and in the next convolution layer, the shape or model of the hand is recognized.

그리고, 상기 단어모델부(90)에 좌,우 손 중요 포인트와 얼굴표정 중요 포인트를 입력으로 사용하고 좌,우 손을 처리하는 이미지 인식을 위한 16개의 출력 계층으로 구성된 특징점 추출모듈의 convolution layer에서 출력되는 좌,우 손 이미지 특징과 얼굴표정 특징을 통해서 출력되는 이미지 특징을 합친 하나의 Feature Map(F)을 생성하게 된다.And, in the convolution layer of the feature point extraction module consisting of 16 output layers for image recognition that uses the left and right hand important points and facial expression important points as inputs to the word model unit 90 and processes the left and right hands. A single Feature Map (F) is created that combines the left and right hand image features that are output and the image features that are output through the facial expression features.

상기 좌,우 손 이미지와 얼굴표정의 특징점 추출모듈을 통해 만들어진 Feature Map(F)에 출력 활성화 함수를 적용한 단어분류모듈(140)의 입력으로 처리하여 단어를 생성하게 된다.A word is generated by processing it as an input of the word classification module 140 to which the output activation function is applied to the Feature Map (F) created through the left and right hand image and the feature point extraction module of the facial expression.

도 4를 참조하면, 상기 문장모델부(100)는 수어문장은 앞뒤 신호가 서로 상관도를 가지고 있기 때문에 수어문장을 추론하기 위해서 엔코더(Encoder)(E)와 디코더(Decoder)(D)로 구성된다.Referring to FIG. 4, the sentence model unit 100 is composed of an encoder (E) and a decoder (D) to infer a sign language sentence because the front and rear signals have a correlation with each other. do.

이러한 상기 엔코더(Encoder)(E)는 이미지의 특징벡터인 Feature Map(F)과 Attention(A)을 추출하여 디코더(Decoder)(D)에서 가중치로 사용하고, 상기 디코더(Decoder)에 입력된 Feature Map(F)에 Attention(A)을 적용하여 가까운 단어를 추출하는 벡터로 사용하게 된다. The encoder (E) extracts Feature Map (F) and Attention (A), which are feature vectors of an image, and uses them as weights in a decoder (D), and features input to the decoder. By applying Attention(A) to Map(F), it is used as a vector to extract nearby words.

상기 이미지의 특징 벡터인 Feature Map(F)은 문장분류모듈(150)의 입력으로 사용하고, 상기 문장분류모듈(150)은 시계열 기반으로 입력과 입력조절벡터를 곱하여 산출되는 활성화 함수를 통해 나온 값에 출력조절벡터를 연산하여 문장을 생성하게 하게 된다.Feature Map (F), which is a feature vector of the image, is used as an input to the sentence classification module 150, and the sentence classification module 150 is a value obtained through an activation function calculated by multiplying the input and the input control vector based on a time series. The output control vector is calculated to generate a sentence.

상기 단어모델부(90)와 문장모델부(100)에서 확인된 단어와 문장은 의도분석부(110)를 통해 사용자가 처음 수어로 말한 문장 의도와 일치 또는 유사한지 여부를 딥러닝을 통해 분석하게 된다.Through deep learning, the words and sentences identified by the word model unit 90 and the sentence model unit 100 are analyzed through deep learning through the intention analysis unit 110 to determine whether they match or are similar to the sentence intention originally spoken in sign language by the user. do.

상기 의도분석부(110)를 통해 분석된 단어와 문장의 의도 일치 여부에 따라 문장변환부(120)에서 단어와 수어의 뜻을 나타내는 문장 및 사용자의 영상 이미지를 통해 글자 문장을 배열하게 된다.According to whether the intention of the word analyzed by the intention analysis unit 110 matches the intention of the sentence, the sentence conversion unit 120 arranges the text sentence through the sentence representing the meaning of the word and the sign language and the image image of the user.

이렇게 상기 문장변환부(120)에서 변환된 글자 문장을 문장 배열대로 글자 문장 생성부(130)에서 글자 문장으로 변환하여 사용자의 수어 영상을 글자로 변환하게 된다.In this way, the character sentence converted by the sentence conversion unit 120 is converted into a character sentence by the character sentence generator 130 according to the sentence arrangement, so that the user's sign language image is converted into a character.

상기 각 구성들에서 형성되는 영상, 이미지, 단어 및 문장등은 DB서버에 저장되어 딥러닝을 통해 생성될 때 DB서버에서 로딩되어 사용될 수 있도록 한다.Images, images, words and sentences formed in each of the above components are stored in the DB server so that they can be loaded and used in the DB server when they are generated through deep learning.

이와 같이 이루어지는 본 발명은 3대의 카메라를 이용하여 입력된 수어동작영상에 대해서 중요 포인트 정보를 획득하고 수어동작 인식결과를 높이기 위해서 통합된 중요 포인트를 생성하여 단어모델과 문장모델에 입력으로 사용하여 입력된 수어동작의 인식을 높여서 적절한 수어의 답변을 제공할 수 있는 효과가 있다.The present invention made in this way is to obtain important point information for a sign language motion image input using three cameras, and to generate an integrated important point to increase sign language motion recognition results, and use it as input to the word model and sentence model. There is an effect that can provide an appropriate sign language answer by raising the recognition of the written sign language movement.

상기와 같은 수어동작 인식 처리절차 및 움직임 추적 Pre-trained 모델을 이용한 수어동작 분석 알고리즘 시스템은 위에서 설명된 실시예들의 구성과 작동 방식에 한정되는 것이 아니다. 상기 실시예들은 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 다양한 변형이 이루어질 수 있도록 구성될 수도 있다.The sign language motion recognition processing procedure and the sign language motion analysis algorithm system using the motion tracking pre-trained model as described above are not limited to the configuration and operation method of the embodiments described above. The above embodiments may be configured so that all or a part of each of the embodiments may be selectively combined and various modifications may be made.

10 : 카메라 20 : 동작 처리부
30 : 손동작 추출부 40 : 얼굴표정 추출부
50 : 손동작 특징점 추출부 60 : 얼굴표정 특징점 추출부
70 : 데이터 변환부 80 : 이미지 생성부
90 : 단어모델부 100 : 문장모델부
110 : 의도분석부 120 : 문장변환부
130 : 글자 문장 생성부
F : Feature Map A : Attention
140 : 단어분류모듈 150 : 문장분류모듈
E : 엔코더 D : 디코더10: camera 20: motion processing unit
30: hand motion extraction unit 40: facial expression extraction unit
50: hand motion feature point extraction unit 60: facial expression feature point extraction unit
70: data conversion unit 80: image generation unit
90: word model unit 100: sentence model unit
110: intention analysis unit 120: sentence conversion unit
130: character sentence generation unit
F: Feature Map A: Attention
140: word classification module 150: sentence classification module
E: encoder D: decoder

Claims

A plurality of cameras 10 installed in a state spaced apart from each other at predetermined intervals in three directions or more in front of the user to photograph the upper body of the user;
A motion processing unit (20) that converts the image captured from the images of the plurality of cameras 10 into an image, separates the user's motion and the background from the image using a motion tracking pre-trained model, and extracts only the user's motion through deep learning. );
A hand motion extracting unit 30 for extracting a motion of an arm and a hand motion from the extracted user's motion;
A facial expression extraction unit 40 for extracting a facial expression from the extracted user's motion;
A hand motion feature point extracting unit 50 for measuring all point coordinates according to the motion of the hand motion extracted by the hand motion extracting unit 30;
A facial expression feature point extracting unit 60 for measuring all point coordinates according to the facial expressions extracted by the facial expression extracting unit 40;
A data conversion (70) for converting hand motion feature point coordinates and facial expression coordinates extracted from the hand motion feature point extracting unit 50 and the facial expression feature point extracting unit 60 into data;
An image generation unit 80 for generating a feature point image for sign language through the data converted by the data conversion unit 70;
A word model unit 90 for checking a word for a character of a word and a feature point of the image through an image of the word converted by the image generating unit 80 through deep learning;
A sentence model unit 100 for checking feature points of a sentence spoken in a sign language by a user through deep learning through an image of a word converted by the image generating unit 80;
An intention analysis unit 110 that analyzes through deep learning whether the words and sentences identified by the word model unit 90 and the sentence model unit 100 match or are similar to the sentence intention initially spoken by the user in sign language;
A sentence conversion unit 120 for arranging a sentence representing the meaning of a word and a sign language and a text sentence through an image image of a user according to whether the intention of the word and the sentence analyzed by the intention analysis unit 110 match or not, and
Sign language motion recognition processing procedure and motion tracking pre-trained, characterized in that it comprises a letter sentence generator 130 that generates a sentence consisting of letters in a sentence arrangement from the text sentences converted by the sentence conversion unit 120 Sign language motion analysis algorithm system using model.

The method of claim 1,
The hand motion feature point extraction unit 50 receives the user's sign language motion image through the camera 10 and uses a deep learning model to which a motion tracking (estimation) pre-trained model is applied, and the left and right hands (finger letters) are each 21 Extract the key points, and use the 42 key points of the left and right hands.
When extracting a facial expression (non-resin) from the image of the camera 10, the facial expression feature point extracting unit 60 extracts 70 important points from the nose, eyes, and lips to apply facial expressions (non-resin) to the fingerprint. Sign language motion analysis algorithm system using a sign language motion recognition processing procedure and a motion tracking pre-trained model, characterized in that it is added.

The method of claim 1,
The data conversion unit 70 builds a standard database for important points of left and right hands (fingerprints) and important points of facial expressions (non-resining) for sign language movement,
The left and right hand (fingerprint) important points and facial expressions (non-resin) important points generated by the hand gesture feature point extraction unit 50 and the facial expression feature point extracting unit 60 are separately stored and collected for the same sign language motion. A new database is constructed by separating and configuring the set of important points of left and right hands (fingerprints) and the set of facial expressions (non-resin) and combining left and right hands (fingerprints) and facial expressions (non-resining) To increase the recognition rate by learning the sign language motion recognition model through deep learning,
The left and right hand (fingerprint) important points and facial expressions (non-resin) important points are extracted to create a file in each json format, and a Json file is stored in the database,
The information stored in the json format is composed of as many frames as the number of frames with values of X coordinate (0 to image width), Y coordinate (0 to image height), and accuracy (0 to 1) in the image for an important point. Sign language motion analysis algorithm system using motion recognition processing procedure and motion tracking pre-trained model.

The method of claim 1,
The word model unit 90 includes a feature point extraction module consisting of 16 output layers for image recognition and
To the word classification module 140 to which the inference output activation function (solution) for sign language words obtained as standard data of a database for sign language actions is applied to a signal vector (Feature Map (F)) input to the output layer of the feature point extraction module. Consists of
The 16 output layers for image recognition of the feature point extraction module consist of a deep learning model, and the deep learning model extracts features so that important points can be extracted from the sign language motion when the sign language motion is input, and the next convolution layer Recognize the shape or model of the hand,
The word model unit 90 uses the left and right hand important points and facial expressions important points as inputs, and is output from the convolution layer of the feature point extraction module consisting of 16 output layers for image recognition processing the left and right hands. Create a Feature Map (F) that combines the image features output through the left and right hand image features and facial expression features,
Sign language gesture recognition processing, characterized in that a word is generated by processing it as an input of the word classification module 140 to which an output activation function is applied to the Feature Map (F) created through the left and right hand image and the feature point extraction module of the facial expression. Sign language motion analysis algorithm system using procedure and motion tracking pre-trained model.

The method of claim 1,
The sentence model unit 100 is composed of an encoder (E) and a decoder (D) in order to infer a sign sentence because the signals before and after the sign sentence have a degree of correlation,
The encoder (E) extracts Feature Map (F) and Attention (A), which are feature vectors of an image, and uses them as weights in a decoder (D),
It is used as a vector for extracting nearby words by applying Attention (A) to Feature Map (F) input to the decoder (D),
Feature Map (F), which is a feature vector of the image, is used as an input to the sentence classification module 150, and the sentence classification module 150 is a value obtained through an activation function calculated by multiplying the input and the input control vector based on a time series. A sign language motion analysis algorithm system using a sign language motion recognition processing procedure and a motion tracking pre-trained model, characterized in that a sentence is generated by calculating an output control vector to