KR102518690B1

KR102518690B1 - Apparatus for purifying data for early screening of developmental disabilities based on learning model and method therefor

Info

Publication number: KR102518690B1
Application number: KR1020210112889A
Authority: KR
Inventors: 김정준; 김민규; 김주현; 이만기; 김경호; 손동섭
Original assignee: 한국로봇융합연구원
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2023-04-05
Also published as: KR20230030794A

Abstract

데이터를 정제하기 위한 장치는 화상과 음성을 포함하는 스트리밍 영상을 소정 시간 단위로 분할하여 복수의 단위 영상을 생성하고, 상기 스트리밍 영상의 재생 순서에 따라 상기 복수의 단위 영상을 출력하는 데이터처리부와, 학습모델을 이용하여 상기 단위 영상의 화상 및 음성을 분석하여 상기 단위 영상의 화자를 판별하고, 판별된 화자에 따라 상기 단위 영상을 질문 부분과 응답 부분으로 분류하고, 상기 질문 부분의 단위 영상을 분석하여 질문의 종류를 분류하고, 상기 응답 부분의 단위 영상을 분석하여 상기 응답으로 수행한 반응의 범주를 분류하는 데이터판별부와, 상기 스트리밍 영상에서 대해 상기 질문 부분 및 상기 응답 부분의 재생 시간과, 상기 질문 부분의 상기 질문의 종류 및 상기 응답 부분의 상기 반응의 범주를 레이블링하여 저장하는 데이터정제부를 포함한다. An apparatus for purifying data includes a data processing unit that divides a streaming image including image and audio into predetermined time units to generate a plurality of unit images, and outputs the plurality of unit images according to a playback order of the streaming image; The image and audio of the unit video are analyzed using a learning model to determine the speaker of the unit video, the unit video is classified into a question part and a response part according to the identified speaker, and the unit video of the question part is analyzed. A data discriminating unit for classifying the type of question and classifying a category of a response performed with the response by analyzing a unit image of the response portion; a reproduction time of the question portion and the response portion in the streaming video; and a data cleaning unit for labeling and storing the type of question of the question part and the category of the response of the response part.

Description

Apparatus for purifying data for early screening of developmental disabilities based on learning model and method therefor}

본 발명은 데이터를 정제하기 위한 기술에 관한 것으로, 보다 상세하게는, 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 장치 및 이를 위한 방법에 관한 것이다. The present invention relates to a technology for refining data, and more particularly, to an apparatus and method for refining data for early screening of developmental disabilities based on a learning model.

발달장애(Developmental Disability)는 영유아 기부터 시작하는 뇌 신경계 발달 장애들의 집합으로 언어 및 의사소통, 인지력, 사회성 등의 측면에서 이루어져야 할 발달이 심하게 지체되거나 성취되지 않은 장애를 의미한다. 발달 장애인은 지적 장애인과 자폐성 장애인으로 규정되어지는데, 국내의 경우 전체 장애인 수는 매년 감소하는 추세이나, 발달 장애로 진단 받은 사람의 일부만 장애로 등록하는 것으로 추정되므로 실제 발달 장애인은 등록된 숫자보다 훨씬 더 많을 것으로 추정되고 있다. Developmental Disability is a set of brain and nervous system developmental disorders that begin in infancy and refers to severely delayed or unachieved development in terms of language, communication, cognition, and social skills. Developmental disabilities are defined as intellectual disabilities and autistic disabilities. In Korea, the total number of people with disabilities tends to decrease every year. It is estimated to be more.

자폐스펙트럼장애(ASD: Autism Spectrum Disorders)는 2세 전후에 진단이 가능하며, 이후 평생 지속되며 발달의 가장 기본적인 영역의 많은 부분에 이상을 초래하여, 아동의 독립적인 발달, 교육, 가족 삶의 질에 미치는 영향이 큰 장애라 할 수 있다. ASD에 대한 임상과 연구 측면 모두에서 조기 발견과 조기 개입은 매우 중요한 문제인데, 영유아기는 뇌의 가소성이 높은 시기여서 정상적인 형태에 가깝게 변화 할 수 있는 기회를 제공할 뿐 아니라 이차적인 신경학적 손상과 이에 따른 이차적인 심각한 행동 문제가 점차 축적 되는 것을 미리 방지 할 수 있다.Autism Spectrum Disorders (ASD) can be diagnosed around the age of 2, continue throughout life thereafter, and cause abnormalities in many of the most basic areas of development, resulting in a child's independent development, education, and quality of life in the family. It can be said to be a major obstacle. Early detection and early intervention are very important issues in both clinical and research aspects of ASD. Infancy and early childhood are a period of high plasticity of the brain, which not only provides an opportunity to change to a close to normal form, but also prevents secondary neurological damage and The gradual accumulation of secondary serious behavioral problems can be prevented in advance.

ASD 진단에는 직접적인 관찰, 양육자와 교사가 제공하는 정보, 성장 과정에 대한 상세한 과거력, 인지능력이나 다른 심리적인 기능에 대한 객관적/정량적 평가, 감별진단을 위한 검사, 신경학적 평가, 뇌기능 검사 등을 필요로 한다. 기존의 ASD 선별 도구들은 사용을 위한 훈련과정이 매우 오래 걸리고, 개별 전문가들의 경험과 능력에 따라 진단의 비일관성이 존재하며, 한명의 아동을 진단하는데 최소 6~7시간의 검사 시간과 투입되는 자원이 매우 방대하다. 또한, 유아/아동의 경우 진단 검사를 어디에서 누구와 시행했는지에 따라 일반적으로 일상생활에서 보이는 행동들과는 많이 상이 할 수 있다. 따라서, 인공 지능 기술의 융복합을 위한 영유아/아동의 ASD 발견 선별검사 도구의 개발과 이를 적용한 콘텐츠 및 인지능력이나 심리적 변화에 대한 객관적이고 정량적 평가를 효율적으로 할 수 있는 AI기반 해결방안이 필요하다. 즉, 새로운 선별 검사 도구에 기반한 영유아/아동의 다중감각 데이터 수집 및 이의 자동분석을 통한 비언어적 의사소통 인지, 비정상/상동 증적 표적행동 인지, 복합정보 기반 심리예측 등 ASD 조기선별 시스템을 통한 문제 해결이 필요하다. ASD diagnosis includes direct observation, information provided by caregivers and teachers, detailed history of the growth process, objective/quantitative evaluation of cognitive ability or other psychological functions, tests for differential diagnosis, neurological evaluation, and brain function tests. in need. Existing ASD screening tools take a very long training process for use, inconsistency in diagnosis exists depending on the experience and ability of individual experts, and it takes at least 6 to 7 hours of examination time and resources to diagnose one child. very voluminous Also, in the case of infants/children, depending on where and with whom the diagnostic test was performed, behaviors generally seen in daily life may be very different. Therefore, it is necessary to develop an ASD detection screening tool for infants/children for the convergence of artificial intelligence technology, and an AI-based solution that can efficiently objectively and quantitatively evaluate the content and cognitive ability or psychological change to which it is applied. . In other words, problem solving through the ASD early screening system, such as non-verbal communication recognition through automatic analysis and abnormal/homological target behavior recognition based on a new screening test tool, multi-sensory data collection of infants/children, and complex information-based psychological prediction need.

한국공개특허 제2020-0085766호 2020년 07월 15일 공개 (명칭: 인지 기능 장애 진단 장치 및 인지 기능 장애 진단 프로그램)Korean Patent Publication No. 2020-0085766 published on July 15, 2020 (Name: Cognitive dysfunction diagnostic device and cognitive dysfunction diagnostic program)

본 발명은 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 장치 및 이를 위한 방법을 제공함에 있다. An object of the present invention is to provide an apparatus and method for refining data for early screening of developmental disabilities based on a learning model.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 데이터를 정제하기 위한 장치는 화상과 음성을 포함하는 스트리밍 영상을 소정 시간 단위로 분할하여 복수의 단위 영상을 생성하고, 상기 스트리밍 영상의 재생 순서에 따라 상기 복수의 단위 영상을 출력하는 데이터처리부와, 학습모델을 이용하여 상기 단위 영상의 화상 및 음성을 분석하여 상기 단위 영상의 화자를 판별하고, 판별된 화자에 따라 상기 단위 영상을 질문 부분과 응답 부분으로 분류하고, 상기 질문 부분의 단위 영상을 분석하여 질문의 종류를 분류하고, 상기 응답 부분의 단위 영상을 분석하여 상기 응답으로 수행한 반응의 범주를 분류하는 데이터판별부와, 상기 스트리밍 영상에서 대해 상기 질문 부분 및 상기 응답 부분의 재생 시간과, 상기 질문 부분의 상기 질문의 종류 및 상기 응답 부분의 상기 반응의 범주를 레이블링하여 저장하는 데이터정제부를 포함한다. An apparatus for purifying data according to a preferred embodiment of the present invention for achieving the above object is to divide a streaming video including video and audio into predetermined time units to generate a plurality of unit videos, and generate the streaming video. A data processing unit that outputs the plurality of unit images according to the playback order of the unit video and analyzes the image and audio of the unit video using a learning model to determine the speaker of the unit video and outputs the unit video according to the identified speaker. a data discriminating unit for classifying a question part and a response part, analyzing a unit image of the question part to classify a type of question, and classifying a category of a reaction performed with the response by analyzing a unit image of the response part; and a data refiner for labeling and storing reproduction times of the question part and the response part, the question type of the question part, and the response category of the response part with respect to the streaming video.

상기 학습모델은 상기 단위 영상의 화상을 입력 받고, 입력된 화상으로부터 화자가 아이일 확률과 어른일 확률을 나타내는 화상판별벡터를 산출하는 화상판별망과, 상기 단위 영상의 화상에 대응하는 음성을 입력 받고, 입력된 음성으로부터 화자가 아이일 확률과 어른일 확률을 나타내는 음성판별벡터를 산출하는 음성판별망과, 상기 화상판별벡터를 및 상기 음성판별벡터를 병합하여 화자가 아이일 확률과 어른일 확률을 나타내는 대화판별벡터를 산출하는 대화판별망을 포함한다. The learning model receives an image of the unit image and inputs an image discrimination network that calculates an image discrimination vector representing a probability that a speaker is a child and a probability that a speaker is an adult from the input image, and inputs audio corresponding to the image of the unit image. and a speech discrimination network that calculates a speech discrimination vector representing the probability that the speaker is a child and the probability that the speaker is an adult from the input speech, and the probability that the speaker is a child and the probability that the speaker is an adult by merging the image discrimination vector and the speech discrimination vector It includes a dialogue discrimination network that calculates a dialogue discrimination vector representing .

상기 학습모델은 상기 대화판별벡터의 화자가 아이일 확률이 어른일 확률 보다 높은 경우, 상기 화상 및 상기 음성을 통해 나타내는 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률을 나타내는 응답판별벡터를 산출하는 응답판별망과, 상기 대화판별벡터의 화자가 어른일 확률이 아이일 확률 보다 높은 경우, 상기 음성에서 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타내는 질문판별벡터를 산출하는 질문판별망을 더 포함한다. When the probability that the speaker of the conversation discrimination vector is a child is higher than the probability that the speaker is an adult, the learning model calculates a response discrimination vector representing a probability that a response expressed through the image and the voice belongs to each of a plurality of preset response categories. A response discriminating network that calculates a question discriminating vector indicating a probability that a question appearing in the speech belongs to each of a plurality of preset question types, when the probability that the speaker of the dialogue discriminating vector is an adult is higher than the probability that the speaker is a child It further includes a discriminant network.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 데이터를 정제하기 위한 방법은 데이터처리부가 화상 및 음성을 포함하는 스트리밍 영상을 소정 시간 단위로 분할하여 복수의 단위 영상을 생성하고, 상기 스트리밍 영상의 재생 순서에 따라 상기 복수의 단위 영상을 출력하는 단계와, 데이터판별부가 학습모델을 이용하여 상기 단위 영상의 화상 및 음성을 분석하여 상기 단위 영상의 화자를 판별하고, 판별된 화자에 따라 상기 단위 영상을 질문 부분과 응답 부분으로 분류하고, 상기 질문 부분의 단위 영상을 분석하여 질문의 종류를 분류하고, 상기 응답 부분의 단위 영상을 분석하여 상기 응답으로 수행한 반응의 범주를 분류하는 단계와, 데이터정제부가 상기 스트리밍 영상에서 대해 상기 질문 부분 및 상기 응답 부분의 재생 시간과, 상기 질문 부분의 상기 질문의 종류 및 상기 응답 부분의 상기 반응의 범주를 레이블링하여 저장하는 단계를 포함한다. A method for purifying data according to a preferred embodiment of the present invention for achieving the above object is a method in which a data processing unit divides a streaming video including image and audio by a predetermined time unit to generate a plurality of unit images, The step of outputting the plurality of unit images according to the playback order of the streaming video, and the data discriminating unit analyzes the image and audio of the unit video using a learning model to determine the speaker of the unit video, and determines the speaker of the unit video. Classifying the unit image into a question part and a response part, classifying the type of question by analyzing the unit image of the question part, and classifying the category of the response performed with the response by analyzing the unit image of the response part and labeling and storing, by a data purifying unit, reproduction times of the question part and the response part, the question type of the question part, and the response category of the response part in the streaming video.

상기 분류하는 단계는 상기 학습모델의 화상판별망이 상기 단위 영상의 화상을 입력 받고, 입력된 화상으로부터 화자가 아이일 확률과 어른일 확률을 나타내는 화상판별벡터를 산출하는 단계와, 상기 학습모델의 음성판별망이 상기 단위 영상의 화상에 대응하는 음성을 입력 받고, 입력된 음성으로부터 화자가 아이일 확률과 어른일 확률을 나타내는 음성판별벡터를 산출하는 단계와, 상기 학습모델의 대화판별망이 상기 화상판별벡터를 및 상기 음성판별벡터를 병합하여 화자가 아이일 확률과 어른일 확률을 나타내는 대화판별벡터를 산출하는 단계를 포함한다. The classifying step may include receiving an image of the unit image by the image discrimination network of the learning model and calculating an image discrimination vector representing a probability that the speaker is a child and a probability that the speaker is an adult from the input image; a step in which a voice discrimination network receives a voice corresponding to the image of the unit video and calculates a voice discrimination vector representing a probability that a speaker is a child and a probability that a speaker is an adult from the input voice; and merging the image discrimination vector and the voice discrimination vector to calculate a conversation discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult.

상기 분류하는 단계는 상기 대화판별벡터를 산출하는 단계 후, 상기 학습모델의 응답판별망이 상기 대화판별벡터의 화자가 아이일 확률이 어른일 확률 보다 높은 경우, 상기 화상 및 상기 음성을 통해 나타내는 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률을 나타내는 응답판별벡터를 산출하는 단계와, 상기 학습모델의 질문판별망이 상기 대화판별벡터의 화자가 어른일 확률이 아이일 확률 보다 높은 경우, 상기 음성에서 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타내는 질문판별벡터를 산출하는 단계를 더 포함한다. In the step of classifying, after the step of calculating the dialog discrimination vector, if the response discrimination network of the learning model has a higher probability that the speaker of the dialog discrimination vector is a child than a probability that the speaker is an adult, the reaction expressed through the image and the voice. Calculating a response discriminant vector representing a probability of belonging to each of the plurality of preset response categories, and when the question discriminant network of the learning model has a higher probability that the speaker of the dialogue discriminant vector is an adult than a child, the The method may further include calculating a question discriminant vector indicating a probability that a question appeared in the speech belongs to each of a plurality of preset question types.

본 발명에 따르면 발달 장애 조기 선별을 위한 검사 시, 검사자가 질문하는 부분의 재생 시간, 피검자가 질문에 대해 반응하는 부분의 재생 시간과, 검사자의 질문의 종류 및 피검자의 반응의 범주를 학습 모델을 통해 분류하고, 대응하는 레이블을 자동으로 해당 스트리밍 영상에 부여할 수 있다. 이로써, ASD 발견 선별검사 도구의 개발과 이를 적용한 콘텐츠 및 인지능력이나 심리적 변화에 대한 객관적이고 정량적 평가를 효율적으로 할 수 있는 AI 기반 해결 기법의 연구에 필요한 데이터를 효율적으로 수집할 수 있다. According to the present invention, in a test for early screening of developmental disorders, the reproduction time of the part the examiner asks, the reproduction time of the part the subject responds to the question, the type of the examiner's question, and the category of the subject's response are calculated using a learning model. classified through video, and a corresponding label can be automatically assigned to the corresponding streaming video. In this way, it is possible to efficiently collect data necessary for the development of an ASD detection screening test tool and the research of an AI-based solution technique that can efficiently perform objective and quantitative evaluation of contents and cognitive abilities or psychological changes to which it is applied.

도 1은 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 장치의 구성을 설명하기 위한 도면이다.
도 2는 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 장치의 세부 구성을 설명하기 위한 도면이다.
도 3은 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 학습 모델의 구성을 설명하기 위한 도면이다.
도 4는 본 발명의 실시예에 따른 데이터를 정제하기 위한 학습모델(ML)을 학습시키기 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 실시예에 따른 학습모델(ML)을 이용하여 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 방법을 설명하기 위한 흐름도이다.
도 6은 본 발명의 실시예에 따른 학습모델(ML)을 이용하여 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 방법을 설명하기 위한 화면 예이다.
도 7은 본 발명의 실시예에 따른 컴퓨팅 장치를 나타내는 도면이다. 1 is a diagram for explaining the configuration of an apparatus for refining data for early screening of developmental disabilities based on a learning model according to an embodiment of the present invention.
2 is a diagram for explaining the detailed configuration of an apparatus for refining data for early screening of developmental disabilities based on a learning model according to an embodiment of the present invention.
3 is a diagram for explaining the configuration of a learning model for refining data for early screening of developmental disabilities based on a learning model according to an embodiment of the present invention.
4 is a flowchart illustrating a method of learning a learning model (ML) for refining data according to an embodiment of the present invention.
5 is a flowchart illustrating a method for refining data for early screening of developmental disabilities using a learning model (ML) according to an embodiment of the present invention.
6 is a screen example for explaining a method for refining data for early screening of developmental disabilities using a learning model (ML) according to an embodiment of the present invention.
7 is a diagram illustrating a computing device according to an embodiment of the present invention.

본 발명의 상세한 설명에 앞서, 이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시예에 불과할 뿐, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다. Prior to the detailed description of the present invention, the terms or words used in this specification and claims described below should not be construed as being limited to a common or dictionary meaning, and the inventors should use their own invention in the best way. It should be interpreted as a meaning and concept corresponding to the technical idea of the present invention based on the principle that it can be properly defined as a concept of a term for explanation. Therefore, the embodiments described in this specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all of the technical ideas of the present invention, so various equivalents that can replace them at the time of the present application. It should be understood that there may be water and variations.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 이때, 첨부된 도면에서 동일한 구성 요소는 가능한 동일한 부호로 나타내고 있음을 유의해야 한다. 또한, 본 발명의 요지를 흐리게 할 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략할 것이다. 마찬가지의 이유로 첨부 도면에 있어서 일부 구성요소는 과장되거나 생략되거나 또는 개략적으로 도시되었으며, 각 구성요소의 크기는 실제 크기를 전적으로 반영하는 것이 아니다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. At this time, it should be noted that the same components in the accompanying drawings are indicated by the same reference numerals as much as possible. In addition, detailed descriptions of well-known functions and configurations that may obscure the gist of the present invention will be omitted. For the same reason, some components in the accompanying drawings are exaggerated, omitted, or schematically illustrated, and the size of each component does not entirely reflect the actual size.

먼저, 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 장치에 대해서 설명하기로 한다. 도 1은 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 장치의 구성을 설명하기 위한 도면이다. 도 2는 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 장치의 세부 구성을 설명하기 위한 도면이다. First, an apparatus for refining data for early screening of developmental disabilities based on a learning model according to an embodiment of the present invention will be described. 1 is a diagram for explaining the configuration of an apparatus for refining data for early screening of developmental disabilities based on a learning model according to an embodiment of the present invention. 2 is a diagram for explaining the detailed configuration of an apparatus for refining data for early screening of developmental disabilities based on a learning model according to an embodiment of the present invention.

먼저, 도 1을 참조하면, 본 발명의 실시예에 따른 데이터정제장치(10)는 카메라부(11), 센서부(12), 입력부(14), 표시부(15), 저장부(16) 및 제어부(17)를 포함한다. First, referring to FIG. 1 , the data purification device 10 according to an embodiment of the present invention includes a camera unit 11, a sensor unit 12, an input unit 14, a display unit 15, a storage unit 16 and A control unit 17 is included.

카메라부(11)는 복수의 연속된 컬러 프레임과 음성을 포함하는 컬러 영상을 촬영하기 위한 것이다. 카메라부(11)는 복수의 카메라를 포함한다. 복수의 카메라는 각각 서로 다른 방향에서 검사자 및 피검자를 촬영하여 컬러 영상을 생성할 수 있다. The camera unit 11 is for taking a color image including a plurality of continuous color frames and sound. The camera unit 11 includes a plurality of cameras. The plurality of cameras may generate color images by photographing the examiner and the examinee from different directions, respectively.

센서부(12)는 복수의 키넥트 센서를 포함한다. 복수의 키넥트 센서 각각은 카메라부(11)의 복수의 카메라 각각에 대응하며, 카메라부(11)의 복수의 카메라 각각과 동일한 방향에서 검사자 및 피검자에 적외선을 방사하고, 이로부터 복수의 연속된 뎁스(IR depth) 프레임을 포함하는 뎁스 영상을 촬영한다. 이러한 뎁스 영상은 컬러 영상에 동기되어 촬영된다. The sensor unit 12 includes a plurality of Kinect sensors. Each of the plurality of Kinect sensors corresponds to each of the plurality of cameras of the camera unit 11, and emits infrared rays to the examiner and the examinee in the same direction as each of the plurality of cameras of the camera unit 11, and from this, a plurality of consecutive A depth image including an IR depth frame is captured. Such a depth image is captured in synchronization with a color image.

입력부(14)는 데이터정제장치(10)를 제어하기 위한 사용자의 조작을 입력받고 입력 신호를 생성하여 제어부(17)에 전달할 수 있다. 입력부(14)는 데이터정제장치(10)를 제어하기 위한 각 종 키, 버튼 등을 포함한다. The input unit 14 may receive a user's manipulation for controlling the data purification device 10, generate an input signal, and transmit it to the control unit 17. The input unit 14 includes various keys and buttons for controlling the data purification device 10.

표시부(15)는 화면 표시를 위한 것으로, 데이터정제장치(10)의 메뉴, 입력된 데이터, 기능 설정 정보 및 기타 다양한 정보를 사용자에게 시각적으로 제공할 수 있다. 표시부(15)는 액정표시장치(LCD, Liquid Crystal Display), 유기 발광 다이오드(OLED, Organic Light Emitting Diodes), 능동형 유기 발광 다이오드(AMOLED, Active Matrix Organic Light Emitting Diodes) 등으로 형성될 수 있다. 한편, 표시부(15)는 터치스크린으로 구현될 수 있다. 이러한 경우, 표시부(15)는 터치센서를 포함한다. 터치센서는 사용자의 터치 입력을 감지한다. 터치센서는 정전용량 방식(capacitive overlay), 압력식, 저항막 방식(resistive overlay), 적외선 감지 방식(infrared beam) 등의 터치 감지 센서로 구성되거나, 압력 감지 센서(pressure sensor)로 구성될 수도 있다. 상기 센서들 이외에도 물체의 접촉 또는 압력을 감지할 수 있는 모든 종류의 센서 기기가 본 발명의 터치센서로 이용될 수 있다. 터치센서는 사용자의 터치 입력을 감지하고, 터치된 위치를 나타내는 입력 좌표를 포함하는 감지 신호를 발생시켜 제어부(17)로 전송할 수 있다. The display unit 15 is for displaying a screen, and can visually provide the menu of the data purifier 10, input data, function setting information, and various other information to the user. The display unit 15 may be formed of a Liquid Crystal Display (LCD), Organic Light Emitting Diodes (OLED), Active Matrix Organic Light Emitting Diodes (AMOLED), or the like. Meanwhile, the display unit 15 may be implemented as a touch screen. In this case, the display unit 15 includes a touch sensor. The touch sensor detects a user's touch input. The touch sensor may be configured as a touch sensor such as a capacitive overlay, a pressure sensor, a resistive overlay, or an infrared beam, or a pressure sensor. . In addition to the above sensors, all types of sensor devices capable of detecting contact or pressure of an object may be used as the touch sensor of the present invention. The touch sensor may detect a user's touch input, generate a detection signal including an input coordinate representing a touched position, and transmit the generated detection signal to the controller 17 .

저장부(16)는 데이터정제장치(10)의 동작에 필요한 프로그램 및 데이터를 저장하는 역할을 수행한다. 저장부(16)는 카메라부(11)의 복수의 카메라가 촬영한 스트리밍 영상 및 센서부(12)의 복수의 키네틱 센서가 촬영하는 뎁스 영상, 및 음성처리부(13)이 수집하는 음성을 저장할 수 있다. 저장부(16)에 저장되는 각 종 데이터는 사용자의 조작에 따라, 삭제, 변경, 추가될 수 있다. The storage unit 16 serves to store programs and data necessary for the operation of the data purifier 10. The storage unit 16 may store streaming images captured by a plurality of cameras of the camera unit 11, depth images captured by a plurality of kinetic sensors of the sensor unit 12, and voice collected by the audio processing unit 13. there is. Various types of data stored in the storage unit 16 may be deleted, changed, or added according to user manipulation.

제어부(17)는 데이터정제장치(10)의 전반적인 동작 및 데이터정제장치(10)의 내부 블록들 간 신호 흐름을 제어하고, 데이터를 처리하는 데이터 처리 기능을 수행할 수 있다. 또한, 제어부(17)는 기본적으로, 데이터정제장치(10)의 각 종 기능을 제어하는 역할을 수행한다. 제어부(17)는 중앙처리장치(CPU: Central Processing Unit), 디지털신호처리기(DSP: Digital Signal Processor) 등을 예시할 수 있다. The control unit 17 may control the overall operation of the data purifying device 10 and signal flow between internal blocks of the data purifying device 10, and may perform a data processing function of processing data. In addition, the control unit 17 basically plays a role of controlling various functions of the data purifying device 10. The controller 17 may include a central processing unit (CPU), a digital signal processor (DSP), and the like.

도 2를 참조하면, 제어부(17)는 레이블링을 위해 모델생성부(100), 데이터처리부(200), 데이터판별부(300) 및 데이터정제부(400)를 포함한다. Referring to FIG. 2 , the control unit 17 includes a model generation unit 100, a data processing unit 200, a data determination unit 300, and a data purification unit 400 for labeling.

먼저, 모델생성부(100)는 학습모델(LM: Leaning Model)을 학습시키기 위한 것이다. 학습모델(LM: Leaning Model)은 심층신경망(Deep Neural Network)이 될 수 있다. 특히, 학습모델(LM)은 대표적으로, CNN(Convolution Neural Network)을 예시할 수 있다. 모델생성부(100)는 학습이 완료된 학습모델(LM)을 데이터판별부(300)에 제공한다. 이러한 학습모델(LM)에 대해서는 아래에서 더 상세하게 설명될 것이다. First, the model generator 100 is for learning a learning model (LM: Leaning Model). A learning model (LM: Leaning Model) may be a deep neural network. In particular, the learning model (LM) may be representative of a Convolution Neural Network (CNN). The model generating unit 100 provides the learned model LM to the data discriminating unit 300 . This learning model (LM) will be described in more detail below.

데이터처리부(100)는 카메라부(11) 및 센서부(12)를 통해 복수의 연속된 컬러 프레임 및 음성을 포함하는 컬러 영상 및 복수의 연속된 뎁스 프레임을 포함하는 뎁스 영상을 수신하면, 복수의 연속된 컬러 프레임과 복수의 연속된 뎁스 프레임을 포함하는 화상 및 음성을 포함하는 멀티채널 형식의 스트리밍 영상을 생성한다. 또한, 데이터처리부(100)는 컬러 프레임 및 뎁스 프레임을 포함하는 화상과 음성을 포함하는 스트리밍 영상을 소정 시간 단위로 분할하여 복수의 단위 영상을 생성할 수 있다. 그리고 데이터처리부(100)는 스트리밍 영상의 재생 순서에 따라 복수의 단위 영상을 출력한다. When the data processing unit 100 receives a color image including a plurality of continuous color frames and audio and a depth image including a plurality of continuous depth frames through the camera unit 11 and the sensor unit 12, a plurality of A multi-channel format streaming video including images and audio including continuous color frames and a plurality of continuous depth frames is generated. In addition, the data processing unit 100 may generate a plurality of unit images by dividing an image including a color frame and a depth frame and a streaming image including audio in units of a predetermined time. The data processing unit 100 outputs a plurality of unit images according to the playback order of the streaming images.

데이터판별부(200)는 학습모델(LM)을 이용하여 단위 영상의 화상 및 음성을 분석하여 단위 영상의 화자를 판별하고, 판별된 화자에 따라 단위 영상을 질문 부분과 응답 부분으로 분류한다. 또한, 데이터판별부(200)는 학습모델(LM)을 이용하여 질문 부분의 단위 영상을 분석하여 질문의 종류를 분류하고, 응답 부분의 단위 영상을 분석하여 응답으로 수행한 반응의 범주를 분류할 수 있다. The data discrimination unit 200 analyzes the image and audio of the unit video using the learning model (LM) to determine the speaker of the unit video, and classifies the unit video into a question part and a response part according to the identified speaker. In addition, the data discrimination unit 200 analyzes the unit image of the question part using the learning model (LM) to classify the type of question, and analyzes the unit image of the response part to classify the category of the response performed as a response. can

데이터정제부(300)는 스트리밍 영상에서 대해 질문 부분 및 응답 부분의 재생 시간과, 질문 부분의 질문의 종류 및 응답 부분의 반응의 범주를 레이블링하여 저장할 수 있다. The data purification unit 300 may label and store reproduction times of the question part and the response part, the question type of the question part, and the response category of the response part in the streaming video.

다음으로, 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 학습 모델의 구성을 설명하기로 한다. 도 3은 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 학습 모델의 구성을 설명하기 위한 도면이다. Next, the configuration of the learning model for refining data for early screening of developmental disabilities based on the learning model according to an embodiment of the present invention will be described. 3 is a diagram for explaining the configuration of a learning model for refining data for early screening of developmental disabilities based on a learning model according to an embodiment of the present invention.

도 3을 참조하면, 학습모델(ML)은 복수의 하부망(sub-network)으로 이루어진다. 복수의 하부망(sub-network)은 하부망(sub-network)은 화상판별망(VDN), 음성판별망(ADN), 대화판별망(CDN), 응답판별망(RDN) 및 질문판별망(QDN)을 포함한다. Referring to Figure 3, the learning model (ML) is composed of a plurality of sub-network (sub-network). A plurality of sub-networks include a video discrimination network (VDN), a voice discrimination network (ADN), a dialogue discrimination network (CDN), a response discrimination network (RDN), and a question discrimination network ( QDN).

복수의 하부망(sub-network) 각각은 입력층(input layer: IL), 교번으로 반복되는 적어도 한 쌍의 컨벌루션층(convolution layer: CL)과 풀링층(pooling layer: PL), 적어도 하나의 완전연결층(fully-connected layer: FL) 및 출력층(output layer: OL)을 포함한다. 복수의 하부망(sub-network) 각각은 입력값에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 출력값을 산출한다. 특히, 컨볼루션층(CL) 및 풀링층(PL)은 적어도 하나의 특징맵(FM: Feature Map)으로 구성된다. 특징맵(FM)은 이전 계층의 연산 결과에 대해 가중치 및 임계치를 적용한 값을 입력받고, 입력받은 값에 대한 연산을 수행한 결과로 도출된다. 이러한 가중치는 소정 크기의 가중치 행렬인 필터 혹은 커널(W)을 통해 적용된다. 본 발명의 실시예에서 컨볼루션층(CL)의 컨벌루션 연산은 제1 필터(W1)가 사용되며, 풀링층(PL)의 풀링 연산은 제2 필터(W2)가 사용된다. Each of the plurality of sub-networks includes an input layer (IL), at least one pair of alternately repeated convolution layers (CL) and a pooling layer (PL), and at least one complete It includes a fully-connected layer (FL) and an output layer (OL). Each of the plurality of sub-networks calculates an output value by performing an operation in which a plurality of inter-layer weights are applied to an input value. In particular, the convolution layer (CL) and the pooling layer (PL) are composed of at least one feature map (FM). The feature map (FM) is derived as a result of receiving values obtained by applying weights and thresholds to the calculation results of the previous layer and performing calculations on the input values. These weights are applied through a filter or kernel (W), which is a weight matrix of a predetermined size. In an embodiment of the present invention, the first filter W1 is used for the convolution operation of the convolution layer CL, and the second filter W2 is used for the pooling operation of the pooling layer PL.

입력층(IL)에 입력값(예컨대, 화상, 음성, 벡터 등)이 입력되면, 컨볼루션층(CL)은 입력층(IL)에 입력된 프레임에 대해 제1 필터(W1)를 이용한 컨벌루션(convolution) 연산 및 활성화함수에 의한 연산을 수행하여 적어도 하나의 제1 특징맵(FM1)을 도출한다. 이어서, 풀링층(PL)은 컨볼루션층(CL)의 적어도 하나의 제1 특징맵(FM1)에 대해 제2 필터(W2)를 이용한 풀링(pooling 또는 sub-sampling) 연산을 수행하여 적어도 하나의 제2 특징맵(FM2)을 도출한다. When an input value (eg, image, audio, vector, etc.) is input to the input layer IL, the convolution layer CL performs convolution using the first filter W1 on the frame input to the input layer IL ( At least one first feature map FM1 is derived by performing a convolution operation and an operation using an activation function. Then, the pooling layer PL performs a pooling (or sub-sampling) operation using the second filter W2 on at least one first feature map FM1 of the convolution layer CL to obtain at least one feature map FM1. A second feature map FM2 is derived.

완결연결층(FL)은 복수의 연산노드(F1 내지 Fm)로 이루어진다. 완결연결층(CL)의 복수의 연산노드(F1 내지 Fm)는 풀링층(PL)의 적어도 하나의 제2 특징맵(FM2)에 대해 활성화함수에 의한 연산을 통해 복수의 연산값을 산출한다. The complete connection layer FL is composed of a plurality of operation nodes F1 to Fm. The plurality of operation nodes F1 to Fm of the complete connection layer CL calculate a plurality of operation values through an operation using an activation function for at least one second feature map FM2 of the pooling layer PL.

출력층(OL)은 복수의 출력노드(O1 내지 On)를 포함한다. 완결연결층(FL)의 복수의 연산노드(F1 내지 Fm) 각각은 가중치(W: weight)를 가지는 채널로 출력층(OL)의 출력노드(O1 내지 On)와 연결된다. 다른 말로, 복수의 연산노드(F1 내지 Fm)의 복수의 연산값은 가중치(W)가 적용되어 복수의 출력노드(O1 내지 On) 각각에 입력된다. 이에 따라, 출력층(OL)의 복수의 출력노드(O1 내지 On)는 완결연결층(FL)의 가중치(W)가 적용되는 복수의 연산값에 대해 활성화함수에 의한 연산을 통해 출력값을 산출한다. The output layer OL includes a plurality of output nodes O1 to On. Each of the plurality of operation nodes F1 to Fm of the complete connection layer FL is connected to the output node O1 to On of the output layer OL as a channel having a weight (W). In other words, the plurality of calculation values of the plurality of calculation nodes F1 to Fm are input to each of the plurality of output nodes O1 to On after a weight W is applied. Accordingly, the plurality of output nodes O1 to On of the output layer OL calculates an output value through an operation by an activation function for a plurality of calculation values to which the weight W of the fully connected layer FL is applied.

전술한 컨벌루션층(CL), 완결연결층(FL) 및 출력층(OL)에서 사용되는 활성화함수는 시그모이드(Sigmoid), 하이퍼볼릭탄젠트(tanh: Hyperbolic tangent), ELU(Exponential Linear Unit), ReLU(Rectified Linear Unit), Leakly ReLU, Maxout, Minout, Softmax 등을 예시할 수 있다. 컨벌루션층(CL), 완결연결층(FL) 및 출력층(OL)에 이러한 활성화함수 중 어느 하나를 선택하여 적용할 수 있다. Activation functions used in the aforementioned convolutional layer (CL), fully connected layer (FL), and output layer (OL) are sigmoid, hyperbolic tangent (tanh), exponential linear unit (ELU), and ReLU. (Rectified Linear Unit), Leakly ReLU, Maxout, Minout, Softmax, etc. can be exemplified. Any one of these activation functions may be selected and applied to the convolutional layer (CL), the fully connected layer (FL), and the output layer (OL).

특히, 출력층(OL)의 복수의 출력노드(O1 내지 On) 각각은 화자가 아이일 확률과 어른일 확률이거나, 화상 및 상기 음성을 통해 나타내는 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률이거나, 음성에서 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타낸다. 일례로, 2개의 출력노드(O1, O2)가 존재하며, 제1 출력노드(O1)는 화자가 아이일 확률이고, 제2 출력노드(O2)는 화자가 어른일 확률에 대응할 수 있다. 이러한 경우, 제1 출력노드(O1)의 출력값은 화자가 아이일 확률이고, 제2 출력노드(O2)의 출력값은 화자가 어른일 확률이다. 이러한 경우, 복수의 출력노드(O1, O2)의 출력값이 0.256 및 0.744이면, 제1 출력노드(O1)의 출력값이 0.026이기 때문에 화자가 아이일 확률이 26%이고, 제2 출력노드(O2)의 출력값이 0.744이기 때문에 화자가 어른일 확률이 74%이다. 이러한 각 노드의 출력값은 노드의 수에 대응하는 원소의 수를 가지는 벡터가 될 수 있다. 이에 따라, 데이터판별부(300)는 서브망에서 출력되는 벡터에 내재된 확률에 따라 입력 데이터를 분류할 수 있다. 예컨대, 전술한 예와 동일하게, 복수의 출력노드(O1, O2)의 출력값이 0.256 및 0.744이면, 벡터는 [0.256, 0.744]이고, 이는 화자가 아이일 확률이 26%이고, 화자가 어른일 확률이 74%을 나타낸다. 화자가 어른일 확률이 74%로 26%인 화자가 아이일 확률 보다 높기 때문에 데이터판별부(300)는 입력 데이터의 화자가 어른인 것으로 분류할 수 있다. In particular, each of the plurality of output nodes O1 to On of the output layer OL is a probability that the speaker is a child and a probability that the speaker is an adult, or a probability that a reaction expressed through an image and the voice belongs to each of a plurality of predetermined categories of reactions, , represents a probability that a question appearing in the voice belongs to each of a plurality of preset question types. For example, there are two output nodes O1 and O2, the first output node O1 may correspond to the probability that the speaker is a child, and the second output node O2 may correspond to the probability that the speaker is an adult. In this case, the output value of the first output node O1 is the probability that the speaker is a child, and the output value of the second output node O2 is the probability that the speaker is an adult. In this case, if the output values of the plurality of output nodes O1 and O2 are 0.256 and 0.744, since the output value of the first output node O1 is 0.026, the probability that the speaker is a child is 26%, and the second output node O2 Since the output value of is 0.744, there is a 74% probability that the speaker is an adult. The output value of each node may be a vector having the number of elements corresponding to the number of nodes. Accordingly, the data discrimination unit 300 may classify the input data according to the probability inherent in the vector output from the subnetwork. For example, as in the above example, if the output values of the plurality of output nodes O1 and O2 are 0.256 and 0.744, the vector is [0.256, 0.744], which means that the probability that the speaker is a child is 26% and the speaker is an adult. The probability represents 74%. Since the probability that the speaker is an adult is 74%, which is higher than the probability that the speaker of 26% is a child, the data discrimination unit 300 can classify the speaker of the input data as an adult.

이와 같이, 본 발명의 실시예에 따른 복수의 하부망(sub-network) 각각은 입력값에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 벡터를 산출한다. 즉, 화상판별망(VDN)은 단위 영상의 화상을 입력 받고, 입력된 화상으로부터 화자가 아이일 확률과 어른일 확률을 나타내는 화상판별벡터를 산출한다. 음성판별망(ADN)은 단위 영상의 화상에 대응하는 음성을 입력 받고, 입력된 음성으로부터 화자가 아이일 확률과 어른일 확률을 나타내는 음성판별벡터를 산출한다. 대화판별망(CDN)은 화상판별벡터를 및 음성판별벡터를 병합하여 화자가 아이일 확률과 어른일 확률을 나타내는 대화판별벡터를 산출한다. 응답판별망(RDN)은 대화판별벡터의 화자가 아이일 확률이 어른일 확률 보다 높은 경우, 화상 및 음성을 통해 나타내는 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률을 나타내는 응답판별벡터를 산출한다. 그리고 질문판별망(QDN)은 대화판별벡터의 화자가 어른일 확률이 아이일 확률 보다 높은 경우, 음성에서 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타내는 질문판별벡터를 산출한다. As such, each of a plurality of sub-networks according to an embodiment of the present invention calculates a vector by performing an operation in which a plurality of inter-layer weights are applied to an input value. That is, the image discrimination network (VDN) receives an image of a unit image and calculates an image discrimination vector representing a probability that a speaker is a child and a probability that a speaker is an adult from the input image. The voice discrimination network (ADN) receives voice corresponding to the image of the unit video, and calculates a voice discrimination vector representing a probability that the speaker is a child and a probability that the speaker is an adult from the input voice. The dialogue discrimination network (CDN) merges the image discrimination vector and the audio discrimination vector to calculate a dialogue discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult. The response discrimination network (RDN) calculates a response discrimination vector representing the probability that a response expressed through video and voice belongs to each of a plurality of predetermined response categories, when the probability that the speaker of the conversation discrimination vector is a child is higher than the probability that the speaker is an adult. do. When the probability that the speaker of the dialog discrimination vector is an adult is higher than the probability that the speaker is a child, the question discrimination network (QDN) calculates a question discrimination vector representing a probability that a question appeared in the speech belongs to each of a plurality of preset question types.

다음으로, 본 발명의 실시예에 따른 학습 모델 기반의 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 방법에 대해서 설명하기로 한다. 데이터를 정제하기 위한 방법의 구체적인 설명에 앞서, 학습모델(ML)을 학습시키기는 방법에 대해서 설명하기로 한다. 도 4는 본 발명의 실시예에 따른 데이터를 정제하기 위한 학습모델(ML)을 학습시키기 방법을 설명하기 위한 흐름도이다. Next, a method for refining data for early screening of developmental disabilities based on a learning model according to an embodiment of the present invention will be described. Prior to a detailed description of a method for refining data, a method for training a learning model (ML) will be described. 4 is a flowchart illustrating a method of learning a learning model (ML) for refining data according to an embodiment of the present invention.

도 4를 참조하면, 모델생성부(200)는 S110 단계에서 학습 데이터를 마련한다. 여기서, 학습 데이터는 검사자와 피검사자 촬영된 화상 및 음성을 포함하는 스트리밍 영상이 소정의 시간 단위로 분할된 복수의 학습용 단위 영상과, 복수의 학습용 단위 영상에 대한 레이블을 포함한다. 특히, 레이블은 화자 레이블, 질문 레이블 및 응답 레이블을 포함한다. 화자 레이블은 학습용 단위 영상의 화자가 아이인지 혹은 어른이지 여부를 구분한다. 질문 레이블은 단위 영상이 검사자가 피검사자에게 질문하는 부분인 질문 부분인 경우, 질문의 종류가 어떤 종류의 질문인지 여부를 구분한다. 여기서, 질문의 종류는 ADOS(Autism Diagnostic Observation Schedule) 형식, BeDevel(Behavior Development Screening for Toddler) 형식을 예시할 수 있다. 응답 레이블은 단위 영상이 피검사자가 검사자의 질문에 대한 반응을 나타내는 응답 부분인 경우, 피검자의 반응의 종류가 기 설정된 복수의 반응 중 어떤 범주의 반응에 속하는지 여부를 구분한다. 여기서, 반응의 범주는 사물을 가리키기, 관습적 제스처, 괴로워하는 발성, 깨물기, 검사자와 눈맞춤, 물건 들어서 보여주기, 사물을 바라보기, 반응 없음 등을 예시할 수 있다. Referring to FIG. 4 , the model generating unit 200 prepares learning data in step S110. Here, the learning data includes a plurality of unit images for learning in which streaming images including images and audio captured by the inspector and the examinee are divided into predetermined time units, and labels for the plurality of unit images for learning. In particular, labels include speaker labels, question labels, and response labels. The speaker label distinguishes whether the speaker of the learning unit image is a child or an adult. The question label distinguishes what type of question the question is from when the unit image is a question portion in which the examiner asks the examinee a question. Here, the types of questions may be ADOS (Autism Diagnostic Observation Schedule) format and BeDevel (Behavior Development Screening for Toddler) format. When the unit image is a response part representing the response of the examinee to the examiner's question, the response label distinguishes whether the response type of the examinee belongs to a certain category among a plurality of preset responses. Here, the category of response may include pointing at an object, customary gesture, distressed vocalization, biting, eye contact with an inspector, lifting and showing an object, looking at an object, and no response.

모델생성부(200)는 S120 단계에서 학습데이터를 이용하여 화상판별망(VDN) 및 음성판별망(ADN) 각각을 개별적으로 학습시킨다. S120 단계에서, 모델생성부(200)는 학습용 단위 영상 중 화상을 화상판별망(VDN)에 입력하고, 화상판별망(VDN)이 입력된 화상에 대해 복수의 계층 간 학습되지 않은 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 화상판별벡터를 산출한다. 그러면, 모델생성부(200)는 화상판별벡터와 화자 레이블과의 차이를 나타내는 손실을 산출하고, 손실이 최소화되도록 화상판별망(VDN)의 가중치를 수정하는 최적화를 수행한다. 또한, S120 단계에서, 모델생성부(200)는 화상판별망(VDN)에 입력된 학습용 단위 영상의 화상에 대응하는 음성을 음성판별망(ADN)에 입력하고, 음성판별망(ADN)이 입력된 음성에 대해 복수의 계층 간 학습되지 않은 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 음성판별벡터를 산출한다. 그러면, 모델생성부(200)는 음성판별벡터와 화자 레이블과의 차이를 나타내는 손실을 산출하고, 손실이 최소화되도록 음성판별망(ADN)의 가중치를 수정하는 최적화를 수행한다. The model generating unit 200 individually trains the image discrimination network (VDN) and the voice discrimination network (ADN) using the training data in step S120. In step S120, the model generating unit 200 inputs an image from among unit images for learning to the VDN, and the VDN applies an unlearned weight between a plurality of layers to the input image. An image discrimination vector representing a probability that the speaker is a child and a probability that the speaker is an adult is calculated by performing an operation. Then, the model generating unit 200 calculates a loss representing the difference between the image discrimination vector and the speaker label, and performs optimization to modify the weight of the VDN so that the loss is minimized. In addition, in step S120, the model generating unit 200 inputs the audio corresponding to the image of the unit video for learning input to the video discrimination network (VDN) to the audio discrimination network (ADN), and the audio discrimination network (ADN) inputs the input A voice discrimination vector representing the probability that the speaker is a child and the probability that the speaker is an adult is calculated by performing an operation in which unlearned weights are applied between a plurality of layers for the generated voice. Then, the model generator 200 calculates a loss representing the difference between the speech discrimination vector and the speaker label, and performs optimization to modify the weight of the speech discrimination network (ADN) so that the loss is minimized.

다음으로, 모델생성부(200)는 S130 단계에서 학습데이터를 이용하여 대화판별망(CDN)을 학습시킨다. S130 단계에서, 모델생성부(200)는 학습용 단위 영상 중 화상을 화상판별망(VDN)에 입력하고, 화상판별망(VDN)에 입력된 학습용 단위 영상의 화상에 대응하는 음성을 음성판별망(ADN)에 입력한다. 그러면, 화상판별망(VDN)은 입력된 화상에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 화상판별벡터를 산출하고, 음성판별망(ADN)은 입력된 음성에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 음성판별벡터를 산출한다. 산출된 화상판별벡터 및 음성판별벡터는 대화판별망(CDN)에 입력되며, 대화판별망(CDN)은 화상판별벡터 및 음성판별벡터에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 대화판별벡터를 산출한다. 그러면, 모델생성부(200)는 화상판별벡터와 화자 레이블과의 차이를 나타내는 제1 손실을 산출하고, 음성판별벡터와 화자 레이블과의 차이를 나타내는 제2 손실을 산출하고, 대화판별벡터와 화자 레이블과의 차이를 나타내는 제3 손실을 산출한다. 이어서, 모델생성부(200)는 제3 손실이 제1 손실 및 제2 손실 보다 작고, 제1 손실, 제2 손실 및 제3 손실이 최소가 되도록 화상판별망(VDN), 음성판별망(ADN) 및 대화판별망(CDN)의 가중치를 수정하는 최적화를 수행한다. Next, the model generating unit 200 learns the dialogue discrimination network (CDN) using the training data in step S130. In step S130, the model generating unit 200 inputs an image of the unit image for learning to the image discrimination network (VDN), and transmits a sound corresponding to the image of the unit image for learning input to the image discrimination network (VDN) to the voice discrimination network ( ADN). Then, the image discrimination network (VDN) calculates an image discrimination vector representing the probability that the speaker is a child and the probability that the speaker is an adult by performing an operation in which weights between a plurality of layers are applied to the input image, and the voice discrimination network (ADN) performs an operation in which a plurality of inter-layer weights are applied to the input voice, and calculates a voice discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult. The calculated image discrimination vector and voice discrimination vector are input to the dialogue discrimination network (CDN), and the dialogue discrimination network (CDN) performs an operation in which a plurality of inter-layer weights are applied to the image discrimination vector and the speech discrimination vector. A dialog discriminant vector representing the probability of being a child and the probability of being an adult is calculated. Then, the model generating unit 200 calculates a first loss representing the difference between the image discrimination vector and the speaker label, calculates a second loss representing the difference between the speech discrimination vector and the speaker label, and calculates the dialogue discrimination vector and the speaker label. A third loss representing the difference from the label is calculated. Next, the model generator 200 uses the VDN and ADN so that the third loss is smaller than the first and second losses and the first, second, and third losses are minimized. ) and the optimization of modifying the weights of the dialogue discrimination network (CDN).

다음으로, 모델생성부(200)는 S140 단계에서 질문판별망(QDN)을 학습시킨다. S140 단계에서, 모델생성부(200)는 학습용 단위 영상 중 화상을 화상판별망(VDN)에 입력하고, 화상판별망(VDN)에 입력된 학습용 단위 영상의 화상에 대응하는 음성을 음성판별망(ADN)에 입력한다. 그러면, 화상판별망(VDN)은 입력된 화상에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 화상판별벡터를 산출하고, 음성판별망(ADN)은 입력된 음성에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 음성판별벡터를 산출한다. 산출된 화상판별벡터 및 음성판별벡터는 대화판별망(CDN)에 입력되며, 대화판별망(CDN)은 화상판별벡터 및 음성판별벡터에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 대화판별벡터를 산출한다. 산출된 대화판별벡터는 질문판별망(QDN)에 입력되며, 동시에, 숏컷(SC)을 통해 음성판별망(ADN)에 입력된 음성이 질문판별망(QDN)에 입력된다. 이에 따라, 질문판별망(QDN)은 대화판별벡터 및 음성에 대해 상기 음성에서 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타내는 질문판별벡터를 산출한다. 여기서, 기 설정된 복수의 질문의 종류는 ADOS 형식 및 BeDevel 형식을 예시할 수 있다. 특히, 대화판별벡터에서 화자가 아이일 확률이 어른일 확률 보다 높은 경우, 이러한 대화판별벡터가 질문판별망(QDN)에 입력되면, 대화판별벡터는 입력되는 음성을 소거하는 역할을 수행하며, 대화판별벡터에서 화자가 어른일 확률이 아이일 확률 보다 높은 경우, 대화판별벡터가 질문판별망(QDN)에 입력되면, 입력된 음성의 변화 없이 해당 음성이 그대로 입력하도록 하는 역할을 수행한다. 따라서 질문판별망(QDN)은 화자가 어른일 확률이 아이일 확률 보다 높은 경우에만, 음성에서 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타내는 질문판별벡터를 산출할 수 있다. 이에 따라, 질문판별벡터가 산출되면, 모델생성부(200)는 질문판별벡터와 질문 레이블과의 차이를 나타내는 손실을 산출하고, 산출된 손실이 최소가 되도록 화상판별망(VDN), 음성판별망(ADN) 및 대화판별망(CDN)의 가중치를 고정한 상태에서, 질문판별망(QDN)의 가중치를 수정하는 최적화를 수행한다. Next, the model generator 200 trains the question discrimination network (QDN) in step S140. In step S140, the model generating unit 200 inputs images from among the unit images for learning to the image discrimination network (VDN), and transmits audio corresponding to the images of the unit images for learning input to the image discrimination network (VDN) to the voice discrimination network ( ADN). Then, the image discrimination network (VDN) calculates an image discrimination vector representing the probability that the speaker is a child and the probability that the speaker is an adult by performing an operation in which weights between a plurality of layers are applied to the input image, and the voice discrimination network (ADN) performs an operation in which a plurality of inter-layer weights are applied to the input voice, and calculates a voice discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult. The calculated image discrimination vector and voice discrimination vector are input to the dialogue discrimination network (CDN), and the dialogue discrimination network (CDN) performs an operation in which a plurality of inter-layer weights are applied to the image discrimination vector and the speech discrimination vector. A dialog discriminant vector representing the probability of being a child and the probability of being an adult is calculated. The calculated dialogue discrimination vector is input to the question discrimination network (QDN), and at the same time, the voice input to the speech discrimination network (ADN) through the shortcut SC is input to the question discrimination network (QDN). Accordingly, the question discrimination network (QDN) calculates a dialogue discrimination vector and a question discrimination vector representing a probability that a question appearing in the speech belongs to each of a plurality of preset question types. Here, the types of the plurality of preset questions may exemplify an ADOS format and a BeDevel format. In particular, when the probability of the speaker being a child is higher than the probability of being an adult in the dialogue discrimination vector, when this dialogue discrimination vector is input to the question discrimination network (QDN), the dialogue discrimination vector plays a role in canceling the input speech, If the probability that the speaker is an adult in the discriminant vector is higher than the probability that the speaker is a child, when the dialog discriminant vector is input to the question discrimination network (QDN), it plays a role in allowing the input voice to be input as it is without changing the input voice. Accordingly, the question discrimination network (QDN) may calculate a question discrimination vector representing a probability that a question expressed in speech belongs to each of a plurality of preset question types only when the probability that the speaker is an adult is higher than the probability that the speaker is a child. Accordingly, when the question discrimination vector is calculated, the model generating unit 200 calculates a loss representing the difference between the question discrimination vector and the question label, and the VDN and the voice discrimination network so that the calculated loss is minimized. Optimization is performed to modify the weights of the question discrimination network (QDN) while the weights of the question discrimination network (ADN) and the dialogue discrimination network (CDN) are fixed.

다음으로, 모델생성부(200)는 S150 단계에서 응답판별망(RDN)을 학습시킨다. S150 단계에서, 모델생성부(200)는 학습용 단위 영상 중 화상을 화상판별망(VDN)에 입력하고, 화상판별망(VDN)에 입력된 학습용 단위 영상의 화상에 대응하는 음성을 음성판별망(ADN)에 입력한다. 그러면, 화상판별망(VDN)은 입력된 화상에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 화상판별벡터를 산출하고, 음성판별망(ADN)은 입력된 음성에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 음성판별벡터를 산출한다. 산출된 화상판별벡터 및 음성판별벡터는 대화판별망(CDN)에 입력되며, 대화판별망(CDN)은 화상판별벡터 및 음성판별벡터에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 대화판별벡터를 산출한다. 산출된 대화판별벡터는 응답판별망(RDN)에 입력되며, 동시에, 숏컷(SC)을 통해 화상판별망(VDN)에 입력된 화상 및 음성판별망(ADN)에 입력된 음성이 질문판별망(QDN)에 입력된다. 이에 따라, 응답판별망(RDN)은 대화판별벡터, 화상 및 음성에 대해 화상 및 음성에서 나타난 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률을 나타내는 응답판별벡터를 산출한다. 여기서, 기 설정된 복수의 반응의 범주는 사물을 가리키기, 관습적 제스처, 괴로워하는 발성, 깨물기, 검사자와 눈맞춤, 물건 들어서 보여주기, 사물을 바라보기, 반응 없음 등을 예시할 수 있다. 특히, 대화판별벡터에서 화자가 어른일 확률이 아이일 확률 보다 높은 경우, 이러한 대화판별벡터가 응답판별망(RDN)에 입력되면, 대화판별벡터는 입력되는 화상 및 음성을 소거하는 역할을 수행하며, 대화판별벡터에서 화자가 아이일 확률이 어른일 확률 보다 높은 경우, 대화판별벡터가 응답판별망(RDN)에 입력되면, 입력된 화상 및 음성의 변화 없이 해당 음성이 그대로 입력하도록 하는 역할을 수행한다. 따라서 응답판별망(RDN)은 화자가 아이일 확률이 어른일 확률 보다 높은 경우에만, 화상 및 음성에서 나타난 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률을 나타내는 응답판별벡터를 산출할 수 있다. 이에 따라, 응답판별벡터가 산출되면, 모델생성부(200)는 응답판별벡터와 응답 레이블과의 차이를 나타내는 손실을 산출하고, 산출된 손실이 최소가 되도록 화상판별망(VDN), 음성판별망(ADN) 및 대화판별망(CDN)의 가중치를 고정한 상태에서, 응답판별망(RDN)의 가중치를 수정하는 최적화를 수행한다. Next, the model generating unit 200 learns the response discrimination network (RDN) in step S150. In step S150, the model generating unit 200 inputs images from among the unit images for learning to the image discrimination network (VDN), and transmits audio corresponding to the images of the unit images for learning input to the image discrimination network (VDN) to the voice discrimination network ( ADN). Then, the image discrimination network (VDN) calculates an image discrimination vector representing the probability that the speaker is a child and the probability that the speaker is an adult by performing an operation in which weights between a plurality of layers are applied to the input image, and the voice discrimination network (ADN) performs an operation in which a plurality of inter-layer weights are applied to the input voice, and calculates a voice discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult. The calculated image discrimination vector and voice discrimination vector are input to the dialogue discrimination network (CDN), and the dialogue discrimination network (CDN) performs an operation in which a plurality of inter-layer weights are applied to the image discrimination vector and the speech discrimination vector. A dialog discriminant vector representing the probability of being a child and the probability of being an adult is calculated. The calculated dialog discrimination vector is input to the response discrimination network (RDN), and at the same time, the video input to the video discrimination network (VDN) through the shortcut (SC) and the voice input to the voice discrimination network (ADN) are input to the question discrimination network ( QDN). Accordingly, the response discrimination network (RDN) calculates a dialog discrimination vector, a response discrimination vector representing a probability that a response appearing in the video and audio belongs to each of a plurality of preset response categories. Here, the plurality of preset response categories may include pointing at an object, customary gesture, distressed vocalization, biting, eye contact with an inspector, lifting and showing an object, looking at an object, and no response. In particular, when the probability that the speaker is an adult in the dialogue discrimination vector is higher than the probability that the speaker is a child, when this dialogue discrimination vector is input to the response discrimination network (RDN), the dialogue discrimination vector serves to cancel the input image and voice. , If the probability that the speaker is a child in the dialog discrimination vector is higher than the probability that the speaker is an adult, when the dialog discrimination vector is input to the response discrimination network (RDN), it plays a role in inputting the corresponding voice as it is without changing the input image and voice. do. Therefore, the response discrimination network (RDN) can calculate a response discrimination vector representing a probability that a response shown in the video and audio belongs to each of a plurality of preset response categories only when the probability that the speaker is a child is higher than the probability that the speaker is an adult. . Accordingly, when the response discriminant vector is calculated, the model generator 200 calculates a loss representing the difference between the response discriminant vector and the response label, and the VDN and the voice discriminant network so that the calculated loss is minimized. Optimization is performed to modify the weights of the response discrimination network (RDN) while the weights of the ADN and the dialogue discrimination network (CDN) are fixed.

전술한 바와 같이, 화상판별망(VDN), 음성판별망(ADN), 대화판별망(CDN) 및 질문판별망(QDN)의 학습이 완료되면, 학습모델(ML)의 학습이 완료된다. As described above, when the learning of the VDN, ADN, CDN, and QDN is completed, the learning of the learning model (ML) is completed.

그러면, 학습이 완료된 학습모델(ML)을 이용하여 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 방법에 대해서 설명하기로 한다. 도 5는 본 발명의 실시예에 따른 학습모델(ML)을 이용하여 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 방법을 설명하기 위한 흐름도이다. 도 6은 본 발명의 실시예에 따른 학습모델(ML)을 이용하여 발달 장애 조기 선별을 위한 데이터를 정제하기 위한 방법을 설명하기 위한 화면 예이다. Then, a method for refining data for early screening of developmental disabilities using the learning model (ML) on which learning is completed will be described. 5 is a flowchart illustrating a method for refining data for early screening of developmental disabilities using a learning model (ML) according to an embodiment of the present invention. 6 is a screen example for explaining a method for refining data for early screening of developmental disabilities using a learning model (ML) according to an embodiment of the present invention.

도 5를 참조하면, 데이터처리부(100)는 S210 단계에서 적어도 2 이상의 사람, 예컨대, 검사자 및 피검사자가 촬영된 화상 및 음성을 포함하는 스트리밍 영상이 입력되면, S220 단계에서 스트리밍 영상을 소정 시간 단위로 분할하여 복수의 단위 영상을 생성하고, 상기 스트리밍 영상의 재생 순서에 따라 상기 복수의 단위 영상을 출력한다. Referring to FIG. 5 , the data processing unit 100 receives a streaming image including images and audio captured by at least two people, for example, an examiner and a test subject, in step S210, and transmits the streaming image in units of a predetermined time in step S220. A plurality of unit images are generated by division, and the plurality of unit images are output according to the playback order of the streaming video.

데이터판별부(120)는 S230 단계에서 스트리밍 영상의 재생 순서에 따라 순차로 복수의 단위 영상을 학습모델(ML)에 입력한다. 단위 영상의 화상은 화상판별망(VDN)에 입력되고, 화상판별망(VDN)에 입력된 단위 영상의 화상에 대응하는 음성은 음성판별망(ADN)에 입력된다. The data discrimination unit 120 sequentially inputs a plurality of unit images to the learning model ML according to the playback order of the streaming images in step S230. The image of the unit video is input to the video discrimination network (VDN), and the audio corresponding to the video of the unit video input to the video discrimination network (VDN) is input to the audio discrimination network (ADN).

그러면, S230 단계에서 화상판별망(VDN)은 입력된 화상에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 화상판별벡터를 산출하고, 음성판별망(ADN)은 입력된 음성에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 음성판별벡터를 산출한다. 산출된 화상판별벡터 및 음성판별벡터는 대화판별망(CDN)에 입력되며, S240 단계에서 대화판별망(CDN)은 화상판별벡터 및 음성판별벡터에 대해 복수의 계층 간 가중치가 적용되는 연산을 수행하여 화자가 아이일 확률과 어른일 확률을 나타내는 대화판별벡터를 산출한다. Then, in step S230, the image discrimination network (VDN) calculates an image discrimination vector representing the probability that the speaker is a child and the probability that the speaker is an adult by performing an operation in which weights between a plurality of layers are applied to the input image, and the voice discrimination network (ADN) calculates a voice discrimination vector representing a probability that a speaker is a child and a probability that a speaker is an adult by performing an operation in which weights between a plurality of layers are applied to the input voice. The calculated image discrimination vector and voice discrimination vector are input to the dialog discrimination network (CDN), and in step S240, the dialog discrimination network (CDN) performs an operation in which a plurality of inter-layer weights are applied to the image discrimination vector and the voice discrimination vector. Thus, a dialog discriminant vector representing the probability that the speaker is a child and the probability that the speaker is an adult is calculated.

산출된 대화판별벡터는 질문판별망(QDN) 및 응답판별망(RDN) 각각에 입력되며, 숏컷(SC)을 통해 화상판별망(VDN)에 입력된 화상 및 음성판별망(ADN)에 입력된 음성이 질문판별망(QDN)에 입력되고, 숏컷(SC)을 통해 음성판별망(ADN)에 입력된 음성이 질문판별망(QDN)에 입력된다. 이에 따라, S260 단계에서 응답판별망(RDN)은 화상 및 음성에 나타난 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률을 나타내는 응답판별벡터를 산출하며, 질문판별망(QDN)은 음성에 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타내는 질문판별벡터를 산출한다. The calculated conversation discrimination vector is input to each of the question discrimination network (QDN) and response discrimination network (RDN), and the video input to the video discrimination network (VDN) through the shortcut (SC) and the input to the voice discrimination network (ADN). The voice is input to the question discrimination network (QDN), and the voice input to the voice discrimination network (ADN) through the shortcut SC is input to the question discrimination network (QDN). Accordingly, in step S260, the response discrimination network (RDN) calculates a response discrimination vector indicating the probability that the responses appearing in the video and audio belong to each of a plurality of preset response categories, and the question discrimination network (QDN) calculates the probability that the responses appearing in the voice. A question discriminant vector representing a probability that a question belongs to each of a plurality of preset question types is calculated.

이러한 S260 단계에 대해 보다 상세하게 설명하면, 다음과 같다. 대화판별벡터에서 화자가 어른일 확률이 아이일 확률 보다 높은 경우, 이러한 대화판별벡터가 응답판별망(RDN)에 입력되면, 대화판별벡터는 입력되는 화상 및 음성을 소거하는 역할을 수행하며, 대화판별벡터에서 화자가 아이일 확률이 어른일 확률 보다 높은 경우, 대화판별벡터가 응답판별망(RDN)에 입력되면, 입력된 화상 및 음성의 변화 없이 해당 음성이 그대로 입력하도록 하는 역할을 수행한다. 따라서 응답판별망(RDN)은 S260 단계에서 화자가 아이일 확률이 어른일 확률 보다 높은 경우에만, 화상 및 음성에서 나타난 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률을 나타내는 응답판별벡터를 산출한다. 또한, 대화판별벡터에서 화자가 아이일 확률이 어른일 확률 보다 높은 경우, 이러한 대화판별벡터가 질문판별망(QDN)에 입력되면, 대화판별벡터는 입력되는 음성을 소거하는 역할을 수행하며, 대화판별벡터에서 화자가 어른일 확률이 아이일 확률 보다 높은 경우, 대화판별벡터가 질문판별망(QDN)에 입력되면, 입력된 음성의 변화 없이 해당 음성이 그대로 입력하도록 하는 역할을 수행한다. 따라서 질문판별망(QDN)은 S260 단계에서 화자가 어른일 확률이 아이일 확률 보다 높은 경우에만, 음성에서 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타내는 질문판별벡터를 산출한다. A more detailed description of this step S260 is as follows. If the probability that the speaker is an adult in the dialog discrimination vector is higher than the probability that the speaker is a child, when this dialog discrimination vector is input to the response discrimination network (RDN), the dialog discrimination vector plays a role in canceling the input image and voice, and If the probability that the speaker is a child is higher than the probability that the speaker is an adult in the discrimination vector, when the conversation discrimination vector is input to the response discrimination network (RDN), it plays a role of inputting the corresponding voice as it is without changing the input image and voice. Therefore, in step S260, the response discrimination network (RDN) calculates a response discrimination vector representing the probability that the response shown in the video and audio belongs to each of a plurality of preset response categories, only when the probability that the speaker is a child is higher than the probability that the speaker is an adult. do. In addition, if the probability that the speaker is a child in the dialogue discrimination vector is higher than the probability that the speaker is an adult, when this dialogue discrimination vector is input to the question discrimination network (QDN), the dialogue discrimination vector plays a role of canceling the input voice, If the probability that the speaker is an adult in the discriminant vector is higher than the probability that the speaker is a child, when the dialog discriminant vector is input to the question discrimination network (QDN), it plays a role in allowing the input voice to be input as it is without changing the input voice. Therefore, the question discrimination network (QDN) calculates a question discrimination vector representing a probability that a question presented in the voice belongs to each of a plurality of preset question types only when the probability that the speaker is an adult is higher than the probability that the speaker is a child in step S260.

전술한 바와 같이, 대화판별벡터와, 응답판별벡터 및 질문판별벡터 중 어느 하나가 산출되면, 데이터판별부(300)는 S270 단계에서 대화판별벡터로부터 단위 영상의 화자를 판별하고, 판별된 화자에 따라 단위 영상을 질문 부분과 응답 부분으로 분류하고, 응답판별벡터로부터 응답으로 수행한 반응의 범주를 분류하고, 질문판별벡터로부터 질문의 종류를 분류한다. As described above, when any one of the dialogue discrimination vector, response discrimination vector, and question discrimination vector is calculated, the data discrimination unit 300 determines the speaker of the unit video from the dialogue discrimination vector in step S270, and assigns the identified speaker to the identified speaker. According to this, the unit image is classified into a question part and a response part, and the category of response performed as a response is classified from the response discriminant vector, and the type of question is classified from the question discriminant vector.

즉, 대화판별벡터는 화자가 아이일 확률과 어른일 확률을 나타내며, 데이터판별부(300)는 화자가 어른일 확률이 아이일 확률 보다 높은 경우, 화자는 검사자인 것으로 판별할 수 있다. 데이터판별부(300)는 화자가 아이일 확률이 어른일 확률 보다 높은 경우, 화자는 피검사자인 것으로 판별할 수 있다. 또한, 데이터판별부(300)는 화자가 검사자인 것으로 판별되면, 해당 단위 영상을 질문 부분으로 분류하며, 화자가 피검사자인 것으로 판별되면, 해당 단위 영상을 응답 부분으로 분류할 수 있다. That is, the dialog discrimination vector represents the probability that the speaker is a child and the probability that the speaker is an adult, and the data discrimination unit 300 can determine that the speaker is an examiner if the probability that the speaker is an adult is higher than the probability that the speaker is a child. When the probability that the speaker is a child is higher than the probability that the speaker is an adult, the data discrimination unit 300 may determine that the speaker is the test subject. In addition, the data discrimination unit 300 may classify the corresponding unit image as a question part when it is determined that the speaker is the inspector, and classify the corresponding unit image as a response part when it is determined that the speaker is the examinee.

또한, 응답판별벡터는 단위 영상에 나타난 반응이 기 설정된 복수의 반응의 범주 각각에 속할 확률을 나타낸다. 이에 따라, 데이터판별부(300)는 복수의 반응의 범주 중 가장 확률이 높은 반응의 범주를 해당 단위 영상에서 피검사자가 응답으로 수행한 반응으로 분류한다. In addition, the response discriminant vector represents a probability that a response displayed in a unit image belongs to each of a plurality of preset response categories. Accordingly, the data discrimination unit 300 classifies a response category with the highest probability among a plurality of response categories as a response performed by the test subject in the corresponding unit image.

그리고 질문판별벡터는 단위 영상에 나타난 질문이 기 설정된 복수의 질문의 종류 각각에 속할 확률을 나타낸다. 이에 따라, 데이터판별부(300)는 복수의 질문의 종류 중 가장 확률이 높은 질문의 종류를 해당 단위 영상에서 검사자가 수행한 질문으로 분류한다. Further, the question discrimination vector represents a probability that a question appearing in a unit image belongs to each of a plurality of preset question types. Accordingly, the data discrimination unit 300 classifies a question type having the highest probability among a plurality of types of questions as a question performed by an examiner in a corresponding unit image.

다음으로, 데이터정제부(400)는 S280 단계에서 데이터판별부(300)가 분류한 바에 따라 도 6에 도시된 바와 같이, 스트리밍 영상에 대해 질문 부분 및 응답 부분의 재생 시간과, 질문의 종류 및 반응의 범주를 레이블링하여 이를 저장할 수 있다. Next, as shown in FIG. 6 according to the classification of the data discrimination unit 300 in step S280, the data purification unit 400 determines the reproduction time of the question part and the response part of the streaming video, the type of question and You can label categories of responses to store them.

전술한 바와 같은 방법에 따라 발달 장애 조기 선별을 위한 검사 시, 검사자가 질문하는 부분의 재생 시간, 피검자가 질문에 대해 반응하는 부분의 재생 시간과, 검사자의 질문의 종류 및 피검자의 반응의 범주를 학습 모델을 통해 분류하고, 대응하는 레이블을 자동으로 해당 스트리밍 영상에 부여할 수 있다. 이로써, ASD 발견 선별검사 도구의 개발과 이를 적용한 콘텐츠 및 인지능력이나 심리적 변화에 대한 객관적이고 정량적 평가를 효율적으로 할 수 있는 AI 기반 해결 기법의 연구에 필요한 데이터를 효율적으로 수집할 수 있다. In the test for early screening of developmental disorders according to the method described above, the playback time of the part the examiner asks, the playback time of the part the subject responds to the question, the type of question the examiner asks, and the category of the subject's response It can be classified through a learning model, and a corresponding label can be automatically assigned to the corresponding streaming video. In this way, it is possible to efficiently collect data necessary for the development of an ASD detection screening test tool and the research of an AI-based solution technique that can efficiently perform objective and quantitative evaluation of contents and cognitive abilities or psychological changes to which it is applied.

도 7은 본 발명의 실시예에 따른 컴퓨팅 장치를 나타내는 도면이다. 도 7의 컴퓨팅 장치(TN100)는 본 명세서에서 기술된 장치, 예컨대, 데이터정제장치(10) 등이 될 수 있다. 7 is a diagram illustrating a computing device according to an embodiment of the present invention. The computing device TN100 of FIG. 7 may be a device described in this specification, for example, the data purification device 10 or the like.

도 7의 실시예에서, 컴퓨팅 장치(TN100)는 적어도 하나의 프로세서(TN110), 송수신 장치(TN120), 및 메모리(TN130)를 포함할 수 있다. 또한, 컴퓨팅 장치(TN100)는 저장 장치(TN140), 입력 인터페이스 장치(TN150), 출력 인터페이스 장치(TN160) 등을 더 포함할 수 있다. 컴퓨팅 장치(TN100)에 포함된 구성 요소들은 버스(bus)(TN170)에 의해 연결되어 서로 통신을 수행할 수 있다.In the embodiment of FIG. 7 , the computing device TN100 may include at least one processor TN110, a transceiver TN120, and a memory TN130. In addition, the computing device TN100 may further include a storage device TN140, an input interface device TN150, and an output interface device TN160. Elements included in the computing device TN100 may communicate with each other by being connected by a bus TN170.

프로세서(TN110)는 메모리(TN130) 및 저장 장치(TN140) 중에서 적어도 하나에 저장된 프로그램 명령(program command)을 실행할 수 있다. 프로세서(TN110)는 중앙 처리 장치(CPU: central processing unit), 그래픽 처리 장치(GPU: graphics processing unit), 또는 본 발명의 실시예에 따른 방법들이 수행되는 전용의 프로세서를 의미할 수 있다. 프로세서(TN110)는 본 발명의 실시예와 관련하여 기술된 절차, 기능, 및 방법 등을 구현하도록 구성될 수 있다. 프로세서(TN110)는 컴퓨팅 장치(TN100)의 각 구성 요소를 제어할 수 있다. The processor TN110 may execute program commands stored in at least one of the memory TN130 and the storage device TN140. The processor TN110 may mean a central processing unit (CPU), a graphics processing unit (GPU), or a dedicated processor on which methods according to embodiments of the present invention are performed. Processor TN110 may be configured to implement procedures, functions, methods, and the like described in relation to embodiments of the present invention. The processor TN110 may control each component of the computing device TN100.

메모리(TN130) 및 저장 장치(TN140) 각각은 프로세서(TN110)의 동작과 관련된 다양한 정보를 저장할 수 있다. 메모리(TN130) 및 저장 장치(TN140) 각각은 휘발성 저장 매체 및 비휘발성 저장 매체 중에서 적어도 하나로 구성될 수 있다. 예를 들어, 메모리(TN130)는 읽기 전용 메모리(ROM: read only memory) 및 랜덤 액세스 메모리(RAM: random access memory) 중에서 적어도 하나로 구성될 수 있다.Each of the memory TN130 and the storage device TN140 may store various information related to the operation of the processor TN110. Each of the memory TN130 and the storage device TN140 may include at least one of a volatile storage medium and a non-volatile storage medium. For example, the memory TN130 may include at least one of read only memory (ROM) and random access memory (RAM).

송수신 장치(TN120)는 유선 신호 또는 무선 신호를 송신 또는 수신할 수 있다. 송수신 장치(TN120)는 네트워크에 연결되어 통신을 수행할 수 있다. The transmitting/receiving device TN120 may transmit or receive a wired signal or a wireless signal. The transmitting/receiving device TN120 may perform communication by being connected to a network.

한편, 전술한 본 발명의 실시예에 따른 방법은 다양한 컴퓨터수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 와이어뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 와이어를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다. On the other hand, the method according to the embodiment of the present invention described above may be implemented in the form of a program readable by various computer means and recorded on a computer-readable recording medium. Here, the recording medium may include program commands, data files, data structures, etc. alone or in combination. Program instructions recorded on the recording medium may be those specially designed and configured for the present invention, or those known and usable to those skilled in computer software. For example, recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks ( It includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media and ROM, RAM, flash memory, etc. Examples of program commands may include high-level language wires that can be executed by a computer using an interpreter, as well as machine language wires such as those produced by a compiler. These hardware devices may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 이와 같이, 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 균등론에 따라 다양한 변화와 수정을 가할 수 있음을 이해할 것이다. The present invention has been described above using several preferred examples, but these examples are illustrative and not limiting. As such, those skilled in the art to which the present invention belongs will understand that various changes and modifications can be made according to the doctrine of equivalents without departing from the spirit of the present invention and the scope of rights set forth in the appended claims.

10: 데이터정제장치 11: 카메라부
12: 센서부 13: 음성수집부
14: 입력부 15: 표시부
16: 저장부 17: 제어부
100: 데이터처리부 200: 모델생성부
300: 데이터판별부 400: 데이터정제부 10: data purification device 11: camera unit
12: sensor unit 13: voice collection unit
14: input unit 15: display unit
16: storage unit 17: control unit
100: data processing unit 200: model generation unit
300: data determination unit 400: data purification unit

Claims

In the apparatus for purifying data,
a data processing unit generating a plurality of unit images by dividing a streaming image including video and audio into predetermined time units and outputting the plurality of unit images according to a playback order of the streaming image;
The image and audio of the unit video are analyzed using a learning model to determine the speaker of the unit video, the unit video is classified into a question part and a response part according to the identified speaker, and the unit video of the question part is analyzed. a data discriminating unit for classifying the type of question and classifying a category of response performed with the response by analyzing a unit image of the response part; and
a data purification unit for labeling and storing reproduction times of the question part and the response part, the question type of the question part, and the response category of the response part in the streaming video;
characterized in that it includes
A device for cleaning data.

According to claim 1,
The learning model is
an image discrimination network that receives an image of the unit image and calculates an image discrimination vector representing a probability that the speaker is a child and a probability that the speaker is an adult from the input image;
a voice discrimination network that receives voice corresponding to the image of the unit video and calculates a voice discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult from the input voice;
a conversation discrimination network for merging the image discrimination vector and the speech discrimination vector to calculate a conversation discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult;
characterized in that it includes
A device for cleaning data.

According to claim 2,
The learning model is
If the probability that the speaker of the conversation discrimination vector is a child is higher than the probability that the speaker is an adult, a response discrimination network for calculating a response discrimination vector indicating a probability that the response expressed through the video and the voice belongs to each of a plurality of preset response categories. ; and
a question discrimination network calculating a question discrimination vector indicating a probability that a question expressed in the speech belongs to each of a plurality of predetermined question types, when the probability that the speaker of the dialogue discrimination vector is an adult is higher than the probability that the speaker is a child;
characterized in that it further comprises
A device for cleaning data.

In the method for cleaning the data,
generating a plurality of unit images by dividing a streaming image including image and audio by a data processing unit in units of a predetermined time, and outputting the plurality of unit images according to a playback order of the streaming image;
The data discrimination unit analyzes the image and audio of the unit video using a learning model to discriminate the speaker of the unit video, classifies the unit video into a question part and a response part according to the identified speaker, and unit of the question part Classifying the type of question by analyzing the image, and classifying the category of the response performed with the response by analyzing the unit image of the response part; and
Labeling and storing, by a data purification unit, playback times of the question part and the response part, the question type of the question part, and the category of the response of the response part in the streaming video;
characterized in that it includes
A method for cleaning data.

According to claim 4,
The classification step is
receiving the image of the unit image by the image discrimination network of the learning model, and calculating an image discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult from the input image;
receiving a voice corresponding to the image of the unit image by the voice discrimination network of the learning model, and calculating a voice discrimination vector representing a probability that the speaker is a child and a probability that the speaker is an adult from the input voice; and
calculating, by the dialogue discrimination network of the learning model, a dialogue discrimination vector indicating a probability that the speaker is a child and a probability that the speaker is an adult by merging the image discrimination vector and the speech discrimination vector;
characterized in that it includes
A method for cleaning data.

According to claim 5,
The classification step is
After calculating the dialog discrimination vector,
If the response discrimination network of the learning model has a higher probability that the speaker of the dialog discrimination vector is a child than an adult, a response indicating a probability that the response indicated through the video and the voice belongs to each of a plurality of preset response categories. Calculating a discriminant vector; and
When the probability that the speaker of the dialogue discrimination vector is an adult is higher than the probability that the speaker of the conversation discrimination vector is a child, the question discrimination network of the learning model calculates a question discrimination vector representing a probability that a question appearing in the speech belongs to each of a plurality of preset question types. doing;
characterized in that it further comprises
A method for cleaning data.