KR101984283B1

KR101984283B1 - Automated Target Analysis System Using Machine Learning Model, Method, and Computer-Readable Medium Thereof

Info

Publication number: KR101984283B1
Application number: KR1020170156158A
Authority: KR
Inventors: 이재욱; 박주경; 이영복; 유대훈
Original assignee: 주식회사 제네시스랩
Priority date: 2017-11-22
Filing date: 2017-11-22
Publication date: 2019-05-30

Abstract

The present invention relates to an automated evaluated person analysis system using a machine learning model, a method thereof, and a computer readable medium thereof capable of automatically providing a more accurate analysis result using a machine learning model while taking into account an evaluation situation and a state of an evaluated person. According to one embodiment of the present invention, the system comprises: a feature information generating part for generating voice feature information and image feature information from input image information and voice information; a speech state inference part for generating speech state information on whether the evaluated person is speaking from the information included in the image feature information; and an analysis result inference part for deriving the analysis result information of the evaluated person from the information including the voice feature information, the image feature information, and the speech state information.

Description

Technical Field [0001] The present invention relates to an automated evaluator analysis system, method, and computer readable medium using a machine learning model,

본 발명은 기계학습모델을 이용한 자동화된 피평가자분석 시스템, 방법, 및 컴퓨터 판독가능매체에 관한 것으로서, 보다 상세하게는, 평가 상황 및 피평가자의 상태를 고려하여, 기계학습모델을 이용하여 보다 정확한 분석결과를 자동적으로 제공할 수 있는, 기계학습모델을 이용한 자동화된 피평가자분석 시스템, 방법, 및 컴퓨터 판독가능매체에 관한 것이다.More particularly, the present invention relates to a system and method for automatically analyzing an evaluated guest using a machine learning model, and more particularly, To an automated assessor analysis system, method, and computer readable medium using a machine learning model that can automatically provide a machine-readable model.

많은 사람들이 커뮤니케이션 기술이나 기업 채용시 하는 인터뷰를 준비하기 위해 독학을 하거나 전문 학원에서 훈련을 받는 사람들이 많아졌다. 이것은 사회적으로 사람들이 오프라인 보다는 SNS와 같은 온라인 환경에 익숙해지고 있고, 이와 동일한 형태로 통화보다는 카카오톡과 같은 메신저나 SMS로 커뮤니케이션을 하다 보니 대면 상황에서의 사회적 기술 능력이 낮아지고 있다는 사실을 반영한다고 할 수 있다.Many people have been self-taught or trained in professional schools to prepare interviews for communication skills or job hunting. This reflects the fact that people are becoming accustomed to the online environment such as SNS, rather than offline, and that social skills in face-to-face situations are lowered when they use messengers such as KakaoTalk or SMS to communicate in the same way can do.

이러한 사람들에게 도움을 주기 위해 온라인 환경에서 다양한 시스템과 연구가 진행되고 있으나, 종래의 기술들의 경우 하기와 같은 한계점을 가지고 있다.Various systems and researches are being conducted in an online environment to help such people, but conventional technologies have the following limitations.

첫째, 종래의 기술들은 질문의 종류에 따라 대응되는 답변에 대한 평가가 불가능하였다. 예를들어, 채용을 위해 온라인 인터뷰 결과를 자동으로 평가하는 시스템이 있다고 가정해보자. 종래의 기술에서는 기본적으로 어떤 질문인지와 상관없이 이에 대한 대답을 할 때 표정이나 말투나 톤, 제스쳐를 기반하여 평가하도록 되어 있다. 이와 같은 종래의 기술에서는, 질문과 상황에 상관없이 엉뚱한 평가 결과를 도출할 수 있는 상황이 발생한다. 예를 들면, 인성 평가 기준으로 유머가 섞인 질문을 했을 때는 면접자가 웃는 얼굴로 편안한 분위기와 톤으로 대답을 하는 것이 좋은 평가를 받을 것이고, 기술 면접에 전문성을 평가하는 질문이나 압박을 위한 질문을 한 경우, 앞서 설명한 바와 같이 웃으면서 대답하는 것보다는 진중한 모습으로 표정과 톤을 유지하는 것이 더욱 좋은 평가를 받아야 할 것이다. 하지만, 종래의 기술에서는 이와 같은 평가 상황에 대한 요소가 반영되지 못하고, 피평가자의 음성, 표정 등이 일관적인 기준에 의하여 평가된다는 문제점이 있다. 즉, 종래의 기술에서는, 상황에 따라 다른 대응의 평가가 불가능하고, 면접 전체 혹은 부분을 전체적으로 학습하여 일반화된 평가 결과를 도출한다는 한계점을 가지고 있다.First, conventional technologies were unable to evaluate corresponding responses according to the types of questions. For example, suppose you have a system that automatically evaluates online interview results for recruitment. In the prior art, it is basically evaluated based on expression, tone, tone and gesture when answering it regardless of a question. In such a conventional technique, there arises a situation where irregular evaluation results can be derived regardless of the question and the situation. For example, when asked a humorous question based on personality evaluation criteria, it would be a good evaluation for the interviewer to answer in a relaxed mood and tone with a smiling face, ask questions for evaluation of technical expertise in technical interviews, In this case, as described above, it would be better to maintain a facial expression and tone in a more gentle manner than to answer with a smile. However, the conventional technology does not reflect the factor of such evaluation situation, and there is a problem that the voice and expression of the subject to be evaluated are evaluated by a consistent standard. That is, in the conventional technology, it is impossible to evaluate different responses according to the situation, and it has a limitation that it derives a generalized evaluation result by learning the whole interview or its parts as a whole.

둘째, 커뮤니케이션 능력 중 경청이 매우 중요함에도 불구하고 종래의 기술들은 이에 대해 진단이 부족하다. 커뮤니케이션 능력은 말을 하는 상황도 중요하지만, 상대방의 말을 얼마나 잘 듣는지에 대한 경청 태도도 매우 중요하다고 할 수 있다. 예를 들어, 영상 채팅이나 비디오 면접 등으로 A와 B 두 사람이 대화한다고 가정했을 때, B에 대한 평가가 이루어진다고 가정해보자. 기존 시스템에서 카메라가 B를 비추고 있고 마이크가 켜져 있는 상황이라면, A가 B에게 말을 하고 있는 상황에서 영상 정보는 B의 정보가 입력되지만, 음성 정보는 A의 정보가 입력된다. 이렇게 B가 말하고 있지 않은 상태인데 입력되는 음성 정보는 이렇게 다른 노이즈 정보가 입력될 수도 있다. 이 경우에는 B에 대한 정확한 진단을 위해 음성 정보를 버리고 영상 정보를 최대한 이용하는 방향으로 평가해야 하지만, 종래의 기술은 이러한 상황에 적응적으로 대응하지 못하는 한계가 있다.Second, while listening to communication skills is very important, conventional technologies lack diagnosis. The ability to communicate is also important, but the attitude of listening to the other person's words is also very important. For example, suppose that A and B are talking to each other for a video chat or a video interview. In the case where the camera is illuminating B and the microphone is turned on in the existing system, in a situation where A is talking to B, the information of B is inputted in the image information, but the information of A is inputted in the audio information. In this state B is not talking, the input audio information may be input with different noise information. In this case, in order to accurately diagnose B, voice information should be discarded and evaluated in the direction of maximizing the use of image information. However, the conventional technology has a limitation in that it can not adapt to such situation.

셋째, 커뮤니케이션은 장소나 상황에 따라 다른 방식으로 감정 표현이나 말을 해야 하며 이것을 분석할 때는 다르게 평가되어야 한다. 앞서 언급한 질문 내용뿐 만 아니라, 장소나 상황에 따라 커뮤니케이션 방법은 달라져야 할 것이다. 장례식장에서 하는 대화는 좀 더 슬픈 감정이 적절할 것이고, 놀이공원에서의 대화는 좀 더 활발한 느낌이 더 어울릴 것이다. 하지만 종래의 기술은 이러한 상황 인지를 고려하지 않기 때문에 이에 대한 적절한 진단 및 피드백이 어렵다고 할 수 있다.Third, communication should express emotions or words in different ways depending on the place and situation, and should be evaluated differently when analyzing it. In addition to the above-mentioned questions, communication methods should be changed depending on the place and situation. The conversation at the funeral hall will be more sad, and the conversation at the amusement park will be a little more active. However, the conventional technology does not take into consideration such a situation, so that it is difficult to adequately diagnose and feedback it.

마지막으로, 종래의 기술에서는 대부분 사람이 피평가자를 평가할 때 중요하게 판단되어야 할 특징정보를 선정(hand-crafted)하여 이를 기반으로 기계학습을 시키는데, 이 경우 최종결과를 도출하기 위한 특징정보가 누락될 위험이 있다. 얼굴 표정을 통한 평가를 진단할 때도 랜드마크 정보를 도출하여 해당 위치점을 이용하여 학습한다. 실제 영상 정보와 음성 정보에 대한 어떤 특징정보를 사용해야 해야 실제 사람이 평가할 때와 유사한 결과가 나오는 지에 대한 명확한 기준을 사람이 찾기란 어렵고, 평가 시에 무의식적으로 반영하는 잠재적(latent) 정보도 이용해야만 사회적 기술을 진단하는 결과에 대한 더 높은 성능을 얻을 수 있을 것이다. Finally, in the conventional art, most of the feature information to be judged when the person evaluates the evaluator is hand-crafted and the machine learning is performed based on the feature information. In this case, the feature information for deriving the final result is missing There is a danger. When evaluating the evaluation through facial expression, landmark information is derived and learned using the corresponding position point. It is difficult for people to find clear criteria for whether the actual information about the image information and the voice information should be used should be similar to the actual person's evaluation, and the latent information should be used unconsciously at the time of evaluation The higher the performance of the results of diagnosing the social skills.

삭제delete

한국등록특허 10-1198445 (면접 영상 분석을 위한 장치 및 이를 위한 방법, 2012년 10월 31일 등록)Korean Patent No. 10-1198445 (Apparatus and method for analyzing interview image, registered on October 31, 2012)

본 발명의 목적은 평가 상황 및 피평가자의 상태를 고려하여, 기계학습모델을 이용하여 보다 정확한 분석결과를 자동적으로 제공할 수 있는, 기계학습모델을 이용한 자동화된 피평가자분석 시스템, 방법, 및 컴퓨터 판독가능매체를 제공하는 것이다.An object of the present invention is to provide an automated assessor analysis system, a method, and a computer readable medium using a machine learning model capable of automatically providing a more accurate analysis result by using a machine learning model, taking into consideration the evaluation situation and the condition of an assessor Media.

상기와 같은 과제를 해결하기 위하여, 본 발명의 일 실시예는 1 이상의 프로세서 및 1 이상의 메모리를 포함하는, 자동화된 피평가자분석 시스템으로서, 입력된 영상 정보 및 음성 정보로부터 음성특징정보 및 영상특징정보를 생성하는 특징정보생성부; 상기 영상특징정보에 포함된 정보로부터 피평가자가 발화하는 지에 대한 말상태정보를 생성하는 말상태추론부; 및 상기 음성특징정보, 상기 영상특징정보, 및 상기 말상태정보를 포함하는 정보로부터 피평가자의 분석결과정보를 도출하는 분석결과추론부;를 포함하는, 자동화된 피평가자분석 시스템를 제공한다.According to an aspect of the present invention, there is provided an automated evaluator analysis system comprising at least one processor and at least one memory, wherein the system comprises voice feature information and image feature information from input image information and voice information, A feature information generating unit to generate the feature information; A horse state reasoning unit for generating horse state information on whether an evaluated person is speaking from information included in the image feature information; And an analysis result inferring unit for deriving analysis result information of the subject from the information including the speech feature information, the image feature information, and the speech state information.

본 발명의 일 실시예에서는, 상기 특징정보생성부는, 음성특징정보를 생성하는 음성특징정보생성부; 및 영상특징정보를 생성하는 영상특징정보생성부를 포함하고, 상기 영상특징정보생성부는, 입력된 영상정보로부터 입영역 외의 얼굴 내부의 1 이상의 요소를 추출하는 제1영상특징정보추출부; 및 입력된 영상정보로부터 입영역의 요소를 추출하는 제2영상특징정보추출부;를 포함할 수 있다.In an embodiment of the present invention, the feature information generating unit includes: a voice feature information generating unit that generates voice feature information; And an image feature information generating unit for generating image feature information, wherein the image feature information generating unit comprises: a first image feature information extracting unit for extracting at least one element in a face outside the mouth area from the input image information; And a second image feature information extraction unit for extracting an element of the input area from the input image information.

본 발명의 일 실시예에서는, 상기 말상태추론부는, 상기 제2영상특징정보로부터 상기 말상태정보를 생성할 수 있다.In one embodiment of the present invention, the horse state inferring unit may generate the horse state information from the second image feature information.

본 발명의 일 실시예에서는, 상기 분석결과추론부는, 상기 음성특징정보로부터 음성피처맵을 생성하는 음성피처맵생성부; 상기 제1영상특징정보로부터 제1영상피처맵을 생성하는 제1영상피처맵생성부; 및 상기 제2영상특징정보로부터 제2영상피처맵을 생성하는 제2영상피처맵생성부를 포함할 수 있다.In one embodiment of the present invention, the analysis result inferring unit includes: a voice feature map generating unit that generates a voice feature map from the voice feature information; A first image feature map generator for generating a first image feature map from the first image feature information; And a second image feature map generator for generating a second image feature map from the second image feature information.

본 발명의 일 실시예에서는, 상기 분석결과추론부는 상기 말상태정보에 기초하여 상기 음성피처맵 및 상기 제2영상피처맵에 대해 각각의 가중치를 부여하여, 상기 음성피처맵, 상기 제1영상피처맵, 및 상기 제2영상피처맵으로부터 복합피처맵을 생성하는 복합피처맵생성부를 더 포함할 수 있다.In one embodiment of the present invention, the analysis result inferring unit assigns weights to the voice feature map and the second image feature map, respectively, based on the speech state information, so that the speech feature map, And a complex feature map generator for generating a complex feature map from the second image feature map.

본 발명의 일 실시예에서는, 상기 자동화된 피평가자분석 시스템은, 입력된 음성 정보 및 상기 말상태정보로부터 피평가자의 평가상황에 대한 정보를 포함하는 상황인지파라미터를 생성하는 상황인지파라미터생성부를 더 포함할 수 있다.In an embodiment of the present invention, the automated assessor analysis system further includes a context-aware parameter generation unit for generating context-aware parameters including input voice information and information on an evaluation status of an evaluator from the speech status information .

본 발명의 일 실시예에서는, 상기 상황인지파라미터생성부는, 상기 음성 정보의 음성 출력여부 및 상기 말상태정보로부터 음성 정보 중 피평가자 외의 사람의 음성 정보인 상황음성정보를 추출하는 상황판별부를 포함할 수 있다.In one embodiment of the present invention, the situation recognition parameter generation unit may include a situation determination unit for extracting the situation audio information, which is audio information of a person other than the subject of evaluation, of the audio information from whether the audio information is output or not have.

본 발명의 일 실시예에서는, 상기 상황인지파라미터생성부는, 상기 상황음성정보로부터 음성의 내용에 대한 컨텐츠피처맵을 생성하는, 컨텐츠피처맵 생성부; 및 상기 상황음성정보로부터 발화특성에 대한 비컨텐츠피처맵을 생성하는, 비컨텐츠피처맵생성부;를 포함할 수 있다.In one embodiment of the present invention, the context recognition parameter generation unit may include: a content feature map generation unit for generating a content feature map for the content of speech from the context audio information; And a non-content feature map generation unit for generating a non-content feature map for the speech characteristic from the context speech information.

본 발명의 일 실시예에서는, 상기 분석결과추론부는, 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보, 및 상기 상황인지파라미터로부터 분석결과정보를 도출할 수 있다.In one embodiment of the present invention, the analysis result inferring unit may derive analysis result information from the speech feature information, the image feature information, the speech state information, and the context recognition parameter.

본 발명의 일 실시예에서는, 상기 분석결과추론부는 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보에 기초하여 생성된 복합피처맵에 상기 상황인지파라미터를 적용하여 분석결과정보를 도출할 수 있다.In an embodiment of the present invention, the analysis result inference unit may apply the context awareness parameter to the complex feature map generated based on the speech feature information, the image feature information, and the speech state information to derive analysis result information have.

상기와 같은 과제를 해결하기 위하여, 본 발명의 일 실시예는 1 이상의 프로세서 및 1 이상의 메모리를 포함하는 컴퓨팅 장치로 구현되는, 자동화된 피평가자분석 방법으로서, 입력된 영상 정보 및 음성 정보로부터 음성특징정보 및 영상특징정보를 생성하는 특징정보생성단계; 상기 영상특징정보에 포함된 정보로부터 피평가자가 발화하는 지에 대한 말상태정보를 생성하는 말상태추론단계; 및 상기 음성특징정보, 상기 영상특징정보, 및 상기 말상태정보를 포함하는 정보로부터 피평가자의 분석결과정보를 도출하는 분석결과추론단계;를 포함하는, 자동화된 피평가자분석 방법을 제공한다.According to an aspect of the present invention, there is provided an automated evaluator analysis method implemented by a computing device including at least one processor and at least one memory, And a feature information generation step of generating image feature information; A word state inference step of generating word state information on whether the assessor is speaking from information included in the image characteristic information; And an analysis result inference step of deriving analysis result information of the subject from the information including the speech feature information, the image feature information, and the speech state information.

본 발명의 일 실시예에서는, 상기 특징정보생성단계는, 음성특징정보를 생성하는 음성특징정보생성단계; 및 영상특징정보를 생성하는 영상특징정보생성단계를 포함하고, 상기 영상특징정보생성단계는, 입력된 영상정보로부터 입영역 외의 얼굴 내부의 1 이상의 요소를 추출하는 제1영상특징정보추출단계; 및 입력된 영상정보로부터 입영역의 요소를 추출하는 제2영상특징정보추출단계;를 포함할 수 있다.In one embodiment of the present invention, the characteristic information generating step includes: a voice characteristic information generating step of generating voice characteristic information; And an image feature information generating step of generating image feature information, wherein the image feature information generating step includes: a first image feature information extracting step of extracting at least one element inside a face outside the mouth area from the input image information; And a second image feature information extracting step of extracting an element of the input area from the input image information.

본 발명의 일 실시예에서는, 상기 말상태추론단계는, 상기 제2영상특징정보로부터 상기 말상태정보를 생성할 수 있다.In one embodiment of the present invention, the horse state inferring step may generate the horse state information from the second image feature information.

본 발명의 일 실시예에서는, 상기 분석결과추론단계는, 상기 음성특징정보로부터 음성피처맵을 생성하는 음성피처맵생성단계; 상기 제1영상특징정보로부터 제1영상피처맵을 생성하는 제1영상피처맵생성단계; 및 상기 제2영상특징정보로부터 제2영상피처맵을 생성하는 제2영상피처맵생성단계를 포함할 수 있다.In an embodiment of the present invention, the analysis result inference step may include: a voice feature map generation step of generating a voice feature map from the voice feature information; A first image feature map generation step of generating a first image feature map from the first image feature information; And a second image feature map generation step of generating a second image feature map from the second image feature information.

본 발명의 일 실시예에서는, 상기 분석결과추론단계는 상기 말상태정보에 기초하여 상기 음성피처맵 및 상기 제2영상피처맵에 대해 각각의 가중치를 부여하여, 상기 음성피처맵, 상기 제1영상피처맵, 및 상기 제2영상피처맵으로부터 복합피처맵을 생성하는 복합피처맵생성단계를 더 포함할 수 있다.In one embodiment of the present invention, the analysis result inferring step assigns weights to the voice feature map and the second image feature map, respectively, based on the speech state information, so that the speech feature map, A feature map, and a composite feature map generation step of generating a composite feature map from the second image feature map.

본 발명의 일 실시예에서는, 상기 자동화된 피평가자분석 방법은, 입력된 음성 정보 및 상기 말상태정보로부터 피평가자의 평가상황에 대한 정보를 포함하는 상황인지파라미터를 생성하는 상황인지파라미터생성단계를 더 포함할 수 있다.In one embodiment of the present invention, the automated assessor analysis method further includes a situation recognition parameter generation step of generating a situation recognition parameter including information on an evaluation situation of an assessor from the input speech information and the speech state information can do.

본 발명의 일 실시예에서는, 상기 상황인지파라미터생성단계는, 상기 음성 정보의 음성 출력여부 및 상기 말상태정보로부터 음성 정보 중 피평가자 외의 사람의 음성 정보인 상황음성정보를 추출하는 상황판별단계를 포함할 수 있다.In one embodiment of the present invention, the context recognition parameter generation step includes a situation determination step of extracting context audio information, which is speech information of a person other than the subject of evaluation, of the speech information from whether the speech information is audibly output or from the speech state information can do.

본 발명의 일 실시예에서는, 상기 상황인지파라미터생성단계는, 상기 상황음성정보로부터 음성의 내용에 대한 컨텐츠피처맵을 생성하는, 컨텐츠피처맵생성단계; 및 상기 상황음성정보로부터 발화특성에 대한 비컨텐츠피처맵을 생성하는, 비컨텐츠피처맵생성단계;를 포함할 수 있다.In an embodiment of the present invention, the context recognition parameter generation step may include: a content feature map generation step of generating a content feature map for the contents of speech from the context speech information; And generating a non-content feature map for the utterance characteristic from the context speech information.

본 발명의 일 실시예에서는, 상기 분석결과추론단계는, 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보, 및 상기 상황인지파라미터로부터 분석결과정보를 도출할 수 있다.In an embodiment of the present invention, the analysis result inference step may derive analysis result information from the speech feature information, the image feature information, the speech state information, and the context recognition parameter.

본 발명의 일 실시예에서는, 상기 분석결과추론단계는 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보에 기초하여 생성된 복합피처맵에 상기 상황인지파라미터를 적용하여 분석결과정보를 도출할 수 있다.In an embodiment of the present invention, the analysis result inference step may include applying the context aware parameter to a complex feature map generated based on the speech feature information, the image feature information, and the speech state information to derive analysis result information .

상기와 같은 과제를 해결하기 위하여, 본 발명의 일 실시예는 컴퓨터-판독가능 매체로서, 상기 컴퓨터-판독가능 매체는, 컴퓨팅 장치로 하여금 이하의 단계들을 수행하도록 하는 명령들을 저장하며, 상기 단계들은: 입력된 영상 정보 및 음성 정보로부터 음성특징정보 및 영상특징정보를 생성하는 특징정보생성단계; 상기 영상특징정보에 포함된 정보로부터 피평가자가 발화하는 지에 대한 말상태정보를 생성하는 말상태추론단계; 및 상기 음성특징정보, 상기 영상특징정보, 및 상기 말상태정보를 포함하는 정보로부터 피평가자의 분석결과정보를 도출하는 분석결과추론단계;를 포함하는, 컴퓨터-판독가능 매체를 제공한다.One embodiment of the present invention is a computer-readable medium having stored thereon instructions for causing a computing device to perform the following steps: : A feature information generating step of generating speech feature information and image feature information from input image information and speech information; A word state inference step of generating word state information on whether the assessor is speaking from information included in the image characteristic information; And an analysis result inferring step of deriving an analyzed result information of the subject from the information including the speech feature information, the image feature information, and the speech state information.

본 발명의 일 실시예에 따르면, 면접 혹은 질문에 따라 다르게 진단을 해야 하는 경우, 해당하는 질문에 따른 정확한 피평가자의 분석결과를 도출해낼 수 있는 효과를 발휘할 수 있다.According to the embodiment of the present invention, when the diagnosis is made differently according to the interview or the question, it is possible to obtain the analysis result of the accurate assessor according to the question.

본 발명의 일 실시예에 따르면, 구직자가 면접 혹은 평가 질문에 따라서 어떻게 대응해야 하는지를 훈련하여 향상시킬 수 있다. According to one embodiment of the present invention, it is possible to train and improve how a job seeker responds to an interview or an evaluation question.

본 발명의 일 실시예에 따르면, 단순히 스크립트 정보를 입력하여 일방적으로 피평가자 중심의 진단 결과를 얻을 수도 있고, 쌍방향 대화를 통해 경청하는 모습도 진단 평가에 포함시킬 수 있다.According to the embodiment of the present invention, it is possible to simply input the script information to unilaterally obtain the diagnosis result based on the evaluator, and to listen to the result through the interactive conversation.

본 발명의 일 실시예에 따르면, 장소나 분위기에 따른 커뮤니케이션 능력을 향상시킬 수 있다. 장례식장, 놀이공원, 진상손님을 대하는 서비스업 종사자, 고객을 설득해야 하는 영업사원, 엄격한 상사와의 대화 등 다양한 환경의 상황을 학습시키고 진단할 수 있는 구조를 제공하고 있으므로 때와 장소도 포함된 사회적 기술을 진단하고 이에 대한 피드백을 통해 자신의 부족한 부분을 파악하고 능력을 향상시킬 수 있는 환경을 제공할 수 있다.According to an embodiment of the present invention, communication ability according to a place or atmosphere can be improved. We provide a structure that allows us to learn and diagnose various environmental situations such as funeral parlor, amusement park, service worker who has to convince customers, salesperson who has to convince customers, and conversation with strict bosses. And provides an environment in which the user can identify his / her deficient parts and improve his / her ability through feedback.

본 발명의 일 실시예에 따르면, 영상 평가에 있어서 입모양에 따른 노이즈를 제거함으로써 보다 정확한 분석결과를 도출할 수 있는 효과를 발휘할 수 있다.According to an embodiment of the present invention, it is possible to obtain a more accurate analysis result by eliminating the noise according to the mouth shape in the image evaluation.

본 발명의 일 실시예에 따르면, 음성 평가에 있어서, 실제 평가자의 음성만을 필터링 하여, 보다 정확한 분석결과를 도출할 수 있는 효과를 발휘할 수 있다.According to the embodiment of the present invention, in the voice evaluation, it is possible to filter out only the voice of the actual evaluator and to obtain a more accurate analysis result.

본 발명의 일 실시예에 따르면, 영상과 음성을 복합적으로 적용한 복합 피처맵 및 기계학습모델을 이용함으로써 실제 사람이 평가하는 것과 유사하거나 보다 높은 인식 결과를 제공하는 효과를 발휘할 수 있다.According to an embodiment of the present invention, by using a complex feature map and a machine learning model in which an image and a voice are combined, it is possible to exhibit an effect of providing a similar or higher recognition result to that of an actual person.

본 발명의 일 실시예에 따르면, 영상과 음성을 복합하여 평가하는 경우, 피평가자의 입모양을 고려함으로써, 보다 정확한 분석 결과를 제공할 수 있는 효과를 발휘할 수 있다.According to an embodiment of the present invention, when evaluating a combination of video and audio, it is possible to provide a more accurate analysis result by considering the mouth shape of the subject.

도 1은 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템의 동작환경을 개략적으로 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템의 내부 구성을 개략적으로 도시한 도면이다.
도 3는 본 발명의 일 실시예에 따른 분석결과추론부의 동작을 개략적으로 도시한다.
도 4는 본 발명의 일 실시예에 따른 특징정보생성부의 내부 구성을 개략적으로 도시한다.
도 5a는 본 발명의 일 실시예에 따른 분석결과추론부의 내부 구성을 개략적으로 도시한다.
도 5b는 본 발명의 일 실시예에 따른 분석결과추론부의 내부 구성을 개략적으로 도시한다.도 5c는 본 발명의 일 실시예에 따른 분석결과추론부의 내부 구성을 개략적으로 도시한다.
도 6는 본 발명의 일 실시예에 따른 음성특징정보 및 영상특징정보의 생플링 및 피처맵 생성을 개략적으로 도시한다.
도 7는 본 발명의 일 실시예에 따른 복합피처맵생성부의 동작을 개략적으로 도시한다.
도 8은 본 발명의 일 실시예에 따른 상황인지적응형추론부의 동작을 개략적으로 도시한다.
도 9는 본 발명의 일 실시예에 따른 상황인지파라미터생성부의 내부 구성을 개략적으로 도시한다.
도 10은 본 발명의 일 실시예에 따른 상황판별부의 동작을 개략적으로 도시한다.
도 11은 본 발명의 일 실시예에 따른 비컨텐츠피처맵생성부 및 컨텐츠피처맵생성부의 내부 구성을 개략적으로 도시한다.
도 12는 본 발명의 일 실시예에 따른 자동화된 피평가자분석 방법의 단계들을 개략적으로 도시한다.
도 13은 본 발명의 일 실시예에 따른 피처맵 생성과정을 개략적으로 도시한다.
도 14는 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템의 전체구성을 개략적으로 도시한다.
도 15는 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템의 분석결과에 대한 사용자화면을 개략적으로 도시한다.
도 16은 본 발명의 일 실시예에 따른 컴퓨팅장치의 내부 구성을 예시적으로 도시한다.FIG. 1 is a schematic diagram illustrating an operating environment of an automated system for analyzing an evaluated system according to an exemplary embodiment of the present invention. Referring to FIG.
FIG. 2 is a schematic diagram illustrating an internal configuration of an automated system for analyzing an evaluated subject matter according to an exemplary embodiment of the present invention. Referring to FIG.
FIG. 3 schematically shows the operation of the analysis result speculation unit according to an embodiment of the present invention.
FIG. 4 schematically shows an internal configuration of a feature information generating unit according to an embodiment of the present invention.
FIG. 5A schematically shows an internal configuration of an analysis result speculation unit according to an embodiment of the present invention.
5B schematically shows the internal structure of the analysis result inferring unit according to an embodiment of the present invention. FIG. 5C schematically shows the internal structure of the analysis result inferring unit according to an embodiment of the present invention.
FIG. 6 schematically illustrates generation of a feature map and feature map of voice feature information and image feature information according to an embodiment of the present invention.
7 schematically illustrates the operation of the complex feature map generator according to an embodiment of the present invention.
FIG. 8 schematically illustrates the operation of the context-aware adaptive reasoning unit according to an embodiment of the present invention.
FIG. 9 schematically shows an internal configuration of a context-aware parameter generation unit according to an embodiment of the present invention.
10 schematically shows the operation of the situation determining unit according to an embodiment of the present invention.
FIG. 11 schematically illustrates an internal configuration of a non-content feature map generating unit and a content feature map generating unit according to an embodiment of the present invention.
Figure 12 schematically illustrates the steps of an automated assessor analysis method in accordance with an embodiment of the present invention.
FIG. 13 schematically illustrates a feature map generation process according to an embodiment of the present invention.
Figure 14 schematically shows the overall configuration of an automated assessor analysis system in accordance with one embodiment of the present invention.
15 schematically shows a user screen for an analysis result of an automated assessor analysis system according to an embodiment of the present invention.
16 illustrates an exemplary internal configuration of a computing device according to an embodiment of the present invention.

이하에서는, 다양한 실시예들 및/또는 양상들이 이제 도면들을 참조하여 개시된다. 하기 설명에서는 설명을 목적으로, 하나이상의 양상들의 전반적 이해를 돕기 위해 다수의 구체적인 세부사항들이 개시된다. 그러나, 이러한 양상(들)은 이러한 구체적인 세부사항들 없이도 실행될 수 있다는 점 또한 본 발명의 기술 분야에서 통상의 지식을 가진 자에게 인식될 수 있을 것이다. 이후의 기재 및 첨부된 도면들은 하나 이상의 양상들의 특정한 예시적인 양상들을 상세하게 기술한다. 하지만, 이러한 양상들은 예시적인 것이고 다양한 양상들의 원리들에서의 다양한 방법들 중 일부가 이용될 수 있으며, 기술되는 설명들은 그러한 양상들 및 그들의 균등물들을 모두 포함하고자 하는 의도이다.In the following, various embodiments and / or aspects are now described with reference to the drawings. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more aspects. However, it will also be appreciated by those of ordinary skill in the art that such aspect (s) may be practiced without these specific details. The following description and the annexed drawings set forth in detail certain illustrative aspects of one or more aspects. It is to be understood, however, that such aspects are illustrative and that some of the various ways of practicing various aspects of the principles of various aspects may be utilized, and that the description set forth is intended to include all such aspects and their equivalents.

또한, 다양한 양상들 및 특징들이 다수의 디바이스들, 컴포넌트들 및/또는 모듈들 등을 포함할 수 있는 시스템에 의하여 제시될 것이다. 다양한 시스템들이, 추가적인 장치들, 컴포넌트들 및/또는 모듈들 등을 포함할 수 있다는 점 그리고/또는 도면들과 관련하여 논의된 장치들, 컴포넌트들, 모듈들 등 전부를 포함하지 않을 수도 있다는 점 또한 이해되고 인식되어야 한다.In addition, various aspects and features will be presented by a system that may include multiple devices, components and / or modules, and so forth. It should be understood that the various systems may include additional devices, components and / or modules, etc., and / or may not include all of the devices, components, modules, etc. discussed in connection with the drawings Must be understood and understood.

본 명세서에서 사용되는 "실시예", "예", "양상", "예시" 등은 기술되는 임의의 양상 또는 설계가 다른 양상 또는 설계들보다 양호하다거나, 이점이 있는 것으로 해석되지 않을 수도 있다. 아래에서 사용되는 용어들 '~부', '컴포넌트', '모듈', '시스템', '인터페이스' 등은 일반적으로 컴퓨터 관련 엔티티(computer-related entity)를 의미하며, 예를 들어, 하드웨어, 하드웨어와 소프트웨어의 조합, 소프트웨어를 의미할 수 있다.As used herein, the terms " an embodiment, " " an embodiment, " " an embodiment, " " an embodiment ", etc. are intended to indicate that any aspect or design described is better or worse than other aspects or designs. . The terms 'component', 'module', 'system', 'interface', etc. used in the following generally refer to a computer-related entity, And a combination of software and software.

또한, "포함한다" 및/또는 "포함하는"이라는 용어는, 해당 특징 및/또는 구성요소가 존재함을 의미하지만, 하나이상의 다른 특징, 구성요소 및/또는 이들의 그룹의 존재 또는 추가를 배제하지 않는 것으로 이해되어야 한다.It is also to be understood that the term " comprises " and / or " comprising " means that the feature and / or component is present, but does not exclude the presence or addition of one or more other features, components and / It should be understood that it does not.

또한, 제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Also, terms including ordinal numbers such as first, second, etc. may be used to describe various elements, but the elements are not limited to these terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

또한, 본 발명의 실시예들에서, 별도로 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 발명의 실시예에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Furthermore, in the embodiments of the present invention, all terms used herein, including technical or scientific terms, unless otherwise defined, are intended to be inclusive in a manner that is generally understood by those of ordinary skill in the art to which this invention belongs. Have the same meaning. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and, unless explicitly defined in the embodiments of the present invention, are intended to mean ideal or overly formal .

도 1은 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)의 동작환경을 개략적으로 도시한 도면이다.1 is a schematic diagram illustrating an operating environment of an automated assessor analysis system 1000 according to an embodiment of the present invention.

본 발명의 일 실시예는 전술한 종래기술의 문제점들을 고려한 시스템, 방법, 및 컴퓨터판독매체를 제공한다. 본 발명에서는, 기본적으로 영상 정보 및 음성 정보 음성을 입력받고, 딥러닝과 같은 기계학습기술을 통한 학습된 모델을 통하여 분석결과정보를 도출한다.An embodiment of the present invention provides a system, a method, and a computer readable medium considering the problems of the prior art described above. In the present invention, the video information and the audio information audio are basically received, and analysis result information is derived through a learned model through a machine learning technique such as deep learning.

여기서, 본 발명의 피평가자분석 시스템(1000)의 전체 혹은 세부 결과들을 도출하는 모듈은 학습된 모델에 해당할 수 있다. 또한, 음성 및 영상 정보는 동영상으로부터 음성 및 영상 정보를 추출하거나 혹은 음성과 영상을 분리하여 입력 받는 등의 형식으로 추출될 수 있다.Here, the module that derives the overall or detailed results of the subject analyzer system 1000 of the present invention may correspond to the learned model. In addition, the voice and image information can be extracted in the form of extracting voice and image information from the moving image or receiving the voice and the image separately.

본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)은 입모양을 통한 말하는 상태 여부를 추론하는 모델을 포함한다. 구체적으로, 본 발명에서는 영상 정보에 포함된 얼굴 중에 입모양 부분을 별도로 검출하고 RNN, LSTM, GRU와 같은 시간적 개념이 포함된 순환 신경망 기술을 이용하여 말을 하고 있는 상태인지 아닌지를 구분하는 모델을 포함한다.The automated assessor analysis system 1000 according to an embodiment of the present invention includes a model for inferring whether or not the state is speaking through a mouth shape. Specifically, in the present invention, a model for distinguishing whether or not speech is being made using a circular neural network technique including temporal concepts such as RNN, LSTM, and GRU is separately detected for the mouth portion of the face included in the image information .

그리고 이러한 정보를 기초하여 영상 정보와 음성 정보에 대해 어느 정도로 가중치로 평가에 영향을 줄 것인지를 자동으로 조절할 수 있도록 구성한다. 바람직하게는 이와 같은 가중치에 대해서도 학습하도록 모델을 구성할 수 있다.Based on this information, the degree of weighting of the image information and the audio information is automatically adjusted so as to influence the evaluation. Preferably, the model can also be configured to learn about such weights.

이와 같은 구성에서 영상 정보 및 음성 정보가 입력되는 경우에, 입모양을 통해 피평가자 혹은 화자가 말하는 상태가 아닌 경우 자동으로 영상 정보를 좀 더 중요하게 처리(일반 상태보다 상대적으로 높은 가중치를 부여)하고 음성 정보를 덜 중요하게 처리(일반 상태보다 상대적으로 낮은 가중치를 부여)할 것이다. In the case where the image information and the audio information are inputted in such a configuration, the image information is processed more importantly (a relatively higher weight is given than the normal state) in the case where the subject or the speaker does not speak through the mouth shape The speech information will be treated less importantly (giving a relatively lower weight than the normal state).

또한, 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)은 질문의 종류 및 장소나 상황 등과 같은 상황정보(context)를 추가로 입력하거나 혹은 자체적으로 동영상 혹은 음성 정보로부터 학습할 수 있도록 하여, 이러한 정보에 따라 사회적 기술을 진단하거나 평가할 수 있도록 한다. 따라서, 동일한 피평가자의 영상과 음성 정보가 들어온다고 할지라도 상황 정보에 따라 다른 진단 결과를 도출할 수가 있다.In addition, the automated assessor analysis system 1000 according to an exemplary embodiment of the present invention may be configured to further input context information such as a question type, a place, a situation, or the like, And to diagnose and evaluate social skills based on this information. Therefore, even if the video and audio information of the same subject is received, different diagnosis results can be derived according to the situation information.

또한, 커뮤니케이션 기술과 같은 사회적 기술의 평가는 인간이 특징하기 어려운 잠재적 정보를 활용해야 하기 때문에, 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)은 전통적인 기계학습 기술보다 더 나아가서 딥러닝과 같은 신경망 기술 기반의 기술을 적용하여 모델을 이용한다.In addition, since the evaluation of social skills such as communication techniques must utilize the potential information that is difficult for humans to characterize, the automated assessor analysis system 1000 according to an embodiment of the present invention can be used for deep learning Based on a neural network technology.

본 발명의 일 실시예에 따르면, 이렇게 도출된 데이터를 사용자에게 통계적 방법을 이용하여 피드백을 줌으로써, 실제 자신의 강점과 단점을 파악하고 훈련할 수 있는 환경을 제공하는 방향으로 활용할 수도 있다.According to an embodiment of the present invention, feedback can be provided to a user using a statistical method to utilize the derived data in a direction to provide an environment that can identify and train its own strengths and weaknesses.

도 1의 (A)에서는 음성 및 영상정보가 입력되고, 피평가자분석 시스템(1000)은 음성 및 영상정보로부터 상황에 대한 정보를 자동적으로 도출하고, 이와 같은 상황에 대한 정보를 고려하여, 피평가자의 분석결과정보를 도출할 수 있다.1 (A), audio and video information is input. The system 1000 analyzes the information about the situation automatically from the audio and video information. Based on the information about the situation, The result information can be derived.

도 1의 (B)에서는 음성 및 영상정보에 추가적으로 상황정보가 입력되고(이와 같은 상황정보는 사용자가 직접 입력하거나 혹은 기설정된 상황정보를 로드하는 방식으로 입력될 수 있음), 피평가자분석 시스템(1000)은 음성 및 영상정보에 상기 상황 정보를 고려하여 피평가자의 분석결과정보를 도출할 수 있다.1B, context information is input in addition to voice and image information (the context information may be input by a user directly or in a manner of loading predetermined context information) ) Can derive analysis result information of the evaluator in consideration of the situation information in the voice and image information.

이와 같은 피평가자분석 시스템(1000)은 이미 저장된 음성 및 영상정보 혹은 동영상 데이터로부터 분석결과정보를 도출하거나 혹은 실시간으로 입력되는 음성 및 영상정보로부터 분석결과정보를 도출할 수도 있다.Such an evaluator analysis system 1000 may derive analysis result information from previously stored voice and image information or moving image data or derive analysis result information from voice and image information input in real time.

도 2는 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)의 내부 구성을 개략적으로 도시한 도면이다.FIG. 2 is a schematic diagram illustrating an internal configuration of an automated assessor analysis system 1000 according to an embodiment of the present invention.

상기 실시예에 따른 피평가자분석 시스템(1000)은 1 이상의 프로세서 및 1 이상의 메모리를 갖는 컴퓨팅 장치에 의하여 구현될 수 있다.The assessor analysis system 1000 according to this embodiment may be implemented by a computing device having one or more processors and one or more memories.

이와 같은 컴퓨팅장치는 프로세서(A), 버스(프로세서, 메모리, 네트워크 인터페이스 사이의 양방향 화살표에 해당), 네트워크 인터페이스(B) 및 메모리(C)를 포함할 수 있다. 메모리(C)에는 운영체제, 및 인공신경망을 구현하는 데 있어서 학습된 학습데이터로서 후술하는 본 발명의 추론 혹은 예측을 하는 모듈에서 이용되는 추론부학습데이터가 저장되어 있을 수 있다. 혹은 상기 추론부학습데이터는 딥러닝이 진행된 모델링 정보 자체를 의미할 수도 있다. 프로세서(A)에서는 특징정보생성부(100), 분석결과추론부(200), 말상태추론부(300), 상황인지파라미터생성부(400)가 실행될 수 있다. 다른 실시예들에서 자동화된 피평가자분석 시스템(1000)은 도 2의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다.Such a computing device may include a processor A, a bus (corresponding to a bi-directional arrow between the processor, memory, and the network interface), a network interface B, and a memory C. The memory C may store inference unit learning data used in the inference or prediction module of the present invention as learning data learned in implementing the operating system and the artificial neural network. Alternatively, the inference unit learning data may refer to the modeling information itself in which the deep learning is performed. In the processor A, the feature information generation unit 100, the analysis result reasoning unit 200, the horse state reasoning unit 300, and the context awareness parameter generation unit 400 may be executed. In other embodiments, automated assessor analysis system 1000 may include more components than the components of FIG.

메모리는 컴퓨터에서 판독 가능한 기록 매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 이러한 소프트웨어 구성요소들은 드라이브 메커니즘(drive mechanism, 미도시)을 이용하여 메모리와는 별도의 컴퓨터에서 판독 가능한 기록 매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록 매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록 매체(미도시)를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록 매체가 아닌 네트워크 인터페이스(B)를 통해 메모리에 로딩될 수도 있다.The memory may be a computer-readable recording medium and may include a permanent mass storage device such as a random access memory (RAM), a read only memory (ROM), and a disk drive. These software components may be loaded from a computer readable recording medium separate from the memory using a drive mechanism (not shown). Such a computer-readable recording medium may include a computer-readable recording medium (not shown) such as a floppy drive, a disk, a tape, a DVD / CD-ROM drive, or a memory card. In other embodiments, the software components may be loaded into the memory via the network interface (B) rather than from a computer readable recording medium.

버스는 컴퓨팅 장치의 구성요소들간의 통신 및 데이터 전송을 가능하게 할 수 있다. 버스는 고속 시리얼 버스(high-speed serial bus), 병렬 버스(parallel bus), SAN(Storage Area Network) 및/또는 다른 적절한 통신 기술을 이용하여 구성될 수 있다.The bus may enable communication and data transfer between components of the computing device. The bus may be configured using a high-speed serial bus, a parallel bus, a Storage Area Network (SAN), and / or other suitable communication technology.

네트워크 인터페이스(B)는 자동화된 피평가자분석 시스템(1000)을 구현하는 컴퓨팅장치를 컴퓨터 네트워크에 연결하기 위한 컴퓨터 하드웨어 구성 요소일 수 있다. 네트워크 인터페이스(B)는 자동화된 피평가자분석 시스템(1000)을 무선 또는 유선 커넥션을 통해 컴퓨터 네트워크에 연결시킬 수 있다. Network interface (B) may be a computer hardware component for connecting a computing device that implements an automated assessor analysis system 1000 to a computer network. The network interface (B) can connect the automated assessor analysis system (1000) to the computer network via a wireless or wired connection.

프로세서(A)는 기본적인 산술, 로직 및 자동화된 피평가자분석 시스템(1000)을 구현하는 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(C) 또는 네트워크 인터페이스(B)에 의해, 그리고 버스를 통해 프로세서로 제공될 수 있다. 프로세서는 특징정보생성부(100), 분석결과추론부(200), 말상태추론부(300), 상황인지파라미터생성부(400)를 위한 프로그램 코드를 실행하도록 구성될 수 있다. 이러한 프로그램 코드는 메모리와 같은 기록 장치에 저장될 수 있다.Processor A may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input / output operations that implement an automated assessor analysis system 1000. The command may be provided by the memory C or the network interface B and via the bus to the processor. The processor may be configured to execute program codes for the feature information generator 100, the analysis result speculator 200, the horse state speculator 300, and the context aware parameter generator 400. Such program codes may be stored in a recording device such as a memory.

상기 특징정보생성부(100), 분석결과추론부(200), 말상태추론부(300), 상황인지파라미터생성부(400)는 이하에서 설명하게 될 자동화된 피평가자분석 방법을 수행하기 위해 구성될 수 있다. 상기한 프로세서는 자동화된 피평가자분석 방법에 따라 일부 컴포넌트가 생략되거나, 도시되지 않은 추가의 컴포넌트가 더 포함되거나, 2개 이상의 컴포넌트가 결합될 수 있다.The feature information generating unit 100, the analysis result reasoning unit 200, the state-state reasoning unit 300, and the context-awareness parameter generating unit 400 may be configured to perform the automated assessor analysis method . The above-mentioned processor may omit some components according to an automated evaluator analysis method, or may further include additional components not shown, or two or more components may be combined.

한편, 이와 같은 상기 컴퓨팅 장치는 바람직하게는 개인용 컴퓨터 혹은 서버에 해당하고, 경우에 따라서는 스마트 폰(smart phone)과, 태블릿(tablet)과, 이동 전화기와, 화상 전화기와, 전자책 리더(e-book reader)와, 데스크 탑(desktop) PC와, 랩탑(laptop) PC와, 넷북(netbook) PC와, 개인용 복합 단말기(personal digital assistant: PDA, 이하 'PDA'라 칭하기로 한다)와, 휴대용 멀티미디어 플레이어(portable multimedia player: PMP, 이하 'PMP'라 칭하기로 한다)와, 엠피3 플레이어(mp3 player)와, 이동 의료 디바이스와, 카메라와, 웨어러블 디바이스(wearable device)(일 예로, 헤드-마운티드 디바이스(head-mounted device: HMD, 일 예로 'HMD'라 칭하기로 한다)와, 전자 의류와, 전자 팔찌와, 전자 목걸이와, 전자 앱세서리(appcessory)와, 전자 문신, 혹은 스마트 워치(smart watch) 등에 해당할 수 있다.Preferably, the computing device corresponds to a personal computer or a server, and may be a smart phone, a tablet, a mobile phone, a videophone, an e-book reader e a notebook PC, a netbook PC, a personal digital assistant (PDA), a portable personal computer (PC) A mobile multimedia device, a portable multimedia player (PMP), an MP3 player, a mobile medical device, a camera, a wearable device (e.g., a head- Electronic devices such as a head-mounted device (HMD), an electronic apparel, an electronic bracelet, an electronic necklace, an electronic app apparel, an electronic tattoo, or a smart watch ) And the like.

즉, 본 발명의 일 실시예 따른 자동화된 피평가자분석 시스템(1000)은 입력된 영상 정보 및 음성 정보로부터 음성특징정보 및 영상특징정보를 생성하는 특징정보생성부(100); 상기 영상특징정보에 포함된 정보로부터 피평가자가 발화하는 지에 대한 말상태정보를 생성하는 말상태추론부(300); 및 상기 음성특징정보, 상기 영상특징정보, 및 상기 말상태정보를 포함하는 정보로부터 피평가자의 분석결과정보를 도출하는 분석결과추론부(200);를 포함한다.That is, the automated assessor analysis system 1000 according to an embodiment of the present invention includes a feature information generation unit 100 for generating speech feature information and image feature information from input image information and speech information; A word state inference unit (300) for generating word state information on whether the subject is speaking based on the information included in the image characteristic information; And an analysis result speculation unit (200) for deriving analysis result information of the subject from the information including the speech feature information, the image feature information, and the speech state information.

바람직하게는, 상기 자동화된 피평가자분석 시스템(1000)은 입력된 음성 정보 및 상기 말상태정보로부터 피평가자의 평가상황에 대한 정보를 포함하는 상황인지파라미터를 생성하는 상황인지파라미터생성부(400)를 더 포함한다.Preferably, the automated assessor analysis system 1000 further includes a context-aware parameter generator 400 for generating context-aware parameters including input voice information and information on the evaluation status of the evaluator from the speech status information .

이하에서는, 상기 피평가자분석 시스템(1000)의 세부 구성요소에 대하여 상술하도록 한다.Hereinafter, detailed components of the system 1000 will be described in detail.

도 3는 본 발명의 일 실시예에 따른 분석결과추론부(200)의 동작을 개략적으로 도시한다.FIG. 3 schematically shows the operation of the analysis result speculation unit 200 according to an embodiment of the present invention.

상기 특징정보생성부(100)는 영상정보 및 음성정보로부터 상기 음성특징정보, 상기 영상특징정보, 및 상기 말상태정보를 생성한다.The feature information generating unit 100 generates the speech feature information, the image feature information, and the speech state information from the video information and the audio information.

한편, 상기 영상특징정보는 입 외의 영역에 대한 특징정보 및 입 영역에 대한 특징정보에 해당하는 제1 영상특징정보 및 제2 영상특징정보를 포함한다.The image feature information includes first image feature information and second image feature information corresponding to feature information on an area outside the mouth and feature information on the mouth area.

이와 같이 상기 특징정보생성부(100)에 의하여 생성된 특징정보는 상기 분석결과추론부(200)에 입력되고, 상기 분석결과추론부(200)은 이와 같은 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보, 및 상기 상황인지파라미터로부터 분석결과정보를 도출한다.The feature information generated by the feature information generating unit 100 is input to the analysis result estimating unit 200. The analysis result estimating unit 200 estimates the speech feature information, And the analysis result information is derived from the speech state information and the context recognition parameter.

본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)은 영상 및 음성 정보 자체에 대하여 바로 인공신경망에 적용하는 것이 아니라, 이로부터 특징정보를 도출하여 1차적으로 데이터를 가공하고, 이를 학습된 인공신경망에 적용함으로써 보다 연산부하를 저감하고, 보다 피평가자분석이라는 목적에 부합하는 시스템을 구현할 수 있다.The system 1000 for analyzing the automated system according to an embodiment of the present invention does not directly apply the artificial neural network to the video and audio information itself but extracts the feature information from the artificial neural network to process the data first, By applying it to the artificial neural network, it is possible to reduce the computational load and implement a system that meets the objective of analyzing the evaluator.

도 4는 본 발명의 일 실시예에 따른 특징정보생성부(100)의 내부 구성을 개략적으로 도시한다.FIG. 4 schematically shows an internal configuration of a feature information generating unit 100 according to an embodiment of the present invention.

도 4에 도시된 바와 같이 상기 특징정보생성부(100)는, 음성특징정보를 생성하는 음성특징정보생성부(110); 및 영상특징정보를 생성하는 영상특징정보생성부(120)를 포함한다.As shown in FIG. 4, the feature information generation unit 100 includes a speech feature information generation unit 110 for generating speech feature information; And an image characteristic information generating unit 120 for generating image characteristic information.

즉, 본 발명의 일 실시예에 따르면 우선적으로 음성정보와 영상정보를 분리하고, 음성정보 및 영상정보 각각으로부터 특징정보를 도출한 후에, 이에 대한 싱크를 맞추어 피처맵을 생성한다. That is, according to an embodiment of the present invention, the voice information and the video information are separated first, the feature information is derived from each of the voice information and the video information, and the feature map is generated by matching the sink information.

음성특징정보생성부(110)는 음성정보로부터 노이즈를 제거하거나 혹은 전체 음성정보로부터 처리대상이 되는 부분을 추출하는 음성전처리부(111); 및 상기 음성전처리부(111)에 의하여 전처리된 음성정보로부터 노이즈가 제거된 raw signal 혹은 prosody, MFCC, eGeMAPs, 주파수 정보 등과 같은 특징정보를 추출하는 음성특징정보추출부(112)를 포함한다.The voice feature information generation unit 110 includes a voice preprocessor 111 for removing noise from voice information or extracting a part to be processed from the entire voice information; And a speech feature information extraction unit 112 for extracting feature information such as raw signal or prosody, MFCC, eGeMAPs, frequency information, etc., from which noise has been removed from the speech information preprocessed by the speech preprocessing unit 111.

영상특징정보생성부(120)는 영상정보로부터 노이즈를 제거하거나 혹은 조명에 대한 효과를 제거하거나, 혹은 주변 환경에 따라 명도, 채도 혹은 색상값들을 조정하는 영상전처리부(121); 전처리된 영상으로부터 얼굴을 검출하는 얼굴검출부(122); 입력된 영상정보의 검출된 얼굴영역으로부터 입영역 외의 얼굴 내부의 1 이상의 요소를 추출하는 제1영상특징정보추출부(123); 및 입력된 영상정보의 검출된 얼굴영역으로부터 입영역의 요소를 추출하는 제2영상특징정보추출부(124);를 포함한다.The image feature information generation unit 120 includes a video pre-processing unit 121 for removing noise from the image information, removing the effect of illumination, or adjusting brightness, saturation, or color values according to the surrounding environment. A face detection unit 122 for detecting a face from the preprocessed image; A first image feature information extracting unit (123) for extracting at least one element in a face outside the input area from the detected face area of the input image information; And a second image characteristic information extraction unit (124) for extracting an input area element from the detected face area of the input image information.

피평가자가 말을 하는 경우에, 입 모양이 변화하게 되고, 이를 그대로 인공신경망에 따른 모델에 적용하는 경우에, 이와 같은 입 모양이 표정에 대한 정보와 관련되는 피처맵 혹은 분석 정보에 영향을 줄 수 있다.In the case where the evaluator speaks, when the mouth shape changes and is applied as it is to the model according to the artificial neural network, such a mouth shape may affect the feature map or analysis information related to the information about the facial expression have.

본 발명의 일 실시예에서는 인공신경망의 성능을 최적으로 발휘하여 보다 정확한 분석결과를 제공하기 위하여 입영역의 요소와 입 외 영역의 요소에 대한 특징정보 혹은 이미지를 각각 추출하고, 이에 대한 피처맵을 각각 생성하고 이후 화자의 말하는 지 여부에 따라서, 입 영역의 피처맵의 가중치를 결정하여 두 피처맵을 머징한다.In one embodiment of the present invention, in order to optimize the performance of the artificial neural network and provide a more accurate analysis result, the feature information or the image of the elements of the mouth area and the mouth area are respectively extracted and the feature map And then determines the weight of the feature map of the mouth area according to whether the speaker is talking or not, and merges the two feature maps.

여기서, 영상특징정보는 영상 그 자체 혹은 영상으로부터 추출된 가공 정보에 해당할 수 있다. 예를들어, 제2영상특징정보는 크롭된 입 영역의 이미지 자체 혹은 소정의 이미지 처리가 된 크롭된 입 영역의 이미지 자체 혹은 크롭된 입 영역의 이미지로부터 기설정된 규칙 혹은 알고리즘에 따라 도출된 특징정보에 해당할 수 있다.Here, the image feature information may correspond to the image itself or the processing information extracted from the image. For example, the second image feature information may include characteristic information derived from a predetermined rule or algorithm from an image of a cropped input area itself, an image of a cropped input area subjected to a predetermined image process or an image of a cropped input area .

도 5a는 본 발명의 일 실시예에 따른 분석결과추론부(200)의 내부 구성을 개략적으로 도시한다.FIG. 5A schematically shows an internal configuration of an analysis result speculation unit 200 according to an embodiment of the present invention.

상기 특징정보생성부(100)는 입력된 영상 정보 및 음성 정보로부터 음성특징정보, 제1영상특징정보 및 제2영상특징정보를 생성한다.The feature information generating unit 100 generates speech feature information, first image feature information, and second image feature information from the inputted image information and audio information.

이와 같은 음성특징정보, 제1영상특징정보, 및 제2영상특징정보는 분석결과추론부(200)로 입력이 되고, 상기 제2영상특징정보는 말상태추론부(300)에 입력이 된다.The speech feature information, the first image feature information, and the second image feature information are input to the analysis result speculative unit 200, and the second image feature information is input to the horse state speculative unit 300.

상기 말상태추론부(300)는 상기 제2영상특징정보에 포함된 정보로부터 피평가자가 발화하는 지에 대한 말상태정보를 생성하고, 이를 분석결과추론부(200) 및 상황인지파라미터생성부(400)로 전달한다.The horse state inference unit 300 generates horse state information on whether the assessor is to be ignited based on the information included in the second image feature information and outputs the result of the analysis to the reasoning unit 200 and the context recognition parameter generator 400. [ .

말상태추론부(300)는 시간흐름에 따라 입모양 이미지를 순차적으로 입력하면 피평가자가 말을 하고 있는상태인지 대한 확률값([0, 1]) P_s을 도출한다. 예를들어 말을 하는 상태이면 1에 가깝고, 말을 하지 않는다면 0에 가까운 값을 도출하고, 이는 이후 분석결과추론부(200) 및 상황인지파라미터생성부(400)에서 이용이 된다.The word state inference unit 300 derives a probability value ([0, 1]) P _{s as} to whether the assessor is talking or not when the mouth image is sequentially input according to the time flow. For example, if the talking state is close to 1, it is close to 1. If not speaking, a value close to 0 is derived, which is used in the reasoning unit 200 and the context-awareness parameter generating unit 400.

구체적으로, 말상태추론부(300)는 현재 말을 하고 있는지 여부를 판단할 샘플링 구간에 존재하는 1 이상의 입모양 이미지로부터 말상태를 추론한다. 예를들어 2 ~ 3초 사이에 말을 하고 있는 지 여부를 판단하기 위하여, 2 ~ 3 초 사이의 5개의 입모양 정보(제2영상특징정보)로부터 2 ~ 3초 사이에 피평가자가 말을 하는 지 여부를 판단할 수 있다. 혹은 2 ~ 3초 사이의 1개의 입모양정보(제2영상특징정보)로부터 2 ~ 3초 사이에 피평가자가 말을 하는 지 여부를 파악할 수도 있다.Specifically, the horse state inferring unit 300 deduces the horse state from one or more mouth-shaped images existing in a sampling interval to judge whether or not the horse is currently being spoken. For example, in order to judge whether or not speech is being made within 2 to 3 seconds, the evaluator speaks in a period of 2 to 3 seconds from 5 mouth shape information (second image feature information) of 2 to 3 seconds Or not. Or whether the assessor speaks within a period of 2 to 3 seconds from one mouth shape information (second image feature information) between 2 and 3 seconds.

본 발명의 일 실시예에 따르면, 상기 말상태추론부(300)는 ConvNet (시간적 요소가 고려되지 않고 단일 이미지로부터 피처맵 생성), LSTM(샘플링 구간에 존재하는 2 이상의 이미지에 시간적 개념을 고려하여 피처맵 생성), Sigmoid(LSTM 의 값을 0~1 사이의 값으로 정규화), 혹은 이 둘을 조합함으로써 구현될 수 있다. 바람직하게는, ConvNet으로 이미적인 피처맵을 도출하고, 이와 같은 피처맵을 복수로 하여 시간 개념을 고려하여 LSTM을 통하여 2차적인 피처맵을 도출한 후에 이를 FC(Fully-Connected) 이나 Global Average Pooling 계층 등을 통하여 연결한 후에, Sigmoid와 같은 함수를 통하여 확률값을 구한다. 이와 같은 말상태추론부(300)는 다양한 학습이미지를 통하여 학습이 수행되어 있음이 바람직하다.According to an embodiment of the present invention, the speech state reasoning unit 300 may include a ConvNet (feature map generation from a single image without consideration of temporal elements), an LSTM (considering two or more images existing in the sampling interval, Feature map generation), Sigmoid (LSTM value normalized to a value between 0 and 1), or a combination of both. Preferably, a feature map is derived from ConvNet, and a plurality of such feature maps are derived to derive a secondary feature map through the LSTM in consideration of the time concept, and then it is referred to as FC (Fully-Connected) or Global Average Pooling After connecting through a layer, the probability value is obtained through a function such as Sigmoid. It is preferable that the speech state reasoning unit 300 performs learning through various learning images.

상기 말상태추론부(300)에 의하여 도출되는 말상태정보는 분석결과추론부(200)의 복합피처맵생성부(240)에 입력된다. 피평가자가 말하는 상태가 아닌 경우에는 음성피처맵이 노이즈이거나 피평가자가 아닌 사람의 음성일수도 있기 때문에 해당 음성피처맵의 영향력을 감소시키고 이에 대한 평가를 수행한다. 이는 복합피처맵생성부(240)에서 피처맵을 머징하는 경우에 음성피처맵의 가중치를 낮춤으로써 구현될 수 있다. 이와 같이 음성피처맵을 말상태정보에 따라 조정하는 경우에 피평가자의 경청하는 태도를 진단할 수 있다. The horse state information derived by the horse state inference unit 300 is input to the compound feature map generation unit 240 of the analysis result inference unit 200. Since the voice feature map may be a noise or a voice of a person who is not the subject of the evalua- tion, the influence of the voice feature map is reduced and an evaluation thereof is performed. This can be implemented by lowering the weight of the voice feature map when the feature map is merged in the compound feature map generator 240. [ As described above, when the voice feature map is adjusted according to the speech state information, it is possible to diagnose the attitude of the audience to be listened to.

반면, 피평가자가 말하는 상태인 경우에는 영상에 관한 피처맵 중에 입모양에 관한 피처맵에 해당하는 제2영상피처맵의 영향력을 감소시키고 이에 대한 평가를 수행한다. 이는 복합피처맵생성부(240)에서 피처맵을 머징하는 경우에 제2영상피처맵의 영향을 낮춤으로써 구현될 수 있다. 소셜스킬을 평가할 때 얼굴의 다양한 정보가 사용되지만 감성정보가 매우 중요한 역할을 하는데, 입모양이 이에 대한 많은 정보를 담고 있으므로 말하고 있는 경우와 놀라거나 혹은 웃고 있는 경우의 입모양과 구분하기가 힘든 경우가 있다. 본 발명에서는 이와 같은 문제를 해결하기 위하여 기설정된 고정 혹은 가변 구간별로 피평가자가 말을 하고 있는지 여부를 판단하고, 이에 따라 복합피처맵생성부(240)에서 각각의 피처맵의 가중치를 조정하여 복합피처맵을 생성한다.On the other hand, in the case where the subject is speaking, the influence of the second image feature map corresponding to the feature map on the mouth shape is reduced and evaluated in the feature map related to the image. This can be implemented by lowering the influence of the second image feature map when the feature map is merged in the compound feature map generator 240. [ When evaluating social skills, various information of the face is used, but emotion information plays a very important role. Since the mouth shape contains a lot of information about it, it is difficult to distinguish between talking and surprising or smiling. . In order to solve such a problem, in the present invention, it is determined whether the subject is speaking by a predetermined fixed or variable interval, and the complex feature map generator 240 adjusts the weight of each feature map, Create a map.

한편, 상기 분석결과추론부(200)는 상기 음성특징정보, 상기 제1영상특징정보, 제2영상특징정보 및 상기 말상태정보를 포함하는 정보로부터 피평가자의 분석결과정보를 도출하고, 상기 말상태추론부(300)는, 상기 제2영상특징정보로부터 상기 말상태정보를 생성한다.Meanwhile, the analysis result inference unit 200 derives analysis result information of the subject from the information including the speech feature information, the first image feature information, the second image feature information, and the speech state information, The inference unit 300 generates the speech state information from the second image feature information.

상기 분석결과추론부(200)는 상기 음성특징정보로부터 음성피처맵을 생성하는 음성피처맵생성부(210); 상기 제1영상특징정보로부터 제1영상피처맵을 생성하는 제1영상피처맵생성부(220); 상기 제2영상특징정보로부터 제2영상피처맵을 생성하는 제2영상피처맵생성부(230); 상기 음성피처맵, 제1영상피처맵, 및 제2영상피처맵으로부터 복합피처맵을 생성하는 복합피처맵생성부(240); 및 복합피처맵에 상황인지파라미터를 적용하여 분석결과정보를 도출하는 상황인지적응형추론부(250)를 포함한다.The analysis result inference unit 200 may include a voice feature map generation unit 210 for generating a voice feature map from the voice feature information. A first image feature map generating unit 220 for generating a first image feature map from the first image feature information; A second image feature map generator 230 for generating a second image feature map from the second image feature information; A complex feature map generator 240 for generating a complex feature map from the voice feature map, the first image feature map, and the second image feature map; And a context-aware adaptive reasoning unit 250 for applying context aware parameters to the complex feature map to derive analysis result information.

여기서, 피처맵은 공지된 인공신경망의 분석 결과로서의 데이터, 인공신경망의 분석 과정에서 생성되는 중간 결과로서의 데이터, 모델을 통하여 도출된 분석 결과값, 혹은 기설정된 방법으로 도출된 특징정보 데이터를 포함할 수 있다.Here, the feature map may include data as an analysis result of a known artificial neural network, data as intermediate results generated in the analysis process of the artificial neural network, analysis result values derived through a model, or feature information data derived by a predetermined method .

상기 음성피처맵생성부(210), 상기 제1영상피처맵생성부(220), 및 상기 제2영상피처맵생성부(230)는 이미지분석에 이용되는 1 이상의 인공신경망 모델을 이용하여 구현될 수 있고, 본 발명의 일 실시예에서는 CNN 계열의 인공신경망 모델이 이용될 수 있다.The voice feature map generating unit 210, the first image feature map generating unit 220, and the second image feature map generating unit 230 may be implemented using at least one artificial neural network model used for image analysis CNN-based artificial neural network model may be used in one embodiment of the present invention.

한편, 상기 복합피처맵생성부(240)는 상기 말상태추론부(300)의 상기 말상태정보에 기초하여 상기 음성피처맵 및 상기 제2영상피처맵에 대해 각각의 가중치를 부여하여, 상기 음성피처맵, 상기 제1영상피처맵, 및 상기 제2영상피처맵으로부터 복합피처맵을 생성한다.Meanwhile, the compound feature map generator 240 assigns weights to the voice feature map and the second image feature map, respectively, based on the horse state information of the horse state inference unit 300, And generates a composite feature map from the feature map, the first image feature map, and the second image feature map.

본 발명의 일 실시예에서는, 음성피처맵 및 제2영상피처맵에 확률값으로서의 말상태정보(P_s)에 곱셈계열의 연산을 이용할 수 있다.In an embodiment of the present invention, it is possible to use the voice feature map, and the second probability value as the end of the image feature map status information operation that multiplies in series (P _s).

한편, 상기 상황인지파라미터생성부(400)는 입력된 음성 정보 및 상기 말상태정보로부터 피평가자의 평가상황에 대한 정보를 포함하는 상황인지파라미터를 생성한다. 이와 같은 상황인지파라미터는 현재 피평가자의 평가상황에 대한 정보에 해당하고, 기설정된 규칙에 따른 메트릭스 혹은 벡터열의 형태를 가질 수 있다.On the other hand, the context-aware parameter generation unit 400 generates a context-awareness parameter including information on an evaluation status of an evaluated person from the input speech information and the speech state information. This context awareness parameter corresponds to information about the current assessor's assessment situation and may have the form of a matrix or vector column according to a predetermined rule.

바람직하게는, 상기 상황인지파라미터생성부(400)는 질문의 종류나 컨텍스트, 장소나 분위기 등의 정보를 적절한 permutation이나 hashing 등으로 변환하거나 기계학습 모델을 이용하여 적절한 파라미터 벡터로 생성하는 작업을 수행한다.Preferably, the context-aware parameter generation unit 400 converts information such as the type of question, context, place, and atmosphere into appropriate permutation or hashing, or generates an appropriate parameter vector using a machine learning model do.

바람직하게는, 상기 상황인지파라미터생성부(400)는 입력된 음성 정보 및 상기 말상태정보 없이도 사용자 설정에 따라 질문의 종류나 컨텍스트, 장소나 분위기 등의 정보를 적절한 값을 입력할 수도 있다.이와 같은 실시예에 대해서는 도 5b과 내부 구성을 도시하고 있다.Preferably, the context-aware parameter generator 400 may input appropriate information such as the type of question, context, location, atmosphere, etc., according to user settings, without the input voice information and the speech state information. 5B and an internal configuration for the same embodiment.

이와 같은 상황인지파라미터생성부(400)는 음성이 발생하고 있을 때, 말상태추론부(300)에 의해 피평가자인지 아닌지를 판단하고, 만약 피평가자 외의 사람으로부터 음성이 발화하고 있다고 판단되면 이 구간의 음성정보를 질문 정보로 판단하고, 이로부터 피처맵을 생성할 수 있다. 구체적으로 발화되고 있는 음성의 컨텐츠적인 정보 및 발성학적인 정보(억양, 높이, 세기 등) 각각에 대해 피처맵을 생성하고, 각각 생성된 피처맵들을 FC(Fully Connected) 이나 Global Average Pooling 계층 등을 이용하여 결합한 후에, 이로부터 파라미터 벡터를 생성할 수 있다. When the speech is being generated, the context-aware parameter generation unit 400 determines whether the speech is evaluated by the speech state inference unit 300. If it is determined that the speech is speech from a person other than the subject, Information can be determined as question information, and a feature map can be generated therefrom. A feature map is generated for each of the content information and the vocal information (intonation, height, intensity, etc.) of the uttered voice, and the generated feature maps are classified into FC (Fully Connected) and Global Average Pooling layers After combining, we can generate a parameter vector from it.

본 발명의 일 실시예에서는, 전술한 바와 같이 음성 및 영상정보로부터 자동적으로 상황인지파라미터를 생성하는 것이 아니라 사용자로부터 상황에 대한 정보를 직접적으로 입력을 받고 이로부터 상황인지파라미터를 생성하거나 혹은 사용자로부터 입력 받은 상황에 대한 정보 및 전술한 바와 같이 자동적으로 생성한 상황인지파라미터로부터 최종적인 상황인지파라미터를 생성할 수도 있다.In an embodiment of the present invention, instead of generating the context-aware parameters automatically from the voice and image information as described above, the context-aware parameters may be directly input from the user to the context-aware parameters, It is possible to generate a final context-aware parameter from the information on the input status and the automatically generated context-aware parameters as described above.

한편, 상기 복합피처맵생성부(240)에 의하여 생성된 복합피처맵 및 상황인지파라미터는 상기 상황인지적응형추론부(250)에 입력이 되고, 상황인지적응형추론부(250)는 복합피처맵 및 상황인지파라미터로부터 분석결과정보를 생성할 수 있고, 이와 같은 분석결과정보에 피평가자의 각각의 평가정보가 포함되어 있다.Meanwhile, the complex feature map and context-aware parameters generated by the complex feature map generator 240 are input to the context-aware adaptive reasoning unit 250, and the context- The analysis result information can be generated from the map and the situation recognition parameter, and the evaluation result information of each of the evaluators is included in the analysis result information.

구체적으로, 상기 상황인지적응형추론부(250)는 복수의 복합피처맵들에 대해 시간적 개념을 고려하여 제1최종피처맵을 생성할 수 있도록, 복수의 입력을 받을 수 있는 순환신경망 모듈(예를들어, LSTM), 상황인지파라미터를 입력 받아 상기 최종피처맵으로부터 상황인지파라미터가 적용된 제2최종피처맵을 생성할 수 있는 동적파라미터계층, 및 상기 제2최종피처맵들을 연결하는 FC 계층을 포함할 수 있다.In particular, the context-aware adaptive reasoning unit 250 may include a cyclic neural network module (e.g., a cyclic neural network module) capable of receiving a plurality of inputs so as to generate a first final feature map considering a temporal concept for a plurality of compound feature maps A dynamic parameter layer that receives context aware parameters and can generate a second final feature map to which context aware parameters are applied from the final feature map, and an FC layer that links the second final feature maps can do.

도 5b는 본 발명의 일 실시예에 따른 분석결과추론부(200)의 내부 구성을 개략적으로 도시한다.FIG. 5B schematically shows the internal structure of the analysis result speculation unit 200 according to an embodiment of the present invention.

도 5b에 도시된 바와 같이, 상황인지파라미터생성부는 사용자입력정보에 기초하여 상황인지파라미터를 생성한다. As shown in FIG. 5B, the context-aware parameter generation unit generates context-aware parameters based on user input information.

본 발명의 일 실시예에서는 사용자입력정보는 스크립트 및 질문 종류 혹은 장소와 같은 상황정보를 포함할 수 있다. 이 경우, 상기 상황인지파라미터생성부는 도 11의 컨텐츠피처맵생성부(430)과 같은 구성으로 상기 스크립트에 포함된 문자열로부터 제1 컨텐츠피처맵을 생성한 후, 상기 제1 컨텐츠피처맵으로부터 제1상황인지파라미터를 생성한다. 추가적으로, 상기 상황인지파라미터생성부는 상기 질문 종류 혹은 장소와 같은 상황정보로부터 제2상황인지파라미터를 생성하고, 상기 제1상황인지파라미터 및 상기 제2상황인지파리미터로부터 상황인지파라미터를 생성할 수 있다. 이는 도 14의 반자동 모드에 상응한다.In an embodiment of the present invention, the user input information may include script and status information such as a question type or location. In this case, the context-aware parameter generation unit may generate a first content feature map from the character string included in the script with the same configuration as that of the content feature map generation unit 430 of FIG. 11, Generate context aware parameters. In addition, the context awareness parameter generator may generate a second context awareness parameter from contextual information such as the question type or location, and generate context awareness parameters from the first context awareness parameter and the second context aware parameter. This corresponds to the semi-automatic mode of Fig.

혹은, 본 발명의 일 실시예에서는 사용자입력정보는 질문 종류 혹은 장소와 같은 상황정보를 포함할 수 있다. 이 경우, 상기 상황인지파라미터생성부는 상기 질문 종류 혹은 장소와 같은 상황정보로부터 직접적으로 상황인지파라미터를 생성할 수 있다. 이는 도 14의 수동 모드에 상응한다.In an embodiment of the present invention, the user input information may include status information such as a question type or a place. In this case, the context awareness parameter generator may generate context awareness parameters directly from context information such as the question type or location. This corresponds to the manual mode of Fig.

도 5c는 본 발명의 일 실시예에 따른 분석결과추론부(200)의 내부 구성을 개략적으로 도시한다.5C schematically shows the internal structure of the analysis result speculation unit 200 according to an embodiment of the present invention.

도 5c에서의 실시예는 도 5a의 실시예와 비교시 말상태추론부의 입력정보가 제2 영상특징정보가 아닌 얼굴 영상정보라는 점에서 구성적 차이가 있다.The embodiment of FIG. 5C differs from the embodiment of FIG. 5A in that the input information of the inverse state inference unit is facial image information, not the second image feature information.

구체적으로, 도 5c에 도시된 바와 같이, 말상태추론부는 얼굴 영상정보로부터 말상태정보를 추출한다.Specifically, as shown in FIG. 5C, the horse state inference unit extracts horse state information from the facial image information.

바람직하게는, 특징정보생성부(100)의 얼굴전처리부(121) 및 얼굴검출부(122)에 의하여 전치리된 얼굴 영역에 대한 이미지를 상기 말상태추론부가 입력받고, 상기 말상태추론부(300)는 상기 얼굴 영상정보가 포함된 정보로부터 피평가자가 발화하는 지에 대한 말상태정보를 생성하고, 이를 분석결과추론부(200) 및 상황인지파라미터생성부(400)로 전달한다. 혹은 도 5b의 상황인지파라미터생성부(400)에서와 같이, 말상태정보는 상기 복합피처맵생성부(240)로만 입력될 수도 있다.Preferably, the horse state inference unit receives the image of the face region pre-processed by the face preprocessing unit 121 and the face detection unit 122 of the feature information generation unit 100, Generates the speech state information on whether the subject is speaking from the information including the face image information, and transmits the speech state information to the analysis result inference unit 200 and the context recognition parameter generation unit 400. [ As in the situation recognition parameter generation unit 400 of FIG. 5B, the horse state information may be input to the complex feature map generation unit 240 only.

말상태추론부(300)는 시간흐름에 따라 입모양 이미지를 순차적으로 입력하면 피평가자가 말을 하고 있는상태인지 대한 확률값([0, 1]) Ps을 도출한다. 예를들어 말을 하는 상태이면 1에 가깝고, 말을 하지 않는다면 0에 가까운 값을 도출하고, 이는 이후 분석결과추론부(200) 및/또는 상황인지파라미터생성부(400)에서 이용이 된다. 이에 대해서는 도 5a 및 도 5b에서의 설명과 동일한 바, 이에 대한 중복된 설명은 생략하기로 한다.The word state inference unit 300 derives a probability value ([0, 1]) Ps of whether the assessor is talking or not when the mouth image is sequentially input according to the time flow. For example, if the speech state is in the talking state, the value is close to 1. If the speech state is not speech, a value close to 0 is derived, which is used in the reasoning unit 200 and / or the context-awareness parameter generation unit 400. This is the same as the description in FIGS. 5A and 5B, and a duplicate description thereof will be omitted.

도 6는 본 발명의 일 실시예에 따른 음성특징정보 및 영상특징정보의 생플링 및 피처맵 생성을 개략적으로 도시한다.FIG. 6 schematically illustrates generation of a feature map and feature map of voice feature information and image feature information according to an embodiment of the present invention.

상기 분석결과추론부(200)는, 상기 음성특징정보로부터 음성피처맵을 생성하는 음성피처맵생성부(210); 상기 제1영상특징정보로부터 제1영상피처맵을 생성하는 제1영상피처맵생성부(220); 및 상기 제2영상특징정보로부터 제2영상피처맵을 생성하는 제2영상피처맵생성부(230)를 포함한다.The analysis result inference unit 200 includes a voice feature map generation unit 210 that generates a voice feature map from the voice feature information; A first image feature map generating unit 220 for generating a first image feature map from the first image feature information; And a second image feature map generator 230 for generating a second image feature map from the second image feature information.

여기서, 음성정보 및 영상정보는 기설정된 주기로 샘플링된다. 예를들어 도 6에 도시된 바와 같이 영상정보의 경우 초당 8프레임으로 이미지정보가 있고, 1초를 샘플링타이밍으로 설정하는 경우에, 0~1초 구간, 1~2초 구간, 2~3초 구간의 음성정보로부터 음성피처맵이 생성될 수 있다.Here, the audio information and the video information are sampled at predetermined intervals. For example, as shown in FIG. 6, in the case of image information, there is image information at 8 frames per second. When 1 second is set as the sampling timing, 0 to 1 second interval, 1 to 2 second interval, The voice feature map can be generated from the voice information of the section.

한편, 영상피처맵의 경우, 예를들어 0~1초 구간에서 1개의 프레임, 1~2초 구간에서 1개의 프레임, 2~3초 구간에서 1개의 프레임이 추출되고, 각각의 프레임에 대하여 영상피처맵이 생성될 수 있다.On the other hand, in the case of the image feature map, for example, one frame is extracted in the interval of 0 to 1 second, one frame in the interval of 1 to 2 seconds, and one frame in the interval of 2 to 3 seconds, A feature map can be generated.

한편, 말상태추론부(300)는 예를들어 0~1초 구간의 8개의 프레임 전체로부터 0 ~ 1초 구간에 말을 하고 있는 지 여부에 대한 말상태정보를 추출하거나 혹은 0~1초 구간의 1개의 프레임에서 0~1초 구간에 말을 하고 있는 지 여부에 대한 말상태정보를 추출할 수 있다(도 6의 도시사항).On the other hand, the speech state inference unit 300 extracts speech state information about whether speech is being spoken in 0 to 1 second intervals from the entire 8 frames of 0 to 1 second interval, It is possible to extract the speech state information as to whether or not speech is being spoken within a period of 0 to 1 second in one frame of the speech signal (Fig. 6).

한편, 각각의 음성피처맵, 제1영상피처맵, 및 제2영상피처맵은 모두 상기 말상태정보와 동기화된 시간정보를 가질 수 있다. 즉, 일정 시간구간에서 음성피처맵, 제1영상피처맵, 제2영상피처맵, 및 말상태정보는 시간적으로 동기화되거나, 혹은 상기 음성피처맵, 제1영상피처맵, 제2영상피처맵은 데이터 데이터형태로서 시간정보에 대한 데이터필드를 가질 수 있다.On the other hand, each of the voice feature map, the first image feature map, and the second image feature map may have time information synchronized with the horse state information. That is, in a predetermined time interval, the voice feature map, the first image feature map, the second image feature map, and the horse state information are temporally synchronized, or the voice feature map, the first image feature map, And may have a data field for time information as data data type.

도 7는 본 발명의 일 실시예에 따른 복합피처맵생성부(240)의 동작을 개략적으로 도시한다.FIG. 7 schematically illustrates the operation of the complex feature map generator 240 according to an embodiment of the present invention.

상기 복합피처맵생성부(240)는 상기 말상태정보에 기초하여 상기 음성피처맵 및 상기 제2영상피처맵에 대해 각각의 가중치를 부여하여, 상기 음성피처맵, 상기 제1영상피처맵, 및 상기 제2영상피처맵으로부터 복합피처맵을 생성한다. 즉, 복합피처맵생성부(240)은 가중치가 부여된 피처맵들을 머징하는 피처맵머징부(241)를 포함한다.The compound feature map generator 240 assigns weights to the voice feature map and the second image feature map based on the speech state information, and outputs the voice feature map, the first image feature map, A composite feature map is generated from the second image feature map. That is, the complex feature map generator 240 includes a feature mappers 241 for merging the weighted feature maps.

도 7은 구체적으로 #1 ~ #N번째의 시간구간 혹은 타이밍이 있고, #n번째의 시간구간에 대하여, 음성피처맵, 제1영상피처맵, 제2영상피처맵, 및 말상태정보로부터 복합피처맵을 생성함에 있어서, 말상태정보가 확률값인 P_s를 가지고 있는 경우의 실시예를 도시한다.FIG. 7 specifically shows a time period or timing of the # 1 to #Nth time periods, and a combination of the voice feature map, the first image feature map, the second image feature map, and the horse state information In the feature map generation, an embodiment in which the horse state information has a probability value P _s is shown.

구체적으로, 복합피처맵생성부(240)는 입력되는 피처맵에 대해 가중치를 부여한 후에, 각각의 피처맵을 머징 혹은 컴바인(combine) 혹은 연결(concatenate) 혹은 해싱(hashing) 한다. 이와 같은 피처맵을 머징하는 방법은 인공신경망 분야에서 공지된 다양한 피처맵의 머징 알고리즘이 이용될 수 있다. Specifically, the complex feature map generator 240 performs a process of merging, combining, or concatenating or hashing each feature map after assigning a weight to the input feature map. As a method of merging such a feature map, various feature map merging algorithms known in the field of artificial neural networks can be used.

본 발명의 일 실시예에 따르면, 말상태정보가 말을 하고 있는지에 대한 확률값(0 에서 1 사이의 P_s)으로 구현되는 경우에, 음성피처맵의 경우 피처맵머징부에 입력되기 전에 음성피처맵에 가중치 P_s가 적용이 되고, 입모양에 대한 제2영상피처맵의 경우 가중치 1-P_s가 적용이 된다. 따라서, 피평가자가 말을 하고 있는 경우에, 음성피처맵의 영향력이 커지고, 입모양에 대한 제2영상피처맵의 영향력이 작아진 상태로 복합피처맵이 생성되게 된다. 반면 피평가자가 말을 하고 있지 않는 경우에는, 음성피처맵의 영향력이 작아지고, 입모양에 대한 제2영상피처맵의 영향력이 커진 상태로 복합피처맵이 생성되게 된다. 이와 같은 말상태정보의 적용은 피처맵에 확률값으로서의 말상태정보를 해당피처맵에 곱하는 형태로 구현될 수 있다.According to one embodiment of the invention, the end of the status information is the probability value for that and the end of speech feature map in the case that is implemented by (P _s between 0 and 1), in the case of speech feature map before it is input to the feature maepmeo jingbu to become a weight P _s applied, and a second map applied when the image feature weight is P _s-1 for the mouth. Thus, when the subject is speaking, the influence of the voice feature map is increased, and the composite feature map is generated with the influence of the second image feature map on the mouth shape being small. On the other hand, if the assessor is not speaking, the influence of the voice feature map is reduced, and the composite feature map is generated with the influence of the second image feature map on the mouth shape becoming larger. Such application of the state information may be implemented by multiplying the feature map with the state information as the probability value in the feature map.

도 8은 본 발명의 일 실시예에 따른 상황인지적응형추론부(250)의 동작을 개략적으로 도시한다.FIG. 8 schematically illustrates the operation of the context-aware adaptive reasoning unit 250 according to an embodiment of the present invention.

본 발명의 일 실시예에서는, 상기 분석결과추론부(200)는, 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보, 및 상기 상황인지파라미터로부터 분석결과정보를 도출한다.In one embodiment of the present invention, the analysis result inference unit 200 derives analysis result information from the speech feature information, the image feature information, the speech state information, and the context recognition parameter.

구체적으로, 상기 분석결과추론부(200)는 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보에 기초하여 생성된 복합피처맵에 상기 상황인지파라미터를 적용하여 분석결과정보를 도출한다.Specifically, the analysis result estimating unit 200 derives analysis result information by applying the context awareness parameter to the complex feature map generated based on the speech feature information, the image feature information, and the speech state information.

도 8에 도시된 바와 같이, 각각의 샘플링구간에 대하여 복합피처맵이 생성된다. 예를들어 1 ~ #N번째까지의 샘플링 구간이 있는 경우에 N개의 복합피처맵이 생성될 수 있다.As shown in Fig. 8, a composite feature map is generated for each sampling period. For example, if there are 1 to #Nth sampling intervals, N complex feature maps can be generated.

여기서 1 ~ #N번째까지의 샘플링 구간은 영상 및 음성 정보의 전체 구간에 해당할 수 있고, 혹은 영상 및 음성 정보의 일부 구간에 해당할 수도 있다.Here, the sampling interval from 1 to #N may correspond to the entire interval of the video and audio information, or may correspond to a partial interval of the video and audio information.

한편, 상기 N개의 복합피처맵은 상황인지적응형추론부(250)에 입력된다. 구체적으로 상황인지적응형추론부(250)는 전술한 순환신경망과 같은 복수입력기계학습모델부(242.1), 동적 파라미터 계층과 같은 상황인지파라미터적용부(242.2), 및 상기 상황인지파라미터적용부(242.2)를 통해서 도출된 피처맵을 FC 계층으로 처리하는 FC모듈(242.3)을 포함한다.Meanwhile, the N compound feature maps are input to the context-aware adaptive reasoning unit 250. Specifically, the context-aware adaptive reasoning unit 250 includes a plurality of input machine learning model units 242.1 such as the above-described circular neural network, a context-aware parameter application unit 242.2 such as a dynamic parameter hierarchy, And an FC module 242.3 that processes the feature map derived through the FC layer 242.2.

이와 같이, 전체 샘플링구간에서의 각각의 N개의 복합피처맵(음성 및 영상에 대한 정보를 포함)는 복수입력기계학습모델부(242.1)에 입력되어 제1최종피처맵이 생성되고, 제1최종피처맵은 상황인지파라미터적용부(242.2)에 의하여 입력된 상황인지파라미터에 따라 제2최종피처맵으로 변환된다. 이후 제2최종피처맵이 FC가 되어 진단결과정보 형태로 출력될 수 있다.As such, each of the N composite feature maps (including information about voices and images) in the entire sampling interval is input to the multiple-input machine learning model unit 242.1 to generate a first final feature map, The feature map is converted into a second final feature map according to the context-aware parameters input by the context-aware parameter application 242.2. The second final feature map may then be FC and output in the form of diagnostic result information.

도 9는 본 발명의 일 실시예에 따른 상황인지파라미터생성부(400)의 내부 구성을 개략적으로 도시한다.FIG. 9 schematically shows the internal configuration of the context-aware parameter generation unit 400 according to an embodiment of the present invention.

상기 상황인지파라미터생성부(400)는, 상기 음성 정보의 음성 출력여부 및 상기 말상태정보로부터 음성 정보 중 피평가자 외의 사람의 음성 정보인 상황음성정보를 추출하는 상황판별부(410); 상기 음성정보로부터 발화 특징에 대한 비컨텐츠피처맵을 생성하는 비컨텐츠피처맵생성부(420); 상기 음성정보로부터 실제 내용에 대한 컨텐츠피처맵을 생성하는 컨텐츠피처맵생성부(430); 상기 비컨텐츠피처맵 및 컨텐츠피처맵을 정합 혹은 머징하는 피처맵정합부(440); 및 상기 정합된 피처맵으로부터 상황인지파라미터를 생성하는 상황인지파라미터도출부(450)를 포함한다.The situation recognition parameter generation unit 400 includes a situation determination unit 410 for extracting situation audio information, which is audio information of a person other than the subject who is the subject of the audio information, based on whether or not the audio information is output and the speech state information. A non-content feature map generation unit 420 for generating a non-content feature map for a speech feature from the speech information; A content feature map generation unit 430 for generating a content feature map for the actual content from the audio information; A feature map matching unit 440 for matching or merging the non-content feature map and the content feature map; And a context-aware parameter deriving unit 450 for generating context-aware parameters from the matched feature map.

즉, 상기 상황인지파라미터생성부(400)는, 상기 상황음성정보로부터 음성의 내용에 대한 컨텐츠피처맵을 생성하는, 컨텐츠피처맵생성부(430); 및 상기 상황음성정보로부터 발화특성에 대한 비컨텐츠피처맵을 생성하는, 비컨텐츠피처맵생성부(420);를 포함한다.That is, the context awareness parameter generation unit 400 includes a content feature map generation unit 430 for generating a content feature map for the content of speech from the context audio information; And a non-content feature map generation unit (420) for generating a non-content feature map for the speech characteristic from the context speech information.

본 발명의 다른 The other 실시예에서는In the embodiment , 상기 , remind 상황인지파라미터는The context-aware parameters 사용자의 입력에 의하여 직접적으로 생성될 수도 있다. Or may be generated directly by the user's input.

본 발명의 다른 The other 실시예에서는In the embodiment , 상기 , remind 상황인지파라미터는The context-aware parameters 상기 remind 정합된Matched 피처맵Feature map 및 사용자입력정보 모두를 고려하여 직접적으로 생성될 수도 있다. &Lt; / RTI > and user input information.

한편, 상기 상황인지파라미터생성부(400)는 기설정된 구간 혹은 전체 구간에 대한 음성정보로부터 생성될 수 있다. 이 경우, 해당 구간에 대한 말상태정보(#1 ~ #N)이 추가적으로 고려된다. 이와 같은 말상태정보는 전체음성으로부터 피평가자가 아닌 평가자의 음성을 추출하기 위한 정보에 해당한다.Meanwhile, the context awareness parameter generator 400 may be generated from speech information for a preset interval or an entire interval. In this case, the speech state information # 1 to #N for the section is additionally considered. Such speech state information corresponds to information for extracting the voice of the evaluator from the entire voice, not the subject of the evaluation.

도 10은 본 발명의 일 실시예에 따른 상황판별부(410)의 동작을 개략적으로 도시한다.FIG. 10 schematically shows the operation of the situation determination unit 410 according to an embodiment of the present invention.

도 10에 도시된 바와 같이 상황판별부(410)는 상기 음성 정보의 음성 출력여부 및 상기 말상태정보로부터 음성 정보 중 피평가자 외의 사람의 음성 정보인 상황음성정보를 추출하는 상황판별부(410)를 포함한다.As shown in FIG. 10, the situation determination unit 410 includes a situation determination unit 410 for extracting the situation information, which is the voice information of a person other than the subject who is the subject of the voice information, .

구체적으로, 전체 구간에 대한 음성 정보가 주어지는 경우에, 상황판별부(410)는 음성이 출력되고 있고, 말상태정보에서 피평가자가 말을 하고 있지 않거나 혹은 말을 하는 확률값 P_S가 기설정된 기준 이하로 판단되는 경우에, 해당 구간을 상황음성정보로 판별한다.Specifically, when the audio information for the entire interval is given, the situation determination unit 410, and speech is output, the end state information pipyeonggaja does not have a say in or end of a probability value P _S is a predetermined reference or less , It is determined that the section is the context voice information.

이와 같은 상황음성정보로부터 상기 비컨텐츠피처맵생성부(420) 및 컨텐츠피처맵생성부(430)는 피처맵을 생성한다.The non-content feature map generation unit 420 and the content feature map generation unit 430 generate feature maps from the context audio information.

도 11은 본 발명의 일 실시예에 따른 비컨텐츠피처맵생성부(420) 및 컨텐츠피처맵생성부(430)의 내부 구성을 개략적으로 도시한다.FIG. 11 schematically shows an internal configuration of a non-content feature map generation unit 420 and a content feature map generation unit 430 according to an embodiment of the present invention.

상기 상황인지파라미터생성부(400)는, 상기 상황음성정보로부터 음성의 내용에 대한 컨텐츠피처맵을 생성하는, 컨텐츠피처맵생성부(430); 및 상기 상황음성정보로부터 발화특성에 대한 비컨텐츠피처맵을 생성하는, 비컨텐츠피처맵생성부(420);를 포함한다.The context awareness parameter generation unit 400 includes a content feature map generation unit 430 for generating a content feature map for the content of speech from the context speech information; And a non-content feature map generation unit (420) for generating a non-content feature map for the speech characteristic from the context speech information.

상기 컨텐츠피처맵생성부는 피평가자가 발화한 음성을 텍스트로 변환하고, 이에 대한 실체적인 내용에 대한 피처맵을 생성한다. 예를들어, 주어진 질문에 대하여 피평가자가 대답하는 경우, 그 대답에 대한 실체적인 내용에 대한 피처맵이 컨텐츠피처맵에 해당한다. 이와 같은 실체적인 내용의 일 실시예로서 어휘 특징이 해당할 수 있다. 예를들어, 특정 표현(긍정적 표현, 부정적 표현, 회피적 표현, 방어적 표현, 및 불명확한 표현 등 중 1 이상)에 대한 빈도 등의 특성에 따른 피처맵이 컨텐츠피처맵의 일예로 구현될 수 있다. The content feature map generation unit converts the voice uttered by the subject to text into a text and generates a feature map for the substantive contents. For example, if an affirmative answers a given question, then the feature map for the substantive content of the answer corresponds to the content feature map. As an example of such substantive contents, the lexical feature may be applicable. For example, a feature map according to characteristics such as a frequency with respect to a specific expression (positive expression, negative expression, avoidance expression, defensive expression, and unspecified expression, etc.) can be implemented as an example of a content feature map have.

한편, 상기 비컨텐츠피처맵은 피평가자가 발화한 음성의 발화특성에 해당한다. 예를들어, 발화의 억양, 톤, 피치, 빠르기, 숨, 및 운율 등 중 1 이상에 대한 특성에 따른 피처맵이 비컨텐츠피처맵의 일예로 구현될 수 있다.On the other hand, the non-content feature map corresponds to the utterance characteristic of the speech uttered by the subject. For example, feature maps according to characteristics of one or more of speech accents, tones, pitches, speeds, breaths, and rhyme can be implemented as an example of a non-content feature map.

구체적으로, 상기 비컨텐츠피처맵생성부(420)는 부분 샘플링 구간의 음성의 발화적 특성을 추출하는 상황인지음성특징정보추출부(421); 상기 추출된 발화적특징으로부터 예비피처맵을 생성하는 비컨텐츠단일입력기계학습모델부(422); 상기 부분 샘플링 구간들에서의 피처맵들로부터 비컨텐츠피처맵을 생성하는 비컨텐츠복수입력기계학습모델부(423)를 포함한다. Specifically, the non-content feature map generation unit 420 includes a context-aware feature extraction unit 421 for extracting a speech characteristic of a speech of a partial sampling period; A non-content single input machine learning model unit 422 for generating a spare feature map from the extracted utterance characteristic; And a non-content multiple input machine learning model unit 423 for generating a non-content feature map from the feature maps in the partial sampling periods.

상기 비컨텐츠단일입력기계학습모델부(422)는 DNN 혹은 CNN 등의 단일입력으로부터 분석결과를 추출하는 인공신경망 모델로 구현될 수 있고, 상기 비컨텐츠복수입력기계학습모델부(423)는 LSTM 등의 복수입력으로부터 분석결과를 추출하는 순환신경망 모델로 구현될 수 있다.The non-content multiple input machine learning model unit 422 may be implemented as an artificial neural network model that extracts analysis results from a single input such as DNN or CNN, And extracting the analysis result from a plurality of inputs of the input device.

한편, 컨텐츠피처맵생성부(430)는 음성을 텍스트 정보로 변환하는 텍스트추출부(431); 및 상기 텍스트추출부에 의하여 추출된 텍스트 정보로부터 컨텐츠피처맵을 생성하는 컨텐츠복수입력기계학습모델부(432)를 포함한다.On the other hand, the content feature map generation unit 430 includes a text extraction unit 431 for converting speech into text information; And a content multiple input machine learning model unit 432 for generating a content feature map from the text information extracted by the text extraction unit.

상기 컨텐츠복수입력기계학습모델부(432)는 순환신경망을 이용한 복수입력으로부터 분석결과를 추출하는 인공신경말 모델로 구현될 수 있다.The content multiple input machine learning model unit 432 may be implemented as an artificial neural model that extracts analysis results from a plurality of inputs using a circular neural network.

도 12는 본 발명의 일 실시예에 따른 자동화된 피평가자분석 방법의 단계들을 개략적으로 도시한다.Figure 12 schematically illustrates the steps of an automated assessor analysis method in accordance with an embodiment of the present invention.

자동화된 피평가자분석 방법은 전술한 자동화된 피평가자분석 시스템에 의하여 수행될 수 있고, 시스템의 설명과 중복되는 부분 중 일부는 편의상 생략하기로 한다.The automated evaluator analysis method can be performed by the automated evaluator analysis system described above, and some of the parts overlapping with the description of the system will be omitted for convenience.

본 발명의 일 실시예에 따른 자동화된 피평가자분석 방법은 1 이상의 프로세서 및 1 이상의 메모리를 포함하는 컴퓨팅 장치로 구현된다.An automated assessor analysis method in accordance with an embodiment of the present invention is implemented in a computing device that includes one or more processors and one or more memories.

구체적으로 상기 방법은 입력된 영상 정보 및 음성 정보로부터 음성특징정보 및 영상특징정보를 생성하는 특징정보생성단계(S100); 상기 영상특징정보에 포함된 정보로부터 피평가자가 발화하는 지에 대한 말상태정보를 생성하는 말상태추론단계(S200); 및 상기 음성특징정보, 상기 영상특징정보, 및 상기 말상태정보를 포함하는 정보로부터 피평가자의 분석결과정보를 도출하는 분석결과추론단계(S400);를 포함한다.Specifically, the method includes a feature information generation step (S100) of generating speech feature information and image feature information from input image information and speech information; A horse state inferring step (S200) of generating horse state information on whether the assessor is speaking from information included in the image feature information; And an analysis result inferring step S400 of deriving analysis result information of the subject from the information including the speech feature information, the image feature information, and the speech state information.

바람직하게는, 상기 방법은 입력된 음성 정보 및 상기 말상태정보로부터 피평가자의 평가상황에 대한 정보를 포함하는 상황인지파라미터를 생성하는 상황인지파라미터생성단계(S300)를 더 포함하고, 상기 분석결과추론단계는 상기 상황인지파라미터를 추가적으로 고려하여 분석결과정보를 도출할 수 있다.Preferably, the method further includes a context-aware parameter generation step (S300) of generating context-aware parameters including the input speech information and the evaluation status of the assessor from the speech status information, The analysis result information can be derived by further considering the context awareness parameter.

바람직하게는, 상기 특징정보생성단계는, 음성특징정보를 생성하는 음성특징정보생성단계; 및 영상특징정보를 생성하는 영상특징정보생성단계를 포함하고, 상기 영상특징정보생성단계는, 입력된 영상정보로부터 입영역 외의 얼굴 내부의 1 이상의 요소를 추출하는 제1영상특징정보추출단계; 및 입력된 영상정보로부터 입영역의 요소를 추출하는 제2영상특징정보추출단계;를 포함할 수 있다.Preferably, the characteristic information generating step includes: a voice characteristic information generating step of generating voice characteristic information; And an image feature information generating step of generating image feature information, wherein the image feature information generating step includes: a first image feature information extracting step of extracting at least one element inside a face outside the mouth area from the input image information; And a second image feature information extracting step of extracting an element of the input area from the input image information.

바람직하게는, 상기 말상태추론단계는, 상기 제2영상특징정보로부터 상기 말상태정보를 생성할 수 있다.Advantageously, the horse state inference step may generate the horse state information from the second image feature information.

바람직하게는, 상기 분석결과추론단계는, 상기 음성특징정보로부터 음성피처맵을 생성하는 음성피처맵생성단계; 상기 제1영상특징정보로부터 제1영상피처맵을 생성하는 제1영상피처맵생성단계; 및 상기 제2영상특징정보로부터 제2영상피처맵을 생성하는 제2영상피처맵생성단계를 포함할 수 있다.Preferably, the analysis result inferring step includes: a voice feature map generating step of generating a voice feature map from the voice feature information; A first image feature map generation step of generating a first image feature map from the first image feature information; And a second image feature map generation step of generating a second image feature map from the second image feature information.

바람직하게는, 상기 분석결과추론단계는 상기 말상태정보에 기초하여 상기 음성피처맵 및 상기 제2영상피처맵에 대해 각각의 가중치를 부여하여, 상기 음성피처맵, 상기 제1영상피처맵, 및 상기 제2영상피처맵으로부터 복합피처맵을 생성하는 복합피처맵생성단계를 더 포함할 수 있다.Preferably, the analysis result inference step may assign each of the weights to the voice feature map and the second image feature map based on the speech state information to determine the voice feature map, the first image feature map, And a composite feature map generation step of generating a composite feature map from the second image feature map.

바람직하게는, 상기 자동화된 피평가자분석 방법은, 입력된 음성 정보 및 상기 말상태정보로부터 피평가자의 평가상황에 대한 정보를 포함하는 상황인지파라미터를 생성하는 상황인지파라미터생성단계를 더 포함할 수 있다.Preferably, the automated assessor analysis method may further include a context-aware parameter generation step of generating context-aware parameters including input voice information and information on an evaluation status of an assessor from the speech status information.

바람직하게는, 상기 상황인지파라미터생성단계는, 상기 음성 정보의 음성 출력여부 및 상기 말상태정보로부터 음성 정보 중 피평가자 외의 사람의 음성 정보인 상황음성정보를 추출하는 상황판별단계를 포함할 수 있다.Preferably, the context recognition parameter generation step may include a situation determination step of extracting context audio information, which is speech information of a person other than the subject of evaluation, of the speech information from whether the speech information is audible output or from the speech state information.

바람직하게는, 상기 상황인지파라미터생성단계는, 상기 상황음성정보로부터 음성의 내용에 대한 컨텐츠피처맵을 생성하는, 컨텐츠피처맵생성단계; 및 상기 상황음성정보로부터 발화특성에 대한 비컨텐츠피처맵을 생성하는, 비컨텐츠피처맵생성단계;를 포함할 수 있다.Preferably, the context recognition parameter generation step includes: a content feature map generation step of generating a content feature map for the content of speech from the context audio information; And generating a non-content feature map for the utterance characteristic from the context speech information.

바람직하게는, 상기 분석결과추론단계는, 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보, 및 상기 상황인지파라미터로부터 분석결과정보를 도출할 수 있다.Preferably, the analysis result inference step may derive analysis result information from the speech feature information, the image feature information, the speech state information, and the context awareness parameter.

바람직하게는, 상기 분석결과추론단계는 상기 음성특징정보, 상기 영상특징정보, 상기 말상태정보에 기초하여 생성된 복합피처맵에 상기 상황인지파라미터를 적용하여 분석결과정보를 도출할 수 있다.Preferably, the analysis result inference step may derive analysis result information by applying the context awareness parameter to a complex feature map generated based on the speech feature information, the image feature information, and the speech state information.

도 13은 본 발명의 일 실시예에 따른 피처맵 생성과정을 개략적으로 도시한다.FIG. 13 schematically illustrates a feature map generation process according to an embodiment of the present invention.

도 13에서는 CNN 인공신경망 모델을 이용하여 입영역에 대한 피처맵을 생성하는 과정을 예시적으로 도시한다. 도 13에 도시된 방법은 제2영상피처맵생성부(230)에 의하여 수행될 수 있다.FIG. 13 exemplarily shows a process of generating a feature map for an input area using a CNN artificial neural network model. The method illustrated in FIG. 13 may be performed by the second image feature map generation unit 230. FIG.

구체적으로, 도 3을 참조하면, 예를 들어, 64 X 64 크기의 제2영상특징정보를 5 X 5 크기의 32개 필터로 컨볼루션(convolution)하여 60 X 60 크기의 32개 결과물을 출력하고, 이를 3 X 3 max pooling 을 이용하여 30 X 30 크기로 축소시킨다. 30 X 30 크기의 32개 결과물에 다시 5 X 5 크기의 32개 필터로 컨볼루션하고, 다시 3 X 3 max pooling하면 13 X 13 크기의 32개 결과물이 출력된다. 여기에, 5 X 5 크기의 64개 필터를 컨볼루션하면 9 X 9 크기의 64개 결과물이 나오고 경계(boundary)를 0으로 채워 10 X 10으로 만든 후 3 X 3 max pooling을 적용하여 5 X 5 크기의 64개 결과물을 얻는다.For example, referring to FIG. 3, the second image feature information having a size of 64 × 64 is convolved with 32 filters having a size of 5 × 5 to output 32 outputs of 60 × 60 size , It is reduced to 30 X 30 by using 3 X 3 max pooling. 32 results of 30 X 30 are convolved with 32 filters of 5 X 5 size, and 3 X 3 max pooling is performed again to output 32 results of 13 X 13 size. Here, 64 convolutions of 5 x 5 size are convolved and 64 results of 9 x 9 size are displayed. The boundary is filled with 0 to make 10 x 10 and 3 x 3 max pooling is applied to 5 x 5 You get 64 results of size.

그 후 512개의 가중치가 있는 FC와 128개의 가중치가 있는 FC를 통해 최종 128개의 벡터로 구성된 피처맵을 얻는다. 예를 들어, 최종 출력값의 개수는 얻고자 하는 입모양에 대한 특성정보의 개수와 같을 수 있다.We then obtain a feature map consisting of the last 128 vectors through FC with 512 weights and FC with 128 weights. For example, the number of final output values may be equal to the number of characteristic information for the mouth shape to be obtained.

상기 제2영상피처맵은 상기 최종 128개의 출력값 자체가 될 수 있고, 혹은 구 중간 단계에서 컨볼루션한 피처맵에 해당할 수 있다. 이와 같은 CNN 기반의 인공신경망 모듈의 구성은 1 이상의 컨볼루션 계층, 풀링 계층, ReLu와 같은 활성화 함수, FC 계층이나 Global Average Pooling 계층 등을 조합하여 형성할 수 있다.The second image feature map may be the final 128 output values itself, or may correspond to a feature map convolved at the middle of the phrase. The configuration of the CNN-based artificial neural network module can be formed by combining at least one convolution layer, a pooling layer, an activation function such as ReLu, and an FC layer or a global average pooling layer.

또한, 전술한 피처맵을 도출하는 구성들, 예를들어, 음성피처맵생성부(210), 제1영상피처맵생성부(220), 제2영상피처맵생성부(230), 복수입력기계학습모델부(242.1), 비컨텐츠피처맵생성부(420), 컨텐츠피처맵생성부(430)들은 인공신경망 모델을 구현할 수 있는 다양한 모듈에 의하여 구현될 수 있으나, 주어진 단일 혹은 복수 입력값에 대해 학습된 결과에 기초하여 출력정보를 도출한다는 점을 공통점으로 갖는다.The first feature feature map generation unit 220, the second feature feature map generation unit 230, and the second feature feature map generation unit 230 may be configured to generate the feature map, for example, a voice feature map generation unit 210, The learning model unit 242.1, the non-content feature map generation unit 420, and the content feature map generation unit 430 may be implemented by various modules capable of implementing an artificial neural network model. However, And that the output information is derived based on the learned result.

한편, 상기 음성피처맵생성부(210), 제1영상피처맵생성부(220), 제2영상피처맵생성부(230), 복수입력기계학습모델부(242.1), 비컨텐츠피처맵생성부(420), 컨텐츠피처맵생성부(430)들은 다양한 학습 데이터에 의하여 학습된 것을 가정하고, 또한 지속적으로 학습이 될 수도 있다.The first image feature map generator 220, the second image feature map generator 230, the multiple input machine learning model unit 242.1, the non-content feature map generator 230, (420) and the content feature map generation unit (430) are assumed to have been learned by various learning data, and may be continuously learned.

본 발명은 주어진 동영상 정보 혹은 영상/음성 정보 전체를 학습시키고, 학습된 단일의 인공신경망 모델을 이용하는 것이 아니라, 음성 및 영상을 분리하고 이에 대해 각각 인공신경망 모델을 이용하여 피처맵을 생성하고, 이와 같이 생성된 피처맵을 말상태정보에 따라 머징한다. 이후 이에 대해 상황인지파라미터를 적용하여, 또 다른 인공신경망 모델을 이용하여 피평가자를 분석한다. 따라서, 과정 전체에 대해 인공신경망 모델을 적용한 것과 달리 세부 과정 각각에 인공신경망 모델을 적용하고, 이에 대해 다시 데이터 처리를 수행한 후에, 다시 인공신경망 모델에 적용함으로써, 보다 정확한 결과를 도출할 수 있다.In the present invention, instead of using a single artificial neural network model that learns a given moving picture information or video / audio information as a whole, a feature map is generated using an artificial neural network model, The generated feature map is merged according to the horse state information. After that, the situation recognition parameter is applied to analyze the analyzed subject using another artificial neural network model. Therefore, unlike the artificial neural network model that is applied to the whole process, the artificial neural network model is applied to each of the detailed processes, and after data processing is performed again, it is possible to obtain a more accurate result by applying it to the artificial neural network model again .

도 14는 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)의 전체구성을 개략적으로 도시한다.Figure 14 schematically illustrates the overall configuration of an automated assessor analysis system 1000 in accordance with one embodiment of the present invention.

도 14에 도시된 실시예에서는 주어신 동영상 정보 혹은 음성 및 영상 정보로부터 얼굴부분에 대한 이미지 및 음성정보가 분리되어 추출되고, 음성정보는 CNN 인공신경망 모델을 통하여 피처맵으로 변환되고, 이미지정보는 입영역 및 입 외의 영역이 각각 CNN 인공신경망 모델을 통하여 피처맵으로 변환된 후에, 3개의 피처맵이 확률값으로서의 말상태정보에 따라 복합피처맵으로 변환된다.In the embodiment shown in FIG. 14, the image and audio information for the face portion are extracted and extracted from the given moving picture information or voice and image information, the voice information is converted into the feature map through the CNN artificial neural network model, After the mouth area and the outside mouth area are respectively converted into the feature map through the CNN artificial neural network model, the three feature maps are converted into the complex feature map according to the horse state information as the probability value.

이후, 복합피처맵은 LSTM과 같은 순환신경망 모델로 입력되고, 이후 상황인지파라미터 및 순환신경망에 의하여 출력되는 피처맵이 동적 파라미터 계층으로 입력되어 최종 결과물이 FC 계층을 통해 진단결과가 생성된다.Then, the complex feature map is input to a circular neural network model such as LSTM. Then, the feature map output by the circumstantial parameter and the circular neural network is input to the dynamic parameter hierarchy, and the final result is generated through the FC hierarchy.

한편, 도 14에서 자동 모드인 경우에는 말상태정보를 참조하여 해당 동영상의 전체 혹은 부분 구간에서의 상황인지파라미터가 도출되고, 반자동모드에서는 수동으로 작성된 텍스트 스크립트에 대해 LSTM과 같은 순환신경망으로 문자열에 기초한 피처맵이 생성되고 이로부터 상황인지파라미터가 도출되고, 수동모드에서는 사용자의 입력이 직접적으로 상황인지파라미터로 변환된다.On the other hand, in the case of the automatic mode in FIG. 14, context recognition parameters in all or partial sections of the moving picture are derived with reference to the speech state information, and in the semi-automatic mode, Based feature map is generated and context-aware parameters are derived, and in the passive mode, the user's input is directly converted into context-aware parameters.

도 15는 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)의 분석결과에 대한 사용자화면을 개략적으로 도시한다.FIG. 15 schematically illustrates a user screen for an analysis result of an automated assessor analysis system 1000 according to an embodiment of the present invention.

도 15에 도시된 바와 같이, 본 발명의 일 실시예에서는, 상기 진단결과정보는 하기의 사회적 기술 평가항목인 활기참, 웃는얼굴, 시선맞춤, 친숙함, 신뢰성, 자신감, 집중도, 말빠르기, 참여도, 추천도, 차분함, 진정성 등으로 나누어질 수 있다. 이와 같은 평가항목은 상황인지적응형추론부(250)에서 FC 계층을 통해 얻은 벡터의 각 항목의 값에 해당할 수 있다.As shown in FIG. 15, in the embodiment of the present invention, the diagnosis result information includes the following social skill evaluation items: lively, smiling face, sight alignment, familiarity, reliability, confidence, concentration, , Recommendation level, calmness, authenticity, and so on. Such an evaluation item may correspond to the value of each item of the vector obtained through the FC layer in the context-aware adaptive reasoning unit 250.

또한, 감성특징에서와 같이 진단결과정보는 놀람, 행복, 사랑, 경멸, 화남, 두려움, 슬픔, 불쾌를 포함할 수 있고, 이 역시 상황인지적응형추론부(250)에서 FC 계층을 통해 얻은 벡터의 각 항목의 값에 해당할 수 있다.In addition, as in the emotional characteristic, the diagnostic result information may include surprise, happiness, love, contempt, angry, fear, sadness, and discomfort. And the value of each item of < RTI ID = 0.0 >

본 발명에 따르면 피평가자의 평가에 있어서 중요한 요소가 될 수 있는 잠재적 항목들이 인공신경망 모델에 의하여 검출되고 이들이 반영이 됨으로써 보다 사회적 기술 및 감성에 대한 결과값을 정확하게 도출할 수 있다. According to the present invention, the potential items that can be an important factor in the evaluation of the evaluator are detected by the artificial neural network model and reflected thereby, and the resultant value for the social skill and emotion can be accurately derived.

종래의 기술에서는 기설정된 특징정보(예를들어 눈의 모양에 있어서 길이 대 폭비율)에 기초하여 피평가자를 자동적으로 평가할 때 이와 같은 잠재적 요소들이 반영되지 않을 수 있다. 한편, 전체 동영상에 대해 인공신경망 모듈을 구성하는 경우에는 연산부하가 과중되고, 또한 정확한 결과가 도출되지 않을 수 있다.In the prior art, such potential factors may not be reflected when automatically evaluating the assessor based on the predetermined feature information (for example, the length-width ratio in the shape of the eye). On the other hand, when an artificial neural network module is constructed for the entire moving image, the computation load is overloaded, and accurate results may not be obtained.

반면 본 발명의 일 실시예에 따른 자동화된 피평가자분석 시스템(1000)에서는 인공신경망 모델을 최적화를 하면서, 말 상태 등에 대해서는 다른 데이터 처리를 함으로써 학습 모델에서의 노이즈 부하를 보다 저감시켜 학습 모델의 오작동율을 보다 낮게 할 수 있는 효과를 발휘할 수 있다.On the other hand, in the automated assessor analysis system 1000 according to the embodiment of the present invention, while the artificial neural network model is optimized, the noise load in the learning model is further reduced by performing other data processing on the horse state, Can be lowered.

도 16는 본 발명의 일 실시예에 따른 컴퓨팅장치의 내부 구성을 예시적으로 도시한다.16 illustrates an exemplary internal configuration of a computing device according to an embodiment of the present invention.

도 16에 도시한 바와 같이, 컴퓨팅 장치(11000)은 적어도 하나의 프로세서(processor)(11100), 메모리(memory)(11200), 주변장치 인터페이스(peripheral interface)(11300), 입/출력 서브시스템(I/Osubsystem)(11400), 전력 회로(11500) 및 통신 회로(11600)를 적어도 포함할 수 있다. 이때, 컴퓨팅 장치(11000)은 촉각 인터페이스 장치에 연결된 사용자단말기(A) 혹은 전술한 컴퓨팅 장치(B)에 해당될 수 있다.16, computing device 11000 includes at least one processor 11100, a memory 11200, a peripheral interface 11300, an input / output subsystem I / Osubsystem) 11400, a power circuit 11500, and a communication circuit 11600. At this time, the computing device 11000 may correspond to the user terminal A connected to the tactile interface device or the computing device B described above.

메모리(11200)는, 일례로 고속 랜덤 액세스 메모리(high-speed random access memory), 자기 디스크, 에스램(SRAM), 디램(DRAM), 롬(ROM), 플래시 메모리 또는 비휘발성 메모리를 포함할 수 있다. 메모리(11200)는 컴퓨팅 장치(11000)의 동작에 필요한 소프트웨어 모듈, 명령어 집합 또는 그밖에 다양한 데이터를 포함할 수 있다.Memory 11200 can include, for example, a high-speed random access memory, a magnetic disk, SRAM, DRAM, ROM, flash memory or non-volatile memory. have. The memory 11200 may include software modules, a set of instructions, or various other data required for operation of the computing device 11000.

이때, 프로세서(11100)나 주변장치 인터페이스(11300) 등의 다른 컴포넌트에서 메모리(11200)에 액세스하는 것은 프로세서(11100)에 의해 제어될 수 있다. 상기 프로세서(11100)은 단일 혹은 복수로 구성될 수 있고, 연산처리속도 향상을 위하여 GPU 및 TPU 형태의 프로세서를 포함할 수 있다.At this point, accessing memory 11200 from other components, such as processor 11100 or peripheral device interface 11300, may be controlled by processor 11100. The processor 11100 may be configured as a single or a plurality of processors and may include a processor in the form of a GPU and a TPU in order to improve an arithmetic processing speed.

주변장치 인터페이스(11300)는 컴퓨팅 장치(11000)의 입력 및/또는 출력 주변장치를 프로세서(11100) 및 메모리 (11200)에 결합시킬 수 있다. 프로세서(11100)는 메모리(11200)에 저장된 소프트웨어 모듈 또는 명령어 집합을 실행하여 컴퓨팅 장치(11000)을 위한 다양한 기능을 수행하고 데이터를 처리할 수 있다.Peripheral device interface 11300 may couple the input and / or output peripheral devices of computing device 11000 to processor 11100 and memory 11200. The processor 11100 may execute a variety of functions and process data for the computing device 11000 by executing a software module or set of instructions stored in the memory 11200.

입/출력 서브시스템(11400)은 다양한 입/출력 주변장치들을 주변장치 인터페이스(11300)에 결합시킬 수 있다. 예를 들어, 입/출력 서브시스템(11400)은 모니터나 키보드, 마우스, 프린터 또는 필요에 따라 터치스크린이나 센서등의 주변장치를 주변장치 인터페이스(11300)에 결합시키기 위한 컨트롤러를 포함할 수 있다. 다른 측면에 따르면, 입/출력 주변장치들은 입/출력 서브시스템(11400)을 거치지 않고 주변장치 인터페이스(11300)에 결합될 수도 있다.The input / output subsystem 11400 may couple various input / output peripherals to the peripheral interface 11300. For example, input / output subsystem 11400 may include a controller for coupling a peripheral, such as a monitor, keyboard, mouse, printer, or a touch screen or sensor, as needed, to peripheral interface 11300. According to another aspect, the input / output peripheral devices may be coupled to the peripheral device interface 11300 without going through the input / output subsystem 11400.

전력 회로(11500)는 단말기의 컴포넌트의 전부 또는 일부로 전력을 공급할 수 있다. 예를 들어 전력 회로(11500)는 전력 관리 시스템, 배터리나 교류(AC) 등과 같은 하나 이상의 전원, 충전 시스템, 전력 실패 감지 회로(power failure detection circuit), 전력 변환기나 인버터, 전력 상태 표시자 또는 전력 생성, 관리, 분배를 위한 임의의 다른 컴포넌트들을 포함할 수 있다.Power circuitry 11500 may provide power to all or a portion of the components of the terminal. For example, the power circuit 11500 may include one or more power supplies, such as a power management system, a battery or alternating current (AC), a charging system, a power failure detection circuit, a power converter or inverter, And may include any other components for creation, management, distribution.

통신 회로(11600)는 적어도 하나의 외부 포트를 이용하여 다른 컴퓨팅 장치와 통신을 가능하게 할 수 있다.Communication circuitry 11600 may enable communication with other computing devices using at least one external port.

또는 상술한 바와 같이 필요에 따라 통신 회로(11600)는 RF 회로를 포함하여 전자기 신호(electromagnetic signal)라고도 알려진 RF 신호를 송수신함으로써, 다른 컴퓨팅 장치와 통신을 가능하게 할 수도 있다.Or as described above, communication circuitry 11600 may, if necessary, enable communications with other computing devices by sending and receiving RF signals, also known as electromagnetic signals, including RF circuitry.

이러한 도 16의 실시예는, 컴퓨팅 장치(11000)의 일례일 뿐이고, 컴퓨팅 장치(11000)은 도 16에 도시된 일부 컴포넌트가 생략되거나, 도 16에 도시되지 않은 추가의 컴포넌트를 더 구비하거나, 2개 이상의 컴포넌트를 결합시키는 구성 또는 배치를 가질 수 있다. 예를 들어, 모바일 환경의 통신 단말을 위한 컴퓨팅 장치는 도 16에도시된 컴포넌트들 외에도, 터치스크린이나 센서 등을 더 포함할 수도 있으며, 통신 회로(1160)에 다양한 통신방식(WiFi, 3G, LTE, Bluetooth, NFC, Zigbee 등)의 RF 통신을 위한 회로가 포함될 수도 있다. 컴퓨팅 장치(11000)에 포함 가능한 컴포넌트들은 하나 이상의 신호 처리 또는 어플리케이션에 특화된 집적 회로를 포함하는 하드웨어, 소프트웨어, 또는 하드웨어 및 소프트웨어 양자의 조합으로 구현될 수 있다.16 is merely an example of the computing device 11000, and the computing device 11000 may have the additional components omitted in Fig. 16, or further components not shown in Fig. 16, Lt; RTI ID = 0.0 > components. &Lt; / RTI > For example, in addition to the components illustrated in FIG. 16, a computing device for a mobile communication terminal may further include a touch screen, a sensor, and the like. The communication device 1160 may be connected to various communication methods (WiFi, 3G, LTE , Bluetooth, NFC, Zigbee, etc.). The components that may be included in computing device 11000 may be implemented in hardware, software, or a combination of both hardware and software, including one or more signal processing or application specific integrated circuits.

본 발명의 실시예에 따른 방법들은 다양한 컴퓨팅 장치를 통하여 수행될 수 있는 프로그램 명령(instruction) 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 특히, 본 실시예에 따른 프로그램은 PC 기반의 프로그램 또는 모바일 단말 전용의 어플리케이션으로 구성될 수 있다. 본 발명이 적용되는 애플리케이션은 파일 배포 시스템이 제공하는 파일을 통해 이용자 단말에 설치될 수 있다. 일 예로, 파일 배포 시스템은 이용자 단말이기의 요청에 따라 상기 파일을 전송하는 파일 전송부(미도시)를 포함할 수 있다.The methods according to embodiments of the present invention may be implemented in the form of a program instruction that can be executed through various computing devices and recorded in a computer-readable medium. In particular, the program according to the present embodiment can be configured as a PC-based program or an application dedicated to a mobile terminal. An application to which the present invention is applied can be installed in a user terminal through a file provided by a file distribution system. For example, the file distribution system may include a file transfer unit (not shown) for transferring the file according to a request from the user terminal.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로 (collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨팅 장치 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computing device and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

An automated assessor analysis system,
A feature information generation unit for generating speech feature information and image feature information from input image information and speech information;
A horse state reasoning unit for generating horse state information on whether an evaluated person is speaking from information included in the image feature information; And
And an analysis result speculation unit for deriving analysis result information of the subject from the information including the speech feature information, the image feature information, and the speech state information,
Wherein the feature information generating unit comprises:
A voice feature information generation unit for generating voice feature information; And
And an image characteristic information generating unit for generating image characteristic information,
Wherein the image feature information generating unit comprises:
A first image feature information extracting unit for extracting at least one element inside the face from the input image information from the input image information; And
And a second image feature information extraction unit for extracting an element of the input area from the input image information,
As a result of the analysis,
A voice feature map generation unit for generating a voice feature map from the voice feature information;
A first image feature map generator for generating a first image feature map from the first image feature information;
A second image feature map generator for generating a second image feature map from the second image feature information; And
Assigning a weight to each of the voice feature map and the second image feature map based on the horse state information to generate a composite feature map from the voice feature map, the first image feature map, and the second image feature map, And a composite feature map generation unit for generating a composite feature map,
The complex feature map generation unit relatively increases the weight of the voice feature map as compared with the case where the probability value of whether the assessor is speaking is low when the probability value of whether the assessor is speaking in the speech state information is high, Wherein a weight of the second image feature map is relatively lowered to generate a composite feature map.

delete

The method according to claim 1,
The word-
And generates the speech state information from the face image information of the input image information or the second image feature information.

delete

The method according to claim 1,
The automated assessor analysis system,
Further comprising a context-aware parameter generating unit for generating context-aware parameters including input voice information and information on an evaluation status of an evaluator from the speech status information.

The method according to claim 1,
The automated assessor analysis system,
And a situation-aware parameter generating unit for generating a situation-aware parameter including information on an evaluation situation of the evaluator according to a user's input.

The method of claim 6,
Wherein the context-aware parameter generator comprises:
And a situation discriminating section for extracting situation audio information, which is audio information of a person other than the subject who is the subject of the audio information, based on whether or not the audio information is output from the speech and the speech state information.

The method of claim 8,
Wherein the context-aware parameter generator comprises:
A content feature map generation unit for generating a content feature map for the contents of speech from the context audio information; And
And a non-content feature map generator for generating a non-content feature map for the speech characteristic from the context speech information.

delete

What is claimed is: 1. An automated assessor analysis method implemented in a computing device comprising at least one processor and at least one memory,
A feature information generating step of generating speech feature information and image feature information from input image information and audio information;
A word state inference step of generating word state information on whether the assessor is speaking from information included in the image characteristic information; And
And an analysis result inferring step of deriving analysis result information of the subject from the information including the speech feature information, the image feature information, and the speech state information,
The feature information generating step may include:
A voice feature information generation step of generating voice feature information; And
An image feature information generating step of generating image feature information,
The image feature information generating step may include:
A first image feature information extracting step of extracting one or more elements in a face outside the input area from the input image information; And
And a second image feature information extraction step of extracting an element of the input area from the input image information,
As a result of the analysis,
A voice feature map generation step of generating a voice feature map from the voice feature information;
A first image feature map generation step of generating a first image feature map from the first image feature information;
A second image feature map generation step of generating a second image feature map from the second image feature information; And
Assigning a weight to each of the voice feature map and the second image feature map based on the horse state information to generate a composite feature map from the voice feature map, the first image feature map, and the second image feature map, And a composite feature map generation step of generating a composite feature map,
The complex feature map generating step may increase the weight of the voice feature map relative to the case where the probability value of whether the assessor is talking is low or not when the probability value of whether the assessor is speaking in the speech state information is high And relatively weighting the second image feature map to produce a composite feature map.

delete

The method of claim 12,
The verbal state inference step comprises:
And generates the speech state information from the face image information of the input image information or the second image feature information.

delete

The method of claim 12,
The automated assessor analysis method comprises:
Further comprising: a context-aware parameter generation step of generating context-aware parameters including input voice information and information on an evaluation situation of an evaluator from the speech state information.

The method of claim 12,
The automated assessor analysis method comprises:
Further comprising a situation-aware parameter generation step of generating a situation-aware parameter including information on an evaluation situation of the evaluator according to a user's input.

18. The method of claim 17,
The context-aware parameter generation step includes:
And a situation determination step of extracting, from the speech state information, whether or not speech of the speech information is outputted, and situation speech information which is speech information of a person other than the subject of the speech.

The method of claim 19,
The context-aware parameter generation step includes:
A content feature map generation step of generating a content feature map for the contents of speech from the context audio information; And
And a non-content feature map generation step of generating a non-content feature map for a speech characteristic from the context speech information.

delete

22. A computer-readable medium,
The computer-readable medium storing instructions that cause a computing device to perform the steps of:
A feature information generating step of generating speech feature information and image feature information from input image information and audio information;
A word state inference step of generating word state information on whether the assessor is speaking from information included in the image characteristic information; And
And an analysis result inferring step of deriving analysis result information of the subject from the information including the speech feature information, the image feature information, and the speech state information,
The feature information generating step may include:
A voice feature information generation step of generating voice feature information; And
An image feature information generating step of generating image feature information,
The image feature information generating step may include:
A first image feature information extracting step of extracting one or more elements in a face outside the input area from the input image information; And
And a second image feature information extraction step of extracting an element of the input area from the input image information,
As a result of the analysis,
A voice feature map generation step of generating a voice feature map from the voice feature information;
A first image feature map generation step of generating a first image feature map from the first image feature information;
A second image feature map generation step of generating a second image feature map from the second image feature information; And
Assigning a weight to each of the voice feature map and the second image feature map based on the horse state information to generate a composite feature map from the voice feature map, the first image feature map, and the second image feature map, And a composite feature map generation step of generating a composite feature map,
The complex feature map generating step may increase the weight of the voice feature map relative to the case where the probability value of whether the assessor is talking is low or not when the probability value of whether the assessor is speaking in the speech state information is high And relatively weighting the second image feature map to produce a composite feature map.