KR20190133361A

KR20190133361A - An apparatus for data input based on user video, system and method thereof, computer readable storage medium

Info

Publication number: KR20190133361A
Application number: KR1020180058183A
Authority: KR
Inventors: 이재석
Original assignee: 카페24 주식회사
Priority date: 2018-05-23
Filing date: 2018-05-23
Publication date: 2019-12-03
Also published as: KR102114368B1

Abstract

Provided is a method for inputting information based on a user image. The method comprises the steps of: generating mouth shape image information for a specific user and generating a plurality of training data sets individually including text information corresponding to the mouth shape image information; generating an image recognition model for outputting the text information corresponding to the inputted mouth shape image information based on the training data sets; and determining input text information corresponding to an input image for specific user based on the image recognition model. Therefore, the method may input information by recognizing a mouth shape of the specific user with improved accuracy.

Description

Information input device, method, system and computer readable storage medium based on user image {AN APPARATUS FOR DATA INPUT BASED ON USER VIDEO, SYSTEM AND METHOD THEREOF, COMPUTER READABLE STORAGE MEDIUM}

본 발명은 정보 입력에 관한 것으로서, 보다 구체적으로는 사용자의 영상을 이용하여 정보를 입력하기 위한 장치, 방법 및 시스템과 컴퓨터 판독 가능한 저장 매체에 관한 것이다. The present invention relates to information input, and more particularly, to an apparatus, a method and a system for inputting information using an image of a user, and a computer readable storage medium.

정보 통신 기술의 발달과 함께 다양한 컴퓨팅 디바이스가 활용되고 있으며, 종래의 키보드나 마우스 이외에도 다양한 정보 입력 수단에 대한 수요가 증가하고 있다. 특히, 예를 들어 스마트 폰이나 태블릿 PC 와 같은 모바일 디바이스는 정보의 입력을 위해 터치 스크린 상에 표시된 자판을 터치하는 방식이 주로 사용되고 있으나 사용자에 따라 이와 같은 입력에 불편함을 느낄 수 있으며, 특히 모빌리티의 측면에서 사용자 손의 자유도를 증가시키기 위해 음성 인식을 수행함으로써 정보를 입력하고 모바일 디바이스의 기능을 제어하는 방안이 활발하게 개발되고 있다. 그러나, 상기와 같은 음성 인식은 주변의 소음 존재 여부에 따라 그 정확도가 현저하게 저하될 수 있는 문제점이 있고, 사용자의 음성을 발생시키기 곤란한 환경에서 사용할 수 없는 문제점이 있다. With the development of information and communication technology, various computing devices have been utilized, and demand for various information input means other than the conventional keyboard and mouse is increasing. In particular, for example, a mobile device such as a smart phone or a tablet PC is mainly used to touch the keyboard displayed on the touch screen for input of information, but users may feel uncomfortable with such an input. In order to increase the degree of freedom of the user's hand in the way of inputting information and controlling the function of the mobile device has been actively developed. However, the voice recognition as described above has a problem that the accuracy may be significantly reduced depending on the presence of ambient noise, and there is a problem that cannot be used in an environment in which it is difficult to generate a voice of a user.

한국 공개특허공보 제 2001-0012024 호 ("다단계 음성인식을 이용한 음성인식 포탈서비스 시스템 및 그 방법", 주식회사 케이티)Korean Laid-Open Patent Publication No. 2001-0012024 ("Voice Recognition Portal Service System and Method Using Multi-Level Voice Recognition", Katie Co., Ltd.)

음성 인식은 주변의 소음 존재 여부에 따라 그 정확도가 현저하게 저하될 수 있는 문제점이 있고, 사용자의 음성을 발생시키기 곤란한 환경에서 사용할 수 없는 문제점이 있다. Speech recognition has a problem that the accuracy can be significantly reduced depending on the presence of ambient noise, there is a problem that can not be used in an environment difficult to generate a user's voice.

음성 인식을 대체하기 위한 수단으로서, 사용자의 입 모양을 인식하여 정보를 입력하는 방법이 고려될 수 있다. 그러나, 복수 사용자의 입 모양은 각각 상이하여 입 모양 인식의 정확도를 향상시키기 위한 방안이 요구된다. As a means for replacing voice recognition, a method of recognizing a user's mouth shape and inputting information may be considered. However, since the shape of the mouths of the plurality of users are different from each other, a method for improving the accuracy of mouth shape recognition is required.

전술한 문제점을 해결하기 위한 본 발명의 목적은 특정 사용자에 대한 입 모양 영상 정보와 그에 대응하는 텍스트 정보를 포함하는 훈련 데이터들을 기반으로 영상 인식 모델을 생성하는 것에 의해, 보다 향상된 정확도를 가지고 특정 사용자의 입 모양을 인식하여 정보를 입력할 수 있는 정보 입력 방법을 제공하는 것이다. Summary of the Invention An object of the present invention for solving the above problems is to generate an image recognition model based on training data including mouth shape image information and text information corresponding to a specific user, thereby improving the accuracy and accuracy of the specific user. It is to provide an information input method that can input information by recognizing the shape of the mouth.

전술한 문제점을 해결하기 위한 본 발명의 다른 목적은 특정 사용자에 대한 입 모양 영상 정보와 그에 대응하는 텍스트 정보를 포함하는 훈련 데이터들을 기반으로 영상 인식 모델을 생성하는 것에 의해, 보다 향상된 정확도를 가지고 특정 사용자의 입 모양을 인식하여 정보를 입력할 수 있는 정보 입력 장치를 제공하는 것이다. Another object of the present invention for solving the above problems is to generate a specific image with improved accuracy by generating an image recognition model based on training data including mouth shape image information and corresponding text information for a specific user. The present invention provides an information input device capable of inputting information by recognizing a shape of a user's mouth.

다만, 본 발명의 해결하고자 하는 과제는 이에 한정되는 것이 아니며, 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위에서 다양하게 확장될 수 있을 것이다.However, the problem to be solved of the present invention is not limited thereto, and may be variously expanded within a range without departing from the spirit and scope of the present invention.

전술한 목적을 달성하기 위한 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 방법은, 특정 사용자에 대한 입 모양 영상 정보 및 상기 입 모양 영상 정보에 대응하는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터 세트들을 생성하는 단계; 상기 훈련 데이터 세트들을 기반으로, 입력된 입 모양 영상 정보에 대응하는 텍스트 정보를 출력하는 영상 인식 모델을 생성하는 단계; 및 상기 영상 인식 모델을 기반으로 상기 특정 사용자에 대한 입력 영상에 대응하는 입력 텍스트 정보를 결정하는 단계를 포함할 수 있다. According to an embodiment of the present invention, an information input method based on a user image includes a plurality of mouth image information and text information corresponding to the mouth image information for a specific user. Generating training data sets of a; Generating an image recognition model outputting text information corresponding to the input mouth shape image information based on the training data sets; And determining input text information corresponding to the input image for the specific user based on the image recognition model.

일 측면에 따르면, 상기 입력 텍스트 정보를 결정하는 단계는, 영상 입력부에 의해 취득된 상기 특정 사용자에 대한 입력 영상을 수신하는 단계; 및 상기 영상 인식 모델을 기반으로, 상기 입력 영상에 포함된 입 모양 영상 정보에 대응하는 텍스트 정보를 상기 입력 텍스트 정보로서 결정하는 단계를 포함할 수 있다. According to an aspect, the determining of the input text information may include: receiving an input image for the specific user acquired by the image input unit; And determining text information corresponding to mouth shape image information included in the input image as the input text information based on the image recognition model.

일 측면에 따르면, 상기 훈련 데이터 세트들은, i) 상기 특정 사용자의 영상 통화 데이터에 포함된 입 모양 영상 정보 및 ii) 상기 입 모양 영상 정보에 대응하는 음성에 대한 음성 인식 결과인 음성 인식 텍스트 정보를 포함할 수 있다. According to an aspect of the present disclosure, the training data sets may include i) speech recognition image information included in the video call data of the specific user and ii) speech recognition text information that is a speech recognition result of a voice corresponding to the mouth shape image information. It may include.

일 측면에 따르면, 상기 훈련 데이터 세트들을 생성하는 단계는, 상기 특정 사용자의 영상 통화 데이터 - 상기 영상 통화 데이터는 통화 영상 및 통화 음성을 포함 - 를 획득하는 단계; 시간 정보를 기반으로 상기 통화 영상의 적어도 일부인 제 1 입 모양 영상 정보와 상기 통화 음성의 적어도 일부인 제 1 음성 정보를 대응시키는 단계; 음성 인식 모델을 기반으로 상기 제 1 음성 정보에 대응하는 텍스트 정보인 제 1 음성 인식 텍스트 정보를 획득하는 단계; 및 기 제 1 입 모양 영상 정보와 상기 제 1 음성 인식 텍스트 정보를 제 1 훈련 데이터 세트로서 저장하는 단계를 포함할 수 있다. According to one aspect, generating the training data sets comprises: obtaining video call data of the particular user, wherein the video call data includes call video and call voice; Associating first mouth image information that is at least a portion of the call video with first voice information that is at least a portion of the call voice based on time information; Acquiring first speech recognition text information that is text information corresponding to the first speech information based on a speech recognition model; And storing the first mouth shape image information and the first speech recognition text information as a first training data set.

일 측면에 따르면, 상기 제 1 입 모양 영상 정보 및 상기 제 1 음성 정보의 시간 길이는 미리 결정된 시간 길이로서 결정될 수 있다. According to an aspect, a time length of the first mouth shape image information and the first voice information may be determined as a predetermined time length.

일 측면에 따르면, 상기 제 1 입 모양 영상 정보 및 상기 제 1 음성 정보는 제 1 시점으로부터 제 2 시점까지의 시간 길이를 가지고, 상기 제 1 시점은 상기 통화 음성이 미리 결정된 임계 크기 이하인 시점을 나타내고, 상기 제 2 시점은 상기 제 1 시점에 후속하는, 상기 통화 음성이 미리 결정된 임계 크기 이하인 시점을 나타낼 수 있다. According to an aspect, the first mouth shape image information and the first voice information have a length of time from a first time point to a second time point, and the first time point indicates a time point at which the call voice is less than or equal to a predetermined threshold size. The second time point may indicate a time point at which the call voice subsequent to the first time point is less than or equal to a predetermined threshold size.

일 측면에 따르면, 상기 제 1 입 모양 영상 정보 및 상기 제 1 음성 정보는 제 1 시점으로부터 제 2 시점까지의 시간 길이를 가지고, 상기 제 1 시점은 상기 통화 영상에 포함된 상기 특정 사용자의 입 모양 영상 정보가 미리 결정된 제 1 트리거링 입 모양 정보와 일치하는 시점을 나타내고, 기 제 2 시점은 상기 제 1 시점에 후속하는, 상기 특정 사용자의 입 모양 영상 정보가 미리 결정된 제 1 트리거링 입 모양 정보와 일치하는 시점을 나타낼 수 있다. According to an aspect, the first mouth shape image information and the first voice information have a length of time from a first time point to a second time point, and the first time point is the mouth shape of the specific user included in the call image. Indicates a point in time at which the image information coincides with the first predetermined triggering mouth shape information, and a second second point of view corresponds to the first triggering mouth shape information of the specific user, which is subsequent to the first time point. It may indicate a time point.

일 측면에 따르면, 상기 제 1 입 모양 영상 정보 및 상기 제 1 음성 정보는 제 1 시점으로부터 제 2 시점까지의 시간 길이를 가지고, 상기 제 1 시점은 상기 통화 영상에 포함된 상기 특정 사용자의 입 모양 영상 정보가 미리 결정된 제 1 트리거링 입 모양 정보와 일치하는 시점을 나타내고, 상기 제 2 시점은 상기 통화 영상에 포함된 상기 특정 사용자의 입 모양 영상 정보가 미리 결정된 제 2 트리거링 입 모양 정보와 일치하는 시점을 나타낼 수 있다. According to an aspect, the first mouth shape image information and the first voice information have a length of time from a first time point to a second time point, and the first time point is the mouth shape of the specific user included in the call image. Indicates a time point at which the image information coincides with the first predetermined triggering mouth shape information, and the second time point is a time point when the mouth shape image information of the specific user included in the call image matches the predetermined second triggering mouth shape information. Can be represented.

일 측면에 따르면, 상기 훈련 데이터 세트들은, i) 예시 텍스트 정보 및 ii) 상기 특정 사용자가 상기 예시 텍스트 정보를 읽은 입 모양 영상 정보를 포함할 수 있다. According to an aspect, the training data sets may include i) exemplary text information and ii) mouth image information from which the specific user reads the exemplary text information.

일 측면에 따르면, 상기 훈련 데이터 세트들을 생성하는 단계는, 예시 텍스트 정보를 디스플레이하는 단계; 상기 특정 사용자가 상기 예시 텍스트 정보를 읽는 동안의 상기 특정 사용자에 대한 영상인 읽기 영상을 획득하는 단계; 및 기 읽기 영상에 포함된 제 2 입 모양 영상 정보 및 상기 예시 텍스트 정보를 제 2 훈련 데이터 세트로서 저장하는 단계를 포함할 수 있다. According to one aspect, generating the training data sets comprises: displaying example text information; Acquiring a read image which is an image of the specific user while the specific user reads the example text information; And storing the second mouth shape image information and the example text information included in the read image as the second training data set.

일 측면에 따르면, 상기 훈련 데이터 세트들을 생성하는 단계는, 복수의 예시 텍스트 정보들을 디스플레이하는 단계; 상기 특정 사용자가 상기 예시 텍스트 정보들을 음독하는 동안 상기 특정 사용자에 대한 영상인 읽기 영상 및 상기 특정 사용자에 대한 음성인 읽기 음성을 획득하는 단계; 시간 정보를 기반으로 상기 읽기 영상의 적어도 일부인 제 3 입 모양 영상 정보와 상기 읽기 음성의 적어도 일부인 제 3 음성 정보를 대응시키는 단계; 음성 인식 모델을 기반으로 상기 제 3 음성 정보에 대응하는 텍스트 정보인 제 3 음성 인식 텍스트 정보를 획득하는 단계; 기 제 3 음성 인식 테스트 정보와 상기 복수의 예시 텍스트 정보들 중 어느 하나인 제 3 예시 텍스트 정보와 동일하다는 결정에 응답하여, 상기 제 3 입 모양 영상 정보와 상기 제 3 예시 텍스트 정보를 제 3 훈련 데이터 세트로서 저장하는 단계를 포함할 수 있다. According to one aspect, generating the training data sets comprises: displaying a plurality of example text information; Acquiring a read image that is an image for the specific user and a read voice that is a voice for the specific user while the specific user reads the example text information; Associating third mouth shape image information, which is at least a part of the read image, with third voice information that is at least a part of the read voice, based on time information; Obtaining third speech recognition text information, which is text information corresponding to the third speech information, based on a speech recognition model; And in response to determining that the third voice recognition test information and the third example text information are the same as the third example text information, the third mouth image information and the third example text information are trained. Storing as a data set.

전술한 문제점을 해결하기 위한 본 발명의 다른 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치는, 영상 정보를 획득하는 영상 입력부; 음성 정보를 획득하는 음성 입력부; 영상 정보, 음성 정보 및 텍스트 정보를 저장하는 메모리; 및 프로세서를 포함하고, 상기 프로세서는, 특정 사용자에 대한 입 모양 영상 정보 및 상기 입 모양 영상 정보에 대응하는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터 세트들을 생성하고, 상기 훈련 데이터 세트들을 기반으로, 입력된 입 모양 영상 정보에 대응하는 텍스트 정보를 출력하는 영상 인식 모델을 생성하고; 그리고 상기 영상 인식 모델을 기반으로 상기 특정 사용자에 대한 입력 영상에 대응하는 입력 텍스트 정보를 결정하도록 구성될 수 있다. According to another aspect of the present invention, an information input apparatus based on a user image includes: an image input unit configured to acquire image information; A voice input unit for obtaining voice information; A memory for storing image information, audio information, and text information; And a processor, wherein the processor generates a plurality of training data sets each comprising mouth shape image information for a specific user and text information corresponding to the mouth shape image information, and based on the training data sets, Generating an image recognition model for outputting text information corresponding to the input mouth image information; The display apparatus may be configured to determine input text information corresponding to the input image for the specific user based on the image recognition model.

일 측면에 따르면, 상기 입력 텍스트 정보를 결정하는 것은, 상기 프로세서가, 상기 영상 입력부에 의해 취득된 상기 특정 사용자에 대한 입력 영상을 수신하고; 그리고 상기 영상 인식 모델을 기반으로, 상기 입력 영상에 포함된 입 모양 영상 정보에 대응하는 텍스트 정보를 상기 입력 텍스트 정보로서 결정하는 것을 포함할 수 있다. According to an aspect, the determining of the input text information may include: the processor receiving an input image for the specific user acquired by the image input unit; And determining, as the input text information, text information corresponding to the mouth shape image information included in the input image, based on the image recognition model.

일 측면에 따르면, 상기 훈련 데이터 세트들을 생성하는 것은, 상기 프로세서가, 상기 특정 사용자의 영상 통화 데이터 - 상기 영상 통화 데이터는 통화 영상 및 통화 음성을 포함 - 를 획득하고; 시간 정보를 기반으로 상기 통화 영상의 적어도 일부인 제 1 입 모양 영상 정보와 상기 통화 음성의 적어도 일부인 제 1 음성 정보를 대응시키고; 음성 인식 모델을 기반으로 상기 제 1 음성 정보에 대응하는 텍스트 정보인 제 1 음성 인식 텍스트 정보를 획득하고; 그리고 상기 제 1 입 모양 영상 정보와 상기 제 1 음성 인식 텍스트 정보를 제 1 훈련 데이터 세트로서 상기 저장부에 저장하는 것을 포함할 수 있다. According to one aspect, generating the training data sets includes: obtaining, by the processor, video call data of the particular user, wherein the video call data includes call video and call voice; Correlating first mouth image information that is at least a portion of the call video with first voice information that is at least a portion of the call voice based on time information; Acquire first speech recognition text information that is text information corresponding to the first speech information based on a speech recognition model; And storing the first mouth shape image information and the first voice recognition text information as the first training data set in the storage unit.

일 측면에 따르면, 상기 제 1 입 모양 영상 정보 및 상기 제 1 음성 정보의 시간 길이는 미리 결정된 시간 길이를 가질 수 있다. According to an aspect, a time length of the first mouth shape image information and the first voice information may have a predetermined time length.

일 측면에 따르면, 상기 제 1 입 모양 영상 정보 및 상기 제 1 음성 정보는 제 1 시점으로부터 제 2 시점까지의 시간 길이를 가지고, 상기 제 1 시점은 상기 통화 영상에 포함된 상기 특정 사용자의 입 모양 영상 정보가 미리 결정된 제 1 트리거링 입 모양 정보와 일치하는 시점을 나타내고, 상기 제 2 시점은 상기 제 1 시점에 후속하는, 상기 특정 사용자의 입 모양 영상 정보가 미리 결정된 제 1 트리거링 입 모양 정보와 일치하는 시점을 나타낼 수 있다. According to an aspect, the first mouth shape image information and the first voice information have a length of time from a first time point to a second time point, and the first time point is the mouth shape of the specific user included in the call image. Indicates a time point at which the image information coincides with the first predetermined triggering mouth shape information, and the second time point corresponds to the first triggering mouth shape information, which is the mouth shape image information of the specific user, subsequent to the first time point. It may indicate a time point.

일 측면에 따르면, 상기 훈련 데이터 세트들을 생성하는 것은, 상기 프로세서가, 상기 사용자 영상을 기반으로 하는 정보 입력 장치에 포함된 표시부에 예시 텍스트 정보를 디스플레이하고; 상기 영상 입력부를 이용하여, 상기 특정 사용자가 상기 예시 텍스트 정보를 읽는 동안의 상기 특정 사용자에 대한 영상인 읽기 영상을 획득하고; 그리고 상기 읽기 영상에 포함된 제 2 입 모양 영상 정보 및 상기 예시 텍스트 정보를 제 2 훈련 데이터 세트로서 상기 저장부에 저장하는 것을 포함할 수 있다. According to an aspect, generating the training data sets may include: displaying, by the processor, exemplary text information on a display unit included in an information input apparatus based on the user image; Using the image input unit, obtain a read image that is an image of the specific user while the specific user reads the example text information; And storing the second mouth shape image information and the example text information included in the read image as the second training data set in the storage unit.

일 측면에 따르면, 상기 훈련 데이터 세트들을 생성하는 것은, 상기 프로세서가, 상기 사용자 영상을 기반으로 하는 정보 입력 장치에 포함된 표시부에 복수의 예시 텍스트 정보들을 디스플레이하고; 상기 영상 입력부 및 음성 입력부를 이용하여, 상기 특정 사용자가 상기 예시 텍스트 정보들을 음독하는 동안 상기 특정 사용자에 대한 영상인 읽기 영상 및 상기 특정 사용자에 대한 음성인 읽기 음성을 획득하고; 시간 정보를 기반으로 상기 읽기 영상의 적어도 일부인 제 3 입 모양 영상 정보와 상기 읽기 음성의 적어도 일부인 제 3 음성 정보를 대응시키고; 음성 인식 모델을 기반으로 상기 제 3 음성 정보에 대응하는 텍스트 정보인 제 3 음성 인식 텍스트 정보를 획득하고; 상기 제 3 음성 인식 테스트 정보와 상기 복수의 예시 텍스트 정보들 중 어느 하나인 제 3 예시 텍스트 정보를 대응시키고; 그리고 상기 제 3 음성 인식 테스트 정보와 상기 복수의 예시 텍스트 정보들 중 어느 하나인 제 3 예시 텍스트 정보와 동일하다는 결정에 응답하여, 상기 제 3 입 모양 영상 정보와 상기 제 3 예시 텍스트 정보를 제 3 훈련 데이터 세트로서 상기 저장부에 저장하는 것를 포함할 수 있다. According to an aspect, the generating of the training data sets may include: displaying, by the processor, a plurality of example text informations on a display included in an information input apparatus based on the user image; Using the image input unit and the audio input unit, obtain a read image that is a video for the specific user and a read voice that is a voice for the specific user while the specific user reads out the example text information; Correlating third mouth shape image information, which is at least a part of the read image, with third voice information that is at least a part of the read voice, based on time information; Acquire third speech recognition text information that is text information corresponding to the third speech information based on a speech recognition model; Correlating the third speech recognition test information with third example text information which is any one of the plurality of example text information; And in response to determining that the third voice recognition test information and the third example text information are the same as the third example text information, the third mouth shape image information and the third example text information are arranged in a third manner. Storing in the storage as a training data set.

전술한 문제점을 해결하기 위한 본 발명의 다른 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 시스템은, 특정 사용자에 대한 입 모양 영상 정보 및 상기 입 모양 영상 정보에 대응하는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터 세트들을 획득하고, 상기 훈련 데이터 세트들을 기반으로, 입력된 입 모양 영상 정보에 대응하는 텍스트 정보를 출력하는 영상 인식 모델을 생성하도록 구성된, 서버; 및 상기 특정 사용자에 대한 영상 정보 및 음성 정보 중 적어도 하나를 취득하도록 구성되고, 상기 영상 인식 모델을 기반으로 상기 특정 사용자에 대한 입력 영상에 대응하는 입력 텍스트 정보를 결정하도록 구성된, 단말기를 포함할 수 있다. According to another aspect of the present invention, an information input system based on a user image includes a plurality of mouth image information for a specific user and text information corresponding to the mouth image information. A server, configured to obtain training data sets of and to generate, based on the training data sets, an image recognition model that outputs text information corresponding to input mouth shape image information; And a terminal, configured to acquire at least one of image information and audio information for the specific user, and configured to determine input text information corresponding to the input image for the specific user based on the image recognition model. have.

전술한 문제점을 해결하기 위한 본 발명의 다른 실시예에 따른 컴퓨터 판독 가능한 저장 매체는, 프로세서에 의해 실행 가능한 명령어들을 포함하고, 상기 명령어들은 상기 프로세서에 의해 실행되었을 때, 특정 사용자에 대한 입 모양 영상 정보 및 상기 입 모양 영상 정보에 대응하는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터 세트들을 생성하고; 상기 훈련 데이터 세트들을 기반으로, 입력된 입 모양 영상 정보에 대응하는 텍스트 정보를 출력하는 영상 인식 모델을 생성하고; 그리고 상기 영상 인식 모델을 기반으로 상기 특정 사용자에 대한 입력 영상에 대응하는 입력 텍스트 정보를 결정하도록 구성될 수 있다. A computer readable storage medium according to another embodiment of the present invention for solving the above-mentioned problems includes instructions executable by a processor, and when the instructions are executed by the processor, a mouth-shaped image of a specific user. Generate a plurality of training data sets each including information and text information corresponding to the mouth shape image information; Generating an image recognition model outputting text information corresponding to the input mouth shape image information based on the training data sets; The display apparatus may be configured to determine input text information corresponding to the input image for the specific user based on the image recognition model.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technique can have the following effects. However, since a specific embodiment does not mean to include all of the following effects or only the following effects, it should not be understood that the scope of the disclosed technology is limited by this.

전술한 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 방법 및 장치에 따르면, 특정 사용자에 대한 입 모양 영상 정보와 그에 대응하는 텍스트 정보를 포함하는 훈련 데이터들을 기반으로 영상 인식 모델을 생성하는 것에 의해, 보다 향상된 정확도를 가지고 특정 사용자의 입 모양을 인식하여 정보를 입력할 수 있다. According to the information input method and apparatus based on a user image according to an embodiment of the present invention, an image recognition model is generated based on training data including mouth shape image information and text information corresponding to a mouth shape of a specific user. By generating, it is possible to input the information by recognizing the shape of the mouth of a specific user with more improved accuracy.

또한, 사용자의 입 모양을 인식하여 정보를 입력하도록 하는 것에 의해, 사용자 주변에 소음이 존재하는 환경이나, 사용자 음성의 발현이 불가능한 조용한 환경에서도 간편하게 정보를 입력하고 이를 기반으로 모바일 디바이스의 기능을 제어할 수 있는 장점이 있다. In addition, by inputting the information by recognizing the shape of the user's mouth, even in an environment in which there is noise around the user or in a quiet environment where the user's voice cannot be expressed, the user can easily input the information and control the function of the mobile device based on the information There is an advantage to this.

도 1 은 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치의 구성을 나타내는 블록도이다.
도 2 는 도 1 의 프로세서 상에서 구현되는 인식 모델의 예시도이다.
도 3 은 도 2 의 영상 인식 모델 및 음성 인식 모델의 개념도이다.
도 4 는 도 2 의 메모리에 저장되는 정보의 예시도이다.
도 5 는 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 시스템의 개념도이다.
도 6 은 영상 통화 데이터를 기반으로 하는 훈련 데이터 세트 생성의 개념도이다.
도 7 은 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 방법의 흐름도이다.
도 8 은 도 7 의 입력 테스트 정보 결정 단계의 상세 흐름도이다.
도 9 는 영상 통화 데이터를 기반으로 하는 훈련 데이터 세트 생성의 상세 흐름도이다.
도 10 은 사전 설정에 따른 세그먼트의 시간 길이 설정 방법의 예시도이다.
도 11 은 음성의 임계 크기에 따른 세그먼트의 시간 길이 설정 방법의 예시도이다.
도 12 는 트리거링 입 모양에 따른 세그먼트의 시간 길이 설정 방법의 예시도이다.
도 13 은 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 1 실시예에 대한 개념도이다.
도 14 는 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 2 실시예에 대한 개념도이다.
도 15 는 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 1 실시예의 흐름도이다.
도 16 은 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 2 실시예의 흐름도이다. 1 is a block diagram illustrating a configuration of an information input apparatus based on a user image according to an embodiment of the present invention.
2 is an illustration of a recognition model implemented on the processor of FIG. 1.
3 is a conceptual diagram of an image recognition model and a voice recognition model of FIG. 2.
4 is an exemplary diagram of information stored in a memory of FIG. 2.
5 is a conceptual diagram of an information input system based on a user image according to an embodiment of the present invention.
6 is a conceptual diagram of a training data set generation based on video call data.
7 is a flowchart illustrating an information input method based on a user image according to an embodiment of the present invention.
8 is a detailed flowchart of an input test information determination step of FIG. 7.
9 is a detailed flowchart of a training data set generation based on video call data.
10 is an exemplary view illustrating a method for setting a time length of a segment according to a preset.
11 is an exemplary view illustrating a method for setting a time length of a segment according to a threshold size of speech.
12 is an exemplary view illustrating a method for setting a time length of a segment according to a triggering mouth shape.
13 is a conceptual diagram of a first embodiment of generating example text information-based training data set.
14 is a conceptual diagram of a second embodiment of generating example text information-based training data set.
15 is a flowchart of a first embodiment of example text information based training data set generation.
16 is a flowchart of a second embodiment of example text information based training data set generation.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.As the present invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

제 1, 제 2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. The term and / or includes a combination of a plurality of related items or any item of a plurality of related items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art, and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다. Hereinafter, with reference to the accompanying drawings, it will be described in detail a preferred embodiment of the present invention. In the following description of the present invention, the same reference numerals are used for the same elements in the drawings and redundant descriptions of the same elements will be omitted.

사용자 영상을 기반으로 하는 정보 입력Information input based on user image

앞서 살핀 바와 같이, 음성 인식은 주변의 소음 존재 여부에 따라 그 정확도가 현저하게 저하될 수 있는 문제점이 있고, 사용자의 음성을 발생시키기 곤란한 환경에서 사용할 수 없는 문제점이 있다. 음성 인식을 대체하기 위한 수단으로서, 사용자의 입 모양을 인식하여 정보를 입력하는 방법이 고려될 수 있으나, 사람에 따라 입모양이 상이하므로 음성 인식에 비해 인식의 정확도를 향상시키기 위한 방안이 요구된다. 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력은, 전술한 문제점을 해결하기 위한 것으로서, 특정 사용자에게 특화된 입 모양 인식 모델을 생성하는 것에 의해 입 모양 인식의 정확도를 향상시킬 수 있다. 보다 구체적으로, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력은, 특정 사용자에 대한 입 모양 영상 정보와 그에 대응하는 텍스트 정보를 포함하는 복수의 훈련 데이터들을 일정량 이상 획득하고, 획득된 특정 사용자에 대한 훈련 데이터 세트들을 기반으로 영상 인식 모델을 생성하는 것에 의해, 보다 향상된 정확도를 가지고 특정 사용자의 입 모양을 인식하여 정보를 입력할 수 있다. As described above, speech recognition has a problem that its accuracy may be significantly reduced depending on the presence of surrounding noise, and there is a problem that it cannot be used in an environment in which it is difficult to generate a voice of a user. As a means for replacing speech recognition, a method of inputting information by recognizing a user's mouth may be considered. However, since the mouth shape varies according to a person, a method for improving recognition accuracy is required compared to speech recognition. . Information input based on a user image according to an embodiment of the present invention is to solve the above-described problem and can improve the accuracy of mouth shape recognition by generating a mouth shape recognition model specialized for a specific user. . More specifically, the information input based on the user image according to an embodiment of the present invention, acquires a plurality of training data, including a plurality of training data including mouth shape image information and a text information corresponding to a specific user, and obtains By generating an image recognition model based on training data sets for a specific user, the user may input information by recognizing a mouth shape of a specific user with improved accuracy.

도 1 은 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치의 구성을 나타내는 블록도이고, 도 7 은 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 방법의 흐름도이며, 도 8 은 도 7 의 입력 테스트 정보 결정 단계의 상세 흐름도이다. 이하, 도 1, 도 7 및 도 8 을 참조하여, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 방법 및 장치에 대해서 보다 구체적으로 설명한다. 1 is a block diagram illustrating a configuration of an information input apparatus based on a user image according to an embodiment of the present invention, and FIG. 7 is a flowchart of an information input method based on a user image according to an embodiment of the present invention. 8 is a detailed flowchart of the input test information determination step of FIG. 7. Hereinafter, an information input method and apparatus based on a user image according to an embodiment of the present invention will be described in detail with reference to FIGS. 1, 7, and 8.

먼저, 도 1 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 는 영상 입력부 (110), 음성 입력부 (120), 프로세서 (130), 표시부 (140), 메모리 (150) 및 통신부 (160) 를 포함할 수 있다. First, as shown in FIG. 1, an information input apparatus 100 based on a user image according to an exemplary embodiment of the present invention may include an image input unit 110, an audio input unit 120, a processor 130, and a display unit ( 140, a memory 150, and a communicator 160.

영상 입력부 (110) 는 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 사용자에 대한 영상 정보를 획득할 수 있도록 구성되며, 예를 들어 카메라 장치일 수 있다. 음성 입력부 (120) 는 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 사용자에 대한 음성 정보를 획득할 수 있도록 구성되며, 예를 들어 마이크 장치일 수 있다. The image input unit 110 is configured to acquire image information of a user of the information input apparatus 100 based on the user image, and may be, for example, a camera device. The voice input unit 120 may be configured to acquire voice information of a user of the information input apparatus 100 based on the user image, and may be, for example, a microphone device.

프로세서 (130) 는 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 절차를 위한 데이터 처리를 수행하도록 구성될 수 있다. 도 2 는 도 1 의 프로세서 상에서 구현되는 인식 모델의 예시도이고, 도 3 은 도 2 의 영상 인식 모델 및 음성 인식 모델의 개념도이다. 도 2 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 프로세서 (130) 는 영상 인식부 (131) 및 음성 인식부 (132) 를 포함할 수 있다. 상기 영상 인식부 (131) 및 음성 인식부 (132) 는 단일 프로세서 내에서 논리적 모듈로서 존재할 수 있으며, 또는 각각 별개의 프로세서 상에서 논리적 모듈로서 존재하거나, 하드웨어적으로 구현된 영상 인식 모듈 또는 음성 인식 모듈로서 포함될 수도 있다. The processor 130 may be configured to perform data processing for an information input procedure based on a user image according to an embodiment of the present invention. FIG. 2 is an exemplary diagram of a recognition model implemented on the processor of FIG. 1, and FIG. 3 is a conceptual diagram of an image recognition model and a voice recognition model of FIG. 2. As shown in FIG. 2, the processor 130 of the information input apparatus 100 based on a user image according to an embodiment of the present invention may include an image recognition unit 131 and a voice recognition unit 132. Can be. The image recognition unit 131 and the voice recognition unit 132 may exist as a logical module in a single processor, or each may exist as a logical module on a separate processor, or an image recognition module or a voice recognition module implemented in hardware. It may also be included as.

한편, 영상 인식부 (132) 상에서 영상 인식 모델 (132) 이 실행될 수 있다. 도 3 에 도시된 바와 같이, 영상 인식 모델 (132) 는 특정 사용자에 대한 입 모양 영상 정보를 입력하면 입력된 입 모양 영상 정보에 대응되는 텍스트 정보를 출력하는 논리적 모델일 수 있다. 일 측면에 따르면, 영상 인식 모델 (132) 은 특정 사용자에 대한 입 모양 영상 정보와 이에 대응하는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터 세트들을 기반으로 인공 신경망 (Artificial Neural Network, ANN) 을 학습시켜 생성된 모델일 수도 있다. 또한, 일 측면에 따르면, 영상 인식 모델 (132) 은 도 2 에 도시된 바와 같이 입 위치 식별 모델 (133) 및 입 모양 식별 모델 (134) 을 포함할 수 있다. 예를 들어, 입 위치 식별 모델 (133) 은 입력된 영상 데이터 내에서 사용자의 입이 위치하는 영역을 추출하도록 구성될 수 있고, 입 모양 식별 모델 (134) 은 상기 입 위치 식별 모델 (133) 에 의해 출력된 입이 위치하는 영역에 대한 영상 정보를 기반으로 대응하는 텍스트 정보를 출력하도록 구성될 수 있다. 일 측면에 따르면 입 위치 식별 모델 (133) 은 공지된 인공지능 알고리즘 모델이 사용될 수도 있고, 본 발명의 일 실시예에 따라 특정 사용자에 대한 영상과 입 모양 영역을 훈련 데이터 세트로서 인공지능 모델을 훈련시킨 결과일 수도 있다. Meanwhile, the image recognition model 132 may be executed on the image recognition unit 132. As shown in FIG. 3, the image recognition model 132 may be a logical model that outputs text information corresponding to the input mouth shape image information when inputting mouth shape image information for a specific user. According to an aspect, the image recognition model 132 trains an artificial neural network (ANN) based on a plurality of training data sets each including mouth shape image information and a corresponding text information for a specific user. It may be a generated model. Also, according to an aspect, the image recognition model 132 may include a mouth position identification model 133 and a mouth shape identification model 134 as shown in FIG. 2. For example, the mouth position identification model 133 may be configured to extract an area in which the user's mouth is located within the input image data, and the mouth shape identification model 134 may be applied to the mouth position identification model 133. It may be configured to output the corresponding text information based on the image information on the area where the mouth is output by the output. According to an aspect, the mouth position identification model 133 may use a known AI algorithm model, and train an AI model as a training data set using an image and a mouth shape region for a specific user according to an embodiment of the present invention. It may be the result.

다시 도 2 를 참조하면, 음성 인식부 (135) 상에서 음성 인식 모델 (136) 이 실행될 수 있다. 도 3 에 도시된 바와 같이, 음성 인식 모델 (136) 은 특정 사용자에 대한 음성 정보를 입력하면 입력된 음성 정보에 대응되는 텍스트 정보를 출력하는 논리적 모델일 수 있다. 일 측면에 따르면, 음성 인식 모델 (132) 은 음성 정보와 이에 대응하는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터 세트들을 기반으로 인공 신경망을 학습시켜 생성된 모델일 수 있다. 또는, 음성 인식 모델 (132) 는 음성 정보를 텍스트로 변환하는 공지의 음성 인식 모델들 중 어느 하나를 선택하여 적용할 수 있다. Referring back to FIG. 2, the speech recognition model 136 may be executed on the speech recognizer 135. As shown in FIG. 3, the voice recognition model 136 may be a logical model that outputs text information corresponding to the input voice information when voice information about a specific user is input. According to an aspect, the speech recognition model 132 may be a model generated by training an artificial neural network based on a plurality of training data sets each including speech information and text information corresponding thereto. Alternatively, the speech recognition model 132 may select and apply any one of known speech recognition models for converting speech information into text.

도 4 는 도 2 의 메모리 (150) 에 저장되는 정보의 예시도이다. 도 4 를 참조하면, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 에 포함된 메모리 (150) 에는, 예를 들어 복수의 훈련 데이터 세트들 (410), 특정 사용자에 대한 영상 통화 데이터 (420) 및 예시 텍스트 정보 (430) 중 적어도 하나가 저장될 수 있다. 훈련 데이터 세트들 (420) 은 영상 인식 모델 (132) 을 학습시키기 위해 사용될 수 있으며, 본 명세서에서 이후 구체적으로 설명하는 바와 같이, 예를 들어 영상 통화 데이터 (420) 에 대한 데이터 처리 또는 예시 텍스트 정보 (430) 를 이용한 데이터 획득 및 처리 절차에 의해 획득될 수 있다. 4 is an exemplary diagram of information stored in the memory 150 of FIG. 2. Referring to FIG. 4, for example, a plurality of training data sets 410 and a specific user are included in the memory 150 included in the information input apparatus 100 based on a user image according to an exemplary embodiment of the present invention. At least one of the video call data 420 and the example text information 430 may be stored. Training data sets 420 may be used to train the image recognition model 132, and as described in detail herein below, for example, data processing or example textual information for the video call data 420. It can be obtained by the data acquisition and processing procedure using 430.

다시 도 2 를 참조하면, 표시부 (140) 는 사용자 영상을 기반으로 하는 정보 입력 장치에 포함되어 영상 및 텍스트 정보 중 적어도 하나를 표시하도록 구성될 수 있으며, 통신부 (160) 는 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 가 외부 디바이스와 데이터 송수신을 수행하도록 구성될 수 있다. 통신부 (160) 는 예를 들어 셀룰러 네트워크 접속을 위한 원거리 무선 통신 모듈, 서버와의 통신을 위한 원거리 무선 통신 모듈, 주변 디바이스와의 근거리 통신을 위한 무선 통신 모듈 및 유선 통신 모듈 중 적어도 하나를 포함할 수 있다. Referring back to FIG. 2, the display unit 140 may be included in an information input apparatus based on a user image to display at least one of image and text information, and the communication unit 160 may be based on the user image. The information input apparatus 100 may be configured to perform data transmission / reception with an external device. The communication unit 160 may include, for example, at least one of a long range wireless communication module for cellular network connection, a long range wireless communication module for communication with a server, a wireless communication module for short range communication with a peripheral device, and a wired communication module. Can be.

본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 는 예를 들어 스마트 폰이나 태블릿 PC 와 같은 모바일 디바이스에 포함되거나, 모바일 디바이스 그 자체로서 구현될 수도 있다. The information input device 100 based on the user image according to an embodiment of the present invention may be included in a mobile device such as, for example, a smart phone or a tablet PC, or may be implemented as the mobile device itself.

도 7 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 방법은, 먼저 특정 사용자에 대한 입 모양 영상 정보 및 상기 입 모양 영상 정보에 대응하는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터 세트들을 생성할 수 있다 (단계 710). 앞서 설명한 바와 같이, 입 모양은 복수의 사용자 별로 상이하고, 동일한 단어를 발음하는 경우에도 개인의 발음 특성이나 습관이 따라 상이한 입 모양이 발현될 수도 있다. 따라서, 특정 사용자에 대한 구별없이 복수의 사용자들에 대한 입 모양 데이터를 기반으로 영상 인식 모델을 학습시킬 경우 입 모양 인식에 따른 텍스트 변환의 정확도가 감소될 수밖에 없다. As shown in FIG. 7, an information input method based on a user image according to an exemplary embodiment of the present invention first includes mouth shape image information and text information corresponding to the mouth shape image information for a specific user, respectively. A plurality of training data sets can be generated (step 710). As described above, the mouth shape is different for a plurality of users, and even when the same word is pronounced, different mouth shapes may be expressed according to pronunciation characteristics or habits of the individual. Therefore, when the image recognition model is trained based on mouth shape data of a plurality of users without distinguishing a specific user, the accuracy of text conversion according to mouth shape recognition may be reduced.

따라서, 본 발명의 일 실시예에 따르면, 입 모양 분석을 수행하고자 하는 특정 사용자에 대해서, 이러한 특정 사용자의 입 모양 영상을 포함하는 입 모영 영상 정보 및 이에 대응하는 문자열을 나타내는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터들을 생성하고 이를 기반으로 영상 인식 모델을 훈련시키는 것에 의해 입 모양 인식의 정확도를 향상시킬 수 있다. 도 4 를 참조하면, 생성된 복수의 훈련 데이터 세트 (410) 들은 메모리 (150) 에 저장될 수 있으며, 복수의 훈련 데이터 세트 (410) 들은 제 1 훈련 데이터 세트 (410-1) 내지 제 n 훈련 데이터 세트 (410-n) 를 포함할 수 있고, 제 1 훈련 데이터 세트 (410-1) 는 제 1 입 모양 영상 정보 (411-1) 및 이에 대응하는 문자열을 포함하는 제 1 텍스트 정보 (412-1) 를 포함할 수 있으며, 제 n 훈련 데이터 세트 (410-n) 는 제 n 입 모양 영상 정보 (411-n) 및 이에 대응하는 문자열을 포함하는 제 n 텍스트 정보 (412-n) 를 포함할 수 있다. 일 실시예에 따르면, 입 모양 영상 정보는 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 영상 입력부 (110) 를 통해 획득된 영상 데이터에 포함될 수 있다. Therefore, according to an embodiment of the present invention, for a specific user who wants to perform a mouth shape analysis, each of the mouth image information including the mouth shape image of the specific user and text information indicating a corresponding character string may be included. The accuracy of mouth shape recognition may be improved by generating a plurality of training data and training an image recognition model based thereon. Referring to FIG. 4, the generated plurality of training data sets 410 may be stored in the memory 150, and the plurality of training data sets 410 may be the first training data sets 410-1 to n-th training. Data set 410-n, wherein the first training data set 410-1 includes first textual image information 411-1 and corresponding text strings 412-n. 1), and the n th training data set 410-n may include the n th text information 412-n including the n th mouth image information 411-n and a string corresponding thereto. Can be. According to an exemplary embodiment, the mouth shape image information may be included in image data acquired through the image input unit 110 of the information input apparatus 100 based on the user image.

다시 도 7 을 참조하면, 앞서 생성된 훈련 데이터 세트들을 기반으로, 입력된 입 모양 영상 정보에 대응하는 텍스트 정보를 출력할 수 있는 영상 인식 모델을 생성할 수 있다. 도 3 을 참조하여 전술한 바와 같이, 영상 인식 모델 (132) 는 특정 사용자의 입 모양 영상 정보를 입력하는 것에 응답하여 이에 대응하는 문자열을 포함하는 텍스트 정보를 출력하는 소프트웨어 모듈일 수 있으며, 복수의 훈련 데이터 세트들을 이용하여 인공 신경망을 학습시키는 것에 의해 생성될 수 있다. Referring to FIG. 7 again, based on the training data sets generated above, an image recognition model capable of outputting text information corresponding to the input mouth shape image information may be generated. As described above with reference to FIG. 3, the image recognition model 132 may be a software module that outputs text information including a character string corresponding thereto in response to inputting mouth shape image information of a specific user. It can be generated by training an artificial neural network using training data sets.

훈련 데이터 세트의 생성 및 그에 따른 영상 인식 모델의 학습이 완료되면, 특정 사용자의 현재 입력 영상을 기반으로 입 모양 인식을 수행할 수 있게 된다. 도 7 에 도시된 바와 같이, 영상 인식 모델을 기반으로 특정 사용자에 대한 입력 영상에 대응하는 입력 텍스트 정보를 결정 (단계 730) 하는 것에 의해 특정 사용자에 대한 입모양 인식을 이용한 텍스트 정보 입력이 가능하게 된다. 도 8 에 보다 상세히 도시된 바와 같이, 입력 텍스트 정보를 결정 (단계 730) 하는 것은, 예를 들어 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 에 포함된 영상 입력부 (110) 에 의해 취득된 특정 사용자에 대한 입력 영상을 수신하고 (단계 731), 앞서 생성된 영상 인식 모델을 기반으로, 수신된 입력 영상에 포함된 입 모양 영상 정보에 대응하는 텍스트 정보를 입력 텍스트 정보로서 결정 (단계 733) 하는 것을 포함할 수 있다. 일 측면에 따르면, 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 에 포함된 프로세서 (130) 상에서 구동되는 영상 인식 모델 (132) 이 사용될 수 있고, 입 위치 식별 모델 (133) 을 통해 입력 영상에서 입 모양을 포함하는 영상을 추출하고, 이를 입 모양 식별 모델 (134) 로 입력하는 것에 의해 대응되는 텍스트 정보를 출력하도록 구성될 수 있다. When the generation of the training data set and the training of the image recognition model accordingly are completed, the mouth shape recognition may be performed based on the current input image of the specific user. As shown in FIG. 7, input text information corresponding to an input image of a specific user is determined based on the image recognition model (step 730) to enable input of text information using the shape recognition for the specific user. do. As shown in more detail in FIG. 8, determining the input text information (step 730) is, for example, specified by the image input unit 110 included in the information input apparatus 100 based on the user image. Receiving an input image for the user (step 731), and determining text information corresponding to the mouth shape image information included in the received input image as input text information based on the previously generated image recognition model (step 733). It may include. According to an aspect, an image recognition model 132 driven on the processor 130 included in the information input apparatus 100 based on the user image may be used, and may be used in the input image through the mouth position identification model 133. Extracting an image including a mouth shape and inputting it to the mouth shape identification model 134 may be configured to output corresponding text information.

본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 에 따르면, 앞서 설명한 바와 같은 훈련 데이터 세트의 생성 (단계 710), 영상 인식 모델의 생성 (단계720) 및 입력 영상에 대응하는 입력 텍스트 정보 결정 (730) 은 프로세서 (130) 에 의해 수행될 수 있다. 한편, 도 5 는 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 시스템의 개념도이다. 도 5 에 도시된 바와 같이, 예를 들어 단말기 (100) 및 단말기 (100) 와 네트워크를 통해 통신 가능한 서버 (200) 와 같이, 복수의 엔티티에 의해 구성되는 사용자 영상을 기반으로 하는 정보 입력 시스템에서는, 앞서 설명한 바와 같은 훈련 데이터 세트의 생성 (단계 710), 영상 인식 모델의 생성 (단계720) 및 입력 영상에 대응하는 입력 텍스트 정보 결정 (730) 은 단말기 (100), 또는 단말기에 포함된 프로세서 (130) 및 서버 (200) 에 의해 함께 수행될 수 있다. 예를 들어, 단말기 (100) 는 영상 입력부 (110) 및 음성 입력부 (120) 를 구비하여 특정 사용자에 대한 영상 정보 및/또는 음성 정보를 획득할 수 있도록 구성되고, 통신부 (160) 를 이용하여 서버 (200) 로 상기 영상 정보 및/또는 음성 정보 그 자체를 송신하거나, 상기 영상 정보 및/또는 음성 정보를 기반으로 데이터 처리된 복수의 훈련 데이터 세트들을 송신하도록 구성될 수도 있다. 서버 (200) 는 특정 사용자에 대한 입 모양 영상 정보 및 입 모양 영상 정보에 대응하는 텍스트 정보를 각각 포함하는 복수의 훈련 데이터 세트들을 획득하고, 훈련 데이터 세트들을 기반으로, 입력된 입 모양 영상 정보에 대응하는 텍스트 정보를 출력하는 영상 인식 모델을 생성하도록 구성될 수 있다. 생성된 영상 인식 모델은 서버 (200) 및 단말기 (100) 의 메모리 (150) 중 적어도 하나에 저장될 수 있다. 특정 사용자에 대한 입 모양 인식을 수행함에 있어서, 단말기 (100) 는 입력된 특정 사용자의 영상 정보를 서버 (200) 로 송신하여 서버 (200) 에 저장된 영상 인식 모델을 기반으로 대응되는 텍스트 정보를 획득한 뒤 수신하거나, 단말기 (100) 의 메모리 (150) 에 저장된 영상 인식 모델을 기반으로 대응되는 텍스트 정보를 획득할 수 있다. 단말기 (100) 의 측면에서 본 발명의 일 실시예에 따른 훈련 데이터 세트의 생성 (단계 710), 영상 인식 모델의 생성 (단계 720) 및 입력 텍스트 정보의 결정 (단계 730) 의 단계들은, 상기 단계들을 단말기 (100) 의 프로세서 (130) 가 직접 수행하는 것 뿐만 아니라, 상기 단계들을 수행하기 위해 서버 (200) 와 같은 외부 디바이스로 관련 데이터들을 송신 및/또는 수신하는 것을 포함하는 것으로 이해되어야 할 것이다. According to the information input device 100 based on the user image according to an embodiment of the present invention, the generation of the training data set as described above (step 710), the image recognition model generation (step 720) and the input image The corresponding input text information determination 730 may be performed by the processor 130. 5 is a conceptual diagram of an information input system based on a user image according to an embodiment of the present invention. As shown in FIG. 5, for example, in an information input system based on a user image constituted by a plurality of entities, such as the terminal 100 and the server 200 capable of communicating with the terminal 100 via a network. The generation of the training data set as described above (step 710), the generation of the image recognition model (step 720), and the determination of the input text information corresponding to the input image 730 may include the terminal 100, or a processor included in the terminal ( 130 and server 200 together. For example, the terminal 100 includes an image input unit 110 and an audio input unit 120, and is configured to acquire image information and / or audio information for a specific user, and use the server using the communication unit 160. The image information and / or the voice information itself may be transmitted to the 200, or the plurality of training data sets that are data processed based on the image information and / or the voice information may be transmitted. The server 200 obtains a plurality of training data sets each including mouth shape image information and text information corresponding to mouth shape image information for a specific user, and based on the training data sets, inputs the input mouth shape image information. And generate an image recognition model that outputs corresponding text information. The generated image recognition model may be stored in at least one of the server 200 and the memory 150 of the terminal 100. In performing mouth shape recognition for a specific user, the terminal 100 transmits the input image information of the specific user to the server 200 to obtain corresponding text information based on the image recognition model stored in the server 200. After receiving it, the corresponding text information may be acquired based on an image recognition model stored in the memory 150 of the terminal 100. In terms of the terminal 100, the steps of generation of a training data set (step 710), generation of an image recognition model (step 720) and determination of input text information (step 730) according to an embodiment of the present invention are performed as described above. It should be understood that not only does the processor 130 of the terminal 100 directly perform, but also transmit and / or receive related data to an external device such as the server 200 to perform the above steps. .

나아가, 서버 (200) 와 같은 컴퓨팅 디바이스의 측면에서, 예를 들어 스마트 폰 이나 태블릿 PC 와 같은 모바일 디바이스로부터 훈련 데이터 세트 및/또는 훈련 데이터 세트의 생성을 위한 영상 및 음성 자료를 수신하여 훈련 데이터 세트를 획득하고, 이를 기반으로 입력 입 모양 데이터에 대응하는 텍스트 정보를 출력하는 영상 인식 모델을 생성하는 Further, in terms of a computing device such as server 200, for example, a training data set is received by receiving video and audio data for generation of the training data set and / or training data set from a mobile device such as a smartphone or tablet PC. To generate an image recognition model for outputting text information corresponding to the input mouth shape data

영상 통화 데이터 기반 훈련 데이터 세트 생성Create training data set based on video call data

앞서 살핀 바와 같이, 특정 사용자에 대한 입 모양 인식의 정확도 향상을 위해서는 상기 특정 사용자에 대한 훈련 데이터 세트들을 기반으로 영상 인식 모델을 생성하는 것이 필요하다. 다만, 특정 사용자의 입 모양 영상 정보 및 이에 대응하는 텍스트 정보를 포함하는 훈련 데이트 세트들을 영상 인식 모델의 훈련이 가능하도록 충분히 확보할 수 있는 방안이 고려되어야 한다. As described above, in order to improve accuracy of mouth shape recognition for a specific user, it is necessary to generate an image recognition model based on training data sets for the specific user. However, a method of sufficiently securing training data sets including mouth shape image information of a specific user and text information corresponding thereto may be secured to enable training of the image recognition model.

도 6 은 영상 통화 데이터를 기반으로 하는 훈련 데이터 세트 생성의 개념도이다. 도 6 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 는 특정 사용자에 의해 사용되는 모바일 디바이스일 수 있으며, 영상 입력부 (610), 음성 입력부 (620) 및 표시부 (640) 를 포함할 수 있다. 이러한 모바일 디바이스의 특정 사용자는 임의의 다른 인물과 영상 통화를 수행할 수 있으며, 영상 통화 수행시 상대방의 영상이 표시부 (640) 의 상대방 영역 (643) 에 표시되고, 상기 모바일 디바이스의 영상 입력부 (110) 를 통해 획득된 특정 사용자의 영상은 모바일 디바이스의 본인 영역 (641) 에 표시될 수 있다. 음성 입력부 (620) 를 통해 특정 사용자의 영상 통화 도중의 음성 정보가 획득될 수 있고, 특정 사용자의 영상 정보 및 음성 정보가 통신부를 통해 상대방 사용자의 디바이스로 전송될 수 있다. 6 is a conceptual diagram of a training data set generation based on video call data. As shown in FIG. 6, the information input apparatus 100 based on a user image according to an embodiment of the present invention may be a mobile device used by a specific user, and may include an image input unit 610 and an audio input unit ( 620 and a display unit 640. The specific user of the mobile device may perform a video call with any other person, and when the video call is performed, the video of the other party is displayed on the counterpart area 643 of the display unit 640 and the video input unit 110 of the mobile device. The image of the specific user acquired through) may be displayed on the user region 641 of the mobile device. The voice information during the video call of the specific user may be obtained through the voice input unit 620, and the video information and the voice information of the specific user may be transmitted to the device of the counterpart user through the communication unit.

이처럼, 특정 사용자가 모바일 디바이스를 기반으로 영상 통화를 수행하면, 특정 사용자가 발화하는 동안의 입 모양 영상 정보와 이에 대응되는 음성 정보가 자연스럽게 획득될 수 있다. 따라서, 모바일 디바이스는 화상 통화 시에 모바일 디바이스의 특정 사용자에 대한 영상과 음성에 대한 정보를 상대방 디바이스로 전송함과 함께, 이를 영상 통화 데이터로서 저장할 수 있다. 일 측면에 따르면, 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 메모리 (150) 에 상기 영상 통화 데이터 (420) 가 저장될 수 있으며, 영상 통화 데이터 (420) 는 사용자의 입 모양 정보를 포함하는 영상 데이터인 통화 영상 데이터 (421) 및 사용자의 음성에 대한 정보를 포함하는 통화 음성 데이터 (423) 를 포함할 수 있다. As such, when a specific user makes a video call based on the mobile device, mouth shape image information and voice information corresponding to the specific user may be naturally obtained. Therefore, the mobile device may transmit the information on the video and the voice of the specific user of the mobile device to the counterpart device during the video call, and store the same as the video call data. According to an aspect, the video call data 420 may be stored in a memory 150 of the information input apparatus 100 based on a user image, and the video call data 420 includes user mouth shape information. The video data may include call video data 421, which is video data, and call voice data 423 including information about a voice of a user.

본 발명의 일 실시예에 따르면, 상기 영상 통화 데이터 (420) 를 기반으로 영상 인식 모델의 생성을 위한 훈련 데이터 세트들을 생성할 수 있으며, 이와 같은 훈련 데이터 세트들은, i) 특정 사용자의 영상 통화 데이터에 포함된 입 모양 영상 정보 및 ii) 입 모양 영상 정보에 대응하는 음성에 대한 음성 인식 결과인 음성 인식 텍스트 정보를 포함할 수 있다. 즉, 영상 통화 데이터에 포함된 사용자에 대한 영상 정보와 음성 정보는 각각 시간 정보를 포함할 수 있으므로, 시간 정보를 기반으로 영상 정보와 음성 정보를 대응시킬 수 있으며, 음성 정보를 소정의 음성 인식 모델을 기반으로 텍스트 정보로 변환할 수 있으므로, 결국 특정 사용자에 대한 영상 정보 (또는 영상 정보에 포함된 입 모양 영상 정보) 와 상기 텍스트 정보를 서로 대응시키는 것이 가능하다. 따라서 서로 대응되는 입 모양 영상 정보와 텍스트 정보를 획득할 수 있고, 이를 하나의 훈련 데이터 세트로서 저장할 수 있다. According to an embodiment of the present invention, training data sets for generating an image recognition model may be generated based on the video call data 420. Such training data sets may include i) video call data of a specific user. And mouth recognition image information included in and ii) speech recognition text information that is a voice recognition result of the voice corresponding to the mouth image information. That is, since the video information and the audio information of the user included in the video call data may each include time information, the video information and the audio information may be corresponded based on the time information, and the voice information may be a predetermined voice recognition model. Since it can be converted into text information on the basis of, the image information (or mouth shape image information included in the image information) for the specific user can be correlated with each other. Therefore, mouth shape image information and text information corresponding to each other can be obtained and stored as one training data set.

영상 통화는 사용자의 필요에 의해 수행되는 것으로서 자연스러운 데이터의 수집이 가능한 장점이 있으며, 영상 통화는 상대방 및 본인의 목소리를 인지하기 위해 주변 소음이 적은 환경에서 주로 행해지므로, 영상 통화 데이터에 포함된 음성 정보는 높은 정확도를 가지고 음성 인식을 수행하는 것이 가능하여 정확한 텍스트 정보를 획득할 수 있다. 따라서 높은 정확도를 가지는 특정 사용자에 대한 훈련 데이터 세트의 생성이 가능한 장점이 있다. Video call is carried out by user's needs, and it has the advantage that natural data can be collected, and video call is mainly performed in an environment with low ambient noise in order to recognize the voice of the other party and the user. The information can perform speech recognition with high accuracy to obtain accurate text information. Therefore, there is an advantage that it is possible to generate a training data set for a specific user with high accuracy.

도 9 는 영상 통화 데이터를 기반으로 하는 훈련 데이터 세트 생성의 상세 흐름도이다. 도 9 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 영상 통화 데이터 기반의 훈련 데이터 세트 생성 절차에 따르면, 먼저 모바일 디바이스를 사용하는 특정 사용자의 영상 통화 데이터 - 영상 통화 데이터는 통화 영상 및 통화 음성을 포함 - 를 획득할 수 있다 (단계 711). 앞서 살핀 바와 같이 모바일 디바이스는 영상 입력부 및 음성 입력부를 구비하는 바, 영상 통화 동안의 특정 사용자에 대한 통화 영상과 통화 음성을 획득할 수 있고, 예를 들어 메모리 (150) 에 통화 영상 데이터 (421) 및 통화 음성 데이터 (423) 를 포함하는 영상 통화 데이터 (420) 로서 저장할 수 있다. 9 is a detailed flowchart of a training data set generation based on video call data. As shown in FIG. 9, according to a training data set generation procedure based on video call data according to an embodiment of the present invention, first, video call data of a specific user using a mobile device-video call data is a call video and a call. It may be obtained (including voice) (step 711). As described above, the mobile device includes a video input unit and an audio input unit, and may acquire a call image and a call voice for a specific user during a video call, for example, the call image data 421 may be stored in the memory 150. And video call data 420 including call voice data 423.

이어서, 시간 정보를 기반으로 통화 영상의 적어도 일부인 제 1 입 모양 영상 정보와 통화 음성의 적어도 일부인 제 1 음성 정보를 대응시킬 수 있다 (단계 713). 획득된 통화 영상 및 통화 음성은 소정의 시간 길이를 가지는 세그먼트로서 구분되어 처리될 수 있다. 예를 들어, 제 1 입 모양 영상 정보는 통화 영상을 구성하는 복수의 세그먼트들 중 어느 하나일 수 있고, 제 1 음성 정보는 통화 음성을 구성하는 복수의 세그먼트들 중 어느 하나일 수 있다. 일 측면에 따르면, 제 1 입 모양 영상 정보는 통화 영상에서 입 위치를 검출하여 추출된 영상이 사용될 수도 있다. 통화 영상 및 통화 음성은 획득 당시의 시간 정보를 포함할 수 있으므로, 복수의 세그먼트로 분할되어도 이러한 시간 정보를 기반으로 입 모양 영상 정보와 음성 정보의 대응이 가능하다. 일 측면에 따르면, 제 1 입 모양 영상 정보는 통화 영상에 포함된 하나의 프레임일 수 있고, 다른 측면에 따르면 제 1 입 모양 영상 정보는 복수의 프레임을 포함할 수도 있다. Subsequently, the first mouth image information, which is at least a part of the call image, and the first voice information, which is at least a part of the call voice, may be corresponded based on the time information (step 713). The obtained call image and call voice may be divided and processed as segments having a predetermined length of time. For example, the first mouth image information may be any one of a plurality of segments constituting a call video, and the first voice information may be any one of a plurality of segments constituting a call voice. According to an aspect, the first mouth shape image information may be an image extracted by detecting a mouth position from a call image. Since the call video and the call voice may include time information at the time of acquisition, even when divided into a plurality of segments, the mouth shape video information and the voice information may be corresponded based on the time information. According to an aspect, the first mouth image information may be one frame included in a call image, and in another aspect, the first mouth image information may include a plurality of frames.

한편, 통화 영상 데이터 및/또는 통화 음성 데이터를 복수의 세그먼트로 분할하기 위한 기준을 설정할 수 있다. 도 10 은 사전 설정에 따른 세그먼트의 시간 길이 설정 방법의 예시도이다. 도 10 에 도시된 바와 같이, 영상 통화 데이터에 포함된 통화 음성 및 통화 영상은 각각 시간 정보를 가지고 시계열적으로 존재할 수 있다. 일 측면에 따르면, 훈련 데이터의 생성을 위해 사용될 수 있는 통화 음성의 복수의 세그먼트들 중 어느 하나일 수 있는 제 1 입 모양 영상 정보와, 통화 영상의 복수의 세그먼트들 중 어느 하나일 수 있는 제 1 음성 정보의 시간 길이는 미리 결정된 시간 길이 (t) 로서 결정될 수 있다. 따라서, 도 10 에 도시된 바와 같이 모든 음성 세그먼트들 (10a-1, 10a-2, 10a-3, 10a-4, 10a-5, 10a-6, 10a-7, 10a-8) 은 전부 동일한 시간 길이를 가지도록 설정되고, 각각 제 1 음성 정보 내지 제 8 음성 정보로서 지정될 수 있다. 또한, 모든 영상 세그먼트들 (10b-1, 10b-2, 10b-3, 10b-4, 10b-5, 10b-6, 10b-7, 10b-8) 역시 전부 동일한 시간 길이를 가지도록 설정되고, 각각 제 1 입 모양 영상 정보 내지 제 8 입 모양 영상 정보로서 지정될 수 있다. Meanwhile, a criterion for dividing the call video data and / or the call voice data into a plurality of segments may be set. 10 is an exemplary view illustrating a method for setting a time length of a segment according to a preset. As illustrated in FIG. 10, the call voice and the call video included in the video call data may exist in time series with time information. According to one aspect, the first mouth shape image information, which may be any one of a plurality of segments of a call voice that may be used for generating training data, and the first may be any one of the plurality of segments of a call image. The time length of the voice information can be determined as the predetermined time length t. Accordingly, all voice segments 10a-1, 10a-2, 10a-3, 10a-4, 10a-5, 10a-6, 10a-7, 10a-8 are all the same time as shown in FIG. It may be set to have a length, and may be designated as the first to eighth voice information, respectively. In addition, all image segments 10b-1, 10b-2, 10b-3, 10b-4, 10b-5, 10b-6, 10b-7, 10b-8 are also set to have the same length of time. Each of the first mouth shape image information and the eighth mouth shape image information may be designated.

도 11 은 음성의 임계 크기에 따른 세그먼트의 시간 길이 설정 방법의 예시도이다. 도 11 에 도시된 바와 같이, 본 발명의 일 실시예에 따르면 제 1 입 모양 영상 정보 및 제 1 음성 정보는 제 1 시점 (예를 들어, t₁) 부터 제 2 시점 (예를 들어, t₂) 까지의 시간 길이를 가지도록 설정될 수 있고, 제 1 시점 (예를 들어, t₁) 은 통화 음성이 미리 결정된 임계 크기 이하인 시점을 나타내고, 제 2 시점 (예를 들어, t₂) 은 제 1 시점 (예를 들어, t₁) 에 후속하는, 통화 음성이 미리 결정된 임계 크기 이하인 시점을 나타낼 수 있다. 즉, 통화 음성 및 통화 음성의 세그먼트 분할점은, 통화 음성의 크기가 미리 설정해둔 임계 크기보다 작은 지점이 되도록 할 수 있다. 따라서, 연속되는 시점들 (t₁, t₂, t₃, t₄, t₅, t₆) 에 대해서, 음성 세그먼트들 (11a-1, 11a-2, 11a-1, 11a-4, 11a-5) 은 각각 상이한 시간 길이를 가지고, 복수의 영상 세그먼트들 (11b-1, 11b-2, 11b-3, 11b-4, 11b-5) 역시 각각 상이한 크기를 가지도록 결정될 수 있다. 11 is an exemplary view illustrating a method for setting a time length of a segment according to a threshold size of speech. As shown in FIG. 11, according to an embodiment of the present invention, the first mouth shape image information and the first audio information are configured from a first time point (eg, t ₁ ) to a second time point (eg, t _2). It can be set to have a time length up to), wherein the first time point (e.g., t ₁ ) represents a time point at which the call voice is less than or equal to a predetermined threshold size, and the second time point (e.g., t ₂ ) A voice following a voice point of time (eg, t ₁ ) may be indicated as a point of time below a predetermined threshold size. In other words, the segmentation point of the call voice and the call voice can be such that the size of the call voice is smaller than the preset threshold size. Thus, for successive time points t ₁ , t ₂ , t ₃ , t ₄ , t ₅ , t ₆ , the voice segments 11a-1, 11a-2, 11a-1, 11a-4, 11a- 5) each have a different time length, and the plurality of image segments 11b-1, 11b-2, 11b-3, 11b-4, 11b-5 may also be determined to have different sizes, respectively.

한편, 도 11 에 도시되지는 않았으나, 일 측면에 따르면, 세그먼트 분할점은, 통화 음성의 크기가 미리 설정해 둔 임계 크기보다 작은 크기를 만족하는 시간 길이가 미리 설정해둔 시간 이상 지속되는 지점이 되도록 설정할 수 있다. 이 때, 세그먼트는 통화 음성의 크기가 미리 설정해 둔 임계 크기보다 작은 크기를 미리 설정해 둔 시간 이상 지속하는 구간을 제외하고 추출될 수 있다. Meanwhile, although not shown in FIG. 11, according to one aspect, the segment split point may be set such that a time length that satisfies a size smaller than a preset threshold size of the call voice is longer than a preset time. Can be. In this case, the segment may be extracted except for a section in which the call voice size is smaller than a preset threshold size for more than a preset time.

도 12 는 트리거링 입 모양에 따른 세그먼트의 시간 길이 설정 방법의 예시도이다. 도 12 에 도시된 바와 같이, 본 발명의 일 실시예에 따르면 제 1 입 모양 영상 정보 및 제 1 음성 정보는 제 1 시점 (예를 들어, t₁) 부터 제 2 시점 (예를 들어, t₂) 까지의 시간 길이를 가지도록 설정될 수 있고, 제 1 시점 (예를 들어, t₁) 은 통화 영상에 포함된 특정 사용자의 입 모양 영상 정보 (121-1) 가 미리 결정된 제 1 트리거링 입 모양 정보와 일치하는 시점을 나타내고, 제 2 시점 (예를 들어, t₂) 은 통화 영상에 포함된 특정 사용자의 입 모양 영상 정보 (122-1) 가 미리 결정된 제 2 트리거링 입 모양 정보와 일치하는 시점을 나타낼 수 있다. 여기서, 제 1 입 모양 영상 정보 (12b-1) 는 입 모양 영상 정보 (121-1) 내지 입 모양 영상 정보 (122-1) 의 영상 정보를 포함하도록 설정될 수 있다. 즉, 통화 음성 및 통화 영상의 세그먼트로의 분할점은 입 모양이 특정한 제 1 의 모양인 제 1 트리거링 입 모양일 때 세그먼트가 시작되고, 입 모양이 특정한 제 2 의 모양인 제 2 트리거링 입 모양일 때 세그먼트가 종료되도록 설정될 수 있다. 따라서, 연속되는 시점들 (t₁, t₂, t₃, t₄, t₅) 에 대해서, 음성 세그먼트들 (12a-1, 12a-2, 12a-1, 12a-4) 은 각각 상이한 시간 길이를 가지고, 복수의 영상 세그먼트들 (12b-1, 12b-2, 12b-3, 12b-4) 역시 각각 상이한 크기를 가지도록 결정될 수 있다. 상기 세그먼트들은 각각 시작 세그먼트 (121-1, 121-2, 121-3, 121-4) 에 의해 시작되고, 종료 세그먼트 (122-1, 122-2, 122-3, 122-4) 에 의해 종료될 수 있다. 도 12 에 도시되지 않았으나, 각각의 세그먼트들 사이에는 세그먼트로 추출되지 않는 절삭 영역이 포함될 수도 있다. 12 is an exemplary view illustrating a method for setting a time length of a segment according to a triggering mouth shape. As shown in FIG. 12, according to an exemplary embodiment of the present invention, the first mouth shape image information and the first audio information are configured from a first time point (eg, t ₁ ) to a second time point (eg, t _2). The first time point (for example, t ₁ ) is a first triggered mouth shape in which mouth shape image information 121-1 of a specific user included in a call video is predetermined. The second time point (eg, t ₂ ) indicates a time point coinciding with the information, and the second time point (eg, t ₂ ) corresponds to a time point when the mouth shape image information 122-1 of the specific user included in the call video matches the predetermined second triggering mouth shape information. Can be represented. Here, the first mouth shape image information 12b-1 may be set to include image information of the mouth shape image information 121-1 to the mouth shape image information 122-1. In other words, the segmentation point of the call voice and the call video into the segment is the segment triggered when the mouth shape is the first triggering mouth shape with the first specific shape, and the second triggering mouth shape with the mouth shape the second specific shape. When the segment may be set to end. Thus, for successive time points t ₁ , t ₂ , t ₃ , t ₄ , t ₅ , the voice segments 12a-1, 12a-2, 12a-1, 12a-4 each have a different length of time. In addition, the plurality of image segments 12b-1, 12b-2, 12b-3, and 12b-4 may also be determined to have different sizes. The segments are started by start segments 121-1, 121-2, 121-3, 121-4, respectively, and end by end segments 122-1, 122-2, 122-3, 122-4. Can be. Although not shown in FIG. 12, a cutting area that is not extracted as a segment may be included between each segment.

또한, 도 12 에 도시되지 않았으나, 하나의 트리거링 입 모양이 세그먼트의 시작과 종료를 위해 사용될 수도 있다. 즉, 제 1 시점 (예를 들어, t₁) 은 통화 영상에 포함된 특정 사용자의 입 모양 영상 정보 (예를 들어, 121-1) 가 미리 결정된 제 1 트리거링 입 모양 정보와 일치하는 시점을 나타내고, 제 2 시점 (예를 들어, t₂) 은 제 1 시점에 후속하는, 특정 사용자의 입 모양 영상 정보가 미리 결정된 제 1 트리거링 입 모양 정보 (121-2) 와 일치하는 시점을 나타낼 수 있다. 이 때, 제 1 세그먼트 (12b-1) 는 입 모양 영상 정보 (121-1) 에서 개시되어 입 모양 영상 정보 (121-2) 이전에 종료될 수 있다. Also, although not shown in FIG. 12, one triggering mouth shape may be used for the start and end of the segment. That is, the first view point (for example, t ₁ ) indicates a view point at which mouth shape image information (eg, 121-1) of a specific user included in the call video coincides with the first predetermined triggering mouth shape information. The second time point (eg, t ₂ ) may indicate a time point at which mouth shape image information of a specific user coincides with the first triggering mouth shape information 121-2, which is subsequent to the first time point. In this case, the first segment 12b-1 may be started from the mouth shape image information 121-1 and end before the mouth shape image information 121-2.

한편, 일 측면에 따르면, 통화 영상에 포함된 특정 사용자의 입 모양 영상 정보가 미리 설정한 시간 길이 이상 미리 결정된 제 1 트리거링 입 모양 정보와 일치하는 것에 응답하여 세그먼트의 분할이 수행될 수도 있다. 일 측면에 따르면 제 1 트리거링 입 모양은 입을 다문 모양일 수 있다. 또는 제 1 트리거링 입 모양 정보는 복수의 입 모양의 시퀀스를 포함하는 입 모양 모션 정보일 수도 있다. According to an aspect, segmentation may be performed in response to the mouth shape image information of a specific user included in a call image matching the first triggering mouth shape information predetermined for a predetermined time length or more. According to one aspect, the first triggering mouth shape may be a mouthful shape. Alternatively, the first triggering mouth shape information may be mouth shape motion information including a plurality of mouth shape sequences.

다시 도 9 를 참조하면, 통화 음성의 적어도 일부인 제 1 음성 정보를 텍스트 정보로 변환할 수 있다 (단계 715). 즉, 음성 인식 모델을 기반으로 제 1 음성 정보에 대응하는 텍스트 정보인 제 1 음성 인식 텍스트 정보를 획득할 수 있다. 여기서, 음성 인식 모델은 공지된 임의의 음성 인식 모델이 사용될 수 있다. 영상 통화 데이터는 앞서 언급한 바와 같이 높은 음성 인식 성공율을 도출할 수 있다. 다만, 일 측면에 따르면, 상기 특정 사용자의 음성 및 이에 대응하는 텍스트 정보를 훈련 데이터 세트로서 사용하여 인공 신경망을 훈련하여 생성된 음성 인식 모델이 사용될 수도 있다. Referring back to FIG. 9, first voice information that is at least a portion of a call voice may be converted into text information (step 715). That is, the first speech recognition text information, which is text information corresponding to the first speech information, may be acquired based on the speech recognition model. Here, any known speech recognition model may be used as the speech recognition model. The video call data can lead to a high rate of speech recognition success as mentioned above. However, according to an aspect, a speech recognition model generated by training an artificial neural network using the voice of the specific user and text information corresponding thereto as a training data set may be used.

도 9 에 도시된 바와 같이, 제 1 입 모양 영상 정보와 상기 제 1 음성 인식 텍스트 정보를 제 1 훈련 데이터 세트로서 저장 (단계 717) 하는 것에 의해, 영상 통화 데이터를 기반으로 하는 훈련 데이터 세트의 생성을 수행할 수 있다. 앞서 언급한 바와 같이 시간 정보를 기반으로 제 1 입 모양 영상 정보와 제 1 음성 정보를 매칭하는 것이 가능하므로, 제 1 입 모양 영상 정보와 제 1 음성 인식 텍스트 정보를 매칭하는 것 역시 가능하다. 따라서, 서로 대응하는 제 1 입 모양 영상 정보와 제 2 음성 인식 텍스트 정보를 제 1 훈련 데이터 세트로서 저장할 수 있다. 이후, 제 2 내지 제 n 입 모양 영상 정보와 제 2 내지 제 n 음성 정보에 대한 반복적인 데이터 처리를 통해 제 1 내지 제 n 훈련 데이터 세트를 생성하여 저장할 수 있다. As shown in FIG. 9, by generating the first mouth shape image information and the first voice recognition text information as a first training data set (step 717), a training data set is generated based on the video call data. Can be performed. As mentioned above, since it is possible to match the first mouth image information and the first voice information based on the time information, it is also possible to match the first mouth image information and the first voice recognition text information. Accordingly, the first mouth shape image information and the second voice recognition text information corresponding to each other may be stored as the first training data set. Thereafter, the first to n th training data sets may be generated and stored through repetitive data processing on the second to n th mouth image information and the second to n th audio information.

따라서, 특정 사용자가 영상 통화를 수행하는 과정에서, 번거롭지 않고 자연스러우며 높은 정확도를 가지는 훈련 데이터 세트들의 생성이 가능하다. Thus, in the course of performing a video call by a specific user, it is possible to generate training data sets that are not cumbersome, natural, and have high accuracy.

예시 텍스트 정보 기반 훈련 데이터 세트 생성Create Training Data Set Based on Example Text Information

본 발명의 일 실시예에 따르면, 소정의 어플리케이션을 통해 예시적인 문자열을 포함하는 예시 텍스트 정보를 표시하고, 특정 사용자가 이러한 텍스트 정보를 읽도록 하여 훈련 데이터 세트를 생성하는 것이 가능하다. 즉, 본 발명의 일 실시예에 따른 훈련 데이터 세트들은, i) 예시 텍스트 정보 및 ii) 특정 사용자가 예시 텍스트 정보를 읽은 입 모양 영상 정보를 포함할 수 있다. According to an embodiment of the present invention, it is possible to generate training data sets by displaying exemplary text information including exemplary text strings through a predetermined application and having a specific user read such text information. That is, the training data sets according to an embodiment of the present invention may include i) exemplary text information and ii) mouth shape image information from which the specific user reads the exemplary text information.

도 13 은 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 1 실시예에 대한 개념도이다. 도 13 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 는 특정 사용자에 의해 사용되는 모바일 디바이스일 수 있으며, 표시부 (1300) 를 구비할 수 있다. 13 is a conceptual diagram of a first embodiment of generating example text information-based training data set. As shown in FIG. 13, the information input apparatus 100 based on a user image according to an embodiment of the present invention may be a mobile device used by a specific user, and may include a display unit 1300. .

도 15 는 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 1 실시예의 흐름도이다. 이하, 도 13 및 도 15 를 참조하여 본 발명의 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 1 실시예에 대해서 보다 구체적으로 설명한다. 15 is a flowchart of a first embodiment of example text information based training data set generation. Hereinafter, a first embodiment of generating an exemplary text information-based training data set of the present invention will be described in more detail with reference to FIGS. 13 and 15.

도 15 에 도시된 바와 같이, 본 발명의 일 측면에 따른 예시 텍스트 정보 기반 훈련 데이터 세트 생성 절차에 따르면, 먼저 예시 텍스트 정보를 디스플레이할 수 있다 (단계 1510). 예시 텍스트 정보는, 예를 들어 도 1 및 도 4 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 메모리 (150) 에 예시 텍스트 정보 (430) 로서 저장되어 있을 수 있다. 예를 들어 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 프로세서 (130) 는 상기 텍스트 정보 (430) 를 표시부 (140) 에 디스플레이할 수 있다. 도 13 에 도시된 바와 같이, 표시부 (1300) 에는 예시 텍스트 정보 (1310, 1320, 1330, 1340) 중 적어도 하나가 표시될 수 있다. As shown in FIG. 15, according to an example text information-based training data set generation procedure according to an aspect of the present invention, first, example text information may be displayed (step 1510). Example text information, for example, as shown in FIGS. 1 and 4, example text information 430 in the memory 150 of the information input apparatus 100 based on the user image according to an embodiment of the present invention. May be stored as). For example, the processor 130 of the information input apparatus 100 based on the user image may display the text information 430 on the display unit 140. As illustrated in FIG. 13, at least one of exemplary text information 1310, 1320, 1330, and 1340 may be displayed on the display unit 1300.

이후, 특정 사용자가 예시 텍스트 정보 (1310, 1320, 1330, 1340) 중 적어도 하나를 읽는 동안의 특정 사용자에 대한 영상인 읽기 영상을 획득할 수 있다 (단계 1520). 일 측면에 따르면, 예시 텍스트 정보 (1310) 에 대한 재생 버튼 (1311) 을 특정 사용자가 터치하는 것에 응답하여 영상 입력부 (110) 의 동작이 개시되어, 예시 텍스트 정보 (1310) 를 읽는 동안의 특정 사용자에 대한 영상 정보를 획득하고, 이를 읽기 영상으로서 저장할 수 있다. 이 때 특정 사용자는 반드시 소리 내어 예시 텍스트 (1310) 를 읽을 필요는 없다. Thereafter, the specific user may acquire a read image that is an image of the specific user while reading at least one of the example text information 1310, 1320, 1330, and 1340 (step 1520). According to an aspect, an operation of the image input unit 110 is initiated in response to a specific user touching the play button 1311 for the example text information 1310, thereby reading the specific user while reading the example text information 1310. Image information about the image may be obtained and stored as a read image. At this time, the specific user does not necessarily read the example text 1310 aloud.

예시 텍스트 정보 (1310) 를 읽는 동안의 특정 사용자의 영상을 획득하였으므로, 이러한 읽기 영상에 포함된 제 2 입 모양 영상 정보 및 예시 텍스트 정보 (1310) 는 서로 대응되고, 제 2 입 모양 영상 정보 및 예시 텍스트 정보 (1310) 를 제 2 훈련 데이터 세트로서 저장 (단계 1530) 하는 것에 의해 제 2 훈련 데이터 세트가 생성될 수 있다. 예시 텍스트 정보 (1320, 1330, 1340) 에 대해서 순차적으로 데이터 처리를 수행할 수 있고, 이 때 각각 버튼 (1312, 1313, 1314) 가 사용될 수도 있다. Since the image of the specific user is acquired while reading the example text information 1310, the second mouth image information and the example text information 1310 included in the read image correspond to each other, and the second mouth image information and the example The second training data set may be generated by storing text information 1310 as a second training data set (step 1530). Data processing may be sequentially performed on the example text information 1320, 1330, and 1340, and buttons 1312, 1313, and 1314 may be used, respectively.

도 14 는 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 2 실시예에 대한 개념도이다. 도 14 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 는 특정 사용자에 의해 사용되는 모바일 디바이스일 수 있으며, 표시부 (1400) 를 구비할 수 있다. 14 is a conceptual diagram of a second embodiment of generating example text information-based training data set. As shown in FIG. 14, the information input apparatus 100 based on a user image according to an embodiment of the present invention may be a mobile device used by a specific user, and may include a display unit 1400. .

도 16 은 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 2 실시예의 흐름도이다. 이하, 도 14 및 도 16 를 참조하여 본 발명의 예시 텍스트 정보 기반 훈련 데이터 세트 생성의 제 2 실시예에 대해서 보다 구체적으로 설명한다. 16 is a flowchart of a second embodiment of example text information based training data set generation. Hereinafter, a second embodiment of generating an exemplary text information-based training data set of the present invention will be described in more detail with reference to FIGS. 14 and 16.

도 16 에 도시된 바와 같이, 본 발명의 일 측면에 따른 예시 텍스트 정보 기반 훈련 데이터 세트 생성 절차에 따르면, 먼저 복수의 예시 텍스트 정보를 디스플레이할 수 있다 (단계 1610). 예시 텍스트 정보는, 예를 들어 도 1 및 도 4 에 도시된 바와 같이, 본 발명의 일 실시예에 따른 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 메모리 (150) 에 예시 텍스트 정보 (430) 로서 저장되어 있을 수 있고, 예시 텍스트 정보 (430) 는 제 1 예시 텍스트 (430-1) 내지 제 n 예시 텍스트 (430-n) 의 복수의 예시 텍스트 정보들을 포함할 수 있다. 예를 들어 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 프로세서 (130) 는 위와 같은 복수의 텍스트 정보 (431-1 내지 431-n) 를 표시부 (140) 에 디스플레이할 수 있다. 도 14 에 도시된 바와 같이, 표시부 (1400) 에는 복수의 예시 텍스트 정보 (1410) 가 표시될 수 있다. 표시부 (1400) 에는 스크롤바 (1420) 가 표시되어 복수의 텍스트 정보들 중 현재 표시된 예시 텍스트 정보의 위치를 특정 사용자가 인지하도록 할 수도 있다. As illustrated in FIG. 16, according to an example text information-based training data set generation procedure according to an aspect of the present invention, first, a plurality of example text information may be displayed (step 1610). Example text information, for example, as shown in FIGS. 1 and 4, example text information 430 in the memory 150 of the information input apparatus 100 based on the user image according to an embodiment of the present invention. ), And the example text information 430 may include a plurality of example text information of the first example text 430-1 to the nth example text 430-n. For example, the processor 130 of the information input apparatus 100 based on the user image may display the text information 431-1 to 431-n as described above on the display unit 140. As illustrated in FIG. 14, a plurality of example text information 1410 may be displayed on the display unit 1400. A scroll bar 1420 may be displayed on the display unit 1400 to allow a specific user to recognize the location of the currently displayed example text information among the plurality of text information.

이후, 특정 사용자가 표시된 복수의 예시 텍스트 정보들 (1410) 을 음독하는 동안 특정 사용자에 대한 영상인 읽기 영상 및 특정 사용자에 대한 음성인 읽기 음성을 획득할 수 있다 (단계 1620). 일 측면에 따르면, 사용자 영상을 기반으로 하는 정보 입력 장치 (100) 의 영상 입력부 (110) 를 이용하여 특정 사용자가 복수의 예시 텍스트 (1410) 를 음독하는 동안의 영상인 읽기 영상을 획득하고, 음성 입력부 (1420) 를 이용하여 특정 사용자가 복수의 예시 텍스트 (1420) 를 음독하는 동안의 음성인 읽기 음성을 획득할 수 있다. Thereafter, while the specific user reads the displayed plurality of example text information 1410, a read image that is an image for a specific user and a read voice that is a voice for a specific user may be acquired (step 1620). According to one aspect, by using the image input unit 110 of the information input device 100 based on the user image to obtain a read image which is an image while a specific user reads a plurality of example text 1410, The input unit 1420 may be used to obtain a read voice that is a voice while a specific user reads the plurality of example texts 1420.

이어서, 시간 정보를 기반으로, 읽기 영상의 적어도 일부인 제 3 입 모양 영상 정보와 읽기 음성의 적어도 일부인 제 3 음성 정보를 대응시킬 수 있다 (단계 1630). 앞서 설명한 영상 통화 데이터에 대한 세그먼트 분할의 절차들과 유사하게, 읽기 영상의 영상 데이터와 읽기 음성의 음성 데이터를 복수의 세그먼트로 분할하고, 제 3 입 모양 영상 정보는 복수의 영상 세그먼트들 중 어느 하나일 수 있으며, 제 3 음성 정보는, 복수의 음성 세그먼트들 중 어느 하나로서 시간 정보를 기반으로 제 3 입 모양 영상 정보와 매칭되는 세그먼트일 수 있다. Subsequently, based on the time information, the third mouth shape image information that is at least a portion of the read image and the third voice information that is at least a portion of the read voice may be associated with each other (step 1630). Similar to the procedure of segmentation for video call data described above, the video data of the read video and the audio data of the read voice are divided into a plurality of segments, and the third mouth image information is any one of the plurality of video segments. The third voice information may be a segment that is matched with the third mouth shape image information based on time information as one of the plurality of voice segments.

제 3 음성 정보를 텍스트 정보로 변환할 수 있다. 즉, 음성 인식 모델을 기반으로 제 3 음성 정보에 대응하는 텍스트 정보인 제 3 음성 인식 텍스트 정보를 획득할 수 있다 (단계 1640). 앞서 영상 통화 데이터의 처리에서와 유사하게, 음성 인식 모델은 임의의 공지된 모델이 사용될 수도 있다. The third voice information can be converted into text information. That is, third speech recognition text information, which is text information corresponding to the third speech information, may be acquired based on the speech recognition model (step 1640). Similar to the above processing of video telephony data, the speech recognition model may use any known model.

제 3 음성 정보에 대응하는 문자열을 포함하는 제 3 음성 인식 텍스트 정보가 결정되면, 복수의 예시 텍스트 정보들 중, 이러한 제 3 음성 인식 테스트 정보와 일치하는 텍스트 정보가 존재하는 지 여부를 판단한다. 동일성 판단의 대상이 되는 예시 텍스트 정보들은, 동일 문장에 대해서 상이한 길이로 세분화된 복수의 분할 예시 텍스트 정보들이 포함될 수 있다. 제 3 음성 인식 텍스트 정보가 복수의 예시 텍스트 정보들 중 어느 하나인 제 3 예시 텍스트 정보와 동일하다고 결정되면, 제 3 입 모양 영상 정보와 제 3 예시 텍스트 정보를 제 3 훈련 데이터 세트로서 저장할 수 있다 (단계 1650). When the third speech recognition text information including the character string corresponding to the third speech information is determined, it is determined whether text information corresponding to the third speech recognition test information exists among the plurality of exemplary text information. Example text information that is an object of identity determination may include a plurality of divided example text information subdivided into different lengths for the same sentence. If it is determined that the third voice recognition text information is the same as the third example text information which is any one of the plurality of example text informations, the third mouth image information and the third example text information may be stored as the third training data set. (Step 1650).

이와 같은 절차를 통해, 특정 사용자는 특정한 입 모양 정보와 예시 텍스트 정보를 매칭하지 않고 주어진 복수의 예시 텍스트 정보들을 읽는 것만으로도 훈련 데이터 세트를 생성할 수 있다. 또한, 음성 인식과 예시 텍스트 정보의 매칭과 같은 이중의 확인 절차를 거치게 되어, 보다 정확한 훈련 데이터 세트의 생성이 가능한 장점이 있다. Through such a procedure, a specific user may generate a training data set by simply reading a plurality of pieces of example text information without matching specific mouth shape information and example text information. In addition, the dual identification procedure, such as speech recognition and example text information matching, has the advantage of generating a more accurate training data set.

상술한 본 발명에 따른 방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체로는 컴퓨터 시스템에 의하여 해독될 수 있는 데이터가 저장된 모든 종류의 기록 매체를 포함한다. 예를 들어, ROM(Read Only Memory), RAM(Random Access Memory), 자기 테이프, 자기 디스크, 플래시 메모리, 광 데이터 저장장치 등이 있을 수 있다. 또한, 컴퓨터로 판독 가능한 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.The method according to the present invention described above may be embodied as computer readable code on a computer readable recording medium. Computer-readable recording media include all kinds of recording media having data stored thereon that can be decrypted by a computer system. For example, there may be a read only memory (ROM), a random access memory (RAM), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, and the like. The computer readable recording medium can also be distributed over computer systems connected over a computer network, stored and executed as readable code in a distributed fashion.

이상, 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. As described above with reference to the drawings and examples, it does not mean that the scope of protection of the present invention is limited by the above drawings or embodiments, and those skilled in the art are skilled in the art It will be understood that various modifications and variations can be made in the present invention without departing from the spirit and scope.

구체적으로, 설명된 특징들은 디지털 전자 회로, 또는 컴퓨터 하드웨어, 펌웨어, 또는 그들의 조합들 내에서 실행될 수 있다. 특징들은 예컨대, 프로그래밍 가능한 프로세서에 의한 실행을 위해, 기계 판독 가능한 저장 디바이스 내의 저장장치 내에서 구현되는 컴퓨터 프로그램 제품에서 실행될 수 있다. 그리고 특징들은 입력 데이터 상에서 동작하고 출력을 생성함으로써 설명된 실시예들의 함수들을 수행하기 위한 지시어들의 프로그램을 실행하는 프로그래밍 가능한 프로세서에 의해 수행될 수 있다. 설명된 특징들은, 데이터 저장 시스템으로부터 데이터 및 지시어들을 수신하기 위해, 및 데이터 저장 시스템으로 데이터 및 지시어들을 전송하기 위해 결합된 적어도 하나의 프로그래밍 가능한 프로세서, 적어도 하나의 입력 디바이스, 및 적어도 하나의 출력 디바이스를 포함하는 프로그래밍 가능한 시스템 상에서 실행될 수 있는 하나 이상의 컴퓨터 프로그램들 내에서 실행될 수 있다. 컴퓨터 프로그램은 소정 결과에 대해 특정 동작을 수행하기 위해 컴퓨터 내에서 직접 또는 간접적으로 사용될 수 있는 지시어들의 집합을 포함한다. 컴퓨터 프로그램은 컴파일된 또는 해석된 언어들을 포함하는 프로그래밍 언어 중 어느 형태로 쓰여지고, 모듈, 소자, 서브루틴(subroutine), 또는 다른 컴퓨터 환경에서 사용을 위해 적합한 다른 유닛으로서, 또는 독립 조작 가능한 프로그램으로서 포함하는 어느 형태로도 사용될 수 있다.Specifically, the described features may be implemented within digital electronic circuitry, or computer hardware, firmware, or combinations thereof. The features may be executed in a computer program product implemented in storage in a machine readable storage device, for example, for execution by a programmable processor. And features may be performed by a programmable processor executing a program of instructions to perform functions of the described embodiments by operating on input data and generating output. The described features include at least one programmable processor, at least one input device, and at least one output device coupled to receive data and directives from a data storage system, and to transmit data and directives to a data storage system. It can be executed within one or more computer programs that can be executed on a programmable system comprising a. A computer program includes a set of directives that can be used directly or indirectly within a computer to perform a particular action on a given result. A computer program is written in any form of programming language, including compiled or interpreted languages, and included as a module, element, subroutine, or other unit suitable for use in another computer environment, or as a standalone program. Can be used in any form.

지시어들의 프로그램의 실행을 위한 적합한 프로세서들은, 예를 들어, 범용 및 특수 용도 마이크로프로세서들 둘 모두, 및 단독 프로세서 또는 다른 종류의 컴퓨터의 다중 프로세서들 중 하나를 포함한다. 또한 설명된 특징들을 구현하는 컴퓨터 프로그램 지시어들 및 데이터를 구현하기 적합한 저장 디바이스들은 예컨대, EPROM, EEPROM, 및 플래쉬 메모리 디바이스들과 같은 반도체 메모리 디바이스들, 내부 하드 디스크들 및 제거 가능한 디스크들과 같은 자기 디바이스들, 광자기 디스크들 및 CD-ROM 및 DVD-ROM 디스크들을 포함하는 비휘발성 메모리의 모든 형태들을 포함한다. 프로세서 및 메모리는 ASIC들(application-specific integrated circuits) 내에서 통합되거나 또는 ASIC들에 의해 추가되어질 수 있다.Suitable processors for the execution of a program of instructions include, for example, both general purpose and special purpose microprocessors, and one of a single processor or multiple processors of another kind of computer. Computer program instructions and data storage devices suitable for implementing the described features are, for example, magnetic memory such as semiconductor memory devices, internal hard disks and removable disks such as EPROM, EEPROM, and flash memory devices. Devices, magneto-optical disks and all forms of non-volatile memory including CD-ROM and DVD-ROM disks. The processor and memory may be integrated in application-specific integrated circuits (ASICs) or added by ASICs.

이상에서 설명한 본 발명은 일련의 기능 블록들을 기초로 설명되고 있지만, 전술한 실시 예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.Although the present invention described above has been described based on a series of functional blocks, the present invention is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes without departing from the technical spirit of the present invention. It will be apparent to one of ordinary skill in the art that this is possible.

전술한 실시 예들의 조합은 전술한 실시 예에 한정되는 것이 아니며, 구현 및/또는 필요에 따라 전술한 실시예들 뿐 아니라 다양한 형태의 조합이 제공될 수 있다.Combinations of the above-described embodiments are not limited to the above-described embodiments, and various types of combinations as well as the above-described embodiments may be provided according to implementation and / or need.

전술한 실시 예들에서, 방법들은 일련의 단계 또는 블록으로서 순서도를 기초로 설명되고 있으나, 본 발명은 단계들의 순서에 한정되는 것은 아니며, 어떤 단계는 상술한 바와 다른 단계와 다른 순서로 또는 동시에 발생할 수 있다. 또한, 당해 기술 분야에서 통상의 지식을 가진 자라면 순서도에 나타난 단계들이 배타적이지 않고, 다른 단계가 포함되거나, 순서도의 하나 또는 그 이상의 단계가 본 발명의 범위에 영향을 미치지 않고 삭제될 수 있음을 이해할 수 있을 것이다.In the above-described embodiments, the methods are described based on a flowchart as a series of steps or blocks, but the present invention is not limited to the order of steps, and any steps may occur in a different order or at the same time from other steps as described above. have. Also, one of ordinary skill in the art appreciates that the steps shown in the flowcharts are not exclusive, that other steps may be included, or that one or more steps in the flowcharts may be deleted without affecting the scope of the present invention. I can understand.

전술한 실시 예는 다양한 양태의 예시들을 포함한다. 다양한 양태들을 나타내기 위한 모든 가능한 조합을 기술할 수는 없지만, 해당 기술 분야의 통상의 지식을 가진 자는 다른 조합이 가능함을 인식할 수 있을 것이다. 따라서, 본 발명은 이하의 특허청구범위 내에 속하는 모든 다른 교체, 수정 및 변경을 포함한다고 할 것이다. The foregoing embodiments include examples of various aspects. While not all possible combinations may be described to represent the various aspects, one of ordinary skill in the art will recognize that other combinations are possible. Accordingly, the invention is intended to embrace all other replacements, modifications and variations that fall within the scope of the following claims.

Claims

An information input method based on a user image,
Generating a plurality of training data sets each including mouth shape image information for a specific user and text information corresponding to the mouth shape image information;
Generating an image recognition model outputting text information corresponding to the input mouth shape image information based on the training data sets; And
And determining input text information corresponding to the input image for the specific user based on the image recognition model.

The method of claim 1,
Determining the input text information,
Receiving an input image for the specific user acquired by the image input unit; And
And determining, as the input text information, text information corresponding to the mouth shape image information included in the input image based on the image recognition model.

The method of claim 1,
The training data sets may include: i) mouth shape image information included in the video call data of the specific user; and ii) voice recognition text information that is a voice recognition result of a voice corresponding to the mouth shape image information. Information input method based on.

The method of claim 1,
Generating the training data sets,
Obtaining video call data of the specific user, wherein the video call data includes call video and call voice;
Associating first mouth image information that is at least a portion of the call video with first voice information that is at least a portion of the call voice based on time information;
Obtaining first speech recognition text information that is text information corresponding to the first speech information based on a speech recognition model; And
And storing the first mouth shape image information and the first speech recognition text information as a first training data set.

The method of claim 4, wherein
And a time length of the first mouth shape image information and the first voice information is determined as a predetermined time length.

The method of claim 4, wherein
The first mouth image information and the first audio information have a length of time from a first time point to a second time point,
The first time point indicates a time point at which the call voice is less than or equal to a predetermined threshold size, and the second time point is based on a user image indicating a time point after which the call voice is less than or equal to a predetermined threshold size. How to enter information.

The method of claim 4, wherein
The first mouth image information and the first audio information have a length of time from a first time point to a second time point,
The first time point represents a time point at which mouth shape image information of the specific user included in the call image coincides with predetermined first triggering mouth shape information.
And wherein the second view point represents a point in time at which mouth shape image information of the specific user coincides with predetermined first triggering mouth shape information subsequent to the first view point.

The method of claim 4, wherein
The first mouth image information and the first audio information have a length of time from a first time point to a second time point,
The first time point represents a time point at which mouth shape image information of the specific user included in the call image coincides with predetermined first triggering mouth shape information.
The second viewpoint is an information input method based on a user image, which represents a time point at which mouth shape image information of the specific user included in the call image matches predetermined second triggering mouth shape information.

The method of claim 1,
The training data sets include: i) exemplary text information and ii) mouth image information from which the specific user has read the exemplary text information.

The method of claim 1,
Generating the training data sets,
Displaying example text information;
Acquiring a read image which is an image of the specific user while the specific user reads the example text information; And
And storing the second mouth shape image information and the example text information included in the read image as a second training data set.

The method of claim 1,
Generating the training data sets,
Displaying a plurality of example text informations;
Acquiring a read image that is an image for the specific user and a read voice that is a voice for the specific user while the specific user reads the example text information;
Associating third mouth shape image information, which is at least a part of the read image, with third voice information that is at least a part of the read voice, based on time information;
Obtaining third speech recognition text information, which is text information corresponding to the third speech information, based on a speech recognition model;
In response to determining that the third speech recognition test information and the third exemplary text information are the same as the third exemplary text information, the third mouth shape image information and the third exemplary text information are trained in a third manner. And storing the data as a data set.

An information input device based on a user image,
An image input unit which acquires image information;
A voice input unit for obtaining voice information;
A memory for storing image information, audio information, and text information; And
Includes a processor,
The processor,
Generating a plurality of training data sets each including mouth shape image information for a specific user and text information corresponding to the mouth shape image information,
Generating an image recognition model outputting text information corresponding to the input mouth shape image information based on the training data sets; And
And determine input text information corresponding to the input image for the specific user based on the image recognition model.

The method of claim 12,
Determining the input text information, the processor,
Receive an input image for the specific user acquired by the image input unit; And
And determining text information corresponding to mouth shape image information included in the input image as the input text information based on the image recognition model.

The method of claim 12,
The training data sets may include: i) mouth shape image information included in the video call data of the specific user; and ii) voice recognition text information that is a voice recognition result of a voice corresponding to the mouth shape image information. Information input device based on.

The method of claim 12,
Generating the training data sets, the processor,
Obtain video call data of the specific user, wherein the video call data includes call video and call voice;
Correlating first mouth image information that is at least a portion of the call video with first voice information that is at least a portion of the call voice based on time information;
Acquire first speech recognition text information that is text information corresponding to the first speech information based on a speech recognition model; And
And storing the first mouth shape image information and the first voice recognition text information as the first training data set in the storage unit.

The method of claim 15,
And a time length of the first mouth shape image information and the first voice information has a predetermined time length.

The method of claim 15,
The first mouth image information and the first audio information have a length of time from a first time point to a second time point,
The first time point indicates a time point at which the call voice is less than or equal to a predetermined threshold size, and the second time point is based on a user image indicating a time point after which the call voice is less than or equal to a predetermined threshold size. Information input device.

The method of claim 15,
The first mouth image information and the first audio information have a length of time from a first time point to a second time point,
The first time point represents a time point at which mouth shape image information of the specific user included in the call image coincides with predetermined first triggering mouth shape information.
And the second viewpoint is a point of time when the mouth shape image information of the specific user coincides with the predetermined first triggering mouth shape information subsequent to the first viewpoint.

The method of claim 15,
The first mouth image information and the first audio information have a length of time from a first time point to a second time point,
The first time point represents a time point at which mouth shape image information of the specific user included in the call image coincides with first predetermined triggering mouth shape information.
And the second view indicates a time point at which mouth shape image information of the specific user included in the call image matches predetermined second triggering mouth shape information.

The method of claim 12,
And the training data sets comprise i) exemplary text information and ii) mouth image information from which the specific user has read the exemplary text information.

The method of claim 12,
Generating the training data sets, the processor,
Displaying exemplary text information on a display unit included in the information input device based on the user image;
Using the image input unit, obtain a read image that is an image of the specific user while the specific user reads the example text information; And
And storing the second mouth shape image information and the example text information included in the read image as the second training data set in the storage unit.

The method of claim 12,
Generating the training data sets, the processor,
Displaying a plurality of example text informations on a display unit included in the information input device based on the user image;
Using the image input unit and the audio input unit, obtain a read image that is a video for the specific user and a read voice that is a voice for the specific user while the specific user reads out the example text information;
Correlating third mouth image information, which is at least a part of the read image, with third voice information that is at least a part of the read voice, based on time information;
Acquire third speech recognition text information that is text information corresponding to the third speech information based on a speech recognition model;
Correlating the third speech recognition test information with third example text information which is any one of the plurality of example text information; And
In response to determining that the third speech recognition test information and the third exemplary text information are the same as the third exemplary text information, the third mouth shape image information and the third exemplary text information are trained in a third manner. And storing the data in the storage unit as a data set.

An information input system based on user images,
Acquiring a plurality of training data sets each including mouth shape image information for a specific user and text information corresponding to the mouth shape image information, and based on the training data sets, text corresponding to the input mouth shape image information. A server, configured to generate an image recognition model that outputs information; And
And a terminal, configured to acquire at least one of image information and audio information for the specific user, and configured to determine input text information corresponding to the input image for the specific user based on the image recognition model. Information input device based on video.

A computer readable storage medium comprising instructions executable by a processor, the instructions being executed by the processor,
Generate a plurality of training data sets each comprising mouth shape image information for a specific user and text information corresponding to the mouth shape image information;
Generating an image recognition model outputting text information corresponding to the input mouth shape image information based on the training data sets; And
And determine input text information corresponding to an input image for the particular user based on the image recognition model.